Probably found bug in GraphTokenStreamFiniteStrings

2021-02-23 Thread Aleksandr Menshikov
Hi everyone,

I faced with some exceptions in my production service based on Lucene,
after some investigation I have found the problem and build minimal
example as test for GraphTokenStreamFiniteStrings (you can add this
into TestGraphTokenStreamFiniteStrings):



public void testX() throws IOException {
CannedTokenStream cts =
new CannedTokenStream(
token("мужские2", 1, 2),
token("мужчина", 2, 1),
token("мужской", 0, 1),
token("2", 1, 1)

);
GraphTokenStreamFiniteStrings graph = new
GraphTokenStreamFiniteStrings(cts);

assertTrue(graph.getTerms("", 0).length > 0);
}



Currently this code fails on assertion in line:
org.apache.lucene.util.automaton.Automaton.initTransition(Automaton.java:484)
with message "state=0 nextState=0".

I have run it on last master revision: eba0e255352adb2cb72031699c3f8d3963286d89

In production we use 8.4.1.

Moreover I think the problem somewhere in Operations#removeDeadStates
code, because if I just remove this operation from
GraphTokenStreamFiniteStrings constructor, test would pass:



public GraphTokenStreamFiniteStrings(TokenStream in) throws IOException {
  Automaton aut = build(in);
  this.det = aut;
  //Operations.removeDeadStates(Operations.determinize(aut,
DEFAULT_MAX_DETERMINIZED_STATES));
}




Have to say I'm not totally sure is it bug in
GraphTokenStreamFiniteStrings or in analyzer which produce such
TokenStream.


Originally user sent misspelled phrase "мужские2" with missed space
before '2'. And our analyzer did some morphology work this is how
"мужчина" and "мужской" arrived.

This is more information about TokenStream:



term: мужские2, type: , startOffset:0, endOffset:8,
posInc:1, posLength:2,
term: мужчина, type: SYNONYM, startOffset:0, endOffset:7, posInc:2,
posLength:1,
term: мужской, type: , startOffset:0, endOffset:7, posInc:0,
posLength:1,
term: 2, type: , startOffset:7, endOffset:8, posInc:1, posLength:1




If you need some extra information just let me know.

Also I'm ready to filing issues in JIRA if it's bug in
GraphTokenStreamFiniteStrings.

So what do you think about it?


- Alexander Menshikov


Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread Baris Kazar
So, just cat  will do this.
Thanks

From: Robert Muir 
Sent: Tuesday, February 23, 2021 4:45 PM
To: Baris Kazar 
Cc: java-user 
Subject: Re: MMapDirectory vs In Memory Lucene Index (i.e., 
ByteBuffersDirectory)

The preload isn't magical.
It only "reads in the whole file" to get it cached, same as if you did that 
yourself with 'cat' or 'dd'.
It "warms" the file.

It just does this in an efficient way at the low level to make the warming 
itself efficient. It madvise()s kernel to announce some read-ahead and then 
reads the first byte of every mmap'd page (which is enough to fault it in).

At the end of the day it doesn't matter if you wrote a shitty shell script that 
uses 'dd' to read in each index file and send it to /dev/null, or whether you 
spent lots of time writing fancy java code to call this preload thing: you get 
the same result, same end state.

Maybe the preload takes 18 seconds to "warm" the index, vs. your crappy shell 
script which takes 22 seconds. It is mainly more important for servers and 
portability (e.g. it will work fine on windows, but obviously will not call 
madvise).

On Tue, Feb 23, 2021 at 4:18 PM 
mailto:baris.ka...@oracle.com>> wrote:

Thanks again, Robert. Could you please explain "preload"? Which functionality 
is that? we discussed in this thread before about a preload.

Is there a Lucene url / site that i can look at for preload?

Thanks for the explanations. This thread will be useful for many folks i 
believe.

Best regards


On 2/23/21 4:15 PM, Robert Muir wrote:


On Tue, Feb 23, 2021 at 4:07 PM 
mailto:baris.ka...@oracle.com>> wrote:

What i want to achieve: Problem statement:

base case is disk based Lucene index with FSDirectory

speedup case was supposed to be in memory Lucene index with MMapDirectory

On 64-bit systems, FSDirectory just invokes MMapDirectory already. So you don't 
need to do anything.

Either way MMapDirectory or NIOFSDirectory are doing the same thing: reading 
your index as a normal file and letting the operating system cache it.
The MMapDirectory is just better because it avoids some overheads, such as 
read() system call, copying and buffering into java memory space, etc etc.
Some of these overheads are only getting worse, e.g. spectre/meltdown-type 
fixes make syscalls 8x slower on my computer. So it is good that MMapDirectory 
avoids it.

So I suggest just stop fighting the operating system, don't give your J2EE 
container huge amounts of ram, let the kernel do its job.
If you want to "warm" a cold system because nothing is in kernel's cache, then 
look into preload and so on. It is just "reading files" to get them cached.


Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread Robert Muir
The preload isn't magical.
It only "reads in the whole file" to get it cached, same as if you did that
yourself with 'cat' or 'dd'.
It "warms" the file.

It just does this in an efficient way at the low level to make the warming
itself efficient. It madvise()s kernel to announce some read-ahead and then
reads the first byte of every mmap'd page (which is enough to fault it in).

At the end of the day it doesn't matter if you wrote a shitty shell script
that uses 'dd' to read in each index file and send it to /dev/null, or
whether you spent lots of time writing fancy java code to call this preload
thing: you get the same result, same end state.

Maybe the preload takes 18 seconds to "warm" the index, vs. your crappy
shell script which takes 22 seconds. It is mainly more important for
servers and portability (e.g. it will work fine on windows, but obviously
will not call madvise).

On Tue, Feb 23, 2021 at 4:18 PM  wrote:

> Thanks again, Robert. Could you please explain "preload"? Which
> functionality is that? we discussed in this thread before about a preload.
>
> Is there a Lucene url / site that i can look at for preload?
>
> Thanks for the explanations. This thread will be useful for many folks i
> believe.
>
> Best regards
>
>
> On 2/23/21 4:15 PM, Robert Muir wrote:
>
>
>
> On Tue, Feb 23, 2021 at 4:07 PM  wrote:
>
>> What i want to achieve: Problem statement:
>>
>> base case is disk based Lucene index with FSDirectory
>>
>> speedup case was supposed to be in memory Lucene index with MMapDirectory
>>
> On 64-bit systems, FSDirectory just invokes MMapDirectory already. So you
> don't need to do anything.
>
> Either way MMapDirectory or NIOFSDirectory are doing the same thing:
> reading your index as a normal file and letting the operating system cache
> it.
> The MMapDirectory is just better because it avoids some overheads, such as
> read() system call, copying and buffering into java memory space, etc etc.
> Some of these overheads are only getting worse, e.g. spectre/meltdown-type
> fixes make syscalls 8x slower on my computer. So it is good that
> MMapDirectory avoids it.
>
> So I suggest just stop fighting the operating system, don't give your J2EE
> container huge amounts of ram, let the kernel do its job.
> If you want to "warm" a cold system because nothing is in kernel's cache,
> then look into preload and so on. It is just "reading files" to get them
> cached.
>
>


Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread baris . kazar
Thanks again, Robert. Could you please explain "preload"? Which 
functionality is that? we discussed in this thread before about a preload.


Is there a Lucene url / site that i can look at for preload?

Thanks for the explanations. This thread will be useful for many folks i 
believe.


Best regards


On 2/23/21 4:15 PM, Robert Muir wrote:



On Tue, Feb 23, 2021 at 4:07 PM > wrote:


What i want to achieve: Problem statement:

base case is disk based Lucene index with FSDirectory

speedup case was supposed to be in memory Lucene index with
MMapDirectory

On 64-bit systems, FSDirectory just invokes MMapDirectory already. So 
you don't need to do anything.


Either way MMapDirectory or NIOFSDirectory are doing the same thing: 
reading your index as a normal file and letting the operating system 
cache it.
The MMapDirectory is just better because it avoids some overheads, 
such as read() system call, copying and buffering into java memory 
space, etc etc.
Some of these overheads are only getting worse, e.g. 
spectre/meltdown-type fixes make syscalls 8x slower on my computer. So 
it is good that MMapDirectory avoids it.


So I suggest just stop fighting the operating system, don't give your 
J2EE container huge amounts of ram, let the kernel do its job.
If you want to "warm" a cold system because nothing is in kernel's 
cache, then look into preload and so on. It is just "reading files" to 
get them cached.


Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread Robert Muir
On Tue, Feb 23, 2021 at 4:07 PM  wrote:

> What i want to achieve: Problem statement:
>
> base case is disk based Lucene index with FSDirectory
>
> speedup case was supposed to be in memory Lucene index with MMapDirectory
>
On 64-bit systems, FSDirectory just invokes MMapDirectory already. So you
don't need to do anything.

Either way MMapDirectory or NIOFSDirectory are doing the same thing:
reading your index as a normal file and letting the operating system cache
it.
The MMapDirectory is just better because it avoids some overheads, such as
read() system call, copying and buffering into java memory space, etc etc.
Some of these overheads are only getting worse, e.g. spectre/meltdown-type
fixes make syscalls 8x slower on my computer. So it is good that
MMapDirectory avoids it.

So I suggest just stop fighting the operating system, don't give your J2EE
container huge amounts of ram, let the kernel do its job.
If you want to "warm" a cold system because nothing is in kernel's cache,
then look into preload and so on. It is just "reading files" to get them
cached.


Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread baris . kazar

(edited previous response)


Thanks, but each different query at the first run i see some slowdown 
(not much though) with MMapDirectory and FSDirectory wrt second, third 
runs (due to cold start), though.


Cold start slowdown is a little bit more with FSdirectory. So, 
MMapDirectory is slightly better in that, too: ie, cold start.



What i want to achieve: Problem statement:

base case is disk based Lucene index with FSDirectory

speedup case was supposed to be in memory Lucene index with MMapDirectory


Uwe mentioned tmpfs will help. i will try that next.


I thought preload was not helping much as we discussed here.

Thanks


On 2/23/21 3:54 PM, Robert Muir wrote:
speedup over what? You are probably already using MMapDirectory (it is 
the default). So I don't know what you are trying to achieve, but 
giving lots of memory to your java process is not going to help.


If you just want to prevent the first few queries to a fresh cold 
machine instance from being slow, you can use the preload for that 
before you make it available. You could also use 'cat' or 'dd'.


On Tue, Feb 23, 2021 at 3:45 PM > wrote:


Thanks but then how will MMapDirectory help gain speedup?

i will try tmpfs and see what happens. i was expecting to get on
order of magnitude of speedup from already very fast on disk
Lucene indexes.

So i was expecting really really really fast response with
MMapDirectory.

Thanks


On 2/23/21 3:40 PM, Robert Muir wrote:

Don't give gobs of memory to your java process, you will just
make things slower. The kernel will cache your index files.

On Tue, Feb 23, 2021 at 1:45 PM mailto:baris.ka...@oracle.com>> wrote:

Ok, but how is this MMapDirectory used then?

Best regards


On 2/23/21 7:03 AM, Robert Muir wrote:
>
>
> On Tue, Feb 23, 2021 at 2:30 AM mailto:baris.ka...@oracle.com>
> >> wrote:
>
>     Hi,-
>
>       I tried MMapDirectory and i allocated as big as index
size on my
>     J2EE
>     Container but
>
>
> Don't allocate java heap memory for the index,
MMapDirectory does not
> use java heap memory!



Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread baris . kazar
Thanks, but each different query i see some slowdown (not much though) 
with MMapDirectory and FSDirectory, though.


It is a little bit more with FSdirectory. So, MMapDirectory is slightly 
better in that, too: ie, cold start.



What i want to achieve: Problem statement:

base case is disk based Lucene index with FSDirectory

speedup case was supposed to be in memory Lucene index with MMapDirectory


Uwe mentioned tmpfs will help. i will try that next.

Thanks


On 2/23/21 3:54 PM, Robert Muir wrote:
speedup over what? You are probably already using MMapDirectory (it is 
the default). So I don't know what you are trying to achieve, but 
giving lots of memory to your java process is not going to help.


If you just want to prevent the first few queries to a fresh cold 
machine instance from being slow, you can use the preload for that 
before you make it available. You could also use 'cat' or 'dd'.


On Tue, Feb 23, 2021 at 3:45 PM > wrote:


Thanks but then how will MMapDirectory help gain speedup?

i will try tmpfs and see what happens. i was expecting to get on
order of magnitude of speedup from already very fast on disk
Lucene indexes.

So i was expecting really really really fast response with
MMapDirectory.

Thanks


On 2/23/21 3:40 PM, Robert Muir wrote:

Don't give gobs of memory to your java process, you will just
make things slower. The kernel will cache your index files.

On Tue, Feb 23, 2021 at 1:45 PM mailto:baris.ka...@oracle.com>> wrote:

Ok, but how is this MMapDirectory used then?

Best regards


On 2/23/21 7:03 AM, Robert Muir wrote:
>
>
> On Tue, Feb 23, 2021 at 2:30 AM mailto:baris.ka...@oracle.com>
> >> wrote:
>
>     Hi,-
>
>       I tried MMapDirectory and i allocated as big as index
size on my
>     J2EE
>     Container but
>
>
> Don't allocate java heap memory for the index,
MMapDirectory does not
> use java heap memory!



Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread Robert Muir
speedup over what? You are probably already using MMapDirectory (it is the
default). So I don't know what you are trying to achieve, but giving lots
of memory to your java process is not going to help.

If you just want to prevent the first few queries to a fresh cold machine
instance from being slow, you can use the preload for that before you make
it available. You could also use 'cat' or 'dd'.

On Tue, Feb 23, 2021 at 3:45 PM  wrote:

> Thanks but then how will MMapDirectory help gain speedup?
>
> i will try tmpfs and see what happens. i was expecting to get on order of
> magnitude of speedup from already very fast on disk Lucene indexes.
>
> So i was expecting really really really fast response with MMapDirectory.
>
> Thanks
>
>
> On 2/23/21 3:40 PM, Robert Muir wrote:
>
> Don't give gobs of memory to your java process, you will just make things
> slower. The kernel will cache your index files.
>
> On Tue, Feb 23, 2021 at 1:45 PM  wrote:
>
>> Ok, but how is this MMapDirectory used then?
>>
>> Best regards
>>
>>
>> On 2/23/21 7:03 AM, Robert Muir wrote:
>> >
>> >
>> > On Tue, Feb 23, 2021 at 2:30 AM > > > wrote:
>> >
>> > Hi,-
>> >
>> >   I tried MMapDirectory and i allocated as big as index size on my
>> > J2EE
>> > Container but
>> >
>> >
>> > Don't allocate java heap memory for the index, MMapDirectory does not
>> > use java heap memory!
>>
>


Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread baris . kazar

Thanks but then how will MMapDirectory help gain speedup?

i will try tmpfs and see what happens. i was expecting to get on order 
of magnitude of speedup from already very fast on disk Lucene indexes.


So i was expecting really really really fast response with MMapDirectory.

Thanks


On 2/23/21 3:40 PM, Robert Muir wrote:
Don't give gobs of memory to your java process, you will just make 
things slower. The kernel will cache your index files.


On Tue, Feb 23, 2021 at 1:45 PM > wrote:


Ok, but how is this MMapDirectory used then?

Best regards


On 2/23/21 7:03 AM, Robert Muir wrote:
>
>
> On Tue, Feb 23, 2021 at 2:30 AM mailto:baris.ka...@oracle.com>
> >>
wrote:
>
>     Hi,-
>
>       I tried MMapDirectory and i allocated as big as index size
on my
>     J2EE
>     Container but
>
>
> Don't allocate java heap memory for the index, MMapDirectory
does not
> use java heap memory!



Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread Robert Muir
Don't give gobs of memory to your java process, you will just make things
slower. The kernel will cache your index files.

On Tue, Feb 23, 2021 at 1:45 PM  wrote:

> Ok, but how is this MMapDirectory used then?
>
> Best regards
>
>
> On 2/23/21 7:03 AM, Robert Muir wrote:
> >
> >
> > On Tue, Feb 23, 2021 at 2:30 AM  > > wrote:
> >
> > Hi,-
> >
> >   I tried MMapDirectory and i allocated as big as index size on my
> > J2EE
> > Container but
> >
> >
> > Don't allocate java heap memory for the index, MMapDirectory does not
> > use java heap memory!
>


Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread baris . kazar
As Uwe suggested some time ago, tmpfs file system usage with 
MMapDirectory is


the only way to get high speedup wrt on disk Lucene index, right?

Best regards


On 2/23/21 1:44 PM, baris.ka...@oracle.com wrote:


Ok, but how is this MMapDirectory used then?

Best regards


On 2/23/21 7:03 AM, Robert Muir wrote:



On Tue, Feb 23, 2021 at 2:30 AM > wrote:


Hi,-

  I tried MMapDirectory and i allocated as big as index size on
my J2EE
Container but


Don't allocate java heap memory for the index, MMapDirectory does not 
use java heap memory!


Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread baris . kazar

Ok, but how is this MMapDirectory used then?

Best regards


On 2/23/21 7:03 AM, Robert Muir wrote:



On Tue, Feb 23, 2021 at 2:30 AM > wrote:


Hi,-

  I tried MMapDirectory and i allocated as big as index size on my
J2EE
Container but


Don't allocate java heap memory for the index, MMapDirectory does not 
use java heap memory!


[ANNOUNCE] Apache Lucene 8.8.1 released

2021-02-23 Thread Timothy Potter
The Lucene PMC is pleased to announce the release of Apache Lucene 8.8.1.


Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for nearly
any application that requires full-text search, especially cross-platform.


This release contains numerous bug fixes, optimizations, and improvements,
some of which are highlighted below. The release is available for immediate
download at:


  


### Lucene 8.8.1 Release Highlights:


No changes from 8.8.0


Please read CHANGES.txt for a full list of changes:


  


Note: The Apache Software Foundation uses an extensive mirroring network for

distributing releases. It is possible that the mirror you are using may not
have

replicated the release yet. If that is the case, please try another mirror.

This also applies to Maven access.




Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread Robert Muir
On Tue, Feb 23, 2021 at 2:30 AM  wrote:

> Hi,-
>
>   I tried MMapDirectory and i allocated as big as index size on my J2EE
> Container but
>
>
Don't allocate java heap memory for the index, MMapDirectory does not use
java heap memory!