Re: indexing all suffixes to support leading wildcard?

2014-08-28 Thread Jack Krupansky

Use the ngram token filter, and the a query of 512 would match by itself:
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html

-- Jack Krupansky

-Original Message- 
From: Erick Erickson

Sent: Thursday, August 28, 2014 11:52 PM
To: java-user
Subject: Re: indexing all suffixes to support leading wildcard?

The "usual" approach is to index to a second field but backwards.
See ReverseStringFilter... Then all your leading wildcards
are really trailing wildcards in the reversed field.

Best,
Erick


On Thu, Aug 28, 2014 at 10:38 AM, Rob Nikander 
wrote:


Hi,

I've got some short fields (phone num, email) that I'd like to search 
using
good old string matching.  (The full query is a boolean "or" that also 
uses

real text fields.) I see the warnings about wildcard queries that start
with *, and I'm wondering... do you think it would be a good idea to index
all the suffixes?  Eg, a phone num 5551234, would become 7 values for the
"phoneNum" field: 4, 34, 234, etc.  So "512*" would be a hit.

And maybe do something with the boosts so it doesn't overvalue the match
when it hits multiple values.  ?

Rob




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: indexing all suffixes to support leading wildcard?

2014-08-28 Thread Erick Erickson
The "usual" approach is to index to a second field but backwards.
See ReverseStringFilter... Then all your leading wildcards
are really trailing wildcards in the reversed field.

Best,
Erick


On Thu, Aug 28, 2014 at 10:38 AM, Rob Nikander 
wrote:

> Hi,
>
> I've got some short fields (phone num, email) that I'd like to search using
> good old string matching.  (The full query is a boolean "or" that also uses
> real text fields.) I see the warnings about wildcard queries that start
> with *, and I'm wondering... do you think it would be a good idea to index
> all the suffixes?  Eg, a phone num 5551234, would become 7 values for the
> "phoneNum" field: 4, 34, 234, etc.  So "512*" would be a hit.
>
> And maybe do something with the boosts so it doesn't overvalue the match
> when it hits multiple values.  ?
>
> Rob
>


Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread craiglang44
 doOpenIfChanged(final Index Commit commit)
> throws IOException {
> ensureOpen();
Sent from my BlackBerry® smartphone

-Original Message-
From: craiglan...@gmail.com
Date: Fri, 29 Aug 2014 00:42:46 
To: 
Reply-To: craiglan...@gmail.com
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

 doOpenIfChanged(final Index Commit commit)
> throws IOException {
> ensureOpen();
Sent from my BlackBerry® smartphone

-Original Message-
From: craiglan...@gmail.com
Date: Fri, 29 Aug 2014 00:40:23 
To: 
Reply-To: craiglan...@gmail.com
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

(writer !=commit!:legitNewreader! {
>Asis>the== return doOpenFromWriter(commit);
>None } else {just
>   return doOpenNoWriter(commit);!
>  Asis!   }
Sent from my BlackBerry® smartphone

-Original Message-
From: Michael McCandless 
Date: Thu, 28 Aug 2014 15:49:30 
To: Lucene Users
Reply-To: java-user@lucene.apache.org
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

Ugh, you're right: this still won't re-use from IW's reader pool.  Can
you open an issue?  Somehow we should make this easier.

In the meantime, I guess you can use openIfChanged from your "back in
time" reader to open another "back in time" reader.  This way you have
two pools... IW's pool for the series of NRT readers, and another pool
shared by the "back in time" readers ... but we should somehow fix
this so it's one pool.

OK looks like it's the FST terms index, and yes synthetic terms gives
you synthetic results :)  However, to reduce the FST ram here you can
just increase the block sizes uses by the terms index (see
BlockTreeTermsWriter).  Larger blocks = smaller terms index (FST) but
possibly slower searches, especially MultiTermQueries ...

Mike McCandless

http://blog.mikemccandless.com


On Thu, Aug 28, 2014 at 2:50 PM, Vitaly Funstein  wrote:
> Thanks, Mike - I think the issue is actually the latter, i.e. SegmentReader
> on its own can certainly use enough heap to cause problems, which of course
> would be made that much worse by failure to pool readers for unchanged
> segments.
>
> But where are you seeing the behavior that would result in reuse of
> SegmentReaders from the pool inside the index writer? If I'm reading the
> code right here, here's what it calls:
>
>   protected DirectoryReader doOpenIfChanged(final IndexCommit commit)
> throws IOException {
> ensureOpen();
>
> // If we were obtained by writer.getReader(), re-ask the
> // writer to get a new reader.
> if (writer != null) {
>   return doOpenFromWriter(commit);
> } else {
>   return doOpenNoWriter(commit);
> }
>   }
>
>   private DirectoryReader doOpenFromWriter(IndexCommit commit) throws
> IOException {
> if (commit != null) {
>   return doOpenFromCommit(commit);
> }
> ..
>
> There is no attempt made to inspect the segments inside the commit point
> here, for possible reader pool reuse.
>
> So here's a drill down into the SegmentReader memory foot print. There
> aren't actually 88 fields here - rather, this number reflects the "shallow"
> heap size of BlockTreeTermsReader instance, i.e. calculated size without
> following any the references from it (at depth 0).
>
> https://drive.google.com/file/d/0B5eRTXMELFjjVmxLejQzazVPZzA/edit?usp=sharing
>
> I suppose totally randomly generated field values are a bit of a contrived
> use case, since in a real world there will be far less randomness to each,
> but perhaps this gives us an idea for the worst case scenario... just
> guessing though.
>
>
>
> On Thu, Aug 28, 2014 at 11:28 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Can you drill down some more to see what's using those ~46 MB?  Is the
>> the FSTs in the terms index?
>>
>> But, we need to decouple the "single segment is opened with multiple
>> SegmentReaders" from e.g. "single SegmentReader is using too much RAM
>> to hold terms index".  E.g. from this screen shot it looks like there
>> are 88 fields totaling ~46 MB so ~0.5 MB per indexed field ...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Aug 28, 2014 at 1:56 PM, Vitaly Funstein 
>> wrote:
>> > Here's the link:
>> >
>> https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?usp=sharing
>> >
>> > I'm indexing let's say 11 unique fields per document. Also, the NRT
>> reader
>> > is opened continually, and "regular" searches use that one. But a special
>> > kind of feature allows searching a particular point in time (they get
>> > cleaned out based on some other logic), which requires opening a non-NRT
>> > reader just to service such search requests - in my understanding no
>> > segment readers for this reader can be shared with the NRT reader's
>> pool...
>> > or am I off here? This seems evident from another heap dump fragment that
>> > shows a full new set of segment readers attached to that "temporary"
>> > reader:
>> >
>> >
>> https://drive.google.com

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread craiglang44
 doOpenIfChanged(final Index Commit commit)
> throws IOException {
> ensureOpen();
Sent from my BlackBerry® smartphone

-Original Message-
From: craiglan...@gmail.com
Date: Fri, 29 Aug 2014 00:40:23 
To: 
Reply-To: craiglan...@gmail.com
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

(writer !=commit!:legitNewreader! {
>Asis>the== return doOpenFromWriter(commit);
>None } else {just
>   return doOpenNoWriter(commit);!
>  Asis!   }
Sent from my BlackBerry® smartphone

-Original Message-
From: Michael McCandless 
Date: Thu, 28 Aug 2014 15:49:30 
To: Lucene Users
Reply-To: java-user@lucene.apache.org
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

Ugh, you're right: this still won't re-use from IW's reader pool.  Can
you open an issue?  Somehow we should make this easier.

In the meantime, I guess you can use openIfChanged from your "back in
time" reader to open another "back in time" reader.  This way you have
two pools... IW's pool for the series of NRT readers, and another pool
shared by the "back in time" readers ... but we should somehow fix
this so it's one pool.

OK looks like it's the FST terms index, and yes synthetic terms gives
you synthetic results :)  However, to reduce the FST ram here you can
just increase the block sizes uses by the terms index (see
BlockTreeTermsWriter).  Larger blocks = smaller terms index (FST) but
possibly slower searches, especially MultiTermQueries ...

Mike McCandless

http://blog.mikemccandless.com


On Thu, Aug 28, 2014 at 2:50 PM, Vitaly Funstein  wrote:
> Thanks, Mike - I think the issue is actually the latter, i.e. SegmentReader
> on its own can certainly use enough heap to cause problems, which of course
> would be made that much worse by failure to pool readers for unchanged
> segments.
>
> But where are you seeing the behavior that would result in reuse of
> SegmentReaders from the pool inside the index writer? If I'm reading the
> code right here, here's what it calls:
>
>   protected DirectoryReader doOpenIfChanged(final IndexCommit commit)
> throws IOException {
> ensureOpen();
>
> // If we were obtained by writer.getReader(), re-ask the
> // writer to get a new reader.
> if (writer != null) {
>   return doOpenFromWriter(commit);
> } else {
>   return doOpenNoWriter(commit);
> }
>   }
>
>   private DirectoryReader doOpenFromWriter(IndexCommit commit) throws
> IOException {
> if (commit != null) {
>   return doOpenFromCommit(commit);
> }
> ..
>
> There is no attempt made to inspect the segments inside the commit point
> here, for possible reader pool reuse.
>
> So here's a drill down into the SegmentReader memory foot print. There
> aren't actually 88 fields here - rather, this number reflects the "shallow"
> heap size of BlockTreeTermsReader instance, i.e. calculated size without
> following any the references from it (at depth 0).
>
> https://drive.google.com/file/d/0B5eRTXMELFjjVmxLejQzazVPZzA/edit?usp=sharing
>
> I suppose totally randomly generated field values are a bit of a contrived
> use case, since in a real world there will be far less randomness to each,
> but perhaps this gives us an idea for the worst case scenario... just
> guessing though.
>
>
>
> On Thu, Aug 28, 2014 at 11:28 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Can you drill down some more to see what's using those ~46 MB?  Is the
>> the FSTs in the terms index?
>>
>> But, we need to decouple the "single segment is opened with multiple
>> SegmentReaders" from e.g. "single SegmentReader is using too much RAM
>> to hold terms index".  E.g. from this screen shot it looks like there
>> are 88 fields totaling ~46 MB so ~0.5 MB per indexed field ...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Aug 28, 2014 at 1:56 PM, Vitaly Funstein 
>> wrote:
>> > Here's the link:
>> >
>> https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?usp=sharing
>> >
>> > I'm indexing let's say 11 unique fields per document. Also, the NRT
>> reader
>> > is opened continually, and "regular" searches use that one. But a special
>> > kind of feature allows searching a particular point in time (they get
>> > cleaned out based on some other logic), which requires opening a non-NRT
>> > reader just to service such search requests - in my understanding no
>> > segment readers for this reader can be shared with the NRT reader's
>> pool...
>> > or am I off here? This seems evident from another heap dump fragment that
>> > shows a full new set of segment readers attached to that "temporary"
>> > reader:
>> >
>> >
>> https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp=sharing
>> >
>> >
>> > On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
>> > luc...@mikemccandless.com> wrote:
>> >
>> >> Hmm screen shot didn't make it ... can you post link?
>> >>
>> >> If you are using NRT reader then when a new one is opened, it won't
>> >> op

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread craiglang44
-(FST)=
Sent from my BlackBerry® smartphone

-Original Message-
From: Michael McCandless 
Date: Thu, 28 Aug 2014 15:49:30 
To: Lucene Users
Reply-To: java-user@lucene.apache.org
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

Ugh, you're right: this still won't re-use from IW's reader pool.  Can
you open an issue?  Somehow we should make this easier.

In the meantime, I guess you can use openIfChanged from your "back in
time" reader to open another "back in time" reader.  This way you have
two pools... IW's pool for the series of NRT readers, and another pool
shared by the "back in time" readers ... but we should somehow fix
this so it's one pool.

OK looks like it's the FST terms index, and yes synthetic terms gives
you synthetic results :)  However, to reduce the FST ram here you can
just increase the block sizes uses by the terms index (see
BlockTreeTermsWriter).  Larger blocks = smaller terms index (FST) but
possibly slower searches, especially MultiTermQueries ...

Mike McCandless

http://blog.mikemccandless.com


On Thu, Aug 28, 2014 at 2:50 PM, Vitaly Funstein  wrote:
> Thanks, Mike - I think the issue is actually the latter, i.e. SegmentReader
> on its own can certainly use enough heap to cause problems, which of course
> would be made that much worse by failure to pool readers for unchanged
> segments.
>
> But where are you seeing the behavior that would result in reuse of
> SegmentReaders from the pool inside the index writer? If I'm reading the
> code right here, here's what it calls:
>
>   protected DirectoryReader doOpenIfChanged(final IndexCommit commit)
> throws IOException {
> ensureOpen();
>
> // If we were obtained by writer.getReader(), re-ask the
> // writer to get a new reader.
> if (writer != null) {
>   return doOpenFromWriter(commit);
> } else {
>   return doOpenNoWriter(commit);
> }
>   }
>
>   private DirectoryReader doOpenFromWriter(IndexCommit commit) throws
> IOException {
> if (commit != null) {
>   return doOpenFromCommit(commit);
> }
> ..
>
> There is no attempt made to inspect the segments inside the commit point
> here, for possible reader pool reuse.
>
> So here's a drill down into the SegmentReader memory foot print. There
> aren't actually 88 fields here - rather, this number reflects the "shallow"
> heap size of BlockTreeTermsReader instance, i.e. calculated size without
> following any the references from it (at depth 0).
>
> https://drive.google.com/file/d/0B5eRTXMELFjjVmxLejQzazVPZzA/edit?usp=sharing
>
> I suppose totally randomly generated field values are a bit of a contrived
> use case, since in a real world there will be far less randomness to each,
> but perhaps this gives us an idea for the worst case scenario... just
> guessing though.
>
>
>
> On Thu, Aug 28, 2014 at 11:28 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Can you drill down some more to see what's using those ~46 MB?  Is the
>> the FSTs in the terms index?
>>
>> But, we need to decouple the "single segment is opened with multiple
>> SegmentReaders" from e.g. "single SegmentReader is using too much RAM
>> to hold terms index".  E.g. from this screen shot it looks like there
>> are 88 fields totaling ~46 MB so ~0.5 MB per indexed field ...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Aug 28, 2014 at 1:56 PM, Vitaly Funstein 
>> wrote:
>> > Here's the link:
>> >
>> https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?usp=sharing
>> >
>> > I'm indexing let's say 11 unique fields per document. Also, the NRT
>> reader
>> > is opened continually, and "regular" searches use that one. But a special
>> > kind of feature allows searching a particular point in time (they get
>> > cleaned out based on some other logic), which requires opening a non-NRT
>> > reader just to service such search requests - in my understanding no
>> > segment readers for this reader can be shared with the NRT reader's
>> pool...
>> > or am I off here? This seems evident from another heap dump fragment that
>> > shows a full new set of segment readers attached to that "temporary"
>> > reader:
>> >
>> >
>> https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp=sharing
>> >
>> >
>> > On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
>> > luc...@mikemccandless.com> wrote:
>> >
>> >> Hmm screen shot didn't make it ... can you post link?
>> >>
>> >> If you are using NRT reader then when a new one is opened, it won't
>> >> open new SegmentReaders for all segments, just for newly
>> >> flushed/merged segments since the last reader was opened.  So for your
>> >> N commit points that you have readers open for, they will be sharing
>> >> SegmentReaders for segments they have in common.
>> >>
>> >> How many unique fields are you adding?
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >>
>> >> On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein 
>> >> wrote:

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread craiglang44
(writer !=commit!:legitNewreader! {
>Asis>the== return doOpenFromWriter(commit);
>None } else {just
>   return doOpenNoWriter(commit);!
>  Asis!   }
Sent from my BlackBerry® smartphone

-Original Message-
From: Michael McCandless 
Date: Thu, 28 Aug 2014 15:49:30 
To: Lucene Users
Reply-To: java-user@lucene.apache.org
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

Ugh, you're right: this still won't re-use from IW's reader pool.  Can
you open an issue?  Somehow we should make this easier.

In the meantime, I guess you can use openIfChanged from your "back in
time" reader to open another "back in time" reader.  This way you have
two pools... IW's pool for the series of NRT readers, and another pool
shared by the "back in time" readers ... but we should somehow fix
this so it's one pool.

OK looks like it's the FST terms index, and yes synthetic terms gives
you synthetic results :)  However, to reduce the FST ram here you can
just increase the block sizes uses by the terms index (see
BlockTreeTermsWriter).  Larger blocks = smaller terms index (FST) but
possibly slower searches, especially MultiTermQueries ...

Mike McCandless

http://blog.mikemccandless.com


On Thu, Aug 28, 2014 at 2:50 PM, Vitaly Funstein  wrote:
> Thanks, Mike - I think the issue is actually the latter, i.e. SegmentReader
> on its own can certainly use enough heap to cause problems, which of course
> would be made that much worse by failure to pool readers for unchanged
> segments.
>
> But where are you seeing the behavior that would result in reuse of
> SegmentReaders from the pool inside the index writer? If I'm reading the
> code right here, here's what it calls:
>
>   protected DirectoryReader doOpenIfChanged(final IndexCommit commit)
> throws IOException {
> ensureOpen();
>
> // If we were obtained by writer.getReader(), re-ask the
> // writer to get a new reader.
> if (writer != null) {
>   return doOpenFromWriter(commit);
> } else {
>   return doOpenNoWriter(commit);
> }
>   }
>
>   private DirectoryReader doOpenFromWriter(IndexCommit commit) throws
> IOException {
> if (commit != null) {
>   return doOpenFromCommit(commit);
> }
> ..
>
> There is no attempt made to inspect the segments inside the commit point
> here, for possible reader pool reuse.
>
> So here's a drill down into the SegmentReader memory foot print. There
> aren't actually 88 fields here - rather, this number reflects the "shallow"
> heap size of BlockTreeTermsReader instance, i.e. calculated size without
> following any the references from it (at depth 0).
>
> https://drive.google.com/file/d/0B5eRTXMELFjjVmxLejQzazVPZzA/edit?usp=sharing
>
> I suppose totally randomly generated field values are a bit of a contrived
> use case, since in a real world there will be far less randomness to each,
> but perhaps this gives us an idea for the worst case scenario... just
> guessing though.
>
>
>
> On Thu, Aug 28, 2014 at 11:28 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Can you drill down some more to see what's using those ~46 MB?  Is the
>> the FSTs in the terms index?
>>
>> But, we need to decouple the "single segment is opened with multiple
>> SegmentReaders" from e.g. "single SegmentReader is using too much RAM
>> to hold terms index".  E.g. from this screen shot it looks like there
>> are 88 fields totaling ~46 MB so ~0.5 MB per indexed field ...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Aug 28, 2014 at 1:56 PM, Vitaly Funstein 
>> wrote:
>> > Here's the link:
>> >
>> https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?usp=sharing
>> >
>> > I'm indexing let's say 11 unique fields per document. Also, the NRT
>> reader
>> > is opened continually, and "regular" searches use that one. But a special
>> > kind of feature allows searching a particular point in time (they get
>> > cleaned out based on some other logic), which requires opening a non-NRT
>> > reader just to service such search requests - in my understanding no
>> > segment readers for this reader can be shared with the NRT reader's
>> pool...
>> > or am I off here? This seems evident from another heap dump fragment that
>> > shows a full new set of segment readers attached to that "temporary"
>> > reader:
>> >
>> >
>> https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp=sharing
>> >
>> >
>> > On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
>> > luc...@mikemccandless.com> wrote:
>> >
>> >> Hmm screen shot didn't make it ... can you post link?
>> >>
>> >> If you are using NRT reader then when a new one is opened, it won't
>> >> open new SegmentReaders for all segments, just for newly
>> >> flushed/merged segments since the last reader was opened.  So for your
>> >> N commit points that you have readers open for, they will be sharing
>> >> SegmentReaders for segments they have in common.
>> >>
>> >> How many unique fields are you adding?
>> 

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread craiglang44
(Commit=all!!
Sent from my BlackBerry® smartphone

-Original Message-
From: Vitaly Funstein 
Date: Thu, 28 Aug 2014 13:18:08 
To: 
Reply-To: java-user@lucene.apache.org
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

Thanks for the suggestions! I'll file an enhancement request.

But I am still a little skeptical about the approach of "pooling" segment
readers from prior DirectoryReader instances, opened at earlier commit
points. It looks like the up to date check for non-NRT directory reader
just compares the segment infos file names, and since each commit will
create a new SI file, doesn't this make the check moot?

  private DirectoryReader doOpenNoWriter(IndexCommit commit) throws
IOException {

if (commit == null) {
  if (isCurrent()) {
return null;
  }
} else {
  if (directory != commit.getDirectory()) {
throw new IOException("the specified commit does not match the
specified Directory");
  }
  if (segmentInfos != null &&
commit.getSegmentsFileName().equals(segmentInfos.getSegmentsFileName())) {
return null;
  }
}

return doOpenFromCommit(commit);
  }

As for tuning the block size - would you recommend increasing it to
BlockTreeTermsWriter.DEFAULT_MAX_BLOCK_SIZE, or close to it? And if I did
this, would I have readability issues for indices created before this
change? We are already using a customized codec though, so perhaps adding
this to the codec is okay and transparent?


On Thu, Aug 28, 2014 at 12:49 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Ugh, you're right: this still won't re-use from IW's reader pool.  Can
> you open an issue?  Somehow we should make this easier.
>
> In the meantime, I guess you can use openIfChanged from your "back in
> time" reader to open another "back in time" reader.  This way you have
> two pools... IW's pool for the series of NRT readers, and another pool
> shared by the "back in time" readers ... but we should somehow fix
> this so it's one pool.
>
> OK looks like it's the FST terms index, and yes synthetic terms gives
> you synthetic results :)  However, to reduce the FST ram here you can
> just increase the block sizes uses by the terms index (see
> BlockTreeTermsWriter).  Larger blocks = smaller terms index (FST) but
> possibly slower searches, especially MultiTermQueries ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Aug 28, 2014 at 2:50 PM, Vitaly Funstein 
> wrote:
> > Thanks, Mike - I think the issue is actually the latter, i.e.
> SegmentReader
> > on its own can certainly use enough heap to cause problems, which of
> course
> > would be made that much worse by failure to pool readers for unchanged
> > segments.
> >
> > But where are you seeing the behavior that would result in reuse of
> > SegmentReaders from the pool inside the index writer? If I'm reading the
> > code right here, here's what it calls:
> >
> >   protected DirectoryReader doOpenIfChanged(final IndexCommit commit)
> > throws IOException {
> > ensureOpen();
> >
> > // If we were obtained by writer.getReader(), re-ask the
> > // writer to get a new reader.
> > if (writer != null) {
> >   return doOpenFromWriter(commit);
> > } else {
> >   return doOpenNoWriter(commit);
> > }
> >   }
> >
> >   private DirectoryReader doOpenFromWriter(IndexCommit commit) throws
> > IOException {
> > if (commit != null) {
> >   return doOpenFromCommit(commit);
> > }
> > ..
> >
> > There is no attempt made to inspect the segments inside the commit point
> > here, for possible reader pool reuse.
> >
> > So here's a drill down into the SegmentReader memory foot print. There
> > aren't actually 88 fields here - rather, this number reflects the
> "shallow"
> > heap size of BlockTreeTermsReader instance, i.e. calculated size without
> > following any the references from it (at depth 0).
> >
> >
> https://drive.google.com/file/d/0B5eRTXMELFjjVmxLejQzazVPZzA/edit?usp=sharing
> >
> > I suppose totally randomly generated field values are a bit of a
> contrived
> > use case, since in a real world there will be far less randomness to
> each,
> > but perhaps this gives us an idea for the worst case scenario... just
> > guessing though.
> >
> >
> >
> > On Thu, Aug 28, 2014 at 11:28 AM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> >> Can you drill down some more to see what's using those ~46 MB?  Is the
> >> the FSTs in the terms index?
> >>
> >> But, we need to decouple the "single segment is opened with multiple
> >> SegmentReaders" from e.g. "single SegmentReader is using too much RAM
> >> to hold terms index".  E.g. from this screen shot it looks like there
> >> are 88 fields totaling ~46 MB so ~0.5 MB per indexed field ...
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Thu, Aug 28, 2014 at 1:56 PM, Vitaly Funstein 
> >> wrote:
> >> > Here's the link:
> >> >
> >>
> https://drive.google.

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread craiglang44
Yes!
Sent from my BlackBerry® smartphone

-Original Message-
From: Vitaly Funstein 
Date: Thu, 28 Aug 2014 14:39:50 
To: 
Reply-To: java-user@lucene.apache.org
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

On Thu, Aug 28, 2014 at 2:38 PM, Vitaly Funstein 
wrote:

>
> Looks like this is used inside Lucene41PostingsFormat, which simply passes
> in those defaults - so you are effectively saying the minimum (and
> therefore, maximum) block size can be raised to reuse the size of the terms
> index inside those TreeMap nodes?
>
>

Sorry, I meant reduce - not reuse.



Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread craiglang44
==null!-(?)
Sent from my BlackBerry® smartphone

-Original Message-
From: Vitaly Funstein 
Date: Thu, 28 Aug 2014 13:18:08 
To: 
Reply-To: java-user@lucene.apache.org
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

Thanks for the suggestions! I'll file an enhancement request.

But I am still a little skeptical about the approach of "pooling" segment
readers from prior DirectoryReader instances, opened at earlier commit
points. It looks like the up to date check for non-NRT directory reader
just compares the segment infos file names, and since each commit will
create a new SI file, doesn't this make the check moot?

  private DirectoryReader doOpenNoWriter(IndexCommit commit) throws
IOException {

if (commit == null) {
  if (isCurrent()) {
return null;
  }
} else {
  if (directory != commit.getDirectory()) {
throw new IOException("the specified commit does not match the
specified Directory");
  }
  if (segmentInfos != null &&
commit.getSegmentsFileName().equals(segmentInfos.getSegmentsFileName())) {
return null;
  }
}

return doOpenFromCommit(commit);
  }

As for tuning the block size - would you recommend increasing it to
BlockTreeTermsWriter.DEFAULT_MAX_BLOCK_SIZE, or close to it? And if I did
this, would I have readability issues for indices created before this
change? We are already using a customized codec though, so perhaps adding
this to the codec is okay and transparent?


On Thu, Aug 28, 2014 at 12:49 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Ugh, you're right: this still won't re-use from IW's reader pool.  Can
> you open an issue?  Somehow we should make this easier.
>
> In the meantime, I guess you can use openIfChanged from your "back in
> time" reader to open another "back in time" reader.  This way you have
> two pools... IW's pool for the series of NRT readers, and another pool
> shared by the "back in time" readers ... but we should somehow fix
> this so it's one pool.
>
> OK looks like it's the FST terms index, and yes synthetic terms gives
> you synthetic results :)  However, to reduce the FST ram here you can
> just increase the block sizes uses by the terms index (see
> BlockTreeTermsWriter).  Larger blocks = smaller terms index (FST) but
> possibly slower searches, especially MultiTermQueries ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Aug 28, 2014 at 2:50 PM, Vitaly Funstein 
> wrote:
> > Thanks, Mike - I think the issue is actually the latter, i.e.
> SegmentReader
> > on its own can certainly use enough heap to cause problems, which of
> course
> > would be made that much worse by failure to pool readers for unchanged
> > segments.
> >
> > But where are you seeing the behavior that would result in reuse of
> > SegmentReaders from the pool inside the index writer? If I'm reading the
> > code right here, here's what it calls:
> >
> >   protected DirectoryReader doOpenIfChanged(final IndexCommit commit)
> > throws IOException {
> > ensureOpen();
> >
> > // If we were obtained by writer.getReader(), re-ask the
> > // writer to get a new reader.
> > if (writer != null) {
> >   return doOpenFromWriter(commit);
> > } else {
> >   return doOpenNoWriter(commit);
> > }
> >   }
> >
> >   private DirectoryReader doOpenFromWriter(IndexCommit commit) throws
> > IOException {
> > if (commit != null) {
> >   return doOpenFromCommit(commit);
> > }
> > ..
> >
> > There is no attempt made to inspect the segments inside the commit point
> > here, for possible reader pool reuse.
> >
> > So here's a drill down into the SegmentReader memory foot print. There
> > aren't actually 88 fields here - rather, this number reflects the
> "shallow"
> > heap size of BlockTreeTermsReader instance, i.e. calculated size without
> > following any the references from it (at depth 0).
> >
> >
> https://drive.google.com/file/d/0B5eRTXMELFjjVmxLejQzazVPZzA/edit?usp=sharing
> >
> > I suppose totally randomly generated field values are a bit of a
> contrived
> > use case, since in a real world there will be far less randomness to
> each,
> > but perhaps this gives us an idea for the worst case scenario... just
> > guessing though.
> >
> >
> >
> > On Thu, Aug 28, 2014 at 11:28 AM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> >> Can you drill down some more to see what's using those ~46 MB?  Is the
> >> the FSTs in the terms index?
> >>
> >> But, we need to decouple the "single segment is opened with multiple
> >> SegmentReaders" from e.g. "single SegmentReader is using too much RAM
> >> to hold terms index".  E.g. from this screen shot it looks like there
> >> are 88 fields totaling ~46 MB so ~0.5 MB per indexed field ...
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Thu, Aug 28, 2014 at 1:56 PM, Vitaly Funstein 
> >> wrote:
> >> > Here's the link:
> >> >
> >>
> https://drive.google.co

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread craiglang44
>= 2*(min-1), 
Sent from my BlackBerry® smartphone

-Original Message-
From: Vitaly Funstein 
Date: Thu, 28 Aug 2014 14:38:37 
To: 
Reply-To: java-user@lucene.apache.org
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

On Thu, Aug 28, 2014 at 1:25 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

>
> The segments_N file can be different, that's fine: after that, we then
> re-use SegmentReaders when they are in common between the two commit
> points.  Each segments_N file refers to many segments...
>
>
Yes, you are totally right - I didn't follow the code far enough the first
time around. :) This is an excellent idea, actually - I can probably
arrange maintained commit points as an MRU data structure (e.g.
LinkedHashMap with access order), and simply grab the most recently opened
reader to pass in when obtaining a new one from the new commit point - to
maximize segment reader reuse.


> You can set it (min and max) as high as you want; the only hard
> requirement is that max >= 2*(min-1), I believe.
>

Looks like this is used inside Lucene41PostingsFormat, which simply passes
in those defaults - so you are effectively saying the minimum (and
therefore, maximum) block size can be raised to reuse the size of the terms
index inside those TreeMap nodes?


>
> > We are already using a customized codec though, so perhaps adding
> > this to the codec is okay and transparent?
>
> Hmmm :)  Customized in what manner?
>
>
We need to have the ability to turn off stored fields compression, so there
is one codec in case the system is configured that way. The other one
exists for compression on, but there I tweaked stored fields format for
bias toward decompression, as well as a smaller chunk size - based on some
empirical observations in executed tests. I am guessing I'll just add
another customization to both that deals with the block sizing for postings
format, and see what difference that makes...



Re: Can I update one field in doc?

2014-08-28 Thread craiglang44
B:Mad.c:,07914269520x "", bbsmsinboxStore.YES)
>
Sent from my BlackBerry® smartphone

-Original Message-
From: Rob Nikander 
Date: Thu, 28 Aug 2014 19:34:13 
To: 
Reply-To: java-user@lucene.apache.org
Subject: Re: Can I update one field in doc?

I used the "Luke" tool to look at my documents. It shows that the positions
and offsets in the term vectors get wiped out, in all fields.  I'm using
Lucene 4.8.  I guess I'll just rebuild the entire doc.

Rob


On Thu, Aug 28, 2014 at 5:33 PM, Rob Nikander 
wrote:

> I tried something like this, to loop through all docs in my index and
> patch a field.  But it appears to wipe out some parts of the stored values
> in the document. For example, highlighting stopped working.
>
> [ scala code ]
> val q = new MatchAllDocsQuery()
> val topDocs = searcher.search(q, 100)
> val field = new StringField(FieldNames.phone, "", Field.Store.YES)
>
> for (sdoc <- topDocs.scoreDocs) {
>val doc = searcher.doc(sdoc.doc)
>val id = doc.get(FieldNames.id)
>var phone = doc.get(FieldNames.phone)
>phone = phone + " changed"
>doc.removeField(FieldNames.phone)
>field.setStringValue(searchable)
>doc.add(field)
>writer.updateDocument(new Term(FieldNames.id, id), doc)
> }
>
> Should it work?  The documents have many fields and it takes 35 minutes to
> rebuild the index from scratch. I'd like to be able to run smaller "patch"
> tasks like this.
>
> Rob
>



Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread craiglang44
openIfChanged(latestNRTReader,
Index Commit): 
Sent from my BlackBerry® smartphone

-Original Message-
From: Michael McCandless 
Date: Thu, 28 Aug 2014 15:49:30 
To: Lucene Users
Reply-To: java-user@lucene.apache.org
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

Ugh, you're right: this still won't re-use from IW's reader pool.  Can
you open an issue?  Somehow we should make this easier.

In the meantime, I guess you can use openIfChanged from your "back in
time" reader to open another "back in time" reader.  This way you have
two pools... IW's pool for the series of NRT readers, and another pool
shared by the "back in time" readers ... but we should somehow fix
this so it's one pool.

OK looks like it's the FST terms index, and yes synthetic terms gives
you synthetic results :)  However, to reduce the FST ram here you can
just increase the block sizes uses by the terms index (see
BlockTreeTermsWriter).  Larger blocks = smaller terms index (FST) but
possibly slower searches, especially MultiTermQueries ...

Mike McCandless

http://blog.mikemccandless.com


On Thu, Aug 28, 2014 at 2:50 PM, Vitaly Funstein  wrote:
> Thanks, Mike - I think the issue is actually the latter, i.e. SegmentReader
> on its own can certainly use enough heap to cause problems, which of course
> would be made that much worse by failure to pool readers for unchanged
> segments.
>
> But where are you seeing the behavior that would result in reuse of
> SegmentReaders from the pool inside the index writer? If I'm reading the
> code right here, here's what it calls:
>
>   protected DirectoryReader doOpenIfChanged(final IndexCommit commit)
> throws IOException {
> ensureOpen();
>
> // If we were obtained by writer.getReader(), re-ask the
> // writer to get a new reader.
> if (writer != null) {
>   return doOpenFromWriter(commit);
> } else {
>   return doOpenNoWriter(commit);
> }
>   }
>
>   private DirectoryReader doOpenFromWriter(IndexCommit commit) throws
> IOException {
> if (commit != null) {
>   return doOpenFromCommit(commit);
> }
> ..
>
> There is no attempt made to inspect the segments inside the commit point
> here, for possible reader pool reuse.
>
> So here's a drill down into the SegmentReader memory foot print. There
> aren't actually 88 fields here - rather, this number reflects the "shallow"
> heap size of BlockTreeTermsReader instance, i.e. calculated size without
> following any the references from it (at depth 0).
>
> https://drive.google.com/file/d/0B5eRTXMELFjjVmxLejQzazVPZzA/edit?usp=sharing
>
> I suppose totally randomly generated field values are a bit of a contrived
> use case, since in a real world there will be far less randomness to each,
> but perhaps this gives us an idea for the worst case scenario... just
> guessing though.
>
>
>
> On Thu, Aug 28, 2014 at 11:28 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Can you drill down some more to see what's using those ~46 MB?  Is the
>> the FSTs in the terms index?
>>
>> But, we need to decouple the "single segment is opened with multiple
>> SegmentReaders" from e.g. "single SegmentReader is using too much RAM
>> to hold terms index".  E.g. from this screen shot it looks like there
>> are 88 fields totaling ~46 MB so ~0.5 MB per indexed field ...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Aug 28, 2014 at 1:56 PM, Vitaly Funstein 
>> wrote:
>> > Here's the link:
>> >
>> https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?usp=sharing
>> >
>> > I'm indexing let's say 11 unique fields per document. Also, the NRT
>> reader
>> > is opened continually, and "regular" searches use that one. But a special
>> > kind of feature allows searching a particular point in time (they get
>> > cleaned out based on some other logic), which requires opening a non-NRT
>> > reader just to service such search requests - in my understanding no
>> > segment readers for this reader can be shared with the NRT reader's
>> pool...
>> > or am I off here? This seems evident from another heap dump fragment that
>> > shows a full new set of segment readers attached to that "temporary"
>> > reader:
>> >
>> >
>> https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp=sharing
>> >
>> >
>> > On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
>> > luc...@mikemccandless.com> wrote:
>> >
>> >> Hmm screen shot didn't make it ... can you post link?
>> >>
>> >> If you are using NRT reader then when a new one is opened, it won't
>> >> open new SegmentReaders for all segments, just for newly
>> >> flushed/merged segments since the last reader was opened.  So for your
>> >> N commit points that you have readers open for, they will be sharing
>> >> SegmentReaders for segments they have in common.
>> >>
>> >> How many unique fields are you adding?
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >>
>> >> On Wed, Aug 27, 2014 at 

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread craiglang44
(commit != null) {madnbr...@gmail.com
 ] LatestNRIreader-return doOpenFromCommit(commit;
=dandappe...@gmail.com
Sent from my BlackBerry® smartphone

-Original Message-
From: Michael McCandless 
Date: Thu, 28 Aug 2014 14:25:11 
To: Lucene Users
Reply-To: java-user@lucene.apache.org
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

You can actually use IndexReader.openIfChanged(latestNRTReader,
IndexCommit): this should pull/share SegmentReaders from the pool
inside IW, when available.  But it will fail to share e.g.
SegmentReader no longer part of the NRT view but shared by e.g. two
"back in time" readers.

Really we need to factor out the reader pooling somehow, such that IW
is a user for its NRT pool, but commit-point readers could also more
easily use a shared pool.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Aug 28, 2014 at 2:07 PM, Uwe Schindler  wrote:
> Hi,
>
> if you open the 2nd instance (the back in time reader) using 
> DirectoryReader.open(IndexCommit), then it has of course nothing in common 
> with the IndexWriter, so how can they share the SegmentReaders?
>
> NRT readers from DirectoryReader.open(IndexWriter) are cached inside 
> IndexWriter, but the completely outside DirectoryReader on the older commit 
> point opens all segments on its own. Maybe a solution would be to extends 
> IndexWriter.open() to also take a commit point with IndexWriter.
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: Vitaly Funstein [mailto:vfunst...@gmail.com]
>> Sent: Thursday, August 28, 2014 7:56 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: BlockTreeTermsReader consumes crazy amount of memory
>>
>> Here's the link:
>> https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?us
>> p=sharing
>>
>> I'm indexing let's say 11 unique fields per document. Also, the NRT reader is
>> opened continually, and "regular" searches use that one. But a special kind 
>> of
>> feature allows searching a particular point in time (they get cleaned out
>> based on some other logic), which requires opening a non-NRT reader just to
>> service such search requests - in my understanding no segment readers for
>> this reader can be shared with the NRT reader's pool...
>> or am I off here? This seems evident from another heap dump fragment that
>> shows a full new set of segment readers attached to that "temporary"
>> reader:
>>
>> https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp
>> =sharing
>>
>>
>> On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>> > Hmm screen shot didn't make it ... can you post link?
>> >
>> > If you are using NRT reader then when a new one is opened, it won't
>> > open new SegmentReaders for all segments, just for newly
>> > flushed/merged segments since the last reader was opened.  So for your
>> > N commit points that you have readers open for, they will be sharing
>> > SegmentReaders for segments they have in common.
>> >
>> > How many unique fields are you adding?
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein 
>> > wrote:
>> > > Mike,
>> > >
>> > > Here's the screenshot; not sure if it will go through as an
>> > > attachment though - if not, I'll post it as a link. Please ignore
>> > > the altered
>> > package
>> > > names, since Lucene is shaded in as part of our build process.
>> > >
>> > > Some more context about the use case. Yes, the terms are pretty much
>> > unique;
>> > > the schema for the data set is actually borrowed from here:
>> > > https://amplab.cs.berkeley.edu/benchmark/#workload - it's the
>> > > UserVisits set, with a couple of other fields added by us. The
>> > > values for the fields are generated almost randomly, though some
>> > > string fields are picked at random from a fixed dictionary.
>> > >
>> > > Also, this type of heap footprint might be tolerable if it stayed
>> > relatively
>> > > constant throughout the system's life cycle (of course, given the
>> > > index
>> > set
>> > > stays more or less static). However, what happens here is that one
>> > > IndexReader reference is maintained by ReaderManager as an NRT
>> reader.
>> > But
>> > > we also would like support an ability to execute searches against
>> > specific
>> > > index commit points, ideally in parallel. As you might imagine, as
>> > > soon
>> > as a
>> > > new DirectoryReader is opened at a given commit, a whole new set of
>> > > SegmentReader instances is created and populated, effectively
>> > > doubling
>> > the
>> > > already large heap usage... if there was a way to somehow reuse
>> > > readers
>> > for
>> > > unchanged segments already pooled by IndexWriter, that would help
>> > > tremendously here. But I don't think there's a way to link up the
>> > > two
>> > sets,
>> > > at least no

Re: Can I update one field in doc?

2014-08-28 Thread Rob Nikander
I used the "Luke" tool to look at my documents. It shows that the positions
and offsets in the term vectors get wiped out, in all fields.  I'm using
Lucene 4.8.  I guess I'll just rebuild the entire doc.

Rob


On Thu, Aug 28, 2014 at 5:33 PM, Rob Nikander 
wrote:

> I tried something like this, to loop through all docs in my index and
> patch a field.  But it appears to wipe out some parts of the stored values
> in the document. For example, highlighting stopped working.
>
> [ scala code ]
> val q = new MatchAllDocsQuery()
> val topDocs = searcher.search(q, 100)
> val field = new StringField(FieldNames.phone, "", Field.Store.YES)
>
> for (sdoc <- topDocs.scoreDocs) {
>val doc = searcher.doc(sdoc.doc)
>val id = doc.get(FieldNames.id)
>var phone = doc.get(FieldNames.phone)
>phone = phone + " changed"
>doc.removeField(FieldNames.phone)
>field.setStringValue(searchable)
>doc.add(field)
>writer.updateDocument(new Term(FieldNames.id, id), doc)
> }
>
> Should it work?  The documents have many fields and it takes 35 minutes to
> rebuild the index from scratch. I'd like to be able to run smaller "patch"
> tasks like this.
>
> Rob
>


Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread craiglang44
(commit != null) {madnbr...@gmail.com
 ] return doOpenFromCommit(commit;
=dandappe...@gmail.com
Sent from my BlackBerry® smartphone

-Original Message-
From: Michael McCandless 
Date: Thu, 28 Aug 2014 14:25:11 
To: Lucene Users
Reply-To: java-user@lucene.apache.org
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

You can actually use IndexReader.openIfChanged(latestNRTReader,
IndexCommit): this should pull/share SegmentReaders from the pool
inside IW, when available.  But it will fail to share e.g.
SegmentReader no longer part of the NRT view but shared by e.g. two
"back in time" readers.

Really we need to factor out the reader pooling somehow, such that IW
is a user for its NRT pool, but commit-point readers could also more
easily use a shared pool.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Aug 28, 2014 at 2:07 PM, Uwe Schindler  wrote:
> Hi,
>
> if you open the 2nd instance (the back in time reader) using 
> DirectoryReader.open(IndexCommit), then it has of course nothing in common 
> with the IndexWriter, so how can they share the SegmentReaders?
>
> NRT readers from DirectoryReader.open(IndexWriter) are cached inside 
> IndexWriter, but the completely outside DirectoryReader on the older commit 
> point opens all segments on its own. Maybe a solution would be to extends 
> IndexWriter.open() to also take a commit point with IndexWriter.
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: Vitaly Funstein [mailto:vfunst...@gmail.com]
>> Sent: Thursday, August 28, 2014 7:56 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: BlockTreeTermsReader consumes crazy amount of memory
>>
>> Here's the link:
>> https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?us
>> p=sharing
>>
>> I'm indexing let's say 11 unique fields per document. Also, the NRT reader is
>> opened continually, and "regular" searches use that one. But a special kind 
>> of
>> feature allows searching a particular point in time (they get cleaned out
>> based on some other logic), which requires opening a non-NRT reader just to
>> service such search requests - in my understanding no segment readers for
>> this reader can be shared with the NRT reader's pool...
>> or am I off here? This seems evident from another heap dump fragment that
>> shows a full new set of segment readers attached to that "temporary"
>> reader:
>>
>> https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp
>> =sharing
>>
>>
>> On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>> > Hmm screen shot didn't make it ... can you post link?
>> >
>> > If you are using NRT reader then when a new one is opened, it won't
>> > open new SegmentReaders for all segments, just for newly
>> > flushed/merged segments since the last reader was opened.  So for your
>> > N commit points that you have readers open for, they will be sharing
>> > SegmentReaders for segments they have in common.
>> >
>> > How many unique fields are you adding?
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein 
>> > wrote:
>> > > Mike,
>> > >
>> > > Here's the screenshot; not sure if it will go through as an
>> > > attachment though - if not, I'll post it as a link. Please ignore
>> > > the altered
>> > package
>> > > names, since Lucene is shaded in as part of our build process.
>> > >
>> > > Some more context about the use case. Yes, the terms are pretty much
>> > unique;
>> > > the schema for the data set is actually borrowed from here:
>> > > https://amplab.cs.berkeley.edu/benchmark/#workload - it's the
>> > > UserVisits set, with a couple of other fields added by us. The
>> > > values for the fields are generated almost randomly, though some
>> > > string fields are picked at random from a fixed dictionary.
>> > >
>> > > Also, this type of heap footprint might be tolerable if it stayed
>> > relatively
>> > > constant throughout the system's life cycle (of course, given the
>> > > index
>> > set
>> > > stays more or less static). However, what happens here is that one
>> > > IndexReader reference is maintained by ReaderManager as an NRT
>> reader.
>> > But
>> > > we also would like support an ability to execute searches against
>> > specific
>> > > index commit points, ideally in parallel. As you might imagine, as
>> > > soon
>> > as a
>> > > new DirectoryReader is opened at a given commit, a whole new set of
>> > > SegmentReader instances is created and populated, effectively
>> > > doubling
>> > the
>> > > already large heap usage... if there was a way to somehow reuse
>> > > readers
>> > for
>> > > unchanged segments already pooled by IndexWriter, that would help
>> > > tremendously here. But I don't think there's a way to link up the
>> > > two
>> > sets,
>> > > at least not in the Lucene 

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread Vitaly Funstein
On Thu, Aug 28, 2014 at 2:38 PM, Vitaly Funstein 
wrote:

>
> Looks like this is used inside Lucene41PostingsFormat, which simply passes
> in those defaults - so you are effectively saying the minimum (and
> therefore, maximum) block size can be raised to reuse the size of the terms
> index inside those TreeMap nodes?
>
>

Sorry, I meant reduce - not reuse.


Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread Vitaly Funstein
On Thu, Aug 28, 2014 at 1:25 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

>
> The segments_N file can be different, that's fine: after that, we then
> re-use SegmentReaders when they are in common between the two commit
> points.  Each segments_N file refers to many segments...
>
>
Yes, you are totally right - I didn't follow the code far enough the first
time around. :) This is an excellent idea, actually - I can probably
arrange maintained commit points as an MRU data structure (e.g.
LinkedHashMap with access order), and simply grab the most recently opened
reader to pass in when obtaining a new one from the new commit point - to
maximize segment reader reuse.


> You can set it (min and max) as high as you want; the only hard
> requirement is that max >= 2*(min-1), I believe.
>

Looks like this is used inside Lucene41PostingsFormat, which simply passes
in those defaults - so you are effectively saying the minimum (and
therefore, maximum) block size can be raised to reuse the size of the terms
index inside those TreeMap nodes?


>
> > We are already using a customized codec though, so perhaps adding
> > this to the codec is okay and transparent?
>
> Hmmm :)  Customized in what manner?
>
>
We need to have the ability to turn off stored fields compression, so there
is one codec in case the system is configured that way. The other one
exists for compression on, but there I tweaked stored fields format for
bias toward decompression, as well as a smaller chunk size - based on some
empirical observations in executed tests. I am guessing I'll just add
another customization to both that deals with the block sizing for postings
format, and see what difference that makes...


Can I update one field in doc?

2014-08-28 Thread Rob Nikander
I tried something like this, to loop through all docs in my index and patch
a field.  But it appears to wipe out some parts of the stored values in the
document. For example, highlighting stopped working.

[ scala code ]
val q = new MatchAllDocsQuery()
val topDocs = searcher.search(q, 100)
val field = new StringField(FieldNames.phone, "", Field.Store.YES)

for (sdoc <- topDocs.scoreDocs) {
   val doc = searcher.doc(sdoc.doc)
   val id = doc.get(FieldNames.id)
   var phone = doc.get(FieldNames.phone)
   phone = phone + " changed"
   doc.removeField(FieldNames.phone)
   field.setStringValue(searchable)
   doc.add(field)
   writer.updateDocument(new Term(FieldNames.id, id), doc)
}

Should it work?  The documents have many fields and it takes 35 minutes to
rebuild the index from scratch. I'd like to be able to run smaller "patch"
tasks like this.

Rob


Re: madzmad-gleeson consumes crazy amount of memory

2014-08-28 Thread craiglang44
Doopenfromcommit!=mep
Sent from my BlackBerry® smartphone

-Original Message-
From: Vitaly Funstein 
Date: Thu, 28 Aug 2014 11:50:34 
To: 
Reply-To: java-user@lucene.apache.org
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

Thanks, Mike - I think the issue is actually the latter, i.e. SegmentReader
on its own can certainly use enough heap to cause problems, which of course
would be made that much worse by failure to pool readers for unchanged
segments.

But where are you seeing the behavior that would result in reuse of
SegmentReaders from the pool inside the index writer? If I'm reading the
code right here, here's what it calls:

  protected DirectoryReader doOpenIfChanged(final IndexCommit commit)
throws IOException {
ensureOpen();

// If we were obtained by writer.getReader(), re-ask the
// writer to get a new reader.
if (writer != null) {
  return doOpenFromWriter(commit);
} else {
  return doOpenNoWriter(commit);
}
  }

  private DirectoryReader doOpenFromWriter(IndexCommit commit) throws
IOException {
if (commit != null) {
  return doOpenFromCommit(commit);
}
..

There is no attempt made to inspect the segments inside the commit point
here, for possible reader pool reuse.

So here's a drill down into the SegmentReader memory foot print. There
aren't actually 88 fields here - rather, this number reflects the "shallow"
heap size of BlockTreeTermsReader instance, i.e. calculated size without
following any the references from it (at depth 0).

https://drive.google.com/file/d/0B5eRTXMELFjjVmxLejQzazVPZzA/edit?usp=sharing

I suppose totally randomly generated field values are a bit of a contrived
use case, since in a real world there will be far less randomness to each,
but perhaps this gives us an idea for the worst case scenario... just
guessing though.



On Thu, Aug 28, 2014 at 11:28 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Can you drill down some more to see what's using those ~46 MB?  Is the
> the FSTs in the terms index?
>
> But, we need to decouple the "single segment is opened with multiple
> SegmentReaders" from e.g. "single SegmentReader is using too much RAM
> to hold terms index".  E.g. from this screen shot it looks like there
> are 88 fields totaling ~46 MB so ~0.5 MB per indexed field ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Aug 28, 2014 at 1:56 PM, Vitaly Funstein 
> wrote:
> > Here's the link:
> >
> https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?usp=sharing
> >
> > I'm indexing let's say 11 unique fields per document. Also, the NRT
> reader
> > is opened continually, and "regular" searches use that one. But a special
> > kind of feature allows searching a particular point in time (they get
> > cleaned out based on some other logic), which requires opening a non-NRT
> > reader just to service such search requests - in my understanding no
> > segment readers for this reader can be shared with the NRT reader's
> pool...
> > or am I off here? This seems evident from another heap dump fragment that
> > shows a full new set of segment readers attached to that "temporary"
> > reader:
> >
> >
> https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp=sharing
> >
> >
> > On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> >> Hmm screen shot didn't make it ... can you post link?
> >>
> >> If you are using NRT reader then when a new one is opened, it won't
> >> open new SegmentReaders for all segments, just for newly
> >> flushed/merged segments since the last reader was opened.  So for your
> >> N commit points that you have readers open for, they will be sharing
> >> SegmentReaders for segments they have in common.
> >>
> >> How many unique fields are you adding?
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein 
> >> wrote:
> >> > Mike,
> >> >
> >> > Here's the screenshot; not sure if it will go through as an attachment
> >> > though - if not, I'll post it as a link. Please ignore the altered
> >> package
> >> > names, since Lucene is shaded in as part of our build process.
> >> >
> >> > Some more context about the use case. Yes, the terms are pretty much
> >> unique;
> >> > the schema for the data set is actually borrowed from here:
> >> > https://amplab.cs.berkeley.edu/benchmark/#workload - it's the
> UserVisits
> >> > set, with a couple of other fields added by us. The values for the
> fields
> >> > are generated almost randomly, though some string fields are picked at
> >> > random from a fixed dictionary.
> >> >
> >> > Also, this type of heap footprint might be tolerable if it stayed
> >> relatively
> >> > constant throughout the system's life cycle (of course, given the
> index
> >> set
> >> > stays more or less static). However, what happens here is that one
> >> > IndexReader reference is maintained by Re

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread Michael McCandless
On Thu, Aug 28, 2014 at 4:18 PM, Vitaly Funstein  wrote:
> Thanks for the suggestions! I'll file an enhancement request.
>
> But I am still a little skeptical about the approach of "pooling" segment
> readers from prior DirectoryReader instances, opened at earlier commit
> points. It looks like the up to date check for non-NRT directory reader
> just compares the segment infos file names, and since each commit will
> create a new SI file, doesn't this make the check moot?

The segments_N file can be different, that's fine: after that, we then
re-use SegmentReaders when they are in common between the two commit
points.  Each segments_N file refers to many segments...

>   private DirectoryReader doOpenNoWriter(IndexCommit commit) throws
> IOException {
>
> if (commit == null) {
>   if (isCurrent()) {
> return null;
>   }
> } else {
>   if (directory != commit.getDirectory()) {
> throw new IOException("the specified commit does not match the
> specified Directory");
>   }
>   if (segmentInfos != null &&
> commit.getSegmentsFileName().equals(segmentInfos.getSegmentsFileName())) {
> return null;
>   }
> }
>
> return doOpenFromCommit(commit);
>   }
>
> As for tuning the block size - would you recommend increasing it to
> BlockTreeTermsWriter.DEFAULT_MAX_BLOCK_SIZE, or close to it?

You can set it (min and max) as high as you want; the only hard
requirement is that max >= 2*(min-1), I believe.

> And if I did
> this, would I have readability issues for indices created before this
> change?

It won't have any effect on them: these parameters are already "baked
into" those indices... only newly written indices with your custom
codec will write larger blocks.

> We are already using a customized codec though, so perhaps adding
> this to the codec is okay and transparent?

Hmmm :)  Customized in what manner?

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread Vitaly Funstein
Thanks for the suggestions! I'll file an enhancement request.

But I am still a little skeptical about the approach of "pooling" segment
readers from prior DirectoryReader instances, opened at earlier commit
points. It looks like the up to date check for non-NRT directory reader
just compares the segment infos file names, and since each commit will
create a new SI file, doesn't this make the check moot?

  private DirectoryReader doOpenNoWriter(IndexCommit commit) throws
IOException {

if (commit == null) {
  if (isCurrent()) {
return null;
  }
} else {
  if (directory != commit.getDirectory()) {
throw new IOException("the specified commit does not match the
specified Directory");
  }
  if (segmentInfos != null &&
commit.getSegmentsFileName().equals(segmentInfos.getSegmentsFileName())) {
return null;
  }
}

return doOpenFromCommit(commit);
  }

As for tuning the block size - would you recommend increasing it to
BlockTreeTermsWriter.DEFAULT_MAX_BLOCK_SIZE, or close to it? And if I did
this, would I have readability issues for indices created before this
change? We are already using a customized codec though, so perhaps adding
this to the codec is okay and transparent?


On Thu, Aug 28, 2014 at 12:49 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Ugh, you're right: this still won't re-use from IW's reader pool.  Can
> you open an issue?  Somehow we should make this easier.
>
> In the meantime, I guess you can use openIfChanged from your "back in
> time" reader to open another "back in time" reader.  This way you have
> two pools... IW's pool for the series of NRT readers, and another pool
> shared by the "back in time" readers ... but we should somehow fix
> this so it's one pool.
>
> OK looks like it's the FST terms index, and yes synthetic terms gives
> you synthetic results :)  However, to reduce the FST ram here you can
> just increase the block sizes uses by the terms index (see
> BlockTreeTermsWriter).  Larger blocks = smaller terms index (FST) but
> possibly slower searches, especially MultiTermQueries ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Aug 28, 2014 at 2:50 PM, Vitaly Funstein 
> wrote:
> > Thanks, Mike - I think the issue is actually the latter, i.e.
> SegmentReader
> > on its own can certainly use enough heap to cause problems, which of
> course
> > would be made that much worse by failure to pool readers for unchanged
> > segments.
> >
> > But where are you seeing the behavior that would result in reuse of
> > SegmentReaders from the pool inside the index writer? If I'm reading the
> > code right here, here's what it calls:
> >
> >   protected DirectoryReader doOpenIfChanged(final IndexCommit commit)
> > throws IOException {
> > ensureOpen();
> >
> > // If we were obtained by writer.getReader(), re-ask the
> > // writer to get a new reader.
> > if (writer != null) {
> >   return doOpenFromWriter(commit);
> > } else {
> >   return doOpenNoWriter(commit);
> > }
> >   }
> >
> >   private DirectoryReader doOpenFromWriter(IndexCommit commit) throws
> > IOException {
> > if (commit != null) {
> >   return doOpenFromCommit(commit);
> > }
> > ..
> >
> > There is no attempt made to inspect the segments inside the commit point
> > here, for possible reader pool reuse.
> >
> > So here's a drill down into the SegmentReader memory foot print. There
> > aren't actually 88 fields here - rather, this number reflects the
> "shallow"
> > heap size of BlockTreeTermsReader instance, i.e. calculated size without
> > following any the references from it (at depth 0).
> >
> >
> https://drive.google.com/file/d/0B5eRTXMELFjjVmxLejQzazVPZzA/edit?usp=sharing
> >
> > I suppose totally randomly generated field values are a bit of a
> contrived
> > use case, since in a real world there will be far less randomness to
> each,
> > but perhaps this gives us an idea for the worst case scenario... just
> > guessing though.
> >
> >
> >
> > On Thu, Aug 28, 2014 at 11:28 AM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> >> Can you drill down some more to see what's using those ~46 MB?  Is the
> >> the FSTs in the terms index?
> >>
> >> But, we need to decouple the "single segment is opened with multiple
> >> SegmentReaders" from e.g. "single SegmentReader is using too much RAM
> >> to hold terms index".  E.g. from this screen shot it looks like there
> >> are 88 fields totaling ~46 MB so ~0.5 MB per indexed field ...
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Thu, Aug 28, 2014 at 1:56 PM, Vitaly Funstein 
> >> wrote:
> >> > Here's the link:
> >> >
> >>
> https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?usp=sharing
> >> >
> >> > I'm indexing let's say 11 unique fields per document. Also, the NRT
> >> reader
> >> > is opened continually, and "regular" searches use that one. But a
> special
> >> > kin

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread craiglang44
(commit != null) {madnbr...@gmail.com
  return doOpenFromCommit(commit);
}dandappe...@gmail.com
Sent from my BlackBerry® smartphone

-Original Message-
From: Michael McCandless 
Date: Thu, 28 Aug 2014 14:25:11 
To: Lucene Users
Reply-To: java-user@lucene.apache.org
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

You can actually use IndexReader.openIfChanged(latestNRTReader,
IndexCommit): this should pull/share SegmentReaders from the pool
inside IW, when available.  But it will fail to share e.g.
SegmentReader no longer part of the NRT view but shared by e.g. two
"back in time" readers.

Really we need to factor out the reader pooling somehow, such that IW
is a user for its NRT pool, but commit-point readers could also more
easily use a shared pool.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Aug 28, 2014 at 2:07 PM, Uwe Schindler  wrote:
> Hi,
>
> if you open the 2nd instance (the back in time reader) using 
> DirectoryReader.open(IndexCommit), then it has of course nothing in common 
> with the IndexWriter, so how can they share the SegmentReaders?
>
> NRT readers from DirectoryReader.open(IndexWriter) are cached inside 
> IndexWriter, but the completely outside DirectoryReader on the older commit 
> point opens all segments on its own. Maybe a solution would be to extends 
> IndexWriter.open() to also take a commit point with IndexWriter.
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: Vitaly Funstein [mailto:vfunst...@gmail.com]
>> Sent: Thursday, August 28, 2014 7:56 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: BlockTreeTermsReader consumes crazy amount of memory
>>
>> Here's the link:
>> https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?us
>> p=sharing
>>
>> I'm indexing let's say 11 unique fields per document. Also, the NRT reader is
>> opened continually, and "regular" searches use that one. But a special kind 
>> of
>> feature allows searching a particular point in time (they get cleaned out
>> based on some other logic), which requires opening a non-NRT reader just to
>> service such search requests - in my understanding no segment readers for
>> this reader can be shared with the NRT reader's pool...
>> or am I off here? This seems evident from another heap dump fragment that
>> shows a full new set of segment readers attached to that "temporary"
>> reader:
>>
>> https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp
>> =sharing
>>
>>
>> On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>> > Hmm screen shot didn't make it ... can you post link?
>> >
>> > If you are using NRT reader then when a new one is opened, it won't
>> > open new SegmentReaders for all segments, just for newly
>> > flushed/merged segments since the last reader was opened.  So for your
>> > N commit points that you have readers open for, they will be sharing
>> > SegmentReaders for segments they have in common.
>> >
>> > How many unique fields are you adding?
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein 
>> > wrote:
>> > > Mike,
>> > >
>> > > Here's the screenshot; not sure if it will go through as an
>> > > attachment though - if not, I'll post it as a link. Please ignore
>> > > the altered
>> > package
>> > > names, since Lucene is shaded in as part of our build process.
>> > >
>> > > Some more context about the use case. Yes, the terms are pretty much
>> > unique;
>> > > the schema for the data set is actually borrowed from here:
>> > > https://amplab.cs.berkeley.edu/benchmark/#workload - it's the
>> > > UserVisits set, with a couple of other fields added by us. The
>> > > values for the fields are generated almost randomly, though some
>> > > string fields are picked at random from a fixed dictionary.
>> > >
>> > > Also, this type of heap footprint might be tolerable if it stayed
>> > relatively
>> > > constant throughout the system's life cycle (of course, given the
>> > > index
>> > set
>> > > stays more or less static). However, what happens here is that one
>> > > IndexReader reference is maintained by ReaderManager as an NRT
>> reader.
>> > But
>> > > we also would like support an ability to execute searches against
>> > specific
>> > > index commit points, ideally in parallel. As you might imagine, as
>> > > soon
>> > as a
>> > > new DirectoryReader is opened at a given commit, a whole new set of
>> > > SegmentReader instances is created and populated, effectively
>> > > doubling
>> > the
>> > > already large heap usage... if there was a way to somehow reuse
>> > > readers
>> > for
>> > > unchanged segments already pooled by IndexWriter, that would help
>> > > tremendously here. But I don't think there's a way to link up the
>> > > two
>> > sets,
>> > > at least not in the Lucene 

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread Michael McCandless
Ugh, you're right: this still won't re-use from IW's reader pool.  Can
you open an issue?  Somehow we should make this easier.

In the meantime, I guess you can use openIfChanged from your "back in
time" reader to open another "back in time" reader.  This way you have
two pools... IW's pool for the series of NRT readers, and another pool
shared by the "back in time" readers ... but we should somehow fix
this so it's one pool.

OK looks like it's the FST terms index, and yes synthetic terms gives
you synthetic results :)  However, to reduce the FST ram here you can
just increase the block sizes uses by the terms index (see
BlockTreeTermsWriter).  Larger blocks = smaller terms index (FST) but
possibly slower searches, especially MultiTermQueries ...

Mike McCandless

http://blog.mikemccandless.com


On Thu, Aug 28, 2014 at 2:50 PM, Vitaly Funstein  wrote:
> Thanks, Mike - I think the issue is actually the latter, i.e. SegmentReader
> on its own can certainly use enough heap to cause problems, which of course
> would be made that much worse by failure to pool readers for unchanged
> segments.
>
> But where are you seeing the behavior that would result in reuse of
> SegmentReaders from the pool inside the index writer? If I'm reading the
> code right here, here's what it calls:
>
>   protected DirectoryReader doOpenIfChanged(final IndexCommit commit)
> throws IOException {
> ensureOpen();
>
> // If we were obtained by writer.getReader(), re-ask the
> // writer to get a new reader.
> if (writer != null) {
>   return doOpenFromWriter(commit);
> } else {
>   return doOpenNoWriter(commit);
> }
>   }
>
>   private DirectoryReader doOpenFromWriter(IndexCommit commit) throws
> IOException {
> if (commit != null) {
>   return doOpenFromCommit(commit);
> }
> ..
>
> There is no attempt made to inspect the segments inside the commit point
> here, for possible reader pool reuse.
>
> So here's a drill down into the SegmentReader memory foot print. There
> aren't actually 88 fields here - rather, this number reflects the "shallow"
> heap size of BlockTreeTermsReader instance, i.e. calculated size without
> following any the references from it (at depth 0).
>
> https://drive.google.com/file/d/0B5eRTXMELFjjVmxLejQzazVPZzA/edit?usp=sharing
>
> I suppose totally randomly generated field values are a bit of a contrived
> use case, since in a real world there will be far less randomness to each,
> but perhaps this gives us an idea for the worst case scenario... just
> guessing though.
>
>
>
> On Thu, Aug 28, 2014 at 11:28 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Can you drill down some more to see what's using those ~46 MB?  Is the
>> the FSTs in the terms index?
>>
>> But, we need to decouple the "single segment is opened with multiple
>> SegmentReaders" from e.g. "single SegmentReader is using too much RAM
>> to hold terms index".  E.g. from this screen shot it looks like there
>> are 88 fields totaling ~46 MB so ~0.5 MB per indexed field ...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Aug 28, 2014 at 1:56 PM, Vitaly Funstein 
>> wrote:
>> > Here's the link:
>> >
>> https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?usp=sharing
>> >
>> > I'm indexing let's say 11 unique fields per document. Also, the NRT
>> reader
>> > is opened continually, and "regular" searches use that one. But a special
>> > kind of feature allows searching a particular point in time (they get
>> > cleaned out based on some other logic), which requires opening a non-NRT
>> > reader just to service such search requests - in my understanding no
>> > segment readers for this reader can be shared with the NRT reader's
>> pool...
>> > or am I off here? This seems evident from another heap dump fragment that
>> > shows a full new set of segment readers attached to that "temporary"
>> > reader:
>> >
>> >
>> https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp=sharing
>> >
>> >
>> > On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
>> > luc...@mikemccandless.com> wrote:
>> >
>> >> Hmm screen shot didn't make it ... can you post link?
>> >>
>> >> If you are using NRT reader then when a new one is opened, it won't
>> >> open new SegmentReaders for all segments, just for newly
>> >> flushed/merged segments since the last reader was opened.  So for your
>> >> N commit points that you have readers open for, they will be sharing
>> >> SegmentReaders for segments they have in common.
>> >>
>> >> How many unique fields are you adding?
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >>
>> >> On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein 
>> >> wrote:
>> >> > Mike,
>> >> >
>> >> > Here's the screenshot; not sure if it will go through as an attachment
>> >> > though - if not, I'll post it as a link. Please ignore the altered
>> >> package
>> >> > names, since Lucene is shaded in as part of our build p

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread craiglang44

Sent from my BlackBerry® smartphone

-Original Message-
From: Vitaly Funstein 
Date: Thu, 28 Aug 2014 10:56:17 
To: 
Reply-To: java-user@lucene.apache.org
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

Here's the link:
https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?usp=sharing

I'm indexing let's say 11 unique fields per document. Also, the NRT reader
is opened continually, and "regular" searches use that one. But a special
kind of feature allows searching a particular point in time (they get
cleaned out based on some other logic), which requires opening a non-NRT
reader just to service such search requests - in my understanding no
segment readers for this reader can be shared with the NRT reader's pool...
or am I off here? This seems evident from another heap dump fragment that
shows a full new set of segment readers attached to that "temporary"
reader:

https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp=sharing


On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Hmm screen shot didn't make it ... can you post link?
>
> If you are using NRT reader then when a new one is opened, it won't
> open new SegmentReaders for all segments, just for newly
> flushed/merged segments since the last reader was opened.  So for your
> N commit points that you have readers open for, they will be sharing
> SegmentReaders for segments they have in common.
>
> How many unique fields are you adding?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein 
> wrote:
> > Mike,
> >
> > Here's the screenshot; not sure if it will go through as an attachment
> > though - if not, I'll post it as a link. Please ignore the altered
> package
> > names, since Lucene is shaded in as part of our build process.
> >
> > Some more context about the use case. Yes, the terms are pretty much
> unique;
> > the schema for the data set is actually borrowed from here:
> > https://amplab.cs.berkeley.edu/benchmark/#workload - it's the UserVisits
> > set, with a couple of other fields added by us. The values for the fields
> > are generated almost randomly, though some string fields are picked at
> > random from a fixed dictionary.
> >
> > Also, this type of heap footprint might be tolerable if it stayed
> relatively
> > constant throughout the system's life cycle (of course, given the index
> set
> > stays more or less static). However, what happens here is that one
> > IndexReader reference is maintained by ReaderManager as an NRT reader.
> But
> > we also would like support an ability to execute searches against
> specific
> > index commit points, ideally in parallel. As you might imagine, as soon
> as a
> > new DirectoryReader is opened at a given commit, a whole new set of
> > SegmentReader instances is created and populated, effectively doubling
> the
> > already large heap usage... if there was a way to somehow reuse readers
> for
> > unchanged segments already pooled by IndexWriter, that would help
> > tremendously here. But I don't think there's a way to link up the two
> sets,
> > at least not in the Lucene version we are using (4.6.1) - is this
> correct?
> >
> >
> > On Wed, Aug 27, 2014 at 12:56 AM, Michael McCandless
> >  wrote:
> >>
> >> This is surprising: unless you have an excessive number of unique
> >> fields, BlockTreeTermReader shouldn't be such a big RAM consumer.
> >>
> >> Bu you only have 12 unique fields?
> >>
> >> Can you post screen shots of the heap usage?
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Tue, Aug 26, 2014 at 3:53 PM, Vitaly Funstein 
> >> wrote:
> >> > This is a follow up to the earlier thread I started to understand
> memory
> >> > usage patterns of SegmentReader instances, but I decided to create a
> >> > separate post since this issue is much more serious than the heap
> >> > overhead
> >> > created by use of stored field compression.
> >> >
> >> > Here is the use case, once again. The index totals around 300M
> >> > documents,
> >> > with 7 string, 2 long, 1 integer, 1 date and 1 float fields which are
> >> > both
> >> > indexed and stored. It is split into 4 shards, which are basically
> >> > separate
> >> > indices... if that matters. After the index is populated (but not
> >> > optimized
> >> > since we don't do that), the overall heap usage taken up by Lucene is
> >> > over
> >> > 1 GB, much of which is taken up by instances of BlockTreeTermsReader.
> >> > For
> >> > instance for the largest segment in one such an index, the retained
> heap
> >> > size of the internal tree map is around 50 MB. This is evident from
> heap
> >> > dump analysis, which I have screenshots of that I can post here, if
> that
> >> > helps. As there are many segments of various sizes in the index, as
> >> > expected, the total heap usage for one shard stands at around 280 MB.
> >> >
> >> > Could someone shed some light on whether t

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread Vitaly Funstein
Thanks, Mike - I think the issue is actually the latter, i.e. SegmentReader
on its own can certainly use enough heap to cause problems, which of course
would be made that much worse by failure to pool readers for unchanged
segments.

But where are you seeing the behavior that would result in reuse of
SegmentReaders from the pool inside the index writer? If I'm reading the
code right here, here's what it calls:

  protected DirectoryReader doOpenIfChanged(final IndexCommit commit)
throws IOException {
ensureOpen();

// If we were obtained by writer.getReader(), re-ask the
// writer to get a new reader.
if (writer != null) {
  return doOpenFromWriter(commit);
} else {
  return doOpenNoWriter(commit);
}
  }

  private DirectoryReader doOpenFromWriter(IndexCommit commit) throws
IOException {
if (commit != null) {
  return doOpenFromCommit(commit);
}
..

There is no attempt made to inspect the segments inside the commit point
here, for possible reader pool reuse.

So here's a drill down into the SegmentReader memory foot print. There
aren't actually 88 fields here - rather, this number reflects the "shallow"
heap size of BlockTreeTermsReader instance, i.e. calculated size without
following any the references from it (at depth 0).

https://drive.google.com/file/d/0B5eRTXMELFjjVmxLejQzazVPZzA/edit?usp=sharing

I suppose totally randomly generated field values are a bit of a contrived
use case, since in a real world there will be far less randomness to each,
but perhaps this gives us an idea for the worst case scenario... just
guessing though.



On Thu, Aug 28, 2014 at 11:28 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Can you drill down some more to see what's using those ~46 MB?  Is the
> the FSTs in the terms index?
>
> But, we need to decouple the "single segment is opened with multiple
> SegmentReaders" from e.g. "single SegmentReader is using too much RAM
> to hold terms index".  E.g. from this screen shot it looks like there
> are 88 fields totaling ~46 MB so ~0.5 MB per indexed field ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Aug 28, 2014 at 1:56 PM, Vitaly Funstein 
> wrote:
> > Here's the link:
> >
> https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?usp=sharing
> >
> > I'm indexing let's say 11 unique fields per document. Also, the NRT
> reader
> > is opened continually, and "regular" searches use that one. But a special
> > kind of feature allows searching a particular point in time (they get
> > cleaned out based on some other logic), which requires opening a non-NRT
> > reader just to service such search requests - in my understanding no
> > segment readers for this reader can be shared with the NRT reader's
> pool...
> > or am I off here? This seems evident from another heap dump fragment that
> > shows a full new set of segment readers attached to that "temporary"
> > reader:
> >
> >
> https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp=sharing
> >
> >
> > On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> >> Hmm screen shot didn't make it ... can you post link?
> >>
> >> If you are using NRT reader then when a new one is opened, it won't
> >> open new SegmentReaders for all segments, just for newly
> >> flushed/merged segments since the last reader was opened.  So for your
> >> N commit points that you have readers open for, they will be sharing
> >> SegmentReaders for segments they have in common.
> >>
> >> How many unique fields are you adding?
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein 
> >> wrote:
> >> > Mike,
> >> >
> >> > Here's the screenshot; not sure if it will go through as an attachment
> >> > though - if not, I'll post it as a link. Please ignore the altered
> >> package
> >> > names, since Lucene is shaded in as part of our build process.
> >> >
> >> > Some more context about the use case. Yes, the terms are pretty much
> >> unique;
> >> > the schema for the data set is actually borrowed from here:
> >> > https://amplab.cs.berkeley.edu/benchmark/#workload - it's the
> UserVisits
> >> > set, with a couple of other fields added by us. The values for the
> fields
> >> > are generated almost randomly, though some string fields are picked at
> >> > random from a fixed dictionary.
> >> >
> >> > Also, this type of heap footprint might be tolerable if it stayed
> >> relatively
> >> > constant throughout the system's life cycle (of course, given the
> index
> >> set
> >> > stays more or less static). However, what happens here is that one
> >> > IndexReader reference is maintained by ReaderManager as an NRT reader.
> >> But
> >> > we also would like support an ability to execute searches against
> >> specific
> >> > index commit points, ideally in parallel. As you might imagine, as
> soon
> >> as a
> >> > new DirectoryReader is opene

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread Michael McCandless
Can you drill down some more to see what's using those ~46 MB?  Is the
the FSTs in the terms index?

But, we need to decouple the "single segment is opened with multiple
SegmentReaders" from e.g. "single SegmentReader is using too much RAM
to hold terms index".  E.g. from this screen shot it looks like there
are 88 fields totaling ~46 MB so ~0.5 MB per indexed field ...

Mike McCandless

http://blog.mikemccandless.com


On Thu, Aug 28, 2014 at 1:56 PM, Vitaly Funstein  wrote:
> Here's the link:
> https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?usp=sharing
>
> I'm indexing let's say 11 unique fields per document. Also, the NRT reader
> is opened continually, and "regular" searches use that one. But a special
> kind of feature allows searching a particular point in time (they get
> cleaned out based on some other logic), which requires opening a non-NRT
> reader just to service such search requests - in my understanding no
> segment readers for this reader can be shared with the NRT reader's pool...
> or am I off here? This seems evident from another heap dump fragment that
> shows a full new set of segment readers attached to that "temporary"
> reader:
>
> https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp=sharing
>
>
> On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Hmm screen shot didn't make it ... can you post link?
>>
>> If you are using NRT reader then when a new one is opened, it won't
>> open new SegmentReaders for all segments, just for newly
>> flushed/merged segments since the last reader was opened.  So for your
>> N commit points that you have readers open for, they will be sharing
>> SegmentReaders for segments they have in common.
>>
>> How many unique fields are you adding?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein 
>> wrote:
>> > Mike,
>> >
>> > Here's the screenshot; not sure if it will go through as an attachment
>> > though - if not, I'll post it as a link. Please ignore the altered
>> package
>> > names, since Lucene is shaded in as part of our build process.
>> >
>> > Some more context about the use case. Yes, the terms are pretty much
>> unique;
>> > the schema for the data set is actually borrowed from here:
>> > https://amplab.cs.berkeley.edu/benchmark/#workload - it's the UserVisits
>> > set, with a couple of other fields added by us. The values for the fields
>> > are generated almost randomly, though some string fields are picked at
>> > random from a fixed dictionary.
>> >
>> > Also, this type of heap footprint might be tolerable if it stayed
>> relatively
>> > constant throughout the system's life cycle (of course, given the index
>> set
>> > stays more or less static). However, what happens here is that one
>> > IndexReader reference is maintained by ReaderManager as an NRT reader.
>> But
>> > we also would like support an ability to execute searches against
>> specific
>> > index commit points, ideally in parallel. As you might imagine, as soon
>> as a
>> > new DirectoryReader is opened at a given commit, a whole new set of
>> > SegmentReader instances is created and populated, effectively doubling
>> the
>> > already large heap usage... if there was a way to somehow reuse readers
>> for
>> > unchanged segments already pooled by IndexWriter, that would help
>> > tremendously here. But I don't think there's a way to link up the two
>> sets,
>> > at least not in the Lucene version we are using (4.6.1) - is this
>> correct?
>> >
>> >
>> > On Wed, Aug 27, 2014 at 12:56 AM, Michael McCandless
>> >  wrote:
>> >>
>> >> This is surprising: unless you have an excessive number of unique
>> >> fields, BlockTreeTermReader shouldn't be such a big RAM consumer.
>> >>
>> >> Bu you only have 12 unique fields?
>> >>
>> >> Can you post screen shots of the heap usage?
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >>
>> >> On Tue, Aug 26, 2014 at 3:53 PM, Vitaly Funstein 
>> >> wrote:
>> >> > This is a follow up to the earlier thread I started to understand
>> memory
>> >> > usage patterns of SegmentReader instances, but I decided to create a
>> >> > separate post since this issue is much more serious than the heap
>> >> > overhead
>> >> > created by use of stored field compression.
>> >> >
>> >> > Here is the use case, once again. The index totals around 300M
>> >> > documents,
>> >> > with 7 string, 2 long, 1 integer, 1 date and 1 float fields which are
>> >> > both
>> >> > indexed and stored. It is split into 4 shards, which are basically
>> >> > separate
>> >> > indices... if that matters. After the index is populated (but not
>> >> > optimized
>> >> > since we don't do that), the overall heap usage taken up by Lucene is
>> >> > over
>> >> > 1 GB, much of which is taken up by instances of BlockTreeTermsReader.
>> >> > For
>> >> > instance for the largest segment in one such an index, the retained
>> h

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread Michael McCandless
You can actually use IndexReader.openIfChanged(latestNRTReader,
IndexCommit): this should pull/share SegmentReaders from the pool
inside IW, when available.  But it will fail to share e.g.
SegmentReader no longer part of the NRT view but shared by e.g. two
"back in time" readers.

Really we need to factor out the reader pooling somehow, such that IW
is a user for its NRT pool, but commit-point readers could also more
easily use a shared pool.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Aug 28, 2014 at 2:07 PM, Uwe Schindler  wrote:
> Hi,
>
> if you open the 2nd instance (the back in time reader) using 
> DirectoryReader.open(IndexCommit), then it has of course nothing in common 
> with the IndexWriter, so how can they share the SegmentReaders?
>
> NRT readers from DirectoryReader.open(IndexWriter) are cached inside 
> IndexWriter, but the completely outside DirectoryReader on the older commit 
> point opens all segments on its own. Maybe a solution would be to extends 
> IndexWriter.open() to also take a commit point with IndexWriter.
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: Vitaly Funstein [mailto:vfunst...@gmail.com]
>> Sent: Thursday, August 28, 2014 7:56 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: BlockTreeTermsReader consumes crazy amount of memory
>>
>> Here's the link:
>> https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?us
>> p=sharing
>>
>> I'm indexing let's say 11 unique fields per document. Also, the NRT reader is
>> opened continually, and "regular" searches use that one. But a special kind 
>> of
>> feature allows searching a particular point in time (they get cleaned out
>> based on some other logic), which requires opening a non-NRT reader just to
>> service such search requests - in my understanding no segment readers for
>> this reader can be shared with the NRT reader's pool...
>> or am I off here? This seems evident from another heap dump fragment that
>> shows a full new set of segment readers attached to that "temporary"
>> reader:
>>
>> https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp
>> =sharing
>>
>>
>> On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>> > Hmm screen shot didn't make it ... can you post link?
>> >
>> > If you are using NRT reader then when a new one is opened, it won't
>> > open new SegmentReaders for all segments, just for newly
>> > flushed/merged segments since the last reader was opened.  So for your
>> > N commit points that you have readers open for, they will be sharing
>> > SegmentReaders for segments they have in common.
>> >
>> > How many unique fields are you adding?
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein 
>> > wrote:
>> > > Mike,
>> > >
>> > > Here's the screenshot; not sure if it will go through as an
>> > > attachment though - if not, I'll post it as a link. Please ignore
>> > > the altered
>> > package
>> > > names, since Lucene is shaded in as part of our build process.
>> > >
>> > > Some more context about the use case. Yes, the terms are pretty much
>> > unique;
>> > > the schema for the data set is actually borrowed from here:
>> > > https://amplab.cs.berkeley.edu/benchmark/#workload - it's the
>> > > UserVisits set, with a couple of other fields added by us. The
>> > > values for the fields are generated almost randomly, though some
>> > > string fields are picked at random from a fixed dictionary.
>> > >
>> > > Also, this type of heap footprint might be tolerable if it stayed
>> > relatively
>> > > constant throughout the system's life cycle (of course, given the
>> > > index
>> > set
>> > > stays more or less static). However, what happens here is that one
>> > > IndexReader reference is maintained by ReaderManager as an NRT
>> reader.
>> > But
>> > > we also would like support an ability to execute searches against
>> > specific
>> > > index commit points, ideally in parallel. As you might imagine, as
>> > > soon
>> > as a
>> > > new DirectoryReader is opened at a given commit, a whole new set of
>> > > SegmentReader instances is created and populated, effectively
>> > > doubling
>> > the
>> > > already large heap usage... if there was a way to somehow reuse
>> > > readers
>> > for
>> > > unchanged segments already pooled by IndexWriter, that would help
>> > > tremendously here. But I don't think there's a way to link up the
>> > > two
>> > sets,
>> > > at least not in the Lucene version we are using (4.6.1) - is this
>> > correct?
>> > >
>> > >
>> > > On Wed, Aug 27, 2014 at 12:56 AM, Michael McCandless
>> > >  wrote:
>> > >>
>> > >> This is surprising: unless you have an excessive number of unique
>> > >> fields, BlockTreeTermReader shouldn't be such a big RAM consumer.
>> > >>
>> > >> Bu you only have 12 unique fields?

RE: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread Uwe Schindler
Hi,

if you open the 2nd instance (the back in time reader) using 
DirectoryReader.open(IndexCommit), then it has of course nothing in common with 
the IndexWriter, so how can they share the SegmentReaders?

NRT readers from DirectoryReader.open(IndexWriter) are cached inside 
IndexWriter, but the completely outside DirectoryReader on the older commit 
point opens all segments on its own. Maybe a solution would be to extends 
IndexWriter.open() to also take a commit point with IndexWriter.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Vitaly Funstein [mailto:vfunst...@gmail.com]
> Sent: Thursday, August 28, 2014 7:56 PM
> To: java-user@lucene.apache.org
> Subject: Re: BlockTreeTermsReader consumes crazy amount of memory
> 
> Here's the link:
> https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?us
> p=sharing
> 
> I'm indexing let's say 11 unique fields per document. Also, the NRT reader is
> opened continually, and "regular" searches use that one. But a special kind of
> feature allows searching a particular point in time (they get cleaned out
> based on some other logic), which requires opening a non-NRT reader just to
> service such search requests - in my understanding no segment readers for
> this reader can be shared with the NRT reader's pool...
> or am I off here? This seems evident from another heap dump fragment that
> shows a full new set of segment readers attached to that "temporary"
> reader:
> 
> https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp
> =sharing
> 
> 
> On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
> 
> > Hmm screen shot didn't make it ... can you post link?
> >
> > If you are using NRT reader then when a new one is opened, it won't
> > open new SegmentReaders for all segments, just for newly
> > flushed/merged segments since the last reader was opened.  So for your
> > N commit points that you have readers open for, they will be sharing
> > SegmentReaders for segments they have in common.
> >
> > How many unique fields are you adding?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein 
> > wrote:
> > > Mike,
> > >
> > > Here's the screenshot; not sure if it will go through as an
> > > attachment though - if not, I'll post it as a link. Please ignore
> > > the altered
> > package
> > > names, since Lucene is shaded in as part of our build process.
> > >
> > > Some more context about the use case. Yes, the terms are pretty much
> > unique;
> > > the schema for the data set is actually borrowed from here:
> > > https://amplab.cs.berkeley.edu/benchmark/#workload - it's the
> > > UserVisits set, with a couple of other fields added by us. The
> > > values for the fields are generated almost randomly, though some
> > > string fields are picked at random from a fixed dictionary.
> > >
> > > Also, this type of heap footprint might be tolerable if it stayed
> > relatively
> > > constant throughout the system's life cycle (of course, given the
> > > index
> > set
> > > stays more or less static). However, what happens here is that one
> > > IndexReader reference is maintained by ReaderManager as an NRT
> reader.
> > But
> > > we also would like support an ability to execute searches against
> > specific
> > > index commit points, ideally in parallel. As you might imagine, as
> > > soon
> > as a
> > > new DirectoryReader is opened at a given commit, a whole new set of
> > > SegmentReader instances is created and populated, effectively
> > > doubling
> > the
> > > already large heap usage... if there was a way to somehow reuse
> > > readers
> > for
> > > unchanged segments already pooled by IndexWriter, that would help
> > > tremendously here. But I don't think there's a way to link up the
> > > two
> > sets,
> > > at least not in the Lucene version we are using (4.6.1) - is this
> > correct?
> > >
> > >
> > > On Wed, Aug 27, 2014 at 12:56 AM, Michael McCandless
> > >  wrote:
> > >>
> > >> This is surprising: unless you have an excessive number of unique
> > >> fields, BlockTreeTermReader shouldn't be such a big RAM consumer.
> > >>
> > >> Bu you only have 12 unique fields?
> > >>
> > >> Can you post screen shots of the heap usage?
> > >>
> > >> Mike McCandless
> > >>
> > >> http://blog.mikemccandless.com
> > >>
> > >>
> > >> On Tue, Aug 26, 2014 at 3:53 PM, Vitaly Funstein
> > >> 
> > >> wrote:
> > >> > This is a follow up to the earlier thread I started to understand
> > memory
> > >> > usage patterns of SegmentReader instances, but I decided to
> > >> > create a separate post since this issue is much more serious than
> > >> > the heap overhead created by use of stored field compression.
> > >> >
> > >> > Here is the use case, once again. The index totals around 300M
> > >> > documents, with 7 string, 2 long, 1 integer, 1 date and 1 float

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread Vitaly Funstein
Here's the link:
https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?usp=sharing

I'm indexing let's say 11 unique fields per document. Also, the NRT reader
is opened continually, and "regular" searches use that one. But a special
kind of feature allows searching a particular point in time (they get
cleaned out based on some other logic), which requires opening a non-NRT
reader just to service such search requests - in my understanding no
segment readers for this reader can be shared with the NRT reader's pool...
or am I off here? This seems evident from another heap dump fragment that
shows a full new set of segment readers attached to that "temporary"
reader:

https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp=sharing


On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Hmm screen shot didn't make it ... can you post link?
>
> If you are using NRT reader then when a new one is opened, it won't
> open new SegmentReaders for all segments, just for newly
> flushed/merged segments since the last reader was opened.  So for your
> N commit points that you have readers open for, they will be sharing
> SegmentReaders for segments they have in common.
>
> How many unique fields are you adding?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein 
> wrote:
> > Mike,
> >
> > Here's the screenshot; not sure if it will go through as an attachment
> > though - if not, I'll post it as a link. Please ignore the altered
> package
> > names, since Lucene is shaded in as part of our build process.
> >
> > Some more context about the use case. Yes, the terms are pretty much
> unique;
> > the schema for the data set is actually borrowed from here:
> > https://amplab.cs.berkeley.edu/benchmark/#workload - it's the UserVisits
> > set, with a couple of other fields added by us. The values for the fields
> > are generated almost randomly, though some string fields are picked at
> > random from a fixed dictionary.
> >
> > Also, this type of heap footprint might be tolerable if it stayed
> relatively
> > constant throughout the system's life cycle (of course, given the index
> set
> > stays more or less static). However, what happens here is that one
> > IndexReader reference is maintained by ReaderManager as an NRT reader.
> But
> > we also would like support an ability to execute searches against
> specific
> > index commit points, ideally in parallel. As you might imagine, as soon
> as a
> > new DirectoryReader is opened at a given commit, a whole new set of
> > SegmentReader instances is created and populated, effectively doubling
> the
> > already large heap usage... if there was a way to somehow reuse readers
> for
> > unchanged segments already pooled by IndexWriter, that would help
> > tremendously here. But I don't think there's a way to link up the two
> sets,
> > at least not in the Lucene version we are using (4.6.1) - is this
> correct?
> >
> >
> > On Wed, Aug 27, 2014 at 12:56 AM, Michael McCandless
> >  wrote:
> >>
> >> This is surprising: unless you have an excessive number of unique
> >> fields, BlockTreeTermReader shouldn't be such a big RAM consumer.
> >>
> >> Bu you only have 12 unique fields?
> >>
> >> Can you post screen shots of the heap usage?
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Tue, Aug 26, 2014 at 3:53 PM, Vitaly Funstein 
> >> wrote:
> >> > This is a follow up to the earlier thread I started to understand
> memory
> >> > usage patterns of SegmentReader instances, but I decided to create a
> >> > separate post since this issue is much more serious than the heap
> >> > overhead
> >> > created by use of stored field compression.
> >> >
> >> > Here is the use case, once again. The index totals around 300M
> >> > documents,
> >> > with 7 string, 2 long, 1 integer, 1 date and 1 float fields which are
> >> > both
> >> > indexed and stored. It is split into 4 shards, which are basically
> >> > separate
> >> > indices... if that matters. After the index is populated (but not
> >> > optimized
> >> > since we don't do that), the overall heap usage taken up by Lucene is
> >> > over
> >> > 1 GB, much of which is taken up by instances of BlockTreeTermsReader.
> >> > For
> >> > instance for the largest segment in one such an index, the retained
> heap
> >> > size of the internal tree map is around 50 MB. This is evident from
> heap
> >> > dump analysis, which I have screenshots of that I can post here, if
> that
> >> > helps. As there are many segments of various sizes in the index, as
> >> > expected, the total heap usage for one shard stands at around 280 MB.
> >> >
> >> > Could someone shed some light on whether this is expected, and if so -
> >> > how
> >> > could I possibly trim down memory usage here? Is there a way to switch
> >> > to a
> >> > different terms index implementation, one that doesn't preload all the
> >> > terms into RAM, 

indexing all suffixes to support leading wildcard?

2014-08-28 Thread Rob Nikander
Hi,

I've got some short fields (phone num, email) that I'd like to search using
good old string matching.  (The full query is a boolean "or" that also uses
real text fields.) I see the warnings about wildcard queries that start
with *, and I'm wondering... do you think it would be a good idea to index
all the suffixes?  Eg, a phone num 5551234, would become 7 values for the
"phoneNum" field: 4, 34, 234, etc.  So "512*" would be a hit.

And maybe do something with the boosts so it doesn't overvalue the match
when it hits multiple values.  ?

Rob


Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-28 Thread Michael McCandless
Hmm screen shot didn't make it ... can you post link?

If you are using NRT reader then when a new one is opened, it won't
open new SegmentReaders for all segments, just for newly
flushed/merged segments since the last reader was opened.  So for your
N commit points that you have readers open for, they will be sharing
SegmentReaders for segments they have in common.

How many unique fields are you adding?

Mike McCandless

http://blog.mikemccandless.com


On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein  wrote:
> Mike,
>
> Here's the screenshot; not sure if it will go through as an attachment
> though - if not, I'll post it as a link. Please ignore the altered package
> names, since Lucene is shaded in as part of our build process.
>
> Some more context about the use case. Yes, the terms are pretty much unique;
> the schema for the data set is actually borrowed from here:
> https://amplab.cs.berkeley.edu/benchmark/#workload - it's the UserVisits
> set, with a couple of other fields added by us. The values for the fields
> are generated almost randomly, though some string fields are picked at
> random from a fixed dictionary.
>
> Also, this type of heap footprint might be tolerable if it stayed relatively
> constant throughout the system's life cycle (of course, given the index set
> stays more or less static). However, what happens here is that one
> IndexReader reference is maintained by ReaderManager as an NRT reader. But
> we also would like support an ability to execute searches against specific
> index commit points, ideally in parallel. As you might imagine, as soon as a
> new DirectoryReader is opened at a given commit, a whole new set of
> SegmentReader instances is created and populated, effectively doubling the
> already large heap usage... if there was a way to somehow reuse readers for
> unchanged segments already pooled by IndexWriter, that would help
> tremendously here. But I don't think there's a way to link up the two sets,
> at least not in the Lucene version we are using (4.6.1) - is this correct?
>
>
> On Wed, Aug 27, 2014 at 12:56 AM, Michael McCandless
>  wrote:
>>
>> This is surprising: unless you have an excessive number of unique
>> fields, BlockTreeTermReader shouldn't be such a big RAM consumer.
>>
>> Bu you only have 12 unique fields?
>>
>> Can you post screen shots of the heap usage?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Tue, Aug 26, 2014 at 3:53 PM, Vitaly Funstein 
>> wrote:
>> > This is a follow up to the earlier thread I started to understand memory
>> > usage patterns of SegmentReader instances, but I decided to create a
>> > separate post since this issue is much more serious than the heap
>> > overhead
>> > created by use of stored field compression.
>> >
>> > Here is the use case, once again. The index totals around 300M
>> > documents,
>> > with 7 string, 2 long, 1 integer, 1 date and 1 float fields which are
>> > both
>> > indexed and stored. It is split into 4 shards, which are basically
>> > separate
>> > indices... if that matters. After the index is populated (but not
>> > optimized
>> > since we don't do that), the overall heap usage taken up by Lucene is
>> > over
>> > 1 GB, much of which is taken up by instances of BlockTreeTermsReader.
>> > For
>> > instance for the largest segment in one such an index, the retained heap
>> > size of the internal tree map is around 50 MB. This is evident from heap
>> > dump analysis, which I have screenshots of that I can post here, if that
>> > helps. As there are many segments of various sizes in the index, as
>> > expected, the total heap usage for one shard stands at around 280 MB.
>> >
>> > Could someone shed some light on whether this is expected, and if so -
>> > how
>> > could I possibly trim down memory usage here? Is there a way to switch
>> > to a
>> > different terms index implementation, one that doesn't preload all the
>> > terms into RAM, or only does this partially, i.e. as a cache? I'm not
>> > sure
>> > if I'm framing my questions correctly, as I'm obviously not an expert on
>> > Lucene's internals, but this is going to become a critical issue for
>> > large
>> > scale use cases of our system.
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to not span fields with phrase query?

2014-08-28 Thread craiglang44
`getPositionIncrementGap` 
Sent from my BlackBerry® smartphone

-Original Message-
From: Rob Nikander 
Date: Thu, 28 Aug 2014 10:26:00 
To: 
Reply-To: java-user@lucene.apache.org
Subject: Re: How to not span fields with phrase query?

Thank you for the explanation. I subclassed Analyzer and overrode
`getPositionIncrementGap` for this field.  It appears to have worked.

Rob


On Thu, Aug 28, 2014 at 10:21 AM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> Usually that's referred to as multiple "values" for the same field; in the
> index there is no distinction between title:C and title:X as far as which
> field they are in -- they're in the same field.
>
> If you want to prevent phrase queries from matching B C X, insert a
> position gap between C and X; so A B C would be positions 0, 1, 2, and X,
> Y, Z might be 4, 5, 6 instead of 3, 4, 5, which is probably what you have
> now
>
> -Mike
>
>
> On 08/28/2014 09:53 AM, Rob Nikander wrote:
>
>> Hi,
>>
>> If I have document with multiple fields "title"
>>
>>  title: A B C
>>  title: X Y Z
>>
>> A phrase search for title:"B C X" matches this document. Can I prevent
>> that?
>>
>> thanks,
>> Rob
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



Re: How to not span fields with phrase query?

2014-08-28 Thread craiglang44
`getPositionIncrementGap` 
Sent from my BlackBerry® smartphone

-Original Message-
From: Rob Nikander 
Date: Thu, 28 Aug 2014 10:26:00 
To: 
Reply-To: java-user@lucene.apache.org
Subject: Re: How to not span fields with phrase query?

Thank you for the explanation. I subclassed Analyzer and overrode
`getPositionIncrementGap` for this field.  It appears to have worked.

Rob


On Thu, Aug 28, 2014 at 10:21 AM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> Usually that's referred to as multiple "values" for the same field; in the
> index there is no distinction between title:C and title:X as far as which
> field they are in -- they're in the same field.
>
> If you want to prevent phrase queries from matching B C X, insert a
> position gap between C and X; so A B C would be positions 0, 1, 2, and X,
> Y, Z might be 4, 5, 6 instead of 3, 4, 5, which is probably what you have
> now
>
> -Mike
>
>
> On 08/28/2014 09:53 AM, Rob Nikander wrote:
>
>> Hi,
>>
>> If I have document with multiple fields "title"
>>
>>  title: A B C
>>  title: X Y Z
>>
>> A phrase search for title:"B C X" matches this document. Can I prevent
>> that?
>>
>> thanks,
>> Rob
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



Re: How to not span fields with phrase query?

2014-08-28 Thread Rob Nikander
Thank you for the explanation. I subclassed Analyzer and overrode
`getPositionIncrementGap` for this field.  It appears to have worked.

Rob


On Thu, Aug 28, 2014 at 10:21 AM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> Usually that's referred to as multiple "values" for the same field; in the
> index there is no distinction between title:C and title:X as far as which
> field they are in -- they're in the same field.
>
> If you want to prevent phrase queries from matching B C X, insert a
> position gap between C and X; so A B C would be positions 0, 1, 2, and X,
> Y, Z might be 4, 5, 6 instead of 3, 4, 5, which is probably what you have
> now
>
> -Mike
>
>
> On 08/28/2014 09:53 AM, Rob Nikander wrote:
>
>> Hi,
>>
>> If I have document with multiple fields "title"
>>
>>  title: A B C
>>  title: X Y Z
>>
>> A phrase search for title:"B C X" matches this document. Can I prevent
>> that?
>>
>> thanks,
>> Rob
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: How to not span fields with phrase query?

2014-08-28 Thread Michael Sokolov
Usually that's referred to as multiple "values" for the same field; in 
the index there is no distinction between title:C and title:X as far as 
which field they are in -- they're in the same field.


If you want to prevent phrase queries from matching B C X, insert a 
position gap between C and X; so A B C would be positions 0, 1, 2, and 
X, Y, Z might be 4, 5, 6 instead of 3, 4, 5, which is probably what you 
have now


-Mike

On 08/28/2014 09:53 AM, Rob Nikander wrote:

Hi,

If I have document with multiple fields "title"

 title: A B C
 title: X Y Z

A phrase search for title:"B C X" matches this document. Can I prevent
that?

thanks,
Rob




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



How to not span fields with phrase query?

2014-08-28 Thread Rob Nikander
Hi,

If I have document with multiple fields "title"

title: A B C
title: X Y Z

A phrase search for title:"B C X" matches this document. Can I prevent
that?

thanks,
Rob