Re: Exploiting a whole lot of memory

2013-10-08 Thread Benson Margulies
Oh, drat, I left out an 's'. I got it now.


On Tue, Oct 8, 2013 at 7:40 PM, Benson Margulies wrote:

> Mike, where do I find DirectPostingFormat?
>
>
> On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> DirectPostingsFormat?
>>
>> It stores all terms + postings as simple java arrays, uncompressed.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Tue, Oct 8, 2013 at 5:45 PM, Benson Margulies 
>> wrote:
>> > Consider a Lucene index consisting of 10m documents with a total disk
>> > footprint of 3G. Consider an application that treats this index as
>> > read-only, and runs very complex queries over it. Queries with many
>> terms,
>> > some of them 'fuzzy' and 'should' terms and a dismax. And, finally,
>> > consider doing all this on a box with over 100G of physical memory, some
>> > cores, and nothing else to do with its time.
>> >
>> > I should probably just stop here and see what thoughts come back, but
>> I'll
>> > go out on a limb and type the word 'codec'. The MMapDirectory, of
>> course,
>> > cheerfully gets to keep every single bit in memory. And then each query
>> > runs, exercising the  the codec, building up a flurry of Java objects,
>> all
>> > of which turn into garbage and we start all over. So, I find myself
>> > wondering, is there some sort of an opportunity for a codec-that-caches
>> in
>> > here? In other words, I'd like to sell some of my space to buy some
>> time.
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>


Re: Exploiting a whole lot of memory

2013-10-08 Thread Benson Margulies
Mike, where do I find DirectPostingFormat?


On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> DirectPostingsFormat?
>
> It stores all terms + postings as simple java arrays, uncompressed.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Oct 8, 2013 at 5:45 PM, Benson Margulies 
> wrote:
> > Consider a Lucene index consisting of 10m documents with a total disk
> > footprint of 3G. Consider an application that treats this index as
> > read-only, and runs very complex queries over it. Queries with many
> terms,
> > some of them 'fuzzy' and 'should' terms and a dismax. And, finally,
> > consider doing all this on a box with over 100G of physical memory, some
> > cores, and nothing else to do with its time.
> >
> > I should probably just stop here and see what thoughts come back, but
> I'll
> > go out on a limb and type the word 'codec'. The MMapDirectory, of course,
> > cheerfully gets to keep every single bit in memory. And then each query
> > runs, exercising the  the codec, building up a flurry of Java objects,
> all
> > of which turn into garbage and we start all over. So, I find myself
> > wondering, is there some sort of an opportunity for a codec-that-caches
> in
> > here? In other words, I'd like to sell some of my space to buy some time.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Exploiting a whole lot of memory

2013-10-08 Thread Michael McCandless
DirectPostingsFormat?

It stores all terms + postings as simple java arrays, uncompressed.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Oct 8, 2013 at 5:45 PM, Benson Margulies  wrote:
> Consider a Lucene index consisting of 10m documents with a total disk
> footprint of 3G. Consider an application that treats this index as
> read-only, and runs very complex queries over it. Queries with many terms,
> some of them 'fuzzy' and 'should' terms and a dismax. And, finally,
> consider doing all this on a box with over 100G of physical memory, some
> cores, and nothing else to do with its time.
>
> I should probably just stop here and see what thoughts come back, but I'll
> go out on a limb and type the word 'codec'. The MMapDirectory, of course,
> cheerfully gets to keep every single bit in memory. And then each query
> runs, exercising the  the codec, building up a flurry of Java objects, all
> of which turn into garbage and we start all over. So, I find myself
> wondering, is there some sort of an opportunity for a codec-that-caches in
> here? In other words, I'd like to sell some of my space to buy some time.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Exploiting a whole lot of memory

2013-10-08 Thread Benson Margulies
Consider a Lucene index consisting of 10m documents with a total disk
footprint of 3G. Consider an application that treats this index as
read-only, and runs very complex queries over it. Queries with many terms,
some of them 'fuzzy' and 'should' terms and a dismax. And, finally,
consider doing all this on a box with over 100G of physical memory, some
cores, and nothing else to do with its time.

I should probably just stop here and see what thoughts come back, but I'll
go out on a limb and type the word 'codec'. The MMapDirectory, of course,
cheerfully gets to keep every single bit in memory. And then each query
runs, exercising the  the codec, building up a flurry of Java objects, all
of which turn into garbage and we start all over. So, I find myself
wondering, is there some sort of an opportunity for a codec-that-caches in
here? In other words, I'd like to sell some of my space to buy some time.


Re: Analyzer classes versus the constituent components

2013-10-08 Thread Michael Sokolov
There are some Analyzer methods you might want to override (initReader 
for inserting a CharFilter, stuff about gaps), but if you don't need 
that, it seems to be mostly about packaging neatly, as you say.


-Mike

On 10/8/13 10:30 AM, Benson Margulies wrote:

Is there some advice around about when it's appropriate to create an
Analyzer class, as opposed to just Tokenizer and TokenFilter classes?

The advantage of the constituent elements is that they allow the
consuming application to add more filters. The only disadvantage I see
is that the following is a bit on the verbose side. Is there some
advantage or use of an Analyzer class that I'm missing?

private Analyzer newAnalyzer() {
 return new Analyzer() {
 @Override
 protected TokenStreamComponents createComponents(String fieldName,
  Reader reader) {
 Tokenizer source = tokenizerFactory.create(reader,
LanguageCode.JAPANESE);
 com.basistech.rosette.bl.Analyzer rblAnalyzer;
 try {
 rblAnalyzer = 
analyzerFactory.create(LanguageCode.JAPANESE);
 } catch (IOException e) {
 throw new RuntimeException("Error creating RBL
analyzer", e);
 }
 BaseLinguisticsTokenFilter filter = new
BaseLinguisticsTokenFilter(source, rblAnalyzer);
 return new TokenStreamComponents(source, filter);
 }
 };
 }

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Equivalent LatLongDistanceFilter in Lucene 4.4 API

2013-10-08 Thread David Smiley (@MITRE.org)
Hi James,

The spatial module in v4 is completely different than the one in v3.  It
would be good for you to review the new API rather then looking for a 1-1
equivalent to a class that existed in v3.  Take a look at the top level
javadocs for the spatial module, and in particular look at
SpatialExample.java: 
http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/spatial/src/test/org/apache/lucene/spatial/SpatialExample.java?view=markup

A hint at a solution is that you should query by intersection with a circle
shape.  Think in terms of shapes, not distances unless you need to sort or
boost by the actual distance.

~ David


james bond wrote
> Hi All,
> 
> Can you please let me know if there is an equivalent of
> LatLongDistanceFilter in Lucene 4.4 API.
> This API was present in Lucene 3.6 API.
> 
> I have to mainly compute whether a point(lat,lang) is
> present at a distance d from another point(lat,lang).
> 
> I have checked different classes from the spatial package ,
> but there is no constructor with 5 arguments like LatLongDistanceFilter
> had.
> I tried with DisjointSpatialFilter separately for both
> lattitude and longitude. but
> not sure whether it will help the purpose.
> 
> Please provide your thoughts on it.
> 
> Thanks
> Jamie





-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Equivalent-LatLongDistanceFilter-in-Lucene-4-4-API-tp4091794p4094123.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Analyzer classes versus the constituent components

2013-10-08 Thread Benson Margulies
Is there some advice around about when it's appropriate to create an
Analyzer class, as opposed to just Tokenizer and TokenFilter classes?

The advantage of the constituent elements is that they allow the
consuming application to add more filters. The only disadvantage I see
is that the following is a bit on the verbose side. Is there some
advantage or use of an Analyzer class that I'm missing?

private Analyzer newAnalyzer() {
return new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName,
 Reader reader) {
Tokenizer source = tokenizerFactory.create(reader,
LanguageCode.JAPANESE);
com.basistech.rosette.bl.Analyzer rblAnalyzer;
try {
rblAnalyzer = analyzerFactory.create(LanguageCode.JAPANESE);
} catch (IOException e) {
throw new RuntimeException("Error creating RBL
analyzer", e);
}
BaseLinguisticsTokenFilter filter = new
BaseLinguisticsTokenFilter(source, rblAnalyzer);
return new TokenStreamComponents(source, filter);
}
};
}

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.4.0 mergeSegments OutOfMemoryError

2013-10-08 Thread Michael McCandless
When you open this index for searching, how much heap do you give it?
In general, you should give IndexWriter the same heap size, since
during merge it will need to open N readers at once, and if you have
RAM resident doc values fields, those need enough heap space.

Also, the default DocValuesFormat in 4.5 has changed to be mostly
disk-based; if you upgrade & cutover your index, then you should need
much less heap to open readers / do merging.


Mike McCandless

http://blog.mikemccandless.com


On Tue, Oct 8, 2013 at 2:53 AM, Michael van Rooyen  wrote:
> With forceMerge(1) throwing an OOM error, we switched to forceMergeDeletes()
> which worked for a while, but that is now also running out of memory.  As a
> result, I've turned all manner of forced merges off.
>
> I'm more than a little apprehensive that if the OOM error can happen as part
> of a forced merge, then it may also be able to happen as part of normal
> merges as the index grows.  I'd be grateful if someone who's grokked the
> code for segment merges could shed some light on whether I'm worrying
> unnecessarily...
>
> Thanks,
> Michael.
>
> On 2013/09/26 01:43 PM, Michael van Rooyen wrote:
>>
>> Thanks for the suggestion Ian.  I switched the optimization to do
>> forceMergeDeletes() instead of forceMerge(1) and it completed successfully,
>> so we will use that instead.  At least then we're guaranteed to have no more
>> than 10% of dead space in the index.
>>
>> I love the videos on Mike's post - I've always thought that the Lucene
>> segment/merge mechanism is such an elegant and efficient way of handling a
>> dynamic index.
>>
>> Michael.
>>
>> On 2013/09/26 12:45 PM, Ian Lea wrote:
>>>
>>> There's a blog posting from Mike McCandless  about merging at
>>>
>>> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html.
>>>   Not very recent but probably still relevant.
>>>
>>> You could try IndexWrite.forceMergeDeletes() rather than
>>> forceMerge(1).  Still costly but probably less so, and might complete!
>>>
>>>
>>> --
>>> Ian.
>>>
>>>
>>>
>>> On Thu, Sep 26, 2013 at 11:25 AM, Michael van Rooyen 
>>> wrote:

 Yes, it happens as part of the early morning optimize, and yes, it's a
 forceMerge(1) which I've disabled for now.

 I haven't looked at the persistence mechanism for Lucene since 2.x, but
 if I
 remember correctly, the deleted documents would stay in an index segment
 until that segment was eventually merged.  Without forcing a merge
 (optimize
 in old versions), the footprint on disk could be a multiple of the
 actual
 space required for the live documents, and this would have an impact on
 performance (the deleted documents would clutter the buffer cache).

 Is this still the case?  I would have thought it good practice to force
 the
 dead space out of an index periodically, but if the underlying storage
 mechanism has changed and the current index files are more efficient at
 housekeeping, this may no longer be necessary.

 If someone could shed a little light on best practice for indexes where
 documents are frequently updated (i.e. deleted and re-added), that would
 be
 great.

 Michael.


 On 2013/09/26 11:43 AM, Ian Lea wrote:
>
> Is this OOM happening as part of your early morning optimize or at
> some other point?  By optimize do you mean IndexWriter.forceMerge(1)?
> You really shouldn't have to use that. If the index grows forever
> without it then something else is going on which you might wish to
> report separately.
>
>
> --
> Ian.
>
>
> On Wed, Sep 25, 2013 at 12:35 PM, Michael van Rooyen
> 
> wrote:
>>
>> We've recently upgraded to Lucene 4.4.0 and mergeSegments now causes
>> an
>> OOM
>> error.
>>
>> As background, our index contains about 14 million documents (growing
>> slowly) and we process about 1 million updates per day. It's about 8GB
>> on
>> disk.  I'm not sure if the Lucene segments merge the way they used to
>> in
>> the
>> early versions, but we've always optimized at 3am to get rid of dead
>> space
>> in the index, or otherwise it grows forever.
>>
>> The mergeSegments was working under 4.3.1 but the index has grown
>> somewhat
>> on disk since then, probably due to a couple of added NumericDocValues
>> fields.  The java process is assigned about 3GB (the maximum, as it's
>> running on a 32 bit i686 Linux box), and it still goes OOM.
>>
>> Any advice as to the possible cause and how to circumvent it would be
>> great.
>> Here's the stack trace:
>>
>> org.apache.lucene.index.MergePolicy$MergeException:
>> java.lang.OutOfMemoryError: Java heap space
>>
>>
>> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
>>
>>
>> org.apache.luc

Re: optimal way to access many TermVectors

2013-10-08 Thread Adrien Grand
Hi,

On Mon, Oct 7, 2013 at 9:31 PM, Rose, Stuart J  wrote:
> Is there an optimal way to access many document TermVectors (in the same 
> chunk) consecutively when using the LZ4 termvector compression?
>
> I'm curious to know whether all TermVectors in a single compressed chunk are 
> decompressed and cached when one TermVector in the same chunk is accessed?

The main use-case for term vectors today being more-like-this and
highlighting, term vectors are generally accessed in no particular
order. This is why we don't cache the uncompressed chunk (it would
never get reused) so you need to decompress everytime you are
retrieving a document or its term vectors.

> Also wondering if there is a mapping of TermVector order to docID order? Or 
> is it always one to one? If docIds are dynamic, then presumably they are not 
> necessarily in the same order as their documents' corresponding term 
> vectors...

Term vectors are stored in doc ID order, meaning that for a given
segment, term vectors for document N are followed by term vectors for
document N+1.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org