Re: Exploiting a whole lot of memory
Oh, drat, I left out an 's'. I got it now. On Tue, Oct 8, 2013 at 7:40 PM, Benson Margulies wrote: > Mike, where do I find DirectPostingFormat? > > > On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> DirectPostingsFormat? >> >> It stores all terms + postings as simple java arrays, uncompressed. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Tue, Oct 8, 2013 at 5:45 PM, Benson Margulies >> wrote: >> > Consider a Lucene index consisting of 10m documents with a total disk >> > footprint of 3G. Consider an application that treats this index as >> > read-only, and runs very complex queries over it. Queries with many >> terms, >> > some of them 'fuzzy' and 'should' terms and a dismax. And, finally, >> > consider doing all this on a box with over 100G of physical memory, some >> > cores, and nothing else to do with its time. >> > >> > I should probably just stop here and see what thoughts come back, but >> I'll >> > go out on a limb and type the word 'codec'. The MMapDirectory, of >> course, >> > cheerfully gets to keep every single bit in memory. And then each query >> > runs, exercising the the codec, building up a flurry of Java objects, >> all >> > of which turn into garbage and we start all over. So, I find myself >> > wondering, is there some sort of an opportunity for a codec-that-caches >> in >> > here? In other words, I'd like to sell some of my space to buy some >> time. >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >
Re: Exploiting a whole lot of memory
Mike, where do I find DirectPostingFormat? On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > DirectPostingsFormat? > > It stores all terms + postings as simple java arrays, uncompressed. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Tue, Oct 8, 2013 at 5:45 PM, Benson Margulies > wrote: > > Consider a Lucene index consisting of 10m documents with a total disk > > footprint of 3G. Consider an application that treats this index as > > read-only, and runs very complex queries over it. Queries with many > terms, > > some of them 'fuzzy' and 'should' terms and a dismax. And, finally, > > consider doing all this on a box with over 100G of physical memory, some > > cores, and nothing else to do with its time. > > > > I should probably just stop here and see what thoughts come back, but > I'll > > go out on a limb and type the word 'codec'. The MMapDirectory, of course, > > cheerfully gets to keep every single bit in memory. And then each query > > runs, exercising the the codec, building up a flurry of Java objects, > all > > of which turn into garbage and we start all over. So, I find myself > > wondering, is there some sort of an opportunity for a codec-that-caches > in > > here? In other words, I'd like to sell some of my space to buy some time. > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Exploiting a whole lot of memory
DirectPostingsFormat? It stores all terms + postings as simple java arrays, uncompressed. Mike McCandless http://blog.mikemccandless.com On Tue, Oct 8, 2013 at 5:45 PM, Benson Margulies wrote: > Consider a Lucene index consisting of 10m documents with a total disk > footprint of 3G. Consider an application that treats this index as > read-only, and runs very complex queries over it. Queries with many terms, > some of them 'fuzzy' and 'should' terms and a dismax. And, finally, > consider doing all this on a box with over 100G of physical memory, some > cores, and nothing else to do with its time. > > I should probably just stop here and see what thoughts come back, but I'll > go out on a limb and type the word 'codec'. The MMapDirectory, of course, > cheerfully gets to keep every single bit in memory. And then each query > runs, exercising the the codec, building up a flurry of Java objects, all > of which turn into garbage and we start all over. So, I find myself > wondering, is there some sort of an opportunity for a codec-that-caches in > here? In other words, I'd like to sell some of my space to buy some time. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Exploiting a whole lot of memory
Consider a Lucene index consisting of 10m documents with a total disk footprint of 3G. Consider an application that treats this index as read-only, and runs very complex queries over it. Queries with many terms, some of them 'fuzzy' and 'should' terms and a dismax. And, finally, consider doing all this on a box with over 100G of physical memory, some cores, and nothing else to do with its time. I should probably just stop here and see what thoughts come back, but I'll go out on a limb and type the word 'codec'. The MMapDirectory, of course, cheerfully gets to keep every single bit in memory. And then each query runs, exercising the the codec, building up a flurry of Java objects, all of which turn into garbage and we start all over. So, I find myself wondering, is there some sort of an opportunity for a codec-that-caches in here? In other words, I'd like to sell some of my space to buy some time.
Re: Analyzer classes versus the constituent components
There are some Analyzer methods you might want to override (initReader for inserting a CharFilter, stuff about gaps), but if you don't need that, it seems to be mostly about packaging neatly, as you say. -Mike On 10/8/13 10:30 AM, Benson Margulies wrote: Is there some advice around about when it's appropriate to create an Analyzer class, as opposed to just Tokenizer and TokenFilter classes? The advantage of the constituent elements is that they allow the consuming application to add more filters. The only disadvantage I see is that the following is a bit on the verbose side. Is there some advantage or use of an Analyzer class that I'm missing? private Analyzer newAnalyzer() { return new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer source = tokenizerFactory.create(reader, LanguageCode.JAPANESE); com.basistech.rosette.bl.Analyzer rblAnalyzer; try { rblAnalyzer = analyzerFactory.create(LanguageCode.JAPANESE); } catch (IOException e) { throw new RuntimeException("Error creating RBL analyzer", e); } BaseLinguisticsTokenFilter filter = new BaseLinguisticsTokenFilter(source, rblAnalyzer); return new TokenStreamComponents(source, filter); } }; } - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Equivalent LatLongDistanceFilter in Lucene 4.4 API
Hi James, The spatial module in v4 is completely different than the one in v3. It would be good for you to review the new API rather then looking for a 1-1 equivalent to a class that existed in v3. Take a look at the top level javadocs for the spatial module, and in particular look at SpatialExample.java: http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/spatial/src/test/org/apache/lucene/spatial/SpatialExample.java?view=markup A hint at a solution is that you should query by intersection with a circle shape. Think in terms of shapes, not distances unless you need to sort or boost by the actual distance. ~ David james bond wrote > Hi All, > > Can you please let me know if there is an equivalent of > LatLongDistanceFilter in Lucene 4.4 API. > This API was present in Lucene 3.6 API. > > I have to mainly compute whether a point(lat,lang) is > present at a distance d from another point(lat,lang). > > I have checked different classes from the spatial package , > but there is no constructor with 5 arguments like LatLongDistanceFilter > had. > I tried with DisjointSpatialFilter separately for both > lattitude and longitude. but > not sure whether it will help the purpose. > > Please provide your thoughts on it. > > Thanks > Jamie - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Equivalent-LatLongDistanceFilter-in-Lucene-4-4-API-tp4091794p4094123.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Analyzer classes versus the constituent components
Is there some advice around about when it's appropriate to create an Analyzer class, as opposed to just Tokenizer and TokenFilter classes? The advantage of the constituent elements is that they allow the consuming application to add more filters. The only disadvantage I see is that the following is a bit on the verbose side. Is there some advantage or use of an Analyzer class that I'm missing? private Analyzer newAnalyzer() { return new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer source = tokenizerFactory.create(reader, LanguageCode.JAPANESE); com.basistech.rosette.bl.Analyzer rblAnalyzer; try { rblAnalyzer = analyzerFactory.create(LanguageCode.JAPANESE); } catch (IOException e) { throw new RuntimeException("Error creating RBL analyzer", e); } BaseLinguisticsTokenFilter filter = new BaseLinguisticsTokenFilter(source, rblAnalyzer); return new TokenStreamComponents(source, filter); } }; } - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 4.4.0 mergeSegments OutOfMemoryError
When you open this index for searching, how much heap do you give it? In general, you should give IndexWriter the same heap size, since during merge it will need to open N readers at once, and if you have RAM resident doc values fields, those need enough heap space. Also, the default DocValuesFormat in 4.5 has changed to be mostly disk-based; if you upgrade & cutover your index, then you should need much less heap to open readers / do merging. Mike McCandless http://blog.mikemccandless.com On Tue, Oct 8, 2013 at 2:53 AM, Michael van Rooyen wrote: > With forceMerge(1) throwing an OOM error, we switched to forceMergeDeletes() > which worked for a while, but that is now also running out of memory. As a > result, I've turned all manner of forced merges off. > > I'm more than a little apprehensive that if the OOM error can happen as part > of a forced merge, then it may also be able to happen as part of normal > merges as the index grows. I'd be grateful if someone who's grokked the > code for segment merges could shed some light on whether I'm worrying > unnecessarily... > > Thanks, > Michael. > > On 2013/09/26 01:43 PM, Michael van Rooyen wrote: >> >> Thanks for the suggestion Ian. I switched the optimization to do >> forceMergeDeletes() instead of forceMerge(1) and it completed successfully, >> so we will use that instead. At least then we're guaranteed to have no more >> than 10% of dead space in the index. >> >> I love the videos on Mike's post - I've always thought that the Lucene >> segment/merge mechanism is such an elegant and efficient way of handling a >> dynamic index. >> >> Michael. >> >> On 2013/09/26 12:45 PM, Ian Lea wrote: >>> >>> There's a blog posting from Mike McCandless about merging at >>> >>> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html. >>> Not very recent but probably still relevant. >>> >>> You could try IndexWrite.forceMergeDeletes() rather than >>> forceMerge(1). Still costly but probably less so, and might complete! >>> >>> >>> -- >>> Ian. >>> >>> >>> >>> On Thu, Sep 26, 2013 at 11:25 AM, Michael van Rooyen >>> wrote: Yes, it happens as part of the early morning optimize, and yes, it's a forceMerge(1) which I've disabled for now. I haven't looked at the persistence mechanism for Lucene since 2.x, but if I remember correctly, the deleted documents would stay in an index segment until that segment was eventually merged. Without forcing a merge (optimize in old versions), the footprint on disk could be a multiple of the actual space required for the live documents, and this would have an impact on performance (the deleted documents would clutter the buffer cache). Is this still the case? I would have thought it good practice to force the dead space out of an index periodically, but if the underlying storage mechanism has changed and the current index files are more efficient at housekeeping, this may no longer be necessary. If someone could shed a little light on best practice for indexes where documents are frequently updated (i.e. deleted and re-added), that would be great. Michael. On 2013/09/26 11:43 AM, Ian Lea wrote: > > Is this OOM happening as part of your early morning optimize or at > some other point? By optimize do you mean IndexWriter.forceMerge(1)? > You really shouldn't have to use that. If the index grows forever > without it then something else is going on which you might wish to > report separately. > > > -- > Ian. > > > On Wed, Sep 25, 2013 at 12:35 PM, Michael van Rooyen > > wrote: >> >> We've recently upgraded to Lucene 4.4.0 and mergeSegments now causes >> an >> OOM >> error. >> >> As background, our index contains about 14 million documents (growing >> slowly) and we process about 1 million updates per day. It's about 8GB >> on >> disk. I'm not sure if the Lucene segments merge the way they used to >> in >> the >> early versions, but we've always optimized at 3am to get rid of dead >> space >> in the index, or otherwise it grows forever. >> >> The mergeSegments was working under 4.3.1 but the index has grown >> somewhat >> on disk since then, probably due to a couple of added NumericDocValues >> fields. The java process is assigned about 3GB (the maximum, as it's >> running on a 32 bit i686 Linux box), and it still goes OOM. >> >> Any advice as to the possible cause and how to circumvent it would be >> great. >> Here's the stack trace: >> >> org.apache.lucene.index.MergePolicy$MergeException: >> java.lang.OutOfMemoryError: Java heap space >> >> >> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545) >> >> >> org.apache.luc
Re: optimal way to access many TermVectors
Hi, On Mon, Oct 7, 2013 at 9:31 PM, Rose, Stuart J wrote: > Is there an optimal way to access many document TermVectors (in the same > chunk) consecutively when using the LZ4 termvector compression? > > I'm curious to know whether all TermVectors in a single compressed chunk are > decompressed and cached when one TermVector in the same chunk is accessed? The main use-case for term vectors today being more-like-this and highlighting, term vectors are generally accessed in no particular order. This is why we don't cache the uncompressed chunk (it would never get reused) so you need to decompress everytime you are retrieving a document or its term vectors. > Also wondering if there is a mapping of TermVector order to docID order? Or > is it always one to one? If docIds are dynamic, then presumably they are not > necessarily in the same order as their documents' corresponding term > vectors... Term vectors are stored in doc ID order, meaning that for a given segment, term vectors for document N are followed by term vectors for document N+1. -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org