Re: Lucene DocValuesField, SortedDocValuesField usage for filtering and sorting
On Tue, Dec 16, 2014 at 3:25 PM, Piotr Idzikowski wrote: > So for instance if I store documents with ie creation date and I have a > data (millions of documents) from last let's say 3 years and I'd like to do > range filter to get socs from some month only is it better to use ordinary > numeric query instead of FieldCacheRangeQuery? Yes. >> Both FieldCacheRangeFilter and FieldCacheTermsFilter would work on the >> same SortedDocValues field. What makes you think you need two fields ? >> > Code: > FieldCacheRangeFilter > > *public static FieldCacheRangeFilter newLongRange(String field, > FieldCache.LongParser parser, Long lowerVal, Long upperVal, boolean > includeLower, boolean includeUpper) {* > *return new FieldCacheRangeFilter(field, parser, lowerVal, > upperVal, includeLower, includeUpper) {* > * @Override* > * public DocIdSet getDocIdSet(AtomicReaderContext context, Bits > acceptDocs) throws IOException {* > *final long inclusiveLowerPoint, inclusiveUpperPoint;* > *if (lowerVal != null) {* > * long i = lowerVal.longValue();* > * if (!includeLower && i == Long.MAX_VALUE)* > *return null;* > * inclusiveLowerPoint = includeLower ? i : (i + 1L);* > *} else {* > * inclusiveLowerPoint = Long.MIN_VALUE;* > *}* > *if (upperVal != null) {* > * long i = upperVal.longValue();* > * if (!includeUpper && i == Long.MIN_VALUE)* > *return null;* > * inclusiveUpperPoint = includeUpper ? i : (i - 1L);* > *} else {* > * inclusiveUpperPoint = Long.MAX_VALUE;* > *}* > > *if (inclusiveLowerPoint > inclusiveUpperPoint)* > * return null;* > > *final FieldCache.Longs values = > FieldCache.DEFAULT.getLongs(context.reader(), field, > (FieldCache.LongParser) parser, false);* > *return new FieldCacheDocIdSet(context.reader().maxDoc(), > acceptDocs) {* > * @Override* > * protected boolean matchDoc(int doc) {* > *final long value = values.get(doc);* > *return value >= inclusiveLowerPoint && value <= > inclusiveUpperPoint;* > * }* > *};* > * }* > *};* > * }* > > FieldCacheTermsFilter: > > *@Override* > * public DocIdSet getDocIdSet(AtomicReaderContext context, Bits > acceptDocs) throws IOException {* > *final SortedDocValues fcsi = > getFieldCache().getTermsIndex(context.reader(), field);* > *final FixedBitSet bits = new FixedBitSet(fcsi.getValueCount());* > *for (int i=0;i * int ord = fcsi.lookupTerm(terms[i]);* > * if (ord >= 0) {* > *bits.set(ord);* > * }* > *}* The FieldCacheRangeFilter you copied is for longs indeed, but there is also a newStringRange method that works on sorted doc values. -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene DocValuesField, SortedDocValuesField usage for filtering and sorting
> > >> So for instance if I store documents with ie creation date and I have a > data (millions of documents) from last let's say 3 years and I'd like to do > range filter to get socs from some month only is it better to use ordinary > numeric query instead of FieldCacheRangeQuery? > > Of course I meant NumericRangeQuery
Re: Lucene DocValuesField, SortedDocValuesField usage for filtering and sorting
Hello. Thanks for your replay. On Tue, Dec 16, 2014 at 3:14 PM, Adrien Grand wrote: > > Hi Piotr, > > On Mon, Dec 15, 2014 at 9:43 PM, Piotr Idzikowski > wrote: > > Hello. > > I am going to switch to newest (4.10.2) version of Lucene and I'd like to > > make some optimization in my index and code. I would like to use > > DocValuesField to get values but also for filtering and sorting. So here > I > > have some questions: If I'd like to use range filter > > (FieldCacheRangeFilter) I need to store a value in XxxDocValuesField, but > > if i want to use terms filter (FieldCacheTermsFilter) I need to store a > > value in SortedDocValuesField. So it looks like if I want to use range > and > > terms filters I need to have two different fields. Am I right? Am I using > > it correctly? > > FieldCacheRangeFilter and FieldCacheTermsFilter only work well when > you have lots of terms and most documents match your filter. Otherwise > you should consider using the regular numeric range filter and terms > filter. Although they might be a bit slower in the dense case, they > will be significantly faster when few terms/documents match. > So for instance if I store documents with ie creation date and I have a data (millions of documents) from last let's say 3 years and I'd like to do range filter to get socs from some month only is it better to use ordinary numeric query instead of FieldCacheRangeQuery? > > Both FieldCacheRangeFilter and FieldCacheTermsFilter would work on the > same SortedDocValues field. What makes you think you need two fields ? > Code: FieldCacheRangeFilter *public static FieldCacheRangeFilter newLongRange(String field, FieldCache.LongParser parser, Long lowerVal, Long upperVal, boolean includeLower, boolean includeUpper) {* *return new FieldCacheRangeFilter(field, parser, lowerVal, upperVal, includeLower, includeUpper) {* * @Override* * public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs) throws IOException {* *final long inclusiveLowerPoint, inclusiveUpperPoint;* *if (lowerVal != null) {* * long i = lowerVal.longValue();* * if (!includeLower && i == Long.MAX_VALUE)* *return null;* * inclusiveLowerPoint = includeLower ? i : (i + 1L);* *} else {* * inclusiveLowerPoint = Long.MIN_VALUE;* *}* *if (upperVal != null) {* * long i = upperVal.longValue();* * if (!includeUpper && i == Long.MIN_VALUE)* *return null;* * inclusiveUpperPoint = includeUpper ? i : (i - 1L);* *} else {* * inclusiveUpperPoint = Long.MAX_VALUE;* *}* *if (inclusiveLowerPoint > inclusiveUpperPoint)* * return null;* *final FieldCache.Longs values = FieldCache.DEFAULT.getLongs(context.reader(), field, (FieldCache.LongParser) parser, false);* *return new FieldCacheDocIdSet(context.reader().maxDoc(), acceptDocs) {* * @Override* * protected boolean matchDoc(int doc) {* *final long value = values.get(doc);* *return value >= inclusiveLowerPoint && value <= inclusiveUpperPoint;* * }* *};* * }* *};* * }* FieldCacheTermsFilter: *@Override* * public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs) throws IOException {* *final SortedDocValues fcsi = getFieldCache().getTermsIndex(context.reader(), field);* *final FixedBitSet bits = new FixedBitSet(fcsi.getValueCount());* *for (int i=0;i= 0) {* *bits.set(ord);* * }* *}* Regards Piotr
Re: Lucene DocValuesField, SortedDocValuesField usage for filtering and sorting
Hi Piotr, On Mon, Dec 15, 2014 at 9:43 PM, Piotr Idzikowski wrote: > Hello. > I am going to switch to newest (4.10.2) version of Lucene and I'd like to > make some optimization in my index and code. I would like to use > DocValuesField to get values but also for filtering and sorting. So here I > have some questions: If I'd like to use range filter > (FieldCacheRangeFilter) I need to store a value in XxxDocValuesField, but > if i want to use terms filter (FieldCacheTermsFilter) I need to store a > value in SortedDocValuesField. So it looks like if I want to use range and > terms filters I need to have two different fields. Am I right? Am I using > it correctly? FieldCacheRangeFilter and FieldCacheTermsFilter only work well when you have lots of terms and most documents match your filter. Otherwise you should consider using the regular numeric range filter and terms filter. Although they might be a bit slower in the dense case, they will be significantly faster when few terms/documents match. Both FieldCacheRangeFilter and FieldCacheTermsFilter would work on the same SortedDocValues field. What makes you think you need two fields ? > Another thing is Sort. I can choose between SortedNumericSortField and > SortField. First one requires SortedNumericDocValues, another > NumericDocValuesField. Is there any(big) difference in performance? Should > I use SortedNumericSortField (adding another field to the index)? SortedNumericSortField is just a helper class to sort on a multi-valued field that stores numeric doc values (in order to know whether the min or max value should be considered for sorting). SortField already handles correctly both numeric and sorted doc values, you can use either one. If you have the choice to store your data either in a numeric doc values field or a sorted doc values field, then the numeric field might be a bit better performance-wise (but it only works with single-valued numerics). > And the last one. Am I right that all corresponding DocValuesField will be > removed from index when doc is removed? I saw an IndexWriter method for an > update doc value but no delete method for doc value. Yes, doc values will be removed too. The reason why there is this method on IndexWriter is that Lucene supports updating doc values fields without reindexing the document completely (the updateDocument method). -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucene DocValuesField, SortedDocValuesField usage for filtering and sorting
Hello. I am going to switch to newest (4.10.2) version of Lucene and I'd like to make some optimization in my index and code. I would like to use DocValuesField to get values but also for filtering and sorting. So here I have some questions: If I'd like to use range filter (FieldCacheRangeFilter) I need to store a value in XxxDocValuesField, but if i want to use terms filter (FieldCacheTermsFilter) I need to store a value in SortedDocValuesField. So it looks like if I want to use range and terms filters I need to have two different fields. Am I right? Am I using it correctly? Another thing is Sort. I can choose between SortedNumericSortField and SortField. First one requires SortedNumericDocValues, another NumericDocValuesField. Is there any(big) difference in performance? Should I use SortedNumericSortField (adding another field to the index)? And the last one. Am I right that all corresponding DocValuesField will be removed from index when doc is removed? I saw an IndexWriter method for an update doc value but no delete method for doc value. Regards Piotr
Lucene DocValuesField, SortedDocValuesField usage for filtering and sorting
Hello. I am going to switch to newest (4.10.2) version of Lucene and I'd like to make some optimization in my index and code. I would like to use DocValuesField to get values but also for filtering and sorting. So here I have some questions: If I'd like to use range filter (FieldCacheRangeFilter) I need to store a value in XxxDocValuesField, but if i want to use terms filter (FieldCacheTermsFilter) I need to store a value in SortedDocValuesField. So it looks like if I want to use range and terms filters I need to have two different fields. Am I right? Am I using it correctly? Another thing is Sort. I can choose between SortedNumericSortField and SortField. First one requires SortedNumericDocValues, another NumericDocValuesField. Is there any(big) difference in performance? Should I use SortedNumericSortField (adding another field to the index)? And the last one. Am I right that all corresponding DocValuesField will be removed from index when doc is removed? I saw an IndexWriter method for an update doc value but no delete method for doc value. Regards Piotr
Re: SortedDocValuesField
don't use RAMDirectory: its not very performant and really intended for e.g. testing and so on. also, using a ramdirectory here defeats the purpose: the idea behind using a docvaluesfield in most cases is to keep (most of) such datastructures out of heap memory. The datastructures and even the compression used are optimized for mmap and nio access... On Thu, Jun 26, 2014 at 11:59 AM, Sandeep Khanzode wrote: > Hi, > > I was checking the SortedDocValuesField and its performance in Sort as > opposed to a normal i.e. StringField and its performance in the same sort. > So, I used the same string/bytesref value in both fields and in separate JVM > processes, I launched the two sorts. > > I used a RAMDirectory and created a million items. The SortedDocValuesField > sort took 12/13 seconds and consumed approx 500-550 megs of RAM whereas the > StringField took 10/11 seconds and consumed 350-400 megs of RAM. > Is this normal behavior? I was expecting the SDVF to perform better since it > is indexed for sorting and not stored for any other purpose. > > --- > > Thanks n Regards, > Sandeep Ramesh Khanzode - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
SortedDocValuesField
Hi, I was checking the SortedDocValuesField and its performance in Sort as opposed to a normal i.e. StringField and its performance in the same sort. So, I used the same string/bytesref value in both fields and in separate JVM processes, I launched the two sorts. I used a RAMDirectory and created a million items. The SortedDocValuesField sort took 12/13 seconds and consumed approx 500-550 megs of RAM whereas the StringField took 10/11 seconds and consumed 350-400 megs of RAM. Is this normal behavior? I was expecting the SDVF to perform better since it is indexed for sorting and not stored for any other purpose. --- Thanks n Regards, Sandeep Ramesh Khanzode