Re: Lucene DocValuesField, SortedDocValuesField usage for filtering and sorting

2014-12-16 Thread Adrien Grand
On Tue, Dec 16, 2014 at 3:25 PM, Piotr Idzikowski
 wrote:
> So for instance if I store documents with ie creation date and I have a
> data (millions of documents) from last let's say 3 years and I'd like to do
> range filter to get socs from some month only is it better to use ordinary
> numeric query instead of FieldCacheRangeQuery?

Yes.

>> Both FieldCacheRangeFilter and FieldCacheTermsFilter would work on the
>> same SortedDocValues field. What makes you think you need two fields ?
>>
> Code:
> FieldCacheRangeFilter
>
> *public static FieldCacheRangeFilter newLongRange(String field,
> FieldCache.LongParser parser, Long lowerVal, Long upperVal, boolean
> includeLower, boolean includeUpper) {*
> *return new FieldCacheRangeFilter(field, parser, lowerVal,
> upperVal, includeLower, includeUpper) {*
> *  @Override*
> *  public DocIdSet getDocIdSet(AtomicReaderContext context, Bits
> acceptDocs) throws IOException {*
> *final long inclusiveLowerPoint, inclusiveUpperPoint;*
> *if (lowerVal != null) {*
> *  long i = lowerVal.longValue();*
> *  if (!includeLower && i == Long.MAX_VALUE)*
> *return null;*
> *  inclusiveLowerPoint = includeLower ? i : (i + 1L);*
> *} else {*
> *  inclusiveLowerPoint = Long.MIN_VALUE;*
> *}*
> *if (upperVal != null) {*
> *  long i = upperVal.longValue();*
> *  if (!includeUpper && i == Long.MIN_VALUE)*
> *return null;*
> *  inclusiveUpperPoint = includeUpper ? i : (i - 1L);*
> *} else {*
> *  inclusiveUpperPoint = Long.MAX_VALUE;*
> *}*
>
> *if (inclusiveLowerPoint > inclusiveUpperPoint)*
> *  return null;*
>
> *final FieldCache.Longs values =
> FieldCache.DEFAULT.getLongs(context.reader(), field,
> (FieldCache.LongParser) parser, false);*
> *return new FieldCacheDocIdSet(context.reader().maxDoc(),
> acceptDocs) {*
> *  @Override*
> *  protected boolean matchDoc(int doc) {*
> *final long value = values.get(doc);*
> *return value >= inclusiveLowerPoint && value <=
> inclusiveUpperPoint;*
> *  }*
> *};*
> *  }*
> *};*
> *  }*
>
> FieldCacheTermsFilter:
>
>  *@Override*
> *  public DocIdSet getDocIdSet(AtomicReaderContext context, Bits
> acceptDocs) throws IOException {*
> *final SortedDocValues fcsi =
> getFieldCache().getTermsIndex(context.reader(), field);*
> *final FixedBitSet bits = new FixedBitSet(fcsi.getValueCount());*
> *for (int i=0;i *  int ord = fcsi.lookupTerm(terms[i]);*
> *  if (ord >= 0) {*
> *bits.set(ord);*
> *  }*
> *}*

The FieldCacheRangeFilter you copied is for longs indeed, but there is
also a newStringRange method that works on sorted doc values.

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene DocValuesField, SortedDocValuesField usage for filtering and sorting

2014-12-16 Thread Piotr Idzikowski
>
>
>> So for instance if I store documents with ie creation date and I have a
> data (millions of documents) from last let's say 3 years and I'd like to do
> range filter to get socs from some month only is it better to use ordinary
> numeric query instead of FieldCacheRangeQuery?
>
> Of course I meant NumericRangeQuery


Re: Lucene DocValuesField, SortedDocValuesField usage for filtering and sorting

2014-12-16 Thread Piotr Idzikowski
Hello.
Thanks for your replay.

On Tue, Dec 16, 2014 at 3:14 PM, Adrien Grand  wrote:
>
> Hi Piotr,
>
> On Mon, Dec 15, 2014 at 9:43 PM, Piotr Idzikowski
>  wrote:
> > Hello.
> > I am going to switch to newest (4.10.2) version of Lucene and I'd like to
> > make some optimization in my index and code. I would like to use
> > DocValuesField to get values but also for filtering and sorting. So here
> I
> > have some questions: If I'd like to use range filter
> > (FieldCacheRangeFilter) I need to store a value in XxxDocValuesField, but
> > if i want to use terms filter (FieldCacheTermsFilter) I need to store a
> > value in SortedDocValuesField. So it looks like if I want to use range
> and
> > terms filters I need to have two different fields. Am I right? Am I using
> > it correctly?
>
> FieldCacheRangeFilter and FieldCacheTermsFilter only work well when
> you have lots of terms and most documents match your filter. Otherwise
> you should consider using the regular numeric range filter and terms
> filter. Although they might be a bit slower in the dense case, they
> will be significantly faster when few terms/documents match.
>
So for instance if I store documents with ie creation date and I have a
data (millions of documents) from last let's say 3 years and I'd like to do
range filter to get socs from some month only is it better to use ordinary
numeric query instead of FieldCacheRangeQuery?


>
> Both FieldCacheRangeFilter and FieldCacheTermsFilter would work on the
> same SortedDocValues field. What makes you think you need two fields ?
>
Code:
FieldCacheRangeFilter

*public static FieldCacheRangeFilter newLongRange(String field,
FieldCache.LongParser parser, Long lowerVal, Long upperVal, boolean
includeLower, boolean includeUpper) {*
*return new FieldCacheRangeFilter(field, parser, lowerVal,
upperVal, includeLower, includeUpper) {*
*  @Override*
*  public DocIdSet getDocIdSet(AtomicReaderContext context, Bits
acceptDocs) throws IOException {*
*final long inclusiveLowerPoint, inclusiveUpperPoint;*
*if (lowerVal != null) {*
*  long i = lowerVal.longValue();*
*  if (!includeLower && i == Long.MAX_VALUE)*
*return null;*
*  inclusiveLowerPoint = includeLower ? i : (i + 1L);*
*} else {*
*  inclusiveLowerPoint = Long.MIN_VALUE;*
*}*
*if (upperVal != null) {*
*  long i = upperVal.longValue();*
*  if (!includeUpper && i == Long.MIN_VALUE)*
*return null;*
*  inclusiveUpperPoint = includeUpper ? i : (i - 1L);*
*} else {*
*  inclusiveUpperPoint = Long.MAX_VALUE;*
*}*

*if (inclusiveLowerPoint > inclusiveUpperPoint)*
*  return null;*

*final FieldCache.Longs values =
FieldCache.DEFAULT.getLongs(context.reader(), field,
(FieldCache.LongParser) parser, false);*
*return new FieldCacheDocIdSet(context.reader().maxDoc(),
acceptDocs) {*
*  @Override*
*  protected boolean matchDoc(int doc) {*
*final long value = values.get(doc);*
*return value >= inclusiveLowerPoint && value <=
inclusiveUpperPoint;*
*  }*
*};*
*  }*
*};*
*  }*

FieldCacheTermsFilter:

 *@Override*
*  public DocIdSet getDocIdSet(AtomicReaderContext context, Bits
acceptDocs) throws IOException {*
*final SortedDocValues fcsi =
getFieldCache().getTermsIndex(context.reader(), field);*
*final FixedBitSet bits = new FixedBitSet(fcsi.getValueCount());*
*for (int i=0;i= 0) {*
*bits.set(ord);*
*  }*
*}*



Regards
Piotr


Re: Lucene DocValuesField, SortedDocValuesField usage for filtering and sorting

2014-12-16 Thread Adrien Grand
Hi Piotr,

On Mon, Dec 15, 2014 at 9:43 PM, Piotr Idzikowski
 wrote:
> Hello.
> I am going to switch to newest (4.10.2) version of Lucene and I'd like to
> make some optimization in my index and code. I would like to use
> DocValuesField to get values but also for filtering and sorting. So here I
> have some questions: If I'd like to use range filter
> (FieldCacheRangeFilter) I need to store a value in XxxDocValuesField, but
> if i want to use terms filter (FieldCacheTermsFilter) I need to store a
> value in SortedDocValuesField. So it looks like if I want to use range and
> terms filters I need to have two different fields. Am I right? Am I using
> it correctly?

FieldCacheRangeFilter and FieldCacheTermsFilter only work well when
you have lots of terms and most documents match your filter. Otherwise
you should consider using the regular numeric range filter and terms
filter. Although they might be a bit slower in the dense case, they
will be significantly faster when few terms/documents match.

Both FieldCacheRangeFilter and FieldCacheTermsFilter would work on the
same SortedDocValues field. What makes you think you need two fields ?

> Another thing is Sort. I can choose between SortedNumericSortField and
> SortField. First one requires SortedNumericDocValues, another
> NumericDocValuesField. Is there any(big) difference in performance? Should
> I use SortedNumericSortField (adding another field to the index)?

SortedNumericSortField is just a helper class to sort on a
multi-valued field that stores numeric doc values (in order to know
whether the min or max value should be considered for sorting).
SortField already handles correctly both numeric and sorted doc
values, you can use either one. If you have the choice to store your
data either in a numeric doc values field or a sorted doc values
field, then the numeric field might be a bit better performance-wise
(but it only works with single-valued numerics).

> And the last one. Am I right that all corresponding DocValuesField will be
> removed from index when doc is removed? I saw an IndexWriter method for an
> update doc value but no delete method for doc value.

Yes, doc values will be removed too. The reason why there is this
method on IndexWriter is that Lucene supports updating doc values
fields without reindexing the document completely (the updateDocument
method).

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene DocValuesField, SortedDocValuesField usage for filtering and sorting

2014-12-15 Thread Piotr Idzikowski
Hello.
I am going to switch to newest (4.10.2) version of Lucene and I'd like to
make some optimization in my index and code. I would like to use
DocValuesField to get values but also for filtering and sorting. So here I
have some questions: If I'd like to use range filter
(FieldCacheRangeFilter) I need to store a value in XxxDocValuesField, but
if i want to use terms filter (FieldCacheTermsFilter) I need to store a
value in SortedDocValuesField. So it looks like if I want to use range and
terms filters I need to have two different fields. Am I right? Am I using
it correctly?

Another thing is Sort. I can choose between SortedNumericSortField and
SortField. First one requires SortedNumericDocValues, another
NumericDocValuesField. Is there any(big) difference in performance? Should
I use SortedNumericSortField (adding another field to the index)?

And the last one. Am I right that all corresponding DocValuesField will be
removed from index when doc is removed? I saw an IndexWriter method for an
update doc value but no delete method for doc value.

Regards Piotr


Lucene DocValuesField, SortedDocValuesField usage for filtering and sorting

2014-12-15 Thread Piotr Idzikowski
Hello.
I am going to switch to newest (4.10.2) version of Lucene and I'd like to
make some optimization in my index and code. I would like to use
DocValuesField to get values but also for filtering and sorting. So here I
have some questions: If I'd like to use range filter
(FieldCacheRangeFilter) I need to store a value in XxxDocValuesField, but
if i want to use terms filter (FieldCacheTermsFilter) I need to store a
value in SortedDocValuesField. So it looks like if I want to use range and
terms filters I need to have two different fields. Am I right? Am I using
it correctly?

Another thing is Sort. I can choose between SortedNumericSortField and
SortField. First one requires SortedNumericDocValues, another
NumericDocValuesField. Is there any(big) difference in performance? Should
I use SortedNumericSortField (adding another field to the index)?

And the last one. Am I right that all corresponding DocValuesField will be
removed from index when doc is removed? I saw an IndexWriter method for an
update doc value but no delete method for doc value.

Regards Piotr


Re: SortedDocValuesField

2014-06-26 Thread Robert Muir
don't use RAMDirectory: its not very performant and really intended
for e.g. testing and so on.

also, using a ramdirectory here defeats the purpose: the idea behind
using a docvaluesfield in most cases is to keep (most of) such
datastructures out of heap memory. The datastructures and even the
compression used are optimized for mmap and nio access...



On Thu, Jun 26, 2014 at 11:59 AM, Sandeep Khanzode
 wrote:
> Hi,
>
> I was checking the SortedDocValuesField and its performance in Sort as 
> opposed to a normal i.e. StringField and its performance in the same sort. 
> So, I used the same string/bytesref value in both fields and in separate JVM 
> processes, I launched the two sorts.
>
> I used a RAMDirectory and created a million items. The SortedDocValuesField 
> sort took 12/13 seconds and consumed approx 500-550 megs of RAM whereas the 
> StringField took 10/11 seconds and consumed 350-400 megs of RAM.
> Is this normal behavior? I was expecting the SDVF to perform better since it 
> is indexed for sorting and not stored for any other purpose.
>
> ---
>
> Thanks n Regards,
> Sandeep Ramesh Khanzode

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



SortedDocValuesField

2014-06-26 Thread Sandeep Khanzode
Hi,
 
I was checking the SortedDocValuesField and its performance in Sort as opposed 
to a normal i.e. StringField and its performance in the same sort. So, I used 
the same string/bytesref value in both fields and in separate JVM processes, I 
launched the two sorts.

I used a RAMDirectory and created a million items. The SortedDocValuesField 
sort took 12/13 seconds and consumed approx 500-550 megs of RAM whereas the 
StringField took 10/11 seconds and consumed 350-400 megs of RAM. 
Is this normal behavior? I was expecting the SDVF to perform better since it is 
indexed for sorting and not stored for any other purpose.

---

Thanks n Regards,
Sandeep Ramesh Khanzode