Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Ravikumar Govindarajan
Shai,

This is the code snippet I use inside my class...

public class MySorter extends Sorter {

@Override

public DocMap sort(AtomicReader reader) throws IOException {

  final MapInteger, BytesRef docVsId = loadSortTerm(reader);

  final Sorter.DocComparator comparator = new Sorter.DocComparator() {

  @Override

   public int compare(int docID1, int docID2) {

  BytesRef v1 = docVsId.get(docID1);

  BytesRef v2 = docVsId.get(docID2);

   return v1.compareTo(v2);

   }

 };

 return sort(reader.maxDoc(), comparator);

}
}

My Problem is, the AtomicReader passed to Sorter.sort method is actually
a SlowCompositeReader, composed of a list of AtomicReaders each of which is
already sorted.

I find this loadSortTerm(compositeReader) to be a bit heavy where it
tries to all load the doc-to-term mappings eagerly...

Are there some alternatives for this?

--
Ravi


On Tue, Jun 17, 2014 at 10:58 AM, Shai Erera ser...@gmail.com wrote:

 I'm not sure that I follow ... where do you see DocMap being loaded up
 front? Specifically, Sorter.sort may return null of the readers are already
 sorted ... I think we already optimized for the case where the readers are
 sorted.

 Shai


 On Tue, Jun 17, 2014 at 4:04 AM, Ravikumar Govindarajan 
 ravikumar.govindara...@gmail.com wrote:

  I am planning to use SortingMergePolicy where all the merge-participating
  segments are already sorted... I understand that I need to define a
 DocMap
  with old-new doc-id mappings.
 
  Is it possible to optimize the eager loading of DocMap and make it kind
 of
  lazy load on-demand?
 
  Ex: Pass ListAtomicReader to the caller and ask for next new-old doc
  mapping..
 
  Since my segments are already sorted, I could save on memory a little-bit
  this way, instead of loading the full DocMap upfront
 
  --
  Ravi
 



RE: Lucene Upgrade from 2.9.x to 4.7.x

2014-06-17 Thread Uwe Schindler
Hi,

 Thanks Uwe. I tried this path and I do not find any .cfs files.

Lucene 3 and Lucene 4 indexes do not necessarily always contain CFS files, 
especially not if they are optimized. This depends on the merge policy. The 
index upgrader uses the default one, which creates no CFS files for the largest 
segment of an index. As there is only one after the upgrade, it is not in 
compound format.

 All that I see in my index directory after running upgrader is following 
 files.

 -rw--- 1 root root  245 Jun 16 22:38 _1.fdt
 -rw--- 1 root root   45 Jun 16 22:38 _1.fdx
 -rw--- 1 root root 2809 Jun 16 22:38 _1.fnm
 -rw--- 1 root root  487 Jun 16 22:38 _1_Lucene41_0.doc
 -rw--- 1 root root   34 Jun 16 22:38 _1_Lucene41_0.pay
 -rw--- 1 root root 3999 Jun 16 22:38 _1_Lucene41_0.pos
 -rw--- 1 root root 5575 Jun 16 22:38 _1_Lucene41_0.tim
 -rw--- 1 root root  834 Jun 16 22:38 _1_Lucene41_0.tip
 -rw--- 1 root root  110 Jun 16 22:38 _1.nvd
 -rw--- 1 root root  343 Jun 16 22:38 _1.nvm
 -rw--- 1 root root  419 Jun 16 22:38 _1.si

That looks perfectly fine, although the index is very small. This is already 
the 4.x index - how did the Lucene 3.6 index look like? The size of the index 
should be in the same magnitude like before the upgrade.

 My search query returns zero object. Can someone help me here. 

The reason for this can be changes in the analysis. Lucene searches only work, 
if the index and query analysis are compatible, which is not guaranteed with 
such a gap in Lucene versions. Please make sure that you use same analyzers 
before and after the upgrade with same matchVersion parameter (in your case you 
would need to pass Version.LUCENE_2_9 parameter to your analyzer, which is no 
longer available in Lucene 4). It depends on the behavior anaylyzer that was 
used before, if it is possible to easily upgrade without reindexing all the 
data. E.g., StandardAnalyzer changed its behavior to be Unicode conform in 
Lucene 3.x. This makes it incompatible for some queries, but simple ones still 
work.

Uwe


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Search degradation on Windows when upgrading from lucene 3.6 to lucene 4.7.2

2014-06-17 Thread Shlomit Rosen
Hi, 

We are in the process of upgrading  from lucene 3.6.0 to lucene 4.7.2, 
and our tests show a significant search degradation on Windows platform.

Trying to figure this out, here are a couple of points we noticed. 
Any suggestions/thoughts will be greatly appreciated. 

Thanks!

1) Running search on an optimized collection.

Our first run on Windows machine showed the following results: 
Lucene 3.6: 115 queries / sec
Lucene 4.7.2:   74   queries / sec

Looking at the collections themselves, we got the following 
characterization: 


Lucene 3.6
General Index Information:
==
Num docs: 10485760
Num deleted docs: 0
Deletion rate: 0%
Number of files in FOLDER: 116
Total size of files in FOLDER: 81558862032 bytes (75.96 GB)

Commit Point Information:
=
Version: 1399567203042
Timestamp: 1399593668185
Generation: 6018
Segments file name: segments_4n6
Number of segments: 32
Committed size: 81216915273 bytes (75.64 GB)
Number of files in COMMIT POINT: 89
Total size of files in COMMIT POINT: 81216923390 bytes (75.64 GB)


Lucene 4.7.2: 
General Index Information:
==
Num docs: 10485760
Num deleted docs: 0
Deletion rate: 0%
Number of files in FOLDER: 301
Total size of files in FOLDER: 71019073768 bytes (66.14 GB)

Commit Point Information:
=
Generation: 4518
Segments file name: segments_3hi
Number of segments: 38
Committed size: 70635339707 bytes (65.78 GB)
Number of files in COMMIT POINT: 115
Total size of files in COMMIT POINT: 70635341223 bytes (65.78 GB)

We saw that the collection created by lucene 4.7.2 was10GB smaller but it 
had a more segments. 
We thought that more segments might account to the search degradation, and 
so we decided to run optimization on the 4.7.2 index before rerunning the 
search test. 

The index was more compact: 

Lucene 4.7.2
General Index Information:
==
Num docs: 10485760
Num deleted docs: 0
Deletion rate: 0%
Number of files in FOLDER: 38
Total size of files in FOLDER: 70488334388 bytes (65.65 GB)

Commit Point Information:
=
Generation: 4519
Segments file name: segments_3hj
Number of segments: 12
Committed size: 70488333864 bytes (65.65 GB)
Number of files in COMMIT POINT: 37
Total size of files in COMMIT POINT: 70488334368 bytes (65.65 GB)
And as expected, the search results were much better: 
4.7.2.   118 queries / sec


We thought that this might be a good direction, so our next step was to 
simulate a more compact index as part of our indexing session without 
running a full optimize at the end. 
To do that we changed maxMergeMB from 4 GB to 6 GB. The collection was 
indeed more compact: 


Win64 4.7.2 merge=6000 commitPoints:
General Index Information:
==
Num docs: 10485760
Num deleted docs: 0
Deletion rate: 0%
Number of files in FOLDER: 213
Total size of files in FOLDER: 83038952682 bytes (77.34 GB)

Commit Point Information:
=
Generation: 4406
Segments file name: segments_3ee
Number of segments: 14
Committed size: 70324985193 bytes (65.50 GB)
Number of files in COMMIT POINT: 91
Total size of files in COMMIT POINT: 70324985781 bytes (65.50 GB)
But search results were not good at all: 
4.7.2:  72 queries / sec

Does this make sense? 
We thought of Optimize as mainly decreasing the number of segments in 
the collection, and removing deletions. 
In this scenario, we had no deletions, and we saw that the number of 
segments did in fact decrease substantially, 
So why are we not seeing this reflect in search performance? Is there any 
other optimize influence/hidden-operation that we are missing here? 

(Note that we are using LogByteSizeMergePolicy. We know that 
TieredMergePolicy is suppose to be better in this aspect, but it is 
important to us 
To keep the order of the documents the same between commit points... )


2) Search Directory
On Lucene 3.6, we did comprehensive testing and saw that the best search 
performance is reached when using an Mmap directory. 
(for Indexing we are using SimpleFSDirectory). 
We tried different directories again with lucene 4.7.2, and while the 
differences were not big, it still seems that Mmap is no longer the best 
option: 

Lucene 4.7.2 with MMap: 72 queries / sec
Lucene 4.7.2 with SimpleFS: 84 queries / sec

Was there any changes around the MMap directory that might account for 
this difference? 
If so, do you think that those changes might account for the overall 
performance we are seeing? 

3) Java 6 / Java 7
We are currently running on Java 6 (that is also the reason we stopped at 
lucene 4.7.2 and not 4.8). 
Is there a reason to believe that the degradation might be connected to 
this? 


Thanks again in advance!


Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Ravikumar Govindarajan
I am afraid the DocMap still maintains doc-id mappings till merge and I am
trying to avoid it...

I think lucene itself has a MergeIterator in o.a.l.util package.

A MergePolicy can wrap a simple MergeIterator for iterating docs across
different AtomicReaders in correct sort-order for a given field/term

That should be fine right?

--
Ravi

--
Ravi


On Tue, Jun 17, 2014 at 1:24 PM, Shai Erera ser...@gmail.com wrote:

 loadSortTerm is your method right? In the current Sorter.sort
 implementation, I see this code:

 boolean sorted = true;
 for (int i = 1; i  maxDoc; ++i) {
   if (comparator.compare(i-1, i)  0) {
 sorted = false;
 break;
   }
 }
 if (sorted) {
   return null;
 }

 Perhaps you can write similar code?

 Also note that the sorting interface has changed, I think in 4.8, and now
 you don't really need to implement a Sorter, but rather pass a SortField,
 if that works for you.

 Shai


 On Tue, Jun 17, 2014 at 9:41 AM, Ravikumar Govindarajan 
 ravikumar.govindara...@gmail.com wrote:

  Shai,
 
  This is the code snippet I use inside my class...
 
  public class MySorter extends Sorter {
 
  @Override
 
  public DocMap sort(AtomicReader reader) throws IOException {
 
final MapInteger, BytesRef docVsId = loadSortTerm(reader);
 
final Sorter.DocComparator comparator = new Sorter.DocComparator() {
 
@Override
 
 public int compare(int docID1, int docID2) {
 
BytesRef v1 = docVsId.get(docID1);
 
BytesRef v2 = docVsId.get(docID2);
 
 return v1.compareTo(v2);
 
 }
 
   };
 
   return sort(reader.maxDoc(), comparator);
 
  }
  }
 
  My Problem is, the AtomicReader passed to Sorter.sort method is
 actually
  a SlowCompositeReader, composed of a list of AtomicReaders each of which
 is
  already sorted.
 
  I find this loadSortTerm(compositeReader) to be a bit heavy where it
  tries to all load the doc-to-term mappings eagerly...
 
  Are there some alternatives for this?
 
  --
  Ravi
 
 
  On Tue, Jun 17, 2014 at 10:58 AM, Shai Erera ser...@gmail.com wrote:
 
   I'm not sure that I follow ... where do you see DocMap being loaded up
   front? Specifically, Sorter.sort may return null of the readers are
  already
   sorted ... I think we already optimized for the case where the readers
  are
   sorted.
  
   Shai
  
  
   On Tue, Jun 17, 2014 at 4:04 AM, Ravikumar Govindarajan 
   ravikumar.govindara...@gmail.com wrote:
  
I am planning to use SortingMergePolicy where all the
  merge-participating
segments are already sorted... I understand that I need to define a
   DocMap
with old-new doc-id mappings.
   
Is it possible to optimize the eager loading of DocMap and make it
 kind
   of
lazy load on-demand?
   
Ex: Pass ListAtomicReader to the caller and ask for next new-old
 doc
mapping..
   
Since my segments are already sorted, I could save on memory a
  little-bit
this way, instead of loading the full DocMap upfront
   
--
Ravi
   
  
 



Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Shai Erera

 I am afraid the DocMap still maintains doc-id mappings till merge and I am
 trying to avoid it...


What do you mean 'till merge'? The method OneMerge.getMergeReaders() is
called only when the merge is executed, not when the MergePolicy decided to
merge those segments. Therefore the DocMap is initialized only when the
merge actually executes ... what is there more to postpone?

And besides, if the segments are already sorted, you should return a null
DocMap, like Lucene code does ...

If I miss your point, I'd appreciate if you can point me to a code example,
preferably in Lucene source, which demonstrates the problem.

Shai


On Tue, Jun 17, 2014 at 3:03 PM, Ravikumar Govindarajan 
ravikumar.govindara...@gmail.com wrote:

 I am afraid the DocMap still maintains doc-id mappings till merge and I am
 trying to avoid it...

 I think lucene itself has a MergeIterator in o.a.l.util package.

 A MergePolicy can wrap a simple MergeIterator for iterating docs across
 different AtomicReaders in correct sort-order for a given field/term

 That should be fine right?

 --
 Ravi

 --
 Ravi


 On Tue, Jun 17, 2014 at 1:24 PM, Shai Erera ser...@gmail.com wrote:

  loadSortTerm is your method right? In the current Sorter.sort
  implementation, I see this code:
 
  boolean sorted = true;
  for (int i = 1; i  maxDoc; ++i) {
if (comparator.compare(i-1, i)  0) {
  sorted = false;
  break;
}
  }
  if (sorted) {
return null;
  }
 
  Perhaps you can write similar code?
 
  Also note that the sorting interface has changed, I think in 4.8, and now
  you don't really need to implement a Sorter, but rather pass a SortField,
  if that works for you.
 
  Shai
 
 
  On Tue, Jun 17, 2014 at 9:41 AM, Ravikumar Govindarajan 
  ravikumar.govindara...@gmail.com wrote:
 
   Shai,
  
   This is the code snippet I use inside my class...
  
   public class MySorter extends Sorter {
  
   @Override
  
   public DocMap sort(AtomicReader reader) throws IOException {
  
 final MapInteger, BytesRef docVsId = loadSortTerm(reader);
  
 final Sorter.DocComparator comparator = new Sorter.DocComparator() {
  
 @Override
  
  public int compare(int docID1, int docID2) {
  
 BytesRef v1 = docVsId.get(docID1);
  
 BytesRef v2 = docVsId.get(docID2);
  
  return v1.compareTo(v2);
  
  }
  
};
  
return sort(reader.maxDoc(), comparator);
  
   }
   }
  
   My Problem is, the AtomicReader passed to Sorter.sort method is
  actually
   a SlowCompositeReader, composed of a list of AtomicReaders each of
 which
  is
   already sorted.
  
   I find this loadSortTerm(compositeReader) to be a bit heavy where it
   tries to all load the doc-to-term mappings eagerly...
  
   Are there some alternatives for this?
  
   --
   Ravi
  
  
   On Tue, Jun 17, 2014 at 10:58 AM, Shai Erera ser...@gmail.com wrote:
  
I'm not sure that I follow ... where do you see DocMap being loaded
 up
front? Specifically, Sorter.sort may return null of the readers are
   already
sorted ... I think we already optimized for the case where the
 readers
   are
sorted.
   
Shai
   
   
On Tue, Jun 17, 2014 at 4:04 AM, Ravikumar Govindarajan 
ravikumar.govindara...@gmail.com wrote:
   
 I am planning to use SortingMergePolicy where all the
   merge-participating
 segments are already sorted... I understand that I need to define a
DocMap
 with old-new doc-id mappings.

 Is it possible to optimize the eager loading of DocMap and make it
  kind
of
 lazy load on-demand?

 Ex: Pass ListAtomicReader to the caller and ask for next new-old
  doc
 mapping..

 Since my segments are already sorted, I could save on memory a
   little-bit
 this way, instead of loading the full DocMap upfront

 --
 Ravi

   
  
 



Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Ravikumar Govindarajan

 Therefore the DocMap is initialized only when the
 merge actually executes ... what is there more to postpone?


Agreed. However, what I am asking is, if there is an alternative to DocMap,
will that be better? Plz read-on

 And besides, if the segments are already sorted, you should return a
null DocMap,
 like Lucene code does ...


What I am trying to say is, my individual segments are sorted. However,
when a merge combines N individual sorted-segments, there needs to be a
global sort-order for writing the new segment. Passing null DocMap won't
work here, no?

DocMap is one-way of bringing the global order during a merge. Another way
is to use something like a MergedIteratorSegmentReader instead of DocMap,
which doesn't need any memory

I was trying to get a heads-up on these 2 approaches. Please do let me know
if I have understood correctly

--
Ravi




On Tue, Jun 17, 2014 at 5:42 PM, Shai Erera ser...@gmail.com wrote:

 
  I am afraid the DocMap still maintains doc-id mappings till merge and I
 am
  trying to avoid it...
 

 What do you mean 'till merge'? The method OneMerge.getMergeReaders() is
 called only when the merge is executed, not when the MergePolicy decided to
 merge those segments. Therefore the DocMap is initialized only when the
 merge actually executes ... what is there more to postpone?

 And besides, if the segments are already sorted, you should return a null
 DocMap, like Lucene code does ...

 If I miss your point, I'd appreciate if you can point me to a code example,
 preferably in Lucene source, which demonstrates the problem.

 Shai


 On Tue, Jun 17, 2014 at 3:03 PM, Ravikumar Govindarajan 
 ravikumar.govindara...@gmail.com wrote:

  I am afraid the DocMap still maintains doc-id mappings till merge and I
 am
  trying to avoid it...
 
  I think lucene itself has a MergeIterator in o.a.l.util package.
 
  A MergePolicy can wrap a simple MergeIterator for iterating docs across
  different AtomicReaders in correct sort-order for a given field/term
 
  That should be fine right?
 
  --
  Ravi
 
  --
  Ravi
 
 
  On Tue, Jun 17, 2014 at 1:24 PM, Shai Erera ser...@gmail.com wrote:
 
   loadSortTerm is your method right? In the current Sorter.sort
   implementation, I see this code:
  
   boolean sorted = true;
   for (int i = 1; i  maxDoc; ++i) {
 if (comparator.compare(i-1, i)  0) {
   sorted = false;
   break;
 }
   }
   if (sorted) {
 return null;
   }
  
   Perhaps you can write similar code?
  
   Also note that the sorting interface has changed, I think in 4.8, and
 now
   you don't really need to implement a Sorter, but rather pass a
 SortField,
   if that works for you.
  
   Shai
  
  
   On Tue, Jun 17, 2014 at 9:41 AM, Ravikumar Govindarajan 
   ravikumar.govindara...@gmail.com wrote:
  
Shai,
   
This is the code snippet I use inside my class...
   
public class MySorter extends Sorter {
   
@Override
   
public DocMap sort(AtomicReader reader) throws IOException {
   
  final MapInteger, BytesRef docVsId = loadSortTerm(reader);
   
  final Sorter.DocComparator comparator = new Sorter.DocComparator()
 {
   
  @Override
   
   public int compare(int docID1, int docID2) {
   
  BytesRef v1 = docVsId.get(docID1);
   
  BytesRef v2 = docVsId.get(docID2);
   
   return v1.compareTo(v2);
   
   }
   
 };
   
 return sort(reader.maxDoc(), comparator);
   
}
}
   
My Problem is, the AtomicReader passed to Sorter.sort method is
   actually
a SlowCompositeReader, composed of a list of AtomicReaders each of
  which
   is
already sorted.
   
I find this loadSortTerm(compositeReader) to be a bit heavy where
 it
tries to all load the doc-to-term mappings eagerly...
   
Are there some alternatives for this?
   
--
Ravi
   
   
On Tue, Jun 17, 2014 at 10:58 AM, Shai Erera ser...@gmail.com
 wrote:
   
 I'm not sure that I follow ... where do you see DocMap being loaded
  up
 front? Specifically, Sorter.sort may return null of the readers are
already
 sorted ... I think we already optimized for the case where the
  readers
are
 sorted.

 Shai


 On Tue, Jun 17, 2014 at 4:04 AM, Ravikumar Govindarajan 
 ravikumar.govindara...@gmail.com wrote:

  I am planning to use SortingMergePolicy where all the
merge-participating
  segments are already sorted... I understand that I need to
 define a
 DocMap
  with old-new doc-id mappings.
 
  Is it possible to optimize the eager loading of DocMap and make
 it
   kind
 of
  lazy load on-demand?
 
  Ex: Pass ListAtomicReader to the caller and ask for next
 new-old
   doc
  mapping..
 
  Since my segments are already sorted, I could save on memory a
little-bit
  this way, instead of loading the full DocMap upfront
 
  --
  Ravi
 

   
  
 



Re: Facets in Lucene 4.7.2

2014-06-17 Thread Sandeep Khanzode
Hi,

Thanks again!

This time, I have indexed data with the following specs. I run into  40 
seconds for the FastTaxonomyFacetCounts to create all the facets. Is this as 
per your measurements? Subsequent runs fare much better probably because of the 
Windows file system cache. How can I speed this up? 
I believe there was a CategoryListCache earlier. Is there any cache or other 
implementation that I can use?

Secondly, I had a general question. If I extrapolate these numbers for a 
billion documents, my search and facet number may probably be unusable in a 
real time scenario. What are the strategies employed when you deal with such 
large scale? I am new to Lucene so please also direct me to the relevant info 
sources. Thanks!
 
Corpus:
Count: 20M, Size: 51GB
 
Index:
Size (w/o Facets): 19GB, Size
(w/Facets): 20.12GB
Creation Time (w/o Facets):
3.46hrs, Creation Time (w/Facets): 3.49hrs
 
Search Performance:
   With 29055 hits (5 terms in query): 
   Query Execution: 8 seconds
   Facet counts execution: 40-45 seconds
   
   With 4.22M hits (2 terms in query): 
   Query Execution: 3 seconds
   Facet counts execution: 42-46 seconds
 
   With 15.1M hits (1 term in query): 
   Query Execution: 2 seconds
   Facet counts execution: 45-53 seconds
 
   With 6183 hits (5 different values for the same 5 terms):  
(Without Flushing Windows File Cache on Next
run)
   Query Execution: 11 seconds
   Facet counts execution:  1 second
 
   With 4.9M hits (1 different value for the 1 term): (Without 
Flushing
Windows File Cache on Next run) 
   Query Execution: 2 seconds
   Facet counts execution: 3 seconds

---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Monday, June 16, 2014 8:11 PM, Shai Erera ser...@gmail.com wrote:
 


Hi

1.] Is there any API that gives me the count of a specific dimension from
 FacetCollector in response to a search query. Currently, I use the
 getTopChildren() with some value and then check the
 FacetResult object for
 the actual number of dimensions hit along with their occurrences. Also, the
 getSpecificValue() does not work without a path attribute to the API.


To get the value of the dimension itself, you should call getTopChildren(1,
dim). Note that getSpecificValue does not allow to pass only the dimension,
and getTopChildren requires topN to be  0. Passing 1 is a hack, but I'm
not sure we should specifically support getting the aggregated value of
just the dimension ... once you get that, the FacetResult.value tells you
the aggregated count.

2.] Can I find the MAX or MIN value of a Numeric type field written to the
 index?


Depends how you index them. If you
 index the field as a numeric field (e.g.
LongField), I believe you can use NumericUtils.getMaxLong. If it's a
DocValues field, I don't know of a built-in function that does it, but this
thread has a demo code:
http://www.gossamer-threads.com/lists/lucene/java-user/195594.

3.] I am trying to compare and contrast Lucene Facets with Elastic Search.
 I could determine that ES does search time faceting and dynamically returns
 the response without any prior faceting during indexing time. Is index time
 lag is not my concern, can I assume that, in general, performance-wise
 Lucene facets would be faster?


I will start by saying that I don't know much about how ES facets work. We
have some committers who know both how
 Lucene and ES facets work, so they
can comment on that. But I personally don't think there's no index-time
decision when it comes to faceting. Well .. not unless you're faceting on
arbitrary terms. Otherwise, you already make decision such as indexing the
field as not tokenized/analyzed/lowercased/doc-values etc.

Note that Lucene facets also support non-taxonomy based faceting option,
using the DocValues fields. Look at SortedSetDocValuesFacetField. This too
can be perceived as an index-time decision though... And there are some
built-in dynamic faceting capabilities too, like range facets
(LongRangeFacetCounts), which can work on any NumericDocValuesField, as
well as any ValueSource (such as Expressions).

I cannot compare ES facets to Lucene's in
 terms of performance, as I
haven't benchmarked them yet.

4.] I index a semi-large-ish corpus of 20M files across 50GB. If I do not
 use IndexWriter.commit(), I get standard files like cfe/cfs/si in the index
 directory. However, if I do use the commit(), then as I understand it, the
 state is persisted to the disk. But this time, there are additional file
 extensions like doc/pos/tim/tip/dvd/dvm, etc. I am not sure about this
 difference and its cause.


The information of the doc/tim/tip etc. is buffered in memory (controlled
by ramBufferSizeMB) and when they are flushed (on commit or when the RAM
buffer fills up), those files materialize on disk. When you call 

Facet migration 4.6.1 to 4.7.0

2014-06-17 Thread Nicola Buso
Hi,

I'm migrating from lucene 4.6.1 to 4.8.1 and I noticed some Facet API
changes happened on 4.7.0 probably mostly related to this ticket:
http://issues.apache.org/jira/browse/LUCENE-5339

Here are few question about some customization/extension we did and
seem not having a direct counterpart/extension point in the new API;
can someone help with these questions?

- we are extending FacetResultsHandler to change the order of the facet
results (i.e. date facets ordered by date instead of count). How can I
achieve this now?

- we have usual IndexReaders opened in groups with MultiReader, than we're
merging in RAM the TaxonomyReaders to obtain a correspondence of the
MultiReader for the taxonomies. Do you think I can still do this?

- at some point you removed the residue information from facets and we
calculated it differently; am I right I can now calculate it as
FacetResult.childCount - FacetResult.labelValues.length?

- we are extending TaxonomyFacetsAccumulator to provide:
  - specific FacetResultsHandler(s) depeding on the facet
  - add facet other than the topk if the user selected some facet values
from the residue.
where does the API permit to extends the behavior to achieve this?


Any help will be really apreciated,



Nicola.



-- 
Nicola Buso
Software Engineer - Web Production Team

European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory

Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

URL: http://www.ebi.ac.uk


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Indexing size increase 20% after switching from lucene 4.4 to 4.5 or 4.8 with BinaryDocValuesField

2014-06-17 Thread Zhao, Gang
I used lucene 4.4 to create index for some documents. One of the indexing 
fields is BinaryDocValuesField. After I change the dependency to lucene 4.5. 
The index size for 1 million documents increases from 293MB to 357MB. If I did 
not use BinaryDocValuesField, the index size increases only about 2%. I also 
tried lucene 4.8. The index size is similar to index size with lucene 4.5.

I am wondering what the change for handling BinaryDocValuesField from 4.4 to 
4.5 or 4.8 is.

Gang Zhao
Software Engineer - EA Digital Platform
207 Redwood Shores Parkway
Redwood City, CA 94065
Direct Line: 650-628-3719
[cid:image001.png@01CD68F0.6239B040]



Re: Indexing size increase 20% after switching from lucene 4.4 to 4.5 or 4.8 with BinaryDocValuesField

2014-06-17 Thread Robert Muir
Again, because merging is based on byte size, you have to be careful how
you measure (hint: use LogDocMergePolicy).

Otherwise you are comparing apples and oranges.

Separately, your configuration is using experimental codecs like
disk/memory which arent as heavily benchmarked etc as the default index
format.


On Fri, Jun 13, 2014 at 8:09 PM, Zhao, Gang gz...@ea.com wrote:

   I used lucene 4.4 to create index for some documents. One of the
 indexing fields is BinaryDocValuesField. After I change the dependency to
 lucene 4.5. The index size for 1 million documents increases from 293MB to
 357MB. If I did not use BinaryDocValuesField, the index size increases only
 about 2%. I also tried lucene 4.8. The index size is similar to index size
 with lucene 4.5.



 I am wondering what the change for handling BinaryDocValuesField from 4.4
 to 4.5 or 4.8 is.



 Gang Zhao

 Software Engineer - EA Digital Platform

 207 Redwood Shores Parkway
 Redwood City, CA 94065

 Direct Line: 650-628-3719

 [image: cid:image001.png@01CD68F0.6239B040]





Re: Facets in Lucene 4.7.2

2014-06-17 Thread Shai Erera
Hi

40 seconds for faceted search is ... crazy. Also, note how the times don't
differ much even though the number of hits is much higher (29K vs 15.1M)
... That, w/ that you say that subsequent queries are much faster (few
seconds) suggests that something is seriously messed up w/ your
environment. Maybe it's a faulty disk? E.g. after the file system cache is
warm, you no longer hit the disk?

In general, the more hits you have, the more expensive is faceted search.
It's also true for scoring as well (i.e. even without facets). There's just
more work to determine the top results (docs, facets...). With facets, you
can use sampling (see RandomSamplingFacetsCollector), but I would do that
only after you verify that collecting 15M docs is very expensive for you,
even when the file system cache is hot.

I've never seen those numbers before, therefore it's difficult for me to
relate to them.

There's a caching mechanism for facets, through CachedOrdinalsReader. But I
wouldn't go there until you verify that your IO system is good (try another
machine, OS, disk ...)., and that the 40s times are truly from the faceting
code.

Shai


On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 Thanks again!

 This time, I have indexed data with the following specs. I run into  40
 seconds for the FastTaxonomyFacetCounts to create all the facets. Is this
 as per your measurements? Subsequent runs fare much better probably because
 of the Windows file system cache. How can I speed this up?
 I believe there was a CategoryListCache earlier. Is there any cache or
 other implementation that I can use?

 Secondly, I had a general question. If I extrapolate these numbers for a
 billion documents, my search and facet number may probably be unusable in a
 real time scenario. What are the strategies employed when you deal with
 such large scale? I am new to Lucene so please also direct me to the
 relevant info sources. Thanks!

 Corpus:
 Count: 20M, Size: 51GB

 Index:
 Size (w/o Facets): 19GB, Size
 (w/Facets): 20.12GB
 Creation Time (w/o Facets):
 3.46hrs, Creation Time (w/Facets): 3.49hrs

 Search Performance:
With 29055 hits (5 terms in query):
Query Execution: 8 seconds
Facet counts execution: 40-45 seconds

With 4.22M hits (2 terms in query):
Query Execution: 3 seconds
Facet counts execution: 42-46 seconds

With 15.1M hits (1 term in query):
Query Execution: 2 seconds
Facet counts execution: 45-53 seconds

With 6183 hits (5 different values for the same 5 terms):
  (Without Flushing Windows File Cache on Next
 run)
Query Execution: 11 seconds
Facet counts execution:  1 second

With 4.9M hits (1 different value for the 1 term): (Without
 Flushing
 Windows File Cache on Next run)
Query Execution: 2 seconds
Facet counts execution: 3 seconds

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Monday, June 16, 2014 8:11 PM, Shai Erera ser...@gmail.com wrote:



 Hi

 1.] Is there any API that gives me the count of a specific dimension from
  FacetCollector in response to a search query. Currently, I use the
  getTopChildren() with some value and then check the
  FacetResult object for
  the actual number of dimensions hit along with their occurrences. Also,
 the
  getSpecificValue() does not work without a path attribute to the API.
 

 To get the value of the dimension itself, you should call getTopChildren(1,
 dim). Note that getSpecificValue does not allow to pass only the dimension,
 and getTopChildren requires topN to be  0. Passing 1 is a hack, but I'm
 not sure we should specifically support getting the aggregated value of
 just the dimension ... once you get that, the FacetResult.value tells you
 the aggregated count.

 2.] Can I find the MAX or MIN value of a Numeric type field written to the
  index?
 

 Depends how you index them. If you
  index the field as a numeric field (e.g.
 LongField), I believe you can use NumericUtils.getMaxLong. If it's a
 DocValues field, I don't know of a built-in function that does it, but this
 thread has a demo code:
 http://www.gossamer-threads.com/lists/lucene/java-user/195594.

 3.] I am trying to compare and contrast Lucene Facets with Elastic Search.
  I could determine that ES does search time faceting and dynamically
 returns
  the response without any prior faceting during indexing time. Is index
 time
  lag is not my concern, can I assume that, in general, performance-wise
  Lucene facets would be faster?
 

 I will start by saying that I don't know much about how ES facets work. We
 have some committers who know both how
  Lucene and ES facets work, so they
 can comment on that. But I personally don't think there's no index-time
 decision when it comes to faceting. Well .. 

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Shai Erera
OK I think I now understand what you're asking :). It's unrelated though to
SortingMergePolicy. You propose to do the merge part of a merge-sort,
since we know the indexes are already sorted, right?

This is something we've considered in the past, but it is very tricky (see
below) and we went with the SortingAR for simplicity and speed of coding.
If however you have an idea how we can easily implement that, that would be
awesome.

So let's consider merging the posting lists of f:val from the N readers.
Say that each returns docs 0-3, and the merged posting will have 4*N
entries (say we don't have deletes). To properly merge them, you need to
lookup the sort-value of each document from each reader, and compare
according to it.

Now you move on to f:val2 (another posting) and it wants to merge 100 other
docs. So you need to lookup the value of each document, compare by it, and
merge them. And the process continues ...

These lookups are expensive and will be done millions of times (each term,
each DV field, each .. everything).

More than that, there's a serious issue of correctness, because you never
make a global sorting decision. So if f:val sees only a single document -
0, in all segments, you want to map them to 4 GLOBALLY SORTED documents. If
you make a local decision based on these 4 documents, you will end up w/ a
completely messed up segment.

I think the global DocMap is really required. Forget about that that other
code, e.g. IndexWriter relies on this in order to properly apply incoming
document deletions and field updates while the segments were merging. It's
just a matter of correctness - we need to know the global sorted segment
map.

Shai


On Tue, Jun 17, 2014 at 3:41 PM, Ravikumar Govindarajan 
ravikumar.govindara...@gmail.com wrote:

 
  Therefore the DocMap is initialized only when the
  merge actually executes ... what is there more to postpone?


 Agreed. However, what I am asking is, if there is an alternative to DocMap,
 will that be better? Plz read-on

  And besides, if the segments are already sorted, you should return a
 null DocMap,
  like Lucene code does ...


 What I am trying to say is, my individual segments are sorted. However,
 when a merge combines N individual sorted-segments, there needs to be a
 global sort-order for writing the new segment. Passing null DocMap won't
 work here, no?

 DocMap is one-way of bringing the global order during a merge. Another way
 is to use something like a MergedIteratorSegmentReader instead of DocMap,
 which doesn't need any memory

 I was trying to get a heads-up on these 2 approaches. Please do let me know
 if I have understood correctly

 --
 Ravi




 On Tue, Jun 17, 2014 at 5:42 PM, Shai Erera ser...@gmail.com wrote:

  
   I am afraid the DocMap still maintains doc-id mappings till merge and I
  am
   trying to avoid it...
  
 
  What do you mean 'till merge'? The method OneMerge.getMergeReaders() is
  called only when the merge is executed, not when the MergePolicy decided
 to
  merge those segments. Therefore the DocMap is initialized only when the
  merge actually executes ... what is there more to postpone?
 
  And besides, if the segments are already sorted, you should return a null
  DocMap, like Lucene code does ...
 
  If I miss your point, I'd appreciate if you can point me to a code
 example,
  preferably in Lucene source, which demonstrates the problem.
 
  Shai
 
 
  On Tue, Jun 17, 2014 at 3:03 PM, Ravikumar Govindarajan 
  ravikumar.govindara...@gmail.com wrote:
 
   I am afraid the DocMap still maintains doc-id mappings till merge and I
  am
   trying to avoid it...
  
   I think lucene itself has a MergeIterator in o.a.l.util package.
  
   A MergePolicy can wrap a simple MergeIterator for iterating docs across
   different AtomicReaders in correct sort-order for a given field/term
  
   That should be fine right?
  
   --
   Ravi
  
   --
   Ravi
  
  
   On Tue, Jun 17, 2014 at 1:24 PM, Shai Erera ser...@gmail.com wrote:
  
loadSortTerm is your method right? In the current Sorter.sort
implementation, I see this code:
   
boolean sorted = true;
for (int i = 1; i  maxDoc; ++i) {
  if (comparator.compare(i-1, i)  0) {
sorted = false;
break;
  }
}
if (sorted) {
  return null;
}
   
Perhaps you can write similar code?
   
Also note that the sorting interface has changed, I think in 4.8, and
  now
you don't really need to implement a Sorter, but rather pass a
  SortField,
if that works for you.
   
Shai
   
   
On Tue, Jun 17, 2014 at 9:41 AM, Ravikumar Govindarajan 
ravikumar.govindara...@gmail.com wrote:
   
 Shai,

 This is the code snippet I use inside my class...

 public class MySorter extends Sorter {

 @Override

 public DocMap sort(AtomicReader reader) throws IOException {

   final MapInteger, BytesRef docVsId = loadSortTerm(reader);

   

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Shai Erera
That said... if we generate the global DocMap up front, there's no reason
to not execute the merge of the segments more efficiently, i.e. without
wrapping them in a SlowCompositeReaderWrapper.

But that's not work for SortingMergePolicy, it's either a special
SortingAtomicReader which wraps a group of readers + a global DocMap, and
then merge-sorts them more efficiently than how it's done now. Or we tap
into SegmentMerger .. which is way more complicated.

Perhaps it would be worth to explore a SortingMultiSortedAtomicReader which
merge-sorts the postings and other data that way ... I look at e.g how
doc-values are merged .. not sure it will improve performance. But if you
want to cons up a patch, that'd be awesome!

Shai


On Tue, Jun 17, 2014 at 8:01 PM, Shai Erera ser...@gmail.com wrote:

 OK I think I now understand what you're asking :). It's unrelated though
 to SortingMergePolicy. You propose to do the merge part of a merge-sort,
 since we know the indexes are already sorted, right?

 This is something we've considered in the past, but it is very tricky (see
 below) and we went with the SortingAR for simplicity and speed of coding.
 If however you have an idea how we can easily implement that, that would be
 awesome.

 So let's consider merging the posting lists of f:val from the N readers.
 Say that each returns docs 0-3, and the merged posting will have 4*N
 entries (say we don't have deletes). To properly merge them, you need to
 lookup the sort-value of each document from each reader, and compare
 according to it.

 Now you move on to f:val2 (another posting) and it wants to merge 100
 other docs. So you need to lookup the value of each document, compare by
 it, and merge them. And the process continues ...

 These lookups are expensive and will be done millions of times (each term,
 each DV field, each .. everything).

 More than that, there's a serious issue of correctness, because you never
 make a global sorting decision. So if f:val sees only a single document -
 0, in all segments, you want to map them to 4 GLOBALLY SORTED documents. If
 you make a local decision based on these 4 documents, you will end up w/ a
 completely messed up segment.

 I think the global DocMap is really required. Forget about that that other
 code, e.g. IndexWriter relies on this in order to properly apply incoming
 document deletions and field updates while the segments were merging. It's
 just a matter of correctness - we need to know the global sorted segment
 map.

 Shai


 On Tue, Jun 17, 2014 at 3:41 PM, Ravikumar Govindarajan 
 ravikumar.govindara...@gmail.com wrote:

 
  Therefore the DocMap is initialized only when the
  merge actually executes ... what is there more to postpone?


 Agreed. However, what I am asking is, if there is an alternative to
 DocMap,
 will that be better? Plz read-on

  And besides, if the segments are already sorted, you should return a
 null DocMap,
  like Lucene code does ...


 What I am trying to say is, my individual segments are sorted. However,
 when a merge combines N individual sorted-segments, there needs to be a
 global sort-order for writing the new segment. Passing null DocMap won't
 work here, no?

 DocMap is one-way of bringing the global order during a merge. Another way
 is to use something like a MergedIteratorSegmentReader instead of
 DocMap,
 which doesn't need any memory

 I was trying to get a heads-up on these 2 approaches. Please do let me
 know
 if I have understood correctly

 --
 Ravi




 On Tue, Jun 17, 2014 at 5:42 PM, Shai Erera ser...@gmail.com wrote:

  
   I am afraid the DocMap still maintains doc-id mappings till merge and
 I
  am
   trying to avoid it...
  
 
  What do you mean 'till merge'? The method OneMerge.getMergeReaders() is
  called only when the merge is executed, not when the MergePolicy
 decided to
  merge those segments. Therefore the DocMap is initialized only when the
  merge actually executes ... what is there more to postpone?
 
  And besides, if the segments are already sorted, you should return a
 null
  DocMap, like Lucene code does ...
 
  If I miss your point, I'd appreciate if you can point me to a code
 example,
  preferably in Lucene source, which demonstrates the problem.
 
  Shai
 
 
  On Tue, Jun 17, 2014 at 3:03 PM, Ravikumar Govindarajan 
  ravikumar.govindara...@gmail.com wrote:
 
   I am afraid the DocMap still maintains doc-id mappings till merge and
 I
  am
   trying to avoid it...
  
   I think lucene itself has a MergeIterator in o.a.l.util package.
  
   A MergePolicy can wrap a simple MergeIterator for iterating docs
 across
   different AtomicReaders in correct sort-order for a given field/term
  
   That should be fine right?
  
   --
   Ravi
  
   --
   Ravi
  
  
   On Tue, Jun 17, 2014 at 1:24 PM, Shai Erera ser...@gmail.com wrote:
  
loadSortTerm is your method right? In the current Sorter.sort
implementation, I see this code:
   
boolean sorted = true;
for (int i = 1; i  

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Sandeep Khanzode
Hi,

Thanks for your response. It does sound pretty bad which is why I am not sure 
whether there is an issue with the code, the index, the searcher, or just the 
machine, as you say. 
I will try with another machine just to make sure and post the results.

Meanwhile, can you tell me if there is anything wrong in the below measurement? 
Or is the API usage or the pattern incorrect?

I used a tool called RAMMap to clean the Windows cache. If I do not, the 
results are very fast as I mentioned already. If I do, then the total time is 
40s. 

Can you please provide any pointers on what could be wrong? I will be checking 
on a Linux box anyway.

=
System.out.println(1. Start Date:  + new Date());
TopDocs topDocs = FacetsCollector.search(searcher, query, 100, fc);
System.out.println(1. End Date:  + new Date());
// Above part takes approx 2-12 seconds depending on the query

System.out.println(2. Start Date:  + new Date());
ListFacetResult results = new ArrayListFacetResult();
Facets facets = new FastTaxonomyFacetCounts(taxoReader, config, fc);
System.out.println(2. End Date:  + new Date());
// Above part takes approx 40-53 seconds depending on the query for the first 
time on Windows

System.out.println(3. Start Date:  + new Date());
results.add(facets.getTopChildren(1000, F1));
results.add(facets.getTopChildren(1000, F2));
results.add(facets.getTopChildren(1000, F3));
results.add(facets.getTopChildren(1000, F4));
results.add(facets.getTopChildren(1000, F5));
results.add(facets.getTopChildren(1000, F6));
results.add(facets.getTopChildren(1000, F7));
System.out.println(3. End Date:  + new Date());
// Above part takes approx less than 1 second
= 

---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Tuesday, June 17, 2014 10:15 PM, Shai Erera ser...@gmail.com wrote:
 


Hi

40 seconds for faceted search is ... crazy. Also, note how the times don't
differ much even though the number of hits is much higher (29K vs 15.1M)
... That, w/ that you say that subsequent queries are much faster (few
seconds)
 suggests that something is seriously messed up w/ your
environment. Maybe it's a faulty disk? E.g. after the file system cache is
warm, you no longer hit the disk?

In general, the more hits you have, the more expensive is faceted search.
It's also true for scoring as well (i.e. even without facets). There's just
more work to determine the top results (docs, facets...). With facets, you
can use sampling (see RandomSamplingFacetsCollector), but I would do that
only after you verify that collecting 15M docs is very expensive for you,
even when the file system cache is hot.

I've never
 seen those numbers before, therefore it's difficult for me to
relate to them.

There's a caching mechanism for facets, through CachedOrdinalsReader. But I
wouldn't go there until you verify that your IO system is good (try another
machine, OS, disk ...)., and that the 40s times are truly from the faceting
code.

Shai



On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 Thanks again!

 This time, I have indexed data with the following specs. I run into  40
 seconds for the FastTaxonomyFacetCounts to create all the facets. Is this
 as per your measurements? Subsequent runs fare much better probably because
 of the Windows file system cache. How can I speed this up?
 I believe there was a CategoryListCache earlier. Is there any cache or
 other implementation that I can use?

 Secondly, I had a general question. If I extrapolate these numbers for a
 billion documents, my search and facet number may probably be unusable in a
 real time scenario. What are the strategies employed when you deal with
 such large scale? I am new to Lucene so please also direct me to the
 relevant info sources. Thanks!

 Corpus:
 Count: 20M, Size: 51GB

 Index:
 Size (w/o Facets): 19GB, Size
 (w/Facets): 20.12GB
 Creation Time (w/o Facets):
 3.46hrs,
 Creation Time (w/Facets): 3.49hrs

 Search Performance:
                With 29055 hits (5 terms in query):
                Query Execution: 8 seconds
                Facet counts execution: 40-45 seconds

                With 4.22M hits (2 terms in query):
                Query Execution: 3 seconds
                Facet counts execution: 42-46 seconds

                With 15.1M hits (1 term in query):
                Query Execution: 2 seconds
                Facet counts execution: 45-53 seconds

                With 6183 hits (5 different values for the same 5 terms):
  (Without Flushing Windows File Cache on Next
 run)
                Query Execution: 11 seconds
                Facet counts execution:  1
 second

                With 4.9M hits (1 different value for the 1 term): (Without
 Flushing
 Windows File Cache on Next run)
                Query Execution: 2 seconds
                Facet counts execution: 3 seconds

 

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Shai Erera
Nothing suspicious ... code looks fine. The call to FastTaxoFacetCounts
actually computes the counts ... that's the expensive part of faceted
search.

How big is your taxonomy (number categories)?
Is it hierarchical (i.e. are your dimensions flat, or deep like A/1/2/3/)?
What does your FacetsConfig look like?

Still, well maybe if your taxonomy is huge (hundreds of millions of
categories), I don't think you can intentionally mess up something that
much to end up w/ 40-45s response times!

Shai


On Tue, Jun 17, 2014 at 8:51 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 Thanks for your response. It does sound pretty bad which is why I am not
 sure whether there is an issue with the code, the index, the searcher, or
 just the machine, as you say.
 I will try with another machine just to make sure and post the results.

 Meanwhile, can you tell me if there is anything wrong in the below
 measurement? Or is the API usage or the pattern incorrect?

 I used a tool called RAMMap to clean the Windows cache. If I do not, the
 results are very fast as I mentioned already. If I do, then the total time
 is 40s.

 Can you please provide any pointers on what could be wrong? I will be
 checking on a Linux box anyway.

 =
 System.out.println(1. Start Date:  + new Date());
 TopDocs topDocs = FacetsCollector.search(searcher, query, 100, fc);
 System.out.println(1. End Date:  + new Date());
 // Above part takes approx 2-12 seconds depending on the query

 System.out.println(2. Start Date:  + new Date());
 ListFacetResult results = new ArrayListFacetResult();
 Facets facets = new FastTaxonomyFacetCounts(taxoReader, config, fc);
 System.out.println(2. End Date:  + new Date());
 // Above part takes approx 40-53 seconds depending on the query for the
 first time on Windows

 System.out.println(3. Start Date:  + new Date());
 results.add(facets.getTopChildren(1000, F1));
 results.add(facets.getTopChildren(1000, F2));
 results.add(facets.getTopChildren(1000, F3));
 results.add(facets.getTopChildren(1000, F4));
 results.add(facets.getTopChildren(1000, F5));
 results.add(facets.getTopChildren(1000, F6));
 results.add(facets.getTopChildren(1000, F7));
 System.out.println(3. End Date:  + new Date());
 // Above part takes approx less than 1 second
 =

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Tuesday, June 17, 2014 10:15 PM, Shai Erera ser...@gmail.com wrote:



 Hi

 40 seconds for faceted search is ... crazy. Also, note how the times don't
 differ much even though the number of hits is much higher (29K vs 15.1M)
 ... That, w/ that you say that subsequent queries are much faster (few
 seconds)
  suggests that something is seriously messed up w/ your
 environment. Maybe it's a faulty disk? E.g. after the file system cache is
 warm, you no longer hit the disk?

 In general, the more hits you have, the more expensive is faceted search.
 It's also true for scoring as well (i.e. even without facets). There's just
 more work to determine the top results (docs, facets...). With facets, you
 can use sampling (see RandomSamplingFacetsCollector), but I would do that
 only after you verify that collecting 15M docs is very expensive for you,
 even when the file system cache is hot.

 I've never
  seen those numbers before, therefore it's difficult for me to
 relate to them.

 There's a caching mechanism for facets, through CachedOrdinalsReader. But I
 wouldn't go there until you verify that your IO system is good (try another
 machine, OS, disk ...)., and that the 40s times are truly from the faceting
 code.

 Shai



 On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode 
 sandeep_khanz...@yahoo.com.invalid wrote:

  Hi,
 
  Thanks again!
 
  This time, I have indexed data with the following specs. I run into  40
  seconds for the FastTaxonomyFacetCounts to create all the facets. Is this
  as per your measurements? Subsequent runs fare much better probably
 because
  of the Windows file system cache. How can I speed this up?
  I believe there was a CategoryListCache earlier. Is there any cache or
  other implementation that I can use?
 
  Secondly, I had a general question. If I extrapolate these numbers for a
  billion documents, my search and facet number may probably be unusable
 in a
  real time scenario. What are the strategies employed when you deal with
  such large scale? I am new to Lucene so please also direct me to the
  relevant info sources. Thanks!
 
  Corpus:
  Count: 20M, Size: 51GB
 
  Index:
  Size (w/o Facets): 19GB, Size
  (w/Facets): 20.12GB
  Creation Time (w/o Facets):
  3.46hrs,
  Creation Time (w/Facets): 3.49hrs
 
  Search Performance:
 With 29055 hits (5 terms in query):
 Query Execution: 8 seconds
 Facet counts execution: 40-45 seconds
 
 With 4.22M hits (2 terms in query):
 

Lucene QueryParser/Analyzer inconsistency

2014-06-17 Thread Luis Pureza
Hi,

I'm experience a puzzling behaviour with the QueryParser and was hoping
someone around here can help me.

I have a very simple Analyzer that tries to replace forward slashes (/) by
spaces. Because QueryParser forces me to escape strings with slashes before
parsing, I added a MappingCharFilter to the analyzer that replaces \/
with a single space. The analyzer is defined as follows:

@Override
protected TokenStreamComponents createComponents(String field, Reader in) {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add(\\/,  );
Reader mappingFilter = new MappingCharFilter(builder.build(), in);

Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter);
return new TokenStreamComponents(tokenizer);
}

Then I use this analyzer in the QueryParser to parse a string with dashes:

String text = QueryParser.escape(one/two);
QueryParser parser = new QueryParser(Version.LUCENE_48, f, new
MyAnalyzer(Version.LUCENE_48));
System.err.println(parser.parse(text));

The expected output would be

f:one f:two

However, I get:

f:one/two

The puzzling thing is that when I debug the analyzer, it tokenizes the
input string correctly, returning two tokens instead of one.

What is going on?

Many thanks,

Luís Pureza

P.S.: I was able to fix this issue temporarily by creating my own tokenizer
that tokenizes on whitespace and slashes. However, I still don't understand
what's going on.


Re: Facets in Lucene 4.7.2

2014-06-17 Thread Sandeep Khanzode
If I am counting correctly, the $facets field in the index shows a count of 
approx. 28k. That does not sound like much, I guess. All my facets are flat and 
the FacetsConfig only defines a couple of them to be multi-valued.

Let me know if I am not counting the taxonomy size correctly. The 
taxoReader.getSize() also shows this count.

I will check on a Linux box to make sure. Thanks,
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Tuesday, June 17, 2014 11:28 PM, Shai Erera ser...@gmail.com wrote:
 


Nothing suspicious ... code looks fine. The call to FastTaxoFacetCounts
actually computes the counts ... that's the expensive part of faceted
search.

How big is your taxonomy (number categories)?
Is it hierarchical (i.e. are your dimensions flat, or deep like A/1/2/3/)?
What does your FacetsConfig look like?

Still, well maybe if your taxonomy is huge (hundreds of millions of
categories), I don't think you can intentionally mess up something that
much to end up w/ 40-45s response times!

Shai


On Tue, Jun 17, 2014 at 8:51 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 Thanks for your response. It does sound pretty bad which is why I am not
 sure whether there is an issue with the code, the index, the searcher, or
 just the machine, as you say.
 I will try with another machine just to make sure and post the results.

 Meanwhile, can you tell me if there is anything wrong in the below
 measurement? Or is the API usage or the pattern incorrect?

 I used a tool called RAMMap to clean the Windows cache. If I do not, the
 results are very fast as I mentioned already. If I do, then the total time
 is 40s.

 Can you please provide any pointers on what could be wrong? I will be
 checking on a Linux box anyway.

 =
 System.out.println(1. Start Date:  + new Date());
 TopDocs topDocs = FacetsCollector.search(searcher, query, 100, fc);
 System.out.println(1. End Date:  + new Date());
 // Above part takes approx 2-12 seconds depending on the query

 System.out.println(2. Start Date:  + new Date());
 ListFacetResult results = new ArrayListFacetResult();
 Facets facets = new FastTaxonomyFacetCounts(taxoReader, config, fc);
 System.out.println(2. End Date:  + new Date());
 // Above part takes approx 40-53 seconds depending on the query for the
 first time on Windows

 System.out.println(3. Start Date:  + new Date());
 results.add(facets.getTopChildren(1000, F1));
 results.add(facets.getTopChildren(1000, F2));
 results.add(facets.getTopChildren(1000, F3));
 results.add(facets.getTopChildren(1000, F4));
 results.add(facets.getTopChildren(1000, F5));
 results.add(facets.getTopChildren(1000, F6));
 results.add(facets.getTopChildren(1000, F7));
 System.out.println(3. End Date:  + new Date());
 // Above part takes approx less than 1 second
 =

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Tuesday, June 17, 2014 10:15 PM, Shai Erera ser...@gmail.com wrote:



 Hi

 40 seconds for faceted search is ... crazy. Also, note how the times don't
 differ much even though the number of hits is much higher (29K vs 15.1M)
 ... That, w/ that you say that subsequent queries are much faster (few
 seconds)
  suggests that something is seriously messed up w/ your
 environment. Maybe it's a faulty disk? E.g. after the file system cache is
 warm, you no longer hit the disk?

 In general, the more hits you have, the more expensive is faceted search.
 It's also true for scoring as well (i.e. even without facets). There's just
 more work to determine the top results (docs, facets...). With facets, you
 can use sampling (see RandomSamplingFacetsCollector), but I would do that
 only after you verify that collecting 15M docs is very expensive for you,
 even when the file system cache is hot.

 I've never
  seen those numbers before, therefore it's difficult for me to
 relate to them.

 There's a caching mechanism for facets, through CachedOrdinalsReader. But I
 wouldn't go there until you verify that your IO system is good (try another
 machine, OS, disk ...)., and that the 40s times are truly from the faceting
 code.

 Shai



 On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode 
 sandeep_khanz...@yahoo.com.invalid wrote:

  Hi,
 
  Thanks again!
 
  This time, I have indexed data with the following specs. I run into  40
  seconds for the FastTaxonomyFacetCounts to create all the facets. Is this
  as per your measurements? Subsequent runs fare much better probably
 because
  of the Windows file system cache. How can I speed this up?
  I believe there was a CategoryListCache earlier. Is there any cache or
  other implementation that I can use?
 
  Secondly, I had a general question. If I extrapolate these numbers for a
  billion documents, my search and facet number may probably be unusable
 in a
  real time scenario. What are the strategies employed when you deal 

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Shai Erera
You can get the size of the taxonomy by calling taxoReader.getSize(). What
does the 28K of the $facets field denote - the number of terms
(drill-down)? If so, that sounds like your taxonomy is of that size.

And indeed, this is a tiny taxonomy ...

How many facets do you record per document? This also affects the amount of
IO that's done during search, as we traverse the BinaryDocValues field,
reading the categories of each document.

Shai


On Tue, Jun 17, 2014 at 9:32 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 If I am counting correctly, the $facets field in the index shows a count
 of approx. 28k. That does not sound like much, I guess. All my facets are
 flat and the FacetsConfig only defines a couple of them to be multi-valued.

 Let me know if I am not counting the taxonomy size correctly. The
 taxoReader.getSize() also shows this count.

 I will check on a Linux box to make sure. Thanks,

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Tuesday, June 17, 2014 11:28 PM, Shai Erera ser...@gmail.com wrote:



 Nothing suspicious ... code looks fine. The call to FastTaxoFacetCounts
 actually computes the counts ... that's the expensive part of faceted
 search.

 How big is your taxonomy (number categories)?
 Is it hierarchical (i.e. are your dimensions flat, or deep like A/1/2/3/)?
 What does your FacetsConfig look like?

 Still, well maybe if your taxonomy is huge (hundreds of millions of
 categories), I don't think you can intentionally mess up something that
 much to end up w/ 40-45s response times!

 Shai


 On Tue, Jun 17, 2014 at 8:51 PM, Sandeep Khanzode 
 sandeep_khanz...@yahoo.com.invalid wrote:

  Hi,
 
  Thanks for your response. It does sound pretty bad which is why I am not
  sure whether there is an issue with the code, the index, the searcher, or
  just the machine, as you say.
  I will try with another machine just to make sure and post the results.
 
  Meanwhile, can you tell me if there is anything wrong in the below
  measurement? Or is the API usage or the pattern incorrect?
 
  I used a tool called RAMMap to clean the Windows cache. If I do not, the
  results are very fast as I mentioned already. If I do, then the total
 time
  is 40s.
 
  Can you please provide any pointers on what could be wrong? I will be
  checking on a Linux box anyway.
 
  =
  System.out.println(1. Start Date:  + new Date());
  TopDocs topDocs = FacetsCollector.search(searcher, query, 100, fc);
  System.out.println(1. End Date:  + new Date());
  // Above part takes approx 2-12 seconds depending on the query
 
  System.out.println(2. Start Date:  + new Date());
  ListFacetResult results = new ArrayListFacetResult();
  Facets facets = new FastTaxonomyFacetCounts(taxoReader, config, fc);
  System.out.println(2. End Date:  + new Date());
  // Above part takes approx 40-53 seconds depending on the query for the
  first time on Windows
 
  System.out.println(3. Start Date:  + new Date());
  results.add(facets.getTopChildren(1000, F1));
  results.add(facets.getTopChildren(1000, F2));
  results.add(facets.getTopChildren(1000, F3));
  results.add(facets.getTopChildren(1000, F4));
  results.add(facets.getTopChildren(1000, F5));
  results.add(facets.getTopChildren(1000, F6));
  results.add(facets.getTopChildren(1000, F7));
  System.out.println(3. End Date:  + new Date());
  // Above part takes approx less than 1 second
  =
 
  ---
  Thanks n Regards,
  Sandeep Ramesh Khanzode
 
 
  On Tuesday, June 17, 2014 10:15 PM, Shai Erera ser...@gmail.com wrote:
 
 
 
  Hi
 
  40 seconds for faceted search is ... crazy. Also, note how the times
 don't
  differ much even though the number of hits is much higher (29K vs 15.1M)
  ... That, w/ that you say that subsequent queries are much faster (few
  seconds)
   suggests that something is seriously messed up w/ your
  environment. Maybe it's a faulty disk? E.g. after the file system cache
 is
  warm, you no longer hit the disk?
 
  In general, the more hits you have, the more expensive is faceted search.
  It's also true for scoring as well (i.e. even without facets). There's
 just
  more work to determine the top results (docs, facets...). With facets,
 you
  can use sampling (see RandomSamplingFacetsCollector), but I would do that
  only after you verify that collecting 15M docs is very expensive for you,
  even when the file system cache is hot.
 
  I've never
   seen those numbers before, therefore it's difficult for me to
  relate to them.
 
  There's a caching mechanism for facets, through CachedOrdinalsReader.
 But I
  wouldn't go there until you verify that your IO system is good (try
 another
  machine, OS, disk ...)., and that the 40s times are truly from the
 faceting
  code.
 
  Shai
 
 
 
  On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode 
  sandeep_khanz...@yahoo.com.invalid wrote:
 
   Hi,

Re: Lucene QueryParser/Analyzer inconsistency

2014-06-17 Thread Jack Krupansky

Yeah, this is kind of tricky and confusing! Here's what happens:

1. The query parser parses the input string into individual source terms, 
each delimited by white space. The escape is removed in this process, but... 
no analyzer has been called at this stage.


2. The query parser (generator) calls the analyzer for each source term. 
Your analyzer is called at this stage, but... the escape is already gone, 
so... the backslashslash mapping rule is not triggered, leaving the 
slash recorded in the source term from step 1.


You do need the backslash in your original query because a slash introduces 
a regex query term. It is added by the escape method you call, but the 
escaping will be gone by the time your analyzer is called.


So, just try a simple, unescaped slash in your char mapping table.

-- Jack Krupansky

-Original Message- 
From: Luis Pureza

Sent: Tuesday, June 17, 2014 1:43 PM
To: java-user@lucene.apache.org
Subject: Lucene QueryParser/Analyzer inconsistency

Hi,

I'm experience a puzzling behaviour with the QueryParser and was hoping
someone around here can help me.

I have a very simple Analyzer that tries to replace forward slashes (/) by
spaces. Because QueryParser forces me to escape strings with slashes before
parsing, I added a MappingCharFilter to the analyzer that replaces \/
with a single space. The analyzer is defined as follows:

@Override
protected TokenStreamComponents createComponents(String field, Reader in) {
   NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
   builder.add(\\/,  );
   Reader mappingFilter = new MappingCharFilter(builder.build(), in);

   Tokenizer tokenizer = new WhitespaceTokenizer(version, mappingFilter);
   return new TokenStreamComponents(tokenizer);
}

Then I use this analyzer in the QueryParser to parse a string with dashes:

String text = QueryParser.escape(one/two);
QueryParser parser = new QueryParser(Version.LUCENE_48, f, new
MyAnalyzer(Version.LUCENE_48));
System.err.println(parser.parse(text));

The expected output would be

f:one f:two

However, I get:

f:one/two

The puzzling thing is that when I debug the analyzer, it tokenizes the
input string correctly, returning two tokens instead of one.

What is going on?

Many thanks,

Luís Pureza

P.S.: I was able to fix this issue temporarily by creating my own tokenizer
that tokenizes on whitespace and slashes. However, I still don't understand
what's going on. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org