Re: Facets in Lucene 4.7.2

2014-06-17 Thread Sandeep Khanzode
Hi,

Thanks again!

This time, I have indexed data with the following specs. I run into  40 
seconds for the FastTaxonomyFacetCounts to create all the facets. Is this as 
per your measurements? Subsequent runs fare much better probably because of the 
Windows file system cache. How can I speed this up? 
I believe there was a CategoryListCache earlier. Is there any cache or other 
implementation that I can use?

Secondly, I had a general question. If I extrapolate these numbers for a 
billion documents, my search and facet number may probably be unusable in a 
real time scenario. What are the strategies employed when you deal with such 
large scale? I am new to Lucene so please also direct me to the relevant info 
sources. Thanks!
 
Corpus:
Count: 20M, Size: 51GB
 
Index:
Size (w/o Facets): 19GB, Size
(w/Facets): 20.12GB
Creation Time (w/o Facets):
3.46hrs, Creation Time (w/Facets): 3.49hrs
 
Search Performance:
   With 29055 hits (5 terms in query): 
   Query Execution: 8 seconds
   Facet counts execution: 40-45 seconds
   
   With 4.22M hits (2 terms in query): 
   Query Execution: 3 seconds
   Facet counts execution: 42-46 seconds
 
   With 15.1M hits (1 term in query): 
   Query Execution: 2 seconds
   Facet counts execution: 45-53 seconds
 
   With 6183 hits (5 different values for the same 5 terms):  
(Without Flushing Windows File Cache on Next
run)
   Query Execution: 11 seconds
   Facet counts execution:  1 second
 
   With 4.9M hits (1 different value for the 1 term): (Without 
Flushing
Windows File Cache on Next run) 
   Query Execution: 2 seconds
   Facet counts execution: 3 seconds

---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Monday, June 16, 2014 8:11 PM, Shai Erera ser...@gmail.com wrote:
 


Hi

1.] Is there any API that gives me the count of a specific dimension from
 FacetCollector in response to a search query. Currently, I use the
 getTopChildren() with some value and then check the
 FacetResult object for
 the actual number of dimensions hit along with their occurrences. Also, the
 getSpecificValue() does not work without a path attribute to the API.


To get the value of the dimension itself, you should call getTopChildren(1,
dim). Note that getSpecificValue does not allow to pass only the dimension,
and getTopChildren requires topN to be  0. Passing 1 is a hack, but I'm
not sure we should specifically support getting the aggregated value of
just the dimension ... once you get that, the FacetResult.value tells you
the aggregated count.

2.] Can I find the MAX or MIN value of a Numeric type field written to the
 index?


Depends how you index them. If you
 index the field as a numeric field (e.g.
LongField), I believe you can use NumericUtils.getMaxLong. If it's a
DocValues field, I don't know of a built-in function that does it, but this
thread has a demo code:
http://www.gossamer-threads.com/lists/lucene/java-user/195594.

3.] I am trying to compare and contrast Lucene Facets with Elastic Search.
 I could determine that ES does search time faceting and dynamically returns
 the response without any prior faceting during indexing time. Is index time
 lag is not my concern, can I assume that, in general, performance-wise
 Lucene facets would be faster?


I will start by saying that I don't know much about how ES facets work. We
have some committers who know both how
 Lucene and ES facets work, so they
can comment on that. But I personally don't think there's no index-time
decision when it comes to faceting. Well .. not unless you're faceting on
arbitrary terms. Otherwise, you already make decision such as indexing the
field as not tokenized/analyzed/lowercased/doc-values etc.

Note that Lucene facets also support non-taxonomy based faceting option,
using the DocValues fields. Look at SortedSetDocValuesFacetField. This too
can be perceived as an index-time decision though... And there are some
built-in dynamic faceting capabilities too, like range facets
(LongRangeFacetCounts), which can work on any NumericDocValuesField, as
well as any ValueSource (such as Expressions).

I cannot compare ES facets to Lucene's in
 terms of performance, as I
haven't benchmarked them yet.

4.] I index a semi-large-ish corpus of 20M files across 50GB. If I do not
 use IndexWriter.commit(), I get standard files like cfe/cfs/si in the index
 directory. However, if I do use the commit(), then as I understand it, the
 state is persisted to the disk. But this time, there are additional file
 extensions like doc/pos/tim/tip/dvd/dvm, etc. I am not sure about this
 difference and its cause.


The information of the doc/tim/tip etc. is buffered in memory (controlled
by ramBufferSizeMB) and when they are flushed (on commit or when the RAM
buffer fills up), those files materialize on disk. When you call 

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Shai Erera
Hi

40 seconds for faceted search is ... crazy. Also, note how the times don't
differ much even though the number of hits is much higher (29K vs 15.1M)
... That, w/ that you say that subsequent queries are much faster (few
seconds) suggests that something is seriously messed up w/ your
environment. Maybe it's a faulty disk? E.g. after the file system cache is
warm, you no longer hit the disk?

In general, the more hits you have, the more expensive is faceted search.
It's also true for scoring as well (i.e. even without facets). There's just
more work to determine the top results (docs, facets...). With facets, you
can use sampling (see RandomSamplingFacetsCollector), but I would do that
only after you verify that collecting 15M docs is very expensive for you,
even when the file system cache is hot.

I've never seen those numbers before, therefore it's difficult for me to
relate to them.

There's a caching mechanism for facets, through CachedOrdinalsReader. But I
wouldn't go there until you verify that your IO system is good (try another
machine, OS, disk ...)., and that the 40s times are truly from the faceting
code.

Shai


On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 Thanks again!

 This time, I have indexed data with the following specs. I run into  40
 seconds for the FastTaxonomyFacetCounts to create all the facets. Is this
 as per your measurements? Subsequent runs fare much better probably because
 of the Windows file system cache. How can I speed this up?
 I believe there was a CategoryListCache earlier. Is there any cache or
 other implementation that I can use?

 Secondly, I had a general question. If I extrapolate these numbers for a
 billion documents, my search and facet number may probably be unusable in a
 real time scenario. What are the strategies employed when you deal with
 such large scale? I am new to Lucene so please also direct me to the
 relevant info sources. Thanks!

 Corpus:
 Count: 20M, Size: 51GB

 Index:
 Size (w/o Facets): 19GB, Size
 (w/Facets): 20.12GB
 Creation Time (w/o Facets):
 3.46hrs, Creation Time (w/Facets): 3.49hrs

 Search Performance:
With 29055 hits (5 terms in query):
Query Execution: 8 seconds
Facet counts execution: 40-45 seconds

With 4.22M hits (2 terms in query):
Query Execution: 3 seconds
Facet counts execution: 42-46 seconds

With 15.1M hits (1 term in query):
Query Execution: 2 seconds
Facet counts execution: 45-53 seconds

With 6183 hits (5 different values for the same 5 terms):
  (Without Flushing Windows File Cache on Next
 run)
Query Execution: 11 seconds
Facet counts execution:  1 second

With 4.9M hits (1 different value for the 1 term): (Without
 Flushing
 Windows File Cache on Next run)
Query Execution: 2 seconds
Facet counts execution: 3 seconds

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Monday, June 16, 2014 8:11 PM, Shai Erera ser...@gmail.com wrote:



 Hi

 1.] Is there any API that gives me the count of a specific dimension from
  FacetCollector in response to a search query. Currently, I use the
  getTopChildren() with some value and then check the
  FacetResult object for
  the actual number of dimensions hit along with their occurrences. Also,
 the
  getSpecificValue() does not work without a path attribute to the API.
 

 To get the value of the dimension itself, you should call getTopChildren(1,
 dim). Note that getSpecificValue does not allow to pass only the dimension,
 and getTopChildren requires topN to be  0. Passing 1 is a hack, but I'm
 not sure we should specifically support getting the aggregated value of
 just the dimension ... once you get that, the FacetResult.value tells you
 the aggregated count.

 2.] Can I find the MAX or MIN value of a Numeric type field written to the
  index?
 

 Depends how you index them. If you
  index the field as a numeric field (e.g.
 LongField), I believe you can use NumericUtils.getMaxLong. If it's a
 DocValues field, I don't know of a built-in function that does it, but this
 thread has a demo code:
 http://www.gossamer-threads.com/lists/lucene/java-user/195594.

 3.] I am trying to compare and contrast Lucene Facets with Elastic Search.
  I could determine that ES does search time faceting and dynamically
 returns
  the response without any prior faceting during indexing time. Is index
 time
  lag is not my concern, can I assume that, in general, performance-wise
  Lucene facets would be faster?
 

 I will start by saying that I don't know much about how ES facets work. We
 have some committers who know both how
  Lucene and ES facets work, so they
 can comment on that. But I personally don't think there's no index-time
 decision when it comes to faceting. Well .. 

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Sandeep Khanzode
Hi,

Thanks for your response. It does sound pretty bad which is why I am not sure 
whether there is an issue with the code, the index, the searcher, or just the 
machine, as you say. 
I will try with another machine just to make sure and post the results.

Meanwhile, can you tell me if there is anything wrong in the below measurement? 
Or is the API usage or the pattern incorrect?

I used a tool called RAMMap to clean the Windows cache. If I do not, the 
results are very fast as I mentioned already. If I do, then the total time is 
40s. 

Can you please provide any pointers on what could be wrong? I will be checking 
on a Linux box anyway.

=
System.out.println(1. Start Date:  + new Date());
TopDocs topDocs = FacetsCollector.search(searcher, query, 100, fc);
System.out.println(1. End Date:  + new Date());
// Above part takes approx 2-12 seconds depending on the query

System.out.println(2. Start Date:  + new Date());
ListFacetResult results = new ArrayListFacetResult();
Facets facets = new FastTaxonomyFacetCounts(taxoReader, config, fc);
System.out.println(2. End Date:  + new Date());
// Above part takes approx 40-53 seconds depending on the query for the first 
time on Windows

System.out.println(3. Start Date:  + new Date());
results.add(facets.getTopChildren(1000, F1));
results.add(facets.getTopChildren(1000, F2));
results.add(facets.getTopChildren(1000, F3));
results.add(facets.getTopChildren(1000, F4));
results.add(facets.getTopChildren(1000, F5));
results.add(facets.getTopChildren(1000, F6));
results.add(facets.getTopChildren(1000, F7));
System.out.println(3. End Date:  + new Date());
// Above part takes approx less than 1 second
= 

---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Tuesday, June 17, 2014 10:15 PM, Shai Erera ser...@gmail.com wrote:
 


Hi

40 seconds for faceted search is ... crazy. Also, note how the times don't
differ much even though the number of hits is much higher (29K vs 15.1M)
... That, w/ that you say that subsequent queries are much faster (few
seconds)
 suggests that something is seriously messed up w/ your
environment. Maybe it's a faulty disk? E.g. after the file system cache is
warm, you no longer hit the disk?

In general, the more hits you have, the more expensive is faceted search.
It's also true for scoring as well (i.e. even without facets). There's just
more work to determine the top results (docs, facets...). With facets, you
can use sampling (see RandomSamplingFacetsCollector), but I would do that
only after you verify that collecting 15M docs is very expensive for you,
even when the file system cache is hot.

I've never
 seen those numbers before, therefore it's difficult for me to
relate to them.

There's a caching mechanism for facets, through CachedOrdinalsReader. But I
wouldn't go there until you verify that your IO system is good (try another
machine, OS, disk ...)., and that the 40s times are truly from the faceting
code.

Shai



On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 Thanks again!

 This time, I have indexed data with the following specs. I run into  40
 seconds for the FastTaxonomyFacetCounts to create all the facets. Is this
 as per your measurements? Subsequent runs fare much better probably because
 of the Windows file system cache. How can I speed this up?
 I believe there was a CategoryListCache earlier. Is there any cache or
 other implementation that I can use?

 Secondly, I had a general question. If I extrapolate these numbers for a
 billion documents, my search and facet number may probably be unusable in a
 real time scenario. What are the strategies employed when you deal with
 such large scale? I am new to Lucene so please also direct me to the
 relevant info sources. Thanks!

 Corpus:
 Count: 20M, Size: 51GB

 Index:
 Size (w/o Facets): 19GB, Size
 (w/Facets): 20.12GB
 Creation Time (w/o Facets):
 3.46hrs,
 Creation Time (w/Facets): 3.49hrs

 Search Performance:
                With 29055 hits (5 terms in query):
                Query Execution: 8 seconds
                Facet counts execution: 40-45 seconds

                With 4.22M hits (2 terms in query):
                Query Execution: 3 seconds
                Facet counts execution: 42-46 seconds

                With 15.1M hits (1 term in query):
                Query Execution: 2 seconds
                Facet counts execution: 45-53 seconds

                With 6183 hits (5 different values for the same 5 terms):
  (Without Flushing Windows File Cache on Next
 run)
                Query Execution: 11 seconds
                Facet counts execution:  1
 second

                With 4.9M hits (1 different value for the 1 term): (Without
 Flushing
 Windows File Cache on Next run)
                Query Execution: 2 seconds
                Facet counts execution: 3 seconds

 

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Shai Erera
Nothing suspicious ... code looks fine. The call to FastTaxoFacetCounts
actually computes the counts ... that's the expensive part of faceted
search.

How big is your taxonomy (number categories)?
Is it hierarchical (i.e. are your dimensions flat, or deep like A/1/2/3/)?
What does your FacetsConfig look like?

Still, well maybe if your taxonomy is huge (hundreds of millions of
categories), I don't think you can intentionally mess up something that
much to end up w/ 40-45s response times!

Shai


On Tue, Jun 17, 2014 at 8:51 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 Thanks for your response. It does sound pretty bad which is why I am not
 sure whether there is an issue with the code, the index, the searcher, or
 just the machine, as you say.
 I will try with another machine just to make sure and post the results.

 Meanwhile, can you tell me if there is anything wrong in the below
 measurement? Or is the API usage or the pattern incorrect?

 I used a tool called RAMMap to clean the Windows cache. If I do not, the
 results are very fast as I mentioned already. If I do, then the total time
 is 40s.

 Can you please provide any pointers on what could be wrong? I will be
 checking on a Linux box anyway.

 =
 System.out.println(1. Start Date:  + new Date());
 TopDocs topDocs = FacetsCollector.search(searcher, query, 100, fc);
 System.out.println(1. End Date:  + new Date());
 // Above part takes approx 2-12 seconds depending on the query

 System.out.println(2. Start Date:  + new Date());
 ListFacetResult results = new ArrayListFacetResult();
 Facets facets = new FastTaxonomyFacetCounts(taxoReader, config, fc);
 System.out.println(2. End Date:  + new Date());
 // Above part takes approx 40-53 seconds depending on the query for the
 first time on Windows

 System.out.println(3. Start Date:  + new Date());
 results.add(facets.getTopChildren(1000, F1));
 results.add(facets.getTopChildren(1000, F2));
 results.add(facets.getTopChildren(1000, F3));
 results.add(facets.getTopChildren(1000, F4));
 results.add(facets.getTopChildren(1000, F5));
 results.add(facets.getTopChildren(1000, F6));
 results.add(facets.getTopChildren(1000, F7));
 System.out.println(3. End Date:  + new Date());
 // Above part takes approx less than 1 second
 =

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Tuesday, June 17, 2014 10:15 PM, Shai Erera ser...@gmail.com wrote:



 Hi

 40 seconds for faceted search is ... crazy. Also, note how the times don't
 differ much even though the number of hits is much higher (29K vs 15.1M)
 ... That, w/ that you say that subsequent queries are much faster (few
 seconds)
  suggests that something is seriously messed up w/ your
 environment. Maybe it's a faulty disk? E.g. after the file system cache is
 warm, you no longer hit the disk?

 In general, the more hits you have, the more expensive is faceted search.
 It's also true for scoring as well (i.e. even without facets). There's just
 more work to determine the top results (docs, facets...). With facets, you
 can use sampling (see RandomSamplingFacetsCollector), but I would do that
 only after you verify that collecting 15M docs is very expensive for you,
 even when the file system cache is hot.

 I've never
  seen those numbers before, therefore it's difficult for me to
 relate to them.

 There's a caching mechanism for facets, through CachedOrdinalsReader. But I
 wouldn't go there until you verify that your IO system is good (try another
 machine, OS, disk ...)., and that the 40s times are truly from the faceting
 code.

 Shai



 On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode 
 sandeep_khanz...@yahoo.com.invalid wrote:

  Hi,
 
  Thanks again!
 
  This time, I have indexed data with the following specs. I run into  40
  seconds for the FastTaxonomyFacetCounts to create all the facets. Is this
  as per your measurements? Subsequent runs fare much better probably
 because
  of the Windows file system cache. How can I speed this up?
  I believe there was a CategoryListCache earlier. Is there any cache or
  other implementation that I can use?
 
  Secondly, I had a general question. If I extrapolate these numbers for a
  billion documents, my search and facet number may probably be unusable
 in a
  real time scenario. What are the strategies employed when you deal with
  such large scale? I am new to Lucene so please also direct me to the
  relevant info sources. Thanks!
 
  Corpus:
  Count: 20M, Size: 51GB
 
  Index:
  Size (w/o Facets): 19GB, Size
  (w/Facets): 20.12GB
  Creation Time (w/o Facets):
  3.46hrs,
  Creation Time (w/Facets): 3.49hrs
 
  Search Performance:
 With 29055 hits (5 terms in query):
 Query Execution: 8 seconds
 Facet counts execution: 40-45 seconds
 
 With 4.22M hits (2 terms in query):
 

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Sandeep Khanzode
If I am counting correctly, the $facets field in the index shows a count of 
approx. 28k. That does not sound like much, I guess. All my facets are flat and 
the FacetsConfig only defines a couple of them to be multi-valued.

Let me know if I am not counting the taxonomy size correctly. The 
taxoReader.getSize() also shows this count.

I will check on a Linux box to make sure. Thanks,
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Tuesday, June 17, 2014 11:28 PM, Shai Erera ser...@gmail.com wrote:
 


Nothing suspicious ... code looks fine. The call to FastTaxoFacetCounts
actually computes the counts ... that's the expensive part of faceted
search.

How big is your taxonomy (number categories)?
Is it hierarchical (i.e. are your dimensions flat, or deep like A/1/2/3/)?
What does your FacetsConfig look like?

Still, well maybe if your taxonomy is huge (hundreds of millions of
categories), I don't think you can intentionally mess up something that
much to end up w/ 40-45s response times!

Shai


On Tue, Jun 17, 2014 at 8:51 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 Thanks for your response. It does sound pretty bad which is why I am not
 sure whether there is an issue with the code, the index, the searcher, or
 just the machine, as you say.
 I will try with another machine just to make sure and post the results.

 Meanwhile, can you tell me if there is anything wrong in the below
 measurement? Or is the API usage or the pattern incorrect?

 I used a tool called RAMMap to clean the Windows cache. If I do not, the
 results are very fast as I mentioned already. If I do, then the total time
 is 40s.

 Can you please provide any pointers on what could be wrong? I will be
 checking on a Linux box anyway.

 =
 System.out.println(1. Start Date:  + new Date());
 TopDocs topDocs = FacetsCollector.search(searcher, query, 100, fc);
 System.out.println(1. End Date:  + new Date());
 // Above part takes approx 2-12 seconds depending on the query

 System.out.println(2. Start Date:  + new Date());
 ListFacetResult results = new ArrayListFacetResult();
 Facets facets = new FastTaxonomyFacetCounts(taxoReader, config, fc);
 System.out.println(2. End Date:  + new Date());
 // Above part takes approx 40-53 seconds depending on the query for the
 first time on Windows

 System.out.println(3. Start Date:  + new Date());
 results.add(facets.getTopChildren(1000, F1));
 results.add(facets.getTopChildren(1000, F2));
 results.add(facets.getTopChildren(1000, F3));
 results.add(facets.getTopChildren(1000, F4));
 results.add(facets.getTopChildren(1000, F5));
 results.add(facets.getTopChildren(1000, F6));
 results.add(facets.getTopChildren(1000, F7));
 System.out.println(3. End Date:  + new Date());
 // Above part takes approx less than 1 second
 =

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Tuesday, June 17, 2014 10:15 PM, Shai Erera ser...@gmail.com wrote:



 Hi

 40 seconds for faceted search is ... crazy. Also, note how the times don't
 differ much even though the number of hits is much higher (29K vs 15.1M)
 ... That, w/ that you say that subsequent queries are much faster (few
 seconds)
  suggests that something is seriously messed up w/ your
 environment. Maybe it's a faulty disk? E.g. after the file system cache is
 warm, you no longer hit the disk?

 In general, the more hits you have, the more expensive is faceted search.
 It's also true for scoring as well (i.e. even without facets). There's just
 more work to determine the top results (docs, facets...). With facets, you
 can use sampling (see RandomSamplingFacetsCollector), but I would do that
 only after you verify that collecting 15M docs is very expensive for you,
 even when the file system cache is hot.

 I've never
  seen those numbers before, therefore it's difficult for me to
 relate to them.

 There's a caching mechanism for facets, through CachedOrdinalsReader. But I
 wouldn't go there until you verify that your IO system is good (try another
 machine, OS, disk ...)., and that the 40s times are truly from the faceting
 code.

 Shai



 On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode 
 sandeep_khanz...@yahoo.com.invalid wrote:

  Hi,
 
  Thanks again!
 
  This time, I have indexed data with the following specs. I run into  40
  seconds for the FastTaxonomyFacetCounts to create all the facets. Is this
  as per your measurements? Subsequent runs fare much better probably
 because
  of the Windows file system cache. How can I speed this up?
  I believe there was a CategoryListCache earlier. Is there any cache or
  other implementation that I can use?
 
  Secondly, I had a general question. If I extrapolate these numbers for a
  billion documents, my search and facet number may probably be unusable
 in a
  real time scenario. What are the strategies employed when you deal 

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Shai Erera
You can get the size of the taxonomy by calling taxoReader.getSize(). What
does the 28K of the $facets field denote - the number of terms
(drill-down)? If so, that sounds like your taxonomy is of that size.

And indeed, this is a tiny taxonomy ...

How many facets do you record per document? This also affects the amount of
IO that's done during search, as we traverse the BinaryDocValues field,
reading the categories of each document.

Shai


On Tue, Jun 17, 2014 at 9:32 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 If I am counting correctly, the $facets field in the index shows a count
 of approx. 28k. That does not sound like much, I guess. All my facets are
 flat and the FacetsConfig only defines a couple of them to be multi-valued.

 Let me know if I am not counting the taxonomy size correctly. The
 taxoReader.getSize() also shows this count.

 I will check on a Linux box to make sure. Thanks,

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Tuesday, June 17, 2014 11:28 PM, Shai Erera ser...@gmail.com wrote:



 Nothing suspicious ... code looks fine. The call to FastTaxoFacetCounts
 actually computes the counts ... that's the expensive part of faceted
 search.

 How big is your taxonomy (number categories)?
 Is it hierarchical (i.e. are your dimensions flat, or deep like A/1/2/3/)?
 What does your FacetsConfig look like?

 Still, well maybe if your taxonomy is huge (hundreds of millions of
 categories), I don't think you can intentionally mess up something that
 much to end up w/ 40-45s response times!

 Shai


 On Tue, Jun 17, 2014 at 8:51 PM, Sandeep Khanzode 
 sandeep_khanz...@yahoo.com.invalid wrote:

  Hi,
 
  Thanks for your response. It does sound pretty bad which is why I am not
  sure whether there is an issue with the code, the index, the searcher, or
  just the machine, as you say.
  I will try with another machine just to make sure and post the results.
 
  Meanwhile, can you tell me if there is anything wrong in the below
  measurement? Or is the API usage or the pattern incorrect?
 
  I used a tool called RAMMap to clean the Windows cache. If I do not, the
  results are very fast as I mentioned already. If I do, then the total
 time
  is 40s.
 
  Can you please provide any pointers on what could be wrong? I will be
  checking on a Linux box anyway.
 
  =
  System.out.println(1. Start Date:  + new Date());
  TopDocs topDocs = FacetsCollector.search(searcher, query, 100, fc);
  System.out.println(1. End Date:  + new Date());
  // Above part takes approx 2-12 seconds depending on the query
 
  System.out.println(2. Start Date:  + new Date());
  ListFacetResult results = new ArrayListFacetResult();
  Facets facets = new FastTaxonomyFacetCounts(taxoReader, config, fc);
  System.out.println(2. End Date:  + new Date());
  // Above part takes approx 40-53 seconds depending on the query for the
  first time on Windows
 
  System.out.println(3. Start Date:  + new Date());
  results.add(facets.getTopChildren(1000, F1));
  results.add(facets.getTopChildren(1000, F2));
  results.add(facets.getTopChildren(1000, F3));
  results.add(facets.getTopChildren(1000, F4));
  results.add(facets.getTopChildren(1000, F5));
  results.add(facets.getTopChildren(1000, F6));
  results.add(facets.getTopChildren(1000, F7));
  System.out.println(3. End Date:  + new Date());
  // Above part takes approx less than 1 second
  =
 
  ---
  Thanks n Regards,
  Sandeep Ramesh Khanzode
 
 
  On Tuesday, June 17, 2014 10:15 PM, Shai Erera ser...@gmail.com wrote:
 
 
 
  Hi
 
  40 seconds for faceted search is ... crazy. Also, note how the times
 don't
  differ much even though the number of hits is much higher (29K vs 15.1M)
  ... That, w/ that you say that subsequent queries are much faster (few
  seconds)
   suggests that something is seriously messed up w/ your
  environment. Maybe it's a faulty disk? E.g. after the file system cache
 is
  warm, you no longer hit the disk?
 
  In general, the more hits you have, the more expensive is faceted search.
  It's also true for scoring as well (i.e. even without facets). There's
 just
  more work to determine the top results (docs, facets...). With facets,
 you
  can use sampling (see RandomSamplingFacetsCollector), but I would do that
  only after you verify that collecting 15M docs is very expensive for you,
  even when the file system cache is hot.
 
  I've never
   seen those numbers before, therefore it's difficult for me to
  relate to them.
 
  There's a caching mechanism for facets, through CachedOrdinalsReader.
 But I
  wouldn't go there until you verify that your IO system is good (try
 another
  machine, OS, disk ...)., and that the 40s times are truly from the
 faceting
  code.
 
  Shai
 
 
 
  On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode 
  sandeep_khanz...@yahoo.com.invalid wrote:
 
   Hi,

Re: Facets in Lucene 4.7.2

2014-06-16 Thread Sandeep Khanzode
Hi Shai,

Thanks for the response. Appreciated! I understand that this particular use 
case has to be handled in a different way.

Can you please help me with the below questions? 

1.] Is there any API that gives me the count of a specific dimension from 
FacetCollector in response to a search query. Currently, I use the 
getTopChildren() with some value and then check the FacetResult object for the 
actual number of dimensions hit along with their occurrences. Also, the 
getSpecificValue() does not work without a path attribute to the API.

2.] Can I find the MAX or MIN value of a Numeric type field written to the 
index?

3.] I am trying to compare and contrast Lucene Facets with Elastic Search. I 
could determine that ES does search time faceting and dynamically returns the 
response without any prior faceting during indexing time. Is index time lag is 
not my concern, can I assume that, in general, performance-wise Lucene facets 
would be faster?

4.] I index a semi-large-ish corpus of 20M files across 50GB. If I do not use 
IndexWriter.commit(), I get standard files like cfe/cfs/si in the index 
directory. However, if I do use the commit(), then as I understand it, the 
state is persisted to the disk. But this time, there are additional file 
extensions like doc/pos/tim/tip/dvd/dvm, etc. I am not sure about this 
difference and its cause. 

5.] Does the RAMBufferSizeMB() control the commit intervals, so that when the 
limit is reached across all writing threads, the contents are flushed to disk 
periodically?

Appreciate your response to the above queries. Thanks again,

 
---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Sunday, June 15, 2014 10:40 AM, Shai Erera ser...@gmail.com wrote:
 


Hi

Currently there's now way to add e.g. terms to already indexed documents,
you have to re-index them. The only updatable field type Lucene offers
currently are DocValues fields. If the list of markers/flags is fixed in
your case, and you can map them to an integer, I think you could use a
NumericDocValues field, which supports field-level updates.

Once you do that, you can then:

* Count on this field pretty easily. You will need to write a Facets
implementation, but otherwise it's very easy.

* Filter queries: you will need to write a Filter which returns a DocIdSet
of the documents that belong to one category (e.g. Financially Relevant).
Here you might want to consider caching the result of the Filter, by using
CachingWrapperFilter.

It's not the best approach, updatable Terms would better suit your usecase,
however we don't offer them yet and it will be a while until we do (and IF
we do). You should also benchmark that approach vs re-indexing the
documents since the current implementation of updatable doc-values fields
isn't optimized for a few document updates between index reopens. See here:
http://shaierera.blogspot.com/2014/04/benchmarking-updatable-docvalues.html

Shai



On Fri, Jun 13, 2014 at 10:19 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi Shai,

 Thanks so much for the clear explanation.

 I agree on the first question. Taxonomy Writer with a separate index would
 probably be my approach too.

 For the second question:
 I am a little new to the Facets API so I will try to figure out the
 approach that you outlined below.

 However, the scenario is such: Assume a document corpus that is indexed.
 For a user query, a document is returned and selected by the user for
 editing as part of some use case/workflow. That document is now marked as
 either historically interesting or not, financially relevant, specific to
 media or entertainment domain, etc. by the user. So, essentially the user
 is flagging the document with certain markers.
 Another set of users could possibly want to query on these markers. So,
 lets say, a second user comes along, and wants to see the top documents
 belonging to one category, say, agriculture or farming. Since these markers
 are run time activities, how can I use the facets on them? So, I was
 envisioning facets as the various markers. But, if I constantly re-index or
 update the documents whenever a marker changes, I believe it would not be
 very efficient.

 Is there anything, facets or otherwise, in Lucene that can help me solve
 this use case?

 Please let me know. And, thanks!

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Friday, June 13, 2014 9:51 PM, Shai Erera ser...@gmail.com wrote:



 Hi

 You can check the demo code here:

 https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/
 .
 This code is updated with each release, so you always get a working code
 examples, even when the API changes.

 If you don't mind managing the sidecar index, which I agree isn't such a
 big deal, then yes - the taxonomy index currently performs the fastest. I
 plan to explore porting the taxonomy-based approach from BinaryDocValues to
 the 

Re: Facets in Lucene 4.7.2

2014-06-16 Thread Sandeep Khanzode
Correction on [4] below. I do get doc/pos/tim/tip/dvd/dvm files in either ase. 
What I meant was the number of those files appear different in both cases. 
Also, does commit() stop the world and behave serially to flush the contents?
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Monday, June 16, 2014 7:10 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.INVALID wrote:
 


Hi Shai,

Thanks for the response. Appreciated! I understand that this particular use 
case has to be handled in a different way.

Can you please help me with the below questions? 

1.] Is there any API that gives me the count of a specific dimension from 
FacetCollector in response to a search query. Currently, I use the 
getTopChildren() with some value and then check the FacetResult object for the 
actual number of dimensions hit along with their occurrences. Also, the 
getSpecificValue() does not work without a path attribute to the API.

2.] Can I find the MAX or MIN value of a Numeric type field written to the 
index?

3.] I am trying to compare and contrast Lucene Facets with Elastic Search. I 
could determine that ES does search time faceting and dynamically returns the 
response without any prior faceting during indexing time. Is index time lag is 
not my concern, can I assume that, in general, performance-wise Lucene facets 
would be faster?

4.] I index a semi-large-ish corpus of 20M files across 50GB. If I do not use 
IndexWriter.commit(), I get standard files like cfe/cfs/si in the index 
directory. However, if I do use the commit(), then as I understand it, the 
state is persisted to the disk. But this time, there are additional file 
extensions like doc/pos/tim/tip/dvd/dvm, etc. I am not sure about this 
difference and its cause. 

5.] Does the RAMBufferSizeMB() control the commit intervals, so that when the 
limit is reached across all writing threads, the contents are flushed to disk 
periodically?

Appreciate your response to the above queries. Thanks again,

 
---
Thanks n Regards,
Sandeep Ramesh Khanzode



On Sunday, June 15, 2014 10:40 AM, Shai Erera ser...@gmail.com wrote:



Hi

Currently there's now way to add e.g. terms to already indexed documents,
you have to re-index them. The only updatable field type Lucene offers
currently are DocValues fields. If the list of markers/flags is fixed in
your case, and you can map them to an integer, I think you could use a
NumericDocValues field, which supports field-level updates.

Once you do that, you can then:

* Count on this field pretty easily. You will need to write a Facets
implementation, but otherwise it's very easy.

* Filter queries: you will need to write a Filter which returns a DocIdSet
of the documents that belong to one category (e.g. Financially Relevant).
Here you might want to consider caching the result of the Filter, by using
CachingWrapperFilter.

It's not the best approach, updatable Terms would better suit your usecase,
however we don't offer them yet and it will be a while until we do (and IF
we do). You should also benchmark that approach vs re-indexing the
documents since the current implementation of updatable doc-values fields
isn't optimized for a few document updates between index reopens. See here:
http://shaierera.blogspot.com/2014/04/benchmarking-updatable-docvalues.html

Shai



On Fri, Jun 13, 2014 at 10:19 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi Shai,

 Thanks so much for the clear explanation.

 I agree on the first question. Taxonomy Writer with a separate index would
 probably be my approach too.

 For the second question:
 I am a little new to the Facets API so I will try to figure out the
 approach that you outlined below.

 However, the scenario is such: Assume a document corpus that is indexed.
 For a user query, a document is returned and selected by the user for
 editing as part of some use case/workflow. That document is now marked as
 either historically interesting or not, financially relevant, specific to
 media or entertainment domain, etc. by the user. So, essentially the user
 is flagging the document with certain markers.
 Another set of users could possibly want to query on these markers. So,
 lets say, a second user comes along, and wants to see the top documents
 belonging to one category, say, agriculture or farming. Since these markers
 are run time activities, how can I use the facets on them? So, I was
 envisioning facets as the various markers. But, if I constantly re-index or
 update the documents whenever a marker changes, I believe it would not be
 very efficient.

 Is there anything, facets or otherwise, in Lucene that can help me solve
 this use case?

 Please let me know. And, thanks!

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Friday, June 13, 2014 9:51 PM, Shai Erera ser...@gmail.com wrote:



 Hi

 You can check the demo code here:

 

Re: Facets in Lucene 4.7.2

2014-06-14 Thread Shai Erera
Hi

Currently there's now way to add e.g. terms to already indexed documents,
you have to re-index them. The only updatable field type Lucene offers
currently are DocValues fields. If the list of markers/flags is fixed in
your case, and you can map them to an integer, I think you could use a
NumericDocValues field, which supports field-level updates.

Once you do that, you can then:

* Count on this field pretty easily. You will need to write a Facets
implementation, but otherwise it's very easy.

* Filter queries: you will need to write a Filter which returns a DocIdSet
of the documents that belong to one category (e.g. Financially Relevant).
Here you might want to consider caching the result of the Filter, by using
CachingWrapperFilter.

It's not the best approach, updatable Terms would better suit your usecase,
however we don't offer them yet and it will be a while until we do (and IF
we do). You should also benchmark that approach vs re-indexing the
documents since the current implementation of updatable doc-values fields
isn't optimized for a few document updates between index reopens. See here:
http://shaierera.blogspot.com/2014/04/benchmarking-updatable-docvalues.html

Shai


On Fri, Jun 13, 2014 at 10:19 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi Shai,

 Thanks so much for the clear explanation.

 I agree on the first question. Taxonomy Writer with a separate index would
 probably be my approach too.

 For the second question:
 I am a little new to the Facets API so I will try to figure out the
 approach that you outlined below.

 However, the scenario is such: Assume a document corpus that is indexed.
 For a user query, a document is returned and selected by the user for
 editing as part of some use case/workflow. That document is now marked as
 either historically interesting or not, financially relevant, specific to
 media or entertainment domain, etc. by the user. So, essentially the user
 is flagging the document with certain markers.
 Another set of users could possibly want to query on these markers. So,
 lets say, a second user comes along, and wants to see the top documents
 belonging to one category, say, agriculture or farming. Since these markers
 are run time activities, how can I use the facets on them? So, I was
 envisioning facets as the various markers. But, if I constantly re-index or
 update the documents whenever a marker changes, I believe it would not be
 very efficient.

 Is there anything, facets or otherwise, in Lucene that can help me solve
 this use case?

 Please let me know. And, thanks!

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Friday, June 13, 2014 9:51 PM, Shai Erera ser...@gmail.com wrote:



 Hi

 You can check the demo code here:

 https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/
 .
 This code is updated with each release, so you always get a working code
 examples, even when the API changes.

 If you don't mind managing the sidecar index, which I agree isn't such a
 big deal, then yes - the taxonomy index currently performs the fastest. I
 plan to explore porting the taxonomy-based approach from BinaryDocValues to
 the new SortedNumericDocValues (coming out in 4.9) since it might perform
 even faster.

 I didn't quite get the marker/flag facet. Can you give an example? For
 instance, if you can model that as a NumericDocValuesField added to
 documents (w/ the different markers/flags translated to numbers), then you
 can use Lucene's updatable numeric DocValues and write a custom Facets to
 aggregate on that NumericDocValues field.

 Shai



 On Fri, Jun 13, 2014 at 11:48 AM, Sandeep Khanzode 
 sandeep_khanz...@yahoo.com.invalid wrote:

  Hi,
 
  I am evaluating Lucene Facets for a project. Since there is a lot of
  change in 4.7.2 for Facets, I am relying on UTs for reference. Please let
  me know if there are other sources of information.
 
  I have a couple of questions:
 
  1.] All categories in my application are flat, not hierarchical. But, it
  seems from a few sources, that even that notwithstanding, you would want
 to
  use a Taxonomy based index for performance reasons. It is faster but uses
  more RAM. Or is the deterrent to use it is the fact that it is a separate
  data structure. If one could do with the life-cycle management of the
 extra
  index, should we go ahead with the taxonomy index for better performance
  across tens of millions of documents?
 
  Another note to add is that I do not see a scenario wherein I would want
  to re-index my collection over and over again or, in other words, the
  changes would be spread over time.
 
  2.] I need a type of dynamic facet that allows me to add a flag or marker
  to the document at runtime since it will change/update every time a user
  modifies or adds to the list of markers. Is this possible to do with the
  current implementation? Since I believe, that currently all 

Facets in Lucene 4.7.2

2014-06-13 Thread Sandeep Khanzode
Hi,
 
I am evaluating Lucene Facets for a project. Since there is a lot of change in 
4.7.2 for Facets, I am relying on UTs for reference. Please let me know if 
there are other sources of information. 

I have a couple of questions:

1.] All categories in my application are flat, not hierarchical. But, it seems 
from a few sources, that even that notwithstanding, you would want to use a 
Taxonomy based index for performance reasons. It is faster but uses more RAM. 
Or is the deterrent to use it is the fact that it is a separate data structure. 
If one could do with the life-cycle management of the extra index, should we go 
ahead with the taxonomy index for better performance across tens of millions of 
documents? 

Another note to add is that I do not see a scenario wherein I would want to 
re-index my collection over and over again or, in other words, the changes 
would be spread over time. 

2.] I need a type of dynamic facet that allows me to add a flag or marker to 
the document at runtime since it will change/update every time a user modifies 
or adds to the list of markers. Is this possible to do with the current 
implementation? Since I believe, that currently all faceting is done at 
indexing time.

 
---
Thanks n Regards,
Sandeep Ramesh Khanzode

Re: Facets in Lucene 4.7.2

2014-06-13 Thread Shai Erera
Hi

You can check the demo code here:
https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/.
This code is updated with each release, so you always get a working code
examples, even when the API changes.

If you don't mind managing the sidecar index, which I agree isn't such a
big deal, then yes - the taxonomy index currently performs the fastest. I
plan to explore porting the taxonomy-based approach from BinaryDocValues to
the new SortedNumericDocValues (coming out in 4.9) since it might perform
even faster.

I didn't quite get the marker/flag facet. Can you give an example? For
instance, if you can model that as a NumericDocValuesField added to
documents (w/ the different markers/flags translated to numbers), then you
can use Lucene's updatable numeric DocValues and write a custom Facets to
aggregate on that NumericDocValues field.

Shai


On Fri, Jun 13, 2014 at 11:48 AM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 I am evaluating Lucene Facets for a project. Since there is a lot of
 change in 4.7.2 for Facets, I am relying on UTs for reference. Please let
 me know if there are other sources of information.

 I have a couple of questions:

 1.] All categories in my application are flat, not hierarchical. But, it
 seems from a few sources, that even that notwithstanding, you would want to
 use a Taxonomy based index for performance reasons. It is faster but uses
 more RAM. Or is the deterrent to use it is the fact that it is a separate
 data structure. If one could do with the life-cycle management of the extra
 index, should we go ahead with the taxonomy index for better performance
 across tens of millions of documents?

 Another note to add is that I do not see a scenario wherein I would want
 to re-index my collection over and over again or, in other words, the
 changes would be spread over time.

 2.] I need a type of dynamic facet that allows me to add a flag or marker
 to the document at runtime since it will change/update every time a user
 modifies or adds to the list of markers. Is this possible to do with the
 current implementation? Since I believe, that currently all faceting is
 done at indexing time.


 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


Re: Facets in Lucene 4.7.2

2014-06-13 Thread Sandeep Khanzode
Hi Shai,
 
Thanks so much for the clear explanation.

I agree on the first question. Taxonomy Writer with a separate index would 
probably be my approach too.

For the second question:
I am a little new to the Facets API so I will try to figure out the approach 
that you outlined below.

However, the scenario is such: Assume a document corpus that is indexed. For a 
user query, a document is returned and selected by the user for editing as part 
of some use case/workflow. That document is now marked as either historically 
interesting or not, financially relevant, specific to media or entertainment 
domain, etc. by the user. So, essentially the user is flagging the document 
with certain markers.
Another set of users could possibly want to query on these markers. So, lets 
say, a second user comes along, and wants to see the top documents belonging to 
one category, say, agriculture or farming. Since these markers are run time 
activities, how can I use the facets on them? So, I was envisioning facets as 
the various markers. But, if I constantly re-index or update the documents 
whenever a marker changes, I believe it would not be very efficient. 

Is there anything, facets or otherwise, in Lucene that can help me solve this 
use case? 

Please let me know. And, thanks!

---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Friday, June 13, 2014 9:51 PM, Shai Erera ser...@gmail.com wrote:
 


Hi

You can check the demo code here:
https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/.
This code is updated with each release, so you always get a working code
examples, even when the API changes.

If you don't mind managing the sidecar index, which I agree isn't such a
big deal, then yes - the taxonomy index currently performs the fastest. I
plan to explore porting the taxonomy-based approach from BinaryDocValues to
the new SortedNumericDocValues (coming out in 4.9) since it might perform
even faster.

I didn't quite get the marker/flag facet. Can you give an example? For
instance, if you can model that as a NumericDocValuesField added to
documents (w/ the different markers/flags translated to numbers), then you
can use Lucene's updatable numeric DocValues and write a custom Facets to
aggregate on that NumericDocValues field.

Shai



On Fri, Jun 13, 2014 at 11:48 AM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 I am evaluating Lucene Facets for a project. Since there is a lot of
 change in 4.7.2 for Facets, I am relying on UTs for reference. Please let
 me know if there are other sources of information.

 I have a couple of questions:

 1.] All categories in my application are flat, not hierarchical. But, it
 seems from a few sources, that even that notwithstanding, you would want to
 use a Taxonomy based index for performance reasons. It is faster but uses
 more RAM. Or is the deterrent to use it is the fact that it is a separate
 data structure. If one could do with the life-cycle management of the extra
 index, should we go ahead with the taxonomy index for better performance
 across tens of millions of documents?

 Another note to add is that I do not see a scenario wherein I would want
 to re-index my collection over and over again or, in other words, the
 changes would be spread over time.

 2.] I need a type of dynamic facet that allows me to add a flag or marker
 to the document at runtime since it will change/update every time a user
 modifies or adds to the list of markers. Is this possible to do with the
 current implementation? Since I believe, that currently all faceting is
 done at indexing time.


 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode