Re: Facets in Lucene 4.7.2

Sandeep Khanzode Tue, 17 Jun 2014 11:32:55 -0700

If I am counting correctly, the $facets field in the index shows a count of 
approx. 28k. That does not sound like much, I guess. All my facets are flat and 
the FacetsConfig only defines a couple of them to be multi-valued.


Let me know if I am not counting the taxonomy size correctly. The 
taxoReader.getSize() also shows this count.

I will check on a Linux box to make sure. Thanks,
 
-----------------------
Thanks n Regards,
Sandeep Ramesh Khanzode


On Tuesday, June 17, 2014 11:28 PM, Shai Erera <[email protected]> wrote:
 


Nothing suspicious ... code looks fine. The call to FastTaxoFacetCounts
actually computes the counts ... that's the expensive part of faceted
search.

How big is your taxonomy (number categories)?
Is it hierarchical (i.e. are your dimensions flat, or deep like A/1/2/3/)?
What does your FacetsConfig look like?

Still, well maybe if your taxonomy is huge (hundreds of millions of
categories), I don't think you can intentionally mess up something that
much to end up w/ 40-45s response times!

Shai


On Tue, Jun 17, 2014 at 8:51 PM, Sandeep Khanzode <
[email protected]> wrote:

> Hi,
>
> Thanks for your response. It does sound pretty bad which is why I am not
> sure whether there is an issue with the code, the index, the searcher, or
> just the machine, as you say.
> I will try with another machine just to make sure and post the results.
>
> Meanwhile, can you tell me if there is anything wrong in the below
> measurement? Or is the API usage or the pattern incorrect?
>
> I used a tool called RAMMap to clean the Windows cache. If I do not, the
> results are very fast as I mentioned already. If I do, then the total time
> is 40s.
>
> Can you please provide any pointers on what could be wrong? I will be
> checking on a Linux box anyway.
>
> =========================================================
> System.out.println("1. Start Date: " + new Date());
> TopDocs topDocs = FacetsCollector.search(searcher, query, 100, fc);
> System.out.println("1. End Date: " + new Date());
> // Above part takes approx 2-12 seconds depending on the query
>
> System.out.println("2. Start Date: " + new Date());
> List<FacetResult> results = new ArrayList<FacetResult>();
> Facets facets = new FastTaxonomyFacetCounts(taxoReader, config, fc);
> System.out.println("2. End Date: " + new Date());
> // Above part takes approx 40-53 seconds depending on the query for the
> first time on Windows
>
> System.out.println("3. Start Date: " + new Date());
> results.add(facets.getTopChildren(1000, "F1"));
> results.add(facets.getTopChildren(1000, "F2"));
> results.add(facets.getTopChildren(1000, "F3"));
> results.add(facets.getTopChildren(1000, "F4"));
> results.add(facets.getTopChildren(1000, "F5"));
> results.add(facets.getTopChildren(1000, "F6"));
> results.add(facets.getTopChildren(1000, "F7"));
> System.out.println("3. End Date: " + new Date());
> // Above part takes approx less than 1 second
> =========================================================
>
> -----------------------
> Thanks n Regards,
> Sandeep Ramesh Khanzode
>
>
> On Tuesday, June 17, 2014 10:15 PM, Shai Erera <[email protected]> wrote:
>
>
>
> Hi
>
> 40 seconds for faceted search is ... crazy. Also, note how the times don't
> differ much even though the number of hits is much higher (29K vs 15.1M)
> ... That, w/ that you say that subsequent queries are much faster (few
> seconds)
>  suggests that something is seriously messed up w/ your
> environment. Maybe it's a faulty disk? E.g. after the file system cache is
> warm, you no longer hit the disk?
>
> In general, the more hits you have, the more expensive is faceted search.
> It's also true for scoring as well (i.e. even without facets). There's just
> more work to determine the top results (docs, facets...). With facets, you
> can use sampling (see RandomSamplingFacetsCollector), but I would do that
> only after you verify that collecting 15M docs is very expensive for you,
> even when the file system cache is hot.
>
> I've never
>  seen those numbers before, therefore it's difficult for me to
> relate to them.
>
> There's a caching mechanism for facets, through CachedOrdinalsReader. But I
> wouldn't go there until you verify that your IO system is good (try another
> machine, OS, disk ...)., and that the 40s times are truly from the faceting
> code.
>
> Shai
>
>
>
> On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode <
> [email protected]> wrote:
>
> > Hi,
> >
> > Thanks again!
> >
> > This time, I have indexed data with the following specs. I run into > 40
> > seconds for the FastTaxonomyFacetCounts to create all the facets. Is this
> > as per your measurements? Subsequent runs fare much better probably
> because
> > of the Windows file system cache. How can I speed this up?
> > I believe there was a CategoryListCache earlier. Is there any cache or
> > other implementation that I can use?
> >
> > Secondly, I had a general question. If I extrapolate these numbers for a
> > billion documents, my search and facet number may probably be unusable
> in a
> > real time scenario. What are the strategies employed when you deal with
> > such large scale? I am new to Lucene so please also direct me to the
> > relevant info sources. Thanks!
> >
> > Corpus:
> > Count: 20M, Size: 51GB
> >
> > Index:
> > Size (w/o Facets): 19GB, Size
> > (w/Facets): 20.12GB
> > Creation Time (w/o Facets):
> > 3.46hrs,
>  Creation Time (w/Facets): 3.49hrs
> >
> > Search Performance:
> >                With 29055 hits (5 terms in query):
> >                Query Execution: 8 seconds
> >                Facet counts execution: 40-45 seconds
> >
> >                With 4.22M hits (2 terms in query):
> >                Query Execution: 3 seconds
> >                Facet counts execution: 42-46 seconds
> >
> >                With 15.1M hits (1 term in query):
> >                Query Execution: 2 seconds
> >                Facet counts execution: 45-53 seconds
> >
> >                With 6183 hits (5 different values for the same 5 terms):
> >  (Without Flushing Windows File Cache on Next
> > run)
> >                Query Execution: 11 seconds
> >                Facet counts execution: < 1
>  second
> >
> >                With 4.9M hits (1 different value for the 1 term):
> (Without
> > Flushing
> > Windows File Cache on Next run)
> >                Query Execution: 2 seconds
> >                Facet counts execution: 3 seconds
> >
> > -----------------------
> > Thanks n Regards,
> > Sandeep Ramesh Khanzode
> >
> >
> > On Monday, June 16, 2014 8:11 PM, Shai Erera <[email protected]> wrote:
> >
> >
> >
> > Hi
> >
> > 1.] Is there any API that gives me the count of a specific dimension from
> > > FacetCollector in response to a search query. Currently, I use the
> > > getTopChildren() with some value and then check the
> >  FacetResult object for
> > > the actual number of dimensions hit along with their occurrences. Also,
> > the
> > > getSpecificValue() does not work without a path
>  attribute to the API.
> > >
> >
> > To get the value of the dimension itself, you should call
> getTopChildren(1,
> > dim). Note that getSpecificValue does not allow to pass only the
> dimension,
> > and getTopChildren requires topN to be > 0. Passing 1 is a hack, but I'm
> > not sure we should specifically support getting the aggregated value of
> > just the dimension ... once you get that, the FacetResult.value tells you
> > the aggregated count.
> >
> > 2.] Can I find the MAX or MIN value of a Numeric type field written to
> the
> > > index?
> > >
> >
> > Depends how you index them. If you
> >  index the field as a numeric field (e.g.
> > LongField), I believe you can use NumericUtils.getMaxLong. If it's a
> > DocValues field, I don't know of a built-in function that does it, but
> this
> > thread has a demo code:
> > http://www.gossamer-threads.com/lists/lucene/java-user/195594.
> >
> > 3.] I am trying to compare and contrast Lucene Facets with Elastic
> Search.
> > > I could determine that ES does search time faceting and dynamically
> > returns
> > > the
>  response without any prior faceting during indexing time. Is index
> > time
> > > lag is not my concern, can I assume that, in general, performance-wise
> > > Lucene facets would be faster?
> > >
> >
> > I will start by saying that I don't know much about how ES facets work.
> We
> > have some committers who know both how
> >  Lucene and ES facets work, so they
> > can comment on that. But I personally don't think there's no index-time
> > decision when it comes to faceting. Well .. not unless you're faceting on
> > arbitrary terms. Otherwise, you already make
>  decision such as indexing the
> > field as not tokenized/analyzed/lowercased/doc-values etc.
> >
> > Note that Lucene facets also support non-taxonomy based faceting option,
> > using the DocValues fields. Look at SortedSetDocValuesFacetField. This
> too
> > can be perceived as an index-time decision though... And there are some
> > built-in dynamic faceting capabilities too, like range facets
> > (LongRangeFacetCounts), which can work on any NumericDocValuesField, as
> > well as any ValueSource (such as Expressions).
> >
> > I cannot compare ES facets to Lucene's in
> >
>  terms of performance, as I
> > haven't benchmarked them yet.
> >
> > 4.] I index a semi-large-ish corpus of 20M files across 50GB. If I do not
> > > use IndexWriter.commit(), I get standard files like cfe/cfs/si in the
> > index
> > > directory. However, if I do use the commit(), then as I understand it,
> > the
> > > state is persisted to the disk. But this time, there are additional
> file
> > > extensions like doc/pos/tim/tip/dvd/dvm, etc. I am not sure about this
> > > difference and its cause.
> > >
> >
> > The information of the doc/tim/tip etc. is buffered in memory (controlled
> > by ramBufferSizeMB) and when they are flushed (on commit or when the RAM
> > buffer fills up), those files materialize on disk. When you call commit
> > there's no stop-the-world activity
> >  that's going on. Rather, all in-memory
> > buffers are flushed, the files are fsync'd and a new commit point is
> > generated. Indexing can continue though as usual. Concurrency might be
> > affected though, depending on the speed of your IO system, but there's no
> > intentional stop-the-world.
> >
> > * 5.] Does
>  the RAMBufferSizeMB() control the commit intervals, so that when
> > the limit is reached across all writing threads, the contents are flushed
> > to disk periodically?*
> >
> > The RAM buffer limit controls the flush intervals. Commit is an explicit
> > operation that you have to call yourself, as it's rather expensive (fsync
> > is expensive). Note that since 4.0 Lucene flushes each thread's indexing
> > state independent from other threads. So when the RAM buffer fills up, on
> > thread's indexing state is picked and flushed, while other threads can
> > continue indexing (where before this flush would be a stop-the-world
> > action, preventing indexing for a while).
> >
> > Shai
> >
> >
> >
> >
> > On Mon, Jun 16, 2014 at 4:57 PM, Sandeep Khanzode <
> > [email protected]> wrote:
> >
> > > Correction on [4] below. I do get doc/pos/tim/tip/dvd/dvm files in
> either
> > > ase. What I meant was the number of those files appear different in
> both
> > > cases. Also, does commit()
>  stop the world and behave serially to flush
> > the
> > > contents?
> > >
> > > -----------------------
> > > Thanks n Regards,
> > > Sandeep Ramesh Khanzode
> > >
> > >
> > > On Monday, June 16, 2014 7:10 PM, Sandeep Khanzode
> > > <[email protected]> wrote:
> > >
> > >
> > >
> > > Hi Shai,
> > >
> > > Thanks for the response. Appreciated! I understand that this particular
> > > use case has to be handled in a different way.
> > >
> > > Can you please help me with the below questions?
> > >
> > > 1.] Is there any API that gives me the count of a specific dimension
> from
> > >
> >  FacetCollector in response to a search query. Currently, I use the
> > > getTopChildren() with some value and then check the FacetResult object
> > for
> > > the actual number of dimensions
>  hit along with their occurrences. Also,
> > the
> > > getSpecificValue() does not work without a path attribute to the API.
> > >
> > > 2.] Can I find the MAX or MIN value of a Numeric type field written to
> > the
> > > index?
> > >
> > > 3.] I am trying to compare and contrast Lucene Facets with Elastic
> > Search.
> > > I could determine that ES does search time faceting and dynamically
> > returns
> > > the response without any prior faceting during indexing time. Is index
> >
>  time
> > > lag is not my concern, can I assume that, in general, performance-wise
> > > Lucene facets would be faster?
> > >
> > > 4.] I index a semi-large-ish corpus of 20M files across 50GB. If I do
> not
> > > use IndexWriter.commit(), I get standard files like cfe/cfs/si in the
> > index
> > > directory. However, if I do use the commit(), then as I understand it,
> > the
> > > state is persisted to the disk. But this time, there are additional
> file
> > > extensions like doc/pos/tim/tip/dvd/dvm, etc. I am not sure about this
> > > difference and
>  its cause.
> > >
> > > 5.] Does the RAMBufferSizeMB() control the commit intervals, so that
> when
> > > the limit is reached across all writing threads, the contents are
> flushed
> > > to disk periodically?
> > >
> > > Appreciate your response to the above queries. Thanks again,
> > >
> > >
> > >
> >  -----------------------
> > > Thanks n Regards,
> > > Sandeep Ramesh Khanzode
> > >
> > >
> >
>  >
> > > On Sunday, June 15, 2014 10:40 AM, Shai Erera <[email protected]>
> wrote:
> > >
> > >
> > >
> > > Hi
> > >
> > > Currently there's now way to add e.g. terms to already indexed
> documents,
> > > you have to re-index them. The only updatable field type Lucene offers
> > > currently are DocValues fields. If the list of markers/flags is fixed
> in
> > > your case, and you can map them to an integer, I think you could use a
> > >
>  NumericDocValues field, which supports field-level updates.
> > >
> > > Once you
> >  do that, you can then:
> > >
> > > * Count on this field pretty easily. You will need to write a Facets
> > > implementation, but otherwise it's very easy.
> > >
> > > * Filter queries: you will need to write a Filter which returns a
> > DocIdSet
> > > of the documents that belong to one category (e.g. Financially
> Relevant).
> > > Here you might want to consider caching the result of the Filter, by
> > using
> > > CachingWrapperFilter.
> > >
> > > It's not the best approach, updatable Terms would better suit your
> > usecase,
> > > however we don't offer them yet and it will be a while until we do (and
> > IF
> > > we do). You should also benchmark that approach vs re-indexing the
> > > documents since the current implementation of updatable doc-values
> fields
> > > isn't optimized for a few document updates between index reopens. See
> > here:
> > >
> >
> http://shaierera.blogspot.com/2014/04/benchmarking-updatable-docvalues.html
> > >
> > > Shai
> > >
> > >
> > >
> > > On Fri, Jun 13, 2014 at 10:19 PM, Sandeep Khanzode <
> > > [email protected]> wrote:
> > >
> > > > Hi Shai,
> > > >
> > > > Thanks so much
>  for the clear explanation.
> > > >
> > > > I agree on the first question. Taxonomy Writer with a separate index
> > > would
> > > > probably be my approach too.
> > > >
> > > > For the second question:
> > > > I am a little new to the Facets API so I will try to figure out the
> > > > approach that you outlined below.
> > > >
> > > > However, the scenario is such: Assume a document corpus that is
> > indexed.
> > > > For a user query, a document is returned and selected by the
>  user for
> > > > editing as part of some use case/workflow. That document is now
> marked
> > as
> > > > either historically interesting or not, financially relevant,
> specific
> > to
> > > > media or entertainment domain, etc. by the user. So, essentially the
> > user
> > > > is flagging the document with certain markers.
> > > > Another set of users could possibly
> >  want to query on these markers. So,
> > > > lets say, a second user comes along, and wants to see the top
> documents
> > > > belonging to one category, say,
>  agriculture or farming. Since these
> > > markers
> > > > are run time activities, how can I use the facets on them? So, I was
> > > > envisioning facets as the various markers. But, if I constantly
> > re-index
> > > or
> > > > update the documents whenever a marker changes, I believe it would
> not
> > be
> > > > very efficient.
> > > >
> > > > Is there anything, facets or otherwise, in Lucene that can help me
> > solve
> > > > this use case?
> > > >
> > > > Please let me know. And, thanks!
> > > >
> > > > -----------------------
> > >
> >  > Thanks n Regards,
> > > > Sandeep Ramesh Khanzode
> > > >
> > > >
> > > > On Friday, June 13, 2014 9:51 PM, Shai Erera <[email protected]>
> wrote:
> > > >
> > > >
> > > >
> > > > Hi
> > > >
> > > > You can check the demo code here:
> > > >
> > > >
> > >
> >
> https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/
> > > > .
> > > > This code is updated with each release, so you
> >  always get a working code
> > > > examples, even when the API changes.
> > > >
> > > > If
>  you don't mind managing the sidecar index, which I agree isn't such
> > a
> > > > big deal, then yes - the taxonomy index currently performs the
> > fastest. I
> > > > plan to explore porting the taxonomy-based approach from
> > BinaryDocValues
> > > to
> > > > the new SortedNumericDocValues (coming out in 4.9) since it might
> > perform
> > > > even faster.
> > > >
> > > > I didn't quite get the marker/flag facet. Can you give an example?
> For
> > > > instance, if you can model that as a
>  NumericDocValuesField added to
> > > > documents (w/ the different markers/flags translated to numbers),
> then
> > > you
> > > > can use Lucene's updatable
> >  numeric DocValues and write a custom Facets to
> > > > aggregate on that NumericDocValues field.
> > > >
> > > > Shai
> > > >
> > > >
> > > >
> > > > On Fri, Jun 13, 2014 at 11:48 AM, Sandeep Khanzode <
> > > > [email protected]> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am evaluating Lucene Facets for a project. Since there is a lot
> of
> > > > > change in 4.7.2 for Facets, I am relying on UTs for reference.
> Please
> > > let
> > > > > me know if there are other sources of information.
> > > > >
> > > > > I have a couple of questions:
> > > > >
> > > > > 1.] All categories in my application are
>  flat, not hierarchical. But,
> > > it
> > > > > seems from a few sources, that even that notwithstanding, you would
> > > want
> > > > to
> > > > > use a Taxonomy based index for performance reasons. It is faster
> but
> > > uses
> > > > > more RAM. Or is the deterrent to use it is the fact that it is a
> > > separate
> > > > > data structure. If one could do with the life-cycle management of
> the
> > > > extra
> > > > > index, should we go ahead with the taxonomy index for better
> > > performance
> > > > > across tens of millions of documents?
> > >
> >  > >
> > > > > Another note to add is that I do not see a scenario wherein I would
> > > want
> > > > > to re-index my collection over and over again or, in other words,
> the
> > > > > changes would be spread over time.
> > > > >
> > > > > 2.] I need a type of dynamic facet that allows me to add a flag or
> > > marker
> > > > > to the document at runtime since it will change/update every time a
> > > user
> > > > > modifies or adds to the list of markers. Is this possible to do
> with
> > > the
> > > > > current implementation? Since I believe, that currently all
> faceting
> > is
> > > > > done at indexing time.
> > > > >
> > > > >
> > > > >
> >  -----------------------
> > > > > Thanks n Regards,
> > > > > Sandeep Ramesh Khanzode
> > >

Re: Facets in Lucene 4.7.2

Reply via email to