Re: Facet DrillDown Exclusion

2016-12-06 Thread Shai Erera
Hey Matt,

You basically don't need to use DDQ in that case. You can construct a
BooleanQuery with a MUST_NOT clause for filter out the facet path. Here's a
short code snippet:

String indexedField = config.getDimConfig("Author").indexFieldName; // Find
the field of the "Author" facet
Query q = new BooleanQuery.Builder()
  .add(new MatchAllDocsQuery(), Occur.MUST) // here you would usually use a
different query
  .add(new TermQuery(DrillDownQuery.term(indexedField, "Author", "Lisa")),
Occur.MUST_NOT) // do not match documents with "Author/Lisa" in their facets
  .build();
searcher.search(q, 10);

Hope this helps

Shai

On Tue, Dec 6, 2016 at 1:55 AM Matt Hicks  wrote:

> I'm currently drilling down adding a facet path, but I'd like to be able to
> do the same as a NOT query.  Is there any way to do an exclusion drill down
> on a facet to exclude docs that match the facet while including all others?
>
> Thanks
>


Re: Faceting : what are the limitations of Taxonomy (Separate index and hierarchical facets) and SortedSetDocValuesFacetField ( flat facets and no sidecar index) ?

2016-11-30 Thread Shai Erera
This feature is not available in Lucene currently, but it shouldn't be hard
to add it. See Mike's comment here:
http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html?showComment=1412777154420#c363162440067733144

One more tricky (yet nicer) feature would be to have it all in one go, i.e.
you'd say something like "facet on field price" and you'd get "interesting"
buckets, per the variance in the results.

But before that, we could have a StatsFacets in Lucene which provide some
statistics about a numeric field (min/max/avg etc.).

On Wed, Nov 30, 2016 at 7:50 AM Chitra R  wrote:

> Thank you so much, mike... Hope, gained a lot of stuff on Doc
> Values faceting and also clarified all my doubts. Thanks..!!
>
>
> *Another use case:*
>
> After getting matching documents for the given query, Is there any way to
> calculate mix and max values on NumericDocValuesField ( say date field)?
>
>
> I would like to implement it in numeric range faceting by splitting the
> numeric values (getting from resulted documents) into ranges.
>
>
> Chitra
>
>
> On Wed, Nov 30, 2016 at 3:51 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
> > Doc values fields are never loaded into memory; at most some small
> > index structures are.
> >
> > When you use those fields, the bytes (for just the one doc values
> > field you are using) are pulled from disk, and the OS will cache them
> > in memory if available.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Mon, Nov 28, 2016 at 6:01 AM, Chitra R  wrote:
> > > Hi,
> > >  When opening SortedSetDocValuesReaderState at search time,
> > whether
> > > the whole doc value files (.dvd & .dvm) information are loaded in
> memory
> > or
> > > specified field information(say $facets field) alone load in memory?
> > >
> > >
> > >
> > >
> > > Any help is much appreciated.
> > >
> > >
> > > Regards,
> > > Chitra
> > >
> > > On Tue, Nov 22, 2016 at 5:47 PM, Chitra R 
> wrote:
> > >>
> > >>
> > >> Kindly post your suggestions.
> > >>
> > >> Regards,
> > >> Chitra
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Sat, Nov 19, 2016 at 1:38 PM, Chitra R 
> > wrote:
> > >>>
> > >>> Hey, I got it clearly. Thank you so much. Could you please help us to
> > >>> implement it in our use case?
> > >>>
> > >>>
> > >>> In our case, we are having dynamic index and it is variable depth
> too.
> > So
> > >>> flat facet is enough.No need of hierarchical facets.
> > >>>
> > >>> What I think is,
> > >>>
> > >>> Index my facet field as normal doc value field, so that no special
> > >>> operation (like taxonomy and sorted set doc values facet field) will
> > be done
> > >>> at index time and only doc value field stores its ordinals in their
> > >>> respective field.
> > >>> At search time, I will pass query (user search query) , filter (path
> > >>> traversed list)  and collect the matching documents in
> Facetscollector.
> > >>> To compute facet count for the specific field, I will gather those
> > >>> resulted docs, then move through each segment for collecting the
> > matching
> > >>> ordinals using AtomicReader.
> > >>>
> > >>>
> > >>> And know when I use this means, can't calculate facet count for more
> > than
> > >>> one field(facet) in a search.
> > >>>
> > >>> Instead of loading all the dimensions in DocValuesReaderState (will
> > take
> > >>> more time and memory) at search time, loading specific fields will
> > take less
> > >>> time and memory, hope so. Kindly help to solve.
> > >>>
> > >>>
> > >>> It will do it in a minimal index and search cost, I think. And hope
> > this
> > >>> won't put overload at index time, also at search time this will be
> > better.
> > >>>
> > >>>
> > >>> Kindly post your suggestions.
> > >>>
> > >>>
> > >>> Regards,
> > >>> Chitra
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Fri, Nov 18, 2016 at 7:15 PM, Michael McCandless
> > >>>  wrote:
> > 
> >  I think you've summed up exactly the differences!
> > 
> >  And, yes, it would be possible to emulate hierarchical facets on top
> >  of flat facets, if the hierarchy is fixed depth like year/month/day.
> > 
> >  But if it's variable depth, it's trickier (but I think still
> >  possible).  See e.g. the Committed Paths drill-down on the left, on
> >  our dog-food server
> >  http://jirasearch.mikemccandless.com/search.py?index=jira
> > 
> >  Mike McCandless
> > 
> >  http://blog.mikemccandless.com
> > 
> > 
> >  On Fri, Nov 18, 2016 at 1:43 AM, Chitra R 
> > wrote:
> >  > case 1:
> >  > In taxonomy, for each indexed document, examines facet
> > label ,
> >  > computes their 

Re: Lucene 6.3 faceting documentation

2016-11-10 Thread Shai Erera
We've removed the userguide a long time ago. We have a set of example files
under lucene-demo, e.g. here
https://lucene.apache.org/core/6_3_0/demo/src-html/org/apache/lucene/demo/facet/
.

Also, you can read some blog posts, start here:
http://shaierera.blogspot.com/2012/11/lucene-facets-part-1.html and then
http://shaierera.blogspot.com/2012/11/lucene-facets-part-2.html, though the
code examples may be outdated. The lucene-demo source is up-to-date though.

Shai

On Thu, Nov 10, 2016 at 4:40 PM Glen Newton  wrote:

> I am looking for documentation on Lucene faceting. The most recent
> documentation I can find is for 4.0.0 here:
>
> http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-files/userguide.html
>
> Is there more recent documentation for 6.3.0? Or 6.x?
>
> Thanks,
> Glen
>


Re: Faceting : what are the limitations of Taxonomy (Separate index and hierarchical facets) and SortedSetDocValuesFacetField ( flat facets and no sidecar index) ?

2016-11-10 Thread Shai Erera
Hi

The reason IMO is historic - ES and Solr had faceting solutions before
Lucene had it. There were discussions in the past about using the Lucene
faceting module in Solr (can't tell for ES) but, sadly, I can't say I see
it happening at this point.

Regarding your other question, IMO the Lucene faceting engine, in terms of
performance and customizability, is on par with Solr/ES. However, it lacks
distributed faceting support and aggregations. Since many people use
Solr/ES and not Lucene directly, the Solr/ES faceting module continues to
advance separately from the Lucene one.

Enhancing Lucene facets with aggregations and even distributed faceting
capabilities is mostly a matter of time and priorities. If you're
interested in it, I'd be willing to collaborate with you on that as much as
I can!

And I'd still hope that this work finds its way into Solr/ES, as I think
it's silly to have that many number of faceting implementations, where they
all rely on the same low-level data structure - Lucene!

Shai


On Thu, Nov 10, 2016 at 12:32 PM Kumaran Ramasubramanian 
wrote:

> Hi All,
> We all know that Lucene supports faceting by providing
> Taxonomy(Separate index and hierarchical facets) and
> SortedSetDocValuesFacetField ( flat facets and no sidecar index).
>
>   Then why did solr and elastic search go for its own implementation ?
>  ( that is, solr uses block join & elasticsearch uses aggregations ) Is
> there any limitations in lucene's implementation ?
>
>
> --
> Kumaran R
>


Re: IndexWriter, DirectoryTaxonomyWriter & SearcherTaxonomyManager synchronization

2016-09-28 Thread Shai Erera
*> However, that should not lead to NSFE.  At worst it should lead to>
"ordinal is not known" (maybe as an AIOOBE) from the taxonomy reader.*

That is correct, this interleaving indexing case can potentially result in
an AIOOBE like exception during faceted search, when the facets that are in
the "sneaked-in-docs" will be found be a search, but resolving the ordinals
to their labels will fail because the labels will be unknown to the
taxonomy.

I wonder if committing the opposite order solves this problem. So in the
above use case, IW.commit() commits all the new docs with their facets,
then if more indexing happens before TIW.commit(), then the commit to the
taxonomy index results in more facets than are known to the search index,
but that's ok.

I'm just not sure if that covers all concurrency cases though. I remember
this was discussed several times in the past, and we eventually reached a
conclusion, but clearly if it was the latter, it wasn't clarified in the
javadocs. I can't think of a use case that breaks this commit order though (
IW.commit() followed by TIW.commit()). This feels safe to me ... can you
try to think of a use case that breaks it? Assuming that each doc-indexing
does addTaxo() followed by addDoc().

Maybe we should have a helper which takes an IW and TIW and exposes
commit() APIs that will do it in the correct order?

Now I'm thinking about SearcherTaxoManager -- it reopens the readers by
first re-opening IR, then TIR. It does so under the assumption of first
committing to TIW then to IW. Now if we reverse the order, then you need to
be more careful in when you commit changes to the two writers, and when you
re-open the readers. If you always do that from the same thread, then you
should be fine, the order of re-opens doesn't really matter.

But you re-open from a different thread than the one you commit, I am not
sure that committing to IW first then TIW can play well with any re-open
order? I.e. one case which breaks it is you commit to IW, then re-open both
IR and TIR, before you commit to TIW, and you have a search which may find
ordinals that are unknown you to the TIR?

So I'd say that if you refresh() from the same thread that you do commit(),
then commit to IW first then TIW, and use SearcherTaxoManager as it's
currently implemented. But I'd like to hear your thoughts about it.

Shai


On Wed, Sep 28, 2016 at 1:26 PM Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Wed, Sep 28, 2016 at 3:05 AM, William Moss
>  wrote:
> > Thank you both for your quick reply!
>
> You're welcome!
>
> > * We actually tried the upgrade to 6.0 a few months back (when that was
> the
> > newest) and were getting similar errors to the ones I'm seeing now. We
> were
> > not able to track them down, which is part of the motivation for me
> asking
> > all these questions. We'll get there though :-)
>
> OK, we gotta get to the root cause.  Sounds like it happens in either
> version...
>
> > * The last time we tested this (which I think was still post
> > ConcurrentMergePolicy) we saw that the read speed would slowly degrade
> over
> > time. My understanding was that forceMerge was very expensive, but would
> > make reads faster once complete. Is this not correct?
>
> It really depends on what queries you are running.  Really you should
> test in your use case and be certain that the massive expense of force
> merge is worthwhile / necessary.  In general it's not worth it, ever
> if searches are a bit faster, except for indices that will never
> change again.
>
> > Also, we never
> > attempted to tune the MergePolicy at all, so while were on the subject,
> is
> > there good documentation on how to do that? I'm much prefer to get away
> > from calling forceMerge. If it's useful information, we've got a
> relatively
> > small corpus, only ~2+M documents.
>
> Just use the defaults :)  Tuning those settings is dangerous unless
> you have a very specific problem to fix.
>
> > * We want to be able to ensure that if a machine or JVM crashes we are
> in a
> > coherent state. To that end, we need to call commit on Lucene and then
> > commit back what we've read so far to Kafka. Calling commit is the only
> way
> > to ensure this, right?
>
> Correct: commit in Lucene, then notify Kafka what offset you had
> indexed just before you called IW.commit.
>
> But you may want to replicate the index across machines if you don't
> want to have a single point of failure.  We recently added
> near-real-time replication to Lucene for this use case ...
>
> > * To make sure I understand how maybeRefresh works, ignoring whether or
> not
> > we commit for a second, if I add a document via IndexWriter, it will not
> be
> > reflected in IndexSearchers I get by calling acquire on
> SearcherAndTaxonomy
> > until I call maybeRefresh?
>
> Correct.
>
> > Now, on to the concurrency issue. I was thinking a little more about this
> > and I think the fundamental issue is that while IndexWriter and
> 

Re: Clarification on LUCENE 4795 discussions ( Add FacetsCollector based on SortedSetDocValues )

2016-09-26 Thread Shai Erera
Hey,

Here's a blog I wrote a couple years ago about using facet associations:
http://shaierera.blogspot.com/2013/01/facet-associations.html. Note that
the examples in the blog were written against a very old Lucene version
(4.7 maybe). We have a couple of demo files that are maintained with the
code changes here
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=tree;f=lucene/demo/src/java/org/apache/lucene/demo/facet;h=41085e3aaa1d4d0697a5ef5d9853a093c1600ca6;hb=HEAD.
Check them out, especially this one:
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=blob;f=lucene/demo/src/java/org/apache/lucene/demo/facet/AssociationsFacetsExample.java;h=3e2737d0c8f02d12e4fdb76f97891c8593ef5fbc;hb=HEAD

Hope this helps!

Shai

On Tue, Sep 27, 2016 at 7:20 AM Kumaran Ramasubramanian 
wrote:

> Hi mike,
>
> Thanks for the clarification. Any example about difference in using flat vs
> hierarchical facets? Any demo or sample page?
>
> In a previous thread yesterday ( Faceting: Taxonomy index Vs
> SortedSetDocValues ), there is a point like
>
> "tried to achieve multilevel (hierarchical) categorization using
> SortedSetDocValues and got it simply by changing the query  and opening the
> IndexReader for each level of query using SortedSetDocValuesReaderState. "
>
> Is it possible easily?
>
> -
> Kumaran R
>
> On Sep 27, 2016 9:38 AM, "Michael McCandless" 
> wrote:
> >
> > Weighted facets is the ability to associate a float value with each
> > facet label you index, and at search time to aggregate those floats.
> > See e.g. FloatAssociationFacetField.
> >
> > "other features" refers to hierarchical facets, which
> > SortedSetDocValuesFacetField does not support (just flat facets)
> > though this is possible to fix, I think (patches welcome!).
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Mon, Sep 26, 2016 at 5:24 PM, Kumaran Ramasubramanian
> >  wrote:
> > >
> > >
> > > Hi All,
> > >
> > > i want to know the list of features which can be used by applications
> > > using facet module of lucene.
> > >
> > >
> https://issues.apache.org/jira/browse/LUCENE-4795?focusedCommentId=13599687
> > >
> > > I ask because it seems that the only thing that we get from this
> SortedSet
> > >> approach is not having to maintain a sidecar index (which for some
> reason
> > >> freaks everybody), and we even lose performance. Plus, I don't see how
> we
> > >> can support other facet features with it.
> > >
> > >
> > > on the other hand SortedSet doesn't have these problems. maybe it
> doesnt
> > >> support weighted facets or other features, but its a nice option. I
> > >> personally don't think its the end of the world if Mike's patch doesnt
> > >> support all the features of the faceting module initially or even
> ever.
> > >
> > >
> > >
> > >
> > > what
> > > is meant by
> > > weighted facets
> > > ? what are
> > > othe
> > > r
> > >  facets
> > > features ?
> > >
> > >
> > > --
> > > Kumaran R
> > >
>


Re: Lucene Facets performance problems (version 4.7.2)

2016-02-26 Thread Shai Erera
True, but Erick's questions are still valid :-). We need more info to
answer these questions. So Simona, the more info you can give us the better
we'll be able to answer.

On Fri, Feb 26, 2016, 10:54 Uwe Schindler  wrote:

> Hi Erick,
>
> this was a question about Lucene so "=true" won't help. It also
> talks about *Lucene's Facetting*, not Solr's.
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: Friday, February 26, 2016 8:22 AM
> > To: java-user 
> > Subject: Re: Lucene Facets performance problems (version 4.7.2)
> >
> > You haven't given us much to go on. What is the cardinality of the fields
> > you're faceting on? What does your query look like? How are you measuring
> > time? What is the output if you add =true?
> >
> > In short, your question is far too vague to give any meaningful
> > information, there could be any of a dozen recommendations.
> >
> > Best
> > Erick
> > On Feb 26, 2016 18:01, "Simona Russo"  wrote:
> >
> > > Hi all,
> > >
> > > we use Lucene *Facet* library version* 4.7.2.*
> > >
> > > We have an *index* with *45 millions *of documents (size about 15 GB)
> > and
> > > a *taxonomy* index with *57* millions of documents (size about 2 GB).
> > >
> > > The total *facet search* time achieve *15 seconds*!
> > >
> > > Is it possible to improve this time? Is there any tips to *configure*
> the
> > > *taxonomy* index to avoid this waste of time?
> > >
> > >
> > > Thanks in advance
> > >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: how to backup lucene index file

2016-01-13 Thread Shai Erera
You should use Lucene's replicator module, which helps you take backups
from live snapshots of your index, even while indexing happens. You can
read about how to use it here:
http://shaierera.blogspot.co.il/2013/05/the-replicator.html

Shai

On Wed, Jan 13, 2016, 19:14 Erick Erickson  wrote:

> Just copy the index directory, it's self contained. I'd
> make sure I wasn't actively indexing to it and
> I'd committed all my indexing first, but that's all.
>
> On Wed, Jan 13, 2016 at 8:33 AM, 鞠朕  wrote:
> > Hi,I am using Lucene to build a Full Text search system, I put the
> index file in some Directory in my server, Considering robust, I think i
> shouldbackup the index file in somewhere else. If the index file is broken,
> i can switch to the backup one.Can you tell me how to do this, use what
> API, can you give me a simple demo.Thanks,From juzhen
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: SOLR/LUCENE 5.2.1: Solution of CharTermAtt, StartOffset, EndOffset, Position

2015-08-07 Thread Shai Erera
I think you can just write a TokenFilter which sets the
PositionIncrementAttribute of every other token to 0. Then you can use
StandardTokenizer and wrap it with that filter.

Shai
On Aug 8, 2015 6:33 AM, Văn Châu vankimc...@gmail.com wrote:

 Hi,

 I'm looking a solution for the following format in solr/lucene 5.2.1
 version:
 Text eg: fast wi fi network is down. If using
 solr.StandardTokenizerFactory , I have the Position  corresponding to
 displayed : fast ( 1 ) -  wi ( 2 ) -  fi ( 3 ) -  Network ( 4 ) -  is (
 5 ) - -  down ( 6 ) . But I need you just create a new custom or class to
 the question above is fast wi fi network is down but the analysis is
 currently Position as follows : fast ( 1 ) -  fi ( 2 ) -  is ( 3 ) or wi
 ( 1 ) -  network ( 2 ) -  down ( 3 ) . I know it involves startOffset ,
 endOffset ... but I can not figure out how to solve?
 Thanks in advance!


 [image: Hình ảnh nội tuyến 1]



 ---
 VĂN KIM CHÂU
 [P]: +84.933.233.047



Re: How to merge several Taxonomy indexes

2015-04-02 Thread Shai Erera
In some cases, MMapDirectory offers even better performance, since the JVM
doesn't need to manage that RAM when it's doing GC.

Also, using only RAMDirectory is not safe in that if the JVM crashes, your
index is lost.

On Thu, Apr 2, 2015 at 12:54 PM, Christoph Kaser lucene_l...@iconparc.de
wrote:

 Hi Gimantha,

 why do you use a RAMDirectory? If your merged index fits into RAM
 completely, a MMapDirectory should offer almost the same performance. And
 if not, it is definitely the better choice.

 Regards
 Christoph


 Am 02.04.2015 um 12:38 schrieb Gimantha Bandara:

 Hi All,

 I have successfully setup a merged indices and drilldown and usual search
 operations work perfect.
 But, I have a side question. If I selected RAMDirectory as the destination
 Indices in merging, probably the jvm can go out of memory if the merged
 indices are too big. Is there a way I can handle this issue?

 On Tue, Mar 24, 2015 at 12:18 PM, Gimantha Bandara giman...@wso2.com
 wrote:

  Hi Christoph,

 My mistake. :) It does the exactly what i need. figured it out later..
 Thanks a lot!

 On Tue, Mar 24, 2015 at 3:14 AM, Gimantha Bandara giman...@wso2.com
 wrote:

  Hi Christoph,

 I think TaxonomyMergeUtils is to merge a taxonomy directory and an index
 together (Correct me if I am wrong). Can it be used to merge several
 taxonomyDirectories together and create one taxonomy index?

 On Mon, Mar 23, 2015 at 9:19 PM, Christoph Kaser 
 lucene_l...@iconparc.de

 wrote:
 Hi Gimantha,

 have a look at the class org.apache.lucene.facet.
 taxonomy.TaxonomyMergeUtils,
 which does exactly what you need.

 Best regards,
 Christoph

 Am 23.03.2015 um 15:44 schrieb Gimantha Bandara:

  Hi all,

 Can anyone point me how to merge several taxonomy indexes? My
 requirement
 is as follows. I have  several taxonomy indexes and normal document
 indexes. I want to merge taxonomy indexes together and other document
 indexes together and perform search on them. One part I have figured
 out.
 It is easy. To Merge document indexes, all I have to do is create a
 MultiReader and pass it to IndexSearcher. But I am stuck at merging
 the
 taxonomy indexes. Is there a way to merge taxonomy indexes?


  --
 Dipl.-Inf. Christoph Kaser

 IconParc GmbH
 Sophienstrasse 1
 80333 München

 www.iconparc.de

 Tel +49 -89- 15 90 06 - 21
 Fax +49 -89- 15 90 06 - 49

 Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer.
 HRB
 121830, Amtsgericht München



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


  --
 Gimantha Bandara
 Software Engineer
 WSO2. Inc : http://wso2.com
 Mobile : +94714961919

  --
 Gimantha Bandara
 Software Engineer
 WSO2. Inc : http://wso2.com
 Mobile : +94714961919




 --
 Dipl.-Inf. Christoph Kaser

 IconParc GmbH
 Sophienstrasse 1
 80333 München

 www.iconparc.de

 Tel +49 -89- 15 90 06 - 21
 Fax +49 -89- 15 90 06 - 49

 Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB
 121830, Amtsgericht München



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: How to merge several Taxonomy indexes

2015-04-02 Thread Shai Erera
MMapDirectory uses memory-mapped files. This is an operating system level
feature, where even though the file resides on disk, the OS can memory-map
it and access it more efficiently. It is loaded into memory outside the JVM
heap, and usually on a properly configured server you should not worry
about running out of memory, since if the file cannot be brought into
memory, it's accessed from disk.

You mentioned that you store the index in a DB, which is distributed. Have
you considered using Solr for managing your distributed index? It might be
better than storing it in a DB, merging taxonomies for search etc. and Solr
has quite rich faceted search capabilities.

On Thu, Apr 2, 2015 at 1:51 PM, Gimantha Bandara giman...@wso2.com wrote:

 Btw I was using a RAMDirectory for just testing purposes..

 On Thu, Apr 2, 2015 at 5:16 PM, Gimantha Bandara giman...@wso2.com
 wrote:

  Hi Christoph and Shai,
 
  Thanks for the quick response!.
  Indices are stored in a relational database ( using a custom Directory
  implementation ). The Problem comes since the indices are sharded (both
  taxonomy indices and normal doc indices), when a user wants to
 drilldown, I
  have to merge all the indices. For that I used mergeUtils (which
  worksperfect). For now I am using RAMDirectory as the merged indices.
  Anyway The indices can grow to a bigger size as time goes. MMapDirectory
  again uses memory right? Can It deal with possible out of memory issue?
 
  I am thinking of using the same Database to store the merged indices. But
  the problem is the original sharded indices can be updated, when new
  entries come in. So the merged final indices also needs to be updated
  accordingly.
 
  On Thu, Apr 2, 2015 at 4:55 PM, Shai Erera ser...@gmail.com wrote:
 
  In some cases, MMapDirectory offers even better performance, since the
 JVM
  doesn't need to manage that RAM when it's doing GC.
 
  Also, using only RAMDirectory is not safe in that if the JVM crashes,
 your
  index is lost.
 
  On Thu, Apr 2, 2015 at 12:54 PM, Christoph Kaser 
 lucene_l...@iconparc.de
  
  wrote:
 
   Hi Gimantha,
  
   why do you use a RAMDirectory? If your merged index fits into RAM
   completely, a MMapDirectory should offer almost the same performance.
  And
   if not, it is definitely the better choice.
  
   Regards
   Christoph
  
  
   Am 02.04.2015 um 12:38 schrieb Gimantha Bandara:
  
   Hi All,
  
   I have successfully setup a merged indices and drilldown and usual
  search
   operations work perfect.
   But, I have a side question. If I selected RAMDirectory as the
  destination
   Indices in merging, probably the jvm can go out of memory if the
 merged
   indices are too big. Is there a way I can handle this issue?
  
   On Tue, Mar 24, 2015 at 12:18 PM, Gimantha Bandara 
 giman...@wso2.com
   wrote:
  
Hi Christoph,
  
   My mistake. :) It does the exactly what i need. figured it out
 later..
   Thanks a lot!
  
   On Tue, Mar 24, 2015 at 3:14 AM, Gimantha Bandara 
 giman...@wso2.com
   wrote:
  
Hi Christoph,
  
   I think TaxonomyMergeUtils is to merge a taxonomy directory and an
  index
   together (Correct me if I am wrong). Can it be used to merge
 several
   taxonomyDirectories together and create one taxonomy index?
  
   On Mon, Mar 23, 2015 at 9:19 PM, Christoph Kaser 
   lucene_l...@iconparc.de
  
   wrote:
   Hi Gimantha,
  
   have a look at the class org.apache.lucene.facet.
   taxonomy.TaxonomyMergeUtils,
   which does exactly what you need.
  
   Best regards,
   Christoph
  
   Am 23.03.2015 um 15:44 schrieb Gimantha Bandara:
  
Hi all,
  
   Can anyone point me how to merge several taxonomy indexes? My
   requirement
   is as follows. I have  several taxonomy indexes and normal
 document
   indexes. I want to merge taxonomy indexes together and other
  document
   indexes together and perform search on them. One part I have
  figured
   out.
   It is easy. To Merge document indexes, all I have to do is
 create a
   MultiReader and pass it to IndexSearcher. But I am stuck at
 merging
   the
   taxonomy indexes. Is there a way to merge taxonomy indexes?
  
  
--
   Dipl.-Inf. Christoph Kaser
  
   IconParc GmbH
   Sophienstrasse 1
   80333 München
  
   www.iconparc.de
  
   Tel +49 -89- 15 90 06 - 21
   Fax +49 -89- 15 90 06 - 49
  
   Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven
  Angerer.
   HRB
   121830, Amtsgericht München
  
  
  
  
  -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
--
   Gimantha Bandara
   Software Engineer
   WSO2. Inc : http://wso2.com
   Mobile : +94714961919
  
--
   Gimantha Bandara
   Software Engineer
   WSO2. Inc : http://wso2.com
   Mobile : +94714961919
  
  
  
  
   --
   Dipl.-Inf. Christoph Kaser
  
   IconParc GmbH
   Sophienstrasse 1
   80333 München
  
   www.iconparc.de

Re: Sampled Hit counts using Lucene Facets.

2015-03-11 Thread Shai Erera
OK yes then sampling isn't the right word. So what you would want to have
is API like count faces in N buckets between a range of [min..max]
values. That would create the ranges for you and then you would be able to
use the RangeFacetCounts as usual.

Would you like to open a JIRA issue and post a patch? I guess it can either
be an additional constructor on LongRangeFacetCounts (and Double), or a
separate utility class which given min/max values and numBuckets, creates
the proper Range[]?

Shai

On Tue, Mar 10, 2015 at 4:07 PM, Gimantha Bandara giman...@wso2.com wrote:

 Hi Shai,

 Yes, Splitting ranges into smaller ranges is not as same as sampling. I
 have used the wrong word there. I think RandomSamplingFacetsCollector is
 for sampling a larger dataset and that class cannot be used to implement
 the described example above. I think I ll have to prepare the Ranges
 manually and pass them to LongRangeFacetsCounts.

 On Tue, Mar 10, 2015 at 4:54 PM, Shai Erera ser...@gmail.com wrote:

  I am not sure that splitting the ranges into smaller ranges is the same
 as
  sampling.
 
  Take a look RandomSamplingFacetsCollector - it implements sampling by
  sampling the document space, not the facet values space.
 
  So if for instance you use a LongRangeFacetCounts in conjunction with a
  RandomSamplingFacetsCollector, you would get the matching documents space
  sampled, and the counts you would get for each range could be considered
  sampled too. This is at least how we implemented facet sampling.
 
  Shai
 
  On Tue, Mar 10, 2015 at 10:21 AM, Gimantha Bandara giman...@wso2.com
  wrote:
 
   What I am planning to do is, split the given time range into smaller
 time
   ranges  by myself and pass them to a LongRangeFacetsCount object and
 get
   the counts for each sub range. Is this the correct way?
  
   On Tue, Mar 10, 2015 at 12:01 AM, Gimantha Bandara giman...@wso2.com
   wrote:
  
Any updates on this please? Do I have to write my own code to sample
  and
get the hitcount?
   
On Sat, Mar 7, 2015 at 2:14 PM, Gimantha Bandara giman...@wso2.com
wrote:
   
Any help on this please?
   
On Fri, Mar 6, 2015 at 3:13 PM, Gimantha Bandara giman...@wso2.com
 
wrote:
   
Hi,
   
I am trying to create some APIs using lucene facets APIs. First I
  will
explain my requirement with an example. Lets say I am keeping track
  of
   the
count of  people who enter through a certain door. Lets say the
 time
   range
I am interested in Last 6 hours( to get the total count, I know
 that
  I
   ll
have to use Ranged Facets). How do I sample this time range and get
  the
counts of each sample? In other words, as an example, If I split
 the
   last
6 hours into 5 minutes samples, I get 72 (6*60/5 ) different time
   ranges. I
would be interested in getting hit counts for each of these 72
 ranges
   in an
array with the respective lower bound of each sample. Can someone
   point me
the direction I should follow/ the classes which can be helpful
   looking at?
ElasticSearch already has this feature exposed by their Javascript
  API.
   
Is it possible to implement the same with lucene?
Is there a Facets user guide for lucene 4.10.3 or lucene 5.0.0 ?
   
Thanks,
   
--
Gimantha Bandara
Software Engineer
WSO2. Inc : http://wso2.com
Mobile : +94714961919
   
   
   
   
--
Gimantha Bandara
Software Engineer
WSO2. Inc : http://wso2.com
Mobile : +94714961919
   
   
   
   
--
Gimantha Bandara
Software Engineer
WSO2. Inc : http://wso2.com
Mobile : +94714961919
   
  
  
  
   --
   Gimantha Bandara
   Software Engineer
   WSO2. Inc : http://wso2.com
   Mobile : +94714961919
  
 



 --
 Gimantha Bandara
 Software Engineer
 WSO2. Inc : http://wso2.com
 Mobile : +94714961919



Re: Filtering question

2015-03-11 Thread Shai Erera
I don't see that you use acceptDocs in your MyNDVFilter. I think it would
return false for all userB docs, but you should confirm that.

Anyway, because you use an NDV field, you can't automatically skip
unrelated documents, but rather your code would look something like:

for (int i = 0; i  reader.maxDoc(); i++) {
  if (!acceptDocs.get(i)) {
continue;
  }
  // document is accepted, read values
  ...
}

Shai

On Wed, Mar 11, 2015 at 1:25 PM, Ian Lea ian@gmail.com wrote:

 Can you use a BooleanFilter (or ChainedFilter in 4.x) alongside your
 BooleanQuery?   Seems more logical and I suspect would solve the problem.
 Caching filters can be good too, depending on how often your data changes.
 See CachingWrapperFilter.

 --
 Ian.


 On Tue, Mar 10, 2015 at 12:45 PM, Chris Bamford cbamf...@mimecast.com
 wrote:

 
   Hi,
 
   I have an index of 30 docs, 20 of which have an owner field of UserA
  and 10 of UserB.
  I also have a query which consists of:
 
   BooleanQuery:
  -- Clause 1: TermQuery
  -- Clause 2: FilteredQuery
  - Branch 1: MatchAllDocsQuery()
  - Branch 2: MyNDVFilter
 
   I execute my search as follows:
 
   searcher.search( booleanQuery,
  new TermFilter(new Term(owner,
  UserA),
  50);
 
   The TermFilter's job is to reduce the number of searchable documents
  from 30 to 20, which it does for all clauses of the BooleanQuery except
 for
  MyNDVFilter which iterates through the full 30 docs, 10 needlessly.  How
  can I restrict it so it behaves the same as the other query branches?
 
   MyNDVFilter source code:
 
   public class MyNDVFilter extends Filter {
 
   private String fieldName;
  private String matchTag;
 
   public TagFilter(String ndvFieldName, String matchTag) {
  this.fieldName = ndvFieldName;
  this.matchTag = matchTag;
  }
 
   @Override
  public DocIdSet getDocIdSet(AtomicReaderContext context, Bits
  acceptDocs) throws IOException {
 
   AtomicReader reader = context.reader();
  int maxDoc = reader.maxDoc();
  final FixedBitSet bitSet = new FixedBitSet(maxDoc);
  BinaryDocValues ndv = reader.getBinaryDocValues(fieldName);
 
   if (ndv != null) {
  for (int i = 0; i  maxDoc; i++) {
  BytesRef br = ndv.get(i);
  if (br.length  0) {
  String strval = br.utf8ToString();
  if (strval.equals(matchTag)) {
  bitSet.set(i);
  System.out.println(MyNDVFilter   + matchTag +
   matched  + i +  [ + strval + ]);
  }
  }
  }
  }
 
   return new DVDocSetId(bitSet);// just wraps a FixedBitSet
  }
  }
 
 
 
Chris Bamford m: +44 7860 405292  w: www.mimecast.com  Senior
 Developer p:
  +44 207 847 8700 Address click here
  http://www.mimecast.com/About-us/Contact-us/
  --
   [image: http://www.mimecast.com]
  
 https://serviceA.mimecast.com/mimecast/click?account=C1A1code=83be674748892bc34425eb4133af3e68
 
[image: LinkedIn]
  
 https://serviceA.mimecast.com/mimecast/click?account=C1A1code=83a78f78bdfa40c471501ae0b813a68f
 [image:
  YouTube]
  
 https://serviceA.mimecast.com/mimecast/click?account=C1A1code=ad1ed1af5bb9cf9dc965267ed43faff0
 [image:
  Facebook]
  
 https://serviceA.mimecast.com/mimecast/click?account=C1A1code=172d4ea57e4a4673452098ba62badace
 [image:
  Blog]
  
 https://serviceA.mimecast.com/mimecast/click?account=C1A1code=871b30b627b3263b9ae2a8f37b0de5ff
 [image:
  Twitter]
  
 https://serviceA.mimecast.com/mimecast/click?account=C1A1code=cc3a825e202ee26a108f3ef8a1dc3c6f
 
 
 



Re: Sampled Hit counts using Lucene Facets.

2015-03-10 Thread Shai Erera
I am not sure that splitting the ranges into smaller ranges is the same as
sampling.

Take a look RandomSamplingFacetsCollector - it implements sampling by
sampling the document space, not the facet values space.

So if for instance you use a LongRangeFacetCounts in conjunction with a
RandomSamplingFacetsCollector, you would get the matching documents space
sampled, and the counts you would get for each range could be considered
sampled too. This is at least how we implemented facet sampling.

Shai

On Tue, Mar 10, 2015 at 10:21 AM, Gimantha Bandara giman...@wso2.com
wrote:

 What I am planning to do is, split the given time range into smaller time
 ranges  by myself and pass them to a LongRangeFacetsCount object and get
 the counts for each sub range. Is this the correct way?

 On Tue, Mar 10, 2015 at 12:01 AM, Gimantha Bandara giman...@wso2.com
 wrote:

  Any updates on this please? Do I have to write my own code to sample and
  get the hitcount?
 
  On Sat, Mar 7, 2015 at 2:14 PM, Gimantha Bandara giman...@wso2.com
  wrote:
 
  Any help on this please?
 
  On Fri, Mar 6, 2015 at 3:13 PM, Gimantha Bandara giman...@wso2.com
  wrote:
 
  Hi,
 
  I am trying to create some APIs using lucene facets APIs. First I will
  explain my requirement with an example. Lets say I am keeping track of
 the
  count of  people who enter through a certain door. Lets say the time
 range
  I am interested in Last 6 hours( to get the total count, I know that I
 ll
  have to use Ranged Facets). How do I sample this time range and get the
  counts of each sample? In other words, as an example, If I split the
 last
  6 hours into 5 minutes samples, I get 72 (6*60/5 ) different time
 ranges. I
  would be interested in getting hit counts for each of these 72 ranges
 in an
  array with the respective lower bound of each sample. Can someone
 point me
  the direction I should follow/ the classes which can be helpful
 looking at?
  ElasticSearch already has this feature exposed by their Javascript API.
 
  Is it possible to implement the same with lucene?
  Is there a Facets user guide for lucene 4.10.3 or lucene 5.0.0 ?
 
  Thanks,
 
  --
  Gimantha Bandara
  Software Engineer
  WSO2. Inc : http://wso2.com
  Mobile : +94714961919
 
 
 
 
  --
  Gimantha Bandara
  Software Engineer
  WSO2. Inc : http://wso2.com
  Mobile : +94714961919
 
 
 
 
  --
  Gimantha Bandara
  Software Engineer
  WSO2. Inc : http://wso2.com
  Mobile : +94714961919
 



 --
 Gimantha Bandara
 Software Engineer
 WSO2. Inc : http://wso2.com
 Mobile : +94714961919



Re: Faceted Search Hierarchy

2015-01-08 Thread Shai Erera
Lucene does not understand the word India, therefore the facets that are
actually indexed are:

Doc1: Asia + Asia/India
Doc2: India + India/Gujarat

When you ask for top children, you will get Asia + India, both with a count
of 1.

Shai

On Thu, Jan 8, 2015 at 1:48 PM, Jigar Shah jigaronl...@gmail.com wrote:

 Very simple question, on facet

 Index has 2 documents as follows:

 Doc1
 Indexed facet path: Asia/India
 Doc2
 Indexed facet path: India/Gujarat


 Now while faceted search

 facets.getTopChildren()

 Will it return 1(Asia) result or 2(Asia, India) ?

 So basically will it join values and return hierarchy ?

 Thanks,



Re: Faceted Search Hierarchy

2015-01-08 Thread Shai Erera
Not automatically. There's no reason to assume that 'India' is the same in
'India/Gujarat' and 'Asia/India'. Furthermore, if you first add a document
with India/Gujarat and later add a document Asia/India, we cannot go back
to the other document and update the hierarchy.

On Thu, Jan 8, 2015 at 3:27 PM, Jigar Shah jigaronl...@gmail.com wrote:

 Is there some way to achieve this at Lucene level. so i can get facet like
 below ?

 Doc1: Asia + Asia/India
 Doc2: India + Asia/India/Gujarat

 Which can result into this:

 Asia/India/Gujarat (2)

 Can Lucene internally index like above, as 'India' value already exist as
 path of some other document ?
 Or some other ways that can be explored within Lucene.



 On Thu, Jan 8, 2015 at 5:26 PM, Shai Erera ser...@gmail.com wrote:

  Lucene does not understand the word India, therefore the facets that
 are
  actually indexed are:
 
  Doc1: Asia + Asia/India
  Doc2: India + India/Gujarat
 
  When you ask for top children, you will get Asia + India, both with a
 count
  of 1.
 
  Shai
 
  On Thu, Jan 8, 2015 at 1:48 PM, Jigar Shah jigaronl...@gmail.com
 wrote:
 
   Very simple question, on facet
  
   Index has 2 documents as follows:
  
   Doc1
   Indexed facet path: Asia/India
   Doc2
   Indexed facet path: India/Gujarat
  
  
   Now while faceted search
  
   facets.getTopChildren()
  
   Will it return 1(Asia) result or 2(Asia, India) ?
  
   So basically will it join values and return hierarchy ?
  
   Thanks,
  
 



Re: Facet Result Order

2014-12-14 Thread Shai Erera
Hi Mrugesh,

This is strange indeed, as the facets are ordered by count, and we use a
facet ordinal (integer code) as a tie breaker. What do you mean by
refreshed? Do you have a sample test that shows this behavior?

Shai

On Fri, Dec 12, 2014 at 8:37 AM, patel mrugesh patelmruge...@yahoo.co.in
wrote:


 Hi All,
 I am working on Lucene Facet now a day and facet seems working fine. Just
 one thing that come to my attention is, order of facet results get changed
 if there is same total count.
 For example, for country facet following results have been noticed.
 First time:
 - USA(10)- India(9)- UK(9)

 When refreshed, second time,
 - USA(10)- UK(9)
 - India(9)
 When refreshed, third time,
 - USA(10)- India(9)- UK(9)
 It would be great if I can have same result every time, I mean order of
 the result should come same even there is count is same ( in our example
 either India should come second every time or UK should come second time
 every time).

 Thanks in advance,Mrugesh



Re: Index replication strategy

2014-12-04 Thread Shai Erera
Do you use Lucene or Solr? Lucene also has a replication module, which will
allow you to replicate index changes.

On Thu, Dec 4, 2014 at 4:19 PM, Vijay B vijay.nip...@gmail.com wrote:

 Hello,

 We index docs coming from database nightly. Current index is sitting on
 NFS. Due to obvious performance reasons, we are switching are planning to
 switch to local index. W have cluster of 4 servers and with NFS it was not
 a problem for us until now to share the index. but going forward, we are
 looking for our design options for index replication on to local storage.

 Our setup:
 Index size: 8GB (grows by 2GB every year)
 Lucene: 4.2.1
 64-bit Java

 The options we considered:
  *  Each server instance,hosting a nightly job to pull delta of data from
 db. But, this would result in high DB load. (4 severs =4 times the load)

  * An additional nightly job sitting on another sever, that pushes the data
 on to local disks of each instances..This may not work out as the local
 disk may not be visible.

 * each sever hosting a replication job that pulls delta of data from NFS
 and stores in the local index...so far this is the only promising option we
 have.

 * Does solr an option for us in this case? (I know it's a question for solr
 group..but experts here might have some thoughts..)..

 Thank you for your attention.



Re: Index replication strategy

2014-12-04 Thread Shai Erera
Ooops, didn't notice that :).

So you'll need to upgrade to Lucene 4.4.0 in order to use it. You can read
some details as well as example code here:
http://shaierera.blogspot.com/2013/05/the-replicator.html.

Shai

On Thu, Dec 4, 2014 at 4:36 PM, Vijay B vijay.nip...@gmail.com wrote:

 As indicated in my post, we use Lucene 4.2.1.

 On Thu, Dec 4, 2014 at 9:29 AM, Shai Erera ser...@gmail.com wrote:

  Do you use Lucene or Solr? Lucene also has a replication module, which
 will
  allow you to replicate index changes.
 
  On Thu, Dec 4, 2014 at 4:19 PM, Vijay B vijay.nip...@gmail.com wrote:
 
   Hello,
  
   We index docs coming from database nightly. Current index is sitting on
   NFS. Due to obvious performance reasons, we are switching are planning
 to
   switch to local index. W have cluster of 4 servers and with NFS it was
  not
   a problem for us until now to share the index. but going forward, we
 are
   looking for our design options for index replication on to local
 storage.
  
   Our setup:
   Index size: 8GB (grows by 2GB every year)
   Lucene: 4.2.1
   64-bit Java
  
   The options we considered:
*  Each server instance,hosting a nightly job to pull delta of data
 from
   db. But, this would result in high DB load. (4 severs =4 times the
 load)
  
* An additional nightly job sitting on another sever, that pushes the
  data
   on to local disks of each instances..This may not work out as the local
   disk may not be visible.
  
   * each sever hosting a replication job that pulls delta of data from
 NFS
   and stores in the local index...so far this is the only promising
 option
  we
   have.
  
   * Does solr an option for us in this case? (I know it's a question for
  solr
   group..but experts here might have some thoughts..)..
  
   Thank you for your attention.
  
 



Re: hierarchical facets

2014-11-25 Thread Shai Erera
Yes, hierarchical faceting in Lucene is only supported by the taxonomy
index, at least currently.

Shai

On Tue, Nov 25, 2014 at 3:46 PM, Vincent Sevel v.se...@lombardodier.com
wrote:

 hi,
 I saw that SortedSetDocValuesFacetCounts does not support hierarchical
 facets.
 Is that to say that hierarchical facets are only supported through the
 Taxonomy index?
 I am using lucene 4.7.2.
 Regards,
 vince


  DISCLAIMER 
 This message is intended only for use by the person to
 whom it is addressed. It may contain information that is
 privileged and confidential. Its content does not constitute
 a formal commitment by Bank Lombard Odier  Co Ltd or any
 of its branches or affiliates. If you are not the intended recipient
 of this message, kindly notify the sender immediately and
 destroy this message. Thank You.
 *



Re: Lucene not showing Low Score Doc

2014-10-27 Thread Shai Erera
Hi

Your question is a bit fuzzy -- what do you mean by not showing low
scores? Are you sure that these 2 documents are matched by the query? Can
you boil it down to a short test case that demonstrates the problem?

In general though, when you search through IndexSearch.search(Query, int),
you won't get all matching documents, but only the number that you
specified (that's the 'int' that you pass). I don't think that's the
problem you're describing through as it sounds like there are only 10
documents, and the default is to return the top-10.

Again, if you have a short test that demonstrates the problem, that would
be good.

Shai

On Mon, Oct 27, 2014 at 2:39 PM, Priyanka Tufchi 
priyanka.tuf...@launchship.com wrote:

 Hi All

 Actually I have set of 10 doc which i gave for comparison  through apache
 lucene now when i check score for the set ,out of 10 i am getting 8 in my
 database , rest 2 are not showing . If the score is very less still lucene
 should show something , how can i handle it as i have to show all 10 score
 index.



 Thanks
 Priyanka

 --
 Launchship Technology  respects your privacy. This email is intended only
 for the use of the party to which it is addressed and may contain
 information that is privileged, confidential, or protected by law. If you
 have received this message in error, or do not want to receive any further
 emails from us, please notify us immediately by replying to the message and
 deleting it from your computer.



Re: Lucene not showing Low Score Doc

2014-10-27 Thread Shai Erera
I'm sorry, I still don't feel like I have all the information in order to
help with the problem that you're seeing. Can you at least paste the
contents of the documents and the query?

Can you search with a TotalHitCountCollector only, and print the total
number of hits?

Shai

On Mon, Oct 27, 2014 at 3:36 PM, Priyanka Tufchi 
priyanka.tuf...@launchship.com wrote:

 Hi
 Actually , It should give 10 docs match index but it is giving for 8 . I
 checked rest 2 are not matching doc with very less score . Is there any way
 I can get those two doc which have not matched.

 And I have set hitpage =10 .


 Thanks
 Priyanka


 On Mon, Oct 27, 2014 at 6:14 AM, Shai Erera ser...@gmail.com wrote:

  Hi
 
  Your question is a bit fuzzy -- what do you mean by not showing low
  scores? Are you sure that these 2 documents are matched by the query?
 Can
  you boil it down to a short test case that demonstrates the problem?
 
  In general though, when you search through IndexSearch.search(Query,
 int),
  you won't get all matching documents, but only the number that you
  specified (that's the 'int' that you pass). I don't think that's the
  problem you're describing through as it sounds like there are only 10
  documents, and the default is to return the top-10.
 
  Again, if you have a short test that demonstrates the problem, that would
  be good.
 
  Shai
 
  On Mon, Oct 27, 2014 at 2:39 PM, Priyanka Tufchi 
  priyanka.tuf...@launchship.com wrote:
 
   Hi All
  
   Actually I have set of 10 doc which i gave for comparison  through
 apache
   lucene now when i check score for the set ,out of 10 i am getting 8 in
 my
   database , rest 2 are not showing . If the score is very less still
  lucene
   should show something , how can i handle it as i have to show all 10
  score
   index.
  
  
  
   Thanks
   Priyanka
  
   --
   Launchship Technology  respects your privacy. This email is intended
 only
   for the use of the party to which it is addressed and may contain
   information that is privileged, confidential, or protected by law. If
 you
   have received this message in error, or do not want to receive any
  further
   emails from us, please notify us immediately by replying to the message
  and
   deleting it from your computer.
  
 

 --
 Launchship Technology  respects your privacy. This email is intended only
 for the use of the party to which it is addressed and may contain
 information that is privileged, confidential, or protected by law. If you
 have received this message in error, or do not want to receive any further
 emails from us, please notify us immediately by replying to the message and
 deleting it from your computer.



Re: Exception from FastTaxonomyFacetCounts

2014-10-15 Thread Shai Erera
Yes, SearcherTaxonomyManager returns a SearcherAndTaxonomy containing a
sync'd IndexSearcher and DirectoryTaxonomyReader.

Shai

On Mon, Oct 13, 2014 at 12:15 PM, Jigar Shah jigaronl...@gmail.com wrote:

 In my application i have two intances of SearcherManager.

 1) SearcherManager with 'applyAllDeletes = true' which is used by Indexer.
 (Works in NRT mode, deletes should be visible to it, also i have
 ControlledRealTimeReopenThread, which refeshes searcher)
 2) SearcherManager with 'applyAllDeletes = false' which is used by searcher
 (Only performs search, javadoc says, we may gain some performance if
 'false', as it will not wait for flushing deletes,).

 I have intoduced Taxonomy Facets in my applicaiton. Should i replace both
 SearcherManager by SearcherTaxonomyManager (one with applyAllDeletes=true
 and another applyAllDeletes=false)

 Will IndexSearcher and TaxonomyReader be in sync, in both
 SearcherTaxonomyManager ?

 On Fri, Oct 10, 2014 at 12:08 AM, Shai Erera ser...@gmail.com wrote:

  This usually means that your IndexReader and TaxonomyReader are out of
  sync. That is, the IndexReader sees category ordinals that the
  TaxonomyReader does not yet see.
 
  Do you use SearcherTaxonomyManager in your application? It ensures that
 the
  two are always in sync, i.e. reopened together and that your application
  always sees a consistent view of the two.
 
  Shai
 
  On Tue, Oct 7, 2014 at 10:03 AM, Jigar Shah jigaronl...@gmail.com
 wrote:
 
   Intermittently while search i am getting this exception on huge index.
   (FacetsConfig used while indexing and searching is same.)
  
   java.lang.ArrayIndexOutOfBoundsException: 252554
   06:28:37,954 ERROR [stderr] at
  
  
 
 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.count(FastTaxonomyFacetCounts.java:73)
   06:28:37,954 ERROR [stderr] at
  
  
 
 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.init(FastTaxonomyFacetCounts.java:49)
   06:28:37,954 ERROR [stderr] at
  
  
 
 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.init(FastTaxonomyFacetCounts.java:39)
   06:28:37,954 ERROR [stderr] at
  
  
 
 com.company.search.CustomDrillSideways.buildFacetsResult(LuceneDrillSideways.java:41)
   06:28:37,954 ERROR [stderr] at
   org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:146)
   06:28:37,955 ERROR [stderr] at
   org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:203)
  
   Thanks,
   Jigar Shah
  
 



Re: Delete / Update facets from taxonomy index

2014-10-09 Thread Shai Erera
Hi

You cannot remove facets from the taxonomy index, but you can reindex a
single document and update its facets. This will add new facets to the
taxonomy index (if they do not already exist). You do that just like you
reindex any document, by calling IndexWriter.updateDocument(). Just make
sure to rebuild the document with FacetsConfig.

Shai

On Tue, Oct 7, 2014 at 12:42 AM, wesli we...@hotmail.com wrote:

 I'm using lucene for a full text search on a online store.
 I've build a indexer program which creates a lucene and a taxonomy index.
 The taxonomy index contains facets with categories and article features
 (like color, brand, etc.).
 Is it possible to re-add or update single document facets? F.g. the shop
 owner changes the category of an article or some feature (like color f.g.).
 As I read in the documentation, the taxonomy index can be rebuild but it is
 not possible to re-add (delete and add) facets.
 I don't want to rebuild the whole taxonomy index each time when some single
 article (document) facet is changed.
 Is there another solution to update the taxonomy index?
 I'm using lucene 4.10

 Regards



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Delete-Update-facets-from-taxonomy-index-tp4163014.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: topdocs per facet

2014-10-09 Thread Shai Erera
The facets translation should be done at the application level. So if you
index the dimension A w/ two facets A/A1 and A/A2, where A1 should also be
translated to B1 and A2 translated to B2, there are several options:

Index the dimensions A and B with their respective facets, and count the
relevant dimension based on the user's locale. Then the user can drill-down
on any of the returned facets easily. I'd say that if your index and/or
taxonomy aren't big, this is the easiest solution and most straightforward
to implement.

Another way is to index the facet Root/R1 and Root/R2, which are
language-independent. At the application level you translate Root/R1 to
either A/A1 or B/B1 based on the user locale. You also then do the reverse
translation when the user drills-down. So e.g. if the user clicked A/A1,
you translate that to Root/R1 and drill-down on that. If your application
is UI based, you probably can return e.g a JSON construct which contains
the labels to display + the facet values to drill-down by and then you
don't need to do any reverse translation.

As for retrieving a document's facets, you can either index them as
separate StoredFields (easy), or use DocValuesOrdinalsReader to traverse
the facets list along with the MatchingDocs, read the facet ordinals and
translate them. If it sounds complex, just use StoredFields :).

Shai

On Mon, Sep 29, 2014 at 7:15 PM, Jürgen Albert j.alb...@data-in-motion.biz
wrote:

 Hi,

 I'm currently implementing the lucene facets in the version 4.8.1 and two
 questions remain for me:

 1. Is the an easy way to have translations for the facets? If we use e.g.
 the books example, the user should see the translation. But if he clicks on
 a link the english value should be used for the search. Thus I have to
 return the facet translation and the actual value by the search.
 2. Is there a possibility to get the docs per facet?

 As An example I have e.g. a DrillDownQuery returning 5 docs and 2
 dimensions with 2 facets each. I guess the solution is somewhere in the
 MatchingDocs.  If I try:

 ListMatchingDocs matchingDocs = facetsCollector.
 getMatchingDocs();

 for(MatchingDocs doc : matchingDocs){
 DocIdSet docSet = doc.bits;
 DocIdSetIterator iterator = docSet.iterator();
 int docId = iterator.nextDoc();
 while (docId != DocIdSetIterator.NO_MORE_DOCS){
 Document document = doc.context.reader().document(
 docId);
 System.out.println(document.toString());
 docId = iterator.nextDoc();
 }
 }

 result:

 A List with as much MachtingDocs as dimensions, but only one MatchDocs
 gives me my docs at all. How I could get the docs per facet I can't see at
 all, nor how  could get the facets of a doc.

 What do I miss?

 Thx,

 Jürgen Albert.

 --
 Jürgen Albert
 Geschäftsführer

 Data In Motion UG (haftungsbeschränkt)

 Kahlaische Str. 4
 07745 Jena

 Mobil:  0157-72521634
 E-Mail: j.alb...@datainmotion.de
 Web: www.datainmotion.de

 XING:   https://www.xing.com/profile/Juergen_Albert5

 Rechtliches

 Jena HBR 507027
 USt-IdNr: DE274553639
 St.Nr.: 162/107/04586


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Exception from FastTaxonomyFacetCounts

2014-10-09 Thread Shai Erera
This usually means that your IndexReader and TaxonomyReader are out of
sync. That is, the IndexReader sees category ordinals that the
TaxonomyReader does not yet see.

Do you use SearcherTaxonomyManager in your application? It ensures that the
two are always in sync, i.e. reopened together and that your application
always sees a consistent view of the two.

Shai

On Tue, Oct 7, 2014 at 10:03 AM, Jigar Shah jigaronl...@gmail.com wrote:

 Intermittently while search i am getting this exception on huge index.
 (FacetsConfig used while indexing and searching is same.)

 java.lang.ArrayIndexOutOfBoundsException: 252554
 06:28:37,954 ERROR [stderr] at

 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.count(FastTaxonomyFacetCounts.java:73)
 06:28:37,954 ERROR [stderr] at

 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.init(FastTaxonomyFacetCounts.java:49)
 06:28:37,954 ERROR [stderr] at

 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.init(FastTaxonomyFacetCounts.java:39)
 06:28:37,954 ERROR [stderr] at

 com.company.search.CustomDrillSideways.buildFacetsResult(LuceneDrillSideways.java:41)
 06:28:37,954 ERROR [stderr] at
 org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:146)
 06:28:37,955 ERROR [stderr] at
 org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:203)

 Thanks,
 Jigar Shah



Re: FacetsConfig usage

2014-10-05 Thread Shai Erera
Hi

The FacetsConfig object is the one that you use to index facets, and at
search time it is consulted about the facets attributes (multi-valued,
hierarchical etc.). You can make changes to the FacetsConfig, as long as
they don't contradict the indexed data in a problematic manner.

Usually the facets configuration does not change, but I believe it will
work if you add new dimensions. Current in-flight searches won't
query/count those dimensions anyway, and new searches will find those
dimensions in recently indexed documents only. It is up to you to decide if
the old 1 million documents that don't contain the new Person facet are OK
to display together w/ the 10 new documents that do, but as long as you're
OK with that, application-wise, adding new dimensions should just work.

Contradicting changes are changes to the attributes of one dimension, e.g.
from hierarchical to flat. In that case, that that there are 1 million old
documents indexed w/ A/B/C hierarchy and 10 new documents w/ only A/B
doesn't matter to the FacetsConfig - all documents will be considered flat
in that case. Here I'm less sure about the effects of that on search (I
don't think we have a test for it), but I hope that you don't do that. It's
not advisable, just like any other schema changes to your fields while
there are already indexed documents.

Shai


Re: confused facet example

2014-09-30 Thread Shai Erera
Thanks Yonghui,

I will commit a fix - need to initialize the example class before each
example is run !

Shai

On Tue, Sep 30, 2014 at 1:26 PM, Yonghui Zhao zhaoyong...@gmail.com wrote:


 https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/SimpleFacetsExample.java

 In SimpleFacetsExample,

 /** Runs the search example. */
   public ListFacetResult runFacetOnly() throws IOException {
 index();
 return facetsOnly();
   }

   /** Runs the search example. */
   public ListFacetResult runSearch() throws IOException {
 index();
 return facetsWithSearch();
   }

   /** Runs the drill-down example. */
   public FacetResult runDrillDown() throws IOException {
 index();
 return drillDown();
   }

   /** Runs the drill-sideways example. */
   public ListFacetResult runDrillSideways() throws IOException {
 index();
 return drillSideways();
   }

   /** Runs the search and drill-down examples and prints the results. */
   public static void main(String[] args) throws Exception {
 System.out.println(Facet counting example:);
 System.out.println(---);
 SimpleFacetsExample example1 = new SimpleFacetsExample();
 ListFacetResult results1 = example1.runFacetOnly();
 System.out.println(Author:  + results1.get(0));
 System.out.println(Publish Date:  + results1.get(1));

 System.out.println(Facet counting example (combined facets and
 search):);
 System.out.println(---);
 SimpleFacetsExample example = new SimpleFacetsExample();
 ListFacetResult results = example.runSearch();
 System.out.println(Author:  + results.get(0));
 System.out.println(Publish Date:  + results.get(1));

 System.out.println(\n);
 System.out.println(Facet drill-down example (Publish Date/2010):);
 System.out.println(-);
 System.out.println(Author:  + example.runDrillDown());

 System.out.println(\n);
 System.out.println(Facet drill-sideways example (Publish
 Date/2010):);
 System.out.println(-);
 for(FacetResult result : example.runDrillSideways()) {
   System.out.println(result);
 }
   }


 The example doesn't new SimpleFacetsExample each time.

 So in drill-down example, it indexes 2 times, the result number is 2 times.
 in drill-sideways example. it indexes 3 times, the result number is 3
 times.

 Is it intended?



Re: sortedset vs taxonomy

2014-09-27 Thread Shai Erera
Hi

The taxonomy faceting approach maintains a sidecar index where it keeps the
taxonomy and assigns an integer (ordinal) to each category. Those integers
are encoded in a BinaryDocValues field for each document. It supports
hierarchical faceting as well as assigning additional metadata to each
facet occurrence (called associations). At search time, faceting is done by
aggregating the category ordinals found in each document. Since those
ordinals are global to the index, merging and finding the top-K facets
across segments is relatively cheap.

The SortedSet faceting approach does not need a sidecar index ans relies on
the SortedSet fields. Here too each term/category is assigned an ordinal
and at search time the facets are aggregated using those ordinals. However,
the ordinals of the same category is not the same across segments, and
therefore finding the top-K facets is a bit more expensive (roughly 20%
slower if I remember correctly).

Another difference is that the SortedSet approach keeps a true ordinal for
a facet, so e.g. the category A/B will always receive an ordinal that is
smaller than A/C. In the taxonomy approach though, whichever facet got
added first receives the lowest ordinal, except that the parent of all
categories at a certain level in the hierarchy always receives a smaller
ordinal than all its children.

Working w/ SortedSet facets is indeed simpler than the taxonomy, but the
taxonomy does not seriously complicate things. If you need a facet
hierarchy, you should use the taxonomy approach. Otherwise, I would just
try each and see which one works better for your usecase.

As for optimizing an index, the taxonomy facets do not make any difference
in that case.

Shai

On Mon, Sep 22, 2014 at 8:48 PM, Yonghui Zhao zhaoyong...@gmail.com wrote:

 If we want to implement simple facet counting feature, it seems we can do
 it via sortedset or taxonomy writer/reader.

 Seems sortedset is simpler but doesn't support hierarchical  facet count
 such as A/B/C.

 I want to know what's advantage/disadvantage of sortedset or taxonomy?

 Is there any trouble with taxonomy when index is optimized(merged)?



Re: document boost at lucene 4.8.1

2014-09-21 Thread Shai Erera
You can read some discussion here:
http://search-lucene.com/m/Z2GP220szmSsubj=RE+What+is+equivalent+to+Document+setBoost+from+Lucene+3+6+inLucene+4+1+
.

I wrote a post on how to achieve that with the new API:
http://shaierera.blogspot.com/2013/09/boosting-documents-in-lucene.html.

Shai

On Sun, Sep 21, 2014 at 11:23 AM, #LI JUN# jli...@e.ntu.edu.sg wrote:

 Hi all,


 How come in 4.8.1, the document.setBoost method is missing. So what is the
 method for document level boost now?


 Regards,

 Jun




Re: improve indexing speed with nomergepolicy

2014-08-14 Thread Shai Erera
I opened https://issues.apache.org/jira/browse/LUCENE-5883 to handle that.

Shai


On Thu, Aug 7, 2014 at 6:42 PM, Uwe Schindler u...@thetaphi.de wrote:

 This is a good idea, because sometimes it's nice to change the MergePolicy
 on the fly without reopening! One example is
 https://issues.apache.org/jira/browse/LUCENE-5526
 In my case, I would like to open an IndexWriter, set its merge policy to
 IndexUpdaterMergePolicy, force a merge to upgrade all segments and then
 proceed with normal indexing and other stuff. Currently you have to close
 IW - this is bad in multithreaded environments: If you start an Index
 Upgrade after installing a new version of your favourite Solr/ES/...
 server, but need to index documents in parallel (real time system) - so
 with little downtime.
 The proposal in the above issue is to allow to pass a MergePolicy to
 forceMerge().

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


  -Original Message-
  From: Shai Erera [mailto:ser...@gmail.com]
  Sent: Thursday, August 07, 2014 4:11 PM
  To: java-user@lucene.apache.org
  Subject: Re: improve indexing speed with nomergepolicy
 
  Yes, currently an MP isn't a live setting on IndexWriter, meaning you
 pass it
  at construction time and don't change it afterwards. I wonder if after
  LUCENE-5711 we can move MergePolicy to LiveIndexWriterConfig and fix
  IndexWriter to not hold on to it, but rather pull it from the config.
 
  Not sure what others think about it.
 
  Shai
 
 
  On Thu, Aug 7, 2014 at 5:05 PM, Jon Stewart
  j...@lightboxtechnologies.com
  wrote:
 
   Related, how does one change the MergePolicy on an IndexWriter (e.g.,
   use NoMergePolicy during batch indexing, then change to something
   better once finished with batch)? It looks like the MergePolicy is set
   through IndexWriterConfig but I don't see a way to update an IWC on an
   IW.
  
   Thanks,
  
   Jon
  
  
   On Thu, Aug 7, 2014 at 7:37 AM, Shai Erera ser...@gmail.com wrote:
Using NoMergePolicy for online indexes is usually not recommended.
You
   want
to use NoMP in case where you build an index in a batch job, then in
the end before the index is published you run a forceMerge or
maybeMerge (with a real MergePolicy).
   
For online indexes, i.e. indexes that are being searched while they
are updated, if you use NoMP you will accumulate many segments in the
  index.
This means higher resources consumption overall: file handles, RAM,
potentially disk space, and usually results in slower searches.
   
You may want to tweak the default MP's settings though, to not kick
off a merge unless there are a large number of segments in the
index. E.g. the default MP merges segments when there are 10 at the
  same level (i.e.
roughly the same size). You can increase that.
   
Also, do you use NRTCachingDirectory? It's usually recommended for
NRT, even with default MP, since the tiny segments are merged
in-memory, and your NRT reopens don't result in flushing new segments
  to disk.
   
Shai
   
   
On Thu, Aug 7, 2014 at 1:14 PM, Sascha Janz sascha.j...@gmx.net
  wrote:
   
hi,
   
i try to speed up our indexing process. we use SeacherManager with
applydeletes to get near real time Reader.
   
we have not really much incoming documents, but the documents
must be updated from time to time and the amount of documents to be
updated
   could
be quite large.
   
i tried some tests with NoMergePolicy and the indexing process was
25 % faster.
   
so i think of a change in our code, to use NoMergePolicy for a
specific time interval, when users are active and do a
forceMerge(20) every
   night,
which last about 2 - 5 minutes.
   
is this a good idea? or will i perhaps get into trouble?
   
Sascha
   
   
---
-- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
   
   
  
  
  
   --
   Jon Stewart, Principal
   (646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Questions for facets search

2014-08-13 Thread Shai Erera
Sheng,

I assume that you're using the Lucene faceting module, so I answer
following that:

(1) A document can be associated with many facet labels, e.g. Tags/lucene
and Author/Shai. The way to extract all facet labels for a particular
document is this:

  OrdinalsReader ordinals = new DocValuesOrdinalsReader();
  OrdinalsSegmentReader ordsSegment =
ordinals.getReader(indexReader.leaves().get(0)); // we have only one segment
  IntsRef scratch = new IntsRef();
  ordsSegment.get(0, scratch);
  for (int i = 0; i  scratch.length; i++) {
System.out.println(taxoReader.getPath(scratch.ints[i]));
  }

Note that OrdinalsSegmentReader works on an AtomicReader. That means that
the doc-id that you pass to it must be relative to the segment. If you have
a global doc-id, you can wrap the DirectoryReader with a
SlowCompositeReaderWrapper, which presents the DirectoryReader as an
AtomicReader.

(2) I'm not quite sure I understand what you mean by facet cache. Do you
mean the taxonomy index? If so the answer is no. Think of the taxonomy
index is a large global MapFacetLabel, Integer, where each facet label is
mapped to an integer, irrespective of the segment it is indexed in. That
map is used to encode the facet information in the *Search Index* more
efficiently.

Therefore the taxonomy index itself doesn't hold all the information that
is needed for faceted search, and you cannot only rebuild it.

Shai


On Wed, Aug 13, 2014 at 8:08 AM, Ralf Heyde ralf.he...@gmx.de wrote:

 For 1st: from Solr Level i guess, you could select (only) the document by
 uniqueid. Then you have the facets for that particular document. But this
 results in one additional query/doc.

 Gesendet von meinem BlackBerry 10-Smartphone.
   Originalnachricht
 Von: Sheng
 Gesendet: Dienstag, 12. August 2014 23:35
 An: java-user@lucene.apache.org
 Antwort an: java-user@lucene.apache.org
 Betreff: Questions for facets search

 I actually have 2 questions:

 1. Is it possible to get the facet label for a particular document? The
 reason we want this is we'd like to allow users to see tags for each hit in
 addition to the taxonomy for his/her search.

 2. Is it possible to re-index the facet cache without reindexing the whole
 lucene cache, since they are separated? We have a dynamic list of faceted
 fields, being able to quickly rebuild the whole facet lucene cache would be
 quite desirable.

 Again, I am using lucene 4.7, thanks in advance to your answers!

 Sheng

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Questions for facets search

2014-08-13 Thread Shai Erera
Glad it helped Sheng.

Note, the taxonomy index is not exactly like what you implement, just want
to clarify that. You implemented something like a JOIN between two indexes,
where a document index Index1 can be joined with a document (or set of
docs) in Index2, by some primary key.

The taxonomy index is different. It's an auxiliary index, but the word
'index' is just an implementation detail. Again, think of it as a large Map
from a String to Integer. Every facet in the taxonomy gets a unique ID
(integer), and that integer is encoded in the search index for all
documents that are associated with that facet.

Lucene implements a similar feature, per-segment, through
SortedSetDocValues (and the facet module supports that one too, without the
need for an auxiliary index). The difference is that SortedSetDocValues
implement that mapping per-segment, so e.g. the facet Tags/Lucene may
receive the integer 5 in seg1 and 12 in seg2, where the taxonomy index maps
it *once* to an integer (say 4), and that integer is encoded in a
BinaryDocValuesField in all segments of the search index.

The only lookup that is done at search time is when you want to label top
facets. Since the search index holds only the integer values of the facets,
the taxonomy index is used to label them (so now it's more of a
bidirectional Map).

Just wanted to clarify the differences.

Shai


On Thu, Aug 14, 2014 at 2:56 AM, Sheng sheng...@gmail.com wrote:

 Shai,

 Thanks a lot for your answers! Sorry, I was distracted by some other
 matters during the day and cannot try your suggestions until now. So what
 you suggest on 1 is working like a charm :) for 2, it is a pity but I can
 understand. By the way, the way you described that facet index gets stored
 like a map is quite similar to how we store the payload :) We use an
 integer as payload for each token, and store more complicated information
 in another Lucene index with the integer payload as the key for each
 document.

 Sheng

 On Wednesday, August 13, 2014, Shai Erera ser...@gmail.com wrote:

  Sheng,
 
  I assume that you're using the Lucene faceting module, so I answer
  following that:
 
  (1) A document can be associated with many facet labels, e.g. Tags/lucene
  and Author/Shai. The way to extract all facet labels for a particular
  document is this:
 
OrdinalsReader ordinals = new DocValuesOrdinalsReader();
OrdinalsSegmentReader ordsSegment =
  ordinals.getReader(indexReader.leaves().get(0)); // we have only one
  segment
IntsRef scratch = new IntsRef();
ordsSegment.get(0, scratch);
for (int i = 0; i  scratch.length; i++) {
  System.out.println(taxoReader.getPath(scratch.ints[i]));
}
 
  Note that OrdinalsSegmentReader works on an AtomicReader. That means that
  the doc-id that you pass to it must be relative to the segment. If you
 have
  a global doc-id, you can wrap the DirectoryReader with a
  SlowCompositeReaderWrapper, which presents the DirectoryReader as an
  AtomicReader.
 
  (2) I'm not quite sure I understand what you mean by facet cache. Do
 you
  mean the taxonomy index? If so the answer is no. Think of the taxonomy
  index is a large global MapFacetLabel, Integer, where each facet label
 is
  mapped to an integer, irrespective of the segment it is indexed in. That
  map is used to encode the facet information in the *Search Index* more
  efficiently.
 
  Therefore the taxonomy index itself doesn't hold all the information that
  is needed for faceted search, and you cannot only rebuild it.
 
  Shai
 
 
  On Wed, Aug 13, 2014 at 8:08 AM, Ralf Heyde ralf.he...@gmx.de
  javascript:; wrote:
 
   For 1st: from Solr Level i guess, you could select (only) the document
 by
   uniqueid. Then you have the facets for that particular document. But
 this
   results in one additional query/doc.
  
   Gesendet von meinem BlackBerry 10-Smartphone.
 Originalnachricht
   Von: Sheng
   Gesendet: Dienstag, 12. August 2014 23:35
   An: java-user@lucene.apache.org javascript:;
   Antwort an: java-user@lucene.apache.org javascript:;
   Betreff: Questions for facets search
  
   I actually have 2 questions:
  
   1. Is it possible to get the facet label for a particular document? The
   reason we want this is we'd like to allow users to see tags for each
 hit
  in
   addition to the taxonomy for his/her search.
  
   2. Is it possible to re-index the facet cache without reindexing the
  whole
   lucene cache, since they are separated? We have a dynamic list of
 faceted
   fields, being able to quickly rebuild the whole facet lucene cache
 would
  be
   quite desirable.
  
   Again, I am using lucene 4.7, thanks in advance to your answers!
  
   Sheng
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  javascript:;
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  javascript:;
  
  
 



Re: improve indexing speed with nomergepolicy

2014-08-07 Thread Shai Erera
Using NoMergePolicy for online indexes is usually not recommended. You want
to use NoMP in case where you build an index in a batch job, then in the
end before the index is published you run a forceMerge or maybeMerge
(with a real MergePolicy).

For online indexes, i.e. indexes that are being searched while they are
updated, if you use NoMP you will accumulate many segments in the index.
This means higher resources consumption overall: file handles, RAM,
potentially disk space, and usually results in slower searches.

You may want to tweak the default MP's settings though, to not kick off a
merge unless there are a large number of segments in the index. E.g. the
default MP merges segments when there are 10 at the same level (i.e.
roughly the same size). You can increase that.

Also, do you use NRTCachingDirectory? It's usually recommended for NRT,
even with default MP, since the tiny segments are merged in-memory, and
your NRT reopens don't result in flushing new segments to disk.

Shai


On Thu, Aug 7, 2014 at 1:14 PM, Sascha Janz sascha.j...@gmx.net wrote:

 hi,

 i try to speed up our indexing process. we use SeacherManager with
 applydeletes to get near real time Reader.

 we have not really much incoming documents, but the documents must be
 updated from time to time and the amount of documents to be updated could
 be quite large.

 i tried some tests with NoMergePolicy and the indexing process was 25 %
 faster.

 so i think of a change in our code, to use NoMergePolicy for a specific
 time interval, when users are active and do a forceMerge(20) every night,
 which last about 2 - 5 minutes.

 is this a good idea? or will i perhaps get into trouble?

 Sascha


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: improve indexing speed with nomergepolicy

2014-08-07 Thread Shai Erera
Yes, currently an MP isn't a live setting on IndexWriter, meaning you
pass it at construction time and don't change it afterwards. I wonder if
after LUCENE-5711 we can move MergePolicy to LiveIndexWriterConfig and fix
IndexWriter to not hold on to it, but rather pull it from the config.

Not sure what others think about it.

Shai


On Thu, Aug 7, 2014 at 5:05 PM, Jon Stewart j...@lightboxtechnologies.com
wrote:

 Related, how does one change the MergePolicy on an IndexWriter (e.g.,
 use NoMergePolicy during batch indexing, then change to something
 better once finished with batch)? It looks like the MergePolicy is set
 through IndexWriterConfig but I don't see a way to update an IWC on an
 IW.

 Thanks,

 Jon


 On Thu, Aug 7, 2014 at 7:37 AM, Shai Erera ser...@gmail.com wrote:
  Using NoMergePolicy for online indexes is usually not recommended. You
 want
  to use NoMP in case where you build an index in a batch job, then in the
  end before the index is published you run a forceMerge or maybeMerge
  (with a real MergePolicy).
 
  For online indexes, i.e. indexes that are being searched while they are
  updated, if you use NoMP you will accumulate many segments in the index.
  This means higher resources consumption overall: file handles, RAM,
  potentially disk space, and usually results in slower searches.
 
  You may want to tweak the default MP's settings though, to not kick off a
  merge unless there are a large number of segments in the index. E.g. the
  default MP merges segments when there are 10 at the same level (i.e.
  roughly the same size). You can increase that.
 
  Also, do you use NRTCachingDirectory? It's usually recommended for NRT,
  even with default MP, since the tiny segments are merged in-memory, and
  your NRT reopens don't result in flushing new segments to disk.
 
  Shai
 
 
  On Thu, Aug 7, 2014 at 1:14 PM, Sascha Janz sascha.j...@gmx.net wrote:
 
  hi,
 
  i try to speed up our indexing process. we use SeacherManager with
  applydeletes to get near real time Reader.
 
  we have not really much incoming documents, but the documents must be
  updated from time to time and the amount of documents to be updated
 could
  be quite large.
 
  i tried some tests with NoMergePolicy and the indexing process was 25 %
  faster.
 
  so i think of a change in our code, to use NoMergePolicy for a specific
  time interval, when users are active and do a forceMerge(20) every
 night,
  which last about 2 - 5 minutes.
 
  is this a good idea? or will i perhaps get into trouble?
 
  Sascha
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 



 --
 Jon Stewart, Principal
 (646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Sort, Search Facets

2014-07-10 Thread Shai Erera
Hi

Currently we do not provide the means to use a single SortedSetDVField for
both faceting and sorting. You can add a SortedSetDVFacetField to a
Document, then use FacetsConfig.build(), but that encodes all your
dimensions under a single SSDV field. It's done for efficiency, since at
search time, when you ask to count the different dimensions, we need to
read a single field.

It might be worth it to explore sharing the same SSDV field for both
faceting and sorting, and compare the performance implications of doing
that (when faceting). if you want to try it, I suggest that you look at
SortedSetDocValuesReaderState and see if you can use it for this task.

Shai


On Tue, Jul 8, 2014 at 9:50 AM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 I am using Lucene 4.7.2 and my primary use case for Lucene is to do three
 things: (a) search, (b) sort by a number of fields for the search results,
 and (c) facet on probably an equal number of fields (probably the most
 standard use cases anyway).

 Let us say, I have a corpus of more than a 100m docs with each document
 having approx. 10-15 fields excluding the content (body) which will also be
 one of the fields. Out of 10-15, I have a requirement to have sorting
 enabled on all 10-15 and the facets as well. That makes a total of approx.
 ~45 fields to be indexed for various reasons, once for
 String/Long/TextField, once for SortedDocValuesField, and once for
 FacetField each.

 What will be the impact of this on the indexing operation w.r.t. the time
 taken as well as the extra disk space required? Will it grow linearly with
 the increase in the number of fields?

 What is the impact on the memory usage during search time?


 I will attempt to benchmark some of these, but if you have any experience
 with this, request you to share the details. Thanks,

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


Re: Incremental Field Updates

2014-07-02 Thread Shai Erera
Using BinaryDocValues is not recommended for all scenarios. It is a
catchall alternative to the other DocValues types. I would not use it
unless it makes sense for your application, even if it means that you need
to re-index a document in order to update a single field.

DocValues are not good for search - by search I assume you mean take a
query such as apache AND lucene and find all documents which contain both
terms under the same field. They are good for sorting and faceting though.

So I guess the answer to your question is it depends (it always is!) - I
would use DocValues for sorting and faceting, but not for regular search
queries. And I would use BinaryDocValues only when the other DocValues
types don't match.

Also, note that the current field-level update of DocValues is not always
better than re-indexing the document, you can read here for more details:
http://shaierera.blogspot.com/2014/04/benchmarking-updatable-docvalues.html

Shai


On Tue, Jul 1, 2014 at 9:17 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi Shai,

 So one follow-up question.

 Assume that my use case is to have approx. ~50M documents indexed with
 each document having about ~10-15 indexed but not stored fields. These
 fields will never change, but there are another ~5-6 fields that will
 change and will continue to change after the index is written. These ~5-6
 fields may also be multivalued. The size of this index turns out to be
 ~120GB.

 In this case, I would like to sort or facet or search on these ~5-6
 fields. Which approach do you suggest? Should I use BinaryDocValues and
 update using IW or use either a ParallelReader/Join query.

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Tuesday, July 1, 2014 9:53 PM, Shai Erera ser...@gmail.com wrote:



 Except that Lucene now offers efficient numeric and binary DocValues
 updates. See IndexWriter.updateNumeric/Binary...

 On Jul 1, 2014 5:51 PM, Erick Erickson erickerick...@gmail.com wrote:

  This JIRA is complicated, don't really expect it in 4.9 as it's
  been hanging around for quite a while. Everyone would like this,
  but it's not easy.
 
  Atomic updates will work, but you have to stored=true for all
  source fields. Under the covers this actually reads the document
  out of the stored fields, deletes the old one and adds it
  over again.
 
  FWIW,
  Erick
 
  On Tue, Jul 1, 2014 at 5:32 AM, Sandeep Khanzode
  sandeep_khanz...@yahoo.com.invalid wrote:
   Hi,
  
   I wanted to know of the best approach to follow if a few fields in my
  indexed documents are changing at run time (after index and before or
  during search), but a majority of them are created at index time.
  
   I could see the JIRA given below but it is scheduled for Lucene 4.9, I
  believe.
  
   There are a few other approaches, like maintaining a separate index for
  changing fields and use either a parallelreader or use a Join.
  
   Can everyone share their experience for this scenario on how it is
  handled in your systems? Thanks,
  
   [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF
  JIRA
  
  
[LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF
  JIRA
   Shai and I would like to start working on the proposal to Incremental
  Field Updates outlined here (
 http://markmail.org/message/zhrdxxpfk6qvdaex
  ).
   View on issues.apache.org Preview by Yahoo
  
  
   ---
   Thanks n Regards,
   Sandeep Ramesh Khanzode
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 


Re: Incremental Field Updates

2014-07-01 Thread Shai Erera
Except that Lucene now offers efficient numeric and binary DocValues
updates. See IndexWriter.updateNumeric/Binary...
On Jul 1, 2014 5:51 PM, Erick Erickson erickerick...@gmail.com wrote:

 This JIRA is complicated, don't really expect it in 4.9 as it's
 been hanging around for quite a while. Everyone would like this,
 but it's not easy.

 Atomic updates will work, but you have to stored=true for all
 source fields. Under the covers this actually reads the document
 out of the stored fields, deletes the old one and adds it
 over again.

 FWIW,
 Erick

 On Tue, Jul 1, 2014 at 5:32 AM, Sandeep Khanzode
 sandeep_khanz...@yahoo.com.invalid wrote:
  Hi,
 
  I wanted to know of the best approach to follow if a few fields in my
 indexed documents are changing at run time (after index and before or
 during search), but a majority of them are created at index time.
 
  I could see the JIRA given below but it is scheduled for Lucene 4.9, I
 believe.
 
  There are a few other approaches, like maintaining a separate index for
 changing fields and use either a parallelreader or use a Join.
 
  Can everyone share their experience for this scenario on how it is
 handled in your systems? Thanks,
 
  [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF
 JIRA
 
 
   [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF
 JIRA
  Shai and I would like to start working on the proposal to Incremental
 Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex
 ).
  View on issues.apache.org Preview by Yahoo
 
 
  ---
  Thanks n Regards,
  Sandeep Ramesh Khanzode

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Lucene Facets Module 4.8.1

2014-06-23 Thread Shai Erera
There is no sample code for doing that but it's quite straightforward - if
you know you indexed some dimensions under different indexFieldNames,
initialize a FacetCounts per such field name, e.g.:

FastTaxoFacetCounts defaultCounts = new FastTaxoFacetCounts(...); // for
your regular facets
FastTaxoFacetCounts cityCounts = new FastTaxoFacetCounts(...); // for your
CITY facets

Something like that...

Shai


On Mon, Jun 23, 2014 at 9:04 AM, Jigar Shah jigaronl...@gmail.com wrote:

 On commenting

 //config.setIndexFieldName(CITY, city); at search time, this is before
 i do, getTopChildren(...)

 I get following exception.

 Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
 at

 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.count(FastTaxonomyFacetCounts.java:74)
 [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
 at

 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.init(FastTaxonomyFacetCounts.java:49)
 [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
 at

 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.init(FastTaxonomyFacetCounts.java:39)
 [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
 at

 org.apache.lucene.facet.DrillSideways.buildFacetsResult(DrillSideways.java:110)
 [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
 at org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:177)
 [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
 at org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:203)
 [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]

 Application level excepitons.
 ...
 ...



 On Sat, Jun 21, 2014 at 10:56 PM, Michael McCandless 
 luc...@mikemccandless.com wrote:

  Are you sure it's the same FacetsConfig at search time?  Because the
  exception implies your CITY field didn't have
  config.setIndexFieldName(CITY, city) called.
 
  Or, can you try commenting out 'config.setIndexFieldName(CITY,
  city)' at index time and see if the exception still happens?
 
  Mike McCandless
 
  http://blog.mikemccandless.com
 
 
  On Sat, Jun 21, 2014 at 1:08 AM, Jigar Shah jigaronl...@gmail.com
 wrote:
   Thanks for helping me.
  
   Yes, i did couple of things:
  
   Below is simple code for indexing which i use.
  
   TrackingIndexWriter nrtWriter
   DirectoryTaxonomyWriter taxoWriter = ...
   
   FacetsConfig config = new FacetConfig();
   config.setHierarchical(CITY, true)
   config.setMultiValued(CITY, true);
   config.setIndexFieldName(CITY,city) // I kept dimName different
 from
   indexFieldName
   
   Added indexing searchable fields...
   
  
   doc.add( new FacetField(CITY, India, Gujarat, Vadodara ))
   doc.add( new FacetField(CITY, India, Gujarat, Ahmedabad ))
  
nrtWriter.addDocument(config.build(taxoWriter, doc));
  
   Below is code which i use for searching
  
   TaxonomyReader taxoReader = new DirectoryTaxonomyReader(taxoWriter);
  
   Query query = ...
   IndexSearcher searcher = ...
   DrillDownQuery ddq = new DrillDownQuery(config, query);
   DrillSideways ds = new DrillSideways(searcher, config, taxoReader); //
   Config object is same which i created before
   DrillSidewaysResult result = ds.search(query, null, null, start +
 limit,
   null, true, true)
   ...
   Facets f = result.facets
   FacetResult fr = f.getTopChildren(5, CITY) [Exception is geneated]//
   Didn't perform any drill-down,really, its just original query for first
   time, but wrapped in DrillDownQuery.
  
   ... and below gives me empty collection.
  
   ListFacetResult frs= f.getAllDims(5)
  
   I debug source code and found, it internally calls
  
   FastTaxonomyFacetCounts(indexFieldName, taxoReader, config) // Config
   object is same which i created before
  
   which then calls
  
   IntTaxonomyFacets(indexFieldName, taxoReader, config) // Config object
 is
   same which i created before
  
   And during this calls the value of indexFieldName is $facets defined
 by
   constant  'public static final String DEFAULT_INDEX_FIELD_NAME =
  $facets;'
   in FacetsConfig.
  
   My question is if i am using same FacetsConfig while indexing and
   searching. why its not identifying correct name of field, and goes for
   $facets
  
   Please correct me if i understood wrong. or correct way to solve above
   problem.
  
   Many Thanks.
   Jigar Shah.
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 



Re: Lucene Facets Module 4.8.1

2014-06-23 Thread Shai Erera
Basically, it's not very common to change the indexFieldName. You should do
that in case you e.g. count facets in groups of dimensions, rather than
counting all of them. So for example, if you have 20 dimensions, but you
know you only count d1-d5, d6-d12 and d13-d20, then if you separate them to
3 different indexFieldNames will probably improve performance.

But if you can't make such a decision, it's better to not modify this. When
you initialize a FacetCounts, it counts all the dimensions that are indexed
under that indexFieldName, so if you need the counts of all of them, or the
majority of them, that's ok. But if you know you *always* need the count of
a subset of them, then separating that subset to a different field is
better.

Hope that clarifies.

Shai


On Mon, Jun 23, 2014 at 4:18 PM, Jigar Shah jigaronl...@gmail.com wrote:

 Thanks this worked for me :)

 Is there any advantage of indexing some facets as not providing any
 indexFieldName ?

 Thanks




 On Mon, Jun 23, 2014 at 12:55 PM, Shai Erera ser...@gmail.com wrote:

  There is no sample code for doing that but it's quite straightforward -
 if
  you know you indexed some dimensions under different indexFieldNames,
  initialize a FacetCounts per such field name, e.g.:
 
  FastTaxoFacetCounts defaultCounts = new FastTaxoFacetCounts(...); // for
  your regular facets
  FastTaxoFacetCounts cityCounts = new FastTaxoFacetCounts(...); // for
 your
  CITY facets
 
  Something like that...
 
  Shai
 
 
  On Mon, Jun 23, 2014 at 9:04 AM, Jigar Shah jigaronl...@gmail.com
 wrote:
 
   On commenting
  
   //config.setIndexFieldName(CITY, city); at search time, this is
  before
   i do, getTopChildren(...)
  
   I get following exception.
  
   Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
   at
  
  
 
 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.count(FastTaxonomyFacetCounts.java:74)
   [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
   at
  
  
 
 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.init(FastTaxonomyFacetCounts.java:49)
   [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
   at
  
  
 
 org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.init(FastTaxonomyFacetCounts.java:39)
   [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
   at
  
  
 
 org.apache.lucene.facet.DrillSideways.buildFacetsResult(DrillSideways.java:110)
   [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
   at
  org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:177)
   [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
   at
  org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:203)
   [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
  
   Application level excepitons.
   ...
   ...
  
  
  
   On Sat, Jun 21, 2014 at 10:56 PM, Michael McCandless 
   luc...@mikemccandless.com wrote:
  
Are you sure it's the same FacetsConfig at search time?  Because the
exception implies your CITY field didn't have
config.setIndexFieldName(CITY, city) called.
   
Or, can you try commenting out 'config.setIndexFieldName(CITY,
city)' at index time and see if the exception still happens?
   
Mike McCandless
   
http://blog.mikemccandless.com
   
   
On Sat, Jun 21, 2014 at 1:08 AM, Jigar Shah jigaronl...@gmail.com
   wrote:
 Thanks for helping me.

 Yes, i did couple of things:

 Below is simple code for indexing which i use.

 TrackingIndexWriter nrtWriter
 DirectoryTaxonomyWriter taxoWriter = ...
 
 FacetsConfig config = new FacetConfig();
 config.setHierarchical(CITY, true)
 config.setMultiValued(CITY, true);
 config.setIndexFieldName(CITY,city) // I kept dimName different
   from
 indexFieldName
 
 Added indexing searchable fields...
 

 doc.add( new FacetField(CITY, India, Gujarat, Vadodara ))
 doc.add( new FacetField(CITY, India, Gujarat, Ahmedabad ))

  nrtWriter.addDocument(config.build(taxoWriter, doc));

 Below is code which i use for searching

 TaxonomyReader taxoReader = new
 DirectoryTaxonomyReader(taxoWriter);

 Query query = ...
 IndexSearcher searcher = ...
 DrillDownQuery ddq = new DrillDownQuery(config, query);
 DrillSideways ds = new DrillSideways(searcher, config, taxoReader);
  //
 Config object is same which i created before
 DrillSidewaysResult result = ds.search(query, null, null, start +
   limit,
 null, true, true)
 ...
 Facets f = result.facets
 FacetResult fr = f.getTopChildren(5, CITY) [Exception is
  geneated]//
 Didn't perform any drill-down,really, its just original query for
  first
 time, but wrapped in DrillDownQuery.

 ... and below gives me empty collection.

 ListFacetResult frs= f.getAllDims(5)

 I debug source code and found

Re: A question about FacetField constructor

2014-06-22 Thread Shai Erera
What do you mean by does not index anything? Do you get an exception when
you add a String[] with more than one element?

You should probably call conf.setHierarchical(dimension), but if you don't
do that you should receive an IllegalArgumentException telling you to do
that...

Shai


On Sun, Jun 22, 2014 at 6:34 AM, west suhanic west.suha...@gmail.com
wrote:

 Hello All:

 I am building sample code using lucene v4.8.1 to explore
 the new facet API. The problem I am having is that if I pass
 a populated string array nothing gets indexed while if
 I pass only the first element of the string array that value gets indexed.
 The code found below shows the case that works and the case that does not
 work. What am I doing wrong?

 Start of code sample*

 void showStuff( String... va )
 {
   /** This code permits out the contents of va successfully.**/
   for( int ii = 0 ; ii  va.length ; ii++ )
   System.out.println( value[ + ii + ]  + va[ii] );
 }

 for( final Map String, String[]  fd : allFacetData )
 {

 final Document doc = new Document();
 for( final Map.Entry String, String[]  entry :
 fd.entrySet() )
 {
 final String key = entry.getKey();
 String[] value = entry.getValue();
 showStuff( value );

 /**  This call indexes successfully **/
 final FacetField newFF = new FacetField(
 key, value[0] );

 /**
* This call will not index anything if
 the value String array
* has more than one element.
*final FacetField newFF = new
 FacetField( key, value );
*/
 doc.add( newFF );
 }

 try
 {
 final Document theBuildDoc =
 configFacetsHandle.
 build( taxoWriter, doc );
 indexWriter.addDocument( theBuildDoc );
 indexWriter.addDocument(
 configFacetsHandle.buil
 d( taxoWriter, doc ) );
 }
 catch( IOException ioe )
 {
 eMsg.append( method );
 eMsg.append(   failed with the exception 
 );
 eMsg.append( ioe.toString() );
 return constantValuesInterface.FAILURE;
 }
 }

 ***End of code sample***

 regards,

 West Suhanic



Re: A question about FacetField constructor

2014-06-22 Thread Shai Erera
Reply wasn't sent to the list.
On Jun 22, 2014 8:15 PM, Shai Erera ser...@gmail.com wrote:

 Can you post an example which demonstrates the problem? It's also
 interesting how you count the facets, eg do you use a TaxonomyFacets object
 or something else?

 Have you looked at the facet demo code? It contains examples for using
 hierarchical facets.

 Shai
 On Jun 22, 2014 8:08 PM, west suhanic west.suha...@gmail.com wrote:

 Hello:

 What do you mean by does not index anything?

 When I do a search the value returned for the dim set to Publish Date
 is null. If I pass through value[0] the publish date year is returned by
 the search.

 setHierarchical was called.

 When a String[] with more than one element is passed an exception is not
 thrown.

 I am open to all suggestions as to what I am missing.

 regards,

 west suhanic


 On Sun, Jun 22, 2014 at 3:23 AM, Shai Erera ser...@gmail.com wrote:

 What do you mean by does not index anything? Do you get an exception
 when you add a String[] with more than one element?

 You should probably call conf.setHierarchical(dimension), but if you
 don't do that you should receive an IllegalArgumentException telling you to
 do that...

 Shai


 On Sun, Jun 22, 2014 at 6:34 AM, west suhanic west.suha...@gmail.com
 wrote:

 Hello All:

 I am building sample code using lucene v4.8.1 to explore
 the new facet API. The problem I am having is that if I pass
 a populated string array nothing gets indexed while if
 I pass only the first element of the string array that value gets
 indexed.
 The code found below shows the case that works and the case that does
 not
 work. What am I doing wrong?

 Start of code sample*

 void showStuff( String... va )
 {
   /** This code permits out the contents of va
 successfully.**/
   for( int ii = 0 ; ii  va.length ; ii++ )
   System.out.println( value[ + ii + ]  + va[ii]
 );
 }

 for( final Map String, String[]  fd : allFacetData )
 {

 final Document doc = new Document();
 for( final Map.Entry String, String[]  entry :
 fd.entrySet() )
 {
 final String key = entry.getKey();
 String[] value = entry.getValue();
 showStuff( value );

 /**  This call indexes successfully **/
 final FacetField newFF = new FacetField(
 key, value[0] );

 /**
* This call will not index anything
 if
 the value String array
* has more than one element.
*final FacetField newFF = new
 FacetField( key, value );
*/
 doc.add( newFF );
 }

 try
 {
 final Document theBuildDoc =
 configFacetsHandle.
 build( taxoWriter, doc );
 indexWriter.addDocument( theBuildDoc );
 indexWriter.addDocument(
 configFacetsHandle.buil
 d( taxoWriter, doc ) );
 }
 catch( IOException ioe )
 {
 eMsg.append( method );
 eMsg.append(   failed with the
 exception 
 );
 eMsg.append( ioe.toString() );
 return constantValuesInterface.FAILURE;
 }
 }

 ***End of code sample***

 regards,

 West Suhanic






Re: Lucene Facets Module 4.8.1

2014-06-21 Thread Shai Erera
If you can, while in debug mode try to note the instance ID of the
FacetsConfig, and assert it is indeed the same (i.e. indexConfig ==
searchConfig).

Shai


On Sat, Jun 21, 2014 at 8:26 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Are you sure it's the same FacetsConfig at search time?  Because the
 exception implies your CITY field didn't have
 config.setIndexFieldName(CITY, city) called.

 Or, can you try commenting out 'config.setIndexFieldName(CITY,
 city)' at index time and see if the exception still happens?

 Mike McCandless

 http://blog.mikemccandless.com


 On Sat, Jun 21, 2014 at 1:08 AM, Jigar Shah jigaronl...@gmail.com wrote:
  Thanks for helping me.
 
  Yes, i did couple of things:
 
  Below is simple code for indexing which i use.
 
  TrackingIndexWriter nrtWriter
  DirectoryTaxonomyWriter taxoWriter = ...
  
  FacetsConfig config = new FacetConfig();
  config.setHierarchical(CITY, true)
  config.setMultiValued(CITY, true);
  config.setIndexFieldName(CITY,city) // I kept dimName different from
  indexFieldName
  
  Added indexing searchable fields...
  
 
  doc.add( new FacetField(CITY, India, Gujarat, Vadodara ))
  doc.add( new FacetField(CITY, India, Gujarat, Ahmedabad ))
 
   nrtWriter.addDocument(config.build(taxoWriter, doc));
 
  Below is code which i use for searching
 
  TaxonomyReader taxoReader = new DirectoryTaxonomyReader(taxoWriter);
 
  Query query = ...
  IndexSearcher searcher = ...
  DrillDownQuery ddq = new DrillDownQuery(config, query);
  DrillSideways ds = new DrillSideways(searcher, config, taxoReader); //
  Config object is same which i created before
  DrillSidewaysResult result = ds.search(query, null, null, start + limit,
  null, true, true)
  ...
  Facets f = result.facets
  FacetResult fr = f.getTopChildren(5, CITY) [Exception is geneated]//
  Didn't perform any drill-down,really, its just original query for first
  time, but wrapped in DrillDownQuery.
 
  ... and below gives me empty collection.
 
  ListFacetResult frs= f.getAllDims(5)
 
  I debug source code and found, it internally calls
 
  FastTaxonomyFacetCounts(indexFieldName, taxoReader, config) // Config
  object is same which i created before
 
  which then calls
 
  IntTaxonomyFacets(indexFieldName, taxoReader, config) // Config object is
  same which i created before
 
  And during this calls the value of indexFieldName is $facets defined by
  constant  'public static final String DEFAULT_INDEX_FIELD_NAME =
 $facets;'
  in FacetsConfig.
 
  My question is if i am using same FacetsConfig while indexing and
  searching. why its not identifying correct name of field, and goes for
  $facets
 
  Please correct me if i understood wrong. or correct way to solve above
  problem.
 
  Many Thanks.
  Jigar Shah.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Lucene Facets Module 4.8.1

2014-06-20 Thread Shai Erera
How do you add facets to your documents? Did you play with the
FacetsConfig, such as alter the field under which the CITY dimension is
indexed?

If you can reproduce this failure in a simple program, I guess it will be
easy to spot the error. Looks like a configuration error to me...

Shai


On Fri, Jun 20, 2014 at 3:12 PM, Jigar Shah jigaronl...@gmail.com wrote:

 Hello,

 I am getting below exception, and using Drillsideways facets. While getting
 children i am getting below exception:

 17:02:10,496 ERROR [stderr:71] (Thread-2
 (HornetQ-client-global-threads-790878673))
 java.lang.IllegalArgumentException: dimension CITY was not indexed into
 field $facets
 17:02:10,500 ERROR [stderr:71] (Thread-2
 (HornetQ-client-global-threads-790878673)) at

 org.apache.lucene.facet.taxonomy.TaxonomyFacets.verifyDim(TaxonomyFacets.java:80)
 17:02:10,503 ERROR [stderr:71] (Thread-2
 (HornetQ-client-global-threads-790878673)) at

 org.apache.lucene.facet.taxonomy.IntTaxonomyFacets.getTopChildren(IntTaxonomyFacets.java:95)

 I have used, TestDrillSideways.java test case to understand concept.

 Is there any mistake in creating FacetsConfig object, or configured
 something wrong ?

 Thanks,



Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Shai Erera

 I am afraid the DocMap still maintains doc-id mappings till merge and I am
 trying to avoid it...


What do you mean 'till merge'? The method OneMerge.getMergeReaders() is
called only when the merge is executed, not when the MergePolicy decided to
merge those segments. Therefore the DocMap is initialized only when the
merge actually executes ... what is there more to postpone?

And besides, if the segments are already sorted, you should return a null
DocMap, like Lucene code does ...

If I miss your point, I'd appreciate if you can point me to a code example,
preferably in Lucene source, which demonstrates the problem.

Shai


On Tue, Jun 17, 2014 at 3:03 PM, Ravikumar Govindarajan 
ravikumar.govindara...@gmail.com wrote:

 I am afraid the DocMap still maintains doc-id mappings till merge and I am
 trying to avoid it...

 I think lucene itself has a MergeIterator in o.a.l.util package.

 A MergePolicy can wrap a simple MergeIterator for iterating docs across
 different AtomicReaders in correct sort-order for a given field/term

 That should be fine right?

 --
 Ravi

 --
 Ravi


 On Tue, Jun 17, 2014 at 1:24 PM, Shai Erera ser...@gmail.com wrote:

  loadSortTerm is your method right? In the current Sorter.sort
  implementation, I see this code:
 
  boolean sorted = true;
  for (int i = 1; i  maxDoc; ++i) {
if (comparator.compare(i-1, i)  0) {
  sorted = false;
  break;
}
  }
  if (sorted) {
return null;
  }
 
  Perhaps you can write similar code?
 
  Also note that the sorting interface has changed, I think in 4.8, and now
  you don't really need to implement a Sorter, but rather pass a SortField,
  if that works for you.
 
  Shai
 
 
  On Tue, Jun 17, 2014 at 9:41 AM, Ravikumar Govindarajan 
  ravikumar.govindara...@gmail.com wrote:
 
   Shai,
  
   This is the code snippet I use inside my class...
  
   public class MySorter extends Sorter {
  
   @Override
  
   public DocMap sort(AtomicReader reader) throws IOException {
  
 final MapInteger, BytesRef docVsId = loadSortTerm(reader);
  
 final Sorter.DocComparator comparator = new Sorter.DocComparator() {
  
 @Override
  
  public int compare(int docID1, int docID2) {
  
 BytesRef v1 = docVsId.get(docID1);
  
 BytesRef v2 = docVsId.get(docID2);
  
  return v1.compareTo(v2);
  
  }
  
};
  
return sort(reader.maxDoc(), comparator);
  
   }
   }
  
   My Problem is, the AtomicReader passed to Sorter.sort method is
  actually
   a SlowCompositeReader, composed of a list of AtomicReaders each of
 which
  is
   already sorted.
  
   I find this loadSortTerm(compositeReader) to be a bit heavy where it
   tries to all load the doc-to-term mappings eagerly...
  
   Are there some alternatives for this?
  
   --
   Ravi
  
  
   On Tue, Jun 17, 2014 at 10:58 AM, Shai Erera ser...@gmail.com wrote:
  
I'm not sure that I follow ... where do you see DocMap being loaded
 up
front? Specifically, Sorter.sort may return null of the readers are
   already
sorted ... I think we already optimized for the case where the
 readers
   are
sorted.
   
Shai
   
   
On Tue, Jun 17, 2014 at 4:04 AM, Ravikumar Govindarajan 
ravikumar.govindara...@gmail.com wrote:
   
 I am planning to use SortingMergePolicy where all the
   merge-participating
 segments are already sorted... I understand that I need to define a
DocMap
 with old-new doc-id mappings.

 Is it possible to optimize the eager loading of DocMap and make it
  kind
of
 lazy load on-demand?

 Ex: Pass ListAtomicReader to the caller and ask for next new-old
  doc
 mapping..

 Since my segments are already sorted, I could save on memory a
   little-bit
 this way, instead of loading the full DocMap upfront

 --
 Ravi

   
  
 



Re: Facets in Lucene 4.7.2

2014-06-17 Thread Shai Erera
Hi

40 seconds for faceted search is ... crazy. Also, note how the times don't
differ much even though the number of hits is much higher (29K vs 15.1M)
... That, w/ that you say that subsequent queries are much faster (few
seconds) suggests that something is seriously messed up w/ your
environment. Maybe it's a faulty disk? E.g. after the file system cache is
warm, you no longer hit the disk?

In general, the more hits you have, the more expensive is faceted search.
It's also true for scoring as well (i.e. even without facets). There's just
more work to determine the top results (docs, facets...). With facets, you
can use sampling (see RandomSamplingFacetsCollector), but I would do that
only after you verify that collecting 15M docs is very expensive for you,
even when the file system cache is hot.

I've never seen those numbers before, therefore it's difficult for me to
relate to them.

There's a caching mechanism for facets, through CachedOrdinalsReader. But I
wouldn't go there until you verify that your IO system is good (try another
machine, OS, disk ...)., and that the 40s times are truly from the faceting
code.

Shai


On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 Thanks again!

 This time, I have indexed data with the following specs. I run into  40
 seconds for the FastTaxonomyFacetCounts to create all the facets. Is this
 as per your measurements? Subsequent runs fare much better probably because
 of the Windows file system cache. How can I speed this up?
 I believe there was a CategoryListCache earlier. Is there any cache or
 other implementation that I can use?

 Secondly, I had a general question. If I extrapolate these numbers for a
 billion documents, my search and facet number may probably be unusable in a
 real time scenario. What are the strategies employed when you deal with
 such large scale? I am new to Lucene so please also direct me to the
 relevant info sources. Thanks!

 Corpus:
 Count: 20M, Size: 51GB

 Index:
 Size (w/o Facets): 19GB, Size
 (w/Facets): 20.12GB
 Creation Time (w/o Facets):
 3.46hrs, Creation Time (w/Facets): 3.49hrs

 Search Performance:
With 29055 hits (5 terms in query):
Query Execution: 8 seconds
Facet counts execution: 40-45 seconds

With 4.22M hits (2 terms in query):
Query Execution: 3 seconds
Facet counts execution: 42-46 seconds

With 15.1M hits (1 term in query):
Query Execution: 2 seconds
Facet counts execution: 45-53 seconds

With 6183 hits (5 different values for the same 5 terms):
  (Without Flushing Windows File Cache on Next
 run)
Query Execution: 11 seconds
Facet counts execution:  1 second

With 4.9M hits (1 different value for the 1 term): (Without
 Flushing
 Windows File Cache on Next run)
Query Execution: 2 seconds
Facet counts execution: 3 seconds

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Monday, June 16, 2014 8:11 PM, Shai Erera ser...@gmail.com wrote:



 Hi

 1.] Is there any API that gives me the count of a specific dimension from
  FacetCollector in response to a search query. Currently, I use the
  getTopChildren() with some value and then check the
  FacetResult object for
  the actual number of dimensions hit along with their occurrences. Also,
 the
  getSpecificValue() does not work without a path attribute to the API.
 

 To get the value of the dimension itself, you should call getTopChildren(1,
 dim). Note that getSpecificValue does not allow to pass only the dimension,
 and getTopChildren requires topN to be  0. Passing 1 is a hack, but I'm
 not sure we should specifically support getting the aggregated value of
 just the dimension ... once you get that, the FacetResult.value tells you
 the aggregated count.

 2.] Can I find the MAX or MIN value of a Numeric type field written to the
  index?
 

 Depends how you index them. If you
  index the field as a numeric field (e.g.
 LongField), I believe you can use NumericUtils.getMaxLong. If it's a
 DocValues field, I don't know of a built-in function that does it, but this
 thread has a demo code:
 http://www.gossamer-threads.com/lists/lucene/java-user/195594.

 3.] I am trying to compare and contrast Lucene Facets with Elastic Search.
  I could determine that ES does search time faceting and dynamically
 returns
  the response without any prior faceting during indexing time. Is index
 time
  lag is not my concern, can I assume that, in general, performance-wise
  Lucene facets would be faster?
 

 I will start by saying that I don't know much about how ES facets work. We
 have some committers who know both how
  Lucene and ES facets work, so they
 can comment on that. But I personally don't think there's no index-time
 decision when it comes to faceting. Well

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Shai Erera
OK I think I now understand what you're asking :). It's unrelated though to
SortingMergePolicy. You propose to do the merge part of a merge-sort,
since we know the indexes are already sorted, right?

This is something we've considered in the past, but it is very tricky (see
below) and we went with the SortingAR for simplicity and speed of coding.
If however you have an idea how we can easily implement that, that would be
awesome.

So let's consider merging the posting lists of f:val from the N readers.
Say that each returns docs 0-3, and the merged posting will have 4*N
entries (say we don't have deletes). To properly merge them, you need to
lookup the sort-value of each document from each reader, and compare
according to it.

Now you move on to f:val2 (another posting) and it wants to merge 100 other
docs. So you need to lookup the value of each document, compare by it, and
merge them. And the process continues ...

These lookups are expensive and will be done millions of times (each term,
each DV field, each .. everything).

More than that, there's a serious issue of correctness, because you never
make a global sorting decision. So if f:val sees only a single document -
0, in all segments, you want to map them to 4 GLOBALLY SORTED documents. If
you make a local decision based on these 4 documents, you will end up w/ a
completely messed up segment.

I think the global DocMap is really required. Forget about that that other
code, e.g. IndexWriter relies on this in order to properly apply incoming
document deletions and field updates while the segments were merging. It's
just a matter of correctness - we need to know the global sorted segment
map.

Shai


On Tue, Jun 17, 2014 at 3:41 PM, Ravikumar Govindarajan 
ravikumar.govindara...@gmail.com wrote:

 
  Therefore the DocMap is initialized only when the
  merge actually executes ... what is there more to postpone?


 Agreed. However, what I am asking is, if there is an alternative to DocMap,
 will that be better? Plz read-on

  And besides, if the segments are already sorted, you should return a
 null DocMap,
  like Lucene code does ...


 What I am trying to say is, my individual segments are sorted. However,
 when a merge combines N individual sorted-segments, there needs to be a
 global sort-order for writing the new segment. Passing null DocMap won't
 work here, no?

 DocMap is one-way of bringing the global order during a merge. Another way
 is to use something like a MergedIteratorSegmentReader instead of DocMap,
 which doesn't need any memory

 I was trying to get a heads-up on these 2 approaches. Please do let me know
 if I have understood correctly

 --
 Ravi




 On Tue, Jun 17, 2014 at 5:42 PM, Shai Erera ser...@gmail.com wrote:

  
   I am afraid the DocMap still maintains doc-id mappings till merge and I
  am
   trying to avoid it...
  
 
  What do you mean 'till merge'? The method OneMerge.getMergeReaders() is
  called only when the merge is executed, not when the MergePolicy decided
 to
  merge those segments. Therefore the DocMap is initialized only when the
  merge actually executes ... what is there more to postpone?
 
  And besides, if the segments are already sorted, you should return a null
  DocMap, like Lucene code does ...
 
  If I miss your point, I'd appreciate if you can point me to a code
 example,
  preferably in Lucene source, which demonstrates the problem.
 
  Shai
 
 
  On Tue, Jun 17, 2014 at 3:03 PM, Ravikumar Govindarajan 
  ravikumar.govindara...@gmail.com wrote:
 
   I am afraid the DocMap still maintains doc-id mappings till merge and I
  am
   trying to avoid it...
  
   I think lucene itself has a MergeIterator in o.a.l.util package.
  
   A MergePolicy can wrap a simple MergeIterator for iterating docs across
   different AtomicReaders in correct sort-order for a given field/term
  
   That should be fine right?
  
   --
   Ravi
  
   --
   Ravi
  
  
   On Tue, Jun 17, 2014 at 1:24 PM, Shai Erera ser...@gmail.com wrote:
  
loadSortTerm is your method right? In the current Sorter.sort
implementation, I see this code:
   
boolean sorted = true;
for (int i = 1; i  maxDoc; ++i) {
  if (comparator.compare(i-1, i)  0) {
sorted = false;
break;
  }
}
if (sorted) {
  return null;
}
   
Perhaps you can write similar code?
   
Also note that the sorting interface has changed, I think in 4.8, and
  now
you don't really need to implement a Sorter, but rather pass a
  SortField,
if that works for you.
   
Shai
   
   
On Tue, Jun 17, 2014 at 9:41 AM, Ravikumar Govindarajan 
ravikumar.govindara...@gmail.com wrote:
   
 Shai,

 This is the code snippet I use inside my class...

 public class MySorter extends Sorter {

 @Override

 public DocMap sort(AtomicReader reader) throws IOException {

   final MapInteger, BytesRef docVsId = loadSortTerm(reader

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Shai Erera
That said... if we generate the global DocMap up front, there's no reason
to not execute the merge of the segments more efficiently, i.e. without
wrapping them in a SlowCompositeReaderWrapper.

But that's not work for SortingMergePolicy, it's either a special
SortingAtomicReader which wraps a group of readers + a global DocMap, and
then merge-sorts them more efficiently than how it's done now. Or we tap
into SegmentMerger .. which is way more complicated.

Perhaps it would be worth to explore a SortingMultiSortedAtomicReader which
merge-sorts the postings and other data that way ... I look at e.g how
doc-values are merged .. not sure it will improve performance. But if you
want to cons up a patch, that'd be awesome!

Shai


On Tue, Jun 17, 2014 at 8:01 PM, Shai Erera ser...@gmail.com wrote:

 OK I think I now understand what you're asking :). It's unrelated though
 to SortingMergePolicy. You propose to do the merge part of a merge-sort,
 since we know the indexes are already sorted, right?

 This is something we've considered in the past, but it is very tricky (see
 below) and we went with the SortingAR for simplicity and speed of coding.
 If however you have an idea how we can easily implement that, that would be
 awesome.

 So let's consider merging the posting lists of f:val from the N readers.
 Say that each returns docs 0-3, and the merged posting will have 4*N
 entries (say we don't have deletes). To properly merge them, you need to
 lookup the sort-value of each document from each reader, and compare
 according to it.

 Now you move on to f:val2 (another posting) and it wants to merge 100
 other docs. So you need to lookup the value of each document, compare by
 it, and merge them. And the process continues ...

 These lookups are expensive and will be done millions of times (each term,
 each DV field, each .. everything).

 More than that, there's a serious issue of correctness, because you never
 make a global sorting decision. So if f:val sees only a single document -
 0, in all segments, you want to map them to 4 GLOBALLY SORTED documents. If
 you make a local decision based on these 4 documents, you will end up w/ a
 completely messed up segment.

 I think the global DocMap is really required. Forget about that that other
 code, e.g. IndexWriter relies on this in order to properly apply incoming
 document deletions and field updates while the segments were merging. It's
 just a matter of correctness - we need to know the global sorted segment
 map.

 Shai


 On Tue, Jun 17, 2014 at 3:41 PM, Ravikumar Govindarajan 
 ravikumar.govindara...@gmail.com wrote:

 
  Therefore the DocMap is initialized only when the
  merge actually executes ... what is there more to postpone?


 Agreed. However, what I am asking is, if there is an alternative to
 DocMap,
 will that be better? Plz read-on

  And besides, if the segments are already sorted, you should return a
 null DocMap,
  like Lucene code does ...


 What I am trying to say is, my individual segments are sorted. However,
 when a merge combines N individual sorted-segments, there needs to be a
 global sort-order for writing the new segment. Passing null DocMap won't
 work here, no?

 DocMap is one-way of bringing the global order during a merge. Another way
 is to use something like a MergedIteratorSegmentReader instead of
 DocMap,
 which doesn't need any memory

 I was trying to get a heads-up on these 2 approaches. Please do let me
 know
 if I have understood correctly

 --
 Ravi




 On Tue, Jun 17, 2014 at 5:42 PM, Shai Erera ser...@gmail.com wrote:

  
   I am afraid the DocMap still maintains doc-id mappings till merge and
 I
  am
   trying to avoid it...
  
 
  What do you mean 'till merge'? The method OneMerge.getMergeReaders() is
  called only when the merge is executed, not when the MergePolicy
 decided to
  merge those segments. Therefore the DocMap is initialized only when the
  merge actually executes ... what is there more to postpone?
 
  And besides, if the segments are already sorted, you should return a
 null
  DocMap, like Lucene code does ...
 
  If I miss your point, I'd appreciate if you can point me to a code
 example,
  preferably in Lucene source, which demonstrates the problem.
 
  Shai
 
 
  On Tue, Jun 17, 2014 at 3:03 PM, Ravikumar Govindarajan 
  ravikumar.govindara...@gmail.com wrote:
 
   I am afraid the DocMap still maintains doc-id mappings till merge and
 I
  am
   trying to avoid it...
  
   I think lucene itself has a MergeIterator in o.a.l.util package.
  
   A MergePolicy can wrap a simple MergeIterator for iterating docs
 across
   different AtomicReaders in correct sort-order for a given field/term
  
   That should be fine right?
  
   --
   Ravi
  
   --
   Ravi
  
  
   On Tue, Jun 17, 2014 at 1:24 PM, Shai Erera ser...@gmail.com wrote:
  
loadSortTerm is your method right? In the current Sorter.sort
implementation, I see this code:
   
boolean sorted = true;
for (int i = 1; i

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Shai Erera
Nothing suspicious ... code looks fine. The call to FastTaxoFacetCounts
actually computes the counts ... that's the expensive part of faceted
search.

How big is your taxonomy (number categories)?
Is it hierarchical (i.e. are your dimensions flat, or deep like A/1/2/3/)?
What does your FacetsConfig look like?

Still, well maybe if your taxonomy is huge (hundreds of millions of
categories), I don't think you can intentionally mess up something that
much to end up w/ 40-45s response times!

Shai


On Tue, Jun 17, 2014 at 8:51 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 Thanks for your response. It does sound pretty bad which is why I am not
 sure whether there is an issue with the code, the index, the searcher, or
 just the machine, as you say.
 I will try with another machine just to make sure and post the results.

 Meanwhile, can you tell me if there is anything wrong in the below
 measurement? Or is the API usage or the pattern incorrect?

 I used a tool called RAMMap to clean the Windows cache. If I do not, the
 results are very fast as I mentioned already. If I do, then the total time
 is 40s.

 Can you please provide any pointers on what could be wrong? I will be
 checking on a Linux box anyway.

 =
 System.out.println(1. Start Date:  + new Date());
 TopDocs topDocs = FacetsCollector.search(searcher, query, 100, fc);
 System.out.println(1. End Date:  + new Date());
 // Above part takes approx 2-12 seconds depending on the query

 System.out.println(2. Start Date:  + new Date());
 ListFacetResult results = new ArrayListFacetResult();
 Facets facets = new FastTaxonomyFacetCounts(taxoReader, config, fc);
 System.out.println(2. End Date:  + new Date());
 // Above part takes approx 40-53 seconds depending on the query for the
 first time on Windows

 System.out.println(3. Start Date:  + new Date());
 results.add(facets.getTopChildren(1000, F1));
 results.add(facets.getTopChildren(1000, F2));
 results.add(facets.getTopChildren(1000, F3));
 results.add(facets.getTopChildren(1000, F4));
 results.add(facets.getTopChildren(1000, F5));
 results.add(facets.getTopChildren(1000, F6));
 results.add(facets.getTopChildren(1000, F7));
 System.out.println(3. End Date:  + new Date());
 // Above part takes approx less than 1 second
 =

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Tuesday, June 17, 2014 10:15 PM, Shai Erera ser...@gmail.com wrote:



 Hi

 40 seconds for faceted search is ... crazy. Also, note how the times don't
 differ much even though the number of hits is much higher (29K vs 15.1M)
 ... That, w/ that you say that subsequent queries are much faster (few
 seconds)
  suggests that something is seriously messed up w/ your
 environment. Maybe it's a faulty disk? E.g. after the file system cache is
 warm, you no longer hit the disk?

 In general, the more hits you have, the more expensive is faceted search.
 It's also true for scoring as well (i.e. even without facets). There's just
 more work to determine the top results (docs, facets...). With facets, you
 can use sampling (see RandomSamplingFacetsCollector), but I would do that
 only after you verify that collecting 15M docs is very expensive for you,
 even when the file system cache is hot.

 I've never
  seen those numbers before, therefore it's difficult for me to
 relate to them.

 There's a caching mechanism for facets, through CachedOrdinalsReader. But I
 wouldn't go there until you verify that your IO system is good (try another
 machine, OS, disk ...)., and that the 40s times are truly from the faceting
 code.

 Shai



 On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode 
 sandeep_khanz...@yahoo.com.invalid wrote:

  Hi,
 
  Thanks again!
 
  This time, I have indexed data with the following specs. I run into  40
  seconds for the FastTaxonomyFacetCounts to create all the facets. Is this
  as per your measurements? Subsequent runs fare much better probably
 because
  of the Windows file system cache. How can I speed this up?
  I believe there was a CategoryListCache earlier. Is there any cache or
  other implementation that I can use?
 
  Secondly, I had a general question. If I extrapolate these numbers for a
  billion documents, my search and facet number may probably be unusable
 in a
  real time scenario. What are the strategies employed when you deal with
  such large scale? I am new to Lucene so please also direct me to the
  relevant info sources. Thanks!
 
  Corpus:
  Count: 20M, Size: 51GB
 
  Index:
  Size (w/o Facets): 19GB, Size
  (w/Facets): 20.12GB
  Creation Time (w/o Facets):
  3.46hrs,
  Creation Time (w/Facets): 3.49hrs
 
  Search Performance:
 With 29055 hits (5 terms in query):
 Query Execution: 8 seconds
 Facet counts execution: 40-45 seconds
 
 With 4.22M hits (2 terms in query

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Shai Erera
You can get the size of the taxonomy by calling taxoReader.getSize(). What
does the 28K of the $facets field denote - the number of terms
(drill-down)? If so, that sounds like your taxonomy is of that size.

And indeed, this is a tiny taxonomy ...

How many facets do you record per document? This also affects the amount of
IO that's done during search, as we traverse the BinaryDocValues field,
reading the categories of each document.

Shai


On Tue, Jun 17, 2014 at 9:32 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 If I am counting correctly, the $facets field in the index shows a count
 of approx. 28k. That does not sound like much, I guess. All my facets are
 flat and the FacetsConfig only defines a couple of them to be multi-valued.

 Let me know if I am not counting the taxonomy size correctly. The
 taxoReader.getSize() also shows this count.

 I will check on a Linux box to make sure. Thanks,

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Tuesday, June 17, 2014 11:28 PM, Shai Erera ser...@gmail.com wrote:



 Nothing suspicious ... code looks fine. The call to FastTaxoFacetCounts
 actually computes the counts ... that's the expensive part of faceted
 search.

 How big is your taxonomy (number categories)?
 Is it hierarchical (i.e. are your dimensions flat, or deep like A/1/2/3/)?
 What does your FacetsConfig look like?

 Still, well maybe if your taxonomy is huge (hundreds of millions of
 categories), I don't think you can intentionally mess up something that
 much to end up w/ 40-45s response times!

 Shai


 On Tue, Jun 17, 2014 at 8:51 PM, Sandeep Khanzode 
 sandeep_khanz...@yahoo.com.invalid wrote:

  Hi,
 
  Thanks for your response. It does sound pretty bad which is why I am not
  sure whether there is an issue with the code, the index, the searcher, or
  just the machine, as you say.
  I will try with another machine just to make sure and post the results.
 
  Meanwhile, can you tell me if there is anything wrong in the below
  measurement? Or is the API usage or the pattern incorrect?
 
  I used a tool called RAMMap to clean the Windows cache. If I do not, the
  results are very fast as I mentioned already. If I do, then the total
 time
  is 40s.
 
  Can you please provide any pointers on what could be wrong? I will be
  checking on a Linux box anyway.
 
  =
  System.out.println(1. Start Date:  + new Date());
  TopDocs topDocs = FacetsCollector.search(searcher, query, 100, fc);
  System.out.println(1. End Date:  + new Date());
  // Above part takes approx 2-12 seconds depending on the query
 
  System.out.println(2. Start Date:  + new Date());
  ListFacetResult results = new ArrayListFacetResult();
  Facets facets = new FastTaxonomyFacetCounts(taxoReader, config, fc);
  System.out.println(2. End Date:  + new Date());
  // Above part takes approx 40-53 seconds depending on the query for the
  first time on Windows
 
  System.out.println(3. Start Date:  + new Date());
  results.add(facets.getTopChildren(1000, F1));
  results.add(facets.getTopChildren(1000, F2));
  results.add(facets.getTopChildren(1000, F3));
  results.add(facets.getTopChildren(1000, F4));
  results.add(facets.getTopChildren(1000, F5));
  results.add(facets.getTopChildren(1000, F6));
  results.add(facets.getTopChildren(1000, F7));
  System.out.println(3. End Date:  + new Date());
  // Above part takes approx less than 1 second
  =
 
  ---
  Thanks n Regards,
  Sandeep Ramesh Khanzode
 
 
  On Tuesday, June 17, 2014 10:15 PM, Shai Erera ser...@gmail.com wrote:
 
 
 
  Hi
 
  40 seconds for faceted search is ... crazy. Also, note how the times
 don't
  differ much even though the number of hits is much higher (29K vs 15.1M)
  ... That, w/ that you say that subsequent queries are much faster (few
  seconds)
   suggests that something is seriously messed up w/ your
  environment. Maybe it's a faulty disk? E.g. after the file system cache
 is
  warm, you no longer hit the disk?
 
  In general, the more hits you have, the more expensive is faceted search.
  It's also true for scoring as well (i.e. even without facets). There's
 just
  more work to determine the top results (docs, facets...). With facets,
 you
  can use sampling (see RandomSamplingFacetsCollector), but I would do that
  only after you verify that collecting 15M docs is very expensive for you,
  even when the file system cache is hot.
 
  I've never
   seen those numbers before, therefore it's difficult for me to
  relate to them.
 
  There's a caching mechanism for facets, through CachedOrdinalsReader.
 But I
  wouldn't go there until you verify that your IO system is good (try
 another
  machine, OS, disk ...)., and that the 40s times are truly from the
 faceting
  code.
 
  Shai
 
 
 
  On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode 
  sandeep_khanz...@yahoo.com.invalid wrote:
 
   Hi

Re: Lucene 4.8.1 - Taxonomy

2014-06-16 Thread Shai Erera
Err ... are you sure there's an index in the directory that you point Luke
at? I see that the exception points to . which suggests the local
directory from where Luke was run.

There's nothing special about the taxonomy index, as far as Luke should
concern. However, note that I do not recommend trying to alter the taxonomy
index via Luke in any way, as it's structure is very specific and things
rely on it. It's not a usual index ... i.e. there's no point trying to
search it or something like that.

Shai


On Mon, Jun 16, 2014 at 9:35 AM, Mrugesh Patel mrugesh.pa...@infodesk.com
wrote:

 Hi,



 I would like to open taxonomy indices in a tool (like Luke). Please could
 you help? Currently I am able to open other lucene indices  in Luke 4.8.1
 but unable to open taxonomy indices. When I try to open taxonomy indices in
 Luke 4.8.1 then it shows

 org.apache.lucene.index.IndexNotFoundException: no segments* file found in
 . exception.



 Please help.



 Thanks,

 Mrugesh




Re: SortingMergePolicy for already sorted segments

2014-06-16 Thread Shai Erera
I'm not sure that I follow ... where do you see DocMap being loaded up
front? Specifically, Sorter.sort may return null of the readers are already
sorted ... I think we already optimized for the case where the readers are
sorted.

Shai


On Tue, Jun 17, 2014 at 4:04 AM, Ravikumar Govindarajan 
ravikumar.govindara...@gmail.com wrote:

 I am planning to use SortingMergePolicy where all the merge-participating
 segments are already sorted... I understand that I need to define a DocMap
 with old-new doc-id mappings.

 Is it possible to optimize the eager loading of DocMap and make it kind of
 lazy load on-demand?

 Ex: Pass ListAtomicReader to the caller and ask for next new-old doc
 mapping..

 Since my segments are already sorted, I could save on memory a little-bit
 this way, instead of loading the full DocMap upfront

 --
 Ravi



Re: Facets in Lucene 4.7.2

2014-06-14 Thread Shai Erera
Hi

Currently there's now way to add e.g. terms to already indexed documents,
you have to re-index them. The only updatable field type Lucene offers
currently are DocValues fields. If the list of markers/flags is fixed in
your case, and you can map them to an integer, I think you could use a
NumericDocValues field, which supports field-level updates.

Once you do that, you can then:

* Count on this field pretty easily. You will need to write a Facets
implementation, but otherwise it's very easy.

* Filter queries: you will need to write a Filter which returns a DocIdSet
of the documents that belong to one category (e.g. Financially Relevant).
Here you might want to consider caching the result of the Filter, by using
CachingWrapperFilter.

It's not the best approach, updatable Terms would better suit your usecase,
however we don't offer them yet and it will be a while until we do (and IF
we do). You should also benchmark that approach vs re-indexing the
documents since the current implementation of updatable doc-values fields
isn't optimized for a few document updates between index reopens. See here:
http://shaierera.blogspot.com/2014/04/benchmarking-updatable-docvalues.html

Shai


On Fri, Jun 13, 2014 at 10:19 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi Shai,

 Thanks so much for the clear explanation.

 I agree on the first question. Taxonomy Writer with a separate index would
 probably be my approach too.

 For the second question:
 I am a little new to the Facets API so I will try to figure out the
 approach that you outlined below.

 However, the scenario is such: Assume a document corpus that is indexed.
 For a user query, a document is returned and selected by the user for
 editing as part of some use case/workflow. That document is now marked as
 either historically interesting or not, financially relevant, specific to
 media or entertainment domain, etc. by the user. So, essentially the user
 is flagging the document with certain markers.
 Another set of users could possibly want to query on these markers. So,
 lets say, a second user comes along, and wants to see the top documents
 belonging to one category, say, agriculture or farming. Since these markers
 are run time activities, how can I use the facets on them? So, I was
 envisioning facets as the various markers. But, if I constantly re-index or
 update the documents whenever a marker changes, I believe it would not be
 very efficient.

 Is there anything, facets or otherwise, in Lucene that can help me solve
 this use case?

 Please let me know. And, thanks!

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Friday, June 13, 2014 9:51 PM, Shai Erera ser...@gmail.com wrote:



 Hi

 You can check the demo code here:

 https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/
 .
 This code is updated with each release, so you always get a working code
 examples, even when the API changes.

 If you don't mind managing the sidecar index, which I agree isn't such a
 big deal, then yes - the taxonomy index currently performs the fastest. I
 plan to explore porting the taxonomy-based approach from BinaryDocValues to
 the new SortedNumericDocValues (coming out in 4.9) since it might perform
 even faster.

 I didn't quite get the marker/flag facet. Can you give an example? For
 instance, if you can model that as a NumericDocValuesField added to
 documents (w/ the different markers/flags translated to numbers), then you
 can use Lucene's updatable numeric DocValues and write a custom Facets to
 aggregate on that NumericDocValues field.

 Shai



 On Fri, Jun 13, 2014 at 11:48 AM, Sandeep Khanzode 
 sandeep_khanz...@yahoo.com.invalid wrote:

  Hi,
 
  I am evaluating Lucene Facets for a project. Since there is a lot of
  change in 4.7.2 for Facets, I am relying on UTs for reference. Please let
  me know if there are other sources of information.
 
  I have a couple of questions:
 
  1.] All categories in my application are flat, not hierarchical. But, it
  seems from a few sources, that even that notwithstanding, you would want
 to
  use a Taxonomy based index for performance reasons. It is faster but uses
  more RAM. Or is the deterrent to use it is the fact that it is a separate
  data structure. If one could do with the life-cycle management of the
 extra
  index, should we go ahead with the taxonomy index for better performance
  across tens of millions of documents?
 
  Another note to add is that I do not see a scenario wherein I would want
  to re-index my collection over and over again or, in other words, the
  changes would be spread over time.
 
  2.] I need a type of dynamic facet that allows me to add a flag or marker
  to the document at runtime since it will change/update every time a user
  modifies or adds to the list of markers. Is this possible to do with the
  current implementation? Since I believe, that currently all

Re: Facets in Lucene 4.7.2

2014-06-13 Thread Shai Erera
Hi

You can check the demo code here:
https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/.
This code is updated with each release, so you always get a working code
examples, even when the API changes.

If you don't mind managing the sidecar index, which I agree isn't such a
big deal, then yes - the taxonomy index currently performs the fastest. I
plan to explore porting the taxonomy-based approach from BinaryDocValues to
the new SortedNumericDocValues (coming out in 4.9) since it might perform
even faster.

I didn't quite get the marker/flag facet. Can you give an example? For
instance, if you can model that as a NumericDocValuesField added to
documents (w/ the different markers/flags translated to numbers), then you
can use Lucene's updatable numeric DocValues and write a custom Facets to
aggregate on that NumericDocValues field.

Shai


On Fri, Jun 13, 2014 at 11:48 AM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 I am evaluating Lucene Facets for a project. Since there is a lot of
 change in 4.7.2 for Facets, I am relying on UTs for reference. Please let
 me know if there are other sources of information.

 I have a couple of questions:

 1.] All categories in my application are flat, not hierarchical. But, it
 seems from a few sources, that even that notwithstanding, you would want to
 use a Taxonomy based index for performance reasons. It is faster but uses
 more RAM. Or is the deterrent to use it is the fact that it is a separate
 data structure. If one could do with the life-cycle management of the extra
 index, should we go ahead with the taxonomy index for better performance
 across tens of millions of documents?

 Another note to add is that I do not see a scenario wherein I would want
 to re-index my collection over and over again or, in other words, the
 changes would be spread over time.

 2.] I need a type of dynamic facet that allows me to add a flag or marker
 to the document at runtime since it will change/update every time a user
 modifies or adds to the list of markers. Is this possible to do with the
 current implementation? Since I believe, that currently all faceting is
 done at indexing time.


 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


Re: Faceted Search User's Guide for Lucene 4.8.1

2014-06-11 Thread Shai Erera
Hi

We removed the userguide long time ago, and replaced it with better
documentation on the classes and package.html, as well as demo code that
you can find here:
https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/

You can also look up some blog posts that I wrote a while ago on facets,
that explain how they work and some internals, even though the code
examples are not up-to-date w/ latest API changes:

http://shaierera.blogspot.com/2012/11/lucene-facets-part-1.html
http://shaierera.blogspot.com/2012/11/lucene-facets-part-2.html
http://shaierera.blogspot.com/2012/12/lucene-facets-under-hood.html
http://shaierera.blogspot.com/2013/01/facet-associations.html

Shai


On Wed, Jun 11, 2014 at 10:51 AM, Raf r.ventag...@gmail.com wrote:

 Hi,
 I have found this useful guide to the *Lucene Faceted Search*:

 http://lucene.apache.org/core/4_4_0/facet/org/apache/lucene/facet/doc-files/userguide.html

 The problem is that it refers to Lucene version 4.4, while I am using the
 latest available release (4.8.1) and I cannot find some classes (e.g.
 FacetSearchParams
 or CountFacetRequest).

 Is there an updated version of that guide?
 I tried this
 http://lucene.apache.org/core/*4_8_1*/facet/org/apache/lucene/facet/doc-files/userguide.html
 but it does not work :|

 Thank you for any help you can provide.

 Regards,
 *Raf*



Re: Multi-thread indexing, should the commit be called from each thread?

2014-05-21 Thread Shai Erera
You don't need to commit from each thread, you can definitely commit when
all threads are done. In general, you should commit only when you want to
ensure the data is safe on disk.

Shai


On Wed, May 21, 2014 at 2:58 PM, andi rexha a_re...@hotmail.com wrote:

 Hi!
 I have a question about multi-thread indexing. When I perform a
 Multi-thread indexing, should I commit from each thread that I add
 documents or the commit should be done only when all the threads are done
 with their indexing task?

 Thank you!



Re: best choice for ramBufferSizeMB

2014-05-14 Thread Shai Erera
Well, first make sure that you set ramBufferSizeMB to well below the max
Java heap size, otherwise you could run into OOMs.

While a larger RAM buffer may speed up indexing (since it flushes less
often to disk), it's not the only factor that affects indexing speed.

For instance, if a big portion of your indexing work is reading the files
from a slow storage device (maybe NFS share, remote Http etc.), then that
could easily shadow any benefits of using large RAM buffer.

Also, do you index with a single or multiple threads? Lucene supports
multi-threaded indexing, and it's recommended to do whenever you can, e.g.
when you run on a sufficiently strong HW (4+ cores...).

Another thing, in the past I noticed that too long RAM buffers did not
improve indexing at all e.g. if your underlying IO system is slow (e.g.
indexing to an NFS share, distributed file-system etc.), then the cost of
flushing a big RAM buffer became significant, more than indexing in RAM,
and e.g. I did not observe improvements when using ramBufferSizeMB=512 vs
128. Also, using a big RAM buffer uses more space on the heap, and makes
the job of the GC harder. So I think it might be that a too big RAM buffer
may actually slow things down, rather than speed up.

Indexing speed is affected by multiple parameters, the RAM buffer is only
one of them...

Shai


On Wed, May 14, 2014 at 4:33 PM, Gudrun Siedersleben 
siedersle...@mpdl.mpg.de wrote:

 Hi all,

 we want to speed up building our lucene index.  We set ramBufferSize to
 some values between 32 and 128 MB, but that does not make any difference
 concerning the time used for reindexing. We did not set maxBufferedDocs, ..
 which could conflict.
 We start the JVM with the following JAVA_OPTS:

 -Xms128m -Xmx512m -XX:MaxPermSize=256m

 What is the recommended value for ramBufferSizeMB depending on JAVA_OPTS
 and perhaps other lucene parameters set? We use Lucene 3.6.0.

 Best regards

 Gudrun





Re: Fields, Index segments and docIds (second Try)

2014-05-02 Thread Shai Erera
I don't think that you need to be concerned with the internal docIDs much.
Just imagine the indexes as a big table with multiple columns, where
columns are grouped together. Each group is a different index. If a
document does not have a value in one column, then you have an empty cell.
if a document doesn't have a value in entire group of columns, then you
denote that by adding an empty document.

Oh, and make sure to use a LogMergePolicy, so segments are merged in the
same order across all indexes.

And given that you rebuild the indexes every time, you can create them
one-by-one. You don't need to do that in parallel to all indexes, unless
it's more convenient for you.

Shai


On Fri, May 2, 2014 at 9:28 AM, Olivier Binda olivier.bi...@wanadoo.frwrote:

 On 05/02/2014 06:05 AM, Shai Erera wrote:

 If you're always rebuilding, let alone forceMerge, you shouldn't have too
 much trouble implementing it. Just make sure that you add documents in the
 same order to all indexes.

 If you're always rebuilding, how come you have deletions? Anyway, you must
 also delete in all indexes.


 Indeed, I don't have deletions and I'm mainly concerned with merges.
 But I just want to understand the whole docId remapping process,
 out of curiosity and also because obtaining a docId (and not losing it)
 seems to be the key of parallel indexes

  On May 2, 2014 1:57 AM, Olivier Binda olivier.bi...@wanadoo.fr wrote:

  On 05/01/2014 10:28 AM, Shai Erera wrote:

  I'm glad it helped you. Good luck with the implementation.

  Thanks. First I started looking at the lucene internal code. To
 understand
 when/where and why docIds are changing/need to be changed (in merge and
 doc
 deletions) .
 I have always wanted to understand this and I think the understanding may
 help me somehow.

  One thing I didn't mention (though it's in the jdocs) -- it's not enough
 to
 have the documents of each index aligned, you also have to have the
 segments aligned. That is, if both indexes have documents 0-5 aligned,
 but
 one index contains a single segment and the other one 2 segments, that's
 not going to work.

  That's good to know.

   It is possible to do w/ some care -- when you build the German index,

 disable merges (use NoMergePolicy) and flush whenever you indexed enough
 documents to match an existing segment on e.g. the Common index.

 Or, if rebuilding all indexes won't take long, you can always rebuild
 all
 of them.

  Yes. That's what I am usually doing (it takes less than 1 minute)
 Yet, I usually do a forceMarge too to only have 1 segment :/

   Shai


 On Thu, May 1, 2014 at 12:00 AM, Olivier Binda 
 olivier.bi...@wanadoo.fr
 wrote:

   On 04/30/2014 10:48 AM, Shai Erera wrote:

   I hope I got all the details right, if I didn't then please clarify.

 Also,
 I haven't read the entire thread, so if someone already suggested this
 ...
 well, it probably means it's the right solution :)

 It sounds like you could use Lucene's ParallelCompositeReader, which
 already handles multiple IndexReaders that are aligned by their
 internal
 document IDs. The way it would work, as far as I understand your
 scenario
 is something like the following table (columns denote different
 indexes).
 Each index contains a subset of relevant fields, where common contains
 the
 common fields, and each language index contains the respective
 language
 fields.

 DocIDLuceneID  Common  English   German
 FirstDoc   0 A,B,C   EN_words, DE_words,
   EN_sentences  DE_sentences
 SecondDoc  1 A,B,C
 ThirdDoc   2 A,B,C

 Each index can contain all relevant fields, or only a subset (e.g.
 maybe
 not all documents have a value for the 'B' field in the 'common'
 index).
 What's absolutely very important here though is that the indexes are
 created very carefully, and if e.g. SecondDoc is not translated into
 German, *you must still have an empty document* in the German index,
 or
 otherwise, document IDs will not align.

   That's exactly how I saw it and what I need to do. So, I'll have a
 very

 good look at

 ParallelCompositeReader


   Lucene does not offer a way to build those indexes though (patches

 welcome!!).

   This answers my question 1. Thanks.  :)

 I somehow hoped that there was already support for that kind of
 situation
 in lucene but well,
 now at least I know that I won't find an already made solution to my
 problem in the lucene classes and that I will have to code one myself,
 by taking inspiration in the lucene classes that do similar processing.

   We've started some effort very long time ago on LUCENE-1879

 (there's a patch and a discussion for an alternative approach) as well
 as
 there is a very useful suggestion in ParallelCompositeReader's jdocs
 (use
 LogDocMergePolicy).

   Wow, priceless. This gives me some headstart and inspiration. :)


   One challenge is how to support multi-threaded indexing, but perhaps

 this
 isn't a problem in your

Re: Fields, Index segments and docIds (second Try)

2014-05-01 Thread Shai Erera
I'm glad it helped you. Good luck with the implementation.

One thing I didn't mention (though it's in the jdocs) -- it's not enough to
have the documents of each index aligned, you also have to have the
segments aligned. That is, if both indexes have documents 0-5 aligned, but
one index contains a single segment and the other one 2 segments, that's
not going to work.

It is possible to do w/ some care -- when you build the German index,
disable merges (use NoMergePolicy) and flush whenever you indexed enough
documents to match an existing segment on e.g. the Common index.

Or, if rebuilding all indexes won't take long, you can always rebuild all
of them.

Shai


On Thu, May 1, 2014 at 12:00 AM, Olivier Binda olivier.bi...@wanadoo.frwrote:

 On 04/30/2014 10:48 AM, Shai Erera wrote:

 I hope I got all the details right, if I didn't then please clarify. Also,
 I haven't read the entire thread, so if someone already suggested this ...
 well, it probably means it's the right solution :)

 It sounds like you could use Lucene's ParallelCompositeReader, which
 already handles multiple IndexReaders that are aligned by their internal
 document IDs. The way it would work, as far as I understand your scenario
 is something like the following table (columns denote different indexes).
 Each index contains a subset of relevant fields, where common contains the
 common fields, and each language index contains the respective language
 fields.

 DocIDLuceneID  Common  English   German
 FirstDoc   0 A,B,C   EN_words, DE_words,
 EN_sentences  DE_sentences
 SecondDoc  1 A,B,C
 ThirdDoc   2 A,B,C

 Each index can contain all relevant fields, or only a subset (e.g. maybe
 not all documents have a value for the 'B' field in the 'common' index).
 What's absolutely very important here though is that the indexes are
 created very carefully, and if e.g. SecondDoc is not translated into
 German, *you must still have an empty document* in the German index, or
 otherwise, document IDs will not align.


 That's exactly how I saw it and what I need to do. So, I'll have a very
 good look at

 ParallelCompositeReader


 Lucene does not offer a way to build those indexes though (patches
 welcome!!).


 This answers my question 1. Thanks.  :)
 I somehow hoped that there was already support for that kind of situation
 in lucene but well,
 now at least I know that I won't find an already made solution to my
 problem in the lucene classes and that I will have to code one myself,
 by taking inspiration in the lucene classes that do similar processing.

 We've started some effort very long time ago on LUCENE-1879
 (there's a patch and a discussion for an alternative approach) as well as
 there is a very useful suggestion in ParallelCompositeReader's jdocs (use
 LogDocMergePolicy).


 Wow, priceless. This gives me some headstart and inspiration. :)


 One challenge is how to support multi-threaded indexing, but perhaps this
 isn't a problem in your application? It sounds like, by you writing that a
 user will download the german index, that the indexes are built offline?

 Indeed. The index is built offline, in a single thread, and once it is
 built, it is read only.
 Cant find an easier situation. :)


  Another challenge is how to control segment merging, so that the *exact
 same segments* are merged over the parallel indexes. Again, if your
 application builds the indexes offline, then this should be easier to
 accomplish.

 I assume though that when you index e.g. the German documents, then the
 already indexes 'common' fields do not change for a document. If they do,
 you will need to rebuild the 'common' index too.

 Once you achieve a correct parallel index, it is very easy to open a
 ParallelCompositeReader on any subset of the indexes, e.g. Common+English,
 Common+German, or Common+English+German and search it, since the internal
 document IDs are perfectly aligned.

 Shai


 Many thanks for the awesome answer and the help (I love you).
 As I really really really need this to happen, I'm going to start working
 on this really soon.

 I'm definately not an expert on threads/filesystems/and lucene inner
 workings, so I can't promise to contribute a miracoulous patch though.
 Especially since I won't work on the muli-thread aspect of the problem.
 But I'll do the best I can and contribute back whatever code I can produce.

 Many thanks, again. :)



 On Wed, Apr 30, 2014 at 7:07 AM, Jose Carlos Canova 
 jose.carlos.can...@gmail.com wrote:

  My suggestion is you not worry about the docId, in practice it is an
 internal lucene id, quite similar with a rowId on a database, each
 index
 may generate a different docId (it is their problem) from a translated
 document, you may use your own ID that relates one document to another on
 different index mainly because like you mention are translated documents
 that on theory can be ranked differently from language

Re: Fields, Index segments and docIds (second Try)

2014-05-01 Thread Shai Erera
If you're always rebuilding, let alone forceMerge, you shouldn't have too
much trouble implementing it. Just make sure that you add documents in the
same order to all indexes.

If you're always rebuilding, how come you have deletions? Anyway, you must
also delete in all indexes.
On May 2, 2014 1:57 AM, Olivier Binda olivier.bi...@wanadoo.fr wrote:

 On 05/01/2014 10:28 AM, Shai Erera wrote:

 I'm glad it helped you. Good luck with the implementation.


 Thanks. First I started looking at the lucene internal code. To understand
 when/where and why docIds are changing/need to be changed (in merge and doc
 deletions) .
 I have always wanted to understand this and I think the understanding may
 help me somehow.


 One thing I didn't mention (though it's in the jdocs) -- it's not enough
 to
 have the documents of each index aligned, you also have to have the
 segments aligned. That is, if both indexes have documents 0-5 aligned, but
 one index contains a single segment and the other one 2 segments, that's
 not going to work.


 That's good to know.

  It is possible to do w/ some care -- when you build the German index,
 disable merges (use NoMergePolicy) and flush whenever you indexed enough
 documents to match an existing segment on e.g. the Common index.

 Or, if rebuilding all indexes won't take long, you can always rebuild all
 of them.

 Yes. That's what I am usually doing (it takes less than 1 minute)
 Yet, I usually do a forceMarge too to only have 1 segment :/

  Shai


 On Thu, May 1, 2014 at 12:00 AM, Olivier Binda olivier.bi...@wanadoo.fr
 wrote:

  On 04/30/2014 10:48 AM, Shai Erera wrote:

  I hope I got all the details right, if I didn't then please clarify.
 Also,
 I haven't read the entire thread, so if someone already suggested this
 ...
 well, it probably means it's the right solution :)

 It sounds like you could use Lucene's ParallelCompositeReader, which
 already handles multiple IndexReaders that are aligned by their internal
 document IDs. The way it would work, as far as I understand your
 scenario
 is something like the following table (columns denote different
 indexes).
 Each index contains a subset of relevant fields, where common contains
 the
 common fields, and each language index contains the respective language
 fields.

 DocIDLuceneID  Common  English   German
 FirstDoc   0 A,B,C   EN_words, DE_words,
  EN_sentences  DE_sentences
 SecondDoc  1 A,B,C
 ThirdDoc   2 A,B,C

 Each index can contain all relevant fields, or only a subset (e.g. maybe
 not all documents have a value for the 'B' field in the 'common' index).
 What's absolutely very important here though is that the indexes are
 created very carefully, and if e.g. SecondDoc is not translated into
 German, *you must still have an empty document* in the German index, or
 otherwise, document IDs will not align.

  That's exactly how I saw it and what I need to do. So, I'll have a very
 good look at

 ParallelCompositeReader


  Lucene does not offer a way to build those indexes though (patches
 welcome!!).

  This answers my question 1. Thanks.  :)
 I somehow hoped that there was already support for that kind of situation
 in lucene but well,
 now at least I know that I won't find an already made solution to my
 problem in the lucene classes and that I will have to code one myself,
 by taking inspiration in the lucene classes that do similar processing.

  We've started some effort very long time ago on LUCENE-1879
 (there's a patch and a discussion for an alternative approach) as well
 as
 there is a very useful suggestion in ParallelCompositeReader's jdocs
 (use
 LogDocMergePolicy).

  Wow, priceless. This gives me some headstart and inspiration. :)


  One challenge is how to support multi-threaded indexing, but perhaps
 this
 isn't a problem in your application? It sounds like, by you writing
 that a
 user will download the german index, that the indexes are built
 offline?

  Indeed. The index is built offline, in a single thread, and once it is
 built, it is read only.
 Cant find an easier situation. :)


   Another challenge is how to control segment merging, so that the *exact

 same segments* are merged over the parallel indexes. Again, if your
 application builds the indexes offline, then this should be easier to
 accomplish.

 I assume though that when you index e.g. the German documents, then the
 already indexes 'common' fields do not change for a document. If they
 do,
 you will need to rebuild the 'common' index too.

 Once you achieve a correct parallel index, it is very easy to open a
 ParallelCompositeReader on any subset of the indexes, e.g.
 Common+English,
 Common+German, or Common+English+German and search it, since the
 internal
 document IDs are perfectly aligned.

 Shai

  Many thanks for the awesome answer and the help (I love you).
 As I really really really need this to happen, I'm going to start working

Re: Fields, Index segments and docIds (second Try)

2014-04-30 Thread Shai Erera
I hope I got all the details right, if I didn't then please clarify. Also,
I haven't read the entire thread, so if someone already suggested this ...
well, it probably means it's the right solution :)

It sounds like you could use Lucene's ParallelCompositeReader, which
already handles multiple IndexReaders that are aligned by their internal
document IDs. The way it would work, as far as I understand your scenario
is something like the following table (columns denote different indexes).
Each index contains a subset of relevant fields, where common contains the
common fields, and each language index contains the respective language
fields.

DocIDLuceneID  Common  English   German
FirstDoc   0 A,B,C   EN_words, DE_words,
   EN_sentences  DE_sentences
SecondDoc  1 A,B,C
ThirdDoc   2 A,B,C

Each index can contain all relevant fields, or only a subset (e.g. maybe
not all documents have a value for the 'B' field in the 'common' index).
What's absolutely very important here though is that the indexes are
created very carefully, and if e.g. SecondDoc is not translated into
German, *you must still have an empty document* in the German index, or
otherwise, document IDs will not align.

Lucene does not offer a way to build those indexes though (patches
welcome!!). We've started some effort very long time ago on LUCENE-1879
(there's a patch and a discussion for an alternative approach) as well as
there is a very useful suggestion in ParallelCompositeReader's jdocs (use
LogDocMergePolicy).

One challenge is how to support multi-threaded indexing, but perhaps this
isn't a problem in your application? It sounds like, by you writing that a
user will download the german index, that the indexes are built offline?

Another challenge is how to control segment merging, so that the *exact
same segments* are merged over the parallel indexes. Again, if your
application builds the indexes offline, then this should be easier to
accomplish.

I assume though that when you index e.g. the German documents, then the
already indexes 'common' fields do not change for a document. If they do,
you will need to rebuild the 'common' index too.

Once you achieve a correct parallel index, it is very easy to open a
ParallelCompositeReader on any subset of the indexes, e.g. Common+English,
Common+German, or Common+English+German and search it, since the internal
document IDs are perfectly aligned.

Shai


On Wed, Apr 30, 2014 at 7:07 AM, Jose Carlos Canova 
jose.carlos.can...@gmail.com wrote:

 My suggestion is you not worry about the docId, in practice it is an
 internal lucene id, quite similar with a rowId on a database, each index
 may generate a different docId (it is their problem) from a translated
 document, you may use your own ID that relates one document to another on
 different index mainly because like you mention are translated documents
 that on theory can be ranked differently from language to language (it is
 not an obligation that a set of documents on different languages spams the
 same rank order but i am not 100% sure about this),

 Second reason is that 'they may change the internal structure of lucene
 without warrant', and then you lose the forward compatibility.

 I am not an expert on Lucene like Schindler, but reading their
 documentation understood that they have a special attention on
 internal lucene and experimental lucene which means internal is non
 warrant compatible, and experimental may be removed.

 For example they (apache-lucene) discover a new manner to relate each
 document that is more efficient and change some mechanism, then your
 application uses an internal mechanism that is high coupled with lucene
 version xxx (marked as internal-lucene) you can stuck on a specific
 version and   on future have to rewrite some code because and this might
 cause some management conflict if your project follows a continuous
 integration and you are subordinated on a management structure (bad to
 you).

 I saw this on several projects that uses Lucene around they do not upgrade
 their lucene components on their new releases one example if i am not wrong
 still uses Lucene 3 and other that i saw around (e.g. Luke) which means
 that The project was abandoned because the manner how they integrate with
 Lucene was not fully functional.

 Another interesting thing is that developing around Lucene is more
 effective, you guarantee that your product will work and they guarantee
 that Lucene works too. This is related with design by contract.

 Regards.







 On Tue, Apr 29, 2014 at 7:11 PM, Olivier Binda olivier.bi...@wanadoo.fr
 wrote:

  Hello.
 
  Sorry to bring this up again. I don't want to be rudeand I mean no
  disrespect, but after thinking it through today,
  I need to and would really love to have the answer to the following
  question :
 
  1) At lucene indexing time, is it possible to rewrite a read-only index
 so
  that some fields 

Re: Getting multi-values to use in filter?

2014-04-29 Thread Shai Erera
Hi Rob,

While the demo code uses a fixed number of 3 values, you don't need to
encode the number of values up front. Since your read the byte[] of a
document up front, you can read in a while loop as long as in.position() 
in.length().

Shai


On Tue, Apr 29, 2014 at 10:04 AM, Rob Audenaerde
rob.audenae...@gmail.comwrote:

 Hi Shai,

 I read the article on your blog, thanks for it! It seems to be a natural
 fit to do multi-values like this, and it is helpful indeed. For my specific
 problem, I have multiple values that do not have a fixed number, so it can
 be either 0 or 10 values. I think the best way to solve this is to encode
 the number of values as first entry in the BDV. This is not that hard so I
 will take this road.

 -Rob


  Op 27 apr. 2014 om 21:27 heeft Shai Erera ser...@gmail.com het
 volgende geschreven:
 
  Hi Rob,
 
  Your question got me interested, so I wrote a quick prototype of what I
  think solves your problem (and if not, I hope it solves someone else's!
  :)). The idea is to write a special ValueSource, e.g. MaxValueSource
 which
  reads a BinadyDocValues, decodes the values and returns the maximum one.
 It
  can then be embedded in an expression quite easily.
 
  I published a post on Lucene expressions and included some prototype code
  which demonstrates how to do it. Hope it's still helpful to you:
  http://shaierera.blogspot.com/2014/04/expressions-with-lucene.html.
 
  Shai
 
 
  On Thu, Apr 24, 2014 at 1:20 PM, Shai Erera ser...@gmail.com wrote:
 
  I don't think that you should use the facet module. If all you want is
 to
  encode a bunch of numbers under a 'foo' field, you can encode them into
 a
  byte[] and index them as a BDV. Then at search time you get the BDV and
  decode the numbers back. The facet module adds complexity here: yes, you
  get the encoding/decoding for free, but at the cost of adding mock
  categories to the taxonomy, or use associations, for no good reason IMO.
 
  Once you do that, you need to figure out how to extend the expressions
  module to support a function like maxValues(fieldName) (cannot use 'max'
  since it's reserved). I read about it some, and still haven't figured
 out
  exactly how to do it. The JavascriptCompiler can take custom functions
 to
  compile expressions, but the methods should take only double values. So
 I
  think it should be some sort of binding, but I'm not sure yet how to do
 it.
  Perhaps it should be a name like max_fieldName, which you add a custom
  Expression to as a binding ... I will try to look into it later.
 
  Shai
 
 
  On Wed, Apr 23, 2014 at 6:49 PM, Rob Audenaerde 
 rob.audenae...@gmail.comwrote:
 
  Thanks for all the questions, gives me an opportunity to clarify it :)
 
  I want the user to be able to give a (simple) formula (so I don't know
 it
  on beforehand) and use that formula in the search. The Javascript
  expressions are really powerful in this use case, but have the
  single-value
  limitation. Ideally, I would like to make it really flexible by for
  example
  allowing (in-document aggregating) expressions like: max(fieldA) -
 fieldB
 
  fieldC.
 
  Currently, using single values, I can handle expressions in the form of
  fieldA - fieldB - fieldC  0 and evaluate the long-value that I
 receive
  from the FunctionValues and the ValueSource. I also optimize the query
 by
  assuring the field exists and has a value, etc. to the search still
 fast
  enough. This works well, but single value only.
 
  I also looked into the facets Association Fields, as they somewhat look
  like the thing that I want. Only in the faceting module, all ordinals
 and
  values are stored in one field, so there is no easy way extract the
 fields
  that are used in the expression.
 
  I like the solution one you suggested, to add all the numeric fields an
  encoded byte[] like the facets do, but then on a per-field basis, so
 that
  each numeric field has a BDV field that contains all multiple values
 for
  that field for that document.
 
  Now that I am typing this, I think there is another way. I could use
 the
  faceting module and add a different facet field ($facetFIELDA,
  $facetFIELDB) in the FacetsConfig for each field. That way it would be
  relatively straightforward to get all the values for a field, as they
 are
  exact all the values for the BDV for that document's facet field. Only
  aggregating all facets will be harder, as the
  TaxonomyFacetSum*Associations
  would need to do this for all fields that I need facet counts/sums for.
 
  What do you think?
 
  -Rob
 
 
  On Wed, Apr 23, 2014 at 5:13 PM, Shai Erera ser...@gmail.com wrote:
 
  A NumericDocValues field can only hold one value. Have you thought
 about
  encoding the values in a BinaryDocValues field? Or are you talking
 about
  multiple fields (different names), each has its own single value, and
 at
  search time you sum the values from a different set of fields?
 
  If it's one field, multiple values, then why do you need to separate

Re: No Compound Files

2014-04-29 Thread Shai Erera
The problem is that compound files settings are split between MergePolicy
and IndexWriterConfig. As documented on IWC.setUseCompoundFile, this
setting controls how new segments are flushed, while the MP setting
controls how merged segments are written.

If we only offer NoMP.INSTANCE, what would it do w/ merged segments? always
compound? always not-compound? But that won't solve the problem of new
flushed segments, since that's controlled by IWC.

If we can move all of that to IWC, I think it will remove the confusion..
it always confuses me that I use NoMP.COMPUND, yet I see non-compound
segments, until I remember to change the IWC setting.

Shai


On Tue, Apr 29, 2014 at 3:07 PM, Robert Muir rcm...@gmail.com wrote:

 I think NoMergePolicy.NO_COMPOUND_FILES and
 NoMergePolicy.COMPOUND_FILES should be removed, and replaced with
 NoMergePolicy.INSTANCE

 If you want to change whether CFS is used by indexwriter flush, you
 need to set that in IndexWriterConfig.

 On Tue, Apr 29, 2014 at 8:03 AM, Varun Thacker
 varunthacker1...@gmail.com wrote:
  I wanted to use the NoMergePolicy.NO_COMPOUND_FILES to ensure that no
  merges take place on the index. However I was unsuccessful at it. What I
 am
  doing wrong here.
 
  Attaching a gist with -
  1. Output when using NoMergePolicy.NO_COMPOUND_FILES
  2. Output when using TieredMergePolicy with policy.setNoCFSRatio(0.0)
  3. The code snippet I used.
 
  https://gist.github.com/vthacker/11398124
 
  I tried it using Lucene 4.7
 
 
 
 
 
  --
 
 
  Regards,
  Varun Thacker
  http://www.vthacker.in/

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: No Compound Files

2014-04-29 Thread Shai Erera
NoMP means no merges, and indeed it seems silly that NoMP distinguishes
between compound/non-compound settings. Perhaps it's rooted somewhere in
the past, I don't remember.

I checked and IndexWriter.addIndexes consults
MP.useCompoundFile(segmentInfo) when it adds the segments. But maybe
NoMP.useCompoundFile can be changed to return
newSegment.info.isCompoundFile? I.e. it doesn't change the type of the new
segment?

Shai


On Tue, Apr 29, 2014 at 3:50 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 +1 to just have NoMergePolicy.INSTANCE

 Mike McCandless

 http://blog.mikemccandless.com


 On Tue, Apr 29, 2014 at 8:07 AM, Robert Muir rcm...@gmail.com wrote:
  I think NoMergePolicy.NO_COMPOUND_FILES and
  NoMergePolicy.COMPOUND_FILES should be removed, and replaced with
  NoMergePolicy.INSTANCE
 
  If you want to change whether CFS is used by indexwriter flush, you
  need to set that in IndexWriterConfig.
 
  On Tue, Apr 29, 2014 at 8:03 AM, Varun Thacker
  varunthacker1...@gmail.com wrote:
  I wanted to use the NoMergePolicy.NO_COMPOUND_FILES to ensure that no
  merges take place on the index. However I was unsuccessful at it. What
 I am
  doing wrong here.
 
  Attaching a gist with -
  1. Output when using NoMergePolicy.NO_COMPOUND_FILES
  2. Output when using TieredMergePolicy with policy.setNoCFSRatio(0.0)
  3. The code snippet I used.
 
  https://gist.github.com/vthacker/11398124
 
  I tried it using Lucene 4.7
 
 
 
 
 
  --
 
 
  Regards,
  Varun Thacker
  http://www.vthacker.in/
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Getting multi-values to use in filter?

2014-04-27 Thread Shai Erera
Hi Rob,

Your question got me interested, so I wrote a quick prototype of what I
think solves your problem (and if not, I hope it solves someone else's!
:)). The idea is to write a special ValueSource, e.g. MaxValueSource which
reads a BinadyDocValues, decodes the values and returns the maximum one. It
can then be embedded in an expression quite easily.

I published a post on Lucene expressions and included some prototype code
which demonstrates how to do it. Hope it's still helpful to you:
http://shaierera.blogspot.com/2014/04/expressions-with-lucene.html.

Shai


On Thu, Apr 24, 2014 at 1:20 PM, Shai Erera ser...@gmail.com wrote:

 I don't think that you should use the facet module. If all you want is to
 encode a bunch of numbers under a 'foo' field, you can encode them into a
 byte[] and index them as a BDV. Then at search time you get the BDV and
 decode the numbers back. The facet module adds complexity here: yes, you
 get the encoding/decoding for free, but at the cost of adding mock
 categories to the taxonomy, or use associations, for no good reason IMO.

 Once you do that, you need to figure out how to extend the expressions
 module to support a function like maxValues(fieldName) (cannot use 'max'
 since it's reserved). I read about it some, and still haven't figured out
 exactly how to do it. The JavascriptCompiler can take custom functions to
 compile expressions, but the methods should take only double values. So I
 think it should be some sort of binding, but I'm not sure yet how to do it.
 Perhaps it should be a name like max_fieldName, which you add a custom
 Expression to as a binding ... I will try to look into it later.

 Shai


 On Wed, Apr 23, 2014 at 6:49 PM, Rob Audenaerde 
 rob.audenae...@gmail.comwrote:

 Thanks for all the questions, gives me an opportunity to clarify it :)

 I want the user to be able to give a (simple) formula (so I don't know it
 on beforehand) and use that formula in the search. The Javascript
 expressions are really powerful in this use case, but have the
 single-value
 limitation. Ideally, I would like to make it really flexible by for
 example
 allowing (in-document aggregating) expressions like: max(fieldA) - fieldB
 
 fieldC.

 Currently, using single values, I can handle expressions in the form of
 fieldA - fieldB - fieldC  0 and evaluate the long-value that I receive
 from the FunctionValues and the ValueSource. I also optimize the query by
 assuring the field exists and has a value, etc. to the search still fast
 enough. This works well, but single value only.

 I also looked into the facets Association Fields, as they somewhat look
 like the thing that I want. Only in the faceting module, all ordinals and
 values are stored in one field, so there is no easy way extract the fields
 that are used in the expression.

 I like the solution one you suggested, to add all the numeric fields an
 encoded byte[] like the facets do, but then on a per-field basis, so that
 each numeric field has a BDV field that contains all multiple values for
 that field for that document.

 Now that I am typing this, I think there is another way. I could use the
 faceting module and add a different facet field ($facetFIELDA,
 $facetFIELDB) in the FacetsConfig for each field. That way it would be
 relatively straightforward to get all the values for a field, as they are
 exact all the values for the BDV for that document's facet field. Only
 aggregating all facets will be harder, as the
 TaxonomyFacetSum*Associations
 would need to do this for all fields that I need facet counts/sums for.

 What do you think?

 -Rob


 On Wed, Apr 23, 2014 at 5:13 PM, Shai Erera ser...@gmail.com wrote:

  A NumericDocValues field can only hold one value. Have you thought about
  encoding the values in a BinaryDocValues field? Or are you talking about
  multiple fields (different names), each has its own single value, and at
  search time you sum the values from a different set of fields?
 
  If it's one field, multiple values, then why do you need to separate the
  values? Is it because you sometimes sum and sometimes e.g. avg? Do you
  always include all values of a document in the formula, but the formula
  changes between searches, or do you sometimes use only a subset of the
  values?
 
  If you always use all values, but change the formula between queries,
 then
  perhaps you can just encode the pre-computed value under different NDV
  fields? If you only use a handful of functions (and they are known in
  advance), it may not be too heavy on the index, and definitely perform
  better during search.
 
  Otherwise, I believe I'd consider indexing them as a BDV field. For
 facets,
  we basically need the same multi-valued numeric field, and given that
 NDV
  is single valued, we went w/ BDV.
 
  If I misunderstood the scenario, I'd appreciate if you clarify it :)
 
  Shai
 
 
  On Wed, Apr 23, 2014 at 5:49 PM, Rob Audenaerde 
 rob.audenae...@gmail.com
  wrote:
 
   Hi Shai, all,
  
   I am

Re: Getting multi-values to use in filter?

2014-04-24 Thread Shai Erera
I don't think that you should use the facet module. If all you want is to
encode a bunch of numbers under a 'foo' field, you can encode them into a
byte[] and index them as a BDV. Then at search time you get the BDV and
decode the numbers back. The facet module adds complexity here: yes, you
get the encoding/decoding for free, but at the cost of adding mock
categories to the taxonomy, or use associations, for no good reason IMO.

Once you do that, you need to figure out how to extend the expressions
module to support a function like maxValues(fieldName) (cannot use 'max'
since it's reserved). I read about it some, and still haven't figured out
exactly how to do it. The JavascriptCompiler can take custom functions to
compile expressions, but the methods should take only double values. So I
think it should be some sort of binding, but I'm not sure yet how to do it.
Perhaps it should be a name like max_fieldName, which you add a custom
Expression to as a binding ... I will try to look into it later.

Shai


On Wed, Apr 23, 2014 at 6:49 PM, Rob Audenaerde rob.audenae...@gmail.comwrote:

 Thanks for all the questions, gives me an opportunity to clarify it :)

 I want the user to be able to give a (simple) formula (so I don't know it
 on beforehand) and use that formula in the search. The Javascript
 expressions are really powerful in this use case, but have the single-value
 limitation. Ideally, I would like to make it really flexible by for example
 allowing (in-document aggregating) expressions like: max(fieldA) - fieldB 
 fieldC.

 Currently, using single values, I can handle expressions in the form of
 fieldA - fieldB - fieldC  0 and evaluate the long-value that I receive
 from the FunctionValues and the ValueSource. I also optimize the query by
 assuring the field exists and has a value, etc. to the search still fast
 enough. This works well, but single value only.

 I also looked into the facets Association Fields, as they somewhat look
 like the thing that I want. Only in the faceting module, all ordinals and
 values are stored in one field, so there is no easy way extract the fields
 that are used in the expression.

 I like the solution one you suggested, to add all the numeric fields an
 encoded byte[] like the facets do, but then on a per-field basis, so that
 each numeric field has a BDV field that contains all multiple values for
 that field for that document.

 Now that I am typing this, I think there is another way. I could use the
 faceting module and add a different facet field ($facetFIELDA,
 $facetFIELDB) in the FacetsConfig for each field. That way it would be
 relatively straightforward to get all the values for a field, as they are
 exact all the values for the BDV for that document's facet field. Only
 aggregating all facets will be harder, as the TaxonomyFacetSum*Associations
 would need to do this for all fields that I need facet counts/sums for.

 What do you think?

 -Rob


 On Wed, Apr 23, 2014 at 5:13 PM, Shai Erera ser...@gmail.com wrote:

  A NumericDocValues field can only hold one value. Have you thought about
  encoding the values in a BinaryDocValues field? Or are you talking about
  multiple fields (different names), each has its own single value, and at
  search time you sum the values from a different set of fields?
 
  If it's one field, multiple values, then why do you need to separate the
  values? Is it because you sometimes sum and sometimes e.g. avg? Do you
  always include all values of a document in the formula, but the formula
  changes between searches, or do you sometimes use only a subset of the
  values?
 
  If you always use all values, but change the formula between queries,
 then
  perhaps you can just encode the pre-computed value under different NDV
  fields? If you only use a handful of functions (and they are known in
  advance), it may not be too heavy on the index, and definitely perform
  better during search.
 
  Otherwise, I believe I'd consider indexing them as a BDV field. For
 facets,
  we basically need the same multi-valued numeric field, and given that NDV
  is single valued, we went w/ BDV.
 
  If I misunderstood the scenario, I'd appreciate if you clarify it :)
 
  Shai
 
 
  On Wed, Apr 23, 2014 at 5:49 PM, Rob Audenaerde 
 rob.audenae...@gmail.com
  wrote:
 
   Hi Shai, all,
  
   I am trying to write that Filter :). But I'm a bit at loss as how to
   efficiently grab the multi-values. I can access the
   context.reader().document() that accesses the storedfields, but that
  seems
   slow.
  
   For single-value fields I use a compiled JavaScript Expression with
   simplebindings as ValueSource, which seems to work quite well. The
  downside
   is that I cannot find a way to implement multi-value through that
  solution.
  
   These create for example a LongFieldSource, which uses the
   FieldCache.LongParser. These parsers only seem te parse one field.
  
   Is there an efficient way to get -all- of the (numeric) values for a
  field

Re: Getting multi-values to use in filter?

2014-04-23 Thread Shai Erera
You can do that by writing a Filter which returns matching documents based
on a sum of the field's value. However I suspect that is going to be slow,
unless you know that you will need several such filters and can cache them.

Another approach would be to write a Collector which serves as a Filter,
but computes the sum only for documents that match the query. Hopefully
that would mean you compute the sum for less documents than you would have
w/ the Filter approach.

Shai


On Wed, Apr 23, 2014 at 5:11 PM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 This isn't really a good use case for an index like Lucene.  The most
 essential property of an index is that it lets you look up documents very
 quickly based on *precomputed* values.

 -Mike


 On 04/23/2014 06:56 AM, Rob Audenaerde wrote:

 Hi all,

 I'm looking for a way to use multi-values in a filter.

 I want to be able to search on  sum(field)=100, where field has values in
 one documents:

 field=60
 field=40

 In this case 'field' is a LongField. I examined the code in the
 FieldCache,
 but that seems to focus on single-valued fields only, or


 It this something that can be done in Lucene? And what would be a good
 approach?

 Thanks in advance,

 -Rob



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Getting multi-values to use in filter?

2014-04-23 Thread Shai Erera
A NumericDocValues field can only hold one value. Have you thought about
encoding the values in a BinaryDocValues field? Or are you talking about
multiple fields (different names), each has its own single value, and at
search time you sum the values from a different set of fields?

If it's one field, multiple values, then why do you need to separate the
values? Is it because you sometimes sum and sometimes e.g. avg? Do you
always include all values of a document in the formula, but the formula
changes between searches, or do you sometimes use only a subset of the
values?

If you always use all values, but change the formula between queries, then
perhaps you can just encode the pre-computed value under different NDV
fields? If you only use a handful of functions (and they are known in
advance), it may not be too heavy on the index, and definitely perform
better during search.

Otherwise, I believe I'd consider indexing them as a BDV field. For facets,
we basically need the same multi-valued numeric field, and given that NDV
is single valued, we went w/ BDV.

If I misunderstood the scenario, I'd appreciate if you clarify it :)

Shai


On Wed, Apr 23, 2014 at 5:49 PM, Rob Audenaerde rob.audenae...@gmail.comwrote:

 Hi Shai, all,

 I am trying to write that Filter :). But I'm a bit at loss as how to
 efficiently grab the multi-values. I can access the
 context.reader().document() that accesses the storedfields, but that seems
 slow.

 For single-value fields I use a compiled JavaScript Expression with
 simplebindings as ValueSource, which seems to work quite well. The downside
 is that I cannot find a way to implement multi-value through that solution.

 These create for example a LongFieldSource, which uses the
 FieldCache.LongParser. These parsers only seem te parse one field.

 Is there an efficient way to get -all- of the (numeric) values for a field
 in a document?


 On Wed, Apr 23, 2014 at 4:38 PM, Shai Erera ser...@gmail.com wrote:

  You can do that by writing a Filter which returns matching documents
 based
  on a sum of the field's value. However I suspect that is going to be
 slow,
  unless you know that you will need several such filters and can cache
 them.
 
  Another approach would be to write a Collector which serves as a Filter,
  but computes the sum only for documents that match the query. Hopefully
  that would mean you compute the sum for less documents than you would
 have
  w/ the Filter approach.
 
  Shai
 
 
  On Wed, Apr 23, 2014 at 5:11 PM, Michael Sokolov 
  msoko...@safaribooksonline.com wrote:
 
   This isn't really a good use case for an index like Lucene.  The most
   essential property of an index is that it lets you look up documents
 very
   quickly based on *precomputed* values.
  
   -Mike
  
  
   On 04/23/2014 06:56 AM, Rob Audenaerde wrote:
  
   Hi all,
  
   I'm looking for a way to use multi-values in a filter.
  
   I want to be able to search on  sum(field)=100, where field has values
  in
   one documents:
  
   field=60
   field=40
  
   In this case 'field' is a LongField. I examined the code in the
   FieldCache,
   but that seems to focus on single-valued fields only, or
  
  
   It this something that can be done in Lucene? And what would be a good
   approach?
  
   Thanks in advance,
  
   -Rob
  
  
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
 



Re: IndexReplication Client and IndexWriter

2014-04-16 Thread Shai Erera
Hi Christoph,

Apologize for the delayed response, I'm on a holiday vacation. I will take
a look at your issues as soon as I can.

Shai


On Fri, Apr 11, 2014 at 12:02 PM, Christoph Kaser
lucene_l...@iconparc.dewrote:

 Hello Shai and Mike,

 thank you for your answers!

 I created LUCENE-5597 for this feature. Unfortunately, I am not sure I
 will be able to provide patches: I don't need this feature at the moment
 (my interest was more academic) and unfortunately don't have the time to
 work on this.

 Additionally, I created LUCENE-5599, which provides a patch to fix a small
 performance issue I had with the replicator when replicating large indexes.

 Regards,
 Christoph Kaser



 Am 08.04.2014 12:45, schrieb Michael McCandless:

 You might be able to use a class on the NRT replication branch
 (LUCENE-5438), InfosRefCounts (weird name), whose purpose is to do
 what IndexFileDeleter does for IndexWriter, ie keep track of which
 files are still referenced, delete them when they are done, etc.  This
 could used on the client side to hold a lease for another client.

 Mike McCandless

 http://blog.mikemccandless.com


 On Tue, Apr 8, 2014 at 6:26 AM, Shai Erera ser...@gmail.com wrote:

 IndexRevision uses the IndexWriter for deleting unused files when the
 revision is released, as well as to obtain the SnapshotDeletionPolicy.

 I think that you will need to implement two things on the client side:

 * Revision, which doesn't use IndexWriter.
 * Replicator which keeps track of how many refs a file has (basically
 what
 IndexFileDeleter does)

 Then you could setup any node in the middle to be both a client and a
 server. Would be interesting to explore that. Would you like to open an
 issue? And maybe even try to come up w/ a patch?

 Shai


 On Tue, Apr 8, 2014 at 1:05 PM, Michael McCandless 
 luc...@mikemccandless.com wrote:

  It's not safe also opening an IndexWriter on the client side.

 But I agree, supporting tree topology would make sense; it seems like
 we just need a way for the ReplicationClient to also be a Replicator.
 It seems like it should be possible, since it's clearly aware of the
 SessionToken it's pulled from the original Replicator.

 Mike McCandless

 http://blog.mikemccandless.com


 On Tue, Apr 8, 2014 at 3:42 AM, Christoph Kaser 
 lucene_l...@iconparc.de
 wrote:

 Hi all,

 I am trying out the (highly useful) index replicator module (with the
 HttpReplicator) and have stumbled upon a question:
 It seems, the IndexReplicationHandler is working directly on the index
 directory, without using an indexwriter. Could there be a problem if I

 open

 an IndexWriter on the client side?
 Usually, this should not be needed, as only the master should be
 changed,
 however if I want to implement a tree topology, I need an IndexWriter

 on a

 non-leaf client, because the IndexRevision that I need to publish needs

 one.

 Regards,
 Christoph


  -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


  -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org






 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: NRT facet issue (bug?), hard to reproduce, please advise

2014-04-11 Thread Shai Erera
Hi

I am not sure how more than one client_no field ends up w/ a document, and
I'm not sure it's related to the taxonomy at all.

However, looking at the code example you pasted above, and since you
mention that you index+commit in one thread, while another thread does the
reopen, I wonder if that's the issue: you first commit the taxo, then
commit the index. But what if a new document makes it into the index after
you committed to taxo, with a new client_no? In that case, the reopening
thread will discover an older taxonomy, while the index will have
categories with ordinals larger than the taxonomy's greatest ordinal?

I also think that it's a mistake to commit and reopen in two separate
threads. If possible, I suggest that you do that always in the same thread,
and in that order: first commit the index, then the taxonomy. That way, if
a document goes in to the index (and new facets to the taxonomy) after the
index.commit(), then when you reopen the worse case is that the taxonomy is
ahead of the index, which is fine. When you reopen, also reopen in the
same order.

Could you try that and see if that resolves your issue. Although, I don't
understand how this can lead to more than one client_no ending up in one
document, unless there's also a concurrency bug in the indexing code ... or
I misunderstood the issue.

Shai


On Fri, Apr 11, 2014 at 2:49 PM, Rob Audenaerde rob.audenae...@gmail.comwrote:

 Hi all,

 I have a issue using the near real-time search in the taxonomy. I could
 really use some advise on how to debug/proceed this issue.

 The issue is as follows:

 I index 100k documents, with about 40 fields each. For each field, I also
 add a FacetField (issues arises both with FacetField as
 FloatAssociationFacetField). Each document has a unique number field
 (client_no).

 When just indexing and searching afterwards, all is fine.

 When searching while indexing, sometimes the number of facets associated
 with a document is to high, i.e. when collecting facets there are more that
 one client_no on one document, which of course should not be the case.

 Before each search, I use the manager.maybeRefreshBlocking() before the
 search, because I want the most-actual results.

 I have a taxonomy and indexreader combined in a ReferenceManager (I created
 this before the SearcherTaxonomyManager existed, but it behaves exactly the
 same, similar refcount logic)

 During indexing I commit every 5000 documents (not needed for the NRT
 search, but needed to prevent loss in the application should shut down). I
 commit as follows:

 public void commit() throws DocumentIndexException
 {
 try
 {
 synchronized ( GlobalIndexCommitAndCloseLock.LOCK )
 {
 this.taxonomyWriter.commit();
 this.luceneIndexWriter.commit();
 }
 }
 catch ( final OutOfMemoryError | IOException e )
 {
 tryCloseWritersOnOOME( this.luceneIndexWriter,
 this.taxonomyWriter );
 throw new DocumentIndexException( e );
 }
 }

 I use a standard IndexWriterConfig and both IndexWriter and TaxonomyWriter
 are RAMDirectory().

 My testcase indexes the 100k documents, while another thread is
 continuously calling 'manager.maybeRefreshBlocking()'. This is enough to
 sometimes cause the taxonomy to be incorrect.

 The number of indexing threads does not seems to influence the issue, as it
 also appears when I have only 1 indexing thread.

 I know it is an index problem, because when I write in the index to file
 instead of RAM and reopen it in a clean application, I see the same
 behaviour.


 I could really use some advise on how to debug/proceed this issue. If more
 info is needed, just ask.

 Thanks in advance,

 -Rob



Re: IndexReplication Client and IndexWriter

2014-04-08 Thread Shai Erera
IndexRevision uses the IndexWriter for deleting unused files when the
revision is released, as well as to obtain the SnapshotDeletionPolicy.

I think that you will need to implement two things on the client side:

* Revision, which doesn't use IndexWriter.
* Replicator which keeps track of how many refs a file has (basically what
IndexFileDeleter does)

Then you could setup any node in the middle to be both a client and a
server. Would be interesting to explore that. Would you like to open an
issue? And maybe even try to come up w/ a patch?

Shai


On Tue, Apr 8, 2014 at 1:05 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 It's not safe also opening an IndexWriter on the client side.

 But I agree, supporting tree topology would make sense; it seems like
 we just need a way for the ReplicationClient to also be a Replicator.
 It seems like it should be possible, since it's clearly aware of the
 SessionToken it's pulled from the original Replicator.

 Mike McCandless

 http://blog.mikemccandless.com


 On Tue, Apr 8, 2014 at 3:42 AM, Christoph Kaser lucene_l...@iconparc.de
 wrote:
  Hi all,
 
  I am trying out the (highly useful) index replicator module (with the
  HttpReplicator) and have stumbled upon a question:
  It seems, the IndexReplicationHandler is working directly on the index
  directory, without using an indexwriter. Could there be a problem if I
 open
  an IndexWriter on the client side?
  Usually, this should not be needed, as only the master should be changed,
  however if I want to implement a tree topology, I need an IndexWriter
 on a
  non-leaf client, because the IndexRevision that I need to publish needs
 one.
 
  Regards,
  Christoph
 
  --
  Dipl.-Inf. Christoph Kaser
 
  IconParc GmbH
  Sophienstrasse 1
  80333 München
 
  www.iconparc.de
 
  Tel +49 -89- 15 90 06 - 21
  Fax +49 -89- 15 90 06 - 49
 
  Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer.
 HRB
  121830, Amtsgericht München
 
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Replicator: how to use it?

2014-03-20 Thread Shai Erera

 Even if the commit is called just before the close, the close triggers
 a last commit.


That seems wrong. If you do writer.commit() and them immediately
writer.close(), and there are no changes to the writer in between (i.e. a
thread comes in and adds/updates/deletes a document), then close() should
not create a new commit point. Do you see that it does?

Shai


On Wed, Mar 19, 2014 at 11:09 PM, Roberto Franchini franch...@celi.itwrote:

 On Sat, Mar 15, 2014 at 12:56 PM, Roberto Franchini franch...@celi.it
 wrote:
  On Sat, Mar 15, 2014 at 12:47 PM, Shai Erera ser...@gmail.com wrote:
  If you use LocalReplicator on both sides, you have to use the same
 instance
  on both sides. Otherwise the replicas will never see the published
  revisions the which are done in a separate instance. Can you try that?
 
  Ok, I missed it. I was using different instances.
  I'll try this afternoon.

 Hi,
 the replicator works fine on live writer, but when the writer is
 closed it does a last commit that isn't replicated.

 Even if the commit is called just before the close, the close triggers
 a last commit.

 And trying to use the writer after close is impossible:

 writer.close();
 revision= new IndexRevision(writer);

 produce:
 org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
 

 So, I can replicate the last commit before the close, and don't worry
 about the inner commit that close does.
 May I'll lost something?
 RF

 --
 Roberto Franchini
 The impossible is inevitable.
 http://www.celi.it http://www.blogmeter.it
 http://github.com/celi-uim   http://github.com/robfrank
 Tel +39.011.562.71.15
 jabber:ro.franch...@gmail.com skype:ro.franchini tw:@robfrankie

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Few questions on updatable DocValues

2014-03-15 Thread Shai Erera
Double fields can be implemented today over NumericDVField and therefore
already support updates.

String can be implemented on Sorted/SortedSetDVField, but not updates for
them yet. I hope that once I'm done w/ LUCENE-5513, adding update support
for Sorted/SortedSet will be even easier.

Shai

On Fri, Mar 14, 2014 at 6:22 PM, Gopal Patwa gopalpa...@gmail.com wrote:

 r lot;s of use case where you have muc


Re: Replicator: how to use it?

2014-03-15 Thread Shai Erera
If you use LocalReplicator on both sides, you have to use the same instance
on both sides. Otherwise the replicas will never see the published
revisions the which are done in a separate instance. Can you try that?

Shai
On Mar 15, 2014 1:10 PM, Roberto Franchini franch...@celi.it wrote:

 On Sat, Mar 15, 2014 at 11:58 AM, Michael McCandless
 luc...@mikemccandless.com wrote:
  I think maybe the problem is you are using LocalReplicator on the
  replicas?  I think you should only use that on the master.  I think
  e.g. you should use HttpReplicator on the clients?  Or, your own
  implementation that moves the files its own way.
 
  Have you seen Shai's blog post about this?
  http://shaierera.blogspot.com/2013/05/the-replicator.html

 Yes, I've seen it. I checkout the replicator code and looked at test code.

 I'm trying to use the local replicator because, as a first step, I
 want only to incrementally backup  indexes.

 So I've implemented a sort of producer/consumer where the indexer is
 the producer, it runs on it's own thread and publish revisions, and
 the consumer will be the replicator client that's on it's onw thread.

 Code samples aren't, at least for me, very clear in how to use the
 replicator.
 So, if someone has a clean sample of use of replicator I would appreciate
 it.

 REgards,
 RF

 --
 Roberto Franchini
 The impossible is inevitable.
 http://www.celi.it http://www.blogmeter.it
 http://github.com/celi-uim   http://github.com/robfrank
 Tel +39.011.562.71.15
 jabber:ro.franch...@gmail.com skype:ro.franchini tw:@robfrankie

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Few questions on updatable DocValues

2014-03-14 Thread Shai Erera
Hi

1. Is it possible to provide updateNumericDocValue(Term term,
 MapString,Long), incase I wish to update multiple-fields and it's
 doc-values?


For now you can call updateNDV multiple times, each time w/ a new field.
Under the covers, we currently process each update separately anyway.
I think in order to change it we'd need to change the API such that it
allows you to define an update in many ways (e.g. Query, see below). Then,
an update by a single Term to multiple fields is atomic. I don't want
though to add many updateNDV variants to IW, especially as we'd like to add
more DV update capabilities. Want to open an issue to explore that?

2. Instead of a Term based update, is it possible to extend it to using a
 Query? What are the obvious problems in doing so?


Technically yes, but currently it's not exposed. At the lowest level we
pull a DocsEnum and iterate on docs to apply the update. So Term/Query
would work the same. I think we can explore generalizing the API such that
you can define your own update following some well thought of API, and that
way you have the flexibility in one hand, yet we don't need to maintain all
options in the Lucene source code. We can explore that on an issue.

3. TrackingIndexWriter does not have updateNumericDocValue exposed. Any
 reason of not doing so?


No reason in particular :). Can you open an issue (separate from the API)?

4. Is it possible to update a DocValue other than long, like lets say a
 BinaryDV?


This is something I currently do on LUCENE-5513, so hopefully very soon you
will be able to do that. If I'm fast enough, maybe even in 4.8 :).

Shai


On Fri, Mar 14, 2014 at 12:14 PM, Ravikumar Govindarajan 
ravikumar.govindara...@gmail.com wrote:

 Hi,

 I have few questions related to updatable DocValues API... It would be
 great if I can get help.

 1. Is it possible to provide updateNumericDocValue(Term term,
 MapString,Long), incase I wish to update multiple-fields and it's
 doc-values?

 2. Instead of a Term based update, is it possible to extend it to using a
 Query? What are the obvious problems in doing so?

 3. TrackingIndexWriter does not have updateNumericDocValue exposed. Any
 reason of not doing so?

 4. Is it possible to update a DocValue other than long, like lets say a
 BinaryDV?

 --
 Ravi



Re: Adding custom weights to individual terms

2014-02-13 Thread Shai Erera
I often prefer to manage such weights outside the index. Usually managing
them inside the index leads to problems in the future when e.g the weights
change. If they are encoded in the index, it means re-indexing. Also, if
the weight changes then in some segments the weight will be different than
others. I think that if you manage the weights e.g. in a simple FST (which
is very compat), it will give you the best flexibility and it's very easy
to use.

Shai


On Thu, Feb 13, 2014 at 1:36 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 You could stuff your custom weights into a payload, and index that,
 but this is per term per document per position, while it sounds like
 you just want one float for each term regardless of which
 documents/positions where that term occurred?

 Doing your own custom attribute would be a challenge: not only must
 you create  set this attribute during indexing, but you then must
 change the indexing process (custom chain, custom codec) to get the
 new attribute into the index, and then make a custom query that can
 pull this attribute at search time.

 What are these term weights?  Are you sure you can't compute these
 weights at search time with a custom similarity using the stats that
 are already stored (docFreq, totalTermFreq, maxDoc, etc.)?

 Mike McCandless

 http://blog.mikemccandless.com


 On Thu, Feb 13, 2014 at 2:40 AM, Rune Stilling s...@rdfined.dk wrote:
  Hi list
 
  I'm trying to figure out how customizable scoring and weighting is in
 the Lucene API. I read about the API's but still can't figure out if the
 following is possible.
 
  I would like to do normal document text indexing, but I would like to
 control the weight added to tokens my self, also I would like to control
 the weighting of query tokens and the how things are added together.
 
  When indexing a word I would like attache my own weights to the word,
 and use these weights when querying for documents. F.ex.
 
  Doc 1
  Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99)
 API(0.3)
 
  Doc 2
  Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
 
  The floats in parentheses are some I would like to add in the indexing
 process, not something coming from Lucene tdf/id ex.
 
  Wen querying I would like to repeat this and also create the weights for
 each term myself and control how the final doc score is calculated.
 
  I have read that it's possible to attach your own custom attributes to
 tokens. Is this the way to go? Ie. should I add my custom weight as
 attributes to tokens, and then access these attributes when calculating
 document score in the search process (described here
 https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.htmlunder
  adding a custom attribute)?
 
  The reason why I'm asking is that I can't find any examples of this
 being done anywhere. But I found someone stating With Lucene, it is
 impossible to increase or decrease the weight of individual terms in a
 document.
 
  With regards
  Rune

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Shai Erera
Why not use LogByteSizeMP in conjunction w/ SortingMP? LogMP picks adjacent
segments and SortingMP ensures the merged segment is also sorted.

Shai


On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan 
ravikumar.govindara...@gmail.com wrote:

 Yes exactly as you have described.

 Ex: Consider Segment[S1,S2,S3  S4] are in reverse-chronological order and
 goes for a merge

 While SortingMergePolicy will correctly solve the merge-part, it does not
 however play any role in picking segments to merge right?

 SMP internally delegates to TieredMergePolicy, which might pick S1S4 to
 merge disturbing the global-order. Ideally only adjacent segments should
 be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc...

 Can there be a better selection of segments to merge in this case, so as to
 maintain a semblance of global-ordering?

 --
 Ravi



 On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless 
 luc...@mikemccandless.com wrote:

  OK, I see (early termination).
 
  That's a challenge, because you really want the docs sorted backwards
  from how they were added right?  And, e.g., merged and then searched
  in reverse segment order?
 
  I think you should be able to do this w/ SortingMergePolicy?  And then
  use a custom collector that stops after you've gone back enough in
  time for a given search.
 
  Mike McCandless
 
  http://blog.mikemccandless.com
 
 
  On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan
  ravikumar.govindara...@gmail.com wrote:
   Mike,
  
   All our queries need to be sorted by timestamp field, in descending
 order
   of time. [latest-first]
  
   Each segment is sorted in itself. But TieredMergePolicy picks arbitrary
   segments and merges them [even with SortingMergePolicy etc...]. I am
  trying
   to avoid this and see if an approximate global ordering of segments [by
   time-stamp field] can be maintained via merge.
  
   Ex: TopN results will only examine recent 2-3 smaller segments
  [best-case]
   and return, without examining older and bigger segments.
  
   I do not know the terminology, may be Early Query Termination Across
   Segments etc...?
  
   --
   Ravi
  
  
   On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless 
   luc...@mikemccandless.com wrote:
  
   LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total
   order.
  
   Only TieredMergePolicy merges out-of-order segments.
  
   I don't understand why you need to encouraging merging of the more
   recent (by your time field) segments...
  
   Mike McCandless
  
   http://blog.mikemccandless.com
  
  
   On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
   ravikumar.govindara...@gmail.com wrote:
Mike,
   
Each of my flushed segment is fully ordered by time. But
   TieredMergePolicy
or LogByteSizeMergePolicy is going to pick arbitrary time-segments
 and
disturb this arrangement and I wanted some kind of control on this.
   
But like you pointed-out, going by only be time-adjacent merges can
 be
disastrous.
   
Is there a way to mix both time and size to arrive at a somewhat
[less-than-accurate] global order of segment merges.
   
Like attempt a time-adjacent merge, provided size of segments is not
extremely skewed etc...
   
--
Ravi
   
   
   
   
   
   
   
On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless 
luc...@mikemccandless.com wrote:
   
You want to focus merging on the segments containing newer
 documents?
Why?  This seems somewhat dangerous...
   
Not taking into account the true segment size can lead to very
 very
poor merge decisions ... you should turn on IndexWriter's
 infoStream
and do a long running test to convince yourself the merging is
 being
sane.
   
Mike
   
Mike McCandless
   
http://blog.mikemccandless.com
   
   
On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
ravikumar.govindara...@gmail.com wrote:
 Thanks Mike,

 Will try your suggestion. I will try to describe the actual
  use-case
itself

 There is a requirement for merging time-adjacent segments
   [append-only,
 rolling time-series data]

 All Documents have a timestamp affixed and during flush I need to
  note
down
 the least timestamp for all documents, through Codec.

 Then, I define a TimeMergePolicy extends LogMergePolicy and
 define
  the
 segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].

 LogMergePolicy will auto-arrange levels of segments according
 time
  and
 proceed with merges. Latest segments will be lesser in size and
   preferred
 during merges than older and bigger segments

 Do you think such an approach will be fine or there are better
  ways to
 solve this?

 --
 Ravi


 On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless 
 luc...@mikemccandless.com wrote:

 Somewhere in those numeric trie terms are the exact integers
 from
   your
 documents, 

Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Shai Erera
Hi

LogMP *always* picks adjacent segments together. Therefore, if you have
segments S1, S2, S3, S4 where the date-wise sort order is S4S3S2S1, then
LogMP will pick either S1-S4, S2-S4, S2-S3 and so on. But always adjacent
segments and in a raw (i.e. it doesn't skip segments).

I guess what both Mike and I don't understand is why you insist on merging
based on the timestamp of each segment. I.e. if the order, timestamp-wise,
of the segments isn't as I described above, then merging them like so won't
hurt - i.e. they will still be unsorted. No harm is done.

Maybe MergePolicy isn't what you need here. If you can record somewhere the
min/max timestamp of each segment, you can use a MultiReader to wrap the
sorted list of IndexReaders (actually SegmentReaders). Then your reader,
always traverses segments from new to old.

If this approach won't address your issue, then you can merge based on
timestamps - there's nothing wrong about it. What Mike suggested is that
you benchmark your application with this merge policy, for a long period of
time (few hours/days, depending on your indexing rate), because what might
happen is that your merges are always unbalanced and your indexing
performance will degrade because of unbalanced amount of IO that happens
during the merge.

Shai


On Thu, Feb 13, 2014 at 7:25 AM, Ravikumar Govindarajan 
ravikumar.govindara...@gmail.com wrote:

 @Mike,

 I had suggested the same approach in one of my previous mails, where-by
 each segment records min/max timestamps in seg-info diagnostics and use it
 for merging adjacent segments.

 Then, I define a TimeMergePolicy extends LogMergePolicy and define the
 segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag]. 

 But you have expressed reservations

 This seems somewhat dangerous...

 Not taking into account the true segment size can lead to very very
 poor merge decisions ... you should turn on IndexWriter's infoStream
 and do a long running test to convince yourself the merging is being
 sane.

 Will merging be disastrous, if I choose a TimeMergePolicy? I will also test
 and verify, but it's always great to hear finer points from experts.

 @Shai,

 LogByteSizeMP categorizes adjacency by size, whereas it would be better
 if timestamp is used in my case

 Sure, I need to wrap this in an SMP to make sure that the newly-created
 segment is also in sorted-order

 --
 Ravi



 On Wed, Feb 12, 2014 at 8:29 PM, Shai Erera ser...@gmail.com wrote:

  Why not use LogByteSizeMP in conjunction w/ SortingMP? LogMP picks
 adjacent
  segments and SortingMP ensures the merged segment is also sorted.
 
  Shai
 
 
  On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan 
  ravikumar.govindara...@gmail.com wrote:
 
   Yes exactly as you have described.
  
   Ex: Consider Segment[S1,S2,S3  S4] are in reverse-chronological order
  and
   goes for a merge
  
   While SortingMergePolicy will correctly solve the merge-part, it does
 not
   however play any role in picking segments to merge right?
  
   SMP internally delegates to TieredMergePolicy, which might pick S1S4
 to
   merge disturbing the global-order. Ideally only adjacent segments
  should
   be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc...
  
   Can there be a better selection of segments to merge in this case, so
 as
  to
   maintain a semblance of global-ordering?
  
   --
   Ravi
  
  
  
   On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless 
   luc...@mikemccandless.com wrote:
  
OK, I see (early termination).
   
That's a challenge, because you really want the docs sorted backwards
from how they were added right?  And, e.g., merged and then searched
in reverse segment order?
   
I think you should be able to do this w/ SortingMergePolicy?  And
 then
use a custom collector that stops after you've gone back enough in
time for a given search.
   
Mike McCandless
   
http://blog.mikemccandless.com
   
   
On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan
ravikumar.govindara...@gmail.com wrote:
 Mike,

 All our queries need to be sorted by timestamp field, in descending
   order
 of time. [latest-first]

 Each segment is sorted in itself. But TieredMergePolicy picks
  arbitrary
 segments and merges them [even with SortingMergePolicy etc...]. I
 am
trying
 to avoid this and see if an approximate global ordering of segments
  [by
 time-stamp field] can be maintained via merge.

 Ex: TopN results will only examine recent 2-3 smaller segments
[best-case]
 and return, without examining older and bigger segments.

 I do not know the terminology, may be Early Query Termination
 Across
 Segments etc...?

 --
 Ravi


 On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless 
 luc...@mikemccandless.com wrote:

 LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the
  total
 order.

 Only TieredMergePolicy merges out

Re: Regarding DrillDown search

2014-02-10 Thread Shai Erera
Hi

You will need to build a BooleanQuery which comprises a list of
PrefixQuery. The relation between each PrefixQuery should be OR or AND, as
you see fit (I believe OR?).

In order to get documents' attributes you should execute searcher.search()
w/ e.g. MultiCollector which wraps a FacetsCollector and
TopScoreDocCollector. Then after .search() finished, you should pull the
facet results from the FacetsCollector instance and the document results
from the TopScoreDocCollector instance. Something like (I hope it compiles
in 3.6! :)):

TopScoreDocCollector tsdc = TopScoreDocCollector.create(...);
FacetsCollector fc = FacetsCollector.create(...);
searcher.search(query, MultiCollector.wrap(tsdc, fc));

ListFacetResult facetResults = fc.getFacetResults();
TopDocs topDocs = tsdc.topDocs();

Something like that..

Shai


On Mon, Feb 10, 2014 at 1:57 PM, Jebarlin Robertson jebar...@gmail.comwrote:

 Dear Shai,

 Thank you for the quick response :)

 I have checked with PrefixQuery and term, it is working fine, But I think I
 cannot pass multiple Category path in it. I am calling the
 DrillDown.term() method 'N' number of times based on the number of Category
 Path list.

 And I have one more question, When I get the FacetResult, I am getting only
 the count of documents matched with the Category Path.
 Is there anyway to get the Document object also along with the count to
 know the file names For ex. Files (file names -title Field in Document)
 which have the same Author from the FacetResult. I have read some articles
 for the same from one of your answer I believe.
 In that you have explained like this Categories will be listed to the user
 and when the user clicks the category we have to do DrillDown search to get
 further result.
 I just want to know if we can get the document names as well in the first
 Facet query search itself, when we get the count (no of hits) of documents
 along with the FacetResult. Is there any solution available already or what
 I can do for that.

 Kindly Guide me :)

 Thank you for All your Support.

 Regards,
 Jebarlin.R


 On Mon, Feb 10, 2014 at 1:28 PM, Shai Erera ser...@gmail.com wrote:

  Hi
 
  If you want to drill-down on first name only, then you have several
  options:
 
  1) Index Author/First, Author/Last, Author/First_Last as facets on the
  document. This is the faster approach, but bloats the index. Also, if you
  index the author Author/Jebarlin, Author/Robertson and
  Author/Jebarlin_Robertson, it still won't allow you to execute a query
  Author/Jebar.
 
  2) You should modify the query to be a PrefixQuery, as if the user chose
 to
  search Author/Jeral*. You can do that with DrillDown.term() to create a
  Term($facets, Author/Jeral) (NOTE: you shouldn't pass '*' as part of the
  CategoryPath) and then construct your own PrefixQuery with that Term.
 
  Hope that helps,
  Shai
 
 
  On Mon, Feb 10, 2014 at 6:21 AM, Jebarlin Robertson jebar...@gmail.com
  wrote:
 
   Dear Shai,
  
   I have one doubt in DrillDown search, when I search with a CategoryPath
  of
   author, it is giving me the result if I give the accurate full name
 only.
   Is there any way to get the result even if I give the first or last
 name.
   Can you help me to search like (*contains* the word in Facet search),
 if
   the latest API supports or any other APIs.
  
   Thank You
  
   --
   Thanks  Regards,
   Jebarlin Robertson.R
   GSM: 91-9538106181.
  
 



 --
 Thanks  Regards,
 Jebarlin Robertson.R
 GSM: 91-9538106181.



Re: Regarding DrillDown search

2014-02-10 Thread Shai Erera
Ahh I see ... so given a single FacetResultNode, you would like to know
which documents contributed to its weight (count in your case). This is not
available immediately, that's why you need to do a drill-down query. So if
you return the user a list of categories, when he clicks one of them, you
perform a drill-down query on that category and retrieve all the associated
documents.

May I ask why do you need to know the list of documents given a
FacetResultNode?

Basically in the 3.6 API it's kind of not so simple to do what you want in
one-pass, but in the 4.x API (especially the upcoming 4.7) it should be
very easy -- when you traverse the list of matching documents, besides only
reading the list of categories associated with it, you also store a map
Category - ListdocIDs. This isn't very cheap though ...

So I guess it would be good if I understand why do you need to know which
documents contributed to which category, before the results are returned to
the user.

Shai


On Mon, Feb 10, 2014 at 3:16 PM, Jebarlin Robertson jebar...@gmail.comwrote:

 Hi Shai,

 Thanks,

 I am using the same way of BooleanQuery only with list of PrefixQuery only.
 I think I confused you sorry :) .

 I am using the same above code to get the result of documents. I am getting
 the TopDocs and retrieving the Documents also, If I don't even try that for
 the basic you will kill me :D.
 But my question was different, from the List of FacetResult I am getting
 only the counts or no of hits of Document in each category after iterating
 the list.
 I believe that the getLevel() of FacetNode returns the no of hits or no of
 documents falls into the particular Category.
 I need to know which are the documents are falling under the same category
 from the FacetResult Object also.

 I hope you will understand my question :)

 Thank you :)

 --
 Jebarlin



 On Mon, Feb 10, 2014 at 9:09 PM, Shai Erera ser...@gmail.com wrote:

  Hi
 
  You will need to build a BooleanQuery which comprises a list of
  PrefixQuery. The relation between each PrefixQuery should be OR or AND,
 as
  you see fit (I believe OR?).
 
  In order to get documents' attributes you should execute
 searcher.search()
  w/ e.g. MultiCollector which wraps a FacetsCollector and
  TopScoreDocCollector. Then after .search() finished, you should pull the
  facet results from the FacetsCollector instance and the document results
  from the TopScoreDocCollector instance. Something like (I hope it
 compiles
  in 3.6! :)):
 
  TopScoreDocCollector tsdc = TopScoreDocCollector.create(...);
  FacetsCollector fc = FacetsCollector.create(...);
  searcher.search(query, MultiCollector.wrap(tsdc, fc));
 
  ListFacetResult facetResults = fc.getFacetResults();
  TopDocs topDocs = tsdc.topDocs();
 
  Something like that..
 
  Shai
 
 
  On Mon, Feb 10, 2014 at 1:57 PM, Jebarlin Robertson jebar...@gmail.com
  wrote:
 
   Dear Shai,
  
   Thank you for the quick response :)
  
   I have checked with PrefixQuery and term, it is working fine, But I
  think I
   cannot pass multiple Category path in it. I am calling the
   DrillDown.term() method 'N' number of times based on the number of
  Category
   Path list.
  
   And I have one more question, When I get the FacetResult, I am getting
  only
   the count of documents matched with the Category Path.
   Is there anyway to get the Document object also along with the count to
   know the file names For ex. Files (file names -title Field in Document)
   which have the same Author from the FacetResult. I have read some
  articles
   for the same from one of your answer I believe.
   In that you have explained like this Categories will be listed to the
  user
   and when the user clicks the category we have to do DrillDown search to
  get
   further result.
   I just want to know if we can get the document names as well in the
 first
   Facet query search itself, when we get the count (no of hits) of
  documents
   along with the FacetResult. Is there any solution available already or
  what
   I can do for that.
  
   Kindly Guide me :)
  
   Thank you for All your Support.
  
   Regards,
   Jebarlin.R
  
  
   On Mon, Feb 10, 2014 at 1:28 PM, Shai Erera ser...@gmail.com wrote:
  
Hi
   
If you want to drill-down on first name only, then you have several
options:
   
1) Index Author/First, Author/Last, Author/First_Last as facets on
 the
document. This is the faster approach, but bloats the index. Also, if
  you
index the author Author/Jebarlin, Author/Robertson and
Author/Jebarlin_Robertson, it still won't allow you to execute a
 query
Author/Jebar.
   
2) You should modify the query to be a PrefixQuery, as if the user
  chose
   to
search Author/Jeral*. You can do that with DrillDown.term() to
 create a
Term($facets, Author/Jeral) (NOTE: you shouldn't pass '*' as part of
  the
CategoryPath) and then construct your own PrefixQuery with that Term.
   
Hope that helps,
Shai
   
   
On Mon

Re: Regarding DrillDown search

2014-02-10 Thread Shai Erera
What you want sounds like grouping more like faceting?

So e.g. if you have an Author field with values A1, A2, A3, and the user
searches for 'love',
then if I understand correctly, you want to display something like:

Author/A1
  Doc1
  Doc2
Author/A2
  Doc3
  Doc4
Author/A3
  Doc5
  Doc6

Is that right?


Where's today your result page looks like this:

Facets   Results
--   ---
Author   Doc1_Title
  A1 (4) Doc1_Highlight
  A2 (3) 
  A3 (1) Doc2_Title
 Doc2_Highlight
 +++
 ...

(Forgive my lack of creativity :)).

If you're not interested in join, and just want to add to each document its
Author facet in the results pane, then I suggest you add another stored
field (only stored, not indexed) with the category value. And then you
could display:

Facets   Results
--   ---
Author   Doc1_Title
  A1 (4) Doc1_Highlight
  A2 (3) Author: A1
  A3 (1) 
 Doc2_Title
 Doc2_Highlight
 Author: A2
 +++
 ...

Did I understand properly?

Shai

On Mon, Feb 10, 2014 at 4:51 PM, Jebarlin Robertson jebar...@gmail.comwrote:

 Hi Shai,

 Thanks for the explanation :)

 For my requirement, I just want to display the list of resulted documents
 to the user.
 In Facet search case also, I already have the resulted documents list in
 TopDoc and the FacetResults have only the count of documents contributed to
 each Catagory,

 According to my understanding,

 Suppose I query for the word Love, Now I do Facet Search and gets 4
 (Files) documents as matched results from TopScoreDocCollector as TopDocs
 and I will get the FacetResult from the FacetCollector.
 And the FacetResultsNode gives me only the values of the category and the
 count of how many documents falls under same category (May be by Author or
 other provided categories ) among the 4 resulted documents only.

 I feel, It will be good if I get the category association with the resulted
 documents, as I have the document list already from TopScoreDocCollector.

 I can do DrillDown Search also by selecting each category, But in my case I
 just want to display the 4 documents result first and then category wise,
 suppose 2 documents by the same Author etc

 As per my requirement, I am doing DrillDown Search by asking the user to
 provide such as title of the docment, author of the document, etc... as
 advanced search option.

 ---
 Jebarlin Robertson.R



 On Mon, Feb 10, 2014 at 10:30 PM, Shai Erera ser...@gmail.com wrote:

  Ahh I see ... so given a single FacetResultNode, you would like to know
  which documents contributed to its weight (count in your case). This is
 not
  available immediately, that's why you need to do a drill-down query. So
 if
  you return the user a list of categories, when he clicks one of them, you
  perform a drill-down query on that category and retrieve all the
 associated
  documents.
 
  May I ask why do you need to know the list of documents given a
  FacetResultNode?
 
  Basically in the 3.6 API it's kind of not so simple to do what you want
 in
  one-pass, but in the 4.x API (especially the upcoming 4.7) it should be
  very easy -- when you traverse the list of matching documents, besides
 only
  reading the list of categories associated with it, you also store a map
  Category - ListdocIDs. This isn't very cheap though ...
 
  So I guess it would be good if I understand why do you need to know which
  documents contributed to which category, before the results are returned
 to
  the user.
 
  Shai
 
 
  On Mon, Feb 10, 2014 at 3:16 PM, Jebarlin Robertson jebar...@gmail.com
  wrote:
 
   Hi Shai,
  
   Thanks,
  
   I am using the same way of BooleanQuery only with list of PrefixQuery
  only.
   I think I confused you sorry :) .
  
   I am using the same above code to get the result of documents. I am
  getting
   the TopDocs and retrieving the Documents also, If I don't even try that
  for
   the basic you will kill me :D.
   But my question was different, from the List of FacetResult I am
 getting
   only the counts or no of hits of Document in each category after
  iterating
   the list.
   I believe that the getLevel() of FacetNode returns the no of hits or no
  of
   documents falls into the particular Category.
   I need to know which are the documents are falling under the same
  category
   from the FacetResult Object also.
  
   I hope you will understand my question :)
  
   Thank you :)
  
   --
   Jebarlin
  
  
  
   On Mon, Feb 10, 2014 at 9:09 PM, Shai Erera ser...@gmail.com wrote:
  
Hi
   
You will need to build a BooleanQuery which comprises a list of
PrefixQuery. The relation between each PrefixQuery should be OR or
 AND,
   as
you see fit (I believe OR?).
   
In order to get documents' attributes you should execute
   searcher.search()
w/ e.g

Re: Regarding DrillDown search

2014-02-10 Thread Shai Erera
You're welcome. And I suggest that you upgrade to 4.7 as soon as it's out!
:)

Shai


On Mon, Feb 10, 2014 at 5:48 PM, Jebarlin Robertson jebar...@gmail.comwrote:

 Hi Shai,

 Yeah exactly the same way I want to display.

 Then I will do the same way of stored field.

 It is not about lack of creativity, I might have not explained you in the
 proper way :)

 Thank you for all the support :)


 On Tue, Feb 11, 2014 at 12:23 AM, Shai Erera ser...@gmail.com wrote:

  What you want sounds like grouping more like faceting?
 
  So e.g. if you have an Author field with values A1, A2, A3, and the user
  searches for 'love',
  then if I understand correctly, you want to display something like:
 
  Author/A1
Doc1
Doc2
  Author/A2
Doc3
Doc4
  Author/A3
Doc5
Doc6
 
  Is that right?
 
 
  Where's today your result page looks like this:
 
  Facets   Results
  --   ---
  Author   Doc1_Title
A1 (4) Doc1_Highlight
A2 (3) 
A3 (1) Doc2_Title
   Doc2_Highlight
   +++
   ...
 
  (Forgive my lack of creativity :)).
 
  If you're not interested in join, and just want to add to each document
 its
  Author facet in the results pane, then I suggest you add another stored
  field (only stored, not indexed) with the category value. And then you
  could display:
 
  Facets   Results
  --   ---
  Author   Doc1_Title
A1 (4) Doc1_Highlight
A2 (3) Author: A1
A3 (1) 
   Doc2_Title
   Doc2_Highlight
   Author: A2
   +++
   ...
 
  Did I understand properly?
 
  Shai
 
  On Mon, Feb 10, 2014 at 4:51 PM, Jebarlin Robertson jebar...@gmail.com
  wrote:
 
   Hi Shai,
  
   Thanks for the explanation :)
  
   For my requirement, I just want to display the list of resulted
 documents
   to the user.
   In Facet search case also, I already have the resulted documents list
 in
   TopDoc and the FacetResults have only the count of documents
 contributed
  to
   each Catagory,
  
   According to my understanding,
  
   Suppose I query for the word Love, Now I do Facet Search and gets 4
   (Files) documents as matched results from TopScoreDocCollector as
 TopDocs
   and I will get the FacetResult from the FacetCollector.
   And the FacetResultsNode gives me only the values of the category and
 the
   count of how many documents falls under same category (May be by Author
  or
   other provided categories ) among the 4 resulted documents only.
  
   I feel, It will be good if I get the category association with the
  resulted
   documents, as I have the document list already from
 TopScoreDocCollector.
  
   I can do DrillDown Search also by selecting each category, But in my
  case I
   just want to display the 4 documents result first and then category
 wise,
   suppose 2 documents by the same Author etc
  
   As per my requirement, I am doing DrillDown Search by asking the user
 to
   provide such as title of the docment, author of the document, etc... as
   advanced search option.
  
   ---
   Jebarlin Robertson.R
  
  
  
   On Mon, Feb 10, 2014 at 10:30 PM, Shai Erera ser...@gmail.com wrote:
  
Ahh I see ... so given a single FacetResultNode, you would like to
 know
which documents contributed to its weight (count in your case). This
 is
   not
available immediately, that's why you need to do a drill-down query.
 So
   if
you return the user a list of categories, when he clicks one of them,
  you
perform a drill-down query on that category and retrieve all the
   associated
documents.
   
May I ask why do you need to know the list of documents given a
FacetResultNode?
   
Basically in the 3.6 API it's kind of not so simple to do what you
 want
   in
one-pass, but in the 4.x API (especially the upcoming 4.7) it should
 be
very easy -- when you traverse the list of matching documents,
 besides
   only
reading the list of categories associated with it, you also store a
 map
Category - ListdocIDs. This isn't very cheap though ...
   
So I guess it would be good if I understand why do you need to know
  which
documents contributed to which category, before the results are
  returned
   to
the user.
   
Shai
   
   
On Mon, Feb 10, 2014 at 3:16 PM, Jebarlin Robertson 
  jebar...@gmail.com
wrote:
   
 Hi Shai,

 Thanks,

 I am using the same way of BooleanQuery only with list of
 PrefixQuery
only.
 I think I confused you sorry :) .

 I am using the same above code to get the result of documents. I am
getting
 the TopDocs and retrieving the Documents also, If I don't even try
  that
for
 the basic you will kill me :D.
 But my question was different, from the List of FacetResult I am
   getting
 only the counts

Re: Regarding DrillDown search

2014-02-09 Thread Shai Erera
Hi

If you want to drill-down on first name only, then you have several options:

1) Index Author/First, Author/Last, Author/First_Last as facets on the
document. This is the faster approach, but bloats the index. Also, if you
index the author Author/Jebarlin, Author/Robertson and
Author/Jebarlin_Robertson, it still won't allow you to execute a query
Author/Jebar.

2) You should modify the query to be a PrefixQuery, as if the user chose to
search Author/Jeral*. You can do that with DrillDown.term() to create a
Term($facets, Author/Jeral) (NOTE: you shouldn't pass '*' as part of the
CategoryPath) and then construct your own PrefixQuery with that Term.

Hope that helps,
Shai


On Mon, Feb 10, 2014 at 6:21 AM, Jebarlin Robertson jebar...@gmail.comwrote:

 Dear Shai,

 I have one doubt in DrillDown search, when I search with a CategoryPath of
 author, it is giving me the result if I give the accurate full name only.
 Is there any way to get the result even if I give the first or last name.
 Can you help me to search like (*contains* the word in Facet search), if
 the latest API supports or any other APIs.

 Thank You

 --
 Thanks  Regards,
 Jebarlin Robertson.R
 GSM: 91-9538106181.



Re: Regarding CorruptedIndexException in using Lucene Facet Search

2014-02-07 Thread Shai Erera
Hi

Since 4.2 the facets module has gone under major changes, both API and
implementation and performance has improved x4. If you want to upgrade,
then I recommend waiting for 4.7 since we overhauled the API again - this
will save you the efforts to migrate to e.g 4.6 and then to the new API
once 4.7 is out.

And you should always use the same version of Lucene for all of its modules
- it's the only way to guarantee things will work :).

Shai


On Fri, Feb 7, 2014 at 9:05 AM, Jebarlin Robertson jebar...@gmail.comwrote:

 Dear Shai,

 I only made the mistake by using the same directory for both IndexWriter
 and FacetWriter. Now it is working fine .Thank you :)

 Could you please tell me if there is any major performance difference in
 using *3.6 and 4.x* *Facet *library?.
 Since I use the Lucene 3.6 version, I am using Facet library also the same
 version.

 Kindly guide me to use the best and the working one. :)
 Thank you :)


 Thanks and Regards,
 Jebarlin Robertson.R



 On Fri, Feb 7, 2014 at 12:41 PM, Jebarlin Robertson jebar...@gmail.com
 wrote:

  Dear Shai,
 
  Thank you for your reply.
 
  Actually I am using Lucene3.6 in Android. It is working fine. but with
 the
  latest versions there are some issues.
  Now I just added this Facet search library also along with the old Lucene
  code.
  After this Facet search integration, it is giving these Corrupted and
  NullpointerExcpetion when I add the document object to the IndexWriter.
 
  Below is the exception.
 
  02-07 12:38:11.006: W/System.err(5411): java.lang.NullPointerException
  02-07 12:38:11.006: W/System.err(5411): at
 
 org.apache.lucene.facet.index.streaming.CategoryParentsStream.incrementToken(CategoryParentsStream.java:138)
  02-07 12:38:11.006: W/System.err(5411): at
 
 org.apache.lucene.facet.index.streaming.CountingListTokenizer.incrementToken(CountingListTokenizer.java:63)
  02-07 12:38:11.006: W/System.err(5411): at
 
 org.apache.lucene.facet.index.streaming.CategoryTokenizer.incrementToken(CategoryTokenizer.java:48)
  02-07 12:38:11.006: W/System.err(5411): at
 
 org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:141)
  02-07 12:38:11.006: W/System.err(5411): at
 
 org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:276)
  02-07 12:38:11.006: W/System.err(5411): at
 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:766)
  02-07 12:38:11.006: W/System.err(5411): at
  org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2060)
  02-07 12:38:11.006: W/System.err(5411): at
  org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2034)
  02-07 12:38:11.006: W/System.err(5411): at
 
 com.example.lucene.threads.AsyncIndexWriter.addDocumentSynchronous(AsyncIndexWriter.java:343)
  02-07 12:38:11.006: W/System.err(5411): at
 
 com.example.lucene.threads.AsyncIndexWriter.addDocumentSync(AsyncIndexWriter.java:371)
 
 
  Just try to help, If I am missing something.
 
  Thanks and regards,
  Jebarlin.R
 
 
  On Thu, Feb 6, 2014 at 11:04 PM, Shai Erera ser...@gmail.com wrote:
 
  It looks like something's wrong with the index indeed. Are you sure you
  committed both the IndexWriter and TaxoWriter?
  Do you have some sort of testcase / short program which demonstrates the
  problem?
 
  I know there were few issues running Lucene on Android, so I cannot
  guarantee it works fully .. we never tested this code on Android.
 
  Shai
 
 
  On Thu, Feb 6, 2014 at 3:21 PM, Jebarlin Robertson jebar...@gmail.com
  wrote:
 
   Hi,
  
   I am using Lucene 3.6 version for indexing and searching in Android
   application.
   I have implemented Facet search. But when I try to search, it is
 giving
  the
   below exception  while creating the DirectoryTaxonomyReader object.
  
   02-06 21:00:58.082: W/System.err(15518):
   org.apache.lucene.index.CorruptIndexException: Missing parent data for
   category 1
  
  
   Can anyone help me to know what is the problem for this. Whether the
   Categories are not added to the Lucene index or some other problem.
  
  
   It will be better if somebody provides some sample code to use lucene
  facet
   for 3.6 version.
  
  
   --
   Thanks  Regards,
   Jebarlin Robertson.R
   GSM: 91-9538106181.
  
 
 
 
 
  --
  Thanks  Regards,
  Jebarlin Robertson.R
  GSM: 91-9538106181.
 



 --
 Thanks  Regards,
 Jebarlin Robertson.R
 GSM: 91-9538106181.



Re: Regarding CorruptedIndexException in using Lucene Facet Search

2014-02-06 Thread Shai Erera
It looks like something's wrong with the index indeed. Are you sure you
committed both the IndexWriter and TaxoWriter?
Do you have some sort of testcase / short program which demonstrates the
problem?

I know there were few issues running Lucene on Android, so I cannot
guarantee it works fully .. we never tested this code on Android.

Shai


On Thu, Feb 6, 2014 at 3:21 PM, Jebarlin Robertson jebar...@gmail.comwrote:

 Hi,

 I am using Lucene 3.6 version for indexing and searching in Android
 application.
 I have implemented Facet search. But when I try to search, it is giving the
 below exception  while creating the DirectoryTaxonomyReader object.

 02-06 21:00:58.082: W/System.err(15518):
 org.apache.lucene.index.CorruptIndexException: Missing parent data for
 category 1


 Can anyone help me to know what is the problem for this. Whether the
 Categories are not added to the Lucene index or some other problem.


 It will be better if somebody provides some sample code to use lucene facet
 for 3.6 version.


 --
 Thanks  Regards,
 Jebarlin Robertson.R
 GSM: 91-9538106181.



Re: updating docs when using SortedSetDocValuesFacetFields

2014-01-22 Thread Shai Erera
Note that Lucene doesn't support general in-place document updates, and
updating a document means first deleting it and adding it back.

Therefore if you only intend to add/change few categories of an existing
document, you have to fully re-index the document. This is not specific to
categories but applies for any field that you add, except NumericDocValues
fields which support in-place document updates since Lucene 4.6.

Shai


On Wed, Jan 22, 2014 at 1:15 AM, Rose, Stuart J stuart.r...@pnnl.govwrote:

 I'm using Lucene 4.4 with SortedSetDocValuesFacetFields and would like to
 add and/or remove CategoryPaths for certain documents in the index.

 Basically, as additional sets of docs are added, the CategoryPaths for
 some of the previously indexed documents need to changed.

 My current testing with using writer.updateDocument(docIdTerm, docFields)
 seems to be generating some duplicates as there are more documents in the
 index than expected.

 Is this a known issue with SortedSetDocValuesFacetFields and discouraged?

 Thanks!
 Stuart




Re: Issue with FacetFields.addFields() throwing ArrayIndexOutOfBoundsException

2014-01-17 Thread Shai Erera
Do you have a test which reproduces the error? Are you adding categories
with very deep hierarchies?

Shai


On Fri, Jan 17, 2014 at 11:59 PM, Matthew Petersen mdpe...@gmail.comwrote:

 I've confirmed that using the LruTaxonomyWriterCache solves the issue for
 me.  It would appear there is in fact a bug in the Cl20TaxonomyWriterCache
 or I am using it incorrectly (I use it as default, no customization).


 On Fri, Jan 17, 2014 at 9:29 AM, Matthew Petersen mdpe...@gmail.com
 wrote:

  I'm sure.  I had seen that issue and it looked similar but the stack
 trace
  is slightly different.  I've found that if I replace the
  Cl2oTaxonomyWriterCache with the LruTaxonomyWriterCache the problem seems
  to go away.  I'm working right now on running a test that will prove this
  but it takes a while as the cache needs to get very large.  If this
 proves
  to solve the problem then I'd say there is still a bug in the
  Cl2oTaxonomyWriterCache implementation.
 
  Thanks for the response.
  Matt
 
 
  On Fri, Jan 17, 2014 at 6:36 AM, Michael McCandless 
  luc...@mikemccandless.com wrote:
 
  Are you sure you're using 4.4?
 
  Because ... this looks like
  https://issues.apache.org/jira/browse/LUCENE-5048 but that was
  supposedly fixed in 4.4.
 
  Mike McCandless
 
  http://blog.mikemccandless.com
 
 
  On Thu, Jan 16, 2014 at 5:33 PM, Matthew Petersen mdpe...@gmail.com
  wrote:
   I’m having an issue with an index when adding category paths to a
  document.
They seem to be added without issue for a long period of time, then
 for
   some unknown reason the addition fails with an ArrayIndexOutOfBounds
   exception.  Subsequent attempts to add category paths fail with the
 same
   exception.  I’ve run CheckIndex on both the index and the taxonomy
   directory and both come back as clean with no issues.  I cannot fix
 the
   index because according to lucene it is not broken.  Could this be a
  bug in
   lucene?  Below is the stack trace when the exception occurs:
  
  
   Lucene v4.4.0
  
  
   java.lang.ArrayIndexOutOfBoundsException: -65535
  
   at java.util.ArrayList.elementData(ArrayList.java:371)
  
   at java.util.ArrayList.get(ArrayList.java:384)
  
   at
  
 
 org.apache.lucene.facet.taxonomy.writercache.cl2o.CharBlockArray.charAt(CharBlockArray.java:152)
  
   at
  
 
 org.apache.lucene.facet.taxonomy.writercache.cl2o.CategoryPathUtils.equalsToSerialized(CategoryPathUtils.java:61)
  
   at
  
 
 org.apache.lucene.facet.taxonomy.writercache.cl2o.CompactLabelToOrdinal.getOrdinal(CompactLabelToOrdinal.java:257)
  
   at
  
 
 org.apache.lucene.facet.taxonomy.writercache.cl2o.CompactLabelToOrdinal.getOrdinal(CompactLabelToOrdinal.java:140)
  
   at
  
 
 org.apache.lucene.facet.taxonomy.writercache.cl2o.Cl2oTaxonomyWriterCache.get(Cl2oTaxonomyWriterCache.java:74)
  
   at
  
 
 org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyWriter.addCategory(DirectoryTaxonomyWriter.java:455)
  
   at
 
 org.apache.lucene.facet.index.FacetFields.addFields(FacetFields.java:175)
  
   at
  
 
 com.logrhythm.messaging.indexing.LogIndexerImpl.getDocument(LogIndexerImpl.java:478)
  
   at
  
 
 com.logrhythm.messaging.indexing.LogIndexerImpl.indexLog(LogIndexerImpl.java:392)
  
   at
  
 
 com.logrhythm.messaging.indexing.LogIndexerImpl.indexLogs(LogIndexerImpl.java:357)
  
   at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
  
   at
  
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  
   at java.lang.reflect.Method.invoke(Method.java:601)
  
   at
  
 
 com.logrhythm.tests.unit.messaging.indexing.LogIndexerTests.logIndexerLoadTest(LogIndexerTests.java:752)
  
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  
   at
  
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  
   at
  
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  
   at java.lang.reflect.Method.invoke(Method.java:601)
  
   at
  
 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
  
   at
  
 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
  
   at
  
 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
  
   at
  
 
 org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
  
   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
  
   at
  
 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
  
   at
  
 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
  
   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
  
   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
  
   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
  
   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
  
   at 

Re: Issue with FacetFields.addFields() throwing ArrayIndexOutOfBoundsException

2014-01-17 Thread Shai Erera
Can you open an issue and attach the test there?
On Jan 18, 2014 12:41 AM, Matthew Petersen mdpe...@gmail.com wrote:

 I do have a test that will reproduce.  I'm not adding categories with very
 deep hierarchies, I'm adding 129 category paths per document (all docs have
 paths with same label) with each path having one value.  All of the values
 are completely random and likely unique.  It's basically a worst case test
 for our app but the condition has been seen in the field (the error has
 been encountered at less than worst case scenario).  The test i have
 reproduces it very quickly, Only have to index ~330K docs.


 On Fri, Jan 17, 2014 at 3:27 PM, Shai Erera ser...@gmail.com wrote:

  Do you have a test which reproduces the error? Are you adding categories
  with very deep hierarchies?
 
  Shai
 
 
  On Fri, Jan 17, 2014 at 11:59 PM, Matthew Petersen mdpe...@gmail.com
  wrote:
 
   I've confirmed that using the LruTaxonomyWriterCache solves the issue
 for
   me.  It would appear there is in fact a bug in the
  Cl20TaxonomyWriterCache
   or I am using it incorrectly (I use it as default, no customization).
  
  
   On Fri, Jan 17, 2014 at 9:29 AM, Matthew Petersen mdpe...@gmail.com
   wrote:
  
I'm sure.  I had seen that issue and it looked similar but the stack
   trace
is slightly different.  I've found that if I replace the
Cl2oTaxonomyWriterCache with the LruTaxonomyWriterCache the problem
  seems
to go away.  I'm working right now on running a test that will prove
  this
but it takes a while as the cache needs to get very large.  If this
   proves
to solve the problem then I'd say there is still a bug in the
Cl2oTaxonomyWriterCache implementation.
   
Thanks for the response.
Matt
   
   
On Fri, Jan 17, 2014 at 6:36 AM, Michael McCandless 
luc...@mikemccandless.com wrote:
   
Are you sure you're using 4.4?
   
Because ... this looks like
https://issues.apache.org/jira/browse/LUCENE-5048 but that was
supposedly fixed in 4.4.
   
Mike McCandless
   
http://blog.mikemccandless.com
   
   
On Thu, Jan 16, 2014 at 5:33 PM, Matthew Petersen 
 mdpe...@gmail.com
wrote:
 I’m having an issue with an index when adding category paths to a
document.
  They seem to be added without issue for a long period of time,
 then
   for
 some unknown reason the addition fails with an
 ArrayIndexOutOfBounds
 exception.  Subsequent attempts to add category paths fail with
 the
   same
 exception.  I’ve run CheckIndex on both the index and the taxonomy
 directory and both come back as clean with no issues.  I cannot
 fix
   the
 index because according to lucene it is not broken.  Could this
 be a
bug in
 lucene?  Below is the stack trace when the exception occurs:


 Lucene v4.4.0


 java.lang.ArrayIndexOutOfBoundsException: -65535

 at java.util.ArrayList.elementData(ArrayList.java:371)

 at java.util.ArrayList.get(ArrayList.java:384)

 at

   
  
 
 org.apache.lucene.facet.taxonomy.writercache.cl2o.CharBlockArray.charAt(CharBlockArray.java:152)

 at

   
  
 
 org.apache.lucene.facet.taxonomy.writercache.cl2o.CategoryPathUtils.equalsToSerialized(CategoryPathUtils.java:61)

 at

   
  
 
 org.apache.lucene.facet.taxonomy.writercache.cl2o.CompactLabelToOrdinal.getOrdinal(CompactLabelToOrdinal.java:257)

 at

   
  
 
 org.apache.lucene.facet.taxonomy.writercache.cl2o.CompactLabelToOrdinal.getOrdinal(CompactLabelToOrdinal.java:140)

 at

   
  
 
 org.apache.lucene.facet.taxonomy.writercache.cl2o.Cl2oTaxonomyWriterCache.get(Cl2oTaxonomyWriterCache.java:74)

 at

   
  
 
 org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyWriter.addCategory(DirectoryTaxonomyWriter.java:455)

 at
   
  
 org.apache.lucene.facet.index.FacetFields.addFields(FacetFields.java:175)

 at

   
  
 
 com.logrhythm.messaging.indexing.LogIndexerImpl.getDocument(LogIndexerImpl.java:478)

 at

   
  
 
 com.logrhythm.messaging.indexing.LogIndexerImpl.indexLog(LogIndexerImpl.java:392)

 at

   
  
 
 com.logrhythm.messaging.indexing.LogIndexerImpl.indexLogs(LogIndexerImpl.java:357)

 at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)

 at

   
  
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.lang.reflect.Method.invoke(Method.java:601)

 at

   
  
 
 com.logrhythm.tests.unit.messaging.indexing.LogIndexerTests.logIndexerLoadTest(LogIndexerTests.java:752)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at

   
  
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

 at

   
  
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43

Re: Index + Taxonomy Replication

2013-11-01 Thread Shai Erera
SearcherTaxonomyManager can be used only for NRT, as it only takes an
IndexWriter and DirectoryTaxonomyWriter. And I don't think you want to keep
those writers open on the slaves side.

I think that a ReferenceManager, which returns a SearcherAndTaxonomy, is
the right thing to do. The reason why we don't offer it is because it's
very tricky to use outside of a well defined refresh protocol. If we let
you refresh a Directory-based pair, and you're not careful enough, you
could end up reopening the IndexReader before the TaxonomyReader was
committed, or vice versa. Both lead to unsynchronized IR/TR pair, which is
bad. However, if your app always calls this maybeRefresh once the Handler
is done (i.e. as a Callback), and it is *the only one* that refreshes the
pair, then you're safe.

Maybe we should offer such a ReferenceManager (maybe it can even be
SearcherTaxonomyManager which takes a pair of Directory in another ctor),
and document that its maybeRefresh needs to be called by the same thread
that modified the index (i.e. commit() or replication).

Shai


On Thu, Oct 31, 2013 at 12:53 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Maybe have a look at how the IndexAndTaxonomyReplicationClientTest.java
 works?

 Hmm, in its callback, it manually reopens the index + taxoIndex, but I
 think you could instead use a SearcherTaxonomyManager and call its
 .maybeRefresh inside your callback?

 Mike McCandless

 http://blog.mikemccandless.com


 On Wed, Oct 30, 2013 at 11:24 AM, Joe Eckard eckar...@gmail.com wrote:
  Hello,
 
  I'm attempting to setup a master/slave arrangment between two servers
 where
  the master uses a SearcherTaxonomyManger to index and search, and the
 slave
  is read-only - using just an IndexSearcher and TaxonomyReader.
 
  So far I am able to publish new IndexAndTaxonomyRevisions on the master
 and
  pull them down to the slave with no problems (using the HttpReplicator
 and
  an IndexAndTaxonomyReplicationHandler), but I'm not sure how to correctly
  reopen the IndexSearcher and TaxonomyReader pair in the
  ReplicationHandler's callback.
 
  Should I wrap them in some kind of ReferenceManager to allow searches to
  continue on the read-only server during the cutover? Is there a specific
  order they should be reopened in?
 
  Any advice or pointers would be much appreciated.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Index + Taxonomy Replication

2013-11-01 Thread Shai Erera
Opened https://issues.apache.org/jira/browse/LUCENE-5320.

Shai


On Fri, Nov 1, 2013 at 4:59 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Fri, Nov 1, 2013 at 3:12 AM, Shai Erera ser...@gmail.com wrote:

  Maybe we should offer such a ReferenceManager (maybe it can even be
  SearcherTaxonomyManager which takes a pair of Directory in another ctor),
  and document that its maybeRefresh needs to be called by the same thread
  that modified the index (i.e. commit() or replication).

 +1, I think we should do this?

 Mike McCandless

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Merging ordered segments without re-sorting.

2013-10-23 Thread Shai Erera
Hi

You can use SortingMergePolicy and SortingAtomicReader to achieve that. You
can read more about index sorting here:
http://shaierera.blogspot.com/2013/04/index-sorting-with-lucene.html

Shai


On Wed, Oct 23, 2013 at 8:13 PM, Arvind Kalyan bas...@gmail.com wrote:

 Hi there, I'm looking for pointers, suggestions on how to approach this in
 Lucene 4.5.

 Say I am creating an index using a sequence of addDocument() calls and end
 up with segments that each contain documents in a specified ordering. It is
 guaranteed that there won't be updates/deletes/reads etc happening on the
 index -- this is an offline index building task for a read-only index.

 I create the index in the above mentioned fashion
 using LogByteSizeMergePolicy and finally do a forceMerge(1) to get a single
 segment in the ordering I want.

 Now my requirement is that I need to be able to merge this single segment
 with another such segment (say from yesterday's index) and guarantee some
 ordering -- say I have a comparator which looks at some field values in the
 2 given docs and defines the ordering.

 Index 1 with segment X:
 (a,1)
 (b,2)
 (e,10)

 Index 2 (say from yesterday) with some segment Y:
 (c,4)
 (d,6)

 Essentially we have 2 ordered segments, and I'm looking to 'merge' them
 (literally) using the value of some field, without having to re-sort them
 which would be too time  resource consuming.

 Output Index, with some segment Z:
 (a,1)
 (b,2)
 (c,4)
 (d,6)
 (e,10)

 Is this already possible? If not, any tips on how I can approach
 implementing this requirement?

 Thanks,

 --
 Arvind Kalyan



Re: Merging ordered segments without re-sorting.

2013-10-23 Thread Shai Erera
SortingAtomicReader uses the TimSort algorithm, which performs well when
the two segments are already sorted.
Anyway, that's the way to do it, even if it looks like it does more work
than it should.

Shai


On Wed, Oct 23, 2013 at 10:46 PM, Arvind Kalyan bas...@gmail.com wrote:

 Thanks, my understanding is that SortingMergePolicy performs sorting after
 wrapping the 2 segments, correct?

 As I mentioned in my original email I would like to avoid the re-sorting
 and exploit the fact that the input segments are already sorted.



 On Wed, Oct 23, 2013 at 11:02 AM, Shai Erera ser...@gmail.com wrote:

  Hi
 
  You can use SortingMergePolicy and SortingAtomicReader to achieve that.
 You
  can read more about index sorting here:
  http://shaierera.blogspot.com/2013/04/index-sorting-with-lucene.html
 
  Shai
 
 
  On Wed, Oct 23, 2013 at 8:13 PM, Arvind Kalyan bas...@gmail.com wrote:
 
   Hi there, I'm looking for pointers, suggestions on how to approach this
  in
   Lucene 4.5.
  
   Say I am creating an index using a sequence of addDocument() calls and
  end
   up with segments that each contain documents in a specified ordering.
 It
  is
   guaranteed that there won't be updates/deletes/reads etc happening on
 the
   index -- this is an offline index building task for a read-only index.
  
   I create the index in the above mentioned fashion
   using LogByteSizeMergePolicy and finally do a forceMerge(1) to get a
  single
   segment in the ordering I want.
  
   Now my requirement is that I need to be able to merge this single
 segment
   with another such segment (say from yesterday's index) and guarantee
 some
   ordering -- say I have a comparator which looks at some field values in
  the
   2 given docs and defines the ordering.
  
   Index 1 with segment X:
   (a,1)
   (b,2)
   (e,10)
  
   Index 2 (say from yesterday) with some segment Y:
   (c,4)
   (d,6)
  
   Essentially we have 2 ordered segments, and I'm looking to 'merge' them
   (literally) using the value of some field, without having to re-sort
 them
   which would be too time  resource consuming.
  
   Output Index, with some segment Z:
   (a,1)
   (b,2)
   (c,4)
   (d,6)
   (e,10)
  
   Is this already possible? If not, any tips on how I can approach
   implementing this requirement?
  
   Thanks,
  
   --
   Arvind Kalyan
  
 



 --
 Arvind Kalyan
 http://www.linkedin.com/in/base16
 cell: (408) 761-2030



Re: external file stored field codec

2013-10-17 Thread Shai Erera

 The codec intercepts merges in order to clean up files that are no longer
 referenced


What happens if a document is deleted while there's a reader open on the
index, and the segments are merged? Maybe I misunderstand what you meant by
this statement, but if the external file is deleted, since the document is
pruned from the index, how will the reader be able to read the stored
fields from it? How do you track references to the external files?

Since you write that all tests in the o.a.l.index package pass, I assume
you handle this, but here's a simple testcase I have in mind:

IndexWriter writer = new IndexWriter(dir, configWithNewCode());
writer.addDocument(addDocWithStoredFields(doc1));
writer.addDocument(addDocWithStoredFields(doc2));
writer.commit();
writer.addDocument(addDocWithStoredFields(doc3));
writer.addDocument(addDocWithStoredFields(doc4));
IndexReader reader = writer.getReader();
writer.deleteDocuments(doc1);
writer.deleteDocuments(doc4);
writer.forceMerge(1);
writer.close();
System.out.println(reader.document(doc1));
System.out.println(reader.document(doc4));

Does this test pass?

Shai


On Fri, Oct 18, 2013 at 7:14 AM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 On 10/13/13 8:09 PM, Michael Sokolov wrote:

 On 10/13/2013 1:52 PM, Adrien Grand wrote:

 Hi Michael,

 I'm not aware enough of operating system internals to know what
 exactly happens when a file is open but it sounds to be like having
 separate files per document or field adds levels of indirection when
 loading stored fields, so I would be surprised it it actually proved
 to be more efficient than storing everything in a single file.

  That's true, Adrien, there's definitely a cost to using files. There
 are some gnarly challenges in here (mostly to do with the large number of
 files, as you say, and with cleaning up after deletes - deletion is always
 hard).  I'm not sure it's going to be possible to both clean up and
 maintain files for stale commits; this will become problematic in the way
 that having index files on NFS mounts are problematic.

 I think the hope is that there will be countervailing savings during
 writes and merges (mostly) because we may be able to cleverly avoid copying
 the contents of stored fields being merged.  There may also be savings when
 querying due to reduced RAM requirements since the large stored fields
 won't be paged in while performing queries.  As I said, some simple tests
 do show improvements under at least some circumstances, so I'm pursuing
 this a bit further.  I have a preliminary implementation as a codec now,
 and I'm learning a bit about Lucene's index internals. BTW SimpleTextCodec
 is a great tool for learning and debugging.

 The background for this is a document store with large files (think PDFs,
 but lots of formats) that have to be tracked, and have associated metadata.
  We've been storing these externally, but it would be beneficial to have a
 single data management layer: i.e. to push this down into Lucene, for a
 variety of reasons.  For one, we could rely on Solr to do our replication
 for us.

 I'll post back when I have some measurements.

 -Mike

 This idea actually does seem to be working out pretty nicely.  I compared
 time to write and then to read documents that included a couple of small
 indexed fields and a binary stored field that varied in size.  Writing to
 external files, via the FSFieldCodec, was 3-20 times faster than writing to
 the index in the normal way (using MMapDirectory).  Reading was sometimes
 faster and sometimes slower. I also measured time for a forceMerge(1) at
 the end of each test: this was almost always nearly zero when binaries were
 external, and grew larger with more data in the normal case.  I believe the
 improvements we're seeing here result largely from removing the bulk of the
 data from the merge I/O path.

 As with any performance measurements, a lot of factors can affect the
 measurements, but this effect seems pretty robust across the conditions I
 measured (different file sizes, numbers of files, and frequency of commits,
 with lots of repetition).  One oddity is a large difference between Mac SSD
 filesystem (15-20x writing, reading 0.6x)  via FSFieldCodec) and Linux ext4
 HD filesystem (3-4x writing, 1.5x reading).

 The codec works as a wrapper around another codec (like the compressing
 codecs), intercepting binary and string stored fields larger than a
 configurable threshold, and storing a file number as a reference in the
 main index which then functions kind of like a symlink.  The codec
 intercepts merges in order to clean up files that are no longer referenced,
 taking special care to preserve the ability of the other codecs to perform
 bulk merges.  The codec passes all the Lucene unit tests in the o.a.l.index
 package.

 The implementation is still very experimental: there are lots of details
 to be worked out: for example, I haven't yet measured the performance
 impact of deletions, which 

Re: Huge FacetArrays while using SortedSetDocValuesAccumulator

2013-08-28 Thread Shai Erera
Oops you're right, it was committed in LUCENE-4985 which will be released
in Lucene 4.5.

Shai


On Wed, Aug 28, 2013 at 6:16 PM, Krishnamurthy, Kannan 
kannan.krishnamur...@contractor.cengage.com wrote:

 Thanks for the response. I double checked that
 SortedSetDocValuesAccumulator doesn't take a FacetArray in its ctor
 currently in 4.3.0 and 4.4. But FacetAccumulator does take FacetArray in
 its ctor. Am I missing something here? We have a high traffic application
 currently doing about 250 searches and facet request per second. We haven't
 performance tested our facet implementation yet to see if object allocation
 is a problem.

 Thanks,
 +Kannan.



 Hi

 SortedSetDocValuesAccumulator does receive FacetArrays in its ctor, so you
 can pass ReusingFacetArrays. You will need to call FacetArrays.free() when
 you're done with accumulation though. However, do notice that
 ReusingFacetArrays did not show any big gain even with large taxonomies --
 that is that the overhead of allocating and freeing them wasn't noticeable.

 If you expect to use very large taxonomies, then facet partitions can help.
 But for that you need to use the sidecar taxonomy index.

 Shai


 On Mon, Aug 26, 2013 at 11:45 PM, Krishnamurthy, Kannan 
 kannan.krishnamur...@contractor.cengage.com wrote:

  Hello,
 
  We are working with large lucene 4.3.0 index and using
  SortedSetDocValuesFacetFields for creating facets and
  SortedSetDocValuesAccumulator for facet accumulation. We couldn't use a
  taxonomy based facet implementation (We use MultiReader for searching and
  our indices is composed of multiple physical lucene indices, hence we
  cannot have a single taxonomy index). We have two million categories and
  expect to have another two million in the near future. As the current
  implementation of SortedSetDocValuesAccumulator does not support
  ReusingFacetArrays, we are concerned with potential garabage collector
  related performance issues in our high traffic application. Will future
  Lucene release support using ReusingFacetArrays in
  SortedSetDocValuesAccumulator ?
 
  Also as an alternative we are considering subclassing FacetIndexingParams
  and provide dimension specific CategoryListParams during indexing time.
  This will help to reduce the size of the FacetArray per facet request. We
  realize this approach will not support multiple FacetRequest in a single
  SortedSetDocValuesAccumulator, as SortedSetDocValuesReaderState hardcodes
  the category to null while calling
  FacetIndexingParams.getCategoryListParams(null) in its constructor.
 
  Are there better approaches to this problem ?
 
 
  Thanks in advance for any help.
 
  Kannan
  Cengage Learning
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 





Re: Huge FacetArrays while using SortedSetDocValuesAccumulator

2013-08-27 Thread Shai Erera
Hi

SortedSetDocValuesAccumulator does receive FacetArrays in its ctor, so you
can pass ReusingFacetArrays. You will need to call FacetArrays.free() when
you're done with accumulation though. However, do notice that
ReusingFacetArrays did not show any big gain even with large taxonomies --
that is that the overhead of allocating and freeing them wasn't noticeable.

If you expect to use very large taxonomies, then facet partitions can help.
But for that you need to use the sidecar taxonomy index.

Shai


On Mon, Aug 26, 2013 at 11:45 PM, Krishnamurthy, Kannan 
kannan.krishnamur...@contractor.cengage.com wrote:

 Hello,

 We are working with large lucene 4.3.0 index and using
 SortedSetDocValuesFacetFields for creating facets and
 SortedSetDocValuesAccumulator for facet accumulation. We couldn't use a
 taxonomy based facet implementation (We use MultiReader for searching and
 our indices is composed of multiple physical lucene indices, hence we
 cannot have a single taxonomy index). We have two million categories and
 expect to have another two million in the near future. As the current
 implementation of SortedSetDocValuesAccumulator does not support
 ReusingFacetArrays, we are concerned with potential garabage collector
 related performance issues in our high traffic application. Will future
 Lucene release support using ReusingFacetArrays in
 SortedSetDocValuesAccumulator ?

 Also as an alternative we are considering subclassing FacetIndexingParams
 and provide dimension specific CategoryListParams during indexing time.
 This will help to reduce the size of the FacetArray per facet request. We
 realize this approach will not support multiple FacetRequest in a single
 SortedSetDocValuesAccumulator, as SortedSetDocValuesReaderState hardcodes
 the category to null while calling
 FacetIndexingParams.getCategoryListParams(null) in its constructor.

 Are there better approaches to this problem ?


 Thanks in advance for any help.

 Kannan
 Cengage Learning
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: How to retrieve value of NumericDocValuesField in similarity

2013-08-12 Thread Shai Erera
Rob, when DiskDV becomes the default DVFormat, would it not make sense to
load the values into the cache if someone uses FieldCache API? Vs. if
someone calls DV API directly, he uses whatever is the default Codec, or
the one that he plugs.

That's what I would expect from a 'cache'. So it's ok that currently all
FieldCache does is delegate the call to DV API, but perhaps we'd want to
change that so that in the DiskDV case, it actually caches things?

Or, you'd like to keep FieldCache API for sort of back-compat with existing
features, and let the app control the caching by using an explicit
RamDVFormat?

Shai


On Mon, Aug 12, 2013 at 7:07 PM, Ross Woolf r...@rosswoolf.com wrote:

 Yes, I will open an issue.


 On Mon, Aug 12, 2013 at 10:02 AM, Robert Muir rcm...@gmail.com wrote:

  On Mon, Aug 12, 2013 at 8:48 AM, Ross Woolf r...@rosswoolf.com wrote:
   Okay, just for clarity sake, what you are saying is that if I make the
   FieldCache call it won't actually create and impose the loading time of
  the
   FieldCache, but rather just use the NumericDocValuesField instead.  Is
  this
   correct?
 
  Yes, exactly. its a little confusing, but a tradeoff to make docvalues
  work transparently with lots of existing code built off of fieldcache
  (sorting/grouping/joins/faceting/...) without having to have 2
  separate implementations of what is the same thing. so its like
  docvalues is a fieldcache you already built at index-time.
 
  
   Also, my similarity was extending SimilarityBase, and I can't see how
 to
   get a docId as it is not passed in the score method score(BasicStats
   stats, float freq, float docLen).  Will I need to extend using
  Similarity
   instead of SimilarityBase, or is there a way to get the docId using
   SimilarityBase?
 
  Maybe we should just add a 'int doc' parameter to the
  SimilarityBase.score() method? Do you want to open a JIRA issue for
  this?
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 



Re: How to retrieve value of NumericDocValuesField in similarity

2013-08-12 Thread Shai Erera
ok that makes sense.

Shai


On Mon, Aug 12, 2013 at 9:18 PM, Robert Muir rcm...@gmail.com wrote:

 On Mon, Aug 12, 2013 at 11:06 AM, Shai Erera ser...@gmail.com wrote:
 
  Or, you'd like to keep FieldCache API for sort of back-compat with
 existing
  features, and let the app control the caching by using an explicit
  RamDVFormat?
 

 Yes. In the future ideally fieldcache goes away and is a
 UninvertingFilterReader or something like that, that exposes DV apis.

 so then things can just use the DV apis... but to get things started
 we did it this way in the interim.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Lucene 4 - Faceted Search with Sorting

2013-08-02 Thread Shai Erera
Hi

Basically, every IndexSearch.search() variant has a matching Collector.
They are there for easier usage by users.
TopFieldCollector.create() takes searchAfter (TopFieldDoc), so you can use
it in conjunction with FacetsCollector as I've outlined before.

In general you're right that for pagination you don't need to collect
facets again. I would cache the ListFacetResult though, and not the
FacetsCollector.
Maybe even cache the way it's output, e.g. the String that you send back.
But, note that such caching means the server becomes stateful, which
usually complicates matters for apps.
Whether it's a problem or not for your app, you'll be the judge, I just
wanted to point that out.

Shai


On Fri, Aug 2, 2013 at 9:35 AM, Sanket Paranjape 
sanket.paranjape.mailingl...@gmail.com wrote:

 Hi Shai,

 Thanks for helping out.

 It worked. :)

 I also want to add pagination feature. This can be done via searchAfter
 method in IndexSearcher. But, this does not have Collector (I want facets
 from this).

 I think this has been done intentionally because facets would remain same
 while paginating/sorting.

 So my approach to this would be following.

 1. On first search call below code to get first set of results along
with facets.
 2. Save *last* ScoreDoc somewhere in the session so that it can be used
to pagination. Also save facetCollector so as to use it later on
pagination request to show facets.
 3. On subsequent pagination requests use IndexSearcher.searchAfter
method to get next set of results using ScoreDoc from session.
 4. If user want to narrow down on facets then follow steps from 1 to 3
using Drill-down feature.

 Am I correct?

 On 01-08-2013 11:33 PM, Shai Erera wrote:

 Hi

 You should do the following:

 TopFieldCollector tfc = TopFieldCollector.create();
 FacetsCollector fc = FacetsCollector.create();
 searcher.search(query, MultiCollector.wrap(tfc, fc));

 Basically IndexSearcher.search(..., Sort) creates TopFieldCollector
 internally, so you need to create it outside and wrap both collectors with
 MultiCollector.

 BTW, you don't need to do new CategoryPath(CATEGORY_PATH, '/'), when the
 category does not contain the delimiter.
 You can use the vararg constructor which takes the path components
 directly, if you have them already.

 Shai


 On Thu, Aug 1, 2013 at 7:46 PM, Sanket Paranjape 
 sanket.paranjape.mailinglist@**gmail.comsanket.paranjape.mailingl...@gmail.com
 wrote:

  I am trying to move from lucene 2.4 to 4.4. I had used bobo-browse for
 faceting in 2.4.

 I used below code (from Lucene examples) to query documents and get
 facets.

  ListFacetRequest categories = new ArrayListFacetRequest();
  categories.add(new CountFacetRequest(new
 CategoryPath(CATEGORY_PATH,
 '/'), 10));
  FacetSearchParams searchParams = new FacetSearchParams(categories);
  TopScoreDocCollector topScoreDocCollector =
 TopScoreDocCollector.create(**200, true);
  FacetsCollector facetsCollector = FacetsCollector.create(**
 searchParams,
 indexReader, taxonomyReader);
  indexSearcher.search(new MatchAllDocsQuery(),
 MultiCollector.wrap(**topScoreDocCollector, facetsCollector));

 Above code gives me results along with facets.

 Now I want to add a Sort field on document, say I want to sort by name. I
 can achieve this using following

  Sort sort = new Sort(new SortField(NAME, Type.STRING));
  TopFieldDocs docs = indexSearcher.search(new MatchAllDocsQuery(),
 100,
 sort);

 Now, how do I achieve sorting along with faceting, because there is no
 method in IndexSearcher which has Collector and Sort.


 I have asked this question on stackoverflow as well. (

 http://stackoverflow.com/**questions/17992183/lucene-4-**
 faceted-search-with-sortinghttp://stackoverflow.com/questions/17992183/lucene-4-faceted-search-with-sorting
 )


 Please Help !!





Re: IndexUpgrade - Any ways to speed up?

2013-08-02 Thread Shai Erera
Hi

You cannot just update headers -- the file formats have changed. Therefore
you need to rewrite the index entirely, at least from 2.3.1 to 3.6.2 (for
4.1 to be able to read it).
If your index is already optimized, then IndexUpgrader is your best option.
The reason it calls forceMerge(1) is that it needs to guarantee *every*
segment in your index gets rewritten.

BTW, you might want to upgrade to 4.4 already.

Shai


On Fri, Aug 2, 2013 at 2:49 PM, Ramprakash Ramamoorthy 
youngestachie...@gmail.com wrote:

 Team,

 We are migrating from lucene version 2.3.1 to 4.1. We are migrating
 the indices as well, and we do this in two steps 2.3.1 to 3.6.2 and 3.6.2
 to 4. We just call IndexUpgrader.upgrade(), using the
 IndexUpgraderMergePolicy. I see that, the upgrade() method actually calls a
 forcemerge(1) over the indices.

 However, we have all our indices optimized and there are no deletes
 as well. This forcemerge(1) seems a very costly operation and since our
 index is already optimized, there is no space benefit as well. Is there a
 faster way to upgrade our indices (like reading the indices and modifying
 the headers, something of that sort)? We are not expecting any compaction
 during the process.

  Currently it takes 4 minutes for a GB of index to get migrated to
 4.1 from 2.3.1. Any pointers would be appreciated. Thanks in advance.


 --
 With Thanks and Regards,
 Ramprakash Ramamoorthy,
 Chennai, India.



Re: IndexUpgrade - Any ways to speed up?

2013-08-02 Thread Shai Erera
Unfortunately you cannot upgrade directly from 2.3.1 to 4.1.

You can consider upgrading to 3.6.2 and stop there. Lucene 4.1 can read 3.x
indexes, and when segments will are merged, they are upgraded automatically
to the newest file format.
However, if this single segment is too big, such that it won't be picked
for merges, you will need to upgrade it anyway when one day you will
upgrade to Lucene 5.0.
So I'd say, if you're not stressed with time, upgrade to 4.1 now ... it's a
one time process.

Shai


On Fri, Aug 2, 2013 at 3:22 PM, Ramprakash Ramamoorthy 
youngestachie...@gmail.com wrote:

 Thank you Shai for the quick response. Have responded inline.


 On Fri, Aug 2, 2013 at 5:37 PM, Shai Erera ser...@gmail.com wrote:

  Hi
 
  You cannot just update headers -- the file formats have changed.
 Therefore
  you need to rewrite the index entirely, at least from 2.3.1 to 3.6.2 (for
  4.1 to be able to read it).
 
 Yeah, as of now, we call IndexUpgrader of 3.6.2 and then IndexUpgrader of
 4.0, and then the indices become readable by 4.1

  If your index is already optimized, then IndexUpgrader is your best
 option.
  The reason it calls forceMerge(1) is that it needs to guarantee *every*
  segment in your index gets rewritten.
 
 Understood. Looks like we will have to stick to what we have written as on
 date.

 
  BTW, you might want to upgrade to 4.4 already.
 
 Yeah, we upgraded the code base when 4.1 was the most recent version, now
 that we are looking forward to migrate the older indices to be compatible.
 Thanks again.

 
  Shai
 
 
  On Fri, Aug 2, 2013 at 2:49 PM, Ramprakash Ramamoorthy 
  youngestachie...@gmail.com wrote:
 
   Team,
  
   We are migrating from lucene version 2.3.1 to 4.1. We are
  migrating
   the indices as well, and we do this in two steps 2.3.1 to 3.6.2 and
 3.6.2
   to 4. We just call IndexUpgrader.upgrade(), using the
   IndexUpgraderMergePolicy. I see that, the upgrade() method actually
  calls a
   forcemerge(1) over the indices.
  
   However, we have all our indices optimized and there are no
  deletes
   as well. This forcemerge(1) seems a very costly operation and since our
   index is already optimized, there is no space benefit as well. Is
 there a
   faster way to upgrade our indices (like reading the indices and
 modifying
   the headers, something of that sort)? We are not expecting any
 compaction
   during the process.
  
Currently it takes 4 minutes for a GB of index to get migrated
  to
   4.1 from 2.3.1. Any pointers would be appreciated. Thanks in advance.
  
  
   --
   With Thanks and Regards,
   Ramprakash Ramamoorthy,
   Chennai, India.
  
 



 --
 With Thanks and Regards,
 Ramprakash Ramamoorthy,
 Chennai, India.



  1   2   3   4   >