Re: Facet DrillDown Exclusion
Hey Matt, You basically don't need to use DDQ in that case. You can construct a BooleanQuery with a MUST_NOT clause for filter out the facet path. Here's a short code snippet: String indexedField = config.getDimConfig("Author").indexFieldName; // Find the field of the "Author" facet Query q = new BooleanQuery.Builder() .add(new MatchAllDocsQuery(), Occur.MUST) // here you would usually use a different query .add(new TermQuery(DrillDownQuery.term(indexedField, "Author", "Lisa")), Occur.MUST_NOT) // do not match documents with "Author/Lisa" in their facets .build(); searcher.search(q, 10); Hope this helps Shai On Tue, Dec 6, 2016 at 1:55 AM Matt Hickswrote: > I'm currently drilling down adding a facet path, but I'd like to be able to > do the same as a NOT query. Is there any way to do an exclusion drill down > on a facet to exclude docs that match the facet while including all others? > > Thanks >
Re: Faceting : what are the limitations of Taxonomy (Separate index and hierarchical facets) and SortedSetDocValuesFacetField ( flat facets and no sidecar index) ?
This feature is not available in Lucene currently, but it shouldn't be hard to add it. See Mike's comment here: http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html?showComment=1412777154420#c363162440067733144 One more tricky (yet nicer) feature would be to have it all in one go, i.e. you'd say something like "facet on field price" and you'd get "interesting" buckets, per the variance in the results. But before that, we could have a StatsFacets in Lucene which provide some statistics about a numeric field (min/max/avg etc.). On Wed, Nov 30, 2016 at 7:50 AM Chitra Rwrote: > Thank you so much, mike... Hope, gained a lot of stuff on Doc > Values faceting and also clarified all my doubts. Thanks..!! > > > *Another use case:* > > After getting matching documents for the given query, Is there any way to > calculate mix and max values on NumericDocValuesField ( say date field)? > > > I would like to implement it in numeric range faceting by splitting the > numeric values (getting from resulted documents) into ranges. > > > Chitra > > > On Wed, Nov 30, 2016 at 3:51 AM, Michael McCandless < > luc...@mikemccandless.com> wrote: > > > Doc values fields are never loaded into memory; at most some small > > index structures are. > > > > When you use those fields, the bytes (for just the one doc values > > field you are using) are pulled from disk, and the OS will cache them > > in memory if available. > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > > > On Mon, Nov 28, 2016 at 6:01 AM, Chitra R wrote: > > > Hi, > > > When opening SortedSetDocValuesReaderState at search time, > > whether > > > the whole doc value files (.dvd & .dvm) information are loaded in > memory > > or > > > specified field information(say $facets field) alone load in memory? > > > > > > > > > > > > > > > Any help is much appreciated. > > > > > > > > > Regards, > > > Chitra > > > > > > On Tue, Nov 22, 2016 at 5:47 PM, Chitra R > wrote: > > >> > > >> > > >> Kindly post your suggestions. > > >> > > >> Regards, > > >> Chitra > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> On Sat, Nov 19, 2016 at 1:38 PM, Chitra R > > wrote: > > >>> > > >>> Hey, I got it clearly. Thank you so much. Could you please help us to > > >>> implement it in our use case? > > >>> > > >>> > > >>> In our case, we are having dynamic index and it is variable depth > too. > > So > > >>> flat facet is enough.No need of hierarchical facets. > > >>> > > >>> What I think is, > > >>> > > >>> Index my facet field as normal doc value field, so that no special > > >>> operation (like taxonomy and sorted set doc values facet field) will > > be done > > >>> at index time and only doc value field stores its ordinals in their > > >>> respective field. > > >>> At search time, I will pass query (user search query) , filter (path > > >>> traversed list) and collect the matching documents in > Facetscollector. > > >>> To compute facet count for the specific field, I will gather those > > >>> resulted docs, then move through each segment for collecting the > > matching > > >>> ordinals using AtomicReader. > > >>> > > >>> > > >>> And know when I use this means, can't calculate facet count for more > > than > > >>> one field(facet) in a search. > > >>> > > >>> Instead of loading all the dimensions in DocValuesReaderState (will > > take > > >>> more time and memory) at search time, loading specific fields will > > take less > > >>> time and memory, hope so. Kindly help to solve. > > >>> > > >>> > > >>> It will do it in a minimal index and search cost, I think. And hope > > this > > >>> won't put overload at index time, also at search time this will be > > better. > > >>> > > >>> > > >>> Kindly post your suggestions. > > >>> > > >>> > > >>> Regards, > > >>> Chitra > > >>> > > >>> > > >>> > > >>> > > >>> On Fri, Nov 18, 2016 at 7:15 PM, Michael McCandless > > >>> wrote: > > > > I think you've summed up exactly the differences! > > > > And, yes, it would be possible to emulate hierarchical facets on top > > of flat facets, if the hierarchy is fixed depth like year/month/day. > > > > But if it's variable depth, it's trickier (but I think still > > possible). See e.g. the Committed Paths drill-down on the left, on > > our dog-food server > > http://jirasearch.mikemccandless.com/search.py?index=jira > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > > > On Fri, Nov 18, 2016 at 1:43 AM, Chitra R > > wrote: > > > case 1: > > > In taxonomy, for each indexed document, examines facet > > label , > > > computes their
Re: Lucene 6.3 faceting documentation
We've removed the userguide a long time ago. We have a set of example files under lucene-demo, e.g. here https://lucene.apache.org/core/6_3_0/demo/src-html/org/apache/lucene/demo/facet/ . Also, you can read some blog posts, start here: http://shaierera.blogspot.com/2012/11/lucene-facets-part-1.html and then http://shaierera.blogspot.com/2012/11/lucene-facets-part-2.html, though the code examples may be outdated. The lucene-demo source is up-to-date though. Shai On Thu, Nov 10, 2016 at 4:40 PM Glen Newtonwrote: > I am looking for documentation on Lucene faceting. The most recent > documentation I can find is for 4.0.0 here: > > http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-files/userguide.html > > Is there more recent documentation for 6.3.0? Or 6.x? > > Thanks, > Glen >
Re: Faceting : what are the limitations of Taxonomy (Separate index and hierarchical facets) and SortedSetDocValuesFacetField ( flat facets and no sidecar index) ?
Hi The reason IMO is historic - ES and Solr had faceting solutions before Lucene had it. There were discussions in the past about using the Lucene faceting module in Solr (can't tell for ES) but, sadly, I can't say I see it happening at this point. Regarding your other question, IMO the Lucene faceting engine, in terms of performance and customizability, is on par with Solr/ES. However, it lacks distributed faceting support and aggregations. Since many people use Solr/ES and not Lucene directly, the Solr/ES faceting module continues to advance separately from the Lucene one. Enhancing Lucene facets with aggregations and even distributed faceting capabilities is mostly a matter of time and priorities. If you're interested in it, I'd be willing to collaborate with you on that as much as I can! And I'd still hope that this work finds its way into Solr/ES, as I think it's silly to have that many number of faceting implementations, where they all rely on the same low-level data structure - Lucene! Shai On Thu, Nov 10, 2016 at 12:32 PM Kumaran Ramasubramanianwrote: > Hi All, > We all know that Lucene supports faceting by providing > Taxonomy(Separate index and hierarchical facets) and > SortedSetDocValuesFacetField ( flat facets and no sidecar index). > > Then why did solr and elastic search go for its own implementation ? > ( that is, solr uses block join & elasticsearch uses aggregations ) Is > there any limitations in lucene's implementation ? > > > -- > Kumaran R >
Re: IndexWriter, DirectoryTaxonomyWriter & SearcherTaxonomyManager synchronization
*> However, that should not lead to NSFE. At worst it should lead to> "ordinal is not known" (maybe as an AIOOBE) from the taxonomy reader.* That is correct, this interleaving indexing case can potentially result in an AIOOBE like exception during faceted search, when the facets that are in the "sneaked-in-docs" will be found be a search, but resolving the ordinals to their labels will fail because the labels will be unknown to the taxonomy. I wonder if committing the opposite order solves this problem. So in the above use case, IW.commit() commits all the new docs with their facets, then if more indexing happens before TIW.commit(), then the commit to the taxonomy index results in more facets than are known to the search index, but that's ok. I'm just not sure if that covers all concurrency cases though. I remember this was discussed several times in the past, and we eventually reached a conclusion, but clearly if it was the latter, it wasn't clarified in the javadocs. I can't think of a use case that breaks this commit order though ( IW.commit() followed by TIW.commit()). This feels safe to me ... can you try to think of a use case that breaks it? Assuming that each doc-indexing does addTaxo() followed by addDoc(). Maybe we should have a helper which takes an IW and TIW and exposes commit() APIs that will do it in the correct order? Now I'm thinking about SearcherTaxoManager -- it reopens the readers by first re-opening IR, then TIR. It does so under the assumption of first committing to TIW then to IW. Now if we reverse the order, then you need to be more careful in when you commit changes to the two writers, and when you re-open the readers. If you always do that from the same thread, then you should be fine, the order of re-opens doesn't really matter. But you re-open from a different thread than the one you commit, I am not sure that committing to IW first then TIW can play well with any re-open order? I.e. one case which breaks it is you commit to IW, then re-open both IR and TIR, before you commit to TIW, and you have a search which may find ordinals that are unknown you to the TIR? So I'd say that if you refresh() from the same thread that you do commit(), then commit to IW first then TIW, and use SearcherTaxoManager as it's currently implemented. But I'd like to hear your thoughts about it. Shai On Wed, Sep 28, 2016 at 1:26 PM Michael McCandless < luc...@mikemccandless.com> wrote: > On Wed, Sep 28, 2016 at 3:05 AM, William Moss >wrote: > > Thank you both for your quick reply! > > You're welcome! > > > * We actually tried the upgrade to 6.0 a few months back (when that was > the > > newest) and were getting similar errors to the ones I'm seeing now. We > were > > not able to track them down, which is part of the motivation for me > asking > > all these questions. We'll get there though :-) > > OK, we gotta get to the root cause. Sounds like it happens in either > version... > > > * The last time we tested this (which I think was still post > > ConcurrentMergePolicy) we saw that the read speed would slowly degrade > over > > time. My understanding was that forceMerge was very expensive, but would > > make reads faster once complete. Is this not correct? > > It really depends on what queries you are running. Really you should > test in your use case and be certain that the massive expense of force > merge is worthwhile / necessary. In general it's not worth it, ever > if searches are a bit faster, except for indices that will never > change again. > > > Also, we never > > attempted to tune the MergePolicy at all, so while were on the subject, > is > > there good documentation on how to do that? I'm much prefer to get away > > from calling forceMerge. If it's useful information, we've got a > relatively > > small corpus, only ~2+M documents. > > Just use the defaults :) Tuning those settings is dangerous unless > you have a very specific problem to fix. > > > * We want to be able to ensure that if a machine or JVM crashes we are > in a > > coherent state. To that end, we need to call commit on Lucene and then > > commit back what we've read so far to Kafka. Calling commit is the only > way > > to ensure this, right? > > Correct: commit in Lucene, then notify Kafka what offset you had > indexed just before you called IW.commit. > > But you may want to replicate the index across machines if you don't > want to have a single point of failure. We recently added > near-real-time replication to Lucene for this use case ... > > > * To make sure I understand how maybeRefresh works, ignoring whether or > not > > we commit for a second, if I add a document via IndexWriter, it will not > be > > reflected in IndexSearchers I get by calling acquire on > SearcherAndTaxonomy > > until I call maybeRefresh? > > Correct. > > > Now, on to the concurrency issue. I was thinking a little more about this > > and I think the fundamental issue is that while IndexWriter and >
Re: Clarification on LUCENE 4795 discussions ( Add FacetsCollector based on SortedSetDocValues )
Hey, Here's a blog I wrote a couple years ago about using facet associations: http://shaierera.blogspot.com/2013/01/facet-associations.html. Note that the examples in the blog were written against a very old Lucene version (4.7 maybe). We have a couple of demo files that are maintained with the code changes here https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=tree;f=lucene/demo/src/java/org/apache/lucene/demo/facet;h=41085e3aaa1d4d0697a5ef5d9853a093c1600ca6;hb=HEAD. Check them out, especially this one: https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=blob;f=lucene/demo/src/java/org/apache/lucene/demo/facet/AssociationsFacetsExample.java;h=3e2737d0c8f02d12e4fdb76f97891c8593ef5fbc;hb=HEAD Hope this helps! Shai On Tue, Sep 27, 2016 at 7:20 AM Kumaran Ramasubramanianwrote: > Hi mike, > > Thanks for the clarification. Any example about difference in using flat vs > hierarchical facets? Any demo or sample page? > > In a previous thread yesterday ( Faceting: Taxonomy index Vs > SortedSetDocValues ), there is a point like > > "tried to achieve multilevel (hierarchical) categorization using > SortedSetDocValues and got it simply by changing the query and opening the > IndexReader for each level of query using SortedSetDocValuesReaderState. " > > Is it possible easily? > > - > Kumaran R > > On Sep 27, 2016 9:38 AM, "Michael McCandless" > wrote: > > > > Weighted facets is the ability to associate a float value with each > > facet label you index, and at search time to aggregate those floats. > > See e.g. FloatAssociationFacetField. > > > > "other features" refers to hierarchical facets, which > > SortedSetDocValuesFacetField does not support (just flat facets) > > though this is possible to fix, I think (patches welcome!). > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > On Mon, Sep 26, 2016 at 5:24 PM, Kumaran Ramasubramanian > > wrote: > > > > > > > > > Hi All, > > > > > > i want to know the list of features which can be used by applications > > > using facet module of lucene. > > > > > > > https://issues.apache.org/jira/browse/LUCENE-4795?focusedCommentId=13599687 > > > > > > I ask because it seems that the only thing that we get from this > SortedSet > > >> approach is not having to maintain a sidecar index (which for some > reason > > >> freaks everybody), and we even lose performance. Plus, I don't see how > we > > >> can support other facet features with it. > > > > > > > > > on the other hand SortedSet doesn't have these problems. maybe it > doesnt > > >> support weighted facets or other features, but its a nice option. I > > >> personally don't think its the end of the world if Mike's patch doesnt > > >> support all the features of the faceting module initially or even > ever. > > > > > > > > > > > > > > > what > > > is meant by > > > weighted facets > > > ? what are > > > othe > > > r > > > facets > > > features ? > > > > > > > > > -- > > > Kumaran R > > > >
Re: Lucene Facets performance problems (version 4.7.2)
True, but Erick's questions are still valid :-). We need more info to answer these questions. So Simona, the more info you can give us the better we'll be able to answer. On Fri, Feb 26, 2016, 10:54 Uwe Schindlerwrote: > Hi Erick, > > this was a question about Lucene so "=true" won't help. It also > talks about *Lucene's Facetting*, not Solr's. > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -Original Message- > > From: Erick Erickson [mailto:erickerick...@gmail.com] > > Sent: Friday, February 26, 2016 8:22 AM > > To: java-user > > Subject: Re: Lucene Facets performance problems (version 4.7.2) > > > > You haven't given us much to go on. What is the cardinality of the fields > > you're faceting on? What does your query look like? How are you measuring > > time? What is the output if you add =true? > > > > In short, your question is far too vague to give any meaningful > > information, there could be any of a dozen recommendations. > > > > Best > > Erick > > On Feb 26, 2016 18:01, "Simona Russo" wrote: > > > > > Hi all, > > > > > > we use Lucene *Facet* library version* 4.7.2.* > > > > > > We have an *index* with *45 millions *of documents (size about 15 GB) > > and > > > a *taxonomy* index with *57* millions of documents (size about 2 GB). > > > > > > The total *facet search* time achieve *15 seconds*! > > > > > > Is it possible to improve this time? Is there any tips to *configure* > the > > > *taxonomy* index to avoid this waste of time? > > > > > > > > > Thanks in advance > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: how to backup lucene index file
You should use Lucene's replicator module, which helps you take backups from live snapshots of your index, even while indexing happens. You can read about how to use it here: http://shaierera.blogspot.co.il/2013/05/the-replicator.html Shai On Wed, Jan 13, 2016, 19:14 Erick Ericksonwrote: > Just copy the index directory, it's self contained. I'd > make sure I wasn't actively indexing to it and > I'd committed all my indexing first, but that's all. > > On Wed, Jan 13, 2016 at 8:33 AM, 鞠朕 wrote: > > Hi,I am using Lucene to build a Full Text search system, I put the > index file in some Directory in my server, Considering robust, I think i > shouldbackup the index file in somewhere else. If the index file is broken, > i can switch to the backup one.Can you tell me how to do this, use what > API, can you give me a simple demo.Thanks,From juzhen > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: SOLR/LUCENE 5.2.1: Solution of CharTermAtt, StartOffset, EndOffset, Position
I think you can just write a TokenFilter which sets the PositionIncrementAttribute of every other token to 0. Then you can use StandardTokenizer and wrap it with that filter. Shai On Aug 8, 2015 6:33 AM, Văn Châu vankimc...@gmail.com wrote: Hi, I'm looking a solution for the following format in solr/lucene 5.2.1 version: Text eg: fast wi fi network is down. If using solr.StandardTokenizerFactory , I have the Position corresponding to displayed : fast ( 1 ) - wi ( 2 ) - fi ( 3 ) - Network ( 4 ) - is ( 5 ) - - down ( 6 ) . But I need you just create a new custom or class to the question above is fast wi fi network is down but the analysis is currently Position as follows : fast ( 1 ) - fi ( 2 ) - is ( 3 ) or wi ( 1 ) - network ( 2 ) - down ( 3 ) . I know it involves startOffset , endOffset ... but I can not figure out how to solve? Thanks in advance! [image: Hình ảnh nội tuyến 1] --- VĂN KIM CHÂU [P]: +84.933.233.047
Re: How to merge several Taxonomy indexes
In some cases, MMapDirectory offers even better performance, since the JVM doesn't need to manage that RAM when it's doing GC. Also, using only RAMDirectory is not safe in that if the JVM crashes, your index is lost. On Thu, Apr 2, 2015 at 12:54 PM, Christoph Kaser lucene_l...@iconparc.de wrote: Hi Gimantha, why do you use a RAMDirectory? If your merged index fits into RAM completely, a MMapDirectory should offer almost the same performance. And if not, it is definitely the better choice. Regards Christoph Am 02.04.2015 um 12:38 schrieb Gimantha Bandara: Hi All, I have successfully setup a merged indices and drilldown and usual search operations work perfect. But, I have a side question. If I selected RAMDirectory as the destination Indices in merging, probably the jvm can go out of memory if the merged indices are too big. Is there a way I can handle this issue? On Tue, Mar 24, 2015 at 12:18 PM, Gimantha Bandara giman...@wso2.com wrote: Hi Christoph, My mistake. :) It does the exactly what i need. figured it out later.. Thanks a lot! On Tue, Mar 24, 2015 at 3:14 AM, Gimantha Bandara giman...@wso2.com wrote: Hi Christoph, I think TaxonomyMergeUtils is to merge a taxonomy directory and an index together (Correct me if I am wrong). Can it be used to merge several taxonomyDirectories together and create one taxonomy index? On Mon, Mar 23, 2015 at 9:19 PM, Christoph Kaser lucene_l...@iconparc.de wrote: Hi Gimantha, have a look at the class org.apache.lucene.facet. taxonomy.TaxonomyMergeUtils, which does exactly what you need. Best regards, Christoph Am 23.03.2015 um 15:44 schrieb Gimantha Bandara: Hi all, Can anyone point me how to merge several taxonomy indexes? My requirement is as follows. I have several taxonomy indexes and normal document indexes. I want to merge taxonomy indexes together and other document indexes together and perform search on them. One part I have figured out. It is easy. To Merge document indexes, all I have to do is create a MultiReader and pass it to IndexSearcher. But I am stuck at merging the taxonomy indexes. Is there a way to merge taxonomy indexes? -- Dipl.-Inf. Christoph Kaser IconParc GmbH Sophienstrasse 1 80333 München www.iconparc.de Tel +49 -89- 15 90 06 - 21 Fax +49 -89- 15 90 06 - 49 Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB 121830, Amtsgericht München - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Gimantha Bandara Software Engineer WSO2. Inc : http://wso2.com Mobile : +94714961919 -- Gimantha Bandara Software Engineer WSO2. Inc : http://wso2.com Mobile : +94714961919 -- Dipl.-Inf. Christoph Kaser IconParc GmbH Sophienstrasse 1 80333 München www.iconparc.de Tel +49 -89- 15 90 06 - 21 Fax +49 -89- 15 90 06 - 49 Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB 121830, Amtsgericht München - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to merge several Taxonomy indexes
MMapDirectory uses memory-mapped files. This is an operating system level feature, where even though the file resides on disk, the OS can memory-map it and access it more efficiently. It is loaded into memory outside the JVM heap, and usually on a properly configured server you should not worry about running out of memory, since if the file cannot be brought into memory, it's accessed from disk. You mentioned that you store the index in a DB, which is distributed. Have you considered using Solr for managing your distributed index? It might be better than storing it in a DB, merging taxonomies for search etc. and Solr has quite rich faceted search capabilities. On Thu, Apr 2, 2015 at 1:51 PM, Gimantha Bandara giman...@wso2.com wrote: Btw I was using a RAMDirectory for just testing purposes.. On Thu, Apr 2, 2015 at 5:16 PM, Gimantha Bandara giman...@wso2.com wrote: Hi Christoph and Shai, Thanks for the quick response!. Indices are stored in a relational database ( using a custom Directory implementation ). The Problem comes since the indices are sharded (both taxonomy indices and normal doc indices), when a user wants to drilldown, I have to merge all the indices. For that I used mergeUtils (which worksperfect). For now I am using RAMDirectory as the merged indices. Anyway The indices can grow to a bigger size as time goes. MMapDirectory again uses memory right? Can It deal with possible out of memory issue? I am thinking of using the same Database to store the merged indices. But the problem is the original sharded indices can be updated, when new entries come in. So the merged final indices also needs to be updated accordingly. On Thu, Apr 2, 2015 at 4:55 PM, Shai Erera ser...@gmail.com wrote: In some cases, MMapDirectory offers even better performance, since the JVM doesn't need to manage that RAM when it's doing GC. Also, using only RAMDirectory is not safe in that if the JVM crashes, your index is lost. On Thu, Apr 2, 2015 at 12:54 PM, Christoph Kaser lucene_l...@iconparc.de wrote: Hi Gimantha, why do you use a RAMDirectory? If your merged index fits into RAM completely, a MMapDirectory should offer almost the same performance. And if not, it is definitely the better choice. Regards Christoph Am 02.04.2015 um 12:38 schrieb Gimantha Bandara: Hi All, I have successfully setup a merged indices and drilldown and usual search operations work perfect. But, I have a side question. If I selected RAMDirectory as the destination Indices in merging, probably the jvm can go out of memory if the merged indices are too big. Is there a way I can handle this issue? On Tue, Mar 24, 2015 at 12:18 PM, Gimantha Bandara giman...@wso2.com wrote: Hi Christoph, My mistake. :) It does the exactly what i need. figured it out later.. Thanks a lot! On Tue, Mar 24, 2015 at 3:14 AM, Gimantha Bandara giman...@wso2.com wrote: Hi Christoph, I think TaxonomyMergeUtils is to merge a taxonomy directory and an index together (Correct me if I am wrong). Can it be used to merge several taxonomyDirectories together and create one taxonomy index? On Mon, Mar 23, 2015 at 9:19 PM, Christoph Kaser lucene_l...@iconparc.de wrote: Hi Gimantha, have a look at the class org.apache.lucene.facet. taxonomy.TaxonomyMergeUtils, which does exactly what you need. Best regards, Christoph Am 23.03.2015 um 15:44 schrieb Gimantha Bandara: Hi all, Can anyone point me how to merge several taxonomy indexes? My requirement is as follows. I have several taxonomy indexes and normal document indexes. I want to merge taxonomy indexes together and other document indexes together and perform search on them. One part I have figured out. It is easy. To Merge document indexes, all I have to do is create a MultiReader and pass it to IndexSearcher. But I am stuck at merging the taxonomy indexes. Is there a way to merge taxonomy indexes? -- Dipl.-Inf. Christoph Kaser IconParc GmbH Sophienstrasse 1 80333 München www.iconparc.de Tel +49 -89- 15 90 06 - 21 Fax +49 -89- 15 90 06 - 49 Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB 121830, Amtsgericht München - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Gimantha Bandara Software Engineer WSO2. Inc : http://wso2.com Mobile : +94714961919 -- Gimantha Bandara Software Engineer WSO2. Inc : http://wso2.com Mobile : +94714961919 -- Dipl.-Inf. Christoph Kaser IconParc GmbH Sophienstrasse 1 80333 München www.iconparc.de
Re: Sampled Hit counts using Lucene Facets.
OK yes then sampling isn't the right word. So what you would want to have is API like count faces in N buckets between a range of [min..max] values. That would create the ranges for you and then you would be able to use the RangeFacetCounts as usual. Would you like to open a JIRA issue and post a patch? I guess it can either be an additional constructor on LongRangeFacetCounts (and Double), or a separate utility class which given min/max values and numBuckets, creates the proper Range[]? Shai On Tue, Mar 10, 2015 at 4:07 PM, Gimantha Bandara giman...@wso2.com wrote: Hi Shai, Yes, Splitting ranges into smaller ranges is not as same as sampling. I have used the wrong word there. I think RandomSamplingFacetsCollector is for sampling a larger dataset and that class cannot be used to implement the described example above. I think I ll have to prepare the Ranges manually and pass them to LongRangeFacetsCounts. On Tue, Mar 10, 2015 at 4:54 PM, Shai Erera ser...@gmail.com wrote: I am not sure that splitting the ranges into smaller ranges is the same as sampling. Take a look RandomSamplingFacetsCollector - it implements sampling by sampling the document space, not the facet values space. So if for instance you use a LongRangeFacetCounts in conjunction with a RandomSamplingFacetsCollector, you would get the matching documents space sampled, and the counts you would get for each range could be considered sampled too. This is at least how we implemented facet sampling. Shai On Tue, Mar 10, 2015 at 10:21 AM, Gimantha Bandara giman...@wso2.com wrote: What I am planning to do is, split the given time range into smaller time ranges by myself and pass them to a LongRangeFacetsCount object and get the counts for each sub range. Is this the correct way? On Tue, Mar 10, 2015 at 12:01 AM, Gimantha Bandara giman...@wso2.com wrote: Any updates on this please? Do I have to write my own code to sample and get the hitcount? On Sat, Mar 7, 2015 at 2:14 PM, Gimantha Bandara giman...@wso2.com wrote: Any help on this please? On Fri, Mar 6, 2015 at 3:13 PM, Gimantha Bandara giman...@wso2.com wrote: Hi, I am trying to create some APIs using lucene facets APIs. First I will explain my requirement with an example. Lets say I am keeping track of the count of people who enter through a certain door. Lets say the time range I am interested in Last 6 hours( to get the total count, I know that I ll have to use Ranged Facets). How do I sample this time range and get the counts of each sample? In other words, as an example, If I split the last 6 hours into 5 minutes samples, I get 72 (6*60/5 ) different time ranges. I would be interested in getting hit counts for each of these 72 ranges in an array with the respective lower bound of each sample. Can someone point me the direction I should follow/ the classes which can be helpful looking at? ElasticSearch already has this feature exposed by their Javascript API. Is it possible to implement the same with lucene? Is there a Facets user guide for lucene 4.10.3 or lucene 5.0.0 ? Thanks, -- Gimantha Bandara Software Engineer WSO2. Inc : http://wso2.com Mobile : +94714961919 -- Gimantha Bandara Software Engineer WSO2. Inc : http://wso2.com Mobile : +94714961919 -- Gimantha Bandara Software Engineer WSO2. Inc : http://wso2.com Mobile : +94714961919 -- Gimantha Bandara Software Engineer WSO2. Inc : http://wso2.com Mobile : +94714961919 -- Gimantha Bandara Software Engineer WSO2. Inc : http://wso2.com Mobile : +94714961919
Re: Filtering question
I don't see that you use acceptDocs in your MyNDVFilter. I think it would return false for all userB docs, but you should confirm that. Anyway, because you use an NDV field, you can't automatically skip unrelated documents, but rather your code would look something like: for (int i = 0; i reader.maxDoc(); i++) { if (!acceptDocs.get(i)) { continue; } // document is accepted, read values ... } Shai On Wed, Mar 11, 2015 at 1:25 PM, Ian Lea ian@gmail.com wrote: Can you use a BooleanFilter (or ChainedFilter in 4.x) alongside your BooleanQuery? Seems more logical and I suspect would solve the problem. Caching filters can be good too, depending on how often your data changes. See CachingWrapperFilter. -- Ian. On Tue, Mar 10, 2015 at 12:45 PM, Chris Bamford cbamf...@mimecast.com wrote: Hi, I have an index of 30 docs, 20 of which have an owner field of UserA and 10 of UserB. I also have a query which consists of: BooleanQuery: -- Clause 1: TermQuery -- Clause 2: FilteredQuery - Branch 1: MatchAllDocsQuery() - Branch 2: MyNDVFilter I execute my search as follows: searcher.search( booleanQuery, new TermFilter(new Term(owner, UserA), 50); The TermFilter's job is to reduce the number of searchable documents from 30 to 20, which it does for all clauses of the BooleanQuery except for MyNDVFilter which iterates through the full 30 docs, 10 needlessly. How can I restrict it so it behaves the same as the other query branches? MyNDVFilter source code: public class MyNDVFilter extends Filter { private String fieldName; private String matchTag; public TagFilter(String ndvFieldName, String matchTag) { this.fieldName = ndvFieldName; this.matchTag = matchTag; } @Override public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs) throws IOException { AtomicReader reader = context.reader(); int maxDoc = reader.maxDoc(); final FixedBitSet bitSet = new FixedBitSet(maxDoc); BinaryDocValues ndv = reader.getBinaryDocValues(fieldName); if (ndv != null) { for (int i = 0; i maxDoc; i++) { BytesRef br = ndv.get(i); if (br.length 0) { String strval = br.utf8ToString(); if (strval.equals(matchTag)) { bitSet.set(i); System.out.println(MyNDVFilter + matchTag + matched + i + [ + strval + ]); } } } } return new DVDocSetId(bitSet);// just wraps a FixedBitSet } } Chris Bamford m: +44 7860 405292 w: www.mimecast.com Senior Developer p: +44 207 847 8700 Address click here http://www.mimecast.com/About-us/Contact-us/ -- [image: http://www.mimecast.com] https://serviceA.mimecast.com/mimecast/click?account=C1A1code=83be674748892bc34425eb4133af3e68 [image: LinkedIn] https://serviceA.mimecast.com/mimecast/click?account=C1A1code=83a78f78bdfa40c471501ae0b813a68f [image: YouTube] https://serviceA.mimecast.com/mimecast/click?account=C1A1code=ad1ed1af5bb9cf9dc965267ed43faff0 [image: Facebook] https://serviceA.mimecast.com/mimecast/click?account=C1A1code=172d4ea57e4a4673452098ba62badace [image: Blog] https://serviceA.mimecast.com/mimecast/click?account=C1A1code=871b30b627b3263b9ae2a8f37b0de5ff [image: Twitter] https://serviceA.mimecast.com/mimecast/click?account=C1A1code=cc3a825e202ee26a108f3ef8a1dc3c6f
Re: Sampled Hit counts using Lucene Facets.
I am not sure that splitting the ranges into smaller ranges is the same as sampling. Take a look RandomSamplingFacetsCollector - it implements sampling by sampling the document space, not the facet values space. So if for instance you use a LongRangeFacetCounts in conjunction with a RandomSamplingFacetsCollector, you would get the matching documents space sampled, and the counts you would get for each range could be considered sampled too. This is at least how we implemented facet sampling. Shai On Tue, Mar 10, 2015 at 10:21 AM, Gimantha Bandara giman...@wso2.com wrote: What I am planning to do is, split the given time range into smaller time ranges by myself and pass them to a LongRangeFacetsCount object and get the counts for each sub range. Is this the correct way? On Tue, Mar 10, 2015 at 12:01 AM, Gimantha Bandara giman...@wso2.com wrote: Any updates on this please? Do I have to write my own code to sample and get the hitcount? On Sat, Mar 7, 2015 at 2:14 PM, Gimantha Bandara giman...@wso2.com wrote: Any help on this please? On Fri, Mar 6, 2015 at 3:13 PM, Gimantha Bandara giman...@wso2.com wrote: Hi, I am trying to create some APIs using lucene facets APIs. First I will explain my requirement with an example. Lets say I am keeping track of the count of people who enter through a certain door. Lets say the time range I am interested in Last 6 hours( to get the total count, I know that I ll have to use Ranged Facets). How do I sample this time range and get the counts of each sample? In other words, as an example, If I split the last 6 hours into 5 minutes samples, I get 72 (6*60/5 ) different time ranges. I would be interested in getting hit counts for each of these 72 ranges in an array with the respective lower bound of each sample. Can someone point me the direction I should follow/ the classes which can be helpful looking at? ElasticSearch already has this feature exposed by their Javascript API. Is it possible to implement the same with lucene? Is there a Facets user guide for lucene 4.10.3 or lucene 5.0.0 ? Thanks, -- Gimantha Bandara Software Engineer WSO2. Inc : http://wso2.com Mobile : +94714961919 -- Gimantha Bandara Software Engineer WSO2. Inc : http://wso2.com Mobile : +94714961919 -- Gimantha Bandara Software Engineer WSO2. Inc : http://wso2.com Mobile : +94714961919 -- Gimantha Bandara Software Engineer WSO2. Inc : http://wso2.com Mobile : +94714961919
Re: Faceted Search Hierarchy
Lucene does not understand the word India, therefore the facets that are actually indexed are: Doc1: Asia + Asia/India Doc2: India + India/Gujarat When you ask for top children, you will get Asia + India, both with a count of 1. Shai On Thu, Jan 8, 2015 at 1:48 PM, Jigar Shah jigaronl...@gmail.com wrote: Very simple question, on facet Index has 2 documents as follows: Doc1 Indexed facet path: Asia/India Doc2 Indexed facet path: India/Gujarat Now while faceted search facets.getTopChildren() Will it return 1(Asia) result or 2(Asia, India) ? So basically will it join values and return hierarchy ? Thanks,
Re: Faceted Search Hierarchy
Not automatically. There's no reason to assume that 'India' is the same in 'India/Gujarat' and 'Asia/India'. Furthermore, if you first add a document with India/Gujarat and later add a document Asia/India, we cannot go back to the other document and update the hierarchy. On Thu, Jan 8, 2015 at 3:27 PM, Jigar Shah jigaronl...@gmail.com wrote: Is there some way to achieve this at Lucene level. so i can get facet like below ? Doc1: Asia + Asia/India Doc2: India + Asia/India/Gujarat Which can result into this: Asia/India/Gujarat (2) Can Lucene internally index like above, as 'India' value already exist as path of some other document ? Or some other ways that can be explored within Lucene. On Thu, Jan 8, 2015 at 5:26 PM, Shai Erera ser...@gmail.com wrote: Lucene does not understand the word India, therefore the facets that are actually indexed are: Doc1: Asia + Asia/India Doc2: India + India/Gujarat When you ask for top children, you will get Asia + India, both with a count of 1. Shai On Thu, Jan 8, 2015 at 1:48 PM, Jigar Shah jigaronl...@gmail.com wrote: Very simple question, on facet Index has 2 documents as follows: Doc1 Indexed facet path: Asia/India Doc2 Indexed facet path: India/Gujarat Now while faceted search facets.getTopChildren() Will it return 1(Asia) result or 2(Asia, India) ? So basically will it join values and return hierarchy ? Thanks,
Re: Facet Result Order
Hi Mrugesh, This is strange indeed, as the facets are ordered by count, and we use a facet ordinal (integer code) as a tie breaker. What do you mean by refreshed? Do you have a sample test that shows this behavior? Shai On Fri, Dec 12, 2014 at 8:37 AM, patel mrugesh patelmruge...@yahoo.co.in wrote: Hi All, I am working on Lucene Facet now a day and facet seems working fine. Just one thing that come to my attention is, order of facet results get changed if there is same total count. For example, for country facet following results have been noticed. First time: - USA(10)- India(9)- UK(9) When refreshed, second time, - USA(10)- UK(9) - India(9) When refreshed, third time, - USA(10)- India(9)- UK(9) It would be great if I can have same result every time, I mean order of the result should come same even there is count is same ( in our example either India should come second every time or UK should come second time every time). Thanks in advance,Mrugesh
Re: Index replication strategy
Do you use Lucene or Solr? Lucene also has a replication module, which will allow you to replicate index changes. On Thu, Dec 4, 2014 at 4:19 PM, Vijay B vijay.nip...@gmail.com wrote: Hello, We index docs coming from database nightly. Current index is sitting on NFS. Due to obvious performance reasons, we are switching are planning to switch to local index. W have cluster of 4 servers and with NFS it was not a problem for us until now to share the index. but going forward, we are looking for our design options for index replication on to local storage. Our setup: Index size: 8GB (grows by 2GB every year) Lucene: 4.2.1 64-bit Java The options we considered: * Each server instance,hosting a nightly job to pull delta of data from db. But, this would result in high DB load. (4 severs =4 times the load) * An additional nightly job sitting on another sever, that pushes the data on to local disks of each instances..This may not work out as the local disk may not be visible. * each sever hosting a replication job that pulls delta of data from NFS and stores in the local index...so far this is the only promising option we have. * Does solr an option for us in this case? (I know it's a question for solr group..but experts here might have some thoughts..).. Thank you for your attention.
Re: Index replication strategy
Ooops, didn't notice that :). So you'll need to upgrade to Lucene 4.4.0 in order to use it. You can read some details as well as example code here: http://shaierera.blogspot.com/2013/05/the-replicator.html. Shai On Thu, Dec 4, 2014 at 4:36 PM, Vijay B vijay.nip...@gmail.com wrote: As indicated in my post, we use Lucene 4.2.1. On Thu, Dec 4, 2014 at 9:29 AM, Shai Erera ser...@gmail.com wrote: Do you use Lucene or Solr? Lucene also has a replication module, which will allow you to replicate index changes. On Thu, Dec 4, 2014 at 4:19 PM, Vijay B vijay.nip...@gmail.com wrote: Hello, We index docs coming from database nightly. Current index is sitting on NFS. Due to obvious performance reasons, we are switching are planning to switch to local index. W have cluster of 4 servers and with NFS it was not a problem for us until now to share the index. but going forward, we are looking for our design options for index replication on to local storage. Our setup: Index size: 8GB (grows by 2GB every year) Lucene: 4.2.1 64-bit Java The options we considered: * Each server instance,hosting a nightly job to pull delta of data from db. But, this would result in high DB load. (4 severs =4 times the load) * An additional nightly job sitting on another sever, that pushes the data on to local disks of each instances..This may not work out as the local disk may not be visible. * each sever hosting a replication job that pulls delta of data from NFS and stores in the local index...so far this is the only promising option we have. * Does solr an option for us in this case? (I know it's a question for solr group..but experts here might have some thoughts..).. Thank you for your attention.
Re: hierarchical facets
Yes, hierarchical faceting in Lucene is only supported by the taxonomy index, at least currently. Shai On Tue, Nov 25, 2014 at 3:46 PM, Vincent Sevel v.se...@lombardodier.com wrote: hi, I saw that SortedSetDocValuesFacetCounts does not support hierarchical facets. Is that to say that hierarchical facets are only supported through the Taxonomy index? I am using lucene 4.7.2. Regards, vince DISCLAIMER This message is intended only for use by the person to whom it is addressed. It may contain information that is privileged and confidential. Its content does not constitute a formal commitment by Bank Lombard Odier Co Ltd or any of its branches or affiliates. If you are not the intended recipient of this message, kindly notify the sender immediately and destroy this message. Thank You. *
Re: Lucene not showing Low Score Doc
Hi Your question is a bit fuzzy -- what do you mean by not showing low scores? Are you sure that these 2 documents are matched by the query? Can you boil it down to a short test case that demonstrates the problem? In general though, when you search through IndexSearch.search(Query, int), you won't get all matching documents, but only the number that you specified (that's the 'int' that you pass). I don't think that's the problem you're describing through as it sounds like there are only 10 documents, and the default is to return the top-10. Again, if you have a short test that demonstrates the problem, that would be good. Shai On Mon, Oct 27, 2014 at 2:39 PM, Priyanka Tufchi priyanka.tuf...@launchship.com wrote: Hi All Actually I have set of 10 doc which i gave for comparison through apache lucene now when i check score for the set ,out of 10 i am getting 8 in my database , rest 2 are not showing . If the score is very less still lucene should show something , how can i handle it as i have to show all 10 score index. Thanks Priyanka -- Launchship Technology respects your privacy. This email is intended only for the use of the party to which it is addressed and may contain information that is privileged, confidential, or protected by law. If you have received this message in error, or do not want to receive any further emails from us, please notify us immediately by replying to the message and deleting it from your computer.
Re: Lucene not showing Low Score Doc
I'm sorry, I still don't feel like I have all the information in order to help with the problem that you're seeing. Can you at least paste the contents of the documents and the query? Can you search with a TotalHitCountCollector only, and print the total number of hits? Shai On Mon, Oct 27, 2014 at 3:36 PM, Priyanka Tufchi priyanka.tuf...@launchship.com wrote: Hi Actually , It should give 10 docs match index but it is giving for 8 . I checked rest 2 are not matching doc with very less score . Is there any way I can get those two doc which have not matched. And I have set hitpage =10 . Thanks Priyanka On Mon, Oct 27, 2014 at 6:14 AM, Shai Erera ser...@gmail.com wrote: Hi Your question is a bit fuzzy -- what do you mean by not showing low scores? Are you sure that these 2 documents are matched by the query? Can you boil it down to a short test case that demonstrates the problem? In general though, when you search through IndexSearch.search(Query, int), you won't get all matching documents, but only the number that you specified (that's the 'int' that you pass). I don't think that's the problem you're describing through as it sounds like there are only 10 documents, and the default is to return the top-10. Again, if you have a short test that demonstrates the problem, that would be good. Shai On Mon, Oct 27, 2014 at 2:39 PM, Priyanka Tufchi priyanka.tuf...@launchship.com wrote: Hi All Actually I have set of 10 doc which i gave for comparison through apache lucene now when i check score for the set ,out of 10 i am getting 8 in my database , rest 2 are not showing . If the score is very less still lucene should show something , how can i handle it as i have to show all 10 score index. Thanks Priyanka -- Launchship Technology respects your privacy. This email is intended only for the use of the party to which it is addressed and may contain information that is privileged, confidential, or protected by law. If you have received this message in error, or do not want to receive any further emails from us, please notify us immediately by replying to the message and deleting it from your computer. -- Launchship Technology respects your privacy. This email is intended only for the use of the party to which it is addressed and may contain information that is privileged, confidential, or protected by law. If you have received this message in error, or do not want to receive any further emails from us, please notify us immediately by replying to the message and deleting it from your computer.
Re: Exception from FastTaxonomyFacetCounts
Yes, SearcherTaxonomyManager returns a SearcherAndTaxonomy containing a sync'd IndexSearcher and DirectoryTaxonomyReader. Shai On Mon, Oct 13, 2014 at 12:15 PM, Jigar Shah jigaronl...@gmail.com wrote: In my application i have two intances of SearcherManager. 1) SearcherManager with 'applyAllDeletes = true' which is used by Indexer. (Works in NRT mode, deletes should be visible to it, also i have ControlledRealTimeReopenThread, which refeshes searcher) 2) SearcherManager with 'applyAllDeletes = false' which is used by searcher (Only performs search, javadoc says, we may gain some performance if 'false', as it will not wait for flushing deletes,). I have intoduced Taxonomy Facets in my applicaiton. Should i replace both SearcherManager by SearcherTaxonomyManager (one with applyAllDeletes=true and another applyAllDeletes=false) Will IndexSearcher and TaxonomyReader be in sync, in both SearcherTaxonomyManager ? On Fri, Oct 10, 2014 at 12:08 AM, Shai Erera ser...@gmail.com wrote: This usually means that your IndexReader and TaxonomyReader are out of sync. That is, the IndexReader sees category ordinals that the TaxonomyReader does not yet see. Do you use SearcherTaxonomyManager in your application? It ensures that the two are always in sync, i.e. reopened together and that your application always sees a consistent view of the two. Shai On Tue, Oct 7, 2014 at 10:03 AM, Jigar Shah jigaronl...@gmail.com wrote: Intermittently while search i am getting this exception on huge index. (FacetsConfig used while indexing and searching is same.) java.lang.ArrayIndexOutOfBoundsException: 252554 06:28:37,954 ERROR [stderr] at org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.count(FastTaxonomyFacetCounts.java:73) 06:28:37,954 ERROR [stderr] at org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.init(FastTaxonomyFacetCounts.java:49) 06:28:37,954 ERROR [stderr] at org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.init(FastTaxonomyFacetCounts.java:39) 06:28:37,954 ERROR [stderr] at com.company.search.CustomDrillSideways.buildFacetsResult(LuceneDrillSideways.java:41) 06:28:37,954 ERROR [stderr] at org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:146) 06:28:37,955 ERROR [stderr] at org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:203) Thanks, Jigar Shah
Re: Delete / Update facets from taxonomy index
Hi You cannot remove facets from the taxonomy index, but you can reindex a single document and update its facets. This will add new facets to the taxonomy index (if they do not already exist). You do that just like you reindex any document, by calling IndexWriter.updateDocument(). Just make sure to rebuild the document with FacetsConfig. Shai On Tue, Oct 7, 2014 at 12:42 AM, wesli we...@hotmail.com wrote: I'm using lucene for a full text search on a online store. I've build a indexer program which creates a lucene and a taxonomy index. The taxonomy index contains facets with categories and article features (like color, brand, etc.). Is it possible to re-add or update single document facets? F.g. the shop owner changes the category of an article or some feature (like color f.g.). As I read in the documentation, the taxonomy index can be rebuild but it is not possible to re-add (delete and add) facets. I don't want to rebuild the whole taxonomy index each time when some single article (document) facet is changed. Is there another solution to update the taxonomy index? I'm using lucene 4.10 Regards -- View this message in context: http://lucene.472066.n3.nabble.com/Delete-Update-facets-from-taxonomy-index-tp4163014.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: topdocs per facet
The facets translation should be done at the application level. So if you index the dimension A w/ two facets A/A1 and A/A2, where A1 should also be translated to B1 and A2 translated to B2, there are several options: Index the dimensions A and B with their respective facets, and count the relevant dimension based on the user's locale. Then the user can drill-down on any of the returned facets easily. I'd say that if your index and/or taxonomy aren't big, this is the easiest solution and most straightforward to implement. Another way is to index the facet Root/R1 and Root/R2, which are language-independent. At the application level you translate Root/R1 to either A/A1 or B/B1 based on the user locale. You also then do the reverse translation when the user drills-down. So e.g. if the user clicked A/A1, you translate that to Root/R1 and drill-down on that. If your application is UI based, you probably can return e.g a JSON construct which contains the labels to display + the facet values to drill-down by and then you don't need to do any reverse translation. As for retrieving a document's facets, you can either index them as separate StoredFields (easy), or use DocValuesOrdinalsReader to traverse the facets list along with the MatchingDocs, read the facet ordinals and translate them. If it sounds complex, just use StoredFields :). Shai On Mon, Sep 29, 2014 at 7:15 PM, Jürgen Albert j.alb...@data-in-motion.biz wrote: Hi, I'm currently implementing the lucene facets in the version 4.8.1 and two questions remain for me: 1. Is the an easy way to have translations for the facets? If we use e.g. the books example, the user should see the translation. But if he clicks on a link the english value should be used for the search. Thus I have to return the facet translation and the actual value by the search. 2. Is there a possibility to get the docs per facet? As An example I have e.g. a DrillDownQuery returning 5 docs and 2 dimensions with 2 facets each. I guess the solution is somewhere in the MatchingDocs. If I try: ListMatchingDocs matchingDocs = facetsCollector. getMatchingDocs(); for(MatchingDocs doc : matchingDocs){ DocIdSet docSet = doc.bits; DocIdSetIterator iterator = docSet.iterator(); int docId = iterator.nextDoc(); while (docId != DocIdSetIterator.NO_MORE_DOCS){ Document document = doc.context.reader().document( docId); System.out.println(document.toString()); docId = iterator.nextDoc(); } } result: A List with as much MachtingDocs as dimensions, but only one MatchDocs gives me my docs at all. How I could get the docs per facet I can't see at all, nor how could get the facets of a doc. What do I miss? Thx, Jürgen Albert. -- Jürgen Albert Geschäftsführer Data In Motion UG (haftungsbeschränkt) Kahlaische Str. 4 07745 Jena Mobil: 0157-72521634 E-Mail: j.alb...@datainmotion.de Web: www.datainmotion.de XING: https://www.xing.com/profile/Juergen_Albert5 Rechtliches Jena HBR 507027 USt-IdNr: DE274553639 St.Nr.: 162/107/04586 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Exception from FastTaxonomyFacetCounts
This usually means that your IndexReader and TaxonomyReader are out of sync. That is, the IndexReader sees category ordinals that the TaxonomyReader does not yet see. Do you use SearcherTaxonomyManager in your application? It ensures that the two are always in sync, i.e. reopened together and that your application always sees a consistent view of the two. Shai On Tue, Oct 7, 2014 at 10:03 AM, Jigar Shah jigaronl...@gmail.com wrote: Intermittently while search i am getting this exception on huge index. (FacetsConfig used while indexing and searching is same.) java.lang.ArrayIndexOutOfBoundsException: 252554 06:28:37,954 ERROR [stderr] at org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.count(FastTaxonomyFacetCounts.java:73) 06:28:37,954 ERROR [stderr] at org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.init(FastTaxonomyFacetCounts.java:49) 06:28:37,954 ERROR [stderr] at org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.init(FastTaxonomyFacetCounts.java:39) 06:28:37,954 ERROR [stderr] at com.company.search.CustomDrillSideways.buildFacetsResult(LuceneDrillSideways.java:41) 06:28:37,954 ERROR [stderr] at org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:146) 06:28:37,955 ERROR [stderr] at org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:203) Thanks, Jigar Shah
Re: FacetsConfig usage
Hi The FacetsConfig object is the one that you use to index facets, and at search time it is consulted about the facets attributes (multi-valued, hierarchical etc.). You can make changes to the FacetsConfig, as long as they don't contradict the indexed data in a problematic manner. Usually the facets configuration does not change, but I believe it will work if you add new dimensions. Current in-flight searches won't query/count those dimensions anyway, and new searches will find those dimensions in recently indexed documents only. It is up to you to decide if the old 1 million documents that don't contain the new Person facet are OK to display together w/ the 10 new documents that do, but as long as you're OK with that, application-wise, adding new dimensions should just work. Contradicting changes are changes to the attributes of one dimension, e.g. from hierarchical to flat. In that case, that that there are 1 million old documents indexed w/ A/B/C hierarchy and 10 new documents w/ only A/B doesn't matter to the FacetsConfig - all documents will be considered flat in that case. Here I'm less sure about the effects of that on search (I don't think we have a test for it), but I hope that you don't do that. It's not advisable, just like any other schema changes to your fields while there are already indexed documents. Shai
Re: confused facet example
Thanks Yonghui, I will commit a fix - need to initialize the example class before each example is run ! Shai On Tue, Sep 30, 2014 at 1:26 PM, Yonghui Zhao zhaoyong...@gmail.com wrote: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/SimpleFacetsExample.java In SimpleFacetsExample, /** Runs the search example. */ public ListFacetResult runFacetOnly() throws IOException { index(); return facetsOnly(); } /** Runs the search example. */ public ListFacetResult runSearch() throws IOException { index(); return facetsWithSearch(); } /** Runs the drill-down example. */ public FacetResult runDrillDown() throws IOException { index(); return drillDown(); } /** Runs the drill-sideways example. */ public ListFacetResult runDrillSideways() throws IOException { index(); return drillSideways(); } /** Runs the search and drill-down examples and prints the results. */ public static void main(String[] args) throws Exception { System.out.println(Facet counting example:); System.out.println(---); SimpleFacetsExample example1 = new SimpleFacetsExample(); ListFacetResult results1 = example1.runFacetOnly(); System.out.println(Author: + results1.get(0)); System.out.println(Publish Date: + results1.get(1)); System.out.println(Facet counting example (combined facets and search):); System.out.println(---); SimpleFacetsExample example = new SimpleFacetsExample(); ListFacetResult results = example.runSearch(); System.out.println(Author: + results.get(0)); System.out.println(Publish Date: + results.get(1)); System.out.println(\n); System.out.println(Facet drill-down example (Publish Date/2010):); System.out.println(-); System.out.println(Author: + example.runDrillDown()); System.out.println(\n); System.out.println(Facet drill-sideways example (Publish Date/2010):); System.out.println(-); for(FacetResult result : example.runDrillSideways()) { System.out.println(result); } } The example doesn't new SimpleFacetsExample each time. So in drill-down example, it indexes 2 times, the result number is 2 times. in drill-sideways example. it indexes 3 times, the result number is 3 times. Is it intended?
Re: sortedset vs taxonomy
Hi The taxonomy faceting approach maintains a sidecar index where it keeps the taxonomy and assigns an integer (ordinal) to each category. Those integers are encoded in a BinaryDocValues field for each document. It supports hierarchical faceting as well as assigning additional metadata to each facet occurrence (called associations). At search time, faceting is done by aggregating the category ordinals found in each document. Since those ordinals are global to the index, merging and finding the top-K facets across segments is relatively cheap. The SortedSet faceting approach does not need a sidecar index ans relies on the SortedSet fields. Here too each term/category is assigned an ordinal and at search time the facets are aggregated using those ordinals. However, the ordinals of the same category is not the same across segments, and therefore finding the top-K facets is a bit more expensive (roughly 20% slower if I remember correctly). Another difference is that the SortedSet approach keeps a true ordinal for a facet, so e.g. the category A/B will always receive an ordinal that is smaller than A/C. In the taxonomy approach though, whichever facet got added first receives the lowest ordinal, except that the parent of all categories at a certain level in the hierarchy always receives a smaller ordinal than all its children. Working w/ SortedSet facets is indeed simpler than the taxonomy, but the taxonomy does not seriously complicate things. If you need a facet hierarchy, you should use the taxonomy approach. Otherwise, I would just try each and see which one works better for your usecase. As for optimizing an index, the taxonomy facets do not make any difference in that case. Shai On Mon, Sep 22, 2014 at 8:48 PM, Yonghui Zhao zhaoyong...@gmail.com wrote: If we want to implement simple facet counting feature, it seems we can do it via sortedset or taxonomy writer/reader. Seems sortedset is simpler but doesn't support hierarchical facet count such as A/B/C. I want to know what's advantage/disadvantage of sortedset or taxonomy? Is there any trouble with taxonomy when index is optimized(merged)?
Re: document boost at lucene 4.8.1
You can read some discussion here: http://search-lucene.com/m/Z2GP220szmSsubj=RE+What+is+equivalent+to+Document+setBoost+from+Lucene+3+6+inLucene+4+1+ . I wrote a post on how to achieve that with the new API: http://shaierera.blogspot.com/2013/09/boosting-documents-in-lucene.html. Shai On Sun, Sep 21, 2014 at 11:23 AM, #LI JUN# jli...@e.ntu.edu.sg wrote: Hi all, How come in 4.8.1, the document.setBoost method is missing. So what is the method for document level boost now? Regards, Jun
Re: improve indexing speed with nomergepolicy
I opened https://issues.apache.org/jira/browse/LUCENE-5883 to handle that. Shai On Thu, Aug 7, 2014 at 6:42 PM, Uwe Schindler u...@thetaphi.de wrote: This is a good idea, because sometimes it's nice to change the MergePolicy on the fly without reopening! One example is https://issues.apache.org/jira/browse/LUCENE-5526 In my case, I would like to open an IndexWriter, set its merge policy to IndexUpdaterMergePolicy, force a merge to upgrade all segments and then proceed with normal indexing and other stuff. Currently you have to close IW - this is bad in multithreaded environments: If you start an Index Upgrade after installing a new version of your favourite Solr/ES/... server, but need to index documents in parallel (real time system) - so with little downtime. The proposal in the above issue is to allow to pass a MergePolicy to forceMerge(). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Shai Erera [mailto:ser...@gmail.com] Sent: Thursday, August 07, 2014 4:11 PM To: java-user@lucene.apache.org Subject: Re: improve indexing speed with nomergepolicy Yes, currently an MP isn't a live setting on IndexWriter, meaning you pass it at construction time and don't change it afterwards. I wonder if after LUCENE-5711 we can move MergePolicy to LiveIndexWriterConfig and fix IndexWriter to not hold on to it, but rather pull it from the config. Not sure what others think about it. Shai On Thu, Aug 7, 2014 at 5:05 PM, Jon Stewart j...@lightboxtechnologies.com wrote: Related, how does one change the MergePolicy on an IndexWriter (e.g., use NoMergePolicy during batch indexing, then change to something better once finished with batch)? It looks like the MergePolicy is set through IndexWriterConfig but I don't see a way to update an IWC on an IW. Thanks, Jon On Thu, Aug 7, 2014 at 7:37 AM, Shai Erera ser...@gmail.com wrote: Using NoMergePolicy for online indexes is usually not recommended. You want to use NoMP in case where you build an index in a batch job, then in the end before the index is published you run a forceMerge or maybeMerge (with a real MergePolicy). For online indexes, i.e. indexes that are being searched while they are updated, if you use NoMP you will accumulate many segments in the index. This means higher resources consumption overall: file handles, RAM, potentially disk space, and usually results in slower searches. You may want to tweak the default MP's settings though, to not kick off a merge unless there are a large number of segments in the index. E.g. the default MP merges segments when there are 10 at the same level (i.e. roughly the same size). You can increase that. Also, do you use NRTCachingDirectory? It's usually recommended for NRT, even with default MP, since the tiny segments are merged in-memory, and your NRT reopens don't result in flushing new segments to disk. Shai On Thu, Aug 7, 2014 at 1:14 PM, Sascha Janz sascha.j...@gmx.net wrote: hi, i try to speed up our indexing process. we use SeacherManager with applydeletes to get near real time Reader. we have not really much incoming documents, but the documents must be updated from time to time and the amount of documents to be updated could be quite large. i tried some tests with NoMergePolicy and the indexing process was 25 % faster. so i think of a change in our code, to use NoMergePolicy for a specific time interval, when users are active and do a forceMerge(20) every night, which last about 2 - 5 minutes. is this a good idea? or will i perhaps get into trouble? Sascha --- -- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Jon Stewart, Principal (646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Questions for facets search
Sheng, I assume that you're using the Lucene faceting module, so I answer following that: (1) A document can be associated with many facet labels, e.g. Tags/lucene and Author/Shai. The way to extract all facet labels for a particular document is this: OrdinalsReader ordinals = new DocValuesOrdinalsReader(); OrdinalsSegmentReader ordsSegment = ordinals.getReader(indexReader.leaves().get(0)); // we have only one segment IntsRef scratch = new IntsRef(); ordsSegment.get(0, scratch); for (int i = 0; i scratch.length; i++) { System.out.println(taxoReader.getPath(scratch.ints[i])); } Note that OrdinalsSegmentReader works on an AtomicReader. That means that the doc-id that you pass to it must be relative to the segment. If you have a global doc-id, you can wrap the DirectoryReader with a SlowCompositeReaderWrapper, which presents the DirectoryReader as an AtomicReader. (2) I'm not quite sure I understand what you mean by facet cache. Do you mean the taxonomy index? If so the answer is no. Think of the taxonomy index is a large global MapFacetLabel, Integer, where each facet label is mapped to an integer, irrespective of the segment it is indexed in. That map is used to encode the facet information in the *Search Index* more efficiently. Therefore the taxonomy index itself doesn't hold all the information that is needed for faceted search, and you cannot only rebuild it. Shai On Wed, Aug 13, 2014 at 8:08 AM, Ralf Heyde ralf.he...@gmx.de wrote: For 1st: from Solr Level i guess, you could select (only) the document by uniqueid. Then you have the facets for that particular document. But this results in one additional query/doc. Gesendet von meinem BlackBerry 10-Smartphone. Originalnachricht Von: Sheng Gesendet: Dienstag, 12. August 2014 23:35 An: java-user@lucene.apache.org Antwort an: java-user@lucene.apache.org Betreff: Questions for facets search I actually have 2 questions: 1. Is it possible to get the facet label for a particular document? The reason we want this is we'd like to allow users to see tags for each hit in addition to the taxonomy for his/her search. 2. Is it possible to re-index the facet cache without reindexing the whole lucene cache, since they are separated? We have a dynamic list of faceted fields, being able to quickly rebuild the whole facet lucene cache would be quite desirable. Again, I am using lucene 4.7, thanks in advance to your answers! Sheng - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Questions for facets search
Glad it helped Sheng. Note, the taxonomy index is not exactly like what you implement, just want to clarify that. You implemented something like a JOIN between two indexes, where a document index Index1 can be joined with a document (or set of docs) in Index2, by some primary key. The taxonomy index is different. It's an auxiliary index, but the word 'index' is just an implementation detail. Again, think of it as a large Map from a String to Integer. Every facet in the taxonomy gets a unique ID (integer), and that integer is encoded in the search index for all documents that are associated with that facet. Lucene implements a similar feature, per-segment, through SortedSetDocValues (and the facet module supports that one too, without the need for an auxiliary index). The difference is that SortedSetDocValues implement that mapping per-segment, so e.g. the facet Tags/Lucene may receive the integer 5 in seg1 and 12 in seg2, where the taxonomy index maps it *once* to an integer (say 4), and that integer is encoded in a BinaryDocValuesField in all segments of the search index. The only lookup that is done at search time is when you want to label top facets. Since the search index holds only the integer values of the facets, the taxonomy index is used to label them (so now it's more of a bidirectional Map). Just wanted to clarify the differences. Shai On Thu, Aug 14, 2014 at 2:56 AM, Sheng sheng...@gmail.com wrote: Shai, Thanks a lot for your answers! Sorry, I was distracted by some other matters during the day and cannot try your suggestions until now. So what you suggest on 1 is working like a charm :) for 2, it is a pity but I can understand. By the way, the way you described that facet index gets stored like a map is quite similar to how we store the payload :) We use an integer as payload for each token, and store more complicated information in another Lucene index with the integer payload as the key for each document. Sheng On Wednesday, August 13, 2014, Shai Erera ser...@gmail.com wrote: Sheng, I assume that you're using the Lucene faceting module, so I answer following that: (1) A document can be associated with many facet labels, e.g. Tags/lucene and Author/Shai. The way to extract all facet labels for a particular document is this: OrdinalsReader ordinals = new DocValuesOrdinalsReader(); OrdinalsSegmentReader ordsSegment = ordinals.getReader(indexReader.leaves().get(0)); // we have only one segment IntsRef scratch = new IntsRef(); ordsSegment.get(0, scratch); for (int i = 0; i scratch.length; i++) { System.out.println(taxoReader.getPath(scratch.ints[i])); } Note that OrdinalsSegmentReader works on an AtomicReader. That means that the doc-id that you pass to it must be relative to the segment. If you have a global doc-id, you can wrap the DirectoryReader with a SlowCompositeReaderWrapper, which presents the DirectoryReader as an AtomicReader. (2) I'm not quite sure I understand what you mean by facet cache. Do you mean the taxonomy index? If so the answer is no. Think of the taxonomy index is a large global MapFacetLabel, Integer, where each facet label is mapped to an integer, irrespective of the segment it is indexed in. That map is used to encode the facet information in the *Search Index* more efficiently. Therefore the taxonomy index itself doesn't hold all the information that is needed for faceted search, and you cannot only rebuild it. Shai On Wed, Aug 13, 2014 at 8:08 AM, Ralf Heyde ralf.he...@gmx.de javascript:; wrote: For 1st: from Solr Level i guess, you could select (only) the document by uniqueid. Then you have the facets for that particular document. But this results in one additional query/doc. Gesendet von meinem BlackBerry 10-Smartphone. Originalnachricht Von: Sheng Gesendet: Dienstag, 12. August 2014 23:35 An: java-user@lucene.apache.org javascript:; Antwort an: java-user@lucene.apache.org javascript:; Betreff: Questions for facets search I actually have 2 questions: 1. Is it possible to get the facet label for a particular document? The reason we want this is we'd like to allow users to see tags for each hit in addition to the taxonomy for his/her search. 2. Is it possible to re-index the facet cache without reindexing the whole lucene cache, since they are separated? We have a dynamic list of faceted fields, being able to quickly rebuild the whole facet lucene cache would be quite desirable. Again, I am using lucene 4.7, thanks in advance to your answers! Sheng - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org javascript:; For additional commands, e-mail: java-user-h...@lucene.apache.org javascript:;
Re: improve indexing speed with nomergepolicy
Using NoMergePolicy for online indexes is usually not recommended. You want to use NoMP in case where you build an index in a batch job, then in the end before the index is published you run a forceMerge or maybeMerge (with a real MergePolicy). For online indexes, i.e. indexes that are being searched while they are updated, if you use NoMP you will accumulate many segments in the index. This means higher resources consumption overall: file handles, RAM, potentially disk space, and usually results in slower searches. You may want to tweak the default MP's settings though, to not kick off a merge unless there are a large number of segments in the index. E.g. the default MP merges segments when there are 10 at the same level (i.e. roughly the same size). You can increase that. Also, do you use NRTCachingDirectory? It's usually recommended for NRT, even with default MP, since the tiny segments are merged in-memory, and your NRT reopens don't result in flushing new segments to disk. Shai On Thu, Aug 7, 2014 at 1:14 PM, Sascha Janz sascha.j...@gmx.net wrote: hi, i try to speed up our indexing process. we use SeacherManager with applydeletes to get near real time Reader. we have not really much incoming documents, but the documents must be updated from time to time and the amount of documents to be updated could be quite large. i tried some tests with NoMergePolicy and the indexing process was 25 % faster. so i think of a change in our code, to use NoMergePolicy for a specific time interval, when users are active and do a forceMerge(20) every night, which last about 2 - 5 minutes. is this a good idea? or will i perhaps get into trouble? Sascha - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: improve indexing speed with nomergepolicy
Yes, currently an MP isn't a live setting on IndexWriter, meaning you pass it at construction time and don't change it afterwards. I wonder if after LUCENE-5711 we can move MergePolicy to LiveIndexWriterConfig and fix IndexWriter to not hold on to it, but rather pull it from the config. Not sure what others think about it. Shai On Thu, Aug 7, 2014 at 5:05 PM, Jon Stewart j...@lightboxtechnologies.com wrote: Related, how does one change the MergePolicy on an IndexWriter (e.g., use NoMergePolicy during batch indexing, then change to something better once finished with batch)? It looks like the MergePolicy is set through IndexWriterConfig but I don't see a way to update an IWC on an IW. Thanks, Jon On Thu, Aug 7, 2014 at 7:37 AM, Shai Erera ser...@gmail.com wrote: Using NoMergePolicy for online indexes is usually not recommended. You want to use NoMP in case where you build an index in a batch job, then in the end before the index is published you run a forceMerge or maybeMerge (with a real MergePolicy). For online indexes, i.e. indexes that are being searched while they are updated, if you use NoMP you will accumulate many segments in the index. This means higher resources consumption overall: file handles, RAM, potentially disk space, and usually results in slower searches. You may want to tweak the default MP's settings though, to not kick off a merge unless there are a large number of segments in the index. E.g. the default MP merges segments when there are 10 at the same level (i.e. roughly the same size). You can increase that. Also, do you use NRTCachingDirectory? It's usually recommended for NRT, even with default MP, since the tiny segments are merged in-memory, and your NRT reopens don't result in flushing new segments to disk. Shai On Thu, Aug 7, 2014 at 1:14 PM, Sascha Janz sascha.j...@gmx.net wrote: hi, i try to speed up our indexing process. we use SeacherManager with applydeletes to get near real time Reader. we have not really much incoming documents, but the documents must be updated from time to time and the amount of documents to be updated could be quite large. i tried some tests with NoMergePolicy and the indexing process was 25 % faster. so i think of a change in our code, to use NoMergePolicy for a specific time interval, when users are active and do a forceMerge(20) every night, which last about 2 - 5 minutes. is this a good idea? or will i perhaps get into trouble? Sascha - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Jon Stewart, Principal (646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Sort, Search Facets
Hi Currently we do not provide the means to use a single SortedSetDVField for both faceting and sorting. You can add a SortedSetDVFacetField to a Document, then use FacetsConfig.build(), but that encodes all your dimensions under a single SSDV field. It's done for efficiency, since at search time, when you ask to count the different dimensions, we need to read a single field. It might be worth it to explore sharing the same SSDV field for both faceting and sorting, and compare the performance implications of doing that (when faceting). if you want to try it, I suggest that you look at SortedSetDocValuesReaderState and see if you can use it for this task. Shai On Tue, Jul 8, 2014 at 9:50 AM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi, I am using Lucene 4.7.2 and my primary use case for Lucene is to do three things: (a) search, (b) sort by a number of fields for the search results, and (c) facet on probably an equal number of fields (probably the most standard use cases anyway). Let us say, I have a corpus of more than a 100m docs with each document having approx. 10-15 fields excluding the content (body) which will also be one of the fields. Out of 10-15, I have a requirement to have sorting enabled on all 10-15 and the facets as well. That makes a total of approx. ~45 fields to be indexed for various reasons, once for String/Long/TextField, once for SortedDocValuesField, and once for FacetField each. What will be the impact of this on the indexing operation w.r.t. the time taken as well as the extra disk space required? Will it grow linearly with the increase in the number of fields? What is the impact on the memory usage during search time? I will attempt to benchmark some of these, but if you have any experience with this, request you to share the details. Thanks, --- Thanks n Regards, Sandeep Ramesh Khanzode
Re: Incremental Field Updates
Using BinaryDocValues is not recommended for all scenarios. It is a catchall alternative to the other DocValues types. I would not use it unless it makes sense for your application, even if it means that you need to re-index a document in order to update a single field. DocValues are not good for search - by search I assume you mean take a query such as apache AND lucene and find all documents which contain both terms under the same field. They are good for sorting and faceting though. So I guess the answer to your question is it depends (it always is!) - I would use DocValues for sorting and faceting, but not for regular search queries. And I would use BinaryDocValues only when the other DocValues types don't match. Also, note that the current field-level update of DocValues is not always better than re-indexing the document, you can read here for more details: http://shaierera.blogspot.com/2014/04/benchmarking-updatable-docvalues.html Shai On Tue, Jul 1, 2014 at 9:17 PM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi Shai, So one follow-up question. Assume that my use case is to have approx. ~50M documents indexed with each document having about ~10-15 indexed but not stored fields. These fields will never change, but there are another ~5-6 fields that will change and will continue to change after the index is written. These ~5-6 fields may also be multivalued. The size of this index turns out to be ~120GB. In this case, I would like to sort or facet or search on these ~5-6 fields. Which approach do you suggest? Should I use BinaryDocValues and update using IW or use either a ParallelReader/Join query. --- Thanks n Regards, Sandeep Ramesh Khanzode On Tuesday, July 1, 2014 9:53 PM, Shai Erera ser...@gmail.com wrote: Except that Lucene now offers efficient numeric and binary DocValues updates. See IndexWriter.updateNumeric/Binary... On Jul 1, 2014 5:51 PM, Erick Erickson erickerick...@gmail.com wrote: This JIRA is complicated, don't really expect it in 4.9 as it's been hanging around for quite a while. Everyone would like this, but it's not easy. Atomic updates will work, but you have to stored=true for all source fields. Under the covers this actually reads the document out of the stored fields, deletes the old one and adds it over again. FWIW, Erick On Tue, Jul 1, 2014 at 5:32 AM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi, I wanted to know of the best approach to follow if a few fields in my indexed documents are changing at run time (after index and before or during search), but a majority of them are created at index time. I could see the JIRA given below but it is scheduled for Lucene 4.9, I believe. There are a few other approaches, like maintaining a separate index for changing fields and use either a parallelreader or use a Join. Can everyone share their experience for this scenario on how it is handled in your systems? Thanks, [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF JIRA [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF JIRA Shai and I would like to start working on the proposal to Incremental Field Updates outlined here ( http://markmail.org/message/zhrdxxpfk6qvdaex ). View on issues.apache.org Preview by Yahoo --- Thanks n Regards, Sandeep Ramesh Khanzode - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Incremental Field Updates
Except that Lucene now offers efficient numeric and binary DocValues updates. See IndexWriter.updateNumeric/Binary... On Jul 1, 2014 5:51 PM, Erick Erickson erickerick...@gmail.com wrote: This JIRA is complicated, don't really expect it in 4.9 as it's been hanging around for quite a while. Everyone would like this, but it's not easy. Atomic updates will work, but you have to stored=true for all source fields. Under the covers this actually reads the document out of the stored fields, deletes the old one and adds it over again. FWIW, Erick On Tue, Jul 1, 2014 at 5:32 AM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi, I wanted to know of the best approach to follow if a few fields in my indexed documents are changing at run time (after index and before or during search), but a majority of them are created at index time. I could see the JIRA given below but it is scheduled for Lucene 4.9, I believe. There are a few other approaches, like maintaining a separate index for changing fields and use either a parallelreader or use a Join. Can everyone share their experience for this scenario on how it is handled in your systems? Thanks, [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF JIRA [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF JIRA Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex ). View on issues.apache.org Preview by Yahoo --- Thanks n Regards, Sandeep Ramesh Khanzode - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene Facets Module 4.8.1
There is no sample code for doing that but it's quite straightforward - if you know you indexed some dimensions under different indexFieldNames, initialize a FacetCounts per such field name, e.g.: FastTaxoFacetCounts defaultCounts = new FastTaxoFacetCounts(...); // for your regular facets FastTaxoFacetCounts cityCounts = new FastTaxoFacetCounts(...); // for your CITY facets Something like that... Shai On Mon, Jun 23, 2014 at 9:04 AM, Jigar Shah jigaronl...@gmail.com wrote: On commenting //config.setIndexFieldName(CITY, city); at search time, this is before i do, getTopChildren(...) I get following exception. Caused by: java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.count(FastTaxonomyFacetCounts.java:74) [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23] at org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.init(FastTaxonomyFacetCounts.java:49) [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23] at org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.init(FastTaxonomyFacetCounts.java:39) [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23] at org.apache.lucene.facet.DrillSideways.buildFacetsResult(DrillSideways.java:110) [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23] at org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:177) [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23] at org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:203) [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23] Application level excepitons. ... ... On Sat, Jun 21, 2014 at 10:56 PM, Michael McCandless luc...@mikemccandless.com wrote: Are you sure it's the same FacetsConfig at search time? Because the exception implies your CITY field didn't have config.setIndexFieldName(CITY, city) called. Or, can you try commenting out 'config.setIndexFieldName(CITY, city)' at index time and see if the exception still happens? Mike McCandless http://blog.mikemccandless.com On Sat, Jun 21, 2014 at 1:08 AM, Jigar Shah jigaronl...@gmail.com wrote: Thanks for helping me. Yes, i did couple of things: Below is simple code for indexing which i use. TrackingIndexWriter nrtWriter DirectoryTaxonomyWriter taxoWriter = ... FacetsConfig config = new FacetConfig(); config.setHierarchical(CITY, true) config.setMultiValued(CITY, true); config.setIndexFieldName(CITY,city) // I kept dimName different from indexFieldName Added indexing searchable fields... doc.add( new FacetField(CITY, India, Gujarat, Vadodara )) doc.add( new FacetField(CITY, India, Gujarat, Ahmedabad )) nrtWriter.addDocument(config.build(taxoWriter, doc)); Below is code which i use for searching TaxonomyReader taxoReader = new DirectoryTaxonomyReader(taxoWriter); Query query = ... IndexSearcher searcher = ... DrillDownQuery ddq = new DrillDownQuery(config, query); DrillSideways ds = new DrillSideways(searcher, config, taxoReader); // Config object is same which i created before DrillSidewaysResult result = ds.search(query, null, null, start + limit, null, true, true) ... Facets f = result.facets FacetResult fr = f.getTopChildren(5, CITY) [Exception is geneated]// Didn't perform any drill-down,really, its just original query for first time, but wrapped in DrillDownQuery. ... and below gives me empty collection. ListFacetResult frs= f.getAllDims(5) I debug source code and found, it internally calls FastTaxonomyFacetCounts(indexFieldName, taxoReader, config) // Config object is same which i created before which then calls IntTaxonomyFacets(indexFieldName, taxoReader, config) // Config object is same which i created before And during this calls the value of indexFieldName is $facets defined by constant 'public static final String DEFAULT_INDEX_FIELD_NAME = $facets;' in FacetsConfig. My question is if i am using same FacetsConfig while indexing and searching. why its not identifying correct name of field, and goes for $facets Please correct me if i understood wrong. or correct way to solve above problem. Many Thanks. Jigar Shah. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene Facets Module 4.8.1
Basically, it's not very common to change the indexFieldName. You should do that in case you e.g. count facets in groups of dimensions, rather than counting all of them. So for example, if you have 20 dimensions, but you know you only count d1-d5, d6-d12 and d13-d20, then if you separate them to 3 different indexFieldNames will probably improve performance. But if you can't make such a decision, it's better to not modify this. When you initialize a FacetCounts, it counts all the dimensions that are indexed under that indexFieldName, so if you need the counts of all of them, or the majority of them, that's ok. But if you know you *always* need the count of a subset of them, then separating that subset to a different field is better. Hope that clarifies. Shai On Mon, Jun 23, 2014 at 4:18 PM, Jigar Shah jigaronl...@gmail.com wrote: Thanks this worked for me :) Is there any advantage of indexing some facets as not providing any indexFieldName ? Thanks On Mon, Jun 23, 2014 at 12:55 PM, Shai Erera ser...@gmail.com wrote: There is no sample code for doing that but it's quite straightforward - if you know you indexed some dimensions under different indexFieldNames, initialize a FacetCounts per such field name, e.g.: FastTaxoFacetCounts defaultCounts = new FastTaxoFacetCounts(...); // for your regular facets FastTaxoFacetCounts cityCounts = new FastTaxoFacetCounts(...); // for your CITY facets Something like that... Shai On Mon, Jun 23, 2014 at 9:04 AM, Jigar Shah jigaronl...@gmail.com wrote: On commenting //config.setIndexFieldName(CITY, city); at search time, this is before i do, getTopChildren(...) I get following exception. Caused by: java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.count(FastTaxonomyFacetCounts.java:74) [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23] at org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.init(FastTaxonomyFacetCounts.java:49) [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23] at org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.init(FastTaxonomyFacetCounts.java:39) [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23] at org.apache.lucene.facet.DrillSideways.buildFacetsResult(DrillSideways.java:110) [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23] at org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:177) [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23] at org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:203) [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23] Application level excepitons. ... ... On Sat, Jun 21, 2014 at 10:56 PM, Michael McCandless luc...@mikemccandless.com wrote: Are you sure it's the same FacetsConfig at search time? Because the exception implies your CITY field didn't have config.setIndexFieldName(CITY, city) called. Or, can you try commenting out 'config.setIndexFieldName(CITY, city)' at index time and see if the exception still happens? Mike McCandless http://blog.mikemccandless.com On Sat, Jun 21, 2014 at 1:08 AM, Jigar Shah jigaronl...@gmail.com wrote: Thanks for helping me. Yes, i did couple of things: Below is simple code for indexing which i use. TrackingIndexWriter nrtWriter DirectoryTaxonomyWriter taxoWriter = ... FacetsConfig config = new FacetConfig(); config.setHierarchical(CITY, true) config.setMultiValued(CITY, true); config.setIndexFieldName(CITY,city) // I kept dimName different from indexFieldName Added indexing searchable fields... doc.add( new FacetField(CITY, India, Gujarat, Vadodara )) doc.add( new FacetField(CITY, India, Gujarat, Ahmedabad )) nrtWriter.addDocument(config.build(taxoWriter, doc)); Below is code which i use for searching TaxonomyReader taxoReader = new DirectoryTaxonomyReader(taxoWriter); Query query = ... IndexSearcher searcher = ... DrillDownQuery ddq = new DrillDownQuery(config, query); DrillSideways ds = new DrillSideways(searcher, config, taxoReader); // Config object is same which i created before DrillSidewaysResult result = ds.search(query, null, null, start + limit, null, true, true) ... Facets f = result.facets FacetResult fr = f.getTopChildren(5, CITY) [Exception is geneated]// Didn't perform any drill-down,really, its just original query for first time, but wrapped in DrillDownQuery. ... and below gives me empty collection. ListFacetResult frs= f.getAllDims(5) I debug source code and found
Re: A question about FacetField constructor
What do you mean by does not index anything? Do you get an exception when you add a String[] with more than one element? You should probably call conf.setHierarchical(dimension), but if you don't do that you should receive an IllegalArgumentException telling you to do that... Shai On Sun, Jun 22, 2014 at 6:34 AM, west suhanic west.suha...@gmail.com wrote: Hello All: I am building sample code using lucene v4.8.1 to explore the new facet API. The problem I am having is that if I pass a populated string array nothing gets indexed while if I pass only the first element of the string array that value gets indexed. The code found below shows the case that works and the case that does not work. What am I doing wrong? Start of code sample* void showStuff( String... va ) { /** This code permits out the contents of va successfully.**/ for( int ii = 0 ; ii va.length ; ii++ ) System.out.println( value[ + ii + ] + va[ii] ); } for( final Map String, String[] fd : allFacetData ) { final Document doc = new Document(); for( final Map.Entry String, String[] entry : fd.entrySet() ) { final String key = entry.getKey(); String[] value = entry.getValue(); showStuff( value ); /** This call indexes successfully **/ final FacetField newFF = new FacetField( key, value[0] ); /** * This call will not index anything if the value String array * has more than one element. *final FacetField newFF = new FacetField( key, value ); */ doc.add( newFF ); } try { final Document theBuildDoc = configFacetsHandle. build( taxoWriter, doc ); indexWriter.addDocument( theBuildDoc ); indexWriter.addDocument( configFacetsHandle.buil d( taxoWriter, doc ) ); } catch( IOException ioe ) { eMsg.append( method ); eMsg.append( failed with the exception ); eMsg.append( ioe.toString() ); return constantValuesInterface.FAILURE; } } ***End of code sample*** regards, West Suhanic
Re: A question about FacetField constructor
Reply wasn't sent to the list. On Jun 22, 2014 8:15 PM, Shai Erera ser...@gmail.com wrote: Can you post an example which demonstrates the problem? It's also interesting how you count the facets, eg do you use a TaxonomyFacets object or something else? Have you looked at the facet demo code? It contains examples for using hierarchical facets. Shai On Jun 22, 2014 8:08 PM, west suhanic west.suha...@gmail.com wrote: Hello: What do you mean by does not index anything? When I do a search the value returned for the dim set to Publish Date is null. If I pass through value[0] the publish date year is returned by the search. setHierarchical was called. When a String[] with more than one element is passed an exception is not thrown. I am open to all suggestions as to what I am missing. regards, west suhanic On Sun, Jun 22, 2014 at 3:23 AM, Shai Erera ser...@gmail.com wrote: What do you mean by does not index anything? Do you get an exception when you add a String[] with more than one element? You should probably call conf.setHierarchical(dimension), but if you don't do that you should receive an IllegalArgumentException telling you to do that... Shai On Sun, Jun 22, 2014 at 6:34 AM, west suhanic west.suha...@gmail.com wrote: Hello All: I am building sample code using lucene v4.8.1 to explore the new facet API. The problem I am having is that if I pass a populated string array nothing gets indexed while if I pass only the first element of the string array that value gets indexed. The code found below shows the case that works and the case that does not work. What am I doing wrong? Start of code sample* void showStuff( String... va ) { /** This code permits out the contents of va successfully.**/ for( int ii = 0 ; ii va.length ; ii++ ) System.out.println( value[ + ii + ] + va[ii] ); } for( final Map String, String[] fd : allFacetData ) { final Document doc = new Document(); for( final Map.Entry String, String[] entry : fd.entrySet() ) { final String key = entry.getKey(); String[] value = entry.getValue(); showStuff( value ); /** This call indexes successfully **/ final FacetField newFF = new FacetField( key, value[0] ); /** * This call will not index anything if the value String array * has more than one element. *final FacetField newFF = new FacetField( key, value ); */ doc.add( newFF ); } try { final Document theBuildDoc = configFacetsHandle. build( taxoWriter, doc ); indexWriter.addDocument( theBuildDoc ); indexWriter.addDocument( configFacetsHandle.buil d( taxoWriter, doc ) ); } catch( IOException ioe ) { eMsg.append( method ); eMsg.append( failed with the exception ); eMsg.append( ioe.toString() ); return constantValuesInterface.FAILURE; } } ***End of code sample*** regards, West Suhanic
Re: Lucene Facets Module 4.8.1
If you can, while in debug mode try to note the instance ID of the FacetsConfig, and assert it is indeed the same (i.e. indexConfig == searchConfig). Shai On Sat, Jun 21, 2014 at 8:26 PM, Michael McCandless luc...@mikemccandless.com wrote: Are you sure it's the same FacetsConfig at search time? Because the exception implies your CITY field didn't have config.setIndexFieldName(CITY, city) called. Or, can you try commenting out 'config.setIndexFieldName(CITY, city)' at index time and see if the exception still happens? Mike McCandless http://blog.mikemccandless.com On Sat, Jun 21, 2014 at 1:08 AM, Jigar Shah jigaronl...@gmail.com wrote: Thanks for helping me. Yes, i did couple of things: Below is simple code for indexing which i use. TrackingIndexWriter nrtWriter DirectoryTaxonomyWriter taxoWriter = ... FacetsConfig config = new FacetConfig(); config.setHierarchical(CITY, true) config.setMultiValued(CITY, true); config.setIndexFieldName(CITY,city) // I kept dimName different from indexFieldName Added indexing searchable fields... doc.add( new FacetField(CITY, India, Gujarat, Vadodara )) doc.add( new FacetField(CITY, India, Gujarat, Ahmedabad )) nrtWriter.addDocument(config.build(taxoWriter, doc)); Below is code which i use for searching TaxonomyReader taxoReader = new DirectoryTaxonomyReader(taxoWriter); Query query = ... IndexSearcher searcher = ... DrillDownQuery ddq = new DrillDownQuery(config, query); DrillSideways ds = new DrillSideways(searcher, config, taxoReader); // Config object is same which i created before DrillSidewaysResult result = ds.search(query, null, null, start + limit, null, true, true) ... Facets f = result.facets FacetResult fr = f.getTopChildren(5, CITY) [Exception is geneated]// Didn't perform any drill-down,really, its just original query for first time, but wrapped in DrillDownQuery. ... and below gives me empty collection. ListFacetResult frs= f.getAllDims(5) I debug source code and found, it internally calls FastTaxonomyFacetCounts(indexFieldName, taxoReader, config) // Config object is same which i created before which then calls IntTaxonomyFacets(indexFieldName, taxoReader, config) // Config object is same which i created before And during this calls the value of indexFieldName is $facets defined by constant 'public static final String DEFAULT_INDEX_FIELD_NAME = $facets;' in FacetsConfig. My question is if i am using same FacetsConfig while indexing and searching. why its not identifying correct name of field, and goes for $facets Please correct me if i understood wrong. or correct way to solve above problem. Many Thanks. Jigar Shah. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene Facets Module 4.8.1
How do you add facets to your documents? Did you play with the FacetsConfig, such as alter the field under which the CITY dimension is indexed? If you can reproduce this failure in a simple program, I guess it will be easy to spot the error. Looks like a configuration error to me... Shai On Fri, Jun 20, 2014 at 3:12 PM, Jigar Shah jigaronl...@gmail.com wrote: Hello, I am getting below exception, and using Drillsideways facets. While getting children i am getting below exception: 17:02:10,496 ERROR [stderr:71] (Thread-2 (HornetQ-client-global-threads-790878673)) java.lang.IllegalArgumentException: dimension CITY was not indexed into field $facets 17:02:10,500 ERROR [stderr:71] (Thread-2 (HornetQ-client-global-threads-790878673)) at org.apache.lucene.facet.taxonomy.TaxonomyFacets.verifyDim(TaxonomyFacets.java:80) 17:02:10,503 ERROR [stderr:71] (Thread-2 (HornetQ-client-global-threads-790878673)) at org.apache.lucene.facet.taxonomy.IntTaxonomyFacets.getTopChildren(IntTaxonomyFacets.java:95) I have used, TestDrillSideways.java test case to understand concept. Is there any mistake in creating FacetsConfig object, or configured something wrong ? Thanks,
Re: SortingMergePolicy for already sorted segments
I am afraid the DocMap still maintains doc-id mappings till merge and I am trying to avoid it... What do you mean 'till merge'? The method OneMerge.getMergeReaders() is called only when the merge is executed, not when the MergePolicy decided to merge those segments. Therefore the DocMap is initialized only when the merge actually executes ... what is there more to postpone? And besides, if the segments are already sorted, you should return a null DocMap, like Lucene code does ... If I miss your point, I'd appreciate if you can point me to a code example, preferably in Lucene source, which demonstrates the problem. Shai On Tue, Jun 17, 2014 at 3:03 PM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: I am afraid the DocMap still maintains doc-id mappings till merge and I am trying to avoid it... I think lucene itself has a MergeIterator in o.a.l.util package. A MergePolicy can wrap a simple MergeIterator for iterating docs across different AtomicReaders in correct sort-order for a given field/term That should be fine right? -- Ravi -- Ravi On Tue, Jun 17, 2014 at 1:24 PM, Shai Erera ser...@gmail.com wrote: loadSortTerm is your method right? In the current Sorter.sort implementation, I see this code: boolean sorted = true; for (int i = 1; i maxDoc; ++i) { if (comparator.compare(i-1, i) 0) { sorted = false; break; } } if (sorted) { return null; } Perhaps you can write similar code? Also note that the sorting interface has changed, I think in 4.8, and now you don't really need to implement a Sorter, but rather pass a SortField, if that works for you. Shai On Tue, Jun 17, 2014 at 9:41 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: Shai, This is the code snippet I use inside my class... public class MySorter extends Sorter { @Override public DocMap sort(AtomicReader reader) throws IOException { final MapInteger, BytesRef docVsId = loadSortTerm(reader); final Sorter.DocComparator comparator = new Sorter.DocComparator() { @Override public int compare(int docID1, int docID2) { BytesRef v1 = docVsId.get(docID1); BytesRef v2 = docVsId.get(docID2); return v1.compareTo(v2); } }; return sort(reader.maxDoc(), comparator); } } My Problem is, the AtomicReader passed to Sorter.sort method is actually a SlowCompositeReader, composed of a list of AtomicReaders each of which is already sorted. I find this loadSortTerm(compositeReader) to be a bit heavy where it tries to all load the doc-to-term mappings eagerly... Are there some alternatives for this? -- Ravi On Tue, Jun 17, 2014 at 10:58 AM, Shai Erera ser...@gmail.com wrote: I'm not sure that I follow ... where do you see DocMap being loaded up front? Specifically, Sorter.sort may return null of the readers are already sorted ... I think we already optimized for the case where the readers are sorted. Shai On Tue, Jun 17, 2014 at 4:04 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: I am planning to use SortingMergePolicy where all the merge-participating segments are already sorted... I understand that I need to define a DocMap with old-new doc-id mappings. Is it possible to optimize the eager loading of DocMap and make it kind of lazy load on-demand? Ex: Pass ListAtomicReader to the caller and ask for next new-old doc mapping.. Since my segments are already sorted, I could save on memory a little-bit this way, instead of loading the full DocMap upfront -- Ravi
Re: Facets in Lucene 4.7.2
Hi 40 seconds for faceted search is ... crazy. Also, note how the times don't differ much even though the number of hits is much higher (29K vs 15.1M) ... That, w/ that you say that subsequent queries are much faster (few seconds) suggests that something is seriously messed up w/ your environment. Maybe it's a faulty disk? E.g. after the file system cache is warm, you no longer hit the disk? In general, the more hits you have, the more expensive is faceted search. It's also true for scoring as well (i.e. even without facets). There's just more work to determine the top results (docs, facets...). With facets, you can use sampling (see RandomSamplingFacetsCollector), but I would do that only after you verify that collecting 15M docs is very expensive for you, even when the file system cache is hot. I've never seen those numbers before, therefore it's difficult for me to relate to them. There's a caching mechanism for facets, through CachedOrdinalsReader. But I wouldn't go there until you verify that your IO system is good (try another machine, OS, disk ...)., and that the 40s times are truly from the faceting code. Shai On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi, Thanks again! This time, I have indexed data with the following specs. I run into 40 seconds for the FastTaxonomyFacetCounts to create all the facets. Is this as per your measurements? Subsequent runs fare much better probably because of the Windows file system cache. How can I speed this up? I believe there was a CategoryListCache earlier. Is there any cache or other implementation that I can use? Secondly, I had a general question. If I extrapolate these numbers for a billion documents, my search and facet number may probably be unusable in a real time scenario. What are the strategies employed when you deal with such large scale? I am new to Lucene so please also direct me to the relevant info sources. Thanks! Corpus: Count: 20M, Size: 51GB Index: Size (w/o Facets): 19GB, Size (w/Facets): 20.12GB Creation Time (w/o Facets): 3.46hrs, Creation Time (w/Facets): 3.49hrs Search Performance: With 29055 hits (5 terms in query): Query Execution: 8 seconds Facet counts execution: 40-45 seconds With 4.22M hits (2 terms in query): Query Execution: 3 seconds Facet counts execution: 42-46 seconds With 15.1M hits (1 term in query): Query Execution: 2 seconds Facet counts execution: 45-53 seconds With 6183 hits (5 different values for the same 5 terms): (Without Flushing Windows File Cache on Next run) Query Execution: 11 seconds Facet counts execution: 1 second With 4.9M hits (1 different value for the 1 term): (Without Flushing Windows File Cache on Next run) Query Execution: 2 seconds Facet counts execution: 3 seconds --- Thanks n Regards, Sandeep Ramesh Khanzode On Monday, June 16, 2014 8:11 PM, Shai Erera ser...@gmail.com wrote: Hi 1.] Is there any API that gives me the count of a specific dimension from FacetCollector in response to a search query. Currently, I use the getTopChildren() with some value and then check the FacetResult object for the actual number of dimensions hit along with their occurrences. Also, the getSpecificValue() does not work without a path attribute to the API. To get the value of the dimension itself, you should call getTopChildren(1, dim). Note that getSpecificValue does not allow to pass only the dimension, and getTopChildren requires topN to be 0. Passing 1 is a hack, but I'm not sure we should specifically support getting the aggregated value of just the dimension ... once you get that, the FacetResult.value tells you the aggregated count. 2.] Can I find the MAX or MIN value of a Numeric type field written to the index? Depends how you index them. If you index the field as a numeric field (e.g. LongField), I believe you can use NumericUtils.getMaxLong. If it's a DocValues field, I don't know of a built-in function that does it, but this thread has a demo code: http://www.gossamer-threads.com/lists/lucene/java-user/195594. 3.] I am trying to compare and contrast Lucene Facets with Elastic Search. I could determine that ES does search time faceting and dynamically returns the response without any prior faceting during indexing time. Is index time lag is not my concern, can I assume that, in general, performance-wise Lucene facets would be faster? I will start by saying that I don't know much about how ES facets work. We have some committers who know both how Lucene and ES facets work, so they can comment on that. But I personally don't think there's no index-time decision when it comes to faceting. Well
Re: SortingMergePolicy for already sorted segments
OK I think I now understand what you're asking :). It's unrelated though to SortingMergePolicy. You propose to do the merge part of a merge-sort, since we know the indexes are already sorted, right? This is something we've considered in the past, but it is very tricky (see below) and we went with the SortingAR for simplicity and speed of coding. If however you have an idea how we can easily implement that, that would be awesome. So let's consider merging the posting lists of f:val from the N readers. Say that each returns docs 0-3, and the merged posting will have 4*N entries (say we don't have deletes). To properly merge them, you need to lookup the sort-value of each document from each reader, and compare according to it. Now you move on to f:val2 (another posting) and it wants to merge 100 other docs. So you need to lookup the value of each document, compare by it, and merge them. And the process continues ... These lookups are expensive and will be done millions of times (each term, each DV field, each .. everything). More than that, there's a serious issue of correctness, because you never make a global sorting decision. So if f:val sees only a single document - 0, in all segments, you want to map them to 4 GLOBALLY SORTED documents. If you make a local decision based on these 4 documents, you will end up w/ a completely messed up segment. I think the global DocMap is really required. Forget about that that other code, e.g. IndexWriter relies on this in order to properly apply incoming document deletions and field updates while the segments were merging. It's just a matter of correctness - we need to know the global sorted segment map. Shai On Tue, Jun 17, 2014 at 3:41 PM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: Therefore the DocMap is initialized only when the merge actually executes ... what is there more to postpone? Agreed. However, what I am asking is, if there is an alternative to DocMap, will that be better? Plz read-on And besides, if the segments are already sorted, you should return a null DocMap, like Lucene code does ... What I am trying to say is, my individual segments are sorted. However, when a merge combines N individual sorted-segments, there needs to be a global sort-order for writing the new segment. Passing null DocMap won't work here, no? DocMap is one-way of bringing the global order during a merge. Another way is to use something like a MergedIteratorSegmentReader instead of DocMap, which doesn't need any memory I was trying to get a heads-up on these 2 approaches. Please do let me know if I have understood correctly -- Ravi On Tue, Jun 17, 2014 at 5:42 PM, Shai Erera ser...@gmail.com wrote: I am afraid the DocMap still maintains doc-id mappings till merge and I am trying to avoid it... What do you mean 'till merge'? The method OneMerge.getMergeReaders() is called only when the merge is executed, not when the MergePolicy decided to merge those segments. Therefore the DocMap is initialized only when the merge actually executes ... what is there more to postpone? And besides, if the segments are already sorted, you should return a null DocMap, like Lucene code does ... If I miss your point, I'd appreciate if you can point me to a code example, preferably in Lucene source, which demonstrates the problem. Shai On Tue, Jun 17, 2014 at 3:03 PM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: I am afraid the DocMap still maintains doc-id mappings till merge and I am trying to avoid it... I think lucene itself has a MergeIterator in o.a.l.util package. A MergePolicy can wrap a simple MergeIterator for iterating docs across different AtomicReaders in correct sort-order for a given field/term That should be fine right? -- Ravi -- Ravi On Tue, Jun 17, 2014 at 1:24 PM, Shai Erera ser...@gmail.com wrote: loadSortTerm is your method right? In the current Sorter.sort implementation, I see this code: boolean sorted = true; for (int i = 1; i maxDoc; ++i) { if (comparator.compare(i-1, i) 0) { sorted = false; break; } } if (sorted) { return null; } Perhaps you can write similar code? Also note that the sorting interface has changed, I think in 4.8, and now you don't really need to implement a Sorter, but rather pass a SortField, if that works for you. Shai On Tue, Jun 17, 2014 at 9:41 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: Shai, This is the code snippet I use inside my class... public class MySorter extends Sorter { @Override public DocMap sort(AtomicReader reader) throws IOException { final MapInteger, BytesRef docVsId = loadSortTerm(reader
Re: SortingMergePolicy for already sorted segments
That said... if we generate the global DocMap up front, there's no reason to not execute the merge of the segments more efficiently, i.e. without wrapping them in a SlowCompositeReaderWrapper. But that's not work for SortingMergePolicy, it's either a special SortingAtomicReader which wraps a group of readers + a global DocMap, and then merge-sorts them more efficiently than how it's done now. Or we tap into SegmentMerger .. which is way more complicated. Perhaps it would be worth to explore a SortingMultiSortedAtomicReader which merge-sorts the postings and other data that way ... I look at e.g how doc-values are merged .. not sure it will improve performance. But if you want to cons up a patch, that'd be awesome! Shai On Tue, Jun 17, 2014 at 8:01 PM, Shai Erera ser...@gmail.com wrote: OK I think I now understand what you're asking :). It's unrelated though to SortingMergePolicy. You propose to do the merge part of a merge-sort, since we know the indexes are already sorted, right? This is something we've considered in the past, but it is very tricky (see below) and we went with the SortingAR for simplicity and speed of coding. If however you have an idea how we can easily implement that, that would be awesome. So let's consider merging the posting lists of f:val from the N readers. Say that each returns docs 0-3, and the merged posting will have 4*N entries (say we don't have deletes). To properly merge them, you need to lookup the sort-value of each document from each reader, and compare according to it. Now you move on to f:val2 (another posting) and it wants to merge 100 other docs. So you need to lookup the value of each document, compare by it, and merge them. And the process continues ... These lookups are expensive and will be done millions of times (each term, each DV field, each .. everything). More than that, there's a serious issue of correctness, because you never make a global sorting decision. So if f:val sees only a single document - 0, in all segments, you want to map them to 4 GLOBALLY SORTED documents. If you make a local decision based on these 4 documents, you will end up w/ a completely messed up segment. I think the global DocMap is really required. Forget about that that other code, e.g. IndexWriter relies on this in order to properly apply incoming document deletions and field updates while the segments were merging. It's just a matter of correctness - we need to know the global sorted segment map. Shai On Tue, Jun 17, 2014 at 3:41 PM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: Therefore the DocMap is initialized only when the merge actually executes ... what is there more to postpone? Agreed. However, what I am asking is, if there is an alternative to DocMap, will that be better? Plz read-on And besides, if the segments are already sorted, you should return a null DocMap, like Lucene code does ... What I am trying to say is, my individual segments are sorted. However, when a merge combines N individual sorted-segments, there needs to be a global sort-order for writing the new segment. Passing null DocMap won't work here, no? DocMap is one-way of bringing the global order during a merge. Another way is to use something like a MergedIteratorSegmentReader instead of DocMap, which doesn't need any memory I was trying to get a heads-up on these 2 approaches. Please do let me know if I have understood correctly -- Ravi On Tue, Jun 17, 2014 at 5:42 PM, Shai Erera ser...@gmail.com wrote: I am afraid the DocMap still maintains doc-id mappings till merge and I am trying to avoid it... What do you mean 'till merge'? The method OneMerge.getMergeReaders() is called only when the merge is executed, not when the MergePolicy decided to merge those segments. Therefore the DocMap is initialized only when the merge actually executes ... what is there more to postpone? And besides, if the segments are already sorted, you should return a null DocMap, like Lucene code does ... If I miss your point, I'd appreciate if you can point me to a code example, preferably in Lucene source, which demonstrates the problem. Shai On Tue, Jun 17, 2014 at 3:03 PM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: I am afraid the DocMap still maintains doc-id mappings till merge and I am trying to avoid it... I think lucene itself has a MergeIterator in o.a.l.util package. A MergePolicy can wrap a simple MergeIterator for iterating docs across different AtomicReaders in correct sort-order for a given field/term That should be fine right? -- Ravi -- Ravi On Tue, Jun 17, 2014 at 1:24 PM, Shai Erera ser...@gmail.com wrote: loadSortTerm is your method right? In the current Sorter.sort implementation, I see this code: boolean sorted = true; for (int i = 1; i
Re: Facets in Lucene 4.7.2
Nothing suspicious ... code looks fine. The call to FastTaxoFacetCounts actually computes the counts ... that's the expensive part of faceted search. How big is your taxonomy (number categories)? Is it hierarchical (i.e. are your dimensions flat, or deep like A/1/2/3/)? What does your FacetsConfig look like? Still, well maybe if your taxonomy is huge (hundreds of millions of categories), I don't think you can intentionally mess up something that much to end up w/ 40-45s response times! Shai On Tue, Jun 17, 2014 at 8:51 PM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi, Thanks for your response. It does sound pretty bad which is why I am not sure whether there is an issue with the code, the index, the searcher, or just the machine, as you say. I will try with another machine just to make sure and post the results. Meanwhile, can you tell me if there is anything wrong in the below measurement? Or is the API usage or the pattern incorrect? I used a tool called RAMMap to clean the Windows cache. If I do not, the results are very fast as I mentioned already. If I do, then the total time is 40s. Can you please provide any pointers on what could be wrong? I will be checking on a Linux box anyway. = System.out.println(1. Start Date: + new Date()); TopDocs topDocs = FacetsCollector.search(searcher, query, 100, fc); System.out.println(1. End Date: + new Date()); // Above part takes approx 2-12 seconds depending on the query System.out.println(2. Start Date: + new Date()); ListFacetResult results = new ArrayListFacetResult(); Facets facets = new FastTaxonomyFacetCounts(taxoReader, config, fc); System.out.println(2. End Date: + new Date()); // Above part takes approx 40-53 seconds depending on the query for the first time on Windows System.out.println(3. Start Date: + new Date()); results.add(facets.getTopChildren(1000, F1)); results.add(facets.getTopChildren(1000, F2)); results.add(facets.getTopChildren(1000, F3)); results.add(facets.getTopChildren(1000, F4)); results.add(facets.getTopChildren(1000, F5)); results.add(facets.getTopChildren(1000, F6)); results.add(facets.getTopChildren(1000, F7)); System.out.println(3. End Date: + new Date()); // Above part takes approx less than 1 second = --- Thanks n Regards, Sandeep Ramesh Khanzode On Tuesday, June 17, 2014 10:15 PM, Shai Erera ser...@gmail.com wrote: Hi 40 seconds for faceted search is ... crazy. Also, note how the times don't differ much even though the number of hits is much higher (29K vs 15.1M) ... That, w/ that you say that subsequent queries are much faster (few seconds) suggests that something is seriously messed up w/ your environment. Maybe it's a faulty disk? E.g. after the file system cache is warm, you no longer hit the disk? In general, the more hits you have, the more expensive is faceted search. It's also true for scoring as well (i.e. even without facets). There's just more work to determine the top results (docs, facets...). With facets, you can use sampling (see RandomSamplingFacetsCollector), but I would do that only after you verify that collecting 15M docs is very expensive for you, even when the file system cache is hot. I've never seen those numbers before, therefore it's difficult for me to relate to them. There's a caching mechanism for facets, through CachedOrdinalsReader. But I wouldn't go there until you verify that your IO system is good (try another machine, OS, disk ...)., and that the 40s times are truly from the faceting code. Shai On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi, Thanks again! This time, I have indexed data with the following specs. I run into 40 seconds for the FastTaxonomyFacetCounts to create all the facets. Is this as per your measurements? Subsequent runs fare much better probably because of the Windows file system cache. How can I speed this up? I believe there was a CategoryListCache earlier. Is there any cache or other implementation that I can use? Secondly, I had a general question. If I extrapolate these numbers for a billion documents, my search and facet number may probably be unusable in a real time scenario. What are the strategies employed when you deal with such large scale? I am new to Lucene so please also direct me to the relevant info sources. Thanks! Corpus: Count: 20M, Size: 51GB Index: Size (w/o Facets): 19GB, Size (w/Facets): 20.12GB Creation Time (w/o Facets): 3.46hrs, Creation Time (w/Facets): 3.49hrs Search Performance: With 29055 hits (5 terms in query): Query Execution: 8 seconds Facet counts execution: 40-45 seconds With 4.22M hits (2 terms in query
Re: Facets in Lucene 4.7.2
You can get the size of the taxonomy by calling taxoReader.getSize(). What does the 28K of the $facets field denote - the number of terms (drill-down)? If so, that sounds like your taxonomy is of that size. And indeed, this is a tiny taxonomy ... How many facets do you record per document? This also affects the amount of IO that's done during search, as we traverse the BinaryDocValues field, reading the categories of each document. Shai On Tue, Jun 17, 2014 at 9:32 PM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: If I am counting correctly, the $facets field in the index shows a count of approx. 28k. That does not sound like much, I guess. All my facets are flat and the FacetsConfig only defines a couple of them to be multi-valued. Let me know if I am not counting the taxonomy size correctly. The taxoReader.getSize() also shows this count. I will check on a Linux box to make sure. Thanks, --- Thanks n Regards, Sandeep Ramesh Khanzode On Tuesday, June 17, 2014 11:28 PM, Shai Erera ser...@gmail.com wrote: Nothing suspicious ... code looks fine. The call to FastTaxoFacetCounts actually computes the counts ... that's the expensive part of faceted search. How big is your taxonomy (number categories)? Is it hierarchical (i.e. are your dimensions flat, or deep like A/1/2/3/)? What does your FacetsConfig look like? Still, well maybe if your taxonomy is huge (hundreds of millions of categories), I don't think you can intentionally mess up something that much to end up w/ 40-45s response times! Shai On Tue, Jun 17, 2014 at 8:51 PM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi, Thanks for your response. It does sound pretty bad which is why I am not sure whether there is an issue with the code, the index, the searcher, or just the machine, as you say. I will try with another machine just to make sure and post the results. Meanwhile, can you tell me if there is anything wrong in the below measurement? Or is the API usage or the pattern incorrect? I used a tool called RAMMap to clean the Windows cache. If I do not, the results are very fast as I mentioned already. If I do, then the total time is 40s. Can you please provide any pointers on what could be wrong? I will be checking on a Linux box anyway. = System.out.println(1. Start Date: + new Date()); TopDocs topDocs = FacetsCollector.search(searcher, query, 100, fc); System.out.println(1. End Date: + new Date()); // Above part takes approx 2-12 seconds depending on the query System.out.println(2. Start Date: + new Date()); ListFacetResult results = new ArrayListFacetResult(); Facets facets = new FastTaxonomyFacetCounts(taxoReader, config, fc); System.out.println(2. End Date: + new Date()); // Above part takes approx 40-53 seconds depending on the query for the first time on Windows System.out.println(3. Start Date: + new Date()); results.add(facets.getTopChildren(1000, F1)); results.add(facets.getTopChildren(1000, F2)); results.add(facets.getTopChildren(1000, F3)); results.add(facets.getTopChildren(1000, F4)); results.add(facets.getTopChildren(1000, F5)); results.add(facets.getTopChildren(1000, F6)); results.add(facets.getTopChildren(1000, F7)); System.out.println(3. End Date: + new Date()); // Above part takes approx less than 1 second = --- Thanks n Regards, Sandeep Ramesh Khanzode On Tuesday, June 17, 2014 10:15 PM, Shai Erera ser...@gmail.com wrote: Hi 40 seconds for faceted search is ... crazy. Also, note how the times don't differ much even though the number of hits is much higher (29K vs 15.1M) ... That, w/ that you say that subsequent queries are much faster (few seconds) suggests that something is seriously messed up w/ your environment. Maybe it's a faulty disk? E.g. after the file system cache is warm, you no longer hit the disk? In general, the more hits you have, the more expensive is faceted search. It's also true for scoring as well (i.e. even without facets). There's just more work to determine the top results (docs, facets...). With facets, you can use sampling (see RandomSamplingFacetsCollector), but I would do that only after you verify that collecting 15M docs is very expensive for you, even when the file system cache is hot. I've never seen those numbers before, therefore it's difficult for me to relate to them. There's a caching mechanism for facets, through CachedOrdinalsReader. But I wouldn't go there until you verify that your IO system is good (try another machine, OS, disk ...)., and that the 40s times are truly from the faceting code. Shai On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi
Re: Lucene 4.8.1 - Taxonomy
Err ... are you sure there's an index in the directory that you point Luke at? I see that the exception points to . which suggests the local directory from where Luke was run. There's nothing special about the taxonomy index, as far as Luke should concern. However, note that I do not recommend trying to alter the taxonomy index via Luke in any way, as it's structure is very specific and things rely on it. It's not a usual index ... i.e. there's no point trying to search it or something like that. Shai On Mon, Jun 16, 2014 at 9:35 AM, Mrugesh Patel mrugesh.pa...@infodesk.com wrote: Hi, I would like to open taxonomy indices in a tool (like Luke). Please could you help? Currently I am able to open other lucene indices in Luke 4.8.1 but unable to open taxonomy indices. When I try to open taxonomy indices in Luke 4.8.1 then it shows org.apache.lucene.index.IndexNotFoundException: no segments* file found in . exception. Please help. Thanks, Mrugesh
Re: SortingMergePolicy for already sorted segments
I'm not sure that I follow ... where do you see DocMap being loaded up front? Specifically, Sorter.sort may return null of the readers are already sorted ... I think we already optimized for the case where the readers are sorted. Shai On Tue, Jun 17, 2014 at 4:04 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: I am planning to use SortingMergePolicy where all the merge-participating segments are already sorted... I understand that I need to define a DocMap with old-new doc-id mappings. Is it possible to optimize the eager loading of DocMap and make it kind of lazy load on-demand? Ex: Pass ListAtomicReader to the caller and ask for next new-old doc mapping.. Since my segments are already sorted, I could save on memory a little-bit this way, instead of loading the full DocMap upfront -- Ravi
Re: Facets in Lucene 4.7.2
Hi Currently there's now way to add e.g. terms to already indexed documents, you have to re-index them. The only updatable field type Lucene offers currently are DocValues fields. If the list of markers/flags is fixed in your case, and you can map them to an integer, I think you could use a NumericDocValues field, which supports field-level updates. Once you do that, you can then: * Count on this field pretty easily. You will need to write a Facets implementation, but otherwise it's very easy. * Filter queries: you will need to write a Filter which returns a DocIdSet of the documents that belong to one category (e.g. Financially Relevant). Here you might want to consider caching the result of the Filter, by using CachingWrapperFilter. It's not the best approach, updatable Terms would better suit your usecase, however we don't offer them yet and it will be a while until we do (and IF we do). You should also benchmark that approach vs re-indexing the documents since the current implementation of updatable doc-values fields isn't optimized for a few document updates between index reopens. See here: http://shaierera.blogspot.com/2014/04/benchmarking-updatable-docvalues.html Shai On Fri, Jun 13, 2014 at 10:19 PM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi Shai, Thanks so much for the clear explanation. I agree on the first question. Taxonomy Writer with a separate index would probably be my approach too. For the second question: I am a little new to the Facets API so I will try to figure out the approach that you outlined below. However, the scenario is such: Assume a document corpus that is indexed. For a user query, a document is returned and selected by the user for editing as part of some use case/workflow. That document is now marked as either historically interesting or not, financially relevant, specific to media or entertainment domain, etc. by the user. So, essentially the user is flagging the document with certain markers. Another set of users could possibly want to query on these markers. So, lets say, a second user comes along, and wants to see the top documents belonging to one category, say, agriculture or farming. Since these markers are run time activities, how can I use the facets on them? So, I was envisioning facets as the various markers. But, if I constantly re-index or update the documents whenever a marker changes, I believe it would not be very efficient. Is there anything, facets or otherwise, in Lucene that can help me solve this use case? Please let me know. And, thanks! --- Thanks n Regards, Sandeep Ramesh Khanzode On Friday, June 13, 2014 9:51 PM, Shai Erera ser...@gmail.com wrote: Hi You can check the demo code here: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/ . This code is updated with each release, so you always get a working code examples, even when the API changes. If you don't mind managing the sidecar index, which I agree isn't such a big deal, then yes - the taxonomy index currently performs the fastest. I plan to explore porting the taxonomy-based approach from BinaryDocValues to the new SortedNumericDocValues (coming out in 4.9) since it might perform even faster. I didn't quite get the marker/flag facet. Can you give an example? For instance, if you can model that as a NumericDocValuesField added to documents (w/ the different markers/flags translated to numbers), then you can use Lucene's updatable numeric DocValues and write a custom Facets to aggregate on that NumericDocValues field. Shai On Fri, Jun 13, 2014 at 11:48 AM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi, I am evaluating Lucene Facets for a project. Since there is a lot of change in 4.7.2 for Facets, I am relying on UTs for reference. Please let me know if there are other sources of information. I have a couple of questions: 1.] All categories in my application are flat, not hierarchical. But, it seems from a few sources, that even that notwithstanding, you would want to use a Taxonomy based index for performance reasons. It is faster but uses more RAM. Or is the deterrent to use it is the fact that it is a separate data structure. If one could do with the life-cycle management of the extra index, should we go ahead with the taxonomy index for better performance across tens of millions of documents? Another note to add is that I do not see a scenario wherein I would want to re-index my collection over and over again or, in other words, the changes would be spread over time. 2.] I need a type of dynamic facet that allows me to add a flag or marker to the document at runtime since it will change/update every time a user modifies or adds to the list of markers. Is this possible to do with the current implementation? Since I believe, that currently all
Re: Facets in Lucene 4.7.2
Hi You can check the demo code here: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/. This code is updated with each release, so you always get a working code examples, even when the API changes. If you don't mind managing the sidecar index, which I agree isn't such a big deal, then yes - the taxonomy index currently performs the fastest. I plan to explore porting the taxonomy-based approach from BinaryDocValues to the new SortedNumericDocValues (coming out in 4.9) since it might perform even faster. I didn't quite get the marker/flag facet. Can you give an example? For instance, if you can model that as a NumericDocValuesField added to documents (w/ the different markers/flags translated to numbers), then you can use Lucene's updatable numeric DocValues and write a custom Facets to aggregate on that NumericDocValues field. Shai On Fri, Jun 13, 2014 at 11:48 AM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi, I am evaluating Lucene Facets for a project. Since there is a lot of change in 4.7.2 for Facets, I am relying on UTs for reference. Please let me know if there are other sources of information. I have a couple of questions: 1.] All categories in my application are flat, not hierarchical. But, it seems from a few sources, that even that notwithstanding, you would want to use a Taxonomy based index for performance reasons. It is faster but uses more RAM. Or is the deterrent to use it is the fact that it is a separate data structure. If one could do with the life-cycle management of the extra index, should we go ahead with the taxonomy index for better performance across tens of millions of documents? Another note to add is that I do not see a scenario wherein I would want to re-index my collection over and over again or, in other words, the changes would be spread over time. 2.] I need a type of dynamic facet that allows me to add a flag or marker to the document at runtime since it will change/update every time a user modifies or adds to the list of markers. Is this possible to do with the current implementation? Since I believe, that currently all faceting is done at indexing time. --- Thanks n Regards, Sandeep Ramesh Khanzode
Re: Faceted Search User's Guide for Lucene 4.8.1
Hi We removed the userguide long time ago, and replaced it with better documentation on the classes and package.html, as well as demo code that you can find here: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/ You can also look up some blog posts that I wrote a while ago on facets, that explain how they work and some internals, even though the code examples are not up-to-date w/ latest API changes: http://shaierera.blogspot.com/2012/11/lucene-facets-part-1.html http://shaierera.blogspot.com/2012/11/lucene-facets-part-2.html http://shaierera.blogspot.com/2012/12/lucene-facets-under-hood.html http://shaierera.blogspot.com/2013/01/facet-associations.html Shai On Wed, Jun 11, 2014 at 10:51 AM, Raf r.ventag...@gmail.com wrote: Hi, I have found this useful guide to the *Lucene Faceted Search*: http://lucene.apache.org/core/4_4_0/facet/org/apache/lucene/facet/doc-files/userguide.html The problem is that it refers to Lucene version 4.4, while I am using the latest available release (4.8.1) and I cannot find some classes (e.g. FacetSearchParams or CountFacetRequest). Is there an updated version of that guide? I tried this http://lucene.apache.org/core/*4_8_1*/facet/org/apache/lucene/facet/doc-files/userguide.html but it does not work :| Thank you for any help you can provide. Regards, *Raf*
Re: Multi-thread indexing, should the commit be called from each thread?
You don't need to commit from each thread, you can definitely commit when all threads are done. In general, you should commit only when you want to ensure the data is safe on disk. Shai On Wed, May 21, 2014 at 2:58 PM, andi rexha a_re...@hotmail.com wrote: Hi! I have a question about multi-thread indexing. When I perform a Multi-thread indexing, should I commit from each thread that I add documents or the commit should be done only when all the threads are done with their indexing task? Thank you!
Re: best choice for ramBufferSizeMB
Well, first make sure that you set ramBufferSizeMB to well below the max Java heap size, otherwise you could run into OOMs. While a larger RAM buffer may speed up indexing (since it flushes less often to disk), it's not the only factor that affects indexing speed. For instance, if a big portion of your indexing work is reading the files from a slow storage device (maybe NFS share, remote Http etc.), then that could easily shadow any benefits of using large RAM buffer. Also, do you index with a single or multiple threads? Lucene supports multi-threaded indexing, and it's recommended to do whenever you can, e.g. when you run on a sufficiently strong HW (4+ cores...). Another thing, in the past I noticed that too long RAM buffers did not improve indexing at all e.g. if your underlying IO system is slow (e.g. indexing to an NFS share, distributed file-system etc.), then the cost of flushing a big RAM buffer became significant, more than indexing in RAM, and e.g. I did not observe improvements when using ramBufferSizeMB=512 vs 128. Also, using a big RAM buffer uses more space on the heap, and makes the job of the GC harder. So I think it might be that a too big RAM buffer may actually slow things down, rather than speed up. Indexing speed is affected by multiple parameters, the RAM buffer is only one of them... Shai On Wed, May 14, 2014 at 4:33 PM, Gudrun Siedersleben siedersle...@mpdl.mpg.de wrote: Hi all, we want to speed up building our lucene index. We set ramBufferSize to some values between 32 and 128 MB, but that does not make any difference concerning the time used for reindexing. We did not set maxBufferedDocs, .. which could conflict. We start the JVM with the following JAVA_OPTS: -Xms128m -Xmx512m -XX:MaxPermSize=256m What is the recommended value for ramBufferSizeMB depending on JAVA_OPTS and perhaps other lucene parameters set? We use Lucene 3.6.0. Best regards Gudrun
Re: Fields, Index segments and docIds (second Try)
I don't think that you need to be concerned with the internal docIDs much. Just imagine the indexes as a big table with multiple columns, where columns are grouped together. Each group is a different index. If a document does not have a value in one column, then you have an empty cell. if a document doesn't have a value in entire group of columns, then you denote that by adding an empty document. Oh, and make sure to use a LogMergePolicy, so segments are merged in the same order across all indexes. And given that you rebuild the indexes every time, you can create them one-by-one. You don't need to do that in parallel to all indexes, unless it's more convenient for you. Shai On Fri, May 2, 2014 at 9:28 AM, Olivier Binda olivier.bi...@wanadoo.frwrote: On 05/02/2014 06:05 AM, Shai Erera wrote: If you're always rebuilding, let alone forceMerge, you shouldn't have too much trouble implementing it. Just make sure that you add documents in the same order to all indexes. If you're always rebuilding, how come you have deletions? Anyway, you must also delete in all indexes. Indeed, I don't have deletions and I'm mainly concerned with merges. But I just want to understand the whole docId remapping process, out of curiosity and also because obtaining a docId (and not losing it) seems to be the key of parallel indexes On May 2, 2014 1:57 AM, Olivier Binda olivier.bi...@wanadoo.fr wrote: On 05/01/2014 10:28 AM, Shai Erera wrote: I'm glad it helped you. Good luck with the implementation. Thanks. First I started looking at the lucene internal code. To understand when/where and why docIds are changing/need to be changed (in merge and doc deletions) . I have always wanted to understand this and I think the understanding may help me somehow. One thing I didn't mention (though it's in the jdocs) -- it's not enough to have the documents of each index aligned, you also have to have the segments aligned. That is, if both indexes have documents 0-5 aligned, but one index contains a single segment and the other one 2 segments, that's not going to work. That's good to know. It is possible to do w/ some care -- when you build the German index, disable merges (use NoMergePolicy) and flush whenever you indexed enough documents to match an existing segment on e.g. the Common index. Or, if rebuilding all indexes won't take long, you can always rebuild all of them. Yes. That's what I am usually doing (it takes less than 1 minute) Yet, I usually do a forceMarge too to only have 1 segment :/ Shai On Thu, May 1, 2014 at 12:00 AM, Olivier Binda olivier.bi...@wanadoo.fr wrote: On 04/30/2014 10:48 AM, Shai Erera wrote: I hope I got all the details right, if I didn't then please clarify. Also, I haven't read the entire thread, so if someone already suggested this ... well, it probably means it's the right solution :) It sounds like you could use Lucene's ParallelCompositeReader, which already handles multiple IndexReaders that are aligned by their internal document IDs. The way it would work, as far as I understand your scenario is something like the following table (columns denote different indexes). Each index contains a subset of relevant fields, where common contains the common fields, and each language index contains the respective language fields. DocIDLuceneID Common English German FirstDoc 0 A,B,C EN_words, DE_words, EN_sentences DE_sentences SecondDoc 1 A,B,C ThirdDoc 2 A,B,C Each index can contain all relevant fields, or only a subset (e.g. maybe not all documents have a value for the 'B' field in the 'common' index). What's absolutely very important here though is that the indexes are created very carefully, and if e.g. SecondDoc is not translated into German, *you must still have an empty document* in the German index, or otherwise, document IDs will not align. That's exactly how I saw it and what I need to do. So, I'll have a very good look at ParallelCompositeReader Lucene does not offer a way to build those indexes though (patches welcome!!). This answers my question 1. Thanks. :) I somehow hoped that there was already support for that kind of situation in lucene but well, now at least I know that I won't find an already made solution to my problem in the lucene classes and that I will have to code one myself, by taking inspiration in the lucene classes that do similar processing. We've started some effort very long time ago on LUCENE-1879 (there's a patch and a discussion for an alternative approach) as well as there is a very useful suggestion in ParallelCompositeReader's jdocs (use LogDocMergePolicy). Wow, priceless. This gives me some headstart and inspiration. :) One challenge is how to support multi-threaded indexing, but perhaps this isn't a problem in your
Re: Fields, Index segments and docIds (second Try)
I'm glad it helped you. Good luck with the implementation. One thing I didn't mention (though it's in the jdocs) -- it's not enough to have the documents of each index aligned, you also have to have the segments aligned. That is, if both indexes have documents 0-5 aligned, but one index contains a single segment and the other one 2 segments, that's not going to work. It is possible to do w/ some care -- when you build the German index, disable merges (use NoMergePolicy) and flush whenever you indexed enough documents to match an existing segment on e.g. the Common index. Or, if rebuilding all indexes won't take long, you can always rebuild all of them. Shai On Thu, May 1, 2014 at 12:00 AM, Olivier Binda olivier.bi...@wanadoo.frwrote: On 04/30/2014 10:48 AM, Shai Erera wrote: I hope I got all the details right, if I didn't then please clarify. Also, I haven't read the entire thread, so if someone already suggested this ... well, it probably means it's the right solution :) It sounds like you could use Lucene's ParallelCompositeReader, which already handles multiple IndexReaders that are aligned by their internal document IDs. The way it would work, as far as I understand your scenario is something like the following table (columns denote different indexes). Each index contains a subset of relevant fields, where common contains the common fields, and each language index contains the respective language fields. DocIDLuceneID Common English German FirstDoc 0 A,B,C EN_words, DE_words, EN_sentences DE_sentences SecondDoc 1 A,B,C ThirdDoc 2 A,B,C Each index can contain all relevant fields, or only a subset (e.g. maybe not all documents have a value for the 'B' field in the 'common' index). What's absolutely very important here though is that the indexes are created very carefully, and if e.g. SecondDoc is not translated into German, *you must still have an empty document* in the German index, or otherwise, document IDs will not align. That's exactly how I saw it and what I need to do. So, I'll have a very good look at ParallelCompositeReader Lucene does not offer a way to build those indexes though (patches welcome!!). This answers my question 1. Thanks. :) I somehow hoped that there was already support for that kind of situation in lucene but well, now at least I know that I won't find an already made solution to my problem in the lucene classes and that I will have to code one myself, by taking inspiration in the lucene classes that do similar processing. We've started some effort very long time ago on LUCENE-1879 (there's a patch and a discussion for an alternative approach) as well as there is a very useful suggestion in ParallelCompositeReader's jdocs (use LogDocMergePolicy). Wow, priceless. This gives me some headstart and inspiration. :) One challenge is how to support multi-threaded indexing, but perhaps this isn't a problem in your application? It sounds like, by you writing that a user will download the german index, that the indexes are built offline? Indeed. The index is built offline, in a single thread, and once it is built, it is read only. Cant find an easier situation. :) Another challenge is how to control segment merging, so that the *exact same segments* are merged over the parallel indexes. Again, if your application builds the indexes offline, then this should be easier to accomplish. I assume though that when you index e.g. the German documents, then the already indexes 'common' fields do not change for a document. If they do, you will need to rebuild the 'common' index too. Once you achieve a correct parallel index, it is very easy to open a ParallelCompositeReader on any subset of the indexes, e.g. Common+English, Common+German, or Common+English+German and search it, since the internal document IDs are perfectly aligned. Shai Many thanks for the awesome answer and the help (I love you). As I really really really need this to happen, I'm going to start working on this really soon. I'm definately not an expert on threads/filesystems/and lucene inner workings, so I can't promise to contribute a miracoulous patch though. Especially since I won't work on the muli-thread aspect of the problem. But I'll do the best I can and contribute back whatever code I can produce. Many thanks, again. :) On Wed, Apr 30, 2014 at 7:07 AM, Jose Carlos Canova jose.carlos.can...@gmail.com wrote: My suggestion is you not worry about the docId, in practice it is an internal lucene id, quite similar with a rowId on a database, each index may generate a different docId (it is their problem) from a translated document, you may use your own ID that relates one document to another on different index mainly because like you mention are translated documents that on theory can be ranked differently from language
Re: Fields, Index segments and docIds (second Try)
If you're always rebuilding, let alone forceMerge, you shouldn't have too much trouble implementing it. Just make sure that you add documents in the same order to all indexes. If you're always rebuilding, how come you have deletions? Anyway, you must also delete in all indexes. On May 2, 2014 1:57 AM, Olivier Binda olivier.bi...@wanadoo.fr wrote: On 05/01/2014 10:28 AM, Shai Erera wrote: I'm glad it helped you. Good luck with the implementation. Thanks. First I started looking at the lucene internal code. To understand when/where and why docIds are changing/need to be changed (in merge and doc deletions) . I have always wanted to understand this and I think the understanding may help me somehow. One thing I didn't mention (though it's in the jdocs) -- it's not enough to have the documents of each index aligned, you also have to have the segments aligned. That is, if both indexes have documents 0-5 aligned, but one index contains a single segment and the other one 2 segments, that's not going to work. That's good to know. It is possible to do w/ some care -- when you build the German index, disable merges (use NoMergePolicy) and flush whenever you indexed enough documents to match an existing segment on e.g. the Common index. Or, if rebuilding all indexes won't take long, you can always rebuild all of them. Yes. That's what I am usually doing (it takes less than 1 minute) Yet, I usually do a forceMarge too to only have 1 segment :/ Shai On Thu, May 1, 2014 at 12:00 AM, Olivier Binda olivier.bi...@wanadoo.fr wrote: On 04/30/2014 10:48 AM, Shai Erera wrote: I hope I got all the details right, if I didn't then please clarify. Also, I haven't read the entire thread, so if someone already suggested this ... well, it probably means it's the right solution :) It sounds like you could use Lucene's ParallelCompositeReader, which already handles multiple IndexReaders that are aligned by their internal document IDs. The way it would work, as far as I understand your scenario is something like the following table (columns denote different indexes). Each index contains a subset of relevant fields, where common contains the common fields, and each language index contains the respective language fields. DocIDLuceneID Common English German FirstDoc 0 A,B,C EN_words, DE_words, EN_sentences DE_sentences SecondDoc 1 A,B,C ThirdDoc 2 A,B,C Each index can contain all relevant fields, or only a subset (e.g. maybe not all documents have a value for the 'B' field in the 'common' index). What's absolutely very important here though is that the indexes are created very carefully, and if e.g. SecondDoc is not translated into German, *you must still have an empty document* in the German index, or otherwise, document IDs will not align. That's exactly how I saw it and what I need to do. So, I'll have a very good look at ParallelCompositeReader Lucene does not offer a way to build those indexes though (patches welcome!!). This answers my question 1. Thanks. :) I somehow hoped that there was already support for that kind of situation in lucene but well, now at least I know that I won't find an already made solution to my problem in the lucene classes and that I will have to code one myself, by taking inspiration in the lucene classes that do similar processing. We've started some effort very long time ago on LUCENE-1879 (there's a patch and a discussion for an alternative approach) as well as there is a very useful suggestion in ParallelCompositeReader's jdocs (use LogDocMergePolicy). Wow, priceless. This gives me some headstart and inspiration. :) One challenge is how to support multi-threaded indexing, but perhaps this isn't a problem in your application? It sounds like, by you writing that a user will download the german index, that the indexes are built offline? Indeed. The index is built offline, in a single thread, and once it is built, it is read only. Cant find an easier situation. :) Another challenge is how to control segment merging, so that the *exact same segments* are merged over the parallel indexes. Again, if your application builds the indexes offline, then this should be easier to accomplish. I assume though that when you index e.g. the German documents, then the already indexes 'common' fields do not change for a document. If they do, you will need to rebuild the 'common' index too. Once you achieve a correct parallel index, it is very easy to open a ParallelCompositeReader on any subset of the indexes, e.g. Common+English, Common+German, or Common+English+German and search it, since the internal document IDs are perfectly aligned. Shai Many thanks for the awesome answer and the help (I love you). As I really really really need this to happen, I'm going to start working
Re: Fields, Index segments and docIds (second Try)
I hope I got all the details right, if I didn't then please clarify. Also, I haven't read the entire thread, so if someone already suggested this ... well, it probably means it's the right solution :) It sounds like you could use Lucene's ParallelCompositeReader, which already handles multiple IndexReaders that are aligned by their internal document IDs. The way it would work, as far as I understand your scenario is something like the following table (columns denote different indexes). Each index contains a subset of relevant fields, where common contains the common fields, and each language index contains the respective language fields. DocIDLuceneID Common English German FirstDoc 0 A,B,C EN_words, DE_words, EN_sentences DE_sentences SecondDoc 1 A,B,C ThirdDoc 2 A,B,C Each index can contain all relevant fields, or only a subset (e.g. maybe not all documents have a value for the 'B' field in the 'common' index). What's absolutely very important here though is that the indexes are created very carefully, and if e.g. SecondDoc is not translated into German, *you must still have an empty document* in the German index, or otherwise, document IDs will not align. Lucene does not offer a way to build those indexes though (patches welcome!!). We've started some effort very long time ago on LUCENE-1879 (there's a patch and a discussion for an alternative approach) as well as there is a very useful suggestion in ParallelCompositeReader's jdocs (use LogDocMergePolicy). One challenge is how to support multi-threaded indexing, but perhaps this isn't a problem in your application? It sounds like, by you writing that a user will download the german index, that the indexes are built offline? Another challenge is how to control segment merging, so that the *exact same segments* are merged over the parallel indexes. Again, if your application builds the indexes offline, then this should be easier to accomplish. I assume though that when you index e.g. the German documents, then the already indexes 'common' fields do not change for a document. If they do, you will need to rebuild the 'common' index too. Once you achieve a correct parallel index, it is very easy to open a ParallelCompositeReader on any subset of the indexes, e.g. Common+English, Common+German, or Common+English+German and search it, since the internal document IDs are perfectly aligned. Shai On Wed, Apr 30, 2014 at 7:07 AM, Jose Carlos Canova jose.carlos.can...@gmail.com wrote: My suggestion is you not worry about the docId, in practice it is an internal lucene id, quite similar with a rowId on a database, each index may generate a different docId (it is their problem) from a translated document, you may use your own ID that relates one document to another on different index mainly because like you mention are translated documents that on theory can be ranked differently from language to language (it is not an obligation that a set of documents on different languages spams the same rank order but i am not 100% sure about this), Second reason is that 'they may change the internal structure of lucene without warrant', and then you lose the forward compatibility. I am not an expert on Lucene like Schindler, but reading their documentation understood that they have a special attention on internal lucene and experimental lucene which means internal is non warrant compatible, and experimental may be removed. For example they (apache-lucene) discover a new manner to relate each document that is more efficient and change some mechanism, then your application uses an internal mechanism that is high coupled with lucene version xxx (marked as internal-lucene) you can stuck on a specific version and on future have to rewrite some code because and this might cause some management conflict if your project follows a continuous integration and you are subordinated on a management structure (bad to you). I saw this on several projects that uses Lucene around they do not upgrade their lucene components on their new releases one example if i am not wrong still uses Lucene 3 and other that i saw around (e.g. Luke) which means that The project was abandoned because the manner how they integrate with Lucene was not fully functional. Another interesting thing is that developing around Lucene is more effective, you guarantee that your product will work and they guarantee that Lucene works too. This is related with design by contract. Regards. On Tue, Apr 29, 2014 at 7:11 PM, Olivier Binda olivier.bi...@wanadoo.fr wrote: Hello. Sorry to bring this up again. I don't want to be rudeand I mean no disrespect, but after thinking it through today, I need to and would really love to have the answer to the following question : 1) At lucene indexing time, is it possible to rewrite a read-only index so that some fields
Re: Getting multi-values to use in filter?
Hi Rob, While the demo code uses a fixed number of 3 values, you don't need to encode the number of values up front. Since your read the byte[] of a document up front, you can read in a while loop as long as in.position() in.length(). Shai On Tue, Apr 29, 2014 at 10:04 AM, Rob Audenaerde rob.audenae...@gmail.comwrote: Hi Shai, I read the article on your blog, thanks for it! It seems to be a natural fit to do multi-values like this, and it is helpful indeed. For my specific problem, I have multiple values that do not have a fixed number, so it can be either 0 or 10 values. I think the best way to solve this is to encode the number of values as first entry in the BDV. This is not that hard so I will take this road. -Rob Op 27 apr. 2014 om 21:27 heeft Shai Erera ser...@gmail.com het volgende geschreven: Hi Rob, Your question got me interested, so I wrote a quick prototype of what I think solves your problem (and if not, I hope it solves someone else's! :)). The idea is to write a special ValueSource, e.g. MaxValueSource which reads a BinadyDocValues, decodes the values and returns the maximum one. It can then be embedded in an expression quite easily. I published a post on Lucene expressions and included some prototype code which demonstrates how to do it. Hope it's still helpful to you: http://shaierera.blogspot.com/2014/04/expressions-with-lucene.html. Shai On Thu, Apr 24, 2014 at 1:20 PM, Shai Erera ser...@gmail.com wrote: I don't think that you should use the facet module. If all you want is to encode a bunch of numbers under a 'foo' field, you can encode them into a byte[] and index them as a BDV. Then at search time you get the BDV and decode the numbers back. The facet module adds complexity here: yes, you get the encoding/decoding for free, but at the cost of adding mock categories to the taxonomy, or use associations, for no good reason IMO. Once you do that, you need to figure out how to extend the expressions module to support a function like maxValues(fieldName) (cannot use 'max' since it's reserved). I read about it some, and still haven't figured out exactly how to do it. The JavascriptCompiler can take custom functions to compile expressions, but the methods should take only double values. So I think it should be some sort of binding, but I'm not sure yet how to do it. Perhaps it should be a name like max_fieldName, which you add a custom Expression to as a binding ... I will try to look into it later. Shai On Wed, Apr 23, 2014 at 6:49 PM, Rob Audenaerde rob.audenae...@gmail.comwrote: Thanks for all the questions, gives me an opportunity to clarify it :) I want the user to be able to give a (simple) formula (so I don't know it on beforehand) and use that formula in the search. The Javascript expressions are really powerful in this use case, but have the single-value limitation. Ideally, I would like to make it really flexible by for example allowing (in-document aggregating) expressions like: max(fieldA) - fieldB fieldC. Currently, using single values, I can handle expressions in the form of fieldA - fieldB - fieldC 0 and evaluate the long-value that I receive from the FunctionValues and the ValueSource. I also optimize the query by assuring the field exists and has a value, etc. to the search still fast enough. This works well, but single value only. I also looked into the facets Association Fields, as they somewhat look like the thing that I want. Only in the faceting module, all ordinals and values are stored in one field, so there is no easy way extract the fields that are used in the expression. I like the solution one you suggested, to add all the numeric fields an encoded byte[] like the facets do, but then on a per-field basis, so that each numeric field has a BDV field that contains all multiple values for that field for that document. Now that I am typing this, I think there is another way. I could use the faceting module and add a different facet field ($facetFIELDA, $facetFIELDB) in the FacetsConfig for each field. That way it would be relatively straightforward to get all the values for a field, as they are exact all the values for the BDV for that document's facet field. Only aggregating all facets will be harder, as the TaxonomyFacetSum*Associations would need to do this for all fields that I need facet counts/sums for. What do you think? -Rob On Wed, Apr 23, 2014 at 5:13 PM, Shai Erera ser...@gmail.com wrote: A NumericDocValues field can only hold one value. Have you thought about encoding the values in a BinaryDocValues field? Or are you talking about multiple fields (different names), each has its own single value, and at search time you sum the values from a different set of fields? If it's one field, multiple values, then why do you need to separate
Re: No Compound Files
The problem is that compound files settings are split between MergePolicy and IndexWriterConfig. As documented on IWC.setUseCompoundFile, this setting controls how new segments are flushed, while the MP setting controls how merged segments are written. If we only offer NoMP.INSTANCE, what would it do w/ merged segments? always compound? always not-compound? But that won't solve the problem of new flushed segments, since that's controlled by IWC. If we can move all of that to IWC, I think it will remove the confusion.. it always confuses me that I use NoMP.COMPUND, yet I see non-compound segments, until I remember to change the IWC setting. Shai On Tue, Apr 29, 2014 at 3:07 PM, Robert Muir rcm...@gmail.com wrote: I think NoMergePolicy.NO_COMPOUND_FILES and NoMergePolicy.COMPOUND_FILES should be removed, and replaced with NoMergePolicy.INSTANCE If you want to change whether CFS is used by indexwriter flush, you need to set that in IndexWriterConfig. On Tue, Apr 29, 2014 at 8:03 AM, Varun Thacker varunthacker1...@gmail.com wrote: I wanted to use the NoMergePolicy.NO_COMPOUND_FILES to ensure that no merges take place on the index. However I was unsuccessful at it. What I am doing wrong here. Attaching a gist with - 1. Output when using NoMergePolicy.NO_COMPOUND_FILES 2. Output when using TieredMergePolicy with policy.setNoCFSRatio(0.0) 3. The code snippet I used. https://gist.github.com/vthacker/11398124 I tried it using Lucene 4.7 -- Regards, Varun Thacker http://www.vthacker.in/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: No Compound Files
NoMP means no merges, and indeed it seems silly that NoMP distinguishes between compound/non-compound settings. Perhaps it's rooted somewhere in the past, I don't remember. I checked and IndexWriter.addIndexes consults MP.useCompoundFile(segmentInfo) when it adds the segments. But maybe NoMP.useCompoundFile can be changed to return newSegment.info.isCompoundFile? I.e. it doesn't change the type of the new segment? Shai On Tue, Apr 29, 2014 at 3:50 PM, Michael McCandless luc...@mikemccandless.com wrote: +1 to just have NoMergePolicy.INSTANCE Mike McCandless http://blog.mikemccandless.com On Tue, Apr 29, 2014 at 8:07 AM, Robert Muir rcm...@gmail.com wrote: I think NoMergePolicy.NO_COMPOUND_FILES and NoMergePolicy.COMPOUND_FILES should be removed, and replaced with NoMergePolicy.INSTANCE If you want to change whether CFS is used by indexwriter flush, you need to set that in IndexWriterConfig. On Tue, Apr 29, 2014 at 8:03 AM, Varun Thacker varunthacker1...@gmail.com wrote: I wanted to use the NoMergePolicy.NO_COMPOUND_FILES to ensure that no merges take place on the index. However I was unsuccessful at it. What I am doing wrong here. Attaching a gist with - 1. Output when using NoMergePolicy.NO_COMPOUND_FILES 2. Output when using TieredMergePolicy with policy.setNoCFSRatio(0.0) 3. The code snippet I used. https://gist.github.com/vthacker/11398124 I tried it using Lucene 4.7 -- Regards, Varun Thacker http://www.vthacker.in/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Getting multi-values to use in filter?
Hi Rob, Your question got me interested, so I wrote a quick prototype of what I think solves your problem (and if not, I hope it solves someone else's! :)). The idea is to write a special ValueSource, e.g. MaxValueSource which reads a BinadyDocValues, decodes the values and returns the maximum one. It can then be embedded in an expression quite easily. I published a post on Lucene expressions and included some prototype code which demonstrates how to do it. Hope it's still helpful to you: http://shaierera.blogspot.com/2014/04/expressions-with-lucene.html. Shai On Thu, Apr 24, 2014 at 1:20 PM, Shai Erera ser...@gmail.com wrote: I don't think that you should use the facet module. If all you want is to encode a bunch of numbers under a 'foo' field, you can encode them into a byte[] and index them as a BDV. Then at search time you get the BDV and decode the numbers back. The facet module adds complexity here: yes, you get the encoding/decoding for free, but at the cost of adding mock categories to the taxonomy, or use associations, for no good reason IMO. Once you do that, you need to figure out how to extend the expressions module to support a function like maxValues(fieldName) (cannot use 'max' since it's reserved). I read about it some, and still haven't figured out exactly how to do it. The JavascriptCompiler can take custom functions to compile expressions, but the methods should take only double values. So I think it should be some sort of binding, but I'm not sure yet how to do it. Perhaps it should be a name like max_fieldName, which you add a custom Expression to as a binding ... I will try to look into it later. Shai On Wed, Apr 23, 2014 at 6:49 PM, Rob Audenaerde rob.audenae...@gmail.comwrote: Thanks for all the questions, gives me an opportunity to clarify it :) I want the user to be able to give a (simple) formula (so I don't know it on beforehand) and use that formula in the search. The Javascript expressions are really powerful in this use case, but have the single-value limitation. Ideally, I would like to make it really flexible by for example allowing (in-document aggregating) expressions like: max(fieldA) - fieldB fieldC. Currently, using single values, I can handle expressions in the form of fieldA - fieldB - fieldC 0 and evaluate the long-value that I receive from the FunctionValues and the ValueSource. I also optimize the query by assuring the field exists and has a value, etc. to the search still fast enough. This works well, but single value only. I also looked into the facets Association Fields, as they somewhat look like the thing that I want. Only in the faceting module, all ordinals and values are stored in one field, so there is no easy way extract the fields that are used in the expression. I like the solution one you suggested, to add all the numeric fields an encoded byte[] like the facets do, but then on a per-field basis, so that each numeric field has a BDV field that contains all multiple values for that field for that document. Now that I am typing this, I think there is another way. I could use the faceting module and add a different facet field ($facetFIELDA, $facetFIELDB) in the FacetsConfig for each field. That way it would be relatively straightforward to get all the values for a field, as they are exact all the values for the BDV for that document's facet field. Only aggregating all facets will be harder, as the TaxonomyFacetSum*Associations would need to do this for all fields that I need facet counts/sums for. What do you think? -Rob On Wed, Apr 23, 2014 at 5:13 PM, Shai Erera ser...@gmail.com wrote: A NumericDocValues field can only hold one value. Have you thought about encoding the values in a BinaryDocValues field? Or are you talking about multiple fields (different names), each has its own single value, and at search time you sum the values from a different set of fields? If it's one field, multiple values, then why do you need to separate the values? Is it because you sometimes sum and sometimes e.g. avg? Do you always include all values of a document in the formula, but the formula changes between searches, or do you sometimes use only a subset of the values? If you always use all values, but change the formula between queries, then perhaps you can just encode the pre-computed value under different NDV fields? If you only use a handful of functions (and they are known in advance), it may not be too heavy on the index, and definitely perform better during search. Otherwise, I believe I'd consider indexing them as a BDV field. For facets, we basically need the same multi-valued numeric field, and given that NDV is single valued, we went w/ BDV. If I misunderstood the scenario, I'd appreciate if you clarify it :) Shai On Wed, Apr 23, 2014 at 5:49 PM, Rob Audenaerde rob.audenae...@gmail.com wrote: Hi Shai, all, I am
Re: Getting multi-values to use in filter?
I don't think that you should use the facet module. If all you want is to encode a bunch of numbers under a 'foo' field, you can encode them into a byte[] and index them as a BDV. Then at search time you get the BDV and decode the numbers back. The facet module adds complexity here: yes, you get the encoding/decoding for free, but at the cost of adding mock categories to the taxonomy, or use associations, for no good reason IMO. Once you do that, you need to figure out how to extend the expressions module to support a function like maxValues(fieldName) (cannot use 'max' since it's reserved). I read about it some, and still haven't figured out exactly how to do it. The JavascriptCompiler can take custom functions to compile expressions, but the methods should take only double values. So I think it should be some sort of binding, but I'm not sure yet how to do it. Perhaps it should be a name like max_fieldName, which you add a custom Expression to as a binding ... I will try to look into it later. Shai On Wed, Apr 23, 2014 at 6:49 PM, Rob Audenaerde rob.audenae...@gmail.comwrote: Thanks for all the questions, gives me an opportunity to clarify it :) I want the user to be able to give a (simple) formula (so I don't know it on beforehand) and use that formula in the search. The Javascript expressions are really powerful in this use case, but have the single-value limitation. Ideally, I would like to make it really flexible by for example allowing (in-document aggregating) expressions like: max(fieldA) - fieldB fieldC. Currently, using single values, I can handle expressions in the form of fieldA - fieldB - fieldC 0 and evaluate the long-value that I receive from the FunctionValues and the ValueSource. I also optimize the query by assuring the field exists and has a value, etc. to the search still fast enough. This works well, but single value only. I also looked into the facets Association Fields, as they somewhat look like the thing that I want. Only in the faceting module, all ordinals and values are stored in one field, so there is no easy way extract the fields that are used in the expression. I like the solution one you suggested, to add all the numeric fields an encoded byte[] like the facets do, but then on a per-field basis, so that each numeric field has a BDV field that contains all multiple values for that field for that document. Now that I am typing this, I think there is another way. I could use the faceting module and add a different facet field ($facetFIELDA, $facetFIELDB) in the FacetsConfig for each field. That way it would be relatively straightforward to get all the values for a field, as they are exact all the values for the BDV for that document's facet field. Only aggregating all facets will be harder, as the TaxonomyFacetSum*Associations would need to do this for all fields that I need facet counts/sums for. What do you think? -Rob On Wed, Apr 23, 2014 at 5:13 PM, Shai Erera ser...@gmail.com wrote: A NumericDocValues field can only hold one value. Have you thought about encoding the values in a BinaryDocValues field? Or are you talking about multiple fields (different names), each has its own single value, and at search time you sum the values from a different set of fields? If it's one field, multiple values, then why do you need to separate the values? Is it because you sometimes sum and sometimes e.g. avg? Do you always include all values of a document in the formula, but the formula changes between searches, or do you sometimes use only a subset of the values? If you always use all values, but change the formula between queries, then perhaps you can just encode the pre-computed value under different NDV fields? If you only use a handful of functions (and they are known in advance), it may not be too heavy on the index, and definitely perform better during search. Otherwise, I believe I'd consider indexing them as a BDV field. For facets, we basically need the same multi-valued numeric field, and given that NDV is single valued, we went w/ BDV. If I misunderstood the scenario, I'd appreciate if you clarify it :) Shai On Wed, Apr 23, 2014 at 5:49 PM, Rob Audenaerde rob.audenae...@gmail.com wrote: Hi Shai, all, I am trying to write that Filter :). But I'm a bit at loss as how to efficiently grab the multi-values. I can access the context.reader().document() that accesses the storedfields, but that seems slow. For single-value fields I use a compiled JavaScript Expression with simplebindings as ValueSource, which seems to work quite well. The downside is that I cannot find a way to implement multi-value through that solution. These create for example a LongFieldSource, which uses the FieldCache.LongParser. These parsers only seem te parse one field. Is there an efficient way to get -all- of the (numeric) values for a field
Re: Getting multi-values to use in filter?
You can do that by writing a Filter which returns matching documents based on a sum of the field's value. However I suspect that is going to be slow, unless you know that you will need several such filters and can cache them. Another approach would be to write a Collector which serves as a Filter, but computes the sum only for documents that match the query. Hopefully that would mean you compute the sum for less documents than you would have w/ the Filter approach. Shai On Wed, Apr 23, 2014 at 5:11 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: This isn't really a good use case for an index like Lucene. The most essential property of an index is that it lets you look up documents very quickly based on *precomputed* values. -Mike On 04/23/2014 06:56 AM, Rob Audenaerde wrote: Hi all, I'm looking for a way to use multi-values in a filter. I want to be able to search on sum(field)=100, where field has values in one documents: field=60 field=40 In this case 'field' is a LongField. I examined the code in the FieldCache, but that seems to focus on single-valued fields only, or It this something that can be done in Lucene? And what would be a good approach? Thanks in advance, -Rob - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Getting multi-values to use in filter?
A NumericDocValues field can only hold one value. Have you thought about encoding the values in a BinaryDocValues field? Or are you talking about multiple fields (different names), each has its own single value, and at search time you sum the values from a different set of fields? If it's one field, multiple values, then why do you need to separate the values? Is it because you sometimes sum and sometimes e.g. avg? Do you always include all values of a document in the formula, but the formula changes between searches, or do you sometimes use only a subset of the values? If you always use all values, but change the formula between queries, then perhaps you can just encode the pre-computed value under different NDV fields? If you only use a handful of functions (and they are known in advance), it may not be too heavy on the index, and definitely perform better during search. Otherwise, I believe I'd consider indexing them as a BDV field. For facets, we basically need the same multi-valued numeric field, and given that NDV is single valued, we went w/ BDV. If I misunderstood the scenario, I'd appreciate if you clarify it :) Shai On Wed, Apr 23, 2014 at 5:49 PM, Rob Audenaerde rob.audenae...@gmail.comwrote: Hi Shai, all, I am trying to write that Filter :). But I'm a bit at loss as how to efficiently grab the multi-values. I can access the context.reader().document() that accesses the storedfields, but that seems slow. For single-value fields I use a compiled JavaScript Expression with simplebindings as ValueSource, which seems to work quite well. The downside is that I cannot find a way to implement multi-value through that solution. These create for example a LongFieldSource, which uses the FieldCache.LongParser. These parsers only seem te parse one field. Is there an efficient way to get -all- of the (numeric) values for a field in a document? On Wed, Apr 23, 2014 at 4:38 PM, Shai Erera ser...@gmail.com wrote: You can do that by writing a Filter which returns matching documents based on a sum of the field's value. However I suspect that is going to be slow, unless you know that you will need several such filters and can cache them. Another approach would be to write a Collector which serves as a Filter, but computes the sum only for documents that match the query. Hopefully that would mean you compute the sum for less documents than you would have w/ the Filter approach. Shai On Wed, Apr 23, 2014 at 5:11 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: This isn't really a good use case for an index like Lucene. The most essential property of an index is that it lets you look up documents very quickly based on *precomputed* values. -Mike On 04/23/2014 06:56 AM, Rob Audenaerde wrote: Hi all, I'm looking for a way to use multi-values in a filter. I want to be able to search on sum(field)=100, where field has values in one documents: field=60 field=40 In this case 'field' is a LongField. I examined the code in the FieldCache, but that seems to focus on single-valued fields only, or It this something that can be done in Lucene? And what would be a good approach? Thanks in advance, -Rob - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: IndexReplication Client and IndexWriter
Hi Christoph, Apologize for the delayed response, I'm on a holiday vacation. I will take a look at your issues as soon as I can. Shai On Fri, Apr 11, 2014 at 12:02 PM, Christoph Kaser lucene_l...@iconparc.dewrote: Hello Shai and Mike, thank you for your answers! I created LUCENE-5597 for this feature. Unfortunately, I am not sure I will be able to provide patches: I don't need this feature at the moment (my interest was more academic) and unfortunately don't have the time to work on this. Additionally, I created LUCENE-5599, which provides a patch to fix a small performance issue I had with the replicator when replicating large indexes. Regards, Christoph Kaser Am 08.04.2014 12:45, schrieb Michael McCandless: You might be able to use a class on the NRT replication branch (LUCENE-5438), InfosRefCounts (weird name), whose purpose is to do what IndexFileDeleter does for IndexWriter, ie keep track of which files are still referenced, delete them when they are done, etc. This could used on the client side to hold a lease for another client. Mike McCandless http://blog.mikemccandless.com On Tue, Apr 8, 2014 at 6:26 AM, Shai Erera ser...@gmail.com wrote: IndexRevision uses the IndexWriter for deleting unused files when the revision is released, as well as to obtain the SnapshotDeletionPolicy. I think that you will need to implement two things on the client side: * Revision, which doesn't use IndexWriter. * Replicator which keeps track of how many refs a file has (basically what IndexFileDeleter does) Then you could setup any node in the middle to be both a client and a server. Would be interesting to explore that. Would you like to open an issue? And maybe even try to come up w/ a patch? Shai On Tue, Apr 8, 2014 at 1:05 PM, Michael McCandless luc...@mikemccandless.com wrote: It's not safe also opening an IndexWriter on the client side. But I agree, supporting tree topology would make sense; it seems like we just need a way for the ReplicationClient to also be a Replicator. It seems like it should be possible, since it's clearly aware of the SessionToken it's pulled from the original Replicator. Mike McCandless http://blog.mikemccandless.com On Tue, Apr 8, 2014 at 3:42 AM, Christoph Kaser lucene_l...@iconparc.de wrote: Hi all, I am trying out the (highly useful) index replicator module (with the HttpReplicator) and have stumbled upon a question: It seems, the IndexReplicationHandler is working directly on the index directory, without using an indexwriter. Could there be a problem if I open an IndexWriter on the client side? Usually, this should not be needed, as only the master should be changed, however if I want to implement a tree topology, I need an IndexWriter on a non-leaf client, because the IndexRevision that I need to publish needs one. Regards, Christoph - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NRT facet issue (bug?), hard to reproduce, please advise
Hi I am not sure how more than one client_no field ends up w/ a document, and I'm not sure it's related to the taxonomy at all. However, looking at the code example you pasted above, and since you mention that you index+commit in one thread, while another thread does the reopen, I wonder if that's the issue: you first commit the taxo, then commit the index. But what if a new document makes it into the index after you committed to taxo, with a new client_no? In that case, the reopening thread will discover an older taxonomy, while the index will have categories with ordinals larger than the taxonomy's greatest ordinal? I also think that it's a mistake to commit and reopen in two separate threads. If possible, I suggest that you do that always in the same thread, and in that order: first commit the index, then the taxonomy. That way, if a document goes in to the index (and new facets to the taxonomy) after the index.commit(), then when you reopen the worse case is that the taxonomy is ahead of the index, which is fine. When you reopen, also reopen in the same order. Could you try that and see if that resolves your issue. Although, I don't understand how this can lead to more than one client_no ending up in one document, unless there's also a concurrency bug in the indexing code ... or I misunderstood the issue. Shai On Fri, Apr 11, 2014 at 2:49 PM, Rob Audenaerde rob.audenae...@gmail.comwrote: Hi all, I have a issue using the near real-time search in the taxonomy. I could really use some advise on how to debug/proceed this issue. The issue is as follows: I index 100k documents, with about 40 fields each. For each field, I also add a FacetField (issues arises both with FacetField as FloatAssociationFacetField). Each document has a unique number field (client_no). When just indexing and searching afterwards, all is fine. When searching while indexing, sometimes the number of facets associated with a document is to high, i.e. when collecting facets there are more that one client_no on one document, which of course should not be the case. Before each search, I use the manager.maybeRefreshBlocking() before the search, because I want the most-actual results. I have a taxonomy and indexreader combined in a ReferenceManager (I created this before the SearcherTaxonomyManager existed, but it behaves exactly the same, similar refcount logic) During indexing I commit every 5000 documents (not needed for the NRT search, but needed to prevent loss in the application should shut down). I commit as follows: public void commit() throws DocumentIndexException { try { synchronized ( GlobalIndexCommitAndCloseLock.LOCK ) { this.taxonomyWriter.commit(); this.luceneIndexWriter.commit(); } } catch ( final OutOfMemoryError | IOException e ) { tryCloseWritersOnOOME( this.luceneIndexWriter, this.taxonomyWriter ); throw new DocumentIndexException( e ); } } I use a standard IndexWriterConfig and both IndexWriter and TaxonomyWriter are RAMDirectory(). My testcase indexes the 100k documents, while another thread is continuously calling 'manager.maybeRefreshBlocking()'. This is enough to sometimes cause the taxonomy to be incorrect. The number of indexing threads does not seems to influence the issue, as it also appears when I have only 1 indexing thread. I know it is an index problem, because when I write in the index to file instead of RAM and reopen it in a clean application, I see the same behaviour. I could really use some advise on how to debug/proceed this issue. If more info is needed, just ask. Thanks in advance, -Rob
Re: IndexReplication Client and IndexWriter
IndexRevision uses the IndexWriter for deleting unused files when the revision is released, as well as to obtain the SnapshotDeletionPolicy. I think that you will need to implement two things on the client side: * Revision, which doesn't use IndexWriter. * Replicator which keeps track of how many refs a file has (basically what IndexFileDeleter does) Then you could setup any node in the middle to be both a client and a server. Would be interesting to explore that. Would you like to open an issue? And maybe even try to come up w/ a patch? Shai On Tue, Apr 8, 2014 at 1:05 PM, Michael McCandless luc...@mikemccandless.com wrote: It's not safe also opening an IndexWriter on the client side. But I agree, supporting tree topology would make sense; it seems like we just need a way for the ReplicationClient to also be a Replicator. It seems like it should be possible, since it's clearly aware of the SessionToken it's pulled from the original Replicator. Mike McCandless http://blog.mikemccandless.com On Tue, Apr 8, 2014 at 3:42 AM, Christoph Kaser lucene_l...@iconparc.de wrote: Hi all, I am trying out the (highly useful) index replicator module (with the HttpReplicator) and have stumbled upon a question: It seems, the IndexReplicationHandler is working directly on the index directory, without using an indexwriter. Could there be a problem if I open an IndexWriter on the client side? Usually, this should not be needed, as only the master should be changed, however if I want to implement a tree topology, I need an IndexWriter on a non-leaf client, because the IndexRevision that I need to publish needs one. Regards, Christoph -- Dipl.-Inf. Christoph Kaser IconParc GmbH Sophienstrasse 1 80333 München www.iconparc.de Tel +49 -89- 15 90 06 - 21 Fax +49 -89- 15 90 06 - 49 Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB 121830, Amtsgericht München - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Replicator: how to use it?
Even if the commit is called just before the close, the close triggers a last commit. That seems wrong. If you do writer.commit() and them immediately writer.close(), and there are no changes to the writer in between (i.e. a thread comes in and adds/updates/deletes a document), then close() should not create a new commit point. Do you see that it does? Shai On Wed, Mar 19, 2014 at 11:09 PM, Roberto Franchini franch...@celi.itwrote: On Sat, Mar 15, 2014 at 12:56 PM, Roberto Franchini franch...@celi.it wrote: On Sat, Mar 15, 2014 at 12:47 PM, Shai Erera ser...@gmail.com wrote: If you use LocalReplicator on both sides, you have to use the same instance on both sides. Otherwise the replicas will never see the published revisions the which are done in a separate instance. Can you try that? Ok, I missed it. I was using different instances. I'll try this afternoon. Hi, the replicator works fine on live writer, but when the writer is closed it does a last commit that isn't replicated. Even if the commit is called just before the close, the close triggers a last commit. And trying to use the writer after close is impossible: writer.close(); revision= new IndexRevision(writer); produce: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed So, I can replicate the last commit before the close, and don't worry about the inner commit that close does. May I'll lost something? RF -- Roberto Franchini The impossible is inevitable. http://www.celi.it http://www.blogmeter.it http://github.com/celi-uim http://github.com/robfrank Tel +39.011.562.71.15 jabber:ro.franch...@gmail.com skype:ro.franchini tw:@robfrankie - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Few questions on updatable DocValues
Double fields can be implemented today over NumericDVField and therefore already support updates. String can be implemented on Sorted/SortedSetDVField, but not updates for them yet. I hope that once I'm done w/ LUCENE-5513, adding update support for Sorted/SortedSet will be even easier. Shai On Fri, Mar 14, 2014 at 6:22 PM, Gopal Patwa gopalpa...@gmail.com wrote: r lot;s of use case where you have muc
Re: Replicator: how to use it?
If you use LocalReplicator on both sides, you have to use the same instance on both sides. Otherwise the replicas will never see the published revisions the which are done in a separate instance. Can you try that? Shai On Mar 15, 2014 1:10 PM, Roberto Franchini franch...@celi.it wrote: On Sat, Mar 15, 2014 at 11:58 AM, Michael McCandless luc...@mikemccandless.com wrote: I think maybe the problem is you are using LocalReplicator on the replicas? I think you should only use that on the master. I think e.g. you should use HttpReplicator on the clients? Or, your own implementation that moves the files its own way. Have you seen Shai's blog post about this? http://shaierera.blogspot.com/2013/05/the-replicator.html Yes, I've seen it. I checkout the replicator code and looked at test code. I'm trying to use the local replicator because, as a first step, I want only to incrementally backup indexes. So I've implemented a sort of producer/consumer where the indexer is the producer, it runs on it's own thread and publish revisions, and the consumer will be the replicator client that's on it's onw thread. Code samples aren't, at least for me, very clear in how to use the replicator. So, if someone has a clean sample of use of replicator I would appreciate it. REgards, RF -- Roberto Franchini The impossible is inevitable. http://www.celi.it http://www.blogmeter.it http://github.com/celi-uim http://github.com/robfrank Tel +39.011.562.71.15 jabber:ro.franch...@gmail.com skype:ro.franchini tw:@robfrankie - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Few questions on updatable DocValues
Hi 1. Is it possible to provide updateNumericDocValue(Term term, MapString,Long), incase I wish to update multiple-fields and it's doc-values? For now you can call updateNDV multiple times, each time w/ a new field. Under the covers, we currently process each update separately anyway. I think in order to change it we'd need to change the API such that it allows you to define an update in many ways (e.g. Query, see below). Then, an update by a single Term to multiple fields is atomic. I don't want though to add many updateNDV variants to IW, especially as we'd like to add more DV update capabilities. Want to open an issue to explore that? 2. Instead of a Term based update, is it possible to extend it to using a Query? What are the obvious problems in doing so? Technically yes, but currently it's not exposed. At the lowest level we pull a DocsEnum and iterate on docs to apply the update. So Term/Query would work the same. I think we can explore generalizing the API such that you can define your own update following some well thought of API, and that way you have the flexibility in one hand, yet we don't need to maintain all options in the Lucene source code. We can explore that on an issue. 3. TrackingIndexWriter does not have updateNumericDocValue exposed. Any reason of not doing so? No reason in particular :). Can you open an issue (separate from the API)? 4. Is it possible to update a DocValue other than long, like lets say a BinaryDV? This is something I currently do on LUCENE-5513, so hopefully very soon you will be able to do that. If I'm fast enough, maybe even in 4.8 :). Shai On Fri, Mar 14, 2014 at 12:14 PM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: Hi, I have few questions related to updatable DocValues API... It would be great if I can get help. 1. Is it possible to provide updateNumericDocValue(Term term, MapString,Long), incase I wish to update multiple-fields and it's doc-values? 2. Instead of a Term based update, is it possible to extend it to using a Query? What are the obvious problems in doing so? 3. TrackingIndexWriter does not have updateNumericDocValue exposed. Any reason of not doing so? 4. Is it possible to update a DocValue other than long, like lets say a BinaryDV? -- Ravi
Re: Adding custom weights to individual terms
I often prefer to manage such weights outside the index. Usually managing them inside the index leads to problems in the future when e.g the weights change. If they are encoded in the index, it means re-indexing. Also, if the weight changes then in some segments the weight will be different than others. I think that if you manage the weights e.g. in a simple FST (which is very compat), it will give you the best flexibility and it's very easy to use. Shai On Thu, Feb 13, 2014 at 1:36 PM, Michael McCandless luc...@mikemccandless.com wrote: You could stuff your custom weights into a payload, and index that, but this is per term per document per position, while it sounds like you just want one float for each term regardless of which documents/positions where that term occurred? Doing your own custom attribute would be a challenge: not only must you create set this attribute during indexing, but you then must change the indexing process (custom chain, custom codec) to get the new attribute into the index, and then make a custom query that can pull this attribute at search time. What are these term weights? Are you sure you can't compute these weights at search time with a custom similarity using the stats that are already stored (docFreq, totalTermFreq, maxDoc, etc.)? Mike McCandless http://blog.mikemccandless.com On Thu, Feb 13, 2014 at 2:40 AM, Rune Stilling s...@rdfined.dk wrote: Hi list I'm trying to figure out how customizable scoring and weighting is in the Lucene API. I read about the API's but still can't figure out if the following is possible. I would like to do normal document text indexing, but I would like to control the weight added to tokens my self, also I would like to control the weighting of query tokens and the how things are added together. When indexing a word I would like attache my own weights to the word, and use these weights when querying for documents. F.ex. Doc 1 Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99) API(0.3) Doc 2 Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1) The floats in parentheses are some I would like to add in the indexing process, not something coming from Lucene tdf/id ex. Wen querying I would like to repeat this and also create the weights for each term myself and control how the final doc score is calculated. I have read that it's possible to attach your own custom attributes to tokens. Is this the way to go? Ie. should I add my custom weight as attributes to tokens, and then access these attributes when calculating document score in the search process (described here https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.htmlunder adding a custom attribute)? The reason why I'm asking is that I can't find any examples of this being done anywhere. But I found someone stating With Lucene, it is impossible to increase or decrease the weight of individual terms in a document. With regards Rune - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Actual min and max-value of NumericField during codec flush
Why not use LogByteSizeMP in conjunction w/ SortingMP? LogMP picks adjacent segments and SortingMP ensures the merged segment is also sorted. Shai On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: Yes exactly as you have described. Ex: Consider Segment[S1,S2,S3 S4] are in reverse-chronological order and goes for a merge While SortingMergePolicy will correctly solve the merge-part, it does not however play any role in picking segments to merge right? SMP internally delegates to TieredMergePolicy, which might pick S1S4 to merge disturbing the global-order. Ideally only adjacent segments should be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc... Can there be a better selection of segments to merge in this case, so as to maintain a semblance of global-ordering? -- Ravi On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless luc...@mikemccandless.com wrote: OK, I see (early termination). That's a challenge, because you really want the docs sorted backwards from how they were added right? And, e.g., merged and then searched in reverse segment order? I think you should be able to do this w/ SortingMergePolicy? And then use a custom collector that stops after you've gone back enough in time for a given search. Mike McCandless http://blog.mikemccandless.com On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: Mike, All our queries need to be sorted by timestamp field, in descending order of time. [latest-first] Each segment is sorted in itself. But TieredMergePolicy picks arbitrary segments and merges them [even with SortingMergePolicy etc...]. I am trying to avoid this and see if an approximate global ordering of segments [by time-stamp field] can be maintained via merge. Ex: TopN results will only examine recent 2-3 smaller segments [best-case] and return, without examining older and bigger segments. I do not know the terminology, may be Early Query Termination Across Segments etc...? -- Ravi On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless luc...@mikemccandless.com wrote: LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total order. Only TieredMergePolicy merges out-of-order segments. I don't understand why you need to encouraging merging of the more recent (by your time field) segments... Mike McCandless http://blog.mikemccandless.com On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: Mike, Each of my flushed segment is fully ordered by time. But TieredMergePolicy or LogByteSizeMergePolicy is going to pick arbitrary time-segments and disturb this arrangement and I wanted some kind of control on this. But like you pointed-out, going by only be time-adjacent merges can be disastrous. Is there a way to mix both time and size to arrive at a somewhat [less-than-accurate] global order of segment merges. Like attempt a time-adjacent merge, provided size of segments is not extremely skewed etc... -- Ravi On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless luc...@mikemccandless.com wrote: You want to focus merging on the segments containing newer documents? Why? This seems somewhat dangerous... Not taking into account the true segment size can lead to very very poor merge decisions ... you should turn on IndexWriter's infoStream and do a long running test to convince yourself the merging is being sane. Mike Mike McCandless http://blog.mikemccandless.com On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: Thanks Mike, Will try your suggestion. I will try to describe the actual use-case itself There is a requirement for merging time-adjacent segments [append-only, rolling time-series data] All Documents have a timestamp affixed and during flush I need to note down the least timestamp for all documents, through Codec. Then, I define a TimeMergePolicy extends LogMergePolicy and define the segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag]. LogMergePolicy will auto-arrange levels of segments according time and proceed with merges. Latest segments will be lesser in size and preferred during merges than older and bigger segments Do you think such an approach will be fine or there are better ways to solve this? -- Ravi On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless luc...@mikemccandless.com wrote: Somewhere in those numeric trie terms are the exact integers from your documents,
Re: Actual min and max-value of NumericField during codec flush
Hi LogMP *always* picks adjacent segments together. Therefore, if you have segments S1, S2, S3, S4 where the date-wise sort order is S4S3S2S1, then LogMP will pick either S1-S4, S2-S4, S2-S3 and so on. But always adjacent segments and in a raw (i.e. it doesn't skip segments). I guess what both Mike and I don't understand is why you insist on merging based on the timestamp of each segment. I.e. if the order, timestamp-wise, of the segments isn't as I described above, then merging them like so won't hurt - i.e. they will still be unsorted. No harm is done. Maybe MergePolicy isn't what you need here. If you can record somewhere the min/max timestamp of each segment, you can use a MultiReader to wrap the sorted list of IndexReaders (actually SegmentReaders). Then your reader, always traverses segments from new to old. If this approach won't address your issue, then you can merge based on timestamps - there's nothing wrong about it. What Mike suggested is that you benchmark your application with this merge policy, for a long period of time (few hours/days, depending on your indexing rate), because what might happen is that your merges are always unbalanced and your indexing performance will degrade because of unbalanced amount of IO that happens during the merge. Shai On Thu, Feb 13, 2014 at 7:25 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: @Mike, I had suggested the same approach in one of my previous mails, where-by each segment records min/max timestamps in seg-info diagnostics and use it for merging adjacent segments. Then, I define a TimeMergePolicy extends LogMergePolicy and define the segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag]. But you have expressed reservations This seems somewhat dangerous... Not taking into account the true segment size can lead to very very poor merge decisions ... you should turn on IndexWriter's infoStream and do a long running test to convince yourself the merging is being sane. Will merging be disastrous, if I choose a TimeMergePolicy? I will also test and verify, but it's always great to hear finer points from experts. @Shai, LogByteSizeMP categorizes adjacency by size, whereas it would be better if timestamp is used in my case Sure, I need to wrap this in an SMP to make sure that the newly-created segment is also in sorted-order -- Ravi On Wed, Feb 12, 2014 at 8:29 PM, Shai Erera ser...@gmail.com wrote: Why not use LogByteSizeMP in conjunction w/ SortingMP? LogMP picks adjacent segments and SortingMP ensures the merged segment is also sorted. Shai On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: Yes exactly as you have described. Ex: Consider Segment[S1,S2,S3 S4] are in reverse-chronological order and goes for a merge While SortingMergePolicy will correctly solve the merge-part, it does not however play any role in picking segments to merge right? SMP internally delegates to TieredMergePolicy, which might pick S1S4 to merge disturbing the global-order. Ideally only adjacent segments should be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc... Can there be a better selection of segments to merge in this case, so as to maintain a semblance of global-ordering? -- Ravi On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless luc...@mikemccandless.com wrote: OK, I see (early termination). That's a challenge, because you really want the docs sorted backwards from how they were added right? And, e.g., merged and then searched in reverse segment order? I think you should be able to do this w/ SortingMergePolicy? And then use a custom collector that stops after you've gone back enough in time for a given search. Mike McCandless http://blog.mikemccandless.com On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: Mike, All our queries need to be sorted by timestamp field, in descending order of time. [latest-first] Each segment is sorted in itself. But TieredMergePolicy picks arbitrary segments and merges them [even with SortingMergePolicy etc...]. I am trying to avoid this and see if an approximate global ordering of segments [by time-stamp field] can be maintained via merge. Ex: TopN results will only examine recent 2-3 smaller segments [best-case] and return, without examining older and bigger segments. I do not know the terminology, may be Early Query Termination Across Segments etc...? -- Ravi On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless luc...@mikemccandless.com wrote: LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total order. Only TieredMergePolicy merges out
Re: Regarding DrillDown search
Hi You will need to build a BooleanQuery which comprises a list of PrefixQuery. The relation between each PrefixQuery should be OR or AND, as you see fit (I believe OR?). In order to get documents' attributes you should execute searcher.search() w/ e.g. MultiCollector which wraps a FacetsCollector and TopScoreDocCollector. Then after .search() finished, you should pull the facet results from the FacetsCollector instance and the document results from the TopScoreDocCollector instance. Something like (I hope it compiles in 3.6! :)): TopScoreDocCollector tsdc = TopScoreDocCollector.create(...); FacetsCollector fc = FacetsCollector.create(...); searcher.search(query, MultiCollector.wrap(tsdc, fc)); ListFacetResult facetResults = fc.getFacetResults(); TopDocs topDocs = tsdc.topDocs(); Something like that.. Shai On Mon, Feb 10, 2014 at 1:57 PM, Jebarlin Robertson jebar...@gmail.comwrote: Dear Shai, Thank you for the quick response :) I have checked with PrefixQuery and term, it is working fine, But I think I cannot pass multiple Category path in it. I am calling the DrillDown.term() method 'N' number of times based on the number of Category Path list. And I have one more question, When I get the FacetResult, I am getting only the count of documents matched with the Category Path. Is there anyway to get the Document object also along with the count to know the file names For ex. Files (file names -title Field in Document) which have the same Author from the FacetResult. I have read some articles for the same from one of your answer I believe. In that you have explained like this Categories will be listed to the user and when the user clicks the category we have to do DrillDown search to get further result. I just want to know if we can get the document names as well in the first Facet query search itself, when we get the count (no of hits) of documents along with the FacetResult. Is there any solution available already or what I can do for that. Kindly Guide me :) Thank you for All your Support. Regards, Jebarlin.R On Mon, Feb 10, 2014 at 1:28 PM, Shai Erera ser...@gmail.com wrote: Hi If you want to drill-down on first name only, then you have several options: 1) Index Author/First, Author/Last, Author/First_Last as facets on the document. This is the faster approach, but bloats the index. Also, if you index the author Author/Jebarlin, Author/Robertson and Author/Jebarlin_Robertson, it still won't allow you to execute a query Author/Jebar. 2) You should modify the query to be a PrefixQuery, as if the user chose to search Author/Jeral*. You can do that with DrillDown.term() to create a Term($facets, Author/Jeral) (NOTE: you shouldn't pass '*' as part of the CategoryPath) and then construct your own PrefixQuery with that Term. Hope that helps, Shai On Mon, Feb 10, 2014 at 6:21 AM, Jebarlin Robertson jebar...@gmail.com wrote: Dear Shai, I have one doubt in DrillDown search, when I search with a CategoryPath of author, it is giving me the result if I give the accurate full name only. Is there any way to get the result even if I give the first or last name. Can you help me to search like (*contains* the word in Facet search), if the latest API supports or any other APIs. Thank You -- Thanks Regards, Jebarlin Robertson.R GSM: 91-9538106181. -- Thanks Regards, Jebarlin Robertson.R GSM: 91-9538106181.
Re: Regarding DrillDown search
Ahh I see ... so given a single FacetResultNode, you would like to know which documents contributed to its weight (count in your case). This is not available immediately, that's why you need to do a drill-down query. So if you return the user a list of categories, when he clicks one of them, you perform a drill-down query on that category and retrieve all the associated documents. May I ask why do you need to know the list of documents given a FacetResultNode? Basically in the 3.6 API it's kind of not so simple to do what you want in one-pass, but in the 4.x API (especially the upcoming 4.7) it should be very easy -- when you traverse the list of matching documents, besides only reading the list of categories associated with it, you also store a map Category - ListdocIDs. This isn't very cheap though ... So I guess it would be good if I understand why do you need to know which documents contributed to which category, before the results are returned to the user. Shai On Mon, Feb 10, 2014 at 3:16 PM, Jebarlin Robertson jebar...@gmail.comwrote: Hi Shai, Thanks, I am using the same way of BooleanQuery only with list of PrefixQuery only. I think I confused you sorry :) . I am using the same above code to get the result of documents. I am getting the TopDocs and retrieving the Documents also, If I don't even try that for the basic you will kill me :D. But my question was different, from the List of FacetResult I am getting only the counts or no of hits of Document in each category after iterating the list. I believe that the getLevel() of FacetNode returns the no of hits or no of documents falls into the particular Category. I need to know which are the documents are falling under the same category from the FacetResult Object also. I hope you will understand my question :) Thank you :) -- Jebarlin On Mon, Feb 10, 2014 at 9:09 PM, Shai Erera ser...@gmail.com wrote: Hi You will need to build a BooleanQuery which comprises a list of PrefixQuery. The relation between each PrefixQuery should be OR or AND, as you see fit (I believe OR?). In order to get documents' attributes you should execute searcher.search() w/ e.g. MultiCollector which wraps a FacetsCollector and TopScoreDocCollector. Then after .search() finished, you should pull the facet results from the FacetsCollector instance and the document results from the TopScoreDocCollector instance. Something like (I hope it compiles in 3.6! :)): TopScoreDocCollector tsdc = TopScoreDocCollector.create(...); FacetsCollector fc = FacetsCollector.create(...); searcher.search(query, MultiCollector.wrap(tsdc, fc)); ListFacetResult facetResults = fc.getFacetResults(); TopDocs topDocs = tsdc.topDocs(); Something like that.. Shai On Mon, Feb 10, 2014 at 1:57 PM, Jebarlin Robertson jebar...@gmail.com wrote: Dear Shai, Thank you for the quick response :) I have checked with PrefixQuery and term, it is working fine, But I think I cannot pass multiple Category path in it. I am calling the DrillDown.term() method 'N' number of times based on the number of Category Path list. And I have one more question, When I get the FacetResult, I am getting only the count of documents matched with the Category Path. Is there anyway to get the Document object also along with the count to know the file names For ex. Files (file names -title Field in Document) which have the same Author from the FacetResult. I have read some articles for the same from one of your answer I believe. In that you have explained like this Categories will be listed to the user and when the user clicks the category we have to do DrillDown search to get further result. I just want to know if we can get the document names as well in the first Facet query search itself, when we get the count (no of hits) of documents along with the FacetResult. Is there any solution available already or what I can do for that. Kindly Guide me :) Thank you for All your Support. Regards, Jebarlin.R On Mon, Feb 10, 2014 at 1:28 PM, Shai Erera ser...@gmail.com wrote: Hi If you want to drill-down on first name only, then you have several options: 1) Index Author/First, Author/Last, Author/First_Last as facets on the document. This is the faster approach, but bloats the index. Also, if you index the author Author/Jebarlin, Author/Robertson and Author/Jebarlin_Robertson, it still won't allow you to execute a query Author/Jebar. 2) You should modify the query to be a PrefixQuery, as if the user chose to search Author/Jeral*. You can do that with DrillDown.term() to create a Term($facets, Author/Jeral) (NOTE: you shouldn't pass '*' as part of the CategoryPath) and then construct your own PrefixQuery with that Term. Hope that helps, Shai On Mon
Re: Regarding DrillDown search
What you want sounds like grouping more like faceting? So e.g. if you have an Author field with values A1, A2, A3, and the user searches for 'love', then if I understand correctly, you want to display something like: Author/A1 Doc1 Doc2 Author/A2 Doc3 Doc4 Author/A3 Doc5 Doc6 Is that right? Where's today your result page looks like this: Facets Results -- --- Author Doc1_Title A1 (4) Doc1_Highlight A2 (3) A3 (1) Doc2_Title Doc2_Highlight +++ ... (Forgive my lack of creativity :)). If you're not interested in join, and just want to add to each document its Author facet in the results pane, then I suggest you add another stored field (only stored, not indexed) with the category value. And then you could display: Facets Results -- --- Author Doc1_Title A1 (4) Doc1_Highlight A2 (3) Author: A1 A3 (1) Doc2_Title Doc2_Highlight Author: A2 +++ ... Did I understand properly? Shai On Mon, Feb 10, 2014 at 4:51 PM, Jebarlin Robertson jebar...@gmail.comwrote: Hi Shai, Thanks for the explanation :) For my requirement, I just want to display the list of resulted documents to the user. In Facet search case also, I already have the resulted documents list in TopDoc and the FacetResults have only the count of documents contributed to each Catagory, According to my understanding, Suppose I query for the word Love, Now I do Facet Search and gets 4 (Files) documents as matched results from TopScoreDocCollector as TopDocs and I will get the FacetResult from the FacetCollector. And the FacetResultsNode gives me only the values of the category and the count of how many documents falls under same category (May be by Author or other provided categories ) among the 4 resulted documents only. I feel, It will be good if I get the category association with the resulted documents, as I have the document list already from TopScoreDocCollector. I can do DrillDown Search also by selecting each category, But in my case I just want to display the 4 documents result first and then category wise, suppose 2 documents by the same Author etc As per my requirement, I am doing DrillDown Search by asking the user to provide such as title of the docment, author of the document, etc... as advanced search option. --- Jebarlin Robertson.R On Mon, Feb 10, 2014 at 10:30 PM, Shai Erera ser...@gmail.com wrote: Ahh I see ... so given a single FacetResultNode, you would like to know which documents contributed to its weight (count in your case). This is not available immediately, that's why you need to do a drill-down query. So if you return the user a list of categories, when he clicks one of them, you perform a drill-down query on that category and retrieve all the associated documents. May I ask why do you need to know the list of documents given a FacetResultNode? Basically in the 3.6 API it's kind of not so simple to do what you want in one-pass, but in the 4.x API (especially the upcoming 4.7) it should be very easy -- when you traverse the list of matching documents, besides only reading the list of categories associated with it, you also store a map Category - ListdocIDs. This isn't very cheap though ... So I guess it would be good if I understand why do you need to know which documents contributed to which category, before the results are returned to the user. Shai On Mon, Feb 10, 2014 at 3:16 PM, Jebarlin Robertson jebar...@gmail.com wrote: Hi Shai, Thanks, I am using the same way of BooleanQuery only with list of PrefixQuery only. I think I confused you sorry :) . I am using the same above code to get the result of documents. I am getting the TopDocs and retrieving the Documents also, If I don't even try that for the basic you will kill me :D. But my question was different, from the List of FacetResult I am getting only the counts or no of hits of Document in each category after iterating the list. I believe that the getLevel() of FacetNode returns the no of hits or no of documents falls into the particular Category. I need to know which are the documents are falling under the same category from the FacetResult Object also. I hope you will understand my question :) Thank you :) -- Jebarlin On Mon, Feb 10, 2014 at 9:09 PM, Shai Erera ser...@gmail.com wrote: Hi You will need to build a BooleanQuery which comprises a list of PrefixQuery. The relation between each PrefixQuery should be OR or AND, as you see fit (I believe OR?). In order to get documents' attributes you should execute searcher.search() w/ e.g
Re: Regarding DrillDown search
You're welcome. And I suggest that you upgrade to 4.7 as soon as it's out! :) Shai On Mon, Feb 10, 2014 at 5:48 PM, Jebarlin Robertson jebar...@gmail.comwrote: Hi Shai, Yeah exactly the same way I want to display. Then I will do the same way of stored field. It is not about lack of creativity, I might have not explained you in the proper way :) Thank you for all the support :) On Tue, Feb 11, 2014 at 12:23 AM, Shai Erera ser...@gmail.com wrote: What you want sounds like grouping more like faceting? So e.g. if you have an Author field with values A1, A2, A3, and the user searches for 'love', then if I understand correctly, you want to display something like: Author/A1 Doc1 Doc2 Author/A2 Doc3 Doc4 Author/A3 Doc5 Doc6 Is that right? Where's today your result page looks like this: Facets Results -- --- Author Doc1_Title A1 (4) Doc1_Highlight A2 (3) A3 (1) Doc2_Title Doc2_Highlight +++ ... (Forgive my lack of creativity :)). If you're not interested in join, and just want to add to each document its Author facet in the results pane, then I suggest you add another stored field (only stored, not indexed) with the category value. And then you could display: Facets Results -- --- Author Doc1_Title A1 (4) Doc1_Highlight A2 (3) Author: A1 A3 (1) Doc2_Title Doc2_Highlight Author: A2 +++ ... Did I understand properly? Shai On Mon, Feb 10, 2014 at 4:51 PM, Jebarlin Robertson jebar...@gmail.com wrote: Hi Shai, Thanks for the explanation :) For my requirement, I just want to display the list of resulted documents to the user. In Facet search case also, I already have the resulted documents list in TopDoc and the FacetResults have only the count of documents contributed to each Catagory, According to my understanding, Suppose I query for the word Love, Now I do Facet Search and gets 4 (Files) documents as matched results from TopScoreDocCollector as TopDocs and I will get the FacetResult from the FacetCollector. And the FacetResultsNode gives me only the values of the category and the count of how many documents falls under same category (May be by Author or other provided categories ) among the 4 resulted documents only. I feel, It will be good if I get the category association with the resulted documents, as I have the document list already from TopScoreDocCollector. I can do DrillDown Search also by selecting each category, But in my case I just want to display the 4 documents result first and then category wise, suppose 2 documents by the same Author etc As per my requirement, I am doing DrillDown Search by asking the user to provide such as title of the docment, author of the document, etc... as advanced search option. --- Jebarlin Robertson.R On Mon, Feb 10, 2014 at 10:30 PM, Shai Erera ser...@gmail.com wrote: Ahh I see ... so given a single FacetResultNode, you would like to know which documents contributed to its weight (count in your case). This is not available immediately, that's why you need to do a drill-down query. So if you return the user a list of categories, when he clicks one of them, you perform a drill-down query on that category and retrieve all the associated documents. May I ask why do you need to know the list of documents given a FacetResultNode? Basically in the 3.6 API it's kind of not so simple to do what you want in one-pass, but in the 4.x API (especially the upcoming 4.7) it should be very easy -- when you traverse the list of matching documents, besides only reading the list of categories associated with it, you also store a map Category - ListdocIDs. This isn't very cheap though ... So I guess it would be good if I understand why do you need to know which documents contributed to which category, before the results are returned to the user. Shai On Mon, Feb 10, 2014 at 3:16 PM, Jebarlin Robertson jebar...@gmail.com wrote: Hi Shai, Thanks, I am using the same way of BooleanQuery only with list of PrefixQuery only. I think I confused you sorry :) . I am using the same above code to get the result of documents. I am getting the TopDocs and retrieving the Documents also, If I don't even try that for the basic you will kill me :D. But my question was different, from the List of FacetResult I am getting only the counts
Re: Regarding DrillDown search
Hi If you want to drill-down on first name only, then you have several options: 1) Index Author/First, Author/Last, Author/First_Last as facets on the document. This is the faster approach, but bloats the index. Also, if you index the author Author/Jebarlin, Author/Robertson and Author/Jebarlin_Robertson, it still won't allow you to execute a query Author/Jebar. 2) You should modify the query to be a PrefixQuery, as if the user chose to search Author/Jeral*. You can do that with DrillDown.term() to create a Term($facets, Author/Jeral) (NOTE: you shouldn't pass '*' as part of the CategoryPath) and then construct your own PrefixQuery with that Term. Hope that helps, Shai On Mon, Feb 10, 2014 at 6:21 AM, Jebarlin Robertson jebar...@gmail.comwrote: Dear Shai, I have one doubt in DrillDown search, when I search with a CategoryPath of author, it is giving me the result if I give the accurate full name only. Is there any way to get the result even if I give the first or last name. Can you help me to search like (*contains* the word in Facet search), if the latest API supports or any other APIs. Thank You -- Thanks Regards, Jebarlin Robertson.R GSM: 91-9538106181.
Re: Regarding CorruptedIndexException in using Lucene Facet Search
Hi Since 4.2 the facets module has gone under major changes, both API and implementation and performance has improved x4. If you want to upgrade, then I recommend waiting for 4.7 since we overhauled the API again - this will save you the efforts to migrate to e.g 4.6 and then to the new API once 4.7 is out. And you should always use the same version of Lucene for all of its modules - it's the only way to guarantee things will work :). Shai On Fri, Feb 7, 2014 at 9:05 AM, Jebarlin Robertson jebar...@gmail.comwrote: Dear Shai, I only made the mistake by using the same directory for both IndexWriter and FacetWriter. Now it is working fine .Thank you :) Could you please tell me if there is any major performance difference in using *3.6 and 4.x* *Facet *library?. Since I use the Lucene 3.6 version, I am using Facet library also the same version. Kindly guide me to use the best and the working one. :) Thank you :) Thanks and Regards, Jebarlin Robertson.R On Fri, Feb 7, 2014 at 12:41 PM, Jebarlin Robertson jebar...@gmail.com wrote: Dear Shai, Thank you for your reply. Actually I am using Lucene3.6 in Android. It is working fine. but with the latest versions there are some issues. Now I just added this Facet search library also along with the old Lucene code. After this Facet search integration, it is giving these Corrupted and NullpointerExcpetion when I add the document object to the IndexWriter. Below is the exception. 02-07 12:38:11.006: W/System.err(5411): java.lang.NullPointerException 02-07 12:38:11.006: W/System.err(5411): at org.apache.lucene.facet.index.streaming.CategoryParentsStream.incrementToken(CategoryParentsStream.java:138) 02-07 12:38:11.006: W/System.err(5411): at org.apache.lucene.facet.index.streaming.CountingListTokenizer.incrementToken(CountingListTokenizer.java:63) 02-07 12:38:11.006: W/System.err(5411): at org.apache.lucene.facet.index.streaming.CategoryTokenizer.incrementToken(CategoryTokenizer.java:48) 02-07 12:38:11.006: W/System.err(5411): at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:141) 02-07 12:38:11.006: W/System.err(5411): at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:276) 02-07 12:38:11.006: W/System.err(5411): at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:766) 02-07 12:38:11.006: W/System.err(5411): at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2060) 02-07 12:38:11.006: W/System.err(5411): at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2034) 02-07 12:38:11.006: W/System.err(5411): at com.example.lucene.threads.AsyncIndexWriter.addDocumentSynchronous(AsyncIndexWriter.java:343) 02-07 12:38:11.006: W/System.err(5411): at com.example.lucene.threads.AsyncIndexWriter.addDocumentSync(AsyncIndexWriter.java:371) Just try to help, If I am missing something. Thanks and regards, Jebarlin.R On Thu, Feb 6, 2014 at 11:04 PM, Shai Erera ser...@gmail.com wrote: It looks like something's wrong with the index indeed. Are you sure you committed both the IndexWriter and TaxoWriter? Do you have some sort of testcase / short program which demonstrates the problem? I know there were few issues running Lucene on Android, so I cannot guarantee it works fully .. we never tested this code on Android. Shai On Thu, Feb 6, 2014 at 3:21 PM, Jebarlin Robertson jebar...@gmail.com wrote: Hi, I am using Lucene 3.6 version for indexing and searching in Android application. I have implemented Facet search. But when I try to search, it is giving the below exception while creating the DirectoryTaxonomyReader object. 02-06 21:00:58.082: W/System.err(15518): org.apache.lucene.index.CorruptIndexException: Missing parent data for category 1 Can anyone help me to know what is the problem for this. Whether the Categories are not added to the Lucene index or some other problem. It will be better if somebody provides some sample code to use lucene facet for 3.6 version. -- Thanks Regards, Jebarlin Robertson.R GSM: 91-9538106181. -- Thanks Regards, Jebarlin Robertson.R GSM: 91-9538106181. -- Thanks Regards, Jebarlin Robertson.R GSM: 91-9538106181.
Re: Regarding CorruptedIndexException in using Lucene Facet Search
It looks like something's wrong with the index indeed. Are you sure you committed both the IndexWriter and TaxoWriter? Do you have some sort of testcase / short program which demonstrates the problem? I know there were few issues running Lucene on Android, so I cannot guarantee it works fully .. we never tested this code on Android. Shai On Thu, Feb 6, 2014 at 3:21 PM, Jebarlin Robertson jebar...@gmail.comwrote: Hi, I am using Lucene 3.6 version for indexing and searching in Android application. I have implemented Facet search. But when I try to search, it is giving the below exception while creating the DirectoryTaxonomyReader object. 02-06 21:00:58.082: W/System.err(15518): org.apache.lucene.index.CorruptIndexException: Missing parent data for category 1 Can anyone help me to know what is the problem for this. Whether the Categories are not added to the Lucene index or some other problem. It will be better if somebody provides some sample code to use lucene facet for 3.6 version. -- Thanks Regards, Jebarlin Robertson.R GSM: 91-9538106181.
Re: updating docs when using SortedSetDocValuesFacetFields
Note that Lucene doesn't support general in-place document updates, and updating a document means first deleting it and adding it back. Therefore if you only intend to add/change few categories of an existing document, you have to fully re-index the document. This is not specific to categories but applies for any field that you add, except NumericDocValues fields which support in-place document updates since Lucene 4.6. Shai On Wed, Jan 22, 2014 at 1:15 AM, Rose, Stuart J stuart.r...@pnnl.govwrote: I'm using Lucene 4.4 with SortedSetDocValuesFacetFields and would like to add and/or remove CategoryPaths for certain documents in the index. Basically, as additional sets of docs are added, the CategoryPaths for some of the previously indexed documents need to changed. My current testing with using writer.updateDocument(docIdTerm, docFields) seems to be generating some duplicates as there are more documents in the index than expected. Is this a known issue with SortedSetDocValuesFacetFields and discouraged? Thanks! Stuart
Re: Issue with FacetFields.addFields() throwing ArrayIndexOutOfBoundsException
Do you have a test which reproduces the error? Are you adding categories with very deep hierarchies? Shai On Fri, Jan 17, 2014 at 11:59 PM, Matthew Petersen mdpe...@gmail.comwrote: I've confirmed that using the LruTaxonomyWriterCache solves the issue for me. It would appear there is in fact a bug in the Cl20TaxonomyWriterCache or I am using it incorrectly (I use it as default, no customization). On Fri, Jan 17, 2014 at 9:29 AM, Matthew Petersen mdpe...@gmail.com wrote: I'm sure. I had seen that issue and it looked similar but the stack trace is slightly different. I've found that if I replace the Cl2oTaxonomyWriterCache with the LruTaxonomyWriterCache the problem seems to go away. I'm working right now on running a test that will prove this but it takes a while as the cache needs to get very large. If this proves to solve the problem then I'd say there is still a bug in the Cl2oTaxonomyWriterCache implementation. Thanks for the response. Matt On Fri, Jan 17, 2014 at 6:36 AM, Michael McCandless luc...@mikemccandless.com wrote: Are you sure you're using 4.4? Because ... this looks like https://issues.apache.org/jira/browse/LUCENE-5048 but that was supposedly fixed in 4.4. Mike McCandless http://blog.mikemccandless.com On Thu, Jan 16, 2014 at 5:33 PM, Matthew Petersen mdpe...@gmail.com wrote: I’m having an issue with an index when adding category paths to a document. They seem to be added without issue for a long period of time, then for some unknown reason the addition fails with an ArrayIndexOutOfBounds exception. Subsequent attempts to add category paths fail with the same exception. I’ve run CheckIndex on both the index and the taxonomy directory and both come back as clean with no issues. I cannot fix the index because according to lucene it is not broken. Could this be a bug in lucene? Below is the stack trace when the exception occurs: Lucene v4.4.0 java.lang.ArrayIndexOutOfBoundsException: -65535 at java.util.ArrayList.elementData(ArrayList.java:371) at java.util.ArrayList.get(ArrayList.java:384) at org.apache.lucene.facet.taxonomy.writercache.cl2o.CharBlockArray.charAt(CharBlockArray.java:152) at org.apache.lucene.facet.taxonomy.writercache.cl2o.CategoryPathUtils.equalsToSerialized(CategoryPathUtils.java:61) at org.apache.lucene.facet.taxonomy.writercache.cl2o.CompactLabelToOrdinal.getOrdinal(CompactLabelToOrdinal.java:257) at org.apache.lucene.facet.taxonomy.writercache.cl2o.CompactLabelToOrdinal.getOrdinal(CompactLabelToOrdinal.java:140) at org.apache.lucene.facet.taxonomy.writercache.cl2o.Cl2oTaxonomyWriterCache.get(Cl2oTaxonomyWriterCache.java:74) at org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyWriter.addCategory(DirectoryTaxonomyWriter.java:455) at org.apache.lucene.facet.index.FacetFields.addFields(FacetFields.java:175) at com.logrhythm.messaging.indexing.LogIndexerImpl.getDocument(LogIndexerImpl.java:478) at com.logrhythm.messaging.indexing.LogIndexerImpl.indexLog(LogIndexerImpl.java:392) at com.logrhythm.messaging.indexing.LogIndexerImpl.indexLogs(LogIndexerImpl.java:357) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at com.logrhythm.tests.unit.messaging.indexing.LogIndexerTests.logIndexerLoadTest(LogIndexerTests.java:752) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at
Re: Issue with FacetFields.addFields() throwing ArrayIndexOutOfBoundsException
Can you open an issue and attach the test there? On Jan 18, 2014 12:41 AM, Matthew Petersen mdpe...@gmail.com wrote: I do have a test that will reproduce. I'm not adding categories with very deep hierarchies, I'm adding 129 category paths per document (all docs have paths with same label) with each path having one value. All of the values are completely random and likely unique. It's basically a worst case test for our app but the condition has been seen in the field (the error has been encountered at less than worst case scenario). The test i have reproduces it very quickly, Only have to index ~330K docs. On Fri, Jan 17, 2014 at 3:27 PM, Shai Erera ser...@gmail.com wrote: Do you have a test which reproduces the error? Are you adding categories with very deep hierarchies? Shai On Fri, Jan 17, 2014 at 11:59 PM, Matthew Petersen mdpe...@gmail.com wrote: I've confirmed that using the LruTaxonomyWriterCache solves the issue for me. It would appear there is in fact a bug in the Cl20TaxonomyWriterCache or I am using it incorrectly (I use it as default, no customization). On Fri, Jan 17, 2014 at 9:29 AM, Matthew Petersen mdpe...@gmail.com wrote: I'm sure. I had seen that issue and it looked similar but the stack trace is slightly different. I've found that if I replace the Cl2oTaxonomyWriterCache with the LruTaxonomyWriterCache the problem seems to go away. I'm working right now on running a test that will prove this but it takes a while as the cache needs to get very large. If this proves to solve the problem then I'd say there is still a bug in the Cl2oTaxonomyWriterCache implementation. Thanks for the response. Matt On Fri, Jan 17, 2014 at 6:36 AM, Michael McCandless luc...@mikemccandless.com wrote: Are you sure you're using 4.4? Because ... this looks like https://issues.apache.org/jira/browse/LUCENE-5048 but that was supposedly fixed in 4.4. Mike McCandless http://blog.mikemccandless.com On Thu, Jan 16, 2014 at 5:33 PM, Matthew Petersen mdpe...@gmail.com wrote: I’m having an issue with an index when adding category paths to a document. They seem to be added without issue for a long period of time, then for some unknown reason the addition fails with an ArrayIndexOutOfBounds exception. Subsequent attempts to add category paths fail with the same exception. I’ve run CheckIndex on both the index and the taxonomy directory and both come back as clean with no issues. I cannot fix the index because according to lucene it is not broken. Could this be a bug in lucene? Below is the stack trace when the exception occurs: Lucene v4.4.0 java.lang.ArrayIndexOutOfBoundsException: -65535 at java.util.ArrayList.elementData(ArrayList.java:371) at java.util.ArrayList.get(ArrayList.java:384) at org.apache.lucene.facet.taxonomy.writercache.cl2o.CharBlockArray.charAt(CharBlockArray.java:152) at org.apache.lucene.facet.taxonomy.writercache.cl2o.CategoryPathUtils.equalsToSerialized(CategoryPathUtils.java:61) at org.apache.lucene.facet.taxonomy.writercache.cl2o.CompactLabelToOrdinal.getOrdinal(CompactLabelToOrdinal.java:257) at org.apache.lucene.facet.taxonomy.writercache.cl2o.CompactLabelToOrdinal.getOrdinal(CompactLabelToOrdinal.java:140) at org.apache.lucene.facet.taxonomy.writercache.cl2o.Cl2oTaxonomyWriterCache.get(Cl2oTaxonomyWriterCache.java:74) at org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyWriter.addCategory(DirectoryTaxonomyWriter.java:455) at org.apache.lucene.facet.index.FacetFields.addFields(FacetFields.java:175) at com.logrhythm.messaging.indexing.LogIndexerImpl.getDocument(LogIndexerImpl.java:478) at com.logrhythm.messaging.indexing.LogIndexerImpl.indexLog(LogIndexerImpl.java:392) at com.logrhythm.messaging.indexing.LogIndexerImpl.indexLogs(LogIndexerImpl.java:357) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at com.logrhythm.tests.unit.messaging.indexing.LogIndexerTests.logIndexerLoadTest(LogIndexerTests.java:752) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43
Re: Index + Taxonomy Replication
SearcherTaxonomyManager can be used only for NRT, as it only takes an IndexWriter and DirectoryTaxonomyWriter. And I don't think you want to keep those writers open on the slaves side. I think that a ReferenceManager, which returns a SearcherAndTaxonomy, is the right thing to do. The reason why we don't offer it is because it's very tricky to use outside of a well defined refresh protocol. If we let you refresh a Directory-based pair, and you're not careful enough, you could end up reopening the IndexReader before the TaxonomyReader was committed, or vice versa. Both lead to unsynchronized IR/TR pair, which is bad. However, if your app always calls this maybeRefresh once the Handler is done (i.e. as a Callback), and it is *the only one* that refreshes the pair, then you're safe. Maybe we should offer such a ReferenceManager (maybe it can even be SearcherTaxonomyManager which takes a pair of Directory in another ctor), and document that its maybeRefresh needs to be called by the same thread that modified the index (i.e. commit() or replication). Shai On Thu, Oct 31, 2013 at 12:53 PM, Michael McCandless luc...@mikemccandless.com wrote: Maybe have a look at how the IndexAndTaxonomyReplicationClientTest.java works? Hmm, in its callback, it manually reopens the index + taxoIndex, but I think you could instead use a SearcherTaxonomyManager and call its .maybeRefresh inside your callback? Mike McCandless http://blog.mikemccandless.com On Wed, Oct 30, 2013 at 11:24 AM, Joe Eckard eckar...@gmail.com wrote: Hello, I'm attempting to setup a master/slave arrangment between two servers where the master uses a SearcherTaxonomyManger to index and search, and the slave is read-only - using just an IndexSearcher and TaxonomyReader. So far I am able to publish new IndexAndTaxonomyRevisions on the master and pull them down to the slave with no problems (using the HttpReplicator and an IndexAndTaxonomyReplicationHandler), but I'm not sure how to correctly reopen the IndexSearcher and TaxonomyReader pair in the ReplicationHandler's callback. Should I wrap them in some kind of ReferenceManager to allow searches to continue on the read-only server during the cutover? Is there a specific order they should be reopened in? Any advice or pointers would be much appreciated. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Index + Taxonomy Replication
Opened https://issues.apache.org/jira/browse/LUCENE-5320. Shai On Fri, Nov 1, 2013 at 4:59 PM, Michael McCandless luc...@mikemccandless.com wrote: On Fri, Nov 1, 2013 at 3:12 AM, Shai Erera ser...@gmail.com wrote: Maybe we should offer such a ReferenceManager (maybe it can even be SearcherTaxonomyManager which takes a pair of Directory in another ctor), and document that its maybeRefresh needs to be called by the same thread that modified the index (i.e. commit() or replication). +1, I think we should do this? Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Merging ordered segments without re-sorting.
Hi You can use SortingMergePolicy and SortingAtomicReader to achieve that. You can read more about index sorting here: http://shaierera.blogspot.com/2013/04/index-sorting-with-lucene.html Shai On Wed, Oct 23, 2013 at 8:13 PM, Arvind Kalyan bas...@gmail.com wrote: Hi there, I'm looking for pointers, suggestions on how to approach this in Lucene 4.5. Say I am creating an index using a sequence of addDocument() calls and end up with segments that each contain documents in a specified ordering. It is guaranteed that there won't be updates/deletes/reads etc happening on the index -- this is an offline index building task for a read-only index. I create the index in the above mentioned fashion using LogByteSizeMergePolicy and finally do a forceMerge(1) to get a single segment in the ordering I want. Now my requirement is that I need to be able to merge this single segment with another such segment (say from yesterday's index) and guarantee some ordering -- say I have a comparator which looks at some field values in the 2 given docs and defines the ordering. Index 1 with segment X: (a,1) (b,2) (e,10) Index 2 (say from yesterday) with some segment Y: (c,4) (d,6) Essentially we have 2 ordered segments, and I'm looking to 'merge' them (literally) using the value of some field, without having to re-sort them which would be too time resource consuming. Output Index, with some segment Z: (a,1) (b,2) (c,4) (d,6) (e,10) Is this already possible? If not, any tips on how I can approach implementing this requirement? Thanks, -- Arvind Kalyan
Re: Merging ordered segments without re-sorting.
SortingAtomicReader uses the TimSort algorithm, which performs well when the two segments are already sorted. Anyway, that's the way to do it, even if it looks like it does more work than it should. Shai On Wed, Oct 23, 2013 at 10:46 PM, Arvind Kalyan bas...@gmail.com wrote: Thanks, my understanding is that SortingMergePolicy performs sorting after wrapping the 2 segments, correct? As I mentioned in my original email I would like to avoid the re-sorting and exploit the fact that the input segments are already sorted. On Wed, Oct 23, 2013 at 11:02 AM, Shai Erera ser...@gmail.com wrote: Hi You can use SortingMergePolicy and SortingAtomicReader to achieve that. You can read more about index sorting here: http://shaierera.blogspot.com/2013/04/index-sorting-with-lucene.html Shai On Wed, Oct 23, 2013 at 8:13 PM, Arvind Kalyan bas...@gmail.com wrote: Hi there, I'm looking for pointers, suggestions on how to approach this in Lucene 4.5. Say I am creating an index using a sequence of addDocument() calls and end up with segments that each contain documents in a specified ordering. It is guaranteed that there won't be updates/deletes/reads etc happening on the index -- this is an offline index building task for a read-only index. I create the index in the above mentioned fashion using LogByteSizeMergePolicy and finally do a forceMerge(1) to get a single segment in the ordering I want. Now my requirement is that I need to be able to merge this single segment with another such segment (say from yesterday's index) and guarantee some ordering -- say I have a comparator which looks at some field values in the 2 given docs and defines the ordering. Index 1 with segment X: (a,1) (b,2) (e,10) Index 2 (say from yesterday) with some segment Y: (c,4) (d,6) Essentially we have 2 ordered segments, and I'm looking to 'merge' them (literally) using the value of some field, without having to re-sort them which would be too time resource consuming. Output Index, with some segment Z: (a,1) (b,2) (c,4) (d,6) (e,10) Is this already possible? If not, any tips on how I can approach implementing this requirement? Thanks, -- Arvind Kalyan -- Arvind Kalyan http://www.linkedin.com/in/base16 cell: (408) 761-2030
Re: external file stored field codec
The codec intercepts merges in order to clean up files that are no longer referenced What happens if a document is deleted while there's a reader open on the index, and the segments are merged? Maybe I misunderstand what you meant by this statement, but if the external file is deleted, since the document is pruned from the index, how will the reader be able to read the stored fields from it? How do you track references to the external files? Since you write that all tests in the o.a.l.index package pass, I assume you handle this, but here's a simple testcase I have in mind: IndexWriter writer = new IndexWriter(dir, configWithNewCode()); writer.addDocument(addDocWithStoredFields(doc1)); writer.addDocument(addDocWithStoredFields(doc2)); writer.commit(); writer.addDocument(addDocWithStoredFields(doc3)); writer.addDocument(addDocWithStoredFields(doc4)); IndexReader reader = writer.getReader(); writer.deleteDocuments(doc1); writer.deleteDocuments(doc4); writer.forceMerge(1); writer.close(); System.out.println(reader.document(doc1)); System.out.println(reader.document(doc4)); Does this test pass? Shai On Fri, Oct 18, 2013 at 7:14 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: On 10/13/13 8:09 PM, Michael Sokolov wrote: On 10/13/2013 1:52 PM, Adrien Grand wrote: Hi Michael, I'm not aware enough of operating system internals to know what exactly happens when a file is open but it sounds to be like having separate files per document or field adds levels of indirection when loading stored fields, so I would be surprised it it actually proved to be more efficient than storing everything in a single file. That's true, Adrien, there's definitely a cost to using files. There are some gnarly challenges in here (mostly to do with the large number of files, as you say, and with cleaning up after deletes - deletion is always hard). I'm not sure it's going to be possible to both clean up and maintain files for stale commits; this will become problematic in the way that having index files on NFS mounts are problematic. I think the hope is that there will be countervailing savings during writes and merges (mostly) because we may be able to cleverly avoid copying the contents of stored fields being merged. There may also be savings when querying due to reduced RAM requirements since the large stored fields won't be paged in while performing queries. As I said, some simple tests do show improvements under at least some circumstances, so I'm pursuing this a bit further. I have a preliminary implementation as a codec now, and I'm learning a bit about Lucene's index internals. BTW SimpleTextCodec is a great tool for learning and debugging. The background for this is a document store with large files (think PDFs, but lots of formats) that have to be tracked, and have associated metadata. We've been storing these externally, but it would be beneficial to have a single data management layer: i.e. to push this down into Lucene, for a variety of reasons. For one, we could rely on Solr to do our replication for us. I'll post back when I have some measurements. -Mike This idea actually does seem to be working out pretty nicely. I compared time to write and then to read documents that included a couple of small indexed fields and a binary stored field that varied in size. Writing to external files, via the FSFieldCodec, was 3-20 times faster than writing to the index in the normal way (using MMapDirectory). Reading was sometimes faster and sometimes slower. I also measured time for a forceMerge(1) at the end of each test: this was almost always nearly zero when binaries were external, and grew larger with more data in the normal case. I believe the improvements we're seeing here result largely from removing the bulk of the data from the merge I/O path. As with any performance measurements, a lot of factors can affect the measurements, but this effect seems pretty robust across the conditions I measured (different file sizes, numbers of files, and frequency of commits, with lots of repetition). One oddity is a large difference between Mac SSD filesystem (15-20x writing, reading 0.6x) via FSFieldCodec) and Linux ext4 HD filesystem (3-4x writing, 1.5x reading). The codec works as a wrapper around another codec (like the compressing codecs), intercepting binary and string stored fields larger than a configurable threshold, and storing a file number as a reference in the main index which then functions kind of like a symlink. The codec intercepts merges in order to clean up files that are no longer referenced, taking special care to preserve the ability of the other codecs to perform bulk merges. The codec passes all the Lucene unit tests in the o.a.l.index package. The implementation is still very experimental: there are lots of details to be worked out: for example, I haven't yet measured the performance impact of deletions, which
Re: Huge FacetArrays while using SortedSetDocValuesAccumulator
Oops you're right, it was committed in LUCENE-4985 which will be released in Lucene 4.5. Shai On Wed, Aug 28, 2013 at 6:16 PM, Krishnamurthy, Kannan kannan.krishnamur...@contractor.cengage.com wrote: Thanks for the response. I double checked that SortedSetDocValuesAccumulator doesn't take a FacetArray in its ctor currently in 4.3.0 and 4.4. But FacetAccumulator does take FacetArray in its ctor. Am I missing something here? We have a high traffic application currently doing about 250 searches and facet request per second. We haven't performance tested our facet implementation yet to see if object allocation is a problem. Thanks, +Kannan. Hi SortedSetDocValuesAccumulator does receive FacetArrays in its ctor, so you can pass ReusingFacetArrays. You will need to call FacetArrays.free() when you're done with accumulation though. However, do notice that ReusingFacetArrays did not show any big gain even with large taxonomies -- that is that the overhead of allocating and freeing them wasn't noticeable. If you expect to use very large taxonomies, then facet partitions can help. But for that you need to use the sidecar taxonomy index. Shai On Mon, Aug 26, 2013 at 11:45 PM, Krishnamurthy, Kannan kannan.krishnamur...@contractor.cengage.com wrote: Hello, We are working with large lucene 4.3.0 index and using SortedSetDocValuesFacetFields for creating facets and SortedSetDocValuesAccumulator for facet accumulation. We couldn't use a taxonomy based facet implementation (We use MultiReader for searching and our indices is composed of multiple physical lucene indices, hence we cannot have a single taxonomy index). We have two million categories and expect to have another two million in the near future. As the current implementation of SortedSetDocValuesAccumulator does not support ReusingFacetArrays, we are concerned with potential garabage collector related performance issues in our high traffic application. Will future Lucene release support using ReusingFacetArrays in SortedSetDocValuesAccumulator ? Also as an alternative we are considering subclassing FacetIndexingParams and provide dimension specific CategoryListParams during indexing time. This will help to reduce the size of the FacetArray per facet request. We realize this approach will not support multiple FacetRequest in a single SortedSetDocValuesAccumulator, as SortedSetDocValuesReaderState hardcodes the category to null while calling FacetIndexingParams.getCategoryListParams(null) in its constructor. Are there better approaches to this problem ? Thanks in advance for any help. Kannan Cengage Learning - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Huge FacetArrays while using SortedSetDocValuesAccumulator
Hi SortedSetDocValuesAccumulator does receive FacetArrays in its ctor, so you can pass ReusingFacetArrays. You will need to call FacetArrays.free() when you're done with accumulation though. However, do notice that ReusingFacetArrays did not show any big gain even with large taxonomies -- that is that the overhead of allocating and freeing them wasn't noticeable. If you expect to use very large taxonomies, then facet partitions can help. But for that you need to use the sidecar taxonomy index. Shai On Mon, Aug 26, 2013 at 11:45 PM, Krishnamurthy, Kannan kannan.krishnamur...@contractor.cengage.com wrote: Hello, We are working with large lucene 4.3.0 index and using SortedSetDocValuesFacetFields for creating facets and SortedSetDocValuesAccumulator for facet accumulation. We couldn't use a taxonomy based facet implementation (We use MultiReader for searching and our indices is composed of multiple physical lucene indices, hence we cannot have a single taxonomy index). We have two million categories and expect to have another two million in the near future. As the current implementation of SortedSetDocValuesAccumulator does not support ReusingFacetArrays, we are concerned with potential garabage collector related performance issues in our high traffic application. Will future Lucene release support using ReusingFacetArrays in SortedSetDocValuesAccumulator ? Also as an alternative we are considering subclassing FacetIndexingParams and provide dimension specific CategoryListParams during indexing time. This will help to reduce the size of the FacetArray per facet request. We realize this approach will not support multiple FacetRequest in a single SortedSetDocValuesAccumulator, as SortedSetDocValuesReaderState hardcodes the category to null while calling FacetIndexingParams.getCategoryListParams(null) in its constructor. Are there better approaches to this problem ? Thanks in advance for any help. Kannan Cengage Learning - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to retrieve value of NumericDocValuesField in similarity
Rob, when DiskDV becomes the default DVFormat, would it not make sense to load the values into the cache if someone uses FieldCache API? Vs. if someone calls DV API directly, he uses whatever is the default Codec, or the one that he plugs. That's what I would expect from a 'cache'. So it's ok that currently all FieldCache does is delegate the call to DV API, but perhaps we'd want to change that so that in the DiskDV case, it actually caches things? Or, you'd like to keep FieldCache API for sort of back-compat with existing features, and let the app control the caching by using an explicit RamDVFormat? Shai On Mon, Aug 12, 2013 at 7:07 PM, Ross Woolf r...@rosswoolf.com wrote: Yes, I will open an issue. On Mon, Aug 12, 2013 at 10:02 AM, Robert Muir rcm...@gmail.com wrote: On Mon, Aug 12, 2013 at 8:48 AM, Ross Woolf r...@rosswoolf.com wrote: Okay, just for clarity sake, what you are saying is that if I make the FieldCache call it won't actually create and impose the loading time of the FieldCache, but rather just use the NumericDocValuesField instead. Is this correct? Yes, exactly. its a little confusing, but a tradeoff to make docvalues work transparently with lots of existing code built off of fieldcache (sorting/grouping/joins/faceting/...) without having to have 2 separate implementations of what is the same thing. so its like docvalues is a fieldcache you already built at index-time. Also, my similarity was extending SimilarityBase, and I can't see how to get a docId as it is not passed in the score method score(BasicStats stats, float freq, float docLen). Will I need to extend using Similarity instead of SimilarityBase, or is there a way to get the docId using SimilarityBase? Maybe we should just add a 'int doc' parameter to the SimilarityBase.score() method? Do you want to open a JIRA issue for this? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to retrieve value of NumericDocValuesField in similarity
ok that makes sense. Shai On Mon, Aug 12, 2013 at 9:18 PM, Robert Muir rcm...@gmail.com wrote: On Mon, Aug 12, 2013 at 11:06 AM, Shai Erera ser...@gmail.com wrote: Or, you'd like to keep FieldCache API for sort of back-compat with existing features, and let the app control the caching by using an explicit RamDVFormat? Yes. In the future ideally fieldcache goes away and is a UninvertingFilterReader or something like that, that exposes DV apis. so then things can just use the DV apis... but to get things started we did it this way in the interim. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 4 - Faceted Search with Sorting
Hi Basically, every IndexSearch.search() variant has a matching Collector. They are there for easier usage by users. TopFieldCollector.create() takes searchAfter (TopFieldDoc), so you can use it in conjunction with FacetsCollector as I've outlined before. In general you're right that for pagination you don't need to collect facets again. I would cache the ListFacetResult though, and not the FacetsCollector. Maybe even cache the way it's output, e.g. the String that you send back. But, note that such caching means the server becomes stateful, which usually complicates matters for apps. Whether it's a problem or not for your app, you'll be the judge, I just wanted to point that out. Shai On Fri, Aug 2, 2013 at 9:35 AM, Sanket Paranjape sanket.paranjape.mailingl...@gmail.com wrote: Hi Shai, Thanks for helping out. It worked. :) I also want to add pagination feature. This can be done via searchAfter method in IndexSearcher. But, this does not have Collector (I want facets from this). I think this has been done intentionally because facets would remain same while paginating/sorting. So my approach to this would be following. 1. On first search call below code to get first set of results along with facets. 2. Save *last* ScoreDoc somewhere in the session so that it can be used to pagination. Also save facetCollector so as to use it later on pagination request to show facets. 3. On subsequent pagination requests use IndexSearcher.searchAfter method to get next set of results using ScoreDoc from session. 4. If user want to narrow down on facets then follow steps from 1 to 3 using Drill-down feature. Am I correct? On 01-08-2013 11:33 PM, Shai Erera wrote: Hi You should do the following: TopFieldCollector tfc = TopFieldCollector.create(); FacetsCollector fc = FacetsCollector.create(); searcher.search(query, MultiCollector.wrap(tfc, fc)); Basically IndexSearcher.search(..., Sort) creates TopFieldCollector internally, so you need to create it outside and wrap both collectors with MultiCollector. BTW, you don't need to do new CategoryPath(CATEGORY_PATH, '/'), when the category does not contain the delimiter. You can use the vararg constructor which takes the path components directly, if you have them already. Shai On Thu, Aug 1, 2013 at 7:46 PM, Sanket Paranjape sanket.paranjape.mailinglist@**gmail.comsanket.paranjape.mailingl...@gmail.com wrote: I am trying to move from lucene 2.4 to 4.4. I had used bobo-browse for faceting in 2.4. I used below code (from Lucene examples) to query documents and get facets. ListFacetRequest categories = new ArrayListFacetRequest(); categories.add(new CountFacetRequest(new CategoryPath(CATEGORY_PATH, '/'), 10)); FacetSearchParams searchParams = new FacetSearchParams(categories); TopScoreDocCollector topScoreDocCollector = TopScoreDocCollector.create(**200, true); FacetsCollector facetsCollector = FacetsCollector.create(** searchParams, indexReader, taxonomyReader); indexSearcher.search(new MatchAllDocsQuery(), MultiCollector.wrap(**topScoreDocCollector, facetsCollector)); Above code gives me results along with facets. Now I want to add a Sort field on document, say I want to sort by name. I can achieve this using following Sort sort = new Sort(new SortField(NAME, Type.STRING)); TopFieldDocs docs = indexSearcher.search(new MatchAllDocsQuery(), 100, sort); Now, how do I achieve sorting along with faceting, because there is no method in IndexSearcher which has Collector and Sort. I have asked this question on stackoverflow as well. ( http://stackoverflow.com/**questions/17992183/lucene-4-** faceted-search-with-sortinghttp://stackoverflow.com/questions/17992183/lucene-4-faceted-search-with-sorting ) Please Help !!
Re: IndexUpgrade - Any ways to speed up?
Hi You cannot just update headers -- the file formats have changed. Therefore you need to rewrite the index entirely, at least from 2.3.1 to 3.6.2 (for 4.1 to be able to read it). If your index is already optimized, then IndexUpgrader is your best option. The reason it calls forceMerge(1) is that it needs to guarantee *every* segment in your index gets rewritten. BTW, you might want to upgrade to 4.4 already. Shai On Fri, Aug 2, 2013 at 2:49 PM, Ramprakash Ramamoorthy youngestachie...@gmail.com wrote: Team, We are migrating from lucene version 2.3.1 to 4.1. We are migrating the indices as well, and we do this in two steps 2.3.1 to 3.6.2 and 3.6.2 to 4. We just call IndexUpgrader.upgrade(), using the IndexUpgraderMergePolicy. I see that, the upgrade() method actually calls a forcemerge(1) over the indices. However, we have all our indices optimized and there are no deletes as well. This forcemerge(1) seems a very costly operation and since our index is already optimized, there is no space benefit as well. Is there a faster way to upgrade our indices (like reading the indices and modifying the headers, something of that sort)? We are not expecting any compaction during the process. Currently it takes 4 minutes for a GB of index to get migrated to 4.1 from 2.3.1. Any pointers would be appreciated. Thanks in advance. -- With Thanks and Regards, Ramprakash Ramamoorthy, Chennai, India.
Re: IndexUpgrade - Any ways to speed up?
Unfortunately you cannot upgrade directly from 2.3.1 to 4.1. You can consider upgrading to 3.6.2 and stop there. Lucene 4.1 can read 3.x indexes, and when segments will are merged, they are upgraded automatically to the newest file format. However, if this single segment is too big, such that it won't be picked for merges, you will need to upgrade it anyway when one day you will upgrade to Lucene 5.0. So I'd say, if you're not stressed with time, upgrade to 4.1 now ... it's a one time process. Shai On Fri, Aug 2, 2013 at 3:22 PM, Ramprakash Ramamoorthy youngestachie...@gmail.com wrote: Thank you Shai for the quick response. Have responded inline. On Fri, Aug 2, 2013 at 5:37 PM, Shai Erera ser...@gmail.com wrote: Hi You cannot just update headers -- the file formats have changed. Therefore you need to rewrite the index entirely, at least from 2.3.1 to 3.6.2 (for 4.1 to be able to read it). Yeah, as of now, we call IndexUpgrader of 3.6.2 and then IndexUpgrader of 4.0, and then the indices become readable by 4.1 If your index is already optimized, then IndexUpgrader is your best option. The reason it calls forceMerge(1) is that it needs to guarantee *every* segment in your index gets rewritten. Understood. Looks like we will have to stick to what we have written as on date. BTW, you might want to upgrade to 4.4 already. Yeah, we upgraded the code base when 4.1 was the most recent version, now that we are looking forward to migrate the older indices to be compatible. Thanks again. Shai On Fri, Aug 2, 2013 at 2:49 PM, Ramprakash Ramamoorthy youngestachie...@gmail.com wrote: Team, We are migrating from lucene version 2.3.1 to 4.1. We are migrating the indices as well, and we do this in two steps 2.3.1 to 3.6.2 and 3.6.2 to 4. We just call IndexUpgrader.upgrade(), using the IndexUpgraderMergePolicy. I see that, the upgrade() method actually calls a forcemerge(1) over the indices. However, we have all our indices optimized and there are no deletes as well. This forcemerge(1) seems a very costly operation and since our index is already optimized, there is no space benefit as well. Is there a faster way to upgrade our indices (like reading the indices and modifying the headers, something of that sort)? We are not expecting any compaction during the process. Currently it takes 4 minutes for a GB of index to get migrated to 4.1 from 2.3.1. Any pointers would be appreciated. Thanks in advance. -- With Thanks and Regards, Ramprakash Ramamoorthy, Chennai, India. -- With Thanks and Regards, Ramprakash Ramamoorthy, Chennai, India.