Sort, Search Facets
Hi, I am using Lucene 4.7.2 and my primary use case for Lucene is to do three things: (a) search, (b) sort by a number of fields for the search results, and (c) facet on probably an equal number of fields (probably the most standard use cases anyway). Let us say, I have a corpus of more than a 100m docs with each document having approx. 10-15 fields excluding the content (body) which will also be one of the fields. Out of 10-15, I have a requirement to have sorting enabled on all 10-15 and the facets as well. That makes a total of approx. ~45 fields to be indexed for various reasons, once for String/Long/TextField, once for SortedDocValuesField, and once for FacetField each. What will be the impact of this on the indexing operation w.r.t. the time taken as well as the extra disk space required? Will it grow linearly with the increase in the number of fields? What is the impact on the memory usage during search time? I will attempt to benchmark some of these, but if you have any experience with this, request you to share the details. Thanks, --- Thanks n Regards, Sandeep Ramesh Khanzode
DrillSideways accepting FacetCollector parameter
Currently Drillsideways provides following method: public DrillSidewaysResult search(DrillDownQuery query, Collector hitCollector); Can same class provide following method ? public DrillSidewaysResult search(DrillDownQuery query, Collector hitCollector, FacetsCollector facetCollector); Currently, FacetsCollector drillDownCollector = new FacetsCollector(); is created from API method public DrillSidewaysResult search(DrillDownQuery query, Collector hitCollector) throws IOException which can be parametrised ? It will help application to use same FacetsCollector to fetch other facets, i.e. non sideways facets. Thanks, Jigar Shah.
re-use IndexWriter
nowadays , i've been trying every way to improve the performance of indexing , IndexWriter's close operation is really costly , and the Lucene's doc sugguest to re-use IndexWriter instance , i did it , i kept the indexWriter instance , and give it back to every request thread , But there comes a big problem , i never search the index changes because the index changes is till in the RAM , maybe there's a way to flush all the changes to the stable Storage and this operation don't close the IndexWriter so i could re-use it . am i right at this point ? there're several point i don't quite understand .. 1, what's the difference between commit and flush ? i thought with these two method , i could see the changes in my Directory without closing IndexWriter . 2, when should i close the writer ? if i use it Singleton(i don't have to worry about the LockObtainException) , and i don't have to worry about the changes because commit and flush would do this , then i don't have to close it any more ...
Re: re-use IndexWriter
Read the javadocs to understand the difference between commit() and flush(). You need commit(), or close(). There are no hard and fast rules and it depends on how much data you are indexing, how fast, how many searches you're getting and how up to date they need to be. And how much you worry about losing indexed data. One option is to pick a value that makes sense to you and commit() the writer every n seconds|minutes|hours|docs. close() it when your indexing job exits. You'll need to reopen index searchers to pick up changes. See the javadocs for IndexSearcher. Another option is to use lucene's near-real-time (NRT) features. Also see the IndexSearcher javadocs for a way in to that. -- Ian. On Tue, Jul 8, 2014 at 10:08 AM, Jason.H 469673...@qq.com wrote: nowadays , i've been trying every way to improve the performance of indexing , IndexWriter's close operation is really costly , and the Lucene's doc sugguest to re-use IndexWriter instance , i did it , i kept the indexWriter instance , and give it back to every request thread , But there comes a big problem , i never search the index changes because the index changes is till in the RAM , maybe there's a way to flush all the changes to the stable Storage and this operation don't close the IndexWriter so i could re-use it . am i right at this point ? there're several point i don't quite understand .. 1, what's the difference between commit and flush ? i thought with these two method , i could see the changes in my Directory without closing IndexWriter . 2, when should i close the writer ? if i use it Singleton(i don't have to worry about the LockObtainException) , and i don't have to worry about the changes because commit and flush would do this , then i don't have to close it any more ... - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: DrillSideways accepting FacetCollector parameter
We could do this, but what's the use case? E.g. DrillSideways also hardwires the drill-sideways collectors it creates ... do you control over those as well? Maybe we could make methods in the DrillSideways class that you could override? Mike McCandless http://blog.mikemccandless.com On Tue, Jul 8, 2014 at 7:14 AM, Jigar Shah jigaronl...@gmail.com wrote: Currently Drillsideways provides following method: public DrillSidewaysResult search(DrillDownQuery query, Collector hitCollector); Can same class provide following method ? public DrillSidewaysResult search(DrillDownQuery query, Collector hitCollector, FacetsCollector facetCollector); Currently, FacetsCollector drillDownCollector = new FacetsCollector(); is created from API method public DrillSidewaysResult search(DrillDownQuery query, Collector hitCollector) throws IOException which can be parametrised ? It will help application to use same FacetsCollector to fetch other facets, i.e. non sideways facets. Thanks, Jigar Shah. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Incremental Field Updates
That's a cool patch. Thanks On Thursday, July 3, 2014, Gopal Patwa gopalpa...@gmail.com wrote: Thanks Ravi, it is good to know general problem with updatable field. In our use-case where we have few fields which update more frequently then main index. We are using this SOLR join contrib patch with DocTransformer for returning data from join core. But this approach has some performance impact if that performance hit acceptable for your use use-case then you can give a try if you are using SOLR. https://issues.apache.org/jira/browse/SOLR-4787 On Thu, Jul 3, 2014 at 3:22 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com javascript:; wrote: In case of sorting, updatable DocValues may be what you are looking for. But updatable fields for searching is a different beast. A sample approach is documented at http://www.flax.co.uk/blog/2012/06/22/updating-individual-fields-in-lucene-with-a-redis-backed-codec/ The general problems with updatable postings-list AFAIK are 1. Impossible to correctly score updated documents 2. Segment Merges could miss out updates 3. Might behave in-correctly with NRT 4. Freq updates could end-up creating lots of files because of append-only nature of lucene... May be if you are not too worried about scoring, correct NRT behavior etc you can attempt a solution like the RedisCodec stuff... Segregating static dynamic fields into 2 separate indexes as described here http://www.lucenerevolution.org/2013/Sidecar-Index-Solr-Components-for-Parallel-Index-Management may be of some use to you -- Ravi On Wed, Jul 2, 2014 at 7:29 PM, Shai Erera ser...@gmail.com javascript:; wrote: Using BinaryDocValues is not recommended for all scenarios. It is a catchall alternative to the other DocValues types. I would not use it unless it makes sense for your application, even if it means that you need to re-index a document in order to update a single field. DocValues are not good for search - by search I assume you mean take a query such as apache AND lucene and find all documents which contain both terms under the same field. They are good for sorting and faceting though. So I guess the answer to your question is it depends (it always is!) - I would use DocValues for sorting and faceting, but not for regular search queries. And I would use BinaryDocValues only when the other DocValues types don't match. Also, note that the current field-level update of DocValues is not always better than re-indexing the document, you can read here for more details: http://shaierera.blogspot.com/2014/04/benchmarking-updatable-docvalues.html Shai On Tue, Jul 1, 2014 at 9:17 PM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi Shai, So one follow-up question. Assume that my use case is to have approx. ~50M documents indexed with each document having about ~10-15 indexed but not stored fields. These fields will never change, but there are another ~5-6 fields that will change and will continue to change after the index is written. These ~5-6 fields may also be multivalued. The size of this index turns out to be ~120GB. In this case, I would like to sort or facet or search on these ~5-6 fields. Which approach do you suggest? Should I use BinaryDocValues and update using IW or use either a ParallelReader/Join query. --- Thanks n Regards, Sandeep Ramesh Khanzode On Tuesday, July 1, 2014 9:53 PM, Shai Erera ser...@gmail.com javascript:; wrote: Except that Lucene now offers efficient numeric and binary DocValues updates. See IndexWriter.updateNumeric/Binary... On Jul 1, 2014 5:51 PM, Erick Erickson erickerick...@gmail.com javascript:; wrote: This JIRA is complicated, don't really expect it in 4.9 as it's been hanging around for quite a while. Everyone would like this, but it's not easy. Atomic updates will work, but you have to stored=true for all source fields. Under the covers this actually reads the document out of the stored fields, deletes the old one and adds it over again. FWIW, Erick On Tue, Jul 1, 2014 at 5:32 AM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi, I wanted to know of the best approach to follow if a few fields in my indexed documents are changing at run time (after index and before or during search), but a majority of them are created at index time. I could see the JIRA given below but it is scheduled for Lucene 4.9, I believe. There are a few other approaches, like maintaining a separate index for changing fields and use either a parallelreader or use a Join. Can everyone share their experience for
Adding/removing a term from a document
Hi all, I am trying to figure out how to easily remove or add a keyword from a document's index (or equivalently, decrease/increase that keyword's frequency in the document). I know Lucene allows you to reindex a document using the IndexWriter.updateDocument(docPath, doc) call but that's too expensive for my purposes. I already know the removed added keywords from a previous pass through the document and I would like to avoid Lucene doing another pass. I am looking for a IndexWriter.adjustTermFreqInDoc(keyword, doc, deltafreq) which will either change the frequency of keyword in 'doc' by 'deltafreq'. This could result in either adding or removing a keyword from the document in the index. Is there a way to do this? At first I thought adding term vectors to the index could help with this but it seems like that will dramatically increase the index size. Cheers, Alin
IndexSearcher.doc thread safe problem
Hi all, I know IndexSearcher is thread safe. But IndexSearcher.doc is not thread safe maybe... I try to below First, I extract docID at index directory. And that docID add on queue(ConcurrentLinkedQueue) Second, extract field value using docID poll at this queue after extract process end. This process is work to multi-threads. For this I used the following summation code below: searcher.search( query, filter, new Collector() { public void collect( int doc ) { queue.add( docBase + doc ) } ); Thread thread1 = new Thread( () - { while( !queue.isEmpty() ) { System.out.println( searcher.doc(queue.poll()).get(content) ); } } ); Thread thread2 = new Thread( thread1 ); thread1.start(); thread2.start(); --- Result was different in every execution. My method is wrong? or IndexSearcher bug? Please help me
Re: DrillSideways accepting FacetCollector parameter
Usecase: With below code i perform search. DrillSideways drillSideWays = new DrillSideways(searcher, config, engine.getTaxoReader()); DrillSidewaysResult result = drillSideWays.search(filterQuery, null, null, first + limit, sort, true, true); In above code i don't have reference to FacetCollector fc, which is used. Consider i want to get LongRangeFacetCounts, which is based on NumericDocValuesField. facets = new LongRangeFacetCounts(facetField.getQueryName(), fc, longRanges.toArray(new LongRange[longRanges .size()])); if i use below, i get access to current fc. FacetsCollector fc = new FacetsCollector(); TopDocs topDocs = FacetsCollector.search(searcher, query, null, first + limit, sort, true, true, fc); Difference is if i use ' FacetsCollector.search(searcher, query, null, first + limit, sort, true, true, fc);' i can get FacetCollector. This is not true in case of DrillSideways. Let me know if, there is already some other way provided. Thanks, Jigar Shah. On Tue, Jul 8, 2014 at 8:15 PM, Michael McCandless luc...@mikemccandless.com wrote: We could do this, but what's the use case? E.g. DrillSideways also hardwires the drill-sideways collectors it creates ... do you control over those as well? Maybe we could make methods in the DrillSideways class that you could override? Mike McCandless http://blog.mikemccandless.com On Tue, Jul 8, 2014 at 7:14 AM, Jigar Shah jigaronl...@gmail.com wrote: Currently Drillsideways provides following method: public DrillSidewaysResult search(DrillDownQuery query, Collector hitCollector); Can same class provide following method ? public DrillSidewaysResult search(DrillDownQuery query, Collector hitCollector, FacetsCollector facetCollector); Currently, FacetsCollector drillDownCollector = new FacetsCollector(); is created from API method public DrillSidewaysResult search(DrillDownQuery query, Collector hitCollector) throws IOException which can be parametrised ? It will help application to use same FacetsCollector to fetch other facets, i.e. non sideways facets. Thanks, Jigar Shah. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org