Indexing size increase 20% after switching from lucene 4.4 to 4.5 or 4.8 with BinaryDocValuesField
I used lucene 4.4 to create index for some documents. One of the indexing fields is BinaryDocValuesField. After I change the dependency to lucene 4.5. The index size for 1 million documents increases from 293MB to 357MB. If I did not use BinaryDocValuesField, the index size increases only about 2%. I also tried lucene 4.8. The index size is similar to index size with lucene 4.5. I am wondering what the change for handling BinaryDocValuesField from 4.4 to 4.5 or 4.8 is. Gang Zhao Software Engineer - EA Digital Platform 207 Redwood Shores Parkway Redwood City, CA 94065 Direct Line: 650-628-3719 [cid:image001.png@01CD68F0.6239B040]
Re: Facets in Lucene 4.7.2
Hi Shai, Thanks so much for the clear explanation. I agree on the first question. Taxonomy Writer with a separate index would probably be my approach too. For the second question: I am a little new to the Facets API so I will try to figure out the approach that you outlined below. However, the scenario is such: Assume a document corpus that is indexed. For a user query, a document is returned and selected by the user for editing as part of some use case/workflow. That document is now marked as either historically interesting or not, financially relevant, specific to media or entertainment domain, etc. by the user. So, essentially the user is flagging the document with certain markers. Another set of users could possibly want to query on these markers. So, lets say, a second user comes along, and wants to see the top documents belonging to one category, say, agriculture or farming. Since these markers are run time activities, how can I use the facets on them? So, I was envisioning facets as the various markers. But, if I constantly re-index or update the documents whenever a marker changes, I believe it would not be very efficient. Is there anything, facets or otherwise, in Lucene that can help me solve this use case? Please let me know. And, thanks! --- Thanks n Regards, Sandeep Ramesh Khanzode On Friday, June 13, 2014 9:51 PM, Shai Erera wrote: Hi You can check the demo code here: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/. This code is updated with each release, so you always get a working code examples, even when the API changes. If you don't mind managing the sidecar index, which I agree isn't such a big deal, then yes - the taxonomy index currently performs the fastest. I plan to explore porting the taxonomy-based approach from BinaryDocValues to the new SortedNumericDocValues (coming out in 4.9) since it might perform even faster. I didn't quite get the marker/flag facet. Can you give an example? For instance, if you can model that as a NumericDocValuesField added to documents (w/ the different markers/flags translated to numbers), then you can use Lucene's updatable numeric DocValues and write a custom Facets to aggregate on that NumericDocValues field. Shai On Fri, Jun 13, 2014 at 11:48 AM, Sandeep Khanzode < sandeep_khanz...@yahoo.com.invalid> wrote: > Hi, > > I am evaluating Lucene Facets for a project. Since there is a lot of > change in 4.7.2 for Facets, I am relying on UTs for reference. Please let > me know if there are other sources of information. > > I have a couple of questions: > > 1.] All categories in my application are flat, not hierarchical. But, it > seems from a few sources, that even that notwithstanding, you would want to > use a Taxonomy based index for performance reasons. It is faster but uses > more RAM. Or is the deterrent to use it is the fact that it is a separate > data structure. If one could do with the life-cycle management of the extra > index, should we go ahead with the taxonomy index for better performance > across tens of millions of documents? > > Another note to add is that I do not see a scenario wherein I would want > to re-index my collection over and over again or, in other words, the > changes would be spread over time. > > 2.] I need a type of dynamic facet that allows me to add a flag or marker > to the document at runtime since it will change/update every time a user > modifies or adds to the list of markers. Is this possible to do with the > current implementation? Since I believe, that currently all faceting is > done at indexing time. > > > --- > Thanks n Regards, > Sandeep Ramesh Khanzode
JTRES 2014: Deadline extended to June 23
(Apologies if you reveive multiple copies of this message.) DEADLINE EXTENDED TO JUNE 23, 2014 The 12th International Workshop on Java Technologies for Real-time and Embedded Systems - JTRES 2014 October 13th - 14th Niagara Falls, NY, USA Call for Papers MOTIVATION Over 90% of all microprocessors are now used for real-time and embedded applications. Embedded devices are deployed on a broad diversity of distinct processor architectures and operating systems. The application software for many embedded devices is custom tailored if not written entirely from scratch. The size of typical embedded system software applications is growing exponentially from year to year, with many of today's embedded systems comprised of multiple millions of lines of code. For all of these reasons, the software portability, reuse, and modular composability benefits offered by Java are especially valuable to developers of embedded systems. Both embedded and general purpose software frequently need to comply with real-time constraints. Higher-level programming languages and middleware are needed to robustly and productively design, implement, compose, integrate, validate, and enforce memory and real-time constraints along with conventional functional requirements for reusable software components. The Java programming language has become an attractive choice because of its safety, productivity, its relatively low maintenance costs, and the availability of well trained developers. Although Java features good software engineering characteristics, traditional Java virtual machine (JVM) implementations are unsuitable for deploying real-time software due to under-specification of thread scheduling and synchronization semantics, unclear demand and utilization of memory and CPU resources, and unpredictable interference associated with automatic garbage collection and adaptive compilation. GOAL Interest in real-time Java by both the academic research community and commercial industry has been motivated by the need to manage the complexity and costs associated with continually expanding embedded real-time software systems. The goal of the workshop is to gather researchers working on real-time and embedded Java to identify the challenging problems that still need to be solved in order to assure the success of real-time Java as a technology and to report results and experience gained by researchers. The Java ecosystem has outgrown the combination of Java as programming language and the JVM. For example, Android uses Java as source language and the Dalvik virtual machine for execution. Languages such as Scala are compiled to Java bytecode and executed on the JVM. JTRES welcomes submissions that apply such approaches to embedded and/or real-time systems. TOPICS OF INTEREST Topics of interest to this workshop include, but are not limited to: - New real-time programming paradigms and language features - Industrial experience and practitioner reports - Open source solutions for real-time Java - Real-time design patterns and programming idioms - High-integrity and safety critical system support - Java-based real-time operating systems and processors - Extensions to the RTSJ and SCJ - Real-time and embedded virtual machines and execution environments - Memory management and real-time garbage collection - Scheduling frameworks, feasibility analysis, and timing analysis - Multiprocessor and distributed real-time Java - Real-time solutions for Android - Languages other than Java on real-time or embedded JVMs SUBMISSION REQUIREMENTS Participants are expected to submit a paper of at most 10 pages (ACM Conference Format, i.e., two-columns, 10 point font). Industrial experience and practitioner reports may be submitted as 4-page short papers. Accepted papers will be published in the ACM International Conference Proceedings Series via the ACM Digital Library and have to be presented by one author at the JTRES. LaTeX and Word templates can be found at: http://www.acm.org/sigs/pubs/proceed/template.html The ISBN number for JTRES 2014 is TBD. Papers describing open source projects shall include a description how to obtain the source and how to run the experiments in the appendix. Papers should be submitted through Easychair. Please use the submission link: https://www.easychair.org/conferences/?conf=jtres2014 The best papers will be invited for submission to a special issue of the Journal on Concurrency and Computation: Practice and Experience, as determined by the program committee. IMPORTANT DATES - Paper Submission: extended to 23 June, 2014 - Notification of Acceptance: 27 July, 2014 - Camera Ready Paper Due: 24 August, 2014 - Workshop: 13-14 October, 2014 PROGRAM CHAIR Wolfgang Puffitsch, Technical University of Denmark WORKSHOP CHAIR Lukasz Ziarek, SUNY Buffalo PROGRAM CO
Re: Facets in Lucene 4.7.2
Hi You can check the demo code here: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/. This code is updated with each release, so you always get a working code examples, even when the API changes. If you don't mind managing the sidecar index, which I agree isn't such a big deal, then yes - the taxonomy index currently performs the fastest. I plan to explore porting the taxonomy-based approach from BinaryDocValues to the new SortedNumericDocValues (coming out in 4.9) since it might perform even faster. I didn't quite get the marker/flag facet. Can you give an example? For instance, if you can model that as a NumericDocValuesField added to documents (w/ the different markers/flags translated to numbers), then you can use Lucene's updatable numeric DocValues and write a custom Facets to aggregate on that NumericDocValues field. Shai On Fri, Jun 13, 2014 at 11:48 AM, Sandeep Khanzode < sandeep_khanz...@yahoo.com.invalid> wrote: > Hi, > > I am evaluating Lucene Facets for a project. Since there is a lot of > change in 4.7.2 for Facets, I am relying on UTs for reference. Please let > me know if there are other sources of information. > > I have a couple of questions: > > 1.] All categories in my application are flat, not hierarchical. But, it > seems from a few sources, that even that notwithstanding, you would want to > use a Taxonomy based index for performance reasons. It is faster but uses > more RAM. Or is the deterrent to use it is the fact that it is a separate > data structure. If one could do with the life-cycle management of the extra > index, should we go ahead with the taxonomy index for better performance > across tens of millions of documents? > > Another note to add is that I do not see a scenario wherein I would want > to re-index my collection over and over again or, in other words, the > changes would be spread over time. > > 2.] I need a type of dynamic facet that allows me to add a flag or marker > to the document at runtime since it will change/update every time a user > modifies or adds to the list of markers. Is this possible to do with the > current implementation? Since I believe, that currently all faceting is > done at indexing time. > > > --- > Thanks n Regards, > Sandeep Ramesh Khanzode
Re: [lucene 4.6] NPE when calling IndexReader#openIfChanged
On Fri, Jun 13, 2014 at 8:53 AM, Clemens Wyss DEV wrote: > Thanks a lot! >>"large text fields" > What is a good limit (in characters) to switch from StringField to TextField? > Do Analyzers (e.g. GermanAnalyzer) help a lot in reducing the size > of an Index? It's more based on your app's requirements. StringField indexes everything as a single token. >> Add XXXDocValuesField instead of e.g. StringField. > Does this apply only for StringFields? Or for TextFields too? > >> Upgrade to the upcoming Lucene 4.9 > we have not yet transitionen to Java 7/8 ... hopefully soon ;) > >> and take a heap dump and see what's using RAM > Find attached a snippet from MemoryAnalyzer Does this say 59255872 bytes (ie, ~56.5 MB) being used by the StandardDirectoryReader? I'm a little confused because I don't see which structures sum up to that total. And I would expect the FST (terms index) to take more RAM. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
fuzzy/case insensitive AnalyzingSuggester )
Looking for an AnalyzingSuggester which supports - fuzzyness - case insensitivity - small (in memors) footprint (*) (*)Just tried to "hand" my big IndexReader (see oher post " [lucene 4.6] NPE when calling IndexReader#openIfChanged") into JaspellLookup. Got an OOM. Is there any (Jaspell)Lookup implementation that can handle really big indexes (by swapping out part of the "lookup-table")? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
AW: [lucene 4.6] NPE when calling IndexReader#openIfChanged
Thanks a lot! >"large text fields" What is a good limit (in characters) to switch from StringField to TextField? Do Analyzers (e.g. GermanAnalyzer) help a lot in reducing the size of an Index? > Add XXXDocValuesField instead of e.g. StringField. Does this apply only for StringFields? Or for TextFields too? > Upgrade to the upcoming Lucene 4.9 we have not yet transitionen to Java 7/8 ... hopefully soon ;) > and take a heap dump and see what's using RAM Find attached a snippet from MemoryAnalyzer Class Name | Shallow Heap | Retained Heap | Percentage --- org.apache.lucene.index.StandardDirectoryReader @ 0x783932460 | 72 |59'255'872 | 3.04% |- org.apache.lucene.index.SegmentReader[24] @ 0x794089ee0 | 112 |59'190'960 | 3.03% | |- org.apache.lucene.index.SegmentReader @ 0x788820f40 | 72 |16'905'072 | 0.87% | | |- org.apache.lucene.index.SegmentCoreReaders @ 0x7910cacc8 | 56 |16'895'576 | 0.87% | | | |- org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader @ 0x780661c50| 24 | 16'864'864 | 0.86% | | | | |- org.apache.lucene.codecs.BlockTreeTermsReader @ 0x7910cae50 | 56 |16'864'240 | 0.86% | | | | | |- java.util.TreeMap @ 0x783902738 | 48 |16'858'472 | 0.86% | | | | | | '- java.util.TreeMap$Entry @ 0x77ec5f9f8 | 40 |16'858'424 | 0.86% | | | | | | |- java.util.TreeMap$Entry @ 0x77ec5fa20 | 40 |10'895'656 | 0.56% | | | | | | |- java.util.TreeMap$Entry @ 0x77ec5fa48 | 40 | 5'960'072 | 0.31% | | | | | | | |- java.util.TreeMap$Entry @ 0x77ec5fa98 | 40 | 5'958'072 | 0.31% | | | | | | | | |- java.util.TreeMap$Entry @ 0x77fc09bf0 | 40 | 5'949'864 | 0.30% | | | | | | | | |- org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader @ 0x788820e20 | 72 | 8'168 | 0.00% | | | | | | | | '- Total: 2 entries | | | | | | | | | | |- java.util.TreeMap$Entry @ 0x77ec5fa70 | 40 | 1'000 | 0.00% | | | | | | | | '- org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader @ 0x78347fbc0 | 72 | 960 | 0.00% | | | | | | | | |- org.apache.lucene.util.fst.FST @ 0x788fe34c8 | 104 | 840 | 0.00% | | | | | | | | | |- org.apache.lucene.util.fst.FST$Arc[128] @ 0x7870932a0 | 528 | 528 | 0.00% | | | | | | | | | |- org.apache.lucene.util.fst.BytesStore @ 0x77ec5fb60| 40 | 144 | 0.00% | | | | | | | | | | '- java.util.ArrayList @ 0x780663b28 | 24 | 104 | 0.00% | | | | | | | | | |- org.apache.lucene.util.BytesRef @ 0x780663b10 | 24 |48 | 0.00% | | | | | | | | | | '- byte[5] @ 0x780
Re: [lucene 4.6] NPE when calling IndexReader#openIfChanged
On Fri, Jun 13, 2014 at 3:02 AM, Clemens Wyss DEV wrote: >> limit how many fields have norms enabled > We have one index for approx 7000 pdfs (24GB). Of course no content is STOREd > (but ANALYZEd). This very index occupies 4GB on disk and the corresponding > IndexReader is 60MB. > Are norms per default enabled org.apache.lucene.document .TextField? Yes. Norms are a good idea for "large text fields", e.g. body text or a catch all field, but usually not a good idea for tiny fields (e.g. title). >> use disk-based doc values not field cache > How is this done? Add XXXDocValuesField instead of e.g. StringField. >> etc. > such as? ;) Upgrade to the upcoming Lucene 4.9; there have been some improvements e.g. to norms compression. You can tune your terms index settings, but terms index usually doesn't use much RAM. You can fire up your up, get all searchers warmed, and take a heap dump and see what's using RAM. We can iterate from there. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Facets in Lucene 4.7.2
Hi, I am evaluating Lucene Facets for a project. Since there is a lot of change in 4.7.2 for Facets, I am relying on UTs for reference. Please let me know if there are other sources of information. I have a couple of questions: 1.] All categories in my application are flat, not hierarchical. But, it seems from a few sources, that even that notwithstanding, you would want to use a Taxonomy based index for performance reasons. It is faster but uses more RAM. Or is the deterrent to use it is the fact that it is a separate data structure. If one could do with the life-cycle management of the extra index, should we go ahead with the taxonomy index for better performance across tens of millions of documents? Another note to add is that I do not see a scenario wherein I would want to re-index my collection over and over again or, in other words, the changes would be spread over time. 2.] I need a type of dynamic facet that allows me to add a flag or marker to the document at runtime since it will change/update every time a user modifies or adds to the list of markers. Is this possible to do with the current implementation? Since I believe, that currently all faceting is done at indexing time. --- Thanks n Regards, Sandeep Ramesh Khanzode
searching in hierarchical structures
we use lucene to search in hierarchical structures. like a folder structure in filesystem. the documents have an extra field, which specifies the location of the document. so if you want to search documents under a specific folder you have to query a prefix in this field. but if the documents are moved to an other location, every document must be updated. in our case this is not a good option. are there any concepts for implementing hierarchical structures in lucene? does someone have a suggestion? i know lucene is fulltextsearch and therefore primarily for flat structures. greetings sascha - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
AW: [lucene 4.6] NPE when calling IndexReader#openIfChanged
> limit how many fields have norms enabled We have one index for approx 7000 pdfs (24GB). Of course no content is STOREd (but ANALYZEd). This very index occupies 4GB on disk and the corresponding IndexReader is 60MB. Are norms per default enabled org.apache.lucene.document .TextField? > use disk-based doc values not field cache How is this done? > etc. such as? ;) -Ursprüngliche Nachricht- Von: Michael McCandless [mailto:luc...@mikemccandless.com] Gesendet: Mittwoch, 21. Mai 2014 11:21 An: Lucene Users Betreff: Re: [lucene 4.6] NPE when calling IndexReader#openIfChanged On Wed, May 21, 2014 at 3:17 AM, Clemens Wyss DEV wrote: >> Can you just decrease IW's ramBufferSizeMB to relieve the memory pressure? > +1 > Is there something alike for IndexReaders? No, although you can take steps during indexing to reduce the RAM required during searching, e.g. limit how many fields have norms enabled, use disk-based doc values not field cache, etc. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org