Interest in Lucene specialists..
- Original Message - From: William Wong To: Lucene-user Sent: Friday, October 05, 2001 5:12 PM Subject: RE: Lucene has moved to Jakarta How about adding filters for different file types such as -HTML (there is one in the demo already) -XML -PDF -MsWord/RTF -other common file formats THanks. -william -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED]] Sent: Friday, October 05, 2001 11:42 AM To: '[EMAIL PROTECTED]'; [EMAIL PROTECTED] Subject: RE: Lucene has moved to Jakarta From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Congratulations on the move! Thanks! As near as I can see, the two major changes for 1.2-rc1 are: switch to org.apache.lucene package names. Apache license instead of LGPL. Yes. Thanks for pointing these out. These are big incompatible changes that I forgot to mention. Other changes since 1.01b include: - ant-only build -- no more makefiles - addition of lock files--now fully thread process safe - addition of German stemmer - MultiSearcher now supports low-level search API - added RangeQuery, for term-range searching - Analyzers can choose tokenizer based on field name - misc bug fixes. I need to work up detailed release notes for the final 1.2 release. Sometime when someone has a chance, I'd love to hear a bit about what plans there are for Lucene development. Let's see, some short term tasks for the 1.2 release: - get source code back into releases - clean up example code - write release notes Some mid-term tasks: - add contributed Chinese analyzers - add Hits.SetOrdering() support - add some term highlighting support Longer term tasks: - add JDBC-based Directory - optimize simple conjunctive queries - optionally store document vectors in index Have I missed your favorite? Doug
Re: Interest in Lucene specialists..
I'm interested in exploring deeper if Lucene is suitable for one of our projects and would like to leverage specialist expertise. Is there any forum where specialist Lucene expertize (evaluation and dev support) can be sought on commercial terms. not sure if this is the right forum to post this question and if so, my apologies in advance with a request to be redirected to the appropriate forum. please advise.. Sun
File Handles issue
We're having a heck of a time with too many file handles around here. When we create large indexes, we often get thousands of temporary files in a given index! Even worse, we just plain run out of file handles--even on boxes where we've upped the limits as much as we think we can! We've played around with various settings for the mergeFactor and maxMergeDocs, but these seem to have at best an indirect effect on the number of temporary files created. I'm not very familiar with the Lucene file system yet, so can someone briefly explain how Lucene works on creating an index? How does it determine when to create a new temporary file in the index and when does it decide to compress the index? Also, is there any way we could limit the number of file handles used by Lucene? This is becoming a huge problem for us, so any insight would be appreciated. Thanks, Scott
RE: File Handles issue
From: Scott Ganyo [mailto:[EMAIL PROTECTED]] We're having a heck of a time with too many file handles around here. When we create large indexes, we often get thousands of temporary files in a given index! Thousands, eh? That seems high. The maximum number of segments should be f*log_f(N), where f is the IndexWriter.mergeFactor and N is the number of documents. The default merge factor is ten. There are seven files per segment, plus one per field. If we assume that you have three fields per document, then its ten files per segment. So to get 1000 files in an index with three fields and a mergeFactor of ten, you'd need 10 billion documents, which I doubt you have. (Lucene can't handle more than 2 billion anyway...) How many fields do you have? (How many different .f files are there per segment?) Have you lowered IndexWriter.maxMergeDocs? If you, e.g. lowered this to 10,000, then with a million documents you'd have 100 segments, which would give you 1000 files. So, to minimize the number of files, keep maxMergeDocs at Integer.MAX_VALUE, its default. Another possibility is that you're running on Win32 and obsolete files are being kept open by IndexReaders and cannot be deleted. Could that be the case? Even worse, we just plain run out of file handles--even on boxes where we've upped the limits as much as we think we can! You should endevour to keep just one IndexReader at a time for an index. When it is out of date, don't close it, as this could break queries running in other threads, just let it get garbage collected. The finalizers will close things and free the file handles. I'm not very familiar with the Lucene file system yet, so can someone briefly explain how Lucene works on creating an index? How does it determine when to create a new temporary file in the index and when does it decide to compress the index? Assume mergeFactor is ten, the default. A new segment is created on disk for every ten documents added, or sooner if IndexWriter.close() is called before ten have been added. When the tenth segment of size ten is added, all ten are merged into a single segment of size 100. When ten such segments of size 100 have been added, these are merged into a single segment containing 1000 documents, and so on. So at any time there can be no more than nine segments in each power-of-ten index size. When optimize() is called all segments are merged into a single segment. The exception is that no segments will be created larger than IndexWriter.maxMergeDocs. So if this were set to 1000, then when you add the 10,000th document, instead of merging things into a single segment of 10,000, it would add a tenth segment of size 1000, and keep adding segments of size 1000 for every 1000 documents added. Also, is there any way we could limit the number of file handles used by Lucene? An IndexReader keeps all files in all segments open while it is open. So to minimize the number of file handles you should minimize the number of segments, minimize the number of fields, and minimize the number of IndexReaders open at once. An IndexWriter also has all files in all segments open at once. So updating in a separate process would also buy you more file handles. Doug
Index Optimization: Which is Better?
We are experimenting with XML-aware indexing. The approach we're trying is to index every element in a given XML document as a separate Lucene document along with a another Lucene document that captures just the concatenated text content of the document (to handle searching for phrases across element boundaries), what we're calling the all-content Lucene document. We are using a node type field to distinguish the different types of XML document constructs we are indexing (elements, comments, PIs, etc.) and also thought we would use node type to distinguish the all-content document. When we get a hit list, we can then use the node type to figure out which XML constructs contained the target text and reduce the per-element Lucene documents to single XML documents for the final query result. We can also use node type to limit the query (you might want to search just in PIs or just in comments, for example). Our question is this: given that for the all-content document we could either use the default content field for the text and the node type field to label the document as the all-content node or simply use a different field name for the content (e.g., alltext or something), which of the following queries would tend to perform better? This: some text AND nodtype:ALL_CONTENT or: alltext:some text Or is there any practical difference? Which way we construct the Lucene document will affect how our front-end and/or users have to construct queries. It would be slightly more convenient for front-ends to get the all-content doc by default (using the content field for the text), but we thought the AND query needed to limit searches to just the text (thus ignoring element-specific searching) might incur a performance penalty. In a related question, is there anything we can or need to do to optimize Lucene to handle lots of little Lucene documents? Thanks, Eliot -- . . . . . . . . . . . . . . . . . . . . . . . . W. Eliot Kimber | Lead Brain 1016 La Posada Dr. | Suite 240 | Austin TX 78752 T 512.656.4139 | F 512.419.1860 | [EMAIL PROTECTED] w w w . d a t a c h a n n e l . c o m
Re: Index Optimization: Which is Better?
Doug wrote: I'm having trouble getting a clear picture of your indexing scheme. I've been doing a lot of thinking about this same problem, so I may be a little more in tune with what Elliot's saying. By the way, Elliot, I'm very interested in your results. I considered the basic approach you're using, but I thought it was a bit extreme in terms of having zillions of tiny lucene Documents. I'm working on a quick kludge that may serve my immediate purposes (if it does, I'm planning to post the deatils here). Could you provide some simple examples, e.g., for the xml: tag1this is some text tag2and some other text/tag2 /tag1 would you have something like the following? doc1 node_type: tag1 contents: this is some text doc2 node_type: tag2 contents: and some other text doc3 node_type: all_contents contents: this is some text and some other text I think that's exactly what Elliot is intending. My first instinct would be to have something like: doc1 tag1: this is some text tag2: and some other text all-tags: this is some text and some other text What do you need that that does not achieve? Name collision - you can have multiple Elements at different levels, and you may have attributes and tags having the same name. Obviously one way around this is Don't do that, but that could get really tiresome, quickly. If you just conflate the elements and attributes under the same name (i.e. field blah contains a concatenated set of values from all occurrences of both elements and attributes) then your searches become much more limited in what you can specify. This is, by the way, the approach I'm trying out, with a second stage to refine the results and drop out false positives. But I'll have to wait on saying any more about that. All of this, of course, is in the context of having arbitrary XML documents. If you have predefined XML schemas then you can hand-code the mappings from elements to lucene document fields. But then you trade a heck of a lot of flexibility for a lot of maintenance. Steven J. Owens [EMAIL PROTECTED]