RE: Preventing merging by IndexWriter
> Why go through all this effort when it's easy to make your > own unique ID? > Add a new field to each document "myuniqueid" and fill it in > yourself. It'll > never change then. I am sorry I did not mention in my post that I am aware of this solution but that it cannot be used for my purposes. I need the stable ID during filtering and afterwards for counting for faceted browsing. My tests show, and from the documentation/mailing lists I conclude, that retrieving a stable ID from a field during filtering and for each hit in a query result is too expensive. After your post I put some more thought into storing a stable ID in a field. I figured I could read all the stable IDs once and create a map from Lucene IDs to stable IDs. But this takes too long (> 600 ms on my laptop) for a small set of documents (< 150,000). Another problem is that I have to do millions of additional lookups during filtering and counting. > Of course, I may misunderstand your problem space enough that this is > useless. If so, please tell us the problem you're trying to > solve and maybe > wiser heads than mine will have better suggestions Here is a description of our problem. We want to build a repository that can handle a number of documents that is in the low millions (we are designing the repository for 10 million documents intially). Almost all navigation through this repository will be faceted. For this we need to be able to filter based on the facet values selected by the user, and we have to count how many documents in the search result have a particular facet value for multiple (estimation: 25-40) facet values. The documents in the repository are constantly changed and we want the faceted navigation to be updated in near real-time: if the user refreshes a search page after making changes to a document, the changes should be visible. I estimate we have about 250-500 ms, the time it takes to go to another page and refresh it, to update the index(es). My idea is to use Lucene for regular searching, and use a custom index for filtering based on facets and for counting the number of matches for facet values. For this to work, (reasonably) stable IDs are needed so updating the facet value index is simply changing values in doing a number of arrays. I am willing to sacrifice search performance for stable IDs if it gains performance in faceted filtering and counting. Johan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Preventing merging by IndexWriter
> > So my questions are: is there a way to prevent the IndexWriter from > > merging, forcing it to create a new segment for each indexing batch? > > Already done in the Lucene trunk: > http://issues.apache.org/jira/browse/LUCENE-672 > > Background: > http://www.gossamer-threads.com/lists/lucene/java-dev/39799#39799 > > > And > > if so, will it still be possible to merge the disk segments > when I want > > to? > > call optimize() Thanks, I have got it working now. But I think it is not a viable solution either. These are the problems that I see: - the number of segments will probably increase too fast, requiring optimizations regularly. - Given the size of the external data, making it consistent with the Lucene index will, in the worst case, require processing and writing to disk of hundreds of megabytes. The biggest problem is of course the combination of them: having to process too much data too may times. Johan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index architectures
No, you've got that right. But there's something I think you might be able to try. Fair warning, I'm remembering things I've read on this list and my memory isn't what it used to be I *think* that if you reduce your result set by, say, a filter, you might drastically reduce what gets sorted. I'm thinking of something like this BooleanQuery bq = new BooleanQuery(); bq.add(Filter for the last N days wrapped in a ConstantScoreQuery, MUST) bq.add(all the rest of your stuff). RangeFilter might work for you here. Even if this works, you'll still have to deal with making the range big enough to do what you want. Perhaps an iterative approach, say the first time you run the query and you don't get your 25 (or whatever) results, increase the range and try again. Again, I'm note entirely sure when the filter gets applied, before or after the sort. Nor am I sure how to tell. I'd sure like you to do the work and tell me how I *am* sure that this has been discussed in this mailing list, so a search there might settle this C'mon Chris, Erik and Yonki, can't you recognize a plea for help when you read it? Although here's yet another thing that flitted through my mind. Is date really the same as doc ID order? And would you be able to sort on DocID instead? And would it matter ? If you're adding your documents as they come in, this might work. Doc IDs change, but I *believe* if doc A is added after doc B, the doc ID for A will always be less than the docID for B, although neither of them is guaranteed to be the same between index optimizations. Again, not sure if this helps at all. Good luck! Erick On 10/18/06, Paul Waite <[EMAIL PROTECTED]> wrote: Many thanks to Erik and Ollie for responding - a lot of ideas and I'll have my work cut out grokking them properly and thinking about what to do. I'll respond further as that develops. One quick thing though - Erik wrote: > So, I wonder if your out of memory issue is really related to the number > of requests you're servicing. But only you will be able to figure that > out . These problems are...er...unpleasant to track down... Indeed! > I guess I wonder a bit about what large result sets is all about. That > is, do your users really care about results 100-10,000 or do they just > want to page through them on demand? No they don't want that. They just want a small number. What happens is they enter some silly query, like searching for all stories with a single common non-stop-word in them, and with the usual sort criterion of by date (ie. a field) descending, and a limit of, say 25. So Lucene then presumably has to haul out a massive resultset, sort it, and return the top 25 (out of 500,000 or whatever). Isn't that how it goes? Or am I missing something horribly obvious. Cheers, Paul. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Preventing merging by IndexWriter
Your problem is out of my experience, so all I can suggest is that you search the list archive. I know the idea of faceted searching has been discussed by people with waaay more experience in that realm than I have and, as I remember, there were some links provided I just searched for 'faceted' on the e-mails I've seen since I subscribed to the list, and there are certainly discussions out there... This thread might be particularly useful, started 15-May-2006 *Aggregating category hits * Best of luck Erick On 10/18/06, Johan Stuyts <[EMAIL PROTECTED]> wrote: > > So my questions are: is there a way to prevent the IndexWriter from > > merging, forcing it to create a new segment for each indexing batch? > > Already done in the Lucene trunk: > http://issues.apache.org/jira/browse/LUCENE-672 > > Background: > http://www.gossamer-threads.com/lists/lucene/java-dev/39799#39799 > > > And > > if so, will it still be possible to merge the disk segments > when I want > > to? > > call optimize() Thanks, I have got it working now. But I think it is not a viable solution either. These are the problems that I see: - the number of segments will probably increase too fast, requiring optimizations regularly. - Given the size of the external data, making it consistent with the Lucene index will, in the worst case, require processing and writing to disk of hundreds of megabytes. The biggest problem is of course the combination of them: having to process too much data too may times. Johan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index architectures
: I *think* that if you reduce your result set by, say, a filter, you might : drastically reduce what gets sorted. I'm thinking of something like this : BooleanQuery bq = new BooleanQuery(); : bq.add(Filter for the last N days wrapped in a ConstantScoreQuery, MUST) : bq.add(all the rest of your stuff). ... : Again, I'm note entirely sure when the filter gets applied, before or after : the sort. Nor am I sure how to tell. I'd sure like you to do the work and The memory required for sorting on a field is independent of the size of the result -- so a Filter wouldn't help you here. The reason is becuase Sorting builds/uses the FieldCache which cotnains all the values for all docs, so that it can be reused for Sorting future queries as well. That said: if you are seeing OOM errors when you sort by a field (but not when you use the docId ordering, or sort by score) then it sounds like you are keeping refrences to IndexReaders arround after you've stoped using them -- the FieldCache is kept in a WeakHashMap keyed off of hte IndexReader, so it should get garbage collected as soon sa you let go of it. Another possibility is that you are sorting on too many fields for it to be able to build the FieldCache for all of them in the RAM you have available. There was some discussion recently on java-dev about an approach to scoring that took advantage of Lazy Field loading insitead of the FieldCache to sort on the *stored* value of a field, the goal being to make sorting small result sets possible with small amount of RAM ... but i don't remember if the person working on it ever submitted a patch. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index architectures
Hi, On Wed, 2006-10-18 at 19:05 +1300, Paul Waite wrote: > No they don't want that. They just want a small number. What happens is > they enter some silly query, like searching for all stories with a single > common non-stop-word in them, and with the usual sort criterion of by date > (ie. a field) descending, and a limit of, say 25. > > So Lucene then presumably has to haul out a massive resultset, sort it, and > return the top 25 (out of 500,000 or whatever). I had a similar issue recently: users only want the 100 (or whatever) most recently updated documents which match, and our documents aren't stored in date-order. Originally, we would walk the result set, instantiate a Document instance, pull out the timestamp field, and keep around the top 100 documents. Obviously this is extremely slow for large result sets. What I initially did to address this was store a reverse timestamp and walk the list of terms in the reverse timestamp field (they're sorted lexigraphically), and return the 100 most recent matching documents. In most cases this was a lot faster (for a search which returned 153,142 matches, I only had to walk 288 documents to find the 100 most recent), but in some cases it was a lot slower (for another search which returned 339 matches, I had to walk 292,911 documents to find the 100 most recent). In the end I found that I could walk 5 terms for every 2 documents I could instantiate and tuned a heuristic so that in the worst case (my second example) searches are 50% slower, but in almost all other cases they're quite a bit faster. Hope this helps, Joe - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index architectures
On Wed, 2006-10-18 at 19:05 +1300, Paul Waite wrote: No they don't want that. They just want a small number. What happens is they enter some silly query, like searching for all stories with a single common non-stop-word in them, and with the usual sort criterion of by date (ie. a field) descending, and a limit of, say 25. So Lucene then presumably has to haul out a massive resultset, sort it, and return the top 25 (out of 500,000 or whatever). I had a similar requirement on a project last year. I implemented a two-pronged approach: 1. Index (i.e. addDocument()) the documents in the order I wanted the final sort to be. 2. Modify a few classes to give a "first N" results capability. We had 4 or 5 sort orders, so I had multiple indexes on the same data, 2 per sort order (forward and reverse), and used the appropriate one at search time. It wasn't a one-hour change, but it didn't take a man-year either. Hope it helps! --MDC - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Error using Luke
Hi, I am getting this error when accessing my index with Luke. No sub-file with id _1.f0 found Does any one have idea about this?? Any help would be appreciated. Thanks, -Vasu - Stay in the know. Pulse on the new Yahoo.com. Check it out.
Re: Error using Luke
seems that you created your index with norms turned off and trying to open with luke which can contain older version of lucene. vasu shah wrote: Hi, I am getting this error when accessing my index with Luke. No sub-file with id _1.f0 found Does any one have idea about this?? Any help would be appreciated. Thanks, -Vasu - Stay in the know. Pulse on the new Yahoo.com. Check it out. -- regards, Volodymyr Bychkoviak - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Error using Luke
Thank you very much. I have indeed turned off the norms. Is there any new version of Luke that I can use? Thanks, -Vasu Volodymyr Bychkoviak <[EMAIL PROTECTED]> wrote: seems that you created your index with norms turned off and trying to open with luke which can contain older version of lucene. vasu shah wrote: > Hi, > > I am getting this error when accessing my index with Luke. > > No sub-file with id _1.f0 found > > Does any one have idea about this?? > > Any help would be appreciated. > > Thanks, > -Vasu > > > - > Stay in the know. Pulse on the new Yahoo.com. Check it out. > -- regards, Volodymyr Bychkoviak - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Get your own web address for just $1.99/1st yr. We'll help. Yahoo! Small Business.
Re: near duplicates
17 okt 2006 kl. 18.55 skrev Andrzej Bialecki: You need to create a fuzzy signature of the document, based on term histogram or shingles - take a look a the Signature framework in Nutch. There is a substantial literature on this subject - go to Citeseer and run a search for "near duplicate detection". Interesting. I'll have to check this out a bit more some day(tm). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Error using Luke
You can get Lucene 1.9.1 and make Luke use this version. (you need luke.jar not luke-all.jar) version 1.9.1 contains API which is removed in 2.0 version of Lucene (as deprecated) and should still be able to read indexes created by Lucene 2.0 (correct me if I'm wrong) and then run Luke with command line like this: java -classpath luke.jar;lucene-1.9.1.jar org.getopt.luke.Luke vasu shah wrote: Thank you very much. I have indeed turned off the norms. Is there any new version of Luke that I can use? Thanks, -Vasu Volodymyr Bychkoviak <[EMAIL PROTECTED]> wrote: seems that you created your index with norms turned off and trying to open with luke which can contain older version of lucene. vasu shah wrote: Hi, I am getting this error when accessing my index with Luke. No sub-file with id _1.f0 found Does any one have idea about this?? Any help would be appreciated. Thanks, -Vasu - Stay in the know. Pulse on the new Yahoo.com. Check it out. -- regards, Volodymyr Bychkoviak
Re: Lucene 2.0.1 release date
This makes it relatively safe for people to grab a snapshot of the trunk with less >concern about latent bugs. I think the concern is that if we start doing this stuff on trunk now, people that are >accustomed to snapping from the trunk might be surprised, and not in a good way. +1 on this. There are some great performance improvements in 2.0.1 Peter On 10/17/06, Steven Parkes <[EMAIL PROTECTED]> wrote: I think the idea is that 2.0.1 would be a patch-fix release from the branch created at 2.0 release. This release would incorporate only back-ported high-impact patches, where "high-impact" is defined by the community. Certainly security vulnerabilities would be included. As Otis said, to date, nobody seems to have raised any issues to that level. 2.1 will include all the patches and new features that have been committed since 2.0; there've been a number of these. But releases are done pretty ad hoc at this point and there hasn't been anyone that has expressed strong interest in (i.e., lobbied for) a release. There was a little discussion on this topic at the ApacheCon BOF. For a number of reasons, the Lucene Java trunk has been kept "pretty stable", with a relatively few number of large changes. This makes it relatively safe for people to grab a snapshot of the trunk with less concern about latent bugs. I don't know how many people/projects are doing this rather than sticking with 2.0. Keeping the trunk stable doesn't provide an obvious place to start working on things that people may want to work on and share but at the same time want to allow to percolate for a while. I think the concern is that if we start doing this stuff on trunk now, people that are accustomed to snapping from the trunk might be surprised, and not in a good way. Nobody wants that. So releases can be about both what people want (getting features out) and allowing a bit more instability in trunk. That is, if the community wants that. Food for thought and/or discussion? -Original Message- From: George Aroush [mailto:[EMAIL PROTECTED] Sent: Sunday, October 15, 2006 5:15 PM To: java-user@lucene.apache.org Subject: RE: Lucene 2.0.1 release date Thanks for the reply Otis. I looked at the CHANGES.txt file and saw quit a bit of changes. For my port from Java to C#, I can't rely on the trunk code as it is (to my knowledge) changes on a monthly basic if not weekly. What I need is an official release so that I can use it as the port point. Regards, -- George Aroush -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Sunday, October 15, 2006 12:41 AM To: java-user@lucene.apache.org Subject: Re: Lucene 2.0.1 release date I'd have to check CHANGES.txt, but I don't think that many bugs have been fixed and not that many new features added that anyone is itching for a new release. Otis - Original Message From: George Aroush <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org; java-user@lucene.apache.org Sent: Saturday, October 14, 2006 10:32:47 AM Subject: RE: Lucene 2.0.1 release date Hi folks, Sorry for reposting this question (see original email below) and this time to both mailing list. If anyone can tell me what is the plan for Lucene 2.0.1 release, I would appreciate it very much. As some of you may know, I am the porter of Lucene to Lucene.Net knowing when 2.0.1 will be released will help me plan things out. Regards, -- George Aroush -Original Message- From: George Aroush [mailto:[EMAIL PROTECTED] Sent: Thursday, October 12, 2006 12:07 AM To: java-dev@lucene.apache.org Subject: Lucene 2.0.1 release date Hi folks, What's the plan for Lucene 2.0.1 release date? Thanks! -- George Aroush - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Scalability Questions
Hello All, Lucene looks very interesting to me. I was wondering if any of you could comment on a few questions: 1) Assuming I use a typical server such as a dual-core dual-processor Dell 2950, about how many files can Lucene index and still have a sub-two-second search speed for a simple search string such as "invoice 2005 mitsubishi"? For the sake of argument, I figure that a typical file will have about 30KB of text in it. 2) How many of these servers would it take to manage an index of one billion such files? 3) Are there any HOWTO's on constructing a large Lucene search cluster? 4) Roughly how large is the index file in comparison to the size of the input files? 5) How does Lucene's search performance/scalability compare to some of the expensive commercial search products such as Fast? (www.fastsearch.com) Thank you all for any comments or suggestions! Guerre
DateTools oddity....
When I run this java code: Long dates = new Long("1154481345000"); Date dada = new Date(dates.longValue()); System.out.println(dada.toString()); System.out.println(DateTools.dateToString(dada, DateTools.Resolution.DAY)); I get this output: Tue Aug 01 21:15:45 EDT 2006 20060802 Huh?! Should it be: 20060801 ?? Any ideas? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DateTools oddity....
DateTools use GMT as a timezone Tue Aug 01 21:15:45 EDT 2006 Wed Aug 02 02:15:45 EDT 2006 Michael J. Prichard wrote: When I run this java code: Long dates = new Long("1154481345000"); Date dada = new Date(dates.longValue()); System.out.println(dada.toString()); System.out.println(DateTools.dateToString(dada, DateTools.Resolution.DAY)); I get this output: Tue Aug 01 21:15:45 EDT 2006 20060802 Huh?! Should it be: 20060801 ?? Any ideas? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DateTools oddity....
Michael J. Prichard wrote: I get this output: Tue Aug 01 21:15:45 EDT 2006 That's August 2, 2006 at 01:15:45 GMT. 20060802 Huh?! Should it be: 20060801 DateTools uses GMT. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DateTools oddity....
Dang it :) Anyway to set timezone? Emmanuel Bernard wrote: DateTools use GMT as a timezone Tue Aug 01 21:15:45 EDT 2006 Wed Aug 02 02:15:45 EDT 2006 Michael J. Prichard wrote: When I run this java code: Long dates = new Long("1154481345000"); Date dada = new Date(dates.longValue()); System.out.println(dada.toString()); System.out.println(DateTools.dateToString(dada, DateTools.Resolution.DAY)); I get this output: Tue Aug 01 21:15:45 EDT 2006 20060802 Huh?! Should it be: 20060801 ?? Any ideas? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DateTools oddity....
No, but using a constant timezone is a good thing anyway since the index will not keep track of the info. And will not really care as long as you always use DateTools (index and search). You can always rewrite DateTools with your own timezone, but EDT is bad since it is vulnerable to Day light saving mess. Michael J. Prichard wrote: Dang it :) Anyway to set timezone? Emmanuel Bernard wrote: DateTools use GMT as a timezone Tue Aug 01 21:15:45 EDT 2006 Wed Aug 02 02:15:45 EDT 2006 Michael J. Prichard wrote: When I run this java code: Long dates = new Long("1154481345000"); Date dada = new Date(dates.longValue()); System.out.println(dada.toString()); System.out.println(DateTools.dateToString(dada, DateTools.Resolution.DAY)); I get this output: Tue Aug 01 21:15:45 EDT 2006 20060802 Huh?! Should it be: 20060801 ?? Any ideas? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: DateTools oddity....
DITTO !!! I like date truncation, but when I store a truncated date, I don't want to retrieve the time in Greenwich, England at midnight of the date I'm truncating in the local machine's time zone. Nothing against the Brits, it just doesn't do me any good to know what time it was over there on the day in question. What I want back is midnight of the correct day in the time zone of the local machine. In other words, when I specify DAY resolution, I'm saying TIME-ZONE IS IRRELEVENT, ALWAYS GIVE ME THE CORRECT DATE IN THE LOCAL TIME ZONE. Understanding the need for backwards-compatibility, I vote that there ought to be some kind of parameter I can set when converting Date-To-String and String-To-Date to force TRUNCATION of a date (as in Oracle, for example) so that it will return the correct date in local time when retrieved. Without the ability to force symmetrical storage-retrieval, I think any DateTools Resolutions to time units greater than hours has no practical value and only serves to mislead people. -Original Message- From: Michael J. Prichard [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 18, 2006 2:39 PM To: java-user@lucene.apache.org Subject: Re: DateTools oddity Dang it :) Anyway to set timezone? Emmanuel Bernard wrote: > DateTools use GMT as a timezone >Tue Aug 01 21:15:45 EDT 2006 >Wed Aug 02 02:15:45 EDT 2006 > > Michael J. Prichard wrote: > >> >> When I run this java code: >> >> Long dates = new Long("1154481345000"); >> Date dada = new Date(dates.longValue()); >> System.out.println(dada.toString()); >> System.out.println(DateTools.dateToString(dada, >> DateTools.Resolution.DAY)); >> >> I get this output: >> >> Tue Aug 01 21:15:45 EDT 2006 >> 20060802 >> >> Huh?! Should it be: >> 20060801 >> >> ?? >> >> Any ideas? >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
termpositions at index time...
Here's my problem: We're indexing books. I need to a> return books ordered by relevancy b> for any single book, return the number of hits in each chapter (which, of course, may be many pages). 1>If I index each page as a document, creating the relevance on a book basis is interesting, but collecting page hits per book is easy. 2>If I index each book as a document, returning the books by relevance is easy but aggregating hits per chapter is interesting. No, creating two indexes is not an option at present, although that would be the least work for me. I can make <2> work if, for a particular field, I can determine what the last termposition on each page is *at index time*. Oh, we don't want searches to span pages. Pages are added to the doc with multiple calls like so doc.add('"field", first page text); doc.add('"field", second page text); doc.add('"field", third page text); The only approach I've really managed to come up with so far is to make my own Analyzer that has the following characteristics... 1> override getPositionIncrementGap for this field and return, say, 100. This should keep us from spanning pages, and provide a convenient trigger for me to know we've finished (or are starting to) index a new page. 2> record the last token position and provide a mechanism for me to retrieve that number. I can then keep a record in this document of what offset each page starts at, and then accomplish my aggregation by storing, with the document, the termpositions of the start (or end) of each page. Note, I'm rolling my own counter for where terms hit. It'll be a degenerate case of only ANDing things together, so it should be pretty simple even in the wildcard case. I'm using the Srnd* classes to do my spans, since they may include wildcards and don't see a way to get a Spans object from that, but it's late in the day . Last time I appealed to y'all, you wrote back that it was already done. My hope is that it's already done again, but I've spent a couple of hours looking and it isn't obvious to me. What I want is a way to do something like this doc.add('"field", first page text); int pos = XXX.getLastTermPosition("field"); doc.add('"field", second page text); pos = XXX.getLastTermPosition("field"); doc.add('"field", third page text); pos = XXX.getLastTermPosition("field"); But if I understand what's happening, the text doesn't get analyzed until the doc is added to the index, all the doc.add(field, value) is just set-up work without any position information really being available yet. I'd be happy to be wrong about that . Thanks Erick
Re: termpositions at index time...
Erick Erickson wrote: Here's my problem: We're indexing books. I need to a> return books ordered by relevancy b> for any single book, return the number of hits in each chapter (which, of course, may be many pages). 1>If I index each page as a document, creating the relevance on a book basis is interesting, but collecting page hits per book is easy. 2>If I index each book as a document, returning the books by relevance is easy but aggregating hits per chapter is interesting. No, creating two indexes is not an option at present, although that would be the least work for me. Could you elaborate on why this approach isn't an option? --MDC - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: termpositions at index time...
Arbitrary restrictions by IT on the space the indexes can take up. Actually, I won't categorically I *can't* make this happen, but in order to use this option, I need to be able to present a convincing case. And I can't do that until I've exhausted my options/creativity. And this it way keeps folks on the list from suggesting it when I've already thought of it. Erick On 10/18/06, Michael D. Curtin <[EMAIL PROTECTED]> wrote: Erick Erickson wrote: > Here's my problem: > > We're indexing books. I need to > a> return books ordered by relevancy > b> for any single book, return the number of hits in each chapter > (which, of > course, may be many pages). > > 1>If I index each page as a document, creating the relevance on a book > basis > is interesting, but collecting page hits per book is easy. > 2>If I index each book as a document, returning the books by relevance is > easy but aggregating hits per chapter is interesting. > > No, creating two indexes is not an option at present, although that > would be > the least work for me. Could you elaborate on why this approach isn't an option? --MDC - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: termpositions at index time...
Erick Erickson wrote: Arbitrary restrictions by IT on the space the indexes can take up. Actually, I won't categorically I *can't* make this happen, but in order to use this option, I need to be able to present a convincing case. And I can't do that until I've exhausted my options/creativity. Disk space is a LOT cheaper than engineering time. Any manager worth his/her salt should be able to evaluate that tradeoff in a millisecond, and any IT professional unable to do so should be reprimanded. Maybe your boss can fix it. If not, yours is probably not the only such situation in the world ... If you can retrieve the pre-index content at search time, maybe this would work: 1. Create the "real" index in the form that lets you get the top N books by relevance, on IT's disks. 2. Create a temporary index on those books in the form that gives you the chapter counts in RAM, search it, then discard it. If N is sufficiently small, #2 could be pretty darn fast. If that wouldn't work, here's another idea. I'm not clear on how your solution with getLastTermPosition() would work, but how about just counting words in the pages as you document.add() them (instead of relying on getLastTermPosition())? It would mean two passes of parsing, but you wouldn't have to modify any Lucene code ... --MDC - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
question regarding usage of IndexWriter.setMaxFieldLength()
Hello- I was wondering about the usage of IndexWriter.setMaxFieldLength() it is limited, by default, to 10k terms per field. Can anyone tell me if this is this a "per field" limit or a "per uniquely named field" limit? I.e. in the following snippet I add many words to different Fields all w/ the same name. Will all words be indexed w/ no problem allowing me to conduct a search across the "text" field for any word occurring in any these long strings? string longString1 = <~9k words in string>; string longString2 = <~9k words in string>; string longString3 = <~9k words in string>; Document doc = new Document(); doc.add(new Field("text", longString1, Field.Store.YES, Field.Index.UN_TOKENIZED)); doc.add(new Field("text", longString2, Field.Store.YES, Field.Index.UN_TOKENIZED)); doc.add(new Field("text", longString3, Field.Store.YES, Field.Index.UN_TOKENIZED)); thanks. -david
Re: near duplicates
On 10/18/06, Isabel Drost <[EMAIL PROTECTED]> wrote: Find Me wrote: > How to eliminate near duplicates from the index? Someone suggested that I > could look at the TermVectors and do a comparision to remove the > duplicates. As an alternative you could also have a look at the paper "Detecting Phrase-Level Duplication on the World Wide Web" by Dennis Fetterly, Mark Manasse, Marc Najork. Another good reference would be Soumen Chakrabarti's reference book, "Mining the Web - Discovering Knowledge from Hypertext Data",2003 and the section on shingling and the elimination of near duplicates. Of course I think this works at the document level rather than at the term vector level but it might be useful to prevent duplicate documents from being indexed altogether. One major problem with this is the structure of the document is > no longer important. Are there any obvious pitfalls? For example: Document > A being a subset of Document B but in no particular order. I think this case is pretty unlikely. But I am not sure whether you can detect near duplicates by only comparing term-document vectors. There might be problems with documents with slightly changed words, words that were replaced with synonyms... However, if you want to keep some information on the word order, you might consider comparing n-gram document vectors. That is, each dimension in the vector does not only represent one word but a sequence of 2, 3, 4, 5... words. would this involve something like a window of 2-5 words around a particular term in a document? Cheers, Isabel
Re: question regarding usage of IndexWriter.setMaxFieldLength()
I had a similar question a while ago and the answer is "you can't cheat". According to what the guys said, this doc.add("field", ) doc.add("field", ) doc.add("field", ) is just the same as this doc.add("field", ) But go ahead and increase the maxfieldlength. I'm successfully indexing (unstored) a 7,500 page book with all the text as a single field. I think I set the maxfieldlength at something like 10,000,000. Had to bump the max memory in the JVM to do it, but it worked. Erick On 10/18/06, d rj <[EMAIL PROTECTED]> wrote: Hello- I was wondering about the usage of IndexWriter.setMaxFieldLength() it is limited, by default, to 10k terms per field. Can anyone tell me if this is this a "per field" limit or a "per uniquely named field" limit? I.e. in the following snippet I add many words to different Fields all w/ the same name. Will all words be indexed w/ no problem allowing me to conduct a search across the "text" field for any word occurring in any these long strings? string longString1 = <~9k words in string>; string longString2 = <~9k words in string>; string longString3 = <~9k words in string>; Document doc = new Document(); doc.add(new Field("text", longString1, Field.Store.YES, Field.Index.UN_TOKENIZED)); doc.add(new Field("text", longString2, Field.Store.YES, Field.Index.UN_TOKENIZED)); doc.add(new Field("text", longString3, Field.Store.YES, Field.Index.UN_TOKENIZED)); thanks. -david
Re: termpositions at index time...
I tried the notion of a temporary RAMDirectory already, and the documents parse unacceptably slowly , 8-10 seconds. Great minds think alike. Believe it or not, I have to deal with a 7,500 page book that details Civil War records of Michigan volunteers. The XML form is 24M, probably 16M of text exclusive of tags. About your second suggestion, I'm trying to figure out how to do essentially that. But a word count isn't very straight forward with stop words and dirty ascii (OCR) data. I'm trying to hook that process into the tokenizer so the counts have a better chance of being accurate, which is the essence of the scheme. I'd far rather get the term offset from the same place the indexer will than try to do a similar-but-not-quite-identical algorithm that failed miserably on, say, the 3,000th and subsequent pages... I'm sure you've been somewhere similar OK, you've just caused me to think a bit, for which I thank you. I think it's actually pretty simple. Just instantiate a class that is a thin wrapper around the Lucene analyzer that implements the tokenstream (or whatever) interface by calling a contained analyzer (has-a). Return the token and do any recording I want to. And provide any additional data to my process as necessary. I'll have to look at that in the morning. All in all, I'm probably going to make your exact argument about disk space being wy cheaper than engineering time. That said, exploring this serves two purposes; first it lets me back my recommendation with data. Second, and longer term, we're using Lucene on more and more products, and exploring the nooks and crannies involved in exotic schemes vastly increases my ability to quickly tirage ways of doing things. The *other* thing my boss is good at is being OK with a reasonable amount of time "wasted" in order to increase my toolkit. So it isn't as frustrating as it might have appeared by my rather off-hand blaming of IT . Thanks for the suggestions, Erick On 10/18/06, Michael D. Curtin <[EMAIL PROTECTED]> wrote: Erick Erickson wrote: > Arbitrary restrictions by IT on the space the indexes can take up. > > Actually, I won't categorically I *can't* make this happen, but in order to > use this option, I need to be able to present a convincing case. And I > can't > do that until I've exhausted my options/creativity. Disk space is a LOT cheaper than engineering time. Any manager worth his/her salt should be able to evaluate that tradeoff in a millisecond, and any IT professional unable to do so should be reprimanded. Maybe your boss can fix it. If not, yours is probably not the only such situation in the world ... If you can retrieve the pre-index content at search time, maybe this would work: 1. Create the "real" index in the form that lets you get the top N books by relevance, on IT's disks. 2. Create a temporary index on those books in the form that gives you the chapter counts in RAM, search it, then discard it. If N is sufficiently small, #2 could be pretty darn fast. If that wouldn't work, here's another idea. I'm not clear on how your solution with getLastTermPosition() would work, but how about just counting words in the pages as you document.add() them (instead of relying on getLastTermPosition())? It would mean two passes of parsing, but you wouldn't have to modify any Lucene code ... --MDC - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index architectures
Some excellent feedback guys - thanks heaps. On my OOM issue, I think Hoss has nailed it here: > That said: if you are seeing OOM errors when you sort by a field (but > not when you use the docId ordering, or sort by score) then it sounds > like you are keeping refrences to IndexReaders arround after you've > stoped using them -- the FieldCache is kept in a WeakHashMap keyed off of > hte IndexReader, so it should get garbage collected as soon sa you let go > of it. Another possibility is that you are sorting on too many fields > for it to be able to build the FieldCache for all of them in the RAM you > have available. I'm using a piece of code written by Peter Halacsy which implements a SearcherCache class. When we do a search we request a searcher, and this class looks after giving us one. It checks whether the index has been updated since the most recent Searcher was created. If so it creates a new one. At the same time it 'retires' outdated Searchers, once they have no queries busy with them. Looking at that code, if the system gets busy indexing new stuff, and doing complex searches this is all rather open-ended as to the potential number of fresh Searchers being created, each with the overhead of building its FieldCache for the first time. No wonder I'm having problems as the archive has grown! Looking at it in this light, my OOM's all seem to come just after a bout of articles have been indexed, and querying is being done simultaneously, so it does fit. I guess a solution is probably to cap this process with a maximum number of active Searchers, meaning potentially some queries might be fobbed off with slightly out of date versions of the index, in extremis, but it would right itself once everything settles down again. Obviously the index partitioning would probably make this a non-issue, but it seems better to sort the basic problem out anyway, and make it 100% stable. Thanks Hoss! Cheers, Paul. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index architectures
Not sure if this is the case, but you said "searchers", so might be it - you can (and should) reuse searchers for multiple/concurrent queries. IndexSearcher is thread-safe, so no need to have a different searcher for each query. Keep using this searcher until you decide to open a new searcher - actually until you 'warmed up' the new (updated) searcher, then switch to using the new searcher and close the old one. - Doron Paul Waite <[EMAIL PROTECTED]> wrote on 18/10/2006 18:22:30: > Some excellent feedback guys - thanks heaps. > > > On my OOM issue, I think Hoss has nailed it here: > > > That said: if you are seeing OOM errors when you sort by a field (but > > not when you use the docId ordering, or sort by score) then it sounds > > like you are keeping refrences to IndexReaders arround after you've > > stoped using them -- the FieldCache is kept in a WeakHashMap keyed off of > > hte IndexReader, so it should get garbage collected as soon sa you let go > > of it. Another possibility is that you are sorting on too many fields > > for it to be able to build the FieldCache for all of them in the RAM you > > have available. > > > I'm using a piece of code written by Peter Halacsy which implements a > SearcherCache class. When we do a search we request a searcher, and this > class looks after giving us one. > > It checks whether the index has been updated since the most recent Searcher > was created. If so it creates a new one. > > At the same time it 'retires' outdated Searchers, once they have no queries > busy with them. > > Looking at that code, if the system gets busy indexing new stuff, and doing > complex searches this is all rather open-ended as to the potential number > of fresh Searchers being created, each with the overhead of building its > FieldCache for the first time. No wonder I'm having problems as the archive > has grown! Looking at it in this light, my OOM's all seem to come just > after a bout of articles have been indexed, and querying is being done > simultaneously, so it does fit. > > I guess a solution is probably to cap this process with a maximum number > of active Searchers, meaning potentially some queries might be fobbed off > with slightly out of date versions of the index, in extremis, but it would > right itself once everything settles down again. > > Obviously the index partitioning would probably make this a non-issue, but > it seems better to sort the basic problem out anyway, and make it 100% > stable. > > Thanks Hoss! > > Cheers, > Paul. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: constructing smaller phrase queries given a multi-word query
Resending, with the hope that the search gurus missed this. Would really appreciate any advise on this. Would not want to reinvent the wheel & I am sure this is something that would have been done. Thanks, mek On 10/16/06, Mek <[EMAIL PROTECTED]> wrote: Has anyone dealt with the problem of constructing sub-queries given a multi-word query ? Here is an example to illustrate what I mean: user queries for -> A B C D right now I change that query to "A B C D" A B C D to give phrase matches higher weightage. What might happen though, is that the user is looking for a document where "A B" in Field1 & "C D" in Field2. So I should ideally be constructing the query as : "A B C D"^20 "A B"^10 "C D"^10 "B C D"^15 "A B C"^15 A B C D Has someone solved this problem, are there other ways to handle this problem ? Thanks, mek. -- http://mekin.livejournal.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
NativeFSLockFactory problem
Hi all, I'm trying to use the new class NativeFSLockFactory, but as you can guess I have a problem using it. Don't know what I'm doing wrong, so here is the code: FSDirectory dir = FSDirectory.getDirectory(indexDir, create, NativeFSLockFactory.getLockFactory()); logger.info("Index: "+indexDir.getAbsolutePath()+" Lock file: "+dir.getLockID()); this.writer = new IndexWriter(dir, new StandardAnalyzer(), create); Just to explain: indexDir is a file, create is set to false. 2nd line is to see what is going on. My problem is that there are many indices, for testing purpose just 4 of them. The first one is started and working like it should, but from the 2nd on I get those "Lock obtain timed out"- exceptions. This is the log output: 08:38:05,199 INFO [IndexerManager] No indexer found for directory D:\[mydir]\index1- starting new one! 08:38:05,199 INFO [Indexer] Index: D:\[mydir]\index1 Lock file: lucene-0ca7838f9396a636d1feda5aabb9b8db 08:38:05,215 INFO [IndexerManager] New amount of Indexers: 1 08:38:05,215 INFO [IndexerManager] No indexer found for directory D:\[mydir]\index2- starting new one! 08:38:05,215 INFO [Indexer] Index: D:\[mydir]\index2 Lock file: lucene-cc9dfaabbf7ad61c4bb3af007b88288c 08:38:06,213 ERROR [IndexerManager] Lock obtain timed out: [EMAIL PROTECTED]:\Dokumente und Einstellungen\[user]\Lokale Einstellungen\Temp\lucene-fd415060ae453638d69faa9fa07fbe95-n-write.lock java.io.IOException: Lock obtain timed out: [EMAIL PROTECTED]:\Dokumente und Einstellungen\[user]\Lokale Einstellungen\Temp\lucene-fd415060ae453638d69faa9fa07fbe95-n-write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:68) at org.apache.lucene.index.IndexWriter.(IndexWriter.java:257) at org.apache.lucene.index.IndexWriter.(IndexWriter.java:247) at de.innosystec.iar.indexing.Indexer.setUp(Indexer.java:101) at de.innosystec.iar.indexing.Indexer.(Indexer.java:80) at de.innosystec.iar.indexing.IndexerManager.addDocumentElement(IndexerManager. java:228) at de.innosystec.iar.parsing.ParserManager.indexDocumentElement(ParserManager.j ava:286) at de.innosystec.iar.parsing.ParserThread.startWorking(ParserThread.java:378) at de.innosystec.iar.parsing.ParserThread.run(ParserThread.java:175) at java.lang.Thread.run(Unknown Source) The lock file mentioned in the exception is really created and used by the first index. Seems like the FSDirectory.getLockID method doesn't work for NativeFSLockFactory? I'm using Win XP on my test platform. Regards, Frank - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]