Re: Search results excerpt similar to Google
I think they do a proximity result based on keyword matches. So... If you search for "lucene" and the document returned has this word at the very start and the very end of the document, then you will see the two sentences (sequences of words) surrounding the two keyword matches, one from the start of the document and one from the end. How you determine which words from the result you include in the summary is up to you. The problem with this it that in Lucene-land you have to store the content of the document inside in index verbatim (so you can get arbitrary portions of it out). This means your index will be larger than it really needs to be. I usually just store the first 255 characters in the index and use this as a summary. It's not as good as Google, but it seems to work ok. - Original Message - From: "Ben" <[EMAIL PROTECTED]> To: "Lucene" Sent: Friday, January 28, 2005 5:08 PM Subject: Search results excerpt similar to Google Hi Is it hard to implement a function that displays the search results excerpts similar to Google? Is it just string manipulations or there are some logic behind it? I like their excerpts. Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: google mini? who needs it when Lucene is there
Xiaohong Yang (Sharon) wrote: Hi, I agree that Google mini is quite expensive. It might be similar to the desktop version in quality. Anyone knows google's ratio of index to text? Is it true that Lucene's index is about 500 times the original text size (not including image size)? I don't have one installed, so I cannot measure. 500:1 for Lucene? I don't think so. In my wikipedia search engine the data in the MySQL DB I index from is approx 1.0 GB (sum of lengths of title and body), while the Lucene index of just these 2 fields is 250MB, thus in this case the Lucene index is 25% of the corpus size. Best, Sharon jian chen <[EMAIL PROTECTED]> wrote: Hi, I was searching using google and just found that there was a new feature called "google mini". Initially I thought it was another free service for small companies. Then I realized that it costs quite some money ($4,995) for the hardware and software. (I guess the proprietary software costs a whole lot more than actual hardware.) The "nice" feature is that, you can only index up to 50,000 documents with this price. If you need to index more, sorry, send in the check... It seems to me that any small biz will be ripped off if they install this google mini thing, compared to using Lucene to implement a easy to use search software, which could search up to whatever number of documents you could image. I hope the lucene project could get exposed more to the enterprise so that people know that they have not only cheaper but more importantly, BETTER alternatives. Jian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Search results excerpt similar to Google
Hi Is it hard to implement a function that displays the search results excerpts similar to Google? Is it just string manipulations or there are some logic behind it? I like their excerpts. Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there
Jason Polites wrote: I think everyone agrees that this would be a very neat application of opensource technology like Lucene... however (opens drawer, pulls out devil's advocate hat, places on head)... there are several complexities here not addressed by Lucene (et. al). Not because Lucene isn't damn fantastic, just because it's not its job. One of the big ones is security. Enterprise search is no good if it doesn't match up with the authentication and authorization paradigms existing in the organisation. How useful is it to return a whole bunch of search results for documents to which you don't have access? Not to mention the issues around whether you are even authorized to know it exists. I was gonna mention this - you beat me to the punch. I suspect that LDAP/JNDI itegration is a start, but you need hooks for an arbitrary auth plugin. And once we address this it might be the case that a user has to *log in* to the search server. We have Verity where I work and this is all the case, along w/ the fact that a sale seems to involve mandatory consulting work (not that that's bad, but if you're trying to ship a shrink wrapped search engine in a box then this is an issue). The other prickly one is file types. It's all well and good to index HTML, XML and text but when you start looking at PDF, MS Office (OLE docs, PSTs, Outlook MSG files, MS Project files etc), Lotus Notes databases etc etc, things begin to look less simple and far less elegant than a nice clean lucene rackmount. Sure there are great projects like Apache POI but they are still have a bit of a way to go before they mature to a point of really solving these problems. After which time Microsoft will probably be rolling out Longhorn and everyone may need to start from scratch. Also need http://jcifs.samba.org/ so you can spider windows file shares. This is not to say that it's not a great idea, but as with most great ideas the challenge is not the formation of the idea, but its implementation. Indeed. I think a great first step would be to start developing good, reliable, opensource extensions to Lucene which strive to solve some of these issues. end rant. - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Friday, January 28, 2005 12:40 PM Subject: Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there I discuss this with myself a lot inside my head... :) Seriously, I agree with Erik. I think this is a business opportunity. How many people are hating me now and going "shh"? Raise your hands! Otis --- David Spencer <[EMAIL PROTECTED]> wrote: This reminds me, has anyone every discussed something similar: - rackmount server ( or for coolness factor, that mini mac) - web i/f for config/control - of course the server would have the following s/w: -- web server -- lucene / nutch Part of the work here I think is having a decent web i/f to configure the thing and to customize the L&F of the search results. jian chen wrote: > Hi, > > I was searching using google and just found that there was a new > feature called "google mini". Initially I thought it was another free > service for small companies. Then I realized that it costs quite some > money ($4,995) for the hardware and software. (I guess the proprietary > software costs a whole lot more than actual hardware.) > > The "nice" feature is that, you can only index up to 50,000 documents > with this price. If you need to index more, sorry, send in the > check... > > It seems to me that any small biz will be ripped off if they install > this google mini thing, compared to using Lucene to implement a easy > to use search software, which could search up to whatever number of > documents you could image. > > I hope the lucene project could get exposed more to the enterprise so > that people know that they have not only cheaper but more importantly, > BETTER alternatives. > > Jian > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: google mini? who needs it when Lucene is there
Overall, even if google mini gives a lot of cool features compared to a bare-born lucene project, what is good with the 50,000 documents limit. It is useless with that limit. That is just their way of trying to turn it into another cash cow. Jian On Thu, 27 Jan 2005 17:45:03 -0800 (PST), Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > 500 times the original data? Not true! :) > > Otis > > --- "Xiaohong Yang (Sharon)" <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I agree that Google mini is quite expensive. It might be similar to > > the desktop version in quality. Anyone knows google's ratio of index > > to text? Is it true that Lucene's index is about 500 times the > > original text size (not including image size)? I don't have one > > installed, so I cannot measure. > > > > Best, > > > > Sharon > > > > jian chen <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I was searching using google and just found that there was a new > > feature called "google mini". Initially I thought it was another free > > service for small companies. Then I realized that it costs quite some > > money ($4,995) for the hardware and software. (I guess the > > proprietary > > software costs a whole lot more than actual hardware.) > > > > The "nice" feature is that, you can only index up to 50,000 documents > > with this price. If you need to index more, sorry, send in the > > check... > > > > It seems to me that any small biz will be ripped off if they install > > this google mini thing, compared to using Lucene to implement a easy > > to use search software, which could search up to whatever number of > > documents you could image. > > > > I hope the lucene project could get exposed more to the enterprise so > > that people know that they have not only cheaper but more > > importantly, > > BETTER alternatives. > > > > Jian > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: query term frequency
No, the number of occurrences of a term in a Query. Jonathan Quoting David Spencer <[EMAIL PROTECTED]>: > Jonathan Lasko wrote: > > > What do I call to get the term frequencies for terms in the Query? I > > can't seem to find it in the Javadoc... > > Do you mean the # of docs that have a term? > > http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#docFreq(org.apache.lucene.index.Term) > > Thanks. > > > > Jonathan > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LuceneReader.delete (term t) Failure ?
Could you work up a self-contained RAMDirectory-using example that demonstrates this issue? Erik On Jan 27, 2005, at 9:10 PM, <[EMAIL PROTECTED]> wrote: Erik, I am using the keyword field doc.add(Field.Keyword("uid", pathRelToArea)); anything else I can check on ? thanks atul PS we worked together for Darden project From: Erik Hatcher <[EMAIL PROTECTED]> Date: 2005/01/27 Thu PM 07:46:40 EST To: "Lucene Users List" Subject: Re: LuceneReader.delete (term t) Failure ? How did you index the "uid" field? Field.Keyword? If not, that may be the problem in that the field was analyzed. For a key field like this, it needs to be unanalyzed/untokenized. Erik On Jan 27, 2005, at 6:21 PM, <[EMAIL PROTECTED]> wrote: Hi, I am trying to delete a document from Lucene index using: Term aTerm = new Term( "uid", path ); aReader.delete( aTerm ); aReader.close(); If the variable path="xxx/foo.txt" then I am able to delete the document. However, if path variable has "-" in the string, the delete method does not work e.g. path="xxx-yyy/foo.txt" // Does Not work!! Can I get around this problem. I cannot subsitute minus character with '.' as it has other implications. is this a bug ? I am using Lucene 1.4-final version. Thanks for the help Atul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: LuceneReader.delete (term t) Failure ?
Erik, I am using the keyword field doc.add(Field.Keyword("uid", pathRelToArea)); anything else I can check on ? thanks atul PS we worked together for Darden project > > From: Erik Hatcher <[EMAIL PROTECTED]> > Date: 2005/01/27 Thu PM 07:46:40 EST > To: "Lucene Users List" > Subject: Re: LuceneReader.delete (term t) Failure ? > > How did you index the "uid" field? Field.Keyword? If not, that may be > the problem in that the field was analyzed. For a key field like this, > it needs to be unanalyzed/untokenized. > > Erik > > On Jan 27, 2005, at 6:21 PM, <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I am trying to delete a document from Lucene index using: > > > > Term aTerm = new Term( "uid", path ); > > aReader.delete( aTerm ); > > aReader.close(); > > > > If the variable path="xxx/foo.txt" then I am able to delete the > > document. > > > > However, if path variable has "-" in the string, the delete method > > does not work > > > > e.g. path="xxx-yyy/foo.txt" // Does Not work!! > > > > > > Can I get around this problem. I cannot subsitute minus character > > with '.' as > > it has other implications. > > > > is this a bug ? I am using Lucene 1.4-final version. > > > > Thanks for the help > > Atul > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there
I think everyone agrees that this would be a very neat application of opensource technology like Lucene... however (opens drawer, pulls out devil's advocate hat, places on head)... there are several complexities here not addressed by Lucene (et. al). Not because Lucene isn't damn fantastic, just because it's not its job. One of the big ones is security. Enterprise search is no good if it doesn't match up with the authentication and authorization paradigms existing in the organisation. How useful is it to return a whole bunch of search results for documents to which you don't have access? Not to mention the issues around whether you are even authorized to know it exists. The other prickly one is file types. It's all well and good to index HTML, XML and text but when you start looking at PDF, MS Office (OLE docs, PSTs, Outlook MSG files, MS Project files etc), Lotus Notes databases etc etc, things begin to look less simple and far less elegant than a nice clean lucene rackmount. Sure there are great projects like Apache POI but they are still have a bit of a way to go before they mature to a point of really solving these problems. After which time Microsoft will probably be rolling out Longhorn and everyone may need to start from scratch. This is not to say that it's not a great idea, but as with most great ideas the challenge is not the formation of the idea, but its implementation. I think a great first step would be to start developing good, reliable, opensource extensions to Lucene which strive to solve some of these issues. end rant. - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Friday, January 28, 2005 12:40 PM Subject: Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there I discuss this with myself a lot inside my head... :) Seriously, I agree with Erik. I think this is a business opportunity. How many people are hating me now and going "shh"? Raise your hands! Otis --- David Spencer <[EMAIL PROTECTED]> wrote: This reminds me, has anyone every discussed something similar: - rackmount server ( or for coolness factor, that mini mac) - web i/f for config/control - of course the server would have the following s/w: -- web server -- lucene / nutch Part of the work here I think is having a decent web i/f to configure the thing and to customize the L&F of the search results. jian chen wrote: > Hi, > > I was searching using google and just found that there was a new > feature called "google mini". Initially I thought it was another free > service for small companies. Then I realized that it costs quite some > money ($4,995) for the hardware and software. (I guess the proprietary > software costs a whole lot more than actual hardware.) > > The "nice" feature is that, you can only index up to 50,000 documents > with this price. If you need to index more, sorry, send in the > check... > > It seems to me that any small biz will be ripped off if they install > this google mini thing, compared to using Lucene to implement a easy > to use search software, which could search up to whatever number of > documents you could image. > > I hope the lucene project could get exposed more to the enterprise so > that people know that they have not only cheaper but more importantly, > BETTER alternatives. > > Jian > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there
As they say, nothing lasts forever ;) I like the idea. If a project like this gets going, I think I'd be interested in helping. The Google mini looks very well done (they have two demos on the web page). For $5000, it's probably a very good solution for many businesses. If the demos are accurate, it seems like you almost literally plug it in, configure a few things using the web interface, and you're in business. Demos are at http://www.google.com/enterprise/mini/product_tours_demos.html -chris On Thu, 27 Jan 2005 17:40:53 -0800 (PST), Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > I discuss this with myself a lot inside my head... :) > Seriously, I agree with Erik. I think this is a business opportunity. > How many people are hating me now and going "shh"? Raise your > hands! > > Otis > > --- David Spencer <[EMAIL PROTECTED]> wrote: > > > This reminds me, has anyone every discussed something similar: > > > > - rackmount server ( or for coolness factor, that mini mac) > > - web i/f for config/control > > > > - of course the server would have the following s/w: > > -- web server > > -- lucene / nutch > > > > Part of the work here I think is having a decent web i/f to configure > > > > the thing and to customize the L&F of the search results. > > > > > > > > jian chen wrote: > > > Hi, > > > > > > I was searching using google and just found that there was a new > > > feature called "google mini". Initially I thought it was another > > free > > > service for small companies. Then I realized that it costs quite > > some > > > money ($4,995) for the hardware and software. (I guess the > > proprietary > > > software costs a whole lot more than actual hardware.) > > > > > > The "nice" feature is that, you can only index up to 50,000 > > documents > > > with this price. If you need to index more, sorry, send in the > > > check... > > > > > > It seems to me that any small biz will be ripped off if they > > install > > > this google mini thing, compared to using Lucene to implement a > > easy > > > to use search software, which could search up to whatever number of > > > documents you could image. > > > > > > I hope the lucene project could get exposed more to the enterprise > > so > > > that people know that they have not only cheaper but more > > importantly, > > > BETTER alternatives. > > > > > > Jian > > > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: > > [EMAIL PROTECTED] > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Reloading an index
: processes ended. If you're under linux, try running the 'lsof' : command to see if there are any handles to files marked "(deleted)". : > Searcher, the old Searcher is closed and nulled, but I : > still see about twice the amount of memory in use well : > after the original searcher has been closed. Is : > there something else I can do to get this memory : > reclaimed? Should I explicitly call garbarge : > collection? Any ideas? In addition to the previous advice, keep in mind that depending on the implimentation of your JVM, it may never actually "free" memory back to the OS. And even the JVMs that can, only do it after a GC which results in a ratio of unused/used memory that they deem worthy of freeing (usually based on tunning parameters) assuming you are using a Sun JVM, take a look at... http://java.sun.com/docs/hotspot/gc1.4.2/index.html ...and search for MinHeapFreeRatio and MaxHeapFreeRatio -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: google mini? who needs it when Lucene is there
500 times the original data? Not true! :) Otis --- "Xiaohong Yang (Sharon)" <[EMAIL PROTECTED]> wrote: > Hi, > > I agree that Google mini is quite expensive. It might be similar to > the desktop version in quality. Anyone knows google's ratio of index > to text? Is it true that Lucene's index is about 500 times the > original text size (not including image size)? I don't have one > installed, so I cannot measure. > > Best, > > Sharon > > jian chen <[EMAIL PROTECTED]> wrote: > Hi, > > I was searching using google and just found that there was a new > feature called "google mini". Initially I thought it was another free > service for small companies. Then I realized that it costs quite some > money ($4,995) for the hardware and software. (I guess the > proprietary > software costs a whole lot more than actual hardware.) > > The "nice" feature is that, you can only index up to 50,000 documents > with this price. If you need to index more, sorry, send in the > check... > > It seems to me that any small biz will be ripped off if they install > this google mini thing, compared to using Lucene to implement a easy > to use search software, which could search up to whatever number of > documents you could image. > > I hope the lucene project could get exposed more to the enterprise so > that people know that they have not only cheaper but more > importantly, > BETTER alternatives. > > Jian > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Disk space used by optimize
Have you tried using the multifile index format? Now I wonder if there is actually a difference in disk space cosumed by optimize() when you use multifile and compound index format... Otis --- "Kauler, Leto S" <[EMAIL PROTECTED]> wrote: > Our copy of LIA is "in the mail" ;) > > Yes the final three files are: the .cfs (46.8MB), deletable (4 > bytes), > and segments (29 bytes). > > --Leto > > > > > -Original Message- > > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > > > > Hello, > > > > Yes, that is how optimize works - copies all existing index > > segments into one unified index segment, thus optimizing it. > > > > see hit #1: > http://www.lucenebook.com/search?query=optimize+disk+space > > > > However, three times the space sounds a bit too much, or I > > make a mistake in the book. :) > > > > You said you end up with 3 files - .cfs is one of them, right? > > > > Otis > > > > > > --- "Kauler, Leto S" <[EMAIL PROTECTED]> wrote: > > > > > > > > Just a quick question: after writing an index and then calling > > > optimize(), is it normal for the index to expand to about > > three times > > > the size before finally compressing? > > > > > > In our case the optimise grinds the disk, expanding the index > into > > > many files of about 145MB total, before compressing down to three > > > > files of about 47MB total. That must be a lot of disk activity > for > > > the people with multi-gigabyte indexes! > > > > > > Regards, > > > Leto > > CONFIDENTIALITY NOTICE AND DISCLAIMER > > Information in this transmission is intended only for the person(s) > to whom it is addressed and may contain privileged and/or > confidential information. If you are not the intended recipient, any > disclosure, copying or dissemination of the information is > unauthorised and you should delete/destroy all copies and notify the > sender. No liability is accepted for any unauthorised use of the > information contained in this transmission. > > This disclaimer has been automatically added. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there
I discuss this with myself a lot inside my head... :) Seriously, I agree with Erik. I think this is a business opportunity. How many people are hating me now and going "shh"? Raise your hands! Otis --- David Spencer <[EMAIL PROTECTED]> wrote: > This reminds me, has anyone every discussed something similar: > > - rackmount server ( or for coolness factor, that mini mac) > - web i/f for config/control > > - of course the server would have the following s/w: > -- web server > -- lucene / nutch > > Part of the work here I think is having a decent web i/f to configure > > the thing and to customize the L&F of the search results. > > > > jian chen wrote: > > Hi, > > > > I was searching using google and just found that there was a new > > feature called "google mini". Initially I thought it was another > free > > service for small companies. Then I realized that it costs quite > some > > money ($4,995) for the hardware and software. (I guess the > proprietary > > software costs a whole lot more than actual hardware.) > > > > The "nice" feature is that, you can only index up to 50,000 > documents > > with this price. If you need to index more, sorry, send in the > > check... > > > > It seems to me that any small biz will be ripped off if they > install > > this google mini thing, compared to using Lucene to implement a > easy > > to use search software, which could search up to whatever number of > > documents you could image. > > > > I hope the lucene project could get exposed more to the enterprise > so > > that people know that they have not only cheaper but more > importantly, > > BETTER alternatives. > > > > Jian > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Reloading an index
I just ran into a similar issue. When you close an IndexSearcher, it doesn't necessarily close the underlying IndexReader. It depends which constructor you used to create the IndexSearcher. See the constructors javadocs or source for the details. In my case, we were updating and optimizing the index from another process, and reopening IndexSearchers. We would eventually run out of disk space because it was leaving open file handles to deleted files, so the disk space was never being made available, until the JVM processes ended. If you're under linux, try running the 'lsof' command to see if there are any handles to files marked "(deleted)". -Chris On Thu, 27 Jan 2005 08:28:30 -0800 (PST), Greg Gershman <[EMAIL PROTECTED]> wrote: > I have an index that is frequently updated. When > indexing is completed, an event triggers a new > Searcher to be opened. When the new Searcher is > opened, incoming searches are redirected to the new > Searcher, the old Searcher is closed and nulled, but I > still see about twice the amount of memory in use well > after the original searcher has been closed. Is > there something else I can do to get this memory > reclaimed? Should I explicitly call garbarge > collection? Any ideas? > > Thanks. > > Greg Gershman > > __ > Do you Yahoo!? > Meet the all-new My Yahoo! - Try it today! > http://my.yahoo.com > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there
I've often said that there is a business to be had in packaging up Lucene (and now Nutch) into a cute little box with user friendly management software to search your intranet. SearchBlox is already there (except they don't include the box). I really hope that an application like SearchBlox/Zilverline can be created as part of the Lucene project itself, replacing the sad demos that currently ship with Lucene. I've got so many things on my plate that I don't foresee myself getting to this as soon as I'd like, but I would most definitely support and contribute what time I could to such an effort. If the web UI used Tapestry, I'd be very inclined to dig in hardcore to it. Any other web UI technology would likely turn me off. One of these days I'll Tapestry-ify Nutch just for grins and submit it as a replacement for the JSPs. And I'm even more sold on it if Mac Mini's are involved! :) Erik On Jan 27, 2005, at 7:16 PM, David Spencer wrote: This reminds me, has anyone every discussed something similar: - rackmount server ( or for coolness factor, that mini mac) - web i/f for config/control - of course the server would have the following s/w: -- web server -- lucene / nutch Part of the work here I think is having a decent web i/f to configure the thing and to customize the L&F of the search results. jian chen wrote: Hi, I was searching using google and just found that there was a new feature called "google mini". Initially I thought it was another free service for small companies. Then I realized that it costs quite some money ($4,995) for the hardware and software. (I guess the proprietary software costs a whole lot more than actual hardware.) The "nice" feature is that, you can only index up to 50,000 documents with this price. If you need to index more, sorry, send in the check... It seems to me that any small biz will be ripped off if they install this google mini thing, compared to using Lucene to implement a easy to use search software, which could search up to whatever number of documents you could image. I hope the lucene project could get exposed more to the enterprise so that people know that they have not only cheaper but more importantly, BETTER alternatives. Jian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: text highlighting
Thanks for your reply. I use QueryParser instead of TermQuery. And all works good !. Thanks. Youngho - Original Message - From: "mark harwood" <[EMAIL PROTECTED]> To: Sent: Thursday, January 27, 2005 7:05 PM Subject: Re: text highlighting > >>sometimes the return Stirng is none. > >>Is the code analyzer dependancy ? > > When the highlighter.getBestFragments returns nothing > this is because there was no match found for query > terms in the TokenStream supplied. > This is nearly always because of Analyzer issues. > Check the post-analysis tokens produced for the query > and the tokens produced in the TokenStream passed to > the highlighter. The highlighter simply looks for > matches in the two sources of terms and uses the token > offsets to select the best sections of the supplied > text. > > Cheers > Mark > > > > > > ___ > ALL-NEW Yahoo! Messenger - all new features - even more fun! > http://uk.messenger.yahoo.com > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED]
Re: LuceneReader.delete (term t) Failure ?
How did you index the "uid" field? Field.Keyword? If not, that may be the problem in that the field was analyzed. For a key field like this, it needs to be unanalyzed/untokenized. Erik On Jan 27, 2005, at 6:21 PM, <[EMAIL PROTECTED]> wrote: Hi, I am trying to delete a document from Lucene index using: Term aTerm = new Term( "uid", path ); aReader.delete( aTerm ); aReader.close(); If the variable path="xxx/foo.txt" then I am able to delete the document. However, if path variable has "-" in the string, the delete method does not work e.g. path="xxx-yyy/foo.txt" // Does Not work!! Can I get around this problem. I cannot subsitute minus character with '.' as it has other implications. is this a bug ? I am using Lucene 1.4-final version. Thanks for the help Atul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: google mini? who needs it when Lucene is there
I disagree. Most small companies don't have an IT staff capable of implementing a custom search engine using Lucene for less than $5,000. Nutch might make this possible, but compared to a plug-in-and-go solution like the Google mini, it still would probably cost a significant amount of money. Getting Lucene/Nutch to the point where it is possible to easily install it on a computer and administrate its settings in a user-friendly way is a great goal, though. Regards, Luke Francl From: jian chen [mailto:[EMAIL PROTECTED] Sent: Thu 1/27/2005 5:44 PM To: Lucene Users List Subject: google mini? who needs it when Lucene is there It seems to me that any small biz will be ripped off if they install this google mini thing, compared to using Lucene to implement a easy to use search software, which could search up to whatever number of documents you could image.
Re: google mini? who needs it when Lucene is there
I think Google mini also includes crawling and a server wrapper. So it is not entirely an 1-to-1 comparison. Of couse extending lucene to have those features are not at all difficult anyway. -John On Thu, 27 Jan 2005 16:04:54 -0800 (PST), Xiaohong Yang (Sharon) <[EMAIL PROTECTED]> wrote: > Hi, > > I agree that Google mini is quite expensive. It might be similar to the > desktop version in quality. Anyone knows google's ratio of index to text? > Is it true that Lucene's index is about 500 times the original text size (not > including image size)? I don't have one installed, so I cannot measure. > > Best, > > Sharon > > jian chen <[EMAIL PROTECTED]> wrote: > Hi, > > I was searching using google and just found that there was a new > feature called "google mini". Initially I thought it was another free > service for small companies. Then I realized that it costs quite some > money ($4,995) for the hardware and software. (I guess the proprietary > software costs a whole lot more than actual hardware.) > > The "nice" feature is that, you can only index up to 50,000 documents > with this price. If you need to index more, sorry, send in the > check... > > It seems to me that any small biz will be ripped off if they install > this google mini thing, compared to using Lucene to implement a easy > to use search software, which could search up to whatever number of > documents you could image. > > I hope the lucene project could get exposed more to the enterprise so > that people know that they have not only cheaper but more importantly, > BETTER alternatives. > > Jian > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there
This reminds me, has anyone every discussed something similar: - rackmount server ( or for coolness factor, that mini mac) - web i/f for config/control - of course the server would have the following s/w: -- web server -- lucene / nutch Part of the work here I think is having a decent web i/f to configure the thing and to customize the L&F of the search results. jian chen wrote: Hi, I was searching using google and just found that there was a new feature called "google mini". Initially I thought it was another free service for small companies. Then I realized that it costs quite some money ($4,995) for the hardware and software. (I guess the proprietary software costs a whole lot more than actual hardware.) The "nice" feature is that, you can only index up to 50,000 documents with this price. If you need to index more, sorry, send in the check... It seems to me that any small biz will be ripped off if they install this google mini thing, compared to using Lucene to implement a easy to use search software, which could search up to whatever number of documents you could image. I hope the lucene project could get exposed more to the enterprise so that people know that they have not only cheaper but more importantly, BETTER alternatives. Jian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Disk space used by optimize
Our copy of LIA is "in the mail" ;) Yes the final three files are: the .cfs (46.8MB), deletable (4 bytes), and segments (29 bytes). --Leto > -Original Message- > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > > Hello, > > Yes, that is how optimize works - copies all existing index > segments into one unified index segment, thus optimizing it. > > see hit #1: http://www.lucenebook.com/search?query=optimize+disk+space > > However, three times the space sounds a bit too much, or I > make a mistake in the book. :) > > You said you end up with 3 files - .cfs is one of them, right? > > Otis > > > --- "Kauler, Leto S" <[EMAIL PROTECTED]> wrote: > > > > > Just a quick question: after writing an index and then calling > > optimize(), is it normal for the index to expand to about > three times > > the size before finally compressing? > > > > In our case the optimise grinds the disk, expanding the index into > > many files of about 145MB total, before compressing down to three > > files of about 47MB total. That must be a lot of disk activity for > > the people with multi-gigabyte indexes! > > > > Regards, > > Leto CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: google mini? who needs it when Lucene is there
Hi, I agree that Google mini is quite expensive. It might be similar to the desktop version in quality. Anyone knows google's ratio of index to text? Is it true that Lucene's index is about 500 times the original text size (not including image size)? I don't have one installed, so I cannot measure. Best, Sharon jian chen <[EMAIL PROTECTED]> wrote: Hi, I was searching using google and just found that there was a new feature called "google mini". Initially I thought it was another free service for small companies. Then I realized that it costs quite some money ($4,995) for the hardware and software. (I guess the proprietary software costs a whole lot more than actual hardware.) The "nice" feature is that, you can only index up to 50,000 documents with this price. If you need to index more, sorry, send in the check... It seems to me that any small biz will be ripped off if they install this google mini thing, compared to using Lucene to implement a easy to use search software, which could search up to whatever number of documents you could image. I hope the lucene project could get exposed more to the enterprise so that people know that they have not only cheaper but more importantly, BETTER alternatives. Jian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Disk space used by optimize
Hello, Yes, that is how optimize works - copies all existing index segments into one unified index segment, thus optimizing it. see hit #1: http://www.lucenebook.com/search?query=optimize+disk+space However, three times the space sounds a bit too much, or I make a mistake in the book. :) You said you end up with 3 files - .cfs is one of them, right? Otis --- "Kauler, Leto S" <[EMAIL PROTECTED]> wrote: > > Just a quick question: after writing an index and then calling > optimize(), is it normal for the index to expand to about three times > the size before finally compressing? > > In our case the optimise grinds the disk, expanding the index into > many > files of about 145MB total, before compressing down to three files of > about 47MB total. That must be a lot of disk activity for the people > with multi-gigabyte indexes! > > Regards, > Leto > > CONFIDENTIALITY NOTICE AND DISCLAIMER > > Information in this transmission is intended only for the person(s) > to whom it is addressed and may contain privileged and/or > confidential information. If you are not the intended recipient, any > disclosure, copying or dissemination of the information is > unauthorised and you should delete/destroy all copies and notify the > sender. No liability is accepted for any unauthorised use of the > information contained in this transmission. > > This disclaimer has been automatically added. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
google mini? who needs it when Lucene is there
Hi, I was searching using google and just found that there was a new feature called "google mini". Initially I thought it was another free service for small companies. Then I realized that it costs quite some money ($4,995) for the hardware and software. (I guess the proprietary software costs a whole lot more than actual hardware.) The "nice" feature is that, you can only index up to 50,000 documents with this price. If you need to index more, sorry, send in the check... It seems to me that any small biz will be ripped off if they install this google mini thing, compared to using Lucene to implement a easy to use search software, which could search up to whatever number of documents you could image. I hope the lucene project could get exposed more to the enterprise so that people know that they have not only cheaper but more importantly, BETTER alternatives. Jian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
LuceneReader.delete (term t) Failure ?
Hi, I am trying to delete a document from Lucene index using: Term aTerm = new Term( "uid", path ); aReader.delete( aTerm ); aReader.close(); If the variable path="xxx/foo.txt" then I am able to delete the document. However, if path variable has "-" in the string, the delete method does not work e.g. path="xxx-yyy/foo.txt" // Does Not work!! Can I get around this problem. I cannot subsitute minus character with '.' as it has other implications. is this a bug ? I am using Lucene 1.4-final version. Thanks for the help Atul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Disk space used by optimize
Just a quick question: after writing an index and then calling optimize(), is it normal for the index to expand to about three times the size before finally compressing? In our case the optimise grinds the disk, expanding the index into many files of about 145MB total, before compressing down to three files of about 47MB total. That must be a lot of disk activity for the people with multi-gigabyte indexes! Regards, Leto CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: query term frequency
Jonathan Lasko wrote: What do I call to get the term frequencies for terms in the Query? I can't seem to find it in the Javadoc... Do you mean the # of docs that have a term? http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#docFreq(org.apache.lucene.index.Term) Thanks. Jonathan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
query term frequency
What do I call to get the term frequencies for terms in the Query? I can't seem to find it in the Javadoc... Thanks. Jonathan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sort Performance Problems across large dataset
Peter Hollas wrote: Currently we can issue a simple search query and expect a response back in about 0.2 seconds (~3,000 results) with the Lucene index that we have built. Lucene gives a much more predictable and faster average query time than using standard fulltext indexing with mySQL. This however returns result in score order, and not alphabetically. To sort the resultset into alphabetical order, we added the species names as a seperate keyword field, and sorted using it whilst querying. This solution works fine, but is unacceptable since a query that returns thousands of results can take upwards of 30 seconds to sort them. Are you using a Lucene Sort? If you reuse the same IndexReader (or IndexSearcher) then perhaps the first query specifying a Sort will take 30 seconds (although that's much slower than I'd expect), but subsequent searches that sort on the same field should be nearly as fast as results sorted by score. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Kevin A. Burton wrote: Is there any way to reduce this footprint? The index is fully optimized... I'm willing to take a performance hit if necessary. Is this documented anywhere? You can increase TermInfosWriter.indexInterval. You'll need to re-write the .tii file for this to take effect. The simplest way to do this is to use IndexWriter.addIndexes(), adding your index to a new, empty, directory. This will of course take a while for a 60GB index... Doubling TermInfosWriter.indexInterval should half the Term memory usage and double the time required to look up terms in the dictionary. With an index this large the the latter is probably not an issue, since processing term frequency and proximity data probably overwhelmingly dominate search performance. Perhaps we should make this public by adding an IndexWriter method? Also, you can list the size of your .tii file by using the main() from CompoundFileReader. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: XML index
Hello Karl, Grab the source code for Lucene in Action, it's got code that parses and indexes XML with DOM and SAX. You can see the coverage of that stuff here: http://lucenebook.com/search?query=indexing+XML+section%3A7* I haven't used kXML, but I imagine the LIA code should get you going quickly and you are free to adapt the code to work with kXML for you. Otis --- Karl Koch <[EMAIL PROTECTED]> wrote: > Hi, > > I want to use kXML with Lucene to index XML files. I think it is > possible to > dynamically assign Node names as Document fields and Node texts as > Text > (after using an Analyser). > > I have seen some XML indexing in the Sandbox. Is anybody here which > has done > something with a thin pull parser (perhaps even kXML)? Does anybody > know of > a project or some sourcecode available which covers this topic? > > Karl > > > > -- > Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Index Layout Question
That's good to know. I'm indexing on 11 fields (9 keyword, 2 text). The documents themselves are between 1K to 2K in size. Is there a point at which IndexSearcher performance begins to fall off? (in term of # of index records?) Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] -Original Message- From: Ian Soboroff [mailto:[EMAIL PROTECTED] Sent: Thursday, January 27, 2005 10:31 AM To: Lucene Users List Subject: Re: Index Layout Question "Jerry Jalenak" <[EMAIL PROTECTED]> writes: > I am in the process of indexing about 1.5 million documents, and have > started down the path of indexing these by month. Each month has between > 100,000 and 200,000 documents. From a performance standpoint, is this the > right approach? This allows me to use MultiSearcher (or > ParallelMultiSearcher), but I'm not sure if the performance gains are really > there. Would one monolithic index be better? Depends on your search infrastructure. Doug Cutting has sent out some basic optimization guidelines on this list which should be in the archives... simply, you need to think about how many CPUs and spindles are involved. 1.5m documents isn't a challenge for Lucene to index or search on a single machine with a monolithic index. I indexed about 1.6m web pages in 22 hours on a single machine with all data local, and search with a single IndexSearcher was instantaneous. We've also done some testing with a larger collection (25m pages) and ParallelMultiSearchers on several machines, and likewise on a fast network haven't felt a slowdown, but we haven't actually benchmarked it. Ian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
XML index
Hi, I want to use kXML with Lucene to index XML files. I think it is possible to dynamically assign Node names as Document fields and Node texts as Text (after using an Analyser). I have seen some XML indexing in the Sandbox. Is anybody here which has done something with a thin pull parser (perhaps even kXML)? Does anybody know of a project or some sourcecode available which covers this topic? Karl -- Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boosting Questions
Thanks Otis. - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Thursday, January 27, 2005 12:11 PM Subject: Re: Boosting Questions > Luke, > > Boosting is only one of the factors involved in Document/Query scoring. > Assuming that by applying your boosts to Document A or a single field > of Document A increases the total score enough, yes, that Document A > may have the highest score. But just because you boost a single > Document and not others, it does not mean it will emerge at the top. > You should check out the Explanation class, which can dump all scoring > factors in text or HTML format. > > Otis > > > --- Luke Shannon <[EMAIL PROTECTED]> wrote: > > > Hi All; > > > > I just want to make sure I have the right idea about boosting. > > > > So if I boost a document (Document A) after I index it (lets say a > > score of > > 2.0) Lucene will now consider this document relativly more important > > than > > other documents in the index with a boost factor less than 2.0. This > > boost > > factor will also be applied to all the fields in the Document A. > > Therefore, > > if I do a TermQuery on a field that all my documents share ("title"), > > in the > > returned Hits (assuming Document A was among the return documents), > > Document > > A will score higher than other documents with a lower boost factor > > because > > the "title" field in A would have been boosted with all its other > > fields. > > Correct? > > > > Now if at indexing time I decided to boost a particular field, lets > > say > > "address" in Document A (this is a field which all documents have) > > the boost > > factor is only applied to the "address" field of Document A. Nothing > > else is > > boosted by this operation. This means if a TermQuery on the "address" > > field > > returns Document A along with a collection of other documents, > > Document A > > will score higher than the others because of boosting. Correct? > > > > Thanks, > > > > Luke > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boosting Questions
Luke, Boosting is only one of the factors involved in Document/Query scoring. Assuming that by applying your boosts to Document A or a single field of Document A increases the total score enough, yes, that Document A may have the highest score. But just because you boost a single Document and not others, it does not mean it will emerge at the top. You should check out the Explanation class, which can dump all scoring factors in text or HTML format. Otis --- Luke Shannon <[EMAIL PROTECTED]> wrote: > Hi All; > > I just want to make sure I have the right idea about boosting. > > So if I boost a document (Document A) after I index it (lets say a > score of > 2.0) Lucene will now consider this document relativly more important > than > other documents in the index with a boost factor less than 2.0. This > boost > factor will also be applied to all the fields in the Document A. > Therefore, > if I do a TermQuery on a field that all my documents share ("title"), > in the > returned Hits (assuming Document A was among the return documents), > Document > A will score higher than other documents with a lower boost factor > because > the "title" field in A would have been boosted with all its other > fields. > Correct? > > Now if at indexing time I decided to boost a particular field, lets > say > "address" in Document A (this is a field which all documents have) > the boost > factor is only applied to the "address" field of Document A. Nothing > else is > boosted by this operation. This means if a TermQuery on the "address" > field > returns Document A along with a collection of other documents, > Document A > will score higher than the others because of boosting. Correct? > > Thanks, > > Luke > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Boosting Questions
Hi All; I just want to make sure I have the right idea about boosting. So if I boost a document (Document A) after I index it (lets say a score of 2.0) Lucene will now consider this document relativly more important than other documents in the index with a boost factor less than 2.0. This boost factor will also be applied to all the fields in the Document A. Therefore, if I do a TermQuery on a field that all my documents share ("title"), in the returned Hits (assuming Document A was among the return documents), Document A will score higher than other documents with a lower boost factor because the "title" field in A would have been boosted with all its other fields. Correct? Now if at indexing time I decided to boost a particular field, lets say "address" in Document A (this is a field which all documents have) the boost factor is only applied to the "address" field of Document A. Nothing else is boosted by this operation. This means if a TermQuery on the "address" field returns Document A along with a collection of other documents, Document A will score higher than the others because of boosting. Correct? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Reloading an index
Make sure that the older searcher is not referenced elsewhere otherwise the garbage collector should delete it. Just remember that the Garbage collector runs when memory is needed but not immediatly after changing a reference to null. -Message d'origine- De : Greg Gershman [mailto:[EMAIL PROTECTED] Envoyé : jeudi 27 janvier 2005 17:29 À : lucene-user@jakarta.apache.org Objet : Reloading an index I have an index that is frequently updated. When indexing is completed, an event triggers a new Searcher to be opened. When the new Searcher is opened, incoming searches are redirected to the new Searcher, the old Searcher is closed and nulled, but I still see about twice the amount of memory in use well after the original searcher has been closed. Is there something else I can do to get this memory reclaimed? Should I explicitly call garbarge collection? Any ideas? Thanks. Greg Gershman __ Do you Yahoo!? Meet the all-new My Yahoo! - Try it today! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Layout Question
"Jerry Jalenak" <[EMAIL PROTECTED]> writes: > I am in the process of indexing about 1.5 million documents, and have > started down the path of indexing these by month. Each month has between > 100,000 and 200,000 documents. From a performance standpoint, is this the > right approach? This allows me to use MultiSearcher (or > ParallelMultiSearcher), but I'm not sure if the performance gains are really > there. Would one monolithic index be better? Depends on your search infrastructure. Doug Cutting has sent out some basic optimization guidelines on this list which should be in the archives... simply, you need to think about how many CPUs and spindles are involved. 1.5m documents isn't a challenge for Lucene to index or search on a single machine with a monolithic index. I indexed about 1.6m web pages in 22 hours on a single machine with all data local, and search with a single IndexSearcher was instantaneous. We've also done some testing with a larger collection (25m pages) and ParallelMultiSearchers on several machines, and likewise on a fast network haven't felt a slowdown, but we haven't actually benchmarked it. Ian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Reloading an index
I have an index that is frequently updated. When indexing is completed, an event triggers a new Searcher to be opened. When the new Searcher is opened, incoming searches are redirected to the new Searcher, the old Searcher is closed and nulled, but I still see about twice the amount of memory in use well after the original searcher has been closed. Is there something else I can do to get this memory reclaimed? Should I explicitly call garbarge collection? Any ideas? Thanks. Greg Gershman __ Do you Yahoo!? Meet the all-new My Yahoo! - Try it today! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Index Layout Question
I am in the process of indexing about 1.5 million documents, and have started down the path of indexing these by month. Each month has between 100,000 and 200,000 documents. From a performance standpoint, is this the right approach? This allows me to use MultiSearcher (or ParallelMultiSearcher), but I'm not sure if the performance gains are really there. Would one monolithic index be better? Thanks. Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Different Documents (with fields) in one index?
Nope, it is very possible. We have an index that holds the search info for documents, messages in discussion threads, filled in forms etc. etc. each having their own structure. cheers, Aad Karl Koch wrote: Hello all, perhaps not such a sophisticated question: I would like to have a very diverse set of documents in one index. Depending on the inside of text documents, I would like to put part of the text in different fields. This means in the searches, when searching a particular field, some of those documents won't be addressed at all. Is it possible to have different kinds of Documents with different index fields in ONE index? Or do I need one index for each set? Karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Different Documents (with fields) in one index?
Karl, This is completely fine. You can have documents with different fields in the same index. Otis --- Karl Koch <[EMAIL PROTECTED]> wrote: > Hello all, > > perhaps not such a sophisticated question: > > I would like to have a very diverse set of documents in one index. > Depending > on the inside of text documents, I would like to put part of the text > in > different fields. This means in the searches, when searching a > particular > field, some of those documents won't be addressed at all. > > Is it possible to have different kinds of Documents with different > index > fields in ONE index? Or do I need one index for each set? > > Karl > > -- > 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail > +++ GMX - die erste Adresse für Mail, Message, More +++ > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Different Documents (with fields) in one index?
Hello all, perhaps not such a sophisticated question: I would like to have a very diverse set of documents in one index. Depending on the inside of text documents, I would like to put part of the text in different fields. This means in the searches, when searching a particular field, some of those documents won't be addressed at all. Is it possible to have different kinds of Documents with different index fields in ONE index? Or do I need one index for each set? Karl -- 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail +++ GMX - die erste Adresse für Mail, Message, More +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
LuceneRAR nearing first release
https://lucenerar.dev.java.net LuceneRAR is now working on two containers, verified: The J2EE 1.4 RI and Orion. Websphere testing is underway, with JBoss to follow. LuceneRAR is a resource adapter for Lucene, allowing J2EE components to look up an entry in a JNDI tree, using that reference to add and search for documents. It's much like RemoteSearcher would be, except using JNDI semantics for communication instead of RMI, which is a little more elegant in a J2EE environment (where JNDI communication is very common). LuceneRAR was created to allow J2EE components to legitimately use the filesystem indexes (for speed) while not violating J2EE's suggestion to not rely on filesystem access. It also allows distributed access to the index (as remote servers would simply establish a JNDI connection to the LuceneRAR home.) Please take a look at it, if you're interested; the feature set isn't complete, but it's workable. There's a sample application that allows creation, searches, and statistical data about the search included in the distribution. Any comments are welcomed. --- Joseph B. Ottinger http://enigmastation.com IT Consultant[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: text highlighting
>>sometimes the return Stirng is none. >>Is the code analyzer dependancy ? When the highlighter.getBestFragments returns nothing this is because there was no match found for query terms in the TokenStream supplied. This is nearly always because of Analyzer issues. Check the post-analysis tokens produced for the query and the tokens produced in the TokenStream passed to the highlighter. The highlighter simply looks for matches in the two sources of terms and uses the token offsets to select the best sections of the supplied text. Cheers Mark ___ ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: text highlighting
More test result if the text contains ... Family ... Than family query string woks OK. But if the query stirng is Family than the highlighter return none. Thanks. Youngho - Original Message - From: "Youngho Cho" <[EMAIL PROTECTED]> To: "Lucene Users List" Cc: "Che Dong" <[EMAIL PROTECTED]> Sent: Thursday, January 27, 2005 6:10 PM Subject: Re: text highlighting > Hello, > > When I used the code with CJKAnalyzer and search English Text > (Because the text is mixed with Korean and English ) > sometimes the return Stirng is none. > Others works well. > > Is the code analyzer dependancy ? > > Thanks. > > Youngho > > --- Test Code ( Just copy of the Book code ) - > > private static final String HIGH_LIGHT_OPEN = " class=\"highlight\">"; > private static final String HIGH_LIGHT_CLOSE = ""; > > public static String highLight(String value, String queryString) > throws IOException > { > if (StringUtils.isEmpty(value) || StringUtils.isEmpty(queryString)) > { > return value; > } > > TermQuery query = new TermQuery(new Term("h", queryString)); > QueryScorer scorer = new QueryScorer(query); > SimpleHTMLFormatter formatter = new > SimpleHTMLFormatter(HIGH_LIGHT_OPEN, > HIGH_LIGHT_CLOSE); > Highlighter highlighter = new Highlighter(formatter, scorer); > > Fragmenter fragmenter = new SimpleFragmenter(50); > > highlighter.setTextFragmenter(fragmenter); > > TokenStream tokenStream = new CJKAnalyzer().tokenStream("h", > new StringReader(value)); > > return highlighter.getBestFragments(tokenStream, value, 5, "..."); > } > > - Original Message - > From: "Erik Hatcher" <[EMAIL PROTECTED]> > To: "Lucene Users List" > Sent: Thursday, January 27, 2005 8:37 AM > Subject: Re: text highlighting > > > > Also, there are some examples in the Lucene in Action source code (grab > > it from http://www.lucenebook.com) (see HighlightIt.java). > > > > Erik > > > > On Jan 26, 2005, at 5:52 PM, markharw00d wrote: > > > > > Michael Celona wrote: > > > > > >> Does any have a working example of the highlighter class found in the > > >> sandbox? > > >> > > >> > > > There are several in the accompanying Junit test: > > > http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/ > > > contributions/highlighter/src/test/org/apache/lucene/search/highlight/ > > > > > > > > > Cheers > > > Mark > > > > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED]
Re: text highlighting
Hello, When I used the code with CJKAnalyzer and search English Text (Because the text is mixed with Korean and English ) sometimes the return Stirng is none. Others works well. Is the code analyzer dependancy ? Thanks. Youngho --- Test Code ( Just copy of the Book code ) - private static final String HIGH_LIGHT_OPEN = ""; private static final String HIGH_LIGHT_CLOSE = ""; public static String highLight(String value, String queryString) throws IOException { if (StringUtils.isEmpty(value) || StringUtils.isEmpty(queryString)) { return value; } TermQuery query = new TermQuery(new Term("h", queryString)); QueryScorer scorer = new QueryScorer(query); SimpleHTMLFormatter formatter = new SimpleHTMLFormatter(HIGH_LIGHT_OPEN, HIGH_LIGHT_CLOSE); Highlighter highlighter = new Highlighter(formatter, scorer); Fragmenter fragmenter = new SimpleFragmenter(50); highlighter.setTextFragmenter(fragmenter); TokenStream tokenStream = new CJKAnalyzer().tokenStream("h", new StringReader(value)); return highlighter.getBestFragments(tokenStream, value, 5, "..."); } - Original Message - From: "Erik Hatcher" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Thursday, January 27, 2005 8:37 AM Subject: Re: text highlighting > Also, there are some examples in the Lucene in Action source code (grab > it from http://www.lucenebook.com) (see HighlightIt.java). > > Erik > > On Jan 26, 2005, at 5:52 PM, markharw00d wrote: > > > Michael Celona wrote: > > > >> Does any have a working example of the highlighter class found in the > >> sandbox? > >> > >> > > There are several in the accompanying Junit test: > > http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/ > > contributions/highlighter/src/test/org/apache/lucene/search/highlight/ > > > > > > Cheers > > Mark > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching with words that contain % , / and the like
Without looking at the source, my guess is that StandardAnalyzer (and StandardTokenizer) is the culprit. The StandardAnalyzer grammar (in StandardTokenizer.jj) is probably defined so "x/y" parses into two tokens, "x" and "y". "s" is a default stopword (see StopAnalyzer.ENGLISH_STOP_WORDS), so it gets filtered out, while "p" does not. To get what you want, you can use a WhitespaceAnalyzer, write your own custom Analyzer or Tokenizer, or modify the StandardTokenizer.jj grammar to suit your needs. WhitespaceAnalyzer is much simpler than StandardAnalyzer, so you may see some other things being tokenized differently. -Chris On Thu, 27 Jan 2005 12:12:16 +0530, Robinson Raju <[EMAIL PROTECTED]> wrote: > Hi , > > Is there a way to search for words that contain "/" or "%" . > if my query is "test/s" , it is just taken as "test" > if my query is "test/p" , it is just taken as "test p" > has anyone done this / faced such an issue ? > > Regards > Robin > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching with words that contain % , / and the like
Hi Jason , yes , the doc'n does mention escaping . but thats only for special characters used in queries , right ? but i've tried 'escaping' too. to answer ur question , am sure it is not HTTP request which is eating it up. Query query = MultiFieldQueryParser.parse("test/s", "value", analyzer); query has "value:test" am using StandardAnalyzer On Thu, 27 Jan 2005 17:53:39 +1100, Jason Polites <[EMAIL PROTECTED]> wrote: > Lucene doco mentions escaping, but doesn't include the "/" char... > > -- > Lucene supports escaping special characters that are part of the query > syntax. The current list special characters are > > + - && || ! ( ) { } [ ] ^ " ~ * ? : \ > > To escape these character use the \ before the character. For example to > search for (1+1):2 use the query: > > \(1\+1\)\:2 > -- > > You could try escaping it anyway? > > Are you sure it's not an HTTP request which is screwing with the parameter? > > > - Original Message - > From: "Robinson Raju" <[EMAIL PROTECTED]> > To: "Lucene Users List" > Sent: Thursday, January 27, 2005 5:42 PM > Subject: Searching with words that contain % , / and the like > > > Hi , > > > > Is there a way to search for words that contain "/" or "%" . > > if my query is "test/s" , it is just taken as "test" > > if my query is "test/p" , it is just taken as "test p" > > has anyone done this / faced such an issue ? > > > > Regards > > Robin > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > -- Regards, Robin 9886394650 "The merit of an action lies in finishing it to the end" - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]