RE: How do i prevent the HTML tags being added to Lucene Index..
Hey Look at the file Test.java under lucene1.4 ,it strips out html tagsand gives u content... with regards Karthik -Original Message- From: root [mailto:root]On Behalf Of Mahesh Sent: Thursday, May 20, 2004 11:13 AM To: [EMAIL PROTECTED] Subject: How do i prevent the HTML tags being added to Lucene Index.. I am using the lucene 1.4 to index the information. I have lot of HTML tags in the information that i will be indexing ,so let me know if their is any way of removing the HTML tags from being indexed.. MAHESH - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How do i prevent the HTML tags being added to Lucene Index..
I am using the lucene 1.4 to index the information. I have lot of HTML tags in the information that i will be indexing ,so let me know if their is any way of removing the HTML tags from being indexed.. MAHESH - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and MVC (was Re: Bad file descriptor (IOException) using SearchBean contribution)
On May 20, 2004, at 04:38, Erik Hatcher wrote: OffTopic: havoc and Struts go well together ;) Pick up Tapestry instead! Nah. Keep it really Simple [1] instead :o) http://simpleweb.sourceforge.net/ PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene and MVC (was Re: Bad file descriptor (IOException) using SearchBean contribution)
On May 19, 2004, at 8:04 AM, Timothy Stone wrote: Could you elaborate on what you mean by MVC here? A value list handler piece has been developed and links posted to it on this list - if this is the type of thing you're referring to. Again, maybe I was naively associating the "SearchBean" with something that it was not suppose to be doing. To elaborate, I would like to take the demo, which has been working with some success for two years on my site, and follow the suggestions of Andrew C. Oliver and go "Model 2 on the demo." I've never seen a user story (or use case) that said "this feature must use MVC" :) What is the purpose of going MVC here? Is it just for architectural purity? So the SearchBean's purpose, as I understood it, was to provide a Model 2 component for use in JSPs. Consider a query that generates a million hits. How should the JSP iterate over them? In a pure MVC world, the JSP would be pushed the hits and allowed to display them however it likes. With Lucene Hits, you get this capability already. I'm just not convinced a wrapper is needed, especially now that sorting is built-in. Again, I'm open to being convinced otherwise. A value list handler piece has been developed and links posted to it on this list - if this is the type of thing you're referring to. I tried looking for references to such, but no luck. http://www.nitwit.de/vlh2/ Also, for JSP use, there is the taglib contribution in the sandbox that might be of interest to you. I've not gotten it to work, yet, and it's not quite my cup of tea (being an anti-JSP kinda guy that is). I must admit that I get the feeling that "newbies" to Lucene seem to get less attention on the list. I'm one that tries real hard to research my question first in the archives (marc.theaimsgroup.com) then on the web. Even I get frustrated on some lists where the most obvious question is being asked and the asker misses hints and outright help. When I was first learning Ant, I lurked on the ant-user list and when a question came up that I knew I'd answer it. When one came up that I didn't know, I'd research it by experimenting and cross-referencing in the source code to try to figure it out. We really get out of this community what we put into it, in my opinion. Newbies need to be savvy and do some homework and not expect everything to be spelled out beautifully - none of us have time to flesh out full-fledged example applications to answer every question. Sometimes a question comes along that I could reply to, but I let it go because I'm crushed for time as it is. Sometimes I'll answer - especially if the question piques my curiosity or has some aspect of a challenge for me to learn something new. I personally try to answer professionally and thoroughly, but sometimes I might answer off-the-cuff or quickly and it comes out a bit tersely or perhaps intimidating. My contributions as a whole, though, are hopefully taken positively by the community. The Lucene User list can be intimidating even for the advanced novice who may be on the right track but not phrasing or wording or describing the problem or task in front of him/her. You are not the only one that gets blown away by things on this list. There are many times I've been baffled and completely mind-blown by things here - what underlies Lucene and what folks can build around it is simply astonishing. This is no typical open source project we're dealing with here. Thankfully the API is so straightforward to use, though, that Lucene usage is clear - its the bigger picture that is daunting (to me). I'm personally reading Managing Gigabytes at the moment, and my head is spinning. But it is helping me get a clearer picture of the underlying concepts that Lucene is built upon. a new desire to tackle Struts, and well, havoc ensues. OffTopic: havoc and Struts go well together ;) Pick up Tapestry instead! Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Internal full content store within Lucene
Morus Walter wrote: Kevin Burton writes: How much interest is there for this? I have to do this for work and will certainly take the extra effort into making this a standard Lucene feature. Sounds interesting. How would you handle deletions? They aren't a requirement in our scenario... It would probably be more efficient to just leave the content on disk. If you want to GC over time the arc files can be grouped together by time so you can just eventually delete a whole arc file... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
Re: Possible to fetch a document without all fields for performance?
Morus Walter wrote: I don't understand that. You get the document object which does not contain the documents field contents. It just provides access to this data. It's up to you which fields you access. And remember that you don't have to store fields at all, if you don't need to retrieve them (e.g. because the original documents are somewhere else). Nope... When you get the Document the fields are already pre-parsed from disk. If you don't call ANY methods to get fields it still has to read all the fields off disk. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
RE: org.apache.lucene.search.highlight.Highlighter
> Thanks for "highlighting" the problem with the Javadocs... Groan. :) Regards, Bruce Ritchie smime.p7s Description: S/MIME cryptographic signature
Re: org.apache.lucene.search.highlight.Highlighter
>>Was Investigating,found some Compile time error.. I see the code you have is taken from the example in the javadocs. Unfortunately that example wasn't complete because the class didnt include the method defined in the Formatter interface. I have updated the Javadocs to correct this oversight. To correct your problem either make your class implement the Formatter interface to perform your choice of custom formatting or remove the "this" parameter from your call to create a new Highlighter with the default Formatter implementation. Thanks for "highlighting" the problem with the Javadocs... Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: AW: Problem indexing Spanish Characters
Here is an example method in org.apache.lucene.demo.html HTMLParser that uses a different buffered reader for a different encoding. public Reader getReader() throws IOException { if (pipeIn == null) { pipeInStream = new MyPipedInputStream(); pipeOutStream = new PipedOutputStream(pipeInStream); pipeIn = new InputStreamReader(pipeInStream); pipeOut = new OutputStreamWriter(pipeOutStream); //check the first 4 bytes for FFFE marker, if its there we know its UTF-16 encoding if (useUTF16) { try { pipeIn = new BufferedReader(new InputStreamReader(pipeInStream, "UTF-16")); } catch (Exception e) { } } Thread thread = new ParserThread(this); thread.start(); // start parsing } return pipeIn; } -Original Message- From: Martin Remy [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 19, 2004 2:09 PM To: 'Lucene Users List' Subject: RE: AW: Problem indexing Spanish Characters The tokenizers deal with unicode characters (CharStream, char), so the problem is not there. This problem must be solved at the point where the bytes from your source files are turned into CharSequences/Strings, i.e. by connecting an InputStreamReader to your FileReader (or whatever you're using) and specifying "UTF-8" (or whatever encoding is appropriate) in the InputStreamReader constructor. You must either detect the encoding from HTTP heaaders or XML declarations or, if you know that it's the same for all of your source files, then just hardcode UTF-8, for example. Martin -Original Message- From: Hannah c [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 19, 2004 10:35 AM To: [EMAIL PROTECTED] Subject: RE: AW: Problem indexing Spanish Characters Hi, I had a quick look at the sandbox but my problem is that I don't need a spanish stemmer. However there must be a replacement tokenizer that supports foreign characters to go along with the foreign language snowball stemmers. Does anyone know where I could find one? In answer to Peters question -yes I'm also using "UTF-8" encoded XML documents as the source. I also put below an example of what is happening when I tokenize the text using the StandardTokenizer below. Thanks Hannah --text I'm trying to index century palace known as la "Fundación Hospital de Na. Señora del Pilar" -tokens outputed from StandardTokenizer century palace known as la â FundaciÃ* n * Hospital de Na Seà * ora * del Pilar â --- >From: "Peter M Cipollone" <[EMAIL PROTECTED]> >To: <[EMAIL PROTECTED]> >Subject: Re: Problem indexing Spanish Characters >Date: Wed, 19 May 2004 11:41:28 -0400 > >could you send some sample text that causes this to happen? > >- Original Message - >From: "Hannah c" <[EMAIL PROTECTED]> >To: <[EMAIL PROTECTED]> >Sent: Wednesday, May 19, 2004 11:30 AM >Subject: Problem indexing Spanish Characters > > > > > > Hi, > > > > I am indexing a number of English articles on Spanish resorts. As > > such there are a number of spanish characters throught the text, > > most of >these > > are in the place names which are the type of words I would like to > > use >as > > queries. My problem is with the StandardTokenizer class which cuts > > the >word > > into two when it comes across any of the spanish characters. I had a >look >at > > the source but the code was generated by JavaCC and so is not very >readable. > > I was wondering if there was a way around this problem or which area > > of >the > > code I would need to change to avoid this. > > > > Thanks > > Hannah Cumming > > > > > > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > >From: PEP AD Server Administrator ><[EMAIL PROTECTED]> >Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> >To: "'Lucene Users List'" <[EMAIL PROTECTED]> >Subject: AW: Problem indexing Spanish Characters >Date: Wed, 19 May 2004 18:08:56 +0200 > >Hi Hannah, Otis >I cannot help but I have excatly the same problems with special german >charcters. I used snowball analyser but this does not help because the >problem (tokenizing) appears before the analyser comes into action. >I just posted the question "Problem tokenizing UTF-8 with geman umlauts" >some minutes ago which describes my problem and Hannahs seem to be similar. >Do you have also UTF-8 encoded pages? > >Pet
RE: AW: Problem indexing Spanish Characters
The tokenizers deal with unicode characters (CharStream, char), so the problem is not there. This problem must be solved at the point where the bytes from your source files are turned into CharSequences/Strings, i.e. by connecting an InputStreamReader to your FileReader (or whatever you're using) and specifying "UTF-8" (or whatever encoding is appropriate) in the InputStreamReader constructor. You must either detect the encoding from HTTP heaaders or XML declarations or, if you know that it's the same for all of your source files, then just hardcode UTF-8, for example. Martin -Original Message- From: Hannah c [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 19, 2004 10:35 AM To: [EMAIL PROTECTED] Subject: RE: AW: Problem indexing Spanish Characters Hi, I had a quick look at the sandbox but my problem is that I don't need a spanish stemmer. However there must be a replacement tokenizer that supports foreign characters to go along with the foreign language snowball stemmers. Does anyone know where I could find one? In answer to Peters question -yes I'm also using "UTF-8" encoded XML documents as the source. I also put below an example of what is happening when I tokenize the text using the StandardTokenizer below. Thanks Hannah --text I'm trying to index century palace known as la Fundación Hospital de Na. Señora del Pilar -tokens outputed from StandardTokenizer century palace known as la â FundaciÃ* n * Hospital de Na Seà * ora * del Pilar â --- >From: "Peter M Cipollone" <[EMAIL PROTECTED]> >To: <[EMAIL PROTECTED]> >Subject: Re: Problem indexing Spanish Characters >Date: Wed, 19 May 2004 11:41:28 -0400 > >could you send some sample text that causes this to happen? > >- Original Message - >From: "Hannah c" <[EMAIL PROTECTED]> >To: <[EMAIL PROTECTED]> >Sent: Wednesday, May 19, 2004 11:30 AM >Subject: Problem indexing Spanish Characters > > > > > > Hi, > > > > I am indexing a number of English articles on Spanish resorts. As > > such there are a number of spanish characters throught the text, > > most of >these > > are in the place names which are the type of words I would like to > > use >as > > queries. My problem is with the StandardTokenizer class which cuts > > the >word > > into two when it comes across any of the spanish characters. I had a >look >at > > the source but the code was generated by JavaCC and so is not very >readable. > > I was wondering if there was a way around this problem or which area > > of >the > > code I would need to change to avoid this. > > > > Thanks > > Hannah Cumming > > > > > > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > >From: PEP AD Server Administrator ><[EMAIL PROTECTED]> >Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> >To: "'Lucene Users List'" <[EMAIL PROTECTED]> >Subject: AW: Problem indexing Spanish Characters >Date: Wed, 19 May 2004 18:08:56 +0200 > >Hi Hannah, Otis >I cannot help but I have excatly the same problems with special german >charcters. I used snowball analyser but this does not help because the >problem (tokenizing) appears before the analyser comes into action. >I just posted the question "Problem tokenizing UTF-8 with geman umlauts" >some minutes ago which describes my problem and Hannahs seem to be similar. >Do you have also UTF-8 encoded pages? > >Peter MH > >-Ursprüngliche Nachricht- >Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED] >Gesendet: Mittwoch, 19. Mai 2004 17:42 >An: Lucene Users List >Betreff: Re: Problem indexing Spanish Characters > > >It looks like Snowball project supports Spanish: >http://www.google.com/search?q=snowball spanish > >If it does, take a look at Lucene Sandbox. There is a project that >allows you to use Snowball analyzers with Lucene. > >Otis > > >--- Hannah c <[EMAIL PROTECTED]> wrote: > > > > Hi, > > > > I am indexing a number of English articles on Spanish resorts. As > > such there are a number of spanish characters throught the text, > > most of these are in the place names which are the type of words I > > would like to use as queries. My problem is with the > > StandardTokenizer class which cuts the word into two when it comes > > across any of the spanish characters. I had a look at the source but > > the code was generated by JavaCC and so is not very readable. > > I was wondering if there was a way around this problem or which area > > of the code I would need to change to avoid this. > > > > Thanks > > Hannah Cumming > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] >
RE: AW: Problem indexing Spanish Characters
Hi, I had a quick look at the sandbox but my problem is that I don't need a spanish stemmer. However there must be a replacement tokenizer that supports foreign characters to go along with the foreign language snowball stemmers. Does anyone know where I could find one? In answer to Peters question -yes I'm also using "UTF-8" encoded XML documents as the source. I also put below an example of what is happening when I tokenize the text using the StandardTokenizer below. Thanks Hannah --text I'm trying to index century palace known as la Fundación Hospital de Na. Señora del Pilar -tokens outputed from StandardTokenizer century palace known as la â FundaciÃ* n * Hospital de Na Seà * ora * del Pilar â --- From: "Peter M Cipollone" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Subject: Re: Problem indexing Spanish Characters Date: Wed, 19 May 2004 11:41:28 -0400 could you send some sample text that causes this to happen? - Original Message - From: "Hannah c" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, May 19, 2004 11:30 AM Subject: Problem indexing Spanish Characters > > Hi, > > I am indexing a number of English articles on Spanish resorts. As such > there are a number of spanish characters throught the text, most of these > are in the place names which are the type of words I would like to use as > queries. My problem is with the StandardTokenizer class which cuts the word > into two when it comes across any of the spanish characters. I had a look at > the source but the code was generated by JavaCC and so is not very readable. > I was wondering if there was a way around this problem or which area of the > code I would need to change to avoid this. > > Thanks > Hannah Cumming > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > From: PEP AD Server Administrator <[EMAIL PROTECTED]> Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> To: "'Lucene Users List'" <[EMAIL PROTECTED]> Subject: AW: Problem indexing Spanish Characters Date: Wed, 19 May 2004 18:08:56 +0200 Hi Hannah, Otis I cannot help but I have excatly the same problems with special german charcters. I used snowball analyser but this does not help because the problem (tokenizing) appears before the analyser comes into action. I just posted the question "Problem tokenizing UTF-8 with geman umlauts" some minutes ago which describes my problem and Hannahs seem to be similar. Do you have also UTF-8 encoded pages? Peter MH -Ursprüngliche Nachricht- Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 19. Mai 2004 17:42 An: Lucene Users List Betreff: Re: Problem indexing Spanish Characters It looks like Snowball project supports Spanish: http://www.google.com/search?q=snowball spanish If it does, take a look at Lucene Sandbox. There is a project that allows you to use Snowball analyzers with Lucene. Otis --- Hannah c <[EMAIL PROTECTED]> wrote: > > Hi, > > I am indexing a number of English articles on Spanish resorts. As > such > there are a number of spanish characters throught the text, most of > these > are in the place names which are the type of words I would like to > use as > queries. My problem is with the StandardTokenizer class which cuts > the word > into two when it comes across any of the spanish characters. I had a > look at > the source but the code was generated by JavaCC and so is not very > readable. > I was wondering if there was a way around this problem or which area > of the > code I would need to change to avoid this. > > Thanks > Hannah Cumming - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Hannah Cumming [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Problem indexing Spanish Characters
Hi Hannah, Otis I cannot help but I have excatly the same problems with special german charcters. I used snowball analyser but this does not help because the problem (tokenizing) appears before the analyser comes into action. I just posted the question "Problem tokenizing UTF-8 with geman umlauts" some minutes ago which describes my problem and Hannahs seem to be similar. Do you have also UTF-8 encoded pages? Peter MH -Ursprüngliche Nachricht- Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 19. Mai 2004 17:42 An: Lucene Users List Betreff: Re: Problem indexing Spanish Characters It looks like Snowball project supports Spanish: http://www.google.com/search?q=snowball spanish If it does, take a look at Lucene Sandbox. There is a project that allows you to use Snowball analyzers with Lucene. Otis --- Hannah c <[EMAIL PROTECTED]> wrote: > > Hi, > > I am indexing a number of English articles on Spanish resorts. As > such > there are a number of spanish characters throught the text, most of > these > are in the place names which are the type of words I would like to > use as > queries. My problem is with the StandardTokenizer class which cuts > the word > into two when it comes across any of the spanish characters. I had a > look at > the source but the code was generated by JavaCC and so is not very > readable. > I was wondering if there was a way around this problem or which area > of the > code I would need to change to avoid this. > > Thanks > Hannah Cumming - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem indexing Spanish Characters
It looks like Snowball project supports Spanish: http://www.google.com/search?q=snowball spanish If it does, take a look at Lucene Sandbox. There is a project that allows you to use Snowball analyzers with Lucene. Otis --- Hannah c <[EMAIL PROTECTED]> wrote: > > Hi, > > I am indexing a number of English articles on Spanish resorts. As > such > there are a number of spanish characters throught the text, most of > these > are in the place names which are the type of words I would like to > use as > queries. My problem is with the StandardTokenizer class which cuts > the word > into two when it comes across any of the spanish characters. I had a > look at > the source but the code was generated by JavaCC and so is not very > readable. > I was wondering if there was a way around this problem or which area > of the > code I would need to change to avoid this. > > Thanks > Hannah Cumming > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > g snowball s - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Possible to fetch a document without all fields for performance?
Hi Kevin, There is no API for this, and I agree it would be handy. Otis --- Kevin Burton <[EMAIL PROTECTED]> wrote: > Say I have a query result for the term Linux... now I just want the > TITLE of these documents not the BODY. > > To further this scenario imagine the TITLE is 500 bytes but the BODY > is > 50M. > > The current impl of fetching a document will pull in ALL 50,000,500 > bytes not just the 500 that I need. > > Obviously if I could just get the TITLE field this would be a HUGE > speedup. > > Is there a somewhat simple and efficient way to get a document with a > > restricted set of fields? Digging through the API it didnt' seem > obvious. > > Kevin > > -- > > Please reply using PGP. > > http://peerfear.org/pubkey.asc > > NewsMonster - http://www.newsmonster.org/ > > Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 >AIM/YIM - sfburtonator, Web - http://peerfear.org/ > GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 > IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster > > > begin:vcard > fn:Kevin Burton > n:Burton;Kevin > email;internet:[EMAIL PROTECTED] > tel;work:415-595-9965 > tel;home:415-595-9965 > tel;cell:415-595-9965 > x-mozilla-html:TRUE > version:2.1 > end:vcard > > > ATTACHMENT part 2 application/pgp-signature name=signature.asc - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Problem tokenizing UTF-8 with geman umlauts
Hello, I have HTML-documents which are UTF-8 encoded and contain english and/or german content. I have written my own Analyser and Filter to replace the german umlauts with the commonly used pair of character (ü=ue, ä=ae, ö=oe) to avoid any problems. Still in the HTML-code the german umlauts are shown as a pair of character representing the UTF-8 encoding (I think). As a result the StandardTokenizer is missinterpreting the string and splitting a word with umlaut into 2 tokens which is of no use anymore. Does anyone ahs experience in this case and can help me back on the road? Peter MH - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Problem indexing Spanish Characters
Hi, I am indexing a number of English articles on Spanish resorts. As such there are a number of spanish characters throught the text, most of these are in the place names which are the type of words I would like to use as queries. My problem is with the StandardTokenizer class which cuts the word into two when it comes across any of the spanish characters. I had a look at the source but the code was generated by JavaCC and so is not very readable. I was wondering if there was a way around this problem or which area of the code I would need to change to avoid this. Thanks Hannah Cumming - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses
Thanks, I will look at the sorting code. Sorting results by date is next on list. For now, I only have a small number of documents but the set is to grow to over 8 million documents for the collection I am working on. Another collection we have is 40 million documents or so. From what you say it seems to me that sorting will not scale then when I get to larger number of documents. I am considering using an SQL back end to implement sorting: bring back the unique IDs from lucene and then sort in SQL. Claude On May 18, 2004, at 11:23 PM, Morus Walter wrote: Claude Devarenne writes: Hi, I have over 60,000 documents in my index which is slightly over a 1 GB in size. The documents range from the late seventies up to now. I have indexed dates as a keyword field using a string because the dates are in MMDD format. When I do range queries things are OK as long as I don't exceed the built-in number of boolean clauses, so that's a range of 3 years, e.g. 1979 to 1981. The users are not only doing complex queries but also want to query over long ranges, e.g. [19790101 TO 19991231]. Given these requirements, I am thinking of doing a query without the date range, bring the unique ids back from the hits and then do a date query in the SQL database I have that contains the same data. Another alternative is to do the query without the date range in Lucene and then sort the results within the range. I still have to learn how to use the new sorting code and confessed I did not have time to look at it yet. Is there a simpler, easier way to do this? I think it would be worth to take a look at the sorting code. The idea of the sorting code is to have an array of the dates for each doc in memory and access this array for sorting. Now sorting isn't the only thing one might use this array for. Doing a range check is another. So you might extend the sorting code by a range selection. There is no code for this in lucene and you have to create your own searcher but it gives you a fast way to search and sort by date. I did this independently from the new sorting code (I just started a little to early) and it works quite well. The only drawback from this (and the new sorting code) is, that it requires an array of field values that must be rebuilt each time the index changes. Shouldn't be a problem for 6 documents. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bad file descriptor (IOException) using SearchBean contribution
Erik Hatcher wrote: On May 18, 2004, at 1:43 PM, Timothy Stone wrote: Erik Hatcher wrote: Lucene 1.4 (now in release candidate stage) includes built-in sorting capabilities, so I definitely recommend you have a look at that. SearchBean is effectively deprecated based on this new much more powerful feature. Erik Forgive my naivety, but isn't the purpose of the SearchBean more than just sorting? Without the SearchBean, creating a MVC demo becomes a larger exercise to undertake. Could you elaborate on what you mean by MVC here? A value list handler piece has been developed and links posted to it on this list - if this is the type of thing you're referring to. Again, maybe I was naively associating the "SearchBean" with something that it was not suppose to be doing. To elaborate, I would like to take the demo, which has been working with some success for two years on my site, and follow the suggestions of Andrew C. Oliver and go "Model 2 on the demo." You and I have moved away in this thread from my original question, why I am getting the IOException: Bad File Descriptor, *and that is okay*, I'm learning a lot. However, I hope that we can come back to it later, if necessary off-list. So the SearchBean's purpose, as I understood it, was to provide a Model 2 component for use in JSPs. A value list handler piece has been developed and links posted to it on this list - if this is the type of thing you're referring to. I tried looking for references to such, but no luck. [snip] I'd love to hear how folks are using SearchBean though, and why they feel it is beneficial. See above as to how I think it could to be used. :) I agree that Lucene offers a tremendous amount of power! Kudos to all of the developers working so hard on this. It is a testament to the flexibility of Java. I must admit that I get the feeling that "newbies" to Lucene seem to get less attention on the list. I'm one that tries real hard to research my question first in the archives (marc.theaimsgroup.com) then on the web. Even I get frustrated on some lists where the most obvious question is being asked and the asker misses hints and outright help. The Lucene User list can be intimidating even for the advanced novice who may be on the right track but not phrasing or wording or describing the problem or task in front of him/her. So forgive me, Lucene is a very powerful API/library (see I understand what Lucene is ;) ) and I get lost in the new search terminology confronting me. Couple this with a new desire to tackle Struts, and well, havoc ensues. Many thanks for your help and answers. Tim - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: about search and update one index simultaneously
There is no problem with updating and searching simultaneously. Two threads updating simultaneously on the same index on NFS can be a problem, as the locking does not work reliably. Have a look through the archives for NFS, there are some solutions scattered about. David -Original Message- From: xuemei li [mailto:[EMAIL PROTECTED] Sent: 18 May 2004 23:01 To: [EMAIL PROTECTED] Subject: about search and update one index simultaneously Hi,all, Can we do search and update one index simultaneously?Is someone know sth about it? I had done some experiments.Now the search will be blocked when the index is being updated.The error in search node is like this: caught a class java.io.IOException with message:Stale NFS file handle Thanks Xuemei Li - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: SELECTIVE Indexing
Hey Lucene Users My original intension for indexing was to index certain portions of HTML [ not the whole Document ], if Jtidy is not supporting this then what are my optionals Karthik -Original Message- From: Viparthi, Kiran (AFIS) [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 19, 2004 1:43 PM To: 'Lucene Users List' Subject: RE: SELECTIVE Indexing I doubt if it can be used as a plug in. Would be good to know if it can be used as a plug in. Regards, Kiran. -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: 17 May 2004 12:30 To: Lucene Users List Subject: RE: SELECTIVE Indexing Hi Can I Use TIDY [as plug in ] with Lucene ... with regards Karthik -Original Message- From: Viparthi, Kiran (AFIS) [mailto:[EMAIL PROTECTED] Sent: Monday, May 17, 2004 3:27 PM To: 'Lucene Users List' Subject: RE: SELECTIVE Indexing Try using Tidy. Creates a Document of the html and allows you to apply xpath. Hope this helps. Kiran. -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: 17 May 2004 11:59 To: Lucene Users List Subject: SELECTIVE Indexing Hi all Can Some Body tell me How to Index CERTAIN PORTION OF THE HTML FILE Only ex:- with regards Karthik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: SELECTIVE Indexing
I doubt if it can be used as a plug in. Would be good to know if it can be used as a plug in. Regards, Kiran. -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: 17 May 2004 12:30 To: Lucene Users List Subject: RE: SELECTIVE Indexing Hi Can I Use TIDY [as plug in ] with Lucene ... with regards Karthik -Original Message- From: Viparthi, Kiran (AFIS) [mailto:[EMAIL PROTECTED] Sent: Monday, May 17, 2004 3:27 PM To: 'Lucene Users List' Subject: RE: SELECTIVE Indexing Try using Tidy. Creates a Document of the html and allows you to apply xpath. Hope this helps. Kiran. -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: 17 May 2004 11:59 To: Lucene Users List Subject: SELECTIVE Indexing Hi all Can Some Body tell me How to Index CERTAIN PORTION OF THE HTML FILE Only ex:- with regards Karthik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
org.apache.lucene.search.highlight.Highlighter
Hey Guys Found some Highlighter Package on CVS Directory Was Investigating,found some Compile time error.. Please some body tell me what this The Code:- private IndexReader reader=null; private Highlighter highlighter = null; public SearchFiles() { } public void searchIndex0(String srchkey,String pathfile)throws Exception { IndexSearcher searcher = new IndexSearcher(pathfile); Query query = QueryParser.parse(srchkey,"bookid", analyzer); query=query.rewrite(reader); //required to expand search terms Hits hits = searcher.search(query); highlighter = new Highlighter(this,new QueryScorer(query)); for (int i = 0; i < hits.length(); i++) { String text = hits.doc(i).get(bookid); TokenStream tokenStream=analyzer.tokenStream(bookid,new StringReader(text)); // Get 3 best fragments and seperate with a "..." String result = highlighter.getBestFragments(tokenStream,text,3,"..."); System.out.println(result); } } The Error:- src\org\apache\lucene\search\higlight\SearchFiles.java:46: cannot resolve symbol symbol : constructor Highlighter (com.controlnet.higlight.SearchFiles,com.controlnet.higlight.QueryScorer) location: class org.apache.lucene.search.highlight.Highlighter highlighter =new Highlighter(this,new QueryScorer(query)); Also Reffrells to URL from archives lucene-dev is not avaliable for proper documentation http://home.clara.net/markharwood/lucene/highlight.htm WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK]