Re: Search Chinese in Unicode !!!
I want that Chinese Anayzer !! On Fri, 21 Jan 2005 17:36:17 +0100, Safarnejad, Ali (AFIS) <[EMAIL PROTECTED]> wrote: > I've written a Chinese Analyzer for Lucene that uses a segmenter written by > Erik Peterson. However, as the author of the segmenter does not want his code > released under apache open source license (although his code _is_ > opensource), I cannot place my work in the Lucene Sandbox. This is > unfortunate, because I believe the analyzer works quite well in indexing and > searching chinese docs in GB2312 and UTF-8 encoding, and I like more people > to test, use, and confirm this. So anyone who wants it, can have it. Just > shoot me an email. > BTW, I also have written an arabic analyzer, which is collecting dust for > similar reasons. > Good luck, > > Ali Safarnejad > > > -Original Message- > From: Eric Chow [mailto:[EMAIL PROTECTED] > Sent: 21 January 2005 11:42 > To: Lucene Users List > Subject: Re: Search Chinese in Unicode !!! > > Search not really correct with UTF-8 !!! > > The following is the search result that I used the SearchFiles in the lucene > demo. > > d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java > org.apache.lucene.demo.SearchFiles c:\temp\myindex > Usage: java SearchFiles > Query: ç > Searching for: g strange ?? > 3 total matching documents > 0. ../docs/ChineseDemo.htmlthis files contains > the ç > - > 1. ../docs/luceneplan.html > - Jakarta Lucene - Plan for enhancements to Lucene > 2. ../docs/api/index-all.html > - Index (Lucene 1.4.3 API) > Query: > > From the above result only the ChineseDemo.html includes the character that I > want to search ! > > The modified code in SearchFiles.java: > > BufferedReader in = new BufferedReader(new InputStreamReader(System.in, > "UTF-8")); > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Document 'Context' & Relation to each other
As a log4j developer, I've been toying with the idea of what Lucene could do for me, maybe as an excuse to play around with Lucene. I've started creating a LoggingEvent->Document converter, and thinking through how I'd like this utility to work when I came across a question I wasn't sure about. When scanning/searching through logging events, one is usually looking for a particular matching event which Lucene does excellently, but what a person usually needs is also the context of that matching logging event around it. With grep, one can use the "-C" argument to grep to provide X # of lines around the matching entry. I'd like to be able to do the same thing with Lucene. Now, I could provide a Field to the LoggingEvent Document that has a sequence #, and once a user has chosen an appropriate matching event, do another search for the documents with a Sequence # between +/- the context size. My question is, is that going to be an efficient way to do this? The sequence # would be treated as text, wouldn't it? Would the range search on an int be the most efficient way to do this? I know from the Hits documentation that one can retrieve the Document ID of a matching entry. What is the contract on this Document ID? Is each Document added to the Index given an increasing number? Can one search an index by Document ID? Could one search for Document ID's between a range? (Hope you can see where I'm going here). If you have any other recommendations about "Context" searching I would appreciate any thoughts. Many thanks for an excellent API, and kudos to Erik & Otis for a great eBook btw. regards, Paul Smith - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
: We have one large index right now... its about 60G ... When I open it : the Java VM used 940M of memory. The VM does nothing else besides open Just out of curiosity, have you tried turning on the verbose gc log, and putting in some thread sleeps after you open the reader, to see if the memory footprint "settles down" after a little while? You're currently checking the memoory usage immediately after opening the index, and some of that memory may be used holding transient data that will get freed up after some GC iterations. : IndexReader ir = IndexReader.open( dir ); : System.out.println( ir.getClass() ); : long after = System.currentTimeMillis(); : System.out.println( "opening...done - duration: " + : (after-before) ); : : System.out.println( "totalMemory: " + : Runtime.getRuntime().totalMemory() ); : System.out.println( "freeMemory: " + : Runtime.getRuntime().freeMemory() ); -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Kevin A. Burton wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. After thinking about it I guess 1.5% of memory per index really isn't THAT bad. What would be nice if there was a way to do this from disk and then use the a buffer (either via the filesystem or in-vm memory) to access these variables. This would be similar to the way the MySQL index cache works... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Opening up one large index takes 940M or memory?
We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. Here's the code: System.out.println( "opening..." ); long before = System.currentTimeMillis(); Directory dir = FSDirectory.getDirectory( "/var/ksa/index-1078106952160/", false ); IndexReader ir = IndexReader.open( dir ); System.out.println( ir.getClass() ); long after = System.currentTimeMillis(); System.out.println( "opening...done - duration: " + (after-before) ); System.out.println( "totalMemory: " + Runtime.getRuntime().totalMemory() ); System.out.println( "freeMemory: " + Runtime.getRuntime().freeMemory() ); Is there any way to reduce this footprint? The index is fully optimized... I'm willing to take a performance hit if necessary. Is this documented anywhere? Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Stemming
Also if you can't wait, see page 2 of http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html or the LIA e-book ;) On Fri, 21 Jan 2005 09:27:42 -0500, Kevin L. Cobb <[EMAIL PROTECTED]> wrote: > OK, OK ... I'll buy the book. I guess its about time since I am deeply > and forever in love with Lucene. Might as well take the final plunge. > > > -Original Message- > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > Sent: Friday, January 21, 2005 9:12 AM > To: Lucene Users List > Subject: Re: Stemming > > Hi Kevin, > > Stemming is an optional operation and is done in the analysis step. > Lucene comes with a Porter stemmer and a Filter that you can use in an > Analyzer: > > ./src/java/org/apache/lucene/analysis/PorterStemFilter.java > ./src/java/org/apache/lucene/analysis/PorterStemmer.java > > You can find more about it here: > http://www.lucenebook.com/search?query=stemming > You can also see mentions of SnowballAnalyzer in those search results, > and you can find an adapter for SnowballAnalyzers in Lucene Sandbox. > > Otis > > --- "Kevin L. Cobb" <[EMAIL PROTECTED]> wrote: > > > I want to understand how Lucene uses stemming but can't find any > > documentation on the Lucene site. I'll continue to google but hope > > that > > this list can help narrow my search. I have several questions on the > > subject currently but hesitate to list them here since finding a good > > document on the subject may answer most of them. > > > > > > > > Thanks in advance for any pointers, > > > > > > > > Kevin > > > > > > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Closed IndexWriter reuse
> --- Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > > > No, you can't add documents to an index once you close the IndexWriter. > > You can re-open the IndexWriter and add more documents, of course. > > > > Otis After my previous post I have made some further tests with multithreading and effectively it randomly throw NullPointerExceptions and Lock exceptions when reusing a closed IndexWriter. My example was bad because based on a very simple single thread. But wouldn't it be safer if IndexWriter rose immediatly an Exception when trying to use its modifying methods after is has been closed? __ Do you Yahoo!? Yahoo! Mail - 250MB free storage. Do more. Manage less. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and multiple languages
I send you the source code in a private mail. Ernesto. aurora escribió: Thanks. I would like to give it a try. Is the source code available? I'm using a Python version of Lucene so it would need to be wrapped or ported :) Hi Aurora I develop a tool with this multiple languages issue. I found very useful an nuke library "language-identifier". This jar have nuke dependencies, but I delete all unnecessary code (for me obvious). This language-identifier that I use work fine and is very simple: For example: LanguageIdentifier languageIdentifier = LanguageIdentifier.getInstance(); String userInputText = "free text"; String language = languageIdentifier.identify(text); This work for 11 languages: English, Spanish, Portuguese, Dutch, German, French, Italian, and Others. I can send you this touched jar, but remember that this jar is from Nuke, for copyright (or left :). http://www.nutch.org/LICENSE.txt More comments above... aurora escribió: I'm trying to build some web search tool that could work for multiple languages. I understand that Lucene is shipped with StandardAnalyzer plus a German and Russian analyzers and some more in the sandbox. And that indexing and searching should use the same analyzer. Now let's said I have an index with documents in multiple languages and analyzed by an assortment of analyzers. When user enter a query, what analyzer should be used? Should the user be asked for the language upfront? What to expect when the analyzer and the document doesn't match? Let's said the query is parsed using StandardAnalyzer. Would it match any documents done in German analyzer at all. Or would it end up in poor result? When this happen, in the major cases you do not obtain matchs. Also is there a good way to find out the languages used in a web page? There is a 'content-langage' header in http and a 'lang' attribute in HTML. Looks like people don't really use them. How can we recognize the language? With language identifier. :) Even more interesting is multiple languages used in one document, let's say half English and half French. Is there a good way to deal with those cases? Language identifier only return one language. I look into language-identifier and work with a score for each language, and return the language with greater value. Maybe you can modify the language-identifier for take the most greater values. Bye Ernesto. Thanks for any guidance. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FOP Generated PDF and PDFBox
Thanks Ben. I new none related issues now. For the time being I will be using path. Once I get a chance I will try this on the command line as you have recommended. Luke - Original Message - From: "Ben Litchfield" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Friday, January 21, 2005 1:05 PM Subject: Re: FOP Generated PDF and PDFBox > > > Ya, when calling LucenePDFDocument.getDocument( File ) then it should be > the same as the path. > > This is the code that the class uses to set those fields. > > document.add( Field.UnIndexed("path", file.getPath() ) ); > document.add(Field.UnIndexed("url", file.getPath().replace(FILE_SEPARATOR, > '/'))); > > I have no idea why an FOP PDF would be any different than another PDF. > > You can also run it from the command line, this is just for debugging > purposes like this. > > java org.pdfbox.searchengine.lucene.LucenePDFDocument > > and it should print out the fields of the lucene Document object. Is the > url there and is it correct? > > Ben > > On Fri, 21 Jan 2005, Luke Shannon wrote: > > > That is correct. No difference with how other PDF are handled. > > > > I am looking at the index in Luke now. The FOP generated documents have a > > path but no URL? I would guess that these would be the same? > > > > Thanks for the speedy reply. > > > > Luke > > > > > > - Original Message - > > From: "Ben Litchfield" <[EMAIL PROTECTED]> > > To: "Lucene Users List" > > Sent: Friday, January 21, 2005 12:34 PM > > Subject: Re: FOP Generated PDF and PDFBox > > > > > > > > > > > > > Are you indexing the FOP PDF's differently than other PDF documents? > > > > > > Can I assume that you are using PDFBox's LucenePDFDocument.getDocument() > > > method? > > > > > > Ben > > > > > > On Fri, 21 Jan 2005, Luke Shannon wrote: > > > > > > > Hello; > > > > > > > > Our CMS now allows users to create PDF documents (uses FOP) and than > > search > > > > them. > > > > > > > > I seem to be able to index these documents ok. But when I am generating > > the > > > > results to display I get a Null Pointer Exception while trying to use a > > > > variable that should contain the url keyword for one of these documents > > in > > > > the index: > > > > > > > > Document doc = hits.doc(i); > > > > String path = doc.get("url"); > > > > > > > > Path contains null. > > > > > > > > The interesting thing is this only happens with PDF that are generate > > with > > > > FOP. Other PDFs are fine. > > > > > > > > What I find weird is shouldn't the "url" field just contain the path of > > the > > > > file? > > > > > > > > Anyone else seen this before? > > > > > > > > Any ideas? > > > > > > > > Thanks, > > > > > > > > Luke > > > > > > > > > > > > > > > > - > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
Hi, I have some studies in Chinese text search. The main problem is how to separate the words. As in Chinese, there is no white space between words. The typical commercial search engines these days use a dictionary based approach. That is, look through the Chinese text and find the words that are in the dictionary. As for those characters that do not match words in the dictionary, you could use bi-gram based approach. Say, a b c, you could index as 2 (pseudo) words, ab, bc. I think pure bi-gram based approach is not good for relative large Chinese text collection, as you end up with many pseudo terms that are not actual words. Cheers, Jian On Fri, 21 Jan 2005 18:55:56 +0100, Safarnejad, Ali (AFIS) <[EMAIL PROTECTED]> wrote: > The ChineseAnalyzer tokenizes based on some english stopwords. The > CJKAnalzyer is not much more sophisticated for Chinese Analysis (2 byte > tokenizing). The analyzer I just sent you (using Erik Peterson's > segmenter:), looks up three dictionaries to segment the chinese text, based > on real word matches. > > > -Original Message- > From: news [mailto:[EMAIL PROTECTED] On Behalf Of aurora > Sent: 21 January 2005 18:29 > To: lucene-user@jakarta.apache.org > Subject: Re: Search Chinese in Unicode !!! > > I would love to give it a try. Please email me at aurora00 at gmail.com. > Thanks! > > Also what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some > people actually said the StandardAnalyzer works better. I wonder what's > the pros and cons. > > > I've written a Chinese Analyzer for Lucene that uses a segmenter > > written > > by > > Erik Peterson. However, as the author of the segmenter does not want his > > code > > released under apache open source license (although his code _is_ > > opensource), I cannot place my work in the Lucene Sandbox. This is > > unfortunate, because I believe the analyzer works quite well in indexing > > and > > searching chinese docs in GB2312 and UTF-8 encoding, and I like more > > people > > to test, use, and confirm this. So anyone who wants it, can have it. > > Just > > shoot me an email. > > BTW, I also have written an arabic analyzer, which is collecting dust for > > similar reasons. > > Good luck, > > > > Ali Safarnejad > > > > > > -Original Message- > > From: Eric Chow [mailto:[EMAIL PROTECTED] > > Sent: 21 January 2005 11:42 > > To: Lucene Users List > > Subject: Re: Search Chinese in Unicode !!! > > > > > > Search not really correct with UTF-8 !!! > > > > > > The following is the search result that I used the SearchFiles in the > > lucene > > demo. > > > > d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java > > org.apache.lucene.demo.SearchFiles c:\temp\myindex > > Usage: java SearchFiles > > Query: ç > > Searching for: g > > strange ?? > > 3 total matching documents > > 0. ../docs/ChineseDemo.htmlthis files > > contains > > the ç > >- > > 1. ../docs/luceneplan.html > >- Jakarta Lucene - Plan for enhancements to Lucene > > 2. ../docs/api/index-all.html > >- Index (Lucene 1.4.3 API) > > Query: > > > > > > > > From the above result only the ChineseDemo.html includes the character > > that I > > want to search ! > > > > > > > > > > The modified code in SearchFiles.java: > > > > > > BufferedReader in = new BufferedReader(new InputStreamReader(System.in, > > "UTF-8")); > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > -- > Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
thanks for the URLDirectory pointer
lucene-user got blacklisted on SPEW, so I didn't actually get the responses to my last question via email. But I managed to dig them out of the archive, and it should do what I needed. Thanks for the pointer! Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FOP Generated PDF and PDFBox
Ya, when calling LucenePDFDocument.getDocument( File ) then it should be the same as the path. This is the code that the class uses to set those fields. document.add( Field.UnIndexed("path", file.getPath() ) ); document.add(Field.UnIndexed("url", file.getPath().replace(FILE_SEPARATOR, '/'))); I have no idea why an FOP PDF would be any different than another PDF. You can also run it from the command line, this is just for debugging purposes like this. java org.pdfbox.searchengine.lucene.LucenePDFDocument and it should print out the fields of the lucene Document object. Is the url there and is it correct? Ben On Fri, 21 Jan 2005, Luke Shannon wrote: > That is correct. No difference with how other PDF are handled. > > I am looking at the index in Luke now. The FOP generated documents have a > path but no URL? I would guess that these would be the same? > > Thanks for the speedy reply. > > Luke > > > - Original Message - > From: "Ben Litchfield" <[EMAIL PROTECTED]> > To: "Lucene Users List" > Sent: Friday, January 21, 2005 12:34 PM > Subject: Re: FOP Generated PDF and PDFBox > > > > > > > > Are you indexing the FOP PDF's differently than other PDF documents? > > > > Can I assume that you are using PDFBox's LucenePDFDocument.getDocument() > > method? > > > > Ben > > > > On Fri, 21 Jan 2005, Luke Shannon wrote: > > > > > Hello; > > > > > > Our CMS now allows users to create PDF documents (uses FOP) and than > search > > > them. > > > > > > I seem to be able to index these documents ok. But when I am generating > the > > > results to display I get a Null Pointer Exception while trying to use a > > > variable that should contain the url keyword for one of these documents > in > > > the index: > > > > > > Document doc = hits.doc(i); > > > String path = doc.get("url"); > > > > > > Path contains null. > > > > > > The interesting thing is this only happens with PDF that are generate > with > > > FOP. Other PDFs are fine. > > > > > > What I find weird is shouldn't the "url" field just contain the path of > the > > > file? > > > > > > Anyone else seen this before? > > > > > > Any ideas? > > > > > > Thanks, > > > > > > Luke > > > > > > > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search Chinese in Unicode !!!
The ChineseAnalyzer tokenizes based on some english stopwords. The (BCJKAnalzyer is not much more sophisticated for Chinese Analysis (2 byte (Btokenizing). The analyzer I just sent you (using Erik Peterson's (Bsegmenter:), looks up three dictionaries to segment the chinese text, based (Bon real word matches. (B (B (B-Original Message- (BFrom: news [mailto:[EMAIL PROTECTED] On Behalf Of aurora (BSent: 21 January 2005 18:29 (BTo: lucene-user@jakarta.apache.org (BSubject: Re: Search Chinese in Unicode !!! (B (B (BI would love to give it a try. Please email me at aurora00 at gmail.com. (BThanks! (B (BAlso what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some (Bpeople actually said the StandardAnalyzer works better. I wonder what's (Bthe pros and cons. (B (B (B (B> I've written a Chinese Analyzer for Lucene that uses a segmenter (B> written (B> by (B> Erik Peterson. However, as the author of the segmenter does not want his (B> code (B> released under apache open source license (although his code _is_ (B> opensource), I cannot place my work in the Lucene Sandbox. This is (B> unfortunate, because I believe the analyzer works quite well in indexing (B> and (B> searching chinese docs in GB2312 and UTF-8 encoding, and I like more (B> people (B> to test, use, and confirm this. So anyone who wants it, can have it. (B> Just (B> shoot me an email. (B> BTW, I also have written an arabic analyzer, which is collecting dust for (B> similar reasons. (B> Good luck, (B> (B> Ali Safarnejad (B> (B> (B> -Original Message- (B> From: Eric Chow [mailto:[EMAIL PROTECTED] (B> Sent: 21 January 2005 11:42 (B> To: Lucene Users List (B> Subject: Re: Search Chinese in Unicode !!! (B> (B> (B> Search not really correct with UTF-8 !!! (B> (B> (B> The following is the search result that I used the SearchFiles in the (B> lucene (B> demo. (B> (B> d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java (B> org.apache.lucene.demo.SearchFiles c:\temp\myindex (B> Usage: java SearchFiles (B> Query: $Be4(J (B> Searching for: g (B> strange ?? (B> 3 total matching documents (B> 0. ../docs/ChineseDemo.htmlthis files (B> contains (B> the $Be4(J (B>- (B> 1. ../docs/luceneplan.html (B>- Jakarta Lucene - Plan for enhancements to Lucene (B> 2. ../docs/api/index-all.html (B>- Index (Lucene 1.4.3 API) (B> Query: (B> (B> (B> (B> From the above result only the ChineseDemo.html includes the character (B> that I (B> want to search ! (B> (B> (B> (B> (B> The modified code in SearchFiles.java: (B> (B> (B> BufferedReader in = new BufferedReader(new InputStreamReader(System.in, (B> "UTF-8")); (B> (B> - (B> To unsubscribe, e-mail: [EMAIL PROTECTED] (B> For additional commands, e-mail: [EMAIL PROTECTED] (B (B (B (B-- (BUsing Opera's revolutionary e-mail client: http://www.opera.com/m2/ (B (B (B- (BTo unsubscribe, e-mail: [EMAIL PROTECTED] (BFor additional commands, e-mail: [EMAIL PROTECTED] (B (B (B- (BTo unsubscribe, e-mail: [EMAIL PROTECTED] (BFor additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and multiple languages
Thanks. I would like to give it a try. Is the source code available? I'm using a Python version of Lucene so it would need to be wrapped or ported :) Hi Aurora I develop a tool with this multiple languages issue. I found very useful an nuke library "language-identifier". This jar have nuke dependencies, but I delete all unnecessary code (for me obvious). This language-identifier that I use work fine and is very simple: For example: LanguageIdentifier languageIdentifier = LanguageIdentifier.getInstance(); String userInputText = "free text"; String language = languageIdentifier.identify(text); This work for 11 languages: English, Spanish, Portuguese, Dutch, German, French, Italian, and Others. I can send you this touched jar, but remember that this jar is from Nuke, for copyright (or left :). http://www.nutch.org/LICENSE.txt More comments above... aurora escribió: I'm trying to build some web search tool that could work for multiple languages. I understand that Lucene is shipped with StandardAnalyzer plus a German and Russian analyzers and some more in the sandbox. And that indexing and searching should use the same analyzer. Now let's said I have an index with documents in multiple languages and analyzed by an assortment of analyzers. When user enter a query, what analyzer should be used? Should the user be asked for the language upfront? What to expect when the analyzer and the document doesn't match? Let's said the query is parsed using StandardAnalyzer. Would it match any documents done in German analyzer at all. Or would it end up in poor result? When this happen, in the major cases you do not obtain matchs. Also is there a good way to find out the languages used in a web page? There is a 'content-langage' header in http and a 'lang' attribute in HTML. Looks like people don't really use them. How can we recognize the language? With language identifier. :) Even more interesting is multiple languages used in one document, let's say half English and half French. Is there a good way to deal with those cases? Language identifier only return one language. I look into language-identifier and work with a score for each language, and return the language with greater value. Maybe you can modify the language-identifier for take the most greater values. Bye Ernesto. Thanks for any guidance. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FOP Generated PDF and PDFBox
That is correct. No difference with how other PDF are handled. I am looking at the index in Luke now. The FOP generated documents have a path but no URL? I would guess that these would be the same? Thanks for the speedy reply. Luke - Original Message - From: "Ben Litchfield" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Friday, January 21, 2005 12:34 PM Subject: Re: FOP Generated PDF and PDFBox > > > Are you indexing the FOP PDF's differently than other PDF documents? > > Can I assume that you are using PDFBox's LucenePDFDocument.getDocument() > method? > > Ben > > On Fri, 21 Jan 2005, Luke Shannon wrote: > > > Hello; > > > > Our CMS now allows users to create PDF documents (uses FOP) and than search > > them. > > > > I seem to be able to index these documents ok. But when I am generating the > > results to display I get a Null Pointer Exception while trying to use a > > variable that should contain the url keyword for one of these documents in > > the index: > > > > Document doc = hits.doc(i); > > String path = doc.get("url"); > > > > Path contains null. > > > > The interesting thing is this only happens with PDF that are generate with > > FOP. Other PDFs are fine. > > > > What I find weird is shouldn't the "url" field just contain the path of the > > file? > > > > Anyone else seen this before? > > > > Any ideas? > > > > Thanks, > > > > Luke > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FOP Generated PDF and PDFBox
Are you indexing the FOP PDF's differently than other PDF documents? Can I assume that you are using PDFBox's LucenePDFDocument.getDocument() method? Ben On Fri, 21 Jan 2005, Luke Shannon wrote: > Hello; > > Our CMS now allows users to create PDF documents (uses FOP) and than search > them. > > I seem to be able to index these documents ok. But when I am generating the > results to display I get a Null Pointer Exception while trying to use a > variable that should contain the url keyword for one of these documents in > the index: > > Document doc = hits.doc(i); > String path = doc.get("url"); > > Path contains null. > > The interesting thing is this only happens with PDF that are generate with > FOP. Other PDFs are fine. > > What I find weird is shouldn't the "url" field just contain the path of the > file? > > Anyone else seen this before? > > Any ideas? > > Thanks, > > Luke > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
I would love to give it a try. Please email me at aurora00 at gmail.com. Thanks! Also what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some people actually said the StandardAnalyzer works better. I wonder what's the pros and cons. I've written a Chinese Analyzer for Lucene that uses a segmenter written by Erik Peterson. However, as the author of the segmenter does not want his code released under apache open source license (although his code _is_ opensource), I cannot place my work in the Lucene Sandbox. This is unfortunate, because I believe the analyzer works quite well in indexing and searching chinese docs in GB2312 and UTF-8 encoding, and I like more people to test, use, and confirm this. So anyone who wants it, can have it. Just shoot me an email. BTW, I also have written an arabic analyzer, which is collecting dust for similar reasons. Good luck, Ali Safarnejad -Original Message- From: Eric Chow [mailto:[EMAIL PROTECTED] Sent: 21 January 2005 11:42 To: Lucene Users List Subject: Re: Search Chinese in Unicode !!! Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles Query: ç Searching for: g strange ?? 3 total matching documents 0. ../docs/ChineseDemo.htmlthis files contains the ç - 1. ../docs/luceneplan.html - Jakarta Lucene - Plan for enhancements to Lucene 2. ../docs/api/index-all.html - Index (Lucene 1.4.3 API) Query: From the above result only the ChineseDemo.html includes the character that I want to search ! The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, "UTF-8")); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
FOP Generated PDF and PDFBox
Hello; Our CMS now allows users to create PDF documents (uses FOP) and than search them. I seem to be able to index these documents ok. But when I am generating the results to display I get a Null Pointer Exception while trying to use a variable that should contain the url keyword for one of these documents in the index: Document doc = hits.doc(i); String path = doc.get("url"); Path contains null. The interesting thing is this only happens with PDF that are generate with FOP. Other PDFs are fine. What I find weird is shouldn't the "url" field just contain the path of the file? Anyone else seen this before? Any ideas? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search Chinese in Unicode !!!
If you are hosting the code somewhere (e.g. your site, SF, java.net, etc.), we should link to them from one of the Lucene pages where we link to related external tools, apps, and such. Otis --- "Safarnejad, Ali (AFIS)" <[EMAIL PROTECTED]> wrote: > I've written a Chinese Analyzer for Lucene that uses a segmenter > written by > Erik Peterson. However, as the author of the segmenter does not want > his code > released under apache open source license (although his code _is_ > opensource), I cannot place my work in the Lucene Sandbox. This is > unfortunate, because I believe the analyzer works quite well in > indexing and > searching chinese docs in GB2312 and UTF-8 encoding, and I like more > people > to test, use, and confirm this. So anyone who wants it, can have it. > Just > shoot me an email. > BTW, I also have written an arabic analyzer, which is collecting dust > for > similar reasons. > Good luck, > > Ali Safarnejad > > > -Original Message- > From: Eric Chow [mailto:[EMAIL PROTECTED] > Sent: 21 January 2005 11:42 > To: Lucene Users List > Subject: Re: Search Chinese in Unicode !!! > > > Search not really correct with UTF-8 !!! > > > The following is the search result that I used the SearchFiles in the > lucene > demo. > > d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java > org.apache.lucene.demo.SearchFiles c:\temp\myindex > Usage: java SearchFiles > Query: å´ > Searching for: g > strange ?? > 3 total matching documents > 0. ../docs/ChineseDemo.htmlthis files > contains > the å´ >- > 1. ../docs/luceneplan.html >- Jakarta Lucene - Plan for enhancements to Lucene > 2. ../docs/api/index-all.html >- Index (Lucene 1.4.3 API) > Query: > > > > From the above result only the ChineseDemo.html includes the > character that I > want to search ! > > > > > The modified code in SearchFiles.java: > > > BufferedReader in = new BufferedReader(new > InputStreamReader(System.in, > "UTF-8")); > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Suggestion needed for extranet search
Free as in orange juice. Otis --- "Ranjan K. Baisak" <[EMAIL PROTECTED]> wrote: > Otis, > Thanks for your help. Is nutch a freeware tool? > > regards, > Ranjan > --- Otis Gospodnetic <[EMAIL PROTECTED]> > wrote: > > > Hi Ranjan, > > > > It sounds like you are should look at and use Nutch: > > http://www.nutch.org > > > > Otis > > > > --- "Ranjan K. Baisak" <[EMAIL PROTECTED]> > > wrote: > > > > > I am planning to move to Lucene but not have much > > > knowledge on the same. The search engine which I > > had > > > developed is searching some extranet URLs e.g. > > > codeguru.com/index.html. Is is possible to get the > > > same functionality using Lucene. So basically can > > I > > > make Lucene as a search engine to search > > extranets. > > > > > > regards, > > > Ranjan > > > > > > __ > > > Do You Yahoo!? > > > Tired of spam? Yahoo! Mail has the best spam > > protection around > > > http://mail.yahoo.com > > > > > > > > > - > > > To unsubscribe, e-mail: > > [EMAIL PROTECTED] > > > For additional commands, e-mail: > > [EMAIL PROTECTED] > > > > > > > > > > > > > - > > To unsubscribe, e-mail: > > [EMAIL PROTECTED] > > For additional commands, e-mail: > > [EMAIL PROTECTED] > > > > > > > > > __ > Do you Yahoo!? > The all-new My Yahoo! - What will yours do? > http://my.yahoo.com > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search Chinese in Unicode !!!
I've written a Chinese Analyzer for Lucene that uses a segmenter written by (BErik Peterson. However, as the author of the segmenter does not want his code (Breleased under apache open source license (although his code _is_ (Bopensource), I cannot place my work in the Lucene Sandbox. This is (Bunfortunate, because I believe the analyzer works quite well in indexing and (Bsearching chinese docs in GB2312 and UTF-8 encoding, and I like more people (Bto test, use, and confirm this. So anyone who wants it, can have it. Just (Bshoot me an email. (BBTW, I also have written an arabic analyzer, which is collecting dust for (Bsimilar reasons. (BGood luck, (B (BAli Safarnejad (B (B (B-Original Message- (BFrom: Eric Chow [mailto:[EMAIL PROTECTED] (BSent: 21 January 2005 11:42 (BTo: Lucene Users List (BSubject: Re: Search Chinese in Unicode !!! (B (B (BSearch not really correct with UTF-8 !!! (B (B (BThe following is the search result that I used the SearchFiles in the lucene (Bdemo. (B (Bd:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java (Borg.apache.lucene.demo.SearchFiles c:\temp\myindex (BUsage: java SearchFiles (BQuery: $Be4(J (BSearching for: g strange ?? (B3 total matching documents (B0. ../docs/ChineseDemo.htmlthis files contains (Bthe $Be4(J (B - (B1. ../docs/luceneplan.html (B - Jakarta Lucene - Plan for enhancements to Lucene (B2. ../docs/api/index-all.html (B - Index (Lucene 1.4.3 API) (BQuery: (B (B (B (B>From the above result only the ChineseDemo.html includes the character that I (Bwant to search ! (B (B (B (B (BThe modified code in SearchFiles.java: (B (B (BBufferedReader in = new BufferedReader(new InputStreamReader(System.in, (B"UTF-8")); (B (B- (BTo unsubscribe, e-mail: [EMAIL PROTECTED] (BFor additional commands, e-mail: [EMAIL PROTECTED] (B (B (B- (BTo unsubscribe, e-mail: [EMAIL PROTECTED] (BFor additional commands, e-mail: [EMAIL PROTECTED]
Re: Concurrent read and write
Hello Ashley, You can read/search while modifying the index, but you have to ensure only one thread or only one process is modifying an index at any given time. Both IndexReader and IndexWriter can be used to modify an index. The former to delete Documents and the latter to add them. You have to ensure these two operations don't overlap. c.f. http://www.lucenebook.com/search?query=concurrent Otis --- Ashley Steigerwalt <[EMAIL PROTECTED]> wrote: > I am a little fuzzy on the thread-safeness of Lucene, or maybe just > java. > From what I understand, and correct me if I'm wrong, Lucene takes > care of > concurrency issues and it is ok to run a query while writing to an > index. > > My question is, does this still hold true if the reader and writer > are being > executed as separate programs? I have a cron job that will update > the index > periodically. I also have a search application on a web form. Is > this going > to cause trouble if someone runs a query while the indexer is > updating? > > Ashley > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Concurrent read and write
Hi, My limited experience shows that reading/searching in a servlet at the "same" time as writing to the index from an application (e.g. by a scheduled script) works very well. The only thing that has caused me problems is applications (e.g. cron started) writing to the index that "crash" while the write-lock is in effect. (The "crash" is in my case often cause by bad socket programming, and has nothing to do with Lucene.) The following scheduled applications will then, of cause, not be able to update the index. cheers Clas / frisim.com On Fri, 21 Jan 2005 09:57:22 -0500, Ashley Steigerwalt <[EMAIL PROTECTED]> wrote: > I am a little fuzzy on the thread-safeness of Lucene, or maybe just java. > From what I understand, and correct me if I'm wrong, Lucene takes care of > concurrency issues and it is ok to run a query while writing to an index. > > My question is, does this still hold true if the reader and writer are being > executed as separate programs? I have a cron job that will update the index > periodically. I also have a search application on a web form. Is this going > to cause trouble if someone runs a query while the indexer is updating? > > Ashley > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Closed IndexWriter reuse
--- Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > No, you can't add documents to an index once you close the IndexWriter. > You can re-open the IndexWriter and add more documents, of course. > > Otis That's what I expected at first, but: 1- It's a disappointment, because such a 'feature' would have made IndexeWriter management much easier. You would open an IndexWriter at startup and reuse it during all the life of the application, just flushing on a regular base using the close() method and without worrying if other objects are currently using the writer. 2- When you say you can't add, do you mean it's impossible or that you shouldn't because for example it could corrupt the index? Maybe I'm wrong, but I think it's possible. Let's look at the follwoing code: " public static void main(String[] args) throws IOException { final IndexWriter writer1 = new IndexWriter("/tmp/test-reuse", new StandardAnalyzer(), true); // First write with the writer Document doc = new Document(); doc.add(new Field("name", "John", Field.Store.YES, Field.Index.UN_TOKENIZED)); writer1.addDocument(doc); System.out.println("1 After first write, before closing the writer ---"); Searcher searcher = new IndexSearcher("/tmp/test-reuse"); Query query = new TermQuery(new Term("name", "John")); Hits hits = searcher.search(query); System.out.println("===> hits: " + hits.length()); System.out.println(); // CLOSING THE WRITER ONCE writer1.close(); System.out.println("2 After first write, after closing the writer ---"); searcher = new IndexSearcher("/tmp/test-reuse"); hits = searcher.search(query); System.out.println("===> hits: " + hits.length()); System.out.println(); // Second write, THE WRITER HAS ALREADY BEEN CLOSED ONCE writer1.addDocument(doc); System.out.println("3 After second write, the writer has been closed once ---"); hits = searcher.search(query); System.out.println("===> hits: " + hits.length()); System.out.println(); // Closing the writer again writer1.close(); System.out.println("4 After second write, the writer has been closed twice ---"); searcher = new IndexSearcher("/tmp/test-reuse"); hits = searcher.search(query); System.out.println("===> hits: " + hits.length()); } == Results == 1 After first write, before closing the writer --- ===> hits: 0 2 After first write, after closing the writer --- ===> hits: 1 3 After second write, the writer has been closed once --- ===> hits: 1 4 After second write, the writer has been closed twice --- ===> hits: 2 As your can see, not only does the code above execute without complain but it also gives the right results. Thanks for your comments. __ Do you Yahoo!? Yahoo! Mail - Easier than ever with enhanced search. Learn more. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Concurrent read and write
I am a little fuzzy on the thread-safeness of Lucene, or maybe just java. From what I understand, and correct me if I'm wrong, Lucene takes care of concurrency issues and it is ok to run a query while writing to an index. My question is, does this still hold true if the reader and writer are being executed as separate programs? I have a cron job that will update the index periodically. I also have a search application on a web form. Is this going to cause trouble if someone runs a query while the indexer is updating? Ashley - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Stemming
OK, OK ... I'll buy the book. I guess its about time since I am deeply and forever in love with Lucene. Might as well take the final plunge. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Friday, January 21, 2005 9:12 AM To: Lucene Users List Subject: Re: Stemming Hi Kevin, Stemming is an optional operation and is done in the analysis step. Lucene comes with a Porter stemmer and a Filter that you can use in an Analyzer: ./src/java/org/apache/lucene/analysis/PorterStemFilter.java ./src/java/org/apache/lucene/analysis/PorterStemmer.java You can find more about it here: http://www.lucenebook.com/search?query=stemming You can also see mentions of SnowballAnalyzer in those search results, and you can find an adapter for SnowballAnalyzers in Lucene Sandbox. Otis --- "Kevin L. Cobb" <[EMAIL PROTECTED]> wrote: > I want to understand how Lucene uses stemming but can't find any > documentation on the Lucene site. I'll continue to google but hope > that > this list can help narrow my search. I have several questions on the > subject currently but hesitate to list them here since finding a good > document on the subject may answer most of them. > > > > Thanks in advance for any pointers, > > > > Kevin > > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Suggestion needed for extranet search
Otis, Thanks for your help. Is nutch a freeware tool? regards, Ranjan --- Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Hi Ranjan, > > It sounds like you are should look at and use Nutch: > http://www.nutch.org > > Otis > > --- "Ranjan K. Baisak" <[EMAIL PROTECTED]> > wrote: > > > I am planning to move to Lucene but not have much > > knowledge on the same. The search engine which I > had > > developed is searching some extranet URLs e.g. > > codeguru.com/index.html. Is is possible to get the > > same functionality using Lucene. So basically can > I > > make Lucene as a search engine to search > extranets. > > > > regards, > > Ranjan > > > > __ > > Do You Yahoo!? > > Tired of spam? Yahoo! Mail has the best spam > protection around > > http://mail.yahoo.com > > > > > - > > To unsubscribe, e-mail: > [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > > > - > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > > __ Do you Yahoo!? The all-new My Yahoo! - What will yours do? http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Search on heterogenous index
Hello all. I'm new to lucene and think about using it in my project. I have prices with dynamic structure, containing wares there, about 10K prices with total 500K wares. Each price has about 5 text fields. I'll do searches on wares. The difficult part is that I'll do searches for all wares, the search is not bound to a particular price structure. My question is, how should I organize my indices? Can Lucene handle data effectlively if I'll have one index containing different Fields in Documents? Or should I create a separate index for each price with same Fields structure across Documents? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Suggestion needed for extranet search
Hi Ranjan, It sounds like you are should look at and use Nutch: http://www.nutch.org Otis --- "Ranjan K. Baisak" <[EMAIL PROTECTED]> wrote: > I am planning to move to Lucene but not have much > knowledge on the same. The search engine which I had > developed is searching some extranet URLs e.g. > codeguru.com/index.html. Is is possible to get the > same functionality using Lucene. So basically can I > make Lucene as a search engine to search extranets. > > regards, > Ranjan > > __ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Filtering w/ Multiple Terms
This: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/BooleanQuery.TooManyClauses.html ? You can control that limit via http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/BooleanQuery.html#maxClauseCount Otis --- Jerry Jalenak <[EMAIL PROTECTED]> wrote: > OK. But isn't there a limit on the number of BooleanQueries that can > be > combined with AND / OR / etc? > > > > Jerry Jalenak > Senior Programmer / Analyst, Web Publishing > LabOne, Inc. > 10101 Renner Blvd. > Lenexa, KS 66219 > (913) 577-1496 > > [EMAIL PROTECTED] > > > > -Original Message- > > From: Erik Hatcher [mailto:[EMAIL PROTECTED] > > Sent: Thursday, January 20, 2005 5:05 PM > > To: Lucene Users List > > Subject: Re: Filtering w/ Multiple Terms > > > > > > > > On Jan 20, 2005, at 5:02 PM, Jerry Jalenak wrote: > > > > > In looking at the examples for filtering of hits, it looks > > like I can > > > only > > > specify a single term; i.e. > > > > > > Filter f = new QueryFilter(new TermQuery(new Term("acct", > > > "acct1"))); > > > > > > I need to specify more than one term in my filter. Short of > using > > > something > > > like ChainFilter, how are others handling this? > > > > You can make as complex of a Query as you want for > > QueryFilter. If you > > want to filter on multiple terms, construct a BooleanQuery > > with nested > > TermQuery's, either in an AND or OR fashion. > > > > Erik > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > > This transmission (and any information attached to it) may be > confidential and > is intended solely for the use of the individual or entity to which > it is > addressed. If you are not the intended recipient or the person > responsible for > delivering the transmission to the intended recipient, be advised > that you > have received this transmission in error and that any use, > dissemination, > forwarding, printing, or copying of this information is strictly > prohibited. > If you have received this transmission in error, please immediately > notify > LabOne at the following email address: > [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Stemming
Hi Kevin, Stemming is an optional operation and is done in the analysis step. Lucene comes with a Porter stemmer and a Filter that you can use in an Analyzer: ./src/java/org/apache/lucene/analysis/PorterStemFilter.java ./src/java/org/apache/lucene/analysis/PorterStemmer.java You can find more about it here: http://www.lucenebook.com/search?query=stemming You can also see mentions of SnowballAnalyzer in those search results, and you can find an adapter for SnowballAnalyzers in Lucene Sandbox. Otis --- "Kevin L. Cobb" <[EMAIL PROTECTED]> wrote: > I want to understand how Lucene uses stemming but can't find any > documentation on the Lucene site. I'll continue to google but hope > that > this list can help narrow my search. I have several questions on the > subject currently but hesitate to list them here since finding a good > document on the subject may answer most of them. > > > > Thanks in advance for any pointers, > > > > Kevin > > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Stemming
I want to understand how Lucene uses stemming but can't find any documentation on the Lucene site. I'll continue to google but hope that this list can help narrow my search. I have several questions on the subject currently but hesitate to list them here since finding a good document on the subject may answer most of them. Thanks in advance for any pointers, Kevin
RE: Filtering w/ Multiple Terms
OK. But isn't there a limit on the number of BooleanQueries that can be combined with AND / OR / etc? Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] > -Original Message- > From: Erik Hatcher [mailto:[EMAIL PROTECTED] > Sent: Thursday, January 20, 2005 5:05 PM > To: Lucene Users List > Subject: Re: Filtering w/ Multiple Terms > > > > On Jan 20, 2005, at 5:02 PM, Jerry Jalenak wrote: > > > In looking at the examples for filtering of hits, it looks > like I can > > only > > specify a single term; i.e. > > > > Filter f = new QueryFilter(new TermQuery(new Term("acct", > > "acct1"))); > > > > I need to specify more than one term in my filter. Short of using > > something > > like ChainFilter, how are others handling this? > > You can make as complex of a Query as you want for > QueryFilter. If you > want to filter on multiple terms, construct a BooleanQuery > with nested > TermQuery's, either in an AND or OR fashion. > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
On Jan 21, 2005, at 11:42, Eric Chow wrote: Search not really correct with UTF-8 !!! Lucene works just fine with any flavor of Unicode as long as _your_ application knows how to consistently deal with Unicode as well. Remember: the world is not just one Big5 pile. As far as Analyzer goes, you may or may not be better off using something more tailored to your linguistic needs. That said, even the default Analyzer does a fairly decent job at handling non-roman languages. YMMV. Cheers -- PA http://alt.textdrive.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!
>>1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) >>apparently produces non-word stems .. i.e. not really human readable. It is possible to derive the human-readable form of a stemmed term using either re-analysis of indexed content or TermPositionVector. Either of these techniques should give you the position data required to discover the original form. The highlighter package is one example of where this technique is used. Cheers Mark ___ ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles Query: ç Searching for: g strange ?? 3 total matching documents 0. ../docs/ChineseDemo.htmlthis files contains the ç - 1. ../docs/luceneplan.html - Jakarta Lucene - Plan for enhancements to Lucene 2. ../docs/api/index-all.html - Index (Lucene 1.4.3 API) Query: >From the above result only the ChineseDemo.html includes the character that I want to search ! The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, "UTF-8")); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How works *
On Fri, 2005-01-21 at 10:58 +0100, Bertrand VENZAL wrote: > I wondered how lucene implement the * character, I know that is working > but when I look at the Query Object, it doesn t seem to appear somewhere, > does someone know how is it implemented ? Take a look at the PrefixQuery and WildcardQuery. PrefixQuery works by finding all terms beginning with the query then constructing a boolean query of them. I assume WildcardQuery works in a similar way. If you have several terms or a short prefix (e.g. a*) you might need to increase the maximum number of clauses allowed in a boolean query because the number of terms might exceed the default (i.e. 1024). -- Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
On Jan 21, 2005, at 4:49 AM, Eric Chow wrote: How to create index with chinese (in utf-8 encoding ) HTML and search with Lucene ? Indexing and searching Chinese basically is no different than using English with Lucene. We covered a bit about it in Lucene in Action: http://www.lucenebook.com/search?query=chinese And a screenshot here: http://www.blogscene.org/erik/LuceneInAction/i18n.html The main issues of dealing with Chinese, and of course other languages, are encoding concerns in both indexing and querying of reading in the text and analysis (as you can see from the screenshot). Lucene itself works with Unicode fine and you're free to index anything. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How works *
Hi, I wondered how lucene implement the * character, I know that is working but when I look at the Query Object, it doesn t seem to appear somewhere, does someone know how is it implemented ? thanks
Search Chinese in Unicode !!!
How to create index with chinese (in utf-8 encoding ) HTML and search with Lucene ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!
Morus Walter wrote: Owen Densmore writes: 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) apparently produces non-word stems .. i.e. not really human readable. (Example: generate, generates, generated, generating -> generat) Although in typical queries this is not important because the result of the search is a document list, it *would* be important if we use the stems within a graphical navigation interface. So the question is: Is there a way to have the stemmer produce english base forms of the words being stemmed? rule based stemmers such as porter/snowball cannot do that. But there are (commercial) dictionary based tools that can. E.g. the canoo lemmatizer. You might also have a look at egothors stemmer, that are word list based. Egothor stemmers are algorithmic, they only use word lists for training. Stems produced by them are usually closer to lemmas than in e.g. Porter's stemmer, but there is a significant amount of stems like in the example above. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]