Re: Call Lucene default command line Search from PHP script
milu07 a écrit : Hello, My machine is Ubuntu 7.10. I am working with Apache Lucene. I have done with indexer and tried with command line Searcher (the default command line included in Lucene package: http://lucene.apache.org/java/2_3_1/demo2.html). When I use this at command line: java Searcher -query algorithm it works and returns a list of results to me. Here 'algorithm' is the keyword to search. However, I want to have a web search interface written in PHP, I use PHP exec() to call this Searcher from my PHP script: exec("java Searcher -query algorithm ", $arr, $retVal); [I also tried: exec("java Searcher -query 'algorithm' ", $arr, $retVal)] It does not work. I print the value of $retVal, it is 1. I come back and try: exec("java Searcher -query algorithm 2>&1 ", $arr, $retVal); I receive: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/lucene/analysis/Analyzer and $retVal is 1 In the command line Searcher.java of Lucene, it imports many libraries, is this the problem? import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyz er; I guess this is the problem of path. However, I do not know how to fix it because it works in command line ($CLASSPATH points to the .jar file of Lucene library). May be PHP does not know $CLASSPATH. So, I add Lucene lib to $PATH: export PATH=$PATH:/usr/lib/lucene-core-2.3.1.jar:/usr/lib However, I get the same error message when I try: exec("java Searcher -query algorithm 2>&1 ", $arr, $retVal); Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/lucene/analysis/Analyzer Could you please help? Thank you, using command line from PHP is a bad idea. socket is a better way : https://admin.garambrogne.net/projets/passerelle/browser/trunk/goniometre M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Field values ...
Thanks. > Date: Mon, 24 Mar 2008 21:03:13 -0700 > From: [EMAIL PROTECTED] > To: java-user@lucene.apache.org > Subject: RE: Field values ... > > > : The Id and Phone fields are stored. So I can just do a MatchAllQuery as > : you suggested. I have read about field selectors on this mailing list > : but have never used it. Does anyone know where I can find some sample > : code? Thank you. > > there's a couple of reusable implementations in subversion... > > http://www.krugle.org/kse/files?query=%22implements%20FieldSelector%22%20lucene&lang=java&findin=code > > > > -Hoss > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > _ Watch “Cause Effect,” a show about real people making a real difference. Learn more. http://im.live.com/Messenger/IM/MTV/?source=text_watchcause
Improving Index Search Performance
Hi Everyone, We are using Lucene to search on a index of around 20G size with around 3 million documents. We are facing performance issues loading large results from the index. Based on the various posts on the forum and documentation, we have made the following code changes to improve the performance: i. Modified the code to use HitCollector instead of Hits since we will be loading all the documents in the index based on keyword matching ii. Added MapFieldSelector to load only selected fields(2 fields only) instead of all the 14 After all these changes, it seems to be taking around 90 secs to load 17k documents. After profiling, we found that the max time is spent in * searcher.doc(id,selector). *Here is the code: *public void collect(int id, float score) { try { MapFieldSelector selector = new MapFieldSelector(new String[] {COMPANY_ID, ID}); doc = searcher.doc(id, selector); mappedCompanies = doc.getValues(COMPANY_ID); } catch (IOException e) { logger.debug("inside IDCollector.collect() :"+e.getMessage()); } }* * *We also read in one of the posts that we should use bitSet.set(doc) instead of calling searcher.doc(id). But we are unable to to understand how this might help in our case since we will anyway have to load the document to get the other required field(company_id). Also we observed that the searcher is actually using only 1G RAM though we have 4G allocated to it. Can someone suggest if there is any other optimization that can done to improve the search performance on MultiSearcher. Any help would be appreciated. Thanks, Vipin
Integrating Spell Checker contributed to Lucene
Hi Guys, Has anybody integrated the Spell Checker contributed to Lucene. I need advise from where to get free dictionary file (one that contains all words in English) that could be used to create instance of PlainTextDictionary class. I currently use for my tests responding files from Jazzy and JADT projects, but I think I do not have right to use them officially outside of their applications. Best Regards, Ivan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Integrating Spell Checker contributed to Lucene
Ivan Vasilev a écrit : Hi Guys, Has anybody integrated the Spell Checker contributed to Lucene. http://blog.garambrogne.net/index.php?post/2008/03/07/A-lexicon-approach-for-Lucene-index https://issues.apache.org/jira/browse/LUCENE-1190 I need advise from where to get free dictionary file (one that contains all words in English) that could be used to create instance of PlainTextDictionary class. all english word is a nonsense. Have a look at wordnet and hunspell. I currently use for my tests responding files from Jazzy and JADT projects, but I think I do not have right to use them officially outside of their applications. Best Regards, Ivan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
hitcollector topdocs
Hi everybody, I was searching for informations about the hitcollector. I was wondering if the value of the fields have to be stored or not. i tested it and it worked both but i'm still not really sure about it. Second question is, can i work with tokenized fields? Best regards Jens -- View this message in context: http://www.nabble.com/hitcollector-topdocs-tp16275287p16275287.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Improving Index Search Performance
On Tue, 2008-03-25 at 18:13 +0530, Shailendra Mudgal wrote: > We are using Lucene to search on a index of around 20G size with around 3 > million documents. We are facing performance issues loading large results > from the index. [...] > After all these changes, it seems to be taking around 90 secs to load 17k > documents. [...] That's fairly slow. Are you doing any warm-up? It is my experience that it helps tremendously with performance. I tried requesting a stored field from all hits for all searches with logged queries on our index (9 million documents, 37GB), no fancy tricks, just Hits and hit.get(fieldname). For the first couple of minutes, using standard harddisks, performance was about 2-300 field-requests/second. After that, the speed increased to about 2-3000 field-requests/second. Using solid state drives, the same pattern could be seen, just with much lower warm-up time before the full speed kicked in. > *Here is the code: > > *public void collect(int id, float score) { > try { > MapFieldSelector selector = new MapFieldSelector(new > String[] {COMPANY_ID, ID}); > doc = searcher.doc(id, selector); > mappedCompanies = doc.getValues(COMPANY_ID); > } catch (IOException e) { > logger.debug("inside IDCollector.collect() > :"+e.getMessage()); > } > }* > > * There's no need to initialize the selector for every collect-call. Try moving the initialization outside of the collect method. > [...] Also we observed that the searcher is actually using only 1G RAM though > we have 4G allocated to it. The system will (hopefully) utilize the free RAM for disk-cache, so the last 3GB are not wasted. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: explain() - fieldnorm
another problem just occurred. These are the results from explain() : 0.27576536 = (MATCH) product of: 0.827296 = (MATCH) sum of: 0.827296 = (MATCH) sum of: 0.24544832 = (MATCH) weight(ti:genetik in 1849319), product of: 0.015469407 = queryWeight(ti:genetik), product of: 10.577795 = idf(docFreq=270) 0.0014624415 = queryNorm 15.866693 = (MATCH) fieldWeight(ti:genetik in 1849319), product of: 1.0 = tf(termFreq(ti:genetik)=1) 10.577795 = idf(docFreq=270) 1.5 = fieldNorm(field=ti, doc=1849319) 0.58184767 = (MATCH) weight(au:knippers in 1849319), product of: 0.020028148 = queryWeight(au:knippers), product of: 13.695007 = idf(docFreq=11) 0.0014624415 = queryNorm 29.051497 = (MATCH) fieldWeight(au:knippers in 1849319), product of: 1.4142135 = tf(termFreq(au:knippers)=2) 13.695007 = idf(docFreq=11) 1.5 = fieldNorm(field=au, doc=1849319) 0.3334 = coord(1/3) 0.27576536 = (MATCH) product of: 0.827296 = (MATCH) sum of: 0.827296 = (MATCH) sum of: 0.24544832 = (MATCH) weight(ti:genetik in 3221603), product of: 0.015469407 = queryWeight(ti:genetik), product of: 10.577795 = idf(docFreq=270) 0.0014624415 = queryNorm 15.866693 = (MATCH) fieldWeight(ti:genetik in 3221603), product of: 1.0 = tf(termFreq(ti:genetik)=1) 10.577795 = idf(docFreq=270) 1.5 = fieldNorm(field=ti, doc=3221603) 0.58184767 = (MATCH) weight(au:knippers in 3221603), product of: 0.020028148 = queryWeight(au:knippers), product of: 13.695007 = idf(docFreq=11) 0.0014624415 = queryNorm 29.051497 = (MATCH) fieldWeight(au:knippers in 3221603), product of: 1.4142135 = tf(termFreq(au:knippers)=2) 13.695007 = idf(docFreq=11) 1.5 = fieldNorm(field=au, doc=3221603) 0.3334 = coord(1/3) As you can see, both are exactly the same. The thing i don't understand is, that the two documents have different documentboosts (the first one got an boost of 1.62 , the second of 1.65) - the boosts are different because the two books got different publication years - but explain() tells me that my fieldNorm value is 1.5. While indexing i use a new similarity class where lengthNorm just returns 1, so the field length does not matter anymore. Best Regards Jens Burkhardt hossman wrote: > > : As my subject is telling, i have a little problem with analyzing the > : explain() output. > : I know, that the fieldnorm value consists out of "documentboost, > fieldboost > : and lengthNorm". > : Is is possible to recieve the single values? I know that they are > multiplied > : while indexing but > : can they be stored so that i can read them when i analyze my search? > > the number of terms the docs have in a given field can be determined by > doing a nested iteration over a TermEnum and TermDoc and keeping count, > but there is no way to keep extract the document boost vs the field boost > -- if you want to know what those were later you have to store them > yourselves (in a stored field perhaps). > > : The Problem is, that i have 2 Documents I want to compare but the only > : difference is the fieldnorm value > : and i don't know which value exactly makes this difference. > > typically the answer to that question for me is "length" because i don't > use field boosts and doc boosts -- if you *do* use field boosts or doc > boosts, you would typically know what you had, and could check what boost > values you had used later (based on whatever source you orriginally built > your index from) > > > > > -Hoss > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/explain%28%29---fieldnorm-tp15717182p16276999.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
Uwe, This is a little off thread-topic, but I was wondering how your search relevance and search performance has fared with this bigram-based index. Is it significantly better than before you use the NGramAnalyzer? -jake On 3/24/08, Uwe Goetzke <[EMAIL PROTECTED]> wrote: > Hi Ivan, > No, we do not use StandardAnalyser or StandardTokenizer. > > Most data is processed by > fTextTokenStream = result = new > org.apache.lucene.analysis.WhitespaceTokenizer(reader); > result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter > modified that ö -> oe > result = new org.apache.lucene.analysis.LowerCaseFilter(result); > result = new org.apache.lucene.analysis.NGramStemFilter(result,2); > //just a > bigram tokenizer > > We use our own queryparser. The bigramms are searched with a tolerant phrase > query, scoring in a doc the greatest bigramms clusters covering the phrase > token. > > Best Regards > > Uwe > > -Ursprüngliche Nachricht- > Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] > Gesendet: Freitag, 21. März 2008 16:25 > An: java-user@lucene.apache.org > Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1 > > Hi Uwe, > > Could you tell what Analyzer do you use when you marked so big indexing > speedup? > If you use StandardAnalyzer (that uses StandardTokenizer) may be the > reason is in it. You can see the pre last report in the thread "Indexing > Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake > Mannix this is because now StandardTokenizer uses StandardTokenizerImpl > that now is generated by JFlex instead of JavaCC. > I am asking because I noticed a great speedup in adding documents to > index in our system. We have time control on this in the debug mode. NOW > THEY ARE ADDED 5 TIMES FASTER!!! > But in the same time the total process of indexing in our case has > improvement of about 8%. As our system is very big and complex I am > wondering if really the whole process of indexing is reduces so > remarkably and our system causes this slowdown or may be Lucene does > some optimizations on the index, merges or something else and this is > the reason the total process of indexing to be not so reasonably faster. > > Best Regards, > Ivan > > > > Uwe Goetzke wrote: > > This week I switched the lucene library version on one customer system. > > The indexing speed went down from 46m32s to 16m20s for the complete task > > including optimisation. Great Job! > > We index product catalogs from several suppliers, in this case around > > 56.000 product groups and 360.000 products including descriptions were > > indexed. > > > > Regards > > > > Uwe > > > > > > > > --- > > Healy Hudson GmbH - D-55252 Mainz Kastel > > Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076 > > > > Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger > sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie > diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte > umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte > loschen Sie danach diese Email. > > This email is confidential. If you are not the intended recipient, you > must not disclose or use this information contained in it. If you have > received this email in error please tell us immediately by return email and > delete the document. > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > __ NOD32 2913 (20080301) Information __ > > > > This message was checked by NOD32 antivirus system. > > http://www.eset.com > > > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --- > Healy Hudson GmbH - D-55252 Mainz Kastel > Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076 > > Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger > sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie > diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte > umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte > löschen Sie danach diese Email. > This email is confidential. If you are not the intended recipient, you must > not disclose or use this information contained in it. If you have received > this email in error please tell us immediately by return email and delete > the document. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Sent from Gmail for mobile | mobile.google.com --
AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
Jake, With the bigram-based index we gave up for the struggle to find a well working language based index. We had implemented soundex (or different "sound"-alikes) and hyphenating but failed to deliver a user explainable search result ("why is this ranked higher" and so on...). One reason may be that product descriptions contain a lot of abbreviations. The index size grew about 30%. The search performance seems a bit slower but I no concrete figures. The evaluation for a for one document is a bit more complex than a phrase query. One reason of course is that there a more terms evaluated. But nevertheless it is quite good. The search relevance improved tremendously. Missing characters, switched letters and partial word fragments are no real problems any more (of course dependent on the length of the search word). Search term "weekday" finds also "day of the week", "disabigaute" finds "disambiguate". The algorithms I developed might not fit other domains but for multi language catalogs of products it works quite well for us. So far... Regards Uwe -Ursprüngliche Nachricht- Von: Jake Mannix [mailto:[EMAIL PROTECTED] Gesendet: Dienstag, 25. März 2008 17:13 An: java-user@lucene.apache.org Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1 Uwe, This is a little off thread-topic, but I was wondering how your search relevance and search performance has fared with this bigram-based index. Is it significantly better than before you use the NGramAnalyzer? -jake On 3/24/08, Uwe Goetzke <[EMAIL PROTECTED]> wrote: > Hi Ivan, > No, we do not use StandardAnalyser or StandardTokenizer. > > Most data is processed by > fTextTokenStream = result = new > org.apache.lucene.analysis.WhitespaceTokenizer(reader); > result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter > modified that ö -> oe > result = new org.apache.lucene.analysis.LowerCaseFilter(result); > result = new org.apache.lucene.analysis.NGramStemFilter(result,2); > //just a > bigram tokenizer > > We use our own queryparser. The bigramms are searched with a tolerant phrase > query, scoring in a doc the greatest bigramms clusters covering the phrase > token. > > Best Regards > > Uwe > > -Ursprüngliche Nachricht- > Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] > Gesendet: Freitag, 21. März 2008 16:25 > An: java-user@lucene.apache.org > Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1 > > Hi Uwe, > > Could you tell what Analyzer do you use when you marked so big indexing > speedup? > If you use StandardAnalyzer (that uses StandardTokenizer) may be the > reason is in it. You can see the pre last report in the thread "Indexing > Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake > Mannix this is because now StandardTokenizer uses StandardTokenizerImpl > that now is generated by JFlex instead of JavaCC. > I am asking because I noticed a great speedup in adding documents to > index in our system. We have time control on this in the debug mode. NOW > THEY ARE ADDED 5 TIMES FASTER!!! > But in the same time the total process of indexing in our case has > improvement of about 8%. As our system is very big and complex I am > wondering if really the whole process of indexing is reduces so > remarkably and our system causes this slowdown or may be Lucene does > some optimizations on the index, merges or something else and this is > the reason the total process of indexing to be not so reasonably faster. > > Best Regards, > Ivan > > > > Uwe Goetzke wrote: > > This week I switched the lucene library version on one customer system. > > The indexing speed went down from 46m32s to 16m20s for the complete task > > including optimisation. Great Job! > > We index product catalogs from several suppliers, in this case around > > 56.000 product groups and 360.000 products including descriptions were > > indexed. > > > > Regards > > > > Uwe > > > > > > > > --- > > Healy Hudson GmbH - D-55252 Mainz Kastel > > Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076 > > > > Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger > sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie > diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte > umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte > loschen Sie danach diese Email. > > This email is confidential. If you are not the intended recipient, you > must not disclose or use this information contained in it. If you have > received this email in error please tell us immediately by return email and > delete the document. > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > __ NOD32 2913 (20080301) Information
random accessing term value
Hi: Is there a way to random accessing term value in a field? e.g. in my field, content, the terms are: lucene, is, cool Is there a way to access content[2] -> cool? Thanks -John
Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
Hi Uwe, I am curious what NGramStemFilter is? Is it a combination of porter stemming and word ngram identification? Thanks! Jay Uwe Goetzke wrote: Hi Ivan, No, we do not use StandardAnalyser or StandardTokenizer. Most data is processed by fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader); result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter modified that ö -> oe result = new org.apache.lucene.analysis.LowerCaseFilter(result); result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a bigram tokenizer We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token. Best Regards Uwe -Ursprüngliche Nachricht- Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 21. März 2008 16:25 An: java-user@lucene.apache.org Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1 Hi Uwe, Could you tell what Analyzer do you use when you marked so big indexing speedup? If you use StandardAnalyzer (that uses StandardTokenizer) may be the reason is in it. You can see the pre last report in the thread "Indexing Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake Mannix this is because now StandardTokenizer uses StandardTokenizerImpl that now is generated by JFlex instead of JavaCC. I am asking because I noticed a great speedup in adding documents to index in our system. We have time control on this in the debug mode. NOW THEY ARE ADDED 5 TIMES FASTER!!! But in the same time the total process of indexing in our case has improvement of about 8%. As our system is very big and complex I am wondering if really the whole process of indexing is reduces so remarkably and our system causes this slowdown or may be Lucene does some optimizations on the index, merges or something else and this is the reason the total process of indexing to be not so reasonably faster. Best Regards, Ivan Uwe Goetzke wrote: This week I switched the lucene library version on one customer system. The indexing speed went down from 46m32s to 16m20s for the complete task including optimisation. Great Job! We index product catalogs from several suppliers, in this case around 56.000 product groups and 360.000 products including descriptions were indexed. Regards Uwe --- Healy Hudson GmbH - D-55252 Mainz Kastel Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email. This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ NOD32 2913 (20080301) Information __ This message was checked by NOD32 antivirus system. http://www.eset.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- Healy Hudson GmbH - D-55252 Mainz Kastel Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie danach diese Email. This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: random accessing term value
On Mar 25, 2008, at 1:32 PM, John Wang wrote: Is there a way to random accessing term value in a field? e.g. in my field, content, the terms are: lucene, is, cool Is there a way to access content[2] -> cool? Via term vectors, or reanalysis of the field are two that come to mind. Maybe other ways? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Improving Index Search Performance
Shailendra, Have a look at the javadocs of HitCollector: http://lucene.apache.org/java/2_3_0/api/core/org/apache/lucene/search/HitCollector.html The problem is with the use of the disk head, when retrieving the documents during collecting, the disk head has to move between the inverted index and the stored documents; see also the file formats. To avoid such excessive disk head movement, you need to collect all (or at least many more than 1 of) your document ids during collect(), for example into an int[]. After collecting retrieve the all the docs with Searcher.doc(). Also, for the same reason, retrieving docs is best done in doc id order, but that is unlikely to go wrong as doc ids are normally collected in increasing order. Regards, Paul Elschot Op Tuesday 25 March 2008 13:43:18 schreef Shailendra Mudgal: > Hi Everyone, > > We are using Lucene to search on a index of around 20G size with > around 3 million documents. We are facing performance issues loading > large results from the index. Based on the various posts on the forum > and documentation, we have made the following code changes to improve > the performance: > > i. Modified the code to use HitCollector instead of Hits since we > will be loading all the documents in the index based on keyword > matching ii. Added MapFieldSelector to load only selected fields(2 > fields only) instead of all the 14 > > After all these changes, it seems to be taking around 90 secs to > load 17k documents. After profiling, we found that the max time is > spent in * searcher.doc(id,selector). > > *Here is the code: > > *public void collect(int id, float score) { > try { > MapFieldSelector selector = new > MapFieldSelector(new String[] {COMPANY_ID, ID}); > doc = searcher.doc(id, selector); > mappedCompanies = doc.getValues(COMPANY_ID); > } catch (IOException e) { > logger.debug("inside IDCollector.collect() > > :"+e.getMessage()); > > } > }* > > * > *We also read in one of the posts that we should use bitSet.set(doc) > instead of calling searcher.doc(id). But we are unable to to > understand how this might help in our case since we will anyway have > to load the document to get the other required field(company_id). > Also we observed that the searcher is actually using only 1G RAM > though we have 4G allocated to it. > > Can someone suggest if there is any other optimization that can done > to improve the search performance on MultiSearcher. Any help would be > appreciated. > > Thanks, > Vipin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Improving Index Search Performance
: *We also read in one of the posts that we should use bitSet.set(doc) : instead of calling searcher.doc(id). But we are unable to to understand how : this might help in our case since we will anyway have to load the document : to get the other required field(company_id). Also we observed that the : searcher is actually using only 1G RAM though we have 4G allocated to it. in addition to Paul's previous excellent suggestion, note that if: * companyId is a single value field (ie: no document has more then one) * companyId is indexed you can use the FieldCache to lookup the compnayId for each doc. on the aggregate this will most likely be much faster then accessing the stored fields. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
Jay, Have a look at Lucene config, it's all there, including tests. This filter will take a token such as "foobar" and chop it up into n-grams (e.g. foobar -> fo oo ob ba ar would be a set of bi-grams). You can specify the n-gram size and even min and max n-gram size. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jay <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, March 25, 2008 1:32:24 PM Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1 Hi Uwe, I am curious what NGramStemFilter is? Is it a combination of porter stemming and word ngram identification? Thanks! Jay Uwe Goetzke wrote: > Hi Ivan, > No, we do not use StandardAnalyser or StandardTokenizer. > > Most data is processed by > fTextTokenStream = result = new > org.apache.lucene.analysis.WhitespaceTokenizer(reader); > result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter > modified that ö -> oe > result = new org.apache.lucene.analysis.LowerCaseFilter(result); > result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just > a bigram tokenizer > > We use our own queryparser. The bigramms are searched with a tolerant phrase > query, scoring in a doc the greatest bigramms clusters covering the phrase > token. > > Best Regards > > Uwe > > -Ursprüngliche Nachricht- > Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] > Gesendet: Freitag, 21. März 2008 16:25 > An: java-user@lucene.apache.org > Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1 > > Hi Uwe, > > Could you tell what Analyzer do you use when you marked so big indexing > speedup? > If you use StandardAnalyzer (that uses StandardTokenizer) may be the > reason is in it. You can see the pre last report in the thread "Indexing > Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake > Mannix this is because now StandardTokenizer uses StandardTokenizerImpl > that now is generated by JFlex instead of JavaCC. > I am asking because I noticed a great speedup in adding documents to > index in our system. We have time control on this in the debug mode. NOW > THEY ARE ADDED 5 TIMES FASTER!!! > But in the same time the total process of indexing in our case has > improvement of about 8%. As our system is very big and complex I am > wondering if really the whole process of indexing is reduces so > remarkably and our system causes this slowdown or may be Lucene does > some optimizations on the index, merges or something else and this is > the reason the total process of indexing to be not so reasonably faster. > > Best Regards, > Ivan > > > > Uwe Goetzke wrote: >> This week I switched the lucene library version on one customer system. >> The indexing speed went down from 46m32s to 16m20s for the complete task >> including optimisation. Great Job! >> We index product catalogs from several suppliers, in this case around >> 56.000 product groups and 360.000 products including descriptions were >> indexed. >> >> Regards >> >> Uwe >> >> >> >> --- >> Healy Hudson GmbH - D-55252 Mainz Kastel >> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076 >> >> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger >> sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie >> diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte >> umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte >> loschen Sie danach diese Email. >> This email is confidential. If you are not the intended recipient, you must >> not disclose or use this information contained in it. If you have received >> this email in error please tell us immediately by return email and delete >> the document. >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> __ NOD32 2913 (20080301) Information __ >> >> This message was checked by NOD32 antivirus system. >> http://www.eset.com >> >> >> >> > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --- > Healy Hudson GmbH - D-55252 Mainz Kastel > Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076 > > Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, > dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese > Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend > mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie > danach diese Email. > This email is confidential. If you are not the
Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
Sorry, I could not find the filter in the 2.3 API class list (core + contrib + test). I am not ware of lucene config file either. Could you please tell me where it is in 2.3 release? Thanks! Jay Otis Gospodnetic wrote: Jay, Have a look at Lucene config, it's all there, including tests. This filter will take a token such as "foobar" and chop it up into n-grams (e.g. foobar -> fo oo ob ba ar would be a set of bi-grams). You can specify the n-gram size and even min and max n-gram size. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jay <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, March 25, 2008 1:32:24 PM Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1 Hi Uwe, I am curious what NGramStemFilter is? Is it a combination of porter stemming and word ngram identification? Thanks! Jay Uwe Goetzke wrote: Hi Ivan, No, we do not use StandardAnalyser or StandardTokenizer. Most data is processed by fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader); result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter modified that ö -> oe result = new org.apache.lucene.analysis.LowerCaseFilter(result); result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a bigram tokenizer We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token. Best Regards Uwe -Ursprüngliche Nachricht- Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 21. März 2008 16:25 An: java-user@lucene.apache.org Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1 Hi Uwe, Could you tell what Analyzer do you use when you marked so big indexing speedup? If you use StandardAnalyzer (that uses StandardTokenizer) may be the reason is in it. You can see the pre last report in the thread "Indexing Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake Mannix this is because now StandardTokenizer uses StandardTokenizerImpl that now is generated by JFlex instead of JavaCC. I am asking because I noticed a great speedup in adding documents to index in our system. We have time control on this in the debug mode. NOW THEY ARE ADDED 5 TIMES FASTER!!! But in the same time the total process of indexing in our case has improvement of about 8%. As our system is very big and complex I am wondering if really the whole process of indexing is reduces so remarkably and our system causes this slowdown or may be Lucene does some optimizations on the index, merges or something else and this is the reason the total process of indexing to be not so reasonably faster. Best Regards, Ivan Uwe Goetzke wrote: This week I switched the lucene library version on one customer system. The indexing speed went down from 46m32s to 16m20s for the complete task including optimisation. Great Job! We index product catalogs from several suppliers, in this case around 56.000 product groups and 360.000 products including descriptions were indexed. Regards Uwe --- Healy Hudson GmbH - D-55252 Mainz Kastel Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email. This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ NOD32 2913 (20080301) Information __ This message was checked by NOD32 antivirus system. http://www.eset.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- Healy Hudson GmbH - D-55252 Mainz Kastel Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie danach diese Email. This email is confidential. If you are not the intended recip
Re: hitcollector topdocs
Hi Jens, I'm having a bit of a hard time following this, so perhaps you could rephrase, or show your sample code or explain a bit more about what you are trying to do at a higher level? Cheers, Grant On Mar 25, 2008, at 10:46 AM, JensBurkhardt wrote: Hi everybody, I was searching for informations about the hitcollector. I was wondering if the value of the fields have to be stored or not. i tested it and it worked both but i'm still not really sure about it. Second question is, can i work with tokenized fields? Best regards Jens -- View this message in context: http://www.nabble.com/hitcollector-topdocs-tp16275287p16275287.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://www.lucenebootcamp.com Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: explain() - fieldnorm
On Mar 25, 2008, at 12:10 PM, JensBurkhardt wrote: As you can see, both are exactly the same. The thing i don't understand is, that the two documents have different documentboosts (the first one got an boost of 1.62 , the second of 1.65) - the boosts are different because the two books got different publication years - but explain() tells me that my fieldNorm value is 1.5. Document boosts do not have much granularity due to the limited number of bits in the norm. I seem to recall Yonik publishing a list of values at one time on the mailing list, but I can't for the life of me conjure the keywords to find it at the moment, as it was one a related topic. Perhaps his memory is better than mine... HTH, Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
Hi Jay, Sorry, lapsus calami, that would be Lucene *contrib*. Have a look: http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jay <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, March 25, 2008 6:15:54 PM Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1 Sorry, I could not find the filter in the 2.3 API class list (core + contrib + test). I am not ware of lucene config file either. Could you please tell me where it is in 2.3 release? Thanks! Jay Otis Gospodnetic wrote: > Jay, > > Have a look at Lucene config, it's all there, including tests. This filter > will take a token such as "foobar" and chop it up into n-grams (e.g. foobar > -> fo oo ob ba ar would be a set of bi-grams). You can specify the n-gram > size and even min and max n-gram size. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > - Original Message > From: Jay <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Tuesday, March 25, 2008 1:32:24 PM > Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1 > > Hi Uwe, > > I am curious what NGramStemFilter is? Is it a combination of porter > stemming and word ngram identification? > > Thanks! > > Jay > > Uwe Goetzke wrote: >> Hi Ivan, >> No, we do not use StandardAnalyser or StandardTokenizer. >> >> Most data is processed by >> fTextTokenStream = result = new >> org.apache.lucene.analysis.WhitespaceTokenizer(reader); >> result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter >> modified that ö -> oe >> result = new org.apache.lucene.analysis.LowerCaseFilter(result); >> result = new org.apache.lucene.analysis.NGramStemFilter(result,2); >> //just a bigram tokenizer >> >> We use our own queryparser. The bigramms are searched with a tolerant phrase >> query, scoring in a doc the greatest bigramms clusters covering the phrase >> token. >> >> Best Regards >> >> Uwe >> >> -Ursprüngliche Nachricht- >> Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] >> Gesendet: Freitag, 21. März 2008 16:25 >> An: java-user@lucene.apache.org >> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1 >> >> Hi Uwe, >> >> Could you tell what Analyzer do you use when you marked so big indexing >> speedup? >> If you use StandardAnalyzer (that uses StandardTokenizer) may be the >> reason is in it. You can see the pre last report in the thread "Indexing >> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake >> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl >> that now is generated by JFlex instead of JavaCC. >> I am asking because I noticed a great speedup in adding documents to >> index in our system. We have time control on this in the debug mode. NOW >> THEY ARE ADDED 5 TIMES FASTER!!! >> But in the same time the total process of indexing in our case has >> improvement of about 8%. As our system is very big and complex I am >> wondering if really the whole process of indexing is reduces so >> remarkably and our system causes this slowdown or may be Lucene does >> some optimizations on the index, merges or something else and this is >> the reason the total process of indexing to be not so reasonably faster. >> >> Best Regards, >> Ivan >> >> >> >> Uwe Goetzke wrote: >>> This week I switched the lucene library version on one customer system. >>> The indexing speed went down from 46m32s to 16m20s for the complete task >>> including optimisation. Great Job! >>> We index product catalogs from several suppliers, in this case around >>> 56.000 product groups and 360.000 products including descriptions were >>> indexed. >>> >>> Regards >>> >>> Uwe >>> >>> >>> >>> --- >>> Healy Hudson GmbH - D-55252 Mainz Kastel >>> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076 >>> >>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger >>> sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn >>> Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies >>> bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. >>> Bitte loschen Sie danach diese Email. >>> This email is confidential. If you are not the intended recipient, you must >>> not disclose or use this information contained in it. If you have received >>> this email in error please tell us immediately by return email and delete >>> the document. >>> >>> >>> - >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> __ NOD32 2913 (20080301) Information __ >>> >>> This message was checked by NOD32 antivirus system. >>> http://www.eset.com >>> >>
Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
Hi Otis, I checked that contrib before and could not find NgramStemFilter. Am I missing other contrib? Thanks for the link! Jay Otis Gospodnetic wrote: Hi Jay, Sorry, lapsus calami, that would be Lucene *contrib*. Have a look: http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jay <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, March 25, 2008 6:15:54 PM Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1 Sorry, I could not find the filter in the 2.3 API class list (core + contrib + test). I am not ware of lucene config file either. Could you please tell me where it is in 2.3 release? Thanks! Jay Otis Gospodnetic wrote: Jay, Have a look at Lucene config, it's all there, including tests. This filter will take a token such as "foobar" and chop it up into n-grams (e.g. foobar -> fo oo ob ba ar would be a set of bi-grams). You can specify the n-gram size and even min and max n-gram size. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jay <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, March 25, 2008 1:32:24 PM Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1 Hi Uwe, I am curious what NGramStemFilter is? Is it a combination of porter stemming and word ngram identification? Thanks! Jay Uwe Goetzke wrote: Hi Ivan, No, we do not use StandardAnalyser or StandardTokenizer. Most data is processed by fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader); result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter modified that ö -> oe result = new org.apache.lucene.analysis.LowerCaseFilter(result); result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a bigram tokenizer We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token. Best Regards Uwe -Ursprüngliche Nachricht- Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 21. März 2008 16:25 An: java-user@lucene.apache.org Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1 Hi Uwe, Could you tell what Analyzer do you use when you marked so big indexing speedup? If you use StandardAnalyzer (that uses StandardTokenizer) may be the reason is in it. You can see the pre last report in the thread "Indexing Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake Mannix this is because now StandardTokenizer uses StandardTokenizerImpl that now is generated by JFlex instead of JavaCC. I am asking because I noticed a great speedup in adding documents to index in our system. We have time control on this in the debug mode. NOW THEY ARE ADDED 5 TIMES FASTER!!! But in the same time the total process of indexing in our case has improvement of about 8%. As our system is very big and complex I am wondering if really the whole process of indexing is reduces so remarkably and our system causes this slowdown or may be Lucene does some optimizations on the index, merges or something else and this is the reason the total process of indexing to be not so reasonably faster. Best Regards, Ivan Uwe Goetzke wrote: This week I switched the lucene library version on one customer system. The indexing speed went down from 46m32s to 16m20s for the complete task including optimisation. Great Job! We index product catalogs from several suppliers, in this case around 56.000 product groups and 360.000 products including descriptions were indexed. Regards Uwe --- Healy Hudson GmbH - D-55252 Mainz Kastel Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email. This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ NOD32 2913 (20080301) Information __ This message was checked by NOD32 antivirus system. http://www.eset.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail:
Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
Sorry, I wrote this stuff, but forgot the naming. Look: http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/org/apache/lucene/analysis/ngram/package-summary.html Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: yu <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, March 26, 2008 12:04:33 AM Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1 Hi Otis, I checked that contrib before and could not find NgramStemFilter. Am I missing other contrib? Thanks for the link! Jay Otis Gospodnetic wrote: > Hi Jay, > > Sorry, lapsus calami, that would be Lucene *contrib*. > Have a look: > http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > - Original Message > From: Jay <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Tuesday, March 25, 2008 6:15:54 PM > Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1 > > Sorry, I could not find the filter in the 2.3 API class list (core + > contrib + test). I am not ware of lucene config file either. Could you > please tell me where it is in 2.3 release? > > Thanks! > > Jay > > Otis Gospodnetic wrote: > >> Jay, >> >> Have a look at Lucene config, it's all there, including tests. This filter >> will take a token such as "foobar" and chop it up into n-grams (e.g. foobar >> -> fo oo ob ba ar would be a set of bi-grams). You can specify the n-gram >> size and even min and max n-gram size. >> >> Otis >> -- >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >> - Original Message >> From: Jay <[EMAIL PROTECTED]> >> To: java-user@lucene.apache.org >> Sent: Tuesday, March 25, 2008 1:32:24 PM >> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1 >> >> Hi Uwe, >> >> I am curious what NGramStemFilter is? Is it a combination of porter >> stemming and word ngram identification? >> >> Thanks! >> >> Jay >> >> Uwe Goetzke wrote: >> >>> Hi Ivan, >>> No, we do not use StandardAnalyser or StandardTokenizer. >>> >>> Most data is processed by >>> fTextTokenStream = result = new >>> org.apache.lucene.analysis.WhitespaceTokenizer(reader); >>> result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter >>> modified that ö -> oe >>> result = new org.apache.lucene.analysis.LowerCaseFilter(result); >>> result = new org.apache.lucene.analysis.NGramStemFilter(result,2); >>> //just a bigram tokenizer >>> >>> We use our own queryparser. The bigramms are searched with a tolerant >>> phrase query, scoring in a doc the greatest bigramms clusters covering the >>> phrase token. >>> >>> Best Regards >>> >>> Uwe >>> >>> -Ursprüngliche Nachricht- >>> Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] >>> Gesendet: Freitag, 21. März 2008 16:25 >>> An: java-user@lucene.apache.org >>> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1 >>> >>> Hi Uwe, >>> >>> Could you tell what Analyzer do you use when you marked so big indexing >>> speedup? >>> If you use StandardAnalyzer (that uses StandardTokenizer) may be the >>> reason is in it. You can see the pre last report in the thread "Indexing >>> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake >>> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl >>> that now is generated by JFlex instead of JavaCC. >>> I am asking because I noticed a great speedup in adding documents to >>> index in our system. We have time control on this in the debug mode. NOW >>> THEY ARE ADDED 5 TIMES FASTER!!! >>> But in the same time the total process of indexing in our case has >>> improvement of about 8%. As our system is very big and complex I am >>> wondering if really the whole process of indexing is reduces so >>> remarkably and our system causes this slowdown or may be Lucene does >>> some optimizations on the index, merges or something else and this is >>> the reason the total process of indexing to be not so reasonably faster. >>> >>> Best Regards, >>> Ivan >>> >>> >>> >>> Uwe Goetzke wrote: >>> This week I switched the lucene library version on one customer system. The indexing speed went down from 46m32s to 16m20s for the complete task including optimisation. Great Job! We index product catalogs from several suppliers, in this case around 56.000 product groups and 360.000 products including descriptions were indexed. Regards Uwe --- Healy Hudson GmbH - D-55252 Mainz Kastel Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler be
Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
Sorry for my ignorance, I am looking for NgramStemFilter specifically. Are you suggesting that it's the same as NGramTokenFilter? Does it have stemming in it? Thanks again. Jay Otis Gospodnetic wrote: Sorry, I wrote this stuff, but forgot the naming. Look: http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/org/apache/lucene/analysis/ngram/package-summary.html Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: yu <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, March 26, 2008 12:04:33 AM Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1 Hi Otis, I checked that contrib before and could not find NgramStemFilter. Am I missing other contrib? Thanks for the link! Jay Otis Gospodnetic wrote: Hi Jay, Sorry, lapsus calami, that would be Lucene *contrib*. Have a look: http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jay <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, March 25, 2008 6:15:54 PM Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1 Sorry, I could not find the filter in the 2.3 API class list (core + contrib + test). I am not ware of lucene config file either. Could you please tell me where it is in 2.3 release? Thanks! Jay Otis Gospodnetic wrote: Jay, Have a look at Lucene config, it's all there, including tests. This filter will take a token such as "foobar" and chop it up into n-grams (e.g. foobar -> fo oo ob ba ar would be a set of bi-grams). You can specify the n-gram size and even min and max n-gram size. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jay <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, March 25, 2008 1:32:24 PM Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1 Hi Uwe, I am curious what NGramStemFilter is? Is it a combination of porter stemming and word ngram identification? Thanks! Jay Uwe Goetzke wrote: Hi Ivan, No, we do not use StandardAnalyser or StandardTokenizer. Most data is processed by fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader); result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter modified that ö -> oe result = new org.apache.lucene.analysis.LowerCaseFilter(result); result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a bigram tokenizer We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token. Best Regards Uwe -Ursprüngliche Nachricht- Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 21. März 2008 16:25 An: java-user@lucene.apache.org Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1 Hi Uwe, Could you tell what Analyzer do you use when you marked so big indexing speedup? If you use StandardAnalyzer (that uses StandardTokenizer) may be the reason is in it. You can see the pre last report in the thread "Indexing Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake Mannix this is because now StandardTokenizer uses StandardTokenizerImpl that now is generated by JFlex instead of JavaCC. I am asking because I noticed a great speedup in adding documents to index in our system. We have time control on this in the debug mode. NOW THEY ARE ADDED 5 TIMES FASTER!!! But in the same time the total process of indexing in our case has improvement of about 8%. As our system is very big and complex I am wondering if really the whole process of indexing is reduces so remarkably and our system causes this slowdown or may be Lucene does some optimizations on the index, merges or something else and this is the reason the total process of indexing to be not so reasonably faster. Best Regards, Ivan Uwe Goetzke wrote: This week I switched the lucene library version on one customer system. The indexing speed went down from 46m32s to 16m20s for the complete task including optimisation. Great Job! We index product catalogs from several suppliers, in this case around 56.000 product groups and 360.000 products including descriptions were indexed. Regards Uwe --- Healy Hudson GmbH - D-55252 Mainz Kastel Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email. This email is confident
Re: random accessing term value
I am not sure how term vectors would help me. Term vectors are ordered by frequency, not in lex order. Since I know in the dictionary the terms are ordered by lex, seems it is possible for me to randomly get the nth term in the dictionary without having to seek to it. Thoughts? Thanks -John On Tue, Mar 25, 2008 at 11:16 AM, Erik Hatcher <[EMAIL PROTECTED]> wrote: > > On Mar 25, 2008, at 1:32 PM, John Wang wrote: > >Is there a way to random accessing term value in a field? e.g. > > > >in my field, content, the terms are: lucene, is, cool > > > >Is there a way to access content[2] -> cool? > > Via term vectors, or reanalysis of the field are two that come to > mind. Maybe other ways? > >Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >