Re: Using RangeFilter
I've a field as NO_NORM, does it has to be untokenized to be able to sort on it? On Jan 21, 2008 12:47 PM, Antony Bowesman [EMAIL PROTECTED] wrote: vivek sar wrote: I need to be able to sort on optime as well, thus need to store it . Lucene's default sorting does not need the field to be stored, only indexed as untokenized. Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Is Fair Similarity working with lucene 2.2 ?
Is there anything I can do to pass my Unit-Test ? Or it is impossible ? Thanks a lot, Fabrice Fabrice Robini wrote: Hi Srikant, I really thank you for your reply, it's very interesting. I have to say I am confused with that now... I do not know what I can to for passing this Unit test... I agree with you, it may be an issue of computing relevance. Fabrice Srikant Jakilinki-3 wrote: OK, got it to work. Thanks. By a quick scoring comparision, I got the same scores for both hits. Maybe there is a loss of precision somewhere. Or when scores are equal, Lucene is doing something unintended/overlooked and thus putting shorter documents higher as the experiment is a special case where the TF of a queried term (for both suites, the TF of x = 10%) is equal which is very rarely. Or maybe the IDF factor is kicking in in some strange way although it shouldnt. There are a number of varied reasons, but for the naked eye, there isnt much. However, that said, length normalization is not a science but an art and the simple scheme we have here in the FairSimilarity will not perform always as expected in real world scenarios. Maybe I am missing something or have forgot my basics but that is not to say your observation is trivial. Rather, the contrary. Hope there will be more activity on this topic because it is an issue of computing relevance which is the core of search. Cheers, Srikant Fabrice Robini wrote: Oooops sorry, bad cut/paste... Here is the right one :-) public void testFairSimilarity() throws CorruptIndexException, IOException, ParseException { Directory theDirectory = new RAMDirectory(); Analyzer theAnalyzer = new StandardAnalyzer(); IndexWriter theIndexWriter = new IndexWriter(theDirectory, theAnalyzer); theIndexWriter.setSimilarity(new FairSimilarity()); Document doc1 = new Document(); Field name1 = new Field(NAME, SHORT_SUITE, Field.Store.YES, Field.Index.UN_TOKENIZED); Field content1 = new Field(CONTENT, x 2 3 4 5 6 7 8 9 10, Field.Store.NO, Field.Index.TOKENIZED); doc1.add(name1); doc1.add(content1); theIndexWriter.addDocument(doc1); Document doc2 = new Document(); Field name2 = new Field(NAME, BIG_SUITE, Field.Store.YES, Field.Index.UN_TOKENIZED); Field content2 = new Field(CONTENT, x x 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20, Field.Store.NO, Field.Index.TOKENIZED); doc2.add(name2); doc2.add(content2); theIndexWriter.addDocument(doc2); theIndexWriter.close(); Searcher searcher = new IndexSearcher(theDirectory); searcher.setSimilarity(new FairSimilarity()); QueryParser queryParser = new QueryParser(CONTENT, theAnalyzer); Hits hits = searcher.search(queryParser.parse(x)); assertEquals(2, hits.length()); assertEquals(BIG_SUITE, hits.doc(0).get(NAME)); assertEquals(SHORT_SUITE, hits.doc(1).get(NAME)); } Srikant Jakilinki-3 wrote: Well, I cant seem to even get past the assertions of this code. The first assertion is failing in that I get 0 hits. I am using SimpleAnalyzer since I do not have a FrenchAnalyzer. Any thoughts? Srikant -- Free pop3 email with a spam filter. http://www.bluebottle.com/tag/5 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- View this message in context: http://www.nabble.com/Is-Fair-Similarity-working-with-lucene-2.2---tp15001250p15060757.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple searchers (Was: CachingWrapperFilter: why cache per IndexReader?)
On Thu, 2008-01-24 at 08:18 +1100, Antony Bowesman wrote: These are odd. The last case in both of the above shows a slowdown compared to 2.1 index and version and in the first 50K queries, the 2.3 index and version is even slower than 2.3 with 2.1 index. It catches up in the longer result set. Any ideas why that might be. Looking at the graphs I can see that the 2 threads / shared searcher is suspiciously fast at getting up to full speed. It could be because the disk-read-cache wasn't properly flushed. I'll rerun the test. I've performed an inspection of graphs for my other published measurements and they looked as expected. I'll spend some more time on it tomorrow. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene search strings two
Hi all. I need to check two conditions in search first i need to find out bank name next in those i need to find documents consisting particular city finally i need the documents which satisfy both conditions i.e., documents with bank+city please can anyone help me Thanks, prathiba.P
Re: Using RangeFilter
vivek sar wrote: I've a field as NO_NORM, does it has to be untokenized to be able to sort on it? NO_NORMS is the same as UNTOKENIZED + omitNorms, so you can sort on that. Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Full Text Searching a Relational Model
Hi, (Warning, not for the weak-hearted) I'm currently working on a project where we have a large and complex data model, related to Genomics. We are trying to build a search engine that provides full text and field-based text searches for our customer base (mostly academic research), and are evaluating different tools for this purpose. As a starting point, we have, as an example, a set of objects (stored in tables as a relational model): Gene [ID, Symbol, Description] Article - M:M with Gene [ID, Title] Disease - M:M with Gene [ID, Name] Author - M:M with Article [ID, Name] (Note: M:M tables exist, just link IDs) An example model would be (hierarchical, relations dealt with as duplications) Gene [ID=1, Symbol=EGFR, Description=epidermal growth factor receptor] Article [ID=1, Title=EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy] Author [ID=1, Name=H. Michaelson] Author [ID=2, Name=J. Watson] Article [ID=2, Title=Proteomics analysis of epidermal protein kinases by target class-selective prefractionation and tandem mass spectrometry] Author [ID=1, Name=H. Michaelson] Author [ID=3, Name=M. Roberts] Disease [ID=1, Name=Epidermal sluffing] Gene [ID=2, Symbol=AHCY, Description=S-adenosylhomocysteine hydrolase] Article [ID=3, Title=Limited proteolysis of S-adenosylhomocysteine hydrolase: implications for the three-dimensional structure] Author [ID=4, Name=B. Cohen] Author [ID=5, Name=L. Alexander] Article [ID=2, Title=Proteomics analysis of epidermal protein kinases by target class-selective prefractionation and tandem mass spectrometry] Author [ID=1, Name=H. Michaelson] Author [ID=3, Name=M. Roberts] Note IDs in the objects above, as they relay the relations in the hierarchical model. In our Full-Text search, we would like to allow users to search ANY textual field for any string. For instance, the term epidermal, and display the list of genes which have any data associated with them with that term (ranked, of course). Our list of results would be something like: EGFR Found in Description (epidermal growth factor receptor) Found in Article ID#2, in Title (proteomics analysis of epidermal protein kinases by target class-selective prefractionation and tandem mass spectrometry) Found in Disease ID#1, in Name (Epidermal sluffing) AHCY Found in Article ID#2, in Title (proteomics analysis of epidermal protein kinases by target class-selective prefractionation and tandem mass spectrometry) Note that the results retain a hierarchial view of our Genes (us being Gene-Centric, we're pretty much framing the question find this term related in information related to those genes). Also note that Article ID #2 has an M:M with Gene ID2 (AHCY) and Gene ID1 (EGFR), and only due to that fact, AHCY is considered a gene that has epidermal in its annotations. Obviously, we'd like to rank fields by location in hierarchy (A term in a gene name is scored higher than the name of the author of an article related to a gene) and by number of hits (number of times a term is found related to that gene, 3 in the case of EGFR above). Ideas for how to take on this challenge? Implementation? Tools? Thanks! Yaron Golan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
LogMergePolicy
Hello, I'm curious, why is LogMergePolicy named *Log*MergePolicy? (Why not ExpMergePolicy? :-) Thank you, Koji - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: LogMergePolicy
I'm curious, why is LogMergePolicy named *Log*MergePolicy? (Why not ExpMergePolicy? :-) Well, I guess it's a matter of perspective. When you look at the way the algorithm works, the merge decisions are based on a concept of level and levels are assigned based on the log of the number of documents in a segment (going back to Ning's equation). When one is in the code, it's very natural to think/talk about log-base-merge-factor. This does result in the number of documents in segments being order-of-magnitude/exponentially related so that might have made more sense to users, so perhaps it wasn't the best decision ... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LogMergePolicy
On Jan 24, 2008 8:40 AM, Steven Parkes [EMAIL PROTECTED] wrote: I'm curious, why is LogMergePolicy named *Log*MergePolicy? (Why not ExpMergePolicy? :-) Well, I guess it's a matter of perspective. When you look at the way the algorithm works, the merge decisions are based on a concept of level and levels are assigned based on the log of the number of documents in a segment (going back to Ning's equation). When one is in the code, it's very natural to think/talk about log-base-merge-factor. This does result in the number of documents in segments being order-of-magnitude/exponentially related so that might have made more sense to users, so perhaps it wasn't the best decision ... I could be accurately described either way, but there is precedent for log too... log-normal, for example is normal after one takes the log (it could have been called exponential-normal). I also tend to think of our number system as logarithmic in nature rather than exponential. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LogMergePolicy
Thank you Steven and Yonik, I think I got it. And I can find LogMergePolicy uses Math.log() to find merges. :-) Thank you again, Koji - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Full Text Searching a Relational Model
In general, you just need to denorm the data and create a list of Genes, and add each Genes' related information by SQLs. Ranking can be easily adjusted via each field's weight, not a big deal. Seems an ideal case for using DBSight. It can also do incremental indexing, which you may also need. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! On Jan 24, 2008 5:42 AM, [EMAIL PROTECTED] wrote: Hi, (Warning, not for the weak-hearted) I'm currently working on a project where we have a large and complex data model, related to Genomics. We are trying to build a search engine that provides full text and field-based text searches for our customer base (mostly academic research), and are evaluating different tools for this purpose. As a starting point, we have, as an example, a set of objects (stored in tables as a relational model): Gene [ID, Symbol, Description] Article - M:M with Gene [ID, Title] Disease - M:M with Gene [ID, Name] Author - M:M with Article [ID, Name] (Note: M:M tables exist, just link IDs) An example model would be (hierarchical, relations dealt with as duplications) Gene [ID=1, Symbol=EGFR, Description=epidermal growth factor receptor] Article [ID=1, Title=EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy] Author [ID=1, Name=H. Michaelson] Author [ID=2, Name=J. Watson] Article [ID=2, Title=Proteomics analysis of epidermal protein kinases by target class-selective prefractionation and tandem mass spectrometry] Author [ID=1, Name=H. Michaelson] Author [ID=3, Name=M. Roberts] Disease [ID=1, Name=Epidermal sluffing] Gene [ID=2, Symbol=AHCY, Description=S-adenosylhomocysteine hydrolase] Article [ID=3, Title=Limited proteolysis of S-adenosylhomocysteine hydrolase: implications for the three-dimensional structure] Author [ID=4, Name=B. Cohen] Author [ID=5, Name=L. Alexander] Article [ID=2, Title=Proteomics analysis of epidermal protein kinases by target class-selective prefractionation and tandem mass spectrometry] Author [ID=1, Name=H. Michaelson] Author [ID=3, Name=M. Roberts] Note IDs in the objects above, as they relay the relations in the hierarchical model. In our Full-Text search, we would like to allow users to search ANY textual field for any string. For instance, the term epidermal, and display the list of genes which have any data associated with them with that term (ranked, of course). Our list of results would be something like: EGFR Found in Description (epidermal growth factor receptor) Found in Article ID#2, in Title (proteomics analysis of epidermal protein kinases by target class-selective prefractionation and tandem mass spectrometry) Found in Disease ID#1, in Name (Epidermal sluffing) AHCY Found in Article ID#2, in Title (proteomics analysis of epidermal protein kinases by target class-selective prefractionation and tandem mass spectrometry) Note that the results retain a hierarchial view of our Genes (us being Gene-Centric, we're pretty much framing the question find this term related in information related to those genes). Also note that Article ID #2 has an M:M with Gene ID2 (AHCY) and Gene ID1 (EGFR), and only due to that fact, AHCY is considered a gene that has epidermal in its annotations. Obviously, we'd like to rank fields by location in hierarchy (A term in a gene name is scored higher than the name of the author of an article related to a gene) and by number of hits (number of times a term is found related to that gene, 3 in the case of EGFR above). Ideas for how to take on this challenge? Implementation? Tools? Thanks! Yaron Golan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Creating search query
Hi, I have an index with some fields which are indexed and un_tokenized (keywords) and one field which is indexed and tokenized (content). Now I want to create a Query-Object: TermQuery k1 = new TermQuery(new Term(foo, some foo)); TermQuery k2 = new TermQuery(new Term(bar, some bar)); QueryParser p = new QueryParser(content, new SomeAnalyzer());//same analyzer is used for indexing Query c =p.parse(text we are looking for); BooleanQuery q = new BooleanQuery(); q.add(k1, Occur.MUST); q.add(k2, Occur.MUST); q.add(c, Occur.MUST); Is this the best way? Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Compass
Thank you. -Original Message- From: Lukas Vlcek [mailto:[EMAIL PROTECTED] Sent: Mittwoch, 23. Januar 2008 08:23 To: java-user@lucene.apache.org Subject: Re: Compass Hi, I am using Compass with Spring and JPA. It works pretty nice. I don't store index into database, I use traditional file system based Lucene index. Updates work very well but you have to be careful about proper mapping of your objects into search engine (specially parent-child mappings). Regards, Lukas On Jan 21, 2008 8:08 PM, [EMAIL PROTECTED] wrote: Hi, compass (http://www.opensymphony.com/compass/content/lucene.html) promisses many nice things in my opinion. Has anybody production experiences with it? Especially Jdbc Directory and Updates? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- http://blog.lukas-vlcek.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Creating search query
That should work fine, assuming that foo and bar are the untokenized fields and content is the tokenized content. Erick On Jan 24, 2008 1:18 PM, [EMAIL PROTECTED] wrote: Hi, I have an index with some fields which are indexed and un_tokenized (keywords) and one field which is indexed and tokenized (content). Now I want to create a Query-Object: TermQuery k1 = new TermQuery(new Term(foo, some foo)); TermQuery k2 = new TermQuery(new Term(bar, some bar)); QueryParser p = new QueryParser(content, new SomeAnalyzer());//same analyzer is used for indexing Query c =p.parse(text we are looking for); BooleanQuery q = new BooleanQuery(); q.add(k1, Occur.MUST); q.add(k2, Occur.MUST); q.add(c, Occur.MUST); Is this the best way? Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Creating search query
Yes, sorry, that's the case. Thank you! -Original Message- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Donnerstag, 24. Januar 2008 19:49 To: java-user@lucene.apache.org Subject: Re: Creating search query That should work fine, assuming that foo and bar are the untokenized fields and content is the tokenized content. Erick On Jan 24, 2008 1:18 PM, [EMAIL PROTECTED] wrote: Hi, I have an index with some fields which are indexed and un_tokenized (keywords) and one field which is indexed and tokenized (content). Now I want to create a Query-Object: TermQuery k1 = new TermQuery(new Term(foo, some foo)); TermQuery k2 = new TermQuery(new Term(bar, some bar)); QueryParser p = new QueryParser(content, new SomeAnalyzer());//same analyzer is used for indexing Query c =p.parse(text we are looking for); BooleanQuery q = new BooleanQuery(); q.add(k1, Occur.MUST); q.add(k2, Occur.MUST); q.add(c, Occur.MUST); Is this the best way? Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Design questions
-Original Message- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Freitag, 11. Januar 2008 16:16 To: java-user@lucene.apache.org Subject: Re: Design questions But you could also vary this scheme by simply storing in your document the offsets for the beginning of each page. Well, this is the best for my app I think, but... How do I find out these offsets? I'm adding the content field with: IndexWriter#add(new Field(content, myContentReader)); I have no clue how find out the offsets in this reader. Must be something with an analyzer and a TokenStream? Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Design questions
I think you'll have to implement your own Analyzer and count. That is, every call to next() that returns a token will have to also increment some counter by 1. To use this, you must have some way of knowing when a page ends, and at that point you call your instance of your custom analyzer to see what the count is. Or your analyzer maintains the list and you can call for it after you've added all the pages. Analyzer.getPositionIncrementGap is called every time you call document.add(field. So, you have something like this while (more pages for doc) { string pagedata = getPageText(); doc.add(text, pagedata); } Under the covers, your custom analyzer adds the current offset (which you've kept track of) to, say, an ArrayList. And after the last page is added, you get this arraylist and add it to your document. Or, you could just do things twice. That is, send your text through a TokenStream, then call next() and count. Then send it all through doc.add(). There are probably cleverer ways, but that should do for a start. Best Erick On Jan 24, 2008 2:33 PM, [EMAIL PROTECTED] wrote: -Original Message- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Freitag, 11. Januar 2008 16:16 To: java-user@lucene.apache.org Subject: Re: Design questions But you could also vary this scheme by simply storing in your document the offsets for the beginning of each page. Well, this is the best for my app I think, but... How do I find out these offsets? I'm adding the content field with: IndexWriter#add(new Field(content, myContentReader)); I have no clue how find out the offsets in this reader. Must be something with an analyzer and a TokenStream? Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene, HTML and Hebrew
Steve and all, I didn't know whether to send a detailed description of my case to aid with seeing the whole picture, or to send a list of short questions which will require loads of follow-up. I guess I know what is better now, thanks Lucene does not store proximity relations between data in different fields, only within individual fields So are 2 calls for doc-add with the same field but different texts are considered as 1 field (latter call being internally appended into the former, merged into one field), or as two instances of the same field which do not share proximity and frequency data? As it seems from what you wrote later in your response, it seems the case is the former. How can I inhibit this appending -- are there any more approaches than appending an invalid string like $$$? I've been thinking about this a bit, and I think I'd go with one big field for all the content, and I'd want to incorporate the headers into it as well. How would I boost those specific words - so the content field can contain both all words and all headers in their original order (for proximity and frequency data to be valid), yet keep the terms which were originally in a header or a sub-header boosted? This can be a good practice for boosting bolded or italic text in the normal paragrphs as well (only with a lower boost). Generally, stemming increases recall (proportion of matching relevant docs among relevant docs in the entire corpus), and decreases precision (proportion of relevant docs among matching docs). Thats a great definition, thanks. I'm trying to think this through, since Hebrew is not a regular case. If you will google for Hebrew and Stemming you will get pages which talk about how complicated is Hebrew compared to English and other European languages. [ Warning: technical data, questions follow after this paragraph -- to comply with the 30-seconds rule :) ] This is extremely difficult since unique features Hebrew has like Niqqud-less spelling (which causes many words to have several spelling options, only one legal but the others too-common to ignore) and three-letter stems which have many deriviations. Furthermore, English words like and, that, of, to etc. in Hebrew are represented as one letter appended to the beginning of the word, forming a whole new word than the original. Discarding them while indexing is not a smart move since one would try and look for specific term *with* this initial and would not expect results without. Furthermore, some words which uses these initials have another meaning when pronounced differently (like KLBI - could be read as Ke-libi [as my heart] where I can omit the leading K, and also as Kalbi [My dog] where I cannot. So, to overcome the challenges above, I was thinking about the query inflation approach, having a negative boost for the inflated terms as you suggested. I will appreciate any different takes on this one, as this is going to be the first public Lucene Hebrew analyzer... Using this approach I only need to make sure I do not inflate those too much (1024 is the standard limit, right?). Also, how can I check whether a word I inflated exists in the index BEFORE executing the query? Is that recommended at all? -- I'm looking for the most efficient way so search speed will still be measured in few m/s, as it is now. The idea is to prevent, or minimize, the use of a dictionary, and keeping the stemmer as simple as possible (and by that produce invalid words and eliminate them before executing the search). It's worth noting, as the above-linked documentation for Field.setBoost() does, that field boosts are not stored independently of other normalization factors in the index. Does this mean I should stick with boosting fields in the query phase only? Itamar. -Original Message- From: Steven A Rowe [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 23, 2008 1:06 AM To: java-user@lucene.apache.org Subject: RE: Lucene, HTML and Hebrew Hi Itamar, In another thread, you wrote: Yesterday I sent an email to this group querying about some very important (to me...) features of Lucene. I'm giving it another chance before it goes unnoticed or forgotten. If it was too long please let me know and I will email a shorter list of questions I think I have something like a 30-second rule for posts on this list: if I can't figure out what the question is within 30 seconds, I move on. Your post was so verbose that I gave up before I asked myself whether I could help. (Déjà vu - upon re-reading this paragraph, it sounds very much like something Hoss has said on this list...) Although I answer your original post below, please don't take this as affirmation of your reminder approach. In my experience, this strategy is interpreted as badgering, and tends to affect response rate in the opposite direction to that intended. Short, focused questions will maximize the response rate here (and elsewhere, I suspect). Also, it helps if there is some
FYI: parallel corpus in 22 languages
Hi all, Just FYI, perhaps this is old news for you ... This large corpus is freely available and it is pairwise sentence-aligned for all language combinations. This looks like a good resource for linguistic information, such as frequent words and phrases, n-gram profiles, etc. http://wt.jrc.it/lt/Acquis/ -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene, HTML and Hebrew
Hi Itamar, On 01/24/2008 at 2:55 PM, Itamar Syn-Hershko wrote: Lucene does not store proximity relations between data in different fields, only within individual fields So are 2 calls for doc-add with the same field but different texts are considered as 1 field (latter call being internally appended into the former, merged into one field), or as two instances of the same field which do not share proximity and frequency data? As it seems from what you wrote later in your response, it seems the case is the former. Yes. From http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/document/Document.html#add(org.apache.lucene.document.Field): Adds a field to a document. Several fields may be added with the same name. In this case, if the fields are indexed, their text is treated as though appended for the purposes of search. How can I inhibit this appending -- are there any more approaches than appending an invalid string like $$$? Here's an idea, though it is entirely untested and may be completely false :) : Lucene's Tokenizers are fed a Reader (in Java - I don't know about CLucene's setup, but I assume the interface is similar) and emit Tokens. Assuming that each field value from same-named fields gets its own Reader, then you could create a custom Tokenizer that, for the first Token it emits, sets a position increment greater than one - in so doing, phrase matching across same-named field values should be inhibited: http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/analysis/Token.html#setPositionIncrement(int) I've been thinking about this a bit, and I think I'd go with one big field for all the content, and I'd want to incorporate the headers into it as well. How would I boost those specific words - so the content field can contain both all words and all headers in their original order (for proximity and frequency data to be valid), yet keep the terms which were originally in a header or a sub-header boosted? Like I wrote in a previous response: One very coarse-grained boosting trick you could use is to repeat the text of headers, etc., that you want to boost. This trick adjusts the term frequency of important terms. I don't know of any other approaches besides this trick, except using field boosting, which would require you to have separate fields. So, to overcome the challenges above, I was thinking about the query inflation approach, having a negative boost for the inflated terms as you suggested. Actually, I was referring to a reduced, but non-negative, boost - like 0.5 instead of 1.0. AFAIK, Lucene does not support negative boosts. I will appreciate any different takes on this one, as this is going to be the first public Lucene Hebrew analyzer... One thought - for ambiguous terms, your stemming component could emit all of the alternatives at the same position. Using this approach I only need to make sure I do not inflate those too much (1024 is the standard limit, right?). 1024 is the default maximum number of BooleanClause children, but you can set this higher (or lower) should you desire: http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/search/BooleanQuery.html#setMaxClauseCount(int) Also, how can I check whether a word I inflated exists in the index BEFORE executing the query? Is that recommended at all? See IndexReader.terms(): http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/index/IndexReader.html#terms() If, as an offline process, you were to trim your query expansion map so that it included only terms known to be in the index, the resulting simpler queries should impact positively on performance. It's worth noting, as the above-linked documentation for Field.setBoost() does, that field boosts are not stored independently of other normalization factors in the index. Does this mean I should stick with boosting fields in the query phase only? No - I mentioned this only to alert you to the fact that field boosts are stored in the index only as part of the field norm, which is an amalgam including a couple of other factors. Index-time field boosting could potentially do good things for you - it's worth trying out. Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stange exception while indexing
That means that one of the merges, which run in the background by default with 2.3, hit an unhandled exception. Did you see another exception logged / printed to stderr before this one? Mike Cam Bazz wrote: Does anyone have any idea about the error I got while indexing? Best Regards, -C.B. Exception in thread main java.io.IOException: background merge hit exception: _kq:C962870 _kr:C2591 into _ks [optimize] at org.apache.lucene.index.IndexWriter.optimize (IndexWriter.java:1749) at org.apache.lucene.index.IndexWriter.optimize (IndexWriter.java:1689) at org.apache.lucene.index.IndexWriter.optimize (IndexWriter.java:1669) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stange exception while indexing
no. only after that there was a gc error. I am also not using the compound index file format in order to increase indexing speed. could it be because of that? I will run the test case again tomorrow. What can I do to increase logging? Best, -C.B. On Jan 24, 2008 11:52 PM, Michael McCandless [EMAIL PROTECTED] wrote: That means that one of the merges, which run in the background by default with 2.3, hit an unhandled exception. Did you see another exception logged / printed to stderr before this one? Mike Cam Bazz wrote: Does anyone have any idea about the error I got while indexing? Best Regards, -C.B. Exception in thread main java.io.IOException: background merge hit exception: _kq:C962870 _kr:C2591 into _ks [optimize] at org.apache.lucene.index.IndexWriter.optimize (IndexWriter.java:1749) at org.apache.lucene.index.IndexWriter.optimize (IndexWriter.java:1689) at org.apache.lucene.index.IndexWriter.optimize (IndexWriter.java:1669) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stange exception while indexing
Hmm, you should have seen an exception before that one from optimize. Can you post the GC error? Was it an OutOfMemoryError situation? Mike On Jan 24, 2008, at 5:32 PM, Cam Bazz wrote: no. only after that there was a gc error. I am also not using the compound index file format in order to increase indexing speed. could it be because of that? I will run the test case again tomorrow. What can I do to increase logging? Best, -C.B. On Jan 24, 2008 11:52 PM, Michael McCandless [EMAIL PROTECTED] wrote: That means that one of the merges, which run in the background by default with 2.3, hit an unhandled exception. Did you see another exception logged / printed to stderr before this one? Mike Cam Bazz wrote: Does anyone have any idea about the error I got while indexing? Best Regards, -C.B. Exception in thread main java.io.IOException: background merge hit exception: _kq:C962870 _kr:C2591 into _ks [optimize] at org.apache.lucene.index.IndexWriter.optimize (IndexWriter.java:1749) at org.apache.lucene.index.IndexWriter.optimize (IndexWriter.java:1689) at org.apache.lucene.index.IndexWriter.optimize (IndexWriter.java:1669) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stange exception while indexing
Oh, also, I don't think not using CFS would lead to this, unless it's somehow triggering too many file descriptors... Mike Cam Bazz wrote: no. only after that there was a gc error. I am also not using the compound index file format in order to increase indexing speed. could it be because of that? I will run the test case again tomorrow. What can I do to increase logging? Best, -C.B. On Jan 24, 2008 11:52 PM, Michael McCandless [EMAIL PROTECTED] wrote: That means that one of the merges, which run in the background by default with 2.3, hit an unhandled exception. Did you see another exception logged / printed to stderr before this one? Mike Cam Bazz wrote: Does anyone have any idea about the error I got while indexing? Best Regards, -C.B. Exception in thread main java.io.IOException: background merge hit exception: _kq:C962870 _kr:C2591 into _ks [optimize] at org.apache.lucene.index.IndexWriter.optimize (IndexWriter.java:1749) at org.apache.lucene.index.IndexWriter.optimize (IndexWriter.java:1689) at org.apache.lucene.index.IndexWriter.optimize (IndexWriter.java:1669) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Design questions
Or, you could just do things twice. That is, send your text through a TokenStream, then call next() and count. Then send it all through doc.add(). Hm. This means read the content twice, doesn't matter using an own analyzer oder overriding/wrapping the main analyzer. Is there anywhere a hook where I can grap the last token when I call Document#add? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene to index OCR text
I've been poking around the list archives and didn't really come up against anything interesting. Anyone using Lucene to index OCR text? Any strategies/algorithms/packages you recommend? I have a large collection (10^7 docs) that's mostly the result of OCR. We index/search/etc. with Lucene without any trouble, but OCR errors are a problem, when doing exact phrase matches in particular. I'm looking for ideas on how to deal with this thorny problem. -- Renaud Waldura Applications Group Manager Library and Center for Knowledge Management University of California, San Francisco (415) 502-6660
MapReduce usage with Lucene Indexing
Hi, I am very new to Lucene Hadoop, and I have a project where I need to use Lucene to index some input given either as a a huge collection of Java objects or one huge java object. I read about Hadoop's MapReduce utilities and I want to leverage that feature in my case described above. Can some one please tell me how I can approach the problem described above. Because all the Hadoop's MapReduce examples out there show only File based input and don't explicitly deal with data coming in as a huge Java object or so to speak. Any help is greatly appreciated. Thanks, Roger Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs
Re: Lucene to index OCR text
Lots of luck to you, because I haven't a clue. My company deals with OCR data and we haven't had a single workable idea. Of course, our data sets are minuscule compared to what you're talking about, so we haven't tried to heuristically clean up the data. But given that Google is scanning the entire U of Mich library, there has to be an answer out there, but I wonder if it's applicable to already OCRd data or whether it's the scanning itself. There are, as you well know, two issues. First, are the words recognizable. As in actual English words. Which is easily checkable via a dictionary. Which doesn't help much since I've seen OCR that consists of English words that are total nonsense. Assuming you're scanning English texts. Assuming it's modern English.. Second, particularly in our case, we have a very significant number of names to deal with. So a dictionary check is pretty useless. We've squirmed out of the problem by having the tables of contents keyed in by hand and then providing our users with links to the OCR image of the scanned data. Since this is genealogy research, it at least gives them a way to verify what our searches return. But inevitably there are false hits as well as false misses. I've considered creating a dictionary of non-English words on the assumption that there will be a finite number of mis-spellings. But this is OCR data, so the set of misspelled words could very well be bigger than the total number of words in the English language, depending on the condition of your source and how well the OCR data is done. But, again, our situation is that the projects aren't large enough to make significant investments in even exploring this. I suppose that one could think about asking a Dictionary program for suggestions, but I haven't a clue how useful that would be. Especially for names or technical data. The LDS church (The Church of Jesus Christ of Latter-day Saints) is doing something interesting that has the flavor of [EMAIL PROTECTED] They're getting volunteers to key in pages. Two different volunteers key in each page. Then a comparison is done and the differences are arbitrated. As you can tell, I have nothing really useful to suggest on the scale you're talking about. 10^7 is a LOT of documents. But I'd also be very interested in anything you come across. Especially in the way of cleaning existing OCRd data. Mostly, I'm expressing sympathy for the size and complexity of the task you're undertaking G.. Best Erick On Jan 24, 2008 8:43 PM, Renaud Waldura [EMAIL PROTECTED] wrote: I've been poking around the list archives and didn't really come up against anything interesting. Anyone using Lucene to index OCR text? Any strategies/algorithms/packages you recommend? I have a large collection (10^7 docs) that's mostly the result of OCR. We index/search/etc. with Lucene without any trouble, but OCR errors are a problem, when doing exact phrase matches in particular. I'm looking for ideas on how to deal with this thorny problem. -- Renaud Waldura Applications Group Manager Library and Center for Knowledge Management University of California, San Francisco (415) 502-6660
Re: Lucene to index OCR text
I've been poking around the list archives and didn't really come up against anything interesting. Anyone using Lucene to index OCR text? Any strategies/algorithms/packages you recommend? I have a large collection (10^7 docs) that's mostly the result of OCR. We index/search/etc. with Lucene without any trouble, but OCR errors are a problem, when doing exact phrase matches in particular. I'm looking for ideas on how to deal with this thorny problem. How about Letter-by-letter ngrams coupled with SpanQueries (or more likely, a custom query utilizing the TermPositions iterator)? -- Kyle Maxwell Software Engineer CastTV, Inc http://www.casttv.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[ANNOUNCE] Lucene Java 2.3.0 release available
Release 2.3.0 of Lucene Java is now available! Many new features, optimizations, and bug fixes have been added since 2.2, including: * significantly improved indexing performance * segment merging in background threads * refreshable IndexReaders * faster StandardAnalyzer and improved Token API * TermVectorMapper to customize how term vectors are loaded * live backups (without pausing indexing) with SnapshotDeletionPolicy * CheckIndex tool to test recover a corrupt index * pluggable MergePolicy MergeScheduler * partial optimize(int maxNumSegments) method * New contrib module for working with Wikipedia content The detailed change log is at: http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_3_0/CHANGES.txt Lucene 2.3 includes index format changes that are not readable by older versions of Lucene. Lucene 2.3 can both read and update older Lucene indexes. Adding to an index with an older format will cause it to be converted to the newer format. Binary and source distributions are available at http://www.apache.org/dyn/closer.cgi/lucene/java/ Lucene artifacts are also available in the Maven2 repository at http://repo1.maven.org/maven2/org/apache/lucene/ -Michael (on behalf of the Lucene team) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Threads blocking on isDeleted when swapping indices for a very long time...
Hi all, I've been tracking down a problem happening in our production environment. When we switch an index after doing deletes adds, running some searches, and finally changing the pointer from old index to new all the threads start stacking up all waiting on isDeleted(). The threads seem to finish, they just get really slow taking up to 30 - 60 seconds. The problem has been discussed here before in 2005: http://mail-archives.apache.org/mod_mbox/lucene-java-user/200510.mbox/[EMAIL PROTECTED] Does anyone have any suggestions on how to work around this? -M
Re: Archiving Index using partitions
Thanks Otis for your response. I've few more questions, 1) Is it recommended to do index partitioning for large indexes? - We index around 35 fields (storing only two of them - simple ids) - Each document is around 200 bytes - Our index grows to around 50G a week 2) The reason I could think for partitioning would be, - optimization would be faster on smaller indexes - search would be faster if I have to search only on specific partition - I would be able to archive old partitions - Even if a partition gets corrupt I wouldn't lose all data Is this correct? Are there any other reasons? Thanks, -vivek On Jan 21, 2008 2:32 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Why not just design your system to roll over to a new index on a weekly a basis (new IndexWriter on a new index dir, roughly speaking)? You can't partition a single Document, if that is what you are asking. But you can create multiple smaller (e.g. weekly indices) instead one large one, and then every 2 weeks archive the one 2 weeks old. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: vivek sar [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Monday, January 21, 2008 3:06:50 PM Subject: Archiving Index using partitions Hi, As a requirement I need to be able to archive any indexes older than 2 weeks (due to space and performance reasons). That means I would need to maintain weekly indexes. Here are my questions, 1) What's the best way to partition indexes using Lucene? 2) Is there a way I can partition documents, but not indexes? I don't want each partitioned index to be a full index, as that would be waste of space. We collect over 10K new documents per min (with each document around 250 bytes). 3) Is ParallelMultiSearcher the way to go for partitioned indexes? Do I ever have to merge these partitioned indexes? 4) I'm hoping I can reload the archived indexes in future if needed. Not sure if there is a standard way to archive the indexes using Lucene. Thanks, -vivek - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]