Re: Bucketing (was Re: Wikia search goes live today)
Otis Gospodnetic wrote: Sounds useful. I suppose this means one would have custom function for within-bucket-reordering? e.g. for a web search you might reorder based on the URL length if you think shorter URLs are an indicator of Yes, that's precisely the idea. It combines the advantages of simple (hence fast) scoring inside the IR system, with a complex (hence slow) reordering of a small sample of results, performed outside the IR system prior to delivering the results. higher quality. It also sounds like something that can easily sit outside Luceneor do you have something else in mind, such as a mechanism to pass a reordering function in Lucene? It should definitely be something outside Lucene - it's meant for cases that require more complex ranking (or faster) than those available through function query. I only mentioned this here because it is simple to implement, yet produces useful results difficult to obtain through the usual means (similarity, boosting, even function query). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Why is lucene so slow indexing in nfs file system ?
Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? How long time it takes to index ? I hope yo can help me Greetings
Re: Bucketing (was Re: Wikia search goes live today)
Would be a nice contrib module, though... -Grant On Jan 9, 2008, at 5:30 AM, Andrzej Bialecki wrote: Otis Gospodnetic wrote: Sounds useful. I suppose this means one would have custom function for within-bucket-reordering? e.g. for a web search you might reorder based on the URL length if you think shorter URLs are an indicator of Yes, that's precisely the idea. It combines the advantages of simple (hence fast) scoring inside the IR system, with a complex (hence slow) reordering of a small sample of results, performed outside the IR system prior to delivering the results. higher quality. It also sounds like something that can easily sit outside Luceneor do you have something else in mind, such as a mechanism to pass a reordering function in Lucene? It should definitely be something outside Lucene - it's meant for cases that require more complex ranking (or faster) than those available through function query. I only mentioned this here because it is simple to implement, yet produces useful results difficult to obtain through the usual means (similarity, boosting, even function query). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://lucene.grantingersoll.com http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Basic Named Entity Indexing
taking your example (text by John Bear, old.), the NGramAnalyzerWrapper creates the following tokens: text text by by by John John John Bear, Bear, Bear, old. I have managed to get rid of the error, but now it just doesn't add anything to the index :s I'm attaching the NGramAnalyzerWrapper and NGramFilter which I am referring to, as well as my own NamedEntityAnalyzer/TokenFilter, which may help you understand better. http://www.nabble.com/file/p14712313/rem.rar rem.rar -- View this message in context: http://www.nabble.com/Basic-Named-Entity-Indexing-tp14291880p14712313.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query processing with Lucene
On Tuesday 08 January 2008 22:49:18 Doron Cohen wrote: This is done by Lucene's scorers. You should however start in http://lucene.apache.org/java/docs/scoring.html, - scorers are described in the Algorithm section. Offsets are used by Phrase Scorers and by Span Scorer. That is for the case that offsets were meant to be positions within a document. It is also possible that offsets were meant in the sense of using skipTo(doc) instead of next() on a Scorer. This is done during query search when at least one term is required. Regards, Paul Elschot Doron On Jan 8, 2008 11:24 PM, Marjan Celikik [EMAIL PROTECTED] wrote: Doron Cohen wrote: Hi Marjan, Lucene process the query in what can be called one-doc-at-a-time. For the example query - x y - (not the phrase query x y) - all documents containing either x or y are considered a match. When processing the query - x y - the posting lists of these two index terms are traversed, and for each document met on the way, a score is computed (taking into account both terms), and collected. At the end of the traversal, usually best N collected docs are returned as search result. So, this is an exhaustive computation creating a union of the two posting. For the query - +x +y - in intersection rather than union is required, and the way Lucene does it is again to traverse the two posting lists, just that only documents seen in both lists are scored and collected. This allows to optimize the search, skipping large chunks of the posting lists, especially when one term is rarer than the other. Thank you for your answer. I am having trouble finding the function which traverses the documents such that they get scored. Can you please tell me where the posting lists (for a +x +y query) get intersected after they get read (by next() I guess) from the index? In particular, I am interested in how does Lucene get the new positions (offsets) of the documents seen in both posting lists, i.e. positions (in a document) for the query word x, and positions for the query word y. Thank you in advance! Marjan. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is lucene so slow indexing in nfs file system ?
would like to find out why my application has this big delay to index Well, then you have to measure G. Tthe first thing I'd do is pinpoint where the time was being spent. Until you have that answered, you simply cannot take any meaningful action. 1 don't do any of the indexing. No new Documents, don't add any fields, etc. This will just time the PDF parsing. (I'd run this for set number of documents rather than the whole 10G). This'll tell you whether the issue is indexing or PDFBox. 2 Perhaps try the above with local files rather than files on the nfs mount. 3 Put back some of the indexing and measure each step. For instance, create the new documents but don't add them to the index. 4Then go ahead and add them to the index. The numbers you get for these measurements will tell you a lot. At that point, perhaps folks will have more useful suggestions. The reason I'm being so unhelpful is that without lots more detail, there's really nothing we can help with since there are so many variables that it's just impossible to say which one is the problem. For instance, is it a single 10G document and you're swapping like crazy? Are you CPU bound or IO bound? Have you tried profiling your process at all to find the choke points? Best Erick On Jan 9, 2008 8:50 AM, Ariel [EMAIL PROTECTED] wrote: Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.htmland I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? How long time it takes to index ? I hope yo can help me Greetings
RE: Why is lucene so slow indexing in nfs file system ?
Hi Ariel, On 01/09/2008 at 8:50 AM, Ariel wrote: Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? Apache Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Tomcat. http://lucene.apache.org/solr/ Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Basic Named Entity Indexing
solved it... i was using token.toString() instead of token.termText(); thanks for the help :) -- View this message in context: http://www.nabble.com/Basic-Named-Entity-Indexing-tp14291880p14715727.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is lucene so slow indexing in nfs file system ?
There's also Nutch. However, 10GB isn't that big... Perhaps you can index where the docs/index lives, then just make the index available via NFS? Or, better yet, use rsync to replicate it like Solr does. -Grant On Jan 9, 2008, at 10:49 AM, Steven A Rowe wrote: Hi Ariel, On 01/09/2008 at 8:50 AM, Ariel wrote: Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? Apache Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Tomcat. http://lucene.apache.org/solr/ Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://lucene.grantingersoll.com http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Empty lucene-similarity jars on maven mirrors
Hi lucene-similarity (2.1.0 and 2.2.0) jar files available on maven mirrors don't contain any files. http://mvnrepository.com/artifact/org.apache.lucene/lucene-similarity/2.2.0Seems like a deployment config problem. -- ~sanjay
Highlighting + phrase queries
Dear all, Let's assume I have a phrase query and a document which contain the phrase but also it contains separate occurrences of each query term. How does the highlighter know that should only display fragments which contain phrases and not fragments which contain only the query words (not as a phrase)? Thank you in advance! Marjan. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Highlighting + phrase queries
The contrib Highlighter doesn't know and highlights them all. Check out my patch here for position sensitive highlighting: https://issues.apache.org/jira/browse/LUCENE-794 Marjan Celikik wrote: Dear all, Let's assume I have a phrase query and a document which contain the phrase but also it contains separate occurrences of each query term. How does the highlighter know that should only display fragments which contain phrases and not fragments which contain only the query words (not as a phrase)? Thank you in advance! Marjan. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Highlighting + phrase queries
Mark Miller wrote: The contrib Highlighter doesn't know and highlights them all. Check out my patch here for position sensitive highlighting: https://issues.apache.org/jira/browse/LUCENE-794 OK, before trying it out, I would like to know does the patch work for mixed queries, e.g. a b +c -d f g ? Thanks! Marjan. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
how do I get my own TopDocHitCollector?
Question: The documents that I index have two id's - a unique document id and a record_id that can link multiple documents together that belong to a common record. I'd like to use something like TopDocs to return the first 1024 results that have unique record_id's, but I will want to skip some of the returned documents that have the same record_id. We're using the ParallelMultiSearcher. I read that I could use a HitCollector and throw an exception to get it to stop, but is there a cleaner way? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Design questions
Hi, I have to index (tokenized) documents which may have very much pages, up to 10.000. I also have to know on which pages the search phrase occurs. I have to update some stored index fields for my document. The content is never changed. Thus I think I have to add one lucene document with the index fields and one lucene document per page. Mapping === MyDocument -ID -Field 1-N -Page 1-N Lucene -Lucene Document with ID, page number 0 and Field1 - N (stored fields) -Lucene Document 1 with ID, page number 1 and tokenized content of Page 1 ... -Lucene Document N with ID, page number N and tokenized content of Page N Delete of MyDocument - IndexWriter#deleteDocuments(Term:ID=foo) Update of stored index fields - IndexWriter#updateDocument(Term: ID=foo, page number = 0) Search with index and content. Step 1: Search on stored index fields - List of IDs Step 2: Search on ID field (list from above OR'ed together) and content - List of IDs and page numbers Does this work? What drawbacks has this approch? Is there another way to achieve what I want? Thank you. P.S. There are millions of documents with a page range from 1 to 10.000. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is lucene so slow indexing in nfs file system ?
Ariel wrote: The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out If you are using log4j, make sure you have the pdfbox log4j categories set to info or higher, otherwise this really slows it down (factor of 10) or make sure you are using the non log4j version. See http://sourceforge.net/forum/message.php?msg_id=3947448 Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Design questions
You can do several things: Rather than index one doc per page, you could index a special token between pages. Say you index $ as the special token. So your index looks like this: last of page 1 first of page 2 last of page 2 first of page 3 and so on. Now, if you used SpanNearQuery with a slop of 0, you would never match across pages. Now, you can call SpanNear.getSpans() to get the offsets of all your matches. You can then correlate these to pages by using TermPositions (?) or similar interface and determine what pages you matched on. This is not as expensive as it sounds, since you're not reading the document, just the indexes. This is a possibility, I'd think that it would be easier to keep track of if there's a 1-to-1 correspondence between your documents in the two indexes. As an aside, note that you don't *require* two separate indexes. There's no requirement that all documents in an index have the same fields. So you could index your meta-data with an ID of, say, meta_doc_id and your page text with text_id where these are your unique (external to Lucene) IDs. Then you could delete with a term delete on meta_doc_id So a meta-doc looks something like: meta_doc_id:453 field1: field2: field3: and the text doc (the one and only) would be text_id:543 text: (all 10,000 pages with page delimiters, maybe (see below)). You could even store all of the page offsets in your meta-data document in a special field if you wanted, then lazy-load that field rather than dynamically counting. You'd have to be careful that your offsets corresponded to the data *after* it was analyzed, but that shouldn't be too hard. You'd have to read this field before deleting the doc and make sure it was stored with the replacement. One caution: Lucene by default only stores the first 10,000 tokens for a field in a document. So be sure to bump this limit with IndexWrite.setMaxFieldLength If you stored all the offsets of page breaks, you wouldn't have to store the special token since you'd have no reason to have to count them later. Be aware that you'd get a match for a phrase that spanned the last word of one page and the first word of the next. Which may be good, but you'll have to decide that. You should be able to do this pretty easily with a custom Analyzer. One more point: I once determined that the following two actions are identical: 1 create one really big string with all the page data concatenated together and then add it to a document and 2 just add successive fragments to the same document. That is, Document doc; doc.add(new Field(text, all the text in all the pages is just like Document doc; while (more pages) { doc.add(new Field(text, text for this page } I like this variant better And, since I'm getting random ideas anyway, here's another. The PositionIncrementGap is the distance (measured in terms) between two tokens. Let's claim that you have no page with more than 10,000 (or whatever) tokens. Just bump the positionincrementgap to the next 10,000 at the start of each page. So, the first term on the first page has an offset of 0. the first term on the second page has an offset of 10,000. The first term on the third page has an offset of 20,000. Now, with the SpanNearQuery trick from above, your term position modulo 10,000 is also your page. This would also NOT match across pages. H, I kind of like that idea. I guess my last question is How often will a document change? The added complexity of keeping two documents per unique ID may be unnecessary if your documents don't change all that often. Anyway, all FWIW Best Erick On Jan 9, 2008 4:39 PM, [EMAIL PROTECTED] wrote: Hi, I have to index (tokenized) documents which may have very much pages, up to 10.000. I also have to know on which pages the search phrase occurs. I have to update some stored index fields for my document. The content is never changed. Thus I think I have to add one lucene document with the index fields and one lucene document per page. Mapping === MyDocument -ID -Field 1-N -Page 1-N Lucene -Lucene Document with ID, page number 0 and Field1 - N (stored fields) -Lucene Document 1 with ID, page number 1 and tokenized content of Page 1 ... -Lucene Document N with ID, page number N and tokenized content of Page N Delete of MyDocument - IndexWriter#deleteDocuments(Term:ID=foo) Update of stored index fields - IndexWriter#updateDocument(Term: ID=foo, page number = 0) Search with index and content. Step 1: Search on stored index fields - List of IDs Step 2: Search on ID field (list from above OR'ed together) and content - List of IDs and page numbers Does this work? What drawbacks has this approch? Is there another way to achieve what I want? Thank you. P.S. There are millions of documents with a page range from 1 to 10.000. - To unsubscribe,
RE: Empty lucene-similarity jars on maven mirrors
Hi Sanjay, On 01/09/2008 at 3:02 PM, Sanjay Dahiya wrote: lucene-similarity (2.1.0 and 2.2.0) jar files available on maven mirrors don't contain any files. That's because the o.a.l.search.similar package (the sole contents of the contrib/similarity/ directory) has been empty as of the 2.1.0 release. The idea behind its continued existence in this state is for it to be a home for future contributions of custom similarity implementations. From Lucene's changelog (trunk/CHANGES.txt): -- === Release 2.1.0 2007-02-14 === [...] Bug fixes [...] 24. LUCENE-728: Removed duplicate/old MoreLikeThis and SimilarityQueries classes from contrib/similarity, as their new home is under contrib/queries. (Otis Gospodnetic) -- Here's the issue that licensed/documented this change: http://issues.apache.org/jira/browse/LUCENE-728 Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: how do I get my own TopDocHitCollector?
Beard, Brian wrote: Question: The documents that I index have two id's - a unique document id and a record_id that can link multiple documents together that belong to a common record. I'd like to use something like TopDocs to return the first 1024 results that have unique record_id's, but I will want to skip some of the returned documents that have the same record_id. We're using the ParallelMultiSearcher. I read that I could use a HitCollector and throw an exception to get it to stop, but is there a cleaner way? I'm doing a similar thing. I have external Ids (equivalent to yout record_id), which have one or more Lucene Documents associated with them. I wrote a custom HitCollector that uses a Map to hold the so far collected external ids along with the collected document. I had to write my own priority queue to know when an object was dropped of the bottom of the score sorted queue, but the latest PriorityQueue on the trunk now has insertWithOverflow(), which does the same thing. Note that ResultDoc extends ScoreDoc, so that the external Id of the item dropped off the queue can be used to remove it from my Map. Code snippet is somewhat as below (I am caching my external Ids, hence the cache usage) protected MapOfficeId, ScoreDoc results; public void collect(int doc, float score) { if (score 0.0f) { totalHits++; if (pq.size() numHits || score minScore) { OfficeId id = cache.get(doc); ResultDoc rd = results.get(id); // No current result for this ID yet found if (rd == null) { rd = new ResultDoc(id, doc, score); ResultDoc added = pq.insert(rd); if (added == null) { // Nothing dropped of the bottom results.put(id, rd); } else { // Return value dropped of the bottom results.remove(added.id); results.put(id, rd); remaining++; } } // Already found this ID, so replace high score if necessary else { if (score rd.score) { pq.remove(rd); rd.score = score; pq.insert(rd); } } // Readjust our minimum score again from the top entry minScore = pq.peek().score; } else remaining++; } } HTH Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is lucene so slow indexing in nfs file system ?
Ariel, I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's author might still be on this list and might comment). Pulling data from NFS to index seems like a bad idea. I hope at least the indices are local and not on a remote NFS... We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and indexing overNFS was slooow. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, January 9, 2008 2:50:41 PM Subject: Why is lucene so slow indexing in nfs file system ? Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? How long time it takes to index ? I hope yo can help me Greetings - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: linkedin group for lucene interest group
Hm, propaganda! :) There is also a Lucene group on Simpy with lots of Lucene/search/IR resources - http://www.simpy.com/group/363 You'll see some familiar names from the list on the right side of the screen. Let me know if you want to join. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: John Wang [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, January 9, 2008 6:23:30 PM Subject: linkedin group for lucene interest group To join: http://www.linkedin.com/e/gis/49647/019FD71A8AEF -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]