RE: Lucene 4.0 scalability and performance.
Thank you -Original Message- From: Steve Rowe [mailto:sar...@gmail.com] Sent: Sunday, December 23, 2012 8:20 PM To: java-user@lucene.apache.org Subject: Re: Lucene 4.0 scalability and performance. Hi Vitaly, Anything by Tom Burton-West should interest you - he works on the HathiTrust digital library project http://www.hathitrust.org, which currently indexes 7TB of full-length books, e.g.: Practical Relevance Ranking for 10 Million Books (paper) INEX 2012, September 2012, Rome, Italy http://www.clef-initiative.eu/documents/71612/943abea5-6e48-48dd-ba89-72c174d001ef HathiTrust Large Scale Search: Scalability meets Usability (slides) Code4Lib 2012, February 2012, Seattle, Washington http://www.hathitrust.org/documents/HathiTrust-Code4Lib-201202.pptx Large-scale Search (blog) http://www.hathitrust.org/blogs/large-scale-search Steve On Dec 23, 2012, at 6:11 AM, vitaly_arte...@mcafee.com wrote: Hi all, We start to evaluate Lucene 4.0 for using in the production environment. This means that we need to index millions of document with TeraBytes of content and search in it. For now we want to define only one indexed field, contained the content of the documents, with possibility to search terms and retrieving the terms offsets. Does somebody already tested Lucene with TerabBytes of data? Does Lucene has some known limitations related to the indexed documents number or to the indexed documents size? What is about search performance in huge set of data? Thanks in advance, Vitaly - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 4.0 scalability and performance.
Am 23.12.2012 12:11, schrieb vitaly_arte...@mcafee.com: This means that we need to index millions of document with TeraBytes of content and search in it. For now we want to define only one indexed field, contained the content of the documents, with possibility to search terms and retrieving the terms offsets. Does somebody already tested Lucene with TerabBytes of data? Does Lucene has some known limitations related to the indexed documents number or to the indexed documents size? What is about search performance in huge set of data? Hi Vitali, we've been working on a linguistic search engine based on Lucene 4.0 and have performed a few tests with large text corpora. There are at least some overlaps in the functionality you mentioned (term offsets). See http://www.oegai.at/konvens2012/proceedings/27_schnober12p/ (mainly section 5). Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Lucene 4.0 scalability and performance.
Thank you -Original Message- From: Carsten Schnober [mailto:schno...@ids-mannheim.de] Sent: Monday, December 24, 2012 3:25 PM To: java-user@lucene.apache.org Subject: Re: Lucene 4.0 scalability and performance. Am 23.12.2012 12:11, schrieb vitaly_arte...@mcafee.com: This means that we need to index millions of document with TeraBytes of content and search in it. For now we want to define only one indexed field, contained the content of the documents, with possibility to search terms and retrieving the terms offsets. Does somebody already tested Lucene with TerabBytes of data? Does Lucene has some known limitations related to the indexed documents number or to the indexed documents size? What is about search performance in huge set of data? Hi Vitali, we've been working on a linguistic search engine based on Lucene 4.0 and have performed a few tests with large text corpora. There are at least some overlaps in the functionality you mentioned (term offsets). See http://www.oegai.at/konvens2012/proceedings/27_schnober12p/ (mainly section 5). Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 4.0 scalability and performance.
Hi Vitaly, Anything by Tom Burton-West should interest you - he works on the HathiTrust digital library project http://www.hathitrust.org, which currently indexes 7TB of full-length books, e.g.: Practical Relevance Ranking for 10 Million Books (paper) INEX 2012, September 2012, Rome, Italy http://www.clef-initiative.eu/documents/71612/943abea5-6e48-48dd-ba89-72c174d001ef HathiTrust Large Scale Search: Scalability meets Usability (slides) Code4Lib 2012, February 2012, Seattle, Washington http://www.hathitrust.org/documents/HathiTrust-Code4Lib-201202.pptx Large-scale Search (blog) http://www.hathitrust.org/blogs/large-scale-search Steve On Dec 23, 2012, at 6:11 AM, vitaly_arte...@mcafee.com wrote: Hi all, We start to evaluate Lucene 4.0 for using in the production environment. This means that we need to index millions of document with TeraBytes of content and search in it. For now we want to define only one indexed field, contained the content of the documents, with possibility to search terms and retrieving the terms offsets. Does somebody already tested Lucene with TerabBytes of data? Does Lucene has some known limitations related to the indexed documents number or to the indexed documents size? What is about search performance in huge set of data? Thanks in advance, Vitaly - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org