RE: Lucene 4.0 scalability and performance.

2012-12-24 Thread Vitaly_Artemov
Thank you

-Original Message-
From: Steve Rowe [mailto:sar...@gmail.com] 
Sent: Sunday, December 23, 2012 8:20 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 4.0 scalability and performance.

Hi Vitaly,

Anything by Tom Burton-West should interest you - he works on the HathiTrust 
digital library project http://www.hathitrust.org, which currently indexes 
7TB of full-length books, e.g.:

Practical Relevance Ranking for 10 Million Books (paper) INEX 2012, September 
2012, Rome, Italy 
http://www.clef-initiative.eu/documents/71612/943abea5-6e48-48dd-ba89-72c174d001ef

HathiTrust Large Scale Search: Scalability meets Usability (slides) Code4Lib 
2012, February 2012, Seattle, Washington 
http://www.hathitrust.org/documents/HathiTrust-Code4Lib-201202.pptx

Large-scale Search (blog)
http://www.hathitrust.org/blogs/large-scale-search

Steve

On Dec 23, 2012, at 6:11 AM, vitaly_arte...@mcafee.com wrote:

 Hi all,
 We start to evaluate Lucene 4.0 for using in the production environment.
 This means that we need to index millions of document with TeraBytes of 
 content and search in it.
 For now we want to define only one indexed field, contained the content of 
 the documents, with possibility to search terms and retrieving the terms 
 offsets.
 Does somebody already tested Lucene with TerabBytes of data?
 Does Lucene has some known limitations related to the indexed documents 
 number or to the indexed documents size?
 What is about search performance in huge set of data?
 Thanks in advance, Vitaly


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.0 scalability and performance.

2012-12-24 Thread Carsten Schnober
Am 23.12.2012 12:11, schrieb vitaly_arte...@mcafee.com:


 This means that we need to index millions of document with TeraBytes of 
 content and search in it.
 For now we want to define only one indexed field, contained the content of 
 the documents, with possibility to search terms and retrieving the terms 
 offsets.
 Does somebody already tested Lucene with TerabBytes of data?
 Does Lucene has some known limitations related to the indexed documents 
 number or to the indexed documents size?
 What is about search performance in huge set of data?

Hi Vitali,
we've been working on a linguistic search engine based on Lucene 4.0 and
have performed a few tests with large text corpora. There are at least
some overlaps in the functionality you mentioned (term offsets). See
http://www.oegai.at/konvens2012/proceedings/27_schnober12p/ (mainly
section 5).
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene 4.0 scalability and performance.

2012-12-24 Thread Vitaly_Artemov
Thank you

-Original Message-
From: Carsten Schnober [mailto:schno...@ids-mannheim.de] 
Sent: Monday, December 24, 2012 3:25 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 4.0 scalability and performance.

Am 23.12.2012 12:11, schrieb vitaly_arte...@mcafee.com:


 This means that we need to index millions of document with TeraBytes of 
 content and search in it.
 For now we want to define only one indexed field, contained the content of 
 the documents, with possibility to search terms and retrieving the terms 
 offsets.
 Does somebody already tested Lucene with TerabBytes of data?
 Does Lucene has some known limitations related to the indexed documents 
 number or to the indexed documents size?
 What is about search performance in huge set of data?

Hi Vitali,
we've been working on a linguistic search engine based on Lucene 4.0 and have 
performed a few tests with large text corpora. There are at least some overlaps 
in the functionality you mentioned (term offsets). See 
http://www.oegai.at/konvens2012/proceedings/27_schnober12p/ (mainly section 5).
Carsten

--
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis 
Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.0 scalability and performance.

2012-12-23 Thread Steve Rowe
Hi Vitaly,

Anything by Tom Burton-West should interest you - he works on the HathiTrust 
digital library project http://www.hathitrust.org, which currently indexes 
7TB of full-length books, e.g.:

Practical Relevance Ranking for 10 Million Books (paper)
INEX 2012, September 2012, Rome, Italy
http://www.clef-initiative.eu/documents/71612/943abea5-6e48-48dd-ba89-72c174d001ef

HathiTrust Large Scale Search: Scalability meets Usability (slides)
Code4Lib 2012, February 2012, Seattle, Washington 
http://www.hathitrust.org/documents/HathiTrust-Code4Lib-201202.pptx

Large-scale Search (blog)
http://www.hathitrust.org/blogs/large-scale-search

Steve

On Dec 23, 2012, at 6:11 AM, vitaly_arte...@mcafee.com wrote:

 Hi all,
 We start to evaluate Lucene 4.0 for using in the production environment.
 This means that we need to index millions of document with TeraBytes of 
 content and search in it.
 For now we want to define only one indexed field, contained the content of 
 the documents, with possibility to search terms and retrieving the terms 
 offsets.
 Does somebody already tested Lucene with TerabBytes of data?
 Does Lucene has some known limitations related to the indexed documents 
 number or to the indexed documents size?
 What is about search performance in huge set of data?
 Thanks in advance, Vitaly


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org