RE: Lucene 4.0 scalability and performance.

2012-12-24 Thread Vitaly_Artemov
Thank you

-Original Message-
From: Steve Rowe [mailto:sar...@gmail.com] 
Sent: Sunday, December 23, 2012 8:20 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 4.0 scalability and performance.

Hi Vitaly,

Anything by Tom Burton-West should interest you - he works on the HathiTrust 
digital library project http://www.hathitrust.org, which currently indexes 
7TB of full-length books, e.g.:

Practical Relevance Ranking for 10 Million Books (paper) INEX 2012, September 
2012, Rome, Italy 
http://www.clef-initiative.eu/documents/71612/943abea5-6e48-48dd-ba89-72c174d001ef

HathiTrust Large Scale Search: Scalability meets Usability (slides) Code4Lib 
2012, February 2012, Seattle, Washington 
http://www.hathitrust.org/documents/HathiTrust-Code4Lib-201202.pptx

Large-scale Search (blog)
http://www.hathitrust.org/blogs/large-scale-search

Steve

On Dec 23, 2012, at 6:11 AM, vitaly_arte...@mcafee.com wrote:

 Hi all,
 We start to evaluate Lucene 4.0 for using in the production environment.
 This means that we need to index millions of document with TeraBytes of 
 content and search in it.
 For now we want to define only one indexed field, contained the content of 
 the documents, with possibility to search terms and retrieving the terms 
 offsets.
 Does somebody already tested Lucene with TerabBytes of data?
 Does Lucene has some known limitations related to the indexed documents 
 number or to the indexed documents size?
 What is about search performance in huge set of data?
 Thanks in advance, Vitaly


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.0 scalability and performance.

2012-12-24 Thread Carsten Schnober
Am 23.12.2012 12:11, schrieb vitaly_arte...@mcafee.com:


 This means that we need to index millions of document with TeraBytes of 
 content and search in it.
 For now we want to define only one indexed field, contained the content of 
 the documents, with possibility to search terms and retrieving the terms 
 offsets.
 Does somebody already tested Lucene with TerabBytes of data?
 Does Lucene has some known limitations related to the indexed documents 
 number or to the indexed documents size?
 What is about search performance in huge set of data?

Hi Vitali,
we've been working on a linguistic search engine based on Lucene 4.0 and
have performed a few tests with large text corpora. There are at least
some overlaps in the functionality you mentioned (term offsets). See
http://www.oegai.at/konvens2012/proceedings/27_schnober12p/ (mainly
section 5).
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene 4.0 scalability and performance.

2012-12-24 Thread Vitaly_Artemov
Thank you

-Original Message-
From: Carsten Schnober [mailto:schno...@ids-mannheim.de] 
Sent: Monday, December 24, 2012 3:25 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 4.0 scalability and performance.

Am 23.12.2012 12:11, schrieb vitaly_arte...@mcafee.com:


 This means that we need to index millions of document with TeraBytes of 
 content and search in it.
 For now we want to define only one indexed field, contained the content of 
 the documents, with possibility to search terms and retrieving the terms 
 offsets.
 Does somebody already tested Lucene with TerabBytes of data?
 Does Lucene has some known limitations related to the indexed documents 
 number or to the indexed documents size?
 What is about search performance in huge set of data?

Hi Vitali,
we've been working on a linguistic search engine based on Lucene 4.0 and have 
performed a few tests with large text corpora. There are at least some overlaps 
in the functionality you mentioned (term offsets). See 
http://www.oegai.at/konvens2012/proceedings/27_schnober12p/ (mainly section 5).
Carsten

--
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis 
Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to implement a TokenFilter?

2012-12-24 Thread Xi Shen
Hi Lance,

I got the lucene 4 from
http://mirror.bjtu.edu.cn/apache/lucene/java/4.0.0/lucene-4.0.0-src.tgz, it
is an Ant project. But I do not which IDE can import it...I tried Eclipse,
it cannot import the build.xml file.


Thanks,
D.


On Mon, Dec 24, 2012 at 12:02 PM, Lance Norskog goks...@gmail.com wrote:

 You need to use an IDE. Find the Attribute type and show all subclasses.
 This shows a lot of rare ones and a few which are used a lot. Now, look at
 source code for various TokenFilters and search for other uses of the
 Attributes you find. This generally is how I figured it out.

 Also, after the full Analyzer stack is called, the caller saves the output
 (I guess to codecs?). You can look at which Attributes it saves.


 On 12/23/2012 06:30 PM, Xi Shen wrote:

 thanks a lot :)


 On Mon, Dec 24, 2012 at 10:22 AM, feng lu amuseme...@gmail.com wrote:

  hi Shen

 May be you can see some source code in org.apache.lucene.analysis
 package,
 such LowerCaseFilter.java,**StopFilter.java and so on.

 and some common attribute includes:

 offsetAtt = addAttribute(OffsetAttribute.**class);
 termAtt = addAttribute(**CharTermAttribute.class);
 typeAtt = addAttribute(TypeAttribute.**class);

 Regards


 On Sun, Dec 23, 2012 at 4:01 PM, Rafał Kuć r@solr.pl wrote:

  Hello!

 The simplest way is to look at Lucene javadoc and see what
 implementations of Attribute interface there are -

  http://lucene.apache.org/core/**4_0_0/core/org/apache/lucene/**
 util/Attribute.htmlhttp://lucene.apache.org/core/4_0_0/core/org/apache/lucene/util/Attribute.html

 --
 Regards,
   Rafał Kuć
   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch

  thanks, i read this ready. it is useful, but it is too 'small'...
 e.g. for this.charTermAttr = addAttribute(**CharTermAttribute.class);
 i want to know what are the other attributes i need in order to

 implement

 my function. where i can find a references to these attributes? i tried

 on

 lucene  solr wiki, but all i found is a list of the names of these
 attributes, nothing about what are they capable of...




  On Sat, Dec 22, 2012 at 10:37 PM, Rafał Kuć r@solr.pl wrote:

 Hello!

 A small example with some explanation can be found here:
 http://solr.pl/en/2012/05/14/**developing-your-own-solr-**filter/http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/

 --
 Regards,
   Rafał Kuć
   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch

  Hi,
 I need a guide to implement my own TokenFilter. I checked the wiki,

 but I

 could not find any useful guide :(



 --**--**
 -
 To unsubscribe, e-mail: 
 java-user-unsubscribe@lucene.**apache.orgjava-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: 
 java-user-help@lucene.apache.**orgjava-user-h...@lucene.apache.org




 --**--**
 -
 To unsubscribe, e-mail: 
 java-user-unsubscribe@lucene.**apache.orgjava-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: 
 java-user-help@lucene.apache.**orgjava-user-h...@lucene.apache.org



 --
 Don't Grow Old, Grow Up... :-)





 --**--**-
 To unsubscribe, e-mail: 
 java-user-unsubscribe@lucene.**apache.orgjava-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: 
 java-user-help@lucene.apache.**orgjava-user-h...@lucene.apache.org




-- 
Regards,
David Shen

http://about.me/davidshen
https://twitter.com/#!/davidshen84