RE: Lucene 4.0 scalability and performance.
Thank you -Original Message- From: Steve Rowe [mailto:sar...@gmail.com] Sent: Sunday, December 23, 2012 8:20 PM To: java-user@lucene.apache.org Subject: Re: Lucene 4.0 scalability and performance. Hi Vitaly, Anything by Tom Burton-West should interest you - he works on the HathiTrust digital library project http://www.hathitrust.org, which currently indexes 7TB of full-length books, e.g.: Practical Relevance Ranking for 10 Million Books (paper) INEX 2012, September 2012, Rome, Italy http://www.clef-initiative.eu/documents/71612/943abea5-6e48-48dd-ba89-72c174d001ef HathiTrust Large Scale Search: Scalability meets Usability (slides) Code4Lib 2012, February 2012, Seattle, Washington http://www.hathitrust.org/documents/HathiTrust-Code4Lib-201202.pptx Large-scale Search (blog) http://www.hathitrust.org/blogs/large-scale-search Steve On Dec 23, 2012, at 6:11 AM, vitaly_arte...@mcafee.com wrote: Hi all, We start to evaluate Lucene 4.0 for using in the production environment. This means that we need to index millions of document with TeraBytes of content and search in it. For now we want to define only one indexed field, contained the content of the documents, with possibility to search terms and retrieving the terms offsets. Does somebody already tested Lucene with TerabBytes of data? Does Lucene has some known limitations related to the indexed documents number or to the indexed documents size? What is about search performance in huge set of data? Thanks in advance, Vitaly - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 4.0 scalability and performance.
Am 23.12.2012 12:11, schrieb vitaly_arte...@mcafee.com: This means that we need to index millions of document with TeraBytes of content and search in it. For now we want to define only one indexed field, contained the content of the documents, with possibility to search terms and retrieving the terms offsets. Does somebody already tested Lucene with TerabBytes of data? Does Lucene has some known limitations related to the indexed documents number or to the indexed documents size? What is about search performance in huge set of data? Hi Vitali, we've been working on a linguistic search engine based on Lucene 4.0 and have performed a few tests with large text corpora. There are at least some overlaps in the functionality you mentioned (term offsets). See http://www.oegai.at/konvens2012/proceedings/27_schnober12p/ (mainly section 5). Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Lucene 4.0 scalability and performance.
Thank you -Original Message- From: Carsten Schnober [mailto:schno...@ids-mannheim.de] Sent: Monday, December 24, 2012 3:25 PM To: java-user@lucene.apache.org Subject: Re: Lucene 4.0 scalability and performance. Am 23.12.2012 12:11, schrieb vitaly_arte...@mcafee.com: This means that we need to index millions of document with TeraBytes of content and search in it. For now we want to define only one indexed field, contained the content of the documents, with possibility to search terms and retrieving the terms offsets. Does somebody already tested Lucene with TerabBytes of data? Does Lucene has some known limitations related to the indexed documents number or to the indexed documents size? What is about search performance in huge set of data? Hi Vitali, we've been working on a linguistic search engine based on Lucene 4.0 and have performed a few tests with large text corpora. There are at least some overlaps in the functionality you mentioned (term offsets). See http://www.oegai.at/konvens2012/proceedings/27_schnober12p/ (mainly section 5). Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: how to implement a TokenFilter?
Hi Lance, I got the lucene 4 from http://mirror.bjtu.edu.cn/apache/lucene/java/4.0.0/lucene-4.0.0-src.tgz, it is an Ant project. But I do not which IDE can import it...I tried Eclipse, it cannot import the build.xml file. Thanks, D. On Mon, Dec 24, 2012 at 12:02 PM, Lance Norskog goks...@gmail.com wrote: You need to use an IDE. Find the Attribute type and show all subclasses. This shows a lot of rare ones and a few which are used a lot. Now, look at source code for various TokenFilters and search for other uses of the Attributes you find. This generally is how I figured it out. Also, after the full Analyzer stack is called, the caller saves the output (I guess to codecs?). You can look at which Attributes it saves. On 12/23/2012 06:30 PM, Xi Shen wrote: thanks a lot :) On Mon, Dec 24, 2012 at 10:22 AM, feng lu amuseme...@gmail.com wrote: hi Shen May be you can see some source code in org.apache.lucene.analysis package, such LowerCaseFilter.java,**StopFilter.java and so on. and some common attribute includes: offsetAtt = addAttribute(OffsetAttribute.**class); termAtt = addAttribute(**CharTermAttribute.class); typeAtt = addAttribute(TypeAttribute.**class); Regards On Sun, Dec 23, 2012 at 4:01 PM, Rafał Kuć r@solr.pl wrote: Hello! The simplest way is to look at Lucene javadoc and see what implementations of Attribute interface there are - http://lucene.apache.org/core/**4_0_0/core/org/apache/lucene/** util/Attribute.htmlhttp://lucene.apache.org/core/4_0_0/core/org/apache/lucene/util/Attribute.html -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch thanks, i read this ready. it is useful, but it is too 'small'... e.g. for this.charTermAttr = addAttribute(**CharTermAttribute.class); i want to know what are the other attributes i need in order to implement my function. where i can find a references to these attributes? i tried on lucene solr wiki, but all i found is a list of the names of these attributes, nothing about what are they capable of... On Sat, Dec 22, 2012 at 10:37 PM, Rafał Kuć r@solr.pl wrote: Hello! A small example with some explanation can be found here: http://solr.pl/en/2012/05/14/**developing-your-own-solr-**filter/http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/ -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hi, I need a guide to implement my own TokenFilter. I checked the wiki, but I could not find any useful guide :( --**--** - To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.orgjava-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.**orgjava-user-h...@lucene.apache.org --**--** - To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.orgjava-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.**orgjava-user-h...@lucene.apache.org -- Don't Grow Old, Grow Up... :-) --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.orgjava-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.**orgjava-user-h...@lucene.apache.org -- Regards, David Shen http://about.me/davidshen https://twitter.com/#!/davidshen84