140GB index directory, what can I do?

2010-08-14 Thread Andrew Bruno
Hi all, I have an index directory that is growing pretty fast, and is now at 138GB. A while ago, this index got corrupted. It was rebuilt, but the engineer cannot remember whether he deleted the corrupt directory before the rebuild. Is there a way to know if any files are not being used or

RE: 140GB index directory, what can I do?

2010-08-14 Thread Uwe Schindler
Optimize the index one time then all unused segment files are *for-sure* removed. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Andrew Bruno [mailto:andrew.br...@gmail.com] Sent: Saturday, August 14,

lucene usage on TREC data

2010-08-14 Thread Ramneek Maan Singh
Hello Everyone, Can anyone point me to a publicly Question answering system built using lucene on TREC or non-TREC data. Regards, Ramneek

Re: 140GB index directory, what can I do?

2010-08-14 Thread Shai Erera
You can also call deleteUnusedFiles(), and all unreferenced files will be deleted either. Make sure to set the index DeletionPolicy to KeepOnlyLastCommit (which is the default), before you do that. That's relevant though if you've built the index using either 3x or 4.0 code. If not, you can

Re: scalability limit in terms of numbers of large documents

2010-08-14 Thread Erick Erickson
As asked, that's really an unanswerable question. The math is pretty easy in terms of running out of document IDs, but searched quickly depends on too many variables. I suspect, though, that long before you ran out of document IDs, you'd need to shard your index, Have you looked at SOLR? Best

Re: scalability limit in terms of numbers of large documents

2010-08-14 Thread andynuss
Hi Erick, My documents are roughly a 0.5 to 1 million chars divide into normal words, and divided into 50 chapters, each chapter streamed into a docid unit. So a search hit is a chapter. How do I find out more about sharding and SOLR? Andy -- View this message in context:

Interaction of Tokenattributes and Tokenizer

2010-08-14 Thread Devshree Sane
Hi, Can anyone explain to me how exactly the Tokenizers and tokenattributes interact with each other? Or perhaps point me to a link which has a the interaction/sequence diagram for the same? I want to extend the Token class to allow use of some more types of Token Attributes. Thanks -Devshree.

Not a valid hit number: 0

2010-08-14 Thread Herbert Roitblat
I was setting up a new instance of my program on a new computer. I got this error: 2010-08-14 10:05:21,951 ERROR Thread LuceneThread: java.lang.IndexOutOfBoundsException: Not a valid hit number: 0 Java stacktrace: java.lang.IndexOutOfBoundsException: Not a valid hit number: 0 at

Re: Interaction of Tokenattributes and Tokenizer

2010-08-14 Thread Simon Willnauer
You might wanna look at the Whats new in Lucene 2.9 Whitepaper from Lucid Imagination http://www.lucidimagination.com/developer/whitepaper/Whats-New-in-Apache-Lucene-2-9 on page 7 you find an introduction to this API. This should get you started :) simon On Sat, Aug 14, 2010 at 4:19 PM,

Re: lucene usage on TREC data

2010-08-14 Thread Glen Newton
Lucene has been used - usually as a starting base that has been modified for specific tasks - by a number of IR researchers for various TREC challenges. Here are some (there are many more): IBM Haifa: http://wiki.apache.org/lucene-java/TREC_2007_Million_Queries_Track_-_IBM_Haifa_Team