Re: Calculating Average Document Length with Lucene

2012-06-19 Thread Kasun Perera
I found this is the correct way of calculating Average Document length of document having tree fields byte[] normsDocLengthArrField1 = indexReader.norms("filed1"); byte[] normsDocLengthArrField2 = indexReader.norms("filed2"); byte[] normsDocLengthArrField3 = indexReader.norms("filed3"); double s

Different Weights to Lucene fields with Okapi Similarity

2012-06-19 Thread Kasun Perera
Based on this link http://www2002.org/CDROM/refereed/643/node6.html , I'm calculating Okapi similarity between the query document and another document as below using Lucene: I have indexed the documents using 3 fields. I want to give higher weight to field 2 and field 3. I can't use Lucene's boost

zero sized cfs files in index lead to IOException: read past EOF

2012-06-19 Thread Chris Gioran
Hello everyone, I am having a problem with a lucene store. When starting an IndexWriter on it, it throws the following exception: Caused by: java.io.IOException: read past EOF: MMapIndexInput(path="/path/to/index/_drs.cfs") at org.apache.lucene.store.MMapDirectory$MMapIndexInput.readByte

Re: zero sized cfs files in index lead to IOException: read past EOF

2012-06-19 Thread Michael McCandless
This shouldn't normally happen, even on crash, kill -9, power loss, etc. It can only mean either there is a bug in Lucene, or there's something wrong with your hardware/IO system, or the fsync operation doesn't actually work on the IO system. You can run CheckIndex to see what's broken (then, add

Wikipedia Index

2012-06-19 Thread Elshaimaa Ali
Hi everybody I'm using Lucene3.6 to index Wikipedia documents which is over 3 million article, the data is on a mysql database and it is taking more than 24 hours so far.Do you know any tips that can speed up the indexing process here is mycode: public static void main(String[] args) {

Re: Wikipedia Index

2012-06-19 Thread Michael McCandless
Likely the bottleneck is pulling content from the database? Maybe test just that and see how long it takes? 24 hours is way too long to index all of Wikipedia. For example, we index Wikipedia every night for our trunk/4.0 performance tests, here: http://people.apache.org/~mikemccand/luceneb

Re: zero sized cfs files in index lead to IOException: read past EOF

2012-06-19 Thread Chris Gioran
On Tue, Jun 19, 2012 at 6:18 PM, Michael McCandless wrote: > This shouldn't normally happen, even on crash, kill -9, power loss, etc. > > It can only mean either there is a bug in Lucene, or there's something > wrong with your hardware/IO system, or the fsync operation doesn't > actually work on t

Re: Wikipedia Index

2012-06-19 Thread Reyna Melara
Could it be possible to index Wikipedia in a 2 core machine with 3 GB in RAM? I have had the same problem trying to index it. I've tried with a dump from april 2011. Thanks Reyna CIC-IPN Mexico 2012/6/19 Michael McCandless > Likely the bottleneck is pulling content from the database? Maybe >

RE: Wikipedia Index

2012-06-19 Thread Elshaimaa Ali
Thanks Mike for the prompt replyDo you have a fully indexed version of the wikipedia, I mainly need two fields for each document the indexed content of the wikipedia articles and the title.if there is any place where I can get the index, that will save me great time regardsshaimaa > From: l

Re: zero sized cfs files in index lead to IOException: read past EOF

2012-06-19 Thread Michael McCandless
Hmm which Lucene version are you using? For 3.x before 3.4, there was a bug (https://issues.apache.org/jira/browse/LUCENE-3418) where we failed to actually fsync... More below: On Tue, Jun 19, 2012 at 4:54 PM, Chris Gioran wrote: > On Tue, Jun 19, 2012 at 6:18 PM, Michael McCandless > wrote: >

Re: Wikipedia Index

2012-06-19 Thread Michael McCandless
3 GB RAM is plenty for indexing Wikipedia (eg, that nightly benchmark uses a 2 GB heap). 2 cores just means it'll take longer than more cores... just use 2 indexing threads. Mike McCandless http://blog.mikemccandless.com On Tue, Jun 19, 2012 at 5:26 PM, Reyna Melara wrote: > Could it be possib

Re: Wikipedia Index

2012-06-19 Thread Michael McCandless
I have the index locally ... but it's really impractical to send it especially if you already have the source text locally. Maybe index directly from the source text instead of via a database? Lucene's benchmark contrib/module has code to decode the XML into documents... Mike McCandless http://b

RE: Wikipedia Index

2012-06-19 Thread Elshaimaa Ali
I only have the source text on a mysql database Do you know where I can download it in xml and is it possible to split the documents into content and title thanksshaimaa > From: luc...@mikemccandless.com > Date: Tue, 19 Jun 2012 19:48:24 -0400 > Subject: Re: Wikipedia Index > To: java-user@lucene

Re: Wikipedia Index

2012-06-19 Thread Greg Bowyer
It depends on what you want, but the wikipedia data dumps can be found here http://en.wikipedia.org/wiki/Wikipedia:Database_download On 19/06/12 17:03, Elshaimaa Ali wrote: I only have the source text on a mysql database Do you know where I can download it in xml and is it possible to split the