distributed lucene progress

2008-05-14 Thread John Wang
Hi: What is the current status on the distributed lucene project proposed at: http://www.mail-archive.com/[EMAIL PROTECTED]/msg00338.html Thanks -John

Document clustering with Lucene

2008-05-14 Thread Supheakmungkol SARIN
Dear all, I'd like to do document clustering using full-text with Lucene. In other words, I would like to group similar documents in their respective groups. I searched the mailing list and found that there are two ways around. The first method is to represent the one document as query and sear

Re: IndexWriter cache swetspots

2008-05-14 Thread Mark Miller
Its been months since i've tested this sort of thing, but from what I remember there is a point where as you go higher, performance starts to very slowly drop. The point was lower than I'd expect, and def created what looked like sweet spot settings. On Wed, 2008-05-14 at 18:36 -0700, Otis Gospodn

Re: IndexWriter cache swetspots

2008-05-14 Thread Otis Gospodnetic
Karl, which caches are you referring to? Things like maxBufferedDocs and the recent memory-based in-memory buffer? If so, isn't "the bigger the better" the answer? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Karl Wettin <[EMAIL PROTE

IndexWriter cache swetspots

2008-05-14 Thread Karl Wettin
I have an index with several million documents that each contains between a few hundred terms and up to about a million terms. To me it feels like there would be a rather big difference between the swetspot setting for the cache size when adding very large and very small documents. What a

Re: RedHad GFS

2008-05-14 Thread Otis Gospodnetic
I have note used RedHat GFS personally, but I know one of Sematext's big customers (see http://sematext.com/clients ) has used it or is still using it and they were happy with it. I bet Wikipedia has a good list of RH GFS alternatives. Here is what I've got: http://www.simpy.com/user/otis/sea

RedHad GFS

2008-05-14 Thread nch
Hi, all. I'd like to know about your experience sharing Lucene index on a RedHat GFS filesystem. Is that a good choice? Is it reliable? What other good alternatives exist? Kind regards

Using more tokens in TokenFilter :(

2008-05-14 Thread broddoi
Hello I created for a project a new class that extends TokenFilter but I have a big problem... I have a dictionary of terms loaded using an Hashset, but this terms have more than one token (for example "computer science") so I can only search 1-word token (I tried adding one invented by me) and it

Re: Boosting more-recent documents in an index

2008-05-14 Thread Grant Ingersoll
Index time boosts don't have much granularity, so you would run out of values pretty quickly, unless I am misunderstanding your proposal. From Similarity.encodeNorm: /** Encodes a normalization factor for storage in an index. * * The encoding uses a three-bit mantissa, a five-bit exponent

Boosting more-recent documents in an index

2008-05-14 Thread Erick Erickson
Don't ask me why this occurred to me, since I'm working on a completely different project... Mostly, this is intended to have folks who really understand the scoring algorithms chime in and tell me it's a silly idea . We've seen multiple threads asking the question: "How can I cause more-recent do

Re: text extraction from pdf

2008-05-14 Thread Bill Janssen
> > the unix program pdf2text can convert keeping the text places, but I wanted > > to ask you guys if you know something better, > > AFAIK, PDFBox has a lower-level API that allows you to get hold of text > positions. In UpLib, I use xpdf-3.02pl2 with a patch which gives me position and font in

Re: Exact match query on a field in index which has been indexed using StandardAnalyzer

2008-05-14 Thread Erick Erickson
Keeping a duplicate field is certainly one way to go, and assuming that it's just the title the duplicate field probably won't increase your index much. I'd recommend just giving it a try, it probably costs less resource wise than you think. You'll have to do something like this to get things to w

Re: Numerical Range Query

2008-05-14 Thread Dan Hardiker
Erick Erickson wrote: Are you using NumberTools both at index and query time? Because this works exactly as I expect Yes, the code I posted showed the usage of NumberTools -- here it is from my 2nd reply: Taking your advice I'm now indexing using: document.add( new Field(RateUtils.SF_F

Re: "Off By One": CorruptIndexException

2008-05-14 Thread Michael McCandless
OK thanks for the update. It's another datapoint, and it tells us _06 doesn't fix it. I'll add it to the Jira issue. Mike Ian Lea wrote: Hi My job (http://lucene.markmail.org/message/awkkunr7j24nh4qj) still fails with java version 1.6.0_06 (build 1.6.0_06-b02), downloaded today, with bo

Re: "Off By One": CorruptIndexException

2008-05-14 Thread Ian Lea
Hi My job (http://lucene.markmail.org/message/awkkunr7j24nh4qj) still fails with java version 1.6.0_06 (build 1.6.0_06-b02), downloaded today, with both lucene 2.3.1 and 2.3.2. For me, downgrading to 1.6.0_03-b05 fixed things. -- Ian. On Tue, May 13, 2008 at 7:56 PM, Stu Hood <[EMAIL PROTECT

Re: text extraction from pdf

2008-05-14 Thread Andrzej Bialecki
Cam Bazz wrote: Hello All, Any suggestions for extracting text from PDF? I have tried pdfbox, but it works nice, however if the pdf is structured, it wont provide good results. For example consider the pdf: P1 Lorem Ipsum Bla bla P3 Lorem2 Ipsum2 P1 bla bla

text extraction from pdf

2008-05-14 Thread Cam Bazz
Hello All, Any suggestions for extracting text from PDF? I have tried pdfbox, but it works nice, however if the pdf is structured, it wont provide good results. For example consider the pdf: P1 Lorem Ipsum Bla bla P3 Lorem2 Ipsum2 P1 bla bla P2 bla bla bla P