Hi:
What is the current status on the distributed lucene project proposed at:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg00338.html
Thanks
-John
Dear all,
I'd like to do document clustering using full-text with Lucene. In other words,
I would like to group similar documents in their respective groups. I searched
the mailing list and found that there are two ways around. The first method is
to represent the one document as query and sear
Its been months since i've tested this sort of thing, but from what I
remember there is a point where as you go higher, performance starts to
very slowly drop. The point was lower than I'd expect, and def created
what looked like sweet spot settings.
On Wed, 2008-05-14 at 18:36 -0700, Otis Gospodn
Karl, which caches are you referring to? Things like maxBufferedDocs and the
recent memory-based in-memory buffer? If so, isn't "the bigger the better" the
answer?
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Karl Wettin <[EMAIL PROTE
I have an index with several million documents that each contains
between a few hundred terms and up to about a million terms. To me it
feels like there would be a rather big difference between the swetspot
setting for the cache size when adding very large and very small
documents.
What a
I have note used RedHat GFS personally, but I know one of Sematext's big
customers (see http://sematext.com/clients ) has used it or is still using it
and they were happy with it. I bet Wikipedia has a good list of RH GFS
alternatives. Here is what I've got:
http://www.simpy.com/user/otis/sea
Hi, all.
I'd like to know about your experience sharing Lucene index on a RedHat GFS
filesystem.
Is that a good choice? Is it reliable? What other good alternatives exist?
Kind regards
Hello I created for a project a new class that extends TokenFilter but I have
a big problem... I have a dictionary of terms loaded using an Hashset, but
this terms have more than one token (for example "computer science") so I
can only search 1-word token (I tried adding one invented by me) and it
Index time boosts don't have much granularity, so you would run out of
values pretty quickly, unless I am misunderstanding your proposal.
From Similarity.encodeNorm:
/** Encodes a normalization factor for storage in an index.
*
* The encoding uses a three-bit mantissa, a five-bit exponent
Don't ask me why this occurred to me, since I'm working on a
completely different project... Mostly, this is intended to have
folks who really understand the scoring algorithms chime in and
tell me it's a silly idea .
We've seen multiple threads asking the question: "How can I
cause more-recent do
> > the unix program pdf2text can convert keeping the text places, but I wanted
> > to ask you guys if you know something better,
>
> AFAIK, PDFBox has a lower-level API that allows you to get hold of text
> positions.
In UpLib, I use xpdf-3.02pl2 with a patch which gives me position and
font in
Keeping a duplicate field is certainly one way to go, and assuming that
it's just the title the duplicate field probably won't increase your
index much. I'd recommend just giving it a try, it probably costs less
resource wise than you think.
You'll have to do something like this to get things to w
Erick Erickson wrote:
Are you using NumberTools both at index and query time? Because
this works exactly as I expect
Yes, the code I posted showed the usage of NumberTools -- here it is
from my 2nd reply:
Taking your advice I'm now indexing using:
document.add( new Field(RateUtils.SF_F
OK thanks for the update. It's another datapoint, and it tells us
_06 doesn't fix it. I'll add it to the Jira issue.
Mike
Ian Lea wrote:
Hi
My job (http://lucene.markmail.org/message/awkkunr7j24nh4qj) still
fails with java version 1.6.0_06 (build 1.6.0_06-b02), downloaded
today, with bo
Hi
My job (http://lucene.markmail.org/message/awkkunr7j24nh4qj) still
fails with java version 1.6.0_06 (build 1.6.0_06-b02), downloaded
today, with both lucene 2.3.1 and 2.3.2.
For me, downgrading to 1.6.0_03-b05 fixed things.
--
Ian.
On Tue, May 13, 2008 at 7:56 PM, Stu Hood <[EMAIL PROTECT
Cam Bazz wrote:
Hello All,
Any suggestions for extracting text from PDF? I have tried pdfbox, but it
works nice, however if the pdf is structured, it wont provide good results.
For example consider the pdf:
P1 Lorem Ipsum Bla bla P3 Lorem2 Ipsum2
P1 bla bla
Hello All,
Any suggestions for extracting text from PDF? I have tried pdfbox, but it
works nice, however if the pdf is structured, it wont provide good results.
For example consider the pdf:
P1 Lorem Ipsum Bla bla P3 Lorem2 Ipsum2
P1 bla bla
P2 bla bla bla
P
17 matches
Mail list logo