It turns out, this is somehow related to an interaction between SWT and
the java Decompresser class - certainly not lucene related.
FYI:
https://bugs.eclipse.org/bugs/show_bug.cgi?id=169484
--
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.arm
This probably isn't a lucene error - but I'm hoping that maybe somebody
here has seen it before, and can shed some light.
I'm trying to add a simple document to an empty index. The toString on
the document looks like this:
Document
stored/uncompressed,indexed
stored/uncompressed,indexed
i
Ariel Isaac Romero Cartaya wrote:
Hi every body:
I am getting a problem during the indexing process, I am indexing big
amounts of texts most of them in pdf format I am using pdf box 0.6 version.
The space in hard disk before that the indexing process begin is around 120
Gb but incredibly even
A few more google searches will probably turn up some reasonable lists
of abbreviation rules or lists for common names - I found this right away:
(google cache link that converts pdf to html)
http://72.14.205.104/search?q=cache:dh7HGiQ-G4wJ:immigrants.byu.edu/Downloads/BritishNames.pdf+common+n
I have someone wanting to do a query like this - "top sta*", but from
what I have been able to gather, lucene doesn't have any built in
support for wildcards inside of phrases?
Well, at least not complete support. I was led to the MultiPhraseQuery
class - but looking at that leaves me wonderi
Ross Rankin wrote:
We keep getting JVM crashes on 1.4.3. I found in the archive that setting a
JVM parameter solved the problem for a few users. We've tried that and it
has not worked. Here's our JVM parameters:
Why not try a new JVM?
Either a newer sun... or a JDK, or a blackdown...
In o
>The reason I'm
asking this that I'm still trying to figure out whether having a machine
with huge ram actually helps Lucene, or not.
Thanks,
Nadav.
Memory can help a little at index time, but you will mostly be Disk / IO
bound. How fast can you read your data in, how fast can you write i
Benjamin Stein wrote:
I could probably store the little RAMDirectories to disk as many
FSDirectories, and then addIndexes() of *all* the FSDirectories at the end
instead of every time. That would probably be smart.
Glad I asked myself!
That was what I was going to suggest - you may also wa
Rahil wrote:
No I have around 50GB free on my extrenal disk in which Im creating the
indexes. So hopefully that shouldnt be the problem.
How is the external disk mounted? Samba from unix? NTFS? I wonder if
there isn't something strange going on here.
Have you tried building the index on a
Your out of memory error is likely due to a mysql bug outlined here:
http://bugs.mysql.com/bug.php?id=7698
Thanks for the article. My query executed in no time without any errors !!!
The MySQL drivers are horrible at dealing with large result sets - that
article gives you the workaround to
Thanks guys as always... lucene (and especially the people behind
it) are top notch.
Less than 6 hours from the time I figured out that the bug was in
Lucene (and not my code, which is usually the case) - and its already
fixed (I'm going to assume - I'll test it tomorrow when I get to work)
A
Doug Cutting wrote:
I assume that your merge factor when calling addIndexes() is less than
90. If it's 90, then what you're doing is the same as Lucene would
automatically do. I think you could save yourself a lot of trouble if
you simply lowered your merge factor substantially and then ind
Yonik Seeley wrote:
For your test case, try lowering numbers, such as maxBufferedDocs=2,
mergeFactor=2 or 3
to create more segments more quickly and cause more merges with fewer documents.
Good suggestion. A merge factor of 2 made it happen much more quickly.
Bug is filed:
http://issues.ap
Yonik Seeley wrote:
On 4/5/06, Dan Armbrust <[EMAIL PROTECTED]> wrote:
I'll continue to try to generate a test case that gets the docs out of
order... but if someone in the know could answer authoritatively whether
I browsed the code for IndexWriter.addIndexes(Dir[]), and it lo
Chris Hostetter wrote:
: exactly the same as how I insert them. Lucene is supposed to maintain
: document order, even across index merges, correct?
Lucene definitely maintains index order for document additions -- but i
don't know if any similar claim has been made about merging whole indexes.
I'm using Lucene 1.9.1, and I'm seeing some odd behavior that I hope
someone can help me with.
My application counts on Lucene maintaining the order of the documents
exactly the same as how I insert them. Lucene is supposed to maintain
document order, even across index merges, correct?
My i
z shalev wrote:
hi all,
i've been trying to load a 6GB index on linux (16GB RAM) but am having no success.
i wrote a program that allocates memory and it was able to allocate as much RAM as i requested (stopped at 12GB)
Was your program that got up to 12GB of memory written
I would give the IBM or blackdown JVM a try on linux - I've seen pretty
wide variance in their speed on different operations.
Sometimes better than Sun, sometimes worse - it depended on the task (I
did some adhoc tests at one point that showed sun was faster for
indexing, but IBM was faster fo
Mufaddal Khumri wrote:
When I do a search for example on "batteries" i get 1200+ results. I
would like to show the user lets say 300. I can do that by only
extracting the first 300 hits (sorted by decreasing relevance by
default) and displaying those to the user.
If you are only talking ab
Aigner, Thomas wrote:
I believe that the files are actually deleted from lucene when the
optimize is run.
That gets things into the 'deleteable' file - but its never actually
deleting all of the files from the deleteable file. I'm almost always
ending up with at least 1 duplicate copy of my
If I am using lucene (daily build from ~ a month ago or so) on windows -
and when I merge two indexes together, I get a number of .cfs files
noted in my 'deleteable' file - but they never seem to actually be
deleted by lucene.
When does lucene try to delete these files - does it ever work on
Pol, Parikshit wrote:
Hi Folks.
I downloaded the Lucene and tried to do an ant. It initially gave me the
following error:
...
Are you using a current version of ant?
Lucene 1.4.3 should already be fully built when you downloaded it - you
shouldn't have to compile it.
If you want the "curre
Dan Quaroni wrote:
And together we will rule the galaxy as father and son?
-Original Message-
From: Rob Young [mailto:[EMAIL PROTECTED]
Sent: Thursday, October 20, 2005 2:22 PM
To: java-user@lucene.apache.org
Subject: Join Me
42!
So here comes the next part of my applet ignorance.
Can I embed the Lucene, etc, jar files in my applet so that when the
user starts up the applet, they can be used on the local machine.
This alone probably stops me from using an applet, I guess.
Anyone have any idea where the definitive rules
I see your words, but I hate to admit that I don't understand them in
totality!
When you say that the search is executed on the web server, that means
that we would need to code it it Perl or some such, no?
I don't see (except for a Perl or PHP script) how the search could
execute on the website
J. David Boyd wrote:
Here's my dilemma.
For years, we have supplied paper documentation to our customers. Many
pages of paper. All together, it makes a 3 foot stack when printed.
Also for many years, customers have been asking for docs in electronic
format, so, recently, I wrote some Perl scr
Ben Gollmer wrote:
>Chris Lamprecht wrote:
>
>
>>I've wanted something similar, for the same purpose -- to keep lucene
>>from consuming disk I/O resources when another process is running on
>>the same machine.
>>
>>
>
>Sorry for jumping in (I'm a Lucene newb) but isn't this better handled
>
I wrote a slightly modified version of the WhiteSpaceTokenizer that
allows me to treat other characters as whitespace. My thought was that
this would be an easy way to make it tokenize on characters such as "-".
My tokenizer looks like this:
public class CustomWhiteSpaceTokenizer extends Char
Daniel Naber wrote:
Correct handling of multiple terms per position was only added to SVN, it's
not part of Lucene 1.4.3.
Regards
Daniel
Cool - is there a daily build somewhere, or do I have to roll my own? I
couldn't find a daily build or a 1.9 alpha, beta, etc. on the site.
Any idea whe
I have a custom Analyzer which performs normalization on all of the
terms as they pass through.
It does normalization like the following:
trees -> tree
Sometimes my normalizer returns multiple words for a normalization - for
example:
leaves -> leaf leave
The second and all subsequent terms
I am implementing a filter that will remove certain characters from the
tokens - thing like '(', etc - but the chars to be removed will be
customizable.
This is what I have come up with - but it doesn't seem very efficient.
Is there a better way?
Should I be adjusting the token endOffset when
I know there used to be a webpage that gave the algorithm used by Lucene
for scoring, along with some info on what each variable controlled, to
some extent... I was looking to brush up on what the idf controls (and
what will happen if I override it) but I can't seem to find that page
any longer
It is my understanding that the StandardAnalyzer will remove underscores
- so "some_word" be indexed as 'some' and 'word'.
I want to keep the underscores, so I was thinking of changing over to an
Analyzer that uses the WhiteSpaceTokenizer, LowerCaseFilter, and StopFilter.
What other tokenizin
Mufaddal Khumri wrote:
Are there
analyzers that do this already?
Its not an analyzer, but the "norm" feature of this tool does a good job
at getting to the normalized form of the words...
http://umlslex.nlm.nih.gov/lvg/current/
http://umlslex.nlm.nih.gov/lvg/current/docs/userDoc/norm.htm
May I suggest:
Don't call optimize. You don't need it. Here is my approach:
Keep each one of your 250,000 document indexes separate - so run your
batch, build the index, and then just close it. Don't try to optimize
it. For each 250,000 document batch, just put it into a different folder.
Otis Gospodnetic wrote:
You may also want to consider PostgreSQL for a few reasons:
3) it seems that the new
versions let you embed Java directly into the database (perhaps
something like Oracle's Java-embedding thing).
Really? I realize this is off topic, but could you point me to some
d
Has anyone ever written code to make it possible to return from a
search, after a given amount of time, returning the results that have
been collected so far (but not necessarily all of them)?
The only thing that I can see to do through the public Lucene API's
would be to do the search using a
In my indexes where the available fields vary by document, I maintain an
additional field that lists out what fields are in used per document.
That way, I can query for all documents that contain field "foo", or all
documents that contain a field "foo", and don't contain "bar"... etc.
Avi
Is there a straightforward way that I could change the scoring algorithm
such that it would break ties based on looking at the value of a field?
I'm not actually searching for the value in the field, so its not part
of the query - I just want documents that have a particular field set to
a par
I'm pretty sure the answer is no.. but I'll check with the guru's anyway...
In my collection of documents, I have a non-tokenized field that only
occurs 0 or 1 time per document.
It is possible to create a query so that a documents would be returned if
(field == "some value" OR field does not
You should be careful, however, not to end up with two VM instances each
trying to open an index writer at the same time - one of them is going
to fail.
Aka, if someone using your web interface tries to add a new document to
the index while you have the optimizer running standalone, the web
i
I would run your optimize process in a separate thread, so that your web
client doesn't have to wait for it to return.
You may even want to set the optimize part up to run on a weekly
schedule, at a low load time. I probably wouldn't reoptimize after
every 30 documents, on an index that size.
42 matches
Mail list logo