Re: obscure error...

2007-01-04 Thread Dan Armbrust
It turns out, this is somehow related to an interaction between SWT and the java Decompresser class - certainly not lucene related. FYI: https://bugs.eclipse.org/bugs/show_bug.cgi?id=169484 -- Daniel Armbrust Biomedical Informatics Mayo Clinic Rochester daniel.arm

obscure error...

2007-01-03 Thread Dan Armbrust
This probably isn't a lucene error - but I'm hoping that maybe somebody here has seen it before, and can shed some light. I'm trying to add a simple document to an empty index. The toString on the document looks like this: Document stored/uncompressed,indexed stored/uncompressed,indexed i

Re: Full disk space during indexing process with 120 gb of free disk space

2006-12-07 Thread Dan Armbrust
Ariel Isaac Romero Cartaya wrote: Hi every body: I am getting a problem during the indexing process, I am indexing big amounts of texts most of them in pdf format I am using pdf box 0.6 version. The space in hard disk before that the indexing process begin is around 120 Gb but incredibly even

Re: Help on search

2006-11-07 Thread Dan Armbrust
A few more google searches will probably turn up some reasonable lists of abbreviation rules or lists for common names - I found this right away: (google cache link that converts pdf to html) http://72.14.205.104/search?q=cache:dh7HGiQ-G4wJ:immigrants.byu.edu/Downloads/BritishNames.pdf+common+n

wildcards in quoted phrases?

2006-09-25 Thread Dan Armbrust
I have someone wanting to do a query like this - "top sta*", but from what I have been able to gather, lucene doesn't have any built in support for wildcards inside of phrases? Well, at least not complete support. I was led to the MultiPhraseQuery class - but looking at that leaves me wonderi

Re: JVM Crash

2006-06-13 Thread Dan Armbrust
Ross Rankin wrote: We keep getting JVM crashes on 1.4.3. I found in the archive that setting a JVM parameter solved the problem for a few users. We've tried that and it has not worked. Here's our JVM parameters: Why not try a new JVM? Either a newer sun... or a JDK, or a blackdown... In o

Re: Does more memory help Lucene?

2006-06-12 Thread Dan Armbrust
>The reason I'm asking this that I'm still trying to figure out whether having a machine with huge ram actually helps Lucene, or not. Thanks, Nadav. Memory can help a little at index time, but you will mostly be Disk / IO bound. How fast can you read your data in, how fast can you write i

Re: IndexWriter.addIndexes & optimization

2006-06-07 Thread Dan Armbrust
Benjamin Stein wrote: I could probably store the little RAMDirectories to disk as many FSDirectories, and then addIndexes() of *all* the FSDirectories at the end instead of every time. That would probably be smart. Glad I asked myself! That was what I was going to suggest - you may also wa

Re: IOException Access Denied errors [ modified]

2006-05-24 Thread Dan Armbrust
Rahil wrote: No I have around 50GB free on my extrenal disk in which Im creating the indexes. So hopefully that shouldnt be the problem. How is the external disk mounted? Samba from unix? NTFS? I wonder if there isn't something strange going on here. Have you tried building the index on a

Re: OutOfMemory and IOException Access Denied errors

2006-05-22 Thread Dan Armbrust
Your out of memory error is likely due to a mysql bug outlined here: http://bugs.mysql.com/bug.php?id=7698 Thanks for the article. My query executed in no time without any errors !!! The MySQL drivers are horrible at dealing with large result sets - that article gives you the workaround to

Re: Lucene Document order not being maintained?

2006-04-05 Thread Dan Armbrust
Thanks guys as always... lucene (and especially the people behind it) are top notch. Less than 6 hours from the time I figured out that the bug was in Lucene (and not my code, which is usually the case) - and its already fixed (I'm going to assume - I'll test it tomorrow when I get to work) A

Re: Lucene Document order not being maintained?

2006-04-05 Thread Dan Armbrust
Doug Cutting wrote: I assume that your merge factor when calling addIndexes() is less than 90. If it's 90, then what you're doing is the same as Lucene would automatically do. I think you could save yourself a lot of trouble if you simply lowered your merge factor substantially and then ind

Re: Lucene Document order not being maintained?

2006-04-05 Thread Dan Armbrust
Yonik Seeley wrote: For your test case, try lowering numbers, such as maxBufferedDocs=2, mergeFactor=2 or 3 to create more segments more quickly and cause more merges with fewer documents. Good suggestion. A merge factor of 2 made it happen much more quickly. Bug is filed: http://issues.ap

Re: Lucene Document order not being maintained?

2006-04-05 Thread Dan Armbrust
Yonik Seeley wrote: On 4/5/06, Dan Armbrust <[EMAIL PROTECTED]> wrote: I'll continue to try to generate a test case that gets the docs out of order... but if someone in the know could answer authoritatively whether I browsed the code for IndexWriter.addIndexes(Dir[]), and it lo

Re: Lucene Document order not being maintained?

2006-04-05 Thread Dan Armbrust
Chris Hostetter wrote: : exactly the same as how I insert them. Lucene is supposed to maintain : document order, even across index merges, correct? Lucene definitely maintains index order for document additions -- but i don't know if any similar claim has been made about merging whole indexes.

Lucene Document order not being maintained?

2006-04-05 Thread Dan Armbrust
I'm using Lucene 1.9.1, and I'm seeing some odd behavior that I hope someone can help me with. My application counts on Lucene maintaining the order of the documents exactly the same as how I insert them. Lucene is supposed to maintain document order, even across index merges, correct? My i

Re: 1.4.3 and 64bit support? out of memory??

2006-03-08 Thread Dan Armbrust
z shalev wrote: hi all, i've been trying to load a 6GB index on linux (16GB RAM) but am having no success. i wrote a program that allocates memory and it was able to allocate as much RAM as i requested (stopped at 12GB) Was your program that got up to 12GB of memory written

Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Dan Armbrust
I would give the IBM or blackdown JVM a try on linux - I've seen pretty wide variance in their speed on different operations. Sometimes better than Sun, sometimes worse - it depended on the task (I did some adhoc tests at one point that showed sun was faster for indexing, but IBM was faster fo

Re: get results by relevance, limiting results and then sort the results by some criterion

2006-02-21 Thread Dan Armbrust
Mufaddal Khumri wrote: When I do a search for example on "batteries" i get 1200+ results. I would like to show the user lets say 300. I can do that by only extracting the first 300 hits (sorted by decreasing relevance by default) and displaying those to the user. If you are only talking ab

Re: When do files in 'deleteable' get deleted?

2006-02-13 Thread Dan Armbrust
Aigner, Thomas wrote: I believe that the files are actually deleted from lucene when the optimize is run. That gets things into the 'deleteable' file - but its never actually deleting all of the files from the deleteable file. I'm almost always ending up with at least 1 duplicate copy of my

When do files in 'deleteable' get deleted?

2006-02-13 Thread Dan Armbrust
If I am using lucene (daily build from ~ a month ago or so) on windows - and when I merge two indexes together, I get a number of .cfs files noted in my 'deleteable' file - but they never seem to actually be deleted by lucene. When does lucene try to delete these files - does it ever work on

Re: Issues while doing ant on lucene source

2005-11-17 Thread Dan Armbrust
Pol, Parikshit wrote: Hi Folks. I downloaded the Lucene and tried to do an ant. It initially gave me the following error: ... Are you using a current version of ant? Lucene 1.4.3 should already be fully built when you downloaded it - you shouldn't have to compile it. If you want the "curre

Re: Join Me

2005-10-20 Thread Dan Armbrust
Dan Quaroni wrote: And together we will rule the galaxy as father and son? -Original Message- From: Rob Young [mailto:[EMAIL PROTECTED] Sent: Thursday, October 20, 2005 2:22 PM To: java-user@lucene.apache.org Subject: Join Me 42!

Re: Lucene and remote index and java applet, with no java app server

2005-10-12 Thread Dan Armbrust
So here comes the next part of my applet ignorance. Can I embed the Lucene, etc, jar files in my applet so that when the user starts up the applet, they can be used on the local machine. This alone probably stops me from using an applet, I guess. Anyone have any idea where the definitive rules

Re: Lucene and remote index and java applet, with no java app server

2005-10-10 Thread Dan Armbrust
I see your words, but I hate to admit that I don't understand them in totality! When you say that the search is executed on the web server, that means that we would need to code it it Perl or some such, no? I don't see (except for a Perl or PHP script) how the search could execute on the website

Re: Lucene and remote index and java applet, with no java app server

2005-10-10 Thread Dan Armbrust
J. David Boyd wrote: Here's my dilemma. For years, we have supplied paper documentation to our customers. Many pages of paper. All together, it makes a 3 foot stack when printed. Also for many years, customers have been asking for docs in electronic format, so, recently, I wrote some Perl scr

Re: IO bandwidth throttling

2005-09-01 Thread Dan Armbrust
Ben Gollmer wrote: >Chris Lamprecht wrote: > > >>I've wanted something similar, for the same purpose -- to keep lucene >>from consuming disk I/O resources when another process is running on >>the same machine. >> >> > >Sorry for jumping in (I'm a Lucene newb) but isn't this better handled >

WhiteSpace Tokenizer question

2005-08-23 Thread Dan Armbrust
I wrote a slightly modified version of the WhiteSpaceTokenizer that allows me to treat other characters as whitespace. My thought was that this would be an easy way to make it tokenize on characters such as "-". My tokenizer looks like this: public class CustomWhiteSpaceTokenizer extends Char

1.9 official betas WAS: Query Parser custom analyzer question

2005-08-22 Thread Dan Armbrust
Daniel Naber wrote: Correct handling of multiple terms per position was only added to SVN, it's not part of Lucene 1.4.3. Regards Daniel Cool - is there a daily build somewhere, or do I have to roll my own? I couldn't find a daily build or a 1.9 alpha, beta, etc. on the site. Any idea whe

Query Parser custom analyzer question

2005-08-22 Thread Dan Armbrust
I have a custom Analyzer which performs normalization on all of the terms as they pass through. It does normalization like the following: trees -> tree Sometimes my normalizer returns multiple words for a normalization - for example: leaves -> leaf leave The second and all subsequent terms

Token Filter question

2005-08-18 Thread Dan Armbrust
I am implementing a filter that will remove certain characters from the tokens - thing like '(', etc - but the chars to be removed will be customizable. This is what I have come up with - but it doesn't seem very efficient. Is there a better way? Should I be adjusting the token endOffset when

Lucene score algorithm details?

2005-08-08 Thread Dan Armbrust
I know there used to be a webpage that gave the algorithm used by Lucene for scoring, along with some info on what each variable controlled, to some extent... I was looking to brush up on what the idf controls (and what will happen if I override it) but I can't seem to find that page any longer

Analyzer question

2005-08-08 Thread Dan Armbrust
It is my understanding that the StandardAnalyzer will remove underscores - so "some_word" be indexed as 'some' and 'word'. I want to keep the underscores, so I was thinking of changing over to an Analyzer that uses the WhiteSpaceTokenizer, LowerCaseFilter, and StopFilter. What other tokenizin

Re: de pluralization

2005-08-05 Thread Dan Armbrust
Mufaddal Khumri wrote: Are there analyzers that do this already? Its not an analyzer, but the "norm" feature of this tool does a good job at getting to the normalized form of the words... http://umlslex.nlm.nih.gov/lvg/current/ http://umlslex.nlm.nih.gov/lvg/current/docs/userDoc/norm.htm

Re: Any problems with a failed IndexWriter optimize call?

2005-08-01 Thread Dan Armbrust
May I suggest: Don't call optimize. You don't need it. Here is my approach: Keep each one of your 250,000 document indexes separate - so run your batch, build the index, and then just close it. Don't try to optimize it. For each 250,000 document batch, just put it into a different folder.

Off Topic: Lucene vs Derby (vs MySQL) for spatial indexing

2005-07-28 Thread Dan Armbrust
Otis Gospodnetic wrote: You may also want to consider PostgreSQL for a few reasons: 3) it seems that the new versions let you embed Java directly into the database (perhaps something like Oracle's Java-embedding thing). Really? I realize this is off topic, but could you point me to some d

Search Timeout - abort a search

2005-07-07 Thread Dan Armbrust
Has anyone ever written code to make it possible to return from a search, after a given amount of time, returning the results that have been collected so far (but not necessarily all of them)? The only thing that I can see to do through the public Lucene API's would be to do the search using a

Re: Is a field in use?

2005-06-23 Thread Dan Armbrust
In my indexes where the available fields vary by document, I maintain an additional field that lists out what fields are in used per document. That way, I can query for all documents that contain field "foo", or all documents that contain a field "foo", and don't contain "bar"... etc. Avi

weight score based on a fields value

2005-06-22 Thread Dan Armbrust
Is there a straightforward way that I could change the scoring algorithm such that it would break ties based on looking at the value of a field? I'm not actually searching for the value in the field, so its not part of the query - I just want documents that have a particular field set to a par

Search for documents where field does not exist?

2005-06-17 Thread Dan Armbrust
I'm pretty sure the answer is no.. but I'll check with the guru's anyway... In my collection of documents, I have a non-tokenized field that only occurs 0 or 1 time per document. It is possible to create a query so that a documents would be returned if (field == "some value" OR field does not

Re: how long should optimizing take

2005-06-02 Thread Dan Armbrust
You should be careful, however, not to end up with two VM instances each trying to open an index writer at the same time - one of them is going to fail. Aka, if someone using your web interface tries to add a new document to the index while you have the optimizer running standalone, the web i

Re: how long should optimizing take

2005-06-02 Thread Dan Armbrust
I would run your optimize process in a separate thread, so that your web client doesn't have to wait for it to return. You may even want to set the optimize part up to run on a weekly schedule, at a low load time. I probably wouldn't reoptimize after every 30 documents, on an index that size.