Re: Indexing is hung or doesn't complete

2010-10-13 Thread Bill Janssen
Ching wrote: > I use PDFBox version 1.1.0; I did find a workaround now. Just wondering > which tools do you use to extract text from pdf? Thanks. Ching, in UpLib I use a patched version of xpdf which reports the bounding box and font information for each word (as well as the Unicode characters o

Re: Indexing is hung or doesn't complete

2010-10-13 Thread Ching
I use PDFBox version 1.1.0; I did find a workaround now. Just wondering which tools do you use to extract text from pdf? Thanks. On Wed, Oct 13, 2010 at 11:36 AM, Fabiano Nunes wrote: > What version of PDFBox are you running? > PDFBox 0.72 does not work properly with some pdf documents. See more

Re: MultiFieldQueryParser

2010-10-13 Thread Erick Erickson
I'm not quite sure what you mean by "run a query against multiple fields". But would creating your own BooleanQuery where each clause was the parsed result against a specific field work? If this is irrelevant, could you give a couple of examples of what you're looking to accomplish? Best Erick O

MultiFieldQueryParser

2010-10-13 Thread Lev Bronshtein
Hi Group, I have an isue when using MultiFieldQueryParser, I would like to use one query against a number of fields however I get an java.lang.IllegalArgumentException: queries.length != fields.length Looked at the javadoc, and it looks like the only way to run one query against multiple fie

Re: How about lucene's delete performance ?

2010-10-13 Thread Otis Gospodnetic
Hello, Of course, if you actually want the last 7 days rolling effect and not the this week vs. previous week, then you could go with smaller indices, say daily ones. Then you'd always add new docs to the latest index and removing the oldest index completely every 24 hours. You could go hourly

Re: Indexing is hung or doesn't complete

2010-10-13 Thread Fabiano Nunes
What version of PDFBox are you running? PDFBox 0.72 does not work properly with some pdf documents. See more in https://issues.apache.org/jira/browse/PDFBOX-361. So, I wrote a extractor (a copy of the original, in fact) based on trunk version (1.2.1, actually). Furthermore, this version is faster e

Re: Indexing is hung or doesn't complete

2010-10-13 Thread Ching
Hi, Thank you for your suggestions. I found the reason which is that PDFBox seems having problem parsing large document (20MB), I have a few of them within those 2000 docs, those are the ones throwing OutOfMemory errors. The app does exit, and JVM died. I am running on 32bit machine. -- Ching On

Re: Questions about Lucene usage recommendations

2010-10-13 Thread Umesh Prasad
One more suggestion: With lucene 2.1 you might be using the hits API to search, which preloads the documents See https://issues.apache.org/jira/browse/LUCENE-954?focusedCommentId=12579258&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12579258 The performance hit i

determining the type of a term - retrieving a payload

2010-10-13 Thread Sykes, Derek
Hi there, I'm currently trying to work out how I can determine the type (string/number/date/etc)of a term. I've not seen any off the shelf way to do it so am trying to store a payload against each term that records the type. I'm having a little trouble retrieving a payload I'd stored onto the

Re: How about lucene's delete performance ?

2010-10-13 Thread Shai Erera
Note that deleteAll does not require you to optimize anything. It literally removes all segments from the index in one shot, and when the files are unreferenced, they will be removed entirely. Shai On Wed, Oct 13, 2010 at 4:53 PM, Dan OConnor wrote: > Jeff, > I would suggest not deleting documen

Re: How about lucene's delete performance ?

2010-10-13 Thread Dan OConnor
Jeff, I would suggest not deleting documents off the back of the index unless you can optimize your index regularly. (Depending on your volume, this could be every day or once a week) I would suggest having two indexes, one that is "this" week and one that is "last" week and a multi-index searc

Re: How about lucene's delete performance ?

2010-10-13 Thread Shai Erera
There's a deleteAll() method on IndexWriter, which is very fast. After you commit(), all documents won't be visible to searchers anymore. When the last searcher will be closed, the documents will completely disappear from the index. All in all it's quite a good approach to take. You can also consi

How about lucene's delete performance ?

2010-10-13 Thread Jeff Zhang
Hi all, I only want to index the latest one week's data, the previous data can be deleted. So I'd like to know about lucene's delete performance and whether it will has impact on the search performance when I do lots of delete operation in the meantime. Thanks -- Best Regards Jeff Zhang -

Re: Indexing is hung or doesn't complete

2010-10-13 Thread Senthil
Hi Ching I donot think issue with Lucene for 2000 documents. As Anshum mentioned, give more details about environment. And check what CPU usage and index fdt file timestamp while it hangs. And using logs would help to identify real cause. I used to work with Lucene 2.4 and recently 3.0.2. No sim