Ching wrote:
> I use PDFBox version 1.1.0; I did find a workaround now. Just wondering
> which tools do you use to extract text from pdf? Thanks.
Ching, in UpLib I use a patched version of xpdf which reports the
bounding box and font information for each word (as well as the Unicode
characters o
I use PDFBox version 1.1.0; I did find a workaround now. Just wondering
which tools do you use to extract text from pdf? Thanks.
On Wed, Oct 13, 2010 at 11:36 AM, Fabiano Nunes wrote:
> What version of PDFBox are you running?
> PDFBox 0.72 does not work properly with some pdf documents. See more
I'm not quite sure what you mean by "run a query against multiple fields".
But would
creating your own BooleanQuery where each clause was the parsed result
against
a specific field work?
If this is irrelevant, could you give a couple of examples of what you're
looking to
accomplish?
Best
Erick
O
Hi Group,
I have an isue when using MultiFieldQueryParser, I would like to use one query
against a number of fields however I get an
java.lang.IllegalArgumentException: queries.length != fields.length
Looked at the javadoc, and it looks like the only way to run one query against
multiple fie
Hello,
Of course, if you actually want the last 7 days rolling effect and not the this
week vs. previous week, then you could go with smaller indices, say daily ones.
Then you'd always add new docs to the latest index and removing the oldest
index
completely every 24 hours.
You could go hourly
What version of PDFBox are you running?
PDFBox 0.72 does not work properly with some pdf documents. See more in
https://issues.apache.org/jira/browse/PDFBOX-361.
So, I wrote a extractor (a copy of the original, in fact) based on trunk
version (1.2.1, actually). Furthermore, this version is faster e
Hi,
Thank you for your suggestions. I found the reason which is that PDFBox
seems having problem parsing large document (20MB), I have a few of them
within those 2000 docs, those are the ones throwing OutOfMemory errors. The
app does exit, and JVM died. I am running on 32bit machine.
-- Ching
On
One more suggestion:
With lucene 2.1 you might be using the hits API to search, which preloads
the documents
See
https://issues.apache.org/jira/browse/LUCENE-954?focusedCommentId=12579258&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12579258
The performance hit i
Hi there,
I'm currently trying to work out how I can determine the type
(string/number/date/etc)of a term. I've not seen any off the shelf way to do
it so am trying to store a payload against each term that records the type.
I'm having a little trouble retrieving a payload I'd stored onto the
Note that deleteAll does not require you to optimize anything. It literally
removes all segments from the index in one shot, and when the files are
unreferenced, they will be removed entirely.
Shai
On Wed, Oct 13, 2010 at 4:53 PM, Dan OConnor wrote:
> Jeff,
> I would suggest not deleting documen
Jeff,
I would suggest not deleting documents off the back of the index unless you can
optimize your index regularly. (Depending on your volume, this could be every
day or once a week)
I would suggest having two indexes, one that is "this" week and one that is
"last" week and a multi-index searc
There's a deleteAll() method on IndexWriter, which is very fast. After you
commit(), all documents won't be visible to searchers anymore. When the last
searcher will be closed, the documents will completely disappear from the
index. All in all it's quite a good approach to take.
You can also consi
Hi all,
I only want to index the latest one week's data, the previous data can
be deleted. So I'd like to know about lucene's delete performance and
whether it will has impact on the search performance when I do lots of
delete operation in the meantime. Thanks
--
Best Regards
Jeff Zhang
-
Hi Ching
I donot think issue with Lucene for 2000 documents. As Anshum mentioned,
give more details about environment.
And check what CPU usage and index fdt file timestamp while it hangs. And
using logs would help to identify real cause. I used to work with Lucene 2.4
and recently 3.0.2. No sim
14 matches
Mail list logo