Re-read my last message - and then take a look at that Solr source code,
which will give you an idea how to use Tika, even though you are using
Lucene only. If you have specific questions, please be specific.
To answer your latest question, yes, Tika is good enough. Solr
/update/extract uses i
We are not using Solr and using just Lucene core 4.0 engine. I am trying to
see if we can use tika library to extract textual information from
pdf/word/excel documents. I am mainly interested in reading the contents
inside the documents and index using lucene. My question here is , is tika
framewor
You may be able to use Tika directly without needing to choose the specific
classes, although the latter may give you the specific data you need without
the extra overhead.
You could take a look at the Solr Extracting Request Handler source for an
example:
http://svn.apache.org/viewvc/lucene/
Have you tried using the PDFParser [1] and the OfficeParser [2]
classes from Tika?
This question seems to be more appropriate for the Tika user mailing list [3]?
[1]
http://tika.apache.org/1.3/api/org/apache/tika/parser/pdf/PDFParser.html#parse(java.io.InputStream,
org.xml.sax.ContentHandler, or
You should set your RAMBufferSizeMB to something smaller than the full
heap size of your JVM.
Mike McCandless
http://blog.mikemccandless.com
On Sat, Jan 26, 2013 at 11:39 PM, wgggfiy wrote:
> I found it is very easy to come into OutOfMemoryError.
> My idea is that lucene could set the RAM memor