date:20130127

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread Jack Krupansky

Re-read my last message - and then take a look at that Solr source code, which will give you an idea how to use Tika, even though you are using Lucene only. If you have specific questions, please be specific. To answer your latest question, yes, Tika is good enough. Solr /update/extract uses i

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread saisantoshi

We are not using Solr and using just Lucene core 4.0 engine. I am trying to see if we can use tika library to extract textual information from pdf/word/excel documents. I am mainly interested in reading the contents inside the documents and index using lucene. My question here is , is tika framewor

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread Jack Krupansky

You may be able to use Tika directly without needing to choose the specific classes, although the latter may give you the specific data you need without the extra overhead. You could take a look at the Solr Extracting Request Handler source for an example: http://svn.apache.org/viewvc/lucene/

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread Adrien Grand

Have you tried using the PDFParser [1] and the OfficeParser [2] classes from Tika? This question seems to be more appropriate for the Tika user mailing list [3]? [1] http://tika.apache.org/1.3/api/org/apache/tika/parser/pdf/PDFParser.html#parse(java.io.InputStream, org.xml.sax.ContentHandler, or

Re: how to avoid OutOfMemoryError while indexing ?

2013-01-27 Thread Michael McCandless

You should set your RAMBufferSizeMB to something smaller than the full heap size of your JVM. Mike McCandless http://blog.mikemccandless.com On Sat, Jan 26, 2013 at 11:39 PM, wgggfiy wrote: > I found it is very easy to come into OutOfMemoryError. > My idea is that lucene could set the RAM memor

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

Re: how to avoid OutOfMemoryError while indexing ?

5 matches

Site Navigation

Mail list logo

Footer information