[ 
https://issues.apache.org/jira/browse/SOLR-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-2886:
---------------------------

    Fix Version/s:     (was: 4.0)

removing fixVersion=4.0 since there is no evidence that anyone is currently 
working on this issue.  (this can certainly be revisited if volunteers step 
forward)

FWIW: it's not clear to me reading the comments how Solr would/could use the 
suggested workaround in the PDFBOX issue, since Solr dones't invoke PDFBox 
directly, and delegates to Tika.

If someone with more tika knowledge can suggest a way in which solr users can 
configure/influence how Tika uses PDFBox to control this setting, that seems 
like it would resolve things
                
> Out of Memory Error with DIH and TikaEntityProcessor
> ----------------------------------------------------
>
>                 Key: SOLR-2886
>                 URL: https://issues.apache.org/jira/browse/SOLR-2886
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler, contrib - Solr Cell (Tika 
> extraction)
>    Affects Versions: 4.0-ALPHA
>            Reporter: Tricia Jenkins
>
> I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to 
> apache-solr-4.0-2011-10-14_08-56-59.war and then 
> apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various 
> sizes, using the TikaEntityProcessor.  My indexing would run to completion 
> and was completely successful under the June build.  The only error was 
> readability of the fulltext in highlighting.  This was fixed in Tika 0.10 
> (TIKA-611).  I chose to use the October 14 build of Solr because Tika 0.10 
> had recently been included (SOLR-2372).  
> On the same machine without changing any memory settings my initial problem 
> is a Perm Gen error.  Fine, I increase the PermGen space.
> I've set the "onError" parameter to "skip" for the TikaEntityProcessor.  Now 
> I get several (6)
> SEVERE: Exception thrown while getting data
> java.net.SocketTimeoutException: Read timed out
> SEVERE: Exception in entity : 
> tika:org.apache.solr.handler.dataimport.DataImport
> HandlerException: Exception in invoking url <url removed> # 2975
> pairs.  And after ~3881 documents, with auto commit set unreasonably 
> frequently I consistently get an Out of Memory Error 
> SEVERE: Exception while processing: f document : 
> null:org.apache.solr.handler.dataimport.DataImportHandlerException: 
> java.lang.OutOfMemoryError: Java heap space
> The stack trace points to 
> org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
>  and 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:718).
> The October 30 build performs identically.
> Funny thing is that monitoring via JConsole doesn't reveal any memory issues.
> Because the out of Memory error did not occur in June, this leads me to 
> believe that a bug has been introduced to the code since then.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to