[jira] [Commented] (SOLR-2886) Out of Memory Error with DIH and TikaEntityProcessor

2016-10-01 Thread Alexandre Rafalovitch (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15539674#comment-15539674
 ] 

Alexandre Rafalovitch commented on SOLR-2886:
-

Does this happen with the latest version of Solr/Tika? If not or cannot be 
reproduced, I suggest closing the case.

> Out of Memory Error with DIH and TikaEntityProcessor
> 
>
> Key: SOLR-2886
> URL: https://issues.apache.org/jira/browse/SOLR-2886
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler, contrib - Solr Cell (Tika 
> extraction)
>Affects Versions: 4.0-ALPHA
>Reporter: Tricia Jenkins
>
> I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to 
> apache-solr-4.0-2011-10-14_08-56-59.war and then 
> apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various 
> sizes, using the TikaEntityProcessor.  My indexing would run to completion 
> and was completely successful under the June build.  The only error was 
> readability of the fulltext in highlighting.  This was fixed in Tika 0.10 
> (TIKA-611).  I chose to use the October 14 build of Solr because Tika 0.10 
> had recently been included (SOLR-2372).  
> On the same machine without changing any memory settings my initial problem 
> is a Perm Gen error.  Fine, I increase the PermGen space.
> I've set the "onError" parameter to "skip" for the TikaEntityProcessor.  Now 
> I get several (6)
> SEVERE: Exception thrown while getting data
> java.net.SocketTimeoutException: Read timed out
> SEVERE: Exception in entity : 
> tika:org.apache.solr.handler.dataimport.DataImport
> HandlerException: Exception in invoking url  # 2975
> pairs.  And after ~3881 documents, with auto commit set unreasonably 
> frequently I consistently get an Out of Memory Error 
> SEVERE: Exception while processing: f document : 
> null:org.apache.solr.handler.dataimport.DataImportHandlerException: 
> java.lang.OutOfMemoryError: Java heap space
> The stack trace points to 
> org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
>  and 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:718).
> The October 30 build performs identically.
> Funny thing is that monitoring via JConsole doesn't reveal any memory issues.
> Because the out of Memory error did not occur in June, this leads me to 
> believe that a bug has been introduced to the code since then.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2886) Out of Memory Error with DIH and TikaEntityProcessor

2011-11-25 Thread Tricia Williams (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157238#comment-13157238
 ] 

Tricia Williams commented on SOLR-2886:
---

Some further bug tracing shows that pdfbox' RandomAccessBuffer is responsible 
for the Out of Memory error.  It creates a new larger buffer and then copies 
the existing buffer into the new buffer.  There are no checks to see if the 
space is available to create a new larger buffer.  Use of a RandomAccessBuffer 
was introduced with PDFBOX-948 at revision 1072678 on Feb 20, 2011 and in the 
pdfbox 1.5 and 1.6 releases, hence tika 0.10 in revision 1080162 on Mar 10, 
2011 and revision 1171497 on Sep 16, 2011.  Moving my DIH workflow to another 
machine with more memory available is able to run to completion.  

It appears that I am the victim of the caviate stated in PDFBOX-948:

For normal sized PDFs files, the in-memory implementation RandomAccessBuffer 
should not increase the memory usage too much, while providing faster IO as all 
access operations are only memory copies. 
  
Therefore, please consider switching the default to in-memory scratch buffers. 
Users with very large files can still pass a temporary directory. 

Now to track down how to detect large files and use a temporary directory 
instead.  This may turn out to be a TIKA issue rather than Solr.

 Out of Memory Error with DIH and TikaEntityProcessor
 

 Key: SOLR-2886
 URL: https://issues.apache.org/jira/browse/SOLR-2886
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler, contrib - Solr Cell (Tika 
 extraction)
Affects Versions: 4.0
Reporter: Tricia Williams
 Fix For: 4.0


 I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to 
 apache-solr-4.0-2011-10-14_08-56-59.war and then 
 apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various 
 sizes, using the TikaEntityProcessor.  My indexing would run to completion 
 and was completely successful under the June build.  The only error was 
 readability of the fulltext in highlighting.  This was fixed in Tika 0.10 
 (TIKA-611).  I chose to use the October 14 build of Solr because Tika 0.10 
 had recently been included (SOLR-2372).  
 On the same machine without changing any memory settings my initial problem 
 is a Perm Gen error.  Fine, I increase the PermGen space.
 I've set the onError parameter to skip for the TikaEntityProcessor.  Now 
 I get several (6)
 SEVERE: Exception thrown while getting data
 java.net.SocketTimeoutException: Read timed out
 SEVERE: Exception in entity : 
 tika:org.apache.solr.handler.dataimport.DataImport
 HandlerException: Exception in invoking url url removed # 2975
 pairs.  And after ~3881 documents, with auto commit set unreasonably 
 frequently I consistently get an Out of Memory Error 
 SEVERE: Exception while processing: f document : 
 null:org.apache.solr.handler.dataimport.DataImportHandlerException: 
 java.lang.OutOfMemoryError: Java heap space
 The stack trace points to 
 org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
  and 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:718).
 The October 30 build performs identically.
 Funny thing is that monitoring via JConsole doesn't reveal any memory issues.
 Because the out of Memory error did not occur in June, this leads me to 
 believe that a bug has been introduced to the code since then.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org