[jira] [Updated] (TIKA-836) parsing really slow on some documents

Rob Tulloh (Updated) (JIRA) Thu, 29 Dec 2011 12:46:54 -0800

     [ 
https://issues.apache.org/jira/browse/TIKA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rob Tulloh updated TIKA-836:
----------------------------

    Description: 
We are seeing that tika sometimes takes a very long time to parse some content 
(likely PDF). For example, with the following EML file that contains 4 
documents (2 PDF, 1 MS Excel, 1 text):

{noformat}
fgrep --binary-file=text Content-Type: XXX.eml
Content-Type: multipart/mixed;
Content-Type: multipart/alternative;
Content-Type: text/plain;
Content-Type: text/html;
Content-Type: application/octet-stream;
Content-Type: application/octet-stream;
Content-Type: application/vnd.ms-excel;

du -sh XXX.eml
6.0M    XXX.eml
{noformat}

Note that it takes tika nearly 30 minutes to process this content even though 
the source is only 6M in size:

{noformat}
time java -Xmx2G -jar ../../tika-app-1.0.jar -m XXX.eml >meta.out
WARN - Did not found XRef object at specified startxref position 230521
WARN - Did not found XRef object at specified startxref position 3742379

real    29m16.913s
user    18m17.050s
sys     0m19.465s
{noformat}

Is there any way to configure tika (in particular via solr) to process files 
more quickly?

  was:
We are seeing that tika sometimes takes a very long time to parse some content 
(likely PDF). For example, with the following EML file that contains 4 
documents (2 PDF, 1 MS Excel, 1 text):

{noformat}
fgrep --binary-file=text Content-Type: XXX.eml
Content-Type: multipart/mixed;
Content-Type: multipart/alternative;
Content-Type: text/plain;
Content-Type: text/html;
Content-Type: application/octet-stream;
Content-Type: application/octet-stream;
Content-Type: application/vnd.ms-excel;

du -sh XXX.eml
6.0M    1326391801.eml
{noformat}

Note that it takes tika nearly 30 minutes to process this content even though 
the source is only 6M in size:

{noformat}
time java -Xmx2G -jar ../../tika-app-1.0.jar -m XXX.eml >meta.out
WARN - Did not found XRef object at specified startxref position 230521
WARN - Did not found XRef object at specified startxref position 3742379

real    29m16.913s
user    18m17.050s
sys     0m19.465s
{noformat}

Is there any way to configure tika (in particular via solr) to process files 
more quickly?

    
> parsing really slow on some documents
> -------------------------------------
>
>                 Key: TIKA-836
>                 URL: https://issues.apache.org/jira/browse/TIKA-836
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>         Environment: CentOS 4.x/5.x/6.x
>            Reporter: Rob Tulloh
>
> We are seeing that tika sometimes takes a very long time to parse some 
> content (likely PDF). For example, with the following EML file that contains 
> 4 documents (2 PDF, 1 MS Excel, 1 text):
> {noformat}
> fgrep --binary-file=text Content-Type: XXX.eml
> Content-Type: multipart/mixed;
> Content-Type: multipart/alternative;
> Content-Type: text/plain;
> Content-Type: text/html;
> Content-Type: application/octet-stream;
> Content-Type: application/octet-stream;
> Content-Type: application/vnd.ms-excel;
> du -sh XXX.eml
> 6.0M    XXX.eml
> {noformat}
> Note that it takes tika nearly 30 minutes to process this content even though 
> the source is only 6M in size:
> {noformat}
> time java -Xmx2G -jar ../../tika-app-1.0.jar -m XXX.eml >meta.out
> WARN - Did not found XRef object at specified startxref position 230521
> WARN - Did not found XRef object at specified startxref position 3742379
> real    29m16.913s
> user    18m17.050s
> sys     0m19.465s
> {noformat}
> Is there any way to configure tika (in particular via solr) to process files 
> more quickly?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-836) parsing really slow on some documents

Reply via email to