[
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178361#comment-14178361
]
Andrew Jackson edited comment on TIKA-1302 at 10/21/14 12:59 PM:
-----------------------------------------------------------------
Okay, so the c.300,000 exceptions are here:
https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0 - let me
know if you'd like it placed elsewhere (it's 14MB of compressed CSV).
This conversation has helped me spot a gap in our code. We currently do a
Tika.detect() before we do a Tika.parse(), and only do the latter if the former
succeeded. Sadly, the version of the code that I used to generate this data did
not record the Tika exception for the .detect() step, only the .parse() step.
This will explain why there are no hung-thread events in this result set - the
interrupted .detect() was not recorded properly. We'll be re-running this scan
soonish, so I'll make sure the next version records all the exceptions. IIRC,
from looking at the MIME types, the permanent hangs were mostly ZIPs, Office
documents, and maybe some PDFs.
Note that the CSV includes the Content-Type from the .detect() step, and this
should indicate which module was run on the resource (i.e. whatever the Tika
1.5 mapping was for that MIME type). I don't think we changed the parse
configuration significantly, so it seems HTML and XHTML and XML should all have
gone through the HtmlParser (I'm not 100% sure about this, and will try to
check).
I'm not sure it's worth giving you all the SAX exceptions, as there are a lot
of repeats of the same problems. I think a random sample of about 50,000 should
be plenty. Does that sound okay to you?
EDIT: Oh, and I meant to say, I'm glad to hear about [~gostep] and
[[email protected]]'s efforts to run this on GovDocs, and would be
interested in comparing results. We already publish format profile data about
web archives, and would love to have more data to refer to.
was (Author: anjackson):
Okay, so the c.300,000 exceptions are here:
https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0 - let me
know if you'd like it placed elsewhere (it's 14MB of compressed CSV).
This conversation has helped me spot a gap in our code. We currently do a
Tika.detect() before we do a Tika.parse(), and only do the latter if the former
succeeded. Sadly, the version of the code that I used to generate this data did
not record the Tika exception for the .detect() step, only the .parse() step.
This will explain why there are no hung-thread events in this result set - the
interrupted .detect() was not recorded properly. We'll be re-running this scan
soonish, so I'll make sure the next version records all the exceptions. IIRC,
from looking at the MIME types, the permanent hangs were mostly ZIPs, Office
documents, and maybe some PDFs.
Note that the CSV includes the Content-Type from the .detect() step, and this
should indicate which module was run on the resource (i.e. whatever the Tika
1.5 mapping was for that MIME type). I don't think we changed the parse
configuration significantly, so it seems HTML and XHTML and XML should all have
gone through the HtmlParser (I'm not 100% sure about this, and will try to
check).
I'm not sure it's worth giving you all the SAX exceptions, as there are a lot
of repeats of the same problems. I think a random sample of about 50,000 should
be plenty. Does that sound okay to you?
> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
> Issue Type: Improvement
> Components: cli, general, server
> Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and
> running again, it might be fun to run Tika regularly against a large set of
> docs and report metrics.
> One excellent candidate corpus is govdocs1:
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?
> [~willp-bl], have anything handy you'd like to contribute?
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
> ;)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)