[ 
https://issues.apache.org/jira/browse/NUTCH-2603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16518239#comment-16518239
 ] 

Sebastian Nagel commented on NUTCH-2603:
----------------------------------------

Hi [~ArkadiKosmynin], this is in contradiction with my experience: I've used 
Nutch since version 0.9 (without parse-tika) to process also PDFs and office 
documents. Most issues with these documents disappeared over time while Tika 
has become mature. 

I've tried to reproduce any parsing issues using the recent Nutch master branch 
(with Tika 1.18) by picking randomly documents parsed with one of the "legacy" 
parsers. 
- at least, some documents require authentication, e.g. 
[1|https://www.atnf.csiro.au/lists/at_meetings/2006/att-0064/CJ_SKA_Industry_Astro_update_Apr_06.ppt],
 
[2|https://www.atnf.csiro.au/lists/at_meetings/2008/att-0001/ASKAP_Antennas_summary.doc],
 
[3|https://svn.atnf.csiro.au/askap/ASKAPDesignEnhancements/PCB/790-0015-Bullant/Datasheets/Control.xls].
 [~ArkadiKosmynin], could you provide some of these documents or remove the 
access restrictions?
- among the remaining URLs there is one systematic error: the server behind 
www.atnf.csiro.au regularly sends {{application/msword}} for plain-text 
documents ending in {{.doc}}, eg. 
[update.doc|https://www.atnf.csiro.au/computing/software/gipsy/doc/update.doc]. 
Tika can parse this document, opened NUTCH-2606 to address this.
- Tika fails to parse MS Word 2.0 documents: 
[zenpap4.doc|http://www.atnf.csiro.au/pub/data/pmn/zenpap4.doc]. It's a known 
issue that Tika (resp. POI) cannot parse old MS Office document, see TIKA-2107. 
However, I doubt that they have been successfully parsed using the "legacy" 
parsers:
-* testing with Nutch 1.0 and parse-msword I get a failed parse with the error:
{noformat}
Can't be handled as Microsoft document. java.io.IOException: Invalid header 
signature; read 867295287388775899, expected -2226271756974174256
{noformat}
-* similar as using  the recent Nutch master and Tika 1.18:
{noformat}
% bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-tika' -forceAs 
application/msword http://www.atnf.csiro.au/pub/data/pmn/zenpap4.doc
...
Status: failed(2,0): Invalid header signature; read 0x0C094078002DA5DB, 
expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document
{noformat}
-* when forcing the OpenDocumentParser, the result is a successful but empty 
parse, both for recent Nutch/Tika as well as Nutch 1.0 and parse-oo:
{noformat}
% bin/ nutch parsechecker -Dplugin.includes='protocol-http|parse-tika'  
-forceAs application/vnd.oasis.opendocument.text 
http://www.atnf.csiro.au/pub/data/pmn/zenpap4.doc
...
Status: success(1,0)
Title: 
Outlinks: 0
{noformat}
resp.
{noformat}
% nutch-1.0/bin/nutch org.apache.nutch.parse.ParserChecker -forceAs 
application/vnd.oasis.opendocument.text 
http://www.atnf.csiro.au/pub/data/pmn/zenpap4.doc
...
Status: success(1,0)
Title: 
Outlinks: 0
...
{noformat}
same when calling the main routine of the plugin parse-oo directly (with a 
local file):
{noformat}
% nutch-1.0/bin/nutch plugin parse-oo org.apache.nutch.parse.oo.OOParser 
.../zenpap4.doc
Version: 5
Status: success(1,0)
Title: 
Outlinks: 0
Content Metadata: 
Parse Metadata: 

Text: ''
{noformat}
I've opened TIKA-2675 to address this problem. [~ArkadiKosmynin], is it 
possible that the message in the attached public_docs.txt is also misleading?
{noformat}
arch.log.2018-06-15:2018-06-15 16:05:49,686 INFO  parse.ParseUtil - 
Successfully parsed [http://www.atnf.csiro.au/pub/data/pmn/zenpap4.doc] with 
[org.apache.nutch.parse.oo.OOParser@380d5de2]
{noformat}

So far, these are the only problems I've seen so far. Please open separate 
issues for other problems. I strongly opt for fixing issues, instead of 
maintaining potentially buggy legacy parsers.

> Bring back legacy pre-Tika parsers and use them as back up parsers
> ------------------------------------------------------------------
>
>                 Key: NUTCH-2603
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2603
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.15
>            Reporter: Arkadi Kosmynin
>            Priority: Major
>         Attachments: public_docs.txt
>
>
> There are cases when legacy parsers successfully parse documents on which 
> Tika fails. I am attaching a list of examples of such documents. Nutch allows 
> use of more than one parser on a document, in a sequence, until the document 
> has been parsed successfully. Thus, old parsers can be combined with Tika to 
> achieve better parsing success rate, at least until Tika is perfect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to