[ 
https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated NUTCH-1997:
---------------------------
    Comment: was deleted

(was: Notes:
The attached cbor file contains both magic bytes for type xhtml and type cbor, 
with priority 40 on application/cbor, we will have the following issues

Problem1: Magic priority 40.
        The application/xhtml+xml has higher priority(50) than application/cbor 
(40); [I don't know who (and why) assigned 40 to cbor];  So if xhtml gets read 
and compared first,  cbor will not even be placed in the magic estimation list 
because it has low priority. Based on the tests, it turns out that it is true 
that xhtml gets read and compared first with the input file, so any type below 
the priority 50 will be disregarded. 


Problem2: again magic priority with 50.
        In Tika, given a file dumped by the nutch dumper tool,  both types 
(xhtml and cbor) will be selected as candidate mime types and they will be put 
in the magic estimation list; since xhtml type gets read first, it is placed 
atop the cbor; in order to break that tie, tika will rely on the decision from 
the extension method. If the extension method fails to detect the type(for now, 
let's ignore metadata hint method for simplicity but the same applies to it 
too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor type 
to 50 the same as xhtml, because it would probably be risky to discard any one 
of the estimated types without going consult the extension method.
)

> Add CBOR "magic header" to CommonCrawlDataDumper output
> -------------------------------------------------------
>
>                 Key: NUTCH-1997
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1997
>             Project: Nutch
>          Issue Type: Improvement
>          Components: tool
>            Reporter: Giuseppe Totaro
>            Priority: Minor
>         Attachments: NUTCH-1997.patch
>
>
> For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} 
> wraps a single string value, representing the JSON text, into CBOR. 
> For instance, using the Unix {{hexdump}} tool, we can see that, as expected, 
> the first byte of all files is "0x7F" (the first three bits are "011", that 
> is the major type for strings, and the following 5 bits are "11010", meaning 
> a uint32_t encodes the length of following text), and the following 4 bytes 
> (single-precision float) encodes the right length of file (as described in 
> [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is 
> currently included into the file (a list of cbor tags is available 
> [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
> In order to add support for CBOR detection using Apache Tika (as described in 
> [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be 
> great if {{CommonCrawlDataDumper}} tool is able to add the self-describing 
> CBOR "magic header" ([Tag 
> 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded 
> output files. 
> Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] 
> for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to