[ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14510382#comment-14510382
 ] 

Luke sh edited comment on TIKA-1610 at 4/24/15 2:43 AM:
--------------------------------------------------------

Notes:
The attached cbor file(i.e.NUTCH-1997.cbor) contains both magic bytes for type 
xhtml and type cbor, with priority 40 on application/cbor, we will have the 
following issues

Problem1: Magic priority 40.
The application/xhtml+xml has higher priority(50) than application/cbor (40); 
[I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and 
compared first, cbor will not even be placed in the magic estimation list 
because it has low priority. Based on the tests, it turns out that it is true 
that xhtml gets read and compared first with the input file, so any type below 
the priority 50 will be disregarded.

Problem2: again magic priority with 50.
In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and 
cbor) will be selected as candidate mime types and they will be put in the 
magic estimation list; since xhtml type gets read first, it is placed atop the 
cbor; in order to break that tie, tika will rely on the decision from the 
extension method. If the extension method fails to detect the type(for now, 
let's ignore metadata hint method for simplicity but the same applies to it 
too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor type 
to 50 the same as xhtml, because it would probably be risky to discard any one 
of the estimated types without going consult the extension method.



was (Author: lukeliush):
Notes:
The attached cbor file contains both magic bytes for type xhtml and type cbor, 
with priority 40 on application/cbor, we will have the following issues

Problem1: Magic priority 40.
The application/xhtml+xml has higher priority(50) than application/cbor (40); 
[I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and 
compared first, cbor will not even be placed in the magic estimation list 
because it has low priority. Based on the tests, it turns out that it is true 
that xhtml gets read and compared first with the input file, so any type below 
the priority 50 will be disregarded.

Problem2: again magic priority with 50.
In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and 
cbor) will be selected as candidate mime types and they will be put in the 
magic estimation list; since xhtml type gets read first, it is placed atop the 
cbor; in order to break that tie, tika will rely on the decision from the 
extension method. If the extension method fails to detect the type(for now, 
let's ignore metadata hint method for simplicity but the same applies to it 
too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor type 
to 50 the same as xhtml, because it would probably be risky to discard any one 
of the estimated types without going consult the extension method.


> CBOR Parser and detection [improvement]
> ---------------------------------------
>
>                 Key: TIKA-1610
>                 URL: https://issues.apache.org/jira/browse/TIKA-1610
>             Project: Tika
>          Issue Type: New Feature
>          Components: detector, mime, parser
>    Affects Versions: 1.7
>            Reporter: Luke sh
>            Assignee: Chris A. Mattmann
>            Priority: Trivial
>              Labels: memex
>         Attachments: 1424402690000.html, NUTCH-1997.cbor, 
> cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg
>
>
> CBOR is a data format whose design goals include the possibility of extremely 
> small code size, fairly small message size, and extensibility without the 
> need for version negotiation (cited from http://cbor.io/ ).
> It would be great if Tika is able to provide the support with CBOR parser and 
> identification. In the current project with Nutch, the Nutch 
> CommonCrawlDataDumper is used to dump the crawled segments to the files in 
> the format of CBOR. In order to read/parse those dumped files by this tool, 
> it would be great if tika is able to support parsing the cbor, the thing is 
> that the CommonCrawlDataDumper is not dumping with correct extension, it 
> dumps with its own rule, the default extension of the dumped file is html, so 
> it might be less painful if tika is able to detect and parse those files 
> without any pre-processing steps. 
> CommonCrawlDataDumper is calling the following to dump with cbor.
> import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
> import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
> fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
> According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
> CBOR does not yet have its magic numbers to be detected/identified by other 
> applications (PFA: rfc_cbor.jpg)
> It seems that the only way to inform other applications of the type as of now 
> is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
> histogram distribution estimation).  
> There is another thing worth the attention, it looks like tika has attempted 
> to add the support with cbor mime detection in the tika-mimetypes.xml 
> (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
> cbor file dumped by CommonCrawlDataDumper. 
> According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
> self-describing Tag 55799 that seems to be used for cbor type 
> identification(the hex code might be 0xd9d9f7), but it is probably up to the 
> application that take care of this tag, and it is also possible that the 
> fasterxml that the nutch dumping tool is missing this tag, an example cbor 
> file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been 
> attached (PFA: 1424402690000.html).
> The following info is cited from the rfc, "...a decoder might be able to 
> parse both CBOR and JSON.
>    Such a decoder would need to mechanically distinguish the two
>    formats.  An easy way for an encoder to help the decoder would be to
>    tag the entire CBOR item with tag 55799, the serialization of which
>    will never be found at the beginning of a JSON text..."
> It looks like the a file can have two parts/sections i.e. the plain text 
> parts and the json prettified by cbor, this might be also worth the attention 
> and consideration with the parsing and type identification.
> On the other hand, it is worth noting that the entries for cbor extension 
> detection needs to be appended in the tika-mimetypes.xml too 
> e.g.
> <glob pattern="*.cbor"/>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to