[ https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506414#comment-14506414 ]
Hudson commented on TIKA-1610: ------------------------------ SUCCESS: Integrated in tika-trunk-jdk1.7 #640 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/640/]) WIP Fix for TIKA-1610: Support MIME extension for CBOR files contributed by LukeLiush <hanson311...@gmail.com> this closes #42 (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1675250) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml > CBOR Parser and detection [improvement] > --------------------------------------- > > Key: TIKA-1610 > URL: https://issues.apache.org/jira/browse/TIKA-1610 > Project: Tika > Issue Type: New Feature > Components: detector, mime, parser > Affects Versions: 1.7 > Reporter: Luke sh > Assignee: Chris A. Mattmann > Priority: Trivial > Labels: memex > Attachments: 1424402690000.html, cbor_tika.mimetypes.xml.jpg, > rfc_cbor.jpg > > > CBOR is a data format whose design goals include the possibility of extremely > small code size, fairly small message size, and extensibility without the > need for version negotiation (cited from http://cbor.io/ ). > It would be great if Tika is able to provide the support with CBOR parser and > identification. In the current project with Nutch, the Nutch > CommonCrawlDataDumper is used to dump the crawled segments to the files in > the format of CBOR. In order to read/parse those dumped files by this tool, > it would be great if tika is able to support parsing the cbor, the thing is > that the CommonCrawlDataDumper is not dumping with correct extension, it > dumps with its own rule, the default extension of the dumped file is html, so > it might be less painful if tika is able to detect and parse those files > without any pre-processing steps. > CommonCrawlDataDumper is calling the following to dump with cbor. > import com.fasterxml.jackson.dataformat.cbor.CBORFactory; > import com.fasterxml.jackson.dataformat.cbor.CBORGenerator; > fasterxml is a 3rd party library for converting json to .cbor and Vice Versa. > According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like > CBOR does not yet have its magic numbers to be detected/identified by other > applications (PFA: rfc_cbor.jpg) > It seems that the only way to inform other applications of the type as of now > is using the extension (i.e. .cbor), or probably content detection (i.e. byte > histogram distribution estimation). > There is another thing worth the attention, it looks like tika has attempted > to add the support with cbor mime detection in the tika-mimetypes.xml > (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the > cbor file dumped by CommonCrawlDataDumper. > According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a > self-describing Tag 55799 that seems to be used for cbor type > identification(the hex code might be 0xd9d9f7), but it is probably up to the > application that take care of this tag, and it is also possible that the > fasterxml that the nutch dumping tool is missing this tag, an example cbor > file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been > attached (PFA: 1424402690000.html). > The following info is cited from the rfc, "...a decoder might be able to > parse both CBOR and JSON. > Such a decoder would need to mechanically distinguish the two > formats. An easy way for an encoder to help the decoder would be to > tag the entire CBOR item with tag 55799, the serialization of which > will never be found at the beginning of a JSON text..." > It looks like the a file can have two parts/sections i.e. the plain text > parts and the json prettified by cbor, this might be also worth the attention > and consideration with the parsing and type identification. > On the other hand, it is worth noting that the entries for cbor extension > detection needs to be appended in the tika-mimetypes.xml too > e.g. > <glob pattern="*.cbor"/> -- This message was sent by Atlassian JIRA (v6.3.4#6332)