subject:"\[jira\] \[Commented\] \(TIKA\-1610\) CBOR Parser and detection \[improvement\]"

[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-23 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510426#comment-14510426
 ] 

Hudson commented on TIKA-1610:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #644 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/644/])
TIKA-1610 Bump the CBOR mime magic priority to 60, to be more specific than 
(x)html, which is what CBOR often contains, and add a detection unit test 
(nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1675755)
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
* /tika/trunk/tika-parsers/src/test/resources/test-documents/NUTCH-1997.cbor


 CBOR Parser and detection [improvement]
 ---

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Attachments: 142440269.html, NUTCH-1997.cbor, 
 cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg


 CBOR is a data format whose design goals include the possibility of extremely 
 small code size, fairly small message size, and extensibility without the 
 need for version negotiation (cited from http://cbor.io/ ).
 It would be great if Tika is able to provide the support with CBOR parser and 
 identification. In the current project with Nutch, the Nutch 
 CommonCrawlDataDumper is used to dump the crawled segments to the files in 
 the format of CBOR. In order to read/parse those dumped files by this tool, 
 it would be great if tika is able to support parsing the cbor, the thing is 
 that the CommonCrawlDataDumper is not dumping with correct extension, it 
 dumps with its own rule, the default extension of the dumped file is html, so 
 it might be less painful if tika is able to detect and parse those files 
 without any pre-processing steps. 
 CommonCrawlDataDumper is calling the following to dump with cbor.
 import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
 import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
 fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
 According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
 CBOR does not yet have its magic numbers to be detected/identified by other 
 applications (PFA: rfc_cbor.jpg)
 It seems that the only way to inform other applications of the type as of now 
 is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
 histogram distribution estimation).  
 There is another thing worth the attention, it looks like tika has attempted 
 to add the support with cbor mime detection in the tika-mimetypes.xml 
 (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
 cbor file dumped by CommonCrawlDataDumper. 
 According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
 self-describing Tag 55799 that seems to be used for cbor type 
 identification(the hex code might be 0xd9d9f7), but it is probably up to the 
 application that take care of this tag, and it is also possible that the 
 fasterxml that the nutch dumping tool is missing this tag, an example cbor 
 file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been 
 attached (PFA: 142440269.html).
 The following info is cited from the rfc, ...a decoder might be able to 
 parse both CBOR and JSON.
Such a decoder would need to mechanically distinguish the two
formats.  An easy way for an encoder to help the decoder would be to
tag the entire CBOR item with tag 55799, the serialization of which
will never be found at the beginning of a JSON text...
 It looks like the a file can have two parts/sections i.e. the plain text 
 parts and the json prettified by cbor, this might be also worth the attention 
 and consideration with the parsing and type identification.
 On the other hand, it is worth noting that the entries for cbor extension 
 detection needs to be appended in the tika-mimetypes.xml too 
 e.g.
 glob pattern=*.cbor/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-23 Thread Luke sh (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510402#comment-14510402
]

Luke sh commented on TIKA-1610:
---

Thanks a lot [~gagravarr] for the prompt response.
I thought it would be probably be risky if we discard any one of the estimated
types because of the magic priority (one is higher than the other, i wanted
tika to rely on the extension when there is a tie to break.

For now, in this particular case, i also cannot think of any reason why we
don't use 60, might be i am too skeptical.

Thanks

CBOR Parser and detection [improvement]
---

Key: TIKA-1610
URL: https://issues.apache.org/jira/browse/TIKA-1610
Project: Tika
Issue Type: New Feature
Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
Labels: memex
Attachments: 142440269.html, NUTCH-1997.cbor,
cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg

CBOR is a data format whose design goals include the possibility of extremely
small code size, fairly small message size, and extensibility without the
need for version negotiation (cited from http://cbor.io/ ).
It would be great if Tika is able to provide the support with CBOR parser and
identification. In the current project with Nutch, the Nutch
CommonCrawlDataDumper is used to dump the crawled segments to the files in
the format of CBOR. In order to read/parse those dumped files by this tool,
it would be great if tika is able to support parsing the cbor, the thing is
that the CommonCrawlDataDumper is not dumping with correct extension, it
dumps with its own rule, the default extension of the dumped file is html, so
it might be less painful if tika is able to detect and parse those files
without any pre-processing steps.
CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like
CBOR does not yet have its magic numbers to be detected/identified by other
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now
is using the extension (i.e. .cbor), or probably content detection (i.e. byte
histogram distribution estimation).
There is another thing worth the attention, it looks like tika has attempted
to add the support with cbor mime detection in the tika-mimetypes.xml
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the
cbor file dumped by CommonCrawlDataDumper.
According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a
self-describing Tag 55799 that seems to be used for cbor type
identification(the hex code might be 0xd9d9f7), but it is probably up to the
application that take care of this tag, and it is also possible that the
fasterxml that the nutch dumping tool is missing this tag, an example cbor
file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been
attached (PFA: 142440269.html).
The following info is cited from the rfc, ...a decoder might be able to
parse both CBOR and JSON.
Such a decoder would need to mechanically distinguish the two
formats. An easy way for an encoder to help the decoder would be to
tag the entire CBOR item with tag 55799, the serialization of which
will never be found at the beginning of a JSON text...
It looks like the a file can have two parts/sections i.e. the plain text
parts and the json prettified by cbor, this might be also worth the attention
and consideration with the parsing and type identification.
On the other hand, it is worth noting that the entries for cbor extension
detection needs to be appended in the tika-mimetypes.xml too
e.g.
glob pattern=*.cbor/

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-23 Thread Luke sh (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510382#comment-14510382
]

Luke sh commented on TIKA-1610:
---

Notes:
The attached cbor file contains both magic bytes for type xhtml and type cbor,
with priority 40 on application/cbor, we will have the following issues

Problem1: Magic priority 40.
The application/xhtml+xml has higher priority(50) than application/cbor (40);
[I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and
compared first, cbor will not even be placed in the magic estimation list
because it has low priority. Based on the tests, it turns out that it is true
that xhtml gets read and compared first with the input file, so any type below
the priority 50 will be disregarded.

Problem2: again magic priority with 50.
In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and
cbor) will be selected as candidate mime types and they will be put in the
magic estimation list; since xhtml type gets read first, it is placed atop the
cbor; in order to break that tie, tika will rely on the decision from the
extension method. If the extension method fails to detect the type(for now,
let's ignore metadata hint method for simplicity but the same applies to it
too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor type
to 50 the same as xhtml, because it would probably be risky to discard any one
of the estimated types without going consult the extension method.

CBOR Parser and detection [improvement]
---

[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-23 Thread Nick Burch (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510394#comment-14510394
]

Nick Burch commented on TIKA-1610:
--

Based on that, I think the CBOR mime magic needs to be higher than the (x)html
one, not lower and not the same. So, in r1675755. I've set it to 60 and added
detection unit tests. These tests failed before the bump from 40 to 60, so I
think we're in a better place now!

CBOR Parser and detection [improvement]
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-21 Thread Chris A. Mattmann (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14506359#comment-14506359
]

Chris A. Mattmann commented on TIKA-1610:
-

Applied Pull request #42 thanks [~Lukeliush]!

{noformat}
[chipotle:~/tmp/tika] mattmann% svn commit -m WIP Fix for TIKA-1610: Support
MIME extension for CBOR files contributed by LukeLiush hanson311...@gmail.com
this closes #42 CHANGES.txt
tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
SendingCHANGES.txt
Sending
tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
Transmitting file data ..
Committed revision 1675250.
[chipotle:~/tmp/tika] mattmann%
{noformat}

Will look for improvements and the parser next, so will leave this open!

CBOR Parser and detection [improvement]
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-21 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14506414#comment-14506414
 ] 

Hudson commented on TIKA-1610:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #640 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/640/])
WIP Fix for TIKA-1610: Support MIME extension for CBOR files contributed by 
LukeLiush hanson311...@gmail.com this closes #42 (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1675250)
* /tika/trunk/CHANGES.txt
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


 CBOR Parser and detection [improvement]
 ---

 Key: TIKA-1610
 URL: https://issues.apache.org/jira/browse/TIKA-1610
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime, parser
Affects Versions: 1.7
Reporter: Luke sh
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: memex
 Attachments: 142440269.html, cbor_tika.mimetypes.xml.jpg, 
 rfc_cbor.jpg


 CBOR is a data format whose design goals include the possibility of extremely 
 small code size, fairly small message size, and extensibility without the 
 need for version negotiation (cited from http://cbor.io/ ).
 It would be great if Tika is able to provide the support with CBOR parser and 
 identification. In the current project with Nutch, the Nutch 
 CommonCrawlDataDumper is used to dump the crawled segments to the files in 
 the format of CBOR. In order to read/parse those dumped files by this tool, 
 it would be great if tika is able to support parsing the cbor, the thing is 
 that the CommonCrawlDataDumper is not dumping with correct extension, it 
 dumps with its own rule, the default extension of the dumped file is html, so 
 it might be less painful if tika is able to detect and parse those files 
 without any pre-processing steps. 
 CommonCrawlDataDumper is calling the following to dump with cbor.
 import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
 import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
 fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
 According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
 CBOR does not yet have its magic numbers to be detected/identified by other 
 applications (PFA: rfc_cbor.jpg)
 It seems that the only way to inform other applications of the type as of now 
 is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
 histogram distribution estimation).  
 There is another thing worth the attention, it looks like tika has attempted 
 to add the support with cbor mime detection in the tika-mimetypes.xml 
 (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
 cbor file dumped by CommonCrawlDataDumper. 
 According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
 self-describing Tag 55799 that seems to be used for cbor type 
 identification(the hex code might be 0xd9d9f7), but it is probably up to the 
 application that take care of this tag, and it is also possible that the 
 fasterxml that the nutch dumping tool is missing this tag, an example cbor 
 file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been 
 attached (PFA: 142440269.html).
 The following info is cited from the rfc, ...a decoder might be able to 
 parse both CBOR and JSON.
Such a decoder would need to mechanically distinguish the two
formats.  An easy way for an encoder to help the decoder would be to
tag the entire CBOR item with tag 55799, the serialization of which
will never be found at the beginning of a JSON text...
 It looks like the a file can have two parts/sections i.e. the plain text 
 parts and the json prettified by cbor, this might be also worth the attention 
 and consideration with the parsing and type identification.
 On the other hand, it is worth noting that the entries for cbor extension 
 detection needs to be appended in the tika-mimetypes.xml too 
 e.g.
 glob pattern=*.cbor/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

6 matches

Site Navigation

Mail list logo

Footer information