[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output
[ https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512589#comment-14512589 ] Hudson commented on NUTCH-1997: --- FAILURE: Integrated in Nutch-trunk #3089 (See [https://builds.apache.org/job/Nutch-trunk/3089/]) NUTCH-1997: Fix for Add CBOR magic header to CommonCrawlDataDumper output contributed by Giuseppe Totaro, and Luke Sh. (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1676029) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java > Add CBOR "magic header" to CommonCrawlDataDumper output > --- > > Key: NUTCH-1997 > URL: https://issues.apache.org/jira/browse/NUTCH-1997 > Project: Nutch > Issue Type: Improvement > Components: tool >Reporter: Giuseppe Totaro >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.10 > > Attachments: NUTCH-1997.patch > > > For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} > wraps a single string value, representing the JSON text, into CBOR. > For instance, using the Unix {{hexdump}} tool, we can see that, as expected, > the first byte of all files is "0x7F" (the first three bits are "011", that > is the major type for strings, and the following 5 bits are "11010", meaning > a uint32_t encodes the length of following text), and the following 4 bytes > (single-precision float) encodes the right length of file (as described in > [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is > currently included into the file (a list of cbor tags is available > [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]). > In order to add support for CBOR detection using Apache Tika (as described in > [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be > great if {{CommonCrawlDataDumper}} tool is able to add the self-describing > CBOR "magic header" ([Tag > 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded > output files. > Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] > for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output
[ https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14510380#comment-14510380 ] Luke sh commented on NUTCH-1997: Notes: The attached cbor file contains both magic bytes for type xhtml and type cbor, with priority 40 on application/cbor, we will have the following issues Problem1: Magic priority 40. The application/xhtml+xml has higher priority(50) than application/cbor (40); [I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and compared first, cbor will not even be placed in the magic estimation list because it has low priority. Based on the tests, it turns out that it is true that xhtml gets read and compared first with the input file, so any type below the priority 50 will be disregarded. Problem2: again magic priority with 50. In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and cbor) will be selected as candidate mime types and they will be put in the magic estimation list; since xhtml type gets read first, it is placed atop the cbor; in order to break that tie, tika will rely on the decision from the extension method. If the extension method fails to detect the type(for now, let's ignore metadata hint method for simplicity but the same applies to it too), then xhtml gets returned eventually. My pull request to be sent : I am going to set the magic priority of cbor type to 50 the same as xhtml, because it would probably be risky to discard any one of the estimated types without going consult the extension method. > Add CBOR "magic header" to CommonCrawlDataDumper output > --- > > Key: NUTCH-1997 > URL: https://issues.apache.org/jira/browse/NUTCH-1997 > Project: Nutch > Issue Type: Improvement > Components: tool >Reporter: Giuseppe Totaro >Priority: Minor > Attachments: NUTCH-1997.patch > > > For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} > wraps a single string value, representing the JSON text, into CBOR. > For instance, using the Unix {{hexdump}} tool, we can see that, as expected, > the first byte of all files is "0x7F" (the first three bits are "011", that > is the major type for strings, and the following 5 bits are "11010", meaning > a uint32_t encodes the length of following text), and the following 4 bytes > (single-precision float) encodes the right length of file (as described in > [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is > currently included into the file (a list of cbor tags is available > [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]). > In order to add support for CBOR detection using Apache Tika (as described in > [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be > great if {{CommonCrawlDataDumper}} tool is able to add the self-describing > CBOR "magic header" ([Tag > 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded > output files. > Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] > for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output
[ https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508549#comment-14508549 ] Giuseppe Totaro commented on NUTCH-1997: Great. Thanks [~Lukeliush]. Please let me know if you may need support on adding cbor detection to Tika. Thanks a lot. > Add CBOR "magic header" to CommonCrawlDataDumper output > --- > > Key: NUTCH-1997 > URL: https://issues.apache.org/jira/browse/NUTCH-1997 > Project: Nutch > Issue Type: Improvement > Components: tool >Reporter: Giuseppe Totaro >Priority: Minor > Attachments: NUTCH-1997.patch > > > For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} > wraps a single string value, representing the JSON text, into CBOR. > For instance, using the Unix {{hexdump}} tool, we can see that, as expected, > the first byte of all files is "0x7F" (the first three bits are "011", that > is the major type for strings, and the following 5 bits are "11010", meaning > a uint32_t encodes the length of following text), and the following 4 bytes > (single-precision float) encodes the right length of file (as described in > [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is > currently included into the file (a list of cbor tags is available > [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]). > In order to add support for CBOR detection using Apache Tika (as described in > [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be > great if {{CommonCrawlDataDumper}} tool is able to add the self-describing > CBOR "magic header" ([Tag > 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded > output files. > Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] > for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output
[ https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508541#comment-14508541 ] Luke sh commented on NUTCH-1997: i am working on the update. > Add CBOR "magic header" to CommonCrawlDataDumper output > --- > > Key: NUTCH-1997 > URL: https://issues.apache.org/jira/browse/NUTCH-1997 > Project: Nutch > Issue Type: Improvement > Components: tool >Reporter: Giuseppe Totaro >Priority: Minor > Attachments: NUTCH-1997.patch > > > For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} > wraps a single string value, representing the JSON text, into CBOR. > For instance, using the Unix {{hexdump}} tool, we can see that, as expected, > the first byte of all files is "0x7F" (the first three bits are "011", that > is the major type for strings, and the following 5 bits are "11010", meaning > a uint32_t encodes the length of following text), and the following 4 bytes > (single-precision float) encodes the right length of file (as described in > [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is > currently included into the file (a list of cbor tags is available > [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]). > In order to add support for CBOR detection using Apache Tika (as described in > [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be > great if {{CommonCrawlDataDumper}} tool is able to add the self-describing > CBOR "magic header" ([Tag > 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded > output files. > Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] > for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output
[ https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508540#comment-14508540 ] Giuseppe Totaro commented on NUTCH-1997: Thanks [~Lukeliush]. Do you verify if Tika is able to detect these files as cbor? Thanks a lot. > Add CBOR "magic header" to CommonCrawlDataDumper output > --- > > Key: NUTCH-1997 > URL: https://issues.apache.org/jira/browse/NUTCH-1997 > Project: Nutch > Issue Type: Improvement > Components: tool >Reporter: Giuseppe Totaro >Priority: Minor > Attachments: NUTCH-1997.patch > > > For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} > wraps a single string value, representing the JSON text, into CBOR. > For instance, using the Unix {{hexdump}} tool, we can see that, as expected, > the first byte of all files is "0x7F" (the first three bits are "011", that > is the major type for strings, and the following 5 bits are "11010", meaning > a uint32_t encodes the length of following text), and the following 4 bytes > (single-precision float) encodes the right length of file (as described in > [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is > currently included into the file (a list of cbor tags is available > [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]). > In order to add support for CBOR detection using Apache Tika (as described in > [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be > great if {{CommonCrawlDataDumper}} tool is able to add the self-describing > CBOR "magic header" ([Tag > 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded > output files. > Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] > for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output
[ https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508522#comment-14508522 ] Luke sh commented on NUTCH-1997: Thanks a lot [~gostep], highly appreciated, this patch works too, i run a quick test and i was able to see the magic tag is appended at the beginning of the cbor file. Thanks Luke > Add CBOR "magic header" to CommonCrawlDataDumper output > --- > > Key: NUTCH-1997 > URL: https://issues.apache.org/jira/browse/NUTCH-1997 > Project: Nutch > Issue Type: Improvement > Components: tool >Reporter: Giuseppe Totaro >Priority: Minor > Attachments: NUTCH-1997.patch > > > For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} > wraps a single string value, representing the JSON text, into CBOR. > For instance, using the Unix {{hexdump}} tool, we can see that, as expected, > the first byte of all files is "0x7F" (the first three bits are "011", that > is the major type for strings, and the following 5 bits are "11010", meaning > a uint32_t encodes the length of following text), and the following 4 bytes > (single-precision float) encodes the right length of file (as described in > [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is > currently included into the file (a list of cbor tags is available > [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]). > In order to add support for CBOR detection using Apache Tika (as described in > [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be > great if {{CommonCrawlDataDumper}} tool is able to add the self-describing > CBOR "magic header" ([Tag > 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded > output files. > Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] > for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output
[ https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508055#comment-14508055 ] Luke sh commented on NUTCH-1997: Thanks a lot [~gostep], let me test it out and will let you know the result, thanks > Add CBOR "magic header" to CommonCrawlDataDumper output > --- > > Key: NUTCH-1997 > URL: https://issues.apache.org/jira/browse/NUTCH-1997 > Project: Nutch > Issue Type: Improvement > Components: tool >Reporter: Giuseppe Totaro >Priority: Minor > Attachments: NUTCH-1997.patch > > > For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} > wraps a single string value, representing the JSON text, into CBOR. > For instance, using the Unix {{hexdump}} tool, we can see that, as expected, > the first byte of all files is "0x7F" (the first three bits are "011", that > is the major type for strings, and the following 5 bits are "11010", meaning > a uint32_t encodes the length of following text), and the following 4 bytes > (single-precision float) encodes the right length of file (as described in > [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is > currently included into the file (a list of cbor tags is available > [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]). > In order to add support for CBOR detection using Apache Tika (as described in > [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be > great if {{CommonCrawlDataDumper}} tool is able to add the self-describing > CBOR "magic header" ([Tag > 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded > output files. > Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] > for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)