subject:"\[jira\] \[Commented\] \(NUTCH\-1997\) Add CBOR \"magic header\" to CommonCrawlDataDumper output"

[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output

2015-04-25 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512589#comment-14512589
 ] 

Hudson commented on NUTCH-1997:
---

FAILURE: Integrated in Nutch-trunk #3089 (See 
[https://builds.apache.org/job/Nutch-trunk/3089/])
NUTCH-1997: Fix for Add CBOR magic header to CommonCrawlDataDumper output 
contributed by Giuseppe Totaro, and Luke Sh. (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1676029)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java


> Add CBOR "magic header" to CommonCrawlDataDumper output
> ---
>
> Key: NUTCH-1997
> URL: https://issues.apache.org/jira/browse/NUTCH-1997
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.10
>
> Attachments: NUTCH-1997.patch
>
>
> For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} 
> wraps a single string value, representing the JSON text, into CBOR. 
> For instance, using the Unix {{hexdump}} tool, we can see that, as expected, 
> the first byte of all files is "0x7F" (the first three bits are "011", that 
> is the major type for strings, and the following 5 bits are "11010", meaning 
> a uint32_t encodes the length of following text), and the following 4 bytes 
> (single-precision float) encodes the right length of file (as described in 
> [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is 
> currently included into the file (a list of cbor tags is available 
> [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
> In order to add support for CBOR detection using Apache Tika (as described in 
> [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be 
> great if {{CommonCrawlDataDumper}} tool is able to add the self-describing 
> CBOR "magic header" ([Tag 
> 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded 
> output files. 
> Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] 
> for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output

2015-04-23 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14510380#comment-14510380
 ] 

Luke sh commented on NUTCH-1997:


Notes:
The attached cbor file contains both magic bytes for type xhtml and type cbor, 
with priority 40 on application/cbor, we will have the following issues

Problem1: Magic priority 40.
The application/xhtml+xml has higher priority(50) than application/cbor 
(40); [I don't know who (and why) assigned 40 to cbor];  So if xhtml gets read 
and compared first,  cbor will not even be placed in the magic estimation list 
because it has low priority. Based on the tests, it turns out that it is true 
that xhtml gets read and compared first with the input file, so any type below 
the priority 50 will be disregarded. 


Problem2: again magic priority with 50.
In Tika, given a file dumped by the nutch dumper tool,  both types 
(xhtml and cbor) will be selected as candidate mime types and they will be put 
in the magic estimation list; since xhtml type gets read first, it is placed 
atop the cbor; in order to break that tie, tika will rely on the decision from 
the extension method. If the extension method fails to detect the type(for now, 
let's ignore metadata hint method for simplicity but the same applies to it 
too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor type 
to 50 the same as xhtml, because it would probably be risky to discard any one 
of the estimated types without going consult the extension method.


> Add CBOR "magic header" to CommonCrawlDataDumper output
> ---
>
> Key: NUTCH-1997
> URL: https://issues.apache.org/jira/browse/NUTCH-1997
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Reporter: Giuseppe Totaro
>Priority: Minor
> Attachments: NUTCH-1997.patch
>
>
> For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} 
> wraps a single string value, representing the JSON text, into CBOR. 
> For instance, using the Unix {{hexdump}} tool, we can see that, as expected, 
> the first byte of all files is "0x7F" (the first three bits are "011", that 
> is the major type for strings, and the following 5 bits are "11010", meaning 
> a uint32_t encodes the length of following text), and the following 4 bytes 
> (single-precision float) encodes the right length of file (as described in 
> [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is 
> currently included into the file (a list of cbor tags is available 
> [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
> In order to add support for CBOR detection using Apache Tika (as described in 
> [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be 
> great if {{CommonCrawlDataDumper}} tool is able to add the self-describing 
> CBOR "magic header" ([Tag 
> 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded 
> output files. 
> Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] 
> for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output

2015-04-22 Thread Giuseppe Totaro (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508549#comment-14508549
 ] 

Giuseppe Totaro commented on NUTCH-1997:


Great. Thanks [~Lukeliush]. Please let me know if you may need support on 
adding cbor detection to Tika.
Thanks a lot.

> Add CBOR "magic header" to CommonCrawlDataDumper output
> ---
>
> Key: NUTCH-1997
> URL: https://issues.apache.org/jira/browse/NUTCH-1997
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Reporter: Giuseppe Totaro
>Priority: Minor
> Attachments: NUTCH-1997.patch
>
>
> For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} 
> wraps a single string value, representing the JSON text, into CBOR. 
> For instance, using the Unix {{hexdump}} tool, we can see that, as expected, 
> the first byte of all files is "0x7F" (the first three bits are "011", that 
> is the major type for strings, and the following 5 bits are "11010", meaning 
> a uint32_t encodes the length of following text), and the following 4 bytes 
> (single-precision float) encodes the right length of file (as described in 
> [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is 
> currently included into the file (a list of cbor tags is available 
> [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
> In order to add support for CBOR detection using Apache Tika (as described in 
> [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be 
> great if {{CommonCrawlDataDumper}} tool is able to add the self-describing 
> CBOR "magic header" ([Tag 
> 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded 
> output files. 
> Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] 
> for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output

2015-04-22 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508541#comment-14508541
 ] 

Luke sh commented on NUTCH-1997:


i am working on the update.

> Add CBOR "magic header" to CommonCrawlDataDumper output
> ---
>
> Key: NUTCH-1997
> URL: https://issues.apache.org/jira/browse/NUTCH-1997
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Reporter: Giuseppe Totaro
>Priority: Minor
> Attachments: NUTCH-1997.patch
>
>
> For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} 
> wraps a single string value, representing the JSON text, into CBOR. 
> For instance, using the Unix {{hexdump}} tool, we can see that, as expected, 
> the first byte of all files is "0x7F" (the first three bits are "011", that 
> is the major type for strings, and the following 5 bits are "11010", meaning 
> a uint32_t encodes the length of following text), and the following 4 bytes 
> (single-precision float) encodes the right length of file (as described in 
> [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is 
> currently included into the file (a list of cbor tags is available 
> [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
> In order to add support for CBOR detection using Apache Tika (as described in 
> [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be 
> great if {{CommonCrawlDataDumper}} tool is able to add the self-describing 
> CBOR "magic header" ([Tag 
> 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded 
> output files. 
> Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] 
> for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output

2015-04-22 Thread Giuseppe Totaro (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508540#comment-14508540
 ] 

Giuseppe Totaro commented on NUTCH-1997:


Thanks [~Lukeliush]. Do you verify if Tika is able to detect these files as 
cbor?
Thanks a lot.

> Add CBOR "magic header" to CommonCrawlDataDumper output
> ---
>
> Key: NUTCH-1997
> URL: https://issues.apache.org/jira/browse/NUTCH-1997
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Reporter: Giuseppe Totaro
>Priority: Minor
> Attachments: NUTCH-1997.patch
>
>
> For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} 
> wraps a single string value, representing the JSON text, into CBOR. 
> For instance, using the Unix {{hexdump}} tool, we can see that, as expected, 
> the first byte of all files is "0x7F" (the first three bits are "011", that 
> is the major type for strings, and the following 5 bits are "11010", meaning 
> a uint32_t encodes the length of following text), and the following 4 bytes 
> (single-precision float) encodes the right length of file (as described in 
> [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is 
> currently included into the file (a list of cbor tags is available 
> [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
> In order to add support for CBOR detection using Apache Tika (as described in 
> [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be 
> great if {{CommonCrawlDataDumper}} tool is able to add the self-describing 
> CBOR "magic header" ([Tag 
> 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded 
> output files. 
> Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] 
> for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output

2015-04-22 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508522#comment-14508522
 ] 

Luke sh commented on NUTCH-1997:


Thanks a lot [~gostep], highly appreciated, this patch works too, i run a quick 
test and i was able to see the magic tag is appended at the beginning of the 
cbor file.

Thanks
Luke

> Add CBOR "magic header" to CommonCrawlDataDumper output
> ---
>
> Key: NUTCH-1997
> URL: https://issues.apache.org/jira/browse/NUTCH-1997
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Reporter: Giuseppe Totaro
>Priority: Minor
> Attachments: NUTCH-1997.patch
>
>
> For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} 
> wraps a single string value, representing the JSON text, into CBOR. 
> For instance, using the Unix {{hexdump}} tool, we can see that, as expected, 
> the first byte of all files is "0x7F" (the first three bits are "011", that 
> is the major type for strings, and the following 5 bits are "11010", meaning 
> a uint32_t encodes the length of following text), and the following 4 bytes 
> (single-precision float) encodes the right length of file (as described in 
> [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is 
> currently included into the file (a list of cbor tags is available 
> [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
> In order to add support for CBOR detection using Apache Tika (as described in 
> [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be 
> great if {{CommonCrawlDataDumper}} tool is able to add the self-describing 
> CBOR "magic header" ([Tag 
> 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded 
> output files. 
> Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] 
> for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output

2015-04-22 Thread Luke sh (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508055#comment-14508055
 ] 

Luke sh commented on NUTCH-1997:


Thanks a lot [~gostep], let me test it out and will let you know the result, 
thanks

> Add CBOR "magic header" to CommonCrawlDataDumper output
> ---
>
> Key: NUTCH-1997
> URL: https://issues.apache.org/jira/browse/NUTCH-1997
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Reporter: Giuseppe Totaro
>Priority: Minor
> Attachments: NUTCH-1997.patch
>
>
> For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} 
> wraps a single string value, representing the JSON text, into CBOR. 
> For instance, using the Unix {{hexdump}} tool, we can see that, as expected, 
> the first byte of all files is "0x7F" (the first three bits are "011", that 
> is the major type for strings, and the following 5 bits are "11010", meaning 
> a uint32_t encodes the length of following text), and the following 4 bytes 
> (single-precision float) encodes the right length of file (as described in 
> [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is 
> currently included into the file (a list of cbor tags is available 
> [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
> In order to add support for CBOR detection using Apache Tika (as described in 
> [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be 
> great if {{CommonCrawlDataDumper}} tool is able to add the self-describing 
> CBOR "magic header" ([Tag 
> 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded 
> output files. 
> Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] 
> for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output

[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output

[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output

[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output

[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output

[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output

[jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output

7 matches

Site Navigation

Mail list logo

Footer information