Hi professor,

Please see the following results.
<match value="&lt;html xmlns=" type="string" offset="0:1024"/>
Result: "text/html"

<match value="&lt;html xmlns=" type="string" offset="0:6000"/>
Result: "application/xhtml+xml"


Thanks
Luke

-----Original Message-----
From: Chris Mattmann [mailto:[email protected]] 
Sent: Wednesday, April 22, 2015 4:21 AM
To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U (3980-Affiliate)'; 
[email protected]
Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF 
Polar CyberInfrastructure DR Students'; [email protected]
Subject: Re: [memex-jpl] this week action from luke

Hi Luke,

Actually I just meant go into tika-mimetypes.xml and change the magic offsets 
for application/xhtml+xml and see if that works. The code you changed below is 
actually how many bytes Tika will first download to do MIME checking.

Cheers,
Chris

------------------------
Chris Mattmann
[email protected]




-----Original Message-----
From: Luke <[email protected]>
Date: Wednesday, April 22, 2015 at 2:25 AM
To: Chris Mattmann <[email protected]>, Chris Mattmann 
<[email protected]>, "'Totaro, Giuseppe U (3980-Affiliate)'"
<[email protected]>, <[email protected]>
Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>, "'Zimdars, Paul 
A (3980-Affiliate)'" <[email protected]>, NSF Polar 
CyberInfrastructure DR Students <[email protected]>,
<[email protected]>
Subject: RE: [memex-jpl] this week action from luke

>
>Hi professor,
>
>I just tried it with minLength set to 1024, I get the following 
>"text/plain"
>I am a bit surprised....
>
>BTW, the 6000 min length still give "application/xhtml+xml"; with 
>anything below 1024 min length, I am seeing "text/plain". :)
>
>BTW, the min length I am referring/altering is as follows 
>MimeTypes.java
>       public int getMinLength() {
>        // This needs to be reasonably large to be able to correctly 
>detect
>        // things like XML root elements after initial comment and DTDs
>        return 64 * 1024;
>    }
>
>
>Thanks
>Luke
>
>-----Original Message-----
>From: Chris Mattmann [mailto:[email protected]]
>Sent: Tuesday, April 21, 2015 7:48 PM
>To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U 
>(3980-Affiliate)'; [email protected]
>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>[email protected]
>Subject: Re: [memex-jpl] this week action from luke
>
>Thanks Luke.
>
>So I guess all I was asking was could you try it out. Thanks for the 
>lesson in the RFC.
>
>Cheers,
>Chris
>
>------------------------
>Chris Mattmann
>[email protected]
>
>
>
>
>-----Original Message-----
>From: Luke <[email protected]>
>Date: Wednesday, April 22, 2015 at 1:46 AM
>To: Chris Mattmann <[email protected]>, Chris Mattmann 
><[email protected]>, "'Totaro, Giuseppe U (3980-Affiliate)'"
><[email protected]>, <[email protected]>
>Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>, 
>"'Zimdars, Paul A (3980-Affiliate)'" <[email protected]>, NSF 
>Polar CyberInfrastructure DR Students 
><[email protected]>,
><[email protected]>
>Subject: RE: [memex-jpl] this week action from luke
>
>>Hi professor,
>>
>>
>>I think it highly depends on the content being read by tika, e.g. if 
>>there is a sequence of bytes in the file that is being read and is the 
>>same as one or more of mime types being defined in our tika-mimes.xml, 
>>I guess that tika will put those types in its estimation list, please 
>>note there could be multiple estimated mime types by magic-byte 
>>detection approach. Now tika also considers the decision made by 
>>extension detection approach, if extension says the file type it 
>>believes is the first one in the magic type estimation list, then 
>>certainly the first one will be returned. (the same applies to 
>>metadata hint approach); Of course, tika also prefers the type that is 
>>the most specialized.
>>
>>let's get back to the following question, here is my guess though.
>>[Prof]: Also what happens if you tweak the definition of XHTML to not 
>>scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>>Let's consider an extreme case where we only scan 10 or 1 bytes, then 
>>it seems that magic bytes will inevitable detect nothing, and I think 
>>it will return the something like" application/oct-stream" that is the 
>>most general type. As mentioned, tika favours the one that is the most 
>>specialized, if extension approach returns the one that is more 
>>specialized, in this extreme case I believe almost every type is a 
>>subclass of this "application/oct-stream".... therefore the answer in 
>>this extreme may be yes, I think it is very possible that CBOR type 
>>detected by the extension approach takes over in this case...
>>
>>My idea was and still is that if the cbor self-Describing tag 55799 is 
>>present in the cbor file, then that can be used to detect the cbor type.
>>Again, the cbor type will probably be appended into the magic 
>>estimation list together with another one such as application/html, I 
>>guess the order in the list probably also matters, the first one is 
>>preferred over the next one. Also the decision from the extension 
>>detection approach also play the role the break the tie.
>>e.g. if extension detection method agrees on cbor with one of the 
>>estimated type in the magic list, then cbor will be returned. (again, 
>>same thing applies to metadatahint method).
>>
>>I have not taken a closer look at a cbor file that has the tag 55799, 
>>but I expect to see its hex is something like 0xd9d9f7 or the tag 
>>should be present in the header with a fixed sequence of
>>bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is 
>>present in the file or preferable in the header (within a reasonable 
>>range of bytes ), I believe it can probably be used as the magic 
>>numbers for the cbor type.
>>
>>
>>There is another thing I have mentioned in the jira ticket I opened 
>>yesterday against the cbor parser and detection, it is also possible 
>>that cbor content can be imbedded inside a plain json file, the way 
>>that a decoder can distinguish them in that file is by looking at the 
>>tag 55799 again. This may rarely happen but a robust parser might be 
>>able to take care of that, tika might need to consider the use of 
>>fastXML being used by the nutch tool when developing the cbor parser...
>>Again let me cite the same paragraph from the rfc,
>>
>>" a decoder might be able to parse both CBOR and JSON.
>>   Such a decoder would need to mechanically distinguish the two
>>   formats.  An easy way for an encoder to help the decoder would be to
>>   tag the entire CBOR item with tag 55799, the serialization of which
>>   will never be found at the beginning of a JSON text."
>>
>>
>>Thanks
>>Luke
>>
>>
>>
>>-----Original Message-----
>>From: Mattmann, Chris A (3980) [mailto:[email protected]]
>>Sent: Tuesday, April 21, 2015 9:49 PM
>>To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
>>Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); 
>>'NSF Polar CyberInfrastructure DR Students'; 
>>[email protected]
>>Subject: Re: [memex-jpl] this week action from luke
>>
>>Hi Luke,
>>
>>Can you post the below conversation to dev@tika and summarize it there.
>>Also what happens if you tweak the definition of XHTML to not scan 
>>until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>>
>>Cheers,
>>Chris
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398) NASA Jet 
>>Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: [email protected]
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department University of 
>>Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>>-----Original Message-----
>>From: Luke <[email protected]>
>>Date: Wednesday, April 22, 2015 at 12:19 AM
>>To: Chris Mattmann <[email protected]>, "Totaro, Giuseppe U 
>>(3980-Affiliate)" <[email protected]>, Chris Mattmann 
>><[email protected]>
>>Cc: "Bryant, Ann C (398G-Affiliate)" <[email protected]>, 
>>"Zimdars, Paul A (3980-Affiliate)" <[email protected]>, NSF 
>>Polar CyberInfrastructure DR Students 
>><[email protected]>,
>>"[email protected]" <[email protected]>
>>Subject: RE: [memex-jpl] this week action from luke
>>
>>>Hi Professor,
>>>Please see attached jpg for the difference.
>>>Thanks
>>>Luke
>>>
>>>-----Original Message-----
>>>From: Chris Mattmann [mailto:[email protected]]
>>>Sent: Tuesday, April 21, 2015 5:27 PM
>>>To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>[email protected]
>>>Subject: Re: [memex-jpl] this week action from luke
>>>
>>>Hey Luke what happens if you do java -jar /path/to/tika-app -m 
>>>/path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app -m 
>>>< /path/to/cbor/file.cbor any difference?
>>>
>>>------------------------
>>>Chris Mattmann
>>>[email protected]
>>>
>>>
>>>
>>>
>>>-----Original Message-----
>>>From: Luke <[email protected]>
>>>Date: Tuesday, April 21, 2015 at 5:41 PM
>>>To: 'Luke' <[email protected]>, Chris Mattmann 
>>><[email protected]>, 'Giuseppe Totaro' 
>>><[email protected]>, Chris Mattmann 
>>><[email protected]>
>>>Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>, 
>>>"'Zimdars, Paul A (3980-Affiliate)'" <[email protected]>, 
>>>NSF Polar CyberInfrastructure DR Students 
>>><[email protected]>,
>>><[email protected]>
>>>Subject: RE: [memex-jpl] this week action from luke
>>>
>>>>Hi professor,
>>>>I just sent a pull request for adding cbor extension.
>>>>The interesting thing is that tika is still identifying the file 
>>>>dumped by the nutch dump tool as a "application/xhtml+xml" even when 
>>>>I manually change the file extension to the correct one (i.e. *.cbor ).
>>>>
>>>>The reason is probably that tika is identifying "application/xhtml+xml"
>>>>by searching for the "&lt;html" in the file content, PFA:
>>>>xhtml+xml.jpg; Now if you take a look at the cbor file dumped by 
>>>>xhtml+nutch,
>>>>you see that we do have that element as part of the cbor content 
>>>>because the entire crawled xhtml document seems to be imbedded in 
>>>>the cbor json(PFA:
>>>>cbor.jpg); and also in Tika, the magic detection seems to have 
>>>>higher priority over the glob detection, thus the type is being 
>>>>incorrectly detected.
>>>>
>>>>Therefore, I would like to please mention that adding the entry of 
>>>><glob pattern="*.cbor"/> is not resolving the issue as of now 
>>>>without some fixed magic bytes / patterns for cbor.
>>>>I also would like to add that the thing will be different with our 
>>>>probabilistic mime detection selector, because if we know that the 
>>>>file extension is more reliable than magic bytes, then we can 
>>>>certainly add more preferential weight to the extension... this also 
>>>>might show the current implementation with MimeTypes detection is a 
>>>>bit stiff or less flexible in this scneario. :)
>>>>
>>>>
>>>>Thanks
>>>>Luke
>>>>
>>>>-----Original Message-----
>>>>From: Luke [mailto:[email protected]]
>>>>Sent: Tuesday, April 21, 2015 12:14 PM
>>>>To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>>'[email protected]'
>>>>Subject: RE: [memex-jpl] this week action from luke
>>>>
>>>>Yes, let me add the cbor extension entry in tika xml, will send the 
>>>>pull request soon.
>>>>
>>>>Thanks
>>>>Luke
>>>>-----Original Message-----
>>>>From: Chris Mattmann [mailto:[email protected]]
>>>>Sent: Tuesday, April 21, 2015 6:51 AM
>>>>To: Giuseppe Totaro; Mattmann, Chris A (3980)
>>>>Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A 
>>>>(3980-Affiliate); NSF Polar CyberInfrastructure DR Students; 
>>>>[email protected]
>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>
>>>>Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER 
>>>>and tag along with adding an -extension command would be fantastic.
>>>>Can you file both of those NUTCH issues, wait a day or so, and then 
>>>>based on feedback use your new Nutch commit karma to get those into 
>>>>Nutch?
>>>>
>>>>And then when creating the issues, can you link to the TIKA-1610 issue?
>>>>At that point, when those two to be defined NUTCH issues are up, 
>>>>Luke, in parallel can you throw up a pull request/patch in Tika for 
>>>>the extension along with the MIME detection?
>>>>
>>>>Cheers,
>>>>Chris
>>>>
>>>>------------------------
>>>>Chris Mattmann
>>>>[email protected]
>>>>
>>>>
>>>>
>>>>
>>>>-----Original Message-----
>>>>From: Giuseppe Totaro <[email protected]>
>>>>Date: Tuesday, April 21, 2015 at 12:33 PM
>>>>To: Chris Mattmann <[email protected]>
>>>>Cc: Luke <[email protected]>, Chris Mattmann 
>>>><[email protected]>, "Bryant, Ann C (398G-Affiliate)"
>>>><[email protected]>, "Zimdars, Paul A (3980-Affiliate)"
>>>><[email protected]>, NSF Polar CyberInfrastructure DR 
>>>>Students <[email protected]>,
>>>>"[email protected]"
>>>><[email protected]>
>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>
>>>>>Thanks Luke. Great work.
>>>>>Chris, we wrap a single string value, representing the JSON text, 
>>>>>for each file into CBOR (by using serializeCBORData method). For 
>>>>>instance, using the Unix hex dump tool, we can see that, as 
>>>>>expected, the first byte of all files is "0x7F" (the first three 
>>>>>bits are "011", that is the major type for strings, and the 
>>>>>following 5 bits are "11010", meaning a uint32_t encodes the length 
>>>>>of following text), and the following 4 bytes (single-precision
>>>>>float) encodes the right length of file (as described in RFC7049 
>>>>><http://tools.ietf.org/html/rfc7049>).
>>>>>Therefore, a CBOR tag is currently included into the file (a list 
>>>>>of cbor tags is available here 
>>>>><http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>).
>>>>>I did not know about CBOR "magic header". Thanks a lot Luke for 
>>>>>this great research. Chris, if you agree, I can add support for 
>>>>>prepending self-describing CBOR tag 55799 to CommonCrawldataDumper 
>>>>>class. I believe it is very easy because I have to enable the 
>>>>>WRITE_TYPE_HEADER feature for CBORGenerator class (the source code 
>>>>>is available here 
>>>>><https://github.com/FasterXML/jackson-dataformat-cbor/blob/master/s
>>>>>r
>>>>>c
>>>>>/
>>>>>m ain
>>>>>/java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>).
>>>>>Then, I can comment the TIKA-1610
>>>>><https://issues.apache.org/jira/browse/TIKA-1610> issue.
>>>>>
>>>>>Regarding the file extension, in the Memex CCA format the original 
>>>>>file extension is used. We could add support for a -extension 
>>>>>command-line option allowing the user to give a file extension 
>>>>>(e.g.,
>>>>>cbor) for all files dumped out.
>>>>>
>>>>>Thanks a lot,
>>>>>Giuseppe
>>>>>
>>>>>
>>>>>
>>>>>On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980) 
>>>>><[email protected]> wrote:
>>>>>
>>>>>Thanks for this great research, Luke!
>>>>>
>>>>>Giuseppe, any idea why this tag doesn’t make it into the file?
>>>>>
>>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>Chris Mattmann, Ph.D.
>>>>>Chief Architect
>>>>>Instrument Software and Science Data Systems Section (398) NASA Jet 
>>>>>Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>Office: 168-519, Mailstop: 168-527
>>>>>Email: [email protected]
>>>>>WWW:  http://sunset.usc.edu/~mattmann/
>>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>Adjunct Associate Professor, Computer Science Department University 
>>>>>of Southern California, Los Angeles, CA 90089 USA
>>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>-----Original Message-----
>>>>>From: Luke <[email protected]>
>>>>>Date: Tuesday, April 21, 2015 at 2:55 AM
>>>>>To: Chris Mattmann <[email protected]>, "Totaro, Giuseppe U 
>>>>>(3980-Affiliate)" <[email protected]>, Chris Mattmann 
>>>>><[email protected]>, "Bryant, Ann C (398G-Affiliate)"
>>>>><[email protected]>, "Zimdars, Paul A (3980-Affiliate)"
>>>>><[email protected]>, NSF Polar CyberInfrastructure DR 
>>>>>Students <[email protected]>,
>>>>>"[email protected]"
>>>>><[email protected]>
>>>>>Subject: RE: [memex-jpl] this week action from luke
>>>>>
>>>>>>Thanks professor.
>>>>>>Hi professor and all.
>>>>>>JIRA issue : CBOR Parser and detection improvement
>>>>>>https://issues.apache.org/jira/browse/TIKA-1610
>>>>>>
>>>>>>I tried to conduct a bit research with this cbor detection.
>>>>>>
>>>>>>It looks like there is a self describing tag that needs to be 
>>>>>>written in the cbor file thru which other applications might be 
>>>>>>able to identify the cbor type....
>>>>>>Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5
>>>>>>
>>>>>>I don’t see that tag being present in the cbor file dumped by the 
>>>>>>nutch tool, I am not very sure though.
>>>>>>
>>>>>>Thanks
>>>>>>Luke
>>>>>>
>>>>>>
>>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Chris Mattmann [mailto:[email protected]]
>>>>>>Sent: Monday, April 20, 2015 4:10 AM
>>>>>>To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C 
>>>>>>(398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar 
>>>>>>CyberInfrastructure DR Students'; [email protected]
>>>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>>>
>>>>>>Nice one, Luke. If you have a second and you can open up an issue 
>>>>>>in Tika to make it support CBOR, then yes, by all means! :)
>>>>>>
>>>>>>
>>>>>>------------------------
>>>>>>Chris Mattmann
>>>>>>[email protected]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Luke <[email protected]>
>>>>>>Date: Monday, April 20, 2015 at 4:15 AM
>>>>>>To: 'Giuseppe Totaro' <[email protected]>, Chris Mattmann 
>>>>>><[email protected]>, Chris Mattmann 
>>>>>><[email protected]>, "'Bryant, Ann C (398G-Affiliate)'"
>>>>>><[email protected]>, "'Zimdars, Paul A (3980-Affiliate)'"
>>>>>><[email protected]>, NSF Polar CyberInfrastructure DR 
>>>>>>Students <[email protected]>,
>>>>>><[email protected]>
>>>>>>Subject: RE: [memex-jpl] this week action from luke
>>>>>>
>>>>>>>Thanks a lot Giuseppe for the prompt response clearing up a bit 
>>>>>>>of my confusion with the Nutch CommonCrawlDataDumper , appreciated.
>>>>>>>
>>>>>>>BTW, it looks like Tika might need to consider the support with 
>>>>>>>COBR parser and detection.
>>>>>>>I checked the rfc, it looks like CBOR has not got magic numbers.
>>>>>>>PFA:
>>>>>>>rfc_cbor.jpg
>>>>>>>Actually, I don’t quite understand why the CommonCrawlDataDumper 
>>>>>>>is not dumping the nutch segments with the .cbor extension, which 
>>>>>>>seems to be helpful for type detection.
>>>>>>>
>>>>>>>To professor Mattmann,
>>>>>>>Tika does not support the detection of COBR, although the trunk 
>>>>>>>version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor in 
>>>>>>>the tika-mimetypes.xml, those entries are not detecting properly 
>>>>>>>the cobr files dumped by CommonCrawlDataDumper.  Also CBOR does 
>>>>>>>not have magic bytes, off the top of my head the only way we can 
>>>>>>>detect it is using the extension, and content byte histogram 
>>>>>>>(please note, this is a local optimal solution and
>>>>>>>data-dependent.)  J
>>>>>>>
>>>>>>>I think I am bit deviating from the main route and discussion of 
>>>>>>>this thread…. i.e. the plan for testing the “probabilistic mime 
>>>>>>>detector selection” with polar data.
>>>>>>>Anyway, I plan to repackage tika by incorporating the 
>>>>>>>probabilistic selection feature and replace the tika jar in nutch 
>>>>>>>with the repackaged one, and then run the CommonCrawlDataDumper 
>>>>>>>and see how it goes. If you have any specific ideas and thought 
>>>>>>>with the testing, please kindly let me know.
>>>>>>>
>>>>>>>Thanks
>>>>>>>Luke
>>>>>>>
>>>>>>>From: Giuseppe Totaro [mailto:[email protected]]
>>>>>>>Sent: Sunday, April 19, 2015 11:17 PM
>>>>>>>To: Luke liu
>>>>>>>Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C 
>>>>>>>(398G-Affiliate); Zimdars, Paul A (3980-Affiliate); Luke; NSF 
>>>>>>>Polar CyberInfrastructure DR Students; [email protected]
>>>>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Hi Luke,
>>>>>>>
>>>>>>>
>>>>>>>my name is Giuseppe and I am a PhD student working under the 
>>>>>>>supervision of Prof. Chris Mattmann. I worked on 
>>>>>>>CommonCrawlDataDumper tool, so I can give some feedback on a 
>>>>>>>couple of your observations. My comments inline below.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Il giorno 19/apr/2015, alle ore 12:11, Luke liu 
>>>>>>><[email protected]> ha
>>>>>>>scritto:
>>>>>>>
>>>>>>>
>>>>>>>Thanks a lot professor; Sorry for the brief delay, I was spending 
>>>>>>>some time in understanding the code repo i.e.
>>>>>>>http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>
>>>>>>>From gen-common-crawl.sh, it looks like commoncrawldump is 
>>>>>>>dumping the crawl segments to json files with the human readable 
>>>>>>>and understandable content.
>>>>>>>1) I am trying to run one of the commands on my side as shown in 
>>>>>>>gen-common-crawl.sh, but the generated files all end with .html 
>>>>>>>or htm; The command listed in gen-common-crawl.sh seems to be 
>>>>>>>allude to where the data is located on our 
>>>>>>>nsfpolardata.dyndns.org <http://nsfpolardata.dyndns.org> 
>>>>>>><http://nsfpolardata.dyndns.org/>; although the locations are not 
>>>>>>>exactly correct (probably they need to be updated), part of the 
>>>>>>>patterns was able to allow me to locate some similar datasets (e.g.
>>>>>>>/data2/crawls/raw/CS572Spring2015 ) again I am seeing the dumped 
>>>>>>>files are all ending with html, but surprisingly inside those 
>>>>>>>outputted html files, the contents are present in json format;
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>The file extension is (almost) always the same as the original file.
>>>>>>>More in detail, using the -epochFilename command-line option (as 
>>>>>>>in gen-common-crawl.sh), the scraped data will be stored with a 
>>>>>>>filename of the format <epochtime(milliseconds)>.<filetype>, 
>>>>>>>where <filetype> is either the extension of the original file or 
>>>>>>>.html as default if the original file does not have an extension. 
>>>>>>>This schema is used for file naming and it does not depend on 
>>>>>>>internal output format (JSON).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>2) Another problem is that the root object is being set with some 
>>>>>>>garbled chars in each of the outputted json files (with extension 
>>>>>>>html in the end), PFA: garbled.jpg and one of the outputted json 
>>>>>>>file has been also attached as an example too (PFA:
>>>>>>>1423894754000.html); the json files cannot be parsed properly by 
>>>>>>>aggregate.py due to those garbled chars.
>>>>>>>Even if I get rid of those garbled chars, there are not mimeTypes 
>>>>>>>element which are being read by aggregate.py.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Text content and metadata extracted from the crawled binary data 
>>>>>>>are stored in a structured document format (JSON). Furthermore, 
>>>>>>>this document is encoded using CBOR <http://cbor.io/> 
>>>>>>>serialization. Each not human-readable character that you notice 
>>>>>>>in front and at the end of JSON data is due to CBOR-encoding.
>>>>>>>Thus, if you need to read JSON data from document dumped out by 
>>>>>>>CommonCrawlDataDumper, you have to deserialized the CBOR-encoded 
>>>>>>>data structure inside the file.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>I hope this short overview can help in you work. I really 
>>>>>>>appreciate your feedback and, by the way, thanks a lot for your 
>>>>>>>great job in detection.
>>>>>>>
>>>>>>>I am available to provide you all support I can give, so you do 
>>>>>>>not hesitate to contact me if you may need any further information.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Thanks,
>>>>>>>
>>>>>>>Giuseppe
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Finally, after some research, I guess that the statistical 
>>>>>>>information (present in the readme of the code repo) is not being 
>>>>>>>collected and computed by aggregate.py from those output json 
>>>>>>>files but it looks like it is coming from the log.... see the 
>>>>>>>following as an example:
>>>>>>>
>>>>>>>2015-04-19 04:55:42,078 INFO  tools.CommonCrawlDataDumper - 
>>>>>>>CommonsCrawlDataDumper File Stats:
>>>>>>>TOTAL Stats:
>>>>>>>[
>>>>>>>   {"mimeType":"application/x-tika-msoffice","count":"17"}
>>>>>>>   {"mimeType":"application/vnd.ms-excel","count":"7"}
>>>>>>>   {"mimeType":"application/xhtml+xml","count":"3000"}
>>>>>>>   {"mimeType":"application/octet-stream","count":"641"}
>>>>>>>   {"mimeType":"application/epub+zip","count":"1"}
>>>>>>>   {"mimeType":"application/zip","count":"6"}
>>>>>>>   {"mimeType":"application/xml","count":"11"}
>>>>>>>   {"mimeType":"image/png","count":"110"}
>>>>>>>   {"mimeType":"image/jpeg","count":"70"}
>>>>>>>   {"mimeType":"application/atom+xml","count":"213"}
>>>>>>>   {"mimeType":"application/rss+xml","count":"43"}
>>>>>>>   {"mimeType":"video/mp4","count":"3"}
>>>>>>>   {"mimeType":"text/plain","count":"104"}
>>>>>>>   {"mimeType":"application/rdf+xml","count":"2"}
>>>>>>>   {"mimeType":"image/gif","count":"2"}
>>>>>>>   {"mimeType":"text/x-php","count":"1"}
>>>>>>>   {"mimeType":"video/x-msvideo","count":"1"}
>>>>>>>   {"mimeType":"application/x-tika-ooxml","count":"3"}
>>>>>>>   {"mimeType":"text/html","count":"9506"}
>>>>>>>   {"mimeType":"application/pdf","count":"280"}
>>>>>>>]
>>>>>>>
>>>>>>>It turns out that aggregate.py is not the one that produces the 
>>>>>>>statistical information, not sure what it does... but anyway, I 
>>>>>>>think I understand the whole idea and I do concur with it, might 
>>>>>>>be we can repackage the tika by incorporating the feature (i.e.
>>>>>>>probabilistic mime
>>>>>>>selection) in it and see if it can output the same information as 
>>>>>>>the one without it in the log.
>>>>>>>
>>>>>>>BTW, Regarding the use of the feature with probabilistic mime
>>>>>>>selection:
>>>>>>>in my pull request, I added a simple test case which might tell a 
>>>>>>>bit more about how the feature is called and used, it is simple 
>>>>>>>though.
>>>>>>>Here is an example snippet
>>>>>>>                ProbabilisticMimeDetectionSelector  probSel = new 
>>>>>>>ProbabilisticMimeDetectionSelector();
>>>>>>>                probSel.detect(input::InputStream, metadata::
>>>>>>>Metadata) It is similar to MimeTypes::detect(...) (more 
>>>>>>>information with this can be found in
>>>>>>>https://issues.apache.org/jira/browse/TIKA-1517)
>>>>>>>Now, in order to allow the Tika().detect() to call the
>>>>>>>ProbabilisticMimeDetectionSelector::detect(...) (as
>>>>>>>Tika().detect() is being called by commoncrawldump), we need to 
>>>>>>>modify/add some code in the TikaConfig which initializes a list 
>>>>>>>of default detectors, and we need to get rid of the detector -
>>>>>>>mimeTypes::
>>>>>>>MimeTypes in the list and replace it with probSel::
>>>>>>>ProbabilisticMimeDetectionSelector. (not sure if I should create 
>>>>>>>another pull request with this change for
>>>>>>>TikaConfig)
>>>>>>>
>>>>>>>I think that is all of my initial thought with some finding and 
>>>>>>>plan; if you have anything you would like to please add and 
>>>>>>>comment, please do kindly let me know, then I will start working 
>>>>>>>on my 'finale'. BTW, don’t worry, even after I am graduated, the 
>>>>>>>graduation is not my termination with tika and this project, 
>>>>>>>after then I still can and want to help this polar project and 
>>>>>>>tika as much as possible, and correct the programming faults and 
>>>>>>>bugs, respond to the tika issues ,etc.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Thanks
>>>>>>>Luke
>>>>>>>
>>>>>>>-----Original Message-----
>>>>>>>From: Chris Mattmann [mailto:[email protected]]
>>>>>>>Sent: Saturday, April 18, 2015 6:26 AM
>>>>>>>To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C 
>>>>>>>(398G-Affiliate); Zimdars, Paul A (3980-Affiliate)
>>>>>>>Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students; 
>>>>>>>[email protected]
>>>>>>>Subject: Re: this week action from luke
>>>>>>>Importance: High
>>>>>>>
>>>>>>>Awesome Luke. I am going to work specifically on now benchmarking 
>>>>>>>your code in real situations. For example, it would be fantastic 
>>>>>>>to now run your Bayesian MIME detector over the whole NSF TREC 
>>>>>>>Dynamic Domain data for Polar described here:
>>>>>>>
>>>>>>>http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>
>>>>>>>Paul Zimdars, CC’ed, can provide you with access to the data, and 
>>>>>>>Annie can explain it, also CC’ed.
>>>>>>>
>>>>>>>Can we make that your goal for the next 2 weeks to actually test 
>>>>>>>it and produce a real result over the whole TREC-DD data for 
>>>>>>>Polar? My goal will be to get your code committed and integrated 
>>>>>>>into Tika.
>>>>>>>The more you can write me a guide of how to build and test your 
>>>>>>>code with Tika so I can get it committed the better.
>>>>>>>
>>>>>>>Also CC’ing the Memex list for context. Note everyone: Luke is 
>>>>>>>building a Bayesian MIME classifier to evaluate against Tika’s 
>>>>>>>existing MIME detection approach. If folks have any Memex needs 
>>>>>>>to try and test more accurate file identification with Tika, Luke 
>>>>>>>is the guy to talk to and I have him for 2 more weeks.
>>>>>>>
>>>>>>>Thanks!
>>>>>>>
>>>>>>>Cheers,
>>>>>>>Chris
>>>>>>>
>>>>>>>------------------------
>>>>>>>Chris Mattmann
>>>>>>>[email protected]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>-----Original Message-----
>>>>>>>From: Luke liu <[email protected]>
>>>>>>>Date: Thursday, April 16, 2015 at 11:29 PM
>>>>>>>To: Chris Mattmann <[email protected]>, Chris Mattmann 
>>>>>>><[email protected]>
>>>>>>>Cc: 'Luke' <[email protected]>
>>>>>>>Subject: this week action from luke
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Hi Professor Mattmann,
>>>>>>>
>>>>>>>I think I am in the final phase of the research, and last week I 
>>>>>>>finished the last item in the list, and hopefully everything will 
>>>>>>>be fine.
>>>>>>>
>>>>>>>For now, i probably can spend some time in verifying or 
>>>>>>>optimizing the codes, the majority of the research has been 
>>>>>>>done…and it will be also great if you can please comment on my 
>>>>>>>work (the 2 pull
>>>>>>>requests) when you have time.
>>>>>>>
>>>>>>>If you do have confusion with any of my work, please also do let 
>>>>>>>me know.
>>>>>>>
>>>>>>>Thanks and I am glad working with you, for the next a couple of 
>>>>>>>weeks before graduation, I am going to continue revising and 
>>>>>>>testing the code and features to get rid of some flaws (if any 
>>>>>>>)when I have time.
>>>>>>>
>>>>>>>Not sure if I miss out something, and if I do miss some thing 
>>>>>>>important, please do let me know too.
>>>>>>>
>>>>>>>Thanks
>>>>>>>Luke
>>>>>>>
>>>>>>>
>>>>>>>--
>>>>>>>You received this message because you are subscribed to the 
>>>>>>>Google Groups "JPL-Kitware-Continuum Memex Group" group.
>>>>>>>To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>send an email to [email protected]
>>>>>>><mailto:memex-jpl%[email protected]>.
>>>>>>>To post to this group, send email to [email protected].
>>>>>>>Visit this group at http://groups.google.com/group/memex-jpl.
>>>>>>>To view this discussion on the web visit
>>>>>>>https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b35100
>>>>>>>7
>>>>>>>0
>>>>>>>%
>>>>>>>2
>>>>>>>41
>>>>>>>9f3
>>>>>>>0150%24%40edu.
>>>>>>>For more options, visit https://groups.google.com/d/optout.
>>>>>>><garbled.jpg><1423894754000.html>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>
>


Reply via email to