Thanks Luke this is probably a good opportunity to test out your Bayesian mime 
detector how does it perform here?

Sent from my iPhone

> On Apr 22, 2015, at 3:29 PM, Luke <[email protected]> wrote:
> 
> Hi professor,
> 
> Please see the following results.
> <match value="&lt;html xmlns=" type="string" offset="0:1024"/>
> Result: "text/html"
> 
> <match value="&lt;html xmlns=" type="string" offset="0:6000"/>
> Result: "application/xhtml+xml"
> 
> 
> Thanks
> Luke
> 
> -----Original Message-----
> From: Chris Mattmann [mailto:[email protected]] 
> Sent: Wednesday, April 22, 2015 4:21 AM
> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U (3980-Affiliate)'; 
> [email protected]
> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 
> 'NSF Polar CyberInfrastructure DR Students'; [email protected]
> Subject: Re: [memex-jpl] this week action from luke
> 
> Hi Luke,
> 
> Actually I just meant go into tika-mimetypes.xml and change the magic offsets 
> for application/xhtml+xml and see if that works. The code you changed below 
> is actually how many bytes Tika will first download to do MIME checking.
> 
> Cheers,
> Chris
> 
> ------------------------
> Chris Mattmann
> [email protected]
> 
> 
> 
> 
> -----Original Message-----
> From: Luke <[email protected]>
> Date: Wednesday, April 22, 2015 at 2:25 AM
> To: Chris Mattmann <[email protected]>, Chris Mattmann 
> <[email protected]>, "'Totaro, Giuseppe U (3980-Affiliate)'"
> <[email protected]>, <[email protected]>
> Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>, "'Zimdars, 
> Paul A (3980-Affiliate)'" <[email protected]>, NSF Polar 
> CyberInfrastructure DR Students <[email protected]>,
> <[email protected]>
> Subject: RE: [memex-jpl] this week action from luke
> 
>> 
>> Hi professor,
>> 
>> I just tried it with minLength set to 1024, I get the following 
>> "text/plain"
>> I am a bit surprised....
>> 
>> BTW, the 6000 min length still give "application/xhtml+xml"; with 
>> anything below 1024 min length, I am seeing "text/plain". :)
>> 
>> BTW, the min length I am referring/altering is as follows 
>> MimeTypes.java
>>    public int getMinLength() {
>>       // This needs to be reasonably large to be able to correctly 
>> detect
>>       // things like XML root elements after initial comment and DTDs
>>       return 64 * 1024;
>>   }
>> 
>> 
>> Thanks
>> Luke
>> 
>> -----Original Message-----
>> From: Chris Mattmann [mailto:[email protected]]
>> Sent: Tuesday, April 21, 2015 7:48 PM
>> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U 
>> (3980-Affiliate)'; [email protected]
>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>> [email protected]
>> Subject: Re: [memex-jpl] this week action from luke
>> 
>> Thanks Luke.
>> 
>> So I guess all I was asking was could you try it out. Thanks for the 
>> lesson in the RFC.
>> 
>> Cheers,
>> Chris
>> 
>> ------------------------
>> Chris Mattmann
>> [email protected]
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Luke <[email protected]>
>> Date: Wednesday, April 22, 2015 at 1:46 AM
>> To: Chris Mattmann <[email protected]>, Chris Mattmann 
>> <[email protected]>, "'Totaro, Giuseppe U (3980-Affiliate)'"
>> <[email protected]>, <[email protected]>
>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>, 
>> "'Zimdars, Paul A (3980-Affiliate)'" <[email protected]>, NSF 
>> Polar CyberInfrastructure DR Students 
>> <[email protected]>,
>> <[email protected]>
>> Subject: RE: [memex-jpl] this week action from luke
>> 
>>> Hi professor,
>>> 
>>> 
>>> I think it highly depends on the content being read by tika, e.g. if 
>>> there is a sequence of bytes in the file that is being read and is the 
>>> same as one or more of mime types being defined in our tika-mimes.xml, 
>>> I guess that tika will put those types in its estimation list, please 
>>> note there could be multiple estimated mime types by magic-byte 
>>> detection approach. Now tika also considers the decision made by 
>>> extension detection approach, if extension says the file type it 
>>> believes is the first one in the magic type estimation list, then 
>>> certainly the first one will be returned. (the same applies to 
>>> metadata hint approach); Of course, tika also prefers the type that is 
>>> the most specialized.
>>> 
>>> let's get back to the following question, here is my guess though.
>>> [Prof]: Also what happens if you tweak the definition of XHTML to not 
>>> scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>>> Let's consider an extreme case where we only scan 10 or 1 bytes, then 
>>> it seems that magic bytes will inevitable detect nothing, and I think 
>>> it will return the something like" application/oct-stream" that is the 
>>> most general type. As mentioned, tika favours the one that is the most 
>>> specialized, if extension approach returns the one that is more 
>>> specialized, in this extreme case I believe almost every type is a 
>>> subclass of this "application/oct-stream".... therefore the answer in 
>>> this extreme may be yes, I think it is very possible that CBOR type 
>>> detected by the extension approach takes over in this case...
>>> 
>>> My idea was and still is that if the cbor self-Describing tag 55799 is 
>>> present in the cbor file, then that can be used to detect the cbor type.
>>> Again, the cbor type will probably be appended into the magic 
>>> estimation list together with another one such as application/html, I 
>>> guess the order in the list probably also matters, the first one is 
>>> preferred over the next one. Also the decision from the extension 
>>> detection approach also play the role the break the tie.
>>> e.g. if extension detection method agrees on cbor with one of the 
>>> estimated type in the magic list, then cbor will be returned. (again, 
>>> same thing applies to metadatahint method).
>>> 
>>> I have not taken a closer look at a cbor file that has the tag 55799, 
>>> but I expect to see its hex is something like 0xd9d9f7 or the tag 
>>> should be present in the header with a fixed sequence of
>>> bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is 
>>> present in the file or preferable in the header (within a reasonable 
>>> range of bytes ), I believe it can probably be used as the magic 
>>> numbers for the cbor type.
>>> 
>>> 
>>> There is another thing I have mentioned in the jira ticket I opened 
>>> yesterday against the cbor parser and detection, it is also possible 
>>> that cbor content can be imbedded inside a plain json file, the way 
>>> that a decoder can distinguish them in that file is by looking at the 
>>> tag 55799 again. This may rarely happen but a robust parser might be 
>>> able to take care of that, tika might need to consider the use of 
>>> fastXML being used by the nutch tool when developing the cbor parser...
>>> Again let me cite the same paragraph from the rfc,
>>> 
>>> " a decoder might be able to parse both CBOR and JSON.
>>>  Such a decoder would need to mechanically distinguish the two
>>>  formats.  An easy way for an encoder to help the decoder would be to
>>>  tag the entire CBOR item with tag 55799, the serialization of which
>>>  will never be found at the beginning of a JSON text."
>>> 
>>> 
>>> Thanks
>>> Luke
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Mattmann, Chris A (3980) [mailto:[email protected]]
>>> Sent: Tuesday, April 21, 2015 9:49 PM
>>> To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
>>> Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); 
>>> 'NSF Polar CyberInfrastructure DR Students'; 
>>> [email protected]
>>> Subject: Re: [memex-jpl] this week action from luke
>>> 
>>> Hi Luke,
>>> 
>>> Can you post the below conversation to dev@tika and summarize it there.
>>> Also what happens if you tweak the definition of XHTML to not scan 
>>> until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398) NASA Jet 
>>> Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: [email protected]
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department University of 
>>> Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Luke <[email protected]>
>>> Date: Wednesday, April 22, 2015 at 12:19 AM
>>> To: Chris Mattmann <[email protected]>, "Totaro, Giuseppe U 
>>> (3980-Affiliate)" <[email protected]>, Chris Mattmann 
>>> <[email protected]>
>>> Cc: "Bryant, Ann C (398G-Affiliate)" <[email protected]>, 
>>> "Zimdars, Paul A (3980-Affiliate)" <[email protected]>, NSF 
>>> Polar CyberInfrastructure DR Students 
>>> <[email protected]>,
>>> "[email protected]" <[email protected]>
>>> Subject: RE: [memex-jpl] this week action from luke
>>> 
>>>> Hi Professor,
>>>> Please see attached jpg for the difference.
>>>> Thanks
>>>> Luke
>>>> 
>>>> -----Original Message-----
>>>> From: Chris Mattmann [mailto:[email protected]]
>>>> Sent: Tuesday, April 21, 2015 5:27 PM
>>>> To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>> [email protected]
>>>> Subject: Re: [memex-jpl] this week action from luke
>>>> 
>>>> Hey Luke what happens if you do java -jar /path/to/tika-app -m 
>>>> /path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app -m 
>>>> < /path/to/cbor/file.cbor any difference?
>>>> 
>>>> ------------------------
>>>> Chris Mattmann
>>>> [email protected]
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Luke <[email protected]>
>>>> Date: Tuesday, April 21, 2015 at 5:41 PM
>>>> To: 'Luke' <[email protected]>, Chris Mattmann 
>>>> <[email protected]>, 'Giuseppe Totaro' 
>>>> <[email protected]>, Chris Mattmann 
>>>> <[email protected]>
>>>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>, 
>>>> "'Zimdars, Paul A (3980-Affiliate)'" <[email protected]>, 
>>>> NSF Polar CyberInfrastructure DR Students 
>>>> <[email protected]>,
>>>> <[email protected]>
>>>> Subject: RE: [memex-jpl] this week action from luke
>>>> 
>>>>> Hi professor,
>>>>> I just sent a pull request for adding cbor extension.
>>>>> The interesting thing is that tika is still identifying the file 
>>>>> dumped by the nutch dump tool as a "application/xhtml+xml" even when 
>>>>> I manually change the file extension to the correct one (i.e. *.cbor ).
>>>>> 
>>>>> The reason is probably that tika is identifying "application/xhtml+xml"
>>>>> by searching for the "&lt;html" in the file content, PFA:
>>>>> xhtml+xml.jpg; Now if you take a look at the cbor file dumped by 
>>>>> xhtml+nutch,
>>>>> you see that we do have that element as part of the cbor content 
>>>>> because the entire crawled xhtml document seems to be imbedded in 
>>>>> the cbor json(PFA:
>>>>> cbor.jpg); and also in Tika, the magic detection seems to have 
>>>>> higher priority over the glob detection, thus the type is being 
>>>>> incorrectly detected.
>>>>> 
>>>>> Therefore, I would like to please mention that adding the entry of 
>>>>> <glob pattern="*.cbor"/> is not resolving the issue as of now 
>>>>> without some fixed magic bytes / patterns for cbor.
>>>>> I also would like to add that the thing will be different with our 
>>>>> probabilistic mime detection selector, because if we know that the 
>>>>> file extension is more reliable than magic bytes, then we can 
>>>>> certainly add more preferential weight to the extension... this also 
>>>>> might show the current implementation with MimeTypes detection is a 
>>>>> bit stiff or less flexible in this scneario. :)
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> Luke
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Luke [mailto:[email protected]]
>>>>> Sent: Tuesday, April 21, 2015 12:14 PM
>>>>> To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>>> '[email protected]'
>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>> 
>>>>> Yes, let me add the cbor extension entry in tika xml, will send the 
>>>>> pull request soon.
>>>>> 
>>>>> Thanks
>>>>> Luke
>>>>> -----Original Message-----
>>>>> From: Chris Mattmann [mailto:[email protected]]
>>>>> Sent: Tuesday, April 21, 2015 6:51 AM
>>>>> To: Giuseppe Totaro; Mattmann, Chris A (3980)
>>>>> Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A 
>>>>> (3980-Affiliate); NSF Polar CyberInfrastructure DR Students; 
>>>>> [email protected]
>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>> 
>>>>> Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER 
>>>>> and tag along with adding an -extension command would be fantastic.
>>>>> Can you file both of those NUTCH issues, wait a day or so, and then 
>>>>> based on feedback use your new Nutch commit karma to get those into 
>>>>> Nutch?
>>>>> 
>>>>> And then when creating the issues, can you link to the TIKA-1610 issue?
>>>>> At that point, when those two to be defined NUTCH issues are up, 
>>>>> Luke, in parallel can you throw up a pull request/patch in Tika for 
>>>>> the extension along with the MIME detection?
>>>>> 
>>>>> Cheers,
>>>>> Chris
>>>>> 
>>>>> ------------------------
>>>>> Chris Mattmann
>>>>> [email protected]
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Giuseppe Totaro <[email protected]>
>>>>> Date: Tuesday, April 21, 2015 at 12:33 PM
>>>>> To: Chris Mattmann <[email protected]>
>>>>> Cc: Luke <[email protected]>, Chris Mattmann 
>>>>> <[email protected]>, "Bryant, Ann C (398G-Affiliate)"
>>>>> <[email protected]>, "Zimdars, Paul A (3980-Affiliate)"
>>>>> <[email protected]>, NSF Polar CyberInfrastructure DR 
>>>>> Students <[email protected]>,
>>>>> "[email protected]"
>>>>> <[email protected]>
>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>> 
>>>>>> Thanks Luke. Great work.
>>>>>> Chris, we wrap a single string value, representing the JSON text, 
>>>>>> for each file into CBOR (by using serializeCBORData method). For 
>>>>>> instance, using the Unix hex dump tool, we can see that, as 
>>>>>> expected, the first byte of all files is "0x7F" (the first three 
>>>>>> bits are "011", that is the major type for strings, and the 
>>>>>> following 5 bits are "11010", meaning a uint32_t encodes the length 
>>>>>> of following text), and the following 4 bytes (single-precision
>>>>>> float) encodes the right length of file (as described in RFC7049 
>>>>>> <http://tools.ietf.org/html/rfc7049>).
>>>>>> Therefore, a CBOR tag is currently included into the file (a list 
>>>>>> of cbor tags is available here 
>>>>>> <http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>).
>>>>>> I did not know about CBOR "magic header". Thanks a lot Luke for 
>>>>>> this great research. Chris, if you agree, I can add support for 
>>>>>> prepending self-describing CBOR tag 55799 to CommonCrawldataDumper 
>>>>>> class. I believe it is very easy because I have to enable the 
>>>>>> WRITE_TYPE_HEADER feature for CBORGenerator class (the source code 
>>>>>> is available here 
>>>>>> <https://github.com/FasterXML/jackson-dataformat-cbor/blob/master/s
>>>>>> r
>>>>>> c
>>>>>> /
>>>>>> m ain
>>>>>> /java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>).
>>>>>> Then, I can comment the TIKA-1610
>>>>>> <https://issues.apache.org/jira/browse/TIKA-1610> issue.
>>>>>> 
>>>>>> Regarding the file extension, in the Memex CCA format the original 
>>>>>> file extension is used. We could add support for a -extension 
>>>>>> command-line option allowing the user to give a file extension 
>>>>>> (e.g.,
>>>>>> cbor) for all files dumped out.
>>>>>> 
>>>>>> Thanks a lot,
>>>>>> Giuseppe
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980) 
>>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>> Thanks for this great research, Luke!
>>>>>> 
>>>>>> Giuseppe, any idea why this tag doesn't make it into the file?
>>>>>> 
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Chris Mattmann, Ph.D.
>>>>>> Chief Architect
>>>>>> Instrument Software and Science Data Systems Section (398) NASA Jet 
>>>>>> Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>> Email: [email protected]
>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Adjunct Associate Professor, Computer Science Department University 
>>>>>> of Southern California, Los Angeles, CA 90089 USA
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Luke <[email protected]>
>>>>>> Date: Tuesday, April 21, 2015 at 2:55 AM
>>>>>> To: Chris Mattmann <[email protected]>, "Totaro, Giuseppe U 
>>>>>> (3980-Affiliate)" <[email protected]>, Chris Mattmann 
>>>>>> <[email protected]>, "Bryant, Ann C (398G-Affiliate)"
>>>>>> <[email protected]>, "Zimdars, Paul A (3980-Affiliate)"
>>>>>> <[email protected]>, NSF Polar CyberInfrastructure DR 
>>>>>> Students <[email protected]>,
>>>>>> "[email protected]"
>>>>>> <[email protected]>
>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>> 
>>>>>>> Thanks professor.
>>>>>>> Hi professor and all.
>>>>>>> JIRA issue : CBOR Parser and detection improvement
>>>>>>> https://issues.apache.org/jira/browse/TIKA-1610
>>>>>>> 
>>>>>>> I tried to conduct a bit research with this cbor detection.
>>>>>>> 
>>>>>>> It looks like there is a self describing tag that needs to be 
>>>>>>> written in the cbor file thru which other applications might be 
>>>>>>> able to identify the cbor type....
>>>>>>> Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5
>>>>>>> 
>>>>>>> I don't see that tag being present in the cbor file dumped by the 
>>>>>>> nutch tool, I am not very sure though.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Luke
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Chris Mattmann [mailto:[email protected]]
>>>>>>> Sent: Monday, April 20, 2015 4:10 AM
>>>>>>> To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C 
>>>>>>> (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar 
>>>>>>> CyberInfrastructure DR Students'; [email protected]
>>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>>> 
>>>>>>> Nice one, Luke. If you have a second and you can open up an issue 
>>>>>>> in Tika to make it support CBOR, then yes, by all means! :)
>>>>>>> 
>>>>>>> 
>>>>>>> ------------------------
>>>>>>> Chris Mattmann
>>>>>>> [email protected]
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Luke <[email protected]>
>>>>>>> Date: Monday, April 20, 2015 at 4:15 AM
>>>>>>> To: 'Giuseppe Totaro' <[email protected]>, Chris Mattmann 
>>>>>>> <[email protected]>, Chris Mattmann 
>>>>>>> <[email protected]>, "'Bryant, Ann C (398G-Affiliate)'"
>>>>>>> <[email protected]>, "'Zimdars, Paul A (3980-Affiliate)'"
>>>>>>> <[email protected]>, NSF Polar CyberInfrastructure DR 
>>>>>>> Students <[email protected]>,
>>>>>>> <[email protected]>
>>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>>> 
>>>>>>>> Thanks a lot Giuseppe for the prompt response clearing up a bit 
>>>>>>>> of my confusion with the Nutch CommonCrawlDataDumper , appreciated.
>>>>>>>> 
>>>>>>>> BTW, it looks like Tika might need to consider the support with 
>>>>>>>> COBR parser and detection.
>>>>>>>> I checked the rfc, it looks like CBOR has not got magic numbers.
>>>>>>>> PFA:
>>>>>>>> rfc_cbor.jpg
>>>>>>>> Actually, I don't quite understand why the CommonCrawlDataDumper 
>>>>>>>> is not dumping the nutch segments with the .cbor extension, which 
>>>>>>>> seems to be helpful for type detection.
>>>>>>>> 
>>>>>>>> To professor Mattmann,
>>>>>>>> Tika does not support the detection of COBR, although the trunk 
>>>>>>>> version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor in 
>>>>>>>> the tika-mimetypes.xml, those entries are not detecting properly 
>>>>>>>> the cobr files dumped by CommonCrawlDataDumper.  Also CBOR does 
>>>>>>>> not have magic bytes, off the top of my head the only way we can 
>>>>>>>> detect it is using the extension, and content byte histogram 
>>>>>>>> (please note, this is a local optimal solution and
>>>>>>>> data-dependent.)  J
>>>>>>>> 
>>>>>>>> I think I am bit deviating from the main route and discussion of 
>>>>>>>> this thread.... i.e. the plan for testing the "probabilistic mime 
>>>>>>>> detector selection" with polar data.
>>>>>>>> Anyway, I plan to repackage tika by incorporating the 
>>>>>>>> probabilistic selection feature and replace the tika jar in nutch 
>>>>>>>> with the repackaged one, and then run the CommonCrawlDataDumper 
>>>>>>>> and see how it goes. If you have any specific ideas and thought 
>>>>>>>> with the testing, please kindly let me know.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> From: Giuseppe Totaro [mailto:[email protected]]
>>>>>>>> Sent: Sunday, April 19, 2015 11:17 PM
>>>>>>>> To: Luke liu
>>>>>>>> Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C 
>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); Luke; NSF 
>>>>>>>> Polar CyberInfrastructure DR Students; [email protected]
>>>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Luke,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> my name is Giuseppe and I am a PhD student working under the 
>>>>>>>> supervision of Prof. Chris Mattmann. I worked on 
>>>>>>>> CommonCrawlDataDumper tool, so I can give some feedback on a 
>>>>>>>> couple of your observations. My comments inline below.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Il giorno 19/apr/2015, alle ore 12:11, Luke liu 
>>>>>>>> <[email protected]> ha
>>>>>>>> scritto:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks a lot professor; Sorry for the brief delay, I was spending 
>>>>>>>> some time in understanding the code repo i.e.
>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>> 
>>>>>>>> From gen-common-crawl.sh, it looks like commoncrawldump is 
>>>>>>>> dumping the crawl segments to json files with the human readable 
>>>>>>>> and understandable content.
>>>>>>>> 1) I am trying to run one of the commands on my side as shown in 
>>>>>>>> gen-common-crawl.sh, but the generated files all end with .html 
>>>>>>>> or htm; The command listed in gen-common-crawl.sh seems to be 
>>>>>>>> allude to where the data is located on our 
>>>>>>>> nsfpolardata.dyndns.org <http://nsfpolardata.dyndns.org> 
>>>>>>>> <http://nsfpolardata.dyndns.org/>; although the locations are not 
>>>>>>>> exactly correct (probably they need to be updated), part of the 
>>>>>>>> patterns was able to allow me to locate some similar datasets (e.g.
>>>>>>>> /data2/crawls/raw/CS572Spring2015 ) again I am seeing the dumped 
>>>>>>>> files are all ending with html, but surprisingly inside those 
>>>>>>>> outputted html files, the contents are present in json format;
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The file extension is (almost) always the same as the original file.
>>>>>>>> More in detail, using the -epochFilename command-line option (as 
>>>>>>>> in gen-common-crawl.sh), the scraped data will be stored with a 
>>>>>>>> filename of the format <epochtime(milliseconds)>.<filetype>, 
>>>>>>>> where <filetype> is either the extension of the original file or 
>>>>>>>> .html as default if the original file does not have an extension. 
>>>>>>>> This schema is used for file naming and it does not depend on 
>>>>>>>> internal output format (JSON).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2) Another problem is that the root object is being set with some 
>>>>>>>> garbled chars in each of the outputted json files (with extension 
>>>>>>>> html in the end), PFA: garbled.jpg and one of the outputted json 
>>>>>>>> file has been also attached as an example too (PFA:
>>>>>>>> 1423894754000.html); the json files cannot be parsed properly by 
>>>>>>>> aggregate.py due to those garbled chars.
>>>>>>>> Even if I get rid of those garbled chars, there are not mimeTypes 
>>>>>>>> element which are being read by aggregate.py.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Text content and metadata extracted from the crawled binary data 
>>>>>>>> are stored in a structured document format (JSON). Furthermore, 
>>>>>>>> this document is encoded using CBOR <http://cbor.io/> 
>>>>>>>> serialization. Each not human-readable character that you notice 
>>>>>>>> in front and at the end of JSON data is due to CBOR-encoding.
>>>>>>>> Thus, if you need to read JSON data from document dumped out by 
>>>>>>>> CommonCrawlDataDumper, you have to deserialized the CBOR-encoded 
>>>>>>>> data structure inside the file.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I hope this short overview can help in you work. I really 
>>>>>>>> appreciate your feedback and, by the way, thanks a lot for your 
>>>>>>>> great job in detection.
>>>>>>>> 
>>>>>>>> I am available to provide you all support I can give, so you do 
>>>>>>>> not hesitate to contact me if you may need any further information.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Giuseppe
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Finally, after some research, I guess that the statistical 
>>>>>>>> information (present in the readme of the code repo) is not being 
>>>>>>>> collected and computed by aggregate.py from those output json 
>>>>>>>> files but it looks like it is coming from the log.... see the 
>>>>>>>> following as an example:
>>>>>>>> 
>>>>>>>> 2015-04-19 04:55:42,078 INFO  tools.CommonCrawlDataDumper - 
>>>>>>>> CommonsCrawlDataDumper File Stats:
>>>>>>>> TOTAL Stats:
>>>>>>>> [
>>>>>>>>  {"mimeType":"application/x-tika-msoffice","count":"17"}
>>>>>>>>  {"mimeType":"application/vnd.ms-excel","count":"7"}
>>>>>>>>  {"mimeType":"application/xhtml+xml","count":"3000"}
>>>>>>>>  {"mimeType":"application/octet-stream","count":"641"}
>>>>>>>>  {"mimeType":"application/epub+zip","count":"1"}
>>>>>>>>  {"mimeType":"application/zip","count":"6"}
>>>>>>>>  {"mimeType":"application/xml","count":"11"}
>>>>>>>>  {"mimeType":"image/png","count":"110"}
>>>>>>>>  {"mimeType":"image/jpeg","count":"70"}
>>>>>>>>  {"mimeType":"application/atom+xml","count":"213"}
>>>>>>>>  {"mimeType":"application/rss+xml","count":"43"}
>>>>>>>>  {"mimeType":"video/mp4","count":"3"}
>>>>>>>>  {"mimeType":"text/plain","count":"104"}
>>>>>>>>  {"mimeType":"application/rdf+xml","count":"2"}
>>>>>>>>  {"mimeType":"image/gif","count":"2"}
>>>>>>>>  {"mimeType":"text/x-php","count":"1"}
>>>>>>>>  {"mimeType":"video/x-msvideo","count":"1"}
>>>>>>>>  {"mimeType":"application/x-tika-ooxml","count":"3"}
>>>>>>>>  {"mimeType":"text/html","count":"9506"}
>>>>>>>>  {"mimeType":"application/pdf","count":"280"}
>>>>>>>> ]
>>>>>>>> 
>>>>>>>> It turns out that aggregate.py is not the one that produces the 
>>>>>>>> statistical information, not sure what it does... but anyway, I 
>>>>>>>> think I understand the whole idea and I do concur with it, might 
>>>>>>>> be we can repackage the tika by incorporating the feature (i.e.
>>>>>>>> probabilistic mime
>>>>>>>> selection) in it and see if it can output the same information as 
>>>>>>>> the one without it in the log.
>>>>>>>> 
>>>>>>>> BTW, Regarding the use of the feature with probabilistic mime
>>>>>>>> selection:
>>>>>>>> in my pull request, I added a simple test case which might tell a 
>>>>>>>> bit more about how the feature is called and used, it is simple 
>>>>>>>> though.
>>>>>>>> Here is an example snippet
>>>>>>>>               ProbabilisticMimeDetectionSelector  probSel = new 
>>>>>>>> ProbabilisticMimeDetectionSelector();
>>>>>>>>               probSel.detect(input::InputStream, metadata::
>>>>>>>> Metadata) It is similar to MimeTypes::detect(...) (more 
>>>>>>>> information with this can be found in
>>>>>>>> https://issues.apache.org/jira/browse/TIKA-1517)
>>>>>>>> Now, in order to allow the Tika().detect() to call the
>>>>>>>> ProbabilisticMimeDetectionSelector::detect(...) (as
>>>>>>>> Tika().detect() is being called by commoncrawldump), we need to 
>>>>>>>> modify/add some code in the TikaConfig which initializes a list 
>>>>>>>> of default detectors, and we need to get rid of the detector -
>>>>>>>> mimeTypes::
>>>>>>>> MimeTypes in the list and replace it with probSel::
>>>>>>>> ProbabilisticMimeDetectionSelector. (not sure if I should create 
>>>>>>>> another pull request with this change for
>>>>>>>> TikaConfig)
>>>>>>>> 
>>>>>>>> I think that is all of my initial thought with some finding and 
>>>>>>>> plan; if you have anything you would like to please add and 
>>>>>>>> comment, please do kindly let me know, then I will start working 
>>>>>>>> on my 'finale'. BTW, don't worry, even after I am graduated, the 
>>>>>>>> graduation is not my termination with tika and this project, 
>>>>>>>> after then I still can and want to help this polar project and 
>>>>>>>> tika as much as possible, and correct the programming faults and 
>>>>>>>> bugs, respond to the tika issues ,etc.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Chris Mattmann [mailto:[email protected]]
>>>>>>>> Sent: Saturday, April 18, 2015 6:26 AM
>>>>>>>> To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C 
>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate)
>>>>>>>> Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students; 
>>>>>>>> [email protected]
>>>>>>>> Subject: Re: this week action from luke
>>>>>>>> Importance: High
>>>>>>>> 
>>>>>>>> Awesome Luke. I am going to work specifically on now benchmarking 
>>>>>>>> your code in real situations. For example, it would be fantastic 
>>>>>>>> to now run your Bayesian MIME detector over the whole NSF TREC 
>>>>>>>> Dynamic Domain data for Polar described here:
>>>>>>>> 
>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>> 
>>>>>>>> Paul Zimdars, CC'ed, can provide you with access to the data, and 
>>>>>>>> Annie can explain it, also CC'ed.
>>>>>>>> 
>>>>>>>> Can we make that your goal for the next 2 weeks to actually test 
>>>>>>>> it and produce a real result over the whole TREC-DD data for 
>>>>>>>> Polar? My goal will be to get your code committed and integrated 
>>>>>>>> into Tika.
>>>>>>>> The more you can write me a guide of how to build and test your 
>>>>>>>> code with Tika so I can get it committed the better.
>>>>>>>> 
>>>>>>>> Also CC'ing the Memex list for context. Note everyone: Luke is 
>>>>>>>> building a Bayesian MIME classifier to evaluate against Tika's 
>>>>>>>> existing MIME detection approach. If folks have any Memex needs 
>>>>>>>> to try and test more accurate file identification with Tika, Luke 
>>>>>>>> is the guy to talk to and I have him for 2 more weeks.
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Chris
>>>>>>>> 
>>>>>>>> ------------------------
>>>>>>>> Chris Mattmann
>>>>>>>> [email protected]
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Luke liu <[email protected]>
>>>>>>>> Date: Thursday, April 16, 2015 at 11:29 PM
>>>>>>>> To: Chris Mattmann <[email protected]>, Chris Mattmann 
>>>>>>>> <[email protected]>
>>>>>>>> Cc: 'Luke' <[email protected]>
>>>>>>>> Subject: this week action from luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Professor Mattmann,
>>>>>>>> 
>>>>>>>> I think I am in the final phase of the research, and last week I 
>>>>>>>> finished the last item in the list, and hopefully everything will 
>>>>>>>> be fine.
>>>>>>>> 
>>>>>>>> For now, i probably can spend some time in verifying or 
>>>>>>>> optimizing the codes, the majority of the research has been 
>>>>>>>> done...and it will be also great if you can please comment on my 
>>>>>>>> work (the 2 pull
>>>>>>>> requests) when you have time.
>>>>>>>> 
>>>>>>>> If you do have confusion with any of my work, please also do let 
>>>>>>>> me know.
>>>>>>>> 
>>>>>>>> Thanks and I am glad working with you, for the next a couple of 
>>>>>>>> weeks before graduation, I am going to continue revising and 
>>>>>>>> testing the code and features to get rid of some flaws (if any 
>>>>>>>> )when I have time.
>>>>>>>> 
>>>>>>>> Not sure if I miss out something, and if I do miss some thing 
>>>>>>>> important, please do let me know too.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>> Google Groups "JPL-Kitware-Continuum Memex Group" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to [email protected]
>>>>>>>> <mailto:memex-jpl%[email protected]>.
>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>> Visit this group at http://groups.google.com/group/memex-jpl.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b35100
>>>>>>>> 7
>>>>>>>> 0
>>>>>>>> %
>>>>>>>> 2
>>>>>>>> 41
>>>>>>>> 9f3
>>>>>>>> 0150%24%40edu.
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>> <garbled.jpg><1423894754000.html>
> 
> 

Reply via email to