Hi professor,

I think it highly depends on the content being read by tika, e.g. if there is a 
sequence of bytes in the file that is being read and is the same as one or more 
of mime types being defined in our tika-mimes.xml, I guess that tika will put 
those types in its estimation list, please note there could be multiple 
estimated mime types by magic-byte detection approach. Now tika also considers 
the decision made by extension detection approach, if extension says the file 
type it believes is the first one in the magic type estimation list, then 
certainly the first one will be returned. (the same applies to metadata hint 
approach);
Of course, tika also prefers the type that is the most specialized.

let's get back to the following question, here is my guess though.
[Prof]: Also what happens if you tweak the definition of XHTML to not scan 
until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
Let's consider an extreme case where we only scan 10 or 1 bytes, then it seems 
that magic bytes will inevitable detect nothing, and I think it will return the 
something like" application/oct-stream" that is the most general type. As 
mentioned, tika favours the one that is the most specialized, if extension 
approach returns the one that is more specialized, in this extreme case I 
believe almost every type is a subclass of this "application/oct-stream".... 
therefore the answer in this extreme may be yes, I think it is very possible 
that CBOR type detected by the extension approach takes over in this case...

My idea was and still is that if the cbor self-Describing tag 55799 is present 
in the cbor file, then that can be used to detect the cbor type.
Again, the cbor type will probably be appended into the magic estimation list 
together with another one such as application/html, I guess the order in the 
list probably also matters, the first one is preferred over the next one. Also 
the decision from the extension detection approach also play the role the break 
the tie.
e.g. if extension detection method agrees on cbor with one of the estimated 
type in the magic list, then cbor will be returned. (again, same thing applies 
to metadatahint method). 

I have not taken a closer look at a cbor file that has the tag 55799, but I 
expect to see its hex is something like 0xd9d9f7 or the tag should be present 
in the header with a fixed sequence of 
bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is present 
in the file or preferable in the header (within a reasonable range of bytes ), 
I believe it can probably be used as the magic numbers for the cbor type.


There is another thing I have mentioned in the jira ticket I opened yesterday 
against the cbor parser and detection, it is also possible that cbor content 
can be imbedded inside a plain json file, the way that a decoder can 
distinguish them in that file is by looking at the tag 55799 again. This may 
rarely happen but a robust parser might be able to take care of that, tika 
might need to consider the use of fastXML being used by the nutch tool when 
developing the cbor parser...
Again let me cite the same paragraph from the rfc, 

" a decoder might be able to parse both CBOR and JSON.
   Such a decoder would need to mechanically distinguish the two
   formats.  An easy way for an encoder to help the decoder would be to
   tag the entire CBOR item with tag 55799, the serialization of which
   will never be found at the beginning of a JSON text."


Thanks
Luke



-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Tuesday, April 21, 2015 9:49 PM
To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); 'NSF 
Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Hi Luke,

Can you post the below conversation to dev@tika and summarize it there. Also 
what happens if you tweak the definition of XHTML to not scan until 8192, but 
say 6000 (e.g., 0:6000), does CBOR take over then?

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion 
Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department University of Southern 
California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Luke <hanson311...@gmail.com>
Date: Wednesday, April 22, 2015 at 12:19 AM
To: Chris Mattmann <chris.mattm...@gmail.com>, "Totaro, Giuseppe U 
(3980-Affiliate)" <tot...@di.uniroma1.it>, Chris Mattmann 
<chris.a.mattm...@jpl.nasa.gov>
Cc: "Bryant, Ann C (398G-Affiliate)" <anniebry...@gmail.com>, "Zimdars, Paul A 
(3980-Affiliate)" <paul.a.zimd...@jpl.nasa.gov>, NSF Polar CyberInfrastructure 
DR Students <nsf-polar-usc-stude...@googlegroups.com>,
"memex-...@googlegroups.com" <memex-...@googlegroups.com>
Subject: RE: [memex-jpl] this week action from luke

>Hi Professor,
>Please see attached jpg for the difference.
>Thanks
>Luke
>
>-----Original Message-----
>From: Chris Mattmann [mailto:chris.mattm...@gmail.com]
>Sent: Tuesday, April 21, 2015 5:27 PM
>To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>memex-...@googlegroups.com
>Subject: Re: [memex-jpl] this week action from luke
>
>Hey Luke what happens if you do java -jar /path/to/tika-app -m 
>/path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app -m < 
>/path/to/cbor/file.cbor any difference?
>
>------------------------
>Chris Mattmann
>chris.mattm...@gmail.com
>
>
>
>
>-----Original Message-----
>From: Luke <hanson311...@gmail.com>
>Date: Tuesday, April 21, 2015 at 5:41 PM
>To: 'Luke' <hanson311...@gmail.com>, Chris Mattmann 
><chris.mattm...@gmail.com>, 'Giuseppe Totaro' <tot...@di.uniroma1.it>, 
>Chris Mattmann <chris.a.mattm...@jpl.nasa.gov>
>Cc: "'Bryant, Ann C (398G-Affiliate)'" <anniebry...@gmail.com>, 
>"'Zimdars, Paul A (3980-Affiliate)'" <paul.a.zimd...@jpl.nasa.gov>, NSF 
>Polar CyberInfrastructure DR Students 
><nsf-polar-usc-stude...@googlegroups.com>,
><memex-...@googlegroups.com>
>Subject: RE: [memex-jpl] this week action from luke
>
>>Hi professor,
>>I just sent a pull request for adding cbor extension.
>>The interesting thing is that tika is still identifying the file 
>>dumped by the nutch dump tool as a "application/xhtml+xml" even when I 
>>manually change the file extension to the correct one (i.e. *.cbor ).
>>
>>The reason is probably that tika is identifying "application/xhtml+xml"
>>by searching for the "&lt;html" in the file content, PFA:
>>xhtml+xml.jpg; Now if you take a look at the cbor file dumped by 
>>xhtml+nutch,
>>you see that we do have that element as part of the cbor content 
>>because the entire crawled xhtml document seems to be imbedded in the 
>>cbor json(PFA:
>>cbor.jpg); and also in Tika, the magic detection seems to have higher 
>>priority over the glob detection, thus the type is being incorrectly 
>>detected.
>>
>>Therefore, I would like to please mention that adding the entry of 
>><glob pattern="*.cbor"/> is not resolving the issue as of now without 
>>some fixed magic bytes / patterns for cbor.
>>I also would like to add that the thing will be different with our 
>>probabilistic mime detection selector, because if we know that the 
>>file extension is more reliable than magic bytes, then we can 
>>certainly add more preferential weight to the extension... this also 
>>might show the current implementation with MimeTypes detection is a 
>>bit stiff or less flexible in this scneario. :)
>>
>>
>>Thanks
>>Luke
>>
>>-----Original Message-----
>>From: Luke [mailto:hanson311...@gmail.com]
>>Sent: Tuesday, April 21, 2015 12:14 PM
>>To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>'memex-...@googlegroups.com'
>>Subject: RE: [memex-jpl] this week action from luke
>>
>>Yes, let me add the cbor extension entry in tika xml, will send the 
>>pull request soon.
>>
>>Thanks
>>Luke
>>-----Original Message-----
>>From: Chris Mattmann [mailto:chris.mattm...@gmail.com]
>>Sent: Tuesday, April 21, 2015 6:51 AM
>>To: Giuseppe Totaro; Mattmann, Chris A (3980)
>>Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A 
>>(3980-Affiliate); NSF Polar CyberInfrastructure DR Students; 
>>memex-...@googlegroups.com
>>Subject: Re: [memex-jpl] this week action from luke
>>
>>Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER and 
>>tag along with adding an -extension command would be fantastic. Can 
>>you file both of those NUTCH issues, wait a day or so, and then based 
>>on feedback use your new Nutch commit karma to get those into Nutch?
>>
>>And then when creating the issues, can you link to the TIKA-1610 issue?
>>At that point, when those two to be defined NUTCH issues are up, Luke, 
>>in parallel can you throw up a pull request/patch in Tika for the 
>>extension along with the MIME detection?
>>
>>Cheers,
>>Chris
>>
>>------------------------
>>Chris Mattmann
>>chris.mattm...@gmail.com
>>
>>
>>
>>
>>-----Original Message-----
>>From: Giuseppe Totaro <tot...@di.uniroma1.it>
>>Date: Tuesday, April 21, 2015 at 12:33 PM
>>To: Chris Mattmann <chris.a.mattm...@jpl.nasa.gov>
>>Cc: Luke <hanson311...@gmail.com>, Chris Mattmann 
>><chris.mattm...@gmail.com>, "Bryant, Ann C (398G-Affiliate)"
>><anniebry...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>><paul.a.zimd...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>Students <nsf-polar-usc-stude...@googlegroups.com>,
>>"memex-...@googlegroups.com"
>><memex-...@googlegroups.com>
>>Subject: Re: [memex-jpl] this week action from luke
>>
>>>Thanks Luke. Great work.
>>>Chris, we wrap a single string value, representing the JSON text, for 
>>>each file into CBOR (by using serializeCBORData method). For 
>>>instance, using the Unix hex dump tool, we can see that, as expected, 
>>>the first byte of all files is "0x7F" (the first three bits are 
>>>"011", that is the major type for strings, and the following 5 bits 
>>>are "11010", meaning a uint32_t encodes the length of following 
>>>text), and the following 4 bytes (single-precision float) encodes the 
>>>right length of file (as described in RFC7049 
>>><http://tools.ietf.org/html/rfc7049>).
>>>Therefore, a CBOR tag is currently included into the file (a list of 
>>>cbor tags is available here 
>>><http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>).
>>>I did not know about CBOR "magic header". Thanks a lot Luke for this 
>>>great research. Chris, if you agree, I can add support for prepending 
>>>self-describing CBOR tag 55799 to CommonCrawldataDumper class. I 
>>>believe it is very easy because I have to enable the 
>>>WRITE_TYPE_HEADER feature for CBORGenerator class (the source code is 
>>>available here 
>>><https://github.com/FasterXML/jackson-dataformat-cbor/blob/master/src
>>>/
>>>m ain
>>>/java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>).
>>>Then, I can comment the TIKA-1610
>>><https://issues.apache.org/jira/browse/TIKA-1610> issue.
>>>
>>>Regarding the file extension, in the Memex CCA format the original 
>>>file extension is used. We could add support for a -extension 
>>>command-line option allowing the user to give a file extension (e.g.,
>>>cbor) for all files dumped out.
>>>
>>>Thanks a lot,
>>>Giuseppe
>>>
>>>
>>>
>>>On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980) 
>>><chris.a.mattm...@jpl.nasa.gov> wrote:
>>>
>>>Thanks for this great research, Luke!
>>>
>>>Giuseppe, any idea why this tag doesn’t make it into the file?
>>>
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>Chris Mattmann, Ph.D.
>>>Chief Architect
>>>Instrument Software and Science Data Systems Section (398) NASA Jet 
>>>Propulsion Laboratory Pasadena, CA 91109 USA
>>>Office: 168-519, Mailstop: 168-527
>>>Email: chris.a.mattm...@nasa.gov
>>>WWW:  http://sunset.usc.edu/~mattmann/
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>Adjunct Associate Professor, Computer Science Department University 
>>>of Southern California, Los Angeles, CA 90089 USA
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>>
>>>-----Original Message-----
>>>From: Luke <hanson311...@gmail.com>
>>>Date: Tuesday, April 21, 2015 at 2:55 AM
>>>To: Chris Mattmann <chris.mattm...@gmail.com>, "Totaro, Giuseppe U 
>>>(3980-Affiliate)" <tot...@di.uniroma1.it>, Chris Mattmann 
>>><chris.a.mattm...@jpl.nasa.gov>, "Bryant, Ann C (398G-Affiliate)"
>>><anniebry...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>><paul.a.zimd...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>Students <nsf-polar-usc-stude...@googlegroups.com>,
>>>"memex-...@googlegroups.com"
>>><memex-...@googlegroups.com>
>>>Subject: RE: [memex-jpl] this week action from luke
>>>
>>>>Thanks professor.
>>>>Hi professor and all.
>>>>JIRA issue : CBOR Parser and detection improvement
>>>>https://issues.apache.org/jira/browse/TIKA-1610
>>>>
>>>>I tried to conduct a bit research with this cbor detection.
>>>>
>>>>It looks like there is a self describing tag that needs to be 
>>>>written in the cbor file thru which other applications might be able 
>>>>to identify the cbor type....
>>>>Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5
>>>>
>>>>I don’t see that tag being present in the cbor file dumped by the 
>>>>nutch tool, I am not very sure though.
>>>>
>>>>Thanks
>>>>Luke
>>>>
>>>>
>>>>
>>>>-----Original Message-----
>>>>From: Chris Mattmann [mailto:chris.mattm...@gmail.com]
>>>>Sent: Monday, April 20, 2015 4:10 AM
>>>>To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C 
>>>>(398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar 
>>>>CyberInfrastructure DR Students'; memex-...@googlegroups.com
>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>
>>>>Nice one, Luke. If you have a second and you can open up an issue in 
>>>>Tika to make it support CBOR, then yes, by all means! :)
>>>>
>>>>
>>>>------------------------
>>>>Chris Mattmann
>>>>chris.mattm...@gmail.com
>>>>
>>>>
>>>>
>>>>
>>>>-----Original Message-----
>>>>From: Luke <hanson311...@gmail.com>
>>>>Date: Monday, April 20, 2015 at 4:15 AM
>>>>To: 'Giuseppe Totaro' <tot...@di.uniroma1.it>, Chris Mattmann 
>>>><chris.mattm...@gmail.com>, Chris Mattmann 
>>>><chris.a.mattm...@jpl.nasa.gov>, "'Bryant, Ann C (398G-Affiliate)'"
>>>><anniebry...@gmail.com>, "'Zimdars, Paul A (3980-Affiliate)'"
>>>><paul.a.zimd...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR 
>>>>Students <nsf-polar-usc-stude...@googlegroups.com>,
>>>><memex-...@googlegroups.com>
>>>>Subject: RE: [memex-jpl] this week action from luke
>>>>
>>>>>Thanks a lot Giuseppe for the prompt response clearing up a bit of 
>>>>>my confusion with the Nutch CommonCrawlDataDumper , appreciated.
>>>>>
>>>>>BTW, it looks like Tika might need to consider the support with 
>>>>>COBR parser and detection.
>>>>>I checked the rfc, it looks like CBOR has not got magic numbers. PFA:
>>>>>rfc_cbor.jpg
>>>>>Actually, I don’t quite understand why the CommonCrawlDataDumper  
>>>>>is not dumping the nutch segments with the .cbor extension, which 
>>>>>seems to be helpful for type detection.
>>>>>
>>>>>To professor Mattmann,
>>>>>Tika does not support the detection of COBR, although the trunk 
>>>>>version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor in 
>>>>>the tika-mimetypes.xml, those entries are not detecting properly 
>>>>>the cobr files dumped by CommonCrawlDataDumper.  Also CBOR does not 
>>>>>have magic bytes, off the top of my head the only way we can detect 
>>>>>it is using the extension, and content byte histogram (please note, 
>>>>>this is a local optimal solution and data-dependent.)  J
>>>>>
>>>>>I think I am bit deviating from the main route and discussion of 
>>>>>this thread…. i.e. the plan for testing the “probabilistic mime 
>>>>>detector selection” with polar data.
>>>>>Anyway, I plan to repackage tika by incorporating the probabilistic 
>>>>>selection feature and replace the tika jar in nutch with the 
>>>>>repackaged one, and then run the CommonCrawlDataDumper and see how 
>>>>>it goes. If you have any specific ideas and thought with the 
>>>>>testing, please kindly let me know.
>>>>>
>>>>>Thanks
>>>>>Luke
>>>>>
>>>>>From: Giuseppe Totaro [mailto:tot...@di.uniroma1.it]
>>>>>Sent: Sunday, April 19, 2015 11:17 PM
>>>>>To: Luke liu
>>>>>Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C (398G-Affiliate); 
>>>>>Zimdars, Paul A (3980-Affiliate); Luke; NSF Polar 
>>>>>CyberInfrastructure DR Students; memex-...@googlegroups.com
>>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>>
>>>>>
>>>>>
>>>>>Hi Luke,
>>>>>
>>>>>
>>>>>my name is Giuseppe and I am a PhD student working under the 
>>>>>supervision of Prof. Chris Mattmann. I worked on 
>>>>>CommonCrawlDataDumper tool, so I can give some feedback on a couple 
>>>>>of your observations. My comments inline below.
>>>>>
>>>>>
>>>>>
>>>>>Il giorno 19/apr/2015, alle ore 12:11, Luke liu <shuai...@usc.edu> 
>>>>>ha
>>>>>scritto:
>>>>>
>>>>>
>>>>>Thanks a lot professor; Sorry for the brief delay, I was spending 
>>>>>some time in understanding the code repo i.e.
>>>>>http://github.com/chrismattmann/trec-dd-polar/
>>>>>
>>>>>From gen-common-crawl.sh, it looks like commoncrawldump is dumping 
>>>>>the crawl segments to json files with the human readable and 
>>>>>understandable content.
>>>>>1) I am trying to run one of the commands on my side as shown in 
>>>>>gen-common-crawl.sh, but the generated files all end with .html or 
>>>>>htm; The command listed in gen-common-crawl.sh seems to be allude 
>>>>>to where the data is located on our nsfpolardata.dyndns.org 
>>>>><http://nsfpolardata.dyndns.org>
>>>>><http://nsfpolardata.dyndns.org/>; although the locations are not 
>>>>>exactly correct (probably they need to be updated), part of the 
>>>>>patterns was able to allow me to locate some similar datasets (e.g.
>>>>>/data2/crawls/raw/CS572Spring2015 ) again I am seeing the dumped 
>>>>>files are all ending with html, but surprisingly inside those 
>>>>>outputted html files, the contents are present in json format;
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>The file extension is (almost) always the same as the original file.
>>>>>More in detail, using the -epochFilename command-line option (as in 
>>>>>gen-common-crawl.sh), the scraped data will be stored with a 
>>>>>filename of the format <epochtime(milliseconds)>.<filetype>, where 
>>>>><filetype> is either the extension of the original file or .html as 
>>>>>default if the original file does not have an extension. This 
>>>>>schema is used for file naming and it does not depend on internal 
>>>>>output format (JSON).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>2) Another problem is that the root object is being set with some 
>>>>>garbled chars in each of the outputted json files (with extension 
>>>>>html in the end), PFA: garbled.jpg and one of the outputted json 
>>>>>file has been also attached as an example too (PFA:
>>>>>1423894754000.html); the json files cannot be parsed properly by 
>>>>>aggregate.py due to those garbled chars.
>>>>>Even if I get rid of those garbled chars, there are not mimeTypes 
>>>>>element which are being read by aggregate.py.
>>>>>
>>>>>
>>>>>
>>>>>Text content and metadata extracted from the crawled binary data 
>>>>>are stored in a structured document format (JSON). Furthermore, 
>>>>>this document is encoded using CBOR <http://cbor.io/> 
>>>>>serialization. Each not human-readable character that you notice in 
>>>>>front and at the end of JSON data is due to CBOR-encoding. Thus, if 
>>>>>you need to read JSON data from document dumped out by 
>>>>>CommonCrawlDataDumper, you have to deserialized the CBOR-encoded data 
>>>>>structure inside the file.
>>>>>
>>>>>
>>>>>
>>>>>I hope this short overview can help in you work. I really 
>>>>>appreciate your feedback and, by the way, thanks a lot for your 
>>>>>great job in detection.
>>>>>
>>>>>I am available to provide you all support I can give, so you do not 
>>>>>hesitate to contact me if you may need any further information.
>>>>>
>>>>>
>>>>>
>>>>>Thanks,
>>>>>
>>>>>Giuseppe
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>Finally, after some research, I guess that the statistical 
>>>>>information (present in the readme of the code repo) is not being 
>>>>>collected and computed by aggregate.py from those output json files 
>>>>>but it looks like it is coming from the log.... see the following 
>>>>>as an example:
>>>>>
>>>>>2015-04-19 04:55:42,078 INFO  tools.CommonCrawlDataDumper - 
>>>>>CommonsCrawlDataDumper File Stats:
>>>>>TOTAL Stats:
>>>>>[
>>>>>   {"mimeType":"application/x-tika-msoffice","count":"17"}
>>>>>   {"mimeType":"application/vnd.ms-excel","count":"7"}
>>>>>   {"mimeType":"application/xhtml+xml","count":"3000"}
>>>>>   {"mimeType":"application/octet-stream","count":"641"}
>>>>>   {"mimeType":"application/epub+zip","count":"1"}
>>>>>   {"mimeType":"application/zip","count":"6"}
>>>>>   {"mimeType":"application/xml","count":"11"}
>>>>>   {"mimeType":"image/png","count":"110"}
>>>>>   {"mimeType":"image/jpeg","count":"70"}
>>>>>   {"mimeType":"application/atom+xml","count":"213"}
>>>>>   {"mimeType":"application/rss+xml","count":"43"}
>>>>>   {"mimeType":"video/mp4","count":"3"}
>>>>>   {"mimeType":"text/plain","count":"104"}
>>>>>   {"mimeType":"application/rdf+xml","count":"2"}
>>>>>   {"mimeType":"image/gif","count":"2"}
>>>>>   {"mimeType":"text/x-php","count":"1"}
>>>>>   {"mimeType":"video/x-msvideo","count":"1"}
>>>>>   {"mimeType":"application/x-tika-ooxml","count":"3"}
>>>>>   {"mimeType":"text/html","count":"9506"}
>>>>>   {"mimeType":"application/pdf","count":"280"}
>>>>>]
>>>>>
>>>>>It turns out that aggregate.py is not the one that produces the 
>>>>>statistical information, not sure what it does... but anyway, I 
>>>>>think I understand the whole idea and I do concur with it, might be 
>>>>>we can repackage the tika by incorporating the feature (i.e.
>>>>>probabilistic mime
>>>>>selection) in it and see if it can output the same information as 
>>>>>the one without it in the log.
>>>>>
>>>>>BTW, Regarding the use of the feature with probabilistic mime
>>>>>selection:
>>>>>in my pull request, I added a simple test case which might tell a 
>>>>>bit more about how the feature is called and used, it is simple 
>>>>>though.
>>>>>Here is an example snippet
>>>>>                ProbabilisticMimeDetectionSelector  probSel = new 
>>>>>ProbabilisticMimeDetectionSelector();
>>>>>                probSel.detect(input::InputStream, metadata::
>>>>>Metadata) It is similar to MimeTypes::detect(...) (more information 
>>>>>with this can be found in
>>>>>https://issues.apache.org/jira/browse/TIKA-1517)
>>>>>Now, in order to allow the Tika().detect() to call the
>>>>>ProbabilisticMimeDetectionSelector::detect(...) (as Tika().detect() 
>>>>>is being called by commoncrawldump), we need to modify/add some 
>>>>>code in the TikaConfig which initializes a list of default 
>>>>>detectors, and we need to get rid of the detector - mimeTypes:: 
>>>>>MimeTypes in the list and replace it with probSel::
>>>>>ProbabilisticMimeDetectionSelector. (not sure if I should create 
>>>>>another pull request with this change for
>>>>>TikaConfig)
>>>>>
>>>>>I think that is all of my initial thought with some finding and 
>>>>>plan; if you have anything you would like to please add and 
>>>>>comment, please do kindly let me know, then I will start working on 
>>>>>my 'finale'. BTW, don’t worry, even after I am graduated, the 
>>>>>graduation is not my termination with tika and this project, after 
>>>>>then I still can and want to help this polar project and tika as 
>>>>>much as possible, and correct the programming faults and bugs, 
>>>>>respond to the tika issues ,etc.
>>>>>
>>>>>
>>>>>
>>>>>Thanks
>>>>>Luke
>>>>>
>>>>>-----Original Message-----
>>>>>From: Chris Mattmann [mailto:chris.mattm...@gmail.com]
>>>>>Sent: Saturday, April 18, 2015 6:26 AM
>>>>>To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C 
>>>>>(398G-Affiliate); Zimdars, Paul A (3980-Affiliate)
>>>>>Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students; 
>>>>>memex-...@googlegroups.com
>>>>>Subject: Re: this week action from luke
>>>>>Importance: High
>>>>>
>>>>>Awesome Luke. I am going to work specifically on now benchmarking 
>>>>>your code in real situations. For example, it would be fantastic to 
>>>>>now run your Bayesian MIME detector over the whole NSF TREC Dynamic 
>>>>>Domain data for Polar described here:
>>>>>
>>>>>http://github.com/chrismattmann/trec-dd-polar/
>>>>>
>>>>>Paul Zimdars, CC’ed, can provide you with access to the data, and 
>>>>>Annie can explain it, also CC’ed.
>>>>>
>>>>>Can we make that your goal for the next 2 weeks to actually test it 
>>>>>and produce a real result over the whole TREC-DD data for Polar? My 
>>>>>goal will be to get your code committed and integrated into Tika.
>>>>>The more you can write me a guide of how to build and test your 
>>>>>code with Tika so I can get it committed the better.
>>>>>
>>>>>Also CC’ing the Memex list for context. Note everyone: Luke is 
>>>>>building a Bayesian MIME classifier to evaluate against Tika’s 
>>>>>existing MIME detection approach. If folks have any Memex needs to 
>>>>>try and test more accurate file identification with Tika, Luke is 
>>>>>the guy to talk to and I have him for 2 more weeks.
>>>>>
>>>>>Thanks!
>>>>>
>>>>>Cheers,
>>>>>Chris
>>>>>
>>>>>------------------------
>>>>>Chris Mattmann
>>>>>chris.mattm...@gmail.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>-----Original Message-----
>>>>>From: Luke liu <shuai...@usc.edu>
>>>>>Date: Thursday, April 16, 2015 at 11:29 PM
>>>>>To: Chris Mattmann <chris.mattm...@gmail.com>, Chris Mattmann 
>>>>><chris.a.mattm...@jpl.nasa.gov>
>>>>>Cc: 'Luke' <hanson311...@gmail.com>
>>>>>Subject: this week action from luke
>>>>>
>>>>>
>>>>>
>>>>>Hi Professor Mattmann,
>>>>>
>>>>>I think I am in the final phase of the research, and last week I 
>>>>>finished the last item in the list, and hopefully everything will 
>>>>>be fine.
>>>>>
>>>>>For now, i probably can spend some time in verifying or optimizing 
>>>>>the codes, the majority of the research has been done…and it will 
>>>>>be also great if you can please comment on my work (the 2 pull
>>>>>requests) when you have time.
>>>>>
>>>>>If you do have confusion with any of my work, please also do let me 
>>>>>know.
>>>>>
>>>>>Thanks and I am glad working with you, for the next a couple of 
>>>>>weeks before graduation, I am going to continue revising and 
>>>>>testing the code and features to get rid of some flaws (if any 
>>>>>)when I have time.
>>>>>
>>>>>Not sure if I miss out something, and if I do miss some thing 
>>>>>important, please do let me know too.
>>>>>
>>>>>Thanks
>>>>>Luke
>>>>>
>>>>>
>>>>>--
>>>>>You received this message because you are subscribed to the Google 
>>>>>Groups "JPL-Kitware-Continuum Memex Group" group.
>>>>>To unsubscribe from this group and stop receiving emails from it, 
>>>>>send an email to memex-jpl+unsubscr...@googlegroups.com
>>>>><mailto:memex-jpl%2bunsubscr...@googlegroups.com>.
>>>>>To post to this group, send email to memex-...@googlegroups.com.
>>>>>Visit this group at http://groups.google.com/group/memex-jpl.
>>>>>To view this discussion on the web visit 
>>>>>https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b3510070
>>>>>%
>>>>>2
>>>>>41
>>>>>9f3
>>>>>0150%24%40edu.
>>>>>For more options, visit https://groups.google.com/d/optout.
>>>>><garbled.jpg><1423894754000.html>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>


Reply via email to