Thanks Luke. So I guess all I was asking was could you try it out. Thanks for the lesson in the RFC.
Cheers, Chris ------------------------ Chris Mattmann chris.mattm...@gmail.com -----Original Message----- From: Luke <hanson311...@gmail.com> Date: Wednesday, April 22, 2015 at 1:46 AM To: Chris Mattmann <chris.a.mattm...@jpl.nasa.gov>, Chris Mattmann <chris.mattm...@gmail.com>, "'Totaro, Giuseppe U (3980-Affiliate)'" <tot...@di.uniroma1.it>, <dev@tika.apache.org> Cc: "'Bryant, Ann C (398G-Affiliate)'" <anniebry...@gmail.com>, "'Zimdars, Paul A (3980-Affiliate)'" <paul.a.zimd...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR Students <nsf-polar-usc-stude...@googlegroups.com>, <memex-...@googlegroups.com> Subject: RE: [memex-jpl] this week action from luke >Hi professor, > > >I think it highly depends on the content being read by tika, e.g. if >there is a sequence of bytes in the file that is being read and is the >same as one or more of mime types being defined in our tika-mimes.xml, I >guess that tika will put those types in its estimation list, please note >there could be multiple estimated mime types by magic-byte detection >approach. Now tika also considers the decision made by extension >detection approach, if extension says the file type it believes is the >first one in the magic type estimation list, then certainly the first one >will be returned. (the same applies to metadata hint approach); >Of course, tika also prefers the type that is the most specialized. > >let's get back to the following question, here is my guess though. >[Prof]: Also what happens if you tweak the definition of XHTML to not >scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then? >Let's consider an extreme case where we only scan 10 or 1 bytes, then it >seems that magic bytes will inevitable detect nothing, and I think it >will return the something like" application/oct-stream" that is the most >general type. As mentioned, tika favours the one that is the most >specialized, if extension approach returns the one that is more >specialized, in this extreme case I believe almost every type is a >subclass of this "application/oct-stream".... therefore the answer in >this extreme may be yes, I think it is very possible that CBOR type >detected by the extension approach takes over in this case... > >My idea was and still is that if the cbor self-Describing tag 55799 is >present in the cbor file, then that can be used to detect the cbor type. >Again, the cbor type will probably be appended into the magic estimation >list together with another one such as application/html, I guess the >order in the list probably also matters, the first one is preferred over >the next one. Also the decision from the extension detection approach >also play the role the break the tie. >e.g. if extension detection method agrees on cbor with one of the >estimated type in the magic list, then cbor will be returned. (again, >same thing applies to metadatahint method). > >I have not taken a closer look at a cbor file that has the tag 55799, but >I expect to see its hex is something like 0xd9d9f7 or the tag should be >present in the header with a fixed sequence of >bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is >present in the file or preferable in the header (within a reasonable >range of bytes ), I believe it can probably be used as the magic numbers >for the cbor type. > > >There is another thing I have mentioned in the jira ticket I opened >yesterday against the cbor parser and detection, it is also possible that >cbor content can be imbedded inside a plain json file, the way that a >decoder can distinguish them in that file is by looking at the tag 55799 >again. This may rarely happen but a robust parser might be able to take >care of that, tika might need to consider the use of fastXML being used >by the nutch tool when developing the cbor parser... >Again let me cite the same paragraph from the rfc, > >" a decoder might be able to parse both CBOR and JSON. > Such a decoder would need to mechanically distinguish the two > formats. An easy way for an encoder to help the decoder would be to > tag the entire CBOR item with tag 55799, the serialization of which > will never be found at the beginning of a JSON text." > > >Thanks >Luke > > > >-----Original Message----- >From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] >Sent: Tuesday, April 21, 2015 9:49 PM >To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate) >Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); >'NSF Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com >Subject: Re: [memex-jpl] this week action from luke > >Hi Luke, > >Can you post the below conversation to dev@tika and summarize it there. >Also what happens if you tweak the definition of XHTML to not scan until >8192, but say 6000 (e.g., 0:6000), does CBOR take over then? > >Cheers, >Chris > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) NASA Jet >Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: chris.a.mattm...@nasa.gov >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Adjunct Associate Professor, Computer Science Department University of >Southern California, Los Angeles, CA 90089 USA >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > >-----Original Message----- >From: Luke <hanson311...@gmail.com> >Date: Wednesday, April 22, 2015 at 12:19 AM >To: Chris Mattmann <chris.mattm...@gmail.com>, "Totaro, Giuseppe U >(3980-Affiliate)" <tot...@di.uniroma1.it>, Chris Mattmann ><chris.a.mattm...@jpl.nasa.gov> >Cc: "Bryant, Ann C (398G-Affiliate)" <anniebry...@gmail.com>, "Zimdars, >Paul A (3980-Affiliate)" <paul.a.zimd...@jpl.nasa.gov>, NSF Polar >CyberInfrastructure DR Students <nsf-polar-usc-stude...@googlegroups.com>, >"memex-...@googlegroups.com" <memex-...@googlegroups.com> >Subject: RE: [memex-jpl] this week action from luke > >>Hi Professor, >>Please see attached jpg for the difference. >>Thanks >>Luke >> >>-----Original Message----- >>From: Chris Mattmann [mailto:chris.mattm...@gmail.com] >>Sent: Tuesday, April 21, 2015 5:27 PM >>To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)' >>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A >>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; >>memex-...@googlegroups.com >>Subject: Re: [memex-jpl] this week action from luke >> >>Hey Luke what happens if you do java -jar /path/to/tika-app -m >>/path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app -m < >>/path/to/cbor/file.cbor any difference? >> >>------------------------ >>Chris Mattmann >>chris.mattm...@gmail.com >> >> >> >> >>-----Original Message----- >>From: Luke <hanson311...@gmail.com> >>Date: Tuesday, April 21, 2015 at 5:41 PM >>To: 'Luke' <hanson311...@gmail.com>, Chris Mattmann >><chris.mattm...@gmail.com>, 'Giuseppe Totaro' <tot...@di.uniroma1.it>, >>Chris Mattmann <chris.a.mattm...@jpl.nasa.gov> >>Cc: "'Bryant, Ann C (398G-Affiliate)'" <anniebry...@gmail.com>, >>"'Zimdars, Paul A (3980-Affiliate)'" <paul.a.zimd...@jpl.nasa.gov>, NSF >>Polar CyberInfrastructure DR Students >><nsf-polar-usc-stude...@googlegroups.com>, >><memex-...@googlegroups.com> >>Subject: RE: [memex-jpl] this week action from luke >> >>>Hi professor, >>>I just sent a pull request for adding cbor extension. >>>The interesting thing is that tika is still identifying the file >>>dumped by the nutch dump tool as a "application/xhtml+xml" even when I >>>manually change the file extension to the correct one (i.e. *.cbor ). >>> >>>The reason is probably that tika is identifying "application/xhtml+xml" >>>by searching for the "<html" in the file content, PFA: >>>xhtml+xml.jpg; Now if you take a look at the cbor file dumped by >>>xhtml+nutch, >>>you see that we do have that element as part of the cbor content >>>because the entire crawled xhtml document seems to be imbedded in the >>>cbor json(PFA: >>>cbor.jpg); and also in Tika, the magic detection seems to have higher >>>priority over the glob detection, thus the type is being incorrectly >>>detected. >>> >>>Therefore, I would like to please mention that adding the entry of >>><glob pattern="*.cbor"/> is not resolving the issue as of now without >>>some fixed magic bytes / patterns for cbor. >>>I also would like to add that the thing will be different with our >>>probabilistic mime detection selector, because if we know that the >>>file extension is more reliable than magic bytes, then we can >>>certainly add more preferential weight to the extension... this also >>>might show the current implementation with MimeTypes detection is a >>>bit stiff or less flexible in this scneario. :) >>> >>> >>>Thanks >>>Luke >>> >>>-----Original Message----- >>>From: Luke [mailto:hanson311...@gmail.com] >>>Sent: Tuesday, April 21, 2015 12:14 PM >>>To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)' >>>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A >>>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; >>>'memex-...@googlegroups.com' >>>Subject: RE: [memex-jpl] this week action from luke >>> >>>Yes, let me add the cbor extension entry in tika xml, will send the >>>pull request soon. >>> >>>Thanks >>>Luke >>>-----Original Message----- >>>From: Chris Mattmann [mailto:chris.mattm...@gmail.com] >>>Sent: Tuesday, April 21, 2015 6:51 AM >>>To: Giuseppe Totaro; Mattmann, Chris A (3980) >>>Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A >>>(3980-Affiliate); NSF Polar CyberInfrastructure DR Students; >>>memex-...@googlegroups.com >>>Subject: Re: [memex-jpl] this week action from luke >>> >>>Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER and >>>tag along with adding an -extension command would be fantastic. Can >>>you file both of those NUTCH issues, wait a day or so, and then based >>>on feedback use your new Nutch commit karma to get those into Nutch? >>> >>>And then when creating the issues, can you link to the TIKA-1610 issue? >>>At that point, when those two to be defined NUTCH issues are up, Luke, >>>in parallel can you throw up a pull request/patch in Tika for the >>>extension along with the MIME detection? >>> >>>Cheers, >>>Chris >>> >>>------------------------ >>>Chris Mattmann >>>chris.mattm...@gmail.com >>> >>> >>> >>> >>>-----Original Message----- >>>From: Giuseppe Totaro <tot...@di.uniroma1.it> >>>Date: Tuesday, April 21, 2015 at 12:33 PM >>>To: Chris Mattmann <chris.a.mattm...@jpl.nasa.gov> >>>Cc: Luke <hanson311...@gmail.com>, Chris Mattmann >>><chris.mattm...@gmail.com>, "Bryant, Ann C (398G-Affiliate)" >>><anniebry...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)" >>><paul.a.zimd...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR >>>Students <nsf-polar-usc-stude...@googlegroups.com>, >>>"memex-...@googlegroups.com" >>><memex-...@googlegroups.com> >>>Subject: Re: [memex-jpl] this week action from luke >>> >>>>Thanks Luke. Great work. >>>>Chris, we wrap a single string value, representing the JSON text, for >>>>each file into CBOR (by using serializeCBORData method). For >>>>instance, using the Unix hex dump tool, we can see that, as expected, >>>>the first byte of all files is "0x7F" (the first three bits are >>>>"011", that is the major type for strings, and the following 5 bits >>>>are "11010", meaning a uint32_t encodes the length of following >>>>text), and the following 4 bytes (single-precision float) encodes the >>>>right length of file (as described in RFC7049 >>>><http://tools.ietf.org/html/rfc7049>). >>>>Therefore, a CBOR tag is currently included into the file (a list of >>>>cbor tags is available here >>>><http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>). >>>>I did not know about CBOR "magic header". Thanks a lot Luke for this >>>>great research. Chris, if you agree, I can add support for prepending >>>>self-describing CBOR tag 55799 to CommonCrawldataDumper class. I >>>>believe it is very easy because I have to enable the >>>>WRITE_TYPE_HEADER feature for CBORGenerator class (the source code is >>>>available here >>>><https://github.com/FasterXML/jackson-dataformat-cbor/blob/master/src >>>>/ >>>>m ain >>>>/java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>). >>>>Then, I can comment the TIKA-1610 >>>><https://issues.apache.org/jira/browse/TIKA-1610> issue. >>>> >>>>Regarding the file extension, in the Memex CCA format the original >>>>file extension is used. We could add support for a -extension >>>>command-line option allowing the user to give a file extension (e.g., >>>>cbor) for all files dumped out. >>>> >>>>Thanks a lot, >>>>Giuseppe >>>> >>>> >>>> >>>>On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980) >>>><chris.a.mattm...@jpl.nasa.gov> wrote: >>>> >>>>Thanks for this great research, Luke! >>>> >>>>Giuseppe, any idea why this tag doesn’t make it into the file? >>>> >>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>Chris Mattmann, Ph.D. >>>>Chief Architect >>>>Instrument Software and Science Data Systems Section (398) NASA Jet >>>>Propulsion Laboratory Pasadena, CA 91109 USA >>>>Office: 168-519, Mailstop: 168-527 >>>>Email: chris.a.mattm...@nasa.gov >>>>WWW: http://sunset.usc.edu/~mattmann/ >>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>Adjunct Associate Professor, Computer Science Department University >>>>of Southern California, Los Angeles, CA 90089 USA >>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> >>>> >>>> >>>> >>>>-----Original Message----- >>>>From: Luke <hanson311...@gmail.com> >>>>Date: Tuesday, April 21, 2015 at 2:55 AM >>>>To: Chris Mattmann <chris.mattm...@gmail.com>, "Totaro, Giuseppe U >>>>(3980-Affiliate)" <tot...@di.uniroma1.it>, Chris Mattmann >>>><chris.a.mattm...@jpl.nasa.gov>, "Bryant, Ann C (398G-Affiliate)" >>>><anniebry...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)" >>>><paul.a.zimd...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR >>>>Students <nsf-polar-usc-stude...@googlegroups.com>, >>>>"memex-...@googlegroups.com" >>>><memex-...@googlegroups.com> >>>>Subject: RE: [memex-jpl] this week action from luke >>>> >>>>>Thanks professor. >>>>>Hi professor and all. >>>>>JIRA issue : CBOR Parser and detection improvement >>>>>https://issues.apache.org/jira/browse/TIKA-1610 >>>>> >>>>>I tried to conduct a bit research with this cbor detection. >>>>> >>>>>It looks like there is a self describing tag that needs to be >>>>>written in the cbor file thru which other applications might be able >>>>>to identify the cbor type.... >>>>>Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5 >>>>> >>>>>I don’t see that tag being present in the cbor file dumped by the >>>>>nutch tool, I am not very sure though. >>>>> >>>>>Thanks >>>>>Luke >>>>> >>>>> >>>>> >>>>>-----Original Message----- >>>>>From: Chris Mattmann [mailto:chris.mattm...@gmail.com] >>>>>Sent: Monday, April 20, 2015 4:10 AM >>>>>To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C >>>>>(398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar >>>>>CyberInfrastructure DR Students'; memex-...@googlegroups.com >>>>>Subject: Re: [memex-jpl] this week action from luke >>>>> >>>>>Nice one, Luke. If you have a second and you can open up an issue in >>>>>Tika to make it support CBOR, then yes, by all means! :) >>>>> >>>>> >>>>>------------------------ >>>>>Chris Mattmann >>>>>chris.mattm...@gmail.com >>>>> >>>>> >>>>> >>>>> >>>>>-----Original Message----- >>>>>From: Luke <hanson311...@gmail.com> >>>>>Date: Monday, April 20, 2015 at 4:15 AM >>>>>To: 'Giuseppe Totaro' <tot...@di.uniroma1.it>, Chris Mattmann >>>>><chris.mattm...@gmail.com>, Chris Mattmann >>>>><chris.a.mattm...@jpl.nasa.gov>, "'Bryant, Ann C (398G-Affiliate)'" >>>>><anniebry...@gmail.com>, "'Zimdars, Paul A (3980-Affiliate)'" >>>>><paul.a.zimd...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR >>>>>Students <nsf-polar-usc-stude...@googlegroups.com>, >>>>><memex-...@googlegroups.com> >>>>>Subject: RE: [memex-jpl] this week action from luke >>>>> >>>>>>Thanks a lot Giuseppe for the prompt response clearing up a bit of >>>>>>my confusion with the Nutch CommonCrawlDataDumper , appreciated. >>>>>> >>>>>>BTW, it looks like Tika might need to consider the support with >>>>>>COBR parser and detection. >>>>>>I checked the rfc, it looks like CBOR has not got magic numbers. PFA: >>>>>>rfc_cbor.jpg >>>>>>Actually, I don’t quite understand why the CommonCrawlDataDumper >>>>>>is not dumping the nutch segments with the .cbor extension, which >>>>>>seems to be helpful for type detection. >>>>>> >>>>>>To professor Mattmann, >>>>>>Tika does not support the detection of COBR, although the trunk >>>>>>version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor in >>>>>>the tika-mimetypes.xml, those entries are not detecting properly >>>>>>the cobr files dumped by CommonCrawlDataDumper. Also CBOR does not >>>>>>have magic bytes, off the top of my head the only way we can detect >>>>>>it is using the extension, and content byte histogram (please note, >>>>>>this is a local optimal solution and data-dependent.) J >>>>>> >>>>>>I think I am bit deviating from the main route and discussion of >>>>>>this thread…. i.e. the plan for testing the “probabilistic mime >>>>>>detector selection” with polar data. >>>>>>Anyway, I plan to repackage tika by incorporating the probabilistic >>>>>>selection feature and replace the tika jar in nutch with the >>>>>>repackaged one, and then run the CommonCrawlDataDumper and see how >>>>>>it goes. If you have any specific ideas and thought with the >>>>>>testing, please kindly let me know. >>>>>> >>>>>>Thanks >>>>>>Luke >>>>>> >>>>>>From: Giuseppe Totaro [mailto:tot...@di.uniroma1.it] >>>>>>Sent: Sunday, April 19, 2015 11:17 PM >>>>>>To: Luke liu >>>>>>Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C (398G-Affiliate); >>>>>>Zimdars, Paul A (3980-Affiliate); Luke; NSF Polar >>>>>>CyberInfrastructure DR Students; memex-...@googlegroups.com >>>>>>Subject: Re: [memex-jpl] this week action from luke >>>>>> >>>>>> >>>>>> >>>>>>Hi Luke, >>>>>> >>>>>> >>>>>>my name is Giuseppe and I am a PhD student working under the >>>>>>supervision of Prof. Chris Mattmann. I worked on >>>>>>CommonCrawlDataDumper tool, so I can give some feedback on a couple >>>>>>of your observations. My comments inline below. >>>>>> >>>>>> >>>>>> >>>>>>Il giorno 19/apr/2015, alle ore 12:11, Luke liu <shuai...@usc.edu> >>>>>>ha >>>>>>scritto: >>>>>> >>>>>> >>>>>>Thanks a lot professor; Sorry for the brief delay, I was spending >>>>>>some time in understanding the code repo i.e. >>>>>>http://github.com/chrismattmann/trec-dd-polar/ >>>>>> >>>>>>From gen-common-crawl.sh, it looks like commoncrawldump is dumping >>>>>>the crawl segments to json files with the human readable and >>>>>>understandable content. >>>>>>1) I am trying to run one of the commands on my side as shown in >>>>>>gen-common-crawl.sh, but the generated files all end with .html or >>>>>>htm; The command listed in gen-common-crawl.sh seems to be allude >>>>>>to where the data is located on our nsfpolardata.dyndns.org >>>>>><http://nsfpolardata.dyndns.org> >>>>>><http://nsfpolardata.dyndns.org/>; although the locations are not >>>>>>exactly correct (probably they need to be updated), part of the >>>>>>patterns was able to allow me to locate some similar datasets (e.g. >>>>>>/data2/crawls/raw/CS572Spring2015 ) again I am seeing the dumped >>>>>>files are all ending with html, but surprisingly inside those >>>>>>outputted html files, the contents are present in json format; >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>The file extension is (almost) always the same as the original file. >>>>>>More in detail, using the -epochFilename command-line option (as in >>>>>>gen-common-crawl.sh), the scraped data will be stored with a >>>>>>filename of the format <epochtime(milliseconds)>.<filetype>, where >>>>>><filetype> is either the extension of the original file or .html as >>>>>>default if the original file does not have an extension. This >>>>>>schema is used for file naming and it does not depend on internal >>>>>>output format (JSON). >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>2) Another problem is that the root object is being set with some >>>>>>garbled chars in each of the outputted json files (with extension >>>>>>html in the end), PFA: garbled.jpg and one of the outputted json >>>>>>file has been also attached as an example too (PFA: >>>>>>1423894754000.html); the json files cannot be parsed properly by >>>>>>aggregate.py due to those garbled chars. >>>>>>Even if I get rid of those garbled chars, there are not mimeTypes >>>>>>element which are being read by aggregate.py. >>>>>> >>>>>> >>>>>> >>>>>>Text content and metadata extracted from the crawled binary data >>>>>>are stored in a structured document format (JSON). Furthermore, >>>>>>this document is encoded using CBOR <http://cbor.io/> >>>>>>serialization. Each not human-readable character that you notice in >>>>>>front and at the end of JSON data is due to CBOR-encoding. Thus, if >>>>>>you need to read JSON data from document dumped out by >>>>>>CommonCrawlDataDumper, you have to deserialized the CBOR-encoded >>>>>>data structure inside the file. >>>>>> >>>>>> >>>>>> >>>>>>I hope this short overview can help in you work. I really >>>>>>appreciate your feedback and, by the way, thanks a lot for your >>>>>>great job in detection. >>>>>> >>>>>>I am available to provide you all support I can give, so you do not >>>>>>hesitate to contact me if you may need any further information. >>>>>> >>>>>> >>>>>> >>>>>>Thanks, >>>>>> >>>>>>Giuseppe >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>Finally, after some research, I guess that the statistical >>>>>>information (present in the readme of the code repo) is not being >>>>>>collected and computed by aggregate.py from those output json files >>>>>>but it looks like it is coming from the log.... see the following >>>>>>as an example: >>>>>> >>>>>>2015-04-19 04:55:42,078 INFO tools.CommonCrawlDataDumper - >>>>>>CommonsCrawlDataDumper File Stats: >>>>>>TOTAL Stats: >>>>>>[ >>>>>> {"mimeType":"application/x-tika-msoffice","count":"17"} >>>>>> {"mimeType":"application/vnd.ms-excel","count":"7"} >>>>>> {"mimeType":"application/xhtml+xml","count":"3000"} >>>>>> {"mimeType":"application/octet-stream","count":"641"} >>>>>> {"mimeType":"application/epub+zip","count":"1"} >>>>>> {"mimeType":"application/zip","count":"6"} >>>>>> {"mimeType":"application/xml","count":"11"} >>>>>> {"mimeType":"image/png","count":"110"} >>>>>> {"mimeType":"image/jpeg","count":"70"} >>>>>> {"mimeType":"application/atom+xml","count":"213"} >>>>>> {"mimeType":"application/rss+xml","count":"43"} >>>>>> {"mimeType":"video/mp4","count":"3"} >>>>>> {"mimeType":"text/plain","count":"104"} >>>>>> {"mimeType":"application/rdf+xml","count":"2"} >>>>>> {"mimeType":"image/gif","count":"2"} >>>>>> {"mimeType":"text/x-php","count":"1"} >>>>>> {"mimeType":"video/x-msvideo","count":"1"} >>>>>> {"mimeType":"application/x-tika-ooxml","count":"3"} >>>>>> {"mimeType":"text/html","count":"9506"} >>>>>> {"mimeType":"application/pdf","count":"280"} >>>>>>] >>>>>> >>>>>>It turns out that aggregate.py is not the one that produces the >>>>>>statistical information, not sure what it does... but anyway, I >>>>>>think I understand the whole idea and I do concur with it, might be >>>>>>we can repackage the tika by incorporating the feature (i.e. >>>>>>probabilistic mime >>>>>>selection) in it and see if it can output the same information as >>>>>>the one without it in the log. >>>>>> >>>>>>BTW, Regarding the use of the feature with probabilistic mime >>>>>>selection: >>>>>>in my pull request, I added a simple test case which might tell a >>>>>>bit more about how the feature is called and used, it is simple >>>>>>though. >>>>>>Here is an example snippet >>>>>> ProbabilisticMimeDetectionSelector probSel = new >>>>>>ProbabilisticMimeDetectionSelector(); >>>>>> probSel.detect(input::InputStream, metadata:: >>>>>>Metadata) It is similar to MimeTypes::detect(...) (more information >>>>>>with this can be found in >>>>>>https://issues.apache.org/jira/browse/TIKA-1517) >>>>>>Now, in order to allow the Tika().detect() to call the >>>>>>ProbabilisticMimeDetectionSelector::detect(...) (as Tika().detect() >>>>>>is being called by commoncrawldump), we need to modify/add some >>>>>>code in the TikaConfig which initializes a list of default >>>>>>detectors, and we need to get rid of the detector - mimeTypes:: >>>>>>MimeTypes in the list and replace it with probSel:: >>>>>>ProbabilisticMimeDetectionSelector. (not sure if I should create >>>>>>another pull request with this change for >>>>>>TikaConfig) >>>>>> >>>>>>I think that is all of my initial thought with some finding and >>>>>>plan; if you have anything you would like to please add and >>>>>>comment, please do kindly let me know, then I will start working on >>>>>>my 'finale'. BTW, don’t worry, even after I am graduated, the >>>>>>graduation is not my termination with tika and this project, after >>>>>>then I still can and want to help this polar project and tika as >>>>>>much as possible, and correct the programming faults and bugs, >>>>>>respond to the tika issues ,etc. >>>>>> >>>>>> >>>>>> >>>>>>Thanks >>>>>>Luke >>>>>> >>>>>>-----Original Message----- >>>>>>From: Chris Mattmann [mailto:chris.mattm...@gmail.com] >>>>>>Sent: Saturday, April 18, 2015 6:26 AM >>>>>>To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C >>>>>>(398G-Affiliate); Zimdars, Paul A (3980-Affiliate) >>>>>>Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students; >>>>>>memex-...@googlegroups.com >>>>>>Subject: Re: this week action from luke >>>>>>Importance: High >>>>>> >>>>>>Awesome Luke. I am going to work specifically on now benchmarking >>>>>>your code in real situations. For example, it would be fantastic to >>>>>>now run your Bayesian MIME detector over the whole NSF TREC Dynamic >>>>>>Domain data for Polar described here: >>>>>> >>>>>>http://github.com/chrismattmann/trec-dd-polar/ >>>>>> >>>>>>Paul Zimdars, CC’ed, can provide you with access to the data, and >>>>>>Annie can explain it, also CC’ed. >>>>>> >>>>>>Can we make that your goal for the next 2 weeks to actually test it >>>>>>and produce a real result over the whole TREC-DD data for Polar? My >>>>>>goal will be to get your code committed and integrated into Tika. >>>>>>The more you can write me a guide of how to build and test your >>>>>>code with Tika so I can get it committed the better. >>>>>> >>>>>>Also CC’ing the Memex list for context. Note everyone: Luke is >>>>>>building a Bayesian MIME classifier to evaluate against Tika’s >>>>>>existing MIME detection approach. If folks have any Memex needs to >>>>>>try and test more accurate file identification with Tika, Luke is >>>>>>the guy to talk to and I have him for 2 more weeks. >>>>>> >>>>>>Thanks! >>>>>> >>>>>>Cheers, >>>>>>Chris >>>>>> >>>>>>------------------------ >>>>>>Chris Mattmann >>>>>>chris.mattm...@gmail.com >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>-----Original Message----- >>>>>>From: Luke liu <shuai...@usc.edu> >>>>>>Date: Thursday, April 16, 2015 at 11:29 PM >>>>>>To: Chris Mattmann <chris.mattm...@gmail.com>, Chris Mattmann >>>>>><chris.a.mattm...@jpl.nasa.gov> >>>>>>Cc: 'Luke' <hanson311...@gmail.com> >>>>>>Subject: this week action from luke >>>>>> >>>>>> >>>>>> >>>>>>Hi Professor Mattmann, >>>>>> >>>>>>I think I am in the final phase of the research, and last week I >>>>>>finished the last item in the list, and hopefully everything will >>>>>>be fine. >>>>>> >>>>>>For now, i probably can spend some time in verifying or optimizing >>>>>>the codes, the majority of the research has been done…and it will >>>>>>be also great if you can please comment on my work (the 2 pull >>>>>>requests) when you have time. >>>>>> >>>>>>If you do have confusion with any of my work, please also do let me >>>>>>know. >>>>>> >>>>>>Thanks and I am glad working with you, for the next a couple of >>>>>>weeks before graduation, I am going to continue revising and >>>>>>testing the code and features to get rid of some flaws (if any >>>>>>)when I have time. >>>>>> >>>>>>Not sure if I miss out something, and if I do miss some thing >>>>>>important, please do let me know too. >>>>>> >>>>>>Thanks >>>>>>Luke >>>>>> >>>>>> >>>>>>-- >>>>>>You received this message because you are subscribed to the Google >>>>>>Groups "JPL-Kitware-Continuum Memex Group" group. >>>>>>To unsubscribe from this group and stop receiving emails from it, >>>>>>send an email to memex-jpl+unsubscr...@googlegroups.com >>>>>><mailto:memex-jpl%2bunsubscr...@googlegroups.com>. >>>>>>To post to this group, send email to memex-...@googlegroups.com. >>>>>>Visit this group at http://groups.google.com/group/memex-jpl. >>>>>>To view this discussion on the web visit >>>>>>https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b3510070 >>>>>>% >>>>>>2 >>>>>>41 >>>>>>9f3 >>>>>>0150%24%40edu. >>>>>>For more options, visit https://groups.google.com/d/optout. >>>>>><garbled.jpg><1423894754000.html> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >> > >