Thanks Luke.

So I guess all I was asking was could you try it out. Thanks for the
lesson in the RFC.

Cheers,
Chris

------------------------
Chris Mattmann
chris.mattm...@gmail.com




-----Original Message-----
From: Luke <hanson311...@gmail.com>
Date: Wednesday, April 22, 2015 at 1:46 AM
To: Chris Mattmann <chris.a.mattm...@jpl.nasa.gov>, Chris Mattmann
<chris.mattm...@gmail.com>, "'Totaro, Giuseppe U (3980-Affiliate)'"
<tot...@di.uniroma1.it>, <dev@tika.apache.org>
Cc: "'Bryant, Ann C (398G-Affiliate)'" <anniebry...@gmail.com>, "'Zimdars,
Paul A (3980-Affiliate)'" <paul.a.zimd...@jpl.nasa.gov>, NSF Polar
CyberInfrastructure DR Students <nsf-polar-usc-stude...@googlegroups.com>,
<memex-...@googlegroups.com>
Subject: RE: [memex-jpl] this week action from luke

>Hi professor,
>
>
>I think it highly depends on the content being read by tika, e.g. if
>there is a sequence of bytes in the file that is being read and is the
>same as one or more of mime types being defined in our tika-mimes.xml, I
>guess that tika will put those types in its estimation list, please note
>there could be multiple estimated mime types by magic-byte detection
>approach. Now tika also considers the decision made by extension
>detection approach, if extension says the file type it believes is the
>first one in the magic type estimation list, then certainly the first one
>will be returned. (the same applies to metadata hint approach);
>Of course, tika also prefers the type that is the most specialized.
>
>let's get back to the following question, here is my guess though.
>[Prof]: Also what happens if you tweak the definition of XHTML to not
>scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>Let's consider an extreme case where we only scan 10 or 1 bytes, then it
>seems that magic bytes will inevitable detect nothing, and I think it
>will return the something like" application/oct-stream" that is the most
>general type. As mentioned, tika favours the one that is the most
>specialized, if extension approach returns the one that is more
>specialized, in this extreme case I believe almost every type is a
>subclass of this "application/oct-stream".... therefore the answer in
>this extreme may be yes, I think it is very possible that CBOR type
>detected by the extension approach takes over in this case...
>
>My idea was and still is that if the cbor self-Describing tag 55799 is
>present in the cbor file, then that can be used to detect the cbor type.
>Again, the cbor type will probably be appended into the magic estimation
>list together with another one such as application/html, I guess the
>order in the list probably also matters, the first one is preferred over
>the next one. Also the decision from the extension detection approach
>also play the role the break the tie.
>e.g. if extension detection method agrees on cbor with one of the
>estimated type in the magic list, then cbor will be returned. (again,
>same thing applies to metadatahint method).
>
>I have not taken a closer look at a cbor file that has the tag 55799, but
>I expect to see its hex is something like 0xd9d9f7 or the tag should be
>present in the header with a fixed sequence of
>bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is
>present in the file or preferable in the header (within a reasonable
>range of bytes ), I believe it can probably be used as the magic numbers
>for the cbor type.
>
>
>There is another thing I have mentioned in the jira ticket I opened
>yesterday against the cbor parser and detection, it is also possible that
>cbor content can be imbedded inside a plain json file, the way that a
>decoder can distinguish them in that file is by looking at the tag 55799
>again. This may rarely happen but a robust parser might be able to take
>care of that, tika might need to consider the use of fastXML being used
>by the nutch tool when developing the cbor parser...
>Again let me cite the same paragraph from the rfc,
>
>" a decoder might be able to parse both CBOR and JSON.
>   Such a decoder would need to mechanically distinguish the two
>   formats.  An easy way for an encoder to help the decoder would be to
>   tag the entire CBOR item with tag 55799, the serialization of which
>   will never be found at the beginning of a JSON text."
>
>
>Thanks
>Luke
>
>
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
>Sent: Tuesday, April 21, 2015 9:49 PM
>To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
>Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate);
>'NSF Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com
>Subject: Re: [memex-jpl] this week action from luke
>
>Hi Luke,
>
>Can you post the below conversation to dev@tika and summarize it there.
>Also what happens if you tweak the definition of XHTML to not scan until
>8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>
>Cheers,
>Chris
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398) NASA Jet
>Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department University of
>Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Luke <hanson311...@gmail.com>
>Date: Wednesday, April 22, 2015 at 12:19 AM
>To: Chris Mattmann <chris.mattm...@gmail.com>, "Totaro, Giuseppe U
>(3980-Affiliate)" <tot...@di.uniroma1.it>, Chris Mattmann
><chris.a.mattm...@jpl.nasa.gov>
>Cc: "Bryant, Ann C (398G-Affiliate)" <anniebry...@gmail.com>, "Zimdars,
>Paul A (3980-Affiliate)" <paul.a.zimd...@jpl.nasa.gov>, NSF Polar
>CyberInfrastructure DR Students <nsf-polar-usc-stude...@googlegroups.com>,
>"memex-...@googlegroups.com" <memex-...@googlegroups.com>
>Subject: RE: [memex-jpl] this week action from luke
>
>>Hi Professor,
>>Please see attached jpg for the difference.
>>Thanks
>>Luke
>>
>>-----Original Message-----
>>From: Chris Mattmann [mailto:chris.mattm...@gmail.com]
>>Sent: Tuesday, April 21, 2015 5:27 PM
>>To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
>>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
>>memex-...@googlegroups.com
>>Subject: Re: [memex-jpl] this week action from luke
>>
>>Hey Luke what happens if you do java -jar /path/to/tika-app -m
>>/path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app -m <
>>/path/to/cbor/file.cbor any difference?
>>
>>------------------------
>>Chris Mattmann
>>chris.mattm...@gmail.com
>>
>>
>>
>>
>>-----Original Message-----
>>From: Luke <hanson311...@gmail.com>
>>Date: Tuesday, April 21, 2015 at 5:41 PM
>>To: 'Luke' <hanson311...@gmail.com>, Chris Mattmann
>><chris.mattm...@gmail.com>, 'Giuseppe Totaro' <tot...@di.uniroma1.it>,
>>Chris Mattmann <chris.a.mattm...@jpl.nasa.gov>
>>Cc: "'Bryant, Ann C (398G-Affiliate)'" <anniebry...@gmail.com>,
>>"'Zimdars, Paul A (3980-Affiliate)'" <paul.a.zimd...@jpl.nasa.gov>, NSF
>>Polar CyberInfrastructure DR Students
>><nsf-polar-usc-stude...@googlegroups.com>,
>><memex-...@googlegroups.com>
>>Subject: RE: [memex-jpl] this week action from luke
>>
>>>Hi professor,
>>>I just sent a pull request for adding cbor extension.
>>>The interesting thing is that tika is still identifying the file
>>>dumped by the nutch dump tool as a "application/xhtml+xml" even when I
>>>manually change the file extension to the correct one (i.e. *.cbor ).
>>>
>>>The reason is probably that tika is identifying "application/xhtml+xml"
>>>by searching for the "&lt;html" in the file content, PFA:
>>>xhtml+xml.jpg; Now if you take a look at the cbor file dumped by
>>>xhtml+nutch,
>>>you see that we do have that element as part of the cbor content
>>>because the entire crawled xhtml document seems to be imbedded in the
>>>cbor json(PFA:
>>>cbor.jpg); and also in Tika, the magic detection seems to have higher
>>>priority over the glob detection, thus the type is being incorrectly
>>>detected.
>>>
>>>Therefore, I would like to please mention that adding the entry of
>>><glob pattern="*.cbor"/> is not resolving the issue as of now without
>>>some fixed magic bytes / patterns for cbor.
>>>I also would like to add that the thing will be different with our
>>>probabilistic mime detection selector, because if we know that the
>>>file extension is more reliable than magic bytes, then we can
>>>certainly add more preferential weight to the extension... this also
>>>might show the current implementation with MimeTypes detection is a
>>>bit stiff or less flexible in this scneario. :)
>>>
>>>
>>>Thanks
>>>Luke
>>>
>>>-----Original Message-----
>>>From: Luke [mailto:hanson311...@gmail.com]
>>>Sent: Tuesday, April 21, 2015 12:14 PM
>>>To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
>>>(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
>>>'memex-...@googlegroups.com'
>>>Subject: RE: [memex-jpl] this week action from luke
>>>
>>>Yes, let me add the cbor extension entry in tika xml, will send the
>>>pull request soon.
>>>
>>>Thanks
>>>Luke
>>>-----Original Message-----
>>>From: Chris Mattmann [mailto:chris.mattm...@gmail.com]
>>>Sent: Tuesday, April 21, 2015 6:51 AM
>>>To: Giuseppe Totaro; Mattmann, Chris A (3980)
>>>Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A
>>>(3980-Affiliate); NSF Polar CyberInfrastructure DR Students;
>>>memex-...@googlegroups.com
>>>Subject: Re: [memex-jpl] this week action from luke
>>>
>>>Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER and
>>>tag along with adding an -extension command would be fantastic. Can
>>>you file both of those NUTCH issues, wait a day or so, and then based
>>>on feedback use your new Nutch commit karma to get those into Nutch?
>>>
>>>And then when creating the issues, can you link to the TIKA-1610 issue?
>>>At that point, when those two to be defined NUTCH issues are up, Luke,
>>>in parallel can you throw up a pull request/patch in Tika for the
>>>extension along with the MIME detection?
>>>
>>>Cheers,
>>>Chris
>>>
>>>------------------------
>>>Chris Mattmann
>>>chris.mattm...@gmail.com
>>>
>>>
>>>
>>>
>>>-----Original Message-----
>>>From: Giuseppe Totaro <tot...@di.uniroma1.it>
>>>Date: Tuesday, April 21, 2015 at 12:33 PM
>>>To: Chris Mattmann <chris.a.mattm...@jpl.nasa.gov>
>>>Cc: Luke <hanson311...@gmail.com>, Chris Mattmann
>>><chris.mattm...@gmail.com>, "Bryant, Ann C (398G-Affiliate)"
>>><anniebry...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>><paul.a.zimd...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR
>>>Students <nsf-polar-usc-stude...@googlegroups.com>,
>>>"memex-...@googlegroups.com"
>>><memex-...@googlegroups.com>
>>>Subject: Re: [memex-jpl] this week action from luke
>>>
>>>>Thanks Luke. Great work.
>>>>Chris, we wrap a single string value, representing the JSON text, for
>>>>each file into CBOR (by using serializeCBORData method). For
>>>>instance, using the Unix hex dump tool, we can see that, as expected,
>>>>the first byte of all files is "0x7F" (the first three bits are
>>>>"011", that is the major type for strings, and the following 5 bits
>>>>are "11010", meaning a uint32_t encodes the length of following
>>>>text), and the following 4 bytes (single-precision float) encodes the
>>>>right length of file (as described in RFC7049
>>>><http://tools.ietf.org/html/rfc7049>).
>>>>Therefore, a CBOR tag is currently included into the file (a list of
>>>>cbor tags is available here
>>>><http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>).
>>>>I did not know about CBOR "magic header". Thanks a lot Luke for this
>>>>great research. Chris, if you agree, I can add support for prepending
>>>>self-describing CBOR tag 55799 to CommonCrawldataDumper class. I
>>>>believe it is very easy because I have to enable the
>>>>WRITE_TYPE_HEADER feature for CBORGenerator class (the source code is
>>>>available here 
>>>><https://github.com/FasterXML/jackson-dataformat-cbor/blob/master/src
>>>>/
>>>>m ain
>>>>/java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>).
>>>>Then, I can comment the TIKA-1610
>>>><https://issues.apache.org/jira/browse/TIKA-1610> issue.
>>>>
>>>>Regarding the file extension, in the Memex CCA format the original
>>>>file extension is used. We could add support for a -extension
>>>>command-line option allowing the user to give a file extension (e.g.,
>>>>cbor) for all files dumped out.
>>>>
>>>>Thanks a lot,
>>>>Giuseppe
>>>>
>>>>
>>>>
>>>>On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980)
>>>><chris.a.mattm...@jpl.nasa.gov> wrote:
>>>>
>>>>Thanks for this great research, Luke!
>>>>
>>>>Giuseppe, any idea why this tag doesn’t make it into the file?
>>>>
>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>Chris Mattmann, Ph.D.
>>>>Chief Architect
>>>>Instrument Software and Science Data Systems Section (398) NASA Jet
>>>>Propulsion Laboratory Pasadena, CA 91109 USA
>>>>Office: 168-519, Mailstop: 168-527
>>>>Email: chris.a.mattm...@nasa.gov
>>>>WWW:  http://sunset.usc.edu/~mattmann/
>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>Adjunct Associate Professor, Computer Science Department University
>>>>of Southern California, Los Angeles, CA 90089 USA
>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>-----Original Message-----
>>>>From: Luke <hanson311...@gmail.com>
>>>>Date: Tuesday, April 21, 2015 at 2:55 AM
>>>>To: Chris Mattmann <chris.mattm...@gmail.com>, "Totaro, Giuseppe U
>>>>(3980-Affiliate)" <tot...@di.uniroma1.it>, Chris Mattmann
>>>><chris.a.mattm...@jpl.nasa.gov>, "Bryant, Ann C (398G-Affiliate)"
>>>><anniebry...@gmail.com>, "Zimdars, Paul A (3980-Affiliate)"
>>>><paul.a.zimd...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR
>>>>Students <nsf-polar-usc-stude...@googlegroups.com>,
>>>>"memex-...@googlegroups.com"
>>>><memex-...@googlegroups.com>
>>>>Subject: RE: [memex-jpl] this week action from luke
>>>>
>>>>>Thanks professor.
>>>>>Hi professor and all.
>>>>>JIRA issue : CBOR Parser and detection improvement
>>>>>https://issues.apache.org/jira/browse/TIKA-1610
>>>>>
>>>>>I tried to conduct a bit research with this cbor detection.
>>>>>
>>>>>It looks like there is a self describing tag that needs to be
>>>>>written in the cbor file thru which other applications might be able
>>>>>to identify the cbor type....
>>>>>Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5
>>>>>
>>>>>I don’t see that tag being present in the cbor file dumped by the
>>>>>nutch tool, I am not very sure though.
>>>>>
>>>>>Thanks
>>>>>Luke
>>>>>
>>>>>
>>>>>
>>>>>-----Original Message-----
>>>>>From: Chris Mattmann [mailto:chris.mattm...@gmail.com]
>>>>>Sent: Monday, April 20, 2015 4:10 AM
>>>>>To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C
>>>>>(398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar
>>>>>CyberInfrastructure DR Students'; memex-...@googlegroups.com
>>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>>
>>>>>Nice one, Luke. If you have a second and you can open up an issue in
>>>>>Tika to make it support CBOR, then yes, by all means! :)
>>>>>
>>>>>
>>>>>------------------------
>>>>>Chris Mattmann
>>>>>chris.mattm...@gmail.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>-----Original Message-----
>>>>>From: Luke <hanson311...@gmail.com>
>>>>>Date: Monday, April 20, 2015 at 4:15 AM
>>>>>To: 'Giuseppe Totaro' <tot...@di.uniroma1.it>, Chris Mattmann
>>>>><chris.mattm...@gmail.com>, Chris Mattmann
>>>>><chris.a.mattm...@jpl.nasa.gov>, "'Bryant, Ann C (398G-Affiliate)'"
>>>>><anniebry...@gmail.com>, "'Zimdars, Paul A (3980-Affiliate)'"
>>>>><paul.a.zimd...@jpl.nasa.gov>, NSF Polar CyberInfrastructure DR
>>>>>Students <nsf-polar-usc-stude...@googlegroups.com>,
>>>>><memex-...@googlegroups.com>
>>>>>Subject: RE: [memex-jpl] this week action from luke
>>>>>
>>>>>>Thanks a lot Giuseppe for the prompt response clearing up a bit of
>>>>>>my confusion with the Nutch CommonCrawlDataDumper , appreciated.
>>>>>>
>>>>>>BTW, it looks like Tika might need to consider the support with
>>>>>>COBR parser and detection.
>>>>>>I checked the rfc, it looks like CBOR has not got magic numbers. PFA:
>>>>>>rfc_cbor.jpg
>>>>>>Actually, I don’t quite understand why the CommonCrawlDataDumper
>>>>>>is not dumping the nutch segments with the .cbor extension, which
>>>>>>seems to be helpful for type detection.
>>>>>>
>>>>>>To professor Mattmann,
>>>>>>Tika does not support the detection of COBR, although the trunk
>>>>>>version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor in
>>>>>>the tika-mimetypes.xml, those entries are not detecting properly
>>>>>>the cobr files dumped by CommonCrawlDataDumper.  Also CBOR does not
>>>>>>have magic bytes, off the top of my head the only way we can detect
>>>>>>it is using the extension, and content byte histogram (please note,
>>>>>>this is a local optimal solution and data-dependent.)  J
>>>>>>
>>>>>>I think I am bit deviating from the main route and discussion of
>>>>>>this thread…. i.e. the plan for testing the “probabilistic mime
>>>>>>detector selection” with polar data.
>>>>>>Anyway, I plan to repackage tika by incorporating the probabilistic
>>>>>>selection feature and replace the tika jar in nutch with the
>>>>>>repackaged one, and then run the CommonCrawlDataDumper and see how
>>>>>>it goes. If you have any specific ideas and thought with the
>>>>>>testing, please kindly let me know.
>>>>>>
>>>>>>Thanks
>>>>>>Luke
>>>>>>
>>>>>>From: Giuseppe Totaro [mailto:tot...@di.uniroma1.it]
>>>>>>Sent: Sunday, April 19, 2015 11:17 PM
>>>>>>To: Luke liu
>>>>>>Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C (398G-Affiliate);
>>>>>>Zimdars, Paul A (3980-Affiliate); Luke; NSF Polar
>>>>>>CyberInfrastructure DR Students; memex-...@googlegroups.com
>>>>>>Subject: Re: [memex-jpl] this week action from luke
>>>>>>
>>>>>>
>>>>>>
>>>>>>Hi Luke,
>>>>>>
>>>>>>
>>>>>>my name is Giuseppe and I am a PhD student working under the
>>>>>>supervision of Prof. Chris Mattmann. I worked on
>>>>>>CommonCrawlDataDumper tool, so I can give some feedback on a couple
>>>>>>of your observations. My comments inline below.
>>>>>>
>>>>>>
>>>>>>
>>>>>>Il giorno 19/apr/2015, alle ore 12:11, Luke liu <shuai...@usc.edu>
>>>>>>ha
>>>>>>scritto:
>>>>>>
>>>>>>
>>>>>>Thanks a lot professor; Sorry for the brief delay, I was spending
>>>>>>some time in understanding the code repo i.e.
>>>>>>http://github.com/chrismattmann/trec-dd-polar/
>>>>>>
>>>>>>From gen-common-crawl.sh, it looks like commoncrawldump is dumping
>>>>>>the crawl segments to json files with the human readable and
>>>>>>understandable content.
>>>>>>1) I am trying to run one of the commands on my side as shown in
>>>>>>gen-common-crawl.sh, but the generated files all end with .html or
>>>>>>htm; The command listed in gen-common-crawl.sh seems to be allude
>>>>>>to where the data is located on our nsfpolardata.dyndns.org
>>>>>><http://nsfpolardata.dyndns.org>
>>>>>><http://nsfpolardata.dyndns.org/>; although the locations are not
>>>>>>exactly correct (probably they need to be updated), part of the
>>>>>>patterns was able to allow me to locate some similar datasets (e.g.
>>>>>>/data2/crawls/raw/CS572Spring2015 ) again I am seeing the dumped
>>>>>>files are all ending with html, but surprisingly inside those
>>>>>>outputted html files, the contents are present in json format;
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>The file extension is (almost) always the same as the original file.
>>>>>>More in detail, using the -epochFilename command-line option (as in
>>>>>>gen-common-crawl.sh), the scraped data will be stored with a
>>>>>>filename of the format <epochtime(milliseconds)>.<filetype>, where
>>>>>><filetype> is either the extension of the original file or .html as
>>>>>>default if the original file does not have an extension. This
>>>>>>schema is used for file naming and it does not depend on internal
>>>>>>output format (JSON).
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>2) Another problem is that the root object is being set with some
>>>>>>garbled chars in each of the outputted json files (with extension
>>>>>>html in the end), PFA: garbled.jpg and one of the outputted json
>>>>>>file has been also attached as an example too (PFA:
>>>>>>1423894754000.html); the json files cannot be parsed properly by
>>>>>>aggregate.py due to those garbled chars.
>>>>>>Even if I get rid of those garbled chars, there are not mimeTypes
>>>>>>element which are being read by aggregate.py.
>>>>>>
>>>>>>
>>>>>>
>>>>>>Text content and metadata extracted from the crawled binary data
>>>>>>are stored in a structured document format (JSON). Furthermore,
>>>>>>this document is encoded using CBOR <http://cbor.io/>
>>>>>>serialization. Each not human-readable character that you notice in
>>>>>>front and at the end of JSON data is due to CBOR-encoding. Thus, if
>>>>>>you need to read JSON data from document dumped out by
>>>>>>CommonCrawlDataDumper, you have to deserialized the CBOR-encoded
>>>>>>data structure inside the file.
>>>>>>
>>>>>>
>>>>>>
>>>>>>I hope this short overview can help in you work. I really
>>>>>>appreciate your feedback and, by the way, thanks a lot for your
>>>>>>great job in detection.
>>>>>>
>>>>>>I am available to provide you all support I can give, so you do not
>>>>>>hesitate to contact me if you may need any further information.
>>>>>>
>>>>>>
>>>>>>
>>>>>>Thanks,
>>>>>>
>>>>>>Giuseppe
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>Finally, after some research, I guess that the statistical
>>>>>>information (present in the readme of the code repo) is not being
>>>>>>collected and computed by aggregate.py from those output json files
>>>>>>but it looks like it is coming from the log.... see the following
>>>>>>as an example:
>>>>>>
>>>>>>2015-04-19 04:55:42,078 INFO  tools.CommonCrawlDataDumper -
>>>>>>CommonsCrawlDataDumper File Stats:
>>>>>>TOTAL Stats:
>>>>>>[
>>>>>>   {"mimeType":"application/x-tika-msoffice","count":"17"}
>>>>>>   {"mimeType":"application/vnd.ms-excel","count":"7"}
>>>>>>   {"mimeType":"application/xhtml+xml","count":"3000"}
>>>>>>   {"mimeType":"application/octet-stream","count":"641"}
>>>>>>   {"mimeType":"application/epub+zip","count":"1"}
>>>>>>   {"mimeType":"application/zip","count":"6"}
>>>>>>   {"mimeType":"application/xml","count":"11"}
>>>>>>   {"mimeType":"image/png","count":"110"}
>>>>>>   {"mimeType":"image/jpeg","count":"70"}
>>>>>>   {"mimeType":"application/atom+xml","count":"213"}
>>>>>>   {"mimeType":"application/rss+xml","count":"43"}
>>>>>>   {"mimeType":"video/mp4","count":"3"}
>>>>>>   {"mimeType":"text/plain","count":"104"}
>>>>>>   {"mimeType":"application/rdf+xml","count":"2"}
>>>>>>   {"mimeType":"image/gif","count":"2"}
>>>>>>   {"mimeType":"text/x-php","count":"1"}
>>>>>>   {"mimeType":"video/x-msvideo","count":"1"}
>>>>>>   {"mimeType":"application/x-tika-ooxml","count":"3"}
>>>>>>   {"mimeType":"text/html","count":"9506"}
>>>>>>   {"mimeType":"application/pdf","count":"280"}
>>>>>>]
>>>>>>
>>>>>>It turns out that aggregate.py is not the one that produces the
>>>>>>statistical information, not sure what it does... but anyway, I
>>>>>>think I understand the whole idea and I do concur with it, might be
>>>>>>we can repackage the tika by incorporating the feature (i.e.
>>>>>>probabilistic mime
>>>>>>selection) in it and see if it can output the same information as
>>>>>>the one without it in the log.
>>>>>>
>>>>>>BTW, Regarding the use of the feature with probabilistic mime
>>>>>>selection:
>>>>>>in my pull request, I added a simple test case which might tell a
>>>>>>bit more about how the feature is called and used, it is simple
>>>>>>though.
>>>>>>Here is an example snippet
>>>>>>                ProbabilisticMimeDetectionSelector  probSel = new
>>>>>>ProbabilisticMimeDetectionSelector();
>>>>>>                probSel.detect(input::InputStream, metadata::
>>>>>>Metadata) It is similar to MimeTypes::detect(...) (more information
>>>>>>with this can be found in
>>>>>>https://issues.apache.org/jira/browse/TIKA-1517)
>>>>>>Now, in order to allow the Tika().detect() to call the
>>>>>>ProbabilisticMimeDetectionSelector::detect(...) (as Tika().detect()
>>>>>>is being called by commoncrawldump), we need to modify/add some
>>>>>>code in the TikaConfig which initializes a list of default
>>>>>>detectors, and we need to get rid of the detector - mimeTypes::
>>>>>>MimeTypes in the list and replace it with probSel::
>>>>>>ProbabilisticMimeDetectionSelector. (not sure if I should create
>>>>>>another pull request with this change for
>>>>>>TikaConfig)
>>>>>>
>>>>>>I think that is all of my initial thought with some finding and
>>>>>>plan; if you have anything you would like to please add and
>>>>>>comment, please do kindly let me know, then I will start working on
>>>>>>my 'finale'. BTW, don’t worry, even after I am graduated, the
>>>>>>graduation is not my termination with tika and this project, after
>>>>>>then I still can and want to help this polar project and tika as
>>>>>>much as possible, and correct the programming faults and bugs,
>>>>>>respond to the tika issues ,etc.
>>>>>>
>>>>>>
>>>>>>
>>>>>>Thanks
>>>>>>Luke
>>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Chris Mattmann [mailto:chris.mattm...@gmail.com]
>>>>>>Sent: Saturday, April 18, 2015 6:26 AM
>>>>>>To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C
>>>>>>(398G-Affiliate); Zimdars, Paul A (3980-Affiliate)
>>>>>>Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students;
>>>>>>memex-...@googlegroups.com
>>>>>>Subject: Re: this week action from luke
>>>>>>Importance: High
>>>>>>
>>>>>>Awesome Luke. I am going to work specifically on now benchmarking
>>>>>>your code in real situations. For example, it would be fantastic to
>>>>>>now run your Bayesian MIME detector over the whole NSF TREC Dynamic
>>>>>>Domain data for Polar described here:
>>>>>>
>>>>>>http://github.com/chrismattmann/trec-dd-polar/
>>>>>>
>>>>>>Paul Zimdars, CC’ed, can provide you with access to the data, and
>>>>>>Annie can explain it, also CC’ed.
>>>>>>
>>>>>>Can we make that your goal for the next 2 weeks to actually test it
>>>>>>and produce a real result over the whole TREC-DD data for Polar? My
>>>>>>goal will be to get your code committed and integrated into Tika.
>>>>>>The more you can write me a guide of how to build and test your
>>>>>>code with Tika so I can get it committed the better.
>>>>>>
>>>>>>Also CC’ing the Memex list for context. Note everyone: Luke is
>>>>>>building a Bayesian MIME classifier to evaluate against Tika’s
>>>>>>existing MIME detection approach. If folks have any Memex needs to
>>>>>>try and test more accurate file identification with Tika, Luke is
>>>>>>the guy to talk to and I have him for 2 more weeks.
>>>>>>
>>>>>>Thanks!
>>>>>>
>>>>>>Cheers,
>>>>>>Chris
>>>>>>
>>>>>>------------------------
>>>>>>Chris Mattmann
>>>>>>chris.mattm...@gmail.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Luke liu <shuai...@usc.edu>
>>>>>>Date: Thursday, April 16, 2015 at 11:29 PM
>>>>>>To: Chris Mattmann <chris.mattm...@gmail.com>, Chris Mattmann
>>>>>><chris.a.mattm...@jpl.nasa.gov>
>>>>>>Cc: 'Luke' <hanson311...@gmail.com>
>>>>>>Subject: this week action from luke
>>>>>>
>>>>>>
>>>>>>
>>>>>>Hi Professor Mattmann,
>>>>>>
>>>>>>I think I am in the final phase of the research, and last week I
>>>>>>finished the last item in the list, and hopefully everything will
>>>>>>be fine.
>>>>>>
>>>>>>For now, i probably can spend some time in verifying or optimizing
>>>>>>the codes, the majority of the research has been done…and it will
>>>>>>be also great if you can please comment on my work (the 2 pull
>>>>>>requests) when you have time.
>>>>>>
>>>>>>If you do have confusion with any of my work, please also do let me
>>>>>>know.
>>>>>>
>>>>>>Thanks and I am glad working with you, for the next a couple of
>>>>>>weeks before graduation, I am going to continue revising and
>>>>>>testing the code and features to get rid of some flaws (if any
>>>>>>)when I have time.
>>>>>>
>>>>>>Not sure if I miss out something, and if I do miss some thing
>>>>>>important, please do let me know too.
>>>>>>
>>>>>>Thanks
>>>>>>Luke
>>>>>>
>>>>>>
>>>>>>--
>>>>>>You received this message because you are subscribed to the Google
>>>>>>Groups "JPL-Kitware-Continuum Memex Group" group.
>>>>>>To unsubscribe from this group and stop receiving emails from it,
>>>>>>send an email to memex-jpl+unsubscr...@googlegroups.com
>>>>>><mailto:memex-jpl%2bunsubscr...@googlegroups.com>.
>>>>>>To post to this group, send email to memex-...@googlegroups.com.
>>>>>>Visit this group at http://groups.google.com/group/memex-jpl.
>>>>>>To view this discussion on the web visit
>>>>>>https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b3510070
>>>>>>%
>>>>>>2
>>>>>>41
>>>>>>9f3
>>>>>>0150%24%40edu.
>>>>>>For more options, visit https://groups.google.com/d/optout.
>>>>>><garbled.jpg><1423894754000.html>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>


Reply via email to