Hi Prof,

The test was finished, the result is expected.
Both (tika with the prob feature and the one without it) produced the same
"stats total", please see the attached matched.txt dumped by the small
program that verbatim checks and compares each line in every section of the
"Stats total" between the log produced by the tika that has the feature and
the one without it;
 so if the string.equals(...) satisfies, the string of the line will be
dumped out. If there is a mismatch(e.g. the count for a particular mime type
is different), an error will be dumped out. Eventually, I don't see any
error in the printout, I think the feature seem to have passed the test.


The processing time between 2 tests is as follows.
The following shows the start time and end time for the test where the Nutch
dumper tool with the prob selection feature.
from
2015-04-22 15:47:08,330
to
2015-04-22 17:48:28,877

The following shows the start time and end time for the test where the Nutch
dumper tool without the tika with the feature.
from
2015-04-22 22:41:23,459
to
2015-04-23 00:11:02,767


BTW, I forgot to mention that probabilistic mime selector with default
weight settings also gives the following result, because by default I
intentionally assign \ a higher weight value on the magic bytes method so as
to make it work in a way similar to the old strategy. On the other hands, if
I know that extension is more reliable, I can certainly add more weights to
the extension approach, in this case, the prob mime selector will return
application/cbor with a higher value of weight.

> <match value="&lt;html xmlns=" type="string" offset="0:1024"/>
> Result: "text/html"
> 
> <match value="&lt;html xmlns=" type="string" offset="0:6000"/>
> Result: "application/xhtml+xml"


Please kindly let me know if you have any confusion with the tests;


Thanks
Luke

-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:[email protected]] 
Sent: Wednesday, April 22, 2015 3:49 PM
To: Luke
Cc: Chris Mattmann; Totaro, Giuseppe U (3980-Affiliate);
[email protected]; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A
(3980-Affiliate); NSF Polar CyberInfrastructure DR Students;
[email protected]
Subject: Re: [memex-jpl] this week action from luke

Thanks Luke this is probably a good opportunity to test out your Bayesian
mime detector how does it perform here?

Sent from my iPhone

> On Apr 22, 2015, at 3:29 PM, Luke <[email protected]> wrote:
> 
> Hi professor,
> 
> Please see the following results.
> <match value="&lt;html xmlns=" type="string" offset="0:1024"/>
> Result: "text/html"
> 
> <match value="&lt;html xmlns=" type="string" offset="0:6000"/>
> Result: "application/xhtml+xml"
> 
> 
> Thanks
> Luke
> 
> -----Original Message-----
> From: Chris Mattmann [mailto:[email protected]]
> Sent: Wednesday, April 22, 2015 4:21 AM
> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U 
> (3980-Affiliate)'; [email protected]
> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
> [email protected]
> Subject: Re: [memex-jpl] this week action from luke
> 
> Hi Luke,
> 
> Actually I just meant go into tika-mimetypes.xml and change the magic
offsets for application/xhtml+xml and see if that works. The code you
changed below is actually how many bytes Tika will first download to do MIME
checking.
> 
> Cheers,
> Chris
> 
> ------------------------
> Chris Mattmann
> [email protected]
> 
> 
> 
> 
> -----Original Message-----
> From: Luke <[email protected]>
> Date: Wednesday, April 22, 2015 at 2:25 AM
> To: Chris Mattmann <[email protected]>, Chris Mattmann
<[email protected]>, "'Totaro, Giuseppe U (3980-Affiliate)'"
> <[email protected]>, <[email protected]>
> Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>, 
> "'Zimdars, Paul A (3980-Affiliate)'" <[email protected]>, 
> NSF Polar CyberInfrastructure DR Students 
> <[email protected]>,
> <[email protected]>
> Subject: RE: [memex-jpl] this week action from luke
> 
>> 
>> Hi professor,
>> 
>> I just tried it with minLength set to 1024, I get the following 
>> "text/plain"
>> I am a bit surprised....
>> 
>> BTW, the 6000 min length still give "application/xhtml+xml"; with 
>> anything below 1024 min length, I am seeing "text/plain". :)
>> 
>> BTW, the min length I am referring/altering is as follows 
>> MimeTypes.java
>>    public int getMinLength() {
>>       // This needs to be reasonably large to be able to correctly 
>> detect
>>       // things like XML root elements after initial comment and DTDs
>>       return 64 * 1024;
>>   }
>> 
>> 
>> Thanks
>> Luke
>> 
>> -----Original Message-----
>> From: Chris Mattmann [mailto:[email protected]]
>> Sent: Tuesday, April 21, 2015 7:48 PM
>> To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U 
>> (3980-Affiliate)'; [email protected]
>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>> [email protected]
>> Subject: Re: [memex-jpl] this week action from luke
>> 
>> Thanks Luke.
>> 
>> So I guess all I was asking was could you try it out. Thanks for the 
>> lesson in the RFC.
>> 
>> Cheers,
>> Chris
>> 
>> ------------------------
>> Chris Mattmann
>> [email protected]
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Luke <[email protected]>
>> Date: Wednesday, April 22, 2015 at 1:46 AM
>> To: Chris Mattmann <[email protected]>, Chris Mattmann 
>> <[email protected]>, "'Totaro, Giuseppe U (3980-Affiliate)'"
>> <[email protected]>, <[email protected]>
>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>, 
>> "'Zimdars, Paul A (3980-Affiliate)'" <[email protected]>, 
>> NSF Polar CyberInfrastructure DR Students 
>> <[email protected]>,
>> <[email protected]>
>> Subject: RE: [memex-jpl] this week action from luke
>> 
>>> Hi professor,
>>> 
>>> 
>>> I think it highly depends on the content being read by tika, e.g. if 
>>> there is a sequence of bytes in the file that is being read and is 
>>> the same as one or more of mime types being defined in our 
>>> tika-mimes.xml, I guess that tika will put those types in its 
>>> estimation list, please note there could be multiple estimated mime 
>>> types by magic-byte detection approach. Now tika also considers the 
>>> decision made by extension detection approach, if extension says the 
>>> file type it believes is the first one in the magic type estimation 
>>> list, then certainly the first one will be returned. (the same 
>>> applies to metadata hint approach); Of course, tika also prefers the 
>>> type that is the most specialized.
>>> 
>>> let's get back to the following question, here is my guess though.
>>> [Prof]: Also what happens if you tweak the definition of XHTML to 
>>> not scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over
then?
>>> Let's consider an extreme case where we only scan 10 or 1 bytes, 
>>> then it seems that magic bytes will inevitable detect nothing, and I 
>>> think it will return the something like" application/oct-stream" 
>>> that is the most general type. As mentioned, tika favours the one 
>>> that is the most specialized, if extension approach returns the one 
>>> that is more specialized, in this extreme case I believe almost 
>>> every type is a subclass of this "application/oct-stream".... 
>>> therefore the answer in this extreme may be yes, I think it is very 
>>> possible that CBOR type detected by the extension approach takes over in
this case...
>>> 
>>> My idea was and still is that if the cbor self-Describing tag 55799 
>>> is present in the cbor file, then that can be used to detect the cbor
type.
>>> Again, the cbor type will probably be appended into the magic 
>>> estimation list together with another one such as application/html, 
>>> I guess the order in the list probably also matters, the first one 
>>> is preferred over the next one. Also the decision from the extension 
>>> detection approach also play the role the break the tie.
>>> e.g. if extension detection method agrees on cbor with one of the 
>>> estimated type in the magic list, then cbor will be returned. 
>>> (again, same thing applies to metadatahint method).
>>> 
>>> I have not taken a closer look at a cbor file that has the tag 
>>> 55799, but I expect to see its hex is something like 0xd9d9f7 or the 
>>> tag should be present in the header with a fixed sequence of
>>> bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this 
>>> is present in the file or preferable in the header (within a 
>>> reasonable range of bytes ), I believe it can probably be used as 
>>> the magic numbers for the cbor type.
>>> 
>>> 
>>> There is another thing I have mentioned in the jira ticket I opened 
>>> yesterday against the cbor parser and detection, it is also possible 
>>> that cbor content can be imbedded inside a plain json file, the way 
>>> that a decoder can distinguish them in that file is by looking at 
>>> the tag 55799 again. This may rarely happen but a robust parser 
>>> might be able to take care of that, tika might need to consider the 
>>> use of fastXML being used by the nutch tool when developing the cbor
parser...
>>> Again let me cite the same paragraph from the rfc,
>>> 
>>> " a decoder might be able to parse both CBOR and JSON.
>>>  Such a decoder would need to mechanically distinguish the two  
>>> formats.  An easy way for an encoder to help the decoder would be to  
>>> tag the entire CBOR item with tag 55799, the serialization of which  
>>> will never be found at the beginning of a JSON text."
>>> 
>>> 
>>> Thanks
>>> Luke
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Mattmann, Chris A (3980) 
>>> [mailto:[email protected]]
>>> Sent: Tuesday, April 21, 2015 9:49 PM
>>> To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
>>> Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A 
>>> (3980-Affiliate); 'NSF Polar CyberInfrastructure DR Students'; 
>>> [email protected]
>>> Subject: Re: [memex-jpl] this week action from luke
>>> 
>>> Hi Luke,
>>> 
>>> Can you post the below conversation to dev@tika and summarize it there.
>>> Also what happens if you tweak the definition of XHTML to not scan 
>>> until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398) NASA Jet 
>>> Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: [email protected]
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department University 
>>> of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Luke <[email protected]>
>>> Date: Wednesday, April 22, 2015 at 12:19 AM
>>> To: Chris Mattmann <[email protected]>, "Totaro, Giuseppe U 
>>> (3980-Affiliate)" <[email protected]>, Chris Mattmann 
>>> <[email protected]>
>>> Cc: "Bryant, Ann C (398G-Affiliate)" <[email protected]>, 
>>> "Zimdars, Paul A (3980-Affiliate)" <[email protected]>, 
>>> NSF Polar CyberInfrastructure DR Students 
>>> <[email protected]>,
>>> "[email protected]" <[email protected]>
>>> Subject: RE: [memex-jpl] this week action from luke
>>> 
>>>> Hi Professor,
>>>> Please see attached jpg for the difference.
>>>> Thanks
>>>> Luke
>>>> 
>>>> -----Original Message-----
>>>> From: Chris Mattmann [mailto:[email protected]]
>>>> Sent: Tuesday, April 21, 2015 5:27 PM
>>>> To: Luke; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>> [email protected]
>>>> Subject: Re: [memex-jpl] this week action from luke
>>>> 
>>>> Hey Luke what happens if you do java -jar /path/to/tika-app -m 
>>>> /path/to/cbor/file.cbor, compared to: java -jar /path/to/tika-app 
>>>> -m < /path/to/cbor/file.cbor any difference?
>>>> 
>>>> ------------------------
>>>> Chris Mattmann
>>>> [email protected]
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Luke <[email protected]>
>>>> Date: Tuesday, April 21, 2015 at 5:41 PM
>>>> To: 'Luke' <[email protected]>, Chris Mattmann 
>>>> <[email protected]>, 'Giuseppe Totaro'
>>>> <[email protected]>, Chris Mattmann 
>>>> <[email protected]>
>>>> Cc: "'Bryant, Ann C (398G-Affiliate)'" <[email protected]>, 
>>>> "'Zimdars, Paul A (3980-Affiliate)'" <[email protected]>, 
>>>> NSF Polar CyberInfrastructure DR Students 
>>>> <[email protected]>,
>>>> <[email protected]>
>>>> Subject: RE: [memex-jpl] this week action from luke
>>>> 
>>>>> Hi professor,
>>>>> I just sent a pull request for adding cbor extension.
>>>>> The interesting thing is that tika is still identifying the file 
>>>>> dumped by the nutch dump tool as a "application/xhtml+xml" even 
>>>>> when I manually change the file extension to the correct one (i.e.
*.cbor ).
>>>>> 
>>>>> The reason is probably that tika is identifying
"application/xhtml+xml"
>>>>> by searching for the "&lt;html" in the file content, PFA:
>>>>> xhtml+xml.jpg; Now if you take a look at the cbor file dumped by 
>>>>> xhtml+nutch,
>>>>> you see that we do have that element as part of the cbor content 
>>>>> because the entire crawled xhtml document seems to be imbedded in 
>>>>> the cbor json(PFA:
>>>>> cbor.jpg); and also in Tika, the magic detection seems to have 
>>>>> higher priority over the glob detection, thus the type is being 
>>>>> incorrectly detected.
>>>>> 
>>>>> Therefore, I would like to please mention that adding the entry of 
>>>>> <glob pattern="*.cbor"/> is not resolving the issue as of now 
>>>>> without some fixed magic bytes / patterns for cbor.
>>>>> I also would like to add that the thing will be different with our 
>>>>> probabilistic mime detection selector, because if we know that the 
>>>>> file extension is more reliable than magic bytes, then we can 
>>>>> certainly add more preferential weight to the extension... this 
>>>>> also might show the current implementation with MimeTypes 
>>>>> detection is a bit stiff or less flexible in this scneario. :)
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> Luke
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Luke [mailto:[email protected]]
>>>>> Sent: Tuesday, April 21, 2015 12:14 PM
>>>>> To: 'Chris Mattmann'; 'Giuseppe Totaro'; 'Mattmann, Chris A (3980)'
>>>>> Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
>>>>> (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
>>>>> '[email protected]'
>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>> 
>>>>> Yes, let me add the cbor extension entry in tika xml, will send 
>>>>> the pull request soon.
>>>>> 
>>>>> Thanks
>>>>> Luke
>>>>> -----Original Message-----
>>>>> From: Chris Mattmann [mailto:[email protected]]
>>>>> Sent: Tuesday, April 21, 2015 6:51 AM
>>>>> To: Giuseppe Totaro; Mattmann, Chris A (3980)
>>>>> Cc: Luke; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A 
>>>>> (3980-Affiliate); NSF Polar CyberInfrastructure DR Students; 
>>>>> [email protected]
>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>> 
>>>>> Giuseppe both of these ideas supporting the CBOR WRITE_TYPE_HEADER 
>>>>> and tag along with adding an -extension command would be fantastic.
>>>>> Can you file both of those NUTCH issues, wait a day or so, and 
>>>>> then based on feedback use your new Nutch commit karma to get 
>>>>> those into Nutch?
>>>>> 
>>>>> And then when creating the issues, can you link to the TIKA-1610
issue?
>>>>> At that point, when those two to be defined NUTCH issues are up, 
>>>>> Luke, in parallel can you throw up a pull request/patch in Tika 
>>>>> for the extension along with the MIME detection?
>>>>> 
>>>>> Cheers,
>>>>> Chris
>>>>> 
>>>>> ------------------------
>>>>> Chris Mattmann
>>>>> [email protected]
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Giuseppe Totaro <[email protected]>
>>>>> Date: Tuesday, April 21, 2015 at 12:33 PM
>>>>> To: Chris Mattmann <[email protected]>
>>>>> Cc: Luke <[email protected]>, Chris Mattmann 
>>>>> <[email protected]>, "Bryant, Ann C (398G-Affiliate)"
>>>>> <[email protected]>, "Zimdars, Paul A (3980-Affiliate)"
>>>>> <[email protected]>, NSF Polar CyberInfrastructure DR 
>>>>> Students <[email protected]>,
>>>>> "[email protected]"
>>>>> <[email protected]>
>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>> 
>>>>>> Thanks Luke. Great work.
>>>>>> Chris, we wrap a single string value, representing the JSON text, 
>>>>>> for each file into CBOR (by using serializeCBORData method). For 
>>>>>> instance, using the Unix hex dump tool, we can see that, as 
>>>>>> expected, the first byte of all files is "0x7F" (the first three 
>>>>>> bits are "011", that is the major type for strings, and the 
>>>>>> following 5 bits are "11010", meaning a uint32_t encodes the 
>>>>>> length of following text), and the following 4 bytes 
>>>>>> (single-precision
>>>>>> float) encodes the right length of file (as described in RFC7049 
>>>>>> <http://tools.ietf.org/html/rfc7049>).
>>>>>> Therefore, a CBOR tag is currently included into the file (a list 
>>>>>> of cbor tags is available here 
>>>>>> <http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml>).
>>>>>> I did not know about CBOR "magic header". Thanks a lot Luke for 
>>>>>> this great research. Chris, if you agree, I can add support for 
>>>>>> prepending self-describing CBOR tag 55799 to 
>>>>>> CommonCrawldataDumper class. I believe it is very easy because I 
>>>>>> have to enable the WRITE_TYPE_HEADER feature for CBORGenerator 
>>>>>> class (the source code is available here 
>>>>>> <https://github.com/FasterXML/jackson-dataformat-cbor/blob/master
>>>>>> /s
>>>>>> r
>>>>>> c
>>>>>> /
>>>>>> m ain
>>>>>> /java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java>).
>>>>>> Then, I can comment the TIKA-1610 
>>>>>> <https://issues.apache.org/jira/browse/TIKA-1610> issue.
>>>>>> 
>>>>>> Regarding the file extension, in the Memex CCA format the 
>>>>>> original file extension is used. We could add support for a 
>>>>>> -extension command-line option allowing the user to give a file 
>>>>>> extension (e.g.,
>>>>>> cbor) for all files dumped out.
>>>>>> 
>>>>>> Thanks a lot,
>>>>>> Giuseppe
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Apr 21, 2015 at 7:31 AM, Mattmann, Chris A (3980) 
>>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>> Thanks for this great research, Luke!
>>>>>> 
>>>>>> Giuseppe, any idea why this tag doesn't make it into the file?
>>>>>> 
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Chris Mattmann, Ph.D.
>>>>>> Chief Architect
>>>>>> Instrument Software and Science Data Systems Section (398) NASA 
>>>>>> Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>> Email: [email protected]
>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Adjunct Associate Professor, Computer Science Department 
>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Luke <[email protected]>
>>>>>> Date: Tuesday, April 21, 2015 at 2:55 AM
>>>>>> To: Chris Mattmann <[email protected]>, "Totaro, Giuseppe 
>>>>>> U (3980-Affiliate)" <[email protected]>, Chris Mattmann 
>>>>>> <[email protected]>, "Bryant, Ann C (398G-Affiliate)"
>>>>>> <[email protected]>, "Zimdars, Paul A (3980-Affiliate)"
>>>>>> <[email protected]>, NSF Polar CyberInfrastructure DR 
>>>>>> Students <[email protected]>,
>>>>>> "[email protected]"
>>>>>> <[email protected]>
>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>> 
>>>>>>> Thanks professor.
>>>>>>> Hi professor and all.
>>>>>>> JIRA issue : CBOR Parser and detection improvement
>>>>>>> https://issues.apache.org/jira/browse/TIKA-1610
>>>>>>> 
>>>>>>> I tried to conduct a bit research with this cbor detection.
>>>>>>> 
>>>>>>> It looks like there is a self describing tag that needs to be 
>>>>>>> written in the cbor file thru which other applications might be 
>>>>>>> able to identify the cbor type....
>>>>>>> Please refer to http://tools.ietf.org/html/rfc7049#section-2.4.5
>>>>>>> 
>>>>>>> I don't see that tag being present in the cbor file dumped by 
>>>>>>> the nutch tool, I am not very sure though.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Luke
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Chris Mattmann [mailto:[email protected]]
>>>>>>> Sent: Monday, April 20, 2015 4:10 AM
>>>>>>> To: Luke; 'Giuseppe Totaro'; 'Chris Mattmann'; 'Bryant, Ann C 
>>>>>>> (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF 
>>>>>>> Polar CyberInfrastructure DR Students'; 
>>>>>>> [email protected]
>>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>>> 
>>>>>>> Nice one, Luke. If you have a second and you can open up an 
>>>>>>> issue in Tika to make it support CBOR, then yes, by all means! 
>>>>>>> :)
>>>>>>> 
>>>>>>> 
>>>>>>> ------------------------
>>>>>>> Chris Mattmann
>>>>>>> [email protected]
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Luke <[email protected]>
>>>>>>> Date: Monday, April 20, 2015 at 4:15 AM
>>>>>>> To: 'Giuseppe Totaro' <[email protected]>, Chris Mattmann 
>>>>>>> <[email protected]>, Chris Mattmann 
>>>>>>> <[email protected]>, "'Bryant, Ann C (398G-Affiliate)'"
>>>>>>> <[email protected]>, "'Zimdars, Paul A (3980-Affiliate)'"
>>>>>>> <[email protected]>, NSF Polar CyberInfrastructure DR 
>>>>>>> Students <[email protected]>,
>>>>>>> <[email protected]>
>>>>>>> Subject: RE: [memex-jpl] this week action from luke
>>>>>>> 
>>>>>>>> Thanks a lot Giuseppe for the prompt response clearing up a bit 
>>>>>>>> of my confusion with the Nutch CommonCrawlDataDumper , appreciated.
>>>>>>>> 
>>>>>>>> BTW, it looks like Tika might need to consider the support with 
>>>>>>>> COBR parser and detection.
>>>>>>>> I checked the rfc, it looks like CBOR has not got magic numbers.
>>>>>>>> PFA:
>>>>>>>> rfc_cbor.jpg
>>>>>>>> Actually, I don't quite understand why the 
>>>>>>>> CommonCrawlDataDumper is not dumping the nutch segments with 
>>>>>>>> the .cbor extension, which seems to be helpful for type detection.
>>>>>>>> 
>>>>>>>> To professor Mattmann,
>>>>>>>> Tika does not support the detection of COBR, although the trunk 
>>>>>>>> version has the entries (PFA: cbor_tika.mimetypes.xml)for cbor 
>>>>>>>> in the tika-mimetypes.xml, those entries are not detecting 
>>>>>>>> properly the cobr files dumped by CommonCrawlDataDumper.  Also 
>>>>>>>> CBOR does not have magic bytes, off the top of my head the only 
>>>>>>>> way we can detect it is using the extension, and content byte 
>>>>>>>> histogram (please note, this is a local optimal solution and
>>>>>>>> data-dependent.)  J
>>>>>>>> 
>>>>>>>> I think I am bit deviating from the main route and discussion 
>>>>>>>> of this thread.... i.e. the plan for testing the "probabilistic 
>>>>>>>> mime detector selection" with polar data.
>>>>>>>> Anyway, I plan to repackage tika by incorporating the 
>>>>>>>> probabilistic selection feature and replace the tika jar in 
>>>>>>>> nutch with the repackaged one, and then run the 
>>>>>>>> CommonCrawlDataDumper and see how it goes. If you have any 
>>>>>>>> specific ideas and thought with the testing, please kindly let me
know.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> From: Giuseppe Totaro [mailto:[email protected]]
>>>>>>>> Sent: Sunday, April 19, 2015 11:17 PM
>>>>>>>> To: Luke liu
>>>>>>>> Cc: Chris Mattmann; Chris Mattmann; Bryant, Ann C 
>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); Luke; NSF 
>>>>>>>> Polar CyberInfrastructure DR Students; 
>>>>>>>> [email protected]
>>>>>>>> Subject: Re: [memex-jpl] this week action from luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Luke,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> my name is Giuseppe and I am a PhD student working under the 
>>>>>>>> supervision of Prof. Chris Mattmann. I worked on 
>>>>>>>> CommonCrawlDataDumper tool, so I can give some feedback on a 
>>>>>>>> couple of your observations. My comments inline below.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Il giorno 19/apr/2015, alle ore 12:11, Luke liu 
>>>>>>>> <[email protected]> ha
>>>>>>>> scritto:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks a lot professor; Sorry for the brief delay, I was 
>>>>>>>> spending some time in understanding the code repo i.e.
>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>> 
>>>>>>>> From gen-common-crawl.sh, it looks like commoncrawldump is 
>>>>>>>> dumping the crawl segments to json files with the human 
>>>>>>>> readable and understandable content.
>>>>>>>> 1) I am trying to run one of the commands on my side as shown 
>>>>>>>> in gen-common-crawl.sh, but the generated files all end with 
>>>>>>>> .html or htm; The command listed in gen-common-crawl.sh seems 
>>>>>>>> to be allude to where the data is located on our 
>>>>>>>> nsfpolardata.dyndns.org <http://nsfpolardata.dyndns.org> 
>>>>>>>> <http://nsfpolardata.dyndns.org/>; although the locations are 
>>>>>>>> not exactly correct (probably they need to be updated), part of 
>>>>>>>> the patterns was able to allow me to locate some similar datasets
(e.g.
>>>>>>>> /data2/crawls/raw/CS572Spring2015 ) again I am seeing the 
>>>>>>>> dumped files are all ending with html, but surprisingly inside 
>>>>>>>> those outputted html files, the contents are present in json 
>>>>>>>> format;
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The file extension is (almost) always the same as the original
file.
>>>>>>>> More in detail, using the -epochFilename command-line option 
>>>>>>>> (as in gen-common-crawl.sh), the scraped data will be stored 
>>>>>>>> with a filename of the format 
>>>>>>>> <epochtime(milliseconds)>.<filetype>,
>>>>>>>> where <filetype> is either the extension of the original file 
>>>>>>>> or .html as default if the original file does not have an
extension.
>>>>>>>> This schema is used for file naming and it does not depend on 
>>>>>>>> internal output format (JSON).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2) Another problem is that the root object is being set with 
>>>>>>>> some garbled chars in each of the outputted json files (with 
>>>>>>>> extension html in the end), PFA: garbled.jpg and one of the 
>>>>>>>> outputted json file has been also attached as an example too (PFA:
>>>>>>>> 1423894754000.html); the json files cannot be parsed properly 
>>>>>>>> by aggregate.py due to those garbled chars.
>>>>>>>> Even if I get rid of those garbled chars, there are not 
>>>>>>>> mimeTypes element which are being read by aggregate.py.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Text content and metadata extracted from the crawled binary 
>>>>>>>> data are stored in a structured document format (JSON). 
>>>>>>>> Furthermore, this document is encoded using CBOR 
>>>>>>>> <http://cbor.io/> serialization. Each not human-readable 
>>>>>>>> character that you notice in front and at the end of JSON data is
due to CBOR-encoding.
>>>>>>>> Thus, if you need to read JSON data from document dumped out by 
>>>>>>>> CommonCrawlDataDumper, you have to deserialized the 
>>>>>>>> CBOR-encoded data structure inside the file.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I hope this short overview can help in you work. I really 
>>>>>>>> appreciate your feedback and, by the way, thanks a lot for your 
>>>>>>>> great job in detection.
>>>>>>>> 
>>>>>>>> I am available to provide you all support I can give, so you do 
>>>>>>>> not hesitate to contact me if you may need any further information.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Giuseppe
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Finally, after some research, I guess that the statistical 
>>>>>>>> information (present in the readme of the code repo) is not 
>>>>>>>> being collected and computed by aggregate.py from those output 
>>>>>>>> json files but it looks like it is coming from the log.... see 
>>>>>>>> the following as an example:
>>>>>>>> 
>>>>>>>> 2015-04-19 04:55:42,078 INFO  tools.CommonCrawlDataDumper - 
>>>>>>>> CommonsCrawlDataDumper File Stats:
>>>>>>>> TOTAL Stats:
>>>>>>>> [
>>>>>>>>  {"mimeType":"application/x-tika-msoffice","count":"17"}
>>>>>>>>  {"mimeType":"application/vnd.ms-excel","count":"7"}
>>>>>>>>  {"mimeType":"application/xhtml+xml","count":"3000"}
>>>>>>>>  {"mimeType":"application/octet-stream","count":"641"}
>>>>>>>>  {"mimeType":"application/epub+zip","count":"1"}
>>>>>>>>  {"mimeType":"application/zip","count":"6"}
>>>>>>>>  {"mimeType":"application/xml","count":"11"}
>>>>>>>>  {"mimeType":"image/png","count":"110"}
>>>>>>>>  {"mimeType":"image/jpeg","count":"70"}
>>>>>>>>  {"mimeType":"application/atom+xml","count":"213"}
>>>>>>>>  {"mimeType":"application/rss+xml","count":"43"}
>>>>>>>>  {"mimeType":"video/mp4","count":"3"}
>>>>>>>>  {"mimeType":"text/plain","count":"104"}
>>>>>>>>  {"mimeType":"application/rdf+xml","count":"2"}
>>>>>>>>  {"mimeType":"image/gif","count":"2"}
>>>>>>>>  {"mimeType":"text/x-php","count":"1"}
>>>>>>>>  {"mimeType":"video/x-msvideo","count":"1"}
>>>>>>>>  {"mimeType":"application/x-tika-ooxml","count":"3"}
>>>>>>>>  {"mimeType":"text/html","count":"9506"}
>>>>>>>>  {"mimeType":"application/pdf","count":"280"}
>>>>>>>> ]
>>>>>>>> 
>>>>>>>> It turns out that aggregate.py is not the one that produces the 
>>>>>>>> statistical information, not sure what it does... but anyway, I 
>>>>>>>> think I understand the whole idea and I do concur with it, 
>>>>>>>> might be we can repackage the tika by incorporating the feature
(i.e.
>>>>>>>> probabilistic mime
>>>>>>>> selection) in it and see if it can output the same information 
>>>>>>>> as the one without it in the log.
>>>>>>>> 
>>>>>>>> BTW, Regarding the use of the feature with probabilistic mime
>>>>>>>> selection:
>>>>>>>> in my pull request, I added a simple test case which might tell 
>>>>>>>> a bit more about how the feature is called and used, it is 
>>>>>>>> simple though.
>>>>>>>> Here is an example snippet
>>>>>>>>               ProbabilisticMimeDetectionSelector  probSel = new 
>>>>>>>> ProbabilisticMimeDetectionSelector();
>>>>>>>>               probSel.detect(input::InputStream, metadata::
>>>>>>>> Metadata) It is similar to MimeTypes::detect(...) (more 
>>>>>>>> information with this can be found in
>>>>>>>> https://issues.apache.org/jira/browse/TIKA-1517)
>>>>>>>> Now, in order to allow the Tika().detect() to call the
>>>>>>>> ProbabilisticMimeDetectionSelector::detect(...) (as
>>>>>>>> Tika().detect() is being called by commoncrawldump), we need to 
>>>>>>>> modify/add some code in the TikaConfig which initializes a list 
>>>>>>>> of default detectors, and we need to get rid of the detector -
>>>>>>>> mimeTypes::
>>>>>>>> MimeTypes in the list and replace it with probSel::
>>>>>>>> ProbabilisticMimeDetectionSelector. (not sure if I should 
>>>>>>>> create another pull request with this change for
>>>>>>>> TikaConfig)
>>>>>>>> 
>>>>>>>> I think that is all of my initial thought with some finding and 
>>>>>>>> plan; if you have anything you would like to please add and 
>>>>>>>> comment, please do kindly let me know, then I will start 
>>>>>>>> working on my 'finale'. BTW, don't worry, even after I am 
>>>>>>>> graduated, the graduation is not my termination with tika and 
>>>>>>>> this project, after then I still can and want to help this 
>>>>>>>> polar project and tika as much as possible, and correct the 
>>>>>>>> programming faults and bugs, respond to the tika issues ,etc.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Chris Mattmann [mailto:[email protected]]
>>>>>>>> Sent: Saturday, April 18, 2015 6:26 AM
>>>>>>>> To: Luke liu; 'Mattmann, Chris A (3980)'; Bryant, Ann C 
>>>>>>>> (398G-Affiliate); Zimdars, Paul A (3980-Affiliate)
>>>>>>>> Cc: 'Luke'; NSF Polar CyberInfrastructure DR Students; 
>>>>>>>> [email protected]
>>>>>>>> Subject: Re: this week action from luke
>>>>>>>> Importance: High
>>>>>>>> 
>>>>>>>> Awesome Luke. I am going to work specifically on now 
>>>>>>>> benchmarking your code in real situations. For example, it 
>>>>>>>> would be fantastic to now run your Bayesian MIME detector over 
>>>>>>>> the whole NSF TREC Dynamic Domain data for Polar described here:
>>>>>>>> 
>>>>>>>> http://github.com/chrismattmann/trec-dd-polar/
>>>>>>>> 
>>>>>>>> Paul Zimdars, CC'ed, can provide you with access to the data, 
>>>>>>>> and Annie can explain it, also CC'ed.
>>>>>>>> 
>>>>>>>> Can we make that your goal for the next 2 weeks to actually 
>>>>>>>> test it and produce a real result over the whole TREC-DD data 
>>>>>>>> for Polar? My goal will be to get your code committed and 
>>>>>>>> integrated into Tika.
>>>>>>>> The more you can write me a guide of how to build and test your 
>>>>>>>> code with Tika so I can get it committed the better.
>>>>>>>> 
>>>>>>>> Also CC'ing the Memex list for context. Note everyone: Luke is 
>>>>>>>> building a Bayesian MIME classifier to evaluate against Tika's 
>>>>>>>> existing MIME detection approach. If folks have any Memex needs 
>>>>>>>> to try and test more accurate file identification with Tika, 
>>>>>>>> Luke is the guy to talk to and I have him for 2 more weeks.
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Chris
>>>>>>>> 
>>>>>>>> ------------------------
>>>>>>>> Chris Mattmann
>>>>>>>> [email protected]
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Luke liu <[email protected]>
>>>>>>>> Date: Thursday, April 16, 2015 at 11:29 PM
>>>>>>>> To: Chris Mattmann <[email protected]>, Chris Mattmann 
>>>>>>>> <[email protected]>
>>>>>>>> Cc: 'Luke' <[email protected]>
>>>>>>>> Subject: this week action from luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Professor Mattmann,
>>>>>>>> 
>>>>>>>> I think I am in the final phase of the research, and last week 
>>>>>>>> I finished the last item in the list, and hopefully everything 
>>>>>>>> will be fine.
>>>>>>>> 
>>>>>>>> For now, i probably can spend some time in verifying or 
>>>>>>>> optimizing the codes, the majority of the research has been 
>>>>>>>> done...and it will be also great if you can please comment on 
>>>>>>>> my work (the 2 pull
>>>>>>>> requests) when you have time.
>>>>>>>> 
>>>>>>>> If you do have confusion with any of my work, please also do 
>>>>>>>> let me know.
>>>>>>>> 
>>>>>>>> Thanks and I am glad working with you, for the next a couple of 
>>>>>>>> weeks before graduation, I am going to continue revising and 
>>>>>>>> testing the code and features to get rid of some flaws (if any 
>>>>>>>> )when I have time.
>>>>>>>> 
>>>>>>>> Not sure if I miss out something, and if I do miss some thing 
>>>>>>>> important, please do let me know too.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Luke
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>> Google Groups "JPL-Kitware-Continuum Memex Group" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>> it, send an email to [email protected]
>>>>>>>> <mailto:memex-jpl%[email protected]>.
>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>> Visit this group at http://groups.google.com/group/memex-jpl.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/memex-jpl/000f01d07ad4%24b351
>>>>>>>> 00
>>>>>>>> 7
>>>>>>>> 0
>>>>>>>> %
>>>>>>>> 2
>>>>>>>> 41
>>>>>>>> 9f3
>>>>>>>> 0150%24%40edu.
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>> <garbled.jpg><1423894754000.html>
> 
> 
map1 (stats map for the log produced by the tika without prob mime selector) 
size: 21
map2 (stats map for the log produced by the tika with            prob mime 
selector) size: 21

/usr/local/ndeploy/data/raw/CS572Spring2015/Team6/raw/acadis_plain/
0 [
[matched]:     {"mimeType":"application/x-tika-msoffice","count":"26"}
[matched]:     {"mimeType":"application/vnd.ms-excel","count":"5"}
[matched]:     {"mimeType":"application/x-hdf","count":"14"}
[matched]:     {"mimeType":"image/tiff","count":"2"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"3677"}
[matched]:     {"mimeType":"application/octet-stream","count":"8483"}
[matched]:     {"mimeType":"application/x-sh","count":"463"}
[matched]:     {"mimeType":"application/zip","count":"466"}
[matched]:     {"mimeType":"application/xml","count":"323"}
[matched]:     {"mimeType":"image/jpeg","count":"10"}
[matched]:     {"mimeType":"image/png","count":"12"}
[matched]:     {"mimeType":"application/rss+xml","count":"4"}
[matched]:     {"mimeType":"application/atom+xml","count":"2"}
[matched]:     {"mimeType":"video/mp4","count":"1"}
[matched]:     {"mimeType":"text/dif+xml","count":"268"}
[matched]:     {"mimeType":"text/plain","count":"8387"}
[matched]:     {"mimeType":"application/gzip","count":"1"}
[matched]:     {"mimeType":"application/rtf","count":"1"}
[matched]:     {"mimeType":"application/x-tika-ooxml","count":"5"}
[matched]:     {"mimeType":"text/html","count":"3379"}
[matched]:     {"mimeType":"application/pdf","count":"118"}
]

/usr/local/ndeploy/data/raw/CS572Spring2015/Team36/raw/second crawl 
Raw/AMD_nasa/
1 [
[matched]:     {"mimeType":"application/x-tika-msoffice","count":"2"}
[matched]:     {"mimeType":"application/x-elc","count":"3"}
[matched]:     {"mimeType":"application/vnd.ms-excel","count":"1"}
[matched]:     {"mimeType":"application/x-msdownload; format=pe32","count":"1"}
[matched]:     {"mimeType":"image/vnd.microsoft.icon","count":"14"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"493"}
[matched]:     {"mimeType":"application/x-java-jnilib","count":"1"}
[matched]:     {"mimeType":"application/octet-stream","count":"200"}
[matched]:     {"mimeType":"application/xml","count":"35"}
[matched]:     {"mimeType":"image/jpeg","count":"111"}
[matched]:     {"mimeType":"image/png","count":"48"}
[matched]:     {"mimeType":"application/rss+xml","count":"9"}
[matched]:     {"mimeType":"text/plain","count":"10398"}
[matched]:     {"mimeType":"image/gif","count":"19"}
[matched]:     {"mimeType":"application/x-tika-ooxml","count":"3"}
[matched]:     {"mimeType":"text/html","count":"2196"}
[matched]:     {"mimeType":"application/pdf","count":"60"}
]

/usr/local/ndeploy/data/raw/CS572Spring2015/Team34/raw/AMD.crawl/
2 [
[matched]:     {"mimeType":"application/xml","count":"32"}
[matched]:     {"mimeType":"image/jpeg","count":"11"}
[matched]:     {"mimeType":"image/png","count":"767"}
[matched]:     {"mimeType":"application/rss+xml","count":"2"}
[matched]:     {"mimeType":"application/atom+xml","count":"45"}
[matched]:     {"mimeType":"text/plain","count":"24"}
[matched]:     {"mimeType":"image/gif","count":"385"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"9162"}
[matched]:     {"mimeType":"application/octet-stream","count":"3804"}
[matched]:     {"mimeType":"text/html","count":"85628"}
]

/usr/local/ndeploy/data/raw/CS572Spring2015/Team36/raw/FirstCrawl/first_aoncadis/
3 [
[matched]:     {"mimeType":"application/x-elc","count":"3"}
[matched]:     {"mimeType":"application/vnd.google-earth.kml+xml","count":"1"}
[matched]:     {"mimeType":"image/vnd.microsoft.icon","count":"12"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"870"}
[matched]:     {"mimeType":"application/octet-stream","count":"101"}
[matched]:     {"mimeType":"application/zip","count":"1"}
[matched]:     {"mimeType":"application/xml","count":"24"}
[matched]:     {"mimeType":"image/jpeg","count":"297"}
[matched]:     {"mimeType":"image/png","count":"184"}
[matched]:     {"mimeType":"application/rss+xml","count":"28"}
[matched]:     {"mimeType":"text/plain","count":"548"}
[matched]:     {"mimeType":"application/rdf+xml","count":"1"}
[matched]:     {"mimeType":"image/gif","count":"60"}
[matched]:     {"mimeType":"application/x-tika-ooxml","count":"1"}
[matched]:     {"mimeType":"text/html","count":"864"}
[matched]:     {"mimeType":"application/pdf","count":"95"}
]

/usr/local/ndeploy/data/raw/CS572Spring2015/Team6/raw/acadis/
4 [
[matched]:     {"mimeType":"application/x-tika-msoffice","count":"17"}
[matched]:     {"mimeType":"application/vnd.ms-excel","count":"7"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"3000"}
[matched]:     {"mimeType":"application/octet-stream","count":"641"}
[matched]:     {"mimeType":"application/epub+zip","count":"1"}
[matched]:     {"mimeType":"application/zip","count":"6"}
[matched]:     {"mimeType":"application/xml","count":"10"}
[matched]:     {"mimeType":"image/png","count":"110"}
[matched]:     {"mimeType":"image/jpeg","count":"70"}
[matched]:     {"mimeType":"application/atom+xml","count":"213"}
[matched]:     {"mimeType":"application/rss+xml","count":"43"}
[matched]:     {"mimeType":"video/mp4","count":"3"}
[matched]:     {"mimeType":"text/plain","count":"97"}
[matched]:     {"mimeType":"application/rdf+xml","count":"2"}
[matched]:     {"mimeType":"image/gif","count":"2"}
[matched]:     {"mimeType":"text/x-php","count":"1"}
[matched]:     {"mimeType":"video/x-msvideo","count":"1"}
[matched]:     {"mimeType":"application/x-tika-ooxml","count":"3"}
[matched]:     {"mimeType":"text/html","count":"9514"}
[matched]:     {"mimeType":"application/pdf","count":"280"}
]

/usr/local/ndeploy/data/raw/CS572Spring2015/Team43/raw/ade.crawl/
5 [
[matched]:     {"mimeType":"application/x-tar","count":"9"}
[matched]:     {"mimeType":"application/vnd.ms-excel","count":"1"}
[matched]:     {"mimeType":"application/x-bzip2","count":"17"}
[matched]:     {"mimeType":"image/tiff","count":"31"}
[matched]:     {"mimeType":"application/x-font-ttf","count":"7"}
[matched]:     {"mimeType":"application/x-tex","count":"11"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"103336"}
[matched]:     {"mimeType":"application/x-sh","count":"4"}
[matched]:     {"mimeType":"application/zip","count":"122"}
[matched]:     {"mimeType":"application/x-bittorrent","count":"3"}
[matched]:     {"mimeType":"application/fits","count":"24"}
[matched]:     {"mimeType":"image/png","count":"591"}
[matched]:     {"mimeType":"application/atom+xml","count":"286"}
[matched]:     {"mimeType":"application/x-gtar","count":"12"}
[matched]:     {"mimeType":"application/x-bibtex-text-file","count":"1"}
[matched]:     {"mimeType":"image/gif","count":"64"}
[matched]:     {"mimeType":"text/html","count":"158884"}
[matched]:     {"mimeType":"audio/mpeg","count":"82"}
[matched]:     {"mimeType":"message/rfc822","count":"3"}
[matched]:     {"mimeType":"application/x-tika-msoffice","count":"440"}
[matched]:     {"mimeType":"application/ogg","count":"9"}
[matched]:     {"mimeType":"application/vnd.google-earth.kml+xml","count":"28"}
[matched]:     {"mimeType":"application/x-hdf","count":"4"}
[matched]:     {"mimeType":"application/x-compress","count":"2"}
[matched]:     {"mimeType":"image/vnd.microsoft.icon","count":"3"}
[matched]:     {"mimeType":"text/x-perl","count":"7"}
[matched]:     {"mimeType":"image/vnd.adobe.photoshop","count":"1"}
[matched]:     {"mimeType":"image/x-xcf","count":"1"}
[matched]:     {"mimeType":"application/octet-stream","count":"20280"}
[matched]:     {"mimeType":"application/x-java-jnilib","count":"1"}
[matched]:     {"mimeType":"application/xml","count":"765"}
[matched]:     {"mimeType":"image/jpeg","count":"2004"}
[matched]:     {"mimeType":"application/rss+xml","count":"3068"}
[matched]:     {"mimeType":"video/mp4","count":"146"}
[matched]:     {"mimeType":"application/gzip","count":"3"}
[matched]:     {"mimeType":"text/x-php","count":"10"}
[matched]:     {"mimeType":"application/rtf","count":"10"}
[matched]:     {"mimeType":"audio/x-wav","count":"1"}
[matched]:     {"mimeType":"application/x-elc","count":"1"}
[matched]:     {"mimeType":"text/x-vcard","count":"13"}
[matched]:     {"mimeType":"application/vnd.sun.xml.writer","count":"1"}
[matched]:     {"mimeType":"application/xslt+xml","count":"1"}
[matched]:     {"mimeType":"application/postscript","count":"114"}
[matched]:     {"mimeType":"video/x-ms-wmv","count":"24"}
[matched]:     {"mimeType":"application/x-sqlite3","count":"1"}
[matched]:     {"mimeType":"video/x-ms-asf","count":"3"}
[matched]:     {"mimeType":"application/x-executable","count":"9"}
[matched]:     {"mimeType":"application/rdf+xml","count":"158"}
[matched]:     {"mimeType":"application/x-grib","count":"3"}
[matched]:     {"mimeType":"application/msword","count":"11"}
[matched]:     {"mimeType":"video/x-msvideo","count":"15"}
[matched]:     {"mimeType":"video/x-flv","count":"4"}
[matched]:     
{"mimeType":"application/vnd.oasis.opendocument.text","count":"5"}
[matched]:     {"mimeType":"application/x-shockwave-flash","count":"23"}
[matched]:     {"mimeType":"image/svg+xml","count":"181"}
[matched]:     {"mimeType":"application/epub+zip","count":"8"}
[matched]:     {"mimeType":"application/x-rar-compressed","count":"1"}
[matched]:     {"mimeType":"text/plain","count":"6244"}
[matched]:     {"mimeType":"video/quicktime","count":"105"}
[matched]:     {"mimeType":"audio/x-flac","count":"1"}
[matched]:     {"mimeType":"audio/basic","count":"1"}
[matched]:     {"mimeType":"video/mpeg","count":"4"}
[matched]:     {"mimeType":"audio/vorbis","count":"2"}
[matched]:     {"mimeType":"application/x-tika-ooxml","count":"563"}
[matched]:     {"mimeType":"application/vnd.rn-realmedia","count":"9"}
[matched]:     {"mimeType":"application/pdf","count":"11756"}
[matched]:     {"mimeType":"video/x-m4v","count":"26"}
[matched]:     {"mimeType":"text/x-python","count":"1"}
[matched]:     {"mimeType":"application/x-xz","count":"9"}
]

/usr/local/ndeploy/data/raw/CS572Spring2015/Team43/raw/acadis.crawl/
6 [
[matched]:     {"mimeType":"application/x-elc","count":"8"}
[matched]:     {"mimeType":"application/x-tex","count":"2"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"8117"}
[matched]:     {"mimeType":"application/x-sh","count":"557"}
[matched]:     {"mimeType":"application/zip","count":"548"}
[matched]:     {"mimeType":"image/png","count":"1010"}
[matched]:     {"mimeType":"application/atom+xml","count":"8"}
[matched]:     {"mimeType":"application/rdf+xml","count":"3"}
[matched]:     {"mimeType":"image/gif","count":"355"}
[matched]:     {"mimeType":"application/msword","count":"4"}
[matched]:     {"mimeType":"video/x-msvideo","count":"7"}
[matched]:     {"mimeType":"text/html","count":"6409"}
[matched]:     
{"mimeType":"application/vnd.oasis.opendocument.text","count":"1"}
[matched]:     {"mimeType":"application/x-tika-msoffice","count":"22"}
[matched]:     {"mimeType":"application/vnd.google-earth.kml+xml","count":"4"}
[matched]:     {"mimeType":"image/vnd.microsoft.icon","count":"57"}
[matched]:     {"mimeType":"application/x-shockwave-flash","count":"7"}
[matched]:     {"mimeType":"image/svg+xml","count":"6"}
[matched]:     {"mimeType":"application/octet-stream","count":"9237"}
[matched]:     {"mimeType":"image/x-ms-bmp","count":"2"}
[matched]:     {"mimeType":"application/xml","count":"1103"}
[matched]:     {"mimeType":"image/jpeg","count":"1767"}
[matched]:     {"mimeType":"application/rss+xml","count":"94"}
[matched]:     {"mimeType":"text/plain","count":"1609"}
[matched]:     {"mimeType":"text/dif+xml","count":"902"}
[matched]:     {"mimeType":"application/x-tika-ooxml","count":"23"}
[matched]:     {"mimeType":"application/rtf","count":"2"}
[matched]:     {"mimeType":"application/pdf","count":"365"}
]

/usr/local/ndeploy/data/raw/CS572Spring2015/Team16/raw/amd.crawl/
7 [
[matched]:     {"mimeType":"application/xml","count":"8"}
[matched]:     {"mimeType":"image/png","count":"725"}
[matched]:     {"mimeType":"application/atom+xml","count":"361"}
[matched]:     {"mimeType":"application/rss+xml","count":"1"}
[matched]:     {"mimeType":"text/plain","count":"418"}
[matched]:     {"mimeType":"image/gif","count":"1957"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"2615"}
[matched]:     {"mimeType":"application/octet-stream","count":"4506"}
[matched]:     {"mimeType":"text/html","count":"58203"}
]

/data2/crawls/raw/CS572Spring2015/Team41/raw/amd.crawl/
8 [
[matched]:     {"mimeType":"message/rfc822","count":"13"}
[matched]:     {"mimeType":"application/xml","count":"13"}
[matched]:     {"mimeType":"application/x-tika-msoffice","count":"2"}
[matched]:     {"mimeType":"image/jpeg","count":"87"}
[matched]:     {"mimeType":"image/png","count":"21"}
[matched]:     {"mimeType":"application/rss+xml","count":"2"}
[matched]:     {"mimeType":"text/plain","count":"9106"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"405"}
[matched]:     {"mimeType":"application/x-java-jnilib","count":"1"}
[matched]:     {"mimeType":"application/octet-stream","count":"61"}
[matched]:     {"mimeType":"text/html","count":"2394"}
[matched]:     {"mimeType":"application/pdf","count":"61"}
]

/usr/local/ndeploy/data/raw/CS572Spring2015/Team36/raw/second crawl 
Raw/Aoncadis/
9 [
[matched]:     {"mimeType":"application/xml","count":"176"}
[matched]:     {"mimeType":"application/atom+xml","count":"1"}
[matched]:     {"mimeType":"text/plain","count":"176"}
[matched]:     {"mimeType":"text/dif+xml","count":"175"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"994"}
[matched]:     {"mimeType":"application/octet-stream","count":"1392"}
[matched]:     {"mimeType":"application/x-tika-ooxml","count":"1"}
[matched]:     {"mimeType":"application/pdf","count":"1"}
[matched]:     {"mimeType":"application/x-sh","count":"146"}
[matched]:     {"mimeType":"application/zip","count":"136"}
]

/usr/local/ndeploy/data/raw/CS572Spring2015/Team18/raw/
10 [
[matched]:     {"mimeType":"application/xml","count":"253"}
[matched]:     {"mimeType":"application/x-tika-msoffice","count":"23"}
[matched]:     {"mimeType":"application/atom+xml","count":"1"}
[matched]:     {"mimeType":"text/dif+xml","count":"264"}
[matched]:     {"mimeType":"text/plain","count":"276"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"1034"}
[matched]:     {"mimeType":"application/octet-stream","count":"1638"}
[matched]:     {"mimeType":"application/x-tika-ooxml","count":"1"}
[matched]:     {"mimeType":"application/x-sh","count":"190"}
[matched]:     {"mimeType":"application/pdf","count":"1"}
[matched]:     {"mimeType":"application/zip","count":"85"}
]

/usr/local/ndeploy/data/raw/CS572Spring2015/Team41/raw/ade.crawl/
11 [
[matched]:     {"mimeType":"application/x-elc","count":"1"}
[matched]:     {"mimeType":"application/vnd.ms-excel","count":"11"}
[matched]:     {"mimeType":"video/x-ms-wmv","count":"12"}
[matched]:     {"mimeType":"image/tiff","count":"162"}
[matched]:     {"mimeType":"audio/x-aiff","count":"2"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"2323"}
[matched]:     {"mimeType":"application/x-sh","count":"1"}
[matched]:     {"mimeType":"application/zip","count":"8"}
[matched]:     {"mimeType":"application/x-matroska","count":"32"}
[matched]:     {"mimeType":"application/x-executable","count":"2"}
[matched]:     {"mimeType":"image/png","count":"1806"}
[matched]:     {"mimeType":"application/atom+xml","count":"3"}
[matched]:     {"mimeType":"application/rdf+xml","count":"1"}
[matched]:     {"mimeType":"image/gif","count":"4"}
[matched]:     {"mimeType":"video/x-msvideo","count":"11"}
[matched]:     {"mimeType":"text/html","count":"5813"}
[matched]:     {"mimeType":"audio/mpeg","count":"1"}
[matched]:     {"mimeType":"message/rfc822","count":"135"}
[matched]:     {"mimeType":"application/x-tika-msoffice","count":"31"}
[matched]:     {"mimeType":"application/vnd.google-earth.kml+xml","count":"2"}
[matched]:     {"mimeType":"application/x-hdf","count":"11"}
[matched]:     {"mimeType":"application/x-compress","count":"36"}
[matched]:     {"mimeType":"text/x-perl","count":"2"}
[matched]:     {"mimeType":"application/x-java-jnilib","count":"1"}
[matched]:     {"mimeType":"application/octet-stream","count":"114"}
[matched]:     {"mimeType":"application/xml","count":"38"}
[matched]:     {"mimeType":"image/jpeg","count":"3284"}
[matched]:     {"mimeType":"application/rss+xml","count":"99"}
[matched]:     {"mimeType":"text/dif+xml","count":"1"}
[matched]:     {"mimeType":"text/plain","count":"13858"}
[matched]:     {"mimeType":"video/mp4","count":"68"}
[matched]:     {"mimeType":"video/quicktime","count":"75"}
[matched]:     {"mimeType":"audio/basic","count":"1"}
[matched]:     {"mimeType":"application/gzip","count":"30"}
[matched]:     {"mimeType":"video/mpeg","count":"45"}
[matched]:     {"mimeType":"application/x-tika-ooxml","count":"7"}
[matched]:     {"mimeType":"application/rtf","count":"3"}
[matched]:     {"mimeType":"application/pdf","count":"1090"}
[matched]:     {"mimeType":"video/x-m4v","count":"21"}
]

/usr/local/ndeploy/data/raw/CS572Spring2015/Team36/raw/second crawl 
Raw/ADE_acadis/
12 [
[matched]:     {"mimeType":"application/x-tika-msoffice","count":"1"}
[matched]:     {"mimeType":"application/x-elc","count":"3"}
[matched]:     {"mimeType":"application/vnd.ms-excel","count":"1"}
[matched]:     {"mimeType":"application/x-msdownload; format=pe32","count":"1"}
[matched]:     {"mimeType":"image/vnd.microsoft.icon","count":"7"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"454"}
[matched]:     {"mimeType":"application/x-java-jnilib","count":"1"}
[matched]:     {"mimeType":"application/octet-stream","count":"170"}
[matched]:     {"mimeType":"application/xml","count":"11"}
[matched]:     {"mimeType":"image/png","count":"76"}
[matched]:     {"mimeType":"image/jpeg","count":"113"}
[matched]:     {"mimeType":"application/atom+xml","count":"1"}
[matched]:     {"mimeType":"application/rss+xml","count":"8"}
[matched]:     {"mimeType":"text/plain","count":"4961"}
[matched]:     {"mimeType":"image/gif","count":"10"}
[matched]:     {"mimeType":"text/html","count":"2161"}
[matched]:     {"mimeType":"application/pdf","count":"52"}
]

/usr/local/ndeploy/data/raw/CS572Spring2015/Team29/raw/
13 [
[matched]:     {"mimeType":"application/xml","count":"33"}
[matched]:     {"mimeType":"image/png","count":"405"}
[matched]:     {"mimeType":"image/jpeg","count":"18"}
[matched]:     {"mimeType":"application/rss+xml","count":"1"}
[matched]:     {"mimeType":"application/atom+xml","count":"1389"}
[matched]:     {"mimeType":"text/plain","count":"379"}
[matched]:     {"mimeType":"image/gif","count":"719"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"4856"}
[matched]:     {"mimeType":"application/octet-stream","count":"6943"}
[matched]:     {"mimeType":"application/x-tika-ooxml","count":"1"}
[matched]:     {"mimeType":"text/html","count":"51259"}
[matched]:     {"mimeType":"application/x-sh","count":"251"}
]

/usr/local/ndeploy/data/raw/CS572Spring2015/Team22/raw/amd.crawl/
14 [
[matched]:     {"mimeType":"application/x-tar","count":"13"}
[matched]:     {"mimeType":"application/vnd.ms-excel","count":"68"}
[matched]:     {"mimeType":"image/tiff","count":"235"}
[matched]:     {"mimeType":"application/x-bzip2","count":"2"}
[matched]:     {"mimeType":"audio/x-aiff","count":"5"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"22586"}
[matched]:     {"mimeType":"application/x-sh","count":"3"}
[matched]:     {"mimeType":"application/zip","count":"81"}
[matched]:     {"mimeType":"image/png","count":"7738"}
[matched]:     {"mimeType":"text/x-jsp","count":"1"}
[matched]:     {"mimeType":"application/atom+xml","count":"149"}
[matched]:     {"mimeType":"application/x-gtar","count":"7"}
[matched]:     {"mimeType":"image/gif","count":"11361"}
[matched]:     {"mimeType":"text/html","count":"94167"}
[matched]:     {"mimeType":"audio/mpeg","count":"227"}
[matched]:     {"mimeType":"message/rfc822","count":"28"}
[matched]:     {"mimeType":"application/x-tika-msoffice","count":"1022"}
[matched]:     {"mimeType":"application/x-hdf","count":"10"}
[matched]:     {"mimeType":"application/vnd.google-earth.kml+xml","count":"93"}
[matched]:     {"mimeType":"application/ogg","count":"13"}
[matched]:     {"mimeType":"image/vnd.microsoft.icon","count":"111"}
[matched]:     {"mimeType":"application/octet-stream","count":"6990"}
[matched]:     {"mimeType":"application/xml","count":"432"}
[matched]:     {"mimeType":"image/jpeg","count":"35833"}
[matched]:     {"mimeType":"application/rss+xml","count":"1597"}
[matched]:     {"mimeType":"video/mp4","count":"292"}
[matched]:     {"mimeType":"application/gzip","count":"10"}
[matched]:     {"mimeType":"text/x-php","count":"7"}
[matched]:     {"mimeType":"application/rtf","count":"16"}
[matched]:     {"mimeType":"application/vnd.ms-excel.sheet.4","count":"1"}
[matched]:     {"mimeType":"application/dita+xml; format=concept","count":"319"}
[matched]:     {"mimeType":"audio/x-wav","count":"33"}
[matched]:     {"mimeType":"application/x-elc","count":"117"}
[matched]:     {"mimeType":"application/postscript","count":"4"}
[matched]:     {"mimeType":"video/x-ms-wmv","count":"74"}
[matched]:     {"mimeType":"image/vnd.dwg","count":"3"}
[matched]:     {"mimeType":"audio/x-ms-wma","count":"52"}
[matched]:     {"mimeType":"video/x-ms-asf","count":"22"}
[matched]:     {"mimeType":"application/x-matroska","count":"32"}
[matched]:     {"mimeType":"application/rdf+xml","count":"276"}
[matched]:     {"mimeType":"application/msword","count":"5"}
[matched]:     {"mimeType":"video/x-msvideo","count":"30"}
[matched]:     {"mimeType":"application/x-shockwave-flash","count":"60"}
[matched]:     {"mimeType":"application/epub+zip","count":"1"}
[matched]:     {"mimeType":"image/x-ms-bmp","count":"19"}
[matched]:     {"mimeType":"text/plain","count":"13264"}
[matched]:     {"mimeType":"video/quicktime","count":"530"}
[matched]:     {"mimeType":"audio/mp4","count":"18"}
[matched]:     {"mimeType":"audio/basic","count":"52"}
[matched]:     {"mimeType":"video/mpeg","count":"155"}
[matched]:     {"mimeType":"application/x-tika-ooxml","count":"434"}
[matched]:     {"mimeType":"application/vnd.rn-realmedia","count":"91"}
[matched]:     {"mimeType":"application/pdf","count":"11358"}
[matched]:     {"mimeType":"video/x-m4v","count":"98"}
[matched]:     {"mimeType":"application/x-msmetafile","count":"1"}
[matched]:     {"mimeType":"text/x-python","count":"1"}
]

/usr/local/ndeploy/data/raw/ade.crawl-090314/
15 [
[matched]:     {"mimeType":"application/xml","count":"9"}
[matched]:     {"mimeType":"image/png","count":"5"}
[matched]:     {"mimeType":"image/jpeg","count":"16"}
[matched]:     {"mimeType":"application/rss+xml","count":"10"}
[matched]:     {"mimeType":"text/plain","count":"34"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"388"}
[matched]:     {"mimeType":"application/octet-stream","count":"11"}
[matched]:     {"mimeType":"text/html","count":"241"}
[matched]:     {"mimeType":"application/pdf","count":"10"}
[matched]:     {"mimeType":"application/zip","count":"1"}
]

/usr/local/ndeploy/data/raw/CS572Spring2015/Team34/raw/NSIDC.crawl/
16 [
[matched]:     {"mimeType":"application/x-tar","count":"1"}
[matched]:     {"mimeType":"application/x-tika-msoffice","count":"7"}
[matched]:     {"mimeType":"application/vnd.google-earth.kml+xml","count":"6"}
[matched]:     {"mimeType":"application/x-compress","count":"1"}
[matched]:     {"mimeType":"application/x-shockwave-flash","count":"2"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"1891"}
[matched]:     {"mimeType":"application/octet-stream","count":"46"}
[matched]:     {"mimeType":"application/x-sh","count":"4"}
[matched]:     {"mimeType":"application/zip","count":"3"}
[matched]:     {"mimeType":"application/xml","count":"19"}
[matched]:     {"mimeType":"application/atom+xml","count":"7"}
[matched]:     {"mimeType":"application/rss+xml","count":"118"}
[matched]:     {"mimeType":"text/dif+xml","count":"1"}
[matched]:     {"mimeType":"video/mp4","count":"1"}
[matched]:     {"mimeType":"text/plain","count":"119"}
[matched]:     {"mimeType":"application/rdf+xml","count":"1"}
[matched]:     {"mimeType":"text/html","count":"3408"}
[matched]:     {"mimeType":"application/x-tika-ooxml","count":"4"}
[matched]:     {"mimeType":"application/pdf","count":"338"}
[matched]:     {"mimeType":"audio/mpeg","count":"1"}
]

/usr/local/ndeploy/data/raw/AmdCrawl/
17 [
[matched]:     {"mimeType":"application/xml","count":"13"}
[matched]:     {"mimeType":"image/png","count":"537"}
[matched]:     {"mimeType":"image/jpeg","count":"17"}
[matched]:     {"mimeType":"application/atom+xml","count":"22"}
[matched]:     {"mimeType":"application/rss+xml","count":"1"}
[matched]:     {"mimeType":"text/plain","count":"178"}
[matched]:     {"mimeType":"image/gif","count":"1179"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"1454"}
[matched]:     {"mimeType":"application/octet-stream","count":"2690"}
[matched]:     {"mimeType":"text/html","count":"33733"}
]

/usr/local/ndeploy/data/raw/CS572Spring2015/Team36/raw/FirstCrawl/first_amd/
18 [
[matched]:     {"mimeType":"application/x-elc","count":"39"}
[matched]:     {"mimeType":"application/vnd.ms-excel","count":"2"}
[matched]:     {"mimeType":"image/tiff","count":"1"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"5258"}
[matched]:     {"mimeType":"application/zip","count":"10"}
[matched]:     {"mimeType":"image/png","count":"914"}
[matched]:     {"mimeType":"application/atom+xml","count":"460"}
[matched]:     {"mimeType":"application/x-gtar","count":"1"}
[matched]:     {"mimeType":"application/rdf+xml","count":"61"}
[matched]:     {"mimeType":"image/gif","count":"1377"}
[matched]:     {"mimeType":"text/html","count":"14898"}
[matched]:     {"mimeType":"application/x-tika-msoffice","count":"17"}
[matched]:     {"mimeType":"application/vnd.google-earth.kml+xml","count":"1"}
[matched]:     {"mimeType":"image/vnd.microsoft.icon","count":"57"}
[matched]:     {"mimeType":"image/svg+xml","count":"1"}
[matched]:     {"mimeType":"application/octet-stream","count":"1040"}
[matched]:     {"mimeType":"application/xml","count":"58"}
[matched]:     {"mimeType":"image/x-ms-bmp","count":"2"}
[matched]:     {"mimeType":"image/jpeg","count":"1446"}
[matched]:     {"mimeType":"application/rss+xml","count":"111"}
[matched]:     {"mimeType":"text/plain","count":"2892"}
[matched]:     {"mimeType":"application/gzip","count":"20"}
[matched]:     {"mimeType":"video/mpeg","count":"2"}
[matched]:     {"mimeType":"application/rtf","count":"4"}
[matched]:     {"mimeType":"application/x-tika-ooxml","count":"6"}
[matched]:     {"mimeType":"application/pdf","count":"455"}
[matched]:     {"mimeType":"application/dita+xml; format=concept","count":"26"}
]

/usr/local/ndeploy/data/raw/CS572Spring2015/Team16/raw/ade.crawl/
19 [
[matched]:     {"mimeType":"application/x-tar","count":"2"}
[matched]:     {"mimeType":"application/x-tika-msoffice","count":"69"}
[matched]:     {"mimeType":"application/vnd.google-earth.kml+xml","count":"11"}
[matched]:     {"mimeType":"application/x-compress","count":"2"}
[matched]:     {"mimeType":"video/x-ms-wmv","count":"6"}
[matched]:     {"mimeType":"text/x-perl","count":"2"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"59019"}
[matched]:     {"mimeType":"application/octet-stream","count":"21565"}
[matched]:     {"mimeType":"application/zip","count":"8"}
[matched]:     {"mimeType":"application/xml","count":"73"}
[matched]:     {"mimeType":"image/jpeg","count":"370"}
[matched]:     {"mimeType":"image/png","count":"22"}
[matched]:     {"mimeType":"application/rss+xml","count":"442"}
[matched]:     {"mimeType":"application/atom+xml","count":"12"}
[matched]:     {"mimeType":"video/mp4","count":"3"}
[matched]:     {"mimeType":"text/plain","count":"816"}
[matched]:     {"mimeType":"application/rdf+xml","count":"2"}
[matched]:     {"mimeType":"image/gif","count":"2"}
[matched]:     {"mimeType":"video/x-msvideo","count":"18"}
[matched]:     {"mimeType":"text/html","count":"7748"}
[matched]:     {"mimeType":"application/x-tika-ooxml","count":"4"}
[matched]:     {"mimeType":"application/pdf","count":"2361"}
[matched]:     {"mimeType":"audio/mpeg","count":"6"}
]

/usr/local/ndeploy/data/raw/CS572Spring2015/Team36/raw/FirstCrawl/first_nsidc/
20 [
[matched]:     {"mimeType":"application/vnd.ms-excel","count":"26"}
[matched]:     {"mimeType":"image/x-bpg","count":"7"}
[matched]:     {"mimeType":"application/x-bzip2","count":"17"}
[matched]:     {"mimeType":"image/tiff","count":"12"}
[matched]:     {"mimeType":"audio/x-aiff","count":"2"}
[matched]:     {"mimeType":"application/x-font-ttf","count":"2"}
[matched]:     {"mimeType":"application/xhtml+xml","count":"46041"}
[matched]:     {"mimeType":"application/zip","count":"119"}
[matched]:     {"mimeType":"image/png","count":"6394"}
[matched]:     {"mimeType":"application/atom+xml","count":"183"}
[matched]:     {"mimeType":"application/x-gtar","count":"7"}
[matched]:     {"mimeType":"application/x-bibtex-text-file","count":"1"}
[matched]:     {"mimeType":"image/gif","count":"11906"}
[matched]:     {"mimeType":"text/html","count":"58501"}
[matched]:     {"mimeType":"audio/mpeg","count":"131"}
[matched]:     {"mimeType":"application/x-tika-msoffice","count":"406"}
[matched]:     {"mimeType":"application/ogg","count":"3"}
[matched]:     {"mimeType":"application/vnd.google-earth.kml+xml","count":"29"}
[matched]:     {"mimeType":"image/vnd.microsoft.icon","count":"440"}
[matched]:     {"mimeType":"text/x-perl","count":"1"}
[matched]:     {"mimeType":"image/vnd.adobe.photoshop","count":"1"}
[matched]:     {"mimeType":"application/octet-stream","count":"5024"}
[matched]:     {"mimeType":"text/troff","count":"1"}
[matched]:     {"mimeType":"application/xml","count":"380"}
[matched]:     {"mimeType":"image/jpeg","count":"12948"}
[matched]:     {"mimeType":"application/rss+xml","count":"1273"}
[matched]:     {"mimeType":"video/mp4","count":"42"}
[matched]:     {"mimeType":"application/gzip","count":"748"}
[matched]:     {"mimeType":"text/x-php","count":"3"}
[matched]:     {"mimeType":"application/rtf","count":"7"}
[matched]:     {"mimeType":"audio/x-wav","count":"1"}
[matched]:     {"mimeType":"application/x-elc","count":"70"}
[matched]:     {"mimeType":"text/x-vcard","count":"6"}
[matched]:     {"mimeType":"application/xslt+xml","count":"4"}
[matched]:     {"mimeType":"application/postscript","count":"18"}
[matched]:     {"mimeType":"video/x-ms-wmv","count":"14"}
[matched]:     {"mimeType":"audio/x-ms-wma","count":"1"}
[matched]:     {"mimeType":"application/x-matroska","count":"1"}
[matched]:     {"mimeType":"application/x-msdownload","count":"7"}
[matched]:     {"mimeType":"application/x-executable","count":"6"}
[matched]:     {"mimeType":"application/rdf+xml","count":"45"}
[matched]:     {"mimeType":"application/msword","count":"2"}
[matched]:     {"mimeType":"video/x-msvideo","count":"2"}
[matched]:     {"mimeType":"application/x-bzip","count":"6"}
[matched]:     {"mimeType":"application/x-shockwave-flash","count":"22"}
[matched]:     {"mimeType":"image/svg+xml","count":"31"}
[matched]:     {"mimeType":"application/epub+zip","count":"12"}
[matched]:     {"mimeType":"image/x-ms-bmp","count":"22"}
[matched]:     {"mimeType":"text/plain","count":"14851"}
[matched]:     {"mimeType":"video/quicktime","count":"78"}
[matched]:     {"mimeType":"audio/x-mpegurl","count":"1"}
[matched]:     {"mimeType":"audio/x-flac","count":"1"}
[matched]:     {"mimeType":"video/mpeg","count":"12"}
[matched]:     {"mimeType":"application/x-tika-ooxml","count":"308"}
[matched]:     {"mimeType":"application/vnd.rn-realmedia","count":"3"}
[matched]:     {"mimeType":"audio/vorbis","count":"1"}
[matched]:     {"mimeType":"application/pdf","count":"6073"}
[matched]:     {"mimeType":"video/x-m4v","count":"25"}
[matched]:     {"mimeType":"text/x-python","count":"2"}
[matched]:     {"mimeType":"application/x-msmetafile","count":"4"}
]

Reply via email to