Re: [memex-jpl] this week action from luke
Great work Luke and both of these changes make sense. Please send the pull request for that thank you! Great work Giuseppe! Go team! Cheers, Chris Chris Mattmann chris.mattm...@gmail.com -Original Message- From: Luke hanson311...@gmail.com Date: Thursday, April 23, 2015 at 3:08 AM To: 'Luke' hanson311...@gmail.com, Chris Mattmann chris.a.mattm...@jpl.nasa.gov, Chris Mattmann chris.mattm...@gmail.com, 'Totaro, Giuseppe U (3980-Affiliate)' tot...@di.uniroma1.it, dev@tika.apache.org, 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com, 'Zimdars, Paul A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF Polar CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com, memex-...@googlegroups.com Subject: RE: [memex-jpl] this week action from luke Both patches from Guiseppe all works based on my tests; from the tests I was able to see the magic tag was being appended at the beginning of the file, and the cbor extension was being appended too when running the Nutch dump tool command with the -extension cbor option. Thanks a lot for the kind help, Giuseppe, highly appreciated. I want to please give a big thumb up to Guiseppe's work, it is thorough and considerate too. To professor, with Guiseppe's two patches, we still need to make a bit change in Tika mimetypes.xml (BTW, the cbor magic tag can be used as magic bytes in tika as it does not look very common, even if it accidentally appears in some other type of files, tika will have extension and metadatahint as a fallback strategy). I am going to send another pull request with that change; But before that, it will be great to elaborate what I am going to change to avoid any confusion. Now we have two problems. Problem1: Magic priority 40. The application/xhtml+xml has higher priority(50) than application/cbor (40); [I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and compared first, cbor will not even be placed in the magic estimation list because it has low priority. Based on the tests, it turns out that it is true that xhtml gets read and compared first with the input file, so any type below the priority 50 will be disregarded. Problem2: again magic priority with 50. In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and cbor) will be selected as candidate mime types and they will be put in the magic estimation list; since xhtml type gets read first, it is placed atop the cbor; in order to break that tie, tika will rely on the decision from the extension method. If the extension method fails to detect the type(for now, let's ignore metadata hint method for simplicity but the same applies to it too), then xhtml gets returned eventually. My pull request to be sent : I am going to set the magic priority of cbor type to 50 the same as xhtml, because it would probably be risky to discard any one of the estimated types without going consult the extension method. Any comments, suggestion, thoughts will be welcomed and appreciated. Thanks Luke -Original Message- From: Luke [mailto:hanson311...@gmail.com] Sent: Wednesday, April 22, 2015 7:45 PM To: 'Mattmann, Chris A (3980)' Cc: 'Chris Mattmann'; 'Totaro, Giuseppe U (3980-Affiliate)'; 'dev@tika.apache.org'; 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 'memex-...@googlegroups.com' Subject: RE: [memex-jpl] this week action from luke Hi Prof, The test was finished, the result is expected. Both (tika with the prob feature and the one without it) produced the same stats total, please see the attached matched.txt dumped by the small program that verbatim checks and compares each line in every section of the Stats total between the log produced by the tika that has the feature and the one without it; so if the string.equals(...) satisfies, the string of the line will be dumped out. If there is a mismatch(e.g. the count for a particular mime type is different), an error will be dumped out. Eventually, I don't see any error in the printout, I think the feature seem to have passed the test. The processing time between 2 tests is as follows. The following shows the start time and end time for the test where the Nutch dumper tool with the prob selection feature. from 2015-04-22 15:47:08,330 to 2015-04-22 17:48:28,877 The following shows the start time and end time for the test where the Nutch dumper tool without the tika with the feature. from 2015-04-22 22:41:23,459 to 2015-04-23 00:11:02,767 BTW, I forgot to mention that probabilistic mime selector with default weight settings also gives the following result, because by default I intentionally assign \ a higher weight value on the magic bytes method so as to make it work in a way similar to the old strategy. On the other hands, if I know that extension is more reliable, I can certainly add more weights to the extension approach, in this case, the prob mime selector
RE: [memex-jpl] this week action from luke
Both patches from Guiseppe all works based on my tests; from the tests I was able to see the magic tag was being appended at the beginning of the file, and the cbor extension was being appended too when running the Nutch dump tool command with the -extension cbor option. Thanks a lot for the kind help, Giuseppe, highly appreciated. I want to please give a big thumb up to Guiseppe's work, it is thorough and considerate too. To professor, with Guiseppe's two patches, we still need to make a bit change in Tika mimetypes.xml (BTW, the cbor magic tag can be used as magic bytes in tika as it does not look very common, even if it accidentally appears in some other type of files, tika will have extension and metadatahint as a fallback strategy). I am going to send another pull request with that change; But before that, it will be great to elaborate what I am going to change to avoid any confusion. Now we have two problems. Problem1: Magic priority 40. The application/xhtml+xml has higher priority(50) than application/cbor (40); [I don't know who (and why) assigned 40 to cbor]; So if xhtml gets read and compared first, cbor will not even be placed in the magic estimation list because it has low priority. Based on the tests, it turns out that it is true that xhtml gets read and compared first with the input file, so any type below the priority 50 will be disregarded. Problem2: again magic priority with 50. In Tika, given a file dumped by the nutch dumper tool, both types (xhtml and cbor) will be selected as candidate mime types and they will be put in the magic estimation list; since xhtml type gets read first, it is placed atop the cbor; in order to break that tie, tika will rely on the decision from the extension method. If the extension method fails to detect the type(for now, let's ignore metadata hint method for simplicity but the same applies to it too), then xhtml gets returned eventually. My pull request to be sent : I am going to set the magic priority of cbor type to 50 the same as xhtml, because it would probably be risky to discard any one of the estimated types without going consult the extension method. Any comments, suggestion, thoughts will be welcomed and appreciated. Thanks Luke -Original Message- From: Luke [mailto:hanson311...@gmail.com] Sent: Wednesday, April 22, 2015 7:45 PM To: 'Mattmann, Chris A (3980)' Cc: 'Chris Mattmann'; 'Totaro, Giuseppe U (3980-Affiliate)'; 'dev@tika.apache.org'; 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 'memex-...@googlegroups.com' Subject: RE: [memex-jpl] this week action from luke Hi Prof, The test was finished, the result is expected. Both (tika with the prob feature and the one without it) produced the same stats total, please see the attached matched.txt dumped by the small program that verbatim checks and compares each line in every section of the Stats total between the log produced by the tika that has the feature and the one without it; so if the string.equals(...) satisfies, the string of the line will be dumped out. If there is a mismatch(e.g. the count for a particular mime type is different), an error will be dumped out. Eventually, I don't see any error in the printout, I think the feature seem to have passed the test. The processing time between 2 tests is as follows. The following shows the start time and end time for the test where the Nutch dumper tool with the prob selection feature. from 2015-04-22 15:47:08,330 to 2015-04-22 17:48:28,877 The following shows the start time and end time for the test where the Nutch dumper tool without the tika with the feature. from 2015-04-22 22:41:23,459 to 2015-04-23 00:11:02,767 BTW, I forgot to mention that probabilistic mime selector with default weight settings also gives the following result, because by default I intentionally assign \ a higher weight value on the magic bytes method so as to make it work in a way similar to the old strategy. On the other hands, if I know that extension is more reliable, I can certainly add more weights to the extension approach, in this case, the prob mime selector will return application/cbor with a higher value of weight. match value=lt;html xmlns= type=string offset=0:1024/ Result: text/html match value=lt;html xmlns= type=string offset=0:6000/ Result: application/xhtml+xml Please kindly let me know if you have any confusion with the tests; Thanks Luke -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Wednesday, April 22, 2015 3:49 PM To: Luke Cc: Chris Mattmann; Totaro, Giuseppe U (3980-Affiliate); dev@tika.apache.org; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); NSF Polar CyberInfrastructure DR Students; memex-...@googlegroups.com Subject: Re: [memex-jpl] this week action from luke Thanks Luke this is probably a good opportunity to test out your Bayesian mime detector
Re: [memex-jpl] this week action from luke
Hi Luke, Actually I just meant go into tika-mimetypes.xml and change the magic offsets for application/xhtml+xml and see if that works. The code you changed below is actually how many bytes Tika will first download to do MIME checking. Cheers, Chris Chris Mattmann chris.mattm...@gmail.com -Original Message- From: Luke hanson311...@gmail.com Date: Wednesday, April 22, 2015 at 2:25 AM To: Chris Mattmann chris.mattm...@gmail.com, Chris Mattmann chris.a.mattm...@jpl.nasa.gov, 'Totaro, Giuseppe U (3980-Affiliate)' tot...@di.uniroma1.it, dev@tika.apache.org Cc: 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com, 'Zimdars, Paul A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF Polar CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com, memex-...@googlegroups.com Subject: RE: [memex-jpl] this week action from luke Hi professor, I just tried it with minLength set to 1024, I get the following text/plain I am a bit surprised BTW, the 6000 min length still give application/xhtml+xml; with anything below 1024 min length, I am seeing text/plain. :) BTW, the min length I am referring/altering is as follows MimeTypes.java public int getMinLength() { // This needs to be reasonably large to be able to correctly detect // things like XML root elements after initial comment and DTDs return 64 * 1024; } Thanks Luke -Original Message- From: Chris Mattmann [mailto:chris.mattm...@gmail.com] Sent: Tuesday, April 21, 2015 7:48 PM To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U (3980-Affiliate)'; dev@tika.apache.org Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com Subject: Re: [memex-jpl] this week action from luke Thanks Luke. So I guess all I was asking was could you try it out. Thanks for the lesson in the RFC. Cheers, Chris Chris Mattmann chris.mattm...@gmail.com -Original Message- From: Luke hanson311...@gmail.com Date: Wednesday, April 22, 2015 at 1:46 AM To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov, Chris Mattmann chris.mattm...@gmail.com, 'Totaro, Giuseppe U (3980-Affiliate)' tot...@di.uniroma1.it, dev@tika.apache.org Cc: 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com, 'Zimdars, Paul A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF Polar CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com, memex-...@googlegroups.com Subject: RE: [memex-jpl] this week action from luke Hi professor, I think it highly depends on the content being read by tika, e.g. if there is a sequence of bytes in the file that is being read and is the same as one or more of mime types being defined in our tika-mimes.xml, I guess that tika will put those types in its estimation list, please note there could be multiple estimated mime types by magic-byte detection approach. Now tika also considers the decision made by extension detection approach, if extension says the file type it believes is the first one in the magic type estimation list, then certainly the first one will be returned. (the same applies to metadata hint approach); Of course, tika also prefers the type that is the most specialized. let's get back to the following question, here is my guess though. [Prof]: Also what happens if you tweak the definition of XHTML to not scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then? Let's consider an extreme case where we only scan 10 or 1 bytes, then it seems that magic bytes will inevitable detect nothing, and I think it will return the something like application/oct-stream that is the most general type. As mentioned, tika favours the one that is the most specialized, if extension approach returns the one that is more specialized, in this extreme case I believe almost every type is a subclass of this application/oct-stream therefore the answer in this extreme may be yes, I think it is very possible that CBOR type detected by the extension approach takes over in this case... My idea was and still is that if the cbor self-Describing tag 55799 is present in the cbor file, then that can be used to detect the cbor type. Again, the cbor type will probably be appended into the magic estimation list together with another one such as application/html, I guess the order in the list probably also matters, the first one is preferred over the next one. Also the decision from the extension detection approach also play the role the break the tie. e.g. if extension detection method agrees on cbor with one of the estimated type in the magic list, then cbor will be returned. (again, same thing applies to metadatahint method). I have not taken a closer look at a cbor file that has the tag 55799, but I expect to see its hex is something like 0xd9d9f7 or the tag should be present in the header with a fixed sequence of bytes(https://tools.ietf.org
RE: [memex-jpl] this week action from luke
Hi professor, I just tried it with minLength set to 1024, I get the following text/plain I am a bit surprised BTW, the 6000 min length still give application/xhtml+xml; with anything below 1024 min length, I am seeing text/plain. :) BTW, the min length I am referring/altering is as follows MimeTypes.java public int getMinLength() { // This needs to be reasonably large to be able to correctly detect // things like XML root elements after initial comment and DTDs return 64 * 1024; } Thanks Luke -Original Message- From: Chris Mattmann [mailto:chris.mattm...@gmail.com] Sent: Tuesday, April 21, 2015 7:48 PM To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U (3980-Affiliate)'; dev@tika.apache.org Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com Subject: Re: [memex-jpl] this week action from luke Thanks Luke. So I guess all I was asking was could you try it out. Thanks for the lesson in the RFC. Cheers, Chris Chris Mattmann chris.mattm...@gmail.com -Original Message- From: Luke hanson311...@gmail.com Date: Wednesday, April 22, 2015 at 1:46 AM To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov, Chris Mattmann chris.mattm...@gmail.com, 'Totaro, Giuseppe U (3980-Affiliate)' tot...@di.uniroma1.it, dev@tika.apache.org Cc: 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com, 'Zimdars, Paul A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF Polar CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com, memex-...@googlegroups.com Subject: RE: [memex-jpl] this week action from luke Hi professor, I think it highly depends on the content being read by tika, e.g. if there is a sequence of bytes in the file that is being read and is the same as one or more of mime types being defined in our tika-mimes.xml, I guess that tika will put those types in its estimation list, please note there could be multiple estimated mime types by magic-byte detection approach. Now tika also considers the decision made by extension detection approach, if extension says the file type it believes is the first one in the magic type estimation list, then certainly the first one will be returned. (the same applies to metadata hint approach); Of course, tika also prefers the type that is the most specialized. let's get back to the following question, here is my guess though. [Prof]: Also what happens if you tweak the definition of XHTML to not scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then? Let's consider an extreme case where we only scan 10 or 1 bytes, then it seems that magic bytes will inevitable detect nothing, and I think it will return the something like application/oct-stream that is the most general type. As mentioned, tika favours the one that is the most specialized, if extension approach returns the one that is more specialized, in this extreme case I believe almost every type is a subclass of this application/oct-stream therefore the answer in this extreme may be yes, I think it is very possible that CBOR type detected by the extension approach takes over in this case... My idea was and still is that if the cbor self-Describing tag 55799 is present in the cbor file, then that can be used to detect the cbor type. Again, the cbor type will probably be appended into the magic estimation list together with another one such as application/html, I guess the order in the list probably also matters, the first one is preferred over the next one. Also the decision from the extension detection approach also play the role the break the tie. e.g. if extension detection method agrees on cbor with one of the estimated type in the magic list, then cbor will be returned. (again, same thing applies to metadatahint method). I have not taken a closer look at a cbor file that has the tag 55799, but I expect to see its hex is something like 0xd9d9f7 or the tag should be present in the header with a fixed sequence of bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is present in the file or preferable in the header (within a reasonable range of bytes ), I believe it can probably be used as the magic numbers for the cbor type. There is another thing I have mentioned in the jira ticket I opened yesterday against the cbor parser and detection, it is also possible that cbor content can be imbedded inside a plain json file, the way that a decoder can distinguish them in that file is by looking at the tag 55799 again. This may rarely happen but a robust parser might be able to take care of that, tika might need to consider the use of fastXML being used by the nutch tool when developing the cbor parser... Again let me cite the same paragraph from the rfc, a decoder might be able to parse both CBOR and JSON. Such a decoder would need to mechanically
RE: [memex-jpl] this week action from luke
Hi professor, Please see the following results. match value=lt;html xmlns= type=string offset=0:1024/ Result: text/html match value=lt;html xmlns= type=string offset=0:6000/ Result: application/xhtml+xml Thanks Luke -Original Message- From: Chris Mattmann [mailto:chris.mattm...@gmail.com] Sent: Wednesday, April 22, 2015 4:21 AM To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U (3980-Affiliate)'; dev@tika.apache.org Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com Subject: Re: [memex-jpl] this week action from luke Hi Luke, Actually I just meant go into tika-mimetypes.xml and change the magic offsets for application/xhtml+xml and see if that works. The code you changed below is actually how many bytes Tika will first download to do MIME checking. Cheers, Chris Chris Mattmann chris.mattm...@gmail.com -Original Message- From: Luke hanson311...@gmail.com Date: Wednesday, April 22, 2015 at 2:25 AM To: Chris Mattmann chris.mattm...@gmail.com, Chris Mattmann chris.a.mattm...@jpl.nasa.gov, 'Totaro, Giuseppe U (3980-Affiliate)' tot...@di.uniroma1.it, dev@tika.apache.org Cc: 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com, 'Zimdars, Paul A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF Polar CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com, memex-...@googlegroups.com Subject: RE: [memex-jpl] this week action from luke Hi professor, I just tried it with minLength set to 1024, I get the following text/plain I am a bit surprised BTW, the 6000 min length still give application/xhtml+xml; with anything below 1024 min length, I am seeing text/plain. :) BTW, the min length I am referring/altering is as follows MimeTypes.java public int getMinLength() { // This needs to be reasonably large to be able to correctly detect // things like XML root elements after initial comment and DTDs return 64 * 1024; } Thanks Luke -Original Message- From: Chris Mattmann [mailto:chris.mattm...@gmail.com] Sent: Tuesday, April 21, 2015 7:48 PM To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U (3980-Affiliate)'; dev@tika.apache.org Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com Subject: Re: [memex-jpl] this week action from luke Thanks Luke. So I guess all I was asking was could you try it out. Thanks for the lesson in the RFC. Cheers, Chris Chris Mattmann chris.mattm...@gmail.com -Original Message- From: Luke hanson311...@gmail.com Date: Wednesday, April 22, 2015 at 1:46 AM To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov, Chris Mattmann chris.mattm...@gmail.com, 'Totaro, Giuseppe U (3980-Affiliate)' tot...@di.uniroma1.it, dev@tika.apache.org Cc: 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com, 'Zimdars, Paul A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF Polar CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com, memex-...@googlegroups.com Subject: RE: [memex-jpl] this week action from luke Hi professor, I think it highly depends on the content being read by tika, e.g. if there is a sequence of bytes in the file that is being read and is the same as one or more of mime types being defined in our tika-mimes.xml, I guess that tika will put those types in its estimation list, please note there could be multiple estimated mime types by magic-byte detection approach. Now tika also considers the decision made by extension detection approach, if extension says the file type it believes is the first one in the magic type estimation list, then certainly the first one will be returned. (the same applies to metadata hint approach); Of course, tika also prefers the type that is the most specialized. let's get back to the following question, here is my guess though. [Prof]: Also what happens if you tweak the definition of XHTML to not scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then? Let's consider an extreme case where we only scan 10 or 1 bytes, then it seems that magic bytes will inevitable detect nothing, and I think it will return the something like application/oct-stream that is the most general type. As mentioned, tika favours the one that is the most specialized, if extension approach returns the one that is more specialized, in this extreme case I believe almost every type is a subclass of this application/oct-stream therefore the answer in this extreme may be yes, I think it is very possible that CBOR type detected by the extension approach takes over in this case... My idea was and still is that if the cbor self-Describing tag 55799 is present in the cbor file, then that can be used to detect the cbor type. Again, the cbor type will probably be appended
RE: [memex-jpl] this week action from luke
Hi Prof, I am actually working on that, it actually is taking a bit time (around 2 or 3 hours) to run the whole script gen-common-crawl.sh. A couple of suspicious error also caused me to run and rerun the script a couple of times I need to be careful with testing with that size of data. I will keep you updated on the findings and progress. Thanks Luke -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Wednesday, April 22, 2015 3:49 PM To: Luke Cc: Chris Mattmann; Totaro, Giuseppe U (3980-Affiliate); dev@tika.apache.org; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); NSF Polar CyberInfrastructure DR Students; memex-...@googlegroups.com Subject: Re: [memex-jpl] this week action from luke Thanks Luke this is probably a good opportunity to test out your Bayesian mime detector how does it perform here? Sent from my iPhone On Apr 22, 2015, at 3:29 PM, Luke hanson311...@gmail.com wrote: Hi professor, Please see the following results. match value=lt;html xmlns= type=string offset=0:1024/ Result: text/html match value=lt;html xmlns= type=string offset=0:6000/ Result: application/xhtml+xml Thanks Luke -Original Message- From: Chris Mattmann [mailto:chris.mattm...@gmail.com] Sent: Wednesday, April 22, 2015 4:21 AM To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U (3980-Affiliate)'; dev@tika.apache.org Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com Subject: Re: [memex-jpl] this week action from luke Hi Luke, Actually I just meant go into tika-mimetypes.xml and change the magic offsets for application/xhtml+xml and see if that works. The code you changed below is actually how many bytes Tika will first download to do MIME checking. Cheers, Chris Chris Mattmann chris.mattm...@gmail.com -Original Message- From: Luke hanson311...@gmail.com Date: Wednesday, April 22, 2015 at 2:25 AM To: Chris Mattmann chris.mattm...@gmail.com, Chris Mattmann chris.a.mattm...@jpl.nasa.gov, 'Totaro, Giuseppe U (3980-Affiliate)' tot...@di.uniroma1.it, dev@tika.apache.org Cc: 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com, 'Zimdars, Paul A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF Polar CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com, memex-...@googlegroups.com Subject: RE: [memex-jpl] this week action from luke Hi professor, I just tried it with minLength set to 1024, I get the following text/plain I am a bit surprised BTW, the 6000 min length still give application/xhtml+xml; with anything below 1024 min length, I am seeing text/plain. :) BTW, the min length I am referring/altering is as follows MimeTypes.java public int getMinLength() { // This needs to be reasonably large to be able to correctly detect // things like XML root elements after initial comment and DTDs return 64 * 1024; } Thanks Luke -Original Message- From: Chris Mattmann [mailto:chris.mattm...@gmail.com] Sent: Tuesday, April 21, 2015 7:48 PM To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U (3980-Affiliate)'; dev@tika.apache.org Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com Subject: Re: [memex-jpl] this week action from luke Thanks Luke. So I guess all I was asking was could you try it out. Thanks for the lesson in the RFC. Cheers, Chris Chris Mattmann chris.mattm...@gmail.com -Original Message- From: Luke hanson311...@gmail.com Date: Wednesday, April 22, 2015 at 1:46 AM To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov, Chris Mattmann chris.mattm...@gmail.com, 'Totaro, Giuseppe U (3980-Affiliate)' tot...@di.uniroma1.it, dev@tika.apache.org Cc: 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com, 'Zimdars, Paul A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF Polar CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com, memex-...@googlegroups.com Subject: RE: [memex-jpl] this week action from luke Hi professor, I think it highly depends on the content being read by tika, e.g. if there is a sequence of bytes in the file that is being read and is the same as one or more of mime types being defined in our tika-mimes.xml, I guess that tika will put those types in its estimation list, please note there could be multiple estimated mime types by magic-byte detection approach. Now tika also considers the decision made by extension detection approach, if extension says the file type it believes is the first one in the magic type estimation list, then certainly the first one will be returned. (the same applies to metadata hint approach
Re: [memex-jpl] this week action from luke
Thanks Luke. So I guess all I was asking was could you try it out. Thanks for the lesson in the RFC. Cheers, Chris Chris Mattmann chris.mattm...@gmail.com -Original Message- From: Luke hanson311...@gmail.com Date: Wednesday, April 22, 2015 at 1:46 AM To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov, Chris Mattmann chris.mattm...@gmail.com, 'Totaro, Giuseppe U (3980-Affiliate)' tot...@di.uniroma1.it, dev@tika.apache.org Cc: 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com, 'Zimdars, Paul A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF Polar CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com, memex-...@googlegroups.com Subject: RE: [memex-jpl] this week action from luke Hi professor, I think it highly depends on the content being read by tika, e.g. if there is a sequence of bytes in the file that is being read and is the same as one or more of mime types being defined in our tika-mimes.xml, I guess that tika will put those types in its estimation list, please note there could be multiple estimated mime types by magic-byte detection approach. Now tika also considers the decision made by extension detection approach, if extension says the file type it believes is the first one in the magic type estimation list, then certainly the first one will be returned. (the same applies to metadata hint approach); Of course, tika also prefers the type that is the most specialized. let's get back to the following question, here is my guess though. [Prof]: Also what happens if you tweak the definition of XHTML to not scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then? Let's consider an extreme case where we only scan 10 or 1 bytes, then it seems that magic bytes will inevitable detect nothing, and I think it will return the something like application/oct-stream that is the most general type. As mentioned, tika favours the one that is the most specialized, if extension approach returns the one that is more specialized, in this extreme case I believe almost every type is a subclass of this application/oct-stream therefore the answer in this extreme may be yes, I think it is very possible that CBOR type detected by the extension approach takes over in this case... My idea was and still is that if the cbor self-Describing tag 55799 is present in the cbor file, then that can be used to detect the cbor type. Again, the cbor type will probably be appended into the magic estimation list together with another one such as application/html, I guess the order in the list probably also matters, the first one is preferred over the next one. Also the decision from the extension detection approach also play the role the break the tie. e.g. if extension detection method agrees on cbor with one of the estimated type in the magic list, then cbor will be returned. (again, same thing applies to metadatahint method). I have not taken a closer look at a cbor file that has the tag 55799, but I expect to see its hex is something like 0xd9d9f7 or the tag should be present in the header with a fixed sequence of bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is present in the file or preferable in the header (within a reasonable range of bytes ), I believe it can probably be used as the magic numbers for the cbor type. There is another thing I have mentioned in the jira ticket I opened yesterday against the cbor parser and detection, it is also possible that cbor content can be imbedded inside a plain json file, the way that a decoder can distinguish them in that file is by looking at the tag 55799 again. This may rarely happen but a robust parser might be able to take care of that, tika might need to consider the use of fastXML being used by the nutch tool when developing the cbor parser... Again let me cite the same paragraph from the rfc, a decoder might be able to parse both CBOR and JSON. Such a decoder would need to mechanically distinguish the two formats. An easy way for an encoder to help the decoder would be to tag the entire CBOR item with tag 55799, the serialization of which will never be found at the beginning of a JSON text. Thanks Luke -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Tuesday, April 21, 2015 9:49 PM To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate) Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); 'NSF Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com Subject: Re: [memex-jpl] this week action from luke Hi Luke, Can you post the below conversation to dev@tika and summarize it there. Also what happens if you tweak the definition of XHTML to not scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then? Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data
RE: [memex-jpl] this week action from luke
Hi professor, I think it highly depends on the content being read by tika, e.g. if there is a sequence of bytes in the file that is being read and is the same as one or more of mime types being defined in our tika-mimes.xml, I guess that tika will put those types in its estimation list, please note there could be multiple estimated mime types by magic-byte detection approach. Now tika also considers the decision made by extension detection approach, if extension says the file type it believes is the first one in the magic type estimation list, then certainly the first one will be returned. (the same applies to metadata hint approach); Of course, tika also prefers the type that is the most specialized. let's get back to the following question, here is my guess though. [Prof]: Also what happens if you tweak the definition of XHTML to not scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then? Let's consider an extreme case where we only scan 10 or 1 bytes, then it seems that magic bytes will inevitable detect nothing, and I think it will return the something like application/oct-stream that is the most general type. As mentioned, tika favours the one that is the most specialized, if extension approach returns the one that is more specialized, in this extreme case I believe almost every type is a subclass of this application/oct-stream therefore the answer in this extreme may be yes, I think it is very possible that CBOR type detected by the extension approach takes over in this case... My idea was and still is that if the cbor self-Describing tag 55799 is present in the cbor file, then that can be used to detect the cbor type. Again, the cbor type will probably be appended into the magic estimation list together with another one such as application/html, I guess the order in the list probably also matters, the first one is preferred over the next one. Also the decision from the extension detection approach also play the role the break the tie. e.g. if extension detection method agrees on cbor with one of the estimated type in the magic list, then cbor will be returned. (again, same thing applies to metadatahint method). I have not taken a closer look at a cbor file that has the tag 55799, but I expect to see its hex is something like 0xd9d9f7 or the tag should be present in the header with a fixed sequence of bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is present in the file or preferable in the header (within a reasonable range of bytes ), I believe it can probably be used as the magic numbers for the cbor type. There is another thing I have mentioned in the jira ticket I opened yesterday against the cbor parser and detection, it is also possible that cbor content can be imbedded inside a plain json file, the way that a decoder can distinguish them in that file is by looking at the tag 55799 again. This may rarely happen but a robust parser might be able to take care of that, tika might need to consider the use of fastXML being used by the nutch tool when developing the cbor parser... Again let me cite the same paragraph from the rfc, a decoder might be able to parse both CBOR and JSON. Such a decoder would need to mechanically distinguish the two formats. An easy way for an encoder to help the decoder would be to tag the entire CBOR item with tag 55799, the serialization of which will never be found at the beginning of a JSON text. Thanks Luke -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Tuesday, April 21, 2015 9:49 PM To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate) Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); 'NSF Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com Subject: Re: [memex-jpl] this week action from luke Hi Luke, Can you post the below conversation to dev@tika and summarize it there. Also what happens if you tweak the definition of XHTML to not scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then? Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Luke hanson311...@gmail.com Date: Wednesday, April 22, 2015 at 12:19 AM To: Chris Mattmann chris.mattm...@gmail.com, Totaro, Giuseppe U (3980-Affiliate) tot...@di.uniroma1.it, Chris Mattmann chris.a.mattm...@jpl.nasa.gov Cc: Bryant, Ann