Re: [memex-jpl] this week action from luke

2015-04-23 Thread Chris Mattmann
Great work Luke and both of these changes make sense.
Please send the pull request for that thank you!

Great work Giuseppe! Go team!

Cheers,
Chris


Chris Mattmann
chris.mattm...@gmail.com




-Original Message-
From: Luke hanson311...@gmail.com
Date: Thursday, April 23, 2015 at 3:08 AM
To: 'Luke' hanson311...@gmail.com, Chris Mattmann
chris.a.mattm...@jpl.nasa.gov, Chris Mattmann
chris.mattm...@gmail.com, 'Totaro, Giuseppe U (3980-Affiliate)'
tot...@di.uniroma1.it, dev@tika.apache.org, 'Bryant, Ann C
(398G-Affiliate)' anniebry...@gmail.com, 'Zimdars, Paul A
(3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF Polar
CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com,
memex-...@googlegroups.com
Subject: RE: [memex-jpl] this week action from luke

Both patches from Guiseppe all works based on my tests;  from the tests I
was able to see the magic tag was being appended at the beginning of the
file, and the cbor extension was being appended too when running the Nutch
dump tool command with the -extension cbor option. Thanks a lot for the
kind help, Giuseppe, highly appreciated. I want to please give a big thumb
up to Guiseppe's work, it is thorough and considerate too.

To professor, 
with Guiseppe's two patches, we still need to make a bit change in Tika
mimetypes.xml (BTW, the cbor magic tag can be used as magic bytes in tika
as
it does not look very common, even if it accidentally appears in some
other
type of files, tika will have extension and metadatahint as a fallback
strategy). I am going to send another pull request with that change;
But before that, it will be great to elaborate what I am going to change
to
avoid any confusion.

Now we have two problems.
Problem1: Magic priority 40.
   The application/xhtml+xml has higher priority(50) than
application/cbor (40); [I don't know who (and why) assigned 40 to cbor];
So
if xhtml gets read and compared first,  cbor will not even be placed in
the
magic estimation list because it has low priority. Based on the tests, it
turns out that it is true that xhtml gets read and compared first with the
input file, so any type below the priority 50 will be disregarded.


Problem2: again magic priority with 50.
   In Tika, given a file dumped by the nutch dumper tool,  both types
(xhtml and cbor) will be selected as candidate mime types and they will be
put in the magic estimation list; since xhtml type gets read first, it is
placed atop the cbor; in order to break that tie, tika will rely on the
decision from the extension method. If the extension method fails to
detect
the type(for now, let's ignore metadata hint method for simplicity but the
same applies to it too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor
type to 50 the same as xhtml, because it would probably be risky to
discard
any one of the estimated types without going consult the extension method.

Any comments, suggestion, thoughts will be welcomed and appreciated.

Thanks
Luke

-Original Message-
From: Luke [mailto:hanson311...@gmail.com]
Sent: Wednesday, April 22, 2015 7:45 PM
To: 'Mattmann, Chris A (3980)'
Cc: 'Chris Mattmann'; 'Totaro, Giuseppe U (3980-Affiliate)';
'dev@tika.apache.org'; 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
'memex-...@googlegroups.com'
Subject: RE: [memex-jpl] this week action from luke

Hi Prof,

The test was finished, the result is expected.
Both (tika with the prob feature and the one without it) produced the same
stats total, please see the attached matched.txt dumped by the small
program that verbatim checks and compares each line in every section of
the
Stats total between the log produced by the tika that has the feature
and
the one without it;  so if the string.equals(...) satisfies, the string of
the line will be dumped out. If there is a mismatch(e.g. the count for a
particular mime type is different), an error will be dumped out.
Eventually,
I don't see any error in the printout, I think the feature seem to have
passed the test.


The processing time between 2 tests is as follows.
The following shows the start time and end time for the test where the
Nutch
dumper tool with the prob selection feature.
from
2015-04-22 15:47:08,330
to
2015-04-22 17:48:28,877

The following shows the start time and end time for the test where the
Nutch
dumper tool without the tika with the feature.
from
2015-04-22 22:41:23,459
to
2015-04-23 00:11:02,767


BTW, I forgot to mention that probabilistic mime selector with default
weight settings also gives the following result, because by default I
intentionally assign \ a higher weight value on the magic bytes method so
as
to make it work in a way similar to the old strategy. On the other hands,
if
I know that extension is more reliable, I can certainly add more weights
to
the extension approach, in this case, the prob mime selector

RE: [memex-jpl] this week action from luke

2015-04-23 Thread Luke
Both patches from Guiseppe all works based on my tests;  from the tests I
was able to see the magic tag was being appended at the beginning of the
file, and the cbor extension was being appended too when running the Nutch
dump tool command with the -extension cbor option. Thanks a lot for the
kind help, Giuseppe, highly appreciated. I want to please give a big thumb
up to Guiseppe's work, it is thorough and considerate too. 

To professor, 
with Guiseppe's two patches, we still need to make a bit change in Tika
mimetypes.xml (BTW, the cbor magic tag can be used as magic bytes in tika as
it does not look very common, even if it accidentally appears in some other
type of files, tika will have extension and metadatahint as a fallback
strategy). I am going to send another pull request with that change;
But before that, it will be great to elaborate what I am going to change to
avoid any confusion.

Now we have two problems.
Problem1: Magic priority 40.
The application/xhtml+xml has higher priority(50) than
application/cbor (40); [I don't know who (and why) assigned 40 to cbor];  So
if xhtml gets read and compared first,  cbor will not even be placed in the
magic estimation list because it has low priority. Based on the tests, it
turns out that it is true that xhtml gets read and compared first with the
input file, so any type below the priority 50 will be disregarded.


Problem2: again magic priority with 50.
In Tika, given a file dumped by the nutch dumper tool,  both types
(xhtml and cbor) will be selected as candidate mime types and they will be
put in the magic estimation list; since xhtml type gets read first, it is
placed atop the cbor; in order to break that tie, tika will rely on the
decision from the extension method. If the extension method fails to detect
the type(for now, let's ignore metadata hint method for simplicity but the
same applies to it too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor
type to 50 the same as xhtml, because it would probably be risky to discard
any one of the estimated types without going consult the extension method.

Any comments, suggestion, thoughts will be welcomed and appreciated.

Thanks
Luke

-Original Message-
From: Luke [mailto:hanson311...@gmail.com] 
Sent: Wednesday, April 22, 2015 7:45 PM
To: 'Mattmann, Chris A (3980)'
Cc: 'Chris Mattmann'; 'Totaro, Giuseppe U (3980-Affiliate)';
'dev@tika.apache.org'; 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A
(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students';
'memex-...@googlegroups.com'
Subject: RE: [memex-jpl] this week action from luke

Hi Prof,

The test was finished, the result is expected.
Both (tika with the prob feature and the one without it) produced the same
stats total, please see the attached matched.txt dumped by the small
program that verbatim checks and compares each line in every section of the
Stats total between the log produced by the tika that has the feature and
the one without it;  so if the string.equals(...) satisfies, the string of
the line will be dumped out. If there is a mismatch(e.g. the count for a
particular mime type is different), an error will be dumped out. Eventually,
I don't see any error in the printout, I think the feature seem to have
passed the test.


The processing time between 2 tests is as follows.
The following shows the start time and end time for the test where the Nutch
dumper tool with the prob selection feature.
from
2015-04-22 15:47:08,330
to
2015-04-22 17:48:28,877

The following shows the start time and end time for the test where the Nutch
dumper tool without the tika with the feature.
from
2015-04-22 22:41:23,459
to
2015-04-23 00:11:02,767


BTW, I forgot to mention that probabilistic mime selector with default
weight settings also gives the following result, because by default I
intentionally assign \ a higher weight value on the magic bytes method so as
to make it work in a way similar to the old strategy. On the other hands, if
I know that extension is more reliable, I can certainly add more weights to
the extension approach, in this case, the prob mime selector will return
application/cbor with a higher value of weight.

 match value=lt;html xmlns= type=string offset=0:1024/
 Result: text/html
 
 match value=lt;html xmlns= type=string offset=0:6000/
 Result: application/xhtml+xml


Please kindly let me know if you have any confusion with the tests;


Thanks
Luke

-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Wednesday, April 22, 2015 3:49 PM
To: Luke
Cc: Chris Mattmann; Totaro, Giuseppe U (3980-Affiliate);
dev@tika.apache.org; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A
(3980-Affiliate); NSF Polar CyberInfrastructure DR Students;
memex-...@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Thanks Luke this is probably a good opportunity to test out your Bayesian
mime detector

Re: [memex-jpl] this week action from luke

2015-04-22 Thread Chris Mattmann
Hi Luke,

Actually I just meant go into tika-mimetypes.xml and change the
magic offsets for application/xhtml+xml and see if that works. The
code you changed below is actually how many bytes Tika will first
download to do MIME checking.

Cheers,
Chris


Chris Mattmann
chris.mattm...@gmail.com




-Original Message-
From: Luke hanson311...@gmail.com
Date: Wednesday, April 22, 2015 at 2:25 AM
To: Chris Mattmann chris.mattm...@gmail.com, Chris Mattmann
chris.a.mattm...@jpl.nasa.gov, 'Totaro, Giuseppe U (3980-Affiliate)'
tot...@di.uniroma1.it, dev@tika.apache.org
Cc: 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com, 'Zimdars,
Paul A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF Polar
CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com,
memex-...@googlegroups.com
Subject: RE: [memex-jpl] this week action from luke


Hi professor,

I just tried it with minLength set to 1024, I get the following
text/plain
I am a bit surprised

BTW, the 6000 min length still give application/xhtml+xml; with
anything below 1024 min length, I am seeing text/plain. :)

BTW, the min length I am referring/altering is as follows
MimeTypes.java
   public int getMinLength() {
// This needs to be reasonably large to be able to correctly
detect
// things like XML root elements after initial comment and DTDs
return 64 * 1024;
}


Thanks
Luke

-Original Message-
From: Chris Mattmann [mailto:chris.mattm...@gmail.com]
Sent: Tuesday, April 21, 2015 7:48 PM
To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U
(3980-Affiliate)'; dev@tika.apache.org
Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)';
'NSF Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Thanks Luke.

So I guess all I was asking was could you try it out. Thanks for the
lesson in the RFC.

Cheers,
Chris


Chris Mattmann
chris.mattm...@gmail.com




-Original Message-
From: Luke hanson311...@gmail.com
Date: Wednesday, April 22, 2015 at 1:46 AM
To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov, Chris Mattmann
chris.mattm...@gmail.com, 'Totaro, Giuseppe U (3980-Affiliate)'
tot...@di.uniroma1.it, dev@tika.apache.org
Cc: 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com,
'Zimdars, Paul A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF
Polar CyberInfrastructure DR Students
nsf-polar-usc-stude...@googlegroups.com,
memex-...@googlegroups.com
Subject: RE: [memex-jpl] this week action from luke

Hi professor,


I think it highly depends on the content being read by tika, e.g. if
there is a sequence of bytes in the file that is being read and is the
same as one or more of mime types being defined in our tika-mimes.xml,
I guess that tika will put those types in its estimation list, please
note there could be multiple estimated mime types by magic-byte
detection approach. Now tika also considers the decision made by
extension detection approach, if extension says the file type it
believes is the first one in the magic type estimation list, then
certainly the first one will be returned. (the same applies to metadata
hint approach); Of course, tika also prefers the type that is the most
specialized.

let's get back to the following question, here is my guess though.
[Prof]: Also what happens if you tweak the definition of XHTML to not
scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
Let's consider an extreme case where we only scan 10 or 1 bytes, then
it seems that magic bytes will inevitable detect nothing, and I think
it will return the something like application/oct-stream that is the
most general type. As mentioned, tika favours the one that is the most
specialized, if extension approach returns the one that is more
specialized, in this extreme case I believe almost every type is a
subclass of this application/oct-stream therefore the answer in
this extreme may be yes, I think it is very possible that CBOR type
detected by the extension approach takes over in this case...

My idea was and still is that if the cbor self-Describing tag 55799 is
present in the cbor file, then that can be used to detect the cbor type.
Again, the cbor type will probably be appended into the magic
estimation list together with another one such as application/html, I
guess the order in the list probably also matters, the first one is
preferred over the next one. Also the decision from the extension
detection approach also play the role the break the tie.
e.g. if extension detection method agrees on cbor with one of the
estimated type in the magic list, then cbor will be returned. (again,
same thing applies to metadatahint method).

I have not taken a closer look at a cbor file that has the tag 55799,
but I expect to see its hex is something like 0xd9d9f7 or the tag
should be present in the header with a fixed sequence of
bytes(https://tools.ietf.org

RE: [memex-jpl] this week action from luke

2015-04-22 Thread Luke

Hi professor,

I just tried it with minLength set to 1024, I get the following 
text/plain
I am a bit surprised

BTW, the 6000 min length still give application/xhtml+xml; with anything 
below 1024 min length, I am seeing text/plain. :)

BTW, the min length I am referring/altering is as follows
MimeTypes.java
public int getMinLength() {
// This needs to be reasonably large to be able to correctly detect
// things like XML root elements after initial comment and DTDs
return 64 * 1024;
}


Thanks
Luke

-Original Message-
From: Chris Mattmann [mailto:chris.mattm...@gmail.com] 
Sent: Tuesday, April 21, 2015 7:48 PM
To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U (3980-Affiliate)'; 
dev@tika.apache.org
Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF 
Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Thanks Luke.

So I guess all I was asking was could you try it out. Thanks for the lesson in 
the RFC.

Cheers,
Chris


Chris Mattmann
chris.mattm...@gmail.com




-Original Message-
From: Luke hanson311...@gmail.com
Date: Wednesday, April 22, 2015 at 1:46 AM
To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov, Chris Mattmann 
chris.mattm...@gmail.com, 'Totaro, Giuseppe U (3980-Affiliate)'
tot...@di.uniroma1.it, dev@tika.apache.org
Cc: 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com, 'Zimdars, Paul 
A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF Polar 
CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com,
memex-...@googlegroups.com
Subject: RE: [memex-jpl] this week action from luke

Hi professor,


I think it highly depends on the content being read by tika, e.g. if 
there is a sequence of bytes in the file that is being read and is the 
same as one or more of mime types being defined in our tika-mimes.xml, 
I guess that tika will put those types in its estimation list, please 
note there could be multiple estimated mime types by magic-byte 
detection approach. Now tika also considers the decision made by 
extension detection approach, if extension says the file type it 
believes is the first one in the magic type estimation list, then 
certainly the first one will be returned. (the same applies to metadata 
hint approach); Of course, tika also prefers the type that is the most 
specialized.

let's get back to the following question, here is my guess though.
[Prof]: Also what happens if you tweak the definition of XHTML to not 
scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
Let's consider an extreme case where we only scan 10 or 1 bytes, then 
it seems that magic bytes will inevitable detect nothing, and I think 
it will return the something like application/oct-stream that is the 
most general type. As mentioned, tika favours the one that is the most 
specialized, if extension approach returns the one that is more 
specialized, in this extreme case I believe almost every type is a 
subclass of this application/oct-stream therefore the answer in 
this extreme may be yes, I think it is very possible that CBOR type 
detected by the extension approach takes over in this case...

My idea was and still is that if the cbor self-Describing tag 55799 is 
present in the cbor file, then that can be used to detect the cbor type.
Again, the cbor type will probably be appended into the magic 
estimation list together with another one such as application/html, I 
guess the order in the list probably also matters, the first one is 
preferred over the next one. Also the decision from the extension 
detection approach also play the role the break the tie.
e.g. if extension detection method agrees on cbor with one of the 
estimated type in the magic list, then cbor will be returned. (again, 
same thing applies to metadatahint method).

I have not taken a closer look at a cbor file that has the tag 55799, 
but I expect to see its hex is something like 0xd9d9f7 or the tag 
should be present in the header with a fixed sequence of
bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is 
present in the file or preferable in the header (within a reasonable 
range of bytes ), I believe it can probably be used as the magic 
numbers for the cbor type.


There is another thing I have mentioned in the jira ticket I opened 
yesterday against the cbor parser and detection, it is also possible 
that cbor content can be imbedded inside a plain json file, the way 
that a decoder can distinguish them in that file is by looking at the 
tag 55799 again. This may rarely happen but a robust parser might be 
able to take care of that, tika might need to consider the use of 
fastXML being used by the nutch tool when developing the cbor parser...
Again let me cite the same paragraph from the rfc,

 a decoder might be able to parse both CBOR and JSON.
   Such a decoder would need to mechanically

RE: [memex-jpl] this week action from luke

2015-04-22 Thread Luke
Hi professor,

Please see the following results.
match value=lt;html xmlns= type=string offset=0:1024/
Result: text/html

match value=lt;html xmlns= type=string offset=0:6000/
Result: application/xhtml+xml


Thanks
Luke

-Original Message-
From: Chris Mattmann [mailto:chris.mattm...@gmail.com] 
Sent: Wednesday, April 22, 2015 4:21 AM
To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U (3980-Affiliate)'; 
dev@tika.apache.org
Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF 
Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Hi Luke,

Actually I just meant go into tika-mimetypes.xml and change the magic offsets 
for application/xhtml+xml and see if that works. The code you changed below is 
actually how many bytes Tika will first download to do MIME checking.

Cheers,
Chris


Chris Mattmann
chris.mattm...@gmail.com




-Original Message-
From: Luke hanson311...@gmail.com
Date: Wednesday, April 22, 2015 at 2:25 AM
To: Chris Mattmann chris.mattm...@gmail.com, Chris Mattmann 
chris.a.mattm...@jpl.nasa.gov, 'Totaro, Giuseppe U (3980-Affiliate)'
tot...@di.uniroma1.it, dev@tika.apache.org
Cc: 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com, 'Zimdars, Paul 
A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF Polar 
CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com,
memex-...@googlegroups.com
Subject: RE: [memex-jpl] this week action from luke


Hi professor,

I just tried it with minLength set to 1024, I get the following 
text/plain
I am a bit surprised

BTW, the 6000 min length still give application/xhtml+xml; with 
anything below 1024 min length, I am seeing text/plain. :)

BTW, the min length I am referring/altering is as follows 
MimeTypes.java
   public int getMinLength() {
// This needs to be reasonably large to be able to correctly 
detect
// things like XML root elements after initial comment and DTDs
return 64 * 1024;
}


Thanks
Luke

-Original Message-
From: Chris Mattmann [mailto:chris.mattm...@gmail.com]
Sent: Tuesday, April 21, 2015 7:48 PM
To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U 
(3980-Affiliate)'; dev@tika.apache.org
Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
(3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
memex-...@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Thanks Luke.

So I guess all I was asking was could you try it out. Thanks for the 
lesson in the RFC.

Cheers,
Chris


Chris Mattmann
chris.mattm...@gmail.com




-Original Message-
From: Luke hanson311...@gmail.com
Date: Wednesday, April 22, 2015 at 1:46 AM
To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov, Chris Mattmann 
chris.mattm...@gmail.com, 'Totaro, Giuseppe U (3980-Affiliate)'
tot...@di.uniroma1.it, dev@tika.apache.org
Cc: 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com, 
'Zimdars, Paul A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF 
Polar CyberInfrastructure DR Students 
nsf-polar-usc-stude...@googlegroups.com,
memex-...@googlegroups.com
Subject: RE: [memex-jpl] this week action from luke

Hi professor,


I think it highly depends on the content being read by tika, e.g. if 
there is a sequence of bytes in the file that is being read and is the 
same as one or more of mime types being defined in our tika-mimes.xml, 
I guess that tika will put those types in its estimation list, please 
note there could be multiple estimated mime types by magic-byte 
detection approach. Now tika also considers the decision made by 
extension detection approach, if extension says the file type it 
believes is the first one in the magic type estimation list, then 
certainly the first one will be returned. (the same applies to 
metadata hint approach); Of course, tika also prefers the type that is 
the most specialized.

let's get back to the following question, here is my guess though.
[Prof]: Also what happens if you tweak the definition of XHTML to not 
scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
Let's consider an extreme case where we only scan 10 or 1 bytes, then 
it seems that magic bytes will inevitable detect nothing, and I think 
it will return the something like application/oct-stream that is the 
most general type. As mentioned, tika favours the one that is the most 
specialized, if extension approach returns the one that is more 
specialized, in this extreme case I believe almost every type is a 
subclass of this application/oct-stream therefore the answer in 
this extreme may be yes, I think it is very possible that CBOR type 
detected by the extension approach takes over in this case...

My idea was and still is that if the cbor self-Describing tag 55799 is 
present in the cbor file, then that can be used to detect the cbor type.
Again, the cbor type will probably be appended

RE: [memex-jpl] this week action from luke

2015-04-22 Thread Luke
Hi Prof,
I am actually working on that, it actually is taking a bit time (around 2 or
3 hours) to run the whole script gen-common-crawl.sh.
A couple of suspicious error also caused me to run and rerun the script a
couple of times  I need to be careful with testing with that size of
data.

I will keep you updated on the findings and progress.

Thanks
Luke

-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Wednesday, April 22, 2015 3:49 PM
To: Luke
Cc: Chris Mattmann; Totaro, Giuseppe U (3980-Affiliate);
dev@tika.apache.org; Bryant, Ann C (398G-Affiliate); Zimdars, Paul A
(3980-Affiliate); NSF Polar CyberInfrastructure DR Students;
memex-...@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Thanks Luke this is probably a good opportunity to test out your Bayesian
mime detector how does it perform here?

Sent from my iPhone

 On Apr 22, 2015, at 3:29 PM, Luke hanson311...@gmail.com wrote:
 
 Hi professor,
 
 Please see the following results.
 match value=lt;html xmlns= type=string offset=0:1024/
 Result: text/html
 
 match value=lt;html xmlns= type=string offset=0:6000/
 Result: application/xhtml+xml
 
 
 Thanks
 Luke
 
 -Original Message-
 From: Chris Mattmann [mailto:chris.mattm...@gmail.com]
 Sent: Wednesday, April 22, 2015 4:21 AM
 To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U 
 (3980-Affiliate)'; dev@tika.apache.org
 Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
 (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
 memex-...@googlegroups.com
 Subject: Re: [memex-jpl] this week action from luke
 
 Hi Luke,
 
 Actually I just meant go into tika-mimetypes.xml and change the magic
offsets for application/xhtml+xml and see if that works. The code you
changed below is actually how many bytes Tika will first download to do MIME
checking.
 
 Cheers,
 Chris
 
 
 Chris Mattmann
 chris.mattm...@gmail.com
 
 
 
 
 -Original Message-
 From: Luke hanson311...@gmail.com
 Date: Wednesday, April 22, 2015 at 2:25 AM
 To: Chris Mattmann chris.mattm...@gmail.com, Chris Mattmann
chris.a.mattm...@jpl.nasa.gov, 'Totaro, Giuseppe U (3980-Affiliate)'
 tot...@di.uniroma1.it, dev@tika.apache.org
 Cc: 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com, 
 'Zimdars, Paul A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, 
 NSF Polar CyberInfrastructure DR Students 
 nsf-polar-usc-stude...@googlegroups.com,
 memex-...@googlegroups.com
 Subject: RE: [memex-jpl] this week action from luke
 
 
 Hi professor,
 
 I just tried it with minLength set to 1024, I get the following 
 text/plain
 I am a bit surprised
 
 BTW, the 6000 min length still give application/xhtml+xml; with 
 anything below 1024 min length, I am seeing text/plain. :)
 
 BTW, the min length I am referring/altering is as follows 
 MimeTypes.java
public int getMinLength() {
   // This needs to be reasonably large to be able to correctly 
 detect
   // things like XML root elements after initial comment and DTDs
   return 64 * 1024;
   }
 
 
 Thanks
 Luke
 
 -Original Message-
 From: Chris Mattmann [mailto:chris.mattm...@gmail.com]
 Sent: Tuesday, April 21, 2015 7:48 PM
 To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U 
 (3980-Affiliate)'; dev@tika.apache.org
 Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A 
 (3980-Affiliate)'; 'NSF Polar CyberInfrastructure DR Students'; 
 memex-...@googlegroups.com
 Subject: Re: [memex-jpl] this week action from luke
 
 Thanks Luke.
 
 So I guess all I was asking was could you try it out. Thanks for the 
 lesson in the RFC.
 
 Cheers,
 Chris
 
 
 Chris Mattmann
 chris.mattm...@gmail.com
 
 
 
 
 -Original Message-
 From: Luke hanson311...@gmail.com
 Date: Wednesday, April 22, 2015 at 1:46 AM
 To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov, Chris Mattmann 
 chris.mattm...@gmail.com, 'Totaro, Giuseppe U (3980-Affiliate)'
 tot...@di.uniroma1.it, dev@tika.apache.org
 Cc: 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com, 
 'Zimdars, Paul A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, 
 NSF Polar CyberInfrastructure DR Students 
 nsf-polar-usc-stude...@googlegroups.com,
 memex-...@googlegroups.com
 Subject: RE: [memex-jpl] this week action from luke
 
 Hi professor,
 
 
 I think it highly depends on the content being read by tika, e.g. if 
 there is a sequence of bytes in the file that is being read and is 
 the same as one or more of mime types being defined in our 
 tika-mimes.xml, I guess that tika will put those types in its 
 estimation list, please note there could be multiple estimated mime 
 types by magic-byte detection approach. Now tika also considers the 
 decision made by extension detection approach, if extension says the 
 file type it believes is the first one in the magic type estimation 
 list, then certainly the first one will be returned. (the same 
 applies to metadata hint approach

Re: [memex-jpl] this week action from luke

2015-04-21 Thread Chris Mattmann
Thanks Luke.

So I guess all I was asking was could you try it out. Thanks for the
lesson in the RFC.

Cheers,
Chris


Chris Mattmann
chris.mattm...@gmail.com




-Original Message-
From: Luke hanson311...@gmail.com
Date: Wednesday, April 22, 2015 at 1:46 AM
To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov, Chris Mattmann
chris.mattm...@gmail.com, 'Totaro, Giuseppe U (3980-Affiliate)'
tot...@di.uniroma1.it, dev@tika.apache.org
Cc: 'Bryant, Ann C (398G-Affiliate)' anniebry...@gmail.com, 'Zimdars,
Paul A (3980-Affiliate)' paul.a.zimd...@jpl.nasa.gov, NSF Polar
CyberInfrastructure DR Students nsf-polar-usc-stude...@googlegroups.com,
memex-...@googlegroups.com
Subject: RE: [memex-jpl] this week action from luke

Hi professor,


I think it highly depends on the content being read by tika, e.g. if
there is a sequence of bytes in the file that is being read and is the
same as one or more of mime types being defined in our tika-mimes.xml, I
guess that tika will put those types in its estimation list, please note
there could be multiple estimated mime types by magic-byte detection
approach. Now tika also considers the decision made by extension
detection approach, if extension says the file type it believes is the
first one in the magic type estimation list, then certainly the first one
will be returned. (the same applies to metadata hint approach);
Of course, tika also prefers the type that is the most specialized.

let's get back to the following question, here is my guess though.
[Prof]: Also what happens if you tweak the definition of XHTML to not
scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
Let's consider an extreme case where we only scan 10 or 1 bytes, then it
seems that magic bytes will inevitable detect nothing, and I think it
will return the something like application/oct-stream that is the most
general type. As mentioned, tika favours the one that is the most
specialized, if extension approach returns the one that is more
specialized, in this extreme case I believe almost every type is a
subclass of this application/oct-stream therefore the answer in
this extreme may be yes, I think it is very possible that CBOR type
detected by the extension approach takes over in this case...

My idea was and still is that if the cbor self-Describing tag 55799 is
present in the cbor file, then that can be used to detect the cbor type.
Again, the cbor type will probably be appended into the magic estimation
list together with another one such as application/html, I guess the
order in the list probably also matters, the first one is preferred over
the next one. Also the decision from the extension detection approach
also play the role the break the tie.
e.g. if extension detection method agrees on cbor with one of the
estimated type in the magic list, then cbor will be returned. (again,
same thing applies to metadatahint method).

I have not taken a closer look at a cbor file that has the tag 55799, but
I expect to see its hex is something like 0xd9d9f7 or the tag should be
present in the header with a fixed sequence of
bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is
present in the file or preferable in the header (within a reasonable
range of bytes ), I believe it can probably be used as the magic numbers
for the cbor type.


There is another thing I have mentioned in the jira ticket I opened
yesterday against the cbor parser and detection, it is also possible that
cbor content can be imbedded inside a plain json file, the way that a
decoder can distinguish them in that file is by looking at the tag 55799
again. This may rarely happen but a robust parser might be able to take
care of that, tika might need to consider the use of fastXML being used
by the nutch tool when developing the cbor parser...
Again let me cite the same paragraph from the rfc,

 a decoder might be able to parse both CBOR and JSON.
   Such a decoder would need to mechanically distinguish the two
   formats.  An easy way for an encoder to help the decoder would be to
   tag the entire CBOR item with tag 55799, the serialization of which
   will never be found at the beginning of a JSON text.


Thanks
Luke



-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Tuesday, April 21, 2015 9:49 PM
To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate);
'NSF Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Hi Luke,

Can you post the below conversation to dev@tika and summarize it there.
Also what happens if you tweak the definition of XHTML to not scan until
8192, but say 6000 (e.g., 0:6000), does CBOR take over then?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data

RE: [memex-jpl] this week action from luke

2015-04-21 Thread Luke
Hi professor,


I think it highly depends on the content being read by tika, e.g. if there is a 
sequence of bytes in the file that is being read and is the same as one or more 
of mime types being defined in our tika-mimes.xml, I guess that tika will put 
those types in its estimation list, please note there could be multiple 
estimated mime types by magic-byte detection approach. Now tika also considers 
the decision made by extension detection approach, if extension says the file 
type it believes is the first one in the magic type estimation list, then 
certainly the first one will be returned. (the same applies to metadata hint 
approach);
Of course, tika also prefers the type that is the most specialized.

let's get back to the following question, here is my guess though.
[Prof]: Also what happens if you tweak the definition of XHTML to not scan 
until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
Let's consider an extreme case where we only scan 10 or 1 bytes, then it seems 
that magic bytes will inevitable detect nothing, and I think it will return the 
something like application/oct-stream that is the most general type. As 
mentioned, tika favours the one that is the most specialized, if extension 
approach returns the one that is more specialized, in this extreme case I 
believe almost every type is a subclass of this application/oct-stream 
therefore the answer in this extreme may be yes, I think it is very possible 
that CBOR type detected by the extension approach takes over in this case...

My idea was and still is that if the cbor self-Describing tag 55799 is present 
in the cbor file, then that can be used to detect the cbor type.
Again, the cbor type will probably be appended into the magic estimation list 
together with another one such as application/html, I guess the order in the 
list probably also matters, the first one is preferred over the next one. Also 
the decision from the extension detection approach also play the role the break 
the tie.
e.g. if extension detection method agrees on cbor with one of the estimated 
type in the magic list, then cbor will be returned. (again, same thing applies 
to metadatahint method). 

I have not taken a closer look at a cbor file that has the tag 55799, but I 
expect to see its hex is something like 0xd9d9f7 or the tag should be present 
in the header with a fixed sequence of 
bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is present 
in the file or preferable in the header (within a reasonable range of bytes ), 
I believe it can probably be used as the magic numbers for the cbor type.


There is another thing I have mentioned in the jira ticket I opened yesterday 
against the cbor parser and detection, it is also possible that cbor content 
can be imbedded inside a plain json file, the way that a decoder can 
distinguish them in that file is by looking at the tag 55799 again. This may 
rarely happen but a robust parser might be able to take care of that, tika 
might need to consider the use of fastXML being used by the nutch tool when 
developing the cbor parser...
Again let me cite the same paragraph from the rfc, 

 a decoder might be able to parse both CBOR and JSON.
   Such a decoder would need to mechanically distinguish the two
   formats.  An easy way for an encoder to help the decoder would be to
   tag the entire CBOR item with tag 55799, the serialization of which
   will never be found at the beginning of a JSON text.


Thanks
Luke



-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Tuesday, April 21, 2015 9:49 PM
To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); 'NSF 
Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Hi Luke,

Can you post the below conversation to dev@tika and summarize it there. Also 
what happens if you tweak the definition of XHTML to not scan until 8192, but 
say 6000 (e.g., 0:6000), does CBOR take over then?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion 
Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department University of Southern 
California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Luke hanson311...@gmail.com
Date: Wednesday, April 22, 2015 at 12:19 AM
To: Chris Mattmann chris.mattm...@gmail.com, Totaro, Giuseppe U 
(3980-Affiliate) tot...@di.uniroma1.it, Chris Mattmann 
chris.a.mattm...@jpl.nasa.gov
Cc: Bryant, Ann