RE: [memex-jpl] this week action from luke

2015-04-21 Thread Luke

Hi professor,

I just tried it with minLength set to 1024, I get the following 
"text/plain"
I am a bit surprised

BTW, the 6000 min length still give "application/xhtml+xml"; with anything 
below 1024 min length, I am seeing "text/plain". :)

BTW, the min length I am referring/altering is as follows
MimeTypes.java
public int getMinLength() {
// This needs to be reasonably large to be able to correctly detect
// things like XML root elements after initial comment and DTDs
return 64 * 1024;
}


Thanks
Luke

-Original Message-
From: Chris Mattmann [mailto:chris.mattm...@gmail.com] 
Sent: Tuesday, April 21, 2015 7:48 PM
To: Luke; 'Mattmann, Chris A (3980)'; 'Totaro, Giuseppe U (3980-Affiliate)'; 
dev@tika.apache.org
Cc: 'Bryant, Ann C (398G-Affiliate)'; 'Zimdars, Paul A (3980-Affiliate)'; 'NSF 
Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Thanks Luke.

So I guess all I was asking was could you try it out. Thanks for the lesson in 
the RFC.

Cheers,
Chris


Chris Mattmann
chris.mattm...@gmail.com




-Original Message-
From: Luke 
Date: Wednesday, April 22, 2015 at 1:46 AM
To: Chris Mattmann , Chris Mattmann 
, "'Totaro, Giuseppe U (3980-Affiliate)'"
, 
Cc: "'Bryant, Ann C (398G-Affiliate)'" , "'Zimdars, Paul 
A (3980-Affiliate)'" , NSF Polar 
CyberInfrastructure DR Students ,

Subject: RE: [memex-jpl] this week action from luke

>Hi professor,
>
>
>I think it highly depends on the content being read by tika, e.g. if 
>there is a sequence of bytes in the file that is being read and is the 
>same as one or more of mime types being defined in our tika-mimes.xml, 
>I guess that tika will put those types in its estimation list, please 
>note there could be multiple estimated mime types by magic-byte 
>detection approach. Now tika also considers the decision made by 
>extension detection approach, if extension says the file type it 
>believes is the first one in the magic type estimation list, then 
>certainly the first one will be returned. (the same applies to metadata 
>hint approach); Of course, tika also prefers the type that is the most 
>specialized.
>
>let's get back to the following question, here is my guess though.
>[Prof]: Also what happens if you tweak the definition of XHTML to not 
>scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>Let's consider an extreme case where we only scan 10 or 1 bytes, then 
>it seems that magic bytes will inevitable detect nothing, and I think 
>it will return the something like" application/oct-stream" that is the 
>most general type. As mentioned, tika favours the one that is the most 
>specialized, if extension approach returns the one that is more 
>specialized, in this extreme case I believe almost every type is a 
>subclass of this "application/oct-stream" therefore the answer in 
>this extreme may be yes, I think it is very possible that CBOR type 
>detected by the extension approach takes over in this case...
>
>My idea was and still is that if the cbor self-Describing tag 55799 is 
>present in the cbor file, then that can be used to detect the cbor type.
>Again, the cbor type will probably be appended into the magic 
>estimation list together with another one such as application/html, I 
>guess the order in the list probably also matters, the first one is 
>preferred over the next one. Also the decision from the extension 
>detection approach also play the role the break the tie.
>e.g. if extension detection method agrees on cbor with one of the 
>estimated type in the magic list, then cbor will be returned. (again, 
>same thing applies to metadatahint method).
>
>I have not taken a closer look at a cbor file that has the tag 55799, 
>but I expect to see its hex is something like 0xd9d9f7 or the tag 
>should be present in the header with a fixed sequence of
>bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is 
>present in the file or preferable in the header (within a reasonable 
>range of bytes ), I believe it can probably be used as the magic 
>numbers for the cbor type.
>
>
>There is another thing I have mentioned in the jira ticket I opened 
>yesterday against the cbor parser and detection, it is also possible 
>that cbor content can be imbedded inside a plain json file, the way 
>that a decoder can distinguish them in that file is by looking at the 
>tag 55799 again. This may rarely happen but a robust parser might be 
>able to take care of that, tika might need to consider the use of 
>fastXML being used by the nutch tool when developing the cbor parser...
>Again let me cite the same paragraph from the rfc,
>
>" a decoder might be able to parse both CBOR and JSON.
>   Such a decoder would need to mechanically distinguish the two
>   formats.  An easy way for an encoder to help the decoder would be to
>   tag the entire CBOR item with tag 55799, the serializatio

Re: [memex-jpl] this week action from luke

2015-04-21 Thread Chris Mattmann
Thanks Luke.

So I guess all I was asking was could you try it out. Thanks for the
lesson in the RFC.

Cheers,
Chris


Chris Mattmann
chris.mattm...@gmail.com




-Original Message-
From: Luke 
Date: Wednesday, April 22, 2015 at 1:46 AM
To: Chris Mattmann , Chris Mattmann
, "'Totaro, Giuseppe U (3980-Affiliate)'"
, 
Cc: "'Bryant, Ann C (398G-Affiliate)'" , "'Zimdars,
Paul A (3980-Affiliate)'" , NSF Polar
CyberInfrastructure DR Students ,

Subject: RE: [memex-jpl] this week action from luke

>Hi professor,
>
>
>I think it highly depends on the content being read by tika, e.g. if
>there is a sequence of bytes in the file that is being read and is the
>same as one or more of mime types being defined in our tika-mimes.xml, I
>guess that tika will put those types in its estimation list, please note
>there could be multiple estimated mime types by magic-byte detection
>approach. Now tika also considers the decision made by extension
>detection approach, if extension says the file type it believes is the
>first one in the magic type estimation list, then certainly the first one
>will be returned. (the same applies to metadata hint approach);
>Of course, tika also prefers the type that is the most specialized.
>
>let's get back to the following question, here is my guess though.
>[Prof]: Also what happens if you tweak the definition of XHTML to not
>scan until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>Let's consider an extreme case where we only scan 10 or 1 bytes, then it
>seems that magic bytes will inevitable detect nothing, and I think it
>will return the something like" application/oct-stream" that is the most
>general type. As mentioned, tika favours the one that is the most
>specialized, if extension approach returns the one that is more
>specialized, in this extreme case I believe almost every type is a
>subclass of this "application/oct-stream" therefore the answer in
>this extreme may be yes, I think it is very possible that CBOR type
>detected by the extension approach takes over in this case...
>
>My idea was and still is that if the cbor self-Describing tag 55799 is
>present in the cbor file, then that can be used to detect the cbor type.
>Again, the cbor type will probably be appended into the magic estimation
>list together with another one such as application/html, I guess the
>order in the list probably also matters, the first one is preferred over
>the next one. Also the decision from the extension detection approach
>also play the role the break the tie.
>e.g. if extension detection method agrees on cbor with one of the
>estimated type in the magic list, then cbor will be returned. (again,
>same thing applies to metadatahint method).
>
>I have not taken a closer look at a cbor file that has the tag 55799, but
>I expect to see its hex is something like 0xd9d9f7 or the tag should be
>present in the header with a fixed sequence of
>bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is
>present in the file or preferable in the header (within a reasonable
>range of bytes ), I believe it can probably be used as the magic numbers
>for the cbor type.
>
>
>There is another thing I have mentioned in the jira ticket I opened
>yesterday against the cbor parser and detection, it is also possible that
>cbor content can be imbedded inside a plain json file, the way that a
>decoder can distinguish them in that file is by looking at the tag 55799
>again. This may rarely happen but a robust parser might be able to take
>care of that, tika might need to consider the use of fastXML being used
>by the nutch tool when developing the cbor parser...
>Again let me cite the same paragraph from the rfc,
>
>" a decoder might be able to parse both CBOR and JSON.
>   Such a decoder would need to mechanically distinguish the two
>   formats.  An easy way for an encoder to help the decoder would be to
>   tag the entire CBOR item with tag 55799, the serialization of which
>   will never be found at the beginning of a JSON text."
>
>
>Thanks
>Luke
>
>
>
>-Original Message-
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
>Sent: Tuesday, April 21, 2015 9:49 PM
>To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
>Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate);
>'NSF Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com
>Subject: Re: [memex-jpl] this week action from luke
>
>Hi Luke,
>
>Can you post the below conversation to dev@tika and summarize it there.
>Also what happens if you tweak the definition of XHTML to not scan until
>8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
>
>Cheers,
>Chris
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398) NASA Jet
>Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mat

RE: [memex-jpl] this week action from luke

2015-04-21 Thread Luke
Hi professor,


I think it highly depends on the content being read by tika, e.g. if there is a 
sequence of bytes in the file that is being read and is the same as one or more 
of mime types being defined in our tika-mimes.xml, I guess that tika will put 
those types in its estimation list, please note there could be multiple 
estimated mime types by magic-byte detection approach. Now tika also considers 
the decision made by extension detection approach, if extension says the file 
type it believes is the first one in the magic type estimation list, then 
certainly the first one will be returned. (the same applies to metadata hint 
approach);
Of course, tika also prefers the type that is the most specialized.

let's get back to the following question, here is my guess though.
[Prof]: Also what happens if you tweak the definition of XHTML to not scan 
until 8192, but say 6000 (e.g., 0:6000), does CBOR take over then?
Let's consider an extreme case where we only scan 10 or 1 bytes, then it seems 
that magic bytes will inevitable detect nothing, and I think it will return the 
something like" application/oct-stream" that is the most general type. As 
mentioned, tika favours the one that is the most specialized, if extension 
approach returns the one that is more specialized, in this extreme case I 
believe almost every type is a subclass of this "application/oct-stream" 
therefore the answer in this extreme may be yes, I think it is very possible 
that CBOR type detected by the extension approach takes over in this case...

My idea was and still is that if the cbor self-Describing tag 55799 is present 
in the cbor file, then that can be used to detect the cbor type.
Again, the cbor type will probably be appended into the magic estimation list 
together with another one such as application/html, I guess the order in the 
list probably also matters, the first one is preferred over the next one. Also 
the decision from the extension detection approach also play the role the break 
the tie.
e.g. if extension detection method agrees on cbor with one of the estimated 
type in the magic list, then cbor will be returned. (again, same thing applies 
to metadatahint method). 

I have not taken a closer look at a cbor file that has the tag 55799, but I 
expect to see its hex is something like 0xd9d9f7 or the tag should be present 
in the header with a fixed sequence of 
bytes(https://tools.ietf.org/html/rfc7049#section-2.4.5 ), if this is present 
in the file or preferable in the header (within a reasonable range of bytes ), 
I believe it can probably be used as the magic numbers for the cbor type.


There is another thing I have mentioned in the jira ticket I opened yesterday 
against the cbor parser and detection, it is also possible that cbor content 
can be imbedded inside a plain json file, the way that a decoder can 
distinguish them in that file is by looking at the tag 55799 again. This may 
rarely happen but a robust parser might be able to take care of that, tika 
might need to consider the use of fastXML being used by the nutch tool when 
developing the cbor parser...
Again let me cite the same paragraph from the rfc, 

" a decoder might be able to parse both CBOR and JSON.
   Such a decoder would need to mechanically distinguish the two
   formats.  An easy way for an encoder to help the decoder would be to
   tag the entire CBOR item with tag 55799, the serialization of which
   will never be found at the beginning of a JSON text."


Thanks
Luke



-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Tuesday, April 21, 2015 9:49 PM
To: Luke; 'Chris Mattmann'; Totaro, Giuseppe U (3980-Affiliate)
Cc: Bryant, Ann C (398G-Affiliate); Zimdars, Paul A (3980-Affiliate); 'NSF 
Polar CyberInfrastructure DR Students'; memex-...@googlegroups.com
Subject: Re: [memex-jpl] this week action from luke

Hi Luke,

Can you post the below conversation to dev@tika and summarize it there. Also 
what happens if you tweak the definition of XHTML to not scan until 8192, but 
say 6000 (e.g., 0:6000), does CBOR take over then?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion 
Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department University of Southern 
California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Luke 
Date: Wednesday, April 22, 2015 at 12:19 AM
To: Chris Mattmann , "Totaro, Giuseppe U 
(3980-Affiliate)" , Chris Mattmann 

Cc: "Bryant, Ann C (398G-Affiliate)" , "Zimdars, Paul A 
(3980-Affiliate)" , NSF Polar CyberInfrastructu

[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506414#comment-14506414
 ] 

Hudson commented on TIKA-1610:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #640 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/640/])
WIP Fix for TIKA-1610: Support MIME extension for CBOR files contributed by 
LukeLiush  this closes #42 (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1675250)
* /tika/trunk/CHANGES.txt
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> CBOR Parser and detection [improvement]
> ---
>
> Key: TIKA-1610
> URL: https://issues.apache.org/jira/browse/TIKA-1610
> Project: Tika
>  Issue Type: New Feature
>  Components: detector, mime, parser
>Affects Versions: 1.7
>Reporter: Luke sh
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: memex
> Attachments: 142440269.html, cbor_tika.mimetypes.xml.jpg, 
> rfc_cbor.jpg
>
>
> CBOR is a data format whose design goals include the possibility of extremely 
> small code size, fairly small message size, and extensibility without the 
> need for version negotiation (cited from http://cbor.io/ ).
> It would be great if Tika is able to provide the support with CBOR parser and 
> identification. In the current project with Nutch, the Nutch 
> CommonCrawlDataDumper is used to dump the crawled segments to the files in 
> the format of CBOR. In order to read/parse those dumped files by this tool, 
> it would be great if tika is able to support parsing the cbor, the thing is 
> that the CommonCrawlDataDumper is not dumping with correct extension, it 
> dumps with its own rule, the default extension of the dumped file is html, so 
> it might be less painful if tika is able to detect and parse those files 
> without any pre-processing steps. 
> CommonCrawlDataDumper is calling the following to dump with cbor.
> import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
> import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
> fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
> According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
> CBOR does not yet have its magic numbers to be detected/identified by other 
> applications (PFA: rfc_cbor.jpg)
> It seems that the only way to inform other applications of the type as of now 
> is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
> histogram distribution estimation).  
> There is another thing worth the attention, it looks like tika has attempted 
> to add the support with cbor mime detection in the tika-mimetypes.xml 
> (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
> cbor file dumped by CommonCrawlDataDumper. 
> According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
> self-describing Tag 55799 that seems to be used for cbor type 
> identification(the hex code might be 0xd9d9f7), but it is probably up to the 
> application that take care of this tag, and it is also possible that the 
> fasterxml that the nutch dumping tool is missing this tag, an example cbor 
> file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been 
> attached (PFA: 142440269.html).
> The following info is cited from the rfc, "...a decoder might be able to 
> parse both CBOR and JSON.
>Such a decoder would need to mechanically distinguish the two
>formats.  An easy way for an encoder to help the decoder would be to
>tag the entire CBOR item with tag 55799, the serialization of which
>will never be found at the beginning of a JSON text..."
> It looks like the a file can have two parts/sections i.e. the plain text 
> parts and the json prettified by cbor, this might be also worth the attention 
> and consideration with the parsing and type identification.
> On the other hand, it is worth noting that the entries for cbor extension 
> detection needs to be appended in the tika-mimetypes.xml too 
> e.g.
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: NUTCH-1994 and UCAR Dependencies

2015-04-21 Thread Mattmann, Chris A (3980)
Thanks Lewis!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Lewis John Mcgibbney 
Reply-To: "dev@tika.apache.org" 
Date: Tuesday, April 21, 2015 at 7:14 PM
To: "dev@tika.apache.org" 
Subject: Re: NUTCH-1994 and UCAR Dependencies

>Hi Folks,
>OK, so the final part of this jigsaw is as follows
>
>I've requested a staging area [0] on Sonatype OSSRH to release the MIT
>licensed 3rd party bzip2 artifacts.
>I had to Mavenize the project. I will submit this patch to the bzip2
>project and hopefully they will pull it in. If not then I will fork the
>project and maintain it myself.
>
>[0] https://issues.sonatype.org/browse/OSSRH-15143
>[1] https://code.google.com/p/jbzip2/
>
>On Tue, Apr 21, 2015 at 3:49 PM, Lewis John Mcgibbney <
>lewis.mcgibb...@gmail.com> wrote:
>
>> Hi Folks,
>> Update
>>
>> On Tue, Apr 21, 2015 at 10:50 AM, Lewis John Mcgibbney <
>> lewis.mcgibb...@gmail.com> wrote:
>>
>>>
>>>
>>> [ivy:resolve] ::
>>> [ivy:resolve] :: edu.ucar#jj2000;5.2: not found
>>> [ivy:resolve] :: edu.ucar#udunits;4.5.5: not found
>>> [ivy:resolve] ::
>>>
>>
>>
>> Both of the above are now on Maven Central.
>> I had to fix a couple of issues in the jj2000 library, namely
>> https://github.com/Unidata/jj2000/pull/3 which was blocking us.
>>
>> I'm moving on to deal with the final one
>>
>> [ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found
>>
>> I'll update in due course.
>> Thanks
>> Lewis
>>
>
>
>
>-- 
>*Lewis*



[jira] [Commented] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-21 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506359#comment-14506359
 ] 

Chris A. Mattmann commented on TIKA-1610:
-

Applied Pull request #42 thanks [~Lukeliush]!

{noformat}
[chipotle:~/tmp/tika] mattmann% svn commit -m "WIP Fix for TIKA-1610: Support 
MIME extension for CBOR files contributed by LukeLiush  
this closes #42" CHANGES.txt 
tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
SendingCHANGES.txt
Sending
tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
Transmitting file data ..
Committed revision 1675250.
[chipotle:~/tmp/tika] mattmann% 
{noformat}

Will look for improvements and the parser next, so will leave this open!


> CBOR Parser and detection [improvement]
> ---
>
> Key: TIKA-1610
> URL: https://issues.apache.org/jira/browse/TIKA-1610
> Project: Tika
>  Issue Type: New Feature
>  Components: detector, mime, parser
>Affects Versions: 1.7
>Reporter: Luke sh
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: memex
> Attachments: 142440269.html, cbor_tika.mimetypes.xml.jpg, 
> rfc_cbor.jpg
>
>
> CBOR is a data format whose design goals include the possibility of extremely 
> small code size, fairly small message size, and extensibility without the 
> need for version negotiation (cited from http://cbor.io/ ).
> It would be great if Tika is able to provide the support with CBOR parser and 
> identification. In the current project with Nutch, the Nutch 
> CommonCrawlDataDumper is used to dump the crawled segments to the files in 
> the format of CBOR. In order to read/parse those dumped files by this tool, 
> it would be great if tika is able to support parsing the cbor, the thing is 
> that the CommonCrawlDataDumper is not dumping with correct extension, it 
> dumps with its own rule, the default extension of the dumped file is html, so 
> it might be less painful if tika is able to detect and parse those files 
> without any pre-processing steps. 
> CommonCrawlDataDumper is calling the following to dump with cbor.
> import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
> import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
> fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
> According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
> CBOR does not yet have its magic numbers to be detected/identified by other 
> applications (PFA: rfc_cbor.jpg)
> It seems that the only way to inform other applications of the type as of now 
> is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
> histogram distribution estimation).  
> There is another thing worth the attention, it looks like tika has attempted 
> to add the support with cbor mime detection in the tika-mimetypes.xml 
> (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
> cbor file dumped by CommonCrawlDataDumper. 
> According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
> self-describing Tag 55799 that seems to be used for cbor type 
> identification(the hex code might be 0xd9d9f7), but it is probably up to the 
> application that take care of this tag, and it is also possible that the 
> fasterxml that the nutch dumping tool is missing this tag, an example cbor 
> file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been 
> attached (PFA: 142440269.html).
> The following info is cited from the rfc, "...a decoder might be able to 
> parse both CBOR and JSON.
>Such a decoder would need to mechanically distinguish the two
>formats.  An easy way for an encoder to help the decoder would be to
>tag the entire CBOR item with tag 55799, the serialization of which
>will never be found at the beginning of a JSON text..."
> It looks like the a file can have two parts/sections i.e. the plain text 
> parts and the json prettified by cbor, this might be also worth the attention 
> and consideration with the parsing and type identification.
> On the other hand, it is worth noting that the entries for cbor extension 
> detection needs to be appended in the tika-mimetypes.xml too 
> e.g.
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: add entry for cbor glob extension in the tika-m...

2015-04-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/42


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Assigned] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned TIKA-1610:
---

Assignee: Chris A. Mattmann

> CBOR Parser and detection [improvement]
> ---
>
> Key: TIKA-1610
> URL: https://issues.apache.org/jira/browse/TIKA-1610
> Project: Tika
>  Issue Type: New Feature
>  Components: detector, mime, parser
>Affects Versions: 1.7
>Reporter: Luke sh
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: memex
> Attachments: 142440269.html, cbor_tika.mimetypes.xml.jpg, 
> rfc_cbor.jpg
>
>
> CBOR is a data format whose design goals include the possibility of extremely 
> small code size, fairly small message size, and extensibility without the 
> need for version negotiation (cited from http://cbor.io/ ).
> It would be great if Tika is able to provide the support with CBOR parser and 
> identification. In the current project with Nutch, the Nutch 
> CommonCrawlDataDumper is used to dump the crawled segments to the files in 
> the format of CBOR. In order to read/parse those dumped files by this tool, 
> it would be great if tika is able to support parsing the cbor, the thing is 
> that the CommonCrawlDataDumper is not dumping with correct extension, it 
> dumps with its own rule, the default extension of the dumped file is html, so 
> it might be less painful if tika is able to detect and parse those files 
> without any pre-processing steps. 
> CommonCrawlDataDumper is calling the following to dump with cbor.
> import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
> import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
> fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
> According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
> CBOR does not yet have its magic numbers to be detected/identified by other 
> applications (PFA: rfc_cbor.jpg)
> It seems that the only way to inform other applications of the type as of now 
> is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
> histogram distribution estimation).  
> There is another thing worth the attention, it looks like tika has attempted 
> to add the support with cbor mime detection in the tika-mimetypes.xml 
> (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
> cbor file dumped by CommonCrawlDataDumper. 
> According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
> self-describing Tag 55799 that seems to be used for cbor type 
> identification(the hex code might be 0xd9d9f7), but it is probably up to the 
> application that take care of this tag, and it is also possible that the 
> fasterxml that the nutch dumping tool is missing this tag, an example cbor 
> file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been 
> attached (PFA: 142440269.html).
> The following info is cited from the rfc, "...a decoder might be able to 
> parse both CBOR and JSON.
>Such a decoder would need to mechanically distinguish the two
>formats.  An easy way for an encoder to help the decoder would be to
>tag the entire CBOR item with tag 55799, the serialization of which
>will never be found at the beginning of a JSON text..."
> It looks like the a file can have two parts/sections i.e. the plain text 
> parts and the json prettified by cbor, this might be also worth the attention 
> and consideration with the parsing and type identification.
> On the other hand, it is worth noting that the entries for cbor extension 
> detection needs to be appended in the tika-mimetypes.xml too 
> e.g.
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506214#comment-14506214
 ] 

Tim Allison commented on TIKA-1513:
---

In looking at [this|http://www.dbf2002.com/dbf-file-format.html], I wonder if 
we could add 0x00 at 30 and 31?

I'm currently grepping the Common Crawl slice from Julien Nioche for files 
starting with 0x03, and I'm getting a vast majority ".dbf", but there are some 
that end in .dct, .ndx (dbf index?), .tfm, .ctg...  Will report findings 
tomorrow.

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Detection problem: Parsing scientific source codes for geoscientists

2015-04-21 Thread Nick Burch

On Tue, 21 Apr 2015, Oh, Ji-Hyun (329F-Affiliate) wrote:
For the first step, I listed up the file formats that widely used in 
climate science.


FORTRAN (.f, .f90, f77)
Python (.py)
R (.R)
Matlab (.m)
GrADS (Grid Analysis and Display System)
(.gs)
NCL (NCAR Command Language) (.ncl)
IDL (Interactive Data Language) (.pro)

I checked Fortran and Matlab are included in tike-mimetypes.xml, but 
when I used Tika to obtain content type of the files (with suffix .f, 
f90, .m), but Tika detected these files as text/plain


Your first step them is probably to try to workout how to identify these 
files, and add suitable mime magic for them, if possible. At the same 
time, make sure the common file extensions for them are listed against 
their mime entries, and make sure we have mime entries for all of these 
formats


I'd probably recommend creating one JIRA per format with detection issues, 
then use that to track the work to add/expand the mime type, attach a 
small sample file, add detection unit tests etc.


Should I build a parser for each file format to get an exact 
content-type, as Java has SourceCodeParser?


As Lewis has said, once detection is working, you'll then want to add the 
missing parsers. You might find that the current SourceCodeParser could, 
with a little bit of work, handle some of these formats itself. Additional 
libraries+parsers may well be needed for the others. I'd suggest one JIRA 
per format you want a parser for that we lack, then use those to track the 
work


Good luck!

Nick


Re: Detection problem: Parsing scientific source codes for geoscientists

2015-04-21 Thread Lewis John Mcgibbney
Hi Ji-Hyun,

On Tue, Apr 21, 2015 at 4:15 PM,  wrote:

>
> FORTRAN (.f, .f90, f77)
> Python (.py)
> R (.R)
> Matlab (.m)
> GrADS (Grid Analysis and Display System)
> (.gs)
> NCL (NCAR Command Language) (.ncl)
> IDL (Interactive Data Language) (.pro)
>

NICE list


>
> I checked Fortran and Matlab are included in tike-mimetypes.xml, but when
> I used Tika to obtain content type of the files (with suffix .f, f90, .m),
> but Tika detected these files as text/plain:
>
> ohjihyun% tika -m spctime.f
>
> Content-Encoding: ISO-8859-1
> Content-Length: 16613
> Content-Type: text/plain; charset=ISO-8859-1
> X-Parsed-By: org.apache.tika.parser.DefaultParser
> X-Parsed-By: org.apache.tika.parser.txt.TXTParser
> resourceName: spctime.f
>
>
[SNIP]


> Should I build a parser for each file format to get an exact content-type,
> as Java has SourceCodeParser?


As far as I know we have no parser for Fortran documents.
You could try using the following Java project
http://sourceforge.net/projects/fortran-parser/
It is dual licensed under Eclipse and BSD licenses.
Hope this helps.
Lewis


Re: NUTCH-1994 and UCAR Dependencies

2015-04-21 Thread Lewis John Mcgibbney
Patch for Mavenizing the bzip2 project
https://code.google.com/p/jbzip2/issues/detail?id=3
Lewis

On Tue, Apr 21, 2015 at 4:14 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Folks,
> OK, so the final part of this jigsaw is as follows
>
> I've requested a staging area [0] on Sonatype OSSRH to release the MIT
> licensed 3rd party bzip2 artifacts.
> I had to Mavenize the project. I will submit this patch to the bzip2
> project and hopefully they will pull it in. If not then I will fork the
> project and maintain it myself.
>
> [0] https://issues.sonatype.org/browse/OSSRH-15143
> [1] https://code.google.com/p/jbzip2/
>
> On Tue, Apr 21, 2015 at 3:49 PM, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> wrote:
>
>> Hi Folks,
>> Update
>>
>> On Tue, Apr 21, 2015 at 10:50 AM, Lewis John Mcgibbney <
>> lewis.mcgibb...@gmail.com> wrote:
>>
>>>
>>>
>>> [ivy:resolve] ::
>>> [ivy:resolve] :: edu.ucar#jj2000;5.2: not found
>>> [ivy:resolve] :: edu.ucar#udunits;4.5.5: not found
>>> [ivy:resolve] ::
>>>
>>
>>
>> Both of the above are now on Maven Central.
>> I had to fix a couple of issues in the jj2000 library, namely
>> https://github.com/Unidata/jj2000/pull/3 which was blocking us.
>>
>> I'm moving on to deal with the final one
>>
>> [ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found
>>
>> I'll update in due course.
>> Thanks
>> Lewis
>>
>
>
>
> --
> *Lewis*
>



-- 
*Lewis*


Re: NUTCH-1994 and UCAR Dependencies

2015-04-21 Thread Lewis John Mcgibbney
Hi Folks,
OK, so the final part of this jigsaw is as follows

I've requested a staging area [0] on Sonatype OSSRH to release the MIT
licensed 3rd party bzip2 artifacts.
I had to Mavenize the project. I will submit this patch to the bzip2
project and hopefully they will pull it in. If not then I will fork the
project and maintain it myself.

[0] https://issues.sonatype.org/browse/OSSRH-15143
[1] https://code.google.com/p/jbzip2/

On Tue, Apr 21, 2015 at 3:49 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Folks,
> Update
>
> On Tue, Apr 21, 2015 at 10:50 AM, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> wrote:
>
>>
>>
>> [ivy:resolve] ::
>> [ivy:resolve] :: edu.ucar#jj2000;5.2: not found
>> [ivy:resolve] :: edu.ucar#udunits;4.5.5: not found
>> [ivy:resolve] ::
>>
>
>
> Both of the above are now on Maven Central.
> I had to fix a couple of issues in the jj2000 library, namely
> https://github.com/Unidata/jj2000/pull/3 which was blocking us.
>
> I'm moving on to deal with the final one
>
> [ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found
>
> I'll update in due course.
> Thanks
> Lewis
>



-- 
*Lewis*


Re: NUTCH-1994 and UCAR Dependencies

2015-04-21 Thread Lewis John Mcgibbney
Hi Folks,
Update

On Tue, Apr 21, 2015 at 10:50 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

>
>
> [ivy:resolve] ::
> [ivy:resolve] :: edu.ucar#jj2000;5.2: not found
> [ivy:resolve] :: edu.ucar#udunits;4.5.5: not found
> [ivy:resolve] ::
>


Both of the above are now on Maven Central.
I had to fix a couple of issues in the jj2000 library, namely
https://github.com/Unidata/jj2000/pull/3 which was blocking us.

I'm moving on to deal with the final one

[ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found

I'll update in due course.
Thanks
Lewis


[GitHub] tika pull request: add entry for cbor glob extension in the tika-m...

2015-04-21 Thread LukeLiush
GitHub user LukeLiush opened a pull request:

https://github.com/apache/tika/pull/42

add entry for cbor glob extension in the tika-mimetypes.xml



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/LukeLiush/tika cborExtension

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/42.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #42


commit 5b86cccdfc6d637cb44c9f8b2642e438c2ae5ff4
Author: LukeLiush 
Date:   2015-04-21T21:39:07Z

add entry for cbor glob extension in the tika-mimetypes.xml




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: NUTCH-1994 and UCAR Dependencies

2015-04-21 Thread Tyler Palsulich
Hi Lewis,

I also tried upgrading Tika in Nutch. But, ran into the same issue
(but, udunits
is found, as expected):

[ivy:retrieve] ::
[ivy:retrieve] ::  UNRESOLVED DEPENDENCIES ::
[ivy:retrieve] ::
[ivy:retrieve] :: edu.ucar#jj2000;5.2: not found
[ivy:retrieve] :: org.itadaki#bzip2;0.9.1: not found
[ivy:retrieve] ::

Thanks for pushing the dependencies out.

Tyler

On Tue, Apr 21, 2015 at 1:50 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Folks,
> Whilst addressing NUTCH-1994, I've experienced a dependency problem
> (related to unpublished artifacts on Maven Central) which I am working
> through right now.
> When Kaing the upgrade in Nutch, I get the following
>
> [ivy:resolve]   -- artifact edu.ucar#udunits;4.5.5!udunits.jar:
> [ivy:resolve]
>
> http://oss.sonatype.org/content/repositories/releases/edu/ucar/udunits/4.5.5/udunits-4.5.5.jar
> [ivy:resolve] ::
> [ivy:resolve] ::  UNRESOLVED DEPENDENCIES ::
> [ivy:resolve] ::
> [ivy:resolve] :: edu.ucar#jj2000;5.2: not found
> [ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found
> [ivy:resolve] :: edu.ucar#udunits;4.5.5: not found
> [ivy:resolve] ::
> [ivy:resolve]
> [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
>
> BUILD FAILED
> /usr/local/trunk_clean/build.xml:112: The following error occurred while
> executing this line:
> /usr/local/trunk_clean/src/plugin/build.xml:60: The following error
> occurred while executing this line:
> /usr/local/trunk_clean/src/plugin/build-plugin.xml:229: impossible to
> resolve dependencies:
> resolve failed - see output for details
>
> Total time: 17 seconds
>
> I've just this minutes pushed the edu.ucar#udunits;4.5.5 artifacts so they
> will be available imminently. The remaining artifact at edu.ucar#jj2000;5.2
> has a corrupted POM which means that OSS Nexus will not accepts it. I'll
> send a pull request further upstream for that ASAP.
>
> Finally, the BZIP dependency is a 3rd party dependency from another Org,
> Licensed under MIT license. So I will register interest to publish this
> dependency, push it, then we will be good to go.
>
> Lewis
>
>
>
> --
> *Lewis*
>


[jira] [Commented] (TIKA-1601) Integrate Jackcess to handle MSAccess files

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505633#comment-14505633
 ] 

Tim Allison commented on TIKA-1601:
---

I don't. That's half the fun of a patch, right. :) On the sqlite parser, I 
tried to have a least one column for each data type, nonascii language to 
confirm no encoding problems and an embedded doc.  

Happy to generate this if it would help. Thank you, again.

> Integrate Jackcess to handle MSAccess files
> ---
>
> Key: TIKA-1601
> URL: https://issues.apache.org/jira/browse/TIKA-1601
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>
> Recently, James Ahlborn, the current maintainer of 
> [Jackcess|http://jackcess.sourceforge.net/], kindly agreed to relicense 
> Jackcess to Apache 2.0.  [~boneill], the CTO at [Health Market Science, a 
> LexisNexis® Company|https://www.healthmarketscience.com/], also agreed with 
> this relicensing and led the charge to obtain all necessary corporate 
> approval to deliver a 
> [CCLA|https://www.apache.org/licenses/cla-corporate.txt] for Jackcess to 
> Apache.  As anyone who has tried to get corporate approval for anything 
> knows, this can sometimes require not a small bit of effort.
> If I may speak on behalf of Tika and the larger Apache community, I offer a 
> sincere thanks to James, Brian and the other developers and contributors to 
> Jackcess!!!
> Once the licensing info has been changed in Jackcess and the new release is 
> available in maven, we can integrate Jackcess into Tika and add a capability 
> to process MSAccess.
> As a side note, I reached out to the developers and contributors to determine 
> if there were any objections.  I couldn't find addresses for everyone, and 
> not everyone replied, but those who did offered their support to this move. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1601) Integrate Jackcess to handle MSAccess files

2015-04-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505377#comment-14505377
 ] 

Luis Filipe Nassif commented on TIKA-1601:
--

Great! Give me more 3 days to submit the patch. Do you have some Apache 2 MDB 
file for unit tests?

> Integrate Jackcess to handle MSAccess files
> ---
>
> Key: TIKA-1601
> URL: https://issues.apache.org/jira/browse/TIKA-1601
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>
> Recently, James Ahlborn, the current maintainer of 
> [Jackcess|http://jackcess.sourceforge.net/], kindly agreed to relicense 
> Jackcess to Apache 2.0.  [~boneill], the CTO at [Health Market Science, a 
> LexisNexis® Company|https://www.healthmarketscience.com/], also agreed with 
> this relicensing and led the charge to obtain all necessary corporate 
> approval to deliver a 
> [CCLA|https://www.apache.org/licenses/cla-corporate.txt] for Jackcess to 
> Apache.  As anyone who has tried to get corporate approval for anything 
> knows, this can sometimes require not a small bit of effort.
> If I may speak on behalf of Tika and the larger Apache community, I offer a 
> sincere thanks to James, Brian and the other developers and contributors to 
> Jackcess!!!
> Once the licensing info has been changed in Jackcess and the new release is 
> available in maven, we can integrate Jackcess into Tika and add a capability 
> to process MSAccess.
> As a side note, I reached out to the developers and contributors to determine 
> if there were any objections.  I couldn't find addresses for everyone, and 
> not everyone replied, but those who did offered their support to this move. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Detection problem: Parsing scientific source codes for geoscientists

2015-04-21 Thread Oh, Ji-Hyun (329F-Affiliate)
Hi Tika friends,

I am currently engaged in a project funded by National Science Foundation. Our 
goal is to develop a research-friendly environment where geoscientists, like 
me, can easily find source codes they need. According to a survey, scientists 
spend a considerable amount of their time in processing data instead of doing 
actual science. Based on my experience as a climate scientist, there exist most 
frequently/typically used analysis tools in atmospheric science. Therefore, it 
could be helpful if these tools can be easily shared among scientists. The 
thing is that the tools are written in various scientific languages, so we are 
trying to provide the metadata of source codes stored in public repositories to 
help scientists select source code for their own usages.

For the first step, I listed up the file formats that widely used in climate 
science.

FORTRAN (.f, .f90, f77)
Python (.py)
R (.R)
Matlab (.m)
GrADS (Grid Analysis and Display System)
(.gs)
NCL (NCAR Command Language) (.ncl)
IDL (Interactive Data Language) (.pro)

I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I 
used Tika to obtain content type of the files (with suffix .f, f90, .m), but 
Tika detected these files as text/plain:

ohjihyun% tika -m spctime.f

Content-Encoding: ISO-8859-1
Content-Length: 16613
Content-Type: text/plain; charset=ISO-8859-1
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.txt.TXTParser
resourceName: spctime.f

ohjihyun% tika -m wavelet.m
Content-Encoding: ISO-8859-1
Content-Length: 5868
Content-Type: text/plain; charset=ISO-8859-1
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.txt.TXTParser
resourceName: wavelet.m

I checked Tika can give correct content type (text/x-java-source) for Java file 
as:
ohjihyun% tika -m UrlParser.java
Content-Encoding: ISO-8859-1
Content-Length: 2178
Content-Type: text/x-java-source
LoC: 70
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser
resourceName: UrlParser.java

Should I build a parser for each file format to get an exact content-type, as 
Java has SourceCodeParser?
Thank you in advance for your insightful comments.

Ji-Hyun


[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2015-04-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505367#comment-14505367
 ] 

Luis Filipe Nassif commented on TIKA-879:
-

Yes, thank you very much for testing with govdocs1 ([~gagravarr]'s suggestion)!

> Detection problem: message/rfc822 file is detected as text/plain.
> -
>
> Key: TIKA-879
> URL: https://issues.apache.org/jira/browse/TIKA-879
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, mime
>Affects Versions: 1.0, 1.1, 1.2
> Environment: linux 3.2.9
> oracle jdk7, openjdk7, sun jdk6
>Reporter: Konstantin Gribov
>  Labels: new-parser
> Attachments: TIKA-879-thunderbird.eml
>
>
> When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
> can test it on {{testRFC822}} and {{testRFC822_base64}} in 
> {{tika-parsers/src/test/resources/test-documents/}}).
> Main reason for such behavior is that only magic detector is really works for 
> such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
> file name in {{RESOURCE_NAME_KEY}}.
> As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", 
> "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
> works only by magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2015-04-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505368#comment-14505368
 ] 

Luis Filipe Nassif commented on TIKA-879:
-

Yes, thank you very much for testing with govdocs1 ([~gagravarr]'s suggestion)!

> Detection problem: message/rfc822 file is detected as text/plain.
> -
>
> Key: TIKA-879
> URL: https://issues.apache.org/jira/browse/TIKA-879
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, mime
>Affects Versions: 1.0, 1.1, 1.2
> Environment: linux 3.2.9
> oracle jdk7, openjdk7, sun jdk6
>Reporter: Konstantin Gribov
>  Labels: new-parser
> Attachments: TIKA-879-thunderbird.eml
>
>
> When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
> can test it on {{testRFC822}} and {{testRFC822_base64}} in 
> {{tika-parsers/src/test/resources/test-documents/}}).
> Main reason for such behavior is that only magic detector is really works for 
> such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
> file name in {{RESOURCE_NAME_KEY}}.
> As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", 
> "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
> works only by magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


NUTCH-1994 and UCAR Dependencies

2015-04-21 Thread Lewis John Mcgibbney
Hi Folks,
Whilst addressing NUTCH-1994, I've experienced a dependency problem
(related to unpublished artifacts on Maven Central) which I am working
through right now.
When Kaing the upgrade in Nutch, I get the following

[ivy:resolve]   -- artifact edu.ucar#udunits;4.5.5!udunits.jar:
[ivy:resolve]
http://oss.sonatype.org/content/repositories/releases/edu/ucar/udunits/4.5.5/udunits-4.5.5.jar
[ivy:resolve] ::
[ivy:resolve] ::  UNRESOLVED DEPENDENCIES ::
[ivy:resolve] ::
[ivy:resolve] :: edu.ucar#jj2000;5.2: not found
[ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found
[ivy:resolve] :: edu.ucar#udunits;4.5.5: not found
[ivy:resolve] ::
[ivy:resolve]
[ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

BUILD FAILED
/usr/local/trunk_clean/build.xml:112: The following error occurred while
executing this line:
/usr/local/trunk_clean/src/plugin/build.xml:60: The following error
occurred while executing this line:
/usr/local/trunk_clean/src/plugin/build-plugin.xml:229: impossible to
resolve dependencies:
resolve failed - see output for details

Total time: 17 seconds

I've just this minutes pushed the edu.ucar#udunits;4.5.5 artifacts so they
will be available imminently. The remaining artifact at edu.ucar#jj2000;5.2
has a corrupted POM which means that OSS Nexus will not accepts it. I'll
send a pull request further upstream for that ASAP.

Finally, the BZIP dependency is a 3rd party dependency from another Org,
Licensed under MIT license. So I will register interest to publish this
dependency, push it, then we will be good to go.

Lewis



-- 
*Lewis*


[jira] [Commented] (TIKA-1611) Allow RecursiveParserWrapper to catch exceptions from embedded documents

2015-04-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505358#comment-14505358
 ] 

Hudson commented on TIKA-1611:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #639 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/639/])
TIKA-1611 -- allow RecursiveParserWrapper to catch exceptions caused by 
embedded documents (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1675159)
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/RecursiveParserWrapperFSConsumer.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java
* /tika/trunk/tika-batch/src/test/java/org/apache/tika/util
* 
/tika/trunk/tika-core/src/main/java/org/apache/tika/parser/RecursiveParserWrapper.java
* /tika/trunk/tika-core/src/main/java/org/apache/tika/utils/ExceptionUtils.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/test_recursive_embedded_npe.docx
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/RecursiveMetadataResourceTest.java


> Allow RecursiveParserWrapper to catch exceptions from embedded documents
> 
>
> Key: TIKA-1611
> URL: https://issues.apache.org/jira/browse/TIKA-1611
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
>
> While parsing embedded documents, currently, if a parser hits an 
> EncryptedDocumentException or anything wrapped in a TikaException, the 
> Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}:
> {noformat}
> DELEGATING_PARSER.parse(
> newStream,
> new EmbeddedContentHandler(new 
> BodyContentHandler(handler)),
> metadata, context);
> } catch (EncryptedDocumentException ede) {
> // TODO: can we log a warning that we lack the password?
> // For now, just skip the content
> } catch (TikaException e) {
> // TODO: can we log a warning somehow?
> // Could not parse the entry, just skip the content
> } finally {
> tmp.close();
> }
> {noformat}
> For some applications, it might be better to store the stack trace of the 
> attachment that caused an exception.
> The proposal would be to include the stack trace in the metadata object for 
> that particular attachment.
> The user will be able to specify whether or not to store stack traces, and 
> the default will be to store stack traces.  This will be a small change to 
> the legacy behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1612) Exceptions getting image data in PPT files

2015-04-21 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1612:
-

 Summary: Exceptions getting image data in PPT files
 Key: TIKA-1612
 URL: https://issues.apache.org/jira/browse/TIKA-1612
 Project: Tika
  Issue Type: Bug
Reporter: Tim Allison
Priority: Minor


In numerous (~500) ppt files in govdocs1, we're getting zip exceptions (unknown 
compression method, bad block, etc) when Tika's HSLFExtractor calls 
{{getData()}} on an embedded image.

Under normal circumstances (I just learned today...), if an attachment causes a 
RuntimeException, we are currently swallowing that in 
{{ParsingEmbeddedDocumentExtractor}}.

However, because we're calling {{getData()}} before the embedded extractor 
takes over, if there is an exception there, the parse of the entire file fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1612) Exceptions getting image data in PPT files

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505335#comment-14505335
 ] 

Tim Allison commented on TIKA-1612:
---

Not sure how we want to fix this.  To make this parallel to our handling of 
other embedded files, we'd just swallow the exception...I really don't like 
that option.

Recommendations?

> Exceptions getting image data in PPT files
> --
>
> Key: TIKA-1612
> URL: https://issues.apache.org/jira/browse/TIKA-1612
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
>
> In numerous (~500) ppt files in govdocs1, we're getting zip exceptions 
> (unknown compression method, bad block, etc) when Tika's HSLFExtractor calls 
> {{getData()}} on an embedded image.
> Under normal circumstances (I just learned today...), if an attachment causes 
> a RuntimeException, we are currently swallowing that in 
> {{ParsingEmbeddedDocumentExtractor}}.
> However, because we're calling {{getData()}} before the embedded extractor 
> takes over, if there is an exception there, the parse of the entire file 
> fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1611) Allow RecursiveParserWrapper to catch exceptions from embedded documents

2015-04-21 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1611.
---
Resolution: Fixed

r1675159.

Nothing like testing to see behavior, rather than assumptions. :(

> Allow RecursiveParserWrapper to catch exceptions from embedded documents
> 
>
> Key: TIKA-1611
> URL: https://issues.apache.org/jira/browse/TIKA-1611
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
>
> While parsing embedded documents, currently, if a parser hits an 
> EncryptedDocumentException or anything wrapped in a TikaException, the 
> Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}:
> {noformat}
> DELEGATING_PARSER.parse(
> newStream,
> new EmbeddedContentHandler(new 
> BodyContentHandler(handler)),
> metadata, context);
> } catch (EncryptedDocumentException ede) {
> // TODO: can we log a warning that we lack the password?
> // For now, just skip the content
> } catch (TikaException e) {
> // TODO: can we log a warning somehow?
> // Could not parse the entry, just skip the content
> } finally {
> tmp.close();
> }
> {noformat}
> For some applications, it might be better to store the stack trace of the 
> attachment that caused an exception.
> The proposal would be to include the stack trace in the metadata object for 
> that particular attachment.
> The user will be able to specify whether or not to store stack traces, and 
> the default will be to store stack traces.  This will be a small change to 
> the legacy behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1611) Allow RecursiveParserWrapper to catch exceptions from embedded documents

2015-04-21 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1611:
--
Description: 
While parsing embedded documents, currently, if a parser hits an 
EncryptedDocumentException or anything wrapped in a TikaException, the 
Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}:
{noformat}
DELEGATING_PARSER.parse(
newStream,
new EmbeddedContentHandler(new 
BodyContentHandler(handler)),
metadata, context);
} catch (EncryptedDocumentException ede) {
// TODO: can we log a warning that we lack the password?
// For now, just skip the content
} catch (TikaException e) {
// TODO: can we log a warning somehow?
// Could not parse the entry, just skip the content
} finally {
tmp.close();
}
{noformat}


For some applications, it might be better to store the stack trace of the 
attachment that caused an exception.

The proposal would be to include the stack trace in the metadata object for 
that particular attachment.

The user will be able to specify whether or not to store stack traces, and the 
default will be to store stack traces.  This will be a small change to the 
legacy behavior.

  was:
While parsing embedded documents, currently, if a parser hits an Exception, the 
Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}:
{noformat}
DELEGATING_PARSER.parse(
newStream,
new EmbeddedContentHandler(new 
BodyContentHandler(handler)),
metadata, context);
} catch (EncryptedDocumentException ede) {
// TODO: can we log a warning that we lack the password?
// For now, just skip the content
} catch (TikaException e) {
// TODO: can we log a warning somehow?
// Could not parse the entry, just skip the content
} finally {
tmp.close();
}
{noformat}


For some applications, it might be better to store the stack trace of the 
attachment that caused an exception.

The proposal would be to include the stack trace in the metadata object for 
that particular attachment.

The user will be able to specify whether or not to store stack traces, and the 
default will be to store stack traces.  This will be a small change to the 
legacy behavior.


> Allow RecursiveParserWrapper to catch exceptions from embedded documents
> 
>
> Key: TIKA-1611
> URL: https://issues.apache.org/jira/browse/TIKA-1611
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
>
> While parsing embedded documents, currently, if a parser hits an 
> EncryptedDocumentException or anything wrapped in a TikaException, the 
> Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}:
> {noformat}
> DELEGATING_PARSER.parse(
> newStream,
> new EmbeddedContentHandler(new 
> BodyContentHandler(handler)),
> metadata, context);
> } catch (EncryptedDocumentException ede) {
> // TODO: can we log a warning that we lack the password?
> // For now, just skip the content
> } catch (TikaException e) {
> // TODO: can we log a warning somehow?
> // Could not parse the entry, just skip the content
> } finally {
> tmp.close();
> }
> {noformat}
> For some applications, it might be better to store the stack trace of the 
> attachment that caused an exception.
> The proposal would be to include the stack trace in the metadata object for 
> that particular attachment.
> The user will be able to specify whether or not to store stack traces, and 
> the default will be to store stack traces.  This will be a small change to 
> the legacy behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1611) Allow RecursiveParserWrapper to catch exceptions from embedded documents

2015-04-21 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1611:
--
Description: 
While parsing embedded documents, currently, if a parser hits an Exception, the 
Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}:
{noformat}
DELEGATING_PARSER.parse(
newStream,
new EmbeddedContentHandler(new 
BodyContentHandler(handler)),
metadata, context);
} catch (EncryptedDocumentException ede) {
// TODO: can we log a warning that we lack the password?
// For now, just skip the content
} catch (TikaException e) {
// TODO: can we log a warning somehow?
// Could not parse the entry, just skip the content
} finally {
tmp.close();
}
{noformat}


For some applications, it might be better to store the stack trace of the 
attachment that caused an exception.

The proposal would be to include the stack trace in the metadata object for 
that particular attachment.

The user will be able to specify whether or not to store stack traces, and the 
default will be to store stack traces.  This will be a small change to the 
legacy behavior.

  was:
While parsing embedded documents, currently, if a parser hits an Exception, the 
parsing of the entire document comes to a grinding halt.  For some 
applications, it might be better to catch the exception at the attachment level.

The proposal would be to include the stack trace in the metadata object for 
that particular attachment.

The user will be able to specify whether or not to catch embedded exceptions, 
and the default will be to catch embedded exceptions.  This will be a small 
change to the legacy behavior.


> Allow RecursiveParserWrapper to catch exceptions from embedded documents
> 
>
> Key: TIKA-1611
> URL: https://issues.apache.org/jira/browse/TIKA-1611
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
>
> While parsing embedded documents, currently, if a parser hits an Exception, 
> the Exception is swallowed by {{ParsingEmbeddedDocumentExtractor}}:
> {noformat}
> DELEGATING_PARSER.parse(
> newStream,
> new EmbeddedContentHandler(new 
> BodyContentHandler(handler)),
> metadata, context);
> } catch (EncryptedDocumentException ede) {
> // TODO: can we log a warning that we lack the password?
> // For now, just skip the content
> } catch (TikaException e) {
> // TODO: can we log a warning somehow?
> // Could not parse the entry, just skip the content
> } finally {
> tmp.close();
> }
> {noformat}
> For some applications, it might be better to store the stack trace of the 
> attachment that caused an exception.
> The proposal would be to include the stack trace in the metadata object for 
> that particular attachment.
> The user will be able to specify whether or not to store stack traces, and 
> the default will be to store stack traces.  This will be a small change to 
> the legacy behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505269#comment-14505269
 ] 

Tim Allison edited comment on TIKA-879 at 4/21/15 5:04 PM:
---

Y, will do. Results probably tomorrow.

This?


  
  
  
  
  
  
  
 
  
 
 
  
 






  



was (Author: talli...@mitre.org):
Y, will do. Results probably tomorrow.

> Detection problem: message/rfc822 file is detected as text/plain.
> -
>
> Key: TIKA-879
> URL: https://issues.apache.org/jira/browse/TIKA-879
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, mime
>Affects Versions: 1.0, 1.1, 1.2
> Environment: linux 3.2.9
> oracle jdk7, openjdk7, sun jdk6
>Reporter: Konstantin Gribov
>  Labels: new-parser
> Attachments: TIKA-879-thunderbird.eml
>
>
> When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
> can test it on {{testRFC822}} and {{testRFC822_base64}} in 
> {{tika-parsers/src/test/resources/test-documents/}}).
> Main reason for such behavior is that only magic detector is really works for 
> such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
> file name in {{RESOURCE_NAME_KEY}}.
> As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", 
> "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
> works only by magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505269#comment-14505269
 ] 

Tim Allison commented on TIKA-879:
--

Y, will do. Results probably tomorrow.

> Detection problem: message/rfc822 file is detected as text/plain.
> -
>
> Key: TIKA-879
> URL: https://issues.apache.org/jira/browse/TIKA-879
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, mime
>Affects Versions: 1.0, 1.1, 1.2
> Environment: linux 3.2.9
> oracle jdk7, openjdk7, sun jdk6
>Reporter: Konstantin Gribov
>  Labels: new-parser
> Attachments: TIKA-879-thunderbird.eml
>
>
> When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
> can test it on {{testRFC822}} and {{testRFC822_base64}} in 
> {{tika-parsers/src/test/resources/test-documents/}}).
> Main reason for such behavior is that only magic detector is really works for 
> such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
> file name in {{RESOURCE_NAME_KEY}}.
> As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", 
> "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
> works only by magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Jeremy B. Merrill (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505178#comment-14505178
 ] 

Jeremy B. Merrill commented on TIKA-1608:
-

It's the only one I've found so far out of 300,000ish documents (most of which 
are plain emails, few of which are .docs).

> RuntimeException on extracting text from Word 97-2004 Document
> --
>
> Key: TIKA-1608
> URL: https://issues.apache.org/jira/browse/TIKA-1608
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Jeremy B. Merrill
> Attachments: 1534-attachment.doc
>
>
> Extracting text from the Word 97-2004 document attached here fails with the 
> following stacktrace:
> $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
> 1534-attachment.doc 
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@69af0db6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at 
> org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
>   at 
> org.apache.poi.hwpf.model.PAPFormattedDiskPage.(PAPFormattedDiskPage.java:101)
>   at 
> org.apache.poi.hwpf.model.OldPAPBinTable.(OldPAPBinTable.java:49)
>   at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:109)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   ... 5 more
> I'm using trunk from Github, which I think is a flavor of 1.9. The document 
> opens properly in Word for Mac '11.
> Happy to answer questions; I'm also on the "user" mailing list. If it's 
> relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
> that document here in Jira rather than on my own dropbox.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1554) Improve EMF file detection

2015-04-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505172#comment-14505172
 ] 

Luis Filipe Nassif commented on TIKA-1554:
--

Actually r1667661

> Improve EMF file detection
> --
>
> Key: TIKA-1554
> URL: https://issues.apache.org/jira/browse/TIKA-1554
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.7
>Reporter: Luis Filipe Nassif
>Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: nonEmf.dat
>
>
> I am getting many files being incorrectly detected as application/x-emf. I 
> think the current magic is too common. According to MS documentation 
> (https://msdn.microsoft.com/en-us/library/cc230635.aspx and 
> https://msdn.microsoft.com/en-us/library/dd240211.aspx), it can be improved 
> to:
> {code}
> 
> EMF
> <_comment>Extended Metafile
> 
> 
>   
>   
>   
> 
>   
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2015-04-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505132#comment-14505132
 ] 

Luis Filipe Nassif commented on TIKA-879:
-

Maybe we could keep the original magics and ADD the widened versions with a 
"\n" prefix to decrease the number of false positives (I have got a small 
number of them)? Could you try the widened magics with govdocs1 
[~talli...@mitre.org]?

> Detection problem: message/rfc822 file is detected as text/plain.
> -
>
> Key: TIKA-879
> URL: https://issues.apache.org/jira/browse/TIKA-879
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, mime
>Affects Versions: 1.0, 1.1, 1.2
> Environment: linux 3.2.9
> oracle jdk7, openjdk7, sun jdk6
>Reporter: Konstantin Gribov
>  Labels: new-parser
> Attachments: TIKA-879-thunderbird.eml
>
>
> When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
> can test it on {{testRFC822}} and {{testRFC822_base64}} in 
> {{tika-parsers/src/test/resources/test-documents/}}).
> Main reason for such behavior is that only magic detector is really works for 
> such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
> file name in {{RESOURCE_NAME_KEY}}.
> As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", 
> "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
> works only by magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505113#comment-14505113
 ] 

Tim Allison commented on TIKA-1608:
---

In govdocs1, there are 24 of these:
{noformat}
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at org.apache.poi.hwpf.sprm.SprmBuffer.append(SprmBuffer.java:128)
at org.apache.poi.hwpf.model.PAPBinTable.rebuild(PAPBinTable.java:293)
at org.apache.poi.hwpf.model.PAPBinTable.rebuild(PAPBinTable.java:116)
at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:136)
at o.a.t.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
{noformat}

There are 2 of those in our commoncrawl slice.

Nothing that matches your trace, though.  
Thank you for attaching it.  How common is this stack trace in your set?

> RuntimeException on extracting text from Word 97-2004 Document
> --
>
> Key: TIKA-1608
> URL: https://issues.apache.org/jira/browse/TIKA-1608
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Jeremy B. Merrill
> Attachments: 1534-attachment.doc
>
>
> Extracting text from the Word 97-2004 document attached here fails with the 
> following stacktrace:
> $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
> 1534-attachment.doc 
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@69af0db6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at 
> org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
>   at 
> org.apache.poi.hwpf.model.PAPFormattedDiskPage.(PAPFormattedDiskPage.java:101)
>   at 
> org.apache.poi.hwpf.model.OldPAPBinTable.(OldPAPBinTable.java:49)
>   at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:109)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   ... 5 more
> I'm using trunk from Github, which I think is a flavor of 1.9. The document 
> opens properly in Word for Mac '11.
> Happy to answer questions; I'm also on the "user" mailing list. If it's 
> relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
> that document here in Jira rather than on my own dropbox.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Jeremy B. Merrill (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy B. Merrill updated TIKA-1608:

Description: 
Extracting text from the Word 97-2004 document attached here fails with the 
following stacktrace:

$ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
1534-attachment.doc 
Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected 
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at 
org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
at 
org.apache.poi.hwpf.model.PAPFormattedDiskPage.(PAPFormattedDiskPage.java:101)
at 
org.apache.poi.hwpf.model.OldPAPBinTable.(OldPAPBinTable.java:49)
at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:109)
at 
org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
... 5 more

I'm using trunk from Github, which I think is a flavor of 1.9. The document 
opens properly in Word for Mac '11.

Happy to answer questions; I'm also on the "user" mailing list. If it's 
relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
that document here in Jira rather than on my own dropbox.)


  was:
Extracting text from the Word 97-2004 document located here 
(https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails with 
the following stacktrace:

$ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
1534-attachment.doc 
Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected 
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at 
org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
at 
org.apache.poi.hwpf.model.PAPFormattedDiskPage.(PAPFormattedDiskPage.java:101)
at 
org.apache.poi.hwpf.model.OldPAPBinTable.(OldPAPBinTable.java:49)
at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:109)
at 
org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
... 5 more

I'm using trunk from Github, which I think is a flavor of 1.9. The document 
opens properly in Word for Mac '11.

Happy to answer questions; I'm also on the "user" mailing list. If it's 
relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
that document here in Jira rather than on my own dropbox.)



> RuntimeException on extracting text from Word 97-2004 Document
> --
>
> Key: TIKA-1608
> URL: https://issues.apache.org/jira/browse/TIKA-1608
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Jeremy B. Merrill
> Attachments: 1534-attachment.doc
>
>
> Extracting text from the Word 97-2004 document attached here fails with the 
> following stacktrace:
> $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
> 1534-attachment.doc 
> Exception in thread 

[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Jeremy B. Merrill (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505102#comment-14505102
 ] 

Jeremy B. Merrill commented on TIKA-1608:
-

POI bug: https://bz.apache.org/bugzilla/show_bug.cgi?id=57843

> RuntimeException on extracting text from Word 97-2004 Document
> --
>
> Key: TIKA-1608
> URL: https://issues.apache.org/jira/browse/TIKA-1608
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Jeremy B. Merrill
> Attachments: 1534-attachment.doc
>
>
> Extracting text from the Word 97-2004 document attached here fails with the 
> following stacktrace:
> $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
> 1534-attachment.doc 
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@69af0db6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at 
> org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
>   at 
> org.apache.poi.hwpf.model.PAPFormattedDiskPage.(PAPFormattedDiskPage.java:101)
>   at 
> org.apache.poi.hwpf.model.OldPAPBinTable.(OldPAPBinTable.java:49)
>   at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:109)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   ... 5 more
> I'm using trunk from Github, which I think is a flavor of 1.9. The document 
> opens properly in Word for Mac '11.
> Happy to answer questions; I'm also on the "user" mailing list. If it's 
> relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
> that document here in Jira rather than on my own dropbox.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persitsence of Tika Metadata

2015-04-21 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-1607:
---
Summary: Introduce new arbitrary object key/values data structure for 
persitsence of Tika Metadata  (was: Introduce new HashMap data 
structure for persitsence of Tika Metadata)

> Introduce new arbitrary object key/values data structure for persitsence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.9
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Jeremy B. Merrill (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy B. Merrill updated TIKA-1608:

Attachment: 1534-attachment.doc

document failing under this bug

> RuntimeException on extracting text from Word 97-2004 Document
> --
>
> Key: TIKA-1608
> URL: https://issues.apache.org/jira/browse/TIKA-1608
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Jeremy B. Merrill
> Attachments: 1534-attachment.doc
>
>
> Extracting text from the Word 97-2004 document located here 
> (https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails 
> with the following stacktrace:
> $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
> 1534-attachment.doc 
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@69af0db6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at 
> org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
>   at 
> org.apache.poi.hwpf.model.PAPFormattedDiskPage.(PAPFormattedDiskPage.java:101)
>   at 
> org.apache.poi.hwpf.model.OldPAPBinTable.(OldPAPBinTable.java:49)
>   at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:109)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   ... 5 more
> I'm using trunk from Github, which I think is a flavor of 1.9. The document 
> opens properly in Word for Mac '11.
> Happy to answer questions; I'm also on the "user" mailing list. If it's 
> relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
> that document here in Jira rather than on my own dropbox.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Jeremy B. Merrill (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505093#comment-14505093
 ] 

Jeremy B. Merrill commented on TIKA-1608:
-

Hi Tim,

I added the document. I'm totally cool with the document being viewed by the 
public. I can't really grant it to the ASF since I didn't create it. It's an 
attachment from an email in an email dump (http://jebemail.com) posted by 
former Florida govenor Jeb Bush. So whether it's usable is probably a question 
for the ASF's lawyers. 

But for the avoidance of doubt, I grant any rights that I might have in the 
document to the ASF.

I'll open a POI bug.

> RuntimeException on extracting text from Word 97-2004 Document
> --
>
> Key: TIKA-1608
> URL: https://issues.apache.org/jira/browse/TIKA-1608
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Jeremy B. Merrill
> Attachments: 1534-attachment.doc
>
>
> Extracting text from the Word 97-2004 document located here 
> (https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails 
> with the following stacktrace:
> $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
> 1534-attachment.doc 
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@69af0db6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at 
> org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
>   at 
> org.apache.poi.hwpf.model.PAPFormattedDiskPage.(PAPFormattedDiskPage.java:101)
>   at 
> org.apache.poi.hwpf.model.OldPAPBinTable.(OldPAPBinTable.java:49)
>   at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:109)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   ... 5 more
> I'm using trunk from Github, which I think is a flavor of 1.9. The document 
> opens properly in Word for Mac '11.
> Happy to answer questions; I'm also on the "user" mailing list. If it's 
> relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
> that document here in Jira rather than on my own dropbox.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505092#comment-14505092
 ] 

Tim Allison commented on TIKA-1513:
---

Completely agree.  

Only 2,386 files.

This is the table of the file extensions for files identified as 
application/octet-stream.

||File Extension||Count||
|dbase3|1664|
|wp|362|
|unk|   285|
|gls|   60|
|ileaf| 4|
|sys|   3|
|chp|   2|
|lnk|   2|
|mac|   2|
|squeak|1|
|bin|   1|

Would very much appreciate what you find, and yes, we can certainly decrease 
the priority...I had my priorities backwards.  Sorry.

Obviously, if you find false positives, we'll back off to file suffix.  I, too, 
was less than enthusiastic about a single byte mime id'er.

Thank you!

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1554) Improve EMF file detection

2015-04-21 Thread Luis Filipe Nassif (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Filipe Nassif closed TIKA-1554.

   Resolution: Fixed
Fix Version/s: 1.8

Resolved in r4608ff5. Thanks.

> Improve EMF file detection
> --
>
> Key: TIKA-1554
> URL: https://issues.apache.org/jira/browse/TIKA-1554
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.7
>Reporter: Luis Filipe Nassif
>Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: nonEmf.dat
>
>
> I am getting many files being incorrectly detected as application/x-emf. I 
> think the current magic is too common. According to MS documentation 
> (https://msdn.microsoft.com/en-us/library/cc230635.aspx and 
> https://msdn.microsoft.com/en-us/library/dd240211.aspx), it can be improved 
> to:
> {code}
> 
> EMF
> <_comment>Extended Metafile
> 
> 
>   
>   
>   
> 
>   
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505057#comment-14505057
 ] 

Luis Filipe Nassif commented on TIKA-1513:
--

No, I did not give a try to 0x03. How many files are detected as octet-stream 
in govdocs1? I wouldn't like to hit an issue similar to TIKA-1554 again (I am 
indexing ALL desktop files). I will test 0x03 and report the results here. Can 
we at least decrease the magic priority to 10 or 20 for now?

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new HashMap data structure for persitsence of Tika Metadata

2015-04-21 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505054#comment-14505054
 ] 

Ray Gauss II commented on TIKA-1607:


We've had a few discussions on structured metadata over the years, some of 
which was captured in the [MetadataRoadmap Wiki 
page|http://wiki.apache.org/tika/MetadataRoadmap].

I'd agree that we should strive to maintain backwards compatibility for simple 
values.

I think we should also consider serialization of the metadata store, not just 
in the {{Serializable}} interface sense, but perhaps being able to easily 
marshal the entire metadata store into JSON and XML.

As [~gagravarr] points out, work has been done to express structured metadata 
via the existing metadata store.  In that email thread you'll find reference to 
the external [tika-ffmpeg project|https://github.com/AlfrescoLabs/tika-ffmpeg].

> Introduce new HashMap data structure for persitsence of Tika 
> Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.9
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1501) Fix the disabled Tika Bundle OSGi related unit tests

2015-04-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505051#comment-14505051
 ] 

Hudson commented on TIKA-1501:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #638 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/638/])
TIKA-1501: Fix disabled OSGi related unit tests. Fixes from Bob Paulin. 
(tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1675121)
* /tika/trunk/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java


> Fix the disabled Tika Bundle OSGi related unit tests
> 
>
> Key: TIKA-1501
> URL: https://issues.apache.org/jira/browse/TIKA-1501
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Affects Versions: 1.6, 1.7
>Reporter: Nick Burch
> Fix For: 1.9
>
> Attachments: TIKA-1501-trunk.patch, TIKA-1501-trunkv2.patch, 
> TIKA-1501.patch
>
>
> Currently, the unit tests for the Tika Bundle contain several bits like:
> {code}
> @Ignore // TODO Fix this test
> {code}
> We should really fix these unit tests so they work, and re-enable them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-04-21 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505042#comment-14505042
 ] 

Moritz Dorka commented on TIKA-1315:


I believe I could speed up the process by ultimately writing a unit test for 
the POI-part... I'm just having a hard time motivating myself to write unit 
tests for a few stupid getters.

What you could also do is to hardcode 
{code}getLevelNumberingPlaceholderOffsets(){code} to always return 
{code}[1,3,5,7,9,11,13,15,17]{code}. This should hold true for most of all 
(trivial) cases (however, I have not tested the reaction of my code to such 
cheating).

There is also a very subtle bug left in my code which only triggers in 
ListLevelOverrides and _sometimes_ provokes wrong number increments. If I find 
the time I will update my patch.

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [ANNOUNCE] Apache Tika 1.8 Released

2015-04-21 Thread Mattmann, Chris A (3980)
Yay thanks Tyler!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: , "Timothy B." 
Reply-To: "dev@tika.apache.org" 
Date: Tuesday, April 21, 2015 at 8:34 AM
To: "dev@tika.apache.org" 
Subject: RE: [ANNOUNCE] Apache Tika 1.8 Released

>Thank you, Tyler!
>
>-Original Message-
>From: Tyler Palsulich [mailto:tpalsul...@apache.org]
>Sent: Monday, April 20, 2015 5:09 PM
>To: dev@tika.apache.org; u...@tika.apache.org; annou...@apache.org
>Subject: [ANNOUNCE] Apache Tika 1.8 Released
>
>The Apache Tika project is pleased to announce the release of Apache Tika
>1.8. The release
>contents have been pushed out to the main Apache release site and to the
>Maven Central sync, so the releases should be available as soon as the
>mirrors get the syncs.
>
>Apache Tika is a toolkit for detecting and extracting metadata and
>structured text content
>from various documents using existing parser libraries.
>
>Apache Tika 1.8 contains a number of improvements and bug fixes. Details
>can be found in the changes file:
>http://www.apache.org/dist/tika/CHANGES-1.8.txt
>
>Apache Tika is available in source form from the following download page:
>http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.8-src.zip
>
>Apache Tika is also available in binary form or for use using Maven 2 from
>the Central Repository: http://repo1.maven.org/maven2/org/apache/tika/
>
>In the initial 48 hours, the release may not be available on all mirrors.
>When downloading from a mirror site, please remember to verify the
>downloads using signatures found on the Apache site:
>https://people.apache.org/keys/group/tika.asc
>
>For more information on Apache Tika, visit the project home page:
>http://tika.apache.org/
>
>-- Tyler Palsulich, on behalf of the Apache Tika community



[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505006#comment-14505006
 ] 

Tim Allison commented on TIKA-1513:
---

Y, I was concerned by that generally.  Are you getting false positives with 
0x03 specifically?  I didn't find any in govdocs1, but I realize that corpus 
has limitations.

Will add text/plain as supertype.  Thank you!

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504996#comment-14504996
 ] 

Luis Filipe Nassif commented on TIKA-1513:
--

Hi Tim,

I am ok with 1) and 2). But I think an one byte magic can result in many false 
positives, specially binary files. My current approach is detection by 
extension only. That needed a declaration of text/plain as a supertype.

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-04-21 Thread Moritz Dorka (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505004#comment-14505004
 ] 

Moritz Dorka commented on TIKA-1315:


Well, the original patch by Filip is essentially an 80% solution. Everything 
that I added is rather obscure functionality...

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505008#comment-14505008
 ] 

Tim Allison commented on TIKA-1315:
---

Ha.  Ok, but your patch is really well done.  Let me take a look at Filip's.  
I'll see if we can find someone on POI to add that call soon.  Thank you!

> Basic list support in WordExtractor
> ---
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Filip Bednárik
>Priority: Minor
> Fix For: 1.9
>
> Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1611) Allow RecursiveParserWrapper to catch exceptions from embedded documents

2015-04-21 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1611:
-

 Summary: Allow RecursiveParserWrapper to catch exceptions from 
embedded documents
 Key: TIKA-1611
 URL: https://issues.apache.org/jira/browse/TIKA-1611
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.9


While parsing embedded documents, currently, if a parser hits an Exception, the 
parsing of the entire document comes to a grinding halt.  For some 
applications, it might be better to catch the exception at the attachment level.

The proposal would be to include the stack trace in the metadata object for 
that particular attachment.

The user will be able to specify whether or not to catch embedded exceptions, 
and the default will be to catch embedded exceptions.  This will be a small 
change to the legacy behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new HashMap data structure for persitsence of Tika Metadata

2015-04-21 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504999#comment-14504999
 ] 

Sergey Beryozkin commented on TIKA-1607:


Hi, 
IMHO it indeed makes sense to keep the existing Metadata methods that return 
String values but also offer an optional support for representing Metadata as a 
multivalued map of arbitrary object key/values where the original String to 
String[] pairs are converted into something more sophisticated if required...

By the way, JAX-RS API has this interface:
http://docs.oracle.com/javaee/7/api/javax/ws/rs/core/MultivaluedMap.html

Not suggesting to use natively in Tika, but it might be of interest...

Cheers, Sergey



> Introduce new HashMap data structure for persitsence of Tika 
> Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.9
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1501) Fix the disabled Tika Bundle OSGi related unit tests

2015-04-21 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1501.
---
   Resolution: Fixed
Fix Version/s: 1.9

r1675121.

Thank you, [~bobpaulin]!

> Fix the disabled Tika Bundle OSGi related unit tests
> 
>
> Key: TIKA-1501
> URL: https://issues.apache.org/jira/browse/TIKA-1501
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Affects Versions: 1.6, 1.7
>Reporter: Nick Burch
> Fix For: 1.9
>
> Attachments: TIKA-1501-trunk.patch, TIKA-1501-trunkv2.patch, 
> TIKA-1501.patch
>
>
> Currently, the unit tests for the Tika Bundle contain several bits like:
> {code}
> @Ignore // TODO Fix this test
> {code}
> We should really fix these unit tests so they work, and re-enable them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504951#comment-14504951
 ] 

Tim Allison commented on TIKA-1513:
---

>From govdocs1, it looks like first byte of 0X03 is a safe way to identify 
>these files.  

[This|http://www.digitalpreservation.gov/formats/fdd/fdd000325.shtml] was 
useful.

Two mime type questions:
1)  What should we use as the canonical mime type for .dbf files?  Proposal: 
{{application/x-dbf}}.

2)  What mimes should the parser "accept", or what should we include in the 
aliases?
>From [filext.com|http://filext.com/file-extension/DBF]:
* application/dbase
* application/x-dbase
* application/dbf
* application/x-dbf
* zz-application/zz-winassoc-dbf

First attempt at mime definition:
{noformat}
  

  



  
{noformat}

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1532) DIF Parser

2015-04-21 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504904#comment-14504904
 ] 

Konstantin Gribov commented on TIKA-1532:
-

{{text/\*+xml}} is quite unusual type. OTOH, there's a lot of 
{{application/\*+xml}} and {{application/vnd.\*+xml}} types in IANA media types 
list (http://www.iana.org/assignments/media-types/media-types.xhtml)

> DIF Parser
> --
>
> Key: TIKA-1532
> URL: https://issues.apache.org/jira/browse/TIKA-1532
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Aakarsh Medleri Hire Math
>  Labels: memex
>
> MIME Type detection & content parser for .dif format



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: [ANNOUNCE] Apache Tika 1.8 Released

2015-04-21 Thread Allison, Timothy B.
Thank you, Tyler!

-Original Message-
From: Tyler Palsulich [mailto:tpalsul...@apache.org] 
Sent: Monday, April 20, 2015 5:09 PM
To: dev@tika.apache.org; u...@tika.apache.org; annou...@apache.org
Subject: [ANNOUNCE] Apache Tika 1.8 Released

The Apache Tika project is pleased to announce the release of Apache Tika
1.8. The release
contents have been pushed out to the main Apache release site and to the
Maven Central sync, so the releases should be available as soon as the
mirrors get the syncs.

Apache Tika is a toolkit for detecting and extracting metadata and
structured text content
from various documents using existing parser libraries.

Apache Tika 1.8 contains a number of improvements and bug fixes. Details
can be found in the changes file:
http://www.apache.org/dist/tika/CHANGES-1.8.txt

Apache Tika is available in source form from the following download page:
http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.8-src.zip

Apache Tika is also available in binary form or for use using Maven 2 from
the Central Repository: http://repo1.maven.org/maven2/org/apache/tika/

In the initial 48 hours, the release may not be available on all mirrors.
When downloading from a mirror site, please remember to verify the
downloads using signatures found on the Apache site:
https://people.apache.org/keys/group/tika.asc

For more information on Apache Tika, visit the project home page:
http://tika.apache.org/

-- Tyler Palsulich, on behalf of the Apache Tika community


[jira] [Commented] (TIKA-1295) Make some Dublin Core items multi-valued

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504884#comment-14504884
 ] 

Tim Allison commented on TIKA-1295:
---

[~lewismc], +1 to adding potential for hierarchical metadata on TIKA-1607.  We 
should ensure during the transition (and maybe forever), that users can still 
get strings fairly easily.

> Make some Dublin Core items multi-valued
> 
>
> Key: TIKA-1295
> URL: https://issues.apache.org/jira/browse/TIKA-1295
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.9
>
>
> According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, 
> dc:title, dc:description and dc:rights should allow multiple values because 
> of language alternatives.  Unless anyone objects in the next few days, I'll 
> switch those to Property.toInternalTextBag() from Property.toInternalText().  
> I'll also modify PDFParser to extract dc:rights.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504871#comment-14504871
 ] 

Tim Allison commented on TIKA-1608:
---

[~jeremybmerrill], thank you for raising this issue. If you go to "More", 
there's an "Attach Files" option.  As I'm sure you've done, please only attach 
files that are ok to share with the public, and please let us know if the file 
is "granted" to Apache under ASF 2.0 so that we can use it in unit tests in the 
future.

I'll take a look at our govdocs1/CommonCrawl exceptions and see if I can find a 
doc in there that matches your stack trace.

>From the stacktrace, it looks like the fix will have to be made at the POI 
>level.  I could be wrong, though!  If you haven't done so already, please open 
>a ticket on POI's 
>[bugzilla|https://bz.apache.org/bugzilla/buglist.cgi?quicksearch=poi&list_id=123825]
>  and add a hyperlink from there to here and vice versa so that we can track 
>progress over here.

Thank you, again.

> RuntimeException on extracting text from Word 97-2004 Document
> --
>
> Key: TIKA-1608
> URL: https://issues.apache.org/jira/browse/TIKA-1608
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Jeremy B. Merrill
>
> Extracting text from the Word 97-2004 document located here 
> (https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails 
> with the following stacktrace:
> $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
> 1534-attachment.doc 
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@69af0db6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at 
> org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
>   at 
> org.apache.poi.hwpf.model.PAPFormattedDiskPage.(PAPFormattedDiskPage.java:101)
>   at 
> org.apache.poi.hwpf.model.OldPAPBinTable.(OldPAPBinTable.java:49)
>   at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:109)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   ... 5 more
> I'm using trunk from Github, which I think is a flavor of 1.9. The document 
> opens properly in Word for Mac '11.
> Happy to answer questions; I'm also on the "user" mailing list. If it's 
> relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
> that document here in Jira rather than on my own dropbox.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1610) CBOR Parser and detection [improvement]

2015-04-21 Thread Luke sh (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1610:
--
Summary: CBOR Parser and detection [improvement]  (was: CBOR Parser and 
detection improvement)

> CBOR Parser and detection [improvement]
> ---
>
> Key: TIKA-1610
> URL: https://issues.apache.org/jira/browse/TIKA-1610
> Project: Tika
>  Issue Type: New Feature
>  Components: detector, mime, parser
>Affects Versions: 1.7
>Reporter: Luke sh
>Priority: Trivial
>  Labels: memex
> Attachments: 142440269.html, cbor_tika.mimetypes.xml.jpg, 
> rfc_cbor.jpg
>
>
> CBOR is a data format whose design goals include the possibility of extremely 
> small code size, fairly small message size, and extensibility without the 
> need for version negotiation (cited from http://cbor.io/ ).
> It would be great if Tika is able to provide the support with CBOR parser and 
> identification. In the current project with Nutch, the Nutch 
> CommonCrawlDataDumper is used to dump the crawled segments to the files in 
> the format of CBOR. In order to read/parse those dumped files by this tool, 
> it would be great if tika is able to support parsing the cbor, the thing is 
> that the CommonCrawlDataDumper is not dumping with correct extension, it 
> dumps with its own rule, the default extension of the dumped file is html, so 
> it might be less painful if tika is able to detect and parse those files 
> without any pre-processing steps. 
> CommonCrawlDataDumper is calling the following to dump with cbor.
> import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
> import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
> fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
> According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
> CBOR does not yet have its magic numbers to be detected/identified by other 
> applications (PFA: rfc_cbor.jpg)
> It seems that the only way to inform other applications of the type as of now 
> is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
> histogram distribution estimation).  
> There is another thing worth the attention, it looks like tika has attempted 
> to add the support with cbor mime detection in the tika-mimetypes.xml 
> (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
> cbor file dumped by CommonCrawlDataDumper. 
> According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
> self-describing Tag 55799 that seems to be used for cbor type 
> identification(the hex code might be 0xd9d9f7), but it is probably up to the 
> application that take care of this tag, and it is also possible that the 
> fasterxml that the nutch dumping tool is missing this tag, an example cbor 
> file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been 
> attached (PFA: 142440269.html).
> The following info is cited from the rfc, "...a decoder might be able to 
> parse both CBOR and JSON.
>Such a decoder would need to mechanically distinguish the two
>formats.  An easy way for an encoder to help the decoder would be to
>tag the entire CBOR item with tag 55799, the serialization of which
>will never be found at the beginning of a JSON text..."
> It looks like the a file can have two parts/sections i.e. the plain text 
> parts and the json prettified by cbor, this might be also worth the attention 
> and consideration with the parsing and type identification.
> On the other hand, it is worth noting that the entries for cbor extension 
> detection needs to be appended in the tika-mimetypes.xml too 
> e.g.
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1610) CBOR Parser and detection improvement

2015-04-21 Thread Luke sh (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1610:
--
Description: 
CBOR is a data format whose design goals include the possibility of extremely 
small code size, fairly small message size, and extensibility without the need 
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and 
identification. In the current project with Nutch, the Nutch 
CommonCrawlDataDumper is used to dump the crawled segments to the files in the 
format of CBOR. In order to read/parse those dumped files by this tool, it 
would be great if tika is able to support parsing the cbor, the thing is that 
the CommonCrawlDataDumper is not dumping with correct extension, it dumps with 
its own rule, the default extension of the dumped file is html, so it might be 
less painful if tika is able to detect and parse those files without any 
pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR 
does not yet have its magic numbers to be detected/identified by other 
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now 
is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to 
add the support with cbor mime detection in the tika-mimetypes.xml 
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor 
file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a 
self-describing Tag 55799 that seems to be used for cbor type 
identification(the hex code might be 0xd9d9f7), but it is probably up to the 
application that take care of this tag, and it is also possible that the 
fasterxml that the nutch dumping tool is missing this tag, an example cbor file 
dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been attached 
(PFA: 142440269.html).
The following info is cited from the rfc, "...a decoder might be able to parse 
both CBOR and JSON.
   Such a decoder would need to mechanically distinguish the two
   formats.  An easy way for an encoder to help the decoder would be to
   tag the entire CBOR item with tag 55799, the serialization of which
   will never be found at the beginning of a JSON text..."
It looks like the a file can have two parts/sections i.e. the plain text parts 
and the json prettified by cbor, this might be also worth the attention and 
consideration with the parsing and type identification.

On the other hand, it is worth noting that the entries for cbor extension 
detection needs to be appended in the tika-mimetypes.xml too 
e.g.




  was:
CBOR is a data format whose design goals include the possibility of extremely 
small code size, fairly small message size, and extensibility without the need 
for version negotiation (cited from http://cbor.io/ ).

It would be great if Tika is able to provide the support with CBOR parser and 
identification. In the current project with Nutch, the Nutch 
CommonCrawlDataDumper is used to dump the crawled segments to the files in the 
format of CBOR. In order to read/parse those dumped files by this tool, it 
would be great if tika is able to support parsing the cbor, the thing is that 
the CommonCrawlDataDumper is not dumping with correct extension, it dumps with 
its own rule, the default extension of the dumped file is html, so it might be 
less painful if tika is able to detect and parse those files without any 
pre-processing steps. 

CommonCrawlDataDumper is calling the following to dump with cbor.
import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;

fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.

According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like CBOR 
does not yet have its magic numbers to be detected/identified by other 
applications (PFA: rfc_cbor.jpg)
It seems that the only way to inform other applications of the type as of now 
is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
histogram distribution estimation).  

There is another thing worth the attention, it looks like tika has attempted to 
add the support with cbor mime detection in the tika-mimetypes.xml 
(PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the cbor 
file dumped by CommonCrawlDataDumper. 
According to http://tools.ietf.org/html/rfc7049#section-2.4.5, the