Re: 1.7 release?

2014-12-22 Thread Tyler Palsulich
Hi All,

Nick added the temporary fix for TIKA-1445 and made the POI updates for
TIKA-1469 (thanks!). And, I'll volunteer to be the Release Manager for 1.7!
:)

I'll start the process this weekend or a couple days into the new year.

Cheers,
Tyler
On Dec 18, 2014 9:45 PM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 +1

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Tyler Palsulich tpalsul...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Thursday, December 18, 2014 at 9:15 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: 1.7 release?

 I'm OK with trying the fix in 1.8 (or 1.7 if people feel strongly). As
 Nick
 just recommended, I'll try adding metadata extraction to Tesseract soon,
 then adding the extensible solution in 1.8.
 
 Tyler
 
 On Thu, Dec 18, 2014 at 11:58 PM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 
  I haven’t tried my hand at it - been super busy. tyler if you have a
  chance go for it, I think that’s the remaining blocker.
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 
 
 
 
  -Original Message-
  From: Tyler Palsulich tpalsul...@gmail.com
  Reply-To: dev@tika.apache.org dev@tika.apache.org
  Date: Thursday, December 18, 2014 at 12:54 PM
  To: dev@tika.apache.org dev@tika.apache.org
  Subject: Re: 1.7 release?
 
  Hi All,
  
  It's been a few months, so I just want to follow up on this thread.
 We've
  resolved/closed 51 issues for v1.7 [0]. There are two on JIRA marked as
  1.7
  (TIKA-1465 and TIKA-894). Do we still want to aim for 1.7 with
 TIKA-1445?
  Has anyone tried their hand at the suggested (significant) fix?
  
  Are there any other issues someone would like to fit in?
  
  Cheers,
  Tyler
  
  [0] -
  
 
 
 https://issues.apache.org/jira/browse/TIKA/fixforversion/12327096/?select
 e
  dTab=com.atlassian.jira.jira-projects-plugin:version-issues-panel
  
  On Tue, Oct 28, 2014 at 1:46 AM, Mattmann, Chris A (3980) 
  chris.a.mattm...@jpl.nasa.gov wrote:
  
   Thanks Tim saw your patch and am looking now.
  
   ++
   Chris Mattmann, Ph.D.
   Chief Architect
   Instrument Software and Science Data Systems Section (398)
   NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
   Office: 168-519, Mailstop: 168-527
   Email: chris.a.mattm...@nasa.gov
   WWW:  http://sunset.usc.edu/~mattmann/
   ++
   Adjunct Associate Professor, Computer Science Department
   University of Southern California, Los Angeles, CA 90089 USA
   ++
  
  
  
  
  
  
   -Original Message-
   From: Allison, Timothy B. talli...@mitre.org
   Reply-To: dev@tika.apache.org dev@tika.apache.org
   Date: Monday, October 27, 2014 at 12:30 PM
   To: dev@tika.apache.org dev@tika.apache.org
   Subject: RE: 1.7 release?
  
   Sounds good.  As long as the default behavior remains the same, I'm
   happy.  I'm going to play with a combination of your patch and
 Tyler's
   and see what the ramifications are for embedded docs.
   
   To confirm, the OCR integration is fantastic.  Thank you and Tyler!
   
   
   Best,
   
  Tim
   
   -Original Message-
   From: Mattmann, Chris A (3980)
 [mailto:chris.a.mattm...@jpl.nasa.gov]
   Sent: Friday, October 24, 2014 5:36 PM
   To: dev@tika.apache.org
   Subject: Re: 1.7 release?
   
   Hey Tim,
   
   What do you think about my existing patch for 1445? For example to
   just call all the parsers? I thought I was seeing behavior that was
   slow because of that, but it turned out to be Tesseract and my
 machine
   at the time?
   
   I think my patch for 1445 may be enough, and we should get the
 metadata
   I think? Thoughts?
   
   I honestly think we need to 

Re: 1.7 release?

2014-12-22 Thread Mattmann, Chris A (3980)
WOOO HOO! Go Tyler go! :0) Merry Christmas bud.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Tyler Palsulich tpalsul...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Monday, December 22, 2014 at 10:57 AM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Re: 1.7 release?

Hi All,

Nick added the temporary fix for TIKA-1445 and made the POI updates for
TIKA-1469 (thanks!). And, I'll volunteer to be the Release Manager for
1.7!
:)

I'll start the process this weekend or a couple days into the new year.

Cheers,
Tyler
On Dec 18, 2014 9:45 PM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 +1

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Tyler Palsulich tpalsul...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Thursday, December 18, 2014 at 9:15 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: 1.7 release?

 I'm OK with trying the fix in 1.8 (or 1.7 if people feel strongly). As
 Nick
 just recommended, I'll try adding metadata extraction to Tesseract
soon,
 then adding the extensible solution in 1.8.
 
 Tyler
 
 On Thu, Dec 18, 2014 at 11:58 PM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 
  I haven’t tried my hand at it - been super busy. tyler if you have a
  chance go for it, I think that’s the remaining blocker.
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 
 
 
 
  -Original Message-
  From: Tyler Palsulich tpalsul...@gmail.com
  Reply-To: dev@tika.apache.org dev@tika.apache.org
  Date: Thursday, December 18, 2014 at 12:54 PM
  To: dev@tika.apache.org dev@tika.apache.org
  Subject: Re: 1.7 release?
 
  Hi All,
  
  It's been a few months, so I just want to follow up on this thread.
 We've
  resolved/closed 51 issues for v1.7 [0]. There are two on JIRA
marked as
  1.7
  (TIKA-1465 and TIKA-894). Do we still want to aim for 1.7 with
 TIKA-1445?
  Has anyone tried their hand at the suggested (significant) fix?
  
  Are there any other issues someone would like to fit in?
  
  Cheers,
  Tyler
  
  [0] -
  
 
 
 
https://issues.apache.org/jira/browse/TIKA/fixforversion/12327096/?select
 e
  dTab=com.atlassian.jira.jira-projects-plugin:version-issues-panel
  
  On Tue, Oct 28, 2014 at 1:46 AM, Mattmann, Chris A (3980) 
  chris.a.mattm...@jpl.nasa.gov wrote:
  
   Thanks Tim saw your patch and am looking now.
  
   ++
   Chris Mattmann, Ph.D.
   Chief Architect
   Instrument Software and Science Data Systems Section (398)
   NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
   Office: 168-519, Mailstop: 168-527
   Email: chris.a.mattm...@nasa.gov
   WWW:  http://sunset.usc.edu/~mattmann/
   ++
   Adjunct Associate Professor, Computer Science Department
   University of Southern California, Los Angeles, CA 90089 USA
   ++
  
  
  
  
  
  
   -Original Message-
   From: Allison, Timothy B. talli...@mitre.org
   Reply-To: dev@tika.apache.org dev@tika.apache.org
   Date: Monday, October 27, 2014 at 12:30 PM
   To: dev@tika.apache.org dev@tika.apache.org
   Subject: RE: 1.7 release?
  
   Sounds good.  As long as the default 

Re: 1.7 release?

2014-12-22 Thread Thomas Ledoux
+1 for going.
Many thanks to Tyler and to Nick to take the POI upgrade.

So many christmas gifts in advance or just after :-)

Merry christmas to all

2014-12-22 19:59 GMT+01:00 Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov:

 WOOO HOO! Go Tyler go! :0) Merry Christmas bud.

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Tyler Palsulich tpalsul...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Monday, December 22, 2014 at 10:57 AM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: 1.7 release?

 Hi All,
 
 Nick added the temporary fix for TIKA-1445 and made the POI updates for
 TIKA-1469 (thanks!). And, I'll volunteer to be the Release Manager for
 1.7!
 :)
 
 I'll start the process this weekend or a couple days into the new year.
 
 Cheers,
 Tyler
 On Dec 18, 2014 9:45 PM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 
  +1
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 
 
 
 
  -Original Message-
  From: Tyler Palsulich tpalsul...@gmail.com
  Reply-To: dev@tika.apache.org dev@tika.apache.org
  Date: Thursday, December 18, 2014 at 9:15 PM
  To: dev@tika.apache.org dev@tika.apache.org
  Subject: Re: 1.7 release?
 
  I'm OK with trying the fix in 1.8 (or 1.7 if people feel strongly). As
  Nick
  just recommended, I'll try adding metadata extraction to Tesseract
 soon,
  then adding the extensible solution in 1.8.
  
  Tyler
  
  On Thu, Dec 18, 2014 at 11:58 PM, Mattmann, Chris A (3980) 
  chris.a.mattm...@jpl.nasa.gov wrote:
  
   I haven’t tried my hand at it - been super busy. tyler if you have a
   chance go for it, I think that’s the remaining blocker.
  
   ++
   Chris Mattmann, Ph.D.
   Chief Architect
   Instrument Software and Science Data Systems Section (398)
   NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
   Office: 168-519, Mailstop: 168-527
   Email: chris.a.mattm...@nasa.gov
   WWW:  http://sunset.usc.edu/~mattmann/
   ++
   Adjunct Associate Professor, Computer Science Department
   University of Southern California, Los Angeles, CA 90089 USA
   ++
  
  
  
  
  
  
   -Original Message-
   From: Tyler Palsulich tpalsul...@gmail.com
   Reply-To: dev@tika.apache.org dev@tika.apache.org
   Date: Thursday, December 18, 2014 at 12:54 PM
   To: dev@tika.apache.org dev@tika.apache.org
   Subject: Re: 1.7 release?
  
   Hi All,
   
   It's been a few months, so I just want to follow up on this thread.
  We've
   resolved/closed 51 issues for v1.7 [0]. There are two on JIRA
 marked as
   1.7
   (TIKA-1465 and TIKA-894). Do we still want to aim for 1.7 with
  TIKA-1445?
   Has anyone tried their hand at the suggested (significant) fix?
   
   Are there any other issues someone would like to fit in?
   
   Cheers,
   Tyler
   
   [0] -
   
  
  
 
 
 https://issues.apache.org/jira/browse/TIKA/fixforversion/12327096/?select
  e
   dTab=com.atlassian.jira.jira-projects-plugin:version-issues-panel
   
   On Tue, Oct 28, 2014 at 1:46 AM, Mattmann, Chris A (3980) 
   chris.a.mattm...@jpl.nasa.gov wrote:
   
Thanks Tim saw your patch and am looking now.
   
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA

[jira] [Commented] (TIKA-1483) Create a general raw string parser

2014-12-22 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256304#comment-14256304
 ] 

Luis Filipe Nassif commented on TIKA-1483:
--

Do you think it would be useful adding a first implementation specific and 
optimized for extracting Latin1 scripts (Western European languages) coded with 
ISO8859-1, UTF8 and UTF16? If yes, I will try to submit a patch.

 Create a general raw string parser
 --

 Key: TIKA-1483
 URL: https://issues.apache.org/jira/browse/TIKA-1483
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif

 I think it can be very useful adding a general parser able to extract raw 
 strings from files (like the strings command), which can be used as the 
 fallback parser for all mimetypes not having a specific parser 
 implementation, like application/octet-stream. It can also be used as a 
 fallback for corrupt files throwing a TikaException.
 It must be configured with the script/language to be extracted from the files 
 (currently I implemented one specific for Latin1).
 It can use heuristics to extract strings encoded with different charsets 
 within the same file, mainly the common ISO-8859-1, UTF8 and UTF16.
 What the community thinks about that?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1483) Create a general raw string parser

2014-12-22 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256377#comment-14256377
 ] 

Luis Filipe Nassif commented on TIKA-1483:
--

[~talli...@apache.org],
Do you mean add language models to do automatic language/charset detection? My 
original purpose was to extract strings from binary and non-text files, so I 
think it would be difficult to detect the language and charset used in that 
files. My idea was to let the user configure the language(s) and charsets of 
interest and the parser would do a best effort to decode them. I think 
TextParser already do an automatic charset detection (do not know about 
language).

 Create a general raw string parser
 --

 Key: TIKA-1483
 URL: https://issues.apache.org/jira/browse/TIKA-1483
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif

 I think it can be very useful adding a general parser able to extract raw 
 strings from files (like the strings command), which can be used as the 
 fallback parser for all mimetypes not having a specific parser 
 implementation, like application/octet-stream. It can also be used as a 
 fallback for corrupt files throwing a TikaException.
 It must be configured with the script/language to be extracted from the files 
 (currently I implemented one specific for Latin1).
 It can use heuristics to extract strings encoded with different charsets 
 within the same file, mainly the common ISO-8859-1, UTF8 and UTF16.
 What the community thinks about that?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1502) Mime magic for database file formats

2014-12-22 Thread Nick Burch (JIRA)
Nick Burch created TIKA-1502:


 Summary: Mime magic for database file formats
 Key: TIKA-1502
 URL: https://issues.apache.org/jira/browse/TIKA-1502
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.6
Reporter: Nick Burch


I noticed today that Tika can't detect a lot of common database formats, such 
as sqlite or Berkeley DB or MISAM

The unix file utility got most of those, which makes me think that there's a 
sensible-ish header on most we can write some mime magic for

It'd therefore be good to add mime entries, with magic where possible, for many 
of these common database file formats



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1502) Mime magic for database file formats

2014-12-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256527#comment-14256527
 ] 

Hudson commented on TIKA-1502:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #366 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/366/])
Some test database files for TIKA-1502 (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1647473)
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_2.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_3.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_4.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_5.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testMYSQL.MYD
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testMYSQL.MYI
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testMYSQL.frm
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testSQLITE3.db


 Mime magic for database file formats
 

 Key: TIKA-1502
 URL: https://issues.apache.org/jira/browse/TIKA-1502
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.6
Reporter: Nick Burch

 I noticed today that Tika can't detect a lot of common database formats, such 
 as sqlite or Berkeley DB or MISAM
 The unix file utility got most of those, which makes me think that there's a 
 sensible-ish header on most we can write some mime magic for
 It'd therefore be good to add mime entries, with magic where possible, for 
 many of these common database file formats



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1502) Mime magic for database file formats

2014-12-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256561#comment-14256561
 ] 

Hudson commented on TIKA-1502:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #383 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/383/])
TIKA-1502 MySQL and SQLite3 mime types, with magic where possible (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1647478)
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
Some test database files for TIKA-1502 (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1647473)
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_2.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_3.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_4.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_5.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testMYSQL.MYD
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testMYSQL.MYI
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testMYSQL.frm
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testSQLITE3.db


 Mime magic for database file formats
 

 Key: TIKA-1502
 URL: https://issues.apache.org/jira/browse/TIKA-1502
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.6
Reporter: Nick Burch

 I noticed today that Tika can't detect a lot of common database formats, such 
 as sqlite or Berkeley DB or MISAM
 The unix file utility got most of those, which makes me think that there's a 
 sensible-ish header on most we can write some mime magic for
 It'd therefore be good to add mime entries, with magic where possible, for 
 many of these common database file formats



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1502) Mime magic for database file formats

2014-12-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256571#comment-14256571
 ] 

Hudson commented on TIKA-1502:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #367 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/367/])
TIKA-1502 MySQL and SQLite3 mime types, with magic where possible (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1647478)
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


 Mime magic for database file formats
 

 Key: TIKA-1502
 URL: https://issues.apache.org/jira/browse/TIKA-1502
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.6
Reporter: Nick Burch

 I noticed today that Tika can't detect a lot of common database formats, such 
 as sqlite or Berkeley DB or MISAM
 The unix file utility got most of those, which makes me think that there's a 
 sensible-ish header on most we can write some mime magic for
 It'd therefore be good to add mime entries, with magic where possible, for 
 many of these common database file formats



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1502) Mime magic for database file formats

2014-12-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256640#comment-14256640
 ] 

Hudson commented on TIKA-1502:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #368 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/368/])
More test database files for TIKA-1502 (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1647484)
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_2.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_3.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_4.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_5.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_btree_2.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_btree_3.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_btree_4.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_btree_5.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_hash_2.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_hash_3.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_hash_4.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_hash_5.db


 Mime magic for database file formats
 

 Key: TIKA-1502
 URL: https://issues.apache.org/jira/browse/TIKA-1502
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.6
Reporter: Nick Burch

 I noticed today that Tika can't detect a lot of common database formats, such 
 as sqlite or Berkeley DB or MISAM
 The unix file utility got most of those, which makes me think that there's a 
 sensible-ish header on most we can write some mime magic for
 It'd therefore be good to add mime entries, with magic where possible, for 
 many of these common database file formats



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1502) Mime magic for database file formats

2014-12-22 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256655#comment-14256655
 ] 

Nick Burch commented on TIKA-1502:
--

As of r1647486, we now have mime types for SQLite3, MySQL (most) and Berkeley 
DB. We have magic for SQLite3, most of the MySQL formats (some are headerless), 
and expanded BDB ones.

One remaining issue is getting MimeTypesReaderTest.testReadParameterHeirarchy() 
to pass - for some reason the 3 level hierarchy of the BDB mime types is 
getting flattened to just two

 Mime magic for database file formats
 

 Key: TIKA-1502
 URL: https://issues.apache.org/jira/browse/TIKA-1502
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.6
Reporter: Nick Burch

 I noticed today that Tika can't detect a lot of common database formats, such 
 as sqlite or Berkeley DB or MISAM
 The unix file utility got most of those, which makes me think that there's a 
 sensible-ish header on most we can write some mime magic for
 It'd therefore be good to add mime entries, with magic where possible, for 
 many of these common database file formats



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1502) Mime magic for database file formats

2014-12-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256663#comment-14256663
 ] 

Hudson commented on TIKA-1502:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #384 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/384/])
Split the Berkeley DB mimetypes into three levels, and add a detection test 
(passes) and a heirarchy test (disabled as fails) TIKA-1502 (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1647486)
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* 
/tika/trunk/tika-core/src/test/java/org/apache/tika/mime/MimeTypesReaderTest.java
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
Start on magic for subtypes of Berkeley DB TIKA-1502 (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1647485)
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
More test database files for TIKA-1502 (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1647484)
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_2.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_3.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_4.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_5.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_btree_2.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_btree_3.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_btree_4.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_btree_5.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_hash_2.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_hash_3.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_hash_4.db
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_hash_5.db


 Mime magic for database file formats
 

 Key: TIKA-1502
 URL: https://issues.apache.org/jira/browse/TIKA-1502
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.6
Reporter: Nick Burch

 I noticed today that Tika can't detect a lot of common database formats, such 
 as sqlite or Berkeley DB or MISAM
 The unix file utility got most of those, which makes me think that there's a 
 sensible-ish header on most we can write some mime magic for
 It'd therefore be good to add mime entries, with magic where possible, for 
 many of these common database file formats



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)