Re: 1.7 release?
Hi All, Nick added the temporary fix for TIKA-1445 and made the POI updates for TIKA-1469 (thanks!). And, I'll volunteer to be the Release Manager for 1.7! :) I'll start the process this weekend or a couple days into the new year. Cheers, Tyler On Dec 18, 2014 9:45 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: +1 ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, December 18, 2014 at 9:15 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? I'm OK with trying the fix in 1.8 (or 1.7 if people feel strongly). As Nick just recommended, I'll try adding metadata extraction to Tesseract soon, then adding the extensible solution in 1.8. Tyler On Thu, Dec 18, 2014 at 11:58 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: I haven’t tried my hand at it - been super busy. tyler if you have a chance go for it, I think that’s the remaining blocker. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, December 18, 2014 at 12:54 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Hi All, It's been a few months, so I just want to follow up on this thread. We've resolved/closed 51 issues for v1.7 [0]. There are two on JIRA marked as 1.7 (TIKA-1465 and TIKA-894). Do we still want to aim for 1.7 with TIKA-1445? Has anyone tried their hand at the suggested (significant) fix? Are there any other issues someone would like to fit in? Cheers, Tyler [0] - https://issues.apache.org/jira/browse/TIKA/fixforversion/12327096/?select e dTab=com.atlassian.jira.jira-projects-plugin:version-issues-panel On Tue, Oct 28, 2014 at 1:46 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Thanks Tim saw your patch and am looking now. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Allison, Timothy B. talli...@mitre.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, October 27, 2014 at 12:30 PM To: dev@tika.apache.org dev@tika.apache.org Subject: RE: 1.7 release? Sounds good. As long as the default behavior remains the same, I'm happy. I'm going to play with a combination of your patch and Tyler's and see what the ramifications are for embedded docs. To confirm, the OCR integration is fantastic. Thank you and Tyler! Best, Tim -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Friday, October 24, 2014 5:36 PM To: dev@tika.apache.org Subject: Re: 1.7 release? Hey Tim, What do you think about my existing patch for 1445? For example to just call all the parsers? I thought I was seeing behavior that was slow because of that, but it turned out to be Tesseract and my machine at the time? I think my patch for 1445 may be enough, and we should get the metadata I think? Thoughts? I honestly think we need to
Re: 1.7 release?
WOOO HOO! Go Tyler go! :0) Merry Christmas bud. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, December 22, 2014 at 10:57 AM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Hi All, Nick added the temporary fix for TIKA-1445 and made the POI updates for TIKA-1469 (thanks!). And, I'll volunteer to be the Release Manager for 1.7! :) I'll start the process this weekend or a couple days into the new year. Cheers, Tyler On Dec 18, 2014 9:45 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: +1 ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, December 18, 2014 at 9:15 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? I'm OK with trying the fix in 1.8 (or 1.7 if people feel strongly). As Nick just recommended, I'll try adding metadata extraction to Tesseract soon, then adding the extensible solution in 1.8. Tyler On Thu, Dec 18, 2014 at 11:58 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: I haven’t tried my hand at it - been super busy. tyler if you have a chance go for it, I think that’s the remaining blocker. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, December 18, 2014 at 12:54 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Hi All, It's been a few months, so I just want to follow up on this thread. We've resolved/closed 51 issues for v1.7 [0]. There are two on JIRA marked as 1.7 (TIKA-1465 and TIKA-894). Do we still want to aim for 1.7 with TIKA-1445? Has anyone tried their hand at the suggested (significant) fix? Are there any other issues someone would like to fit in? Cheers, Tyler [0] - https://issues.apache.org/jira/browse/TIKA/fixforversion/12327096/?select e dTab=com.atlassian.jira.jira-projects-plugin:version-issues-panel On Tue, Oct 28, 2014 at 1:46 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Thanks Tim saw your patch and am looking now. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Allison, Timothy B. talli...@mitre.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, October 27, 2014 at 12:30 PM To: dev@tika.apache.org dev@tika.apache.org Subject: RE: 1.7 release? Sounds good. As long as the default
Re: 1.7 release?
+1 for going. Many thanks to Tyler and to Nick to take the POI upgrade. So many christmas gifts in advance or just after :-) Merry christmas to all 2014-12-22 19:59 GMT+01:00 Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov: WOOO HOO! Go Tyler go! :0) Merry Christmas bud. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, December 22, 2014 at 10:57 AM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Hi All, Nick added the temporary fix for TIKA-1445 and made the POI updates for TIKA-1469 (thanks!). And, I'll volunteer to be the Release Manager for 1.7! :) I'll start the process this weekend or a couple days into the new year. Cheers, Tyler On Dec 18, 2014 9:45 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: +1 ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, December 18, 2014 at 9:15 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? I'm OK with trying the fix in 1.8 (or 1.7 if people feel strongly). As Nick just recommended, I'll try adding metadata extraction to Tesseract soon, then adding the extensible solution in 1.8. Tyler On Thu, Dec 18, 2014 at 11:58 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: I haven’t tried my hand at it - been super busy. tyler if you have a chance go for it, I think that’s the remaining blocker. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, December 18, 2014 at 12:54 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Hi All, It's been a few months, so I just want to follow up on this thread. We've resolved/closed 51 issues for v1.7 [0]. There are two on JIRA marked as 1.7 (TIKA-1465 and TIKA-894). Do we still want to aim for 1.7 with TIKA-1445? Has anyone tried their hand at the suggested (significant) fix? Are there any other issues someone would like to fit in? Cheers, Tyler [0] - https://issues.apache.org/jira/browse/TIKA/fixforversion/12327096/?select e dTab=com.atlassian.jira.jira-projects-plugin:version-issues-panel On Tue, Oct 28, 2014 at 1:46 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Thanks Tim saw your patch and am looking now. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
[jira] [Commented] (TIKA-1483) Create a general raw string parser
[ https://issues.apache.org/jira/browse/TIKA-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256304#comment-14256304 ] Luis Filipe Nassif commented on TIKA-1483: -- Do you think it would be useful adding a first implementation specific and optimized for extracting Latin1 scripts (Western European languages) coded with ISO8859-1, UTF8 and UTF16? If yes, I will try to submit a patch. Create a general raw string parser -- Key: TIKA-1483 URL: https://issues.apache.org/jira/browse/TIKA-1483 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif I think it can be very useful adding a general parser able to extract raw strings from files (like the strings command), which can be used as the fallback parser for all mimetypes not having a specific parser implementation, like application/octet-stream. It can also be used as a fallback for corrupt files throwing a TikaException. It must be configured with the script/language to be extracted from the files (currently I implemented one specific for Latin1). It can use heuristics to extract strings encoded with different charsets within the same file, mainly the common ISO-8859-1, UTF8 and UTF16. What the community thinks about that? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1483) Create a general raw string parser
[ https://issues.apache.org/jira/browse/TIKA-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256377#comment-14256377 ] Luis Filipe Nassif commented on TIKA-1483: -- [~talli...@apache.org], Do you mean add language models to do automatic language/charset detection? My original purpose was to extract strings from binary and non-text files, so I think it would be difficult to detect the language and charset used in that files. My idea was to let the user configure the language(s) and charsets of interest and the parser would do a best effort to decode them. I think TextParser already do an automatic charset detection (do not know about language). Create a general raw string parser -- Key: TIKA-1483 URL: https://issues.apache.org/jira/browse/TIKA-1483 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif I think it can be very useful adding a general parser able to extract raw strings from files (like the strings command), which can be used as the fallback parser for all mimetypes not having a specific parser implementation, like application/octet-stream. It can also be used as a fallback for corrupt files throwing a TikaException. It must be configured with the script/language to be extracted from the files (currently I implemented one specific for Latin1). It can use heuristics to extract strings encoded with different charsets within the same file, mainly the common ISO-8859-1, UTF8 and UTF16. What the community thinks about that? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1502) Mime magic for database file formats
Nick Burch created TIKA-1502: Summary: Mime magic for database file formats Key: TIKA-1502 URL: https://issues.apache.org/jira/browse/TIKA-1502 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.6 Reporter: Nick Burch I noticed today that Tika can't detect a lot of common database formats, such as sqlite or Berkeley DB or MISAM The unix file utility got most of those, which makes me think that there's a sensible-ish header on most we can write some mime magic for It'd therefore be good to add mime entries, with magic where possible, for many of these common database file formats -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1502) Mime magic for database file formats
[ https://issues.apache.org/jira/browse/TIKA-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256527#comment-14256527 ] Hudson commented on TIKA-1502: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #366 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/366/]) Some test database files for TIKA-1502 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1647473) * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_2.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_3.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_4.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_5.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testMYSQL.MYD * /tika/trunk/tika-parsers/src/test/resources/test-documents/testMYSQL.MYI * /tika/trunk/tika-parsers/src/test/resources/test-documents/testMYSQL.frm * /tika/trunk/tika-parsers/src/test/resources/test-documents/testSQLITE3.db Mime magic for database file formats Key: TIKA-1502 URL: https://issues.apache.org/jira/browse/TIKA-1502 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.6 Reporter: Nick Burch I noticed today that Tika can't detect a lot of common database formats, such as sqlite or Berkeley DB or MISAM The unix file utility got most of those, which makes me think that there's a sensible-ish header on most we can write some mime magic for It'd therefore be good to add mime entries, with magic where possible, for many of these common database file formats -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1502) Mime magic for database file formats
[ https://issues.apache.org/jira/browse/TIKA-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256561#comment-14256561 ] Hudson commented on TIKA-1502: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #383 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/383/]) TIKA-1502 MySQL and SQLite3 mime types, with magic where possible (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1647478) * /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml Some test database files for TIKA-1502 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1647473) * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_2.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_3.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_4.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_5.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testMYSQL.MYD * /tika/trunk/tika-parsers/src/test/resources/test-documents/testMYSQL.MYI * /tika/trunk/tika-parsers/src/test/resources/test-documents/testMYSQL.frm * /tika/trunk/tika-parsers/src/test/resources/test-documents/testSQLITE3.db Mime magic for database file formats Key: TIKA-1502 URL: https://issues.apache.org/jira/browse/TIKA-1502 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.6 Reporter: Nick Burch I noticed today that Tika can't detect a lot of common database formats, such as sqlite or Berkeley DB or MISAM The unix file utility got most of those, which makes me think that there's a sensible-ish header on most we can write some mime magic for It'd therefore be good to add mime entries, with magic where possible, for many of these common database file formats -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1502) Mime magic for database file formats
[ https://issues.apache.org/jira/browse/TIKA-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256571#comment-14256571 ] Hudson commented on TIKA-1502: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #367 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/367/]) TIKA-1502 MySQL and SQLite3 mime types, with magic where possible (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1647478) * /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml Mime magic for database file formats Key: TIKA-1502 URL: https://issues.apache.org/jira/browse/TIKA-1502 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.6 Reporter: Nick Burch I noticed today that Tika can't detect a lot of common database formats, such as sqlite or Berkeley DB or MISAM The unix file utility got most of those, which makes me think that there's a sensible-ish header on most we can write some mime magic for It'd therefore be good to add mime entries, with magic where possible, for many of these common database file formats -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1502) Mime magic for database file formats
[ https://issues.apache.org/jira/browse/TIKA-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256640#comment-14256640 ] Hudson commented on TIKA-1502: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #368 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/368/]) More test database files for TIKA-1502 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1647484) * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_2.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_3.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_4.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_5.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_btree_2.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_btree_3.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_btree_4.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_btree_5.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_hash_2.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_hash_3.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_hash_4.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_hash_5.db Mime magic for database file formats Key: TIKA-1502 URL: https://issues.apache.org/jira/browse/TIKA-1502 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.6 Reporter: Nick Burch I noticed today that Tika can't detect a lot of common database formats, such as sqlite or Berkeley DB or MISAM The unix file utility got most of those, which makes me think that there's a sensible-ish header on most we can write some mime magic for It'd therefore be good to add mime entries, with magic where possible, for many of these common database file formats -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1502) Mime magic for database file formats
[ https://issues.apache.org/jira/browse/TIKA-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256655#comment-14256655 ] Nick Burch commented on TIKA-1502: -- As of r1647486, we now have mime types for SQLite3, MySQL (most) and Berkeley DB. We have magic for SQLite3, most of the MySQL formats (some are headerless), and expanded BDB ones. One remaining issue is getting MimeTypesReaderTest.testReadParameterHeirarchy() to pass - for some reason the 3 level hierarchy of the BDB mime types is getting flattened to just two Mime magic for database file formats Key: TIKA-1502 URL: https://issues.apache.org/jira/browse/TIKA-1502 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.6 Reporter: Nick Burch I noticed today that Tika can't detect a lot of common database formats, such as sqlite or Berkeley DB or MISAM The unix file utility got most of those, which makes me think that there's a sensible-ish header on most we can write some mime magic for It'd therefore be good to add mime entries, with magic where possible, for many of these common database file formats -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1502) Mime magic for database file formats
[ https://issues.apache.org/jira/browse/TIKA-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256663#comment-14256663 ] Hudson commented on TIKA-1502: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #384 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/384/]) Split the Berkeley DB mimetypes into three levels, and add a detection test (passes) and a heirarchy test (disabled as fails) TIKA-1502 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1647486) * /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml * /tika/trunk/tika-core/src/test/java/org/apache/tika/mime/MimeTypesReaderTest.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java Start on magic for subtypes of Berkeley DB TIKA-1502 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1647485) * /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml More test database files for TIKA-1502 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1647484) * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_2.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_3.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_4.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_5.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_btree_2.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_btree_3.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_btree_4.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_btree_5.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_hash_2.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_hash_3.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_hash_4.db * /tika/trunk/tika-parsers/src/test/resources/test-documents/testBDB_hash_5.db Mime magic for database file formats Key: TIKA-1502 URL: https://issues.apache.org/jira/browse/TIKA-1502 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.6 Reporter: Nick Burch I noticed today that Tika can't detect a lot of common database formats, such as sqlite or Berkeley DB or MISAM The unix file utility got most of those, which makes me think that there's a sensible-ish header on most we can write some mime magic for It'd therefore be good to add mime entries, with magic where possible, for many of these common database file formats -- This message was sent by Atlassian JIRA (v6.3.4#6332)