[jira] [Created] (TIKA-1843) Tika parser for SEG-Y files and new MIME type application/segy
Giovanni Usai created TIKA-1843: --- Summary: Tika parser for SEG-Y files and new MIME type application/segy Key: TIKA-1843 URL: https://issues.apache.org/jira/browse/TIKA-1843 Project: Tika Issue Type: New Feature Components: mime, parser Reporter: Giovanni Usai Priority: Minor This ticket refers to the parsing of SEG-Y files (extensions .seg, .segy and .sgy). The SEG-Y format is used to store seismic data, you can find more information here http://pubs.usgs.gov/of/2001/of01-326/HTML/FILEFORM.HTM. I have: - added a new MIME type application/segy matching the file name extensions .segy, .seg and .sgy. - created a new SEGYParser, matching that MIME type. In order to parse the SEG-Y files, I am using a modified version of the sigrun code (available under Apache license, here https://github.com/mikhail-aksenov/sigrun). Notably I have done a fix and changed some method signatures to be able to read from a ReadableByteChannel instead of FileChannel. For the moment I have put it directly into the new Tika's segy package. Is this the right thing to do or should I reference it as external library thus modifying the pom.xml? Thanks and best regards, Giovanni -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1843) Tika parser for SEG-Y files and new MIME type application/segy
[ https://issues.apache.org/jira/browse/TIKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123568#comment-15123568 ] Giovanni Usai commented on TIKA-1843: - Hi Nick, thanks for the fast reply! The last sigrun commit (the one of some days ago) is mine; I have had to rename a class to make sigrun compile. Apart from that, no other commits in 1 year. Anyway, no problem, I will submit my modifications to sigrun and I will come back to you once my pull will be merged. Please note that sigrun artifact is not installed in any Maven repository yet, as far as I know. Thanks again! > Tika parser for SEG-Y files and new MIME type application/segy > -- > > Key: TIKA-1843 > URL: https://issues.apache.org/jira/browse/TIKA-1843 > Project: Tika > Issue Type: New Feature > Components: mime, parser >Reporter: Giovanni Usai >Priority: Minor > > This ticket refers to the parsing of SEG-Y files (extensions .seg, .segy and > .sgy). > The SEG-Y format is used to store seismic data, you can find more information > here http://pubs.usgs.gov/of/2001/of01-326/HTML/FILEFORM.HTM. > I have: > - added a new MIME type application/segy matching the file name extensions > .segy, .seg and .sgy. > - created a new SEGYParser, matching that MIME type. > In order to parse the SEG-Y files, I am using a modified version of the > sigrun code (available under Apache license, here > https://github.com/mikhail-aksenov/sigrun). Notably I have done a fix and > changed some method signatures to be able to read from a ReadableByteChannel > instead of FileChannel. > For the moment I have put it directly into the new Tika's segy package. Is > this the right thing to do or should I reference it as external library thus > modifying the pom.xml? > Thanks and best regards, > Giovanni -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1843) Tika parser for SEG-Y files and new MIME type application/segy
[ https://issues.apache.org/jira/browse/TIKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123552#comment-15123552 ] Nick Burch commented on TIKA-1843: -- Looks like Sigrun is an active project, so best bet would be to submit Github pull requests to them to add the `ReadableByteChannel` support. Then, once they've added that + released, we'll add a Tika dependency to that + add the parser code ASF best-practice is to avoid forking upstream projects + bundling modified versions whenever possible, so putting customised versions of Segrun classes in the Tika segy package should be avoided if possible. Much better to get them to accept the fixes upstream! > Tika parser for SEG-Y files and new MIME type application/segy > -- > > Key: TIKA-1843 > URL: https://issues.apache.org/jira/browse/TIKA-1843 > Project: Tika > Issue Type: New Feature > Components: mime, parser >Reporter: Giovanni Usai >Priority: Minor > > This ticket refers to the parsing of SEG-Y files (extensions .seg, .segy and > .sgy). > The SEG-Y format is used to store seismic data, you can find more information > here http://pubs.usgs.gov/of/2001/of01-326/HTML/FILEFORM.HTM. > I have: > - added a new MIME type application/segy matching the file name extensions > .segy, .seg and .sgy. > - created a new SEGYParser, matching that MIME type. > In order to parse the SEG-Y files, I am using a modified version of the > sigrun code (available under Apache license, here > https://github.com/mikhail-aksenov/sigrun). Notably I have done a fix and > changed some method signatures to be able to read from a ReadableByteChannel > instead of FileChannel. > For the moment I have put it directly into the new Tika's segy package. Is > this the right thing to do or should I reference it as external library thus > modifying the pom.xml? > Thanks and best regards, > Giovanni -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1843) Tika parser for SEG-Y files and new MIME type application/segy
[ https://issues.apache.org/jira/browse/TIKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123612#comment-15123612 ] Nick Burch commented on TIKA-1843: -- Getting a maven-built project into the Sonatype OSS repo for maven use isn't too bad. Ideally we'd work with the Sigrun team to get their POM into shape so it can be released as per http://central.sonatype.org/pages/ossrh-guide.html , otherwise we can take over and upload it for them as a third party. Ask on the dev list for help with any of those if needed, we've several people well experienced in both routes! > Tika parser for SEG-Y files and new MIME type application/segy > -- > > Key: TIKA-1843 > URL: https://issues.apache.org/jira/browse/TIKA-1843 > Project: Tika > Issue Type: New Feature > Components: mime, parser >Reporter: Giovanni Usai >Priority: Minor > > This ticket refers to the parsing of SEG-Y files (extensions .seg, .segy and > .sgy). > The SEG-Y format is used to store seismic data, you can find more information > here http://pubs.usgs.gov/of/2001/of01-326/HTML/FILEFORM.HTM. > I have: > - added a new MIME type application/segy matching the file name extensions > .segy, .seg and .sgy. > - created a new SEGYParser, matching that MIME type. > In order to parse the SEG-Y files, I am using a modified version of the > sigrun code (available under Apache license, here > https://github.com/mikhail-aksenov/sigrun). Notably I have done a fix and > changed some method signatures to be able to read from a ReadableByteChannel > instead of FileChannel. > For the moment I have put it directly into the new Tika's segy package. Is > this the right thing to do or should I reference it as external library thus > modifying the pom.xml? > Thanks and best regards, > Giovanni -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [VOTE] Apache Tika 1.12 Release Candidate #1
Thank you Tim for catching this. If you remember, please file a ticket for the below and I’ll fix it in 1.13 (or someone else will :) ) ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: "Allison, Timothy B."Reply-To: "dev@tika.apache.org" Date: Friday, January 29, 2016 at 10:07 AM To: "dev@tika.apache.org" Subject: RE: [VOTE] Apache Tika 1.12 Release Candidate #1 >+1 > >With the one caveat that the PooledTimeSeriesParser is now taking >precedence over the MP4Parser. So, for those mp4 video files for which >we used to extract some metadata (length, and a handful of other items), >we're now getting nothing if the external pooled-time-series application >is not installed. This could be a big problem for some people... > >Thank you, Chris! > >With any luck, I'll be fully dug out by next week and onto our new git >repo. :) Onward to Tika 1.13 (after TIKA-1830) soon. > > >-Original Message- >From: Ken Krugler [mailto:kkrugler_li...@transpac.com] >Sent: Thursday, January 28, 2016 2:44 PM >To: dev@tika.apache.org >Subject: RE: [VOTE] Apache Tika 1.12 Release Candidate #1 > >Built & installed on Mac OS X 10.8. > >Switched Bixo to use 1.12, all tests pass. > >+1. > >-- Ken > >> From: Mattmann, Chris A (3980) >> Sent: January 25, 2016 11:58:04am PST >> To: u...@tika.apache.org; dev@tika.apache.org >> Subject: [VOTE] Apache Tika 1.12 Release Candidate #1 >> >> Hi Folks, >> >> A first candidate for the Tika 1.12 release is available at: >> >> https://dist.apache.org/repos/dist/dev/tika/ >> >> The release candidate is a zip archive of the sources in: >> https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e6 >> 5db24 >> 27f9e84bc4ff31e569ae661c >> >> >> The SHA1 checksum of the archive is: >> 30e64645af643959841ac3bb3c41f7e64eba7e5f >> >> In addition, a staged maven repository is available here: >> >> https://repository.apache.org/content/repositories/orgapachetika-1015/ >> >> >> Please vote on releasing this package as Apache Tika 1.12. >> The vote is open for the next 72 hours and passes if a majority of at >> least three +1 Tika PMC votes are cast. >> >> [ ] +1 Release this package as Apache Tika 1.12 [ ] -1 Do not release >> this package because... >> >> Cheers, >> Chris >> >> P.S. Of course here is my +1. > >-- >Ken Krugler >+1 530-210-6378 >http://www.scaleunlimited.com >custom big data solutions & training >Hadoop, Cascading, Cassandra & Solr > > > > >
RE: [VOTE] Apache Tika 1.12 Release Candidate #1
+1 With the one caveat that the PooledTimeSeriesParser is now taking precedence over the MP4Parser. So, for those mp4 video files for which we used to extract some metadata (length, and a handful of other items), we're now getting nothing if the external pooled-time-series application is not installed. This could be a big problem for some people... Thank you, Chris! With any luck, I'll be fully dug out by next week and onto our new git repo. :) Onward to Tika 1.13 (after TIKA-1830) soon. -Original Message- From: Ken Krugler [mailto:kkrugler_li...@transpac.com] Sent: Thursday, January 28, 2016 2:44 PM To: dev@tika.apache.org Subject: RE: [VOTE] Apache Tika 1.12 Release Candidate #1 Built & installed on Mac OS X 10.8. Switched Bixo to use 1.12, all tests pass. +1. -- Ken > From: Mattmann, Chris A (3980) > Sent: January 25, 2016 11:58:04am PST > To: u...@tika.apache.org; dev@tika.apache.org > Subject: [VOTE] Apache Tika 1.12 Release Candidate #1 > > Hi Folks, > > A first candidate for the Tika 1.12 release is available at: > > https://dist.apache.org/repos/dist/dev/tika/ > > The release candidate is a zip archive of the sources in: > https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e6 > 5db24 > 27f9e84bc4ff31e569ae661c > > > The SHA1 checksum of the archive is: > 30e64645af643959841ac3bb3c41f7e64eba7e5f > > In addition, a staged maven repository is available here: > > https://repository.apache.org/content/repositories/orgapachetika-1015/ > > > Please vote on releasing this package as Apache Tika 1.12. > The vote is open for the next 72 hours and passes if a majority of at > least three +1 Tika PMC votes are cast. > > [ ] +1 Release this package as Apache Tika 1.12 [ ] -1 Do not release > this package because... > > Cheers, > Chris > > P.S. Of course here is my +1. -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
[jira] [Created] (TIKA-1844) PooledTimeSeriesParser takes precedence over MP4Parser
Tim Allison created TIKA-1844: - Summary: PooledTimeSeriesParser takes precedence over MP4Parser Key: TIKA-1844 URL: https://issues.apache.org/jira/browse/TIKA-1844 Project: Tika Issue Type: Bug Reporter: Tim Allison Priority: Minor The PooledTimeSeriesParser currently takes precedence over the MP4Parser even if the pooled-time-series application is not installed. This means that clients will lose metadata formerly extracted by the MP4Parser unless they remove the PooledTimeSeriesParser. This is similar to what happened with the integration of the Tesseract Parser (TIKA-1445). We should probably follow a similar pattern to that...run both parsers and combine metadata. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [VOTE] Apache Tika 1.12 Release Candidate #1
All enabled tests passed on openjdk8u72. SHA1 and gpg signature are correct. Checked tika-app and tika-server on some documents from my collection. [x] +1 Release this package as Apache Tika 1.12 [ ] -1 Do not release this package because… пт, 29 янв. 2016 г. в 21:21, Mattmann, Chris A (3980) < chris.a.mattm...@jpl.nasa.gov>: > Thank you Tim for catching this. If you remember, please file a > ticket for the below and I’ll fix it in 1.13 (or someone else will :) ) > > ++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++ > > > > > > -Original Message- > From: "Allison, Timothy B."> Reply-To: "dev@tika.apache.org" > Date: Friday, January 29, 2016 at 10:07 AM > To: "dev@tika.apache.org" > Subject: RE: [VOTE] Apache Tika 1.12 Release Candidate #1 > > >+1 > > > >With the one caveat that the PooledTimeSeriesParser is now taking > >precedence over the MP4Parser. So, for those mp4 video files for which > >we used to extract some metadata (length, and a handful of other items), > >we're now getting nothing if the external pooled-time-series application > >is not installed. This could be a big problem for some people... > > > >Thank you, Chris! > > > >With any luck, I'll be fully dug out by next week and onto our new git > >repo. :) Onward to Tika 1.13 (after TIKA-1830) soon. > > > > > >-Original Message- > >From: Ken Krugler [mailto:kkrugler_li...@transpac.com] > >Sent: Thursday, January 28, 2016 2:44 PM > >To: dev@tika.apache.org > >Subject: RE: [VOTE] Apache Tika 1.12 Release Candidate #1 > > > >Built & installed on Mac OS X 10.8. > > > >Switched Bixo to use 1.12, all tests pass. > > > >+1. > > > >-- Ken > > > >> From: Mattmann, Chris A (3980) > >> Sent: January 25, 2016 11:58:04am PST > >> To: u...@tika.apache.org; dev@tika.apache.org > >> Subject: [VOTE] Apache Tika 1.12 Release Candidate #1 > >> > >> Hi Folks, > >> > >> A first candidate for the Tika 1.12 release is available at: > >> > >> https://dist.apache.org/repos/dist/dev/tika/ > >> > >> The release candidate is a zip archive of the sources in: > >> https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e6 > >> 5db24 > >> 27f9e84bc4ff31e569ae661c > >> > >> > >> The SHA1 checksum of the archive is: > >> 30e64645af643959841ac3bb3c41f7e64eba7e5f > >> > >> In addition, a staged maven repository is available here: > >> > >> https://repository.apache.org/content/repositories/orgapachetika-1015/ > >> > >> > >> Please vote on releasing this package as Apache Tika 1.12. > >> The vote is open for the next 72 hours and passes if a majority of at > >> least three +1 Tika PMC votes are cast. > >> > >> [ ] +1 Release this package as Apache Tika 1.12 [ ] -1 Do not release > >> this package because... > >> > >> Cheers, > >> Chris > >> > >> P.S. Of course here is my +1. > > > >-- > >Ken Krugler > >+1 530-210-6378 > >http://www.scaleunlimited.com > >custom big data solutions & training > >Hadoop, Cascading, Cassandra & Solr > > > > > > > > > > > > -- Best regards, Konstantin Gribov