[jira] [Created] (TIKA-1843) Tika parser for SEG-Y files and new MIME type application/segy

2016-01-29 Thread Giovanni Usai (JIRA)
Giovanni Usai created TIKA-1843:
---

 Summary: Tika parser for SEG-Y files and new MIME type 
application/segy
 Key: TIKA-1843
 URL: https://issues.apache.org/jira/browse/TIKA-1843
 Project: Tika
  Issue Type: New Feature
  Components: mime, parser
Reporter: Giovanni Usai
Priority: Minor


This ticket refers to the parsing of SEG-Y files (extensions .seg, .segy and 
.sgy). 
The SEG-Y format is used to store seismic data, you can find more information 
here http://pubs.usgs.gov/of/2001/of01-326/HTML/FILEFORM.HTM.

I have:
- added a new MIME type application/segy matching the file name extensions 
.segy, .seg and .sgy.
- created a new SEGYParser, matching that MIME type.

In order to parse the SEG-Y files, I am using a modified version of the sigrun 
code (available under Apache license, here 
https://github.com/mikhail-aksenov/sigrun). Notably I have done a fix and 
changed some method signatures to be able to read from a ReadableByteChannel 
instead of FileChannel.
For the moment I have put it directly into the new Tika's segy package. Is this 
the right thing to do or should I reference it as external library thus 
modifying the pom.xml?

Thanks and best regards,
Giovanni



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1843) Tika parser for SEG-Y files and new MIME type application/segy

2016-01-29 Thread Giovanni Usai (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123568#comment-15123568
 ] 

Giovanni Usai commented on TIKA-1843:
-

Hi Nick,
thanks for the fast reply!
The last sigrun commit (the one of some days ago) is mine; I have had to rename 
a class to make sigrun compile.
Apart from that, no other commits in 1 year.

Anyway, no problem, I will submit my modifications to sigrun and I will come 
back to you once my pull will be merged.

Please note that sigrun artifact is not installed in any Maven repository yet, 
as far as I know.

Thanks again!

> Tika parser for SEG-Y files and new MIME type application/segy
> --
>
> Key: TIKA-1843
> URL: https://issues.apache.org/jira/browse/TIKA-1843
> Project: Tika
>  Issue Type: New Feature
>  Components: mime, parser
>Reporter: Giovanni Usai
>Priority: Minor
>
> This ticket refers to the parsing of SEG-Y files (extensions .seg, .segy and 
> .sgy). 
> The SEG-Y format is used to store seismic data, you can find more information 
> here http://pubs.usgs.gov/of/2001/of01-326/HTML/FILEFORM.HTM.
> I have:
> - added a new MIME type application/segy matching the file name extensions 
> .segy, .seg and .sgy.
> - created a new SEGYParser, matching that MIME type.
> In order to parse the SEG-Y files, I am using a modified version of the 
> sigrun code (available under Apache license, here 
> https://github.com/mikhail-aksenov/sigrun). Notably I have done a fix and 
> changed some method signatures to be able to read from a ReadableByteChannel 
> instead of FileChannel.
> For the moment I have put it directly into the new Tika's segy package. Is 
> this the right thing to do or should I reference it as external library thus 
> modifying the pom.xml?
> Thanks and best regards,
> Giovanni



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1843) Tika parser for SEG-Y files and new MIME type application/segy

2016-01-29 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123552#comment-15123552
 ] 

Nick Burch commented on TIKA-1843:
--

Looks like Sigrun is an active project, so best bet would be to submit Github 
pull requests to them to add the `ReadableByteChannel` support. Then, once 
they've added that + released, we'll add a Tika dependency to that + add the 
parser code

ASF best-practice is to avoid forking upstream projects + bundling modified 
versions whenever possible, so putting customised versions of Segrun classes in 
the Tika segy package should be avoided if possible. Much better to get them to 
accept the fixes upstream!

> Tika parser for SEG-Y files and new MIME type application/segy
> --
>
> Key: TIKA-1843
> URL: https://issues.apache.org/jira/browse/TIKA-1843
> Project: Tika
>  Issue Type: New Feature
>  Components: mime, parser
>Reporter: Giovanni Usai
>Priority: Minor
>
> This ticket refers to the parsing of SEG-Y files (extensions .seg, .segy and 
> .sgy). 
> The SEG-Y format is used to store seismic data, you can find more information 
> here http://pubs.usgs.gov/of/2001/of01-326/HTML/FILEFORM.HTM.
> I have:
> - added a new MIME type application/segy matching the file name extensions 
> .segy, .seg and .sgy.
> - created a new SEGYParser, matching that MIME type.
> In order to parse the SEG-Y files, I am using a modified version of the 
> sigrun code (available under Apache license, here 
> https://github.com/mikhail-aksenov/sigrun). Notably I have done a fix and 
> changed some method signatures to be able to read from a ReadableByteChannel 
> instead of FileChannel.
> For the moment I have put it directly into the new Tika's segy package. Is 
> this the right thing to do or should I reference it as external library thus 
> modifying the pom.xml?
> Thanks and best regards,
> Giovanni



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1843) Tika parser for SEG-Y files and new MIME type application/segy

2016-01-29 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123612#comment-15123612
 ] 

Nick Burch commented on TIKA-1843:
--

Getting a maven-built project into the Sonatype OSS repo for maven use isn't 
too bad. Ideally we'd work with the Sigrun team to get their POM into shape so 
it can be released as per http://central.sonatype.org/pages/ossrh-guide.html , 
otherwise we can take over and upload it for them as a third party. Ask on the 
dev list for help with any of those if needed, we've several people well 
experienced in both routes!

> Tika parser for SEG-Y files and new MIME type application/segy
> --
>
> Key: TIKA-1843
> URL: https://issues.apache.org/jira/browse/TIKA-1843
> Project: Tika
>  Issue Type: New Feature
>  Components: mime, parser
>Reporter: Giovanni Usai
>Priority: Minor
>
> This ticket refers to the parsing of SEG-Y files (extensions .seg, .segy and 
> .sgy). 
> The SEG-Y format is used to store seismic data, you can find more information 
> here http://pubs.usgs.gov/of/2001/of01-326/HTML/FILEFORM.HTM.
> I have:
> - added a new MIME type application/segy matching the file name extensions 
> .segy, .seg and .sgy.
> - created a new SEGYParser, matching that MIME type.
> In order to parse the SEG-Y files, I am using a modified version of the 
> sigrun code (available under Apache license, here 
> https://github.com/mikhail-aksenov/sigrun). Notably I have done a fix and 
> changed some method signatures to be able to read from a ReadableByteChannel 
> instead of FileChannel.
> For the moment I have put it directly into the new Tika's segy package. Is 
> this the right thing to do or should I reference it as external library thus 
> modifying the pom.xml?
> Thanks and best regards,
> Giovanni



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Apache Tika 1.12 Release Candidate #1

2016-01-29 Thread Mattmann, Chris A (3980)
Thank you Tim for catching this. If you remember, please file a
ticket for the below and I’ll fix it in 1.13 (or someone else will :) )

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: "Allison, Timothy B." 
Reply-To: "dev@tika.apache.org" 
Date: Friday, January 29, 2016 at 10:07 AM
To: "dev@tika.apache.org" 
Subject: RE: [VOTE] Apache Tika 1.12 Release Candidate #1

>+1
>
>With the one caveat that the PooledTimeSeriesParser is now taking
>precedence over the MP4Parser.  So, for those mp4 video files for which
>we used to extract some metadata (length, and a handful of other items),
>we're now getting nothing if the external pooled-time-series application
>is not installed.  This could be a big problem for some people...
>
>Thank you, Chris!
>
>With any luck, I'll be fully dug out by next week and onto our new git
>repo. :) Onward to Tika 1.13 (after TIKA-1830) soon.
>
>
>-Original Message-
>From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
>Sent: Thursday, January 28, 2016 2:44 PM
>To: dev@tika.apache.org
>Subject: RE: [VOTE] Apache Tika 1.12 Release Candidate #1
>
>Built & installed on Mac OS X 10.8.
>
>Switched Bixo to use 1.12, all tests pass.
>
>+1.
>
>-- Ken
>
>> From: Mattmann, Chris A (3980)
>> Sent: January 25, 2016 11:58:04am PST
>> To: u...@tika.apache.org; dev@tika.apache.org
>> Subject: [VOTE] Apache Tika 1.12 Release Candidate #1
>> 
>> Hi Folks,
>> 
>> A first candidate for the Tika 1.12 release is available at:
>> 
>>  https://dist.apache.org/repos/dist/dev/tika/
>> 
>> The release candidate is a zip archive of the sources in:
>> https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e6
>> 5db24
>> 27f9e84bc4ff31e569ae661c
>> 
>> 
>> The SHA1 checksum of the archive is:
>> 30e64645af643959841ac3bb3c41f7e64eba7e5f
>> 
>> In addition, a staged maven repository is available here:
>> 
>> https://repository.apache.org/content/repositories/orgapachetika-1015/
>> 
>> 
>> Please vote on releasing this package as Apache Tika 1.12.
>> The vote is open for the next 72 hours and passes if a majority of at
>> least three +1 Tika PMC votes are cast.
>> 
>> [ ] +1 Release this package as Apache Tika 1.12 [ ] -1 Do not release
>> this package because...
>> 
>> Cheers,
>> Chris
>> 
>> P.S. Of course here is my +1.
>
>--
>Ken Krugler
>+1 530-210-6378
>http://www.scaleunlimited.com
>custom big data solutions & training
>Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>



RE: [VOTE] Apache Tika 1.12 Release Candidate #1

2016-01-29 Thread Allison, Timothy B.
+1

With the one caveat that the PooledTimeSeriesParser is now taking precedence 
over the MP4Parser.  So, for those mp4 video files for which we used to extract 
some metadata (length, and a handful of other items), we're now getting nothing 
if the external pooled-time-series application is not installed.  This could be 
a big problem for some people...

Thank you, Chris!

With any luck, I'll be fully dug out by next week and onto our new git repo. :) 
Onward to Tika 1.13 (after TIKA-1830) soon.


-Original Message-
From: Ken Krugler [mailto:kkrugler_li...@transpac.com] 
Sent: Thursday, January 28, 2016 2:44 PM
To: dev@tika.apache.org
Subject: RE: [VOTE] Apache Tika 1.12 Release Candidate #1

Built & installed on Mac OS X 10.8.

Switched Bixo to use 1.12, all tests pass.

+1.

-- Ken

> From: Mattmann, Chris A (3980)
> Sent: January 25, 2016 11:58:04am PST
> To: u...@tika.apache.org; dev@tika.apache.org
> Subject: [VOTE] Apache Tika 1.12 Release Candidate #1
> 
> Hi Folks,
> 
> A first candidate for the Tika 1.12 release is available at:
> 
>  https://dist.apache.org/repos/dist/dev/tika/
> 
> The release candidate is a zip archive of the sources in:
> https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e6
> 5db24
> 27f9e84bc4ff31e569ae661c
> 
> 
> The SHA1 checksum of the archive is:
> 30e64645af643959841ac3bb3c41f7e64eba7e5f
> 
> In addition, a staged maven repository is available here:
> 
> https://repository.apache.org/content/repositories/orgapachetika-1015/
> 
> 
> Please vote on releasing this package as Apache Tika 1.12.
> The vote is open for the next 72 hours and passes if a majority of at 
> least three +1 Tika PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Tika 1.12 [ ] -1 Do not release 
> this package because...
> 
> Cheers,
> Chris
> 
> P.S. Of course here is my +1.

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







[jira] [Created] (TIKA-1844) PooledTimeSeriesParser takes precedence over MP4Parser

2016-01-29 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1844:
-

 Summary: PooledTimeSeriesParser takes precedence over MP4Parser
 Key: TIKA-1844
 URL: https://issues.apache.org/jira/browse/TIKA-1844
 Project: Tika
  Issue Type: Bug
Reporter: Tim Allison
Priority: Minor


The PooledTimeSeriesParser currently takes precedence over the MP4Parser even 
if the pooled-time-series application is not installed.  This means that 
clients will lose metadata formerly extracted by the MP4Parser unless they 
remove the PooledTimeSeriesParser.

This is similar to what happened with the integration of the Tesseract Parser 
(TIKA-1445).  We should probably follow a similar pattern to that...run both 
parsers and combine metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Apache Tika 1.12 Release Candidate #1

2016-01-29 Thread Konstantin Gribov
All enabled tests passed on openjdk8u72.
SHA1 and gpg signature are correct.
Checked tika-app and tika-server on some documents from my collection.

[x] +1 Release this package as Apache Tika 1.12
[ ] -1 Do not release this package because…

пт, 29 янв. 2016 г. в 21:21, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov>:

> Thank you Tim for catching this. If you remember, please file a
> ticket for the below and I’ll fix it in 1.13 (or someone else will :) )
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
> -Original Message-
> From: "Allison, Timothy B." 
> Reply-To: "dev@tika.apache.org" 
> Date: Friday, January 29, 2016 at 10:07 AM
> To: "dev@tika.apache.org" 
> Subject: RE: [VOTE] Apache Tika 1.12 Release Candidate #1
>
> >+1
> >
> >With the one caveat that the PooledTimeSeriesParser is now taking
> >precedence over the MP4Parser.  So, for those mp4 video files for which
> >we used to extract some metadata (length, and a handful of other items),
> >we're now getting nothing if the external pooled-time-series application
> >is not installed.  This could be a big problem for some people...
> >
> >Thank you, Chris!
> >
> >With any luck, I'll be fully dug out by next week and onto our new git
> >repo. :) Onward to Tika 1.13 (after TIKA-1830) soon.
> >
> >
> >-Original Message-
> >From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
> >Sent: Thursday, January 28, 2016 2:44 PM
> >To: dev@tika.apache.org
> >Subject: RE: [VOTE] Apache Tika 1.12 Release Candidate #1
> >
> >Built & installed on Mac OS X 10.8.
> >
> >Switched Bixo to use 1.12, all tests pass.
> >
> >+1.
> >
> >-- Ken
> >
> >> From: Mattmann, Chris A (3980)
> >> Sent: January 25, 2016 11:58:04am PST
> >> To: u...@tika.apache.org; dev@tika.apache.org
> >> Subject: [VOTE] Apache Tika 1.12 Release Candidate #1
> >>
> >> Hi Folks,
> >>
> >> A first candidate for the Tika 1.12 release is available at:
> >>
> >>  https://dist.apache.org/repos/dist/dev/tika/
> >>
> >> The release candidate is a zip archive of the sources in:
> >> https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e6
> >> 5db24
> >> 27f9e84bc4ff31e569ae661c
> >>
> >>
> >> The SHA1 checksum of the archive is:
> >> 30e64645af643959841ac3bb3c41f7e64eba7e5f
> >>
> >> In addition, a staged maven repository is available here:
> >>
> >> https://repository.apache.org/content/repositories/orgapachetika-1015/
> >>
> >>
> >> Please vote on releasing this package as Apache Tika 1.12.
> >> The vote is open for the next 72 hours and passes if a majority of at
> >> least three +1 Tika PMC votes are cast.
> >>
> >> [ ] +1 Release this package as Apache Tika 1.12 [ ] -1 Do not release
> >> this package because...
> >>
> >> Cheers,
> >> Chris
> >>
> >> P.S. Of course here is my +1.
> >
> >--
> >Ken Krugler
> >+1 530-210-6378
> >http://www.scaleunlimited.com
> >custom big data solutions & training
> >Hadoop, Cascading, Cassandra & Solr
> >
> >
> >
> >
> >
>
> --
Best regards,
Konstantin Gribov