looking to contribute

2015-12-16 Thread Joey Hong
Hi Tika Developers,

My name is Joey. I am a college freshmen with programming experience looking to 
get into the world of open-source. I was hoping to contribute to the Tika 
project, and was wondering if there were any tasks that a beginner like me 
could tackle. I am willing to do anything, whether it be fixing a minor bug, or 
adding test suites or documentation.

Thanks,
Joey

[jira] [Created] (TIKA-1814) Add a standalone XMPScannerParser

2015-12-16 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1814:
-

 Summary: Add a standalone XMPScannerParser
 Key: TIKA-1814
 URL: https://issues.apache.org/jira/browse/TIKA-1814
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor


Several parsers make use of XMP data and normalize it via dc or other standards 
into our metadata object.  We're currently either relying on dependencies to 
make sense of multiple XMP packets within a file (PDFBox for PDFParser) or 
we're just grabbing the first (TiffParser via JempboxExtractor and 
XMPPacketScanner) or...which other parsers are processing XMP?

It might be useful to extract all XMPPackets from a file and store those raw 
bytes as Base64 encoded Strings in the Metadata object.  Advanced users could 
then have access to the raw XMP streams.

For Tika 1.x, unless users configured it, nothing would call it.  For Tika 2.x, 
once we get the combo configurable parsers set up, a user could configure a 
combo/additive parser, e.g., a PDFParser that is a combination of our current 
PDFParser and then this new XMPScannerParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: looking to contribute

2015-12-16 Thread Nick Burch

On Wed, 16 Dec 2015, Joey Hong wrote:
My name is Joey. I am a college freshmen with programming experience 
looking to get into the world of open-source. I was hoping to contribute 
to the Tika project, and was wondering if there were any tasks that a 
beginner like me could tackle. I am willing to do anything, whether it 
be fixing a minor bug, or adding test suites or documentation.


On the docs / examples side, we have a few examples on the website, but 
probably not enough! One thing might be to look through those, identify 
gaps with your fresh eyes, and work on those. We also have instructions 
for some more complicated integrations on the wiki, maybe try some of 
those and feed back on which ones aren't clear enough?


If you want to try more coding, Tim quite often runs Tika against some 
large filesets, and has a nifty tool to report on what breaks. He can 
hopefully point you at the most recent report! Maybe have a look through 
that, identify a few common failures from unidentified or common 
exceptions, and try to fix one or two of those?


Nick


[jira] [Updated] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl

2015-12-16 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1813:
--
Attachment: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB
25JIANLV77U645GUSJ2E67YSM4B2TNSP
2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA

Some examples.

The file lengths are suspiciously regular.  Given that these are Common Crawl 
docs, there's a chance that they were truncated.

225... looks like an SPSS output file (SPO)...maybe?

[Gary Kessler|http://www.garykessler.net/library/file_sigs.html] has a helpful 
list of non-MS file types that rely on OLE.

> Figure out file types for several unknown OLE files in Common Crawl
> ---
>
> Key: TIKA-1813
> URL: https://issues.apache.org/jira/browse/TIKA-1813
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, 
> 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA
>
>
> We're getting around 300 exceptions from "application/x-tika-msoffice" files 
> in our current slice of Common Crawl documents that look roughly like this:
> {noformat}
> java.lang.IllegalArgumentException: Position 86528 past the end of the file
> at org.apache.poi.poifs.nio.FileBackedDataSource.read
> {noformat}
> I suspect these are non-MS OLE file formats.  Any help identifying the file 
> types and patching our OLE mime detector would be great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl

2015-12-16 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1813:
-

 Summary: Figure out file types for several unknown OLE files in 
Common Crawl
 Key: TIKA-1813
 URL: https://issues.apache.org/jira/browse/TIKA-1813
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor


We're getting around 300 exceptions from "application/x-tika-msoffice" files in 
our current slice of Common Crawl documents that look roughly like this:

{noformat}
java.lang.IllegalArgumentException: Position 86528 past the end of the file
at org.apache.poi.poifs.nio.FileBackedDataSource.read
{noformat}

I suspect these are non-MS OLE file formats.  Any help identifying the file 
types and patching our OLE mime detector would be great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl

2015-12-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1505#comment-1505
 ] 

Tim Allison commented on TIKA-1813:
---

{{file}} yields:  {{Composite Document File V2 Document, corrupt: Can't read 
SSAT}}

> Figure out file types for several unknown OLE files in Common Crawl
> ---
>
> Key: TIKA-1813
> URL: https://issues.apache.org/jira/browse/TIKA-1813
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, 
> 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA
>
>
> We're getting around 300 exceptions from "application/x-tika-msoffice" files 
> in our current slice of Common Crawl documents that look roughly like this:
> {noformat}
> java.lang.IllegalArgumentException: Position 86528 past the end of the file
> at org.apache.poi.poifs.nio.FileBackedDataSource.read
> {noformat}
> I suspect these are non-MS OLE file formats.  Any help identifying the file 
> types and patching our OLE mime detector would be great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1799) Upgrade to POI 3.14-Beta1 when available

2015-12-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060029#comment-15060029
 ] 

Tim Allison commented on TIKA-1799:
---

[~kiwiwings], any recommendations for what I need to change in our bundle pom?  
Thank you!

> Upgrade to POI 3.14-Beta1 when available
> 
>
> Key: TIKA-1799
> URL: https://issues.apache.org/jira/browse/TIKA-1799
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 349008.ppt, 349008.ppt.json
>
>
> Should be out in the next week or two.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: fix for TIKA-1803 contributed by msha...@usc.ed...

2015-12-16 Thread smadha
GitHub user smadha opened a pull request:

https://github.com/apache/tika/pull/65

fix for TIKA-1803 contributed by msha...@usc.edu



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/smadha/tika TIKA-1803

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/65.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #65


commit a55990aa5d6a0c521358123f8d7bbd6947255174
Author: smadha 
Date:   2015-12-16T15:26:23Z

fix for TIKA-1803 contributed by msha...@usc.edu




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl

2015-12-16 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060203#comment-15060203
 ] 

Nick Burch commented on TIKA-1813:
--

My best guess is that these have been truncated. Having a look with 
{{{org.apache.poi.poifs.dev.POIFSHeaderDumper}}} it certainly looks that way

> Figure out file types for several unknown OLE files in Common Crawl
> ---
>
> Key: TIKA-1813
> URL: https://issues.apache.org/jira/browse/TIKA-1813
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, 
> 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA
>
>
> We're getting around 300 exceptions from "application/x-tika-msoffice" files 
> in our current slice of Common Crawl documents that look roughly like this:
> {noformat}
> java.lang.IllegalArgumentException: Position 86528 past the end of the file
> at org.apache.poi.poifs.nio.FileBackedDataSource.read
> {noformat}
> I suspect these are non-MS OLE file formats.  Any help identifying the file 
> types and patching our OLE mime detector would be great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1803) Use lucene-geo-gazetteer REST API in GeoTopicParser

2015-12-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060182#comment-15060182
 ] 

ASF GitHub Bot commented on TIKA-1803:
--

GitHub user smadha opened a pull request:

https://github.com/apache/tika/pull/65

fix for TIKA-1803 contributed by msha...@usc.edu



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/smadha/tika TIKA-1803

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/65.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #65


commit a55990aa5d6a0c521358123f8d7bbd6947255174
Author: smadha 
Date:   2015-12-16T15:26:23Z

fix for TIKA-1803 contributed by msha...@usc.edu




> Use lucene-geo-gazetteer REST API in GeoTopicParser
> ---
>
> Key: TIKA-1803
> URL: https://issues.apache.org/jira/browse/TIKA-1803
> Project: Tika
>  Issue Type: Sub-task
>  Components: parser
>Reporter: Madhav Sharan
>
> As of now tika uses lucene-geo-gazetteer CLI to extract co-ordinates of a 
> location. CLI requires jvm and lucene to instantiate for every request. With 
> all new REST api it will be possible to gain improvement in this space.
> Idea is to create a client of lucene-geo-gazetteer in tika and use it in 
> GeoTopicParser



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl

2015-12-16 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060203#comment-15060203
 ] 

Nick Burch edited comment on TIKA-1813 at 12/16/15 3:58 PM:


My best guess is that these have been truncated. Having a look with 
{{org.apache.poi.poifs.dev.POIFSHeaderDumper}} it certainly looks that way


was (Author: gagravarr):
My best guess is that these have been truncated. Having a look with 
{{{org.apache.poi.poifs.dev.POIFSHeaderDumper}}} it certainly looks that way

> Figure out file types for several unknown OLE files in Common Crawl
> ---
>
> Key: TIKA-1813
> URL: https://issues.apache.org/jira/browse/TIKA-1813
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, 
> 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA
>
>
> We're getting around 300 exceptions from "application/x-tika-msoffice" files 
> in our current slice of Common Crawl documents that look roughly like this:
> {noformat}
> java.lang.IllegalArgumentException: Position 86528 past the end of the file
> at org.apache.poi.poifs.nio.FileBackedDataSource.read
> {noformat}
> I suspect these are non-MS OLE file formats.  Any help identifying the file 
> types and patching our OLE mime detector would be great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl

2015-12-16 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1813:
--
Attachment: 27BYDLE36XWCDZXA3PPV6MF524UQ6KAF

This looks like a Revit project file (block 2).

> Figure out file types for several unknown OLE files in Common Crawl
> ---
>
> Key: TIKA-1813
> URL: https://issues.apache.org/jira/browse/TIKA-1813
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, 
> 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 27BYDLE36XWCDZXA3PPV6MF524UQ6KAF, 
> 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA
>
>
> We're getting around 300 exceptions from "application/x-tika-msoffice" files 
> in our current slice of Common Crawl documents that look roughly like this:
> {noformat}
> java.lang.IllegalArgumentException: Position 86528 past the end of the file
> at org.apache.poi.poifs.nio.FileBackedDataSource.read
> {noformat}
> I suspect these are non-MS OLE file formats.  Any help identifying the file 
> types and patching our OLE mime detector would be great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl

2015-12-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060254#comment-15060254
 ] 

Tim Allison commented on TIKA-1813:
---

Duh...I initially posted the exceptions on the theory that we may be misreading 
an old version of how many bytes to read, but y, truncated makes sense.

I'll post some other tika-msoffice that didn't cause exceptions.  Thank you for 
the tip on the header dumper.

> Figure out file types for several unknown OLE files in Common Crawl
> ---
>
> Key: TIKA-1813
> URL: https://issues.apache.org/jira/browse/TIKA-1813
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, 
> 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 27BYDLE36XWCDZXA3PPV6MF524UQ6KAF, 
> 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA
>
>
> We're getting around 300 exceptions from "application/x-tika-msoffice" files 
> in our current slice of Common Crawl documents that look roughly like this:
> {noformat}
> java.lang.IllegalArgumentException: Position 86528 past the end of the file
> at org.apache.poi.poifs.nio.FileBackedDataSource.read
> {noformat}
> I suspect these are non-MS OLE file formats.  Any help identifying the file 
> types and patching our OLE mime detector would be great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl

2015-12-16 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1813:
--
Attachment: unidentified_ole_docs_in_common_crawl_slice.csv

Rather than posting files, here's the list of files that did not result in an 
exception (probably not truncated) and were identified as x-tika-msoffice.

> Figure out file types for several unknown OLE files in Common Crawl
> ---
>
> Key: TIKA-1813
> URL: https://issues.apache.org/jira/browse/TIKA-1813
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, 
> 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 27BYDLE36XWCDZXA3PPV6MF524UQ6KAF, 
> 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA, 
> unidentified_ole_docs_in_common_crawl_slice.csv
>
>
> We're getting around 300 exceptions from "application/x-tika-msoffice" files 
> in our current slice of Common Crawl documents that look roughly like this:
> {noformat}
> java.lang.IllegalArgumentException: Position 86528 past the end of the file
> at org.apache.poi.poifs.nio.FileBackedDataSource.read
> {noformat}
> I suspect these are non-MS OLE file formats.  Any help identifying the file 
> types and patching our OLE mime detector would be great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)