looking to contribute
Hi Tika Developers, My name is Joey. I am a college freshmen with programming experience looking to get into the world of open-source. I was hoping to contribute to the Tika project, and was wondering if there were any tasks that a beginner like me could tackle. I am willing to do anything, whether it be fixing a minor bug, or adding test suites or documentation. Thanks, Joey
[jira] [Created] (TIKA-1814) Add a standalone XMPScannerParser
Tim Allison created TIKA-1814: - Summary: Add a standalone XMPScannerParser Key: TIKA-1814 URL: https://issues.apache.org/jira/browse/TIKA-1814 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Several parsers make use of XMP data and normalize it via dc or other standards into our metadata object. We're currently either relying on dependencies to make sense of multiple XMP packets within a file (PDFBox for PDFParser) or we're just grabbing the first (TiffParser via JempboxExtractor and XMPPacketScanner) or...which other parsers are processing XMP? It might be useful to extract all XMPPackets from a file and store those raw bytes as Base64 encoded Strings in the Metadata object. Advanced users could then have access to the raw XMP streams. For Tika 1.x, unless users configured it, nothing would call it. For Tika 2.x, once we get the combo configurable parsers set up, a user could configure a combo/additive parser, e.g., a PDFParser that is a combination of our current PDFParser and then this new XMPScannerParser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: looking to contribute
On Wed, 16 Dec 2015, Joey Hong wrote: My name is Joey. I am a college freshmen with programming experience looking to get into the world of open-source. I was hoping to contribute to the Tika project, and was wondering if there were any tasks that a beginner like me could tackle. I am willing to do anything, whether it be fixing a minor bug, or adding test suites or documentation. On the docs / examples side, we have a few examples on the website, but probably not enough! One thing might be to look through those, identify gaps with your fresh eyes, and work on those. We also have instructions for some more complicated integrations on the wiki, maybe try some of those and feed back on which ones aren't clear enough? If you want to try more coding, Tim quite often runs Tika against some large filesets, and has a nifty tool to report on what breaks. He can hopefully point you at the most recent report! Maybe have a look through that, identify a few common failures from unidentified or common exceptions, and try to fix one or two of those? Nick
[jira] [Updated] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl
[ https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1813: -- Attachment: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB 25JIANLV77U645GUSJ2E67YSM4B2TNSP 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA Some examples. The file lengths are suspiciously regular. Given that these are Common Crawl docs, there's a chance that they were truncated. 225... looks like an SPSS output file (SPO)...maybe? [Gary Kessler|http://www.garykessler.net/library/file_sigs.html] has a helpful list of non-MS file types that rely on OLE. > Figure out file types for several unknown OLE files in Common Crawl > --- > > Key: TIKA-1813 > URL: https://issues.apache.org/jira/browse/TIKA-1813 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, > 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA > > > We're getting around 300 exceptions from "application/x-tika-msoffice" files > in our current slice of Common Crawl documents that look roughly like this: > {noformat} > java.lang.IllegalArgumentException: Position 86528 past the end of the file > at org.apache.poi.poifs.nio.FileBackedDataSource.read > {noformat} > I suspect these are non-MS OLE file formats. Any help identifying the file > types and patching our OLE mime detector would be great. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl
Tim Allison created TIKA-1813: - Summary: Figure out file types for several unknown OLE files in Common Crawl Key: TIKA-1813 URL: https://issues.apache.org/jira/browse/TIKA-1813 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor We're getting around 300 exceptions from "application/x-tika-msoffice" files in our current slice of Common Crawl documents that look roughly like this: {noformat} java.lang.IllegalArgumentException: Position 86528 past the end of the file at org.apache.poi.poifs.nio.FileBackedDataSource.read {noformat} I suspect these are non-MS OLE file formats. Any help identifying the file types and patching our OLE mime detector would be great. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl
[ https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1505#comment-1505 ] Tim Allison commented on TIKA-1813: --- {{file}} yields: {{Composite Document File V2 Document, corrupt: Can't read SSAT}} > Figure out file types for several unknown OLE files in Common Crawl > --- > > Key: TIKA-1813 > URL: https://issues.apache.org/jira/browse/TIKA-1813 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, > 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA > > > We're getting around 300 exceptions from "application/x-tika-msoffice" files > in our current slice of Common Crawl documents that look roughly like this: > {noformat} > java.lang.IllegalArgumentException: Position 86528 past the end of the file > at org.apache.poi.poifs.nio.FileBackedDataSource.read > {noformat} > I suspect these are non-MS OLE file formats. Any help identifying the file > types and patching our OLE mime detector would be great. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1799) Upgrade to POI 3.14-Beta1 when available
[ https://issues.apache.org/jira/browse/TIKA-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060029#comment-15060029 ] Tim Allison commented on TIKA-1799: --- [~kiwiwings], any recommendations for what I need to change in our bundle pom? Thank you! > Upgrade to POI 3.14-Beta1 when available > > > Key: TIKA-1799 > URL: https://issues.apache.org/jira/browse/TIKA-1799 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: 349008.ppt, 349008.ppt.json > > > Should be out in the next week or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] tika pull request: fix for TIKA-1803 contributed by msha...@usc.ed...
GitHub user smadha opened a pull request: https://github.com/apache/tika/pull/65 fix for TIKA-1803 contributed by msha...@usc.edu You can merge this pull request into a Git repository by running: $ git pull https://github.com/smadha/tika TIKA-1803 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/65.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #65 commit a55990aa5d6a0c521358123f8d7bbd6947255174 Author: smadhaDate: 2015-12-16T15:26:23Z fix for TIKA-1803 contributed by msha...@usc.edu --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl
[ https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060203#comment-15060203 ] Nick Burch commented on TIKA-1813: -- My best guess is that these have been truncated. Having a look with {{{org.apache.poi.poifs.dev.POIFSHeaderDumper}}} it certainly looks that way > Figure out file types for several unknown OLE files in Common Crawl > --- > > Key: TIKA-1813 > URL: https://issues.apache.org/jira/browse/TIKA-1813 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, > 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA > > > We're getting around 300 exceptions from "application/x-tika-msoffice" files > in our current slice of Common Crawl documents that look roughly like this: > {noformat} > java.lang.IllegalArgumentException: Position 86528 past the end of the file > at org.apache.poi.poifs.nio.FileBackedDataSource.read > {noformat} > I suspect these are non-MS OLE file formats. Any help identifying the file > types and patching our OLE mime detector would be great. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1803) Use lucene-geo-gazetteer REST API in GeoTopicParser
[ https://issues.apache.org/jira/browse/TIKA-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060182#comment-15060182 ] ASF GitHub Bot commented on TIKA-1803: -- GitHub user smadha opened a pull request: https://github.com/apache/tika/pull/65 fix for TIKA-1803 contributed by msha...@usc.edu You can merge this pull request into a Git repository by running: $ git pull https://github.com/smadha/tika TIKA-1803 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/65.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #65 commit a55990aa5d6a0c521358123f8d7bbd6947255174 Author: smadhaDate: 2015-12-16T15:26:23Z fix for TIKA-1803 contributed by msha...@usc.edu > Use lucene-geo-gazetteer REST API in GeoTopicParser > --- > > Key: TIKA-1803 > URL: https://issues.apache.org/jira/browse/TIKA-1803 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Madhav Sharan > > As of now tika uses lucene-geo-gazetteer CLI to extract co-ordinates of a > location. CLI requires jvm and lucene to instantiate for every request. With > all new REST api it will be possible to gain improvement in this space. > Idea is to create a client of lucene-geo-gazetteer in tika and use it in > GeoTopicParser -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl
[ https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060203#comment-15060203 ] Nick Burch edited comment on TIKA-1813 at 12/16/15 3:58 PM: My best guess is that these have been truncated. Having a look with {{org.apache.poi.poifs.dev.POIFSHeaderDumper}} it certainly looks that way was (Author: gagravarr): My best guess is that these have been truncated. Having a look with {{{org.apache.poi.poifs.dev.POIFSHeaderDumper}}} it certainly looks that way > Figure out file types for several unknown OLE files in Common Crawl > --- > > Key: TIKA-1813 > URL: https://issues.apache.org/jira/browse/TIKA-1813 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, > 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA > > > We're getting around 300 exceptions from "application/x-tika-msoffice" files > in our current slice of Common Crawl documents that look roughly like this: > {noformat} > java.lang.IllegalArgumentException: Position 86528 past the end of the file > at org.apache.poi.poifs.nio.FileBackedDataSource.read > {noformat} > I suspect these are non-MS OLE file formats. Any help identifying the file > types and patching our OLE mime detector would be great. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl
[ https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1813: -- Attachment: 27BYDLE36XWCDZXA3PPV6MF524UQ6KAF This looks like a Revit project file (block 2). > Figure out file types for several unknown OLE files in Common Crawl > --- > > Key: TIKA-1813 > URL: https://issues.apache.org/jira/browse/TIKA-1813 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, > 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 27BYDLE36XWCDZXA3PPV6MF524UQ6KAF, > 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA > > > We're getting around 300 exceptions from "application/x-tika-msoffice" files > in our current slice of Common Crawl documents that look roughly like this: > {noformat} > java.lang.IllegalArgumentException: Position 86528 past the end of the file > at org.apache.poi.poifs.nio.FileBackedDataSource.read > {noformat} > I suspect these are non-MS OLE file formats. Any help identifying the file > types and patching our OLE mime detector would be great. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl
[ https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060254#comment-15060254 ] Tim Allison commented on TIKA-1813: --- Duh...I initially posted the exceptions on the theory that we may be misreading an old version of how many bytes to read, but y, truncated makes sense. I'll post some other tika-msoffice that didn't cause exceptions. Thank you for the tip on the header dumper. > Figure out file types for several unknown OLE files in Common Crawl > --- > > Key: TIKA-1813 > URL: https://issues.apache.org/jira/browse/TIKA-1813 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, > 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 27BYDLE36XWCDZXA3PPV6MF524UQ6KAF, > 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA > > > We're getting around 300 exceptions from "application/x-tika-msoffice" files > in our current slice of Common Crawl documents that look roughly like this: > {noformat} > java.lang.IllegalArgumentException: Position 86528 past the end of the file > at org.apache.poi.poifs.nio.FileBackedDataSource.read > {noformat} > I suspect these are non-MS OLE file formats. Any help identifying the file > types and patching our OLE mime detector would be great. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl
[ https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1813: -- Attachment: unidentified_ole_docs_in_common_crawl_slice.csv Rather than posting files, here's the list of files that did not result in an exception (probably not truncated) and were identified as x-tika-msoffice. > Figure out file types for several unknown OLE files in Common Crawl > --- > > Key: TIKA-1813 > URL: https://issues.apache.org/jira/browse/TIKA-1813 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, > 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 27BYDLE36XWCDZXA3PPV6MF524UQ6KAF, > 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA, > unidentified_ole_docs_in_common_crawl_slice.csv > > > We're getting around 300 exceptions from "application/x-tika-msoffice" files > in our current slice of Common Crawl documents that look roughly like this: > {noformat} > java.lang.IllegalArgumentException: Position 86528 past the end of the file > at org.apache.poi.poifs.nio.FileBackedDataSource.read > {noformat} > I suspect these are non-MS OLE file formats. Any help identifying the file > types and patching our OLE mime detector would be great. -- This message was sent by Atlassian JIRA (v6.3.4#6332)