Re: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

2016-03-28 Thread Nick Burch

On Sun, 27 Mar 2016, Bob Paulin wrote:
Yes I think overall if these functions can live in somewhere either 
inside tika or a smaller dependent library we're in a better place. I'll 
take a look at Ogg-Vorbis.


The two util classes there, that spring to mind, are:
https://github.com/Gagravarr/VorbisJava/blob/master/core/src/main/java/org/gagravarr/ogg/IOUtils.java
https://github.com/Gagravarr/VorbisJava/blob/master/core/src/main/java/org/gagravarr/ogg/BitsReader.java

Nick


Re: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

2016-03-27 Thread Nick Burch

On Sun, 27 Mar 2016, Bob Paulin wrote:
Currently the Apache POI dependency is in several modules and it's sort 
of a beast (> 2 MB in size).


You should've seen it before Jukka and Yegor spent a crazy ApacheCon 
hacking up the ooxml-lite support... ;-)



It appears many of the modules are only using the IOUtils library.


I suspect a strong overlap with the parser classes I've helped write...

Any concerns with replacing this POI stuff with commons-io? Does POI 
offer anything above the commons-io functionality in IOUtils? If not I 
think it would be great to isolate the poi dependency to the office 
module only.


A lot of the use is for endian-specific reading of numbers and strings. 
Might be a bit of stream stuff, but mostly that can be passed off to the 
Tika IO utils classes.


From a quick check, I can't see any endian number stuff in commons IO, but 
I might of missed it, or it might be in a different commons module. If 
not, there might be something to be said for popping that POI logic along 
with some of the Ogg-Vorbis utils stuff (another one with my grubby mits 
all over it) into a more helpful general utils grouping


Nick


[jira] [Commented] (TIKA-1908) --list-met-models does not display Dublin core along with other metadata models

2016-03-25 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212612#comment-15212612
 ] 

Nick Burch commented on TIKA-1908:
--

I seem to recall there was a deliberate policy to avoid putting all the 
new-style metadata keys onto the Metadata interface. It won't only be the DC 
ones missing, I'd guess most/all of the new property collections would be 
missing too

Probably the "right" fix would be to pull in the properties from the other 
metadata collections too in the listing action

> --list-met-models does not display Dublin core along with other metadata 
> models
> ---
>
> Key: TIKA-1908
> URL: https://issues.apache.org/jira/browse/TIKA-1908
> Project: Tika
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 1.12
> Environment: Windows
>Reporter: Sharmilee S
>Priority: Minor
>  Labels: easyfix
> Fix For: 1.12
>
> Attachments: Metadata.java
>
>
> The --list-met-models option on running the Tika client jar (tika-app.jar), 
> does not list the dublin core metadata model. Seems like the interface was 
> missed to be added while implementing the Metadata class. Added the 
> DublinCore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1898) backslashes in mime-type : application/vnd.mif are wrong

2016-03-23 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209418#comment-15209418
 ] 

Nick Burch commented on TIKA-1898:
--

Ah, ok, got it. We had a random impenetrable hex string magic too, which turned 
out to be " backslashes in mime-type : application/vnd.mif are wrong 
> -
>
> Key: TIKA-1898
> URL: https://issues.apache.org/jira/browse/TIKA-1898
> Project: Tika
>  Issue Type: Bug
>  Components: config, core
> Environment: Win64, Eclipse
>Reporter: Steffen Netz
>Priority: Minor
>  Labels: easyfix, patch
> Fix For: 1.13
>
> Attachments: test.doc, test.fm, test.mif, tika-bug.log
>
>
> In
> tika-core\src\main\resources\org\apache\tika\mime\tika-mimetypes.xml  
> there are the lines:
> 
>   
>   
>   
>   
>   
>   
>   wrong.
> the backslashes must be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1898) backslashes in mime-type : application/vnd.mif are wrong

2016-03-23 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1898.
--
   Resolution: Fixed
Fix Version/s: 1.13

> backslashes in mime-type : application/vnd.mif are wrong 
> -
>
> Key: TIKA-1898
> URL: https://issues.apache.org/jira/browse/TIKA-1898
> Project: Tika
>  Issue Type: Bug
>  Components: config, core
> Environment: Win64, Eclipse
>Reporter: Steffen Netz
>Priority: Minor
>  Labels: easyfix, patch
> Fix For: 1.13
>
> Attachments: test.doc, test.fm, test.mif, tika-bug.log
>
>
> In
> tika-core\src\main\resources\org\apache\tika\mime\tika-mimetypes.xml  
> there are the lines:
> 
>   
>   
>   
>   
>   
>   
>   wrong.
> the backslashes must be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1888) Update mimetype for application/x-netcdf

2016-03-22 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15206152#comment-15206152
 ] 

Nick Burch commented on TIKA-1888:
--

Which match is missing? We already have CDF 0x01, which is what your 
hard-to-read hex string codes for

> Update mimetype for application/x-netcdf
> 
>
> Key: TIKA-1888
> URL: https://issues.apache.org/jira/browse/TIKA-1888
> Project: Tika
>  Issue Type: Improvement
>  Components: core, mime
>Affects Versions: 1.13
>Reporter: Ajay Kumar Loganathan Ravichandran
>  Labels: mimetypes
> Fix For: 1.13
>
>
> Updating tika-mimetype.xml to identify .cdf and .nc file format.
> 
>   
>   
>  
>
> 
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1898) backslashes in mime-type : application/vnd.mif are wrong

2016-03-14 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15194363#comment-15194363
 ] 

Nick Burch commented on TIKA-1898:
--

I've just tried with your test file, and Tika is able to detect the file 
correctly with the data only (no filename). That makes me think that the 
mimetype is correct:

{code}
$ java -jar tika-app-1.13-SNAPSHOT.jar --detect < test.mif 
application/vnd.mif
{code}

Are you able to produce a junit unit test that shows your detection issue, and 
ideally shows your proposed patch fixes it? (Bonus marks if it's as a Github 
Pull Request or a Patch attached to the JIRA!)

> backslashes in mime-type : application/vnd.mif are wrong 
> -
>
> Key: TIKA-1898
> URL: https://issues.apache.org/jira/browse/TIKA-1898
> Project: Tika
>  Issue Type: Bug
>  Components: config, core
> Environment: Win64, Eclipse
>Reporter: Steffen Netz
>Priority: Minor
>  Labels: easyfix, patch
> Attachments: test.doc, test.fm, test.mif, tika-bug.log
>
>
> In
> tika-core\src\main\resources\org\apache\tika\mime\tika-mimetypes.xml  
> there are the lines:
> 
>   
>   
>   
>   
>   
>   
>   wrong.
> the backslashes must be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1898) backslashes in mime-type : application/vnd.mif are wrong

2016-03-10 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189482#comment-15189482
 ] 

Nick Burch commented on TIKA-1898:
--

Do you have a small sample file we could use to write a unit test to verify 
this fix?

> backslashes in mime-type : application/vnd.mif are wrong 
> -
>
> Key: TIKA-1898
> URL: https://issues.apache.org/jira/browse/TIKA-1898
> Project: Tika
>  Issue Type: Bug
>  Components: config, core
> Environment: Win64, Eclipse
>Reporter: Steffen Netz
>Priority: Minor
>  Labels: easyfix, patch
>
> In
> tika-core\src\main\resources\org\apache\tika\mime\tika-mimetypes.xml  
> there are the lines:
> 
>   
>   
>   
>   
>   
>   
>   wrong.
> the backslashes must be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration

2016-03-09 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15187205#comment-15187205
 ] 

Nick Burch commented on TIKA-1508:
--

> I think that's exactly what ParseContext should be for..it should be a 
> vehicle for Param passing. We can delineate by property name (FQ) and/or by 
> class.

I view {{ParseContext}} as somewhere you configure things on a per-document 
basis, not a per-parser basis. 

So, need to set where Tesseract lives on your system? Applies to everything, so 
on the parser. Need to tell Tesseract to use a German not an English dictionary 
on this particular jpeg? Applies to just this one document being parserd, so on 
the {{ParseContext}}

> Add uniformity to parser parameter configuration
> 
>
> Key: TIKA-1508
> URL: https://issues.apache.org/jira/browse/TIKA-1508
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Fix For: 1.13
>
>
> We can currently configure parsers by the following means:
> 1) programmatically by direct calls to the parsers or their config objects
> 2) sending in a config object through the ParseContext
> 3) modifying .properties files for specific parsers (e.g. PDFParser)
> Rather than scattering the landscape with .properties files for each parser, 
> it would be great if we could specify parser parameters in the main config 
> file, something along the lines of this:
> {noformat}
> 
>   
> 2
> something or other
>   
>   audio/basic
>   audio/x-aiff
>   audio/x-wav
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1888) Update mimetype for application/x-netcdf

2016-03-07 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182860#comment-15182860
 ] 

Nick Burch commented on TIKA-1888:
--

Our current mimetype definition for netcdf is:
{code}
  



  
  

  
{code}

Your change seems to remove one possible magic match, and make the magic less 
easy to read... Am I missing something?

> Update mimetype for application/x-netcdf
> 
>
> Key: TIKA-1888
> URL: https://issues.apache.org/jira/browse/TIKA-1888
> Project: Tika
>  Issue Type: Improvement
>  Components: core, mime
>Affects Versions: 1.13
>Reporter: Ajay Kumar Loganathan Ravichandran
>  Labels: mimetypes
> Fix For: 1.13
>
>
> Updating tika-mimetype.xml to identify .cdf and .nc file format.
> 
>   
>   
>  
>
> 
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1891) Update mimetype for mime-type image/fits

2016-03-07 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1891.
--
   Resolution: Duplicate
Fix Version/s: (was: 1.13)

> Update mimetype for mime-type image/fits
> 
>
> Key: TIKA-1891
> URL: https://issues.apache.org/jira/browse/TIKA-1891
> Project: Tika
>  Issue Type: Improvement
>  Components: core, mime
>Affects Versions: 1.13
>Reporter: Ajay Kumar Loganathan Ravichandran
>  Labels: mimetypes
>
> Updating tika-mimetype.xml to identify image/fits files
> Updated mime-type:
> 
>   
> 
>
>  
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1889) Update mimetype for *.qt and *.mov files detection

2016-03-06 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1889.
--
   Resolution: Duplicate
Fix Version/s: (was: 1.13)

> Update mimetype for *.qt and *.mov files detection
> --
>
> Key: TIKA-1889
> URL: https://issues.apache.org/jira/browse/TIKA-1889
> Project: Tika
>  Issue Type: Improvement
>  Components: core, mime
>Affects Versions: 1.13
>Reporter: Ajay Kumar Loganathan Ravichandran
>  Labels: mime-type
>
> Updating tika-mimetype.xml to identify quicktime file format.
> Updated match value for quicktime file format
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1892) Mime Magic for application/x-mobipocket-ebook and application/x-shapefile

2016-03-06 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1892.
--
   Resolution: Fixed
Fix Version/s: 1.13

Thanks, SHP added and MOBI updated in 74e71eb

> Mime Magic for application/x-mobipocket-ebook and application/x-shapefile
> -
>
> Key: TIKA-1892
> URL: https://issues.apache.org/jira/browse/TIKA-1892
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.12
>Reporter: Suman Kashyap
>Priority: Minor
> Fix For: 1.13
>
>
> Our FHT analysis for mobipocket-ebook and shapefiles shows high corelation of 
> initial header bytes. Further inspection of these files over online available 
> and TREC polar data sets revealed presence of common bytes for mime 
> identification 
> patch content
> 
>   NETCDF
>   <_comment>Network Common Data Format
>   
>   
>   
>   
> 
> 
>   MOBI
>   <_comment>Mobipocket Ebook
>   
>   
>   
>   
> 
> 
>   ESRI Shapefiles
>   <_comment>ESRI Shapefiles
>   
>   
>   
>   
> 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1893) Add new mimetype for *.icns (Apple Icon Image Format) files

2016-03-06 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182181#comment-15182181
 ] 

Nick Burch commented on TIKA-1893:
--

Do you have a patch or pull request for this?

> Add new mimetype for *.icns (Apple Icon Image Format) files 
> 
>
> Key: TIKA-1893
> URL: https://issues.apache.org/jira/browse/TIKA-1893
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.11
>Reporter: Manisha Kampasi
>Priority: Minor
>  Labels: patch
>
> Currently, TIKA does not support the "image/icns" mime type for *.icns files 
> (Apple Icon Image Format). This can be added to the tika-mimetypes.xml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1890) Update mimetype for application/vnd.ms-cab-compressed

2016-03-06 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1890.
--
Resolution: Fixed

More specific mime magic added in f7d3097 along with a unit test

> Update mimetype for application/vnd.ms-cab-compressed
> -
>
> Key: TIKA-1890
> URL: https://issues.apache.org/jira/browse/TIKA-1890
> Project: Tika
>  Issue Type: Improvement
>  Components: core, mime
>Affects Versions: 1.13
>Reporter: Ajay Kumar Loganathan Ravichandran
>  Labels: mimetypes
> Fix For: 1.13
>
>
> Updating tika-mimetype.xml to identify *.cab file format.
> Updated mime-type:
>  
>   
>  
>
>  
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1889) Update mimetype for *.qt and *.mov files detection

2016-03-06 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182156#comment-15182156
 ] 

Nick Burch commented on TIKA-1889:
--

Isn't this a duplicate of TIKA-1882?

> Update mimetype for *.qt and *.mov files detection
> --
>
> Key: TIKA-1889
> URL: https://issues.apache.org/jira/browse/TIKA-1889
> Project: Tika
>  Issue Type: Improvement
>  Components: core, mime
>Affects Versions: 1.13
>Reporter: Ajay Kumar Loganathan Ravichandran
>  Labels: mime-type
> Fix For: 1.13
>
>
> Updating tika-mimetype.xml to identify quicktime file format.
> Updated match value for quicktime file format
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1888) Update mimetype for application/x-netcdf

2016-03-06 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182155#comment-15182155
 ] 

Nick Burch commented on TIKA-1888:
--

That match looks to be a string. In order to keep it readable and consistent 
with the others, any chance you could re-do it as a text match rather than a 
hex one?

> Update mimetype for application/x-netcdf
> 
>
> Key: TIKA-1888
> URL: https://issues.apache.org/jira/browse/TIKA-1888
> Project: Tika
>  Issue Type: Improvement
>  Components: core, mime
>Affects Versions: 1.13
>Reporter: Ajay Kumar Loganathan Ravichandran
>  Labels: mimetypes
> Fix For: 1.13
>
>
> Updating tika-mimetype.xml to identify .cdf and .nc file format.
> 
>   
>   
>  
>
> 
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1887) Add new mimetype for file extensions .po

2016-03-06 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182154#comment-15182154
 ] 

Nick Burch commented on TIKA-1887:
--

http://www.icanlocalize.com/site/tutorials/how-to-translate-with-gettext-po-and-pot-files/
 seems a good introduction to these formats, for those new to it all

{{text/x-gettext-translation}} and {{text/x-po}} seem to be moderately widely 
used for these already, so it might be good to use the former and set the 
latter as an alias, rather than inventing our own. (We also shouldn't use 
{{text/po}} as it isn't officially assigned, so would need an x- prefix to 
indicate this)

> Add new mimetype for file extensions .po 
> -
>
> Key: TIKA-1887
> URL: https://issues.apache.org/jira/browse/TIKA-1887
> Project: Tika
>  Issue Type: Improvement
>  Components: core, mime
>Reporter: Manali Shah
>  Labels: mimetypes
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi, 
> While analyzing the Trec DD polar data, we came across files that were 
> classified as octet-stream. 
> On using content based algorithms such as BFA, BFCC  and FHT we were able to 
> determine more magic bytes for certain files.
> The GNU gettext toolset is used by programmers and translators at producing, 
> updating and using translation files, mainly those PO files which are 
> textual, editable files.
> We suggest a new mimetype as text/po to be added to the existing mime 
> repository of Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1886) Updating tika-mimetypes.xml to detect .hfa files

2016-03-06 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182150#comment-15182150
 ] 

Nick Burch commented on TIKA-1886:
--

Matching pull request is https://github.com/apache/tika/pull/88 , but it needs 
a few tweaks before it can be merged

> Updating tika-mimetypes.xml to detect .hfa files
> 
>
> Key: TIKA-1886
> URL: https://issues.apache.org/jira/browse/TIKA-1886
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.12
>Reporter: Nandan Chandrashekar
>Priority: Minor
> Fix For: 1.11
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Updating tika-mimetype.xml to identify .hfa file format. 
> Details about .hfa file format. 
> Links : 
> 1. 
> ftp://ftp.ecn.purdue.edu/jshan/86/help/html/appendices/hfa_object_directory.htm
> 2. ftp://ftp.ecn.purdue.edu/jshan/86/help/html/appendices/Ehfa_HeaderTag.htm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1885) Updated tika-mimestype.xml and a detector to identify new types of files based on analysis

2016-03-06 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182149#comment-15182149
 ] 

Nick Burch commented on TIKA-1885:
--

Any luck with the pull request?

> Updated tika-mimestype.xml and a detector to identify new types of files 
> based on analysis
> --
>
> Key: TIKA-1885
> URL: https://issues.apache.org/jira/browse/TIKA-1885
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector, mime
>Affects Versions: 1.11
> Environment: Windows OS X64 , Java
>Reporter: Adesh Gupta
>Priority: Critical
> Fix For: 1.11
>
>
> Updated tika-mimetypes.xml and detector to identify new file types in TREC DD 
> Polar dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1882) Updating the tika-mimetypes.xml for new mime magic patterns

2016-03-06 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182146#comment-15182146
 ] 

Nick Burch commented on TIKA-1882:
--

Just because other people think it's a magic doesn't necessarily mean it is - 
many others just blindly find a few bytes that look common without trying to 
understand the underlying format, and consequently can get it wrong...

As the QuickTime container is a base for MP4, and our MP4 Video mime type 
declares QuickTime Video as its parent, if things are common then QuickTime is 
the right place to put it. 

I've had a go in bee1a87d7d9ad3a1c5f45cf65082b9505dbe9fc0 to better express the 
QuickTime/MP4 relationship in the mime types hierarchy. If you could merge that 
and re-test, and all tests pass, plus switch hex strings to text where possible 
(see pull request comments) then I think we should be fine to apply

> Updating the tika-mimetypes.xml for new mime magic patterns
> ---
>
> Key: TIKA-1882
> URL: https://issues.apache.org/jira/browse/TIKA-1882
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.11
>Reporter: Manisha Kampasi
>Priority: Minor
>  Labels: patch
>
> The following mime magic can be added to better detect the below mime-types:
> 1. vnd.ms-cab-compressed (.cab files) - pattern "MCSF" in the first 4 bytes
> 2. application/vnd.xara (.xar files) - pattern "xar!" in the first 4 bytes
> 3. application/x-mobipocket-ebook (.mobi files) - pattern "BOOKMOBI" starting 
> at byte position 60
> 4. video/quicktime (.mov files) - patterns "free" and "wide" seen starting at 
> byte position 4
> The changes can be seen here:
> https://github.com/mkampasi/tika/commit/f7433daf434a44937ba3ae8b15813a768f95e334



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1883) Identification of Mime Type for Empty Files

2016-03-06 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182145#comment-15182145
 ] 

Nick Burch commented on TIKA-1883:
--

This pull request is almost impossible to understand. Any chance you could 
re-base your commits onto the latest Git Head, then try again?

> Identification of Mime Type for Empty Files
> ---
>
> Key: TIKA-1883
> URL: https://issues.apache.org/jira/browse/TIKA-1883
> Project: Tika
>  Issue Type: Improvement
>  Components: core, parser
>Affects Versions: 1.12
>Reporter: Aditya Ramachandra Desai
>Priority: Minor
>  Labels: patch
> Fix For: 1.12
>
>
> Identification of Mime types for empty files, updating TIKA 1.12 source code 
> to fix this issue. The Tika Detector and Parsers have been modified 
> accordingly to identify the empty files and classify them.
> Team 20
> Shashank, Rashmi and Aditya Desai
> CSCI 599 USC Spring 2016



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1878) Upgrade Apache SIS 0.6

2016-03-06 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1878.
--
   Resolution: Fixed
Fix Version/s: 1.13

Thanks, upgraded but a slightly different way (I pulled the version string out 
to a property so it only needs changing once)

> Upgrade Apache SIS 0.6
> --
>
> Key: TIKA-1878
> URL: https://issues.apache.org/jira/browse/TIKA-1878
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.12
>Reporter: Hendy Irawan
>Priority: Trivial
> Fix For: 1.13
>
>
> Pull request here: https://github.com/apache/tika/pull/79



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1881) On updating mime magic for existing mime types

2016-03-06 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182137#comment-15182137
 ] 

Nick Burch commented on TIKA-1881:
--

As mentioned on the Github pull request:

For the Atom, RSS and RDF ones - is the magic required? Doesn't the XML 
detector get them already via the namespace? And without risk of mis-detecting 
text files which happen to mention feed or rss or rdf near the start?

For the Postscript one - could you re-do this as text rather than hex, so it's 
easier to read?

(Others look fine!)


> On updating mime magic for existing mime types
> --
>
> Key: TIKA-1881
> URL: https://issues.apache.org/jira/browse/TIKA-1881
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.11
>Reporter: Namitha Sanjeeva Ganiga
>Priority: Minor
>  Labels: mime
> Fix For: 1.11
>
>
> Updated Mime-Magic for 6 mime types:
> 1. application/postscript : files begin with pattern "%!PS-Adobe-3.0 
> EPSF-3.0".
> 2. application/wordperfect: files begin with pattern "ÿWPC" .
> 3. image/tiff : updated pattern for "MM.+" for Big endian format.(occur at 
> the beginning of files of tiff mime type)
> 4. application/rdf+xml : updated pattern "rdf" ( from byte offset 5 to 400) 
> 5. application/atom+xml : updated pattern "feed" ( from byte offset 5 to 50)
> 6. application/rss+xml : updated pattern "rss" ( from byte offset 5 to 50)
> https://github.com/NamithaGS/tika/commit/780100767e24505a24595ea6db43978d0700e220



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1877) On updating the tika-mimetypes.xml to detect .fts file format, tika detector does not return anything

2016-03-06 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1877.
--
   Resolution: Fixed
Fix Version/s: 1.13

Patch applied, with a slight tweak to rename the test file to better match our 
naming standards. Thanks!

> On updating the tika-mimetypes.xml to detect .fts file format, tika detector 
> does not return anything
> -
>
> Key: TIKA-1877
> URL: https://issues.apache.org/jira/browse/TIKA-1877
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Reporter: Prasad Nagaraj Subramanya
>Priority: Minor
> Fix For: 1.13
>
> Attachments: 
> 3DEE2CE70CAD248DC8A46C2D0BD0BD6C21AACE54AC958264773390B39C8AF079, 
> 4E8D6B46E2366D7063DE3926AF0F976A0DCCD57A7E3B53B7D54768F16DD23984, 
> tika-mimetypes.xml
>
>
> The match value for .fts file format in tika-mimetypes.xml is "SIMPLE  =  
>   T".
> Tika detected a .fts file as application/octet-stream. On verifying the 
> header I found the value to be "SIMPLE  =T"(just 16 spaces 
> before = and T)
> I tried the following changes-
> Change 1) Updated the existing match value. But the build failed 
> Change 2) Added a new match value  type="string" offset="0"/> after the existing one.
> But now, tika returns empty value. It neither identifies the file as .fts nor 
> as application/octet-stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1877) On updating the tika-mimetypes.xml to detect .fts file format, tika detector does not return anything

2016-03-06 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182131#comment-15182131
 ] 

Nick Burch commented on TIKA-1877:
--

With your patch applied, the Tika app correctly detects your new text file for 
me, both with and without the filename hint:
{code}
$ tika --detect 
tika-parsers/src/test/resources/test-documents/testFITS_ShorterHeader.fits
application/fits
$ tika --detect < 
tika-parsers/src/test/resources/test-documents/testFITS_ShorterHeader.fits
application/fits
{code}

> On updating the tika-mimetypes.xml to detect .fts file format, tika detector 
> does not return anything
> -
>
> Key: TIKA-1877
> URL: https://issues.apache.org/jira/browse/TIKA-1877
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Reporter: Prasad Nagaraj Subramanya
>Priority: Minor
> Attachments: 
> 3DEE2CE70CAD248DC8A46C2D0BD0BD6C21AACE54AC958264773390B39C8AF079, 
> 4E8D6B46E2366D7063DE3926AF0F976A0DCCD57A7E3B53B7D54768F16DD23984, 
> tika-mimetypes.xml
>
>
> The match value for .fts file format in tika-mimetypes.xml is "SIMPLE  =  
>   T".
> Tika detected a .fts file as application/octet-stream. On verifying the 
> header I found the value to be "SIMPLE  =T"(just 16 spaces 
> before = and T)
> I tried the following changes-
> Change 1) Updated the existing match value. But the build failed 
> Change 2) Added a new match value  type="string" offset="0"/> after the existing one.
> But now, tika returns empty value. It neither identifies the file as .fts nor 
> as application/octet-stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1875) Updating tika-mimetypes.xml to detect .NC files

2016-03-06 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1875.
--
   Resolution: Fixed
Fix Version/s: (was: 1.11)
   1.13

Thanks for the new patch, now applied

> Updating tika-mimetypes.xml to detect .NC files 
> 
>
> Key: TIKA-1875
> URL: https://issues.apache.org/jira/browse/TIKA-1875
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.12
>Reporter: Prasad Nagaraj Subramanya
>Priority: Minor
>  Labels: patch
> Fix For: 1.13
>
>
> Adding magic number to detect .NC files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1885) Updated tika-mimestype.xml and a detector to identify new types of files based on analysis

2016-03-03 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177566#comment-15177566
 ] 

Nick Burch commented on TIKA-1885:
--

Did you mean to close this? Is there a matching pull request or patch that 
needs to be applied to implement the changes? And what file types are you 
working with?

> Updated tika-mimestype.xml and a detector to identify new types of files 
> based on analysis
> --
>
> Key: TIKA-1885
> URL: https://issues.apache.org/jira/browse/TIKA-1885
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector, mime
>Affects Versions: 1.11
> Environment: Windows OS X64 , Java
>Reporter: Adesh Gupta
>Priority: Critical
> Fix For: 1.11
>
>
> Updated tika-mimetypes.xml and detector to identify new file types in TREC DD 
> Polar dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Need suggestion on file type .HFA to be added Tika.

2016-03-02 Thread Nick Burch

On Wed, 2 Mar 2016, Nandan Padar Chandrashekar wrote:

Identified (Hierarchical File Architecture) HFA file format which is not
presently being identified through Tika.

extension : *.hfa
Header tag contains string  EHFA_HEADER_TAG


Looks fine for adding to Tika to me


Should this be considered as custom mime type or standard mime type. ?


As it's a common well known file type, it should be a standard one. It'd 
really only need to be a custom one if it was only used in your lab / 
school / company and no-where else



Need suggestion for content type(mime-type type) of this file format.


application/x-erdas-hfa seems to be used in at least some places online, 
so I'd suggest using that, at least for now


Nick


[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata

2016-03-02 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176652#comment-15176652
 ] 

Nick Burch commented on TIKA-1663:
--

The other parser decorators are specified with options inside the parent 
parser, eg mime includes or excludes are decorators given as options to the 
main parser. In some ways, this is quite nice, as you do the main definition on 
the thing that'll do the work, then the decorators after

One option, for the general case, would be to add additional decorators too, eg 
http://tika.apache.org/1.12/configuring.html#Configuring_Parsers becomes
{code}

  image/jpeg
  application/pdf
  
  
  

{code}

For the specific case of the digester, it's a well known thing, so we could 
give it custom tags. That would make things clearer, and would get round the 
parameter issue. One option is:
{code}

  image/jpeg
  application/pdf
  MD5,SHA256
  

{code}

The other to keep it more in line with the mime includes/excludes is:
{code}

  image/jpeg
  application/pdf
  MD5
  SHA256
  

{code}

What do people think?

> Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
> ---
>
> Key: TIKA-1663
> URL: https://issues.apache.org/jira/browse/TIKA-1663
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: digesting_parser_v1.patch
>
>
> It might be useful to integrate commons' DigestUtils and allow users to 
> easily add the MD5 or other supported hashes to the Metadata object.
> Anyone else find this of use?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1882) Updating the tika-mimetypes.xml for new mime magic patterns

2016-03-01 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174295#comment-15174295
 ] 

Nick Burch commented on TIKA-1882:
--

I'm not sure the quicktime pattern is correct - I have some MOV files without 
either there, and some MP4s which do have it. (MP4 and Quicktime MOV are 
related formats)

> Updating the tika-mimetypes.xml for new mime magic patterns
> ---
>
> Key: TIKA-1882
> URL: https://issues.apache.org/jira/browse/TIKA-1882
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.11
>Reporter: Manisha Kampasi
>Priority: Minor
>  Labels: patch
>
> The following mime magic can be added to better detect the below mime-types:
> 1. vnd.ms-cab-compressed (.cab files) - pattern "MCSF" in the first 4 bytes
> 2. application/vnd.xara (.xar files) - pattern "xar!" in the first 4 bytes
> 3. application/x-mobipocket-ebook (.mobi files) - pattern "BOOKMOBI" starting 
> at byte position 60
> 4. video/quicktime (.mov files) - patterns "free" and "wide" seen starting at 
> byte position 4
> The changes can be seen here:
> https://github.com/mkampasi/tika/commit/f7433daf434a44937ba3ae8b15813a768f95e334



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1877) On updating the tika-mimetypes.xml to detect .fts file format, tika detector does not return anything

2016-02-27 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170579#comment-15170579
 ] 

Nick Burch commented on TIKA-1877:
--

Posting the whole modified tika mimetypes file isn't ideal - it's hard for us 
to see what has changed and what hasn't, especially given the file's large 
size. Would you be able to post a patch/diff showing just your changes, to help 
us review and possibly spot the issue?

(I tried diff'ing it to trunk, but got such a large number of changes I 
couldn't see what was supposed to be your change amongst them)

Ideally, also, it would be easier if you could write a short junit unit test 
showing the detection issue. That's generally much quicker and easier to test 
with, as well as having the bonus of proving a check to ensure that post-fix it 
stays fixed!

> On updating the tika-mimetypes.xml to detect .fts file format, tika detector 
> does not return anything
> -
>
> Key: TIKA-1877
> URL: https://issues.apache.org/jira/browse/TIKA-1877
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Reporter: Prasad Nagaraj Subramanya
>Priority: Minor
> Attachments: 
> 4E8D6B46E2366D7063DE3926AF0F976A0DCCD57A7E3B53B7D54768F16DD23984, 
> tika-mimetypes.xml
>
>
> The match value for .fts file format in tika-mimetypes.xml is "SIMPLE  =  
>   T".
> Tika detected a .fts file as application/octet-stream. On verifying the 
> header I found the value to be "SIMPLE  =T"(just 16 spaces 
> before = and T)
> I tried the following changes-
> Change 1) Updated the existing match value. But the build failed 
> Change 2) Added a new match value  type="string" offset="0"/> after the existing one.
> But now, tika returns empty value. It neither identifies the file as .fts nor 
> as application/octet-stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1875) Updating tika-mimetypes.xml to detect .NC files

2016-02-27 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170572#comment-15170572
 ] 

Nick Burch commented on TIKA-1875:
--

As mentioned on list, there is a github pull for this: 
https://github.com/apache/tika/pull/78 (needs some more work before committing 
though)

> Updating tika-mimetypes.xml to detect .NC files 
> 
>
> Key: TIKA-1875
> URL: https://issues.apache.org/jira/browse/TIKA-1875
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.12
>Reporter: Prasad Nagaraj Subramanya
>Priority: Minor
>  Labels: patch
> Fix For: 1.11
>
>
> Adding magic number to detect .NC files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata

2016-02-26 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169005#comment-15169005
 ] 

Nick Burch commented on TIKA-1865:
--

Whatever we do, matching changes should be made to the other Email file format 
parsers to keep things consistent

I'm not sure we should be changing the existing keys to suddenly hold different 
values, that'll break backwards compatibility and likely confuse existing users

Maybe we should find a suitable metadata scheme for this, and add additional 
keys that hold the email addresses and the names in a way that they can be 
helpfully associated together?

> Save sender email address in Outlook MSG metadata
> -
>
> Key: TIKA-1865
> URL: https://issues.apache.org/jira/browse/TIKA-1865
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
> Environment: Windows 7 x64, jre 1.8.0_60 x64
>Reporter: Luis Filipe Nassif
>
> Sender email address is lost when extracting metadata from Outlook msg files. 
> Currently only sender name is extracted. That is an important information to 
> be extracted for search engines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata

2016-02-25 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167231#comment-15167231
 ] 

Nick Burch commented on TIKA-1865:
--

IIRC it needs the "fixed length properties" support to be completed to be able 
to get out

> Save sender email address in Outlook MSG metadata
> -
>
> Key: TIKA-1865
> URL: https://issues.apache.org/jira/browse/TIKA-1865
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
> Environment: Windows 7 x64, jre 1.8.0_60 x64
>Reporter: Luis Filipe Nassif
>
> Sender email address is lost when extracting metadata from Outlook msg files. 
> Currently only sender name is extracted. That is an important information to 
> be extracted for search engines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1855) TIka 2.0 - Move shared test-code back to tika-core and distribute test files to parser modules

2016-02-25 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167123#comment-15167123
 ] 

Nick Burch commented on TIKA-1855:
--

Currently, we have most test documents in Tika Parsers, and a handful in Tika 
Core, which is sometimes confusing. We also end up with quite a lot of the unit 
tests for Tika Core actually being in the Tika Parsers test area, so that they 
can use the test documents in parsers which aren't in core. Based on my 
experiences with this (eg where I start putting things in the wrong module, 
initially can't find the right unit test etc), I find it non-ideal, and I 
suspect it's not intuitive at all for new contributors.

For the Ogg Vorbis stuff I maintain, I've opted to put all of the test files 
needed in {{core/src/test/resources}} then have the other maven modules (eg the 
Tika one and the Tools one) depend on the core-test artifact as a test-scope 
dependency in order for their unit tests to access the common set of test 
files. I find this actually works quite well, now I have it set up, and it 
seems ok for both InputStream and File based tests

So, given the above two, I would suggest that we put all of our test documents 
from core, parsers, server and bundle (all of which seem to have their own ones 
at the moment!) into a single artifact. We then depend on that artifact for all 
of our tests, with a test scope

> TIka 2.0 - Move shared test-code back to tika-core and distribute test files 
> to parser modules
> --
>
> Key: TIKA-1855
> URL: https://issues.apache.org/jira/browse/TIKA-1855
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Tim Allison
>
> Undo TIKA-1851, and divide test docs to appropriate parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1873) Test Cases failed when tika-mimetypes.xml is changed

2016-02-25 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167012#comment-15167012
 ] 

Nick Burch commented on TIKA-1873:
--

Interesting stuff! I'd skip most container-based formats, and especially OLE2 
formats though. With OLE2 the only bit you can be sure of is the 512/4096 (1 
block) header at the start, which basically says "I'm OLE2". After that, you 
can put the blocks in any order, so one file could have the first bit of word 
data starting at 513 bytes, another could have that as the last 512 bytes of 
the file, and both are valid!

> Test Cases failed when tika-mimetypes.xml is changed
> 
>
> Key: TIKA-1873
> URL: https://issues.apache.org/jira/browse/TIKA-1873
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Antriksh Saxena
>  Labels: test
>
> The test cases were failing when tika was built after updating the 
> tika-mimetypes.xml. The failure logs are as follows.
> {code}
> TestContainerAwareDetector.testTruncatedFiles:395 
> expected: but was:
>   TestMimeTypes.testOLE2Detection:138->assertTypeByData:1045 
> expected: but was:
>   TestMimeTypes.testOldExcel:251->assertTypeByData:1045 
> expected: but was:
>   TestMimeTypes.testVisioDetection:305->assertTypeByNameAndData:1071 
> expected: but was:
>   ExcelParserTest.testExcel95:320 expected: but 
> was:
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1873) Test Cases failed when tika-mimetypes.xml is changed

2016-02-24 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15166334#comment-15166334
 ] 

Nick Burch commented on TIKA-1873:
--

What changes did you make to the mime types file?

If you alter how files are detected which the unit tests check, then clearly 
those unit tests will (and should!) fail...

> Test Cases failed when tika-mimetypes.xml is changed
> 
>
> Key: TIKA-1873
> URL: https://issues.apache.org/jira/browse/TIKA-1873
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Antriksh Saxena
>  Labels: test
>
> The test cases were failing when tika was built after updating the 
> tika-mimetypes.xml. The failure logs are as follows.
> {code}
> TestContainerAwareDetector.testTruncatedFiles:395 
> expected: but was:
>   TestMimeTypes.testOLE2Detection:138->assertTypeByData:1045 
> expected: but was:
>   TestMimeTypes.testOldExcel:251->assertTypeByData:1045 
> expected: but was:
>   TestMimeTypes.testVisioDetection:305->assertTypeByNameAndData:1071 
> expected: but was:
>   ExcelParserTest.testExcel95:320 expected: but 
> was:
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: PDFParser in-process mode

2016-02-24 Thread Nick Burch

On Wed, 24 Feb 2016, Pei Chen wrote:

Does the default pdf parser using auto detect parser require to tika
to run in server mode?


No

It seems to try and open an http connection to localhost:8080 by 
default?  Can it run in-process?


The stacktrace shows you're not using the PDF parser:


at 
org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:74)
at org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:60)


See https://wiki.apache.org/tika/GrobidJournalParser for how to configure 
the grobid parser if you want to use it


Nick


[jira] [Resolved] (TIKA-1869) Jackson update to latest version

2016-02-24 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1869.
--
Resolution: Fixed

Thanks, patch applied

> Jackson update to latest version
> 
>
> Key: TIKA-1869
> URL: https://issues.apache.org/jira/browse/TIKA-1869
> Project: Tika
>  Issue Type: Bug
>  Components: translation
>Affects Versions: 1.11, 1.12
>Reporter: John Patrick
>  Labels: github-import, newbie, patch
> Fix For: 1.13
>
>
> Linked to TIKA-1868 this is to update the version of Jackson used from 2.4.0 
> to 2.7.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1870) Relocating RichTextContentHandler into tika-core from tika-server

2016-02-24 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163117#comment-15163117
 ] 

Nick Burch commented on TIKA-1870:
--

Currently the class lacks javadocs to explain what it does, and seems to lack 
unit tests. Any chance you could knock up a patch to fix those two, then we can 
move it over? (Potentially you'd need to put the unit test in the Tika Parsers 
test package, to get access to the test documents, unless you just in-lined a 
small snippet of HTML to show the translation)

> Relocating RichTextContentHandler into tika-core from tika-server
> -
>
> Key: TIKA-1870
> URL: https://issues.apache.org/jira/browse/TIKA-1870
> Project: Tika
>  Issue Type: Bug
>  Components: core, server
>Reporter: John Patrick
>  Labels: newbie, patch
> Fix For: 1.13
>
>
> linked to TIKA-1868, different solution by refactoring class into tika-core 
> so don't need to depend upon tika-server and changing other classes used to 
> custom ones or other alternatives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1868) create clean tika-server jar and shaded classifier jar

2016-02-24 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163016#comment-15163016
 ] 

Nick Burch commented on TIKA-1868:
--

I'm not sure why you'd want to be using that Tika Server exception on its own? 
It's intended for the Tika Server only, which as stated you shouldn't be using 
except by running

If you think that the RichTextContentHandler would be useful generally (i.e. 
outside the server), you should open a request to have that moved over to the 
core package

> create clean tika-server jar and shaded classifier jar
> --
>
> Key: TIKA-1868
> URL: https://issues.apache.org/jira/browse/TIKA-1868
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.11, 1.12
> Environment: n/a
>Reporter: John Patrick
>  Labels: github-import, maven, newbie, patch
> Fix For: 1.13
>
>
> If using tika-server-VERSION.jar as a standalone component it works. But if 
> you use it as a dependency so is included with other jars then it causes 
> classpath issues specifically around jackson.
> The project I'm working on is using Jackson 2.6.1, we have just added tika 
> but when adding tika-server-VERSION.jar we have discovered it contains 
> Jackson 2.4.0 classes.
> I've update the maven build so two jar's are now created.
> 1) tika-server-VERSION.jar correct clean jar
> 2) tika-server-VERSION-standalone.jar what was previously created
> This in my view is more inline with how maven should be being used to create 
> jars as the previous way restricted the consumers ability to override maven 
> dependencies.
> I've also updated the documentation in source control that refs to 
> tika-server to include the new tika-server standalone jar. I realize other 
> documentation might also need to change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1869) Jackson update to latest version

2016-02-24 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15162917#comment-15162917
 ] 

Nick Burch commented on TIKA-1869:
--

Could you try bumping the version in your own checkout of Tika head from git, 
and report back if all the unit tests still pass afterwards? That'll give us an 
idea of how much work doing the upgrade would be

> Jackson update to latest version
> 
>
> Key: TIKA-1869
> URL: https://issues.apache.org/jira/browse/TIKA-1869
> Project: Tika
>  Issue Type: Bug
>  Components: translation
>Affects Versions: 1.11, 1.12
>Reporter: John Patrick
>  Labels: github-import, newbie, patch
> Fix For: 1.13
>
>
> Linked to TIKA-1868 this is to update the version of Jackson used from 2.4.0 
> to 2.7.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1868) create clean tika-server jar and shaded classifier jar

2016-02-24 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15162915#comment-15162915
 ] 

Nick Burch commented on TIKA-1868:
--

As explained by several people on the mailing list, you shouldn't be depending 
on the Tika Server jar! It's intended as a standalone runnable server.

To include Tika in your own project, you should depend on {{tika-parsers}} if 
you want everything, or {{tika-core}} if you don't want any parsers or 
detectors (just core + mime), or the OSGi bundle if you're in an OSGi 
environment. You are welcome to depend on {{tika-parsers}} and exclude a few 
dependencies, if you don't want those specific parsers. Alternately, the Tika 
2.x branch has the parsers split out into groupings, so you could have all 
parsers, or just a few.

> create clean tika-server jar and shaded classifier jar
> --
>
> Key: TIKA-1868
> URL: https://issues.apache.org/jira/browse/TIKA-1868
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.11, 1.12
> Environment: n/a
>Reporter: John Patrick
>  Labels: github-import, maven, newbie, patch
> Fix For: 1.13
>
>
> If using tika-server-VERSION.jar as a standalone component it works. But if 
> you use it as a dependency so is included with other jars then it causes 
> classpath issues specifically around jackson.
> The project I'm working on is using Jackson 2.6.1, we have just added tika 
> but when adding tika-server-VERSION.jar we have discovered it contains 
> Jackson 2.4.0 classes.
> I've update the maven build so two jar's are now created.
> 1) tika-server-VERSION.jar correct clean jar
> 2) tika-server-VERSION-standalone.jar what was previously created
> This in my view is more inline with how maven should be being used to create 
> jars as the previous way restricted the consumers ability to override maven 
> dependencies.
> I've also updated the documentation in source control that refs to 
> tika-server to include the new tika-server standalone jar. I realize other 
> documentation might also need to change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1867) Tika external parsers cannot be turned off without patching the tika-app-XX.jar

2016-02-23 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159042#comment-15159042
 ] 

Nick Burch commented on TIKA-1867:
--

You should be able to exclude the CompositeExternalParser with a ~5 line Tika 
Config file, which requires no patching or jars. Just use default parser but 
with a parser exclude for that one parser

See http://tika.apache.org/1.12/configuring.html for more on how to configure 
Tika, including an example of how to disable just one parser in config

> Tika external parsers cannot be turned off without patching the 
> tika-app-XX.jar
> ---
>
> Key: TIKA-1867
> URL: https://issues.apache.org/jira/browse/TIKA-1867
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
>Reporter: Roman Kratochvil
>
> The CompositeExternalParser calls ExternalParsersFactory.create() which 
> always uses configuration from 
> org/apache/tika/parser/external/tika-external-parsers.xml. The issue is that 
> this introduces performance regression as the parser initialization checks 
> for presence of external commands (ffmpeg, exiftool) and that takes time.
> Unfortunately, there is no way how to turn off this functionality without 
> patching the tika-app JAR -- one has to either change the 
> tika-external-parsers.xml or remove the whole CompositeExternalParser from 
> list of services in /META-INF/services/org.apache.tika.parser.Parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1864) org.apache.poi.hssf.record.formula.UnaryPlusPtg package for tika-app-1.10

2016-02-22 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15156790#comment-15156790
 ] 

Nick Burch commented on TIKA-1864:
--

First up, I'd suggest you upgrade to Apache Tika 1.12, which is recently out

Secondly, as covered in the Apache POI FAQ 
http://poi.apache.org/faq.html#faq-N1019C , all your Apache POI jars must be 
from the same version. It's not possible to run with some old jars and some new 
jars. You'll need to remove all of your old POI jars, in favour of the ones 
Tika provides. If you were previously on 2.5.1 (12 years old! 12 years! 
http://poi.apache.org/changes.html#2.5.1-FINAL), you'll need to recompile your 
code against the newer jars. POI does try to be backwards compatible, but not 
for releases from more than a decade ago...

> org.apache.poi.hssf.record.formula.UnaryPlusPtg package for tika-app-1.10
> -
>
> Key: TIKA-1864
> URL: https://issues.apache.org/jira/browse/TIKA-1864
> Project: Tika
>  Issue Type: Test
>Affects Versions: 1.10
>Reporter: Mohammed Manna
>Priority: Critical
>  Labels: hssf, poi, ss
>
> Hello,
> Due to legacy code issues, I had to remove POI-2.5.1-Final from my build path 
> as I was told that tika-app-1.10 will have all the necessary POI files for 
> the project. But them I got a build error in my ant script that said 
> `org.apache.poi.hssf.record.formula.UnaryPlusPtg' is missing from the build 
> path. I found out that the class is also in 
> `org.apache.poi.ss.formula.ptg.UnaryPlusPtg'. Can I replace them or is it 
> something I need a separate package for?
> KR,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-19 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154208#comment-15154208
 ] 

Nick Burch commented on TIKA-1607:
--

We have generally required those developing a parser to do more thinking, so 
that users of Tika don't need to. A random bytes bucket does seem to be going 
the other way, making it very easy for a parser developer to chuck random stuff 
into this "other" bucket, and putting all the work onto the now-confused user. 
So, like Ray, I'd advise against it

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1862) Exception in thread "Thread-9" java.lang.UnsatisfiedLinkError: /usr/lib/jvm/jre/lib/amd64/headless/libmawt.so: libcups.so.2: cannot open shared object file: No such file

2016-02-19 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1862.
--
Resolution: Invalid

This isn't a Tika issue. You either need to fix your JVM installation, or talk 
to the provider of your JVM about why they need cups to do simple headless 
graphical calculations

> Exception in thread "Thread-9" java.lang.UnsatisfiedLinkError: 
> /usr/lib/jvm/jre/lib/amd64/headless/libmawt.so: libcups.so.2: cannot open 
> shared object file: No such file or directory
> --
>
> Key: TIKA-1862
> URL: https://issues.apache.org/jira/browse/TIKA-1862
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Ubuntu 14.04.03 with openjdk-7-jre, 
> openjdk-7-jre-headless installed
>Reporter: Avinash
> Fix For: 1.13
>
>
> java -jar tika-app-1.11.jar --text testPDF_bookmarks.pdf
> Exception in thread "main" java.lang.UnsatisfiedLinkError: 
> /usr/lib/jvm/jre/lib/amd64/headless/libmawt.so: libcups.so.2: cannot open 
> shared object file: No such file or directory
> at java.lang.ClassLoader$NativeLibrary.load(Native Method)
> at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1965)
> at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1890)
> at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1851)
> at java.lang.Runtime.load0(Runtime.java:795)
> at java.lang.System.load(System.java:1062)
> at java.lang.ClassLoader$NativeLibrary.load(Native Method)
> at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1965)
> at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1890)
> at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1872)
> at java.lang.Runtime.loadLibrary0(Runtime.java:849)
> at java.lang.System.loadLibrary(System.java:1088)
> at 
> sun.security.action.LoadLibraryAction.run(LoadLibraryAction.java:67)
> at 
> sun.security.action.LoadLibraryAction.run(LoadLibraryAction.java:47)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.awt.Toolkit.loadLibraries(Toolkit.java:1657)
> at java.awt.Toolkit.(Toolkit.java:1686)
> at java.awt.Color.(Color.java:275)
> at org.apache.pdfbox.pdmodel.PDPage.(PDPage.java:79)
> at 
> org.apache.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:217)
> at 
> org.apache.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:185)
> at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:213)
> at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342)
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144)
> It works if libcups.so is installed, but libcups needs avahi which is not 
> recommended from security standpoint.
> why does PDF extraction needs libmawt.so and/or libcups.so.2 ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1856) Error while parsing an ogg file

2016-02-17 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1856.
--
   Resolution: Fixed
Fix Version/s: 1.13

The fix was fairly quick in the end, but the process of getting a new release 
out with the fix in wasn't :( After lots of annoyances, and one bug report to 
Maven, I've done a 0.8 release + bumped the dependency in Tika. With that done, 
the Tika App can detect these files without error (just with a warning about 
their truncation), and it can even get some simple metadata out!

> Error while parsing an ogg file
> ---
>
> Key: TIKA-1856
> URL: https://issues.apache.org/jira/browse/TIKA-1856
> Project: Tika
>  Issue Type: Bug
>  Components: detector, parser
>Affects Versions: 1.12
> Environment: python
>Reporter: Yash Tanna
>  Labels: newbie, tika
> Fix For: 1.13
>
> Attachments: 
> 1B7A7AE8FE999D22E2A677EFDA38982C8957CF77BEF3371E48852F7D67A7, 
> 1DE811ACAB8432D526EFE9D941E5EFE58F3C89F1AAB6CB7152091961DD854431, 
> 4600B9FF184F6AB71AA0CF6873E580FB0A31D75CE1218998057E9A185A5FFBB2, 
> 5E5892EA6C2B4A07BE998403A04127C7924E5539DB3EB0D27B9BD34D11A1575B, 
> CA3065B754E6CE79E4BF128464F4A202B0F2CF0336FBE73FA33F13776CD01CE8, 
> F036789D92EE18032556D9D0ECAC75073CED52226E1833001E379740E23E183D, 
> F33BFE4B1AF562D40E5B9D9F5D4B34EA6734F8F3A06F99535F100F957958D9BA, 
> F47F833BFD4A7E55C128DD76DB3666EEFFD0F5EDA24BF31D6F2427BA092D, 
> FA9D1D2B8D0FB50CFE306FA6024EC48BD771562878B9B70D38D106DF4E61147A
>
>
> Unable to detect a malformed ogg file. The error thrown was 
> Exception in thread "main" java.io.IOException: Asked to read 4335 bytes
> from 0 but hit EoF at 780
> at org.gagravarr.ogg.IOUtils.readFully(IOUtils.java:39)
> at org.gagravarr.ogg.IOUtils.readFully(IOUtils.java:31)
> at org.gagravarr.ogg.OggPage.(OggPage.java:82)
> at
> org.gagravarr.ogg.OggPacketReader.getNextPacket(OggPacketReader.java:116)
> at org.gagravarr.tika.OggDetector.detect(OggDetector.java:97)
> at
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
> at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:291)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
> [xdatadeploy@xdata upload]$



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1859) file poi reads tika does not bring the content

2016-02-17 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150596#comment-15150596
 ] 

Nick Burch commented on TIKA-1859:
--

Which file? How isn't it working? How are you calling Apache Tika? Did you try 
other ways? How can we reproduce your issue?

> file poi reads tika does not bring the content
> --
>
> Key: TIKA-1859
> URL: https://issues.apache.org/jira/browse/TIKA-1859
> Project: Tika
>  Issue Type: Bug
>  Components: handler
>Affects Versions: 1.12
>Reporter: Movses
>Priority: Blocker
> Fix For: 1.12
>
>
> I have a file xlsx I'm able to read and process in using poi but in tika it 
> does not extract the content of the file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1858) Unable to extract content from chunked portion of large file

2016-02-17 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150370#comment-15150370
 ] 

Nick Burch commented on TIKA-1858:
--

Other than a handful of text-based file types, Tika will need the whole file in 
order to be able to process it. Other file formats are simply not built to 
support chunk-based processing

> Unable to extract content from chunked portion of large file
> 
>
> Key: TIKA-1858
> URL: https://issues.apache.org/jira/browse/TIKA-1858
> Project: Tika
>  Issue Type: Bug
>Reporter: raghu
>
> Hi All,
> we are using Tika server(REST based api) to extract content in .NET 
> Application.
> we need to extract content from very large file(500MB). we want to split this 
> file to chunks and passing request to TIKA. we are able to get any result 
> from TIKA. 
> please help me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1856) Error while parsing an ogg file

2016-02-16 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148629#comment-15148629
 ] 

Nick Burch commented on TIKA-1856:
--

Picking one of those files to look at,{{oggz-info}} processes it without 
warning. {{ogginfo}} warns about the EOS being missing on both streams, but 
otherwise gives no errors

Trying with mplayer, it reports some issues with the file:
{code}
[vorbis @ 0x7f1470f5cb00]partition out of bounds: type, begin, end, size, 
blocksize: 2, 0, 192, 16, 1024
[vorbis @ 0x7f1470f5cb00] Vorbis setup header packet corrupt (residues). 
[vorbis @ 0x7f1470f5cb00]Setup header corrupt.
Could not open codec.
{code}

Do you know where these files came from? It looks like they have been truncated 
some how, could that be the case? 

(If so, we'd probably just need to improve the truncation error handling)

> Error while parsing an ogg file
> ---
>
> Key: TIKA-1856
> URL: https://issues.apache.org/jira/browse/TIKA-1856
> Project: Tika
>  Issue Type: Bug
>  Components: detector, parser
>Affects Versions: 1.12
> Environment: python
>Reporter: Yash Tanna
>  Labels: newbie, tika
> Attachments: 
> 1B7A7AE8FE999D22E2A677EFDA38982C8957CF77BEF3371E48852F7D67A7, 
> 1DE811ACAB8432D526EFE9D941E5EFE58F3C89F1AAB6CB7152091961DD854431, 
> 4600B9FF184F6AB71AA0CF6873E580FB0A31D75CE1218998057E9A185A5FFBB2, 
> 5E5892EA6C2B4A07BE998403A04127C7924E5539DB3EB0D27B9BD34D11A1575B, 
> CA3065B754E6CE79E4BF128464F4A202B0F2CF0336FBE73FA33F13776CD01CE8, 
> F036789D92EE18032556D9D0ECAC75073CED52226E1833001E379740E23E183D, 
> F33BFE4B1AF562D40E5B9D9F5D4B34EA6734F8F3A06F99535F100F957958D9BA, 
> F47F833BFD4A7E55C128DD76DB3666EEFFD0F5EDA24BF31D6F2427BA092D, 
> FA9D1D2B8D0FB50CFE306FA6024EC48BD771562878B9B70D38D106DF4E61147A
>
>
> Unable to detect a malformed ogg file. The error thrown was 
> Exception in thread "main" java.io.IOException: Asked to read 4335 bytes
> from 0 but hit EoF at 780
> at org.gagravarr.ogg.IOUtils.readFully(IOUtils.java:39)
> at org.gagravarr.ogg.IOUtils.readFully(IOUtils.java:31)
> at org.gagravarr.ogg.OggPage.(OggPage.java:82)
> at
> org.gagravarr.ogg.OggPacketReader.getNextPacket(OggPacketReader.java:116)
> at org.gagravarr.tika.OggDetector.detect(OggDetector.java:97)
> at
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
> at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:291)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
> [xdatadeploy@xdata upload]$



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Need project suggestions to contribute to Apache Tika

2016-02-13 Thread Nick Burch

On Fri, 12 Feb 2016, Prasad N S wrote:

I have over 5 years of experience in software development. My favorite
language is Java, though I am comfortable with Python too. I have worked on
a range of databases from relational to NoSQL and distributed systems. I am
a quick learner and open to learn new technologies.

I am here to kick start my contributions to Open Source projects. Please
let me know if there are any small projects or bug fixes that I can get
started with.


First up I'd suggest you work through the "5 minute" parser guide, to get 
happy with adding new mime types to Tika, adding new parsers, that sort of 
thing:

http://tika.apache.org/1.11/parser_guide.html

You may hit some issues on the way, if so, please try the troubleshooting 
guide to assist:

http://wiki.apache.org/tika/Troubleshooting%20Tika

Then report back / contribute fixes to the 5 minute guide + 
troubleshooting guide!



I've seen a few queries on the Tika Python stuff recently, so if you know 
python, you could try with that. Take a look at the "apache-tika" tag on 
StackOverflow to get an idea of the problems people are having, areas 
where we need more examples, areas where the docs need work, that sort of 
thing


Once you're up to speed with all that, it really depends on what you're 
interested in. If there's some formats you use in your personal life / 
other research that aren't supported, have a go at adding mime magic then 
a parser. If there's something with limited support you're interested in, 
have a go at expanding it. If you're into Big Data, help with Tika Batch 
and Tika Eval, or maybe with integrations with things like Behemoth or 
Storm Crawler. If you're just generally interested, take a look at the 
Tika Batch+Eval reports, find an intersting looking failure / exception / 
etc, and dive in!



Oh, and one other possible thing - rework this email slightly, put it on 
the wiki as a "how to get started contributing" guide, invite others to 
help, and expand it as you learn :)


Nick


Re: scm info in pom.xml

2016-02-11 Thread Nick Burch

On Sat, 6 Feb 2016, Ken Krugler wrote:

I'm revisiting the creation of a new tika-langdetect module in the 2.x branch, 
and have created a pom.xml

But in looking at what I started with (from tika-translate), I see this:

 
   http://svn.apache.org/viewvc/tika/trunk/tika-langdetect
   
scm:svn:http://svn.apache.org/repos/asf/tika/trunk/tika-langdetect
   
scm:svn:https://svn.apache.org/repos/asf/tika/trunk/tika-langdetect
 

What's the plan (if any) for switching to git details in poms?


I think it needs fixing in both trunk and the 2.x branch, since we're on 
Git for both


Nick


[jira] [Commented] (TIKA-1850) Tika erroneously detects some versions of jQuery as "text/html"

2016-02-04 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132249#comment-15132249
 ] 

Nick Burch commented on TIKA-1850:
--

It's showing up for me in the snapshots repo - see 
https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-parsers/1.13-SNAPSHOT/

1.12 is being voted on now, but the commits for this were done after the 1.12 
release candidates were cut. Unless there has to be a re-creation of the RCs, 
expect it in 1.13 in 2-4 months

> Tika erroneously detects some versions of jQuery as "text/html"
> ---
>
> Key: TIKA-1850
> URL: https://issues.apache.org/jira/browse/TIKA-1850
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.11
> Environment: {code}
> ProductName:  Mac OS X
> ProductVersion:   10.11.3
> BuildVersion: 15D21
> {code}
>Reporter: Boris Slobodin
>
> This sets the wrong {{Content-Type}} on S3 as a result, for example, when 
> using s3_website and breaks some browsers like IE.
> {code}
> ➜  wget https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js -O 
> jquery-1.7.1.min.js
> --2016-02-02 15:21:33--  
> https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js
> Resolving ajax.googleapis.com... 216.58.193.106, 2607:f8b0:400a:801::200a
> Connecting to ajax.googleapis.com|216.58.193.106|:443... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/javascript]
> Saving to: 'jquery-1.7.1.min.js'
> jquery-1.7.1.min.js[  <=> 
>  ]  91.67K   323KB/sin 0.3s
> 2016-02-02 15:21:33 (323 KB/s) - 'jquery-1.7.1.min.js' saved [93868]
> {code}
> {code}
> ➜  wget https://ajax.googleapis.com/ajax/libs/jquery/1.12.0/jquery.min.js -O 
> jquery-1.12.0.min.js
> --2016-02-02 15:22:10--  
> https://ajax.googleapis.com/ajax/libs/jquery/1.12.0/jquery.min.js
> Resolving ajax.googleapis.com... 216.58.193.106, 2607:f8b0:400a:801::200a
> Connecting to ajax.googleapis.com|216.58.193.106|:443... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/javascript]
> Saving to: 'jquery-1.12.0.min.js'
> jquery-1.12.0.min.js   [ <=>  
>  ]  95.08K  --.-KB/sin 0.03s
> 2016-02-02 15:22:10 (3.30 MB/s) - 'jquery-1.12.0.min.js' saved [97362]
> {code}
> {code}
> ➜  wget https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js -O 
> jquery-2.2.0.min.js
> --2016-02-02 15:22:24--  
> https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js
> Resolving ajax.googleapis.com... 216.58.193.106, 2607:f8b0:400a:801::200a
> Connecting to ajax.googleapis.com|216.58.193.106|:443... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/javascript]
> Saving to: 'jquery-2.2.0.min.js'
> jquery-2.2.0.min.js[ <=>  
>  ]  83.58K  --.-KB/sin 0.02s
> 2016-02-02 15:22:24 (3.39 MB/s) - 'jquery-2.2.0.min.js' saved [85589]
> {code}
> {color:red}{{jquery-1.7.1.min.js}}{color}
> {code}
> ➜  java -jar tika-app-1.11.jar --detect jquery-1.7.1.min.js
> text/html
> {code}
> {color:green}{{jquery-1.12.0.min.js}}{color}
> {code}
> ➜  java -jar tika-app-1.11.jar --detect jquery-1.12.0.min.js
> application/javascript
> {code}
> {color:green}{{jquery-2.2.0.min.js}}{color}
> {code}
> ➜  java -jar tika-app-1.11.jar --detect jquery-2.2.0.min.js
> application/javascript
> {code}
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1848) Address issues with Tika 1.12rc#1

2016-02-03 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130376#comment-15130376
 ] 

Nick Burch commented on TIKA-1848:
--

I'm not sure if our test files should have license headers in them, especially 
not if it'll break the things we're using to test for! Since we're not adding 
license metadata to our PNGs, our Ogg files or a Office documents (for just a 
few examples), I don't see why we should be monkeying with the HTML ones only?

The Charset stuff doesn't have our standard header, as it's third party 
(suitably licensed) code that we've incorporated + re-packaged + bugfixed

Is it worth getting DRAT to pull in the excludes we've put into the POMs that 
normal RAT uses?

> Address issues with Tika 1.12rc#1
> -
>
> Key: TIKA-1848
> URL: https://issues.apache.org/jira/browse/TIKA-1848
> Project: Tika
>  Issue Type: Task
>Affects Versions: 1.12
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
>
> The following files for the 1.12rc#1 have unsuitable license headers
> {code}
>   /usr/local/drat/deploy/data/jobs/rat/1454458514778/input/testJAVA.java
>   
> /usr/local/drat/deploy/data/jobs/rat/1454458508087/input/CharsetDetector.java
>   /usr/local/drat/deploy/data/jobs/rat/1454458508087/input/CharsetMatch.java
>   
> /usr/local/drat/deploy/data/jobs/rat/1454458508087/input/CharsetRecog_2022.java
>   
> /usr/local/drat/deploy/data/jobs/rat/1454458508087/input/CharsetRecog_UTF8.java
>   
> /usr/local/drat/deploy/data/jobs/rat/1454458508087/input/CharsetRecog_Unicode.java
>   
> /usr/local/drat/deploy/data/jobs/rat/1454458508087/input/CharsetRecog_mbcs.java
>   
> /usr/local/drat/deploy/data/jobs/rat/1454458508087/input/CharsetRecog_sbcs.java
>   
> /usr/local/drat/deploy/data/jobs/rat/1454458508087/input/CharsetRecognizer.java
>   /usr/local/drat/deploy/data/jobs/rat/1454458515805/input/big-preamble.html
>   
> /usr/local/drat/deploy/data/jobs/rat/1454458515805/input/boilerplate-whitespace.html
>   /usr/local/drat/deploy/data/jobs/rat/1454458515805/input/boilerplate.html
>   /usr/local/drat/deploy/data/jobs/rat/1454458515805/input/resume.html
>   /usr/local/drat/deploy/data/jobs/rat/1454458515805/input/test-tika-327.html
>   /usr/local/drat/deploy/data/jobs/rat/1454458515805/input/test.html
>   
> /usr/local/drat/deploy/data/jobs/rat/1454458515805/input/testHTMLNoisyMetaEncoding_1.html
>   
> /usr/local/drat/deploy/data/jobs/rat/1454458515805/input/testHTMLNoisyMetaEncoding_2.html
>   
> /usr/local/drat/deploy/data/jobs/rat/1454458515805/input/testHTMLNoisyMetaEncoding_3.html
>   
> /usr/local/drat/deploy/data/jobs/rat/1454458515805/input/testHTMLNoisyMetaEncoding_4.html
>   
> /usr/local/drat/deploy/data/jobs/rat/1454458515805/input/testJsonMultipleInts.html
>   
> /usr/local/drat/deploy/data/jobs/rat/1454458515805/input/testlargerbuffer.html
>   /usr/local/drat/deploy/data/jobs/rat/1454458515805/input/tika434.html
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1821) Problem in Tika().detect for xml file signed in CADES

2016-02-03 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1821.
--
   Resolution: Fixed
Fix Version/s: 1.13

Thanks for these, I've used the to add unit tests which verify that we now 
correctly detect all the files with and without the filename

> Problem in Tika().detect for xml file signed in CADES
> -
>
> Key: TIKA-1821
> URL: https://issues.apache.org/jira/browse/TIKA-1821
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.8
>Reporter: Alessandro De Angelis
> Fix For: 1.13
>
> Attachments: test-tika-error-v2.xml.p7m, test-tika-error-v3.xml.p7m, 
> test-tika-error-v4.xml.p7m, test-tika-error.xml.p7m
>
>
> We have a xml file with base64 attachment signed with CADES signature. 
> In this case TIKA recognize the resulted file mime type as "text/plain" and 
> not "application/pkcs7-signature" as we expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1850) Tika erroneously detects some versions of jQuery as "text/html"

2016-02-03 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130678#comment-15130678
 ] 

Nick Burch commented on TIKA-1850:
--

Looks like a duplicate to me, are you happy to close it as such?

> Tika erroneously detects some versions of jQuery as "text/html"
> ---
>
> Key: TIKA-1850
> URL: https://issues.apache.org/jira/browse/TIKA-1850
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.11
> Environment: {code}
> ProductName:  Mac OS X
> ProductVersion:   10.11.3
> BuildVersion: 15D21
> {code}
>Reporter: Boris Slobodin
>
> This sets the wrong {{Content-Type}} on S3 as a result, for example, when 
> using s3_website and breaks some browsers like IE.
> {code}
> ➜  wget https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js -O 
> jquery-1.7.1.min.js
> --2016-02-02 15:21:33--  
> https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js
> Resolving ajax.googleapis.com... 216.58.193.106, 2607:f8b0:400a:801::200a
> Connecting to ajax.googleapis.com|216.58.193.106|:443... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/javascript]
> Saving to: 'jquery-1.7.1.min.js'
> jquery-1.7.1.min.js[  <=> 
>  ]  91.67K   323KB/sin 0.3s
> 2016-02-02 15:21:33 (323 KB/s) - 'jquery-1.7.1.min.js' saved [93868]
> {code}
> {code}
> ➜  wget https://ajax.googleapis.com/ajax/libs/jquery/1.12.0/jquery.min.js -O 
> jquery-1.12.0.min.js
> --2016-02-02 15:22:10--  
> https://ajax.googleapis.com/ajax/libs/jquery/1.12.0/jquery.min.js
> Resolving ajax.googleapis.com... 216.58.193.106, 2607:f8b0:400a:801::200a
> Connecting to ajax.googleapis.com|216.58.193.106|:443... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/javascript]
> Saving to: 'jquery-1.12.0.min.js'
> jquery-1.12.0.min.js   [ <=>  
>  ]  95.08K  --.-KB/sin 0.03s
> 2016-02-02 15:22:10 (3.30 MB/s) - 'jquery-1.12.0.min.js' saved [97362]
> {code}
> {code}
> ➜  wget https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js -O 
> jquery-2.2.0.min.js
> --2016-02-02 15:22:24--  
> https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js
> Resolving ajax.googleapis.com... 216.58.193.106, 2607:f8b0:400a:801::200a
> Connecting to ajax.googleapis.com|216.58.193.106|:443... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/javascript]
> Saving to: 'jquery-2.2.0.min.js'
> jquery-2.2.0.min.js[ <=>  
>  ]  83.58K  --.-KB/sin 0.02s
> 2016-02-02 15:22:24 (3.39 MB/s) - 'jquery-2.2.0.min.js' saved [85589]
> {code}
> {color:red}{{jquery-1.7.1.min.js}}{color}
> {code}
> ➜  java -jar tika-app-1.11.jar --detect jquery-1.7.1.min.js
> text/html
> {code}
> {color:green}{{jquery-1.12.0.min.js}}{color}
> {code}
> ➜  java -jar tika-app-1.11.jar --detect jquery-1.12.0.min.js
> application/javascript
> {code}
> {color:green}{{jquery-2.2.0.min.js}}{color}
> {code}
> ➜  java -jar tika-app-1.11.jar --detect jquery-2.2.0.min.js
> application/javascript
> {code}
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1141) javascript files that contain "<html" are detected as text/html

2016-02-03 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130723#comment-15130723
 ] 

Nick Burch commented on TIKA-1141:
--

I've tweaked the mime magic for HTML, so we give  javascript files that contain "<html" are detected as text/html
> ---
>
> Key: TIKA-1141
> URL: https://issues.apache.org/jira/browse/TIKA-1141
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.2
>Reporter: David Hara
>Priority: Minor
>
> The Mimetypes detector will return text/html as the mimetype for any 
> javascript file that contains the string "<html" in it. I believe this is due 
> to the rule  in the 
> tika-mimetypes.xml file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1850) Tika erroneously detects some versions of jQuery as "text/html"

2016-02-03 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130860#comment-15130860
 ] 

Nick Burch commented on TIKA-1850:
--

Please grab a nightly build / build from git, and check - the test jquery files 
mentioned here and in the other bug now detect correctly for me when the 
filename is given

> Tika erroneously detects some versions of jQuery as "text/html"
> ---
>
> Key: TIKA-1850
> URL: https://issues.apache.org/jira/browse/TIKA-1850
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.11
> Environment: {code}
> ProductName:  Mac OS X
> ProductVersion:   10.11.3
> BuildVersion: 15D21
> {code}
>Reporter: Boris Slobodin
>
> This sets the wrong {{Content-Type}} on S3 as a result, for example, when 
> using s3_website and breaks some browsers like IE.
> {code}
> ➜  wget https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js -O 
> jquery-1.7.1.min.js
> --2016-02-02 15:21:33--  
> https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js
> Resolving ajax.googleapis.com... 216.58.193.106, 2607:f8b0:400a:801::200a
> Connecting to ajax.googleapis.com|216.58.193.106|:443... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/javascript]
> Saving to: 'jquery-1.7.1.min.js'
> jquery-1.7.1.min.js[  <=> 
>  ]  91.67K   323KB/sin 0.3s
> 2016-02-02 15:21:33 (323 KB/s) - 'jquery-1.7.1.min.js' saved [93868]
> {code}
> {code}
> ➜  wget https://ajax.googleapis.com/ajax/libs/jquery/1.12.0/jquery.min.js -O 
> jquery-1.12.0.min.js
> --2016-02-02 15:22:10--  
> https://ajax.googleapis.com/ajax/libs/jquery/1.12.0/jquery.min.js
> Resolving ajax.googleapis.com... 216.58.193.106, 2607:f8b0:400a:801::200a
> Connecting to ajax.googleapis.com|216.58.193.106|:443... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/javascript]
> Saving to: 'jquery-1.12.0.min.js'
> jquery-1.12.0.min.js   [ <=>  
>  ]  95.08K  --.-KB/sin 0.03s
> 2016-02-02 15:22:10 (3.30 MB/s) - 'jquery-1.12.0.min.js' saved [97362]
> {code}
> {code}
> ➜  wget https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js -O 
> jquery-2.2.0.min.js
> --2016-02-02 15:22:24--  
> https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js
> Resolving ajax.googleapis.com... 216.58.193.106, 2607:f8b0:400a:801::200a
> Connecting to ajax.googleapis.com|216.58.193.106|:443... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/javascript]
> Saving to: 'jquery-2.2.0.min.js'
> jquery-2.2.0.min.js[ <=>  
>  ]  83.58K  --.-KB/sin 0.02s
> 2016-02-02 15:22:24 (3.39 MB/s) - 'jquery-2.2.0.min.js' saved [85589]
> {code}
> {color:red}{{jquery-1.7.1.min.js}}{color}
> {code}
> ➜  java -jar tika-app-1.11.jar --detect jquery-1.7.1.min.js
> text/html
> {code}
> {color:green}{{jquery-1.12.0.min.js}}{color}
> {code}
> ➜  java -jar tika-app-1.11.jar --detect jquery-1.12.0.min.js
> application/javascript
> {code}
> {color:green}{{jquery-2.2.0.min.js}}{color}
> {code}
> ➜  java -jar tika-app-1.11.jar --detect jquery-2.2.0.min.js
> application/javascript
> {code}
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1841) Different XML output structure for PPT and PPTX

2016-02-01 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126532#comment-15126532
 ] 

Nick Burch commented on TIKA-1841:
--

Ideally we would break out the header and footer into separate divs/paragraphs 
within the slide's contents. If you can tweak the code to do that, please do! 
If only one format makes it easy, do it "right" for that one, and add a TODO 
for the other

Otherwise, assuming no last minute objections (eg from [~talli...@mitre.org]), 
then go ahead with your plan, and submit a pull request once it's all ready + 
unit tested!

> Different XML output structure for PPT and PPTX
> ---
>
> Key: TIKA-1841
> URL: https://issues.apache.org/jira/browse/TIKA-1841
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
>Reporter: Sam H
>
> Issue is slightly related to TIKA-1840
> I've noticed that the XML structure of Powerpoint (PPT) and PPTX files is 
> different. 
> The structure for PPTX seems as follows:
> {code}
> 
> 
>  //optional
>  //optional
> ...
> 
> 
>  //optional
>  //optional
> {code}
> Note that there's no parent slide element to indicate the start and end of 
> each slide.
> For powerpoint the structure is as follows:
> {code}
> 
>   
> 
> 
>  //added in TIKA-1840
>  
>   
>   ...
>   
> 
> 
>  //added in TIKA-1840
> 
>   
> 
> 
> {code}
> In my application, I'm using XPath to get the desired information . As the 
> XML structure is different, I have to differentiate my XPath queries whether 
> the file is PPT (old) or PPTX (new). It would be nice for Tika to return the 
> same XML for both.
> I would propose changing the XML structure to this:
> {code}
> 
>   
> 
> 
>  //added in TIKA-1840
>  
>   
>   ...
>   
> 
> 
>  //added in TIKA-1840
> 
>   
> 
> {code}
> So, essentially, like the current PPT output, but without the list of notes 
> at the end (as this is also omitted for PPTX).
> On the one hand this generalizes PPT(X) handling, on the other it can break 
> existing (external) functionality relying on a specific XML output format.
> I don't know if this is something the project wants fixed or not. If so, I'm 
> willing to donate my time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126317#comment-15126317
 ] 

Nick Burch commented on TIKA-1845:
--

Near the top of the jira page are some buttons, please hit "More" then "Attach 
Files", and then upload the smallest file you have which triggers the issue. We 
can then use that for investigating, testing and (hopefully!) later unit 
testing of fixes.

> Unable to extract content from certain RTFs using tika-server versions since 
> 1.5 
> -
>
> Key: TIKA-1845
> URL: https://issues.apache.org/jira/browse/TIKA-1845
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6, 1.9, 1.11
> Environment: Windows
>Reporter: Ian Williams
>
> I have some patient letters that are RTF documents.  When I extract the text 
> from these documents using tika-server-1.5.jar, it works fine.
> However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
> 1.11), it fails with the stack trace and error shown below.
> I can provide a sample RTF that is failing.  I'm not sure how to attach files 
> to this issue so here is a link to an Evernote note containing an example RTF 
> that fails:
> https://www.evernote.com/shard/s66/sh/4a003611-2400-4959-a1cc-2be5b3efe2cf/284a6f2dd3e0a290
> I wondered whether the error might be related to the following change that 
> was introduced in 1.6?:
>   * Made RTFParser's list handling slightly more robust against corrupt
> list metadata (TIKA-1305)
> It's possible that there is some issue with the RTF documents, but they are 
> real patient letters and they open in Microsoft Word without any problems.
> Many thanks
> Ian
> Steps to reproduce issue
> 
> 1. HTTP PUT to Tika server using curl:
> C:\Downloads\Apache Tika>curl -X PUT --data-binary 
> @test-anonymised-letter.rtf http://localhost:9998/tika --header 
> "Content-Type: application/rtf" --header "Accept: text/plain"
> --> this works fine when running tika-server-1.5.jar, but fails with 
> tika-server-1.6.jar
> 2. Screen capture from the server:
> INFO: Starting Apache Tika 1.9 server
> Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be http://localhost:9998/
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: jetty-8.y.z-SNAPSHOT
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Started SelectChannelConnector@localhost:9998
> Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
> INFO: Started
> Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: tika (application/rtf)
> Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@32a6dc
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
> at 
> org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
> at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
> at 
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.a

[jira] [Commented] (TIKA-1843) Tika parser for SEG-Y files and new MIME type application/segy

2016-02-01 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126450#comment-15126450
 ] 

Nick Burch commented on TIKA-1843:
--

Ideally you'd work with the Sigrun owner to have them do it - it's best if the 
people who "own" the code and "do" the releases are also the ones who push the 
files to Maven central. (Doesn't have to be, there is the third party process, 
but it's certainly preferred)

If I were you, I'd review the docs, then suggest any POM fixes to them. Once 
those are in, work with the Sigrun team to get them to request their access + 
get things uploaded

If you need an example project to crib from for the pom, my own 
https://github.com/Gagravarr/VorbisJava/blob/master/parent/pom.xml is one place 
you could start (amongst others!)

> Tika parser for SEG-Y files and new MIME type application/segy
> --
>
> Key: TIKA-1843
> URL: https://issues.apache.org/jira/browse/TIKA-1843
> Project: Tika
>  Issue Type: New Feature
>  Components: mime, parser
>Reporter: Giovanni Usai
>Priority: Minor
>
> This ticket refers to the parsing of SEG-Y files (extensions .seg, .segy and 
> .sgy). 
> The SEG-Y format is used to store seismic data, you can find more information 
> here http://pubs.usgs.gov/of/2001/of01-326/HTML/FILEFORM.HTM.
> I have:
> - added a new MIME type application/segy matching the file name extensions 
> .segy, .seg and .sgy.
> - created a new SEGYParser, matching that MIME type.
> In order to parse the SEG-Y files, I am using a modified version of the 
> sigrun code (available under Apache license, here 
> https://github.com/mikhail-aksenov/sigrun). Notably I have done a fix and 
> changed some method signatures to be able to read from a ReadableByteChannel 
> instead of FileChannel.
> For the moment I have put it directly into the new Tika's segy package. Is 
> this the right thing to do or should I reference it as external library thus 
> modifying the pom.xml?
> Thanks and best regards,
> Giovanni



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1843) Tika parser for SEG-Y files and new MIME type application/segy

2016-01-29 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123552#comment-15123552
 ] 

Nick Burch commented on TIKA-1843:
--

Looks like Sigrun is an active project, so best bet would be to submit Github 
pull requests to them to add the `ReadableByteChannel` support. Then, once 
they've added that + released, we'll add a Tika dependency to that + add the 
parser code

ASF best-practice is to avoid forking upstream projects + bundling modified 
versions whenever possible, so putting customised versions of Segrun classes in 
the Tika segy package should be avoided if possible. Much better to get them to 
accept the fixes upstream!

> Tika parser for SEG-Y files and new MIME type application/segy
> --
>
> Key: TIKA-1843
> URL: https://issues.apache.org/jira/browse/TIKA-1843
> Project: Tika
>  Issue Type: New Feature
>  Components: mime, parser
>Reporter: Giovanni Usai
>Priority: Minor
>
> This ticket refers to the parsing of SEG-Y files (extensions .seg, .segy and 
> .sgy). 
> The SEG-Y format is used to store seismic data, you can find more information 
> here http://pubs.usgs.gov/of/2001/of01-326/HTML/FILEFORM.HTM.
> I have:
> - added a new MIME type application/segy matching the file name extensions 
> .segy, .seg and .sgy.
> - created a new SEGYParser, matching that MIME type.
> In order to parse the SEG-Y files, I am using a modified version of the 
> sigrun code (available under Apache license, here 
> https://github.com/mikhail-aksenov/sigrun). Notably I have done a fix and 
> changed some method signatures to be able to read from a ReadableByteChannel 
> instead of FileChannel.
> For the moment I have put it directly into the new Tika's segy package. Is 
> this the right thing to do or should I reference it as external library thus 
> modifying the pom.xml?
> Thanks and best regards,
> Giovanni



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1843) Tika parser for SEG-Y files and new MIME type application/segy

2016-01-29 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123612#comment-15123612
 ] 

Nick Burch commented on TIKA-1843:
--

Getting a maven-built project into the Sonatype OSS repo for maven use isn't 
too bad. Ideally we'd work with the Sigrun team to get their POM into shape so 
it can be released as per http://central.sonatype.org/pages/ossrh-guide.html , 
otherwise we can take over and upload it for them as a third party. Ask on the 
dev list for help with any of those if needed, we've several people well 
experienced in both routes!

> Tika parser for SEG-Y files and new MIME type application/segy
> --
>
> Key: TIKA-1843
> URL: https://issues.apache.org/jira/browse/TIKA-1843
> Project: Tika
>  Issue Type: New Feature
>  Components: mime, parser
>Reporter: Giovanni Usai
>Priority: Minor
>
> This ticket refers to the parsing of SEG-Y files (extensions .seg, .segy and 
> .sgy). 
> The SEG-Y format is used to store seismic data, you can find more information 
> here http://pubs.usgs.gov/of/2001/of01-326/HTML/FILEFORM.HTM.
> I have:
> - added a new MIME type application/segy matching the file name extensions 
> .segy, .seg and .sgy.
> - created a new SEGYParser, matching that MIME type.
> In order to parse the SEG-Y files, I am using a modified version of the 
> sigrun code (available under Apache license, here 
> https://github.com/mikhail-aksenov/sigrun). Notably I have done a fix and 
> changed some method signatures to be able to read from a ReadableByteChannel 
> instead of FileChannel.
> For the moment I have put it directly into the new Tika's segy package. Is 
> this the right thing to do or should I reference it as external library thus 
> modifying the pom.xml?
> Thanks and best regards,
> Giovanni



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1823) Support detecting DWF format

2016-01-26 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1823.
--
   Resolution: Fixed
Fix Version/s: 1.13

Thanks, I've added this magic, along with a unit test, and some more specific 
magic which can give version information too, in 38fbc504 & 6a092332

> Support detecting DWF format
> 
>
> Key: TIKA-1823
> URL: https://issues.apache.org/jira/browse/TIKA-1823
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, mime
>Reporter: Luca Moretti
>Priority: Minor
>  Labels: detection, dwf, mime
> Fix For: 1.13
>
> Attachments: blocks_and_tables.dwf
>
>
> Tika currently detects dwf files as application/octect-stream.
> To make Tika mime magic detector correctly recognize dwf files it should be 
> added this code fragment in _tika-mimetypes.xml_ registry:
> {code:xml}
> 
>   dwf
>   <_comment>Design Web Format
>   
>   
>   
>   
>   
>   
>   
>   
> 
> {code}
> \\
> In current version (DWF 6.0), dwf file is a ZIP-compressed container for 
> vector-based CAD drawings. It is basically a ZIP archive with the _(DWF 
> V06.00)_ signature added before the regular ZIP magic number. For this 
> reason, the match value to detect dwf files should be: {{(DWF V06.00)PK}}.
> In the previous versions, the dwf data transport isn't a ZIP file format, so 
> the magic number is only the _(DWF V00.55)_ signature in the file header.
> To make Tika detect dwf files with this version too I propose the match value 
> in the code above.
> Thanks,
> Luca
> \\
> P.S.: The DWF format specification is included in the DWF Toolkit. The DWF 
> Toolkit is available for free at [http://www.autodesk.com/dwftoolkit]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (TIKA-1840) No way to link slide notes to slide in PPT output.

2016-01-25 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch reopened TIKA-1840:
--

Re-opening as the applied patch causes the notes text to be included twice, 
which isn't ideal, so further work still remains. (Details on the github 
request)

> No way to link slide notes to slide in PPT output.
> --
>
> Key: TIKA-1840
> URL: https://issues.apache.org/jira/browse/TIKA-1840
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
>Reporter: Sam H
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
>
> I'm integrating Apache Tika into my project, and I want to extract (text) 
> information from Powerpoint slides. Both PPT and PPTX
> I've noticed when using PPT format, the slide notes are all aggregated at the 
> end of the XML output, and there is no way to identify which note belongs to 
> which slide.
> I began looking at the code and found the following:
> {code}
> // TODO Find the Notes for this slide and extract inline
> {code}
> in 
> [HSLFExtractor.java|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java]
>  on line 140 
> I would like to implement this part and contribute



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1841) Different XML output structure for PPT and PPTX

2016-01-25 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115857#comment-15115857
 ] 

Nick Burch commented on TIKA-1841:
--

I think it would be good to have the PPT and PPTX parsers return xhtml as close 
to identical as we can reasonably get for equivalent input files.

Looking at the XHTML examples given, wrapping things up in a per-slide block 
seems more sensible and useful to me

We do try to avoid making breaking changes where we can, but as I can't think 
of any way to do so here without making an even-more-breaking change of 
duplicating all the text and the markup, it seems that our best bet would be to 
rationalise + warn in the changelog

I think we should have some test powerpoint files with both a .ppt and a .pptx 
version. It might be good if we could write a unit test that verifies that two 
parsers correctly do the slide -> contents + slide -> notes markup, as well as 
both producing the same output. Any chance you'd be able to write that?

Let's give it a few more days for everyone else interested to review + comment 
on this, before we finalise on a xhtml representation for powerpoint slideshows 
to update the parsers to

> Different XML output structure for PPT and PPTX
> ---
>
> Key: TIKA-1841
> URL: https://issues.apache.org/jira/browse/TIKA-1841
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
>Reporter: Sam H
>
> Issue is slightly related to TIKA-1840
> I've noticed that the XML structure of Powerpoint (PPT) and PPTX files is 
> different. 
> The structure for PPTX seems as follows:
> {code}
> 
> 
>  //optional
>  //optional
> ...
> 
> 
>  //optional
>  //optional
> {code}
> Note that there's no parent slide element to indicate the start and end of 
> each slide.
> For powerpoint the structure is as follows:
> {code}
> 
>   
> 
> 
>  //added in TIKA-1840
>  
>   
>   ...
>   
> 
> 
>  //added in TIKA-1840
> 
>   
> 
> 
> {code}
> In my application, I'm using XPath to get the desired information . As the 
> XML structure is different, I have to differentiate my XPath queries whether 
> the file is PPT (old) or PPTX (new). It would be nice for Tika to return the 
> same XML for both.
> I would propose changing the XML structure to this:
> {code}
> 
>   
> 
> 
>  //added in TIKA-1840
>  
>   
>   ...
>   
> 
> 
>  //added in TIKA-1840
> 
>   
> 
> {code}
> So, essentially, like the current PPT output, but without the list of notes 
> at the end (as this is also omitted for PPTX).
> On the one hand this generalizes PPT(X) handling, on the other it can break 
> existing (external) functionality relying on a specific XML output format.
> I don't know if this is something the project wants fixed or not. If so, I'm 
> willing to donate my time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1839) Update website inclusion of Examples for Git

2016-01-22 Thread Nick Burch (JIRA)
Nick Burch created TIKA-1839:


 Summary: Update website inclusion of Examples for Git
 Key: TIKA-1839
 URL: https://issues.apache.org/jira/browse/TIKA-1839
 Project: Tika
  Issue Type: Task
  Components: example
Affects Versions: 1.11
Reporter: Nick Burch


Currently, the Tika website remains in SVN. However, the Tika Examples have 
moved to Git, along with the rest of the main area of the source tree

The website pulls in snippets of code from the Examples area of the source 
tree, to be displayed on the site. (This allows us to have the examples shown 
on the site be compiled+unit tested, to help minimise the chance of them 
getting out of date)

This needs updating to pull the examples from Git now, rather than SVN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Are we on git?

2016-01-22 Thread Nick Burch

On Fri, 22 Jan 2016, Mattmann, Chris A (3980) wrote:

Our new ASF git repo is:
https://git-wip-us.apache.org/repos/asf/tika.git

Here’s an email I sent to the OODT-dev list about how
to convert from your existing SVN checkout to Git.
http://s.apache.org/UNr


Steps I followed on my trunk checkout:
 * svn status
 * (ensured no local changes)
 * mv .svn .svn.old
 * git init
 * git remote add origin https://git-wip-us.apache.org/repos/asf/tika.git
 * git checkout -b merge-branch
 * git fetch --all
 * git reset --hard origin/master
 * git checkout master

And on my Tika 2.x checkout the last two steps were changed to:
 * git reset --hard origin/2.x
 * git checkout 2.x

All seems to be working well now, thanks for the pointers!



Can we file a ticket to update the contribute page?


I've done that page, and the parser guide links


The thing that remains to be done is to sort out the site to import the 
examples from Git rather than SVN. I'll raise a ticket for that


Nick

Are we on git?

2016-01-21 Thread Nick Burch

Hi All

I've seen a commit message to git, but no "stop using SVN", and 
http://tika.apache.org/contribute.html still talks about SVN being our 
master.


What's the status? Have we switched? Still in progress? Where should we 
commit to? Is it time to delete our SVN checkouts and re-checkout from 
git?


Cheers
Nick


[jira] [Commented] (TIKA-1823) Support detecting DWF format

2016-01-18 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105081#comment-15105081
 ] 

Nick Burch commented on TIKA-1823:
--

Do you have a very small sample DWF file (ideally your own, otherwise a 
suitably licensed sample) that we could use for unit testing the detection?

> Support detecting DWF format
> 
>
> Key: TIKA-1823
> URL: https://issues.apache.org/jira/browse/TIKA-1823
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, mime
>Reporter: Luca Moretti
>Priority: Minor
>  Labels: detection, dwf, mime
>
> Tika currently detects dwf files as application/octect-stream.
> To make Tika mime magic detector correctly recognize dwf files it should be 
> added this code fragment in _tika-mimetypes.xml_ registry:
> {code:xml}
> 
>   dwf
>   <_comment>Design Web Format
>   
>   
>   
>   
>   
>   
>   
>   
> 
> {code}
> \\
> In current version (DWF 6.0), dwf file is a ZIP-compressed container for 
> vector-based CAD drawings. It is basically a ZIP archive with the _(DWF 
> V06.00)_ signature added before the regular ZIP magic number. For this 
> reason, the match value to detect dwf files should be: {{(DWF V06.00)PK}}.
> In the previous versions, the dwf data transport isn't a ZIP file format, so 
> the magic number is only the _(DWF V00.55)_ signature in the file header.
> To make Tika detect dwf files with this version too I propose the match value 
> in the code above.
> Thanks,
> Luca
> \\
> P.S.: The DWF format specification is included in the DWF Toolkit. The DWF 
> Toolkit is available for free at [http://www.autodesk.com/dwftoolkit]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: WMF extraction

2016-01-15 Thread Nick Burch

On Thu, 14 Jan 2016, Andreas Beeker wrote:
POI will have a WMF module (org.apache.poi.hwmf.*) in the next beta. 
Looking over the govdocs collection, those embedded wmfs might contain 
interesting information for TIKA.


Should the output be part of the embedding document, e.g. ppt, or does 
it make sense to crawl over various extensions and extract those 
metadata separately?


I'd suggest a two-step process. One is to update the current office 
parsers (especially HSLF) as needed to expose the embedded WMF files as 
embedded resources, much as they do for embedded jpegs, pngs etc


Next, add a WMF parser that uses HWMF to expose any useful metadata you 
can find


Tika will then call the WMF parser for embedded WMFs where requested

Nick


RE: Tika questions on StackOverflow

2016-01-14 Thread Nick Burch

On Wed, 13 Jan 2016, Allison, Timothy B. wrote:

Are there other consumer lists we should be following?  Elastic Search?


I think Elastic Search only has a forum-type thingy, this probably should 
let you see Tika posts there (not that frequent)

https://discuss.elastic.co/search?q=tika%20category%3A6%20order%3Alatest

Otherwise Alfresco, Nutch and StormCrawler are probably the next biggest 
open source users, I guess?


Nick


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-14 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097884#comment-15097884
 ] 

Nick Burch commented on TIKA-1824:
--

Tika already supports using a custom classloader for loading parser + detector 
classes + spi files - 
http://tika.apache.org/1.11/api/org/apache/tika/config/TikaConfig.html#TikaConfig%28java.lang.ClassLoader%29

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Tika questions on StackOverflow

2016-01-13 Thread Nick Burch

Hi All

This may be old news for some of you, in which case you can skip the 
email, but for others... StackOverflow is a programming-focused question 
and answer site, with excellent google-foo, quite wide use, and growing 
use. At the moment I'd say there's something like a new Tika question a 
day on it, and that number seems to be climbing. (It's quite bursty 
though, 2 one day, nothing for the next few)


Increasingly, new users seem to be turning to StackOverflow to get help 
with projects, learn how to use them etc, in place of joining a mailing 
list and asking a question. There's also a lot of people out there who 
know about Tika, aren't on our lists, but are posting helpful replies 
(answers) to questions on how to use Tika.


(There's also a fair number of useless people asking very basic questions, 
without full information, and without having done any research / checked 
existing questions / checked out site / etc. They tend to get moderated 
down pretty quickly though, or they learn and edit the question)


Because StackOverflow gets a lot of newbie traffic, they have some rules, 
and can be quite strict on enforcing them. A lot stricted than many of the 
other StackExchange network sites, largely because of that traffic. That 
means you will find some restrictions at the start, but they go away soon. 
You do need to be careful to actually answer questions with an answer, 
asking for clarifications or saying "can't help, ask on the list" as an 
answer won't go down well.



If you're interested to see what sort of questions there are, see
http://stackoverflow.com/questions/tagged/apache-tika?sort=newest=50
for what has been asked recently, and
http://stackoverflow.com/questions/tagged/apache-tika?sort=votes=50
for the most "popular"


There are a few of us on StackOverflow already, but you might want to join 
in too. You certainly don't have to! But you might want to, not only to 
help, but also to get bug reports, find out what docs we need to update, 
and maybe even spot people answering who we can ask to join the project.


If you sign up for an account, you can get emails when people ask Tika 
related questions, so you can know to go look if it interests you. To do 
that, go to

http://stackexchange.com/filters/212512/apache-tika-questions
On the right it should have an "Email Updates" box, where you can 
subscribe to get emailed for new questions on a timing of your choice



If you have questions on using StackOverflow, I'm happy to do my best to 
explain. They have pretty good help/documentation, and they have the 
"meta" site to check policies / why reasons / etc.


You will suffer some restrictions as a new user, but they go away when 
your answers get a few up-votes. Let us know your username if you sign up 
and answer something, then the few of us who already use StackOverflow can 
up vote you to get you to the minimum rep score to escape them!


Nick


Re: [VOTE] Moving SCM to Git

2016-01-11 Thread Nick Burch

On 02/01/16 04:30, Mattmann, Chris A (3980) wrote:

Hi Everyone,

DISCUSS thread here: http://s.apache.org/wVE

Time to officially VOTE on moving Tika to Git. I’ve made a wiki
page for our SCM explaining how to use Git at Apache, and how to
use it with Github, and how to use it even in a traditional SVN
sense. The page is here:

https://wiki.apache.org/tika/UsingGit

https://wiki.apache.org/tika/DeveloperResources
https://wiki.apache.org/tika/ReleaseProcess


Thanks for all those docs! Looks fine to me, at first glance, and we can 
fix anything else as we go along :)



If you’d like to revise your VOTE or to VOTE for the first time,
please use the ballot below:

[ ] +1 Move the Apache Tika source control to Writeable Git repos
at the ASF


+1 from me now

Nick


[jira] [Commented] (TIKA-1821) Problem in Tika().detect for xml file signed in CADES

2016-01-07 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15087613#comment-15087613
 ] 

Nick Burch commented on TIKA-1821:
--

Hopefully fixed in r1723581 - the length is part of the initial magic 
(effectively), but we weren't handling all the combinations, and our unit test 
only did the shortest possible one

I'm reluctant to add your test file to SVN, as it's a little large. Any chance 
you could create a slightly smaller test file for us to use?

> Problem in Tika().detect for xml file signed in CADES
> -
>
> Key: TIKA-1821
> URL: https://issues.apache.org/jira/browse/TIKA-1821
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.8
>Reporter: Alessandro De Angelis
> Attachments: test-tika-error.xml.p7m
>
>
> We have a xml file with base64 attachment signed with CADES signature. 
> In this case TIKA recognize the resulted file mime type as "text/plain" and 
> not "application/pkcs7-signature" as we expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1817) Extracts entire file content for ASCII DXF files

2015-12-23 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15070169#comment-15070169
 ] 

Nick Burch commented on TIKA-1817:
--

Thanks for that. Test file from JustCAD added in r1721576, along with a unit 
test, and the required license + attribution

We still ideally want a Binary DXF file, and a DXB, if someone can find/produce 
suitably licensed ones!

> Extracts entire file content for ASCII DXF files
> 
>
> Key: TIKA-1817
> URL: https://issues.apache.org/jira/browse/TIKA-1817
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.11
>Reporter: Zoltan Toth
> Attachments: SMA-Controller.dxf, house design.dxf, 
> jcsample-screendump.jpg, jcsample.dxf
>
>
> By definition, ASCII DXF files are encoded in plain text.  However. the vast 
> majority of their content is not intended to be human readable (see 
> https://en.wikipedia.org/wiki/AutoCAD_DXF).  Unfortunately for these files, 
> Tika simply "extracts" the entire content of the file instead of the 
> human-readable portions (i.e. comments etc.) that a CAD tool would render.  
> This results in massive amounts of rubbish data being returned with dire 
> consequences for applications that rely on this.
> It would be nice if only the human-readable text fields were extracted.  
> Failing this, it would still be nice if no text was extracted from these 
> files at all.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1817) Extracts entire file content for ASCII DXF files

2015-12-22 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068077#comment-15068077
 ] 

Nick Burch commented on TIKA-1817:
--

I've had a go at adding mime subtypes for binary and ascii for DXF, as well as 
the related DXB, in r1721390. No unit tests though :( Needs some suitable 
sample files

With that in place, ascii dxf files should no longer end up routed to the text 
parser. That's probably slightly better, but not ideal... We really need 
someone to volunteer to write a proper parser!

Writing one shouldn't be too bad, especially for strings and metadata, along 
the lines of the DWG one we already have. 
http://www.fileformat.info/format/dxf/egff.htm seems a good overview of the 
file format, and there's also published stuff at 
http://www.autodesk.com/techpubs/autocad/acadr14/dxf/drawing_interchange_file_formats.htm
 that should help

> Extracts entire file content for ASCII DXF files
> 
>
> Key: TIKA-1817
> URL: https://issues.apache.org/jira/browse/TIKA-1817
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.11
>Reporter: Zoltan Toth
> Attachments: jcsample-screendump.jpg, jcsample.dxf
>
>
> By definition, ASCII DXF files are encoded in plain text.  However. the vast 
> majority of their content is not intended to be human readable (see 
> https://en.wikipedia.org/wiki/AutoCAD_DXF).  Unfortunately for these files, 
> Tika simply "extracts" the entire content of the file instead of the 
> human-readable portions (i.e. comments etc.) that a CAD tool would render.  
> This results in massive amounts of rubbish data being returned with dire 
> consequences for applications that rely on this.
> It would be nice if only the human-readable text fields were extracted.  
> Failing this, it would still be nice if no text was extracted from these 
> files at all.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: looking to contribute

2015-12-22 Thread Nick Burch

On Wed, 16 Dec 2015, Nick Burch wrote:
If you want to try more coding, Tim quite often runs Tika against some 
large filesets, and has a nifty tool to report on what breaks. He can 
hopefully point you at the most recent report! Maybe have a look through 
that, identify a few common failures from unidentified or common 
exceptions, and try to fix one or two of those?


Another one might be TIKA-1817 - needs two or three new parsers, all 
hopefully fairly straightforward. There'll want to be a text-based one for 
ASCII DXF, likely along the lines of some of the scientific text-based 
formats. There also needs a binary one for binary DXF, maybe also able to 
do DXB at the same time. The DWG parser might be a good starting point for 
that, or maybe even could be extended to do those too


Nick


[jira] [Commented] (TIKA-1817) Extracts entire file content for ASCII DXF files

2015-12-21 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067319#comment-15067319
 ] 

Nick Burch commented on TIKA-1817:
--

Any chance you could upload a small sample DXF file? Ideally with the same / 
similar metadata and contents as our other AutoCAD files, but failing that 
anything with known contents

First task will be using that to get detection working properly, so if you know 
the mime type for these files, that'll help!

Once we have detection, then it's a question of parsing. That should be quite 
quick to do, from the sound of it, and might even be a good starting point for 
one of our new volunteers for the project :) Either way, needs some test files!

> Extracts entire file content for ASCII DXF files
> 
>
> Key: TIKA-1817
> URL: https://issues.apache.org/jira/browse/TIKA-1817
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.11
>Reporter: Zoltan Toth
>
> By definition, ASCII DXF files are encoded in plain text.  However. the vast 
> majority of their content is not intended to be human readable (see 
> https://en.wikipedia.org/wiki/AutoCAD_DXF).  Unfortunately for these files, 
> Tika simply "extracts" the entire content of the file instead of the 
> human-readable portions (i.e. comments etc.) that a CAD tool would render.  
> This results in massive amounts of rubbish data being returned with dire 
> consequences for applications that rely on this.
> It would be nice if only the human-readable text fields were extracted.  
> Failing this, it would still be nice if no text was extracted from these 
> files at all.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: looking to contribute

2015-12-20 Thread Nick Burch

On Sun, 20 Dec 2015, Joey Hong wrote:
Oh, my bad. I should have realized when the HTML looked generated. I 
have now added the usage examples to the examples.apt file, and the page 
looks find after it was built by mvn. As of now, the examples are edited 
both for the 1.11/ and 1.12/ folders; should they only affect the 1.12/ 
one?


If the example applies to the 1.11 release (eg because we forget to add it 
there in time), pop it into the 1.11 apt file and add a note that we 
should apply it to 1.12 as well


If the example relies on new functionality added since 1.11, just in the 
1.12 folder


Also, when this is all done, would i svn commit my changes the same way 
as for the main Tika app?


Because they're in different bits of the tree, you'd likely need one 
patch/commit for the changes to the example source+tests, and one for the 
examples page that references + explains it


Nick


Re: looking to contribute

2015-12-20 Thread Nick Burch

On Sat, 19 Dec 2015, Joey Hong wrote:

Regarding TIKA-1329, I found the tike-site on the Subversion source code, and I 
called:
svn checkout https://svn.apache.org/repos/asf/tika/site/publish/1.11/ 
.

Since this isn’t part of the main tika/trunk repository, I was wondering 
if I should still follow the same protocol and svn commit my changes to 
the site folder.


You shouldn't be working on those files - they're the generated HTML. You 
need to work on the original APT (Almost Plain Text) files which are in a 
sibling folder


I'd suggest, if you want to work on any docs stuff (yey!), you just 
checkout https://svn.apache.org/repos/asf/tika/site


Then edit the files in src/site/apt/1.x/, and use "mvn install" in the 
checkout root to test how the resulting HTML looks


In case I shouldn’t, I’ve attached my changes to the usage examples page 
of the website below. I basically added how to parse documents with 
embedded docs using the RecursiveParserWrapper class, and how to 
serialize the returned Metadata list to JSON, with some description.


Examples is a bit special! Any code should go into the tika-example module 
in the main source tree, along with unit tests that verify that they work 
properly + stay working properly. That avoids the common issue of examples 
that no longer work/compile!


Once your changes are in the example svn area, edit in the site folder the 
file src/site/apt/1.{x+1}/examples.apt to both pull in the appropriate 
code snippet + describe it. Use the %{include} directive to have the code 
pulled in, tell it which file to grab from, and which method, and it'll 
nicely inline the unit-tested example for you


Nick

[jira] [Commented] (TIKA-1773) No XML Metadata output for JP2 files

2015-12-18 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063957#comment-15063957
 ] 

Nick Burch commented on TIKA-1773:
--

We can't depend on a LGPL library - see 
http://www.apache.org/legal/resolved.html#category-x

We'd need to have it as an externally hosted parser, which users could download 
manually if the license proved acceptable for their use case. It could then be 
listed at http://wiki.apache.org/tika/3rd%20party%20parser%20plugins

(Or we'd need to find or write a suitably licensed alternative)

> No XML Metadata output for JP2 files
> 
>
> Key: TIKA-1773
> URL: https://issues.apache.org/jira/browse/TIKA-1773
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.8, 1.9, 1.10
>Reporter: Andreas Hirtzel
> Attachments: testJPEG.jp2
>
>
> Hi,
> Tika doesn't return output for JPEG2000 (.jp2) files in xhtml format. We're 
> using tika libraries in our application and get only empty html output for 
> this file type. If you open a jp2 file with the gui and switch to structured 
> text view, you don't get any results. There is no exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: looking to contribute

2015-12-16 Thread Nick Burch

On Wed, 16 Dec 2015, Joey Hong wrote:
My name is Joey. I am a college freshmen with programming experience 
looking to get into the world of open-source. I was hoping to contribute 
to the Tika project, and was wondering if there were any tasks that a 
beginner like me could tackle. I am willing to do anything, whether it 
be fixing a minor bug, or adding test suites or documentation.


On the docs / examples side, we have a few examples on the website, but 
probably not enough! One thing might be to look through those, identify 
gaps with your fresh eyes, and work on those. We also have instructions 
for some more complicated integrations on the wiki, maybe try some of 
those and feed back on which ones aren't clear enough?


If you want to try more coding, Tim quite often runs Tika against some 
large filesets, and has a nifty tool to report on what breaks. He can 
hopefully point you at the most recent report! Maybe have a look through 
that, identify a few common failures from unidentified or common 
exceptions, and try to fix one or two of those?


Nick


[jira] [Commented] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl

2015-12-16 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060203#comment-15060203
 ] 

Nick Burch commented on TIKA-1813:
--

My best guess is that these have been truncated. Having a look with 
{{{org.apache.poi.poifs.dev.POIFSHeaderDumper}}} it certainly looks that way

> Figure out file types for several unknown OLE files in Common Crawl
> ---
>
> Key: TIKA-1813
> URL: https://issues.apache.org/jira/browse/TIKA-1813
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, 
> 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA
>
>
> We're getting around 300 exceptions from "application/x-tika-msoffice" files 
> in our current slice of Common Crawl documents that look roughly like this:
> {noformat}
> java.lang.IllegalArgumentException: Position 86528 past the end of the file
> at org.apache.poi.poifs.nio.FileBackedDataSource.read
> {noformat}
> I suspect these are non-MS OLE file formats.  Any help identifying the file 
> types and patching our OLE mime detector would be great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl

2015-12-16 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060203#comment-15060203
 ] 

Nick Burch edited comment on TIKA-1813 at 12/16/15 3:58 PM:


My best guess is that these have been truncated. Having a look with 
{{org.apache.poi.poifs.dev.POIFSHeaderDumper}} it certainly looks that way


was (Author: gagravarr):
My best guess is that these have been truncated. Having a look with 
{{{org.apache.poi.poifs.dev.POIFSHeaderDumper}}} it certainly looks that way

> Figure out file types for several unknown OLE files in Common Crawl
> ---
>
> Key: TIKA-1813
> URL: https://issues.apache.org/jira/browse/TIKA-1813
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, 
> 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA
>
>
> We're getting around 300 exceptions from "application/x-tika-msoffice" files 
> in our current slice of Common Crawl documents that look roughly like this:
> {noformat}
> java.lang.IllegalArgumentException: Position 86528 past the end of the file
> at org.apache.poi.poifs.nio.FileBackedDataSource.read
> {noformat}
> I suspect these are non-MS OLE file formats.  Any help identifying the file 
> types and patching our OLE mime detector would be great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Tika 2.0 Source in Modules or tika-parser

2015-12-14 Thread Nick Burch

On Sun, 13 Dec 2015, Bob Paulin wrote:

So in short

Source in tika-parser
Dependencies managed in tika-parser and copied to module

Source in Modules
Dependencies managed in modules and consolidated via maven shade plugin. 
Conflicting dependencies managed by maven.


IIRC there are some util / parent classes in the tika parsers module which 
many different parsers need. Where would those end up?


Thanks
Nick


Re: Tika 2.0 Source in Modules or tika-parser

2015-12-14 Thread Nick Burch

On 14/12/15 16:26, Ray Gauss wrote:

I'd vote for a tiki-parser-common(s) artifact for common util classes and 
dependencies.


That would make sense to me

Nick


Re: Tika 2.0 Source in Modules or tika-parser

2015-12-14 Thread Nick Burch

On Mon, 14 Dec 2015, Bob Paulin wrote:
So there seems to be a pretty good consensus forming around moving the 
sources but some differing opinions on where to put shared parser code.


I know it'll be a bit dull and some work, but... Could someone put 
together a list (probably in the wiki or on jira so we can edit it) of the 
candidate classes to go in core/commons, along with their dependencies?


Once we've finalised that list, the answer may become clear just from 
that!


Nick


[jira] [Commented] (TIKA-1806) Bouncy Castle conflict

2015-12-03 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038105#comment-15038105
 ] 

Nick Burch commented on TIKA-1806:
--

I've just tried that file with the Tika App, and I don't get that exception. 
Googling suggests it's caused by having mis-matched BouncyCastle jars. Could 
your runtime possibly have an older bc jar on the classpath? (Maybe use code 
similar to http://poi.apache.org/faq.html#faq-N10006 to check which bc jars are 
really being used?)

> Bouncy Castle conflict
> --
>
> Key: TIKA-1806
> URL: https://issues.apache.org/jira/browse/TIKA-1806
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Blocker
> Attachments: K7JQOGUCTHLSOPCSG5OCKIKM2LZKD7AH
>
>
> On a recent run with trunk against our Common Crawl corpus, I found quite a 
> few of these:
> {noformat}
> java.lang.NoSuchFieldError: gostR3411_94_with_gostR3410_94
>   at org.bouncycastle.operator.jcajce.OperatorHelper.(Unknown 
> Source)
>   at 
> org.bouncycastle.operator.jcajce.JcaDigestCalculatorProviderBuilder.(Unknown
>  Source)
>   at org.apache.tika.parser.crypto.Pkcs7Parser.parse(Pkcs7Parser.java:63)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1805) Default parser/detector loading should warn on missing/empty classes

2015-12-01 Thread Nick Burch (JIRA)
Nick Burch created TIKA-1805:


 Summary: Default parser/detector loading should warn on 
missing/empty classes
 Key: TIKA-1805
 URL: https://issues.apache.org/jira/browse/TIKA-1805
 Project: Tika
  Issue Type: Improvement
  Components: config
Affects Versions: 2.0
Reporter: Nick Burch
 Fix For: 2.0


As mentioned on-list, with the parser modularisation changes in 2.x, the 
chances of a newbie getting something wrong goes up. We should therefore change 
the default in 2.x to warn (rather than silently ignore) if parsers or 
detectors are missing / none are defined

This remains configurable with Tika Config XML, explicit TikaConfig object 
setup etc, so it can be easily silenced if wanted. It's just the default which 
will warn people if they've made a mistake!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1805) Default parser/detector loading should warn on missing/empty classes

2015-12-01 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1805.
--
Resolution: Fixed

Changed as of r1717560, along with an additional handler method to alert if a 
service has no implementations (eg DefaultParser has no parser service files 
available)

> Default parser/detector loading should warn on missing/empty classes
> 
>
> Key: TIKA-1805
> URL: https://issues.apache.org/jira/browse/TIKA-1805
> Project: Tika
>  Issue Type: Improvement
>  Components: config
>Affects Versions: 2.0
>    Reporter: Nick Burch
> Fix For: 2.0
>
>
> As mentioned on-list, with the parser modularisation changes in 2.x, the 
> chances of a newbie getting something wrong goes up. We should therefore 
> change the default in 2.x to warn (rather than silently ignore) if parsers or 
> detectors are missing / none are defined
> This remains configurable with Tika Config XML, explicit TikaConfig object 
> setup etc, so it can be easily silenced if wanted. It's just the default 
> which will warn people if they've made a mistake!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: more modular parser bundles

2015-11-30 Thread Nick Burch

On Mon, 30 Nov 2015, Allison, Timothy B. wrote:
Perhaps we could start with a tika-advanced-bundle to gather all of the 
nlp/advanced parsers?  Or would this have to wait for Tika 2.0?


I've noticed that there have been a lot fewer queries (on our list, on 
stackoverflow, at events etc) caused by people missing jars of late. Not 
sure of the message has got out there better, the right posts are getting 
to the top of google, the troubleshooting page has done its magic, or 
something else entirely! But I'm now less worried about the impact of 
modular parsers on newbies that I have been before


To try to avoid all the existing guidance (most of it external) from going 
stale, I'd lean towards either keeping "tika-parsers" as the full version, 
or make "tika-parsers" be an alias to "tika-parsers-all", so that current 
behaviour remains


I'd also probably suggest we change the default load error handler to 
warn/log, so that people by default will find out more quickly that 
they've missed jars, and probably also have an extra load error log/check 
which triggers in the event of 0 parser definitions being found. People 
can turn that off if they want, as now, but maybe the new default should 
be so that newbies tend to get told quickly what they've done wrong!


Oh, and we'll need to update the troubleshooting page too for the new 
bundles world :)


Nick


[jira] [Commented] (TIKA-1804) Tika use no free json.org

2015-11-30 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15031681#comment-15031681
 ] 

Nick Burch commented on TIKA-1804:
--

The JSON license has been approved for use by Apache Projects by the ASF Legal 
Affairs committee, without affecting the license conditions of the overall 
software, see http://www.apache.org/legal/resolved.html#json

If you feel that's incorrect, you'd need to take that up with the Legal Affairs 
committee on legal-discuss@ - they're the ones qualified / charged with 
deciding on this sort of stuff, not us!

> Tika use no free json.org
> -
>
> Key: TIKA-1804
> URL: https://issues.apache.org/jira/browse/TIKA-1804
> Project: Tika
>  Issue Type: Bug
>Reporter: gil cattaneo
>
> Hi
> Your project is licensed under Apache License Version 2,
> but your code pulls in code from json.org under Douglas Crockford’s bad 
> licence [1] , and is non-free [2].
> Such usage restriction makes the license incompatible with The Open Source 
> Definition and
> The Free Software Definition. Because Tika binary distribution includes this 
> software,
> it effectively becomes proprietary software itself.
> You may also comment that the json.org license is valid for You but for many 
> Linux distributions it is not acceptable.
> I hope to continue to maintain Tika for Fedora, without having to run into 
> these problems.
> Please try to replace it with one of the many free alternatives.
> Regards
> [1]
> ./tika-1.11/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidRESTParser.java
> ./tika-1.11/tika-parsers/src/main/java/org/apache/tika/parser/journal/JournalParser.java
> ./tika-1.11/tika-parsers/src/main/java/org/apache/tika/parser/journal/TEIParser.java
> [2]
> https://wiki.debian.org/qa.debian.org/jsonevil
> http://www.sonatype.com/people/2012/03/use-json-well-youd-better-not-be-evil/
> http://tanguy.ortolo.eu/blog/article46/json-license



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-11-26 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15028642#comment-15028642
 ] 

Nick Burch commented on TIKA-1706:
--

Does anyone have any objections to us going ahead with this for Tika 1.12?

If no objections are raised in 1 week (by 2015-12-03), then I think we should 
go ahead and commit

> Bring back commons-io to tika-core
> --
>
> Key: TIKA-1706
> URL: https://issues.apache.org/jira/browse/TIKA-1706
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.12
>
> Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch
>
>
> TIKA-249 inlined select commons-io classes in order to simplify the 
> dependency tree and save some space.
> I believe these arguments are weaker nowadays due to the following concerns:
> - Most of the non-core modules already use commons-io, and since tika-core is 
> usually not used by itself, commons-io is already included with it
> - Since some modules use both tika-core and commons-io, it's not clear which 
> code should be used
> - Having the inlined classes causes more maintenance and/or technology debt 
> (which in turn causes more maintenance)
> - Newer commons-io code utilizes newer platform code, e.g. using Charset 
> objects instead of encoding names, being able to use StringBuilder instead of 
> StringBuffer, and so on.
> I'll be happy to provide a patch to replace usages of the inlined classes 
> with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: NER wiki page up

2015-11-20 Thread Nick Burch

On Fri, 20 Nov 2015, Mattmann, Chris A (3980) wrote:

P.S. Nick - Git instructions coming next :)


Woot! :)

Nick


Re: Incompatibility between apacke tikka and apache commons email jar

2015-11-20 Thread Nick Burch

On Fri, 20 Nov 2015, Neel79 wrote:
I am using Apache commons email jar 1.4 and Apache Tikka jar 1.10 . I 
see the following error


Caused by: java.lang.UnsupportedClassVersionError: JVMCFRE003 bad major
version; class=org/apache/tika/detect/Detector, offset=6


Apache Tika now requires Java 7 or higher. Apache Commons Email only 
requires Java 5. From the look of it, your JVM (IBM WebSphere?) doesn't 
support Java 7. You'll need to upgrade your JVM to Java 7 + to use recent 
versions of Tika


Nick


Re: [DISCUSS] Moving to Git

2015-11-19 Thread Nick Burch

On Thu, 19 Nov 2015, Mattmann, Chris A (3980) wrote:
I’ll be happy to update our docs and to write a wiki page on using Tika 
& Git that we can refer folks to. I think I’ve demonstrated documenting 
things on the Tika wiki :)


Great stuff! Scribble something sensible down, and I can vote +1 to the 
move, plus learn more about Git at the same time :)


Nick

<    2   3   4   5   6   7   8   9   10   11   >