Re: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

2016-03-28 Thread Nick Burch
On Sun, 27 Mar 2016, Bob Paulin wrote: Yes I think overall if these functions can live in somewhere either inside tika or a smaller dependent library we're in a better place. I'll take a look at Ogg-Vorbis. The two util classes there, that spring to mind, are:

Re: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

2016-03-27 Thread Nick Burch
On Sun, 27 Mar 2016, Bob Paulin wrote: Currently the Apache POI dependency is in several modules and it's sort of a beast (> 2 MB in size). You should've seen it before Jukka and Yegor spent a crazy ApacheCon hacking up the ooxml-lite support... ;-) It appears many of the modules are only

[jira] [Commented] (TIKA-1908) --list-met-models does not display Dublin core along with other metadata models

2016-03-25 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212612#comment-15212612 ] Nick Burch commented on TIKA-1908: -- I seem to recall there was a deliberate policy to avoid putting all

[jira] [Commented] (TIKA-1898) backslashes in mime-type : application/vnd.mif are wrong

2016-03-23 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209418#comment-15209418 ] Nick Burch commented on TIKA-1898: -- Ah, ok, got it. We had a random impenetrable hex string magic too

[jira] [Resolved] (TIKA-1898) backslashes in mime-type : application/vnd.mif are wrong

2016-03-23 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1898. -- Resolution: Fixed Fix Version/s: 1.13 > backslashes in mime-type : application/vnd.mif are wr

[jira] [Commented] (TIKA-1888) Update mimetype for application/x-netcdf

2016-03-22 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15206152#comment-15206152 ] Nick Burch commented on TIKA-1888: -- Which match is missing? We already have CDF 0x01, which is what your

[jira] [Commented] (TIKA-1898) backslashes in mime-type : application/vnd.mif are wrong

2016-03-14 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15194363#comment-15194363 ] Nick Burch commented on TIKA-1898: -- I've just tried with your test file, and Tika is able to detect

[jira] [Commented] (TIKA-1898) backslashes in mime-type : application/vnd.mif are wrong

2016-03-10 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189482#comment-15189482 ] Nick Burch commented on TIKA-1898: -- Do you have a small sample file we could use to write a unit test

[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration

2016-03-09 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15187205#comment-15187205 ] Nick Burch commented on TIKA-1508: -- > I think that's exactly what ParseContext should be for..it sho

[jira] [Commented] (TIKA-1888) Update mimetype for application/x-netcdf

2016-03-07 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182860#comment-15182860 ] Nick Burch commented on TIKA-1888: -- Our current mimetype definition for netcdf is: {code

[jira] [Resolved] (TIKA-1891) Update mimetype for mime-type image/fits

2016-03-07 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1891. -- Resolution: Duplicate Fix Version/s: (was: 1.13) > Update mimetype for mime-type image/f

[jira] [Resolved] (TIKA-1889) Update mimetype for *.qt and *.mov files detection

2016-03-06 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1889. -- Resolution: Duplicate Fix Version/s: (was: 1.13) > Update mimetype for *.qt and *.mov fi

[jira] [Resolved] (TIKA-1892) Mime Magic for application/x-mobipocket-ebook and application/x-shapefile

2016-03-06 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1892. -- Resolution: Fixed Fix Version/s: 1.13 Thanks, SHP added and MOBI updated in 74e71eb > Mime Ma

[jira] [Commented] (TIKA-1893) Add new mimetype for *.icns (Apple Icon Image Format) files

2016-03-06 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182181#comment-15182181 ] Nick Burch commented on TIKA-1893: -- Do you have a patch or pull request for this? > Add new mimet

[jira] [Resolved] (TIKA-1890) Update mimetype for application/vnd.ms-cab-compressed

2016-03-06 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1890. -- Resolution: Fixed More specific mime magic added in f7d3097 along with a unit test > Update mimet

[jira] [Commented] (TIKA-1889) Update mimetype for *.qt and *.mov files detection

2016-03-06 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182156#comment-15182156 ] Nick Burch commented on TIKA-1889: -- Isn't this a duplicate of TIKA-1882? > Update mimetype for *

[jira] [Commented] (TIKA-1888) Update mimetype for application/x-netcdf

2016-03-06 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182155#comment-15182155 ] Nick Burch commented on TIKA-1888: -- That match looks to be a string. In order to keep it readable

[jira] [Commented] (TIKA-1887) Add new mimetype for file extensions .po

2016-03-06 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182154#comment-15182154 ] Nick Burch commented on TIKA-1887: -- http://www.icanlocalize.com/site/tutorials/how-to-translate

[jira] [Commented] (TIKA-1886) Updating tika-mimetypes.xml to detect .hfa files

2016-03-06 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182150#comment-15182150 ] Nick Burch commented on TIKA-1886: -- Matching pull request is https://github.com/apache/tika/pull/88

[jira] [Commented] (TIKA-1885) Updated tika-mimestype.xml and a detector to identify new types of files based on analysis

2016-03-06 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182149#comment-15182149 ] Nick Burch commented on TIKA-1885: -- Any luck with the pull request? > Updated tika-mimestype.

[jira] [Commented] (TIKA-1882) Updating the tika-mimetypes.xml for new mime magic patterns

2016-03-06 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182146#comment-15182146 ] Nick Burch commented on TIKA-1882: -- Just because other people think it's a magic doesn't necessarily mean

[jira] [Commented] (TIKA-1883) Identification of Mime Type for Empty Files

2016-03-06 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182145#comment-15182145 ] Nick Burch commented on TIKA-1883: -- This pull request is almost impossible to understand. Any chance you

[jira] [Resolved] (TIKA-1878) Upgrade Apache SIS 0.6

2016-03-06 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1878. -- Resolution: Fixed Fix Version/s: 1.13 Thanks, upgraded but a slightly different way (I pulled

[jira] [Commented] (TIKA-1881) On updating mime magic for existing mime types

2016-03-06 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182137#comment-15182137 ] Nick Burch commented on TIKA-1881: -- As mentioned on the Github pull request: For the Atom, RSS and RDF

[jira] [Resolved] (TIKA-1877) On updating the tika-mimetypes.xml to detect .fts file format, tika detector does not return anything

2016-03-06 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1877. -- Resolution: Fixed Fix Version/s: 1.13 Patch applied, with a slight tweak to rename the test file

[jira] [Commented] (TIKA-1877) On updating the tika-mimetypes.xml to detect .fts file format, tika detector does not return anything

2016-03-06 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182131#comment-15182131 ] Nick Burch commented on TIKA-1877: -- With your patch applied, the Tika app correctly detects your new text

[jira] [Resolved] (TIKA-1875) Updating tika-mimetypes.xml to detect .NC files

2016-03-06 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1875. -- Resolution: Fixed Fix Version/s: (was: 1.11) 1.13 Thanks for the new patch

[jira] [Commented] (TIKA-1885) Updated tika-mimestype.xml and a detector to identify new types of files based on analysis

2016-03-03 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177566#comment-15177566 ] Nick Burch commented on TIKA-1885: -- Did you mean to close this? Is there a matching pull request or patch

Re: Need suggestion on file type .HFA to be added Tika.

2016-03-02 Thread Nick Burch
On Wed, 2 Mar 2016, Nandan Padar Chandrashekar wrote: Identified (Hierarchical File Architecture) HFA file format which is not presently being identified through Tika. extension : *.hfa Header tag contains string EHFA_HEADER_TAG Looks fine for adding to Tika to me Should this be considered

[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata

2016-03-02 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176652#comment-15176652 ] Nick Burch commented on TIKA-1663: -- The other parser decorators are specified with options inside

[jira] [Commented] (TIKA-1882) Updating the tika-mimetypes.xml for new mime magic patterns

2016-03-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174295#comment-15174295 ] Nick Burch commented on TIKA-1882: -- I'm not sure the quicktime pattern is correct - I have some MOV files

[jira] [Commented] (TIKA-1877) On updating the tika-mimetypes.xml to detect .fts file format, tika detector does not return anything

2016-02-27 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170579#comment-15170579 ] Nick Burch commented on TIKA-1877: -- Posting the whole modified tika mimetypes file isn't ideal - it's hard

[jira] [Commented] (TIKA-1875) Updating tika-mimetypes.xml to detect .NC files

2016-02-27 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170572#comment-15170572 ] Nick Burch commented on TIKA-1875: -- As mentioned on list, there is a github pull for this: https

[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata

2016-02-26 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169005#comment-15169005 ] Nick Burch commented on TIKA-1865: -- Whatever we do, matching changes should be made to the other Email

[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata

2016-02-25 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167231#comment-15167231 ] Nick Burch commented on TIKA-1865: -- IIRC it needs the "fixed length properties" support to be

[jira] [Commented] (TIKA-1855) TIka 2.0 - Move shared test-code back to tika-core and distribute test files to parser modules

2016-02-25 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167123#comment-15167123 ] Nick Burch commented on TIKA-1855: -- Currently, we have most test documents in Tika Parsers, and a handful

[jira] [Commented] (TIKA-1873) Test Cases failed when tika-mimetypes.xml is changed

2016-02-25 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167012#comment-15167012 ] Nick Burch commented on TIKA-1873: -- Interesting stuff! I'd skip most container-based formats

[jira] [Commented] (TIKA-1873) Test Cases failed when tika-mimetypes.xml is changed

2016-02-24 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15166334#comment-15166334 ] Nick Burch commented on TIKA-1873: -- What changes did you make to the mime types file? If you alter how

Re: PDFParser in-process mode

2016-02-24 Thread Nick Burch
On Wed, 24 Feb 2016, Pei Chen wrote: Does the default pdf parser using auto detect parser require to tika to run in server mode? No It seems to try and open an http connection to localhost:8080 by default? Can it run in-process? The stacktrace shows you're not using the PDF parser: at

[jira] [Resolved] (TIKA-1869) Jackson update to latest version

2016-02-24 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1869. -- Resolution: Fixed Thanks, patch applied > Jackson update to latest vers

[jira] [Commented] (TIKA-1870) Relocating RichTextContentHandler into tika-core from tika-server

2016-02-24 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163117#comment-15163117 ] Nick Burch commented on TIKA-1870: -- Currently the class lacks javadocs to explain what it does, and seems

[jira] [Commented] (TIKA-1868) create clean tika-server jar and shaded classifier jar

2016-02-24 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163016#comment-15163016 ] Nick Burch commented on TIKA-1868: -- I'm not sure why you'd want to be using that Tika Server exception

[jira] [Commented] (TIKA-1869) Jackson update to latest version

2016-02-24 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15162917#comment-15162917 ] Nick Burch commented on TIKA-1869: -- Could you try bumping the version in your own checkout of Tika head

[jira] [Commented] (TIKA-1868) create clean tika-server jar and shaded classifier jar

2016-02-24 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15162915#comment-15162915 ] Nick Burch commented on TIKA-1868: -- As explained by several people on the mailing list, you shouldn't

[jira] [Commented] (TIKA-1867) Tika external parsers cannot be turned off without patching the tika-app-XX.jar

2016-02-23 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159042#comment-15159042 ] Nick Burch commented on TIKA-1867: -- You should be able to exclude the CompositeExternalParser with a ~5

[jira] [Commented] (TIKA-1864) org.apache.poi.hssf.record.formula.UnaryPlusPtg package for tika-app-1.10

2016-02-22 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15156790#comment-15156790 ] Nick Burch commented on TIKA-1864: -- First up, I'd suggest you upgrade to Apache Tika 1.12, which

[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-19 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154208#comment-15154208 ] Nick Burch commented on TIKA-1607: -- We have generally required those developing a parser to do more

[jira] [Resolved] (TIKA-1862) Exception in thread "Thread-9" java.lang.UnsatisfiedLinkError: /usr/lib/jvm/jre/lib/amd64/headless/libmawt.so: libcups.so.2: cannot open shared object file: No such file

2016-02-19 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1862. -- Resolution: Invalid This isn't a Tika issue. You either need to fix your JVM installation, or talk

[jira] [Resolved] (TIKA-1856) Error while parsing an ogg file

2016-02-17 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1856. -- Resolution: Fixed Fix Version/s: 1.13 The fix was fairly quick in the end, but the process

[jira] [Commented] (TIKA-1859) file poi reads tika does not bring the content

2016-02-17 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150596#comment-15150596 ] Nick Burch commented on TIKA-1859: -- Which file? How isn't it working? How are you calling Apache Tika? Did

[jira] [Commented] (TIKA-1858) Unable to extract content from chunked portion of large file

2016-02-17 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150370#comment-15150370 ] Nick Burch commented on TIKA-1858: -- Other than a handful of text-based file types, Tika will need

[jira] [Commented] (TIKA-1856) Error while parsing an ogg file

2016-02-16 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148629#comment-15148629 ] Nick Burch commented on TIKA-1856: -- Picking one of those files to look at,{{oggz-info}} processes

Re: Need project suggestions to contribute to Apache Tika

2016-02-13 Thread Nick Burch
On Fri, 12 Feb 2016, Prasad N S wrote: I have over 5 years of experience in software development. My favorite language is Java, though I am comfortable with Python too. I have worked on a range of databases from relational to NoSQL and distributed systems. I am a quick learner and open to learn

Re: scm info in pom.xml

2016-02-11 Thread Nick Burch
On Sat, 6 Feb 2016, Ken Krugler wrote: I'm revisiting the creation of a new tika-langdetect module in the 2.x branch, and have created a pom.xml But in looking at what I started with (from tika-translate), I see this: http://svn.apache.org/viewvc/tika/trunk/tika-langdetect

[jira] [Commented] (TIKA-1850) Tika erroneously detects some versions of jQuery as "text/html"

2016-02-04 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132249#comment-15132249 ] Nick Burch commented on TIKA-1850: -- It's showing up for me in the snapshots repo - see https

[jira] [Commented] (TIKA-1848) Address issues with Tika 1.12rc#1

2016-02-03 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130376#comment-15130376 ] Nick Burch commented on TIKA-1848: -- I'm not sure if our test files should have license headers in them

[jira] [Resolved] (TIKA-1821) Problem in Tika().detect for xml file signed in CADES

2016-02-03 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1821. -- Resolution: Fixed Fix Version/s: 1.13 Thanks for these, I've used the to add unit tests which

[jira] [Commented] (TIKA-1850) Tika erroneously detects some versions of jQuery as "text/html"

2016-02-03 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130678#comment-15130678 ] Nick Burch commented on TIKA-1850: -- Looks like a duplicate to me, are you happy to close

[jira] [Commented] (TIKA-1141) javascript files that contain "<html" are detected as text/html

2016-02-03 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130723#comment-15130723 ] Nick Burch commented on TIKA-1141: -- I've tweaked the mime magic for HTML, so we give javascript files

[jira] [Commented] (TIKA-1850) Tika erroneously detects some versions of jQuery as "text/html"

2016-02-03 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130860#comment-15130860 ] Nick Burch commented on TIKA-1850: -- Please grab a nightly build / build from git, and check - the test

[jira] [Commented] (TIKA-1841) Different XML output structure for PPT and PPTX

2016-02-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126532#comment-15126532 ] Nick Burch commented on TIKA-1841: -- Ideally we would break out the header and footer into separate divs

[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126317#comment-15126317 ] Nick Burch commented on TIKA-1845: -- Near the top of the jira page are some buttons, please hit "

[jira] [Commented] (TIKA-1843) Tika parser for SEG-Y files and new MIME type application/segy

2016-02-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126450#comment-15126450 ] Nick Burch commented on TIKA-1843: -- Ideally you'd work with the Sigrun owner to have them do it - it's

[jira] [Commented] (TIKA-1843) Tika parser for SEG-Y files and new MIME type application/segy

2016-01-29 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123552#comment-15123552 ] Nick Burch commented on TIKA-1843: -- Looks like Sigrun is an active project, so best bet would be to submit

[jira] [Commented] (TIKA-1843) Tika parser for SEG-Y files and new MIME type application/segy

2016-01-29 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123612#comment-15123612 ] Nick Burch commented on TIKA-1843: -- Getting a maven-built project into the Sonatype OSS repo for maven use

[jira] [Resolved] (TIKA-1823) Support detecting DWF format

2016-01-26 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1823. -- Resolution: Fixed Fix Version/s: 1.13 Thanks, I've added this magic, along with a unit test

[jira] [Reopened] (TIKA-1840) No way to link slide notes to slide in PPT output.

2016-01-25 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch reopened TIKA-1840: -- Re-opening as the applied patch causes the notes text to be included twice, which isn't ideal, so further

[jira] [Commented] (TIKA-1841) Different XML output structure for PPT and PPTX

2016-01-25 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115857#comment-15115857 ] Nick Burch commented on TIKA-1841: -- I think it would be good to have the PPT and PPTX parsers return xhtml

[jira] [Created] (TIKA-1839) Update website inclusion of Examples for Git

2016-01-22 Thread Nick Burch (JIRA)
Nick Burch created TIKA-1839: Summary: Update website inclusion of Examples for Git Key: TIKA-1839 URL: https://issues.apache.org/jira/browse/TIKA-1839 Project: Tika Issue Type: Task

Re: Are we on git?

2016-01-22 Thread Nick Burch
On Fri, 22 Jan 2016, Mattmann, Chris A (3980) wrote: Our new ASF git repo is: https://git-wip-us.apache.org/repos/asf/tika.git Here’s an email I sent to the OODT-dev list about how to convert from your existing SVN checkout to Git. http://s.apache.org/UNr Steps I followed on my trunk

Are we on git?

2016-01-21 Thread Nick Burch
Hi All I've seen a commit message to git, but no "stop using SVN", and http://tika.apache.org/contribute.html still talks about SVN being our master. What's the status? Have we switched? Still in progress? Where should we commit to? Is it time to delete our SVN checkouts and re-checkout

[jira] [Commented] (TIKA-1823) Support detecting DWF format

2016-01-18 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105081#comment-15105081 ] Nick Burch commented on TIKA-1823: -- Do you have a very small sample DWF file (ideally your own, otherwise

Re: WMF extraction

2016-01-15 Thread Nick Burch
On Thu, 14 Jan 2016, Andreas Beeker wrote: POI will have a WMF module (org.apache.poi.hwmf.*) in the next beta. Looking over the govdocs collection, those embedded wmfs might contain interesting information for TIKA. Should the output be part of the embedding document, e.g. ppt, or does it

RE: Tika questions on StackOverflow

2016-01-14 Thread Nick Burch
On Wed, 13 Jan 2016, Allison, Timothy B. wrote: Are there other consumer lists we should be following? Elastic Search? I think Elastic Search only has a forum-type thingy, this probably should let you see Tika posts there (not that frequent)

[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-14 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097884#comment-15097884 ] Nick Burch commented on TIKA-1824: -- Tika already supports using a custom classloader for loading parser

Tika questions on StackOverflow

2016-01-13 Thread Nick Burch
Hi All This may be old news for some of you, in which case you can skip the email, but for others... StackOverflow is a programming-focused question and answer site, with excellent google-foo, quite wide use, and growing use. At the moment I'd say there's something like a new Tika question a

Re: [VOTE] Moving SCM to Git

2016-01-11 Thread Nick Burch
On 02/01/16 04:30, Mattmann, Chris A (3980) wrote: Hi Everyone, DISCUSS thread here: http://s.apache.org/wVE Time to officially VOTE on moving Tika to Git. I’ve made a wiki page for our SCM explaining how to use Git at Apache, and how to use it with Github, and how to use it even in a

[jira] [Commented] (TIKA-1821) Problem in Tika().detect for xml file signed in CADES

2016-01-07 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15087613#comment-15087613 ] Nick Burch commented on TIKA-1821: -- Hopefully fixed in r1723581 - the length is part of the initial magic

[jira] [Commented] (TIKA-1817) Extracts entire file content for ASCII DXF files

2015-12-23 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15070169#comment-15070169 ] Nick Burch commented on TIKA-1817: -- Thanks for that. Test file from JustCAD added in r1721576, along

[jira] [Commented] (TIKA-1817) Extracts entire file content for ASCII DXF files

2015-12-22 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068077#comment-15068077 ] Nick Burch commented on TIKA-1817: -- I've had a go at adding mime subtypes for binary and ascii for DXF

Re: looking to contribute

2015-12-22 Thread Nick Burch
On Wed, 16 Dec 2015, Nick Burch wrote: If you want to try more coding, Tim quite often runs Tika against some large filesets, and has a nifty tool to report on what breaks. He can hopefully point you at the most recent report! Maybe have a look through that, identify a few common failures from

[jira] [Commented] (TIKA-1817) Extracts entire file content for ASCII DXF files

2015-12-21 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067319#comment-15067319 ] Nick Burch commented on TIKA-1817: -- Any chance you could upload a small sample DXF file? Ideally

Re: looking to contribute

2015-12-20 Thread Nick Burch
On Sun, 20 Dec 2015, Joey Hong wrote: Oh, my bad. I should have realized when the HTML looked generated. I have now added the usage examples to the examples.apt file, and the page looks find after it was built by mvn. As of now, the examples are edited both for the 1.11/ and 1.12/ folders;

Re: looking to contribute

2015-12-20 Thread Nick Burch
On Sat, 19 Dec 2015, Joey Hong wrote: Regarding TIKA-1329, I found the tike-site on the Subversion source code, and I called: svn checkout https://svn.apache.org/repos/asf/tika/site/publish/1.11/ . Since this isn’t part of the

[jira] [Commented] (TIKA-1773) No XML Metadata output for JP2 files

2015-12-18 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063957#comment-15063957 ] Nick Burch commented on TIKA-1773: -- We can't depend on a LGPL library - see http://www.apache.org/legal

Re: looking to contribute

2015-12-16 Thread Nick Burch
On Wed, 16 Dec 2015, Joey Hong wrote: My name is Joey. I am a college freshmen with programming experience looking to get into the world of open-source. I was hoping to contribute to the Tika project, and was wondering if there were any tasks that a beginner like me could tackle. I am willing

[jira] [Commented] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl

2015-12-16 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060203#comment-15060203 ] Nick Burch commented on TIKA-1813: -- My best guess is that these have been truncated. Having a look

[jira] [Comment Edited] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl

2015-12-16 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060203#comment-15060203 ] Nick Burch edited comment on TIKA-1813 at 12/16/15 3:58 PM: My best guess

Re: Tika 2.0 Source in Modules or tika-parser

2015-12-14 Thread Nick Burch
On Sun, 13 Dec 2015, Bob Paulin wrote: So in short Source in tika-parser Dependencies managed in tika-parser and copied to module Source in Modules Dependencies managed in modules and consolidated via maven shade plugin. Conflicting dependencies managed by maven. IIRC there are some util /

Re: Tika 2.0 Source in Modules or tika-parser

2015-12-14 Thread Nick Burch
On 14/12/15 16:26, Ray Gauss wrote: I'd vote for a tiki-parser-common(s) artifact for common util classes and dependencies. That would make sense to me Nick

Re: Tika 2.0 Source in Modules or tika-parser

2015-12-14 Thread Nick Burch
On Mon, 14 Dec 2015, Bob Paulin wrote: So there seems to be a pretty good consensus forming around moving the sources but some differing opinions on where to put shared parser code. I know it'll be a bit dull and some work, but... Could someone put together a list (probably in the wiki or on

[jira] [Commented] (TIKA-1806) Bouncy Castle conflict

2015-12-03 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038105#comment-15038105 ] Nick Burch commented on TIKA-1806: -- I've just tried that file with the Tika App, and I don't get

[jira] [Created] (TIKA-1805) Default parser/detector loading should warn on missing/empty classes

2015-12-01 Thread Nick Burch (JIRA)
Nick Burch created TIKA-1805: Summary: Default parser/detector loading should warn on missing/empty classes Key: TIKA-1805 URL: https://issues.apache.org/jira/browse/TIKA-1805 Project: Tika

[jira] [Resolved] (TIKA-1805) Default parser/detector loading should warn on missing/empty classes

2015-12-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1805. -- Resolution: Fixed Changed as of r1717560, along with an additional handler method to alert if a service

Re: more modular parser bundles

2015-11-30 Thread Nick Burch
On Mon, 30 Nov 2015, Allison, Timothy B. wrote: Perhaps we could start with a tika-advanced-bundle to gather all of the nlp/advanced parsers? Or would this have to wait for Tika 2.0? I've noticed that there have been a lot fewer queries (on our list, on stackoverflow, at events etc) caused

[jira] [Commented] (TIKA-1804) Tika use no free json.org

2015-11-30 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15031681#comment-15031681 ] Nick Burch commented on TIKA-1804: -- The JSON license has been approved for use by Apache Projects

[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-11-26 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15028642#comment-15028642 ] Nick Burch commented on TIKA-1706: -- Does anyone have any objections to us going ahead with this for Tika

Re: NER wiki page up

2015-11-20 Thread Nick Burch
On Fri, 20 Nov 2015, Mattmann, Chris A (3980) wrote: P.S. Nick - Git instructions coming next :) Woot! :) Nick

Re: Incompatibility between apacke tikka and apache commons email jar

2015-11-20 Thread Nick Burch
On Fri, 20 Nov 2015, Neel79 wrote: I am using Apache commons email jar 1.4 and Apache Tikka jar 1.10 . I see the following error Caused by: java.lang.UnsupportedClassVersionError: JVMCFRE003 bad major version; class=org/apache/tika/detect/Detector, offset=6 Apache Tika now requires Java 7 or

Re: [DISCUSS] Moving to Git

2015-11-19 Thread Nick Burch
On Thu, 19 Nov 2015, Mattmann, Chris A (3980) wrote: I’ll be happy to update our docs and to write a wiki page on using Tika & Git that we can refer folks to. I think I’ve demonstrated documenting things on the Tika wiki :) Great stuff! Scribble something sensible down, and I can vote +1 to

<    2   3   4   5   6   7   8   9   10   11   >