Re: Tika at ApacheCon

2012-08-03 Thread Nick Burch
On Fri, 3 Aug 2012, Jukka Zitting wrote: Did someone already submit a talk about Tika to ApacheCon Europe [1]? The proposal review functionality will be going live in a few hours, you'd be able to check about submissions yourself then. As it isn't live just yet, I've gone in and checked, and

Re: AutoDetectParser not picking up custom parser

2012-08-06 Thread Nick Burch
On Mon, 6 Aug 2012, 122jxgcn wrote: In order to AutoDetectParser to pick up my parser, I followed the instructions listed in http://tika.apache.org/1.1/parser_guide.html#List_the_new_parser However, AutoDetectParser is not picking up my parser when I do Parser parser = new AutoDetectParser();

Re: AutoDetectParser not picking up custom parser

2012-08-07 Thread Nick Burch
On Mon, 6 Aug 2012, 122jxgcn wrote: Did you make sure both your parser and the service file are on the classpath? If you miss one or both then your new parser won't be loaded I'm sorry but can you be more specific? I'm not sure what do you mean by service file. The step is described here:

Re: Found bug in PSDParser.java

2012-10-11 Thread Nick Burch
On Thu, 11 Oct 2012, Andrew Stepanov wrote: org.apache.tika.exception.TikaException: Invalid Image Resource Block Signature Found, got 3686985 0x384249 but the spec defines 943868237 I compared specification ( http://www.adobe.com/devnet-apps/photoshop/fileformatashtml/PhotoshopFileFormats.htm)

Re: MimeTypes.java final?

2012-10-29 Thread Nick Burch
On Mon, 29 Oct 2012, Ryan McKinley wrote: The key things I am stuck with: 1. As is, MimeTypes#forName(String name) will get or create the MimeType. There is no way to ask if the MimeTypes registry already knows about the type. I think the idea is that you use the underlying MediaTypeRegistry i

Re: Patching fix for Tika-521 on Tika 0.8

2012-11-21 Thread Nick Burch
On Wed, 21 Nov 2012, Jana, Kumar Raja wrote: Is it possible to patch the fix for Tika-521 to Tika 0.8 without upgrading to POI 3.8? Tika 0.8 is fairly old, there have been lots of new features and bug fixes since then. Ditto POI 3.7 There is TikaExcelEventBasedExtraction.diff attached to the

Re: Tika OneNote Support

2012-11-25 Thread Nick Burch
On Wed, 14 Nov 2012, 122jxgcn wrote: Is there anyone who worked on extracting contents from MS OneNote file? (*.one) It will be great if someone can tell me how to work with parsing OneNote files programatically. I'm not aware of anything. The good news is that the file format is fully docume

Re: Contribution of parser for FITS file format to Apache Tika

2012-12-05 Thread Nick Burch
On Wed, 5 Dec 2012, Rahul Khanna wrote: I'm a developer who has used Apache Tika in a Research Data Repository System at The Australian National University. As part of the requirements of the project we extended the functionality of Apache Tika by creating a parser that extracts the headers of

Re: Tika Parser 1.2 - MP4Parser.java Query

2013-01-07 Thread Nick Burch
On Mon, 7 Jan 2013, Sharon Corbett wrote: I have a question regarding the MP4Parser.java file contained in version 1.2 of Apache Tika Parser. The file contains the following comment, "This uses the MP4Parser project from http://code.google.com/p/mp4parser/ to do the underlying parsing". I'd li

Re: [DISCUSS] Release Candidate for 1.3?

2013-01-09 Thread Nick Burch
On Wed, 9 Jan 2013, Jukka Zitting wrote: Re: binary compatibility; Before cutting the release it would be a good idea to update the clirr plugin configuration to use Tika 1.2 instead of 1.0 when checking for binary compatibility. Can we ask clirr to check both versions? Ideally we need to cont

Re: Microsoft Office versions supported by Tika 1.3?

2013-02-05 Thread Nick Burch
On Tue, 5 Feb 2013, saisantoshi wrote: I am looking at the versions supported by newer version of Tika (1.3) and was not sure what version(s) of the Microsoft office it supports (97/2000/2010/2013) for each of the below? http://tika.apache.org/1.3/formats.html#Microsoft_Office_document_formats

Re: Build failed in Jenkins: Tika-trunk #977

2013-02-07 Thread Nick Burch
On Thu, 7 Feb 2013, Michael McCandless wrote: Hmm it looks like the Tika build is failing on Jenkins due to this: [ERROR] /home/jenkins/jenkins-slave/workspace/Tika-trunk/trunk/tika-server/src/main/java/org/apache/tika/server/CSVMessageBodyWriter.java:[51,3] method does not override a method fr

Join us at ApacheCon North America!

2013-02-12 Thread Nick Burch
Hi All It's now about 2 weeks until ApacheCon North America, which is taking place Sunday 24th Feb - Thursday 28th in Portland. Quite a few people from our project will be there, and we'd love to see you all! If you haven't already registered for the conference, then we've some good news - w

Re: PICT format detection

2013-02-14 Thread Nick Burch
On Thu, 14 Feb 2013, Jérémie Lesage wrote: Do I have to create an issue in Jira ? Yes, please do, and attach the patch there Nick

Re: how to add more metadata to tika extraction?

2013-02-27 Thread Nick Burch
On Wed, 27 Feb 2013, eShard wrote: Here's my quandary: I'm using manifoldcf v1.1.1 to crawl non standard (IBM) RSS feeds and custom RSS feeds. There's additional metadata in each item that we need to capture. I added the additional fields to the Solr schema (4.0 final) but the additional fields a

Re: how to add more metadata to tika extraction?

2013-03-05 Thread Nick Burch
On Wed, 27 Feb 2013, eShard wrote: I manually ran the tika-app --gui and I dropped the rss feed into it. Here's what the metadata output: Content-Length: 615913 Content-Type: application/rss+xml dc:description: This is an IBM C3 Public Files feed generated by a Java application. dc:title: IBM -

Re: FW: [Tika Wiki] Update of "RecursiveMetadata" by domtheo

2013-03-07 Thread Nick Burch
On Thu, 7 Mar 2013, Mattmann, Chris A (388J) wrote: Guys I reverted this spammer but don't know how to block him. Help? I think you need to ask infra Nick

Re: Questions about java TIKA project.

2013-03-07 Thread Nick Burch
On Thu, 7 Mar 2013, A Z wrote: I also notice that you are building on POI (presumably 3.9). -POI has shortfalls around HWPFDocument objects; Microsoft Word  .doc files. One may not really easily insert org.apache.poi.hwpf.usermodel.Picture Apache Tika only reads files in through the various l

Re: Tika GUI can't get the original file

2013-03-08 Thread Nick Burch
On Fri, 8 Mar 2013, Juri Linkov wrote: This works flawlessly in the CLI version. But when using the GUI interface, the original stream gets wrapped through ProgressMonitorInputStream, so hasFile() returns false. Fortunately, getFile() automagically creates a spooled temporary file. Thanks fo

Wiki permissions changes

2013-03-10 Thread Nick Burch
Hi All Just to let everyone know that we've made some changes to permissions on the Tika wiki, to hopefully avoid the recent repeated spam attacks on it. As such, before you can make changes to the wiki (edit pages, create new ones etc), your wiki user account needs to be added to the Contrib

Wiki permissions changes

2013-03-10 Thread Nick Burch
Hi All Just to let everyone know that we've made some changes to permissions on the Tika wiki, to hopefully avoid the recent repeated spam attacks on it. As such, before you can make changes to the wiki (edit pages, create new ones etc), your wiki user account needs to be added to the Contrib

Re: MP4Parser Triggers no ContentHandler.startDocument() and ContentHandler.endDocument() in one case

2013-05-28 Thread Nick Burch
On Tue, 28 May 2013, Christian Reuschling wrote: This works like a charme, but inside MP4Parser, there exists these lines of code: Line 146-154, parse() method: MovieBox moov = getOrNull(isoFile, MovieBox.class); if (moov == null) { // Bail out return;

Re: MP4Parser Triggers no ContentHandler.startDocument() and ContentHandler.endDocument() in one case

2013-05-29 Thread Nick Burch
On Wed, 29 May 2013, Christian Reuschling wrote: Nevertheless, in this case an Exception (like in all other parsers) or a tika body with length zero, which is indicated at least by handler.endDocument() would be the appropriate way, isn't it? - From the ContentHandlers point of view, there is n

Re: MP4Parser triggers .... something betwwen an exception and endDocument() from the Contenthandlers point of view?

2013-06-07 Thread Nick Burch
On Fri, 7 Jun 2013, Ray Gauss II wrote: I think the Parser interface Javadoc would make sense as a place to document, but I don't know if there is an existing policy. It might be helpful if some kind soul could take a few hours to review all the existing parsers, and give a summary of what the

Re: MP4Parser Triggers no ContentHandler.startDocument() and ContentHandler.endDocument() in one case

2013-06-24 Thread Nick Burch
On Wed, 29 May 2013, Nick Burch wrote: I'm not sure if we do have a properly documented policy on what a parser should do if it receives a file it can't handle. For ones that are invalid (eg corrupt), I believe an exception is the expected result. The case when the file seems valid,

RFC822Parser build error on gump

2013-06-25 Thread Nick Burch
Hi All Anyone have any idea about this compiler error on the tika parsers project as hit by gump? http://vmgump.apache.org/gump/public/tika/tika-parsers/gump_work/build_tika_tika-parsers.html Gump notifications will hopefully start again soon, which'd let us find out about breaking changes fr

RE: need URL openStream() to test Tika-327 in MimeDetectionTest?

2013-06-28 Thread Nick Burch
On Fri, 28 Jun 2013, Allison, Timothy B. wrote: Doh! Please ignore last email: https://issues.apache.org/jira/browse/TIKA-1129 Would anyone mind if I recreated the structure from the offending html so that we can return this test to test a local copy of the document? I think as long as we re

Re: Keynote Thumbnails?

2013-06-28 Thread Nick Burch
On Fri, 28 Jun 2013, Mike Patterson wrote: I understand that Tika accomplishes the text portion of this project today. I'm curious however, given the familiarity with the keynote file format, if anyone has any suggestions for extracting/generating larger thumbnail images from these presentation

Re: MagicDetector don't work for all RFC882 message Types.

2013-07-11 Thread Nick Burch
On Thu, 11 Jul 2013, Kai-Uwe Schmidt wrote: I am trying to use Tika to extract metadata from eml's created via Novell Groupwise. By this I ran into a problem with the dedection of "message/rfc822". The MagicDetector (working with the default tika-mimetypes.xml) compares the "match" values binar

Re: AW: MagicDetector don't work for all RFC882 message Types.

2013-07-11 Thread Nick Burch
On Thu, 11 Jul 2013, Kai-Uwe Schmidt wrote: Where can I read how to provide a patch? Hmm. I was going to say: * Go to the website at and follow the link to download the source * Look on the website at ??? and see the contribution instructions However, unless I'm missing something, we don'

RE: MagicDetector don't work for all RFC882 message Types.

2013-07-11 Thread Nick Burch
On Thu, 11 Jul 2013, Allison, Timothy B. wrote: I think I may be uniquely qualified to answer this from an Idiot's guide/newish to Tika perspective. :) Apologies if I'm missing out on more obvious answers! Feedback from people like you is exactly what we need! I've been around too long to be

RE: MagicDetector don't work for all RFC882 message Types.

2013-07-11 Thread Nick Burch
On Thu, 11 Jul 2013, Allison, Timothy B. wrote: I'm sorry that I missed your response (wound up in my spambox). I'd be happy to draft a section on how to contribute for Tika's website. How do I contribute that? Open an issue and submit html? Should I create a separate html or modify the h

[Announce] Welcome Tim Allison as Tika PM member and committer

2013-07-30 Thread Nick Burch
Hi All The Tika PMC VOTE'd to add Tim Allison to our merry group as a PMC member and committer. Welcome, Tim! Please feel free to say a bit about yourself. Cheers Nick

Re: Tika JAX-RS server port bug

2013-09-16 Thread Nick Burch
On Mon, 16 Sep 2013, Kevin Slote wrote: My name is Kevin Slote and I am looking to report a bug that I think I found in the Tika JAX-RS server. I was not really sure where I should report this. Your best bet would be to create a new bug in JIRA, our issue tracker - https://issues.apache.org/ji

Excel files with "holes" in the cell sequence

2013-10-08 Thread Nick Burch
Hi All The Excel file formats (.xls and .xlsx) are somewhat sparse formats, and where a cell has never been used it generally doesn't get written to the file. (Being a Microsoft format, there are exceptions to this...). Currently, if you parse a file with cells at A1 B1 F1 G1, then Tika will

Re: problem with embedded OLE attachments

2013-10-17 Thread Nick Burch
On Thu, 17 Oct 2013, kevin slote wrote: Hi, I was trying to parse a word file with an embedded OLE attachment and I got this error... Caused by: java.lang.IllegalAccessError: tried to access method org.apache.poi.POIDocument.(Lorg/apache/poi/poifs/filesystem/DirectoryNode;)V from class org.apac

Re: problem with embedded OLE attachments

2013-10-17 Thread Nick Burch
On Thu, 17 Oct 2013, kevin slote wrote: I do. But, when I deleted the jars from my classpath, I got the same error. You haven't got them all then. See the POI FAQ for how to check what jar you're really using Additionally, is there a work around

RE: Extract thumbnail from openxml office files

2014-01-09 Thread Nick Burch
On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote: By searching on issues, I found the issue already created: https://issues.apache.org/jira/browse/TIKA-90 I'm not sure if the metadata is the right place to return this. Some formats offer a small thumbnail, others can offer a small thumbnail for ev

RE: Extract thumbnail from openxml office files

2014-01-09 Thread Nick Burch
On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote: I agree with you that metadata is not the best place to store thumbnail result. Until now, our metadata is simple map with key:values. This structure is not really flexiable in some cases. Currently, we have four kinds of "things" that we return for

RE: Extract thumbnail from openxml office files

2014-01-09 Thread Nick Burch
On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote: I'm convinced that using embedded resources is a better solution. OK, sounds like we have a consensus and can go ahead with it, great! One outstanding query is what name we should give to these when we return them as embedded resources, and if we sh

RE: Extract thumbnail from openxml office files

2014-01-09 Thread Nick Burch
On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote: BTW, may I'm wrong to say that thumbnail handling in Alfresco is quite complex because Alfresco can call external thumbnail generation with PDFBox or PDFRender It can do, yes, but there are also dedicated classes to pull out most of the common

Re: Submission to ApacheCon on Tika

2014-01-31 Thread Nick Burch
On Fri, 31 Jan 2014, Jukka Zitting wrote: [to: only tika] On Fri, Jan 31, 2014 at 1:02 AM, Chris Mattmann wrote: I submitted the below talk on Apache Tika, Nutch and Solr to ApacheCon NA 2014: Nice! Looking forward to seeing you there. :-) I'm considering submitting an updated version of my

Re: Unsubscribe?

2014-02-05 Thread Nick Burch
On Wed, 5 Feb 2014, A Z wrote: Can someone unsubscribe or tell me how to do this from this Tika emai list? Easiest thing is for you to do it. Just send an email to dev-unsubscr...@tika.apache.org and follow the confirmation instructions (That email address is given in the welcome email when

Re: building Tika without bundle dependency on repositories

2014-02-05 Thread Nick Burch
On Wed, 5 Feb 2014, Allison, Timothy B. wrote: Speaking of building...is there an easy way to build Tika "locally" without reference to the repositories and without building each component one by one (in the correct order) and then manually installing in a local repository. I just do: cd

Re: [VOTE] Apache Tika 1.5 RC2

2014-02-10 Thread Nick Burch
On Sun, 9 Feb 2014, Dave Meikle wrote: A new release candidate for the Tika 1.5 release is now available at: http://people.apache.org/~dmeikle/tika-1.5-rc2/ Any chance you could explain the difference between tika-app-1.5.jar and original-tika-app-1.5.jar ? And if the latter is needed, is there

Re: [VOTE] Apache Tika 1.5 RC2

2014-02-14 Thread Nick Burch
On Fri, 14 Feb 2014, David Meikle wrote: Had a check on this and there isn’t anyone local I can find for a quick meet up. I am based in Scotland but travel a bit, so will look out for an opportunity to meet up with someone soon but doubt it will be in the coming weeks. Which bit of Scotland?

Re: [VOTE] Apache Tika 1.5 RC2

2014-02-16 Thread Nick Burch
On Sun, 16 Feb 2014, David Meikle wrote: On 14 Feb 2014, at 22:24, Nick Burch wrote: Which bit of Scotland? I might know someone… I am based in Edinburgh, so can get around quite easily if you know someone in the Central belt. I'll see if I can track someone down. Lewis would'

Failing test - PDFParserTest.testSequentialParser

2014-02-19 Thread Nick Burch
I've just tried to build Tika from svn, and despite doing a clean I've got a failing unit test when I try to build: --- Test set: org.apache.tika.parser.pdf.PDFParserTest -

RE: Failing test - PDFParserTest.testSequentialParser

2014-02-19 Thread Nick Burch
On Wed, 19 Feb 2014, Allison, Timothy B. wrote: Did you, by chance, add a pdf to your local test-documents directory? Yes! I have one uncommitted pdf in there There appear to be 16 pdfs in test-documents under parsers in trunk. The goal of that test was to make sure that the test-documents di

Re: [VOTE] Apache Tika 1.5 RC2

2014-02-22 Thread Nick Burch
On Fri, 14 Feb 2014, Annie Burgess wrote: I also live in a sort-of removed location - Anchorage, AK. If anyone knows of any developers up north, I'd love to try to connect with the AK Apache community. There are two main public places where Apache committers announce their locations: * htt

Re: Build failure at trunk in org.apache.tika.server.UnpackerResourceTest

2014-02-25 Thread Nick Burch
On Tue, 25 Feb 2014, Ken Krugler wrote: Failed tests: testText(org.apache.tika.server.UnpackerResourceTest) testImageDOCX(org.apache.tika.server.UnpackerResourceTest): expected:<[5516590467b069fa59397432677bad4d]> but was:<[bfb451ca6aa8f5a5095afd5228034e6a]> testImageXSL(org.apache.tika.se

Re: Build failure at trunk in org.apache.tika.server.UnpackerResourceTest

2014-02-26 Thread Nick Burch
On Wed, 26 Feb 2014, Ken Krugler wrote: I'm curious why we're not seeing buildbot failures, given that it seems to be a general problem and not just an issue on my Mac. Is buildbot configured to build that module? Or does it perhaps skip the server module? Nick

Re: Project dependencies page

2014-02-27 Thread Nick Burch
On Thu, 27 Feb 2014, Vadim Roizman wrote: Looks like this page is misleading: https://tika.apache.org/dependencies.html Looks like someone deleted the source to that page some time ago, but forgot to zap the published html version Have you found that page linked from anywhere on the Tika sit

Re: Project dependencies page

2014-02-27 Thread Nick Burch
On Thu, 27 Feb 2014, Vadim Roizman wrote: No, it probably just pops up in google results, bun then it propagates: https://stackoverflow.com/questions/21929040/special-characters-stored-when-extracting-content-from-microsoft-word-documents#comment33371269_22007797 It seems they're being generate

Re: Submission to ApacheCon on Tika

2014-03-02 Thread Nick Burch
On Sun, 2 Mar 2014, Jukka Zitting wrote: The schedule for Tika-related talks (http://apacheconnorthamerica2014.sched.org/?s=tika) looks a bit awkward. My talk is scheduled for Wednesday morning before Nick's afternoon slot, and Chris' and Annie's case studies overlap at 10am on Wednesday. I gu

Re: Using guava on tika ?

2014-03-06 Thread Nick Burch
On Thu, 6 Mar 2014, Hong-Thai Nguyen wrote: Guava (https://code.google.com/p/guava-libraries/) provides many facilities on text, file, collection ... manipuation. Should we use in Tika ? Can you give an example of where using Guava would either simplify some existing code, or improve its effe

Re: Unconsistent logging in current tika (1.5)

2014-03-06 Thread Nick Burch
On Fri, 7 Mar 2014, Konstantin Gribov wrote: Tika-core is quite pure (uses only java.util.logging) but tika-parsers uses commons-logging 1.1.1 (through pdfbox), slf4j-api 1.5.6 (through netcdf) and log4j 1.2.14 (through slf4j-log4j as test scope dependency). Also some parsers (like pdfbox) logs

How should video files with audio be handled by parsers?

2014-03-27 Thread Nick Burch
Hi All Does anyone know if we have a recommended way / plan of a way to handle video files with possibly multiple audio streams? Most of the multimedia container formats support video and zero or one audio streams, and a fair number support video and multiple audio streams. A few can actuall

Re: How should video files with audio be handled by parsers?

2014-03-27 Thread Nick Burch
On Thu, 27 Mar 2014, Konstantin Gribov wrote: Some containers (like matroska/mkv) tags audio and subtitle streams with language tag and some comment. From mplayer console output: [lavf] stream 0: video (h264), -vid 0 [lavf] stream 1: audio (aac), -aid 0, -alang rus, Rus BaibaKo.tv [lavf] stream

Re: How should video files with audio be handled by parsers?

2014-03-28 Thread Nick Burch
On Fri, 28 Mar 2014, Konstantin Gribov wrote: I think you should have three info blocks: video streams, audio streams and subtitles (if container supports their embedding). Sort naturally or by vid/aid/sid if present. That's not something Tika supports though. We have a metadata object we can

Re: metadata key for original file path?

2014-03-28 Thread Nick Burch
On Fri, 28 Mar 2014, Allison, Timothy B. wrote: In working on TIKA-1010, there are some cases where the full original file path is stored with an image or embedded document. TikaMetadatakeys.RESOURCE_NAME_KEY should be used for file name (right?), but what should I use for file path? I can on

Re: How to exclude a mimetype form being indexed in solr using tika?

2014-03-28 Thread Nick Burch
On Fri, 28 Mar 2014, eShard wrote: I'm using solr 4.0 Final I need movies "hidden" in zip files that need to be excluded from the index. I can't filter movies on the crawler because then I would have to exclude all zip files. If you're calling Tika directly, this is very easy. When tika hits e

Re: Unable to commit SVN ?

2014-04-03 Thread Nick Burch
On Thu, 3 Apr 2014, Hong-Thai Nguyen wrote: I have 500 error when committing to tika SVN. Do you have same problem ? POST request on '/repos/asf/!svn/me' failed: 500 Internal Server Error As posted on infrastructure@ a few minutes ago: - As per http://status.apache.org and twitt

Re: Tika VM Service

2014-04-08 Thread Nick Burch
On Tue, 8 Apr 2014, Lewis John Mcgibbney wrote: I would like to propose that we get a Tika service up and running on a VM. Tika users can do adhoc parsing, etc and can do this based on possibly stable nightly SNAPSHOT's or alternatively based on the most recent stable release. Preferably, the ser

Re: Tika VM Service

2014-04-09 Thread Nick Burch
On Wed, 9 Apr 2014, Nick Burch wrote: My vision of how this would work would be to use the Tika Server, with some extensions so that it self hosted some basic documentation. We're thinking of trying to start that tomorrow in the hackathon, any help / ideas / projects to crib off grate

Re: Tika VM Service

2014-04-09 Thread Nick Burch
On Wed, 9 Apr 2014, Konstantin Gribov wrote: I can recommend packer.io to generate images for major virtualization systems. Virtual appliances is useful for learning some software. With Tika, you can download either the Tika App single jar or the Tika Server single jar, and both let you get up

Re: Starter tasks for a new comer

2014-04-20 Thread Nick Burch
On Sun, 20 Apr 2014, Sachith Withana wrote: I'm Sachith and I'm from the Apache Airavata community. We are interested in using Apache Tika and wondering what would be the best place to start. Three places spring to mind: * Website, eg http://tika.apache.org/1.5/gettingstarted.html * The thre

Re: unit tests and classpaths

2014-04-25 Thread Nick Burch
On Thu, 24 Apr 2014, Annie Burgess wrote: I'm working on a very simple starter unit test for a new parser and am coming up with some roadblocks. I suspect it may be classpath related, but have tried many iterations and am coming up short. At first glance, it looks like your unit test is runni

Calling OSGi experts - TIKA-1276 patch review

2014-04-28 Thread Nick Burch
Hi All I know enough OSGi to be dangerous, but not enough to be sure of exactly what I should and shouldn't do... On TIKA-1276 we've got some suggested patches from Rupert Westenthaler which hopefully fix some Tika OSGi problems, as well as adding some more unit tests for the OSGi support.

Re: Shared MIME info update

2014-04-28 Thread Nick Burch
On Mon, 28 Apr 2014, Matthias Krueger wrote: I ran a diff on tika-mimetypes.xml and the latest Freedesktop share MIME info DB release (http://cgit.freedesktop.org/xdg/shared-mime-info/). It seems they have diverged quite a lot. I don't think they've ever been the same. We use their XML format,

Re: [DISCUSS] Nightly Jenkins Builds for Trunk

2014-05-14 Thread Nick Burch
On Wed, 14 May 2014, Lewis John Mcgibbney wrote: Right now in Jenkins (builds.apache.org) we don't seem to have a Tika project directory which contains the trunk build... it is just a free standing project burried under the mountain of jobs currently running on that box. I believe that Buildbot

JAXRS, endpoints and a / welcome page - any ideas why it's broken?

2014-05-15 Thread Nick Burch
Hi All One for our JAXRS gurus here... At ApacheCon, we came up with the idea of having a welcome page on the Tika Server, so that we could point people to it to try Tika, and let them discover what it offered. Based on that, and the mailing list discussions, we raised TIKA-1269. (Related t

Re: JAXRS, endpoints and a / welcome page - any ideas why it's broken?

2014-05-15 Thread Nick Burch
On Wed, 14 May 2014, Sergey Beryozkin wrote: UnpackerResource has no Path annotation so it is defaulted to "/". Every endpoint method within the class does have one though. I would've expected it to match based on those, is that not the case? However, the selection between multiple root reso

Re: parser metadata empty after tika detect

2014-05-16 Thread Nick Burch
On Fri, 16 May 2014, aliosha79 wrote: For this purpose i have write these few code lines: File f = new File("MyEmail.eml"); is= new FileInputStream(f); Tika tika = new Tika(); String mimeType = tika.detect(is); This will most likely use a fair bit (to possibly all) of

Re: JAXRS, endpoints and a / welcome page - any ideas why it's broken?

2014-05-19 Thread Nick Burch
On Mon, 19 May 2014, Sergey Beryozkin wrote: I've just looked at the source, unfortunately adding a new Path value will affect the request URIs, UnpackerResource has 2 methods accepting path segments starting from "/unpacker" and "/all". So if we updated then the users would have to modify URI

Re: JAXRS, endpoints and a / welcome page - any ideas why it's broken?

2014-05-19 Thread Nick Burch
On Mon, 19 May 2014, Sergey Beryozkin wrote: I think it might be good to push them into a common path prefix. Though /unpack/unpacker seems a bit unwieldy... If we do introduce "/unpack" then may be we can drop "/unpacker", and have two methods with "/" & "/all", so users will work with "/unpa

Re: [IMPORTANT] INFRA-7751 - Create a VM for Apache Tika

2014-05-20 Thread Nick Burch
On Tue, 20 May 2014, Lewis John Mcgibbney wrote: * - what is the external name used by users. tika-vm.a.o is solely for ssh, not for public * This is entirely up to you guys. Over in Any23 we were lucky enough to have someone on the project team own any23.org... what about service.tika.apache.

Re: [UPDATE] Tika Nightly Builds

2014-05-20 Thread Nick Burch
Hi Lewis Thanks for that! Any chance you could add these details to the "for developers" page on the website? On second thoughts, any chance you could create a "For Developers" page, with links to the SVN repo and the dev list, then add these details to it? Along with a few lines on how we l

Re: Property type closed choice

2014-05-20 Thread Nick Burch
On Tue, 20 May 2014, Allison, Timothy B. wrote: When I run this: Property attach = TikaCoreProperties.EMBEDDED_RESOURCE_TYPE; Metadata m = new Metadata(); m.add(attach, "blah"); m.set(attach, "blah"); I don't get an exception. Should metadata be throwing an exception whe

RE: [DISCUSS] Centralizing JSON handling of Metadata

2014-05-29 Thread Nick Burch
On Wed, 28 May 2014, Ray Gauss II wrote: However, that sort of modularization is probably a broader discussion than what we need for this particular issue, so between those two I’d vote for tika-serialization. Tika-CLI and Tika-Server will likely want to depend on all of the serialisation met

Re: [jira] [Commented] (TIKA-93) OCR support

2014-06-02 Thread Nick Burch
On Mon, 2 Jun 2014, Tyler Palsulich wrote: Good point! We should figure out a way to fail gracefully when Tesseract isn't installed, right? Unless there is, in fact, some pure Java OCR implementation. I believe the standard policy is that a parser which can't work should either thrown an exce

Re: [jira] [Commented] (TIKA-93) OCR support

2014-06-02 Thread Nick Burch
On Mon, 2 Jun 2014, Tyler Palsulich wrote: How do we know when Tesseract is installed? There isn't an easy, cross-platform Java method to check if a given program is installed. Maybe, we make the user specify the install location in some config file? Then, don't have to worry about Tesseract be

Example code in documentation?

2014-06-02 Thread Nick Burch
Hi All Currently, we have some example code on the website, and some in the wiki, neither of which gets checked to ensure it compiles, neither unit tested. However, it is easy to add I've noticed that a couple of ASF projects now have their example code in svn, and use a new-ish cms feature

Re: JAXRS, endpoints and a / welcome page - any ideas why it's broken?

2014-06-02 Thread Nick Burch
On Tue, 20 May 2014, Sergey Beryozkin wrote: Maybe we should post to users@, and see if anyone says they do? Sounds good, please ask or I can do it, let me know please As our jaxrs guru, can you? :) I've just asked at the users list Based on the silence, I don't think the unpacker resourc

Re: [jira] [Created] (TIKA-1316) Old Site Code in Trunk

2014-06-02 Thread Nick Burch
On Mon, 2 Jun 2014, Mattmann, Chris A (3980) wrote: We generate the site directory from src/site - if src/site is old, and not what's at /site, then I think whoever updated the site last forgot to check in their changes. Guys? The site is generated from https://svn.apache.org/repos/asf/tika/sit

Re: Review Request 22246: New parser for Matlab .mat files

2014-06-05 Thread Nick Burch
> On June 4, 2014, 11:25 p.m., Matthias Krueger wrote: > > The Matlab MIME types used seem to be application/x-matlab-data or > > application/matlab-mat. > > > > Would it make sense to add them to the mime XML for detection? > > > > > > MATLAB data file > > > > > > > > > >

Re: Timezone issue with TTF parser?

2014-06-09 Thread Nick Burch
On Mon, 9 Jun 2014, Ken Krugler wrote: I just did an svn up from trunk, and mvn clean install is failing with: Failed tests: testTTFParsing(org.apache.tika.parser.font.FontParsersTest): expected:<1904-01-01T0[0]:00:00Z> but was:<1904-01-01T0[8]:00:00Z> See TIKA-1325. Pesky to discover it's

Re: svn commit: r1601805 - in /tika/trunk: CHANGES.txt tika-bundle/pom.xml tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml tika-parsers/pom.xml tika-parsers/src/test/resources/tes

2014-06-11 Thread Nick Burch
On Wed, 11 Jun 2014, mattm...@apache.org wrote: --- tika/trunk/tika-parsers/pom.xml (original) +++ tika/trunk/tika-parsers/pom.xml Wed Jun 11 03:23:29 2014 @@ -76,9 +76,14 @@ edu.ucar netcdf - 4.2.20 + 4.2-min You appear to have backed out the netcdf upgrade with

Re: Can some of tika-parsers module dependencies be made optional ?

2014-06-17 Thread Nick Burch
On Tue, 17 Jun 2014, Sergey Beryozkin wrote: The problem seems to be that Tika Parsers module contains many dependencies that may not be needed by a specific custom JAX-RS application. For example, we'd expect a given application dealing with PDF only, or a certain set of image formats only, o

Re: Can some of tika-parsers module dependencies be made optional ?

2014-06-18 Thread Nick Burch
On Wed, 18 Jun 2014, Ray Gauss wrote: I think for 2.0 we should consider splitting out parsers into their own projects for a streamlined dependency hierarchy then reassembling them with something like a tika-parsers-all artifact. We had another thread on that not that long ago, where someone c

Re: Can some of tika-parsers module dependencies be made optional ?

2014-06-18 Thread Nick Burch
On Wed, 18 Jun 2014, Sergey Beryozkin wrote: The reason we need it is that CXF can not ship all of Tika Parser dependencies because CXF will only offer a light-weight Tika-aware handler. Sounds like you just want to depend on tika-core then, and not tika-parsers. That'll give you mime magic de

Re: Can some of tika-parsers module dependencies be made optional ?

2014-06-18 Thread Nick Burch
On Wed, 18 Jun 2014, Sergey Beryozkin wrote: Can we start with adding a section to Tika docs documenting the core dependencies of the tike-parsers module to make the life a bit easier for developers who do not expect the specific parser implementations immediately downloaded ? Are you not jus

Re: Can some of tika-parsers module dependencies be made optional ?

2014-06-18 Thread Nick Burch
On Wed, 18 Jun 2014, Ken Krugler wrote: On Jun 18, 2014, at 9:08am, Nick Burch wrote: On Wed, 18 Jun 2014, Sergey Beryozkin wrote: Can we start with adding a section to Tika docs documenting the core dependencies of the tike-parsers module to make the life a bit easier for developers who do

Re: Can some of tika-parsers module dependencies be made optional ?

2014-06-18 Thread Nick Burch
On Wed, 18 Jun 2014, Ken Krugler wrote: I'm not much of a Maven maven, so what's the right way to manually pull some subset of parsers & dependencies? If you didn't want POI, you'd do something like: ${project.groupId} tika-parsers ${project.version}

Re: Can some of tika-parsers module dependencies be made optional ?

2014-06-19 Thread Nick Burch
On Thu, 19 Jun 2014, Ray Gauss wrote: The point of a tika-parsers-all artifact would be a single dependency that re-aggregates everything so that downstream projects could work the same way they do now and not worry about missing dependencies. What’s the disadvantage for splitting things up (i

Re: tika-server exception handling

2014-06-19 Thread Nick Burch
On Thu, 19 Jun 2014, Allison, Timothy B. wrote: I won't make any changes until after 1.6 is released, and the default behaviors will all remain in place, but I wanted to get feedback from the community to make sure that I'm on a track that makes sense. The only precident that springs to mind i

Re: Can some of tika-parsers module dependencies be made optional ?

2014-06-21 Thread Nick Burch
On Sat, 21 Jun 2014, Ray Gauss wrote: I’d have to respectfully disagree with most of those points but if there’s that much resistance to the idea I’ll drop it. Please make your case! Nick

Re: [jira] [Commented] (TIKA-1350) OutlookPSTParser: Unknown message type: IPM.Note

2014-06-23 Thread Nick Burch
On Mon, 23 Jun 2014, kevin slote wrote: What tika version will have the pst support? See TIKA-623 - PST support is already in trunk, and will be included in Tika 1.6 when that gets released Nick

Re: Review Request 22892: New parser for ENVI header files

2014-06-24 Thread Nick Burch
/EnviHeaderParser.java <https://reviews.apache.org/r/22892/#comment81964> This might be better using something like a BufferedReader, so you can read in one line of the Envi file at a time, and output each into their own p tag / li tag within a ul - Nick Burch On June 23, 2014, 11:14 p.m

Re: Patch: self-contained HTML using Data URI

2014-06-24 Thread Nick Burch
On Tue, 24 Jun 2014, Andrew Skiba wrote: I started with org.apache.tika.parser.microsoft.WordExtractor and immediately saw that it already makes a recursive call to the org.apache.tika.parser.image.ImageParser. But ImageParser currently only enriches metadata, and does not create element itsel

<    1   2   3   4   5   6   7   8   9   10   >