[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: 114032807362001301.gif 114032807362001201.gif

[jira] [Commented] (TIKA-93) OCR support

2014-03-28 Thread Timo Boehme (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950487#comment-13950487 ] Timo Boehme commented on TIKA-93: - Hi Anurag, which PDF are you referring to? Without knowing

Re: PDF parser (two more questions)

2014-03-28 Thread Stefano Fornari
Hi Jukka, thanks a lot for your reply. On #1 I am still wondering why for indexing we need structure information. is there any particular reason? wouldn't make more sense to get just the text by default and only optionally getting the structure? On #2, I expected the code you presented would not

[jira] [Comment Edited] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950461#comment-13950461 ] Chris Bamford edited comment on TIKA-1010 at 3/28/14 9:49 AM: --

[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: (was: 114032807362001301.gif) Embedded documents in RTF are not extracted

[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: (was: 114032807362001001.gif) Embedded documents in RTF are not extracted

[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: (was: 114032807362000801.gif) Embedded documents in RTF are not extracted

[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: (was: 114032807362000901.gif) Embedded documents in RTF are not extracted

[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: (was: 114032807362001201.gif) Embedded documents in RTF are not extracted

Re: How should video files with audio be handled by parsers?

2014-03-28 Thread Konstantin Gribov
I think you should have three info blocks: video streams, audio streams and subtitles (if container supports their embedding). Sort naturally or by vid/aid/sid if present. You shouldn't multiplex video and audio streams since any video stream can be combined with any audio stream. In terms of

Re: PDF parser (two more questions)

2014-03-28 Thread Konstantin Gribov
Exception is rethrown only if write limit not reached. So if exception was on first 100k chars it affects the result. If exception is thrown after that -- it will be suppressed. -- Best regards, Konstantin Gribov. 28.03.2014 13:32 пользователь Stefano Fornari stefano.forn...@gmail.com написал:

Re: PDF parser (two more questions)

2014-03-28 Thread Stefano Fornari
Yes, got it. Which is a strange use case: if I set the limit, first I would not expect an exception (which represents an unexpected error condition); secondly, I would not expect to rethrow it only under certain conditions. I understood the trick, but I am trying to understand this is done in this

Re: PDF parser (two more questions)

2014-03-28 Thread Stefano Fornari
On Fri, Mar 28, 2014 at 11:26 AM, Stefano Fornari stefano.forn...@gmail.com wrote: I understood the trick, but I am trying to understand this is done in this way (that at a first glance does not seem clean). ... trying to understand why this is done in this way...

Re: PDF parser (two more questions)

2014-03-28 Thread Konstantin Gribov
SAXException is checked, so you have to catch it or add to method throws list (or javac wouldn't compile it). Tika usually rethrows exceptions enveloping them into TikaException. In case of code above method throws SAXException. Suppressing the exception is done to avoid parser fail after parsing

Re: How should video files with audio be handled by parsers?

2014-03-28 Thread Nick Burch
On Fri, 28 Mar 2014, Konstantin Gribov wrote: I think you should have three info blocks: video streams, audio streams and subtitles (if container supports their embedding). Sort naturally or by vid/aid/sid if present. That's not something Tika supports though. We have a metadata object we

Re: PDF parser (two more questions)

2014-03-28 Thread Stefano Fornari
well, I should look at the code, I can't do it now, but I guess my point is that BodyContentHandler should not throw the exception (and most probably not a SAXException in any case) in the case the limit is reached. This means that the limit should not put on the WriteOutContentHandler, but on

Re: PDF parser (two more questions)

2014-03-28 Thread Konstantin Gribov
All such handlers are implementation of org.xml.sax.ContentHandler interface, so thier methods throws SAXException. But in code above none of contentHandler methods are invoked (only in parser.parse where content handler is passed). You can take a look at

Re: How should video files with audio be handled by parsers?

2014-03-28 Thread Konstantin Gribov
I said it about output to content handler, not to metadata. How to handle metadata for containers with several video streams is another problem. Tika metadata model is something weird for me, so I try to do not look at it too often =) -- Best regards, Konstantin Gribov. 2014-03-28 14:59

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950628#comment-13950628 ] Tim Allison commented on TIKA-1010: --- Chris, Thank you for digging into the spec and

[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1010: -- Attachment: testRTFRegularImages.rtf This is an example of regular images -- pict -- not embedded data.

[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1010: -- Attachment: testRTF_embbededFiles.zip This is the test file I'll use to test poifs package and embedded

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950714#comment-13950714 ] Chris Bamford commented on TIKA-1010: - Hi Tim Sorry about the confusion with the GIFs

[jira] [Assigned] (TIKA-1244) Better parsing of Mbox files

2014-03-28 Thread Hong-Thai Nguyen (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen reassigned TIKA-1244: -- Assignee: Hong-Thai Nguyen Better parsing of Mbox files

Re: metadata key for original file path?

2014-03-28 Thread Nick Burch
On Fri, 28 Mar 2014, Allison, Timothy B. wrote: In working on TIKA-1010, there are some cases where the full original file path is stored with an image or embedded document. TikaMetadatakeys.RESOURCE_NAME_KEY should be used for file name (right?), but what should I use for file path? I can

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951009#comment-13951009 ] Chris Bamford commented on TIKA-1010: - Hi again Tim Dunno if this helps, but there is

[jira] [Comment Edited] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951004#comment-13951004 ] Tim Allison edited comment on TIKA-1010 at 3/28/14 4:47 PM:

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951004#comment-13951004 ] Tim Allison commented on TIKA-1010: --- Chris, Thanks for pointing that out. The objdata

[jira] [Comment Edited] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951004#comment-13951004 ] Tim Allison edited comment on TIKA-1010 at 3/28/14 4:44 PM:

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951097#comment-13951097 ] Chris Bamford commented on TIKA-1010: - The binary actually looks like this: {noformat}

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951092#comment-13951092 ] Chris Bamford commented on TIKA-1010: - Ideally I'd like to be able to extract any file,

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951107#comment-13951107 ] Tim Allison commented on TIKA-1010: --- Y, thanks, I got that. I can add an extract all

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951116#comment-13951116 ] Tim Allison commented on TIKA-1010: --- As a side note, I can grab file names for: 1)

How to exclude a mimetype form being indexed in solr using tika?

2014-03-28 Thread eShard
Good afternoon, I already asked this question in the solr - user forum and I didn't get anywhere. They suggested I ask the tika community... I'm using solr 4.0 Final I need movies hidden in zip files that need to be excluded from the index. I can't filter movies on the crawler because then I would

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950835#comment-13950835 ] Chris Bamford commented on TIKA-1010: - Hi Tim I have found one - please see

Re: How to exclude a mimetype form being indexed in solr using tika?

2014-03-28 Thread Nick Burch
On Fri, 28 Mar 2014, eShard wrote: I'm using solr 4.0 Final I need movies hidden in zip files that need to be excluded from the index. I can't filter movies on the crawler because then I would have to exclude all zip files. If you're calling Tika directly, this is very easy. When tika hits