Re: How to exclude a mimetype form being indexed in solr using tika?

2014-03-28 Thread Nick Burch
On Fri, 28 Mar 2014, eShard wrote: I'm using solr 4.0 Final I need movies "hidden" in zip files that need to be excluded from the index. I can't filter movies on the crawler because then I would have to exclude all zip files. If you're calling Tika directly, this is very easy. When tika hits e

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950835#comment-13950835 ] Chris Bamford commented on TIKA-1010: - Hi Tim I have found one - please see https://is

Re: PDF parser (two more questions)

2014-03-28 Thread Jukka Zitting
Hi, On Fri, Mar 28, 2014 at 5:32 AM, Stefano Fornari wrote: > On #1 I am still wondering why for indexing we need structure information. > is there any particular reason? wouldn't make more sense to get just the > text by default and only optionally getting the structure? The trouble is that the

How to exclude a mimetype form being indexed in solr using tika?

2014-03-28 Thread eShard
Good afternoon, I already asked this question in the solr - user forum and I didn't get anywhere. They suggested I ask the tika community... I'm using solr 4.0 Final I need movies "hidden" in zip files that need to be excluded from the index. I can't filter movies on the crawler because then I woul

metadata key for original file path?

2014-03-28 Thread Allison, Timothy B.
All, In working on TIKA-1010, there are some cases where the full original file path is stored with an image or embedded document. TikaMetadatakeys.RESOURCE_NAME_KEY should be used for file name (right?), but what should I use for file path? Thank you. Best,

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951116#comment-13951116 ] Tim Allison commented on TIKA-1010: --- As a side note, I can grab file names for: 1) image

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951107#comment-13951107 ] Tim Allison commented on TIKA-1010: --- Y, thanks, I got that. I can add an "extract all" m

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951092#comment-13951092 ] Chris Bamford commented on TIKA-1010: - Ideally I'd like to be able to extract any file,

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951097#comment-13951097 ] Chris Bamford commented on TIKA-1010: - The binary actually looks like this: {noformat}

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951009#comment-13951009 ] Chris Bamford commented on TIKA-1010: - Hi again Tim Dunno if this helps, but there is

[jira] [Comment Edited] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951004#comment-13951004 ] Tim Allison edited comment on TIKA-1010 at 3/28/14 4:44 PM: Chr

[jira] [Comment Edited] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951004#comment-13951004 ] Tim Allison edited comment on TIKA-1010 at 3/28/14 4:47 PM: Chr

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951004#comment-13951004 ] Tim Allison commented on TIKA-1010: --- Chris, Thanks for pointing that out. The objdata

Re: metadata key for original file path?

2014-03-28 Thread Nick Burch
On Fri, 28 Mar 2014, Allison, Timothy B. wrote: In working on TIKA-1010, there are some cases where the full original file path is stored with an image or embedded document. TikaMetadatakeys.RESOURCE_NAME_KEY should be used for file name (right?), but what should I use for file path? I can on

[jira] [Assigned] (TIKA-1244) Better parsing of Mbox files

2014-03-28 Thread Hong-Thai Nguyen (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen reassigned TIKA-1244: -- Assignee: Hong-Thai Nguyen > Better parsing of Mbox files >

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950714#comment-13950714 ] Chris Bamford commented on TIKA-1010: - Hi Tim Sorry about the confusion with the GIFs

[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1010: -- Attachment: testRTF_embbededFiles.zip This is the test file I'll use to test poifs package and embedded

[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1010: -- Attachment: testRTFRegularImages.rtf This is an example of regular images -- pict -- not embedded data.

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950628#comment-13950628 ] Tim Allison commented on TIKA-1010: --- Chris, Thank you for digging into the spec and sha

Re: How should video files with audio be handled by parsers?

2014-03-28 Thread Konstantin Gribov
I said it about output to content handler, not to metadata. How to handle metadata for containers with several video streams is another problem. Tika metadata model is something weird for me, so I try to do not look at it too often =) -- Best regards, Konstantin Gribov. 2014-03-28 14:59 GMT+04:

Re: PDF parser (two more questions)

2014-03-28 Thread Konstantin Gribov
All such handlers are implementation of org.xml.sax.ContentHandler interface, so thier methods throws SAXException. But in code above none of contentHandler methods are invoked (only in parser.parse where content handler is passed). You can take a look at org.apache.tika.Tika.parseToString(InputSt

Re: PDF parser (two more questions)

2014-03-28 Thread Stefano Fornari
well, I should look at the code, I can't do it now, but I guess my point is that BodyContentHandler should not throw the exception (and most probably not a SAXException in any case) in the case the limit is reached. This means that the limit should not put on the WriteOutContentHandler, but on Body

Re: How should video files with audio be handled by parsers?

2014-03-28 Thread Nick Burch
On Fri, 28 Mar 2014, Konstantin Gribov wrote: I think you should have three info blocks: video streams, audio streams and subtitles (if container supports their embedding). Sort naturally or by vid/aid/sid if present. That's not something Tika supports though. We have a metadata object we can

Re: PDF parser (two more questions)

2014-03-28 Thread Konstantin Gribov
SAXException is checked, so you have to catch it or add to method throws list (or javac wouldn't compile it). Tika usually rethrows exceptions enveloping them into TikaException. In case of code above method throws SAXException. Suppressing the exception is done to avoid parser fail after parsing

Re: PDF parser (two more questions)

2014-03-28 Thread Stefano Fornari
On Fri, Mar 28, 2014 at 11:26 AM, Stefano Fornari wrote: > I understood the trick, but I am trying to understand this is done in this > way (that at a first glance does not seem clean). > > ... trying to understand why this is done in this way...

Re: PDF parser (two more questions)

2014-03-28 Thread Stefano Fornari
Yes, got it. Which is a strange use case: if I set the limit, first I would not expect an exception (which represents an unexpected error condition); secondly, I would not expect to rethrow it only under certain conditions. I understood the trick, but I am trying to understand this is done in this

Re: PDF parser (two more questions)

2014-03-28 Thread Konstantin Gribov
Exception is rethrown only if write limit not reached. So if exception was on first 100k chars it affects the result. If exception is thrown after that -- it will be suppressed. -- Best regards, Konstantin Gribov. 28.03.2014 13:32 пользователь "Stefano Fornari" написал: > Hi Jukka, > thanks a l

Re: How should video files with audio be handled by parsers?

2014-03-28 Thread Konstantin Gribov
I think you should have three info blocks: video streams, audio streams and subtitles (if container supports their embedding). Sort naturally or by vid/aid/sid if present. You shouldn't multiplex video and audio streams since any video stream can be combined with any audio stream. In terms of xml

[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: (was: 114032807362001001.gif) > Embedded documents in RTF are not extracted > --

[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: (was: 114032807362000801.gif) > Embedded documents in RTF are not extracted > --

[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: (was: 114032807362000901.gif) > Embedded documents in RTF are not extracted > --

[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: (was: 114032807362001201.gif) > Embedded documents in RTF are not extracted > --

[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: (was: 114032807362001301.gif) > Embedded documents in RTF are not extracted > --

[jira] [Comment Edited] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950461#comment-13950461 ] Chris Bamford edited comment on TIKA-1010 at 3/28/14 9:49 AM: --

[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: (was: 114032807362001101.gif) > Embedded documents in RTF are not extracted > --

Re: PDF parser (two more questions)

2014-03-28 Thread Stefano Fornari
Hi Jukka, thanks a lot for your reply. On #1 I am still wondering why for indexing we need structure information. is there any particular reason? wouldn't make more sense to get just the text by default and only optionally getting the structure? On #2, I expected the code you presented would not

[jira] [Commented] (TIKA-93) OCR support

2014-03-28 Thread Timo Boehme (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950487#comment-13950487 ] Timo Boehme commented on TIKA-93: - Hi Anurag, which PDF are you referring to? Without knowing

[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: 114032807362001301.gif 114032807362001201.gif 11403280736