[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Bamford updated TIKA-1010:
Attachment: 114032807362001301.gif
114032807362001201.gif
[
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950487#comment-13950487
]
Timo Boehme commented on TIKA-93:
-
Hi Anurag, which PDF are you referring to? Without knowing
Hi Jukka,
thanks a lot for your reply.
On #1 I am still wondering why for indexing we need structure information.
is there any particular reason? wouldn't make more sense to get just the
text by default and only optionally getting the structure?
On #2, I expected the code you presented would not
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950461#comment-13950461
]
Chris Bamford edited comment on TIKA-1010 at 3/28/14 9:49 AM:
--
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Bamford updated TIKA-1010:
Attachment: (was: 114032807362001301.gif)
Embedded documents in RTF are not extracted
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Bamford updated TIKA-1010:
Attachment: (was: 114032807362001001.gif)
Embedded documents in RTF are not extracted
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Bamford updated TIKA-1010:
Attachment: (was: 114032807362000801.gif)
Embedded documents in RTF are not extracted
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Bamford updated TIKA-1010:
Attachment: (was: 114032807362000901.gif)
Embedded documents in RTF are not extracted
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Bamford updated TIKA-1010:
Attachment: (was: 114032807362001201.gif)
Embedded documents in RTF are not extracted
I think you should have three info blocks: video streams, audio streams and
subtitles (if container supports their embedding). Sort naturally or by
vid/aid/sid if present.
You shouldn't multiplex video and audio streams since any video stream can
be combined with any audio stream.
In terms of
Exception is rethrown only if write limit not reached. So if exception was
on first 100k chars it affects the result. If exception is thrown after
that -- it will be suppressed.
--
Best regards,
Konstantin Gribov.
28.03.2014 13:32 пользователь Stefano Fornari stefano.forn...@gmail.com
написал:
Yes, got it. Which is a strange use case: if I set the limit, first I would
not expect an exception (which represents an unexpected error condition);
secondly, I would not expect to rethrow it only under certain conditions. I
understood the trick, but I am trying to understand this is done in this
On Fri, Mar 28, 2014 at 11:26 AM, Stefano Fornari stefano.forn...@gmail.com
wrote:
I understood the trick, but I am trying to understand this is done in this
way (that at a first glance does not seem clean).
... trying to understand why this is done in this way...
SAXException is checked, so you have to catch it or add to method throws
list (or javac wouldn't compile it). Tika usually rethrows exceptions
enveloping them into TikaException. In case of code above method throws
SAXException.
Suppressing the exception is done to avoid parser fail after parsing
On Fri, 28 Mar 2014, Konstantin Gribov wrote:
I think you should have three info blocks: video streams, audio streams
and subtitles (if container supports their embedding). Sort naturally or
by vid/aid/sid if present.
That's not something Tika supports though. We have a metadata object we
well, I should look at the code, I can't do it now, but I guess my point is
that BodyContentHandler should not throw the exception (and most probably
not a SAXException in any case) in the case the limit is reached. This
means that the limit should not put on the WriteOutContentHandler, but on
All such handlers are implementation of org.xml.sax.ContentHandler
interface, so thier methods throws SAXException. But in code above none of
contentHandler methods are invoked (only in parser.parse where content
handler is passed).
You can take a look at
I said it about output to content handler, not to metadata. How to handle
metadata for containers with several video streams is another problem. Tika
metadata model is something weird for me, so I try to do not look at it too
often =)
--
Best regards,
Konstantin Gribov.
2014-03-28 14:59
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950628#comment-13950628
]
Tim Allison commented on TIKA-1010:
---
Chris,
Thank you for digging into the spec and
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-1010:
--
Attachment: testRTFRegularImages.rtf
This is an example of regular images -- pict -- not embedded data.
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-1010:
--
Attachment: testRTF_embbededFiles.zip
This is the test file I'll use to test poifs package and embedded
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950714#comment-13950714
]
Chris Bamford commented on TIKA-1010:
-
Hi Tim
Sorry about the confusion with the GIFs
[
https://issues.apache.org/jira/browse/TIKA-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hong-Thai Nguyen reassigned TIKA-1244:
--
Assignee: Hong-Thai Nguyen
Better parsing of Mbox files
On Fri, 28 Mar 2014, Allison, Timothy B. wrote:
In working on TIKA-1010, there are some cases where the full original
file path is stored with an image or embedded document.
TikaMetadatakeys.RESOURCE_NAME_KEY should be used for file name
(right?), but what should I use for file path?
I can
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951009#comment-13951009
]
Chris Bamford commented on TIKA-1010:
-
Hi again Tim
Dunno if this helps, but there is
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951004#comment-13951004
]
Tim Allison edited comment on TIKA-1010 at 3/28/14 4:47 PM:
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951004#comment-13951004
]
Tim Allison commented on TIKA-1010:
---
Chris,
Thanks for pointing that out. The objdata
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951004#comment-13951004
]
Tim Allison edited comment on TIKA-1010 at 3/28/14 4:44 PM:
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951097#comment-13951097
]
Chris Bamford commented on TIKA-1010:
-
The binary actually looks like this:
{noformat}
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951092#comment-13951092
]
Chris Bamford commented on TIKA-1010:
-
Ideally I'd like to be able to extract any file,
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951107#comment-13951107
]
Tim Allison commented on TIKA-1010:
---
Y, thanks, I got that. I can add an extract all
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951116#comment-13951116
]
Tim Allison commented on TIKA-1010:
---
As a side note, I can grab file names for:
1)
Good afternoon,
I already asked this question in the solr - user forum and I didn't get
anywhere.
They suggested I ask the tika community...
I'm using solr 4.0 Final
I need movies hidden in zip files that need to be excluded from the index.
I can't filter movies on the crawler because then I would
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950835#comment-13950835
]
Chris Bamford commented on TIKA-1010:
-
Hi Tim
I have found one - please see
On Fri, 28 Mar 2014, eShard wrote:
I'm using solr 4.0 Final
I need movies hidden in zip files that need to be excluded from the index.
I can't filter movies on the crawler because then I would have to exclude
all zip files.
If you're calling Tika directly, this is very easy. When tika hits
35 matches
Mail list logo