On Fri, 28 Mar 2014, eShard wrote:
I'm using solr 4.0 Final
I need movies "hidden" in zip files that need to be excluded from the index.
I can't filter movies on the crawler because then I would have to exclude
all zip files.
If you're calling Tika directly, this is very easy. When tika hits
e
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950835#comment-13950835
]
Chris Bamford commented on TIKA-1010:
-
Hi Tim
I have found one - please see https://is
Hi,
On Fri, Mar 28, 2014 at 5:32 AM, Stefano Fornari
wrote:
> On #1 I am still wondering why for indexing we need structure information.
> is there any particular reason? wouldn't make more sense to get just the
> text by default and only optionally getting the structure?
The trouble is that the
Good afternoon,
I already asked this question in the solr - user forum and I didn't get
anywhere.
They suggested I ask the tika community...
I'm using solr 4.0 Final
I need movies "hidden" in zip files that need to be excluded from the index.
I can't filter movies on the crawler because then I woul
All,
In working on TIKA-1010, there are some cases where the full original file
path is stored with an image or embedded document.
TikaMetadatakeys.RESOURCE_NAME_KEY should be used for file name (right?), but
what should I use for file path?
Thank you.
Best,
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951116#comment-13951116
]
Tim Allison commented on TIKA-1010:
---
As a side note, I can grab file names for:
1) image
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951107#comment-13951107
]
Tim Allison commented on TIKA-1010:
---
Y, thanks, I got that. I can add an "extract all" m
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951092#comment-13951092
]
Chris Bamford commented on TIKA-1010:
-
Ideally I'd like to be able to extract any file,
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951097#comment-13951097
]
Chris Bamford commented on TIKA-1010:
-
The binary actually looks like this:
{noformat}
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951009#comment-13951009
]
Chris Bamford commented on TIKA-1010:
-
Hi again Tim
Dunno if this helps, but there is
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951004#comment-13951004
]
Tim Allison edited comment on TIKA-1010 at 3/28/14 4:44 PM:
Chr
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951004#comment-13951004
]
Tim Allison edited comment on TIKA-1010 at 3/28/14 4:47 PM:
Chr
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951004#comment-13951004
]
Tim Allison commented on TIKA-1010:
---
Chris,
Thanks for pointing that out. The objdata
On Fri, 28 Mar 2014, Allison, Timothy B. wrote:
In working on TIKA-1010, there are some cases where the full original
file path is stored with an image or embedded document.
TikaMetadatakeys.RESOURCE_NAME_KEY should be used for file name
(right?), but what should I use for file path?
I can on
[
https://issues.apache.org/jira/browse/TIKA-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hong-Thai Nguyen reassigned TIKA-1244:
--
Assignee: Hong-Thai Nguyen
> Better parsing of Mbox files
>
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950714#comment-13950714
]
Chris Bamford commented on TIKA-1010:
-
Hi Tim
Sorry about the confusion with the GIFs
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-1010:
--
Attachment: testRTF_embbededFiles.zip
This is the test file I'll use to test poifs package and embedded
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-1010:
--
Attachment: testRTFRegularImages.rtf
This is an example of regular images -- pict -- not embedded data.
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950628#comment-13950628
]
Tim Allison commented on TIKA-1010:
---
Chris,
Thank you for digging into the spec and sha
I said it about output to content handler, not to metadata. How to handle
metadata for containers with several video streams is another problem. Tika
metadata model is something weird for me, so I try to do not look at it too
often =)
--
Best regards,
Konstantin Gribov.
2014-03-28 14:59 GMT+04:
All such handlers are implementation of org.xml.sax.ContentHandler
interface, so thier methods throws SAXException. But in code above none of
contentHandler methods are invoked (only in parser.parse where content
handler is passed).
You can take a look at org.apache.tika.Tika.parseToString(InputSt
well, I should look at the code, I can't do it now, but I guess my point is
that BodyContentHandler should not throw the exception (and most probably
not a SAXException in any case) in the case the limit is reached. This
means that the limit should not put on the WriteOutContentHandler, but on
Body
On Fri, 28 Mar 2014, Konstantin Gribov wrote:
I think you should have three info blocks: video streams, audio streams
and subtitles (if container supports their embedding). Sort naturally or
by vid/aid/sid if present.
That's not something Tika supports though. We have a metadata object we
can
SAXException is checked, so you have to catch it or add to method throws
list (or javac wouldn't compile it). Tika usually rethrows exceptions
enveloping them into TikaException. In case of code above method throws
SAXException.
Suppressing the exception is done to avoid parser fail after parsing
On Fri, Mar 28, 2014 at 11:26 AM, Stefano Fornari wrote:
> I understood the trick, but I am trying to understand this is done in this
> way (that at a first glance does not seem clean).
>
> ... trying to understand why this is done in this way...
Yes, got it. Which is a strange use case: if I set the limit, first I would
not expect an exception (which represents an unexpected error condition);
secondly, I would not expect to rethrow it only under certain conditions. I
understood the trick, but I am trying to understand this is done in this
Exception is rethrown only if write limit not reached. So if exception was
on first 100k chars it affects the result. If exception is thrown after
that -- it will be suppressed.
--
Best regards,
Konstantin Gribov.
28.03.2014 13:32 пользователь "Stefano Fornari"
написал:
> Hi Jukka,
> thanks a l
I think you should have three info blocks: video streams, audio streams and
subtitles (if container supports their embedding). Sort naturally or by
vid/aid/sid if present.
You shouldn't multiplex video and audio streams since any video stream can
be combined with any audio stream.
In terms of xml
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Bamford updated TIKA-1010:
Attachment: (was: 114032807362001001.gif)
> Embedded documents in RTF are not extracted
> --
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Bamford updated TIKA-1010:
Attachment: (was: 114032807362000801.gif)
> Embedded documents in RTF are not extracted
> --
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Bamford updated TIKA-1010:
Attachment: (was: 114032807362000901.gif)
> Embedded documents in RTF are not extracted
> --
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Bamford updated TIKA-1010:
Attachment: (was: 114032807362001201.gif)
> Embedded documents in RTF are not extracted
> --
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Bamford updated TIKA-1010:
Attachment: (was: 114032807362001301.gif)
> Embedded documents in RTF are not extracted
> --
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950461#comment-13950461
]
Chris Bamford edited comment on TIKA-1010 at 3/28/14 9:49 AM:
--
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Bamford updated TIKA-1010:
Attachment: (was: 114032807362001101.gif)
> Embedded documents in RTF are not extracted
> --
Hi Jukka,
thanks a lot for your reply.
On #1 I am still wondering why for indexing we need structure information.
is there any particular reason? wouldn't make more sense to get just the
text by default and only optionally getting the structure?
On #2, I expected the code you presented would not
[
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950487#comment-13950487
]
Timo Boehme commented on TIKA-93:
-
Hi Anurag, which PDF are you referring to? Without knowing
[
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Bamford updated TIKA-1010:
Attachment: 114032807362001301.gif
114032807362001201.gif
11403280736
38 matches
Mail list logo