Re: Parsing huge PDF (400Mb, 2700 pages)

2019-11-14 Thread Sergey Beryozkin
Hi, Are you using tika-server ? If yes and you can submit the data using a multipart/form-data payload then it may help, CXF (used by tika-server) should do the best effort at saving the multipart payloads to the temp locations on the disk, and thus minimize the memory requirements Cheers, Sergey

Re: How to parse PDF more effectively

2019-07-18 Thread Sergey Beryozkin
exclude the PDFParser from the default > parser and then add the custom configured one back in: > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066 > > Let me know if you have any surprises. > > Best, > > Tim > > O

Re: How to parse PDF more effectively

2019-07-18 Thread Sergey Beryozkin
If it is not possible yet, then please create an issue and assign it to me (can do myself as well), will take care of it a bit later on. (don't mind having tika config in JSON format supported as well) Sergey On Wed, Jul 17, 2019 at 5:20 PM Sergey Beryozkin wrote: > Hi Tim, > > How

Re: How to parse PDF more effectively

2019-07-17 Thread Sergey Beryozkin
not a big deal though :-)) I also looked at the source and I'm still not sure which ContentHandler did you use to get the HTML tags added. (I may experiment with a custom one sitting on top of it adding the table tags may be...) Sergey On Thu, Jul 11, 2019 at 9:52 PM Sergey Beryozkin wrote: > Hi

Re: Are Tika parser instances thread safe ?

2019-07-17 Thread Sergey Beryozkin
, Jul 16, 2019 at 10:40 PM Tim Allison wrote: > Y. They should be! Let me know if you find any that aren’t. The regression > tests assume thread safety. > > On Tue, Jul 16, 2019 at 5:36 PM Sergey Beryozkin > wrote: > >> Hi >> >> I've looked around the web, found [

Are Tika parser instances thread safe ?

2019-07-16 Thread Sergey Beryozkin
Hi I've looked around the web, found [1]. is it still the case that all the parsers are thread safe (higher level ones like AutoDetectParser and specific ones like PDFParser), it is great if it is the case, but would appreciate some confirmation Thanks, Sergey [1]

Re: How to parse PDF more effectively

2019-07-11 Thread Sergey Beryozkin
licedinvoices.com/demo > > href="http://slicedinvoices.com/demo;>http://slicedinvoices.com/demo > > href="http://slicedinvoices.com/demo;>http://slicedinvoices.com/demo > > href="mailto:ad...@slicedinvoices.com;>mailto:ad...@slicedinvoices.com >

Re: [EXTERNAL] How to parse PDF more effectively

2019-07-11 Thread Sergey Beryozkin
g I have been looking at for this as well as doing > like Deep Neural Nets… > > > > > > > > *From: *Sergey Beryozkin > *Reply-To: *"user@tika.apache.org" > *Date: *Thursday, July 11, 2019 at 10:25 AM > *To: *"user@tika.apache.org" > *Sub

How to parse PDF more effectively

2019-07-11 Thread Sergey Beryozkin
Hi I've used Tika to parse this invoice PDF: https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf (AutoDetectParser, ToTextContentHandler), see below what is returned. The numbers like (1), (2) are added by myself, this is the preferred order (approximately). Is it possible

Re: "Stream closed" error when extracting text using Tika Server

2017-06-02 Thread Sergey Beryozkin
Any help or comments are appreciated! Haris -- Sergey Beryozkin Talend Community Coders http://coders.talend.com/

Re: "Stream closed" error when extracting text using Tika Server

2017-06-02 Thread Sergey Beryozkin
Hi By default, CXF JAX-RS MessageBodyWriter which deals with InputStream closes it immediately a copy is complete, it can be disabled, but it would be indeed simpler to avoid using a try-with-resources. I can fix it... FYI, re your test code, you can do response.getEntity(String.class)

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Sergey Beryozkin
. :) -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Friday, May 19, 2017 12:40 PM To: user@tika.apache.org Subject: Re: Extracting Text from embedded images in PDF docs Hi Tim On 19/05/17 17:31, Allison, Timothy B. wrote: The autoscaling feature of Beam and the job

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Sergey Beryozkin
-2328 It will take me few more weeks to create a PR, Thanks, Sergey -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Friday, May 19, 2017 12:27 PM To: user@tika.apache.org Subject: Re: Extracting Text from embedded images in PDF docs Hi Chris I'm getting

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Sergey Beryozkin
sity of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++ On 5/19/17, 9:11 AM, "Sergey Beryozkin" <sberyoz...@gmail.com> wrote: Hi Tim On 19/05/17 16:47, Allison, Timothy B. wrote:

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Sergey Beryozkin
Hi Tim On 19/05/17 16:47, Allison, Timothy B. wrote: Yes I was asking about it as I thought it was confusing it did not work - I saw you following up on this possible issue in the other email... Y, I agree. That _should_ work. I'm doing some work with Tika now so it was of an immediate

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Sergey Beryozkin
On 19/05/17 16:25, Allison, Timothy B. wrote: and when is "extractInlineImages" actually effective ? Not sure I understand the question exactly? If the question is "why didn't extractInlineImages work on a specific document"? That's probably a bug or could be user error in the

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Sergey Beryozkin
Hi Tim and when is "extractInlineImages" actually effective ? Thanks, Sergey On 19/05/17 16:16, Allison, Timothy B. wrote: Y, well, sorry. I’m thrilled someone is using it! I tried to document that here: https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29 See the OCR section.

Streaming and Tika

2016-11-10 Thread Sergey Beryozkin
Hi All I've been looking at how to integrate Tika in some of the streaming pipelines, and I'm finding it difficult to set up with the callback-based SAX mechanism. Does it make sense to consider starting adding StAX-like Parser API ? So far the only reference to Stax I've seen is

Re: How to parse PDF files effectively with Tika

2016-09-15 Thread Sergey Beryozkin
Hi On 12/09/16 22:19, Sergey Beryozkin wrote: Hi Tim This is very helpful, thanks. I'll experiment with the code below. By the way, I've found out AutoDetectParser may not work if the (pdf) stream is an attachment stream which may not support a mark. I've been wondering, would it make sense

Re: Query on correct use of 'fileUrl' in TikaJAXRS Server to extract document at remote url - my request is not working

2016-09-12 Thread Sergey Beryozkin
Hi, can you give me a favor and paste a -v output ? -H identifies a request header, I wonder if it should be curl -i fileUrl:http://www.bbc.co.uk/news -H "Accept: text/plain" -X PUT http://localhost:9998/tika ? (though I've never used this option) Thanks, Sergey On 11/09/16 09:48, John

How to parse PDF files effectively with Tika

2016-09-09 Thread Sergey Beryozkin
Hi All While I've experimented with writing a simple demo code which creates a Tika PDFParser (and few other parsers) and provides a ToTextContentHandler for it to return the content, I'm realizing I'm not really quite sure what the best strategy is. For example, Tim has mentioned that it

Re: How to create a Parser from InputStream alone

2016-09-08 Thread Sergey Beryozkin
Oops, it is called AutoDetectParser :-) On 08/09/16 11:15, Sergey Beryozkin wrote: Hi All Is there a way in Tika to create a Parser from some factory which accepts InputStream ? For example, rather than doing 'new PDFParser()' I'd like give Tika InputStream which will return a Parser (using

How to create a Parser from InputStream alone

2016-09-08 Thread Sergey Beryozkin
Hi All Is there a way in Tika to create a Parser from some factory which accepts InputStream ? For example, rather than doing 'new PDFParser()' I'd like give Tika InputStream which will return a Parser (using the auto-detection) ? Thanks, Sergey

Re: Unable to extract content from chunked portion of large file

2016-02-29 Thread Sergey Beryozkin
to extract content out of chunked portion of file Will TIKA supports to extract content from file chunk.? Regards, Raghu. From: Sergey Beryozkin <sberyoz...@gmail.com> Sent: Monday, February 29, 2016 7:23 PM To: user@tika.apache.org Subject: Re:

Re: Unable to extract content from chunked portion of large file

2016-02-29 Thread Sergey Beryozkin
documents of this size. we need to handle this. please help us. Regards, Raghu. From: Sergey Beryozkin <sberyoz...@gmail.com> Sent: Monday, February 29, 2016 6:50 PM To: user@tika.apache.org Subject: Re: Unable to extract content from chunked p

Re: Unable to extract content from chunked portion of large file

2016-02-29 Thread Sergey Beryozkin
giving empty text. I assume Tika is relied on file structure that why it is not giving any content. we are using Tika Server(REST api) in our .net application. please suggest us better approach for this scenario. Regards, Raghu. -- Ken Krugler +1 530-210-6378 http://www.sc

Re: Tika Wiki Login

2016-02-24 Thread Sergey Beryozkin
, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Sergey Beryozkin <sberyoz...@gmail.com> Reply-To: "user@tika.apache.org" <user@ti

Tika Wiki Login

2016-02-24 Thread Sergey Beryozkin
Hi Chris Can you please give me the rights to edit the wiki, I have all the docs signed. I can edit CXF and Camel wikis with a 'sergey_beryozkin' login, thought could do the same with Tika Thanks, Sergey

Re: Unable to extract content from chunked portion of large file

2016-02-24 Thread Sergey Beryozkin
e suggest us better approach for this scenario. Regards, Raghu. -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr -- Sergey Beryozkin Talend Community Coders http://coders.talend.com/

Re: message/rfc822 parser doesn't identify attachment filenames from Content-Disposition header

2015-10-13 Thread Sergey Beryozkin
! Do you guys think this is a bug, or am I doing something wrong? Thanks! Sergey -- Sergey Beryozkin Talend Community Coders http://coders.talend.com/

Re: Use TikaJAXRS with HDD offsets instead of urls

2015-09-01 Thread Sergey Beryozkin
Hi the server accepts InputStream form a multipart attachment or fromj the immediate request body, in the latter case it is HTTP PUT, so you can use the client library to PUT bytes to the server Cheers, Sergey On 01/09/15 09:44, zahlenm...@gmx.de wrote: Hey everyone, I am parsing file

Re: [VOTE] Apache Tika 1.10 Release Candidate #1

2015-08-02 Thread Sergey Beryozkin
+1 Sergey On 02/08/15 10:15, David Meikle wrote: Hi Everyone, A candidate for the Apache Tika 1.10 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.10-rc1/ The

Re: JAX-RS: SEVERE Problem with writing the data when parser hits exception?

2015-02-27 Thread Sergey Beryozkin
Hi Tim, The problem appears to be happening during a write process, when a JAX-RS runtime provider delegates back to JAX-RS StreamingOutput TikaResource implementation. I'm presuming this causes an actual exception reporting. Do you think it should not be reported/logged ? This can be

Re: JAX-RS: SEVERE Problem with writing the data when parser hits exception?

2015-02-27 Thread Sergey Beryozkin
to parse and added a custom ExceptionMapper. We could handle it there, if we wanted. However, if you're not batting an eye at the warning, I'm happy to ignore the logs. Thank you! Best, Tim -Original Message- From: Sergey Beryozkin [mailto:sberyoz

Re: Outputting JSON from tika-server/meta

2014-12-18 Thread Sergey Beryozkin
Hi, I see MetadataResource returning StreamingOutput and it has @Produces(text/csv) only. As such this MBW has no effect at the moment. We can update MetadataResource to return Metadata directly if application/json is requested or update MetadataResource to directly convert Metadata to JSON

Re: Outputting JSON from tika-server/meta

2014-12-18 Thread Sergey Beryozkin
Hi Peter Thanks, you are too nice, it is a minor bug :-) Cheers, Sergey On 18/12/14 14:50, Peter Bowyer wrote: Thanks Sergey, I have opened TIKA-1497 for this enhancement. Best wishes, Peter On 18 December 2014 at 14:31, Sergey Beryozkin sberyoz...@gmail.com mailto:sberyoz...@gmail.com wrote

Re: Compression of Tika server output files

2014-08-07 Thread Sergey Beryozkin
Hi I can try to enhance a CXF GzipOutInterceptor (at CXF level) to use a compressing Deflater in GZIP compatible mode. The server will react to a client accepting GZIP and compress the out payloads. I think it would be a good idea to have a Tika server war module introduced for users easily

Re: Compression of Tika server output files

2014-08-07 Thread Sergey Beryozkin
By the way, would a default GZIP compression suit ? If yes we can have it done even without the extra CXF changes. Sergey On 07/08/14 16:15, Sergey Beryozkin wrote: Hi I can try to enhance a CXF GzipOutInterceptor (at CXF level) to use a compressing Deflater in GZIP compatible mode

Re: How to index the parsed content effectively

2014-07-11 Thread Sergey Beryozkin
-Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Wednesday, July 02, 2014 8:27 AM To: user@tika.apache.org Subject: How to index the parsed content effectively Hi All, We've been experimenting with indexing the parsed content in Lucene and our initial attempt

Re: How to index the parsed content effectively

2014-07-02 Thread Sergey Beryozkin
Hi, On 02/07/14 13:54, Ken Krugler wrote: On Jul 2, 2014, at 5:27am, Sergey Beryozkin sberyoz...@gmail.com mailto:sberyoz...@gmail.com wrote: Hi All, We've been experimenting with indexing the parsed content in Lucene and our initial attempt was to index the output from

Re: How to index the parsed content effectively

2014-07-02 Thread Sergey Beryozkin
, Tim -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Wednesday, July 02, 2014 8:27 AM To: user@tika.apache.org Subject: How to index the parsed content effectively Hi All, We've been experimenting with indexing the parsed content in Lucene and our initial attempt

Re: How to index the parsed content effectively

2014-07-02 Thread Sergey Beryozkin
, cutting the document, print a warning, etc. Sure Of course, everything depends on the use case ;) I agree, Many thanks for the feedback, Definitely has been useful for me and hopefully for some other users :-) Cheers, Sergey On 02.07.2014 17:45, Sergey Beryozkin wrote: Hi Tim Thanks for sharing

Possible update to Tika Server Unpacker service

2014-05-20 Thread Sergey Beryozkin
Hi All, As you know Tika Server [1] has a number of JAX-RS endpoints, with Unpacker service being one of them. Unpacker resource has 2 methods, matching /unpacker and /all URI segments with Unpacker itself having a default JAX-RS @Path(/) annotation. The problem here is that introducing