RE: Unable to extract content from chunked portion of large file

Ken Krugler Wed, 24 Feb 2016 07:51:17 -0800

Hi Sergey,

Thanks for digging into the code - I'd seen the docs and assumed it wouldn't 
work.


Anybody have a chance to give that a try? Maybe Raghu? :)

-- Ken

> From: Sergey Beryozkin
> Sent: February 24, 2016 7:44:13am PST
> To: user@tika.apache.org
> Subject: Re: Unable to extract content from chunked portion of large file
> 
> Hi All
> 
> If a large file is passed to a Tika server as a multipart/form payload
> 
> then CXF will be creating a temp file on the disk itself.
> 
> Hmm... I was looking for a reference to it and I found the advice not to
> use multipart/form-data:
> https://wiki.apache.org/tika/TikaJAXRS (in Services)
> 
> I believe it should be removed, see:
> 
> http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java,
>  example:
> 
> @POST
>    @Consumes("multipart/form-data")
>    @Produces("text/plain")
>    @Path("form")
>    public StreamingOutput getTextFromMultipart(Attachment att, @Context final 
> UriInfo info) {
>        return produceText(att.getObject(InputStream.class), att.getHeaders(), 
> info);
>    }
> 
> 
> Cheers, Sergey
> 
> 
> 
> On 24/02/16 15:37, Ken Krugler wrote:
>> Hi Raghu,
>> 
>> I don't think you understood what I was proposing.
>> 
>> I suggested creating a service that could receive chunks of the file
>> (persisted to local disk). Then this service could implement an input
>> stream class that would read sequentially from these pieces. This input
>> stream would be passed to Tika, thus giving Tika a single continuous
>> stream of data to the entire file content.
>> 
>> -- Ken
>> 
>>> ------------------------------------------------------------------------
>>> 
>>> *From:* raghu vittal
>>> 
>>> *Sent:* February 24, 2016 4:32:01am PST
>>> 
>>> *To:* user@tika.apache.org <mailto:user@tika.apache.org>
>>> 
>>> *Subject:* Re: Unable to extract content from chunked portion of large
>>> file
>>> 
>>> 
>>> Thanks for your reply.
>>> 
>>> In our application user can upload large files. Our intention is to
>>> extract the content out of large file and dump that in Elastic for
>>> contented based search.
>>> we have > 300 MB size .xlsx and .doc files. sending that large file to
>>> Tika will causing timeout issues.
>>> 
>>> i tried getting chunk of file and pass to Tika. Tika given me invalid
>>> data exception.
>>> 
>>> I Think for Tika we need to pass entire file at once to extract content.
>>> 
>>> Raghu.
>>> 
>>> ------------------------------------------------------------------------
>>> *From:*Ken Krugler <kkrugler_li...@transpac.com
>>> <mailto:kkrugler_li...@transpac.com>>
>>> *Sent:*Friday, February 19, 2016 8:22 PM
>>> *To:*user@tika.apache.org <mailto:user@tika.apache.org>
>>> *Subject:*RE: Unable to extract content from chunked portion of large file
>>> One option is to create your own RESTful API that lets you send chunks
>>> of the file, and then you can provide an input stream that provides
>>> the seamless data view of the chunks to Tika (which is what it needs).
>>> 
>>> -- Ken
>>> 
>>>> ------------------------------------------------------------------------
>>>> *From:*raghu vittal
>>>> *Sent:*February 19, 2016 1:37:49am PST
>>>> *To:*user@tika.apache.org <mailto:user@tika.apache.org>
>>>> *Subject:*Unable to extract content from chunked portion of large file
>>>> 
>>>> Hi All
>>>> 
>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>>> content and dump data in Elastic Search for full-text search.
>>>> sending very large files to Tika will cause out of memory exception.
>>>> 
>>>> we want to chunk the file and send it to TIKA for content extraction.
>>>> when we passed chunked portion of file to Tika it is giving empty text.
>>>> I assume Tika is relied on file structure that why it is not giving
>>>> any content.
>>>> 
>>>> we are using Tika Server(REST api) in our .net application.
>>>> 
>>>> please suggest us better approach for this scenario.
>>>> 
>>>> Regards,
>>>> Raghu.





--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

RE: Unable to extract content from chunked portion of large file

Reply via email to