Hi Sergey, Thanks for digging into the code - I'd seen the docs and assumed it wouldn't work.
Anybody have a chance to give that a try? Maybe Raghu? :) -- Ken > From: Sergey Beryozkin > Sent: February 24, 2016 7:44:13am PST > To: user@tika.apache.org > Subject: Re: Unable to extract content from chunked portion of large file > > Hi All > > If a large file is passed to a Tika server as a multipart/form payload > > then CXF will be creating a temp file on the disk itself. > > Hmm... I was looking for a reference to it and I found the advice not to > use multipart/form-data: > https://wiki.apache.org/tika/TikaJAXRS (in Services) > > I believe it should be removed, see: > > http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java, > example: > > @POST > @Consumes("multipart/form-data") > @Produces("text/plain") > @Path("form") > public StreamingOutput getTextFromMultipart(Attachment att, @Context final > UriInfo info) { > return produceText(att.getObject(InputStream.class), att.getHeaders(), > info); > } > > > Cheers, Sergey > > > > On 24/02/16 15:37, Ken Krugler wrote: >> Hi Raghu, >> >> I don't think you understood what I was proposing. >> >> I suggested creating a service that could receive chunks of the file >> (persisted to local disk). Then this service could implement an input >> stream class that would read sequentially from these pieces. This input >> stream would be passed to Tika, thus giving Tika a single continuous >> stream of data to the entire file content. >> >> -- Ken >> >>> ------------------------------------------------------------------------ >>> >>> *From:* raghu vittal >>> >>> *Sent:* February 24, 2016 4:32:01am PST >>> >>> *To:* user@tika.apache.org <mailto:user@tika.apache.org> >>> >>> *Subject:* Re: Unable to extract content from chunked portion of large >>> file >>> >>> >>> Thanks for your reply. >>> >>> In our application user can upload large files. Our intention is to >>> extract the content out of large file and dump that in Elastic for >>> contented based search. >>> we have > 300 MB size .xlsx and .doc files. sending that large file to >>> Tika will causing timeout issues. >>> >>> i tried getting chunk of file and pass to Tika. Tika given me invalid >>> data exception. >>> >>> I Think for Tika we need to pass entire file at once to extract content. >>> >>> Raghu. >>> >>> ------------------------------------------------------------------------ >>> *From:*Ken Krugler <kkrugler_li...@transpac.com >>> <mailto:kkrugler_li...@transpac.com>> >>> *Sent:*Friday, February 19, 2016 8:22 PM >>> *To:*user@tika.apache.org <mailto:user@tika.apache.org> >>> *Subject:*RE: Unable to extract content from chunked portion of large file >>> One option is to create your own RESTful API that lets you send chunks >>> of the file, and then you can provide an input stream that provides >>> the seamless data view of the chunks to Tika (which is what it needs). >>> >>> -- Ken >>> >>>> ------------------------------------------------------------------------ >>>> *From:*raghu vittal >>>> *Sent:*February 19, 2016 1:37:49am PST >>>> *To:*user@tika.apache.org <mailto:user@tika.apache.org> >>>> *Subject:*Unable to extract content from chunked portion of large file >>>> >>>> Hi All >>>> >>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract >>>> content and dump data in Elastic Search for full-text search. >>>> sending very large files to Tika will cause out of memory exception. >>>> >>>> we want to chunk the file and send it to TIKA for content extraction. >>>> when we passed chunked portion of file to Tika it is giving empty text. >>>> I assume Tika is relied on file structure that why it is not giving >>>> any content. >>>> >>>> we are using Tika Server(REST api) in our .net application. >>>> >>>> please suggest us better approach for this scenario. >>>> >>>> Regards, >>>> Raghu. -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr