Ok, thank you for your support

Best regards

2015-01-29 15:14 GMT+01:00 Konstantin Gribov <gros...@gmail.com>:

> Hi, Gabriele.
>
> If you're using InputStream which doesn't support mark/reset tika facade
> (org.apache.Tika) creates BufferedInputStream which consumes up to 8k of
> original inputStream by default, so Tika mime type detector can't find pdf
> magic after first call.
>
> Second case (with copying to byte[]) is similar. If you do this copy
> before calling tika.detect, you consume that input stream and subsequent
> calls on that stream return application/octet-stream as default mime-type.
> But all works fine with bytes since you have full copy of original stream
> in it.
>
> If you call tika.detect on input stream before copying it to bytes it
> falls to first case, you'll copy inputstream without first 8k to it, so
> drop pdf magic.
>
> You have to recreate input stream, copy it somewhere to temporary resource
> (as with bytes or some temp file) or wrap it to BufferedInputStream before
> passing it to tika.detect.
>
> --
> Best regards,
> Konstantin Gribov
>
> Thu Jan 29 2015 at 16:07:12, Gabriele Guidi <gabriele.gu...@eng.it>:
>
> Hi
>>
>> No, I ask it with "*markSupported
>> <http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#markSupported()>*
>> ()" function and it says "NO".
>> No recreation.
>>
>> The code test is very simple:
>>
>> InputStream inputsbust = content.getContentStream();
>> System.out.println(" mark and reset inputStream ?
>> "+(inputsbust.markSupported()?"YES":"NO"));
>> System.out.println(" 1 mime : " + tika.detect(inputsbust));
>> System.out.println(" 2 mime : " + tika.detect(inputsbust));
>> byte[] bytes = IOUtils.toByteArray(inputsbust);
>> System.out.println(" 3 mime : " + tika.detect(bytes));
>> System.out.println(" 3.2 mime : " + tika.detect(bytes));
>>
>>
>> The result:
>>
>> mark and reset of inputStream ? NO
>>
>>  1 mime : application/pdf
>>  2 mime : application/octet-stream
>>  3 mime : application/octet-stream
>>  3.2 mime : application/octet-stream
>>
>>
>> If i put the 5th line ("byte[] bytes = IOUtils.toByteArray(inputsbust);")
>> as second line the result is:
>>
>> mark and reset of inputStream ? NO
>>
>>  1 mime : application/octet-stream
>>  2 mime : application/octet-stream
>>  3 mime : application/pdf
>>  3.2 mime : application/pdf
>>
>>
>> I hope it helps
>>
>> Thanks
>>
>>
>> 2015-01-29 10:49 GMT+01:00 Konstantin Gribov <gros...@gmail.com>:
>>
>>> Hi,
>>>
>>> Does this InputStream support mark/reset fuctionality? Is InputStream
>>> recreated before each subsequent call to tika.detect or it called on
>>> partially consumed stream (in case mark isn't supported)?
>>>
>>> --
>>> Best regards,
>>> Konstantin Gribov
>>>
>>> Thu Jan 29 2015 at 9:25:28, Mattmann, Chris A (3980) <
>>> chris.a.mattm...@jpl.nasa.gov>:
>>>
>>> Dear Gabriele,
>>>>
>>>> Thanks for your question. It should be sent to dev@tika.apache.org
>>>> (moving dev-ow...@tika.apache.org to BCC).
>>>>
>>>> I’ll take a look tomorrow if someone else hasn’t answered yet.
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: chris.a.mattm...@nasa.gov
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Gabriele Guidi <gabriele.gu...@eng.it>
>>>> Date: Wednesday, January 28, 2015 at 5:25 AM
>>>> To: "dev-ow...@tika.apache.org" <dev-ow...@tika.apache.org>
>>>> Subject: multiple detect call -> different results (tika 1.7)
>>>>
>>>> >
>>>> >
>>>> >Hi,
>>>> >
>>>> >
>>>> >I found a strange behavior. I have p7m file, then I extract file inside
>>>> >the signed one, after that I use tika to discover mime type, the first
>>>> >call it gives me "application/pdf" (that's correct). BUT every next
>>>> call
>>>> >to the detect method of Tika to the
>>>> > same inputStream gives me "application/octet-stream". ...why?
>>>> >I cannot understand the behavior ...and find a solution.
>>>> >
>>>> >
>>>> >Just a snipped of code:
>>>> >
>>>> >
>>>> >
>>>> >InputStream inputsbust = content.getContentStream();
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >System.out.println(" 1 mime " + filepath + " : "
>>>> >+ tika.detect(inputsbust));
>>>> >System.out.println(" 2 mime " + filepath + " : "
>>>> >+ tika.detect(inputsbust));
>>>> >System.out.println(" 3 mime " + filepath + " : "
>>>> >+ tika.detect(inputsbust));
>>>> >
>>>> >
>>>> >
>>>> >Result:
>>>> >
>>>> > 1 mime /home/gguidi/01_file.pdf : application/pdf
>>>> > 2 mime /home/gguidi/01_file.pdf : application/octet-stream
>>>> > 3 mime /home/gguidi/01_file.pdf : application/octet-stream
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >Thanks
>>>> >
>>>> >
>>>> >--
>>>> >
>>>> >
>>>> >Gabriele Guidi
>>>> >Direzione Pubblica Amministrazione
>>>> >gabriele.gu...@eng.it
>>>> >
>>>> >Engineering Ingegneria Informatica spa
>>>> >Via Marconi, 10 - 40122, Bologna
>>>> >Tel. +39-051.0435135
>>>> >www.eng.it <http://www.eng.it>
>>>> >
>>>> >
>>>> >Rispetta l'ambiente. Non stampare questa e-mail se non necessario.
>>>> >Respect the environment. Please don't print this e-mail unless you
>>>> really
>>>> >need to.
>>>> >Le informazioni trasmesse sono destinate esclusivamente alla persona o
>>>> >alla società in indirizzo e sono da intendersi confidenziali e
>>>> riservate.
>>>> >Ogni trasmissione, inoltro, diffusione o altro uso
>>>> > di queste informazioni a persone o società differenti dal
>>>> destinatario è
>>>> >proibita. Se ricevete questa comunicazione per errore, contattate il
>>>> >mittente e cancellate le informazioni da ogni computer.
>>>> >The information transmitted is intended only for the person or entity
>>>> to
>>>> >which it is addressed and may contain confidential and/or privileged
>>>> >material. Any review, retransmission, dissemination or other use of, or
>>>> >taking of any action in reliance upon, this
>>>> > information by persons or entities other than the intended recipient
>>>> is
>>>> >prohibited. If you received this in error, please contact the sender
>>>> and
>>>> >delete the material from any computer.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>>
>>>>
>>
>>
>> --
>>
>>
>>
>> * Gabriele Guidi*
>>
>>
>>  Direzione Pubblica Amministrazione
>> gabriele.gu...@eng.it
>>
>> *Engineering Ingegneria Informatica spa*
>> Via Marconi, 10 - 40122, Bologna
>>
>>
>> Tel. +39-051.0435135
>>  www.eng.it
>>
>>  Rispetta l'ambiente. Non stampare questa e-mail se non necessario.
>> Respect the environment. Please don't print this e-mail unless you really
>> need to.
>>
>> Le informazioni trasmesse sono destinate esclusivamente alla persona o
>> alla società in indirizzo e sono da intendersi confidenziali e riservate.
>> Ogni trasmissione, inoltro, diffusione o altro uso di queste informazioni a
>> persone o società differenti dal destinatario è proibita. Se ricevete
>> questa comunicazione per errore, contattate il mittente e cancellate le
>> informazioni da ogni computer.
>> The information transmitted is intended only for the person or entity to
>> which it is addressed and may contain confidential and/or privileged
>> material. Any review, retransmission, dissemination or other use of, or
>> taking of any action in reliance upon, this information by persons or
>> entities other than the intended recipient is prohibited. If you received
>> this in error, please contact the sender and delete the material from any
>> computer.
>>
>


-- 



* Gabriele Guidi*
 Direzione Pubblica Amministrazione
gabriele.gu...@eng.it

*Engineering Ingegneria Informatica spa*
Via Marconi, 10 - 40122, Bologna
Tel. +39-051.0435135
 www.eng.it

 Rispetta l'ambiente. Non stampare questa e-mail se non necessario.
Respect the environment. Please don't print this e-mail unless you really
need to.

Le informazioni trasmesse sono destinate esclusivamente alla persona o alla
società in indirizzo e sono da intendersi confidenziali e riservate. Ogni
trasmissione, inoltro, diffusione o altro uso di queste informazioni a
persone o società differenti dal destinatario è proibita. Se ricevete
questa comunicazione per errore, contattate il mittente e cancellate le
informazioni da ogni computer.
The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.

Reply via email to