Tika's detector runs detection on the byte stream.  If the suffix of
the file is a subtype of the detected byte stream, then it applies
that subtype.

So, if you tell tika that your text file is an xml file, it will
detect the bytes as text, but then see that xml is a subtype of text
and run the XML parser against the file, leading to a parse exception.

If you use the /rmeta endpoint, you'll get better feedback about parse
exceptions.

On Thu, Apr 20, 2023 at 2:47 AM didon...@126.com <didon...@126.com> wrote:
>
> Hi,
>
> I have encountered a problem before: some files cannot be detected based on 
> their content.
> So I added the file name and solved the problem.
>
> But now I have another problem: adding a file name actually resulted in not 
> being detected.
>
> If that's the case, I need to make two attempts:
> Try using content detection first,
> and then try using file name detection.
>
>
> ________________________________
> didon...@126.com
>
>
> From: Tilman Hausherr
> Date: 2023-04-20 14:32
> To: user
> Subject: Re: Tika server extraction failed
> Hi,
>
> I don't see why this is a problem, and you're mentioning the solution 
> yourself. If you want detection by content, then don't pass the filename.
>
> Tilman
>
> On 20.04.2023 08:19, didon...@126.com wrote:
>
> Hi, Tilman
>
>     I have encountered another problem.
>
>     t1.xml is a simple plain text file, not a standard XML file.
>     When I use Tika Server 2.7.0 to extract file content, the results are as 
> follows:
>
> curl -T t1.xml http://127.0.0.1:12000/tika --header "Accept: text/plain" -H 
> "Content-Disposition: attachment; filename=t1.xml"
> Result: fail (empty)
>
> curl -T t1.xml http://127.0.0.1:12000/tika --header "Accept: text/plain"
> curl -T t1.xml http://127.0.0.1:12000/tika --header "Accept: text/plain" -H 
> "Content-Disposition: attachment; filename=t1.txt"
> curl -T t1.xml http://127.0.0.1:12000/tika --header "Accept: text/plain" -H 
> "Content-Disposition: attachment; filename=t1.docx"
> Result: success
>
>     The file name information affects the extraction result.
>
>
>

Reply via email to