Thank you for your information.
didon...@126.com From: Tim Allison Date: 2023-04-20 18:58 To: user Subject: Re: Re: Tika server extraction failed Tika's detector runs detection on the byte stream. If the suffix of the file is a subtype of the detected byte stream, then it applies that subtype. So, if you tell tika that your text file is an xml file, it will detect the bytes as text, but then see that xml is a subtype of text and run the XML parser against the file, leading to a parse exception. If you use the /rmeta endpoint, you'll get better feedback about parse exceptions. On Thu, Apr 20, 2023 at 2:47 AM didon...@126.com <didon...@126.com> wrote: > > Hi, > > I have encountered a problem before: some files cannot be detected based on > their content. > So I added the file name and solved the problem. > > But now I have another problem: adding a file name actually resulted in not > being detected. > > If that's the case, I need to make two attempts: > Try using content detection first, > and then try using file name detection. > > > ________________________________ > didon...@126.com > > > From: Tilman Hausherr > Date: 2023-04-20 14:32 > To: user > Subject: Re: Tika server extraction failed > Hi, > > I don't see why this is a problem, and you're mentioning the solution > yourself. If you want detection by content, then don't pass the filename. > > Tilman > > On 20.04.2023 08:19, didon...@126.com wrote: > > Hi, Tilman > > I have encountered another problem. > > t1.xml is a simple plain text file, not a standard XML file. > When I use Tika Server 2.7.0 to extract file content, the results are as > follows: > > curl -T t1.xml http://127.0.0.1:12000/tika --header "Accept: text/plain" -H > "Content-Disposition: attachment; filename=t1.xml" > Result: fail (empty) > > curl -T t1.xml http://127.0.0.1:12000/tika --header "Accept: text/plain" > curl -T t1.xml http://127.0.0.1:12000/tika --header "Accept: text/plain" -H > "Content-Disposition: attachment; filename=t1.txt" > curl -T t1.xml http://127.0.0.1:12000/tika --header "Accept: text/plain" -H > "Content-Disposition: attachment; filename=t1.docx" > Result: success > > The file name information affects the extraction result. > > >