[jira] Issue Comment Edited: (TIKA-522) AutoDetectParser treats HTML/XML files as Audio

Dennis Adler (JIRA) Tue, 05 Oct 2010 12:56:56 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917814#action_12917814
 ]


Dennis Adler edited comment on TIKA-522 at 10/5/10 3:54 PM:
------------------------------------------------------------

Hi Nick,

I can't tell what it is getting; it seems to be the right length. Then it
goes into the FOR loop, and when it fails, it decides all subsequent HTML
files are AUDIO/MPEG (instead of getting the parent type of text, which
would accept the HTML extension and revise the type to TEXT/HTML.

That is part of what is so odd... when it chokes on the first one, it chokes
on all the rest (a subdirectory tree is enumerated one dir/file at a
time using the java.io.File class, then passed in to Tika. When it goes
south, it does fine with Word, PDF and MSG files (that's what I have in my
test suite), but chokes on all HTML files (most of them, but not all, were
generated from Word documents). Even when I have 700+ files in several
directories... they either all work or all fail, depending on what the first
one does.



      was (Author: dennisad):
    Hi Nick,

Pardon the alias on the GMail addr...

I can't tell what it is getting; it seems to be the right length. Then it
goes into the FOR loop, and when it fails, it decides all subsequent HTML
files are AUDIO/MPEG (instead of getting the parent type of text, which
would accept the HTML extension and revise the type to TEXT/HTML.

That is part of what is so odd... when it chokes on the first one, it chokes
on all the rest (a subdirectory tree is enumerated one dir/file at a
time using the java.io.File class, then passed in to Tika. When it goes
south, it does fine with Word, PDF and MSG files (that's what I have in my
test suite), but chokes on all HTML files (most of them, but not all, were
generated from Word documents). Even when I have 700+ files in several
directories... they either all work or all fail, depending on what the first
one does.


  
> AutoDetectParser treats HTML/XML files as Audio
> -----------------------------------------------
>
>                 Key: TIKA-522
>                 URL: https://issues.apache.org/jira/browse/TIKA-522
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: WIndows 7 x64, java v6.0.170.4, jdk1.6.0_21, Eclipse 
> 20100617-1415
>            Reporter: Dennis Adler
>            Assignee: Ken Krugler
>
> I am crawling an SMB share. I've used the steps outlined in Tika samples to 
> initialize; given a File object in f, my code is:
>       parser = new AutoDetectParser();
>       context.set(Parser.class, parser);
>       // Get the URL
>       URL url = f.toURI().toURL();
>       // Extract Metadata
>       Metadata metadata = new Metadata();
>       BodyContentHandler handler = new BodyContentHandler(-1);        // -1 = 
> infinite size for XML string buffer (per file)
>       // Get the input stream
>       InputStream input = MetadataHelper.getInputStream(url, metadata);
>       // Parse the document
>       parser.parse(input, handler, metadata, context);
> If I place a breakpoint right after the parser.parse invoke, I find the 
> metadata calling my input out as an Audio file. If I try to debug the parse 
> steps, it correctly tags it as Text/HTML. Seems like a timing-related problem.
> I have a half-baked workaround: I invoke Thread.sleep(5000) just after the 
> context.set invoke... in 3 sequential test runs that works fine. Problem is, 
> this was working fine several days ago without that (perhaps my computer was 
> busy with other things and the timing issue did not pop up then).
> I have downloade and am building today's 0.8 from svn to see if that helps, 
> though I am concerned about the impacts to the rest of my testing if I have 
> to swtich to 0.8. Just understanding what was going on would be a huge help :)
> * UPDATE * I was able to repro this once under the debugger. MimeTypes.detect 
> invokes org.apache.tika.mime.MimeTypes.getMimeType on the input stream to 
> determine the Mime Type based on the first 8k of data. I did not trace into 
> getMimeType, but did see it return "audio/mpeg" on an HTML file one time, and 
> "text/html" most others. I can supply the HTML file if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (TIKA-522) AutoDetectParser treats HTML/XML files as Audio

Reply via email to