[jira] Updated: (TIKA-522) AutoDetectParser treats HTML/XML files as Audio

Dennis Adler (JIRA) Fri, 01 Oct 2010 12:30:58 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dennis Adler updated TIKA-522:
------------------------------

    Description: 
I am crawling an SMB share. I've used the steps outlined in Tika samples to 
initialize; given a File object in f, my code is:

        parser = new AutoDetectParser();
        context.set(Parser.class, parser);
        // Get the URL
        URL url = f.toURI().toURL();
        // Extract Metadata
        Metadata metadata = new Metadata();
        BodyContentHandler handler = new BodyContentHandler(-1);        // -1 = 
infinite size for XML string buffer (per file)
        // Get the input stream
        InputStream input = MetadataHelper.getInputStream(url, metadata);
        // Parse the document
        parser.parse(input, handler, metadata, context);

If I place a breakpoint right after the parser.parse invoke, I find the 
metadata calling my input out as an Audio file. If I try to debug the parse 
steps, it correctly tags it as Text/HTML. Seems like a timing-related problem.

I have a half-baked workaround: I invoke Thread.sleep(5000) just after the 
context.set invoke... in 3 sequential test runs that works fine. Problem is, 
this was working fine several days ago without that (perhaps my computer was 
busy with other things and the timing issue did not pop up then).

I have downloade and am building today's 0.8 from svn to see if that helps, 
though I am concerned about the impacts to the rest of my testing if I have to 
swtich to 0.8. Just understanding what was going on would be a huge help :)

* UPDATE * I was able to repro this once under the debugger. MimeTypes.detect 
invokes org.apache.tika.mime.MimeTypes.getMimeType on the input stream to 
determine the Mime Type based on the first 8k of data. I did not trace into 
getMimeType, but did see it return "audio/mpeg" on an HTML file one time, and 
"text/html" most others. I can supply the HTML file if desired.

  was:
I am crawling an SMB share. I've used the steps outlined in Tika samples to 
initialize; given a File object in f, my code is:

        parser = new AutoDetectParser();
        context.set(Parser.class, parser);
        // Get the URL
        URL url = f.toURI().toURL();
        // Extract Metadata
        Metadata metadata = new Metadata();
        BodyContentHandler handler = new BodyContentHandler(-1);        // -1 = 
infinite size for XML string buffer (per file)
        // Get the input stream
        InputStream input = MetadataHelper.getInputStream(url, metadata);
        // Parse the document
        parser.parse(input, handler, metadata, context);

If I place a breakpoint right after the parser.parse invoke, I find the 
metadata calling my input out as an Audio file. If I try to debug the parse 
steps, it correctly tags it as Text/HTML. Seems like a timing-related problem.

I have a half-baked workaround: I invoke Thread.sleep(5000) just after the 
context.set invoke... in 3 sequential test runs that works fine. Problem is, 
this was working fine several days ago without that (perhaps my computer was 
busy with other things and the timing issue did not pop up then).

I have downloade and am building today's 0.8 from svn to see if that helps, 
though I am concerned about the impacts to the rest of my testing if I have to 
swtich to 0.8. Just understanding what was going on would be a huge help :)





Added *UPDATE*  to description

> AutoDetectParser treats HTML/XML files as Audio
> -----------------------------------------------
>
>                 Key: TIKA-522
>                 URL: https://issues.apache.org/jira/browse/TIKA-522
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: WIndows 7 x64, java v6.0.170.4, jdk1.6.0_21, Eclipse 
> 20100617-1415
>            Reporter: Dennis Adler
>
> I am crawling an SMB share. I've used the steps outlined in Tika samples to 
> initialize; given a File object in f, my code is:
>       parser = new AutoDetectParser();
>       context.set(Parser.class, parser);
>       // Get the URL
>       URL url = f.toURI().toURL();
>       // Extract Metadata
>       Metadata metadata = new Metadata();
>       BodyContentHandler handler = new BodyContentHandler(-1);        // -1 = 
> infinite size for XML string buffer (per file)
>       // Get the input stream
>       InputStream input = MetadataHelper.getInputStream(url, metadata);
>       // Parse the document
>       parser.parse(input, handler, metadata, context);
> If I place a breakpoint right after the parser.parse invoke, I find the 
> metadata calling my input out as an Audio file. If I try to debug the parse 
> steps, it correctly tags it as Text/HTML. Seems like a timing-related problem.
> I have a half-baked workaround: I invoke Thread.sleep(5000) just after the 
> context.set invoke... in 3 sequential test runs that works fine. Problem is, 
> this was working fine several days ago without that (perhaps my computer was 
> busy with other things and the timing issue did not pop up then).
> I have downloade and am building today's 0.8 from svn to see if that helps, 
> though I am concerned about the impacts to the rest of my testing if I have 
> to swtich to 0.8. Just understanding what was going on would be a huge help :)
> * UPDATE * I was able to repro this once under the debugger. MimeTypes.detect 
> invokes org.apache.tika.mime.MimeTypes.getMimeType on the input stream to 
> determine the Mime Type based on the first 8k of data. I did not trace into 
> getMimeType, but did see it return "audio/mpeg" on an HTML file one time, and 
> "text/html" most others. I can supply the HTML file if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-522) AutoDetectParser treats HTML/XML files as Audio

Reply via email to