[ 
https://issues.apache.org/jira/browse/TIKA-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919685#action_12919685
 ] 

Dennis Adler commented on TIKA-522:
-----------------------------------

As soon as I can develop a reliable repro case I would be happy to retest.
Right now the problem comes and goes with 0.7. Previously, when I thought
the SLEEP call fixed things, it worked find for a week or so before failing
again. It is one of those very frustrating sorts of bugs (both for me and
for those of you trying to help me figure it out).

If I install a 0.8 SVN build and the problem does not repro, I am unsure
what that demonstrates... perhaps the timing is different and the bug pops
somewhere else, or perhaps the bug is really fixed? Sometimes I can run for
several days and the bug does not reoccur in 0.7. Sometimes it shows up
every few runs.

Until I have a reliable repro I do not know how to figure out if fixes in
0.8 MIME detection have made the problem disappear or if it is just masking
the bug I've hit. If there is a good repro case, then (a) I can determine if
the bug disappears for that repro case, and (b) others who know Tika better
may decide to trace down the origins of what I found to see if there is some
tricky, timing-related bug still hiding there or determine that it was
really fixed.

Suggestions on how to determine otherwise are welcome! In the meantime I
will try to find a good repro case as time permits (still running tests with
the work-around patches to the MimeTypes class). Once that is done I will
also try several runs with 0.8 to see what happens.




> AutoDetectParser treats HTML/XML files as Audio
> -----------------------------------------------
>
>                 Key: TIKA-522
>                 URL: https://issues.apache.org/jira/browse/TIKA-522
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: WIndows 7 x64, java v6.0.170.4, jdk1.6.0_21, Eclipse 
> 20100617-1415
>            Reporter: Dennis Adler
>            Assignee: Ken Krugler
>         Attachments: Tika MimeTypes bug repro case.htm
>
>
> I am crawling an SMB share. I've used the steps outlined in Tika samples to 
> initialize; given a File object in f, my code is:
>       parser = new AutoDetectParser();
>       context.set(Parser.class, parser);
>       // Get the URL
>       URL url = f.toURI().toURL();
>       // Extract Metadata
>       Metadata metadata = new Metadata();
>       BodyContentHandler handler = new BodyContentHandler(-1);        // -1 = 
> infinite size for XML string buffer (per file)
>       // Get the input stream
>       InputStream input = MetadataHelper.getInputStream(url, metadata);
>       // Parse the document
>       parser.parse(input, handler, metadata, context);
> If I place a breakpoint right after the parser.parse invoke, I find the 
> metadata calling my input out as an Audio file. If I try to debug the parse 
> steps, it correctly tags it as Text/HTML. Seems like a timing-related problem.
> I have a half-baked workaround: I invoke Thread.sleep(5000) just after the 
> context.set invoke... in 3 sequential test runs that works fine. Problem is, 
> this was working fine several days ago without that (perhaps my computer was 
> busy with other things and the timing issue did not pop up then).
> I have downloade and am building today's 0.8 from svn to see if that helps, 
> though I am concerned about the impacts to the rest of my testing if I have 
> to swtich to 0.8. Just understanding what was going on would be a huge help :)
> * UPDATE * I was able to repro this once under the debugger. MimeTypes.detect 
> invokes org.apache.tika.mime.MimeTypes.getMimeType on the input stream to 
> determine the Mime Type based on the first 8k of data. I did not trace into 
> getMimeType, but did see it return "audio/mpeg" on an HTML file one time, and 
> "text/html" most others. I can supply the HTML file if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to