[ https://issues.apache.org/jira/browse/TIKA-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917814#action_12917814 ]
Dennis Adler commented on TIKA-522: ----------------------------------- Hi Nick, Pardon the alias on the GMail addr... I can't tell what it is getting; it seems to be the right length. Then it goes into the FOR loop, and when it fails, it decides all subsequent HTML files are AUDIO/MPEG (instead of getting the parent type of text, which would accept the HTML extension and revise the type to TEXT/HTML. That is part of what is so odd... when it chokes on the first one, it chokes on all the rest (a subdirectory tree is enumerated one dir/file at a time using the java.io.File class, then passed in to Tika. When it goes south, it does fine with Word, PDF and MSG files (that's what I have in my test suite), but chokes on all HTML files (most of them, but not all, were generated from Word documents). Even when I have 700+ files in several directories... they either all work or all fail, depending on what the first one does. > AutoDetectParser treats HTML/XML files as Audio > ----------------------------------------------- > > Key: TIKA-522 > URL: https://issues.apache.org/jira/browse/TIKA-522 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.7 > Environment: WIndows 7 x64, java v6.0.170.4, jdk1.6.0_21, Eclipse > 20100617-1415 > Reporter: Dennis Adler > Assignee: Ken Krugler > > I am crawling an SMB share. I've used the steps outlined in Tika samples to > initialize; given a File object in f, my code is: > parser = new AutoDetectParser(); > context.set(Parser.class, parser); > // Get the URL > URL url = f.toURI().toURL(); > // Extract Metadata > Metadata metadata = new Metadata(); > BodyContentHandler handler = new BodyContentHandler(-1); // -1 = > infinite size for XML string buffer (per file) > // Get the input stream > InputStream input = MetadataHelper.getInputStream(url, metadata); > // Parse the document > parser.parse(input, handler, metadata, context); > If I place a breakpoint right after the parser.parse invoke, I find the > metadata calling my input out as an Audio file. If I try to debug the parse > steps, it correctly tags it as Text/HTML. Seems like a timing-related problem. > I have a half-baked workaround: I invoke Thread.sleep(5000) just after the > context.set invoke... in 3 sequential test runs that works fine. Problem is, > this was working fine several days ago without that (perhaps my computer was > busy with other things and the timing issue did not pop up then). > I have downloade and am building today's 0.8 from svn to see if that helps, > though I am concerned about the impacts to the rest of my testing if I have > to swtich to 0.8. Just understanding what was going on would be a huge help :) > * UPDATE * I was able to repro this once under the debugger. MimeTypes.detect > invokes org.apache.tika.mime.MimeTypes.getMimeType on the input stream to > determine the Mime Type based on the first 8k of data. I did not trace into > getMimeType, but did see it return "audio/mpeg" on an HTML file one time, and > "text/html" most others. I can supply the HTML file if desired. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.