I noticed that, when crwaling the same html files repeatedly, fetcher.java does not always extract identical outlinks. A further look leads to the finding that cyberneko html parser appears to "randomly" have attributes in <a>..</a> elements scanned incorrectly. For example, I have seen attribute <a href=blah> interpreted as <a name=blah>, or vice versa, but not always. However, the problem is gone if fetcher.java is run as single thread.
Given its nature, I do expect cyberneko html parser might have different interpretation of html attributes for non-standard html texts. But my tests were done with good html texts. Does anyone else experience the same problem? John __________________________________________ http://www.neasys.com - A Good Place to Be Come to visit us today! ------------------------------------------------------- This SF.Net email sponsored by Black Hat Briefings & Training. Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
