Bugs item #993099, was opened at 2004-07-18 01:19 Message generated for change (Comment added) made by joa23 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993099&group_id=59548
Category: plugin: parse-html Group: None >Status: Closed >Resolution: Rejected Priority: 5 Submitted By: Jungshik Shin (jshin) Assigned to: Nobody/Anonymous (nobody) Summary: The content of alt, longdesc, title attribute aren't stored Initial Comment: 'alt', 'title' and 'longdesc' can contain valuable information, but their values are not currently stored by nutch. Those attributes are used to describe the content of image files, audio files, video files, and other non-textual data for those who can't view/listen to/watch them tat include not just the blind/the deaf and text browser users but also crawlers like Nutch. Note that 'alt' is mandatory for 'img' tag in HTML 4.x or XHTML 1.x because it's an important accessibility feature (see http://www.w3.org/WAI). When the web accessibility is promoted, it's almost always mentioned that the accessibility is not only for the disabled but also for machine agents (crawlers, search engines, etc). That is, 'alt', 'longdesc', 'title' are there for Nutch to take advantage of and Nutch should do that. The other day, while 'browsing' web pages Nutch collected (both raw html files and text summary files), I hit upon a page with two dozens of jpeg photos with detailed descriptions for them in longdesc and title. Nutch doesn't have any of description in text summary. Attached is my patch that uses regex. I'm not sure whether regex is too slow for this, in which case I'll make it use 'Set'. ---------------------------------------------------------------------- >Comment By: Stefan Groschupf (joa23) Date: 2005-03-10 20:17 Message: Logged In: YES user_id=396197 see last commet by Doug, please update your patch against the latest source code in apches subversion or write a own index plugin. please submit a new patch to the new issue tracking: http://issues.apache.org/jira/browse/Nutch ---------------------------------------------------------------------- Comment By: Jungshik Shin (jshin) Date: 2004-07-20 04:53 Message: Logged In: YES user_id=307557 Thanks for taking a look. I agree with you on all three points. I'll make changes you suggested later this week and upload a new patch. ---------------------------------------------------------------------- Comment By: Doug Cutting (cutting) Date: 2004-07-19 18:49 Message: Logged In: YES user_id=21778 I agree that we should fix this, but I'm not sure this patch is quite ready. First, it is against an old version of the code, not the latest CVS, I think. Second, I think that, instead of regex, a Set of names to check would be much faster. And shouldn't this check be case-insensitive? If so, then the set should use String.CASE_INSENSITIVE_ORDER. Third, should we only index these attributes where they're legal? According to http://www.w3.org/TR/html4/index/attributes.html, LONGDESC is only allowed in IMG, ALT is only allowed in IMG, AREA, APPLET, and INPUT, and TITLE is allowed almost everywhere. If we allow them everywhere we might be tempting spammers... ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993099&group_id=59548 ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
