Bugs item #993099, was opened at 2004-07-17 21:19 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993099&group_id=59548
Category: plugin: parse-html Group: None Status: Open Resolution: None Priority: 5 Submitted By: Jungshik Shin (jshin) Assigned to: Nobody/Anonymous (nobody) Summary: The content of alt, longdesc, title attribute aren't stored Initial Comment: 'alt', 'title' and 'longdesc' can contain valuable information, but their values are not currently stored by nutch. Those attributes are used to describe the content of image files, audio files, video files, and other non-textual data for those who can't view/listen to/watch them tat include not just the blind/the deaf and text browser users but also crawlers like Nutch. Note that 'alt' is mandatory for 'img' tag in HTML 4.x or XHTML 1.x because it's an important accessibility feature (see http://www.w3.org/WAI). When the web accessibility is promoted, it's almost always mentioned that the accessibility is not only for the disabled but also for machine agents (crawlers, search engines, etc). That is, 'alt', 'longdesc', 'title' are there for Nutch to take advantage of and Nutch should do that. The other day, while 'browsing' web pages Nutch collected (both raw html files and text summary files), I hit upon a page with two dozens of jpeg photos with detailed descriptions for them in longdesc and title. Nutch doesn't have any of description in text summary. Attached is my patch that uses regex. I'm not sure whether regex is too slow for this, in which case I'll make it use 'Set'. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993099&group_id=59548 ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
