Bugs item #993099, was opened at 2004-07-17 18:19 Message generated for change (Comment added) made by cutting You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993099&group_id=59548
Category: plugin: parse-html Group: None Status: Open Resolution: None Priority: 5 Submitted By: Jungshik Shin (jshin) Assigned to: Nobody/Anonymous (nobody) Summary: The content of alt, longdesc, title attribute aren't stored Initial Comment: 'alt', 'title' and 'longdesc' can contain valuable information, but their values are not currently stored by nutch. Those attributes are used to describe the content of image files, audio files, video files, and other non-textual data for those who can't view/listen to/watch them tat include not just the blind/the deaf and text browser users but also crawlers like Nutch. Note that 'alt' is mandatory for 'img' tag in HTML 4.x or XHTML 1.x because it's an important accessibility feature (see http://www.w3.org/WAI). When the web accessibility is promoted, it's almost always mentioned that the accessibility is not only for the disabled but also for machine agents (crawlers, search engines, etc). That is, 'alt', 'longdesc', 'title' are there for Nutch to take advantage of and Nutch should do that. The other day, while 'browsing' web pages Nutch collected (both raw html files and text summary files), I hit upon a page with two dozens of jpeg photos with detailed descriptions for them in longdesc and title. Nutch doesn't have any of description in text summary. Attached is my patch that uses regex. I'm not sure whether regex is too slow for this, in which case I'll make it use 'Set'. ---------------------------------------------------------------------- >Comment By: Doug Cutting (cutting) Date: 2004-07-19 11:49 Message: Logged In: YES user_id=21778 I agree that we should fix this, but I'm not sure this patch is quite ready. First, it is against an old version of the code, not the latest CVS, I think. Second, I think that, instead of regex, a Set of names to check would be much faster. And shouldn't this check be case-insensitive? If so, then the set should use String.CASE_INSENSITIVE_ORDER. Third, should we only index these attributes where they're legal? According to http://www.w3.org/TR/html4/index/attributes.html, LONGDESC is only allowed in IMG, ALT is only allowed in IMG, AREA, APPLET, and INPUT, and TITLE is allowed almost everywhere. If we allow them everywhere we might be tempting spammers... ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993099&group_id=59548 ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
