Bugs item #993099, was opened at 2004-07-17 21:19
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993099&group_id=59548

Category: plugin: parse-html
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jungshik Shin (jshin)
Assigned to: Nobody/Anonymous (nobody)
Summary: The content of alt, longdesc, title attribute aren't stored 

Initial Comment:
'alt', 'title' and 'longdesc' can contain valuable
information, but their values are not currently stored
by nutch. 
Those attributes are used to describe the content of
image files, audio files, video files, and other
non-textual data for those who can't view/listen
to/watch them tat include not just the blind/the deaf
and text browser users but also crawlers like Nutch.
Note that 'alt' is mandatory for 'img' tag in HTML 4.x
or XHTML 1.x because it's an important accessibility
feature (see http://www.w3.org/WAI). 

When the web accessibility is promoted, it's almost
always mentioned that the accessibility is not only for
the disabled but also for machine agents (crawlers,
search engines, etc). That is, 'alt', 'longdesc',
'title' are there for Nutch to take advantage of and
Nutch should do that. 

The other day, while 'browsing' web pages Nutch
collected (both raw html files and text summary files),
I hit upon a page with two dozens of jpeg photos with
detailed descriptions for them in longdesc and title.
Nutch doesn't have any of description in text summary.  


Attached is my patch that uses regex. I'm not sure
whether regex is too slow for this, in which case I'll
make it use 'Set'. 


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993099&group_id=59548


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to