Selective/Configurable HTML Parsing?
Hi, I need some help with understanding how the HTML parser works in nutch. I have to write a plugin which while crawling text will help me identify certain words/phrases that will be pre-specified. eg: I might want to index pages with a specific in case they have the name Jimi Hendrix occuring on them. In such a case, how do I write an extension that allows me to check for the occurence of a certain word on the page? Meaning, where do I start? I have read the html parser code in the nutch source files, to an extent I could understand it. Is there a text-library/dictionary that nutch uses while it parses the page content? I read the documentation on neko parser, but am still not able to understand it completely. - Sagar
[jira] Commented: (NUTCH-488) Avoid parsing uneccessary links and get a more relevant outlink list
[ https://issues.apache.org/jira/browse/NUTCH-488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535014 ] Dennis Kubes commented on NUTCH-488: Tested. Working good +1 > Avoid parsing uneccessary links and get a more relevant outlink list > > > Key: NUTCH-488 > URL: https://issues.apache.org/jira/browse/NUTCH-488 > Project: Nutch > Issue Type: Improvement >Affects Versions: 0.9.0 > Environment: Windows, Java 1.5 >Reporter: Emmanuel Joke > Attachments: DOMContentUtils.patch, ignore_tags_v2.patch, > ignore_tags_v3.patch, nutch-default.xml.patch > > > NekoHTML parser use a method to extract all outlinks from the HTML page. It > will extracts them from the HTML content based on the list of param defined > in the method setConf(). Then this list of links will be truncated to be > limit to the the maximum number of outlinks that we'll process for a page > defined in nutch-default.xml (db.max.outlinks.per.page = 100 by default ) and > finally it will be go through all urlfilter defined. > Unfortunetly it can happen that the list of outlinks is more than 100, so it > will truncated the list and could remove some relevant links. > So I've added few options in the nutch-default.xml in order to enable/disable > the extraction of specific HTML Tag links in this parser (SCRIPT, IMG, FORM, > LINK). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Anyone looked for a better HTML parser?
I've spent quite a bit of time working with both Neko and Tagsoup, and they both have some fairly serious bugs: Neko has some occasional hangs, and it doesn't deal very well with a fair amount of "bad" HTML that displays just fine in a browser. Tagsoup is better in terms of handling "bad" HTML, but it has a pretty serious bug in that HTML character entities are expanded in inappropriate places, e.g. inside of hrefs, so that a dynamic URL of the form http://www.foo.com/bar?x=1&sub=5 has problems: the &sub is interpreted as an HTML character entity, and an invalid href is created. John Cowan, the author of Tagsoup, more or less said "yeah, I know, everybody mentions that, but that's done at such a low level in the code it's not likely to get fixed any time soon". (See a discussion of this and other issues at http://tech.groups.yahoo.com/group/tagsoup-friends/message/838). The tagsoup bug affects some 3-4% of the sites in my index, so I consider it fatal, and I *know* Neko misses some text, sometimes entire documents, because it can't deal with pathological HTML. Has anyone (a) got local fixes for any of these problems, or (b) found a superior Java HTML parser out there? Doug -- View this message in context: http://www.nabble.com/Anyone-looked-for-a-better-HTML-parser--tf4630266.html#a13221500 Sent from the Nutch - Dev mailing list archive at Nabble.com.
[jira] Updated: (NUTCH-488) Avoid parsing uneccessary links and get a more relevant outlink list
[ https://issues.apache.org/jira/browse/NUTCH-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcin Okraszewski updated NUTCH-488: - Attachment: ignore_tags_v3.patch OK, yet another approach based on Doğacan comments. Sorry for delay, but I didn't notice the comment earlier. - I didn't notice the conf.getStrings() method. Thanks for hint :) - I did made the backward compatibility with the "use_action" param, but it works a bit different now, if there is no value set. Now, default is that it should use the forms. But it can be dropped with ignore_tags setting if not specified. If someone has the use_action set to true explicite, then it cannot be overridden by the ignore_tags. It is still a bit inconsitent, but it is understandable that specific setting (use_action) has precedence. If default is "false" then if you do not have "use_action" defined and not added to ignore_tags, then one could expect that form is taken. But it wouldn't be. Keeping the backward compatibility make the code a bit clumsy :( ... and I think I've made it over flexible, but that was the cleanest solution here. - For the repeating if; I agree, it is error prone, but on the other hand it is easy to understand. I didn't quite understand Dogacan's proposal :( but I think I did something acceptable - simply remove all specified tags from link params. > Avoid parsing uneccessary links and get a more relevant outlink list > > > Key: NUTCH-488 > URL: https://issues.apache.org/jira/browse/NUTCH-488 > Project: Nutch > Issue Type: Improvement >Affects Versions: 0.9.0 > Environment: Windows, Java 1.5 >Reporter: Emmanuel Joke > Attachments: DOMContentUtils.patch, ignore_tags_v2.patch, > ignore_tags_v3.patch, nutch-default.xml.patch > > > NekoHTML parser use a method to extract all outlinks from the HTML page. It > will extracts them from the HTML content based on the list of param defined > in the method setConf(). Then this list of links will be truncated to be > limit to the the maximum number of outlinks that we'll process for a page > defined in nutch-default.xml (db.max.outlinks.per.page = 100 by default ) and > finally it will be go through all urlfilter defined. > Unfortunetly it can happen that the list of outlinks is more than 100, so it > will truncated the list and could remove some relevant links. > So I've added few options in the nutch-default.xml in order to enable/disable > the extraction of specific HTML Tag links in this parser (SCRIPT, IMG, FORM, > LINK). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-442) Integrate Solr/Nutch
[ https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534869 ] Enis Soztutar commented on NUTCH-442: - Using nutch with solr has been a very demanding request, so it will be very useful when this makes into trunk. I have spend some time reviewing the patch, which I find quite elegant. Some improvements to the patch would be - make NutchDocument implement VersionedWritable instead of writable, and delegate version checking to superclass - refactor getDetails() methods in HitDetailer to Searcher (it is not likely that a class would implement Searcher but not HitDetailer) - use Searcher, delete HitDetailer and SearchBean - Rename XXXBean classes so that they do not include "bean". (I think it is confusing to have bean objects that have non-trivial functionality) - refactor LuceneSearchBean.VERSION to RPCSearchBean - remove unrelated changes from the patch.(the changes in NGramProfile, HTMLLanguageParser,LanguageIdentifier,... correct me if i'm wrong) As far as i can see, we do not need any metadata for Solr backend, and only need Store,Index and Vector options for lucene backend, so i think we can simplify NutchDocument#metadata. We may implement : {code} class FieldMeta { o.a.l.document.Field.Store store; o.a.l.document.Field.Index index; o.a.l.document.Field.TermVector tv; } FieldMeta[] IndexingFilter.getFields(); class NutchDocument { ... private ArrayList fieldMeta; ... } {code} Or alternatively we may wish to keep add methods of NutchDocument compatible with o.a.l.document.Document, keeping the metadata up-to-date as we add new fields, using this info at LuceneWriter, but ignoring in SolrWriter. This will be slightly slower but the API will be much more intuitive. > Integrate Solr/Nutch > > > Key: NUTCH-442 > URL: https://issues.apache.org/jira/browse/NUTCH-442 > Project: Nutch > Issue Type: New Feature > Environment: Ubuntu linux >Reporter: rubdabadub > Attachments: NUTCH_442_v3.patch, RFC_multiple_search_backends.patch, > schema.xml > > > Hi: > After trying out Sami's patch regarding Solr/Nutch. Can be found here > (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html) > and I can confirm it worked :-) And that lead me to request the following : > I would be very very great full if this could be included in nutch 0.9 as I > am trying to eliminate my python based crawler which post documents to solr. > As I am in the corporate enviornment I can't install trunk version in the > production enviornment thus I am asking this to be included in 0.9 release. I > hope my wish would be granted. > I look forward to get some feedback. > Thank you. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.