Selective/Configurable HTML Parsing?

2007-10-15 Thread Sagar Vibhute
Hi,

I need some help with understanding how the HTML parser works in nutch. I
have to write a plugin which while crawling text will help me identify
certain words/phrases that will be pre-specified.

eg: I might want to index pages with a specific in case they have the name Jimi
Hendrix occuring on them.

In such a case, how do I write an extension that allows me to check for the
occurence of a certain word on the page? Meaning, where do I start? I have
read the html parser code in the nutch source files, to an extent I could
understand it. Is there a text-library/dictionary that nutch uses while it
parses the page content? I read the documentation on neko parser, but am
still not able to understand it completely.

- Sagar


[jira] Commented: (NUTCH-488) Avoid parsing uneccessary links and get a more relevant outlink list

2007-10-15 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535014
 ] 

Dennis Kubes commented on NUTCH-488:


Tested.  Working good +1

> Avoid parsing uneccessary links and get a more relevant outlink list
> 
>
> Key: NUTCH-488
> URL: https://issues.apache.org/jira/browse/NUTCH-488
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 0.9.0
> Environment: Windows, Java 1.5
>Reporter: Emmanuel Joke
> Attachments: DOMContentUtils.patch, ignore_tags_v2.patch, 
> ignore_tags_v3.patch, nutch-default.xml.patch
>
>
> NekoHTML parser use a method to extract all outlinks from the HTML page. It 
> will extracts them from the HTML content based on the list of param defined 
> in the method setConf(). Then this list of links will be truncated to be 
> limit to the the maximum number of outlinks that we'll process for a page 
> defined in nutch-default.xml (db.max.outlinks.per.page = 100 by default ) and 
> finally it will be go through all urlfilter defined.
> Unfortunetly it can happen that the list of outlinks is more than 100, so it 
> will truncated the list and could remove some relevant links.
> So I've added few options in the nutch-default.xml in order to enable/disable 
> the extraction of specific HTML Tag links in this parser (SCRIPT, IMG, FORM, 
> LINK).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Anyone looked for a better HTML parser?

2007-10-15 Thread Doug Cook


I've spent quite a bit of time working with both Neko and Tagsoup, and they
both have some fairly serious bugs:

Neko has some occasional hangs, and it doesn't deal very well with a fair
amount of "bad" HTML that displays just fine in a browser. 

Tagsoup is better in terms of handling "bad" HTML, but it has a pretty
serious bug in that HTML character entities are expanded in inappropriate
places, e.g. inside of hrefs, so that a dynamic URL of the form
http://www.foo.com/bar?x=1&sub=5 has problems: the &sub is interpreted as an
HTML character entity, and an invalid href is created.  John Cowan, the
author of Tagsoup, more or less said "yeah, I know, everybody mentions that,
but that's done at such a low level in the code it's not likely to get fixed
any time soon". (See a discussion of this and other issues at
http://tech.groups.yahoo.com/group/tagsoup-friends/message/838). 

The tagsoup bug affects some 3-4% of the sites in my index, so I consider it
fatal, and I *know* Neko misses some text, sometimes entire documents,
because it can't deal with pathological HTML.

Has anyone (a) got local fixes for any of these problems, or (b) found a
superior Java HTML parser out there?

Doug
-- 
View this message in context: 
http://www.nabble.com/Anyone-looked-for-a-better-HTML-parser--tf4630266.html#a13221500
Sent from the Nutch - Dev mailing list archive at Nabble.com.



[jira] Updated: (NUTCH-488) Avoid parsing uneccessary links and get a more relevant outlink list

2007-10-15 Thread Marcin Okraszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Okraszewski updated NUTCH-488:
-

Attachment: ignore_tags_v3.patch

OK, yet another approach based on Doğacan comments. Sorry for delay, but I 
didn't notice the comment earlier.

- I didn't notice the conf.getStrings() method. Thanks for hint :)
- I did made the backward compatibility with the "use_action" param, but it 
works a bit different now, if there is no value set. Now, default is that it 
should use the forms. But it can be dropped with ignore_tags setting if not 
specified. If someone has the use_action set to true explicite, then it cannot 
be overridden by the ignore_tags. It is still a bit inconsitent, but it is 
understandable that specific setting (use_action) has precedence. If default is 
"false" then if you do not have "use_action" defined and not added to 
ignore_tags, then one could expect that form is taken. But it wouldn't be. 
Keeping the backward compatibility make the code a bit clumsy :( ... and I 
think I've made it over flexible, but that was the cleanest solution here.
- For the repeating if; I agree, it is error prone, but on the other hand it is 
easy to understand. I didn't quite understand Dogacan's proposal :( but I think 
I did something acceptable - simply remove all specified tags from link params. 




> Avoid parsing uneccessary links and get a more relevant outlink list
> 
>
> Key: NUTCH-488
> URL: https://issues.apache.org/jira/browse/NUTCH-488
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 0.9.0
> Environment: Windows, Java 1.5
>Reporter: Emmanuel Joke
> Attachments: DOMContentUtils.patch, ignore_tags_v2.patch, 
> ignore_tags_v3.patch, nutch-default.xml.patch
>
>
> NekoHTML parser use a method to extract all outlinks from the HTML page. It 
> will extracts them from the HTML content based on the list of param defined 
> in the method setConf(). Then this list of links will be truncated to be 
> limit to the the maximum number of outlinks that we'll process for a page 
> defined in nutch-default.xml (db.max.outlinks.per.page = 100 by default ) and 
> finally it will be go through all urlfilter defined.
> Unfortunetly it can happen that the list of outlinks is more than 100, so it 
> will truncated the list and could remove some relevant links.
> So I've added few options in the nutch-default.xml in order to enable/disable 
> the extraction of specific HTML Tag links in this parser (SCRIPT, IMG, FORM, 
> LINK).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-442) Integrate Solr/Nutch

2007-10-15 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534869
 ] 

Enis Soztutar commented on NUTCH-442:
-

Using nutch with solr has been a very demanding request, so it will be very 
useful when this makes into trunk. I have spend some time reviewing the patch, 
which I find quite elegant. 

Some improvements to the patch would be 
- make NutchDocument implement VersionedWritable instead of writable, and 
delegate version checking to superclass
- refactor getDetails() methods in HitDetailer to Searcher (it is not likely 
that a class would implement Searcher but not HitDetailer)
- use Searcher, delete HitDetailer and SearchBean 
- Rename XXXBean classes so that they do not include "bean". (I think it is 
confusing to have bean objects that have non-trivial functionality)
- refactor LuceneSearchBean.VERSION to RPCSearchBean
- remove unrelated changes from the patch.(the changes in NGramProfile, 
HTMLLanguageParser,LanguageIdentifier,... correct me if i'm wrong)

As far as i can see, we do not need any metadata for Solr backend, and only 
need Store,Index and Vector options for lucene backend, so i think we can 
simplify NutchDocument#metadata. We may implement :  
{code}
class FieldMeta {
o.a.l.document.Field.Store store;
o.a.l.document.Field.Index index;
o.a.l.document.Field.TermVector tv;
}

FieldMeta[] IndexingFilter.getFields();

class NutchDocument {
...
private ArrayList fieldMeta;
...
}

{code}

Or alternatively we may wish to keep add methods of NutchDocument compatible 
with o.a.l.document.Document, keeping the metadata up-to-date as we add new 
fields, using this info at LuceneWriter, but ignoring in SolrWriter. This will 
be slightly slower but the API will be much more intuitive. 

> Integrate Solr/Nutch
> 
>
> Key: NUTCH-442
> URL: https://issues.apache.org/jira/browse/NUTCH-442
> Project: Nutch
>  Issue Type: New Feature
> Environment: Ubuntu linux
>Reporter: rubdabadub
> Attachments: NUTCH_442_v3.patch, RFC_multiple_search_backends.patch, 
> schema.xml
>
>
> Hi:
> After trying out Sami's patch regarding Solr/Nutch. Can be found here 
> (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html)
>  and I can confirm it worked :-) And that lead me to request the following :
> I would be very very great full if this could be included in nutch 0.9 as I 
> am trying to eliminate my python based crawler which post documents to solr. 
> As I am in the corporate enviornment I can't install trunk version in the 
> production enviornment thus I am asking this to be included in 0.9 release. I 
> hope my wish would be granted.
> I look forward to get some feedback.
> Thank you.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.