[jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495350
 ] 

Doğacan Güney commented on NUTCH-485:
-

You probably should not add "put(String/Text key, Parse parse)" methods to 
ParseResult. ParseResult doesn't have a direct method of adding a Parse object, 
so that it can check whether the parse object comes from a real url or a 
sub-url. 

> Change HtmlParseFilter 's to return ParseResult object instead of Parse object
> --
>
> Key: NUTCH-485
> URL: https://issues.apache.org/jira/browse/NUTCH-485
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Gal Nitzan
> Fix For: 1.0.0
>
> Attachments: NUTCH-485.200705122151.patch, 
> NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch
>
>
> The current implementation of HtmlParseFilters.java doesn't allow a filter to 
> add parse objects to the ParseResult object.
> A change to the HtmlParseFilter is needed which allows the filter to return 
> ParseResult . and ofcourse a change to  HtmlParseFilters .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-13 Thread Gal Nitzan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gal Nitzan updated NUTCH-485:
-

Attachment: NUTCH-485.200705131241.patch

Thanks Doğacan, I missed it :( 

Thanks to all reviewers.
 
Yet another patch...

> Change HtmlParseFilter 's to return ParseResult object instead of Parse object
> --
>
> Key: NUTCH-485
> URL: https://issues.apache.org/jira/browse/NUTCH-485
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Gal Nitzan
> Fix For: 1.0.0
>
> Attachments: NUTCH-485.200705122151.patch, 
> NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, 
> NUTCH-485.200705131241.patch
>
>
> The current implementation of HtmlParseFilters.java doesn't allow a filter to 
> add parse objects to the ParseResult object.
> A change to the HtmlParseFilter is needed which allows the filter to return 
> ParseResult . and ofcourse a change to  HtmlParseFilters .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495357
 ] 

Doğacan Güney commented on NUTCH-443:
-

Well... That's embarrassing. It seems I forgot to include the necessary changes 
to Indexer. Indexer has to read crawl_parse too so that it can pickup sub-urls' 
fetch datums. 

So, that seemed easy (just a couple of lines) but then I realized that there is 
another bug. (Which, in my defense, was present in Nutch before 443. So the bug 
was there, I only made it worse:)

It is a bit difficult to describe, so please bear with me. The problem goes 
like this:

In fetcher, if max.redirect is 0, Nutch pushes an empty Content to content and 
a LINKED datum to crawl_fetch (let's call this url foo). ParseSegment parses 
empty Content and creates a parse data and an empty parse text. After updatedb 
and one more generate-fetch-parse-updatedb cycle, we now have a proper content, 
parse text and parse data for foo in the new segment.

Now, assume I index both of these segments together. Url foo will have two sets 
of (fetch datum, parse), one coming from the first segment, the other coming 
from the second segment. Since first fetch datum is LINKED,  this code in 
Indexer.reduce will cause foo to be discarded:

if (redir != null) {
  // XXX page was redirected - what should we do?
  // XXX discard it for now
  return;
}

And it doesn't work if we just remove this code. Remember that foo has two sets 
of (fetch datum, parse) and one of the parses contains an empty parse text. 
Since, in reduce Indexer will randomly choose one of the parses it is likely 
that we will get an empty parse text for url foo.

This is the part that I made worse: Since Indexer has to read crawl_parse it 
will get a lot of STATUS_LINKED (that are written to crawl_parse as outlinks) 
and discard a lot of useful pages in any multi-segment index job.

Sorry if the description is unnecessarily complex.



> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> 
>
> Key: NUTCH-443
> URL: https://issues.apache.org/jira/browse/NUTCH-443
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
> Assigned To: Andrzej Bialecki 
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
> NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, 
> NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, 
> parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-13 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-443:


Attachment: redirect_and_index.patch

Patch for the problem. 

Now, if Fetcher gets a null content, instead of pushing an empty content, it 
filters null content. 

It may change the semantics very slightly, but I don't think that it will be a 
problem. Before this patch, Fetcher creates an empty content than passes score 
from datum to content. Parse then passes it from content to parse data so that 
it can distribute the score to outlinks. But empty pages don't have outlinks 
anyway and they should not be indexed (so an adjust datum has no purpose).

Sorry about missing this bug in the first place, but, man, this is a subtle one.


> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> 
>
> Key: NUTCH-443
> URL: https://issues.apache.org/jira/browse/NUTCH-443
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
> Assigned To: Andrzej Bialecki 
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
> NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, 
> NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, 
> parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, 
> redirect_and_index.patch
>
>
> allow Parser#parse to return a Map. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-484) Nutch Nightly API link is broken in site

2007-05-13 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-484.
--

Resolution: Fixed

committed and updated site, thanks Gal

> Nutch Nightly API link is broken in site
> 
>
> Key: NUTCH-484
> URL: https://issues.apache.org/jira/browse/NUTCH-484
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Gal Nitzan
>Priority: Trivial
> Fix For: 1.0.0
>
> Attachments: NUTCH-484.200705121200.patch
>
>
> The Nightly API link is broken

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-05-13 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-444:


Attachment: NUTCH-444.patch
feed.tar.bz2

First version of feed plugin featuring a Parser and an IndexingFilter. You 
would need the latest patch from NUTCH-443 (redirect_and_index.patch) to test 
it.

> Possibly use a different library to parse RSS feed for improved performance 
> and compatibility
> -
>
> Key: NUTCH-444
> URL: https://issues.apache.org/jira/browse/NUTCH-444
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
> Assigned To: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: feed.tar.bz2, NUTCH-444.patch, parse-feed-v2.tar.bz2, 
> parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
> library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to 
> jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-13 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reopened NUTCH-443:
-

  Assignee: Chris A. Mattmann  (was: Andrzej Bialecki )

Per Doğacan's comment, we need to reopen this and test out his new patch for 
it. Andrzej, I'd be happy if you reassigned to you, however, I will have some 
time on Tuesday to look at this if you don't until then.

> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> 
>
> Key: NUTCH-443
> URL: https://issues.apache.org/jira/browse/NUTCH-443
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
> Assigned To: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
> NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, 
> NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, 
> parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, 
> redirect_and_index.patch
>
>
> allow Parser#parse to return a Map. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-05-13 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495381
 ] 

Chris A. Mattmann commented on NUTCH-444:
-

Doğacan -- I will check this out tomorrow (Monday) night, latest Tuesday. I've 
reopened NUTCH-443 and will also look at your new patch from there.

> Possibly use a different library to parse RSS feed for improved performance 
> and compatibility
> -
>
> Key: NUTCH-444
> URL: https://issues.apache.org/jira/browse/NUTCH-444
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
> Assigned To: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: feed.tar.bz2, NUTCH-444.patch, parse-feed-v2.tar.bz2, 
> parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
> library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to 
> jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495410
 ] 

Doğacan Güney commented on NUTCH-485:
-

I have two more minor nits:

1) ParseResult.isSuccess returns true only if all parses are successful. This 
makes sense, but I think you should make it more obvious by mentioning it in 
method's javadoc. 

2) There seems to be some whitespace issues. For  example, some indents are 4 
spaces. All indents should be 2 space-indents.

Anyway, I don't know if my vote counts, but, besides these two issues, I am +1 
on this patch.

I think this may be very useful for image search. After parsing a page, one can 
traverse DOM, add image src's as urls and the immediate text around images as 
parse text (+ whatever data you can gather as parse data). Of course, this 
doesn't automatically make Nutch an image search engine, but is a good first 
step.

> Change HtmlParseFilter 's to return ParseResult object instead of Parse object
> --
>
> Key: NUTCH-485
> URL: https://issues.apache.org/jira/browse/NUTCH-485
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Gal Nitzan
> Fix For: 1.0.0
>
> Attachments: NUTCH-485.200705122151.patch, 
> NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, 
> NUTCH-485.200705131241.patch
>
>
> The current implementation of HtmlParseFilters.java doesn't allow a filter to 
> add parse objects to the ParseResult object.
> A change to the HtmlParseFilter is needed which allows the filter to return 
> ParseResult . and ofcourse a change to  HtmlParseFilters .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-13 Thread Gal Nitzan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gal Nitzan updated NUTCH-485:
-

Attachment: NUTCH-485.200705140001.patch

Thanks Doğacan for taking the time to review the code.

I agree with your comments on the usage. I run a video search and it sure going 
to help. The ability to "discover" and add content "on the fly" to the segment 
while parsing is a functionality long awaited and it all made possible after 
NUTCH-443... :)


And yet one more update with a better description in javadoc and some fixes to 
indentation.

> Change HtmlParseFilter 's to return ParseResult object instead of Parse object
> --
>
> Key: NUTCH-485
> URL: https://issues.apache.org/jira/browse/NUTCH-485
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Gal Nitzan
> Fix For: 1.0.0
>
> Attachments: NUTCH-485.200705122151.patch, 
> NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, 
> NUTCH-485.200705131241.patch, NUTCH-485.200705140001.patch
>
>
> The current implementation of HtmlParseFilters.java doesn't allow a filter to 
> add parse objects to the ParseResult object.
> A change to the HtmlParseFilter is needed which allows the filter to return 
> ParseResult . and ofcourse a change to  HtmlParseFilters .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.