[jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495350 ] Doğacan Güney commented on NUTCH-485: - You probably should not add "put(String/Text key, Parse parse)" methods to ParseResult. ParseResult doesn't have a direct method of adding a Parse object, so that it can check whether the parse object comes from a real url or a sub-url. > Change HtmlParseFilter 's to return ParseResult object instead of Parse object > -- > > Key: NUTCH-485 > URL: https://issues.apache.org/jira/browse/NUTCH-485 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.0.0 > Environment: All >Reporter: Gal Nitzan > Fix For: 1.0.0 > > Attachments: NUTCH-485.200705122151.patch, > NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch > > > The current implementation of HtmlParseFilters.java doesn't allow a filter to > add parse objects to the ParseResult object. > A change to the HtmlParseFilter is needed which allows the filter to return > ParseResult . and ofcourse a change to HtmlParseFilters . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gal Nitzan updated NUTCH-485: - Attachment: NUTCH-485.200705131241.patch Thanks Doğacan, I missed it :( Thanks to all reviewers. Yet another patch... > Change HtmlParseFilter 's to return ParseResult object instead of Parse object > -- > > Key: NUTCH-485 > URL: https://issues.apache.org/jira/browse/NUTCH-485 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.0.0 > Environment: All >Reporter: Gal Nitzan > Fix For: 1.0.0 > > Attachments: NUTCH-485.200705122151.patch, > NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, > NUTCH-485.200705131241.patch > > > The current implementation of HtmlParseFilters.java doesn't allow a filter to > add parse objects to the ParseResult object. > A change to the HtmlParseFilter is needed which allows the filter to return > ParseResult . and ofcourse a change to HtmlParseFilters . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495357 ] Doğacan Güney commented on NUTCH-443: - Well... That's embarrassing. It seems I forgot to include the necessary changes to Indexer. Indexer has to read crawl_parse too so that it can pickup sub-urls' fetch datums. So, that seemed easy (just a couple of lines) but then I realized that there is another bug. (Which, in my defense, was present in Nutch before 443. So the bug was there, I only made it worse:) It is a bit difficult to describe, so please bear with me. The problem goes like this: In fetcher, if max.redirect is 0, Nutch pushes an empty Content to content and a LINKED datum to crawl_fetch (let's call this url foo). ParseSegment parses empty Content and creates a parse data and an empty parse text. After updatedb and one more generate-fetch-parse-updatedb cycle, we now have a proper content, parse text and parse data for foo in the new segment. Now, assume I index both of these segments together. Url foo will have two sets of (fetch datum, parse), one coming from the first segment, the other coming from the second segment. Since first fetch datum is LINKED, this code in Indexer.reduce will cause foo to be discarded: if (redir != null) { // XXX page was redirected - what should we do? // XXX discard it for now return; } And it doesn't work if we just remove this code. Remember that foo has two sets of (fetch datum, parse) and one of the parses contains an empty parse text. Since, in reduce Indexer will randomly choose one of the parses it is likely that we will get an empty parse text for url foo. This is the part that I made worse: Since Indexer has to read crawl_parse it will get a lot of STATUS_LINKED (that are written to crawl_parse as outlinks) and discard a lot of useful pages in any multi-segment index job. Sorry if the description is unnecessarily complex. > allow parsers to return multiple Parse object, this will speed up the rss > parser > > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet > Assigned To: Andrzej Bialecki >Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, > NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, > NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, > parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff > > > allow Parser#parse to return a Map. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-443: Attachment: redirect_and_index.patch Patch for the problem. Now, if Fetcher gets a null content, instead of pushing an empty content, it filters null content. It may change the semantics very slightly, but I don't think that it will be a problem. Before this patch, Fetcher creates an empty content than passes score from datum to content. Parse then passes it from content to parse data so that it can distribute the score to outlinks. But empty pages don't have outlinks anyway and they should not be indexed (so an adjust datum has no purpose). Sorry about missing this bug in the first place, but, man, this is a subtle one. > allow parsers to return multiple Parse object, this will speed up the rss > parser > > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet > Assigned To: Andrzej Bialecki >Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, > NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, > NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, > parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, > redirect_and_index.patch > > > allow Parser#parse to return a Map. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-484) Nutch Nightly API link is broken in site
[ https://issues.apache.org/jira/browse/NUTCH-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-484. -- Resolution: Fixed committed and updated site, thanks Gal > Nutch Nightly API link is broken in site > > > Key: NUTCH-484 > URL: https://issues.apache.org/jira/browse/NUTCH-484 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.0.0 > Environment: All >Reporter: Gal Nitzan >Priority: Trivial > Fix For: 1.0.0 > > Attachments: NUTCH-484.200705121200.patch > > > The Nightly API link is broken -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-444: Attachment: NUTCH-444.patch feed.tar.bz2 First version of feed plugin featuring a Parser and an IndexingFilter. You would need the latest patch from NUTCH-443 (redirect_and_index.patch) to test it. > Possibly use a different library to parse RSS feed for improved performance > and compatibility > - > > Key: NUTCH-444 > URL: https://issues.apache.org/jira/browse/NUTCH-444 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet > Assigned To: Chris A. Mattmann >Priority: Minor > Fix For: 1.0.0 > > Attachments: feed.tar.bz2, NUTCH-444.patch, parse-feed-v2.tar.bz2, > parse-feed.tar.bz2 > > > As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current > library (feedparser) has the following issues: > - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to > jdom first > - no support for Atom 1.0 > - there has been no development in the last year > Alternatives are: > - Rome > - Informa > - custom implementation based on Stax > - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reopened NUTCH-443: - Assignee: Chris A. Mattmann (was: Andrzej Bialecki ) Per Doğacan's comment, we need to reopen this and test out his new patch for it. Andrzej, I'd be happy if you reassigned to you, however, I will have some time on Tuesday to look at this if you don't until then. > allow parsers to return multiple Parse object, this will speed up the rss > parser > > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet > Assigned To: Chris A. Mattmann >Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, > NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, > NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, > parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, > redirect_and_index.patch > > > allow Parser#parse to return a Map. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495381 ] Chris A. Mattmann commented on NUTCH-444: - Doğacan -- I will check this out tomorrow (Monday) night, latest Tuesday. I've reopened NUTCH-443 and will also look at your new patch from there. > Possibly use a different library to parse RSS feed for improved performance > and compatibility > - > > Key: NUTCH-444 > URL: https://issues.apache.org/jira/browse/NUTCH-444 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet > Assigned To: Chris A. Mattmann >Priority: Minor > Fix For: 1.0.0 > > Attachments: feed.tar.bz2, NUTCH-444.patch, parse-feed-v2.tar.bz2, > parse-feed.tar.bz2 > > > As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current > library (feedparser) has the following issues: > - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to > jdom first > - no support for Atom 1.0 > - there has been no development in the last year > Alternatives are: > - Rome > - Informa > - custom implementation based on Stax > - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495410 ] Doğacan Güney commented on NUTCH-485: - I have two more minor nits: 1) ParseResult.isSuccess returns true only if all parses are successful. This makes sense, but I think you should make it more obvious by mentioning it in method's javadoc. 2) There seems to be some whitespace issues. For example, some indents are 4 spaces. All indents should be 2 space-indents. Anyway, I don't know if my vote counts, but, besides these two issues, I am +1 on this patch. I think this may be very useful for image search. After parsing a page, one can traverse DOM, add image src's as urls and the immediate text around images as parse text (+ whatever data you can gather as parse data). Of course, this doesn't automatically make Nutch an image search engine, but is a good first step. > Change HtmlParseFilter 's to return ParseResult object instead of Parse object > -- > > Key: NUTCH-485 > URL: https://issues.apache.org/jira/browse/NUTCH-485 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.0.0 > Environment: All >Reporter: Gal Nitzan > Fix For: 1.0.0 > > Attachments: NUTCH-485.200705122151.patch, > NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, > NUTCH-485.200705131241.patch > > > The current implementation of HtmlParseFilters.java doesn't allow a filter to > add parse objects to the ParseResult object. > A change to the HtmlParseFilter is needed which allows the filter to return > ParseResult . and ofcourse a change to HtmlParseFilters . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gal Nitzan updated NUTCH-485: - Attachment: NUTCH-485.200705140001.patch Thanks Doğacan for taking the time to review the code. I agree with your comments on the usage. I run a video search and it sure going to help. The ability to "discover" and add content "on the fly" to the segment while parsing is a functionality long awaited and it all made possible after NUTCH-443... :) And yet one more update with a better description in javadoc and some fixes to indentation. > Change HtmlParseFilter 's to return ParseResult object instead of Parse object > -- > > Key: NUTCH-485 > URL: https://issues.apache.org/jira/browse/NUTCH-485 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.0.0 > Environment: All >Reporter: Gal Nitzan > Fix For: 1.0.0 > > Attachments: NUTCH-485.200705122151.patch, > NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, > NUTCH-485.200705131241.patch, NUTCH-485.200705140001.patch > > > The current implementation of HtmlParseFilters.java doesn't allow a filter to > add parse objects to the ParseResult object. > A change to the HtmlParseFilter is needed which allows the filter to return > ParseResult . and ofcourse a change to HtmlParseFilters . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.