[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Andrzej Bialecki (JIRA) Wed, 28 Feb 2007 07:01:23 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476600
 ]


Andrzej Bialecki  commented on NUTCH-443:
-----------------------------------------

Almost there ... ParseResult seemed to tidy up this patch quite a bit. 
Remaining issues:

* you create the "fake" CrawlDatum-s in ParseOutputFormat, and then set 
fetchTime to the current time. This is incorrect - parsing may have been 
performed long after the content was fetched. The correct place to create and 
store these "fake" CrawlDatum-s is in the FetcherThread.output(), where you 
loop through Entry<Text, Parse>, i.e.:

          long curTime = System.currentTimeMillis();
          for (Entry<Text, Parse> entry : parseResult) {
            Text k = entry.getKey();
            output.collect(k, 
                new ObjectWritable(new ParseImpl(entry.getValue())));
            if (!k.equals(key)) {
              CrawlDatum fake = datum.clone();
              fake.set
              fake.setFetchTime(curTime);
              output.collect(k, new ObjectWritable(fake)); 
            } else {
              // save the real datum
              output.collect(k, new ObjectWritable(datum));
            }
          }

* I'm pretty sure that ParseResult.filter() must NOT be called under normal 
circumstances ... We need to store the information that parsing was 
unsuccessful - if we remove this information from the ParseResult we will never 
know that parsing failed for this content (or a part thereof).

* we have a backward-compatibility issue with ParseImpl.isFetched - i.e. data 
created with earlier versions of Nutch won't be compatible with the new format, 
and there is no versioning information in the already existing data. We need to 
do one of the following:
  - bite the bullet, and don't care about backward compatibility - not so nice 
... all existing segments will have to be re-parsed. Ouch.
  - add look-ahead code to test the data coming from DataInput if it contains 
this boolean flag or a likely Text length - somewhat unreliable...
  - store this flag in ParseData.contentMeta - ugly hack.

Out of these three the last option seems the safest for now. From the long-term 
point of view we should later on add versioning information and handling of 
different versions in Parse.

* the name of this method Parse.isFetched is somewhat misleading - it's not 
about fetching or not, it's whether this Parse corresponds to the original url 
or to a sub-url. Perhaps isCanonical, isRoot, or some other name ...?

* in ParseSegment - what's the reason for creating a new copy of ParseImpl in 
this line below? I think we should store the one we already have in "parse":

      output.collect(url, new ParseImpl(new ParseText(parse.getText()), 
                                        parse.getData(), parse.isFetched()));


Thank you for your perseverance!

> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
> NUTCH-443.022507.patch.txt, NUTCH-443.02282007.patch, 
> parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Reply via email to