RE: Sequence File Question
Got it. I am going to document this on the wiki. Thanks. Steve -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Thursday, March 29, 2007 2:31 PM To: nutch-dev@lucene.apache.org Subject: Re: Sequence File Question Steve Severance wrote: >> DB updates - or actually replacements - see e.g. CrawlDb.install() >> method for details. This is not needed in case of segments, which >> are created once and never updated. > > How does the reader know which one it is expecting. For instance I > can make a reader to read a linkDB just by instantiating it on the > directory crawl/linkdb And it knows to go inside the current > directory. What when opening a parse_data there is no current. So how > does it know which expect? Use The Source Luke ;) It follows this (arbitrary) naming convention that we always use a "current" subdirectory when working with LinkDb and CrawlDb. And it follows a different naming convention when we use SegmentReader. One comment: CrawlDbReader, LinkDbReader and SegmentReader are Nutch classes. However, the real data is stored using Hadoop classes, specifically MapOutputFileFormat. CrawlDbReader knows about Nutch naming convention and always appends "current" to the db name. But if you were to use MapFileOutputFormat.getReaders() directly this Hadoop class of course doesn't know about this, so you need to provide a full path that includes "current". -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Sequence File Question
Steve Severance wrote: DB updates - or actually replacements - see e.g. CrawlDb.install() method for details. This is not needed in case of segments, which are created once and never updated. How does the reader know which one it is expecting. For instance I can make a reader to read a linkDB just by instantiating it on the directory crawl/linkdb And it knows to go inside the current directory. What when opening a parse_data there is no current. So how does it know which expect? Use The Source Luke ;) It follows this (arbitrary) naming convention that we always use a "current" subdirectory when working with LinkDb and CrawlDb. And it follows a different naming convention when we use SegmentReader. One comment: CrawlDbReader, LinkDbReader and SegmentReader are Nutch classes. However, the real data is stored using Hadoop classes, specifically MapOutputFileFormat. CrawlDbReader knows about Nutch naming convention and always appends "current" to the db name. But if you were to use MapFileOutputFormat.getReaders() directly this Hadoop class of course doesn't know about this, so you need to provide a full path that includes "current". -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: Sequence File Question
> -Original Message- > From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] > Sent: Wednesday, March 28, 2007 4:34 PM > To: nutch-dev@lucene.apache.org > Subject: Re: Sequence File Question > > Steve Severance wrote: > > Let me actually refine that question we do some directories like the > linkdb > > have a current, and why do others like parse_data not? Is there a > convention > > on this? > > First, to answer your original question: you should use > MapFileOutputFormat class for reading such output. It handles these > part- subdirectories automatically. > > Second, the "current" subdirectory is there in order to properly handle > DB updates - or actually replacements - see e.g. CrawlDb.install() > method for details. This is not needed in case of segments, which are > created once and never updated. How does the reader know which one it is expecting. For instance I can make a reader to read a linkDB just by instantiating it on the directory crawl/linkdb And it knows to go inside the current directory. What when opening a parse_data there is no current. So how does it know which expect? Steve > > Thirdly, although you didn't ask about it ;) the latest version of > Hadoop contains a handy facility called Counters - if you use the PR > PowerMethod you need to collect PR from dangling nodes in order to > redistribute it later. You can use Counters for this, and save on a > separate aggregation step. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com
Re: Sequence File Question
Steve Severance wrote: Let me actually refine that question we do some directories like the linkdb have a current, and why do others like parse_data not? Is there a convention on this? First, to answer your original question: you should use MapFileOutputFormat class for reading such output. It handles these part- subdirectories automatically. Second, the "current" subdirectory is there in order to properly handle DB updates - or actually replacements - see e.g. CrawlDb.install() method for details. This is not needed in case of segments, which are created once and never updated. Thirdly, although you didn't ask about it ;) the latest version of Hadoop contains a handy facility called Counters - if you use the PR PowerMethod you need to collect PR from dangling nodes in order to redistribute it later. You can use Counters for this, and save on a separate aggregation step. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: Sequence File Question
Let me actually refine that question we do some directories like the linkdb have a current, and why do others like parse_data not? Is there a convention on this? Steve > -Original Message- > From: Steve Severance [mailto:[EMAIL PROTECTED] > Sent: Wednesday, March 28, 2007 4:11 PM > To: nutch-dev@lucene.apache.org > Subject: Sequence File Question > > Hey guys, > I have a mapreduce job that sets up a directory for pagerank. It > iterates > over all the segments and then outputs a MapFile containing the data. > When I > go to open the outputted directory with another MapReduce job it fails > saying that it cannot find the path. The path that it thinks it is > trying to > open does not include the part-0 directory. Both my directory (and > all > other directories for that matter) have the same structure which is > /path/part-0/. I feel like this is a really stupid error > and I > have forgotten something that is easily fixed. Any ideas? > > Steve
Sequence File Question
Hey guys, I have a mapreduce job that sets up a directory for pagerank. It iterates over all the segments and then outputs a MapFile containing the data. When I go to open the outputted directory with another MapReduce job it fails saying that it cannot find the path. The path that it thinks it is trying to open does not include the part-0 directory. Both my directory (and all other directories for that matter) have the same structure which is /path/part-0/. I feel like this is a really stupid error and I have forgotten something that is easily fixed. Any ideas? Steve