RE: Sequence File Question

2007-03-29 Thread Steve Severance
Got it. I am going to document this on the wiki. Thanks.

Steve
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 29, 2007 2:31 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Sequence File Question

Steve Severance wrote:
>> DB updates - or actually replacements - see e.g. CrawlDb.install()
>>  method for details. This is not needed in case of segments, which 
>> are created once and never updated.
> 
> How does the reader know which one it is expecting. For instance I 
> can make a reader to read a linkDB just by instantiating it on the 
> directory crawl/linkdb And it knows to go inside the current 
> directory. What when opening a parse_data there is no current. So how
>  does it know which expect?

Use The Source Luke ;) It follows this (arbitrary) naming convention
that we always use a "current" subdirectory when working with LinkDb and
CrawlDb. And it follows a different naming convention when we use
SegmentReader.

One comment: CrawlDbReader, LinkDbReader and SegmentReader are Nutch
classes. However, the real data is stored using Hadoop classes,
specifically MapOutputFileFormat. CrawlDbReader knows about Nutch naming
convention and always appends "current" to the db name. But if you were
to use MapFileOutputFormat.getReaders() directly this Hadoop class of
course doesn't know about this, so you need to provide a full path that
includes "current".


-- 
Best regards,
Andrzej Bialecki <><
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Sequence File Question

2007-03-29 Thread Andrzej Bialecki

Steve Severance wrote:

DB updates - or actually replacements - see e.g. CrawlDb.install()
 method for details. This is not needed in case of segments, which 
are created once and never updated.


How does the reader know which one it is expecting. For instance I 
can make a reader to read a linkDB just by instantiating it on the 
directory crawl/linkdb And it knows to go inside the current 
directory. What when opening a parse_data there is no current. So how

 does it know which expect?


Use The Source Luke ;) It follows this (arbitrary) naming convention
that we always use a "current" subdirectory when working with LinkDb and
CrawlDb. And it follows a different naming convention when we use
SegmentReader.

One comment: CrawlDbReader, LinkDbReader and SegmentReader are Nutch
classes. However, the real data is stored using Hadoop classes,
specifically MapOutputFileFormat. CrawlDbReader knows about Nutch naming
convention and always appends "current" to the db name. But if you were
to use MapFileOutputFormat.getReaders() directly this Hadoop class of
course doesn't know about this, so you need to provide a full path that
includes "current".


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: Sequence File Question

2007-03-29 Thread Steve Severance
> -Original Message-
> From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, March 28, 2007 4:34 PM
> To: nutch-dev@lucene.apache.org
> Subject: Re: Sequence File Question
> 
> Steve Severance wrote:
> > Let me actually refine that question we do some directories like the
> linkdb
> > have a current, and why do others like parse_data not? Is there a
> convention
> > on this?
> 
> First, to answer your original question: you should use
> MapFileOutputFormat class for reading such output. It handles these
> part- subdirectories automatically.
> 
> Second, the "current" subdirectory is there in order to properly handle
> DB updates - or actually replacements - see e.g. CrawlDb.install()
> method for details. This is not needed in case of segments, which are
> created once and never updated.

How does the reader know which one it is expecting. For instance I can make a 
reader to read a linkDB just by instantiating it on the directory crawl/linkdb 
And it knows to go inside the current directory. What when opening a parse_data 
there is no current. So how does it know which expect?

Steve

> 
> Thirdly, although you didn't ask about it ;) the latest version of
> Hadoop contains a handy facility called Counters - if you use the PR
> PowerMethod you need to collect PR from dangling nodes in order to
> redistribute it later. You can use Counters for this, and save on a
> separate aggregation step.
> 
> 
> --
> Best regards,
> Andrzej Bialecki <><
>   ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com



Re: Sequence File Question

2007-03-28 Thread Andrzej Bialecki

Steve Severance wrote:

Let me actually refine that question we do some directories like the linkdb
have a current, and why do others like parse_data not? Is there a convention
on this?


First, to answer your original question: you should use 
MapFileOutputFormat class for reading such output. It handles these 
part- subdirectories automatically.


Second, the "current" subdirectory is there in order to properly handle 
DB updates - or actually replacements - see e.g. CrawlDb.install() 
method for details. This is not needed in case of segments, which are 
created once and never updated.


Thirdly, although you didn't ask about it ;) the latest version of 
Hadoop contains a handy facility called Counters - if you use the PR 
PowerMethod you need to collect PR from dangling nodes in order to 
redistribute it later. You can use Counters for this, and save on a 
separate aggregation step.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: Sequence File Question

2007-03-28 Thread Steve Severance
Let me actually refine that question we do some directories like the linkdb
have a current, and why do others like parse_data not? Is there a convention
on this?

Steve

> -Original Message-
> From: Steve Severance [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, March 28, 2007 4:11 PM
> To: nutch-dev@lucene.apache.org
> Subject: Sequence File Question
> 
> Hey guys,
> I have a mapreduce job that sets up a directory for pagerank. It
> iterates
> over all the segments and then outputs a MapFile containing the data.
> When I
> go to open the outputted directory with another MapReduce job it fails
> saying that it cannot find the path. The path that it thinks it is
> trying to
> open does not include the part-0 directory. Both my directory (and
> all
> other directories for that matter) have the same structure which is
> /path/part-0/. I feel like this is a really stupid error
> and I
> have forgotten something that is easily fixed. Any ideas?
> 
> Steve



Sequence File Question

2007-03-28 Thread Steve Severance
Hey guys,
I have a mapreduce job that sets up a directory for pagerank. It iterates
over all the segments and then outputs a MapFile containing the data. When I
go to open the outputted directory with another MapReduce job it fails
saying that it cannot find the path. The path that it thinks it is trying to
open does not include the part-0 directory. Both my directory (and all
other directories for that matter) have the same structure which is
/path/part-0/. I feel like this is a really stupid error and I
have forgotten something that is easily fixed. Any ideas?

Steve