Re: getSize() API of Source

Josh Wills Thu, 05 Jul 2012 08:21:12 -0700

Hey Rahul,

getSize is used to determine how many reducers should run on a MapReduce
job. If the data set that you will be processing is small, than configuring
getSize to return a small integer value is fine, although I would probably
opt for 1 instead of 0. If you find that you need to control the number of
reducers that run as part of the job, there is a groupByKey(int
numReducers) option that allows you to specify the reducer count explicitly.

When you say Sequential file, do you mean SequenceFile? I ask because
Crunch has a ReadableSource impl for SequenceFiles in
com.cloudera.crunch.io.seq. If you aren't using SequenceFiles, another
option is to have your ReadableSource extend
com.cloudera.crunch.io.impl.FileSourceImpl, which takes care of some of the
boilerplate around creating Sources (including figuring out the size of an
input source using The FileSystem API) for you.

J

On Thu, Jul 5, 2012 at 1:39 AM, Rahul <[email protected]> wrote:

> Hi Guys,
>
> I am trying my existing Map-Reduce application on crunch. My current
> application is triggered from a custom server component. The server is
> capable of accepting data on open connection and it writes back the same on
> Sequential File before triggering the application. Now I have migrated the
> whole business logic parts on Crunch, I have also created a ReadableSource
> based on Sequential file that can read data in crunch. But I could not make
> out how to implement the getSize() API and return back 0 from there. The
> piece of code works but I wonder if it will break at some place ? How can I
> return back size of Sequential file ?
>
> regards
> Rahul
>
>

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: getSize() API of Source

Reply via email to