Hey Rahul, getSize is used to determine how many reducers should run on a MapReduce job. If the data set that you will be processing is small, than configuring getSize to return a small integer value is fine, although I would probably opt for 1 instead of 0. If you find that you need to control the number of reducers that run as part of the job, there is a groupByKey(int numReducers) option that allows you to specify the reducer count explicitly.
When you say Sequential file, do you mean SequenceFile? I ask because Crunch has a ReadableSource impl for SequenceFiles in com.cloudera.crunch.io.seq. If you aren't using SequenceFiles, another option is to have your ReadableSource extend com.cloudera.crunch.io.impl.FileSourceImpl, which takes care of some of the boilerplate around creating Sources (including figuring out the size of an input source using The FileSystem API) for you. J On Thu, Jul 5, 2012 at 1:39 AM, Rahul <[email protected]> wrote: > Hi Guys, > > I am trying my existing Map-Reduce application on crunch. My current > application is triggered from a custom server component. The server is > capable of accepting data on open connection and it writes back the same on > Sequential File before triggering the application. Now I have migrated the > whole business logic parts on Crunch, I have also created a ReadableSource > based on Sequential file that can read data in crunch. But I could not make > out how to implement the getSize() API and return back 0 from there. The > piece of code works but I wonder if it will break at some place ? How can I > return back size of Sequential file ? > > regards > Rahul > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
