That would be awesome - I will generalize it and blog about what a great person you are :D
On Fri, Aug 31, 2012 at 3:12 PM, Terry Siu <terry....@datasphere.com> wrote: > Thanks, Russell, I'll dig in to your recommendations. I'd be happy to open > source it, but at the moment, it's not exactly general enough. However, I > can certainly put it on github for your perusal. > > -Terry > > -----Original Message----- > From: Russell Jurney [mailto:russell.jur...@gmail.com] > Sent: Friday, August 31, 2012 3:03 PM > To: user@pig.apache.org > Subject: Re: Custom DB Loader UDF > > I don't have an answer, and I'm only learning these APIs myself, but > you're writing something I'm planning on writing very soon - a > MySQL-specific LoadFunc for Pig. I would greatly appreciate it if you would > open source it on github or contribute it to Piggybank :) > > The InputSplits should determine the number of mappers, but to debug you > might try forcing it by setting some properties in your script re: > inputsplits (see > > https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/1QtL9bBwL0c > ): > > The input split size is detemined by map.min.split.size, dfs.block.size > and mapred.map.tasks. > > goalSize = totalSize / mapred.map.tasks > minSize = max {mapred.min.split.size, minSplitSize} splitSize= max > (minSize, min(goalSize, dfs.block.size)) > > minSplitSize is determined by each InputFormat such as > SequenceFileInputFormat. > > > I'd play around with those and see if you can get it doing what you want. > > On Fri, Aug 31, 2012 at 2:02 PM, Terry Siu <terry....@datasphere.com> > wrote: > > > Hi all, > > > > I know this question has probably been posed multiple times, but I'm > > having difficulty figuring out a couple of aspects of a custom > > LoaderFunc to read from a DB. And yes, I did try to Google my way to an > answer. > > Anyhoo, for what it's worth, I have a MySql table that I wish to load > > via Pig. I have the LoaderFunc working using PigServer in a Java app, > > but I noticed the following when my job gets submitted to my MR > > cluster. I generated 6 InputSplits in my custom InputFormat, where > > each split specifies a non-overlapping range/page of records to read > > from. I thought that each InputSplit would correspond to a map task, > > but what I see in the JobTracker is that the submitted job only has 1 > > map task which executes each split serially. Is my understanding even > > correct that a split can be effectively assigned to a single map task? > > If so, can I coerce the submitted MR job to properly get each of my > > splits to execute in its own map task? > > > > Thanks, > > -Terry > > > > > > -- > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com > datasyndrome.com > -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com