Terry, Russell, Just a proposal: maybe it should be added to DBStorage? http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/DBStorage.html As far as I know it only stores data for now, but i think it can be extended to load and store, like PigStorage.
Ruslan On Sat, Sep 1, 2012 at 3:03 AM, Russell Jurney <russell.jur...@gmail.com> wrote: > That would be awesome - I will generalize it and blog about what a great > person you are :D > > On Fri, Aug 31, 2012 at 3:12 PM, Terry Siu <terry....@datasphere.com> wrote: > >> Thanks, Russell, I'll dig in to your recommendations. I'd be happy to open >> source it, but at the moment, it's not exactly general enough. However, I >> can certainly put it on github for your perusal. >> >> -Terry >> >> -----Original Message----- >> From: Russell Jurney [mailto:russell.jur...@gmail.com] >> Sent: Friday, August 31, 2012 3:03 PM >> To: user@pig.apache.org >> Subject: Re: Custom DB Loader UDF >> >> I don't have an answer, and I'm only learning these APIs myself, but >> you're writing something I'm planning on writing very soon - a >> MySQL-specific LoadFunc for Pig. I would greatly appreciate it if you would >> open source it on github or contribute it to Piggybank :) >> >> The InputSplits should determine the number of mappers, but to debug you >> might try forcing it by setting some properties in your script re: >> inputsplits (see >> >> https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/1QtL9bBwL0c >> ): >> >> The input split size is detemined by map.min.split.size, dfs.block.size >> and mapred.map.tasks. >> >> goalSize = totalSize / mapred.map.tasks >> minSize = max {mapred.min.split.size, minSplitSize} splitSize= max >> (minSize, min(goalSize, dfs.block.size)) >> >> minSplitSize is determined by each InputFormat such as >> SequenceFileInputFormat. >> >> >> I'd play around with those and see if you can get it doing what you want. >> >> On Fri, Aug 31, 2012 at 2:02 PM, Terry Siu <terry....@datasphere.com> >> wrote: >> >> > Hi all, >> > >> > I know this question has probably been posed multiple times, but I'm >> > having difficulty figuring out a couple of aspects of a custom >> > LoaderFunc to read from a DB. And yes, I did try to Google my way to an >> answer. >> > Anyhoo, for what it's worth, I have a MySql table that I wish to load >> > via Pig. I have the LoaderFunc working using PigServer in a Java app, >> > but I noticed the following when my job gets submitted to my MR >> > cluster. I generated 6 InputSplits in my custom InputFormat, where >> > each split specifies a non-overlapping range/page of records to read >> > from. I thought that each InputSplit would correspond to a map task, >> > but what I see in the JobTracker is that the submitted job only has 1 >> > map task which executes each split serially. Is my understanding even >> > correct that a split can be effectively assigned to a single map task? >> > If so, can I coerce the submitted MR job to properly get each of my >> > splits to execute in its own map task? >> > >> > Thanks, >> > -Terry >> > >> >> >> >> -- >> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com >> datasyndrome.com >> > > > > -- > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com -- Best Regards, Ruslan Al-Fakikh