Hi Ruslan,

Yep, I heard of Sqoop and had originally thought of using that, but wanted to 
give the LoaderFunc a try first. With regards to overriding the setLocation, 
I'm not sure I understand how you're using it to cache your DB data to HDFS. 
Ultimately, the location is used (per the documentation) "so that the input 
format can get itself set up properly before reading". I figured in my case, 
it's not necessary so long as I pass in the correct parameters to my 
InputFormat so I can construct the splits and the RecordReaders correctly. That 
works for me and I can store my generated Tuples in HDFS. Can you elaborate on 
your comment?

Thanks,
-Terry

-----Original Message-----
From: Ruslan Al-Fakikh [mailto:ruslan.al-fak...@jalent.ru] 
Sent: Friday, August 31, 2012 2:45 PM
To: user@pig.apache.org
Subject: Re: Custom DB Loader UDF

Hi Terry,

I am not sure whether you architecture is correct, but what we do in my team: 
we override setLocation in LoadFunc so that it caches db data to hdfs.
Basically the simplest way is to copy data from MySQL to HDFS by Sqoop and then 
read it by Pig as a normal input.

Ruslan

On Sat, Sep 1, 2012 at 1:02 AM, Terry Siu <terry....@datasphere.com> wrote:
> Hi all,
>
> I know this question has probably been posed multiple times, but I'm having 
> difficulty figuring out a couple of aspects of a custom LoaderFunc to read 
> from a DB. And yes, I did try to Google my way to an answer. Anyhoo, for what 
> it's worth, I have a MySql table that I wish to load via Pig. I have the 
> LoaderFunc working using PigServer in a Java app, but I noticed the following 
> when my job gets submitted to my MR cluster. I generated 6 InputSplits in my 
> custom InputFormat, where each split specifies a non-overlapping range/page 
> of records to read from. I thought that each InputSplit would correspond to a 
> map task, but what I see in the JobTracker is that the submitted job only has 
> 1 map task which executes each split serially. Is my understanding even 
> correct that a split can be effectively assigned to a single map task? If so, 
> can I coerce the submitted MR job to properly get each of my splits to 
> execute in its own map task?
>
> Thanks,
> -Terry



--
Best Regards,
Ruslan Al-Fakikh

Reply via email to