Re: architecture help

Jason Venner Mon, 16 Nov 2009 07:41:29 -0800

What version of hadoop are you using?
It may be that you are creating a new connection in each map call.


Create your connection in the configure, and close it in the close, perhaps
committing every 1000 calls in the mapper,

On Mon, Nov 16, 2009 at 3:33 PM, yz5od2 <woods5242-outdo...@yahoo.com>wrote:

> Thanks all for the replies, that makes sense. I think I am allocating
> connection resources per-mapper, instead of per-task.
>
> How do I programatically allocate a "pool" or shared resource for a task,
> that all Mapper instances can have access to?
>
> 1) I have 4 nodes, each node has a map capacity of 2 for a total of 8 tasks
> running simultaneously. The job I am running is queuing up ~950 tasks that
> need to be done.
>
> 2) the mysql server I am connecting to is configured to permit 300
> connections.
>
> 2) When a Mapper instance starts, right now each mapper instance is
> handling the connections, obviously this is my problem as each task must be
> spinning up dozens/hundreds of mapper instances to process the task (is that
> right? or does one mapper instance process an entire split?). I need to move
> this to the "task", but this is where I need some pointers on where to look.
>
> When I submit my job is there some way to say:
>
>
> jobConf.setTaskHandlingClass(SomeClassThatCreatesThePoolThatTaskMapperInstancesAccess.class)
>
> ??
>
>        -
>
>
> On Nov 15, 2009, at 7:57 PM, Jeff Zhang wrote:
>
>  Each map task will run in an separate JVM. So you should create connection
>> pool for each task, And all the mapper instances in one task share the
>> same
>> connection pool.
>>
>> Another suggestion is that you can use JNDI to manger the connection . It
>> can be shared by all the map tasks in your cluster.
>>
>>
>> Jeff Zhang
>>
>>
>>
>>
>> On Mon, Nov 16, 2009 at 8:52 AM, yz5od2 <woods5242-outdo...@yahoo.com
>> >wrote:
>>
>>  Hi,
>>>
>>> a) I have a Mapper ONLY job, the job reads in records, then parses them
>>> apart.  No reduce phase
>>>
>>> b) I would like this mapper job to save the record into a shared mysql
>>> database on the network.
>>>
>>> c) I am running a 4 node cluster, and obviously running out of
>>> connections
>>> very quickly, that is something I can work on the db server side.
>>>
>>> What I am trying to understand, is that for each mapper task instance
>>> that
>>> is processing an input split... does that run in its own classloader? I
>>> guess I am trying to figure out how to manage a connection pool on each
>>> processing node, so that all mapper instances would use that to get
>>> access
>>> to the database. Right now it appears that each node is creating
>>> thousands
>>> of mapper instance each with their own connection management, hence this
>>> is
>>> blowing up quite quickly. I would like the connection management to live
>>> separately from the mapper instances per node.
>>>
>>> I hope I am explaining what I want to do ok, please let me know if anyone
>>> has any thoughts, tips, best practices, features I should look at etc.
>>>
>>> thanks
>>>
>>>
>>>
>>>
>>>
>>>
>


-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Re: architecture help

Reply via email to