An alternative thought: In addition to the (key/ value) interface provided by Hama, each process (within bsp function) should be able to read data from external source with Reader related class; but processes may need to use something like ZooKeeper for coordination.
FYI On 26 May 2015 at 06:43, Edward J. Yoon <[email protected]> wrote: > Hi, > > Currently the task capacity of cluster should be larger than the number of > blocks or files of input dataset. The alternative is to merge them into one > file using hadoop fs -getmerge command. > > -- > Best Regards, Edward J. Yoon > > -----Original Message----- > From: Behroz Sikander [mailto:[email protected]] > Sent: Tuesday, May 26, 2015 1:14 AM > To: [email protected] > Subject: Hama parition 1000 files on 3 tasks/machine > > Hi, > I have a problem regarding data partitioning but was not able to find any > solution online. > > Problem: I have around 1000 files that I want to process using Hama. Each > file has the same schema/structure but different data. How can I divide > these files in my cluster ? I mean if I have 3 tasks/machines then each > task should process around 333 files. > > So, > 1- How can I take thousand files as input in Hama ? With my current > understanding, Hama will open 1000 tasks (1 task for each file) > 2- How to divide the files on different machines (Custom Partitioner maybe > )? > 3- If this approach is not supported, then what can be an alternative > approach of solving this ?Regards, > Behroz Sikander > >
