Hi Ted

Thanks for the info :)

We do not know yet how is going to be generated the input data.
Because the data is going to be observations from a satellite... still
on design.  But we could also try to make every binary file as big as
a block size (then we need to decide the block size). Or if they are
bigger than a block, the split them before adding the data to the
cluster. But the file should be large enough, as you said, in order to
last more than 10 seconds its computation.

Then each task (map) will process the files that are locally stored in
a node (the framework controls this??) <location transparency>

All these is fine. We already have a grid solution with agents on
every node polling for jobs. Each job sent to a node computes 1-n
files (could be zipped) of simulated data.

One solution is to move to map/reduce and let the framework do the
distribution of tasks and data.


Another thing we want to considerer is to make our simple grid aware
of the data location in order to move the task to the node which
contains the data. A way of getting the hostname were the
filename-block is and then calling the dfs API from that node.

Cheers
Alfonso


On 17/03/2008, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
>
>  This sounds very different from your earlier questions.
>
>  If you have a moderate (10's to 1000's) number of binary files, then it is
>  very easy to write a special purpose InputFormat that tells hadoop that the
>  file is not splittable.  This allows you to add all of the files as inputs
>  to the map step and you will get the locality that you want.  The files
>  should be large enough so that you take at least 10 seconds or more
>  processing them to get good performance relative to startup costs.  If they
>  are not, then you may want to package them up in a form that can be read
>  sequentially.  This need not be splittable, but it would be nice if it were.
>
>  If you are producing a single file per hour, then this style works pretty
>  well.  In my own work, we have a few compressed and encrypted files each
>  hour that are map-reduced into a more congenial and splittable form each
>  hour.  Then subsequent steps are used to aggregate or process the data as
>  needed.
>
>  This gives you all of the locality that you were looking for.
>
>
>  On 3/17/08 6:19 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]>
>  wrote:
>
>
>  > Hi there.
>  >
>  > After reading a bit of the hadoop framework and trying the WordCount
>  > example. I have several doubts about how to use map /reduce with
>  > binary files.
>  >
>  > In my case binary files are generated in a time line basis. Let's say
>  > 1 file per hour. The size of each file is different (briefly we are
>  > getting pictures from space and the stars density is different between
>  > observations). The mappers, rather than receiving the file content.
>  > They have to receive the file name.  I read that if the input files
>  > are big (several blocks), they are split among several tasks in
>  > same/different node/s (block sizes?).  But we want each map task
>  > processes a file rather than a block (or a line of a file as in the
>  > WordCount sample).
>  >
>  > In a previous post I did to this forum. I was recommended to use an
>  > input file with all the file names, so the mappers would receive the
>  > file name. But there is a drawback related with data  location (also
>  > was mentioned this), because data then has to be moved from one node
>  > to another.   Data is not going to be replicated to all the nodes.  So
>  > a task taskA that has to process fileB on nodeN, it has to be executed
>  > on nodeN. How can we achive that???  What if a task requires a file
>  > that is on other node. Does the framework moves the logic to that
>  > node?  We need to define a URI file map in each node
>  > (hostname/path/filename) for all the files. Tasks would access the
>  > local URI file map in order to process the files.
>  >
>  > Another approach we have thought is to use the distributed file system
>  > to load balance the data among the nodes. And have our processes
>  > running on every node (without using the map/reduce framework). Then
>  > each process has to access to the local node to process the data,
>  > using the dfs API (or checking the local URI file map).  This approach
>  > would be more flexible to us, because depending on the machine
>  > (cuadcore, dualcore) we know how many java threads we can run in order
>  > to get the maximum performance of the machine.  Using the framework we
>  > can only say a number of tasks to be executed on every node, but all
>  > the nodes have to be the same.
>  >
>  > URI file map.
>  > Once the files are copied to the distributed file system, then we need
>  > to create this table map. Or is it a way to access a <directory> at
>  > the data node and retrieve the files it handles? rather than getting
>  > all the files in all the nodes in that <directory>  ie
>  >
>  > NodeA  /tmp/.../mytask/input/fileA-1
>  >             /tmp/.../mytask/input/fileA-2
>  >
>  > NodeB /tmp/.../mytask/input/fileB
>  >
>  > A process at nodeB listing the /tmp/.../input directory, would get only 
> fileB
>  >
>  > Any ideas?
>  > Thanks
>  > Alfonso.
>
>

Reply via email to