/2010_06_10/81ae7c24745211df9f6d002590008422 already was existing
because my script created it the first time).
Still odd behavior, whatever it is fine sorry to bother... i just added to
my automation script to remove the directory before i do a copyToLocal.
On Thu, Jun 10, 2010 at 7:12 PM, Joseph
Hi, so ok am using copyToLocal through an automation script we have and
seeing odd results.
I am not sure if this is something I am doing wrong, defect, or known good
reason for it. Let me know I would like to correct this in either my own
script, happy to give a try in the fs code fixing a bug
1) only the namenode is "formatted" and what happens is basically the
image file is created and prepped. The image file holds the meta data
about how your files are stored on the cluster.
2) The datanodes are not formatted in the conventional sense. Their
(datanode) disk usage will grow only wh
You need to either report status or increment a counter from within
your task. In your Java code there is a little trick to help the job
be “aware” within the cluster of tasks that are not dead but just
working hard. During execution of a task there is no built in
reporting that the job is running
Hadoopers & Hadooperets, I wanted to see if any folks would be
interested in being a guest on a new podcast we ( www.medialets.com )
are very seriously thinking about producing & hosting specifically to
talk about Hadoop.
This is still pre-production phase now but we are starting to firm it
up as
ble in Step
> 3, and in Step 2, you just use table.exists(hash-key) to check if it is a
> dup. You still need Step 1 to populate the table with your historical data.
>
> Hope this helps
>
> Cheers,
> jp
>
>
> -Original Message-
> From: Joseph Stein [mail
you can loop
> through them. As the simplest solution, you just take the first one.
>
> Sincerely,
> Mark
>
> On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein wrote:
>
>> I have been researching ways to handle de-dupping data while running a
>> map/reduce program (so
I have been researching ways to handle de-dupping data while running a
map/reduce program (so as to not re-calculate/re-aggregate data that
we have seen before[possibly months before]).
The data sets we have are littered with repeats of data from mobile
devices which continue to come in over time