Re: copy data from one hadoop cluster to another hadoop cluster + cant use distcp

2015-06-19 Thread Joep Rottinghuis
You can't set up a proxy ?
You probably want to avoid writing to local file system because aside from that 
being slow, it limits the size of your file to the free space on your local 
disc.

If you do need to go commando and go through a single client machine that can 
see both clusters you probably want to pipe a get to a put.

Any kind of serious data volume pulled through a "straw" is going to be rather 
slow though.

Cheers,

Joep

Sent from my iPhone

> On Jun 19, 2015, at 12:09 AM, Nitin Pawar  wrote:
> 
> yes 
> 
>> On Fri, Jun 19, 2015 at 11:36 AM, Divya Gehlot  
>> wrote:
>> In thats It will be like three step process .
>> 1. first cluster (secure zone) HDFS  -> copytoLocal -> user local file 
>> system 
>> 2. user local space -> copy data -> second cluster user local file system 
>> 3. second cluster user local file system -> copyfromlocal -> second 
>> clusterHDFS 
>> 
>> Am I on the right track ?
>> 
>>  
>> 
>>> On 19 June 2015 at 12:38, Nitin Pawar  wrote:
>>> What's the size of the data?
>>> If you can not do distcp between clusters then other way is doing hdfs get 
>>> on the data and then hdfs put on another cluster
>>> 
 On 19-Jun-2015 9:56 am, "Divya Gehlot"  wrote:
 Hi,
 I need to copy data from first hadoop cluster to second hadoop cluster.
 I cant access second hadoop cluster from first hadoop cluster due to some 
 security issue.
 Can any point me how can I do apart from distcp command.
 For instance 
 Cluster 1 secured zone -> copy hdfs data  to -> cluster 2 in non secured 
 zone 
  
 
 
 Thanks,
 Divya
> 
> 
> 
> -- 
> Nitin Pawar


Re: Question about log files

2015-04-06 Thread Joep Rottinghuis
This depends on your OS.
When you "delete" a file on Linux, you merely unlink the entry from the 
directory.
The file does not actually get deleted until until the last reference (open 
handle) goes away. Note that this could lead to an interesting way to fill up a 
disk.
You should be able to see the open files by a process using the lsof command.
The process itself does not know that a dentry has been removed, so there is 
nothing that log4j or the Hadoop code can do about it.
Assuming you have some rolling file appender configured, log4j should start 
logging to a new file at some point, or you have to bounce you daemon process.

Cheers,

Joep

Sent from my iPhone

> On Apr 6, 2015, at 6:19 AM, Fabio C.  wrote:
> 
> I noticed that too, I think Hadoop keeps the file open all the time and when 
> you delete it it is just no more able to write on it and doesn't try to 
> recreate it. Not sure if it's a Log4j problem or an Hadoop one...
> yanghaogn, which is the *correct* way to delete the Hadoop logs? I didn't 
> find anything better than deleting the file and restarting the service...
> 
>> On Mon, Apr 6, 2015 at 9:27 AM, 杨浩  wrote:
>> I think the log information has lost.
>> 
>>  the hadoop is not designed for that you deleted these files incorrectly
>> 
>> 2015-04-02 11:45 GMT+08:00 煜 韦 :
>>> Hi there,
>>> If log files are deleted without restarting service, it seems that the logs 
>>> is to be lost for later operation. For example, on namenode, datanode.
>>> Why not log files could be re-created when deleted by mistake or on purpose 
>>> during cluster is running?
>>> 
>>> Thanks,
>>> Jared
> 


Re: Fair Scheduler of Hadoop

2013-01-21 Thread Joep Rottinghuis
You could configure it like that if you wanted. Keep in mind that would waste 
some resources. Imagine a 10 minute task that has been running for 9 minutes. 
If you have that task killed immediately then it would have to be the-scheduled 
and re-do all 10 minutes.
Give it another minute and the task is complete and out if the way.

So, consider how busy your cluster is overall and how much you are willing to 
wait for fairness trading this off against a certain amount of waste.

Cheers,

Joep

Sent from my iPhone

On Jan 21, 2013, at 9:30 AM, Lin Ma  wrote:

> Hi Joep,
> 
> Excellent answer! I think you have answered my confusions. And one remaining 
> issue after reading this document again, even it is old. :-)
> 
> It is mentioned, "which will allow you to set how long each pool will wait 
> before preempting other jobs’ tasks to reach its guaranteed capacity", my 
> question is why each pool need wait here? If a pool cannot get its guaranteed 
> capacity because of jobs in other pools over use the capacity, we should kill 
> such jobs immediately? Appreciate if you could elaborate a bit more why we 
> need wait to get even guaranteed capacity.
> 
> regards,
> Lin
> 
> On Mon, Jan 21, 2013 at 8:24 AM, Joep Rottinghuis  
> wrote:
>> Lin,
>> 
>> The article you are reading us old.
>> Fair scheduler does have preemption.
>> Tasks get killed and rerun later, potentially on a different node.
>> 
>> You can set a minimum / guaranteed capacity. The sum of those across pools 
>> would typically equal the total capacity of your cluster or less.
>> Then you can configure each pool to go beyond that capacity. That would 
>> happen if the cluster is temporary not used to the full capacity.
>> Then when the demand for capacity increases, and jobs are queued in other 
>> pools that are not running at their minimum guaranteed capacity, some long 
>> running tasks from jobs in the pool that is using more than its minimum 
>> capacity get killed (to be run later again).
>> 
>> Does that make sense?
>> 
>> Cheers,
>> 
>> Joep
>> 
>> Sent from my iPhone
>> 
>> On Jan 20, 2013, at 6:25 AM, Lin Ma  wrote:
>> 
>>> Hi guys,
>>> 
>>> I have a quick question regarding to fire scheduler of Hadoop, I am reading 
>>> this article => 
>>> http://blog.cloudera.com/blog/2008/11/job-scheduling-in-hadoop/, my 
>>> question is from the following statements, "There is currently no support 
>>> for preemption of long tasks, but this is being added in HADOOP-4665, which 
>>> will allow you to set how long each pool will wait before preempting other 
>>> jobs’ tasks to reach its guaranteed capacity.".
>>> 
>>> My questions are,
>>> 
>>> 1. What means "preemption of long tasks"? Kill long running tasks, or pause 
>>> long running tasks to give resources to other tasks, or it means something 
>>> else?
>>> 2. I am also confused about "set how long each pool will wait before 
>>> preempting other jobs’ tasks to reach its guaranteed capacity"., what means 
>>> "reach its guaranteed capacity"? I think when using fair scheduler, each 
>>> pool has predefined resources allocation settings (and the settings 
>>> guarantees each pool has resources as configured), is that true? In what 
>>> situations each pool will not have its guaranteed (or configured) capacity?
>>> 
>>> regards,
>>> Lin
> 


Re: Fair Scheduler of Hadoop

2013-01-20 Thread Joep Rottinghuis
Lin,

The article you are reading us old.
Fair scheduler does have preemption.
Tasks get killed and rerun later, potentially on a different node.

You can set a minimum / guaranteed capacity. The sum of those across pools 
would typically equal the total capacity of your cluster or less.
Then you can configure each pool to go beyond that capacity. That would happen 
if the cluster is temporary not used to the full capacity.
Then when the demand for capacity increases, and jobs are queued in other pools 
that are not running at their minimum guaranteed capacity, some long running 
tasks from jobs in the pool that is using more than its minimum capacity get 
killed (to be run later again).

Does that make sense?

Cheers,

Joep

Sent from my iPhone

On Jan 20, 2013, at 6:25 AM, Lin Ma  wrote:

> Hi guys,
> 
> I have a quick question regarding to fire scheduler of Hadoop, I am reading 
> this article => 
> http://blog.cloudera.com/blog/2008/11/job-scheduling-in-hadoop/, my question 
> is from the following statements, "There is currently no support for 
> preemption of long tasks, but this is being added in HADOOP-4665, which will 
> allow you to set how long each pool will wait before preempting other jobs’ 
> tasks to reach its guaranteed capacity.".
> 
> My questions are,
> 
> 1. What means "preemption of long tasks"? Kill long running tasks, or pause 
> long running tasks to give resources to other tasks, or it means something 
> else?
> 2. I am also confused about "set how long each pool will wait before 
> preempting other jobs’ tasks to reach its guaranteed capacity"., what means 
> "reach its guaranteed capacity"? I think when using fair scheduler, each pool 
> has predefined resources allocation settings (and the settings guarantees 
> each pool has resources as configured), is that true? In what situations each 
> pool will not have its guaranteed (or configured) capacity?
> 
> regards,
> Lin


Re: Fastest way to transfer files

2012-12-28 Thread Joep Rottinghuis
Not sure why you are implying a contradiction when you say: "... distcp is 
useful _but_ you want to do 'it' in java..."

First of all distcp _is_ written in Java.
You can call distcp or any other MR job from Java just fine.

Cheers,

Joep

Sent from my iPhone

On Dec 28, 2012, at 12:01 PM, burakkk  wrote:

> Hi,
> I have two different hdfs cluster. I need to transfer files between these 
> environments. What's the fastest way to transfer files for that situation? 
> 
> I've researched about it. I found distcp command. It's useful but I want to 
> do in java so is there any way to do this?
> 
> Is there any way to transfer files chunk by chunk from one hdfs cluster to 
> another one or is there any way to implement a process using chunks without 
> whole file?
> 
> Thanks
> Best Regards...
> 
> -- 
> BURAK ISIKLI | http://burakisikli.wordpress.com
>