How to delete files older than X days in HDFS/Hadoop

2011-11-26 Thread Raimon Bosch
Hi,

I'm wondering how to delete files older than X days with HDFS/Hadoop. On
linux we can do it with the folowing command:

find ~/datafolder/* -mtime +7 -exec rm {} \;

Any ideas?


How to assign a mapreduce process to a specific pool

2011-10-25 Thread Raimon Bosch
Hi all,

I'm trying to launch a process to map our log data into hive tables and I
was trying to assign this job to a specific pool name. This is our
configuration file:

- allocations.xml 
?xml version=1.0?
allocations

  pool name=log2hive
minMaps14/minMaps
minReduces6/minReduces
maxRunningJobs1/maxRunningJobs
minSharePreemptionTimeout300/minSharePreemptionTimeout
weight1.0/weight
  /pool

  defaultMinSharePreemptionTimeout600/defaultMinSharePreemptionTimeout
  fairSharePreemptionTimeout600/fairSharePreemptionTimeout

/allocations
- mapred-site.xml --
  property
namemapred.jobtracker.taskScheduler/name
valueorg.apache.hadoop.mapred.FairScheduler/value
  /property

  property
namemapred.fairscheduler.allocation.file/name
value/hadoop/conf/allocations.xml/value
  /property

My question is how can I use this new pool to launch jobs? I'm using `hadoop
jar` command but I would be also interested in assign pools from my code.


Thanks in advance,


Re: why one of the reducers it's always slower?

2011-10-24 Thread Raimon Bosch
The answer is yes. I have checked my code and I was generating one map key
for each table when I didn't  need to do it.

Now, I'm generating keys that are including the name of the table and a
unique id. That information is used during the MultiOutput to generate
proper outputs for each table.

2011/10/23 Raimon Bosch raimon.bo...@gmail.com

 Thanks for your help,

 In fact, I'm using MultipleOutputFormat to generate one file for each hive
 table and in this case I'm generating only one of the possible hive tables.
 Can I use MultipleOutputFormat and still distribute my keys over all the
 cluster?

 2011/10/23 Ayon Sinha ayonsi...@yahoo.com

 Looks like that is the reducer who is actually doing the work with 14M
 input records.


  Reduce input groups 1
  Combine output records 0
  Reduce shuffle bytes 5,135,004,496
  Reduce output records 14,232,592
  Spilled Records 14,232,592
  Combine input records 0
  Reduce input records 14,232,592



 Other reducers have this:
 Reduce output records0
 Spilled Records0
 Combine input records0
 Reduce input records0

 -Ayon
 See My Photos on Flickr
 Also check out my Blog for answers to commonly asked questions.



 
 From: Raimon Bosch raimon.bo...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Saturday, October 22, 2011 6:01 PM
 Subject: why one of the reducers it's always slower?

 Hi all,

 I'm executing one job to convert logs into hive tables. The times are very
 good once we have added a proper number of nodes but the reduce phase
 spends
 always more time in one of the machines.

 task_201110211442_0086_r_00
 http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_00
 
 100.00%
 reduce  reduce
 23-Oct-2011 00:26:42
 23-Oct-2011 00:28:09 (1mins, 27sec)

 9
 http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_00
 
 task_201110211442_0086_r_01
 http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_01
 
 100.00%
 reduce  reduce
 23-Oct-2011 00:26:42
 23-Oct-2011 00:28:10 (1mins, 27sec)

 9
 http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_01
 
 task_201110211442_0086_r_02
 http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_02
 
 100.00%
 reduce  reduce
 23-Oct-2011 00:26:43
 23-Oct-2011 00:28:10 (1mins, 27sec)

 9
 http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_02
 
 task_201110211442_0086_r_03
 http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_03
 
 100.00%
 reduce  reduce
 23-Oct-2011 00:26:43
 23-Oct-2011 00:28:10 (1mins, 27sec)

 9
 http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_03
 
 task_201110211442_0086_r_04
 http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_04
 
 100.00%
 reduce  reduce
 23-Oct-2011 00:26:44
 23-Oct-2011 00:35:56 (9mins, 11sec)

 10
 http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_04
 
 task_201110211442_0086_r_05
 http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_05
 
 100.00%
 reduce  reduce
 23-Oct-2011 00:26:44
 23-Oct-2011 00:28:09 (1mins, 24sec)

 9
 http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_05
 

 As you can see in the statistics from 6 reduce executions one is spending
 9
 minutes while the rest is spending 1 minute. I think that it is because
 one
 of the reducers has to spend time sorting the results from the rest of
 nodes.

 There is a way to reduce this time?

 Thanks in advance,
 Raimon Bosch





Re: why one of the reducers it's always slower?

2011-10-23 Thread Raimon Bosch
Thanks for your help,

In fact, I'm using MultipleOutputFormat to generate one file for each hive
table and in this case I'm generating only one of the possible hive tables.
Can I use MultipleOutputFormat and still distribute my keys over all the
cluster?

2011/10/23 Ayon Sinha ayonsi...@yahoo.com

 Looks like that is the reducer who is actually doing the work with 14M
 input records.


  Reduce input groups 1
  Combine output records 0
  Reduce shuffle bytes 5,135,004,496
  Reduce output records 14,232,592
  Spilled Records 14,232,592
  Combine input records 0
  Reduce input records 14,232,592



 Other reducers have this:
 Reduce output records0
 Spilled Records0
 Combine input records0
 Reduce input records0

 -Ayon
 See My Photos on Flickr
 Also check out my Blog for answers to commonly asked questions.



 
 From: Raimon Bosch raimon.bo...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Saturday, October 22, 2011 6:01 PM
 Subject: why one of the reducers it's always slower?

 Hi all,

 I'm executing one job to convert logs into hive tables. The times are very
 good once we have added a proper number of nodes but the reduce phase
 spends
 always more time in one of the machines.

 task_201110211442_0086_r_00
 http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_00
 
 100.00%
 reduce  reduce
 23-Oct-2011 00:26:42
 23-Oct-2011 00:28:09 (1mins, 27sec)

 9
 http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_00
 
 task_201110211442_0086_r_01
 http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_01
 
 100.00%
 reduce  reduce
 23-Oct-2011 00:26:42
 23-Oct-2011 00:28:10 (1mins, 27sec)

 9
 http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_01
 
 task_201110211442_0086_r_02
 http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_02
 
 100.00%
 reduce  reduce
 23-Oct-2011 00:26:43
 23-Oct-2011 00:28:10 (1mins, 27sec)

 9
 http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_02
 
 task_201110211442_0086_r_03
 http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_03
 
 100.00%
 reduce  reduce
 23-Oct-2011 00:26:43
 23-Oct-2011 00:28:10 (1mins, 27sec)

 9
 http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_03
 
 task_201110211442_0086_r_04
 http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_04
 
 100.00%
 reduce  reduce
 23-Oct-2011 00:26:44
 23-Oct-2011 00:35:56 (9mins, 11sec)

 10
 http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_04
 
 task_201110211442_0086_r_05
 http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_05
 
 100.00%
 reduce  reduce
 23-Oct-2011 00:26:44
 23-Oct-2011 00:28:09 (1mins, 24sec)

 9
 http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_05
 

 As you can see in the statistics from 6 reduce executions one is spending 9
 minutes while the rest is spending 1 minute. I think that it is because one
 of the reducers has to spend time sorting the results from the rest of
 nodes.

 There is a way to reduce this time?

 Thanks in advance,
 Raimon Bosch



why one of the reducers it's always slower?

2011-10-22 Thread Raimon Bosch
Hi all,

I'm executing one job to convert logs into hive tables. The times are very
good once we have added a proper number of nodes but the reduce phase spends
always more time in one of the machines.

task_201110211442_0086_r_00http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_00
100.00%
reduce  reduce
23-Oct-2011 00:26:42
23-Oct-2011 00:28:09 (1mins, 27sec)

9http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_00
task_201110211442_0086_r_01http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_01
100.00%
reduce  reduce
23-Oct-2011 00:26:42
23-Oct-2011 00:28:10 (1mins, 27sec)

9http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_01
task_201110211442_0086_r_02http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_02
100.00%
reduce  reduce
23-Oct-2011 00:26:43
23-Oct-2011 00:28:10 (1mins, 27sec)

9http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_02
task_201110211442_0086_r_03http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_03
100.00%
reduce  reduce
23-Oct-2011 00:26:43
23-Oct-2011 00:28:10 (1mins, 27sec)

9http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_03
task_201110211442_0086_r_04http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_04
100.00%
reduce  reduce
23-Oct-2011 00:26:44
23-Oct-2011 00:35:56 (9mins, 11sec)

10http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_04
task_201110211442_0086_r_05http://204.236.208.103:50030/taskdetails.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_05
100.00%
reduce  reduce
23-Oct-2011 00:26:44
23-Oct-2011 00:28:09 (1mins, 24sec)

9http://204.236.208.103:50030/taskstats.jsp?jobid=job_201110211442_0086tipid=task_201110211442_0086_r_05

As you can see in the statistics from 6 reduce executions one is spending 9
minutes while the rest is spending 1 minute. I think that it is because one
of the reducers has to spend time sorting the results from the rest of
nodes.

There is a way to reduce this time?

Thanks in advance,
Raimon Bosch


cannot use distcp in some s3 buckets

2011-10-13 Thread Raimon Bosch
Hi,

I've been having some problems with one of our s3 buckets. I have asked on
amazon support with no luck yet
https://forums.aws.amazon.com/thread.jspa?threadID=78001.

I'm getting this exception only with our oldest s3 bucket with this command:
hadoop distcp s3://MY_BUCKET_NAME/logfile-20110815.gz
/tmp/logfile-20110815.gz

java.lang.IllegalArgumentException: Invalid hostname in URI
s3://MY_BUCKET_NAME/logfile-20110815.gz /tmp/logfile-20110815.gz
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:41)
at
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:82)

As you can see, hadoop is rejecting my url before starting to do the
authorization steps. Someone has been in a similar issue? I have already
tested the same operation in newer s3 buckets and the command is working
correctly.

Thanks in advance,
Raimon Bosch.


Re: cannot use distcp in some s3 buckets

2011-10-13 Thread Raimon Bosch
By the way,

The url I'm trying has a '_' in the bucket name. Could be this the problem?

2011/10/13 Raimon Bosch raimon.bo...@gmail.com

 Hi,

 I've been having some problems with one of our s3 buckets. I have asked on
 amazon support with no luck yet
 https://forums.aws.amazon.com/thread.jspa?threadID=78001.

 I'm getting this exception only with our oldest s3 bucket with this
 command: hadoop distcp s3://MY_BUCKET_NAME/logfile-20110815.gz
 /tmp/logfile-20110815.gz

 java.lang.IllegalArgumentException: Invalid hostname in URI
 s3://MY_BUCKET_NAME/logfile-20110815.gz /tmp/logfile-20110815.gz
 at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:41)
 at
 org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:82)

 As you can see, hadoop is rejecting my url before starting to do the
 authorization steps. Someone has been in a similar issue? I have already
 tested the same operation in newer s3 buckets and the command is working
 correctly.

 Thanks in advance,
 Raimon Bosch.





How to get number of live nodes in hadoop

2011-10-11 Thread Raimon Bosch
Hi,

Following this instructions at
http://wiki.apache.org/hadoop/HowManyMapsAndReduces I've read that the best
amount of reducers for one process is 0.95 or 1.75 * (nodes *
mapred.tasktracker.tasks.maximum) so I would like to call
to conf.setNumReduceTasks(int num) according to how many nodes I have
working.

So how can I get the number of live nodes from my hadoop code?

Thanks in advance,
Raimon Bosch.


How to iterate over a hdfs folder with hadoop

2011-10-10 Thread Raimon Bosch
Hi,

I'm wondering how can I browse an hdfs folder using the classes
in org.apache.hadoop.fs package. The operation that I'm looking for is
'hadoop dfs -ls'

The standard file system equivalent would be:

File f = new File(outputPath);
if(f.isDirectory()){
  String files[] = f.list();
  for(String file : files){
//Do your logic
  }
}

Thanks in advance,
Raimon Bosch.


Re: How to iterate over a hdfs folder with hadoop

2011-10-10 Thread Raimon Bosch
Thanks John!

There is the complete solution:


Configuration jc = new Configuration();
Object files[] = null;
List files_in_hdfs = new ArrayList();

FileSystem fs = FileSystem.get(jc);
FileStatus[] file_status = fs.listStatus(new Path(outputPath));
for (FileStatus fileStatus : file_status) {
  files_in_hdfs.add(fileStatus.getPath().getName());
}

files = files_in_hdfs.toArray();

2011/10/10 John Conwell j...@iamjohn.me

 FileStatus[] files = fs.listStatus(new Path(path));

 for (FileStatus fileStatus : files)

 {

 //...do stuff ehre

 }

 On Mon, Oct 10, 2011 at 8:03 AM, Raimon Bosch raimon.bo...@gmail.com
 wrote:

  Hi,
 
  I'm wondering how can I browse an hdfs folder using the classes
  in org.apache.hadoop.fs package. The operation that I'm looking for is
  'hadoop dfs -ls'
 
  The standard file system equivalent would be:
 
  File f = new File(outputPath);
  if(f.isDirectory()){
   String files[] = f.list();
   for(String file : files){
 //Do your logic
   }
  }
 
  Thanks in advance,
  Raimon Bosch.
 



 --

 Thanks,
 John C



How to solve a DisallowedDatanodeException?

2011-10-07 Thread Raimon Bosch
Hi,

I'm running a cluster on amazon and sometimes I'm getting this exception:

2011-10-07 10:36:28,014 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException: Datanode
denied communication with namenode:
ip-10-235-57-112.eu-west-1.compute.internal:50010
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.registerDatanode(FSNamesystem.java:2042)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.register(NameNode.java:687)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

at org.apache.hadoop.ipc.Client.call(Client.java:740)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy4.register(Unknown Source)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.register(DataNode.java:531)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.runDatanodeDaemon(DataNode.java:1208)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1247)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)

Since I have this exception I'm not able to run any datanode. I have checked
all the connections between the nodes and they are ok, I have tried also to
format the namenode but the problem is still remaining.

Shall I need to remove the information about the datanode? rm -rf
${HOME}/dfs-xvdh/dn

I would prefer a solution that doesn't implies a format or erasing
anything...


Regards,
Raimon Bosch.


Re: How to solve a DisallowedDatanodeException?

2011-10-07 Thread Raimon Bosch
My list of dfs.hosts was correct in all the servers. In this case I had a
problem with the internal DNS from amazon. I had to restart all my nodes to
getting rid of this problem.

After some changes on my cluster (renaming nodes), some nodes had
automatically changed his IP and I had to perform a restart to force a
change also in the internal ip's.

2011/10/7 Eric Fiala e...@fiala.ca

 Raimon - the error
 org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException:
 Datanode denied communication with namenode

 Usually indicates that the datanode that is trying to connect to the
 namenode is either:

   - listed in the file defined by dfs.hosts.exclude (explicitly excluded) -
   or
   - that dfs.hosts (explicitly included) is used and the node is not listed
   within that file

 Make sure the datanode is not listed in excludes, and if you are using
 dfs.hosts, add it to the includes, and run hadoop dfsadmin -refreshNodes

 You should not have to remove any data on local disc to solve this problem.

 HTH

 EF

 On Fri, Oct 7, 2011 at 4:47 AM, Raimon Bosch raimon.bo...@gmail.com
 wrote:

  Hi,
 
  I'm running a cluster on amazon and sometimes I'm getting this exception:
 
  2011-10-07 10:36:28,014 ERROR
  org.apache.hadoop.hdfs.server.datanode.DataNode:
  org.apache.hadoop.ipc.RemoteException:
  org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException:
  Datanode
  denied communication with namenode:
  ip-10-235-57-112.eu-west-1.compute.internal:50010
 at
 
 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.registerDatanode(FSNamesystem.java:2042)
 at
 
 org.apache.hadoop.hdfs.server.namenode.NameNode.register(NameNode.java:687)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
 
 at org.apache.hadoop.ipc.Client.call(Client.java:740)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
 at $Proxy4.register(Unknown Source)
 at
 
 org.apache.hadoop.hdfs.server.datanode.DataNode.register(DataNode.java:531)
 at
 
 
 org.apache.hadoop.hdfs.server.datanode.DataNode.runDatanodeDaemon(DataNode.java:1208)
 at
 
 
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1247)
 at
  org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
 
  Since I have this exception I'm not able to run any datanode. I have
  checked
  all the connections between the nodes and they are ok, I have tried also
 to
  format the namenode but the problem is still remaining.
 
  Shall I need to remove the information about the datanode? rm -rf
  ${HOME}/dfs-xvdh/dn
 
  I would prefer a solution that doesn't implies a format or erasing
  anything...
 
 
  Regards,
  Raimon Bosch.
 



 --
 *Eric Fiala*
 *Fiala Consulting*
 T: 403.828.1117
 E: e...@fiala.ca
 http://www.fiala.ca



Re: How to solve a DisallowedDatanodeException?

2011-10-07 Thread Raimon Bosch
in the internal dns's sorry...

2011/10/7 Raimon Bosch raimon.bo...@gmail.com

 My list of dfs.hosts was correct in all the servers. In this case I had a
 problem with the internal DNS from amazon. I had to restart all my nodes to
 getting rid of this problem.

 After some changes on my cluster (renaming nodes), some nodes had
 automatically changed his IP and I had to perform a restart to force a
 change also in the internal ip's.


 2011/10/7 Eric Fiala e...@fiala.ca

 Raimon - the error
 org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException:
 Datanode denied communication with namenode

 Usually indicates that the datanode that is trying to connect to the
 namenode is either:

   - listed in the file defined by dfs.hosts.exclude (explicitly excluded)
 -
   or
   - that dfs.hosts (explicitly included) is used and the node is not
 listed
   within that file

 Make sure the datanode is not listed in excludes, and if you are using
 dfs.hosts, add it to the includes, and run hadoop dfsadmin -refreshNodes

 You should not have to remove any data on local disc to solve this
 problem.

 HTH

 EF

 On Fri, Oct 7, 2011 at 4:47 AM, Raimon Bosch raimon.bo...@gmail.com
 wrote:

  Hi,
 
  I'm running a cluster on amazon and sometimes I'm getting this
 exception:
 
  2011-10-07 10:36:28,014 ERROR
  org.apache.hadoop.hdfs.server.datanode.DataNode:
  org.apache.hadoop.ipc.RemoteException:
  org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException:
  Datanode
  denied communication with namenode:
  ip-10-235-57-112.eu-west-1.compute.internal:50010
 at
 
 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.registerDatanode(FSNamesystem.java:2042)
 at
 
 org.apache.hadoop.hdfs.server.namenode.NameNode.register(NameNode.java:687)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
 
 at org.apache.hadoop.ipc.Client.call(Client.java:740)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
 at $Proxy4.register(Unknown Source)
 at
 
 org.apache.hadoop.hdfs.server.datanode.DataNode.register(DataNode.java:531)
 at
 
 
 org.apache.hadoop.hdfs.server.datanode.DataNode.runDatanodeDaemon(DataNode.java:1208)
 at
 
 
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1247)
 at
  org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
 
  Since I have this exception I'm not able to run any datanode. I have
  checked
  all the connections between the nodes and they are ok, I have tried also
 to
  format the namenode but the problem is still remaining.
 
  Shall I need to remove the information about the datanode? rm -rf
  ${HOME}/dfs-xvdh/dn
 
  I would prefer a solution that doesn't implies a format or erasing
  anything...
 
 
  Regards,
  Raimon Bosch.
 



 --
 *Eric Fiala*
 *Fiala Consulting*
 T: 403.828.1117
 E: e...@fiala.ca
 http://www.fiala.ca





Re: How to solve a DisallowedDatanodeException?

2011-10-07 Thread Raimon Bosch
Definetly it was an amazon problem. They were assigning a new internal ip
but some of the nodes were using the old one. I had to force on all my
/etc/hosts redirects from old dns's to the correct ips:

[NEW_IP]  ip-[OLD_IP].eu-west-1.compute.internal
[NEW_IP]  ip-[OLD_IP]

2011/10/7 Raimon Bosch raimon.bo...@gmail.com

 in the internal dns's sorry...


 2011/10/7 Raimon Bosch raimon.bo...@gmail.com

 My list of dfs.hosts was correct in all the servers. In this case I had a
 problem with the internal DNS from amazon. I had to restart all my nodes to
 getting rid of this problem.

 After some changes on my cluster (renaming nodes), some nodes had
 automatically changed his IP and I had to perform a restart to force a
 change also in the internal ip's.


 2011/10/7 Eric Fiala e...@fiala.ca

 Raimon - the error
 org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException:
 Datanode denied communication with namenode

 Usually indicates that the datanode that is trying to connect to the
 namenode is either:

   - listed in the file defined by dfs.hosts.exclude (explicitly excluded)
 -
   or
   - that dfs.hosts (explicitly included) is used and the node is not
 listed
   within that file

 Make sure the datanode is not listed in excludes, and if you are using
 dfs.hosts, add it to the includes, and run hadoop dfsadmin -refreshNodes

 You should not have to remove any data on local disc to solve this
 problem.

 HTH

 EF

 On Fri, Oct 7, 2011 at 4:47 AM, Raimon Bosch raimon.bo...@gmail.com
 wrote:

  Hi,
 
  I'm running a cluster on amazon and sometimes I'm getting this
 exception:
 
  2011-10-07 10:36:28,014 ERROR
  org.apache.hadoop.hdfs.server.datanode.DataNode:
  org.apache.hadoop.ipc.RemoteException:
  org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException:
  Datanode
  denied communication with namenode:
  ip-10-235-57-112.eu-west-1.compute.internal:50010
 at
 
 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.registerDatanode(FSNamesystem.java:2042)
 at
 
 org.apache.hadoop.hdfs.server.namenode.NameNode.register(NameNode.java:687)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
 
 at org.apache.hadoop.ipc.Client.call(Client.java:740)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
 at $Proxy4.register(Unknown Source)
 at
 
 org.apache.hadoop.hdfs.server.datanode.DataNode.register(DataNode.java:531)
 at
 
 
 org.apache.hadoop.hdfs.server.datanode.DataNode.runDatanodeDaemon(DataNode.java:1208)
 at
 
 
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1247)
 at
 
 org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
 
  Since I have this exception I'm not able to run any datanode. I have
  checked
  all the connections between the nodes and they are ok, I have tried
 also to
  format the namenode but the problem is still remaining.
 
  Shall I need to remove the information about the datanode? rm -rf
  ${HOME}/dfs-xvdh/dn
 
  I would prefer a solution that doesn't implies a format or erasing
  anything...
 
 
  Regards,
  Raimon Bosch.
 



 --
 *Eric Fiala*
 *Fiala Consulting*
 T: 403.828.1117
 E: e...@fiala.ca
 http://www.fiala.ca






Re: Creating a hive table for a custom log

2011-09-16 Thread Raimon Bosch



Any Ideas? 

The most common aproach will be writting your own serde and plug it to your
hive like:

http://code.google.com/p/hive-json-serde/

But I'm wondering if there is some work already done in this area.


Raimon Bosch wrote:
 
 Hi,
 
 I'm trying to create a table similar to apache_log but I'm trying to avoid
 to write my own map-reduce task because I don't want to have my HDFS files
 twice.
 
 So if you're working with log lines like this:
 
 186.92.134.151 [31/Aug/2011:00:10:41 +] GET
 /client/action1/?transaction_id=8002user_id=87179311248ts=1314749223525item1=271item2=6045environment=2
 HTTP/1.1
 
 112.201.65.238 [31/Aug/2011:00:10:41 +] GET
 /client/action1/?transaction_id=9002ts=1314749223525user_id=9048871793100item2=6045item1=271environment=2
 HTTP/1.1
 
 90.45.198.251 [31/Aug/2011:00:10:41 +] GET
 /client/action2/?transaction_id=9022ts=1314749223525user_id=9048871793100item2=6045item1=271environment=2
 HTTP/1.1
 
 And having in mind that the parameters could be in different orders. Which
 will be the best strategy to create this table? Write my own
 org.apache.hadoop.hive.contrib.serde2? Is there any resource already
 implemented that I could use to perform this task?
 
 In the end the objective is convert all the parameters in fields and use
 as type the action. With this big table I will be able to perform my
 queries, my joins or my views.
 
 Any ideas?
 
 Thanks in Advance,
 Raimon Bosch.
 

-- 
View this message in context: 
http://old.nabble.com/Creating-a-hive-table-for-a-custom-log-tp32379849p32481457.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.