Re: Running terasort with 1 map task

2013-02-26 Thread Bertrand Dechoux
http://wiki.apache.org/hadoop/HowManyMapsAndReduces

It is possible to have a single mapper if the input is not splittable BUT
it is rarely seen as a feature.
One could ask why you want to use a platform for distributed computing for
a job that shouldn't be distributed.

Regards

Bertrand


On Tue, Feb 26, 2013 at 12:09 PM, Arindam Choudhury 
arindamchoudhu...@gmail.com wrote:

 Hi all,

 I am trying to run terasort using one map and one reduce. so, I generated
 the input data using:

 hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map

 Then I launched the hadoop terasort job using:

 hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.map.tasks=1
 -Dmapred.reduce.tasks=1 /user/hadoop/input32mb1map /user/hadoop/output1

 I thought it will run the job using 1 map and 1 reduce, but when inspect
 the job statistics I found:

 hadoop job -history /user/hadoop/output1

 Task Summary
 
 KindTotalSuccessfulFailedKilledStartTimeFinishTime

 Setup110026-Feb-2013 10:57:4726-Feb-2013
 10:57:55 (8sec)
 Map24240026-Feb-2013 10:57:5726-Feb-2013
 11:05:37 (7mins, 40sec)
 Reduce110026-Feb-2013 10:58:2126-Feb-2013
 11:08:31 (10mins, 10sec)
 Cleanup110026-Feb-2013 11:08:3226-Feb-2013
 11:08:36 (4sec)
 

 so, though I mentioned to launch one map tasks, there are 24 of them.

 How to solve this problem. How to tell hadoop to launch only one map.

 Thanks,



Re: Running terasort with 1 map task

2013-02-26 Thread Julien Muller
Maybe your goal is to have a baseline for performance measurement?
In that case, you might want to consider running only one taskTracker?  You
would have multiple tasks but running on only 1 machine. Also, you could
make mappers run serially, by configuring only one map slot on your 1 node
cluster.

Nevertheless I agree with Bertrand, this is not really a realistic use case
(or maybe you can give us more clues).

Julien


2013/2/26 Bertrand Dechoux decho...@gmail.com

 http://wiki.apache.org/hadoop/HowManyMapsAndReduces

 It is possible to have a single mapper if the input is not splittable BUT
 it is rarely seen as a feature.
 One could ask why you want to use a platform for distributed computing for
 a job that shouldn't be distributed.

 Regards

 Bertrand



 On Tue, Feb 26, 2013 at 12:09 PM, Arindam Choudhury 
 arindamchoudhu...@gmail.com wrote:

 Hi all,

 I am trying to run terasort using one map and one reduce. so, I generated
 the input data using:

 hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map

 Then I launched the hadoop terasort job using:

 hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.map.tasks=1
 -Dmapred.reduce.tasks=1 /user/hadoop/input32mb1map /user/hadoop/output1

 I thought it will run the job using 1 map and 1 reduce, but when inspect
 the job statistics I found:

 hadoop job -history /user/hadoop/output1

 Task Summary
 
 KindTotalSuccessfulFailedKilledStartTimeFinishTime

 Setup110026-Feb-2013 10:57:4726-Feb-2013
 10:57:55 (8sec)
 Map24240026-Feb-2013 10:57:5726-Feb-2013
 11:05:37 (7mins, 40sec)
 Reduce110026-Feb-2013 10:58:2126-Feb-2013
 11:08:31 (10mins, 10sec)
 Cleanup110026-Feb-2013 11:08:3226-Feb-2013
 11:08:36 (4sec)
 

 so, though I mentioned to launch one map tasks, there are 24 of them.

 How to solve this problem. How to tell hadoop to launch only one map.

 Thanks,





Re: Running terasort with 1 map task

2013-02-26 Thread Arindam Choudhury
Thanks . As Julien said I want to do a performance measurement.

Actually,
hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
-Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map

has generated:
Total size:3200029737 B
Total dirs:3
Total files:5
Total blocks (validated):27 (avg. block size 118519619 B)

Thats why so many maps.


On Tue, Feb 26, 2013 at 12:46 PM, Julien Muller julien.mul...@ezako.comwrote:

 Maybe your goal is to have a baseline for performance measurement?
 In that case, you might want to consider running only one taskTracker?
  You would have multiple tasks but running on only 1 machine. Also, you
 could make mappers run serially, by configuring only one map slot on your 1
 node cluster.

 Nevertheless I agree with Bertrand, this is not really a realistic use
 case (or maybe you can give us more clues).

 Julien


 2013/2/26 Bertrand Dechoux decho...@gmail.com

 http://wiki.apache.org/hadoop/HowManyMapsAndReduces

 It is possible to have a single mapper if the input is not splittable BUT
 it is rarely seen as a feature.
 One could ask why you want to use a platform for distributed computing
 for a job that shouldn't be distributed.

 Regards

 Bertrand



 On Tue, Feb 26, 2013 at 12:09 PM, Arindam Choudhury 
 arindamchoudhu...@gmail.com wrote:

 Hi all,

 I am trying to run terasort using one map and one reduce. so, I
 generated the input data using:

 hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map

 Then I launched the hadoop terasort job using:

 hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.map.tasks=1
 -Dmapred.reduce.tasks=1 /user/hadoop/input32mb1map /user/hadoop/output1

 I thought it will run the job using 1 map and 1 reduce, but when inspect
 the job statistics I found:

 hadoop job -history /user/hadoop/output1

 Task Summary
 
 KindTotalSuccessfulFailedKilledStartTime
 FinishTime

 Setup110026-Feb-2013 10:57:4726-Feb-2013
 10:57:55 (8sec)
 Map24240026-Feb-2013 10:57:5726-Feb-2013
 11:05:37 (7mins, 40sec)
 Reduce110026-Feb-2013 10:58:2126-Feb-2013
 11:08:31 (10mins, 10sec)
 Cleanup110026-Feb-2013 11:08:3226-Feb-2013
 11:08:36 (4sec)
 

 so, though I mentioned to launch one map tasks, there are 24 of them.

 How to solve this problem. How to tell hadoop to launch only one map.

 Thanks,






Re: Running terasort with 1 map task

2013-02-26 Thread Arindam Choudhury
In my $HADOOP_HOME/conf/hdfs-site.xml, I have mentioned the data-block size

property
  namedfs.block.size/name
  value134217728/value
  finaltrue/final
/property

While running the teragen I am again specifying it to be sure:

hadoop jar /opt/hadoop-1.0.4/hadoop-examples-1.0.4.jar teragen
-Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -Ddfs.block.size=134217728
32 /user/hadoop/input

but it generates 3 blocks:

hadoop fsck -blocks -files -locations /user/hadoop/input
Status: HEALTHY
 Total size:32029543 B
 Total dirs:3
 Total files:4
 Total blocks (validated):3 (avg. block size 10676514 B)
 Minimally replicated blocks:3 (100.0 %)

What I am doing wrong? How can I generate only one block?



On Tue, Feb 26, 2013 at 12:52 PM, Arindam Choudhury 
arindamchoudhu...@gmail.com wrote:

 Thanks . As Julien said I want to do a performance measurement.

 Actually,

 hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map

 has generated:
 Total size:3200029737 B
 Total dirs:3
 Total files:5
 Total blocks (validated):27 (avg. block size 118519619 B)

 Thats why so many maps.


 On Tue, Feb 26, 2013 at 12:46 PM, Julien Muller 
 julien.mul...@ezako.comwrote:

 Maybe your goal is to have a baseline for performance measurement?
 In that case, you might want to consider running only one taskTracker?
  You would have multiple tasks but running on only 1 machine. Also, you
 could make mappers run serially, by configuring only one map slot on your 1
 node cluster.

 Nevertheless I agree with Bertrand, this is not really a realistic use
 case (or maybe you can give us more clues).

 Julien


 2013/2/26 Bertrand Dechoux decho...@gmail.com

 http://wiki.apache.org/hadoop/HowManyMapsAndReduces

 It is possible to have a single mapper if the input is not splittable
 BUT it is rarely seen as a feature.
 One could ask why you want to use a platform for distributed computing
 for a job that shouldn't be distributed.

 Regards

 Bertrand



 On Tue, Feb 26, 2013 at 12:09 PM, Arindam Choudhury 
 arindamchoudhu...@gmail.com wrote:

 Hi all,

 I am trying to run terasort using one map and one reduce. so, I
 generated the input data using:

 hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map

 Then I launched the hadoop terasort job using:

 hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.map.tasks=1
 -Dmapred.reduce.tasks=1 /user/hadoop/input32mb1map /user/hadoop/output1

 I thought it will run the job using 1 map and 1 reduce, but when
 inspect the job statistics I found:

 hadoop job -history /user/hadoop/output1

 Task Summary
 
 KindTotalSuccessfulFailedKilledStartTime
 FinishTime

 Setup110026-Feb-2013 10:57:4726-Feb-2013
 10:57:55 (8sec)
 Map24240026-Feb-2013 10:57:5726-Feb-2013
 11:05:37 (7mins, 40sec)
 Reduce110026-Feb-2013 10:58:2126-Feb-2013
 11:08:31 (10mins, 10sec)
 Cleanup110026-Feb-2013 11:08:3226-Feb-2013
 11:08:36 (4sec)
 

 so, though I mentioned to launch one map tasks, there are 24 of them.

 How to solve this problem. How to tell hadoop to launch only one map.

 Thanks,







Re: Running terasort with 1 map task

2013-02-26 Thread Arindam Choudhury
sorry my bad, it solved


On Tue, Feb 26, 2013 at 1:22 PM, Arindam Choudhury 
arindamchoudhu...@gmail.com wrote:

 In my $HADOOP_HOME/conf/hdfs-site.xml, I have mentioned the data-block
 size

 property
   namedfs.block.size/name
   value134217728/value
   finaltrue/final
 /property

 While running the teragen I am again specifying it to be sure:

 hadoop jar /opt/hadoop-1.0.4/hadoop-examples-1.0.4.jar teragen
 -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -Ddfs.block.size=134217728
 32 /user/hadoop/input

 but it generates 3 blocks:

 hadoop fsck -blocks -files -locations /user/hadoop/input
 Status: HEALTHY
  Total size:32029543 B
  Total dirs:3
  Total files:4
  Total blocks (validated):3 (avg. block size 10676514 B)
  Minimally replicated blocks:3 (100.0 %)

 What I am doing wrong? How can I generate only one block?



 On Tue, Feb 26, 2013 at 12:52 PM, Arindam Choudhury 
 arindamchoudhu...@gmail.com wrote:

 Thanks . As Julien said I want to do a performance measurement.

 Actually,

 hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map

 has generated:
 Total size:3200029737 B
 Total dirs:3
 Total files:5
 Total blocks (validated):27 (avg. block size 118519619 B)

 Thats why so many maps.


 On Tue, Feb 26, 2013 at 12:46 PM, Julien Muller 
 julien.mul...@ezako.comwrote:

 Maybe your goal is to have a baseline for performance measurement?
 In that case, you might want to consider running only one taskTracker?
  You would have multiple tasks but running on only 1 machine. Also, you
 could make mappers run serially, by configuring only one map slot on your 1
 node cluster.

 Nevertheless I agree with Bertrand, this is not really a realistic use
 case (or maybe you can give us more clues).

 Julien


 2013/2/26 Bertrand Dechoux decho...@gmail.com

 http://wiki.apache.org/hadoop/HowManyMapsAndReduces

 It is possible to have a single mapper if the input is not splittable
 BUT it is rarely seen as a feature.
 One could ask why you want to use a platform for distributed computing
 for a job that shouldn't be distributed.

 Regards

 Bertrand



 On Tue, Feb 26, 2013 at 12:09 PM, Arindam Choudhury 
 arindamchoudhu...@gmail.com wrote:

 Hi all,

 I am trying to run terasort using one map and one reduce. so, I
 generated the input data using:

 hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map

 Then I launched the hadoop terasort job using:

 hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.map.tasks=1
 -Dmapred.reduce.tasks=1 /user/hadoop/input32mb1map /user/hadoop/output1

 I thought it will run the job using 1 map and 1 reduce, but when
 inspect the job statistics I found:

 hadoop job -history /user/hadoop/output1

 Task Summary
 
 KindTotalSuccessfulFailedKilledStartTime
 FinishTime

 Setup110026-Feb-2013 10:57:4726-Feb-2013
 10:57:55 (8sec)
 Map24240026-Feb-2013 10:57:5726-Feb-2013
 11:05:37 (7mins, 40sec)
 Reduce110026-Feb-2013 10:58:2126-Feb-2013
 11:08:31 (10mins, 10sec)
 Cleanup110026-Feb-2013 11:08:3226-Feb-2013
 11:08:36 (4sec)
 

 so, though I mentioned to launch one map tasks, there are 24 of them.

 How to solve this problem. How to tell hadoop to launch only one map.

 Thanks,








Re: Hadoop efficient resource isolation

2013-02-26 Thread Arun C Murthy
CapacityScheduler has features to allow a user to specify the amount of virtual 
memory per map/reduce task and the TaskTracker monitors all tasks and their 
process-trees to ensure fork-bombs don't kill the node.

On Feb 25, 2013, at 8:27 PM, Marcin Mejran wrote:

 That won't stop a bad job (say a fork bomb or a massive memory leak in a 
 streaming script) from taking out a node which is what I believe Dhanasekaran 
 was asking about. He wants to physically isolate certain lobs to certain non 
 critical nodes. I don't believe this is possible and data would be spread to 
 those nodes, assuming they're data nodes, which would still cause cluster 
 wide issues (and if data is isolate why not have two separate clusters?),
 
 I've read references in the docs about some type of memory based contrains in 
 Hadoop but I don't know of the details. Anyone know how they work?
 
 Also, I believe there are tools in Linux that can kill processes in case of 
 memory issues and otherwise restrict what a certain user can do. These seem 
 like a more flexible solution although they won't cover all potential issues.
 
 -Marcin
 
 On Feb 25, 2013, at 7:20 PM, Arun C Murthy a...@hortonworks.com wrote:
 
 CapacityScheduler is what you want...
 
 On Feb 21, 2013, at 5:16 AM, Dhanasekaran Anbalagan wrote:
 
 Hi Guys,
 
 It's possible isolation job submission for hadoop cluster, we currently 
 running 48 machine cluster. we  monitor Hadoop is not provides efficient 
 resource isolation. In my case we ran for tech and research pool, When tech 
 job some memory leak will haven, It's occupy the hole cluster.  Finally we 
 figure out  issue with tech job. It's  screwed up hole hadoop cluster. 
 finally 10 data node  are dead.
 
 Any prevention of job submission efficient way resource allocation. When 
 something wrong in   particular job, effect particular pool, Not effect 
 others job. Any way to archive this
 
 Please guide me guys.
 
 My idea is, When tech user submit job means only apply job in for my case 
 submit 24 machine. other machine only for research user.
 
 It's will prevent the memory leak problem. 
  
 
 -Dhanasekaran.
 Did I learn something today? If not, I wasted it.
 
 --
 Arun C. Murthy
 Hortonworks Inc.
 http://hortonworks.com/
 
 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




HDFS Backup for Hadoop Update

2013-02-26 Thread Pablo Musa

Hello guys,
I am starting the update from hadoop 0.20 to a newer version which changes
HDFS format(2.0). I read a lot of tutorials and they say that data loss is
possible (as expected). In order to avoid HDFS data loss I am will probably
backup all HDFS structure (7TB per node). However, this is a huge amount
of data and it will take a lot of time in which my service would be 
unavailable.


I was thinking about a simple approach: copying all files to a different 
place.
I tried to find some parallel files compactor to fasten the process, but 
could

not find it.

How do you guys did it?
Is there some trick?

Thank you in advance,
Pablo Musa


Re: Running terasort with 1 map task

2013-02-26 Thread Mahesh Balija
does passing the dfs.block.size=134217728 resolves your issue? or is it
something else fixed your problem?

On Tue, Feb 26, 2013 at 6:04 PM, Arindam Choudhury 
arindamchoudhu...@gmail.com wrote:

 sorry my bad, it solved


 On Tue, Feb 26, 2013 at 1:22 PM, Arindam Choudhury 
 arindamchoudhu...@gmail.com wrote:

 In my $HADOOP_HOME/conf/hdfs-site.xml, I have mentioned the data-block
 size

 property
   namedfs.block.size/name
   value134217728/value
   finaltrue/final
 /property

 While running the teragen I am again specifying it to be sure:

 hadoop jar /opt/hadoop-1.0.4/hadoop-examples-1.0.4.jar teragen
 -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -Ddfs.block.size=134217728
 32 /user/hadoop/input

 but it generates 3 blocks:

 hadoop fsck -blocks -files -locations /user/hadoop/input
 Status: HEALTHY
  Total size:32029543 B
  Total dirs:3
  Total files:4
  Total blocks (validated):3 (avg. block size 10676514 B)
  Minimally replicated blocks:3 (100.0 %)

 What I am doing wrong? How can I generate only one block?



 On Tue, Feb 26, 2013 at 12:52 PM, Arindam Choudhury 
 arindamchoudhu...@gmail.com wrote:

 Thanks . As Julien said I want to do a performance measurement.

 Actually,

 hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map

 has generated:
 Total size:3200029737 B
 Total dirs:3
 Total files:5
 Total blocks (validated):27 (avg. block size 118519619 B)

 Thats why so many maps.


 On Tue, Feb 26, 2013 at 12:46 PM, Julien Muller julien.mul...@ezako.com
  wrote:

 Maybe your goal is to have a baseline for performance measurement?
 In that case, you might want to consider running only one taskTracker?
  You would have multiple tasks but running on only 1 machine. Also, you
 could make mappers run serially, by configuring only one map slot on your 1
 node cluster.

 Nevertheless I agree with Bertrand, this is not really a realistic use
 case (or maybe you can give us more clues).

 Julien


 2013/2/26 Bertrand Dechoux decho...@gmail.com

 http://wiki.apache.org/hadoop/HowManyMapsAndReduces

 It is possible to have a single mapper if the input is not splittable
 BUT it is rarely seen as a feature.
 One could ask why you want to use a platform for distributed computing
 for a job that shouldn't be distributed.

 Regards

 Bertrand



 On Tue, Feb 26, 2013 at 12:09 PM, Arindam Choudhury 
 arindamchoudhu...@gmail.com wrote:

 Hi all,

 I am trying to run terasort using one map and one reduce. so, I
 generated the input data using:

 hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
 -Dmapred.reduce.tasks=1 3200 /user/hadoop/input32mb1map

 Then I launched the hadoop terasort job using:

 hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.map.tasks=1
 -Dmapred.reduce.tasks=1 /user/hadoop/input32mb1map /user/hadoop/output1

 I thought it will run the job using 1 map and 1 reduce, but when
 inspect the job statistics I found:

 hadoop job -history /user/hadoop/output1

 Task Summary
 
 KindTotalSuccessfulFailedKilledStartTime
 FinishTime

 Setup110026-Feb-2013 10:57:4726-Feb-2013
 10:57:55 (8sec)
 Map24240026-Feb-2013 10:57:5726-Feb-2013
 11:05:37 (7mins, 40sec)
 Reduce110026-Feb-2013 10:58:2126-Feb-2013
 11:08:31 (10mins, 10sec)
 Cleanup110026-Feb-2013 11:08:32
 26-Feb-2013 11:08:36 (4sec)
 

 so, though I mentioned to launch one map tasks, there are 24 of them.

 How to solve this problem. How to tell hadoop to launch only one map.

 Thanks,









Re: JobTracker security

2013-02-26 Thread Serge Blazhievsky
I am trying to not to use kerberos...

Is there other option?

Thanks
Serge

On Tue, Feb 26, 2013 at 3:31 PM, Patai Sangbutsarakum 
patai.sangbutsara...@turn.com wrote:

  Kerberos

   From: Serge Blazhievsky hadoop...@gmail.com
 Reply-To: user@hadoop.apache.org
 Date: Tue, 26 Feb 2013 15:29:08 -0800
 To: user@hadoop.apache.org
 Subject: JobTracker security

  Hi all,

  Is there a way to restrict job monitoring and management only to jobs
 started by each individual user?


  The basic scenario is:

  1. Start a job under user1
 2. Login as user2
 3. hadoop job -list to retrieve job id
 4. hadoop job -kill job_id
 5. Job gets terminated

  Is there something that needs to be enabled to prevent that from
 happening?

  Thanks
 Serge



Re: HDFS Backup for Hadoop Update

2013-02-26 Thread Pablo Musa
Following the idea of doing a copy of the data structure I thought about 
rsync.


I could run rsync while the server is ON and later just apply the diff, 
which

would be much faster decreasing system off-line time.
But I do not know if hadoop make a lot of changes into the data 
structure (blocks).


Thanks again,
Pablo

On 02/26/2013 07:39 PM, Pablo Musa wrote:

Hello guys,
I am starting the update from hadoop 0.20 to a newer version which changes
HDFS format(2.0). I read a lot of tutorials and they say that data loss is
possible (as expected). In order to avoid HDFS data loss I am will probably
backup all HDFS structure (7TB per node). However, this is a huge amount
of data and it will take a lot of time in which my service would be
unavailable.

I was thinking about a simple approach: copying all files to a different
place.
I tried to find some parallel files compactor to fasten the process, but
could
not find it.

How do you guys did it?
Is there some trick?

Thank you in advance,
Pablo Musa




Re: JobTracker security

2013-02-26 Thread Jean-Marc Spaggiari
Maybe restrict access to the hadoop file(s) to the user1?

2013/2/26 Serge Blazhievsky hadoop...@gmail.com:
 I am trying to not to use kerberos...

 Is there other option?

 Thanks
 Serge


 On Tue, Feb 26, 2013 at 3:31 PM, Patai Sangbutsarakum
 patai.sangbutsara...@turn.com wrote:

 Kerberos

 From: Serge Blazhievsky hadoop...@gmail.com
 Reply-To: user@hadoop.apache.org
 Date: Tue, 26 Feb 2013 15:29:08 -0800
 To: user@hadoop.apache.org
 Subject: JobTracker security

 Hi all,

 Is there a way to restrict job monitoring and management only to jobs
 started by each individual user?


 The basic scenario is:

 1. Start a job under user1
 2. Login as user2
 3. hadoop job -list to retrieve job id
 4. hadoop job -kill job_id
 5. Job gets terminated

 Is there something that needs to be enabled to prevent that from
 happening?

 Thanks
 Serge




Re: JobTracker security

2013-02-26 Thread Serge Blazhievsky
hi Jean,

Do you mean input files for hadoop ? or hadoop directory?

Serge

On Tue, Feb 26, 2013 at 4:38 PM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 Maybe restrict access to the hadoop file(s) to the user1?

 2013/2/26 Serge Blazhievsky hadoop...@gmail.com:
  I am trying to not to use kerberos...
 
  Is there other option?
 
  Thanks
  Serge
 
 
  On Tue, Feb 26, 2013 at 3:31 PM, Patai Sangbutsarakum
  patai.sangbutsara...@turn.com wrote:
 
  Kerberos
 
  From: Serge Blazhievsky hadoop...@gmail.com
  Reply-To: user@hadoop.apache.org
  Date: Tue, 26 Feb 2013 15:29:08 -0800
  To: user@hadoop.apache.org
  Subject: JobTracker security
 
  Hi all,
 
  Is there a way to restrict job monitoring and management only to jobs
  started by each individual user?
 
 
  The basic scenario is:
 
  1. Start a job under user1
  2. Login as user2
  3. hadoop job -list to retrieve job id
  4. hadoop job -kill job_id
  5. Job gets terminated
 
  Is there something that needs to be enabled to prevent that from
  happening?
 
  Thanks
  Serge
 
 



Re: QJM HA and ClusterID

2013-02-26 Thread Azuryy Yu
Anybody here? Thanks!


On Tue, Feb 26, 2013 at 9:57 AM, Azuryy Yu azury...@gmail.com wrote:

 Hi all,
 I've stay on this question several days. I want upgrade my cluster from
 hadoop-1.0.3 to hadoop-2.0.3-alpha, I've configured QJM successfully.

 How to customize clusterID by myself. It generated a random clusterID now.

 It doesn't work when I run:

 start-dfs.sh -upgrade -clusterId 12345-test

 Thanks!




Re: JobTracker security

2013-02-26 Thread Jean-Marc Spaggiari
I mean the executable files. Or even the entire hadoop directory?
People might still be able to install a local copy of hadoop and
configure it to point to the same trackers, and then do the kill, but
at least that will complicate the things a bit?

If user1 and user2 are on different groups also, that might allow you
to block some user2 actions against user1 processes? Also, you should
take look to the Security chapter in Hadoop: The Definitive Guide
and to the hadoop-policy.xml file (I never looked at this file, so
maybe it's not at all related).

2013/2/26 Serge Blazhievsky hadoop...@gmail.com:
 hi Jean,

 Do you mean input files for hadoop ? or hadoop directory?

 Serge


 On Tue, Feb 26, 2013 at 4:38 PM, Jean-Marc Spaggiari
 jean-m...@spaggiari.org wrote:

 Maybe restrict access to the hadoop file(s) to the user1?

 2013/2/26 Serge Blazhievsky hadoop...@gmail.com:
  I am trying to not to use kerberos...
 
  Is there other option?
 
  Thanks
  Serge
 
 
  On Tue, Feb 26, 2013 at 3:31 PM, Patai Sangbutsarakum
  patai.sangbutsara...@turn.com wrote:
 
  Kerberos
 
  From: Serge Blazhievsky hadoop...@gmail.com
  Reply-To: user@hadoop.apache.org
  Date: Tue, 26 Feb 2013 15:29:08 -0800
  To: user@hadoop.apache.org
  Subject: JobTracker security
 
  Hi all,
 
  Is there a way to restrict job monitoring and management only to jobs
  started by each individual user?
 
 
  The basic scenario is:
 
  1. Start a job under user1
  2. Login as user2
  3. hadoop job -list to retrieve job id
  4. hadoop job -kill job_id
  5. Job gets terminated
 
  Is there something that needs to be enabled to prevent that from
  happening?
 
  Thanks
  Serge
 
 




Re: JobTracker security

2013-02-26 Thread Serge Blazhievsky
All right!

Thanks for advice!


Serge

On Tue, Feb 26, 2013 at 4:57 PM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 I mean the executable files. Or even the entire hadoop directory?
 People might still be able to install a local copy of hadoop and
 configure it to point to the same trackers, and then do the kill, but
 at least that will complicate the things a bit?

 If user1 and user2 are on different groups also, that might allow you
 to block some user2 actions against user1 processes? Also, you should
 take look to the Security chapter in Hadoop: The Definitive Guide
 and to the hadoop-policy.xml file (I never looked at this file, so
 maybe it's not at all related).

 2013/2/26 Serge Blazhievsky hadoop...@gmail.com:
  hi Jean,
 
  Do you mean input files for hadoop ? or hadoop directory?
 
  Serge
 
 
  On Tue, Feb 26, 2013 at 4:38 PM, Jean-Marc Spaggiari
  jean-m...@spaggiari.org wrote:
 
  Maybe restrict access to the hadoop file(s) to the user1?
 
  2013/2/26 Serge Blazhievsky hadoop...@gmail.com:
   I am trying to not to use kerberos...
  
   Is there other option?
  
   Thanks
   Serge
  
  
   On Tue, Feb 26, 2013 at 3:31 PM, Patai Sangbutsarakum
   patai.sangbutsara...@turn.com wrote:
  
   Kerberos
  
   From: Serge Blazhievsky hadoop...@gmail.com
   Reply-To: user@hadoop.apache.org
   Date: Tue, 26 Feb 2013 15:29:08 -0800
   To: user@hadoop.apache.org
   Subject: JobTracker security
  
   Hi all,
  
   Is there a way to restrict job monitoring and management only to jobs
   started by each individual user?
  
  
   The basic scenario is:
  
   1. Start a job under user1
   2. Login as user2
   3. hadoop job -list to retrieve job id
   4. hadoop job -kill job_id
   5. Job gets terminated
  
   Is there something that needs to be enabled to prevent that from
   happening?
  
   Thanks
   Serge
  
  
 
 



Re: QJM HA and ClusterID

2013-02-26 Thread Suresh Srinivas
looks start-dfs.sh has a bug. It only takes -upgrade option and ignores
clusterId.

Consider running the command (which is what start-dfs.sh calls):
bin/hdfs start namenode -upgrade -clusterId your cluster ID

Please file a bug, if you can, for start-dfs.sh bug which ignores
additional parameters.


On Tue, Feb 26, 2013 at 4:50 PM, Azuryy Yu azury...@gmail.com wrote:

 Anybody here? Thanks!


 On Tue, Feb 26, 2013 at 9:57 AM, Azuryy Yu azury...@gmail.com wrote:

 Hi all,
 I've stay on this question several days. I want upgrade my cluster from
 hadoop-1.0.3 to hadoop-2.0.3-alpha, I've configured QJM successfully.

 How to customize clusterID by myself. It generated a random clusterID now.

 It doesn't work when I run:

 start-dfs.sh -upgrade -clusterId 12345-test

 Thanks!





-- 
http://hortonworks.com/download/


Concatenate adjacent lines with hadoop

2013-02-26 Thread Matthieu Labour
Hi

Please find below the issue I need to solve. Thank you in advance for your
help/ tips.

I have log files where sometimes log lines are splited (this happens when
the log line exceeds a specific length)

Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
LOGTAGTABFIELD-0TABTABFIELD-MAX
Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
LOGTAGTABFIELD-0TABTABFIELD-N  === log line is being
splitted
Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
FIELD-NTABFIELD-N+1 .FIELD-MAX

Can I reconcile/ concatenate splited log lines with a hadoop map reduce
job?

On other words, using a map reduce job, can I concatenate the 2 following
adjacent lines (provided that I 'detect' them)

Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
LOGTAGTABFIELD-0TABTABFIELD-N  === log line is being
splitted
Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
FIELD-NTABFIELD-N+1 .FIELD-MAX

into

Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
LOGTAGTABFIELD-0TABTABFIELD-NTABFIELD-N+1 .FIELD-MAX

Thank you!


Re: Datanodes shutdown and HBase's regionservers not working

2013-02-26 Thread Jean-Marc Spaggiari
Hi Davey,

So were you able to find the issue?

JM

2013/2/25 Davey Yan davey@gmail.com:
 Hi Nicolas,

 I think i found what led to shutdown of all of the datanodes, but i am
 not completely certain.
 I will return to this mail list when my cluster returns to be stable.

 On Mon, Feb 25, 2013 at 8:01 PM, Nicolas Liochon nkey...@gmail.com wrote:
 Network error messages are not always friendly, especially if there is a
 misconfiguration.
 This said,  connection refused says that the network connection was made,
 but that the remote port was not opened on the remote box. I.e. the process
 was dead.
 It could be useful to pastebin the whole logs as well...


 On Mon, Feb 25, 2013 at 12:44 PM, Davey Yan davey@gmail.com wrote:

 But... there was no log like network unreachable.


 On Mon, Feb 25, 2013 at 6:07 PM, Nicolas Liochon nkey...@gmail.com
 wrote:
  I agree.
  Then for HDFS, ...
  The first thing to check is the network I would say.
 
 
 
 
  On Mon, Feb 25, 2013 at 10:46 AM, Davey Yan davey@gmail.com wrote:
 
  Thanks for reply, Nicolas.
 
  My question: What can lead to shutdown of all of the datanodes?
  I believe that the regionservers will be OK if the HDFS is OK.
 
 
  On Mon, Feb 25, 2013 at 5:31 PM, Nicolas Liochon nkey...@gmail.com
  wrote:
   Ok, what's your question?
   When you say the datanode went down, was it the datanode processes or
   the
   machines, with both the datanodes and the regionservers?
  
   The NameNode pings its datanodes every 3 seconds. However it will
   internally
   mark the datanodes as dead after 10:30 minutes (even if in the gui
   you
   have
   'no answer for x minutes').
   HBase monitoring is done by ZooKeeper. By default, a regionserver is
   considered as dead after 180s with no answer. Before, well, it's
   considered
   as live.
   When you stop a regionserver, it tries to flush its data to the disk
   (i.e.
   hdfs, i.e. the datanodes). That's why if you have no datanodes, or if
   a
   high
   ratio of your datanodes are dead, it can't shutdown. Connection
   refused
   
   socket timeouts come from the fact that before the 10:30 minutes hdfs
   does
   not declare the nodes as dead, so hbase tries to use them (and,
   obviously,
   fails). Note that there is now  an intermediate state for hdfs
   datanodes,
   called stale: an intermediary state where the datanode is used only
   if
   you
   have to (i.e. it's the only datanode with a block replica you need).
   It
   will
   be documented in HBase for the 0.96 release. But if all your
   datanodes
   are
   down it won't change much.
  
   Cheers,
  
   Nicolas
  
  
  
   On Mon, Feb 25, 2013 at 10:10 AM, Davey Yan davey@gmail.com
   wrote:
  
   Hey guys,
  
   We have a cluster with 5 nodes(1 NN and 4 DNs) running for more than
   1
   year, and it works fine.
   But the datanodes got shutdown twice in the last month.
  
   When the datanodes got shutdown, all of them became Dead Nodes in
   the NN web admin UI(http://ip:50070/dfshealth.jsp),
   but regionservers of HBase were still live in the HBase web
   admin(http://ip:60010/master-status), of course, they were zombies.
   All of the processes of jvm were still running, including
   hmaster/namenode/regionserver/datanode.
  
   When the datanodes got shutdown, the load (using the top command)
   of
   slaves became very high, more than 10, higher than normal running.
   From the top command, we saw that the processes of datanode and
   regionserver were comsuming CPU.
  
   We could not stop the HBase or Hadoop cluster through normal
   commands(stop-*.sh/*-daemon.sh stop *).
   So we stopped datanodes and regionservers by kill -9 PID, then the
   load of slaves returned to normal level, and we start the cluster
   again.
  
  
   Log of NN at the shutdown point(All of the DNs were removed):
   2013-02-22 11:10:02,278 INFO org.apache.hadoop.net.NetworkTopology:
   Removing a node: /default-rack/192.168.1.152:50010
   2013-02-22 11:10:02,278 INFO org.apache.hadoop.hdfs.StateChange:
   BLOCK* NameSystem.heartbeatCheck: lost heartbeat from
   192.168.1.149:50010
   2013-02-22 11:10:02,693 INFO org.apache.hadoop.net.NetworkTopology:
   Removing a node: /default-rack/192.168.1.149:50010
   2013-02-22 11:10:02,693 INFO org.apache.hadoop.hdfs.StateChange:
   BLOCK* NameSystem.heartbeatCheck: lost heartbeat from
   192.168.1.150:50010
   2013-02-22 11:10:03,004 INFO org.apache.hadoop.net.NetworkTopology:
   Removing a node: /default-rack/192.168.1.150:50010
   2013-02-22 11:10:03,004 INFO org.apache.hadoop.hdfs.StateChange:
   BLOCK* NameSystem.heartbeatCheck: lost heartbeat from
   192.168.1.148:50010
   2013-02-22 11:10:03,339 INFO org.apache.hadoop.net.NetworkTopology:
   Removing a node: /default-rack/192.168.1.148:50010
  
  
   Logs in DNs indicated there were many IOException and
   SocketTimeoutException:
   2013-02-22 11:02:52,354 ERROR
   org.apache.hadoop.hdfs.server.datanode.DataNode:
   

Re: Concatenate adjacent lines with hadoop

2013-02-26 Thread Azuryy Yu
That's easy, in your example,

Map output key: FIELD-N ; Map output value: just original value.
In the reduece: if there is  LOGTAGTAB in the value, then this is the
first log entry. if not, this is a splitted log entry. just get a sub
string and concat with the first log entry.

Am I explain clearly?



On Wed, Feb 27, 2013 at 9:36 AM, Matthieu Labour matth...@actionx.comwrote:

 Hi

 Please find below the issue I need to solve. Thank you in advance for your
 help/ tips.

 I have log files where sometimes log lines are splited (this happens when
 the log line exceeds a specific length)

 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 LOGTAGTABFIELD-0TABTABFIELD-MAX
 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 LOGTAGTABFIELD-0TABTABFIELD-N  === log line is being
 splitted
 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 FIELD-NTABFIELD-N+1 .FIELD-MAX

 Can I reconcile/ concatenate splited log lines with a hadoop map
 reduce job?

 On other words, using a map reduce job, can I concatenate the 2 following
 adjacent lines (provided that I 'detect' them)

 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 LOGTAGTABFIELD-0TABTABFIELD-N  === log line is being
 splitted
 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 FIELD-NTABFIELD-N+1 .FIELD-MAX

 into

 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 LOGTAGTABFIELD-0TABTABFIELD-NTABFIELD-N+1 .FIELD-MAX

 Thank you!



Re: Concatenate adjacent lines with hadoop

2013-02-26 Thread Matthieu Labour
Thank you for your answer. I am not sure i understand fully. My email was
most likely not very clear. Here is an example of log line. Please note the
beginning of the log line YSLOGROW. Please note that the second line should
be concatenated with the first line.

Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] YSLOGROW
20121216T214720.345Z
remote-addr=166.137.156.155user-agent=Mozilla%2F5.0+%28Linux%3B+U%3B+Android+4.0.4%3B+en-us%3B+SAMSUNG-SGH-I717+Build%2FIMM76D%29+AppleWebKit%2F534.30+%28KHTML%2C+like+Gecko%29+Version%2F4.0+Mobile+Safari%2F534.30referrer=http%3A%2F%
2Flp.mydas.mobi
%2F%2Frich%2Ffoundation%2FdynamicInterstitial%2Fint_launch.php%3Fmm_urid%3DWBNMMG9h4XmbJBUHbDrNWWWm%26mm_ipaddress%3D166.137.156.155%26mm_handset%3D8440%26mm_carrier%3D2%26mm_apid%3D78683%26mm_acid%3D1050500%26mm_osid%3D14%26mm_uip%3D166.137.156.155%26mm_ua%3DMozilla%252F5.0%2B%2528Linux%253B%2BU%253B%2BAndroid%2B4.0.4%253B%2Ben-us%253B%2BSAMSUNG-SGH-I717%2BBuild%252FIMM76D%2529%2BAppleWebKit%252F534.30%2B%2528KHTML%252C%2Blike%2BGecko%2529%2BVersion%252F4.0%2BMobile%2BSafari%252F534.30SAMSUNG-SGH-I717%26mtpid%3DUNKNOWN%26mm_msuid%3DUNKNOWN%26mm_mmisdk%3D4.6.0-12.07.16.a%26mm_mxsdk%3DUNKNOWN%26mm_dv%3DAndroid4.0.4%26mm_adtype%3DMMFullScreenAdTransition%26mm_hswd%3DUNKNOWN%26mm_dm%3DSAMSUNG-SGH-I717%26mm_hsht%3DUNKNOWN%26mm_auid%3Dmmi

Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
d_bd6b33dc569994102eaa60a060987d99e9_013b35a758bd%26mm_accelerometer%3Dtrue%26mm_lat%3DUNKNOWN%26mm_long%3DUNKNOWN%26mm_hpx%3D1280%26mm_wpx%3D800%26mm_density%3D2.0%26mm_dpi%3DUNKNOWN%26mm_campaignid%3D45695%26autoExpand%3Dtruequery-string=ncid%3DWBNMMG9h4XmbJBUHbDrNWWWm
tr7y MLNL 1009 10034 3401 t4fx 10034 click


On Tue, Feb 26, 2013 at 9:39 PM, Azuryy Yu azury...@gmail.com wrote:

 That's easy, in your example,

 Map output key: FIELD-N ; Map output value: just original value.
 In the reduece: if there is  LOGTAGTAB in the value, then this is the
 first log entry. if not, this is a splitted log entry. just get a sub
 string and concat with the first log entry.

 Am I explain clearly?



 On Wed, Feb 27, 2013 at 9:36 AM, Matthieu Labour matth...@actionx.comwrote:

 Hi

 Please find below the issue I need to solve. Thank you in advance for
 your help/ tips.

 I have log files where sometimes log lines are splited (this happens when
 the log line exceeds a specific length)

 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 LOGTAGTABFIELD-0TABTABFIELD-MAX
 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 LOGTAGTABFIELD-0TABTABFIELD-N  === log line is being
 splitted
 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 FIELD-NTABFIELD-N+1 .FIELD-MAX

 Can I reconcile/ concatenate splited log lines with a hadoop map
 reduce job?

 On other words, using a map reduce job, can I concatenate the 2 following
 adjacent lines (provided that I 'detect' them)

 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 LOGTAGTABFIELD-0TABTABFIELD-N  === log line is being
 splitted
 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 FIELD-NTABFIELD-N+1 .FIELD-MAX

 into

 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 LOGTAGTABFIELD-0TABTABFIELD-NTABFIELD-N+1 .FIELD-MAX

 Thank you!





-- 
Matthieu Labour, Engineering | *Action**X* |
584 Broadway, Suite 1002 – NY, NY 10012
415-994-3480 (m)


Re: Concatenate adjacent lines with hadoop

2013-02-26 Thread Azuryy Yu
I just noticed your two lines are all started with: Dec 16 21:47:20
d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app

does that different for other lines? if your answer is yes, then just using
this prefix as map output key.


On Wed, Feb 27, 2013 at 1:01 PM, Matthieu Labour matth...@actionx.comwrote:

 Thank you for your answer. I am not sure i understand fully. My email was
 most likely not very clear. Here is an example of log line. Please note the
 beginning of the log line YSLOGROW. Please note that the second line should
 be concatenated with the first line.

 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3] YSLOGROW
 20121216T214720.345Z
 remote-addr=166.137.156.155user-agent=Mozilla%2F5.0+%28Linux%3B+U%3B+Android+4.0.4%3B+en-us%3B+SAMSUNG-SGH-I717+Build%2FIMM76D%29+AppleWebKit%2F534.30+%28KHTML%2C+like+Gecko%29+Version%2F4.0+Mobile+Safari%2F534.30referrer=http%3A%2F%
 2Flp.mydas.mobi
 %2F%2Frich%2Ffoundation%2FdynamicInterstitial%2Fint_launch.php%3Fmm_urid%3DWBNMMG9h4XmbJBUHbDrNWWWm%26mm_ipaddress%3D166.137.156.155%26mm_handset%3D8440%26mm_carrier%3D2%26mm_apid%3D78683%26mm_acid%3D1050500%26mm_osid%3D14%26mm_uip%3D166.137.156.155%26mm_ua%3DMozilla%252F5.0%2B%2528Linux%253B%2BU%253B%2BAndroid%2B4.0.4%253B%2Ben-us%253B%2BSAMSUNG-SGH-I717%2BBuild%252FIMM76D%2529%2BAppleWebKit%252F534.30%2B%2528KHTML%252C%2Blike%2BGecko%2529%2BVersion%252F4.0%2BMobile%2BSafari%252F534.30SAMSUNG-SGH-I717%26mtpid%3DUNKNOWN%26mm_msuid%3DUNKNOWN%26mm_mmisdk%3D4.6.0-12.07.16.a%26mm_mxsdk%3DUNKNOWN%26mm_dv%3DAndroid4.0.4%26mm_adtype%3DMMFullScreenAdTransition%26mm_hswd%3DUNKNOWN%26mm_dm%3DSAMSUNG-SGH-I717%26mm_hsht%3DUNKNOWN%26mm_auid%3Dmmi

 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 d_bd6b33dc569994102eaa60a060987d99e9_013b35a758bd%26mm_accelerometer%3Dtrue%26mm_lat%3DUNKNOWN%26mm_long%3DUNKNOWN%26mm_hpx%3D1280%26mm_wpx%3D800%26mm_density%3D2.0%26mm_dpi%3DUNKNOWN%26mm_campaignid%3D45695%26autoExpand%3Dtruequery-string=ncid%3DWBNMMG9h4XmbJBUHbDrNWWWm
 tr7y MLNL 1009 10034 3401 t4fx 10034 click


 On Tue, Feb 26, 2013 at 9:39 PM, Azuryy Yu azury...@gmail.com wrote:

 That's easy, in your example,

 Map output key: FIELD-N ; Map output value: just original value.
 In the reduece: if there is  LOGTAGTAB in the value, then this is the
 first log entry. if not, this is a splitted log entry. just get a sub
 string and concat with the first log entry.

 Am I explain clearly?



 On Wed, Feb 27, 2013 at 9:36 AM, Matthieu Labour matth...@actionx.comwrote:

 Hi

 Please find below the issue I need to solve. Thank you in advance for
 your help/ tips.

 I have log files where sometimes log lines are splited (this happens
 when the log line exceeds a specific length)

 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 LOGTAGTABFIELD-0TABTABFIELD-MAX
 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 LOGTAGTABFIELD-0TABTABFIELD-N  === log line is being
 splitted
 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 FIELD-NTABFIELD-N+1 .FIELD-MAX

 Can I reconcile/ concatenate splited log lines with a hadoop map
 reduce job?

 On other words, using a map reduce job, can I concatenate the 2
 following adjacent lines (provided that I 'detect' them)

 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 LOGTAGTABFIELD-0TABTABFIELD-N  === log line is being
 splitted
 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 FIELD-NTABFIELD-N+1 .FIELD-MAX

 into

 Dec 16 21:47:20 d.14b48e47-abf2-403e-8a1a-04e821a42cb6 app[web.3]
 LOGTAGTABFIELD-0TABTABFIELD-NTABFIELD-N+1 .FIELD-MAX

 Thank you!





 --
 Matthieu Labour, Engineering | *Action**X* |
 584 Broadway, Suite 1002 – NY, NY 10012
 415-994-3480 (m)



java.lang.NumberFormatException and Thanks to Hemanth and Harsh

2013-02-26 Thread Fatih Haltas
Hi all,

First, I would like to thank you all, espacially to Hemanth and Harsh.

I solved my problem, this was exactly the about java version and hadoop
version incompatibility, now, I can run my compiled and jarred MapReduce
program.

I have a different question now. I created a code, finding the each ip's
packet number in a given time interval for the netflow data.

However, I am getting the java.lang.NumberFormatException.
*
1. Here is my code in Java;
*
package org.myorg;
import java.io.IOException;
import org.apache.hadoop.io.*;
import java.util.NoSuchElementException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Mapper;
import java.util.StringTokenizer;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;



public class MapReduce extends Configured implements Tool
{


public int run (String[] args) throws Exception
{
System.out.println(Debug1);

if(args.length != 2)
{
System.err.println(Usage: MapReduce input path
output path);
ToolRunner.printGenericCommandUsage(System.err);
}

Job job = new Job();

job.setJarByClass(MapReduce.class);
System.out.println(Debug2);
job.setJobName(MaximumPacketFlowIP);
System.out.println(Debug3);

 FileInputFormat.addInputPath(job, new Path(args[0]));
 System.out.println(Debug8);
 FileOutputFormat.setOutputPath(job, new Path(args[1]));
 System.out.println(Debug9);


job.setMapperClass(FlowPortMapper.class);
System.out.println(Debug6);
job.setReducerClass(FlowPortReducer.class);
System.out.println(Debug7);





job.setOutputKeyClass(Text.class);
System.out.println(Debug4);
job.setOutputValueClass(IntWritable.class);
System.out.println(Debug5);

//System.exit(job.waitForCompletion(true) ? 0:1);
return job.waitForCompletion(true) ? 0:1 ;
}


/* --main-*/
public static void main(String[] args) throws Exception
{
int exitCode = ToolRunner.run(new MapReduce(), args);
System.exit(exitCode);
}

/*-Mapper-*/
static class FlowPortMapper extends Mapper
LongWritable,Text,Text,IntWritable
{
public void map (LongWritable key, Text value, Context
context) throws IOException,
InterruptedException
{
String flow = value.toString();
long starttime=0;
long endtime=0;
long time1=1357289339;
long time2=1357289342;
StringTokenizer line = new StringTokenizer(flow);
String internalip=i;

//Getting the internalip from flow
if(line.hasMoreTokens())
internalip=line.nextToken();

//Getting the starttime and endtime from flow
for(int i=0;i9;i++)
if(line.hasMoreTokens())
starttime=Long.parseLong(line.nextToken());
if(line.hasMoreTokens())
endtime=Long.parseLong(line.nextToken());

//If the time is in the given interval then emit 1
if(starttime=time1  endtime=time2)
context.write(new Text(internalip), new
IntWritable(1));


}

}

/* Reducer---*/

static class FlowPortReducer extends
ReducerText,IntWritable,Text,IntWritable
{
public  void reduce(Text key,IterableIntWritable values,
Context context) throws