date:20140910

Thank you for your all support.

I could fix the issue this morning using this link, it was clearly explain.

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/#java-io-ioexception-incompatible-namespaceids

You can use the link as well.

Warm regard

From: vivek [mailto:vivvekbha...@gmail.com]
Sent: Tuesday 9 September 2014 19:31
To: user@hadoop.apache.org
Subject: Re: Error and problem when running a hadoop job

is there any namespace mismatch?
Try to delete the data in datanode directory

On Tue, Sep 9, 2014 at 10:41 PM, Sandeep Khurana 
skhurana...@gmail.commailto:skhurana...@gmail.com wrote:
check the log file at ./hadoop/hadoop-datanide-latdevweb02.out (As per ur 
last screen shot). There can be various reasons of datanode not starting, the 
real issue will be logged into this file.

On Tue, Sep 9, 2014 at 10:06 PM, YIMEN YIMGA Gael 
gael.yimen-yi...@sgcib.commailto:gael.yimen-yi...@sgcib.com wrote:
Hi,

When I run the following command to launch DATANODE as shown in the screenshot 
below, all is ok
But when I run JPS command, I do not see the datanode process

[cid:image001.png@01CFCCED.E2FD4BC0]

That’s where my worry is ☹ ☹

Standing by ….

From: vivek [mailto:vivvekbha...@gmail.commailto:vivvekbha...@gmail.com]
Sent: Tuesday 9 September 2014 17:27

To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: Error and problem when running a hadoop job

check whether datanode is started.


On Tue, Sep 9, 2014 at 7:26 PM, YIMEN YIMGA Gael 
gael.yimen-yi...@sgcib.commailto:gael.yimen-yi...@sgcib.com wrote:
Yes, all about ssh access, have been done.

My cluster is a single node cluster.

Standing by …

From: Sandeep Khurana 
[mailto:skhurana...@gmail.commailto:skhurana...@gmail.com]
Sent: Tuesday 9 September 2014 15:54
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: Error and problem when running a hadoop job


I hope you did do passphrase less ssh access to localhost by generating keys 
etc?
On Sep 9, 2014 7:18 PM, YIMEN YIMGA Gael 
gael.yimen-yi...@sgcib.commailto:gael.yimen-yi...@sgcib.com wrote:
Hello Dear hadoopers,

I hope you are doing well.

I tried to run WordCount.jar file to experience running hadoop jobs. After 
launching the program as shown in the screenshot below, I have the message in 
the screenshot.
The job tries to connect to the datanode. But failed after 10 attempts, I got 
the error in the second screenshot.
After that, I first stop all the Hadoop deamons, second format the dfs, third 
re-launch Hadoop deamons, and I notice using the JPS command that DATANODE is 
not running.
I then run the datanode alone with the command bin/hadoop –deamon.sh  start 
datanode as shown in the third screenshot, but the datanode is still not up and 
running.

Could someone advice in this case, please ?

Standing by for your habitual support.

Thank in advance.

GYY

[cid:image002.png@01CFCCED.E2FD4BC0]


[cid:image003.png@01CFCCED.E2FD4BC0]


[cid:image004.png@01CFCCED.E2FD4BC0]

*
This message and any attachments (the message) are confidential, intended 
solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to 
alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be 
liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with 
respect to derivative products.
  
Ce message et toutes les pieces jointes (ci-apres le message) sont 
confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute 
utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de 
ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir 
d'importantes informations sur les produits derives.
*



--







Thanks and Regards,

VIVEK KOUL



--
Thanks and regards
Sandeep Khurana



--







Thanks and Regards,

VIVEK KOUL

MapReduce data decompression using a custom codec

2014-09-10 Thread POUPON Kevin

Hello,

I developed a custom compression codec for Hadoop. Of course Hadoop is set to 
use my codec when compressing data.
For testing purposes, I use the following two commands:

Compression test command:
---
hadoop jar 
/opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/hadoop//../hadoop-mapreduce/hadoop-streaming.jar
 -Dmapreduce.output.fileoutputformat.compress=true -input /originalFiles/ 
-output /compressedFiles/ -mapper cat -reducer cat


Decompression test command:
---
hadoop jar 
/opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/hadoop//../hadoop-mapreduce/hadoop-streaming.jar
 -Dmapreduce.output.fileoutputformat.compress=false -input /compressedFiles/ 
-output /decompressedFiles/ -mapper cat -reducer cat


As you can see, both of them are quite similar: only the compression option 
changes and the input/output directories.

The first command compresses the input data then 'cat' (the Linux command, you 
know) it to the output file.
The second one decompresses the input  data (which are supposed to be 
compressed) then 'cat' it to the output file. As I understand, Hadoop is 
supposed to auto-detect compressed input data and decompress it using the right 
codec.

Those test compression and decompression work well when Hadoop is set to use a 
default codec, like BZip2 or Snappy.

However, when using my custom compression codec, only the compression works: 
the decompression is sluggish and triggers errors (Java heap space):

packageJobJar: [] 
[/opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/hadoop-mapreduce/hadoop-streaming-2.3.0-cdh5.1.2.jar]
 /tmp/streamjob6475393520304432687.jar tmpDir=null
14/09/09 15:33:21 INFO client.RMProxy: Connecting to ResourceManager at 
bluga2/10.1.96.222:8032
14/09/09 15:33:22 INFO client.RMProxy: Connecting to ResourceManager at 
bluga2/10.1.96.222:8032
14/09/09 15:33:23 INFO mapred.FileInputFormat: Total input paths to process : 1
14/09/09 15:33:23 INFO mapreduce.JobSubmitter: number of splits:1
14/09/09 15:33:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
job_1410264242020_0016
14/09/09 15:33:24 INFO impl.YarnClientImpl: Submitted application 
application_1410264242020_0016
14/09/09 15:33:24 INFO mapreduce.Job: The url to track the job: 
http://bluga2:8088/proxy/application_1410264242020_0016/
14/09/09 15:33:24 INFO mapreduce.Job: Running job: job_1410264242020_0016
14/09/09 15:33:30 INFO mapreduce.Job: Job job_1410264242020_0016 running in 
uber mode : false
14/09/09 15:33:30 INFO mapreduce.Job:  map 0% reduce 0%
14/09/09 15:35:12 INFO mapreduce.Job:  map 100% reduce 0%
14/09/09 15:35:13 INFO mapreduce.Job: Task Id : 
attempt_1410264242020_0016_m_00_0, Status : FAILED
Error: Java heap space
14/09/09 15:35:14 INFO mapreduce.Job:  map 0% reduce 0%
14/09/09 15:35:41 INFO mapreduce.Job: Task Id : 
attempt_1410264242020_0016_m_00_1, Status : FAILED
Error: Java heap space
14/09/09 15:36:02 INFO mapreduce.Job: Task Id : 
attempt_1410264242020_0016_m_00_2, Status : FAILED
Error: Java heap space
14/09/09 15:36:49 INFO mapreduce.Job:  map 100% reduce 0%
14/09/09 15:36:50 INFO mapreduce.Job:  map 100% reduce 100%
14/09/09 15:36:56 INFO mapreduce.Job: Job job_1410264242020_0016 failed with 
state FAILED due to: Task failed task_1410264242020_0016_m_00
Job failed as tasks failed. failedMaps:1 failedReduces:0

14/09/09 15:36:58 INFO mapreduce.Job: Counters: 9
   Job Counters
 Failed map tasks=4
 Launched map tasks=4
 Other local map tasks=3
 Data-local map tasks=1
 Total time spent by all maps in occupied slots (ms)=190606
 Total time spent by all reduces in occupied slots (ms)=0
 Total time spent by all map tasks (ms)=190606
 Total vcore-seconds taken by all map tasks=190606
 Total megabyte-seconds taken by all map tasks=195180544
14/09/09 15:36:58 ERROR streaming.StreamJob: Job not Successful!
Streaming Command Failed!

I already tried to increase the map maximum heap size 
(mapreduce.map.java.opts.max.heap's YARN property) from 1 GiB to 2 GiB but the 
decompression still doesn't work. By the way, I'm compressing and decompressing 
a small ~2MB file and use the latest Cloudera version.

I built a quick Java test environment to try to reproduce the Hadoop codec call 
(instantiating the codec, creating a new compression stream from it ...). I 
noticed that the decompression is an infinite loop where only the first block 
of compressed data is decompressed, infinitely. This could explain the above 
Java heap space error.

What am I doing wrong/what did I forget ? How could my codec decompress data 
without troubles?

Thank you for helping !

Kévin Poupon

Re: Regular expressions in fs paths?

2014-09-10 Thread Mahesh Khandewal

I want to unsubscribe from this mailing list

On Wed, Sep 10, 2014 at 4:42 PM, Charles Robertson 
charles.robert...@gmail.com wrote:

 Hi all,

 Is it possible to use regular expressions in fs commands? Specifically, I
 want to use the copy (-cp) and move (-mv) commands on all files in a
 directory that match a pattern (the pattern being all files that do not end
 in '.tmp').

 Can this be done?

 Thanks,
 Charles

Re: Regular expressions in fs paths?

2014-09-10 Thread Georgi Ivanov


Yes you can :
hadoop fs -ls /tmp/myfiles*

I would recommend first using -ls in order to verify  you are selecting 
the right files.


#Mahesh : do you need some help doing this ?


On 10.09.2014 13:46, Mahesh Khandewal wrote:

I want to unsubscribe from this mailing list

On Wed, Sep 10, 2014 at 4:42 PM, Charles Robertson 
charles.robert...@gmail.com mailto:charles.robert...@gmail.com wrote:


Hi all,

Is it possible to use regular expressions in fs commands?
Specifically, I want to use the copy (-cp) and move (-mv) commands
on all files in a directory that match a pattern (the pattern
being all files that do not end in '.tmp').

Can this be done?

Thanks,
Charles

Re: HDFS: Couldn't obtain the locations of the last block

2014-09-10 Thread Zesheng Wu

Hi Yi,

I went through HDFS-4516, and it really solves our problem, thanks very
much!

2014-09-10 16:39 GMT+08:00 Zesheng Wu wuzeshen...@gmail.com:

 Thanks Yi, I will look into HDFS-4516.


 2014-09-10 15:03 GMT+08:00 Liu, Yi A yi.a@intel.com:

  Hi Zesheng,



 I got from an offline email of you and knew your Hadoop version was
 2.0.0-alpha and you also said “The block is allocated successfully in NN,
 but isn’t created in DN”.

 Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is
 similar with HDFS-4516.   And can you try Hadoop 2.4 or later, you should
 not be able to re-produce it for these versions.



 From your description, the second block is created successfully and NN
 would flush the edit log info to shared journal and shared storage might
 persist the info, but before reporting back in rpc, there might be timeout
 to NN from shared storage.  So the block exist in shared edit log, but DN
 doesn’t create it in anyway.  On restart, client could fail, because in
 that Hadoop version, client would retry only in the case of NN last block
 size reported as non-zero if it was synced (see more in HDFS-4516).



 Regards,

 Yi Liu



 *From:* Zesheng Wu [mailto:wuzeshen...@gmail.com]
 *Sent:* Tuesday, September 09, 2014 6:16 PM
 *To:* user@hadoop.apache.org
 *Subject:* HDFS: Couldn't obtain the locations of the last block



 Hi,



 These days we encountered a critical bug in HDFS which can result in
 HBase can't start normally.

 The scenario is like following:

 1.  rs1 writes data to HDFS file f1, and the first block is written
 successfully

 2.  rs1 apply to create the second block successfully, at this time,
 nn1(ann) is crashed due to writing journal timeout

 3. nn2(snn) isn't become active because of zkfc2 is in abnormal state

 4. nn1 is restarted and becomes active

 5. During the process of nn1 restarting, rs1 is crashed due to writing to
 safemode nn(nn1)

 6. As a result, the file f1 is in abnormal state and the HBase cluster
 can't serve any more



 We can use the command line shell to list the file, look like following:

 -rw---   3 hbase_srv supergroup  134217728 2014-09-05 11:32 
 /hbase/lgsrv-push/xxx

  But when we try to download the file from hdfs, the dfs client
 complains:

 14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 3 times

 14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 2 times

 14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 1 times

 get: Could not obtain the last block locations.

 Anyone can help on this?

  --
 Best Wishes!

 Yours, Zesheng




 --
 Best Wishes!

 Yours, Zesheng




-- 
Best Wishes!

Yours, Zesheng

Re: Regular expressions in fs paths?

2014-09-10 Thread Charles Robertson

Hi Georgi,

Thanks for your reply. Won't hadoop fs -ls /tmp/myfiles* return all files
that begin with 'myfiles' in the tmp directory? What I don't understand is
how I can specify a pattern that excludes files ending in '.tmp'. I have
tried using the normal regular expression syntax for this ^(.tmp) but it
tries to match it literally.

Regards,
Charles

On 10 September 2014 13:07, Georgi Ivanov iva...@vesseltracker.com wrote:

  Yes you can :
 hadoop fs -ls /tmp/myfiles*

 I would recommend first using -ls in order to verify  you are selecting
 the right files.

 #Mahesh : do you need some help doing this ?



 On 10.09.2014 13:46, Mahesh Khandewal wrote:

 I want to unsubscribe from this mailing list

 On Wed, Sep 10, 2014 at 4:42 PM, Charles Robertson 
 charles.robert...@gmail.com wrote:

 Hi all,

  Is it possible to use regular expressions in fs commands? Specifically,
 I want to use the copy (-cp) and move (-mv) commands on all files in a
 directory that match a pattern (the pattern being all files that do not end
 in '.tmp').

  Can this be done?

  Thanks,
 Charles

RE: HDFS: Couldn't obtain the locations of the last block

2014-09-10 Thread Liu, Yi A

That’s great.

Regards,
Yi Liu

From: Zesheng Wu [mailto:wuzeshen...@gmail.com]
Sent: Wednesday, September 10, 2014 8:25 PM
To: user@hadoop.apache.org
Subject: Re: HDFS: Couldn't obtain the locations of the last block

Hi Yi,

I went through HDFS-4516, and it really solves our problem, thanks very much!

2014-09-10 16:39 GMT+08:00 Zesheng Wu 
wuzeshen...@gmail.commailto:wuzeshen...@gmail.com:
Thanks Yi, I will look into HDFS-4516.

2014-09-10 15:03 GMT+08:00 Liu, Yi A 
yi.a@intel.commailto:yi.a@intel.com:

Hi Zesheng,

I got from an offline email of you and knew your Hadoop version was 2.0.0-alpha 
and you also said “The block is allocated successfully in NN, but isn’t created 
in DN”.
Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is similar 
with HDFS-4516.   And can you try Hadoop 2.4 or later, you should not be able 
to re-produce it for these versions.

From your description, the second block is created successfully and NN would 
flush the edit log info to shared journal and shared storage might persist the 
info, but before reporting back in rpc, there might be timeout to NN from 
shared storage.  So the block exist in shared edit log, but DN doesn’t create 
it in anyway.  On restart, client could fail, because in that Hadoop version, 
client would retry only in the case of NN last block size reported as non-zero 
if it was synced (see more in HDFS-4516).

Regards,
Yi Liu

From: Zesheng Wu [mailto:wuzeshen...@gmail.commailto:wuzeshen...@gmail.com]
Sent: Tuesday, September 09, 2014 6:16 PM
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: HDFS: Couldn't obtain the locations of the last block

Hi,

These days we encountered a critical bug in HDFS which can result in HBase 
can't start normally.
The scenario is like following:
1.  rs1 writes data to HDFS file f1, and the first block is written successfully
2.  rs1 apply to create the second block successfully, at this time, nn1(ann) 
is crashed due to writing journal timeout
3. nn2(snn) isn't become active because of zkfc2 is in abnormal state
4. nn1 is restarted and becomes active
5. During the process of nn1 restarting, rs1 is crashed due to writing to 
safemode nn(nn1)
6. As a result, the file f1 is in abnormal state and the HBase cluster can't 
serve any more

We can use the command line shell to list the file, look like following:

-rw---   3 hbase_srv supergroup  134217728 2014-09-05 11:32 
/hbase/lgsrv-push/xxx
But when we try to download the file from hdfs, the dfs client complains:

14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. 
Datanodes might not have reported blocks completely. Will retry for 3 times

14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. 
Datanodes might not have reported blocks completely. Will retry for 2 times

14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. 
Datanodes might not have reported blocks completely. Will retry for 1 times

get: Could not obtain the last block locations.

Anyone can help on this?
--
Best Wishes!

Yours, Zesheng

--
Best Wishes!

Yours, Zesheng

--
Best Wishes!

Yours, Zesheng

Error when executing a WordCount Program

Hello Hadoopers,

Here is the error, I'm facing when running WordCount example program written by 
myself.
Kind find attached the file of my WordCount program.
Below the error.

===
-bash-4.1$ bin/hadoop jar WordCount.jar
Entr?e dans le programme MAIN !!!
14/09/10 15:00:24 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
14/09/10 15:00:24 WARN mapred.JobClient: No job jar file set.  User classes may 
not be found. See JobConf(Class) or JobConf#setJar(String).
14/09/10 15:00:24 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/09/10 15:00:24 WARN snappy.LoadSnappy: Snappy native library not loaded
14/09/10 15:00:24 INFO mapred.JobClient: Cleaning up the staging area 
hdfs://latdevweb02:9000/user/hadoop/.staging/job_201409101141_0001
14/09/10 15:00:24 ERROR security.UserGroupInformation: 
PriviledgedActionException as:hadoop 
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: hdfs://latdevweb02:9000/home/hadoop/hadoop/input
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
hdfs://latdevweb02:9000/home/hadoop/hadoop/input
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at 
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at 
fr.societegenerale.bigdata.lactool.WordCountDriver.main(WordCountDriver.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
-bash-4.1$
===

Thanks in advance for your help.

Warm regards
GYY
*
This message and any attachments (the message) are confidential, intended 
solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to 
alteration.   
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be 
liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with 
respect to derivative products.
  
Ce message et toutes les pieces jointes (ci-apres le message) sont 
confidentiels et susceptibles de contenir des informations couvertes 
par le secret professionnel. 
Ce message est etabli a l'intention exclusive de ses destinataires. Toute 
utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration. 
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de 
ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir 
d'importantes informations sur les produits derives.
*


WordCountReducer.java
Description: WordCountReducer.java


WordCountMapper.java
Description: WordCountMapper.java


WordCountDriver.java
Description: WordCountDriver.java

Re: Error when executing a WordCount Program

2014-09-10 Thread Shahab Yunus

*hdfs://latdevweb02:9000/home/hadoop/hadoop/input*

is this is a valid path on hdfs? Can you access this path outside of the
program? For example using hadoop fs -ls command? Also, was this path and
files in it, created by a different user?

The exception seem to say that it does not exist or the running user does
not have permission to read it.

Regards,
Shahab



On Wed, Sep 10, 2014 at 9:09 AM, YIMEN YIMGA Gael 
gael.yimen-yi...@sgcib.com wrote:

 Hello Hadoopers,



 Here is the error, I’m facing when running WordCount example program
 written by myself.

 Kind find attached the file of my WordCount program.

 Below the error.




 ===

 *-bash-4.1$ bin/hadoop jar WordCount.jar*

 *Entr?e dans le programme MAIN !!!*

 *14/09/10 15:00:24 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.*

 *14/09/10 15:00:24 WARN mapred.JobClient: No job jar file set.  User
 classes may not be found. See JobConf(Class) or JobConf#setJar(String).*

 *14/09/10 15:00:24 INFO util.NativeCodeLoader: Loaded the native-hadoop
 library*

 *14/09/10 15:00:24 WARN snappy.LoadSnappy: Snappy native library not
 loaded*

 *14/09/10 15:00:24 INFO mapred.JobClient: Cleaning up the staging area
 hdfs://latdevweb02:9000/user/hadoop/.staging/job_201409101141_0001*

 *14/09/10 15:00:24 ERROR security.UserGroupInformation:
 PriviledgedActionException as:hadoop
 cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
 exist: hdfs://latdevweb02:9000/home/hadoop/hadoop/input*

 *org.apache.hadoop.mapred.InvalidInputException: Input path does not
 exist: hdfs://latdevweb02:9000/home/hadoop/hadoop/input*

 *at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)*

 *at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)*

 *at
 org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)*

 *at
 org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)*

 *at
 org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)*

 *at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)*

 *at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)*

 *at java.security.AccessController.doPrivileged(Native Method)*

 *at javax.security.auth.Subject.doAs(Subject.java:415)*

 *at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)*

 *at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)*

 *at
 org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)*

 *at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)*

 *at
 fr.societegenerale.bigdata.lactool.WordCountDriver.main(WordCountDriver.java:50)*

 *at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)*

 *at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)*

 *at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)*

 *at java.lang.reflect.Method.invoke(Method.java:601)*

 *at org.apache.hadoop.util.RunJar.main(RunJar.java:160)*

 *-bash-4.1$*


 ===



 Thanks in advance for your help.



 Warm regards

 GYY

 *
 This message and any attachments (the message) are confidential,
 intended solely for the addressee(s), and may contain legally privileged
 information.
 Any unauthorised use or dissemination is prohibited. E-mails are
 susceptible to alteration.
 Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
 be liable for the message if altered, changed or
 falsified.
 Please visit http://swapdisclosure.sgcib.com for important information
 with respect to derivative products.
   
 Ce message et toutes les pieces jointes (ci-apres le message) sont
 confidentiels et susceptibles de contenir des informations couvertes
 par le secret professionnel.
 Ce message est etabli a l'intention exclusive de ses destinataires. Toute
 utilisation ou diffusion non autorisee est interdite.
 Tout message electronique est susceptible d'alteration.
 La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
 titre de ce message s'il a ete altere, deforme ou falsifie.
 Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
 recueillir d'importantes informations sur les produits derives.
 *

Re: Error when executing a WordCount Program

2014-09-10 Thread Chris MacKenzie

Hi have you set a class in your code ?

 WARN mapred.JobClient: No job jar file set.  User classes may not be found. 
 See JobConf(Class) or JobConf#setJar(String).
 


Also you need to check the path for your input file

 Input path does not exist: hdfs://latdevweb02:9000/home/hadoop/hadoop/input
 

These are pretty straight forward errors resolve them and you should be good to 
go. 

Sent from my iPhone

 On 10 Sep 2014, at 14:19, Shahab Yunus shahab.yu...@gmail.com wrote:
 
 hdfs://latdevweb02:9000/home/hadoop/hadoop/input
 
 is this is a valid path on hdfs? Can you access this path outside of the 
 program? For example using hadoop fs -ls command? Also, was this path and 
 files in it, created by a different user?
 
 The exception seem to say that it does not exist or the running user does not 
 have permission to read it.
 
 Regards,
 Shahab
 
 
 
 On Wed, Sep 10, 2014 at 9:09 AM, YIMEN YIMGA Gael 
 gael.yimen-yi...@sgcib.com wrote:
 Hello Hadoopers,
 
  
 
 Here is the error, I’m facing when running WordCount example program written 
 by myself.
 
 Kind find attached the file of my WordCount program.
 
 Below the error.
 
  
 
 ===
 
 -bash-4.1$ bin/hadoop jar WordCount.jar
 
 Entr?e dans le programme MAIN !!!
 
 14/09/10 15:00:24 WARN mapred.JobClient: Use GenericOptionsParser for 
 parsing the arguments. Applications should implement Tool for the same.
 
 14/09/10 15:00:24 WARN mapred.JobClient: No job jar file set.  User classes 
 may not be found. See JobConf(Class) or JobConf#setJar(String).
 
 14/09/10 15:00:24 INFO util.NativeCodeLoader: Loaded the native-hadoop 
 library
 
 14/09/10 15:00:24 WARN snappy.LoadSnappy: Snappy native library not loaded
 
 14/09/10 15:00:24 INFO mapred.JobClient: Cleaning up the staging area 
 hdfs://latdevweb02:9000/user/hadoop/.staging/job_201409101141_0001
 
 14/09/10 15:00:24 ERROR security.UserGroupInformation: 
 PriviledgedActionException as:hadoop 
 cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not 
 exist: hdfs://latdevweb02:9000/home/hadoop/hadoop/input
 
 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
 hdfs://latdevweb02:9000/home/hadoop/hadoop/input
 
 at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
 
 at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
 
 at 
 org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
 
 at 
 org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
 
 at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
 
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
 
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
 
 at java.security.AccessController.doPrivileged(Native Method)
 
 at javax.security.auth.Subject.doAs(Subject.java:415)
 
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
 
 at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
 
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
 
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
 
 at 
 fr.societegenerale.bigdata.lactool.WordCountDriver.main(WordCountDriver.java:50)
 
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
 at java.lang.reflect.Method.invoke(Method.java:601)
 
 at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
 
 -bash-4.1$
 
 ===
 
  
 
 Thanks in advance for your help.
 
  
 
 Warm regards
 
 GYY
 
 *
 This message and any attachments (the message) are confidential, intended 
 solely for the addressee(s), and may contain legally privileged information.
 Any unauthorised use or dissemination is prohibited. E-mails are susceptible 
 to alteration.  
 Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be 
 liable for the message if altered, changed or
 falsified.
 Please visit http://swapdisclosure.sgcib.com for important information with 
 respect to derivative products.
   
 Ce message et toutes les pieces jointes (ci-apres le message) sont 
 confidentiels et susceptibles de contenir des informations couvertes 
 par le secret professionnel. 
 Ce message est etabli a l'intention exclusive de ses destinataires. Toute

running beyond virtual memory limits

2014-09-10 Thread Jakub Stransky

Hello,

I am getting following error when running on 500MB dataset compressed in
avro data format.

Container [pid=22961,containerID=container_1409834588043_0080_01_10] is
running beyond virtual memory limits. Current usage: 636.6 MB of 1 GB
physical memory used; 2.1 GB of 2.1 GB virtual memory used.
Killing container. Dump of the process-tree for
container_1409834588043_0080_01_10 :
|- PIDPPID  PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 22961  16896 22961  22961  (bash)0  0
9424896   312 /bin/bash -c
/usr/java/default/bin/java -Djava.net.preferIPv4Stack=true
-Dhadoop.metrics.log.level=WARN -Xmx768m
-Djava.io.tmpdir=/home/hadoop/yarn/local/usercache/jobsubmit/appcache/application_1409834588043_0080/container_1409834588043_0080_01_10/tmp
-Dlog4j.configuration=container-log4j.properties
-Dyarn.app.container.log.dir=/home/hadoop/yarn/logs/application_1409834588043_0080/container_1409834588043_0080_01_10
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA
org.apache.hadoop.mapred.YarnChild 153.87.47.116 47184
attempt_1409834588043_0080_r_00_0 10
1/home/hadoop/yarn/logs/application_1409834588043_0080/container_1409834588043_0080_01_10/stdout
2/home/hadoop/yarn/logs/application_1409834588043_0080/container_1409834588043_0080_01_10/stderr
|- 22970 22961 22961 22961 (java) 24692 1165 2256662528 162659
/usr/java/default/bin/java -Djava.net.preferIPv4Stack=true
-Dhadoop.metrics.log.level=WARN -Xmx768m
-Djava.io.tmpdir=/home/hadoop/yarn/local/usercache/jobsubmit/appcache/application_1409834588043_0080/container_1409834588043_0080_01_10/tmp
-Dlog4j.configuration=container-log4j.properties
-Dyarn.app.container.log.dir=/home/hadoop/yarn/logs/application_1409834588043_0080/container_1409834588043_0080_01_10
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA
org.apache.hadoop.mapred.YarnChild 153.87.47.116 47184
attempt_1409834588043_0080_r_00_0 10 Container killed on request. Exit
code is 143

I have read a lot about hadoop yarn memory settings but seems that
something basic I am missing in my understanding of how yarn and MR2 works.
I have pretty small testing cluster of 5 machines, 2nn and 3dn with
following parameters set

# hadoop - yarn-site.xml
yarn.nodemanager.resource.memory-mb  : 2048
yarn.scheduler.minimum-allocation-mb : 256
yarn.scheduler.maximum-allocation-mb : 2048

# hadoop - mapred-site.xml
mapreduce.map.memory.mb  : 768
mapreduce.map.java.opts  : -Xmx512m
mapreduce.reduce.memory.mb   : 1024
mapreduce.reduce.java.opts   : -Xmx768m
mapreduce.task.io.sort.mb: 100
yarn.app.mapreduce.am.resource.mb: 1024
yarn.app.mapreduce.am.command-opts   : -Xmx768m

I understand the mathematics here for the parameters but what I do not
understand is: Does your containers need to grow with the size of your
dataset? e.g. setting of mapreduce.map.memory.mb   and
mapreduce.map.java.opts  on per job basis? My reducer doesn't cache any
data, it is simply in - out just categorize data to multiple outputs as
follows using AvroMultipleOutputs()

@Override
public void reduce(Text key, IterableAvroValuePosData values,
Context context) throws IOException, InterruptedException {
try {
log.info(Processing key {}, key.toString());
final StoreIdDob storeIdDob = separateKey(key);

log.info(Processing DOB {}, SotoreId {}, storeIdDob.getDob(),
storeIdDob.getStoreId());
int size = 0;

Output out;
String path;

if (storeIdDob.getDob() != null 
isValidDOB(storeIdDob.getDob())  storeIdDob.getStoreId() != null 
!storeIdDob.getStoreId().isEmpty()) {
// reasonable data
if (isHistoricalDOB(storeIdDob.getDob())) {
out = Output.HISTORY;
} else {
out = Output.ACTUAL;
}
path = out.getKey() + / + storeIdDob.getDob() + / +
storeIdDob.getStoreId();
} else {
// error data
out = Output.ERROR;
path = out.getKey() + / + part;
}

for (AvroValuePosData posData : values) {
amos.write(out.getKey(), new AvroKeyPosData
(posData.datum()), null, path);
}

} catch (Exception e) {
log.error(Error on reducer , e);
//TODO audit log :-)
}
}

Do I need to grow the container size with size of the dataset? That seems
to me odd and I did expect that is what MR is for. Or am I missing some
settings which decides the size of data chunks?

Thx
Jakub

RE: Error when executing a WordCount Program

Hi,

Please that is my real problem.
Could you please look into my code in attached and tell me how I can update 
this, please ?

How to set a job jar file?

And now, here is my hdfs-site.xml

==
-bash-4.1$ cat conf/hdfs-site.xml
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

!-- Put site-specific property overrides in this file. --

configuration
   property
  namedfs.replication/name
  value1/value
   /property
   property
  namedfs.data.dir/name
  value/tmp/hadoop-hadoop/dfs/data/value
   /property
/configuration
-bash-4.1$
==

Could you advice on how to solve the error of “input path does not exist”?

Standing by …

Cheers


From: Chris MacKenzie [mailto:stu...@chrismackenziephotography.co.uk]
Sent: Wednesday 10 September 2014 15:27
To: user@hadoop.apache.org
Subject: Re: Error when executing a WordCount Program

Hi have you set a class in your code ?

WARN mapred.JobClient: No job jar file set.  User classes may not be found. See 
JobConf(Class) or JobConf#setJar(String).

Also you need to check the path for your input file

Input path does not exist: hdfs://latdevweb02:9000/home/hadoop/hadoop/input

These are pretty straight forward errors resolve them and you should be good to 
go.

Sent from my iPhone

On 10 Sep 2014, at 14:19, Shahab Yunus 
shahab.yu...@gmail.commailto:shahab.yu...@gmail.com wrote:
hdfs://latdevweb02:9000/home/hadoop/hadoop/input

is this is a valid path on hdfs? Can you access this path outside of the 
program? For example using hadoop fs -ls command? Also, was this path and files 
in it, created by a different user?

The exception seem to say that it does not exist or the running user does not 
have permission to read it.

Regards,
Shahab



On Wed, Sep 10, 2014 at 9:09 AM, YIMEN YIMGA Gael 
gael.yimen-yi...@sgcib.commailto:gael.yimen-yi...@sgcib.com wrote:
Hello Hadoopers,

Here is the error, I’m facing when running WordCount example program written by 
myself.
Kind find attached the file of my WordCount program.
Below the error.

===
-bash-4.1$ bin/hadoop jar WordCount.jar
Entr?e dans le programme MAIN !!!
14/09/10 15:00:24 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
14/09/10 15:00:24 WARN mapred.JobClient: No job jar file set.  User classes may 
not be found. See JobConf(Class) or JobConf#setJar(String).
14/09/10 15:00:24 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/09/10 15:00:24 WARN snappy.LoadSnappy: Snappy native library not loaded
14/09/10 15:00:24 INFO mapred.JobClient: Cleaning up the staging area 
hdfs://latdevweb02:9000/user/hadoop/.staging/job_201409101141_0001
14/09/10 15:00:24 ERROR security.UserGroupInformation: 
PriviledgedActionException as:hadoop 
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: hdfs://latdevweb02:9000/home/hadoop/hadoop/input
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
hdfs://latdevweb02:9000/home/hadoop/hadoop/input
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at 
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at 
fr.societegenerale.bigdata.lactool.WordCountDriver.main(WordCountDriver.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
-bash-4.1$
===

Thanks in advance for your help.

Warm regards
GYY

*
This message and any attachments (the

RE: Error when executing a WordCount Program

Hi,

In fact,

hdfs://latdevweb02:9000/home/hadoop/hadoop/input
is not a folder on hdfs.

I created a folder /tmp/hadoop-hadoop/dfs/data, where data will be saved in 
hdfs.

And in my HADOOP_HOME folder, there is two folders “input” and “output”, but I 
don’t know how to configure them in the program.

Please could you look into my code and advise please ?

Standing by …

Warm regards

From: Shahab Yunus [mailto:shahab.yu...@gmail.com]
Sent: Wednesday 10 September 2014 15:19
To: user@hadoop.apache.org
Subject: Re: Error when executing a WordCount Program

hdfs://latdevweb02:9000/home/hadoop/hadoop/input

is this is a valid path on hdfs? Can you access this path outside of the 
program? For example using hadoop fs -ls command? Also, was this path and files 
in it, created by a different user?

The exception seem to say that it does not exist or the running user does not 
have permission to read it.

Regards,
Shahab



On Wed, Sep 10, 2014 at 9:09 AM, YIMEN YIMGA Gael 
gael.yimen-yi...@sgcib.commailto:gael.yimen-yi...@sgcib.com wrote:
Hello Hadoopers,

Here is the error, I’m facing when running WordCount example program written by 
myself.
Kind find attached the file of my WordCount program.
Below the error.

===
-bash-4.1$ bin/hadoop jar WordCount.jar
Entr?e dans le programme MAIN !!!
14/09/10 15:00:24 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
14/09/10 15:00:24 WARN mapred.JobClient: No job jar file set.  User classes may 
not be found. See JobConf(Class) or JobConf#setJar(String).
14/09/10 15:00:24 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/09/10 15:00:24 WARN snappy.LoadSnappy: Snappy native library not loaded
14/09/10 15:00:24 INFO mapred.JobClient: Cleaning up the staging area 
hdfs://latdevweb02:9000/user/hadoop/.staging/job_201409101141_0001
14/09/10 15:00:24 ERROR security.UserGroupInformation: 
PriviledgedActionException as:hadoop 
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: hdfs://latdevweb02:9000/home/hadoop/hadoop/input
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
hdfs://latdevweb02:9000/home/hadoop/hadoop/input
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at 
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at 
fr.societegenerale.bigdata.lactool.WordCountDriver.main(WordCountDriver.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
-bash-4.1$
===

Thanks in advance for your help.

Warm regards
GYY

*
This message and any attachments (the message) are confidential, intended 
solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to 
alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be 
liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with 
respect to derivative products.
  
Ce message et toutes les pieces jointes (ci-apres le message) sont 
confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute 
utilisation ou diffusion non autorisee est interdite.
Tout message electronique

Hadoop Smoke Test: TERASORT

2014-09-10 Thread arthur.hk.c...@gmail.com

Hi,

I am trying the smoke test for Hadoop (2.4.1).  About “terasort”, below is my 
test command, the Map part was completed very fast because it was split into 
many subtasks, however the Reduce part takes very long time and only 1 running 
Reduce job.  Is there a way speed up the reduce phase by splitting the large 
reduce job into many smaller ones and run them across the cluster like the Map 
part?


bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar  terasort 
/tmp/teragenout /tmp/terasortout


Job ID  NameState   
Maps Total  Maps Completed  Reduce Total
Reduce Complted
job_1409876705457_0002  TeraSortRUNNING 22352   
22352   1   0


Regards
Arthur

Re: Regular expressions in fs paths?

HDFS doesn't support he full range of glob matching you will find in Linux.
 If you want to exclude all files from a directory listing that meet a
certain criteria try doing your listing and using grep -v to exclude the
matching records.

Writing output from streaming task without dealing with key/value

Hello!

Imagine the following common task: I want to process big text file line-by-line 
using streaming interface.
Run unix grep command for instance.  Or some other line-by-line processing, 
e.g. line.upper().
I copy file to HDFS.

Then I run a map task on this file which reads one line, modifies it some way 
and then writes it to the output.

TextInputFormat suites well for reading: it's key is the offset in bytes 
(meaningless in my case) and the value is the line itself, so I can iterate 
over line like this (in python):
for line in sys.stdin:
  print(line.upper())

The problem arises with TextOutputFormat:  It tries to split the resulting line 
on mapreduce.output.textoutputformat.separator which results in extra separator 
in output if this character is missing in the line, for instance (extra TAB at 
the end if we stick to defaults).

Is there any way to write the result of streaming task without any internal 
processing so it appears exactly as the script produces it?

If it is impossible with Hadoop, which works with key/value pairs, may be there 
are other frameworks which work on top of HDFS which allow to do this?

Thanks in advance!

Re: Hadoop Smoke Test: TERASORT

You can set the number of reducers used in any hadoop job from the command
line by using -Dmapred.reduce.tasks=XX.

e.g.  hadoop jar hadoop-mapreduce-examples.jar terasort
-Dmapred.reduce.tasks=10  /terasort-input /terasort-output

Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Susheel Kumar Gadalay

If you don't want key in the final output, you can set like this in Java.

job.setOutputKeyClass(NullWritable.class);

It will just print the value in the output file.

I don't how to do it in python.

On 9/10/14, Dmitry Sivachenko trtrmi...@gmail.com wrote:
 Hello!

 Imagine the following common task: I want to process big text file
 line-by-line using streaming interface.
 Run unix grep command for instance.  Or some other line-by-line processing,
 e.g. line.upper().
 I copy file to HDFS.

 Then I run a map task on this file which reads one line, modifies it some
 way and then writes it to the output.

 TextInputFormat suites well for reading: it's key is the offset in bytes
 (meaningless in my case) and the value is the line itself, so I can iterate
 over line like this (in python):
 for line in sys.stdin:
   print(line.upper())

 The problem arises with TextOutputFormat:  It tries to split the resulting
 line on mapreduce.output.textoutputformat.separator which results in extra
 separator in output if this character is missing in the line, for instance
 (extra TAB at the end if we stick to defaults).

 Is there any way to write the result of streaming task without any internal
 processing so it appears exactly as the script produces it?

 If it is impossible with Hadoop, which works with key/value pairs, may be
 there are other frameworks which work on top of HDFS which allow to do
 this?

 Thanks in advance!

Re: Writing output from streaming task without dealing with key/value

In python, or any streaming program just set the output value to the empty
string and you will get something like key\t.

On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay skgada...@gmail.com
 wrote:

 If you don't want key in the final output, you can set like this in Java.

 job.setOutputKeyClass(NullWritable.class);

 It will just print the value in the output file.

 I don't how to do it in python.

 On 9/10/14, Dmitry Sivachenko trtrmi...@gmail.com wrote:
  Hello!
 
  Imagine the following common task: I want to process big text file
  line-by-line using streaming interface.
  Run unix grep command for instance.  Or some other line-by-line
 processing,
  e.g. line.upper().
  I copy file to HDFS.
 
  Then I run a map task on this file which reads one line, modifies it some
  way and then writes it to the output.
 
  TextInputFormat suites well for reading: it's key is the offset in bytes
  (meaningless in my case) and the value is the line itself, so I can
 iterate
  over line like this (in python):
  for line in sys.stdin:
print(line.upper())
 
  The problem arises with TextOutputFormat:  It tries to split the
 resulting
  line on mapreduce.output.textoutputformat.separator which results in
 extra
  separator in output if this character is missing in the line, for
 instance
  (extra TAB at the end if we stick to defaults).
 
  Is there any way to write the result of streaming task without any
 internal
  processing so it appears exactly as the script produces it?
 
  If it is impossible with Hadoop, which works with key/value pairs, may be
  there are other frameworks which work on top of HDFS which allow to do
  this?
 
  Thanks in advance!




-- 
*Kernighan's Law*
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it.

Re: Writing output from streaming task without dealing with key/value


On 10 сент. 2014 г., at 22:05, Rich Haase rdha...@gmail.com wrote:

 In python, or any streaming program just set the output value to the empty 
 string and you will get something like key\t.
 


I see, but I want to use many existing programs (like UNIX grep), and I don't 
want to have and extra \t in the output.

Is there any way to achieve this?  Or may be it is possible to write custom 
XxxOutputFormat to workaround that issue?

(something opposite to TextInputFormat: it passes input line without any 
modification to script's stdin, there should be a way to write stdout to file 
as is).


Thanks!


 On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay skgada...@gmail.com 
 wrote:
 If you don't want key in the final output, you can set like this in Java.
 
 job.setOutputKeyClass(NullWritable.class);
 
 It will just print the value in the output file.
 
 I don't how to do it in python.
 
 On 9/10/14, Dmitry Sivachenko trtrmi...@gmail.com wrote:
  Hello!
 
  Imagine the following common task: I want to process big text file
  line-by-line using streaming interface.
  Run unix grep command for instance.  Or some other line-by-line processing,
  e.g. line.upper().
  I copy file to HDFS.
 
  Then I run a map task on this file which reads one line, modifies it some
  way and then writes it to the output.
 
  TextInputFormat suites well for reading: it's key is the offset in bytes
  (meaningless in my case) and the value is the line itself, so I can iterate
  over line like this (in python):
  for line in sys.stdin:
print(line.upper())
 
  The problem arises with TextOutputFormat:  It tries to split the resulting
  line on mapreduce.output.textoutputformat.separator which results in extra
  separator in output if this character is missing in the line, for instance
  (extra TAB at the end if we stick to defaults).
 
  Is there any way to write the result of streaming task without any internal
  processing so it appears exactly as the script produces it?
 
  If it is impossible with Hadoop, which works with key/value pairs, may be
  there are other frameworks which work on top of HDFS which allow to do
  this?
 
  Thanks in advance!
 
 
 
 -- 
 Kernighan's Law
 Debugging is twice as hard as writing the code in the first place.  
 Therefore, if you write the code as cleverly as possible, you are, by 
 definition, not smart enough to debug it.

Re: Writing output from streaming task without dealing with key/value

You can write a custom output format, or you can write your mapreduce job
in Java and use a NullWritable as Susheel recommended.

grep (and every other *nix text processing command) I can think of would
not be limited by a trailing tab character.  It's even quite easy to strip
away that tab character if you don't want it during the post processing
steps you want to perform with *nix commands.

On Wed, Sep 10, 2014 at 12:12 PM, Dmitry Sivachenko trtrmi...@gmail.com
wrote:


 On 10 сент. 2014 г., at 22:05, Rich Haase rdha...@gmail.com wrote:

  In python, or any streaming program just set the output value to the
 empty string and you will get something like key\t.
 


 I see, but I want to use many existing programs (like UNIX grep), and I
 don't want to have and extra \t in the output.

 Is there any way to achieve this?  Or may be it is possible to write
 custom XxxOutputFormat to workaround that issue?

 (something opposite to TextInputFormat: it passes input line without any
 modification to script's stdin, there should be a way to write stdout to
 file as is).


 Thanks!


  On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay 
 skgada...@gmail.com wrote:
  If you don't want key in the final output, you can set like this in Java.
 
  job.setOutputKeyClass(NullWritable.class);
 
  It will just print the value in the output file.
 
  I don't how to do it in python.
 
  On 9/10/14, Dmitry Sivachenko trtrmi...@gmail.com wrote:
   Hello!
  
   Imagine the following common task: I want to process big text file
   line-by-line using streaming interface.
   Run unix grep command for instance.  Or some other line-by-line
 processing,
   e.g. line.upper().
   I copy file to HDFS.
  
   Then I run a map task on this file which reads one line, modifies it
 some
   way and then writes it to the output.
  
   TextInputFormat suites well for reading: it's key is the offset in
 bytes
   (meaningless in my case) and the value is the line itself, so I can
 iterate
   over line like this (in python):
   for line in sys.stdin:
 print(line.upper())
  
   The problem arises with TextOutputFormat:  It tries to split the
 resulting
   line on mapreduce.output.textoutputformat.separator which results in
 extra
   separator in output if this character is missing in the line, for
 instance
   (extra TAB at the end if we stick to defaults).
  
   Is there any way to write the result of streaming task without any
 internal
   processing so it appears exactly as the script produces it?
  
   If it is impossible with Hadoop, which works with key/value pairs, may
 be
   there are other frameworks which work on top of HDFS which allow to do
   this?
  
   Thanks in advance!
 
 
 
  --
  Kernighan's Law
  Debugging is twice as hard as writing the code in the first place.
 Therefore, if you write the code as cleverly as possible, you are, by
 definition, not smart enough to debug it.




-- 
*Kernighan's Law*
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it.

Re: Writing output from streaming task without dealing with key/value


On 10 сент. 2014 г., at 22:19, Rich Haase rdha...@gmail.com wrote:

 You can write a custom output format


Any clues how can this can be done?



 , or you can write your mapreduce job in Java and use a NullWritable as 
 Susheel recommended.  
 
 grep (and every other *nix text processing command) I can think of would not 
 be limited by a trailing tab character.  It's even quite easy to strip away 
 that tab character if you don't want it during the post processing steps you 
 want to perform with *nix commands. 


Problem is that the line itself contains a TAB in the middle, there will not be 
extra trailing TAB at the end.
So it is not that simple.
You never know if it is a TAB from the original line or it is extra TAB added 
by TextOutputFormat.

Thanks!

Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Shahab Yunus

Examples (the top ones are related to streaming jobs):

http://www.infoq.com/articles/HadoopOutputFormat
http://research.neustar.biz/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/
http://stackoverflow.com/questions/12759651/how-to-override-inputformat-and-outputformat-in-hadoop-application

Regards,
Shahab

On Wed, Sep 10, 2014 at 2:28 PM, Dmitry Sivachenko trtrmi...@gmail.com
wrote:


 On 10 сент. 2014 г., at 22:19, Rich Haase rdha...@gmail.com wrote:

  You can write a custom output format


 Any clues how can this can be done?



  , or you can write your mapreduce job in Java and use a NullWritable as
 Susheel recommended.
 
  grep (and every other *nix text processing command) I can think of would
 not be limited by a trailing tab character.  It's even quite easy to strip
 away that tab character if you don't want it during the post processing
 steps you want to perform with *nix commands.


 Problem is that the line itself contains a TAB in the middle, there will
 not be extra trailing TAB at the end.
 So it is not that simple.
 You never know if it is a TAB from the original line or it is extra TAB
 added by TextOutputFormat.

 Thanks!

Re: Writing output from streaming task without dealing with key/value

10 сент. 2014 г., в 22:47, Shahab Yunus shahab.yu...@gmail.com написал(а):

Examples (the top ones are related to streaming jobs):

http://www.infoq.com/articles/HadoopOutputFormat
http://research.neustar.biz/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/
http://stackoverflow.com/questions/12759651/how-to-override-inputformat-and-outputformat-in-hadoop-application

Thanks for the links. Problem is that in RecordWriter() I get two parameters:
key and value. If one of them is empty I have no way to tell if I should output
the delimiter (because it was present in the original line) or not.

What is the proper way to workaround that isuue?

Regards,
Shahab

On Wed, Sep 10, 2014 at 2:28 PM, Dmitry Sivachenko trtrmi...@gmail.com
wrote:

On 10 сент. 2014 г., at 22:19, Rich Haase rdha...@gmail.com wrote:

You can write a custom output format

Any clues how can this can be done?

, or you can write your mapreduce job in Java and use a NullWritable as
Susheel recommended.

grep (and every other *nix text processing command) I can think of would
not be limited by a trailing tab character. It's even quite easy to strip
away that tab character if you don't want it during the post processing
steps you want to perform with *nix commands.

Problem is that the line itself contains a TAB in the middle, there will not
be extra trailing TAB at the end.
So it is not that simple.
You never know if it is a TAB from the original line or it is extra TAB
added by TextOutputFormat.

Thanks!

Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Felix Chern

Use ‘tr -s’ to stripe out tabs?

 $ echo -e a\t\t\tb
a   b

 $ echo -e a\t\t\tb | tr -s \t
a   b


On Sep 10, 2014, at 11:28 AM, Dmitry Sivachenko trtrmi...@gmail.com wrote:

 
 On 10 сент. 2014 г., at 22:19, Rich Haase rdha...@gmail.com wrote:
 
 You can write a custom output format
 
 
 Any clues how can this can be done?
 
 
 
 , or you can write your mapreduce job in Java and use a NullWritable as 
 Susheel recommended.  
 
 grep (and every other *nix text processing command) I can think of would not 
 be limited by a trailing tab character.  It's even quite easy to strip away 
 that tab character if you don't want it during the post processing steps you 
 want to perform with *nix commands. 
 
 
 Problem is that the line itself contains a TAB in the middle, there will not 
 be extra trailing TAB at the end.
 So it is not that simple.
 You never know if it is a TAB from the original line or it is extra TAB added 
 by TextOutputFormat.
 
 Thanks!

Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Felix Chern

If you don’t want anything get inserted, just set your output to key only or 
value only.
TextOutputFormat$LineRecordWriter won’t insert anything unless both values are 
set:

public synchronized void write(K key, V value)
  throws IOException {

  boolean nullKey = key == null || key instanceof NullWritable;
  boolean nullValue = value == null || value instanceof NullWritable;
  if (nullKey  nullValue) {
return;
  }
  if (!nullKey) {
writeObject(key);
  }
  if (!(nullKey || nullValue)) {
out.write(keyValueSeparator);
  }
  if (!nullValue) {
writeObject(value);
  }
  out.write(newline);
}

On Sep 10, 2014, at 1:37 PM, Dmitry Sivachenko trtrmi...@gmail.com wrote:

 
 On 10 сент. 2014 г., at 22:33, Felix Chern idry...@gmail.com wrote:
 
 Use ‘tr -s’ to stripe out tabs?
 
 $ echo -e a\t\t\tb
 ab
 
 $ echo -e a\t\t\tb | tr -s \t
 ab
 
 
 There can be tabs in the input, I want to keep input lines without any 
 modification.
 
 Actually it is rather standard task: process lines one by one without 
 inserting extra characters.  There should be standard solution for it IMO.

Re: Writing output from streaming task without dealing with key/value