RE: HDFS: Couldn't obtain the locations of the last block

2014-09-10 Thread Liu, Yi A
Hi Zesheng,

I got from an offline email of you and knew your Hadoop version was 2.0.0-alpha 
and you also said “The block is allocated successfully in NN, but isn’t created 
in DN”.
Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is similar 
with HDFS-4516.   And can you try Hadoop 2.4 or later, you should not be able 
to re-produce it for these versions.

From your description, the second block is created successfully and NN would 
flush the edit log info to shared journal and shared storage might persist the 
info, but before reporting back in rpc, there might be timeout to NN from 
shared storage.  So the block exist in shared edit log, but DN doesn’t create 
it in anyway.  On restart, client could fail, because in that Hadoop version, 
client would retry only in the case of NN last block size reported as non-zero 
if it was synced (see more in HDFS-4516).

Regards,
Yi Liu

From: Zesheng Wu [mailto:wuzeshen...@gmail.com]
Sent: Tuesday, September 09, 2014 6:16 PM
To: user@hadoop.apache.org
Subject: HDFS: Couldn't obtain the locations of the last block

Hi,

These days we encountered a critical bug in HDFS which can result in HBase 
can't start normally.
The scenario is like following:
1.  rs1 writes data to HDFS file f1, and the first block is written successfully
2.  rs1 apply to create the second block successfully, at this time, nn1(ann) 
is crashed due to writing journal timeout
3. nn2(snn) isn't become active because of zkfc2 is in abnormal state
4. nn1 is restarted and becomes active
5. During the process of nn1 restarting, rs1 is crashed due to writing to 
safemode nn(nn1)
6. As a result, the file f1 is in abnormal state and the HBase cluster can't 
serve any more

We can use the command line shell to list the file, look like following:

-rw---   3 hbase_srv supergroup  134217728 2014-09-05 11:32 
/hbase/lgsrv-push/xxx
But when we try to download the file from hdfs, the dfs client complains:

14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. 
Datanodes might not have reported blocks completely. Will retry for 3 times

14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. 
Datanodes might not have reported blocks completely. Will retry for 2 times

14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. 
Datanodes might not have reported blocks completely. Will retry for 1 times

get: Could not obtain the last block locations.

Anyone can help on this?
--
Best Wishes!

Yours, Zesheng


S3 with Hadoop 2.5.0 - Not working

2014-09-10 Thread Dhiraj
Hi,

I have downloaded hadoop-2.5.0 and am trying to get it working for s3
backend *(single-node in a pseudo-distributed mode)*.
I have made changes to the core-site.xml according to
https://wiki.apache.org/hadoop/AmazonS3

I have an backend object store running on my machine that supports S3.

I get the following message when i try to start the daemons
*Incorrect configuration: namenode address dfs.namenode.servicerpc-address
or dfs.namenode.rpc-address is not configured.*


root@ubuntu:/build/hadoop/hadoop-2.5.0# ./sbin/start-dfs.sh
Incorrect configuration: namenode address dfs.namenode.servicerpc-address
or dfs.namenode.rpc-address is not configured.
Starting namenodes on []
localhost: starting namenode, logging to
/build/hadoop/hadoop-2.5.0/logs/hadoop-root-namenode-ubuntu.out
localhost: starting datanode, logging to
/build/hadoop/hadoop-2.5.0/logs/hadoop-root-datanode-ubuntu.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to
/build/hadoop/hadoop-2.5.0/logs/hadoop-root-secondarynamenode-ubuntu.out
root@ubuntu:/build/hadoop/hadoop-2.5.0#

The deamons dont start after the above.
i get the same error if i add the property fs.defaultFS and set its value
to the s3 bucket but if i change the defaultFS to *hdfs://* it works fine -
am able to launch the daemons.

my core-site.xml:
configuration
property
namefs.defaultFS/name
values3://bucket1/value
/property
property
namefs.s3.awsAccessKeyId/name
valueabcd/value
/property
property
namefs.s3.awsSecretAccessKey/name
value1234/value
/property
/configuration


I am able to list the buckets and its contents via s3cmd and boto; but
unable to get an s3 config started via hadoop

Also from the following core-file.xml listed on the website; i dont see an
implementation for s3
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml

There is an s3.impl until 1.2.1 release. So does the 2.5.0 release support
s3 or do i need to do anything else.

cheers,
Dhiraj


Re: HDFS: Couldn't obtain the locations of the last block

2014-09-10 Thread Zesheng Wu
Thanks Yi, I will look into HDFS-4516.


2014-09-10 15:03 GMT+08:00 Liu, Yi A yi.a@intel.com:

  Hi Zesheng,



 I got from an offline email of you and knew your Hadoop version was
 2.0.0-alpha and you also said “The block is allocated successfully in NN,
 but isn’t created in DN”.

 Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is
 similar with HDFS-4516.   And can you try Hadoop 2.4 or later, you should
 not be able to re-produce it for these versions.



 From your description, the second block is created successfully and NN
 would flush the edit log info to shared journal and shared storage might
 persist the info, but before reporting back in rpc, there might be timeout
 to NN from shared storage.  So the block exist in shared edit log, but DN
 doesn’t create it in anyway.  On restart, client could fail, because in
 that Hadoop version, client would retry only in the case of NN last block
 size reported as non-zero if it was synced (see more in HDFS-4516).



 Regards,

 Yi Liu



 *From:* Zesheng Wu [mailto:wuzeshen...@gmail.com]
 *Sent:* Tuesday, September 09, 2014 6:16 PM
 *To:* user@hadoop.apache.org
 *Subject:* HDFS: Couldn't obtain the locations of the last block



 Hi,



 These days we encountered a critical bug in HDFS which can result in HBase
 can't start normally.

 The scenario is like following:

 1.  rs1 writes data to HDFS file f1, and the first block is written
 successfully

 2.  rs1 apply to create the second block successfully, at this time,
 nn1(ann) is crashed due to writing journal timeout

 3. nn2(snn) isn't become active because of zkfc2 is in abnormal state

 4. nn1 is restarted and becomes active

 5. During the process of nn1 restarting, rs1 is crashed due to writing to
 safemode nn(nn1)

 6. As a result, the file f1 is in abnormal state and the HBase cluster
 can't serve any more



 We can use the command line shell to list the file, look like following:

 -rw---   3 hbase_srv supergroup  134217728 2014-09-05 11:32 
 /hbase/lgsrv-push/xxx

  But when we try to download the file from hdfs, the dfs client complains:

 14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 3 times

 14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 2 times

 14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 1 times

 get: Could not obtain the last block locations.

 Anyone can help on this?

  --
 Best Wishes!

 Yours, Zesheng




-- 
Best Wishes!

Yours, Zesheng


Start standby namenode using bootstrapStandby hangs

2014-09-10 Thread sam liu
Hi Experts,

My hadoop cluster is enabled HA with QJM and I failed to upgrade it from
version 2.2.0 to 2.4.1. Why? Is this a existing issue?

My steps:
1. Stop hadoop cluster
2. On each node, upgrade hadoop binary with the newer version
3. On each JournalNode:
sbin/hadoop-daemon.sh start journalnode
4. On each DataNode:
sbin/hadoop-daemon.sh start datanode
5. On previous active NameNode:
sbin/hadoop-daemon.sh start namenode -upgrade
6. On previous standby NameNode:
sbin/hadoop-daemon.sh start namenode -bootstrapStandby

Encountered Issue:
Failed to start NameNode service normally with a warning info as below:
2014-09-10 15:57:41,730 WARN org.apache.hadoop.hdfs.server.common.Util:
Path /hadoop/hdfs/name should be specified as a URI in configuration files.
Please update hdfs configuration.
After throwing out above warning info, the execution of command hangs there
and did not throw any other warning/error messages any more.

Thanks!


Re: S3 with Hadoop 2.5.0 - Not working

2014-09-10 Thread Harsh J
 Incorrect configuration: namenode address dfs.namenode.servicerpc-address or 
 dfs.namenode.rpc-address is not configured.
 Starting namenodes on []

NameNode/DataNode are part of a HDFS service. It makes no sense to try
and run them over an S3 URL default, which is a distributed filesystem
in itself. The services need fs.defaultFS to be set to a HDFS URI to
be able to start up.

 but unable to get an s3 config started via hadoop

You can run jobs over S3 input and output data by running a regular MR
cluster on HDFS - just pass the right URI as input and output
parameters of the job. Set your S3 properties in core-site.xml but let
the fs.defaultFS be of HDFS type, to do this.

 There is an s3.impl until 1.2.1 release. So does the 2.5.0 release support s3 
 or do i need to do anything else.

In Apache Hadoop 2 we dynamically load the FS classes, so we do not
need the fs.NAME.impl configs anymore as we did in Apache Hadoop 1.

On Wed, Sep 10, 2014 at 1:15 PM, Dhiraj jar...@gmail.com wrote:
 Hi,

 I have downloaded hadoop-2.5.0 and am trying to get it working for s3
 backend (single-node in a pseudo-distributed mode).
 I have made changes to the core-site.xml according to
 https://wiki.apache.org/hadoop/AmazonS3

 I have an backend object store running on my machine that supports S3.

 I get the following message when i try to start the daemons
 Incorrect configuration: namenode address dfs.namenode.servicerpc-address or
 dfs.namenode.rpc-address is not configured.


 root@ubuntu:/build/hadoop/hadoop-2.5.0# ./sbin/start-dfs.sh
 Incorrect configuration: namenode address dfs.namenode.servicerpc-address or
 dfs.namenode.rpc-address is not configured.
 Starting namenodes on []
 localhost: starting namenode, logging to
 /build/hadoop/hadoop-2.5.0/logs/hadoop-root-namenode-ubuntu.out
 localhost: starting datanode, logging to
 /build/hadoop/hadoop-2.5.0/logs/hadoop-root-datanode-ubuntu.out
 Starting secondary namenodes [0.0.0.0]
 0.0.0.0: starting secondarynamenode, logging to
 /build/hadoop/hadoop-2.5.0/logs/hadoop-root-secondarynamenode-ubuntu.out
 root@ubuntu:/build/hadoop/hadoop-2.5.0#

 The deamons dont start after the above.
 i get the same error if i add the property fs.defaultFS and set its value
 to the s3 bucket but if i change the defaultFS to hdfs:// it works fine - am
 able to launch the daemons.

 my core-site.xml:
 configuration
 property
 namefs.defaultFS/name
 values3://bucket1/value
 /property
 property
 namefs.s3.awsAccessKeyId/name
 valueabcd/value
 /property
 property
 namefs.s3.awsSecretAccessKey/name
 value1234/value
 /property
 /configuration


 I am able to list the buckets and its contents via s3cmd and boto; but
 unable to get an s3 config started via hadoop

 Also from the following core-file.xml listed on the website; i dont see an
 implementation for s3
 http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml

 There is an s3.impl until 1.2.1 release. So does the 2.5.0 release support
 s3 or do i need to do anything else.

 cheers,
 Dhiraj






-- 
Harsh J


RE: Error and problem when running a hadoop job

2014-09-10 Thread YIMEN YIMGA Gael
Thank you for your all support.

I could fix the issue this morning using this link, it was clearly explain.

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/#java-io-ioexception-incompatible-namespaceids

You can use the link as well.

Warm regard

From: vivek [mailto:vivvekbha...@gmail.com]
Sent: Tuesday 9 September 2014 19:31
To: user@hadoop.apache.org
Subject: Re: Error and problem when running a hadoop job

is there any namespace mismatch?
Try to delete the data in datanode directory

On Tue, Sep 9, 2014 at 10:41 PM, Sandeep Khurana 
skhurana...@gmail.commailto:skhurana...@gmail.com wrote:
check the log file at ./hadoop/hadoop-datanide-latdevweb02.out (As per ur 
last screen shot). There can be various reasons of datanode not starting, the 
real issue will be logged into this file.

On Tue, Sep 9, 2014 at 10:06 PM, YIMEN YIMGA Gael 
gael.yimen-yi...@sgcib.commailto:gael.yimen-yi...@sgcib.com wrote:
Hi,

When I run the following command to launch DATANODE as shown in the screenshot 
below, all is ok
But when I run JPS command, I do not see the datanode process

[cid:image001.png@01CFCCED.E2FD4BC0]

That’s where my worry is ☹ ☹

Standing by ….

From: vivek [mailto:vivvekbha...@gmail.commailto:vivvekbha...@gmail.com]
Sent: Tuesday 9 September 2014 17:27

To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: Error and problem when running a hadoop job

check whether datanode is started.


On Tue, Sep 9, 2014 at 7:26 PM, YIMEN YIMGA Gael 
gael.yimen-yi...@sgcib.commailto:gael.yimen-yi...@sgcib.com wrote:
Yes, all about ssh access, have been done.

My cluster is a single node cluster.

Standing by …

From: Sandeep Khurana 
[mailto:skhurana...@gmail.commailto:skhurana...@gmail.com]
Sent: Tuesday 9 September 2014 15:54
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: Error and problem when running a hadoop job


I hope you did do passphrase less ssh access to localhost by generating keys 
etc?
On Sep 9, 2014 7:18 PM, YIMEN YIMGA Gael 
gael.yimen-yi...@sgcib.commailto:gael.yimen-yi...@sgcib.com wrote:
Hello Dear hadoopers,

I hope you are doing well.

I tried to run WordCount.jar file to experience running hadoop jobs. After 
launching the program as shown in the screenshot below, I have the message in 
the screenshot.
The job tries to connect to the datanode. But failed after 10 attempts, I got 
the error in the second screenshot.
After that, I first stop all the Hadoop deamons, second format the dfs, third 
re-launch Hadoop deamons, and I notice using the JPS command that DATANODE is 
not running.
I then run the datanode alone with the command bin/hadoop –deamon.sh  start 
datanode as shown in the third screenshot, but the datanode is still not up and 
running.

Could someone advice in this case, please ?

Standing by for your habitual support.

Thank in advance.

GYY

[cid:image002.png@01CFCCED.E2FD4BC0]


[cid:image003.png@01CFCCED.E2FD4BC0]


[cid:image004.png@01CFCCED.E2FD4BC0]

*
This message and any attachments (the message) are confidential, intended 
solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to 
alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be 
liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with 
respect to derivative products.
  
Ce message et toutes les pieces jointes (ci-apres le message) sont 
confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute 
utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration.
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de 
ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir 
d'importantes informations sur les produits derives.
*



--







Thanks and Regards,

VIVEK KOUL



--
Thanks and regards
Sandeep Khurana



--







Thanks and Regards,

VIVEK KOUL


MapReduce data decompression using a custom codec

2014-09-10 Thread POUPON Kevin
Hello,

I developed a custom compression codec for Hadoop. Of course Hadoop is set to 
use my codec when compressing data.
For testing purposes, I use the following two commands:

Compression test command:
---
hadoop jar 
/opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/hadoop//../hadoop-mapreduce/hadoop-streaming.jar
 -Dmapreduce.output.fileoutputformat.compress=true -input /originalFiles/ 
-output /compressedFiles/ -mapper cat -reducer cat


Decompression test command:
---
hadoop jar 
/opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/hadoop//../hadoop-mapreduce/hadoop-streaming.jar
 -Dmapreduce.output.fileoutputformat.compress=false -input /compressedFiles/ 
-output /decompressedFiles/ -mapper cat -reducer cat


As you can see, both of them are quite similar: only the compression option 
changes and the input/output directories.

The first command compresses the input data then 'cat' (the Linux command, you 
know) it to the output file.
The second one decompresses the input  data (which are supposed to be 
compressed) then 'cat' it to the output file. As I understand, Hadoop is 
supposed to auto-detect compressed input data and decompress it using the right 
codec.

Those test compression and decompression work well when Hadoop is set to use a 
default codec, like BZip2 or Snappy.

However, when using my custom compression codec, only the compression works: 
the decompression is sluggish and triggers errors (Java heap space):

packageJobJar: [] 
[/opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/hadoop-mapreduce/hadoop-streaming-2.3.0-cdh5.1.2.jar]
 /tmp/streamjob6475393520304432687.jar tmpDir=null
14/09/09 15:33:21 INFO client.RMProxy: Connecting to ResourceManager at 
bluga2/10.1.96.222:8032
14/09/09 15:33:22 INFO client.RMProxy: Connecting to ResourceManager at 
bluga2/10.1.96.222:8032
14/09/09 15:33:23 INFO mapred.FileInputFormat: Total input paths to process : 1
14/09/09 15:33:23 INFO mapreduce.JobSubmitter: number of splits:1
14/09/09 15:33:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
job_1410264242020_0016
14/09/09 15:33:24 INFO impl.YarnClientImpl: Submitted application 
application_1410264242020_0016
14/09/09 15:33:24 INFO mapreduce.Job: The url to track the job: 
http://bluga2:8088/proxy/application_1410264242020_0016/
14/09/09 15:33:24 INFO mapreduce.Job: Running job: job_1410264242020_0016
14/09/09 15:33:30 INFO mapreduce.Job: Job job_1410264242020_0016 running in 
uber mode : false
14/09/09 15:33:30 INFO mapreduce.Job:  map 0% reduce 0%
14/09/09 15:35:12 INFO mapreduce.Job:  map 100% reduce 0%
14/09/09 15:35:13 INFO mapreduce.Job: Task Id : 
attempt_1410264242020_0016_m_00_0, Status : FAILED
Error: Java heap space
14/09/09 15:35:14 INFO mapreduce.Job:  map 0% reduce 0%
14/09/09 15:35:41 INFO mapreduce.Job: Task Id : 
attempt_1410264242020_0016_m_00_1, Status : FAILED
Error: Java heap space
14/09/09 15:36:02 INFO mapreduce.Job: Task Id : 
attempt_1410264242020_0016_m_00_2, Status : FAILED
Error: Java heap space
14/09/09 15:36:49 INFO mapreduce.Job:  map 100% reduce 0%
14/09/09 15:36:50 INFO mapreduce.Job:  map 100% reduce 100%
14/09/09 15:36:56 INFO mapreduce.Job: Job job_1410264242020_0016 failed with 
state FAILED due to: Task failed task_1410264242020_0016_m_00
Job failed as tasks failed. failedMaps:1 failedReduces:0

14/09/09 15:36:58 INFO mapreduce.Job: Counters: 9
   Job Counters
 Failed map tasks=4
 Launched map tasks=4
 Other local map tasks=3
 Data-local map tasks=1
 Total time spent by all maps in occupied slots (ms)=190606
 Total time spent by all reduces in occupied slots (ms)=0
 Total time spent by all map tasks (ms)=190606
 Total vcore-seconds taken by all map tasks=190606
 Total megabyte-seconds taken by all map tasks=195180544
14/09/09 15:36:58 ERROR streaming.StreamJob: Job not Successful!
Streaming Command Failed!

I already tried to increase the map maximum heap size 
(mapreduce.map.java.opts.max.heap's YARN property) from 1 GiB to 2 GiB but the 
decompression still doesn't work. By the way, I'm compressing and decompressing 
a small ~2MB file and use the latest Cloudera version.

I built a quick Java test environment to try to reproduce the Hadoop codec call 
(instantiating the codec, creating a new compression stream from it ...). I 
noticed that the decompression is an infinite loop where only the first block 
of compressed data is decompressed, infinitely. This could explain the above 
Java heap space error.

What am I doing wrong/what did I forget ? How could my codec decompress data 
without troubles?

Thank you for helping !

Kévin Poupon



Re: Regular expressions in fs paths?

2014-09-10 Thread Mahesh Khandewal
I want to unsubscribe from this mailing list

On Wed, Sep 10, 2014 at 4:42 PM, Charles Robertson 
charles.robert...@gmail.com wrote:

 Hi all,

 Is it possible to use regular expressions in fs commands? Specifically, I
 want to use the copy (-cp) and move (-mv) commands on all files in a
 directory that match a pattern (the pattern being all files that do not end
 in '.tmp').

 Can this be done?

 Thanks,
 Charles



Re: Regular expressions in fs paths?

2014-09-10 Thread Georgi Ivanov

Yes you can :
hadoop fs -ls /tmp/myfiles*

I would recommend first using -ls in order to verify  you are selecting 
the right files.


#Mahesh : do you need some help doing this ?


On 10.09.2014 13:46, Mahesh Khandewal wrote:

I want to unsubscribe from this mailing list

On Wed, Sep 10, 2014 at 4:42 PM, Charles Robertson 
charles.robert...@gmail.com mailto:charles.robert...@gmail.com wrote:


Hi all,

Is it possible to use regular expressions in fs commands?
Specifically, I want to use the copy (-cp) and move (-mv) commands
on all files in a directory that match a pattern (the pattern
being all files that do not end in '.tmp').

Can this be done?

Thanks,
Charles






Re: HDFS: Couldn't obtain the locations of the last block

2014-09-10 Thread Zesheng Wu
Hi Yi,

I went through HDFS-4516, and it really solves our problem, thanks very
much!

2014-09-10 16:39 GMT+08:00 Zesheng Wu wuzeshen...@gmail.com:

 Thanks Yi, I will look into HDFS-4516.


 2014-09-10 15:03 GMT+08:00 Liu, Yi A yi.a@intel.com:

  Hi Zesheng,



 I got from an offline email of you and knew your Hadoop version was
 2.0.0-alpha and you also said “The block is allocated successfully in NN,
 but isn’t created in DN”.

 Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is
 similar with HDFS-4516.   And can you try Hadoop 2.4 or later, you should
 not be able to re-produce it for these versions.



 From your description, the second block is created successfully and NN
 would flush the edit log info to shared journal and shared storage might
 persist the info, but before reporting back in rpc, there might be timeout
 to NN from shared storage.  So the block exist in shared edit log, but DN
 doesn’t create it in anyway.  On restart, client could fail, because in
 that Hadoop version, client would retry only in the case of NN last block
 size reported as non-zero if it was synced (see more in HDFS-4516).



 Regards,

 Yi Liu



 *From:* Zesheng Wu [mailto:wuzeshen...@gmail.com]
 *Sent:* Tuesday, September 09, 2014 6:16 PM
 *To:* user@hadoop.apache.org
 *Subject:* HDFS: Couldn't obtain the locations of the last block



 Hi,



 These days we encountered a critical bug in HDFS which can result in
 HBase can't start normally.

 The scenario is like following:

 1.  rs1 writes data to HDFS file f1, and the first block is written
 successfully

 2.  rs1 apply to create the second block successfully, at this time,
 nn1(ann) is crashed due to writing journal timeout

 3. nn2(snn) isn't become active because of zkfc2 is in abnormal state

 4. nn1 is restarted and becomes active

 5. During the process of nn1 restarting, rs1 is crashed due to writing to
 safemode nn(nn1)

 6. As a result, the file f1 is in abnormal state and the HBase cluster
 can't serve any more



 We can use the command line shell to list the file, look like following:

 -rw---   3 hbase_srv supergroup  134217728 2014-09-05 11:32 
 /hbase/lgsrv-push/xxx

  But when we try to download the file from hdfs, the dfs client
 complains:

 14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 3 times

 14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 2 times

 14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 1 times

 get: Could not obtain the last block locations.

 Anyone can help on this?

  --
 Best Wishes!

 Yours, Zesheng




 --
 Best Wishes!

 Yours, Zesheng




-- 
Best Wishes!

Yours, Zesheng


Re: Regular expressions in fs paths?

2014-09-10 Thread Charles Robertson
Hi Georgi,

Thanks for your reply. Won't hadoop fs -ls /tmp/myfiles* return all files
that begin with 'myfiles' in the tmp directory? What I don't understand is
how I can specify a pattern that excludes files ending in '.tmp'. I have
tried using the normal regular expression syntax for this ^(.tmp) but it
tries to match it literally.

Regards,
Charles

On 10 September 2014 13:07, Georgi Ivanov iva...@vesseltracker.com wrote:

  Yes you can :
 hadoop fs -ls /tmp/myfiles*

 I would recommend first using -ls in order to verify  you are selecting
 the right files.

 #Mahesh : do you need some help doing this ?



 On 10.09.2014 13:46, Mahesh Khandewal wrote:

 I want to unsubscribe from this mailing list

 On Wed, Sep 10, 2014 at 4:42 PM, Charles Robertson 
 charles.robert...@gmail.com wrote:

 Hi all,

  Is it possible to use regular expressions in fs commands? Specifically,
 I want to use the copy (-cp) and move (-mv) commands on all files in a
 directory that match a pattern (the pattern being all files that do not end
 in '.tmp').

  Can this be done?

  Thanks,
 Charles






RE: HDFS: Couldn't obtain the locations of the last block

2014-09-10 Thread Liu, Yi A
That’s great.

Regards,
Yi Liu

From: Zesheng Wu [mailto:wuzeshen...@gmail.com]
Sent: Wednesday, September 10, 2014 8:25 PM
To: user@hadoop.apache.org
Subject: Re: HDFS: Couldn't obtain the locations of the last block

Hi Yi,

I went through HDFS-4516, and it really solves our problem, thanks very much!

2014-09-10 16:39 GMT+08:00 Zesheng Wu 
wuzeshen...@gmail.commailto:wuzeshen...@gmail.com:
Thanks Yi, I will look into HDFS-4516.


2014-09-10 15:03 GMT+08:00 Liu, Yi A 
yi.a@intel.commailto:yi.a@intel.com:

Hi Zesheng,

I got from an offline email of you and knew your Hadoop version was 2.0.0-alpha 
and you also said “The block is allocated successfully in NN, but isn’t created 
in DN”.
Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is similar 
with HDFS-4516.   And can you try Hadoop 2.4 or later, you should not be able 
to re-produce it for these versions.

From your description, the second block is created successfully and NN would 
flush the edit log info to shared journal and shared storage might persist the 
info, but before reporting back in rpc, there might be timeout to NN from 
shared storage.  So the block exist in shared edit log, but DN doesn’t create 
it in anyway.  On restart, client could fail, because in that Hadoop version, 
client would retry only in the case of NN last block size reported as non-zero 
if it was synced (see more in HDFS-4516).

Regards,
Yi Liu

From: Zesheng Wu [mailto:wuzeshen...@gmail.commailto:wuzeshen...@gmail.com]
Sent: Tuesday, September 09, 2014 6:16 PM
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: HDFS: Couldn't obtain the locations of the last block

Hi,

These days we encountered a critical bug in HDFS which can result in HBase 
can't start normally.
The scenario is like following:
1.  rs1 writes data to HDFS file f1, and the first block is written successfully
2.  rs1 apply to create the second block successfully, at this time, nn1(ann) 
is crashed due to writing journal timeout
3. nn2(snn) isn't become active because of zkfc2 is in abnormal state
4. nn1 is restarted and becomes active
5. During the process of nn1 restarting, rs1 is crashed due to writing to 
safemode nn(nn1)
6. As a result, the file f1 is in abnormal state and the HBase cluster can't 
serve any more

We can use the command line shell to list the file, look like following:

-rw---   3 hbase_srv supergroup  134217728 2014-09-05 11:32 
/hbase/lgsrv-push/xxx
But when we try to download the file from hdfs, the dfs client complains:

14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. 
Datanodes might not have reported blocks completely. Will retry for 3 times

14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. 
Datanodes might not have reported blocks completely. Will retry for 2 times

14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. 
Datanodes might not have reported blocks completely. Will retry for 1 times

get: Could not obtain the last block locations.

Anyone can help on this?
--
Best Wishes!

Yours, Zesheng



--
Best Wishes!

Yours, Zesheng



--
Best Wishes!

Yours, Zesheng


Error when executing a WordCount Program

2014-09-10 Thread YIMEN YIMGA Gael
Hello Hadoopers,

Here is the error, I'm facing when running WordCount example program written by 
myself.
Kind find attached the file of my WordCount program.
Below the error.

===
-bash-4.1$ bin/hadoop jar WordCount.jar
Entr?e dans le programme MAIN !!!
14/09/10 15:00:24 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
14/09/10 15:00:24 WARN mapred.JobClient: No job jar file set.  User classes may 
not be found. See JobConf(Class) or JobConf#setJar(String).
14/09/10 15:00:24 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/09/10 15:00:24 WARN snappy.LoadSnappy: Snappy native library not loaded
14/09/10 15:00:24 INFO mapred.JobClient: Cleaning up the staging area 
hdfs://latdevweb02:9000/user/hadoop/.staging/job_201409101141_0001
14/09/10 15:00:24 ERROR security.UserGroupInformation: 
PriviledgedActionException as:hadoop 
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: hdfs://latdevweb02:9000/home/hadoop/hadoop/input
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
hdfs://latdevweb02:9000/home/hadoop/hadoop/input
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at 
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at 
fr.societegenerale.bigdata.lactool.WordCountDriver.main(WordCountDriver.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
-bash-4.1$
===

Thanks in advance for your help.

Warm regards
GYY
*
This message and any attachments (the message) are confidential, intended 
solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to 
alteration.   
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be 
liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with 
respect to derivative products.
  
Ce message et toutes les pieces jointes (ci-apres le message) sont 
confidentiels et susceptibles de contenir des informations couvertes 
par le secret professionnel. 
Ce message est etabli a l'intention exclusive de ses destinataires. Toute 
utilisation ou diffusion non autorisee est interdite.
Tout message electronique est susceptible d'alteration. 
La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de 
ce message s'il a ete altere, deforme ou falsifie.
Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir 
d'importantes informations sur les produits derives.
*


WordCountReducer.java
Description: WordCountReducer.java


WordCountMapper.java
Description: WordCountMapper.java


WordCountDriver.java
Description: WordCountDriver.java


Re: Error when executing a WordCount Program

2014-09-10 Thread Shahab Yunus
*hdfs://latdevweb02:9000/home/hadoop/hadoop/input*

is this is a valid path on hdfs? Can you access this path outside of the
program? For example using hadoop fs -ls command? Also, was this path and
files in it, created by a different user?

The exception seem to say that it does not exist or the running user does
not have permission to read it.

Regards,
Shahab



On Wed, Sep 10, 2014 at 9:09 AM, YIMEN YIMGA Gael 
gael.yimen-yi...@sgcib.com wrote:

 Hello Hadoopers,



 Here is the error, I’m facing when running WordCount example program
 written by myself.

 Kind find attached the file of my WordCount program.

 Below the error.




 ===

 *-bash-4.1$ bin/hadoop jar WordCount.jar*

 *Entr?e dans le programme MAIN !!!*

 *14/09/10 15:00:24 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.*

 *14/09/10 15:00:24 WARN mapred.JobClient: No job jar file set.  User
 classes may not be found. See JobConf(Class) or JobConf#setJar(String).*

 *14/09/10 15:00:24 INFO util.NativeCodeLoader: Loaded the native-hadoop
 library*

 *14/09/10 15:00:24 WARN snappy.LoadSnappy: Snappy native library not
 loaded*

 *14/09/10 15:00:24 INFO mapred.JobClient: Cleaning up the staging area
 hdfs://latdevweb02:9000/user/hadoop/.staging/job_201409101141_0001*

 *14/09/10 15:00:24 ERROR security.UserGroupInformation:
 PriviledgedActionException as:hadoop
 cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
 exist: hdfs://latdevweb02:9000/home/hadoop/hadoop/input*

 *org.apache.hadoop.mapred.InvalidInputException: Input path does not
 exist: hdfs://latdevweb02:9000/home/hadoop/hadoop/input*

 *at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)*

 *at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)*

 *at
 org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)*

 *at
 org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)*

 *at
 org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)*

 *at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)*

 *at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)*

 *at java.security.AccessController.doPrivileged(Native Method)*

 *at javax.security.auth.Subject.doAs(Subject.java:415)*

 *at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)*

 *at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)*

 *at
 org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)*

 *at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)*

 *at
 fr.societegenerale.bigdata.lactool.WordCountDriver.main(WordCountDriver.java:50)*

 *at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)*

 *at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)*

 *at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)*

 *at java.lang.reflect.Method.invoke(Method.java:601)*

 *at org.apache.hadoop.util.RunJar.main(RunJar.java:160)*

 *-bash-4.1$*


 ===



 Thanks in advance for your help.



 Warm regards

 GYY

 *
 This message and any attachments (the message) are confidential,
 intended solely for the addressee(s), and may contain legally privileged
 information.
 Any unauthorised use or dissemination is prohibited. E-mails are
 susceptible to alteration.
 Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
 be liable for the message if altered, changed or
 falsified.
 Please visit http://swapdisclosure.sgcib.com for important information
 with respect to derivative products.
   
 Ce message et toutes les pieces jointes (ci-apres le message) sont
 confidentiels et susceptibles de contenir des informations couvertes
 par le secret professionnel.
 Ce message est etabli a l'intention exclusive de ses destinataires. Toute
 utilisation ou diffusion non autorisee est interdite.
 Tout message electronique est susceptible d'alteration.
 La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
 titre de ce message s'il a ete altere, deforme ou falsifie.
 Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
 recueillir d'importantes informations sur les produits derives.
 *



Re: Error when executing a WordCount Program

2014-09-10 Thread Chris MacKenzie
Hi have you set a class in your code ?

 WARN mapred.JobClient: No job jar file set.  User classes may not be found. 
 See JobConf(Class) or JobConf#setJar(String).
 


Also you need to check the path for your input file

 Input path does not exist: hdfs://latdevweb02:9000/home/hadoop/hadoop/input
 

These are pretty straight forward errors resolve them and you should be good to 
go. 

Sent from my iPhone

 On 10 Sep 2014, at 14:19, Shahab Yunus shahab.yu...@gmail.com wrote:
 
 hdfs://latdevweb02:9000/home/hadoop/hadoop/input
 
 is this is a valid path on hdfs? Can you access this path outside of the 
 program? For example using hadoop fs -ls command? Also, was this path and 
 files in it, created by a different user?
 
 The exception seem to say that it does not exist or the running user does not 
 have permission to read it.
 
 Regards,
 Shahab
 
 
 
 On Wed, Sep 10, 2014 at 9:09 AM, YIMEN YIMGA Gael 
 gael.yimen-yi...@sgcib.com wrote:
 Hello Hadoopers,
 
  
 
 Here is the error, I’m facing when running WordCount example program written 
 by myself.
 
 Kind find attached the file of my WordCount program.
 
 Below the error.
 
  
 
 ===
 
 -bash-4.1$ bin/hadoop jar WordCount.jar
 
 Entr?e dans le programme MAIN !!!
 
 14/09/10 15:00:24 WARN mapred.JobClient: Use GenericOptionsParser for 
 parsing the arguments. Applications should implement Tool for the same.
 
 14/09/10 15:00:24 WARN mapred.JobClient: No job jar file set.  User classes 
 may not be found. See JobConf(Class) or JobConf#setJar(String).
 
 14/09/10 15:00:24 INFO util.NativeCodeLoader: Loaded the native-hadoop 
 library
 
 14/09/10 15:00:24 WARN snappy.LoadSnappy: Snappy native library not loaded
 
 14/09/10 15:00:24 INFO mapred.JobClient: Cleaning up the staging area 
 hdfs://latdevweb02:9000/user/hadoop/.staging/job_201409101141_0001
 
 14/09/10 15:00:24 ERROR security.UserGroupInformation: 
 PriviledgedActionException as:hadoop 
 cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not 
 exist: hdfs://latdevweb02:9000/home/hadoop/hadoop/input
 
 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
 hdfs://latdevweb02:9000/home/hadoop/hadoop/input
 
 at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
 
 at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
 
 at 
 org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
 
 at 
 org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
 
 at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
 
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
 
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
 
 at java.security.AccessController.doPrivileged(Native Method)
 
 at javax.security.auth.Subject.doAs(Subject.java:415)
 
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
 
 at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
 
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
 
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
 
 at 
 fr.societegenerale.bigdata.lactool.WordCountDriver.main(WordCountDriver.java:50)
 
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
 at java.lang.reflect.Method.invoke(Method.java:601)
 
 at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
 
 -bash-4.1$
 
 ===
 
  
 
 Thanks in advance for your help.
 
  
 
 Warm regards
 
 GYY
 
 *
 This message and any attachments (the message) are confidential, intended 
 solely for the addressee(s), and may contain legally privileged information.
 Any unauthorised use or dissemination is prohibited. E-mails are susceptible 
 to alteration.  
 Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be 
 liable for the message if altered, changed or
 falsified.
 Please visit http://swapdisclosure.sgcib.com for important information with 
 respect to derivative products.
   
 Ce message et toutes les pieces jointes (ci-apres le message) sont 
 confidentiels et susceptibles de contenir des informations couvertes 
 par le secret professionnel. 
 Ce message est etabli a l'intention exclusive de ses destinataires. Toute 
 

running beyond virtual memory limits

2014-09-10 Thread Jakub Stransky
Hello,

I am getting following error when running on 500MB dataset compressed in
avro data format.

Container [pid=22961,containerID=container_1409834588043_0080_01_10] is
running beyond virtual memory limits. Current usage: 636.6 MB of 1 GB
physical memory used; 2.1 GB of 2.1 GB virtual memory used.
Killing container. Dump of the process-tree for
container_1409834588043_0080_01_10 :
|- PIDPPID  PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 22961  16896 22961  22961  (bash)0  0
9424896   312 /bin/bash -c
/usr/java/default/bin/java -Djava.net.preferIPv4Stack=true
-Dhadoop.metrics.log.level=WARN -Xmx768m
-Djava.io.tmpdir=/home/hadoop/yarn/local/usercache/jobsubmit/appcache/application_1409834588043_0080/container_1409834588043_0080_01_10/tmp
-Dlog4j.configuration=container-log4j.properties
-Dyarn.app.container.log.dir=/home/hadoop/yarn/logs/application_1409834588043_0080/container_1409834588043_0080_01_10
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA
org.apache.hadoop.mapred.YarnChild 153.87.47.116 47184
attempt_1409834588043_0080_r_00_0 10
1/home/hadoop/yarn/logs/application_1409834588043_0080/container_1409834588043_0080_01_10/stdout
2/home/hadoop/yarn/logs/application_1409834588043_0080/container_1409834588043_0080_01_10/stderr
|- 22970 22961 22961 22961 (java) 24692 1165 2256662528 162659
/usr/java/default/bin/java -Djava.net.preferIPv4Stack=true
-Dhadoop.metrics.log.level=WARN -Xmx768m
-Djava.io.tmpdir=/home/hadoop/yarn/local/usercache/jobsubmit/appcache/application_1409834588043_0080/container_1409834588043_0080_01_10/tmp
-Dlog4j.configuration=container-log4j.properties
-Dyarn.app.container.log.dir=/home/hadoop/yarn/logs/application_1409834588043_0080/container_1409834588043_0080_01_10
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA
org.apache.hadoop.mapred.YarnChild 153.87.47.116 47184
attempt_1409834588043_0080_r_00_0 10 Container killed on request. Exit
code is 143

I have read a lot about hadoop yarn memory settings but seems that
something basic I am missing in my understanding of how yarn and MR2 works.
I have pretty small testing cluster of 5 machines, 2nn and 3dn with
following parameters set

# hadoop - yarn-site.xml
yarn.nodemanager.resource.memory-mb  : 2048
yarn.scheduler.minimum-allocation-mb : 256
yarn.scheduler.maximum-allocation-mb : 2048

# hadoop - mapred-site.xml
mapreduce.map.memory.mb  : 768
mapreduce.map.java.opts  : -Xmx512m
mapreduce.reduce.memory.mb   : 1024
mapreduce.reduce.java.opts   : -Xmx768m
mapreduce.task.io.sort.mb: 100
yarn.app.mapreduce.am.resource.mb: 1024
yarn.app.mapreduce.am.command-opts   : -Xmx768m

I understand the mathematics here for the parameters but what I do not
understand is: Does your containers need to grow with the size of your
dataset? e.g. setting of mapreduce.map.memory.mb   and
mapreduce.map.java.opts  on per job basis? My reducer doesn't cache any
data, it is simply in - out just categorize data to multiple outputs as
follows using AvroMultipleOutputs()

@Override
public void reduce(Text key, IterableAvroValuePosData values,
Context context) throws IOException, InterruptedException {
try {
log.info(Processing key {}, key.toString());
final StoreIdDob storeIdDob = separateKey(key);

log.info(Processing DOB {}, SotoreId {}, storeIdDob.getDob(),
storeIdDob.getStoreId());
int size = 0;

Output out;
String path;

if (storeIdDob.getDob() != null 
isValidDOB(storeIdDob.getDob())  storeIdDob.getStoreId() != null 
!storeIdDob.getStoreId().isEmpty()) {
// reasonable data
if (isHistoricalDOB(storeIdDob.getDob())) {
out = Output.HISTORY;
} else {
out = Output.ACTUAL;
}
path = out.getKey() + / + storeIdDob.getDob() + / +
storeIdDob.getStoreId();
} else {
// error data
out = Output.ERROR;
path = out.getKey() + / + part;
}

for (AvroValuePosData posData : values) {
amos.write(out.getKey(), new AvroKeyPosData
(posData.datum()), null, path);
}

} catch (Exception e) {
log.error(Error on reducer , e);
//TODO audit log :-)
}
}

Do I need to grow the container size with size of the dataset? That seems
to me odd and I did expect that is what MR is for. Or am I missing some
settings which decides the size of data chunks?

Thx
Jakub


RE: Error when executing a WordCount Program

2014-09-10 Thread YIMEN YIMGA Gael
Hi,

Please that is my real problem.
Could you please look into my code in attached and tell me how I can update 
this, please ?

How to set a job jar file?

And now, here is my hdfs-site.xml

==
-bash-4.1$ cat conf/hdfs-site.xml
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

!-- Put site-specific property overrides in this file. --

configuration
   property
  namedfs.replication/name
  value1/value
   /property
   property
  namedfs.data.dir/name
  value/tmp/hadoop-hadoop/dfs/data/value
   /property
/configuration
-bash-4.1$
==

Could you advice on how to solve the error of “input path does not exist”?

Standing by …

Cheers


From: Chris MacKenzie [mailto:stu...@chrismackenziephotography.co.uk]
Sent: Wednesday 10 September 2014 15:27
To: user@hadoop.apache.org
Subject: Re: Error when executing a WordCount Program

Hi have you set a class in your code ?

WARN mapred.JobClient: No job jar file set.  User classes may not be found. See 
JobConf(Class) or JobConf#setJar(String).

Also you need to check the path for your input file

Input path does not exist: hdfs://latdevweb02:9000/home/hadoop/hadoop/input

These are pretty straight forward errors resolve them and you should be good to 
go.

Sent from my iPhone

On 10 Sep 2014, at 14:19, Shahab Yunus 
shahab.yu...@gmail.commailto:shahab.yu...@gmail.com wrote:
hdfs://latdevweb02:9000/home/hadoop/hadoop/input

is this is a valid path on hdfs? Can you access this path outside of the 
program? For example using hadoop fs -ls command? Also, was this path and files 
in it, created by a different user?

The exception seem to say that it does not exist or the running user does not 
have permission to read it.

Regards,
Shahab



On Wed, Sep 10, 2014 at 9:09 AM, YIMEN YIMGA Gael 
gael.yimen-yi...@sgcib.commailto:gael.yimen-yi...@sgcib.com wrote:
Hello Hadoopers,

Here is the error, I’m facing when running WordCount example program written by 
myself.
Kind find attached the file of my WordCount program.
Below the error.

===
-bash-4.1$ bin/hadoop jar WordCount.jar
Entr?e dans le programme MAIN !!!
14/09/10 15:00:24 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
14/09/10 15:00:24 WARN mapred.JobClient: No job jar file set.  User classes may 
not be found. See JobConf(Class) or JobConf#setJar(String).
14/09/10 15:00:24 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/09/10 15:00:24 WARN snappy.LoadSnappy: Snappy native library not loaded
14/09/10 15:00:24 INFO mapred.JobClient: Cleaning up the staging area 
hdfs://latdevweb02:9000/user/hadoop/.staging/job_201409101141_0001
14/09/10 15:00:24 ERROR security.UserGroupInformation: 
PriviledgedActionException as:hadoop 
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: hdfs://latdevweb02:9000/home/hadoop/hadoop/input
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
hdfs://latdevweb02:9000/home/hadoop/hadoop/input
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at 
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at 
fr.societegenerale.bigdata.lactool.WordCountDriver.main(WordCountDriver.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
-bash-4.1$
===

Thanks in advance for your help.

Warm regards
GYY

*
This message and any attachments (the 

RE: Error when executing a WordCount Program

2014-09-10 Thread YIMEN YIMGA Gael
Hi,

In fact,

hdfs://latdevweb02:9000/home/hadoop/hadoop/input
is not a folder on hdfs.

I created a folder /tmp/hadoop-hadoop/dfs/data, where data will be saved in 
hdfs.

And in my HADOOP_HOME folder, there is two folders “input” and “output”, but I 
don’t know how to configure them in the program.

Please could you look into my code and advise please ?

Standing by …

Warm regards

From: Shahab Yunus [mailto:shahab.yu...@gmail.com]
Sent: Wednesday 10 September 2014 15:19
To: user@hadoop.apache.org
Subject: Re: Error when executing a WordCount Program

hdfs://latdevweb02:9000/home/hadoop/hadoop/input

is this is a valid path on hdfs? Can you access this path outside of the 
program? For example using hadoop fs -ls command? Also, was this path and files 
in it, created by a different user?

The exception seem to say that it does not exist or the running user does not 
have permission to read it.

Regards,
Shahab



On Wed, Sep 10, 2014 at 9:09 AM, YIMEN YIMGA Gael 
gael.yimen-yi...@sgcib.commailto:gael.yimen-yi...@sgcib.com wrote:
Hello Hadoopers,

Here is the error, I’m facing when running WordCount example program written by 
myself.
Kind find attached the file of my WordCount program.
Below the error.

===
-bash-4.1$ bin/hadoop jar WordCount.jar
Entr?e dans le programme MAIN !!!
14/09/10 15:00:24 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
14/09/10 15:00:24 WARN mapred.JobClient: No job jar file set.  User classes may 
not be found. See JobConf(Class) or JobConf#setJar(String).
14/09/10 15:00:24 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/09/10 15:00:24 WARN snappy.LoadSnappy: Snappy native library not loaded
14/09/10 15:00:24 INFO mapred.JobClient: Cleaning up the staging area 
hdfs://latdevweb02:9000/user/hadoop/.staging/job_201409101141_0001
14/09/10 15:00:24 ERROR security.UserGroupInformation: 
PriviledgedActionException as:hadoop 
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: hdfs://latdevweb02:9000/home/hadoop/hadoop/input
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
hdfs://latdevweb02:9000/home/hadoop/hadoop/input
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at 
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at 
fr.societegenerale.bigdata.lactool.WordCountDriver.main(WordCountDriver.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
-bash-4.1$
===

Thanks in advance for your help.

Warm regards
GYY

*
This message and any attachments (the message) are confidential, intended 
solely for the addressee(s), and may contain legally privileged information.
Any unauthorised use or dissemination is prohibited. E-mails are susceptible to 
alteration.
Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be 
liable for the message if altered, changed or
falsified.
Please visit http://swapdisclosure.sgcib.com for important information with 
respect to derivative products.
  
Ce message et toutes les pieces jointes (ci-apres le message) sont 
confidentiels et susceptibles de contenir des informations couvertes
par le secret professionnel.
Ce message est etabli a l'intention exclusive de ses destinataires. Toute 
utilisation ou diffusion non autorisee est interdite.
Tout message electronique 

Hadoop Smoke Test: TERASORT

2014-09-10 Thread arthur.hk.c...@gmail.com
Hi,

I am trying the smoke test for Hadoop (2.4.1).  About “terasort”, below is my 
test command, the Map part was completed very fast because it was split into 
many subtasks, however the Reduce part takes very long time and only 1 running 
Reduce job.  Is there a way speed up the reduce phase by splitting the large 
reduce job into many smaller ones and run them across the cluster like the Map 
part?


bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar  terasort 
/tmp/teragenout /tmp/terasortout


Job ID  NameState   
Maps Total  Maps Completed  Reduce Total
Reduce Complted
job_1409876705457_0002  TeraSortRUNNING 22352   
22352   1   0


Regards
Arthur
























Re: Regular expressions in fs paths?

2014-09-10 Thread Rich Haase
HDFS doesn't support he full range of glob matching you will find in Linux.
 If you want to exclude all files from a directory listing that meet a
certain criteria try doing your listing and using grep -v to exclude the
matching records.


Writing output from streaming task without dealing with key/value

2014-09-10 Thread Dmitry Sivachenko
Hello!

Imagine the following common task: I want to process big text file line-by-line 
using streaming interface.
Run unix grep command for instance.  Or some other line-by-line processing, 
e.g. line.upper().
I copy file to HDFS.

Then I run a map task on this file which reads one line, modifies it some way 
and then writes it to the output.

TextInputFormat suites well for reading: it's key is the offset in bytes 
(meaningless in my case) and the value is the line itself, so I can iterate 
over line like this (in python):
for line in sys.stdin:
  print(line.upper())

The problem arises with TextOutputFormat:  It tries to split the resulting line 
on mapreduce.output.textoutputformat.separator which results in extra separator 
in output if this character is missing in the line, for instance (extra TAB at 
the end if we stick to defaults).

Is there any way to write the result of streaming task without any internal 
processing so it appears exactly as the script produces it?

If it is impossible with Hadoop, which works with key/value pairs, may be there 
are other frameworks which work on top of HDFS which allow to do this?

Thanks in advance!

Re: Hadoop Smoke Test: TERASORT

2014-09-10 Thread Rich Haase
You can set the number of reducers used in any hadoop job from the command
line by using -Dmapred.reduce.tasks=XX.

e.g.  hadoop jar hadoop-mapreduce-examples.jar terasort
-Dmapred.reduce.tasks=10  /terasort-input /terasort-output


Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Susheel Kumar Gadalay
If you don't want key in the final output, you can set like this in Java.

job.setOutputKeyClass(NullWritable.class);

It will just print the value in the output file.

I don't how to do it in python.

On 9/10/14, Dmitry Sivachenko trtrmi...@gmail.com wrote:
 Hello!

 Imagine the following common task: I want to process big text file
 line-by-line using streaming interface.
 Run unix grep command for instance.  Or some other line-by-line processing,
 e.g. line.upper().
 I copy file to HDFS.

 Then I run a map task on this file which reads one line, modifies it some
 way and then writes it to the output.

 TextInputFormat suites well for reading: it's key is the offset in bytes
 (meaningless in my case) and the value is the line itself, so I can iterate
 over line like this (in python):
 for line in sys.stdin:
   print(line.upper())

 The problem arises with TextOutputFormat:  It tries to split the resulting
 line on mapreduce.output.textoutputformat.separator which results in extra
 separator in output if this character is missing in the line, for instance
 (extra TAB at the end if we stick to defaults).

 Is there any way to write the result of streaming task without any internal
 processing so it appears exactly as the script produces it?

 If it is impossible with Hadoop, which works with key/value pairs, may be
 there are other frameworks which work on top of HDFS which allow to do
 this?

 Thanks in advance!


Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Rich Haase
In python, or any streaming program just set the output value to the empty
string and you will get something like key\t.

On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay skgada...@gmail.com
 wrote:

 If you don't want key in the final output, you can set like this in Java.

 job.setOutputKeyClass(NullWritable.class);

 It will just print the value in the output file.

 I don't how to do it in python.

 On 9/10/14, Dmitry Sivachenko trtrmi...@gmail.com wrote:
  Hello!
 
  Imagine the following common task: I want to process big text file
  line-by-line using streaming interface.
  Run unix grep command for instance.  Or some other line-by-line
 processing,
  e.g. line.upper().
  I copy file to HDFS.
 
  Then I run a map task on this file which reads one line, modifies it some
  way and then writes it to the output.
 
  TextInputFormat suites well for reading: it's key is the offset in bytes
  (meaningless in my case) and the value is the line itself, so I can
 iterate
  over line like this (in python):
  for line in sys.stdin:
print(line.upper())
 
  The problem arises with TextOutputFormat:  It tries to split the
 resulting
  line on mapreduce.output.textoutputformat.separator which results in
 extra
  separator in output if this character is missing in the line, for
 instance
  (extra TAB at the end if we stick to defaults).
 
  Is there any way to write the result of streaming task without any
 internal
  processing so it appears exactly as the script produces it?
 
  If it is impossible with Hadoop, which works with key/value pairs, may be
  there are other frameworks which work on top of HDFS which allow to do
  this?
 
  Thanks in advance!




-- 
*Kernighan's Law*
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it.


Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Dmitry Sivachenko

On 10 сент. 2014 г., at 22:05, Rich Haase rdha...@gmail.com wrote:

 In python, or any streaming program just set the output value to the empty 
 string and you will get something like key\t.
 


I see, but I want to use many existing programs (like UNIX grep), and I don't 
want to have and extra \t in the output.

Is there any way to achieve this?  Or may be it is possible to write custom 
XxxOutputFormat to workaround that issue?

(something opposite to TextInputFormat: it passes input line without any 
modification to script's stdin, there should be a way to write stdout to file 
as is).


Thanks!


 On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay skgada...@gmail.com 
 wrote:
 If you don't want key in the final output, you can set like this in Java.
 
 job.setOutputKeyClass(NullWritable.class);
 
 It will just print the value in the output file.
 
 I don't how to do it in python.
 
 On 9/10/14, Dmitry Sivachenko trtrmi...@gmail.com wrote:
  Hello!
 
  Imagine the following common task: I want to process big text file
  line-by-line using streaming interface.
  Run unix grep command for instance.  Or some other line-by-line processing,
  e.g. line.upper().
  I copy file to HDFS.
 
  Then I run a map task on this file which reads one line, modifies it some
  way and then writes it to the output.
 
  TextInputFormat suites well for reading: it's key is the offset in bytes
  (meaningless in my case) and the value is the line itself, so I can iterate
  over line like this (in python):
  for line in sys.stdin:
print(line.upper())
 
  The problem arises with TextOutputFormat:  It tries to split the resulting
  line on mapreduce.output.textoutputformat.separator which results in extra
  separator in output if this character is missing in the line, for instance
  (extra TAB at the end if we stick to defaults).
 
  Is there any way to write the result of streaming task without any internal
  processing so it appears exactly as the script produces it?
 
  If it is impossible with Hadoop, which works with key/value pairs, may be
  there are other frameworks which work on top of HDFS which allow to do
  this?
 
  Thanks in advance!
 
 
 
 -- 
 Kernighan's Law
 Debugging is twice as hard as writing the code in the first place.  
 Therefore, if you write the code as cleverly as possible, you are, by 
 definition, not smart enough to debug it.



Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Rich Haase
You can write a custom output format, or you can write your mapreduce job
in Java and use a NullWritable as Susheel recommended.

grep (and every other *nix text processing command) I can think of would
not be limited by a trailing tab character.  It's even quite easy to strip
away that tab character if you don't want it during the post processing
steps you want to perform with *nix commands.

On Wed, Sep 10, 2014 at 12:12 PM, Dmitry Sivachenko trtrmi...@gmail.com
wrote:


 On 10 сент. 2014 г., at 22:05, Rich Haase rdha...@gmail.com wrote:

  In python, or any streaming program just set the output value to the
 empty string and you will get something like key\t.
 


 I see, but I want to use many existing programs (like UNIX grep), and I
 don't want to have and extra \t in the output.

 Is there any way to achieve this?  Or may be it is possible to write
 custom XxxOutputFormat to workaround that issue?

 (something opposite to TextInputFormat: it passes input line without any
 modification to script's stdin, there should be a way to write stdout to
 file as is).


 Thanks!


  On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay 
 skgada...@gmail.com wrote:
  If you don't want key in the final output, you can set like this in Java.
 
  job.setOutputKeyClass(NullWritable.class);
 
  It will just print the value in the output file.
 
  I don't how to do it in python.
 
  On 9/10/14, Dmitry Sivachenko trtrmi...@gmail.com wrote:
   Hello!
  
   Imagine the following common task: I want to process big text file
   line-by-line using streaming interface.
   Run unix grep command for instance.  Or some other line-by-line
 processing,
   e.g. line.upper().
   I copy file to HDFS.
  
   Then I run a map task on this file which reads one line, modifies it
 some
   way and then writes it to the output.
  
   TextInputFormat suites well for reading: it's key is the offset in
 bytes
   (meaningless in my case) and the value is the line itself, so I can
 iterate
   over line like this (in python):
   for line in sys.stdin:
 print(line.upper())
  
   The problem arises with TextOutputFormat:  It tries to split the
 resulting
   line on mapreduce.output.textoutputformat.separator which results in
 extra
   separator in output if this character is missing in the line, for
 instance
   (extra TAB at the end if we stick to defaults).
  
   Is there any way to write the result of streaming task without any
 internal
   processing so it appears exactly as the script produces it?
  
   If it is impossible with Hadoop, which works with key/value pairs, may
 be
   there are other frameworks which work on top of HDFS which allow to do
   this?
  
   Thanks in advance!
 
 
 
  --
  Kernighan's Law
  Debugging is twice as hard as writing the code in the first place.
 Therefore, if you write the code as cleverly as possible, you are, by
 definition, not smart enough to debug it.




-- 
*Kernighan's Law*
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it.


Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Dmitry Sivachenko

On 10 сент. 2014 г., at 22:19, Rich Haase rdha...@gmail.com wrote:

 You can write a custom output format


Any clues how can this can be done?



 , or you can write your mapreduce job in Java and use a NullWritable as 
 Susheel recommended.  
 
 grep (and every other *nix text processing command) I can think of would not 
 be limited by a trailing tab character.  It's even quite easy to strip away 
 that tab character if you don't want it during the post processing steps you 
 want to perform with *nix commands. 


Problem is that the line itself contains a TAB in the middle, there will not be 
extra trailing TAB at the end.
So it is not that simple.
You never know if it is a TAB from the original line or it is extra TAB added 
by TextOutputFormat.

Thanks!

Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Shahab Yunus
Examples (the top ones are related to streaming jobs):

http://www.infoq.com/articles/HadoopOutputFormat
http://research.neustar.biz/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/
http://stackoverflow.com/questions/12759651/how-to-override-inputformat-and-outputformat-in-hadoop-application

Regards,
Shahab

On Wed, Sep 10, 2014 at 2:28 PM, Dmitry Sivachenko trtrmi...@gmail.com
wrote:


 On 10 сент. 2014 г., at 22:19, Rich Haase rdha...@gmail.com wrote:

  You can write a custom output format


 Any clues how can this can be done?



  , or you can write your mapreduce job in Java and use a NullWritable as
 Susheel recommended.
 
  grep (and every other *nix text processing command) I can think of would
 not be limited by a trailing tab character.  It's even quite easy to strip
 away that tab character if you don't want it during the post processing
 steps you want to perform with *nix commands.


 Problem is that the line itself contains a TAB in the middle, there will
 not be extra trailing TAB at the end.
 So it is not that simple.
 You never know if it is a TAB from the original line or it is extra TAB
 added by TextOutputFormat.

 Thanks!


Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Dmitry Sivachenko


 10 сент. 2014 г., в 22:47, Shahab Yunus shahab.yu...@gmail.com написал(а):
 
 Examples (the top ones are related to streaming jobs):
 
 http://www.infoq.com/articles/HadoopOutputFormat
 http://research.neustar.biz/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/
 http://stackoverflow.com/questions/12759651/how-to-override-inputformat-and-outputformat-in-hadoop-application
 


Thanks for the links.  Problem is that in RecordWriter() I get two parameters: 
key and value. If one of them is empty I have no way to tell if I should output 
the delimiter (because it was present in the original line) or not.

What is the proper way to workaround that isuue?


 Regards,
 Shahab
 
 On Wed, Sep 10, 2014 at 2:28 PM, Dmitry Sivachenko trtrmi...@gmail.com 
 wrote:
 
 On 10 сент. 2014 г., at 22:19, Rich Haase rdha...@gmail.com wrote:
 
  You can write a custom output format
 
 
 Any clues how can this can be done?
 
 
 
  , or you can write your mapreduce job in Java and use a NullWritable as 
  Susheel recommended.
 
  grep (and every other *nix text processing command) I can think of would 
  not be limited by a trailing tab character.  It's even quite easy to strip 
  away that tab character if you don't want it during the post processing 
  steps you want to perform with *nix commands.
 
 
 Problem is that the line itself contains a TAB in the middle, there will not 
 be extra trailing TAB at the end.
 So it is not that simple.
 You never know if it is a TAB from the original line or it is extra TAB 
 added by TextOutputFormat.
 
 Thanks!
 


Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Felix Chern
Use ‘tr -s’ to stripe out tabs?

 $ echo -e a\t\t\tb
a   b

 $ echo -e a\t\t\tb | tr -s \t
a   b


On Sep 10, 2014, at 11:28 AM, Dmitry Sivachenko trtrmi...@gmail.com wrote:

 
 On 10 сент. 2014 г., at 22:19, Rich Haase rdha...@gmail.com wrote:
 
 You can write a custom output format
 
 
 Any clues how can this can be done?
 
 
 
 , or you can write your mapreduce job in Java and use a NullWritable as 
 Susheel recommended.  
 
 grep (and every other *nix text processing command) I can think of would not 
 be limited by a trailing tab character.  It's even quite easy to strip away 
 that tab character if you don't want it during the post processing steps you 
 want to perform with *nix commands. 
 
 
 Problem is that the line itself contains a TAB in the middle, there will not 
 be extra trailing TAB at the end.
 So it is not that simple.
 You never know if it is a TAB from the original line or it is extra TAB added 
 by TextOutputFormat.
 
 Thanks!



Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Felix Chern
If you don’t want anything get inserted, just set your output to key only or 
value only.
TextOutputFormat$LineRecordWriter won’t insert anything unless both values are 
set:

public synchronized void write(K key, V value)
  throws IOException {

  boolean nullKey = key == null || key instanceof NullWritable;
  boolean nullValue = value == null || value instanceof NullWritable;
  if (nullKey  nullValue) {
return;
  }
  if (!nullKey) {
writeObject(key);
  }
  if (!(nullKey || nullValue)) {
out.write(keyValueSeparator);
  }
  if (!nullValue) {
writeObject(value);
  }
  out.write(newline);
}

On Sep 10, 2014, at 1:37 PM, Dmitry Sivachenko trtrmi...@gmail.com wrote:

 
 On 10 сент. 2014 г., at 22:33, Felix Chern idry...@gmail.com wrote:
 
 Use ‘tr -s’ to stripe out tabs?
 
 $ echo -e a\t\t\tb
 ab
 
 $ echo -e a\t\t\tb | tr -s \t
 ab
 
 
 There can be tabs in the input, I want to keep input lines without any 
 modification.
 
 Actually it is rather standard task: process lines one by one without 
 inserting extra characters.  There should be standard solution for it IMO.
 



Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Dmitry Sivachenko
On 11 сент. 2014 г., at 0:47, Felix Chern idry...@gmail.com wrote:

 If you don’t want anything get inserted, just set your output to key only or 
 value only.
 TextOutputFormat$LineRecordWriter won’t insert anything unless both values 
 are set:


If I output value only, for instance, and my line contains TAB then everything 
before TAB will be lost?
If I output key only, and my line contains TAB then everything after TAB will 
be lost?


 
 public synchronized void write(K key, V value)
   throws IOException {
 
   boolean nullKey = key == null || key instanceof NullWritable;
   boolean nullValue = value == null || value instanceof NullWritable;
   if (nullKey  nullValue) {
 return;
   }
   if (!nullKey) {
 writeObject(key);
   }
   if (!(nullKey || nullValue)) {
 out.write(keyValueSeparator);
   }
   if (!nullValue) {
 writeObject(value);
   }
   out.write(newline);
 }
 
 On Sep 10, 2014, at 1:37 PM, Dmitry Sivachenko trtrmi...@gmail.com wrote:
 
 
 On 10 сент. 2014 г., at 22:33, Felix Chern idry...@gmail.com wrote:
 
 Use ‘tr -s’ to stripe out tabs?
 
 $ echo -e a\t\t\tb
 a   b
 
 $ echo -e a\t\t\tb | tr -s \t
 a   b
 
 
 There can be tabs in the input, I want to keep input lines without any 
 modification.
 
 Actually it is rather standard task: process lines one by one without 
 inserting extra characters.  There should be standard solution for it IMO.
 
 



Re: Regular expressions in fs paths?

2014-09-10 Thread Charles Robertson
I solved this in the end by using a shell script (initiated by an oozie
shell action) to use grep and loop through the results - didn't have to use
-v option, as the -e option gives you access to a fuller range of regular
expression functionality.

Thanks for your help (again!) Rich.

Charles

On 10 September 2014 16:50, Rich Haase rdha...@gmail.com wrote:

 HDFS doesn't support he full range of glob matching you will find in
 Linux.  If you want to exclude all files from a directory listing that meet
 a certain criteria try doing your listing and using grep -v to exclude the
 matching records.



The running job is blocked for a while if the queue is short of resources

2014-09-10 Thread Anfernee Xu
Hi experts,

I faced one strange issue I cannot understand, can you guys tell me if this
is a bug or I configured something wrong. Below is my situation.

I'm running with Hadopp 2.2.0 release and all my jobs are uberized, each
node only can run a single job at a point of time, I used CapacityScheduler
and configured 2 queues(default and small), I only give 5% capacity(10
nodes) to the small queue. What I found is the thoughput of the small queue
is very poor if it's under heavy load( the inflow rate  processing speed),
I checked the log of the job, found out each job takes extra 1- 2 minutes
in job commit phase, see below log

2014-09-10 14:01:13,665 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410336300553_9902_m_00_0
2014-09-10 14:01:13,665 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of
TaskAttemptattempt_1410336300553_9902_m_00_0 is : 1.0
2014-09-10 14:01:13,670 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement from
attempt_1410336300553_9902_m_00_0
2014-09-10 14:01:13,670 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.Task: Task 'attempt_1410336300553_9902_m_00_0'
done.
2014-09-10 14:01:13,671 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1410336300553_9902_m_00_0 TaskAttempt Transitioned from RUNNING
to SUCCESS_CONTAINER_CLEANUP
2014-09-10 14:01:13,671 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.LocalContainerLauncher: Processing the event
EventType: CONTAINER_REMOTE_CLEANUP for container
container_1410336300553_9902_01_01 taskAttempt
attempt_1410336300553_9902_m_00_0
2014-09-10 14:01:13,675 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1410336300553_9902_m_00_0 TaskAttempt Transitioned from
SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2014-09-10 14:01:13,685 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with
attempt attempt_1410336300553_9902_m_00_0
2014-09-10 14:01:13,687 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
task_1410336300553_9902_m_00 Task Transitioned from RUNNING to SUCCEEDED
2014-09-10 14:01:13,693 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 1
2014-09-10 14:01:13,694 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.TIEMRAppMetrics: task is completed on
2014-09-10 14:01:13,697 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:
job_1410336300553_9902Job Transitioned from RUNNING to COMMITTING
2014-09-10 14:01:13,697 INFO [CommitterEvent Processor #1]
org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing
the event EventType: JOB_COMMIT
2014-09-10 14:02:30,121 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Calling handler for
JobFinishedEvent
2014-09-10 14:02:30,122 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:
job_1410336300553_9902Job Transitioned from COMMITTING to SUCCEEDED

As you can see the job commit started at 14:01:13 and ended at  14:02:30,
it took a lot of time, I also captured the thread dump of the
job(AppMaster), the interesting part is here

CommitterEvent Processor #1 id=91 idx=0x16c tid=29593 prio=5 alive,
waiting, native_blocked
-- Waiting for notification on:
org/apache/hadoop/mapreduce/v2/app/commit/CommitterEventHandler$EventProcessor0x906b46d0[fat
lock]
at jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native
Method)
at java/lang/Object.wait(J)V(Native Method)
at java/lang/Object.wait(Object.java:485)
at
org/apache/hadoop/mapreduce/v2/app/commit/CommitterEventHandler$EventProcessor.waitForValidCommitWindow(CommitterEventHandler.java:313)
^-- Lock released while waiting:
org/apache/hadoop/mapreduce/v2/app/commit/CommitterEventHandler$EventProcessor0x906b46d0[fat
lock]
at
org/apache/hadoop/mapreduce/v2/app/commit/CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:252)
at
org/apache/hadoop/mapreduce/v2/app/commit/CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:216)
at
java/util/concurrent/ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java/lang/Thread.run(Thread.java:662)
at jrockit/vm/RNI.c2java(J)V(Native Method)
-- end of trace

I checked the code, it got blocked and waiting for the heartbeat to RM, and
also I checked

org.apache.hadoop.mapreduce.v2.app.local.LocalContainerAllocator.heartbeat()

it seems sending another resource_allocate request to RM.

So my understanding(correct me if wrong) is if the 

Balancing is very slow.

2014-09-10 Thread cho ju il
hadoop 2.4.1
Balancing is very slow. 
 
$HADOOP_PREFIX/bin/hdfs dfsadmin -setBalancerBandwidth 52428800
 
It takes long time to move the one block.
 
2014. 09. 11. 11:38:01  Block begins to move
2014-09-11  11:47:20  Complete block move
 
 
#10.2.1.211 netstat, Block begins to move, 10.2.1.210 --gt;gt;gt; 10.2.1.211
2014. 09. 11. 11:38:01
tcp   1110650  0 10.2.1.211:5681910.2.1.210:40010
ESTABLISHED -
 
 
# datanode log, 10.2.1.211 
2014-09-11 11:47:09,819 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Copied BP-1770955034-0.0.0.0-1401163460236:blk_1077753386_4013196 to 
/10.2.1.211:56819
 
# namenode balancer log
2014-09-11 11:47:20,782 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 
Successfully moved blk_1077753386_4013196 with size=134217728 from 
10.2.1.204:40010 to 10.2.1.211:40010 through 10.2.1.210:40010

# check network state, File transfer speed using scp,  76.7MB/sdummy.tar   100% 
 230MB  76.7MB/s   00:03