date:20140818

ignoring map task failure

2014-08-18 Thread parnab kumar

Hi All,

   I am running a job where there are between 1300-1400 map tasks. Some
map task fails due to some error. When 4 such maps fail the job naturally
gets killed.  How to  ignore the failed tasks and go around executing the
other map tasks. I am okay with loosing some data for the failed tasks.

Thanks,
Parnab

Re: Data cleansing in modern data architecture

2014-08-18 Thread Jens Scheidtmann

Hi Bob,

the answer to your original question depends entirely on the procedures and
conventions set forth for your data warehouse. So only you can answer it.

If you're asking for best practices, it still depends:
- How large are your files?
- Have you enough free space for recoding?
- Are you better off writing an exception file?
- How do you make sure it is always respected?
- etc.

Best regards,

Jens

Re: ignoring map task failure

2014-08-18 Thread Susheel Kumar Gadalay

Check the parameter yarn.app.mapreduce.client.max-retries.

On 8/18/14, parnab kumar parnab.2...@gmail.com wrote:
 Hi All,

I am running a job where there are between 1300-1400 map tasks. Some
 map task fails due to some error. When 4 such maps fail the job naturally
 gets killed.  How to  ignore the failed tasks and go around executing the
 other map tasks. I am okay with loosing some data for the failed tasks.

 Thanks,
 Parnab

Re: Top K words problem

2014-08-18 Thread Jens Scheidtmann

Google for streaming algorithms also stream processing for getting ideas.

Best regards, Jens

Passing VM arguments while submitting a job in Hadoop YARN.

2014-08-18 Thread S.L

Hi All,

All we submit a job using bin/hadoop script on teh resource manager node,
what if we need our job to be passed in a vm argument like in my case
'target-env' , how o I do that , will this argument be passed to all the
node managers in different nodes ?

Re: hadoop/yarn and task parallelization on non-hdfs filesystems

2014-08-18 Thread Calvin

OK, I figured out exactly what was happening.

I had set the configuration value yarn.nodemanager.vmem-pmem-ratio
to 10. Since there is no swap space available for use, every task
which is requesting 2 GB of memory is also requesting an additional 20
GB of memory. This 20 GB isn't represented in the Memory Used column
on the YARN applications status page and thus it seemed like I was
underutilizing the YARN cluster (when in actuality I had allocated all
the memory available).

(The cluster underutilization occurs regardless of using HDFS or
LocalFileSystem; I must have made this configuration change after
testing HDFS and before testing the local filesystem.)

The solution is to set yarn.nodemanager.vmem-pmem-ratio to 1 (since
I have no swap) *and* yarn.nodemanager.vmem-check.enabled to false.

Part of the reason why I had set such a high setting was due to
containers being killed because of virtual memory usage. The Cloudera
folks have a good blog post [1] on this topic (see #6) and I wish I
had read that sooner.

With the above configuration values, I can now utilize the cluster at 100%.

Thanks for everyone's input!

Calvin

[1]
http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/

On Fri, Aug 15, 2014 at 2:11 PM, java8964 java8...@hotmail.com wrote:
Interesting to know that.

I also want to know what underline logic holding the force to only generate
25-35 parallelized containers, instead of up to 1300.

Another suggestion I can give is following:

1) In your driver, generate a text file, including all your 1300 bz2 file
names with absolute path.
2) In your MR job, use the NLineInputFormat, with default setting, each line
content will trigger one mapper task.
3) In your mapper, key/value pair will be offset byte loc/line content, just
start to process the file, as it should be available from the mount path in
the local data nodes.
4) I assume that you are using Yarn. In this case, at least 1300 container
requests will be issued to the cluster. You generate 1300 parallelized
request, now it is up to the cluster to decide how many containers can be
parallel run.

Yong

Date: Fri, 15 Aug 2014 12:30:09 -0600

Subject: Re: hadoop/yarn and task parallelization on non-hdfs filesystems
From: iphcal...@gmail.com
To: user@hadoop.apache.org

Thanks for the responses!

To clarify, I'm not using any special FileSystem implementation. An
example input parameter to a MapReduce job would be something like
-input file:///scratch/data. Thus I think (any clarification would
be helpful) Hadoop is then utilizing LocalFileSystem
(org.apache.hadoop.fs.LocalFileSystem).

The input data is large enough and splittable (1300 .bz2 files, 274MB
each, 350GB total). Thus even if it the input data weren't splittable,
Hadoop should be able to parallelize up to 1300 map tasks if capacity
is available; in my case, I find that the Hadoop cluster is not fully
utilized (i.e., ~25-35 containers running when it can scale up to ~80
containers) when not using HDFS, while achieving maximum use when
using HDFS.

I'm wondering if Hadoop is holding back or throttling the I/O if
LocalFileSystem is being used, and what changes I can make to have the
Hadoop tasks scale.

In the meantime, I'll take a look at the API calls that Harsh mentioned.

On Fri, Aug 15, 2014 at 10:15 AM, Harsh J ha...@cloudera.com wrote:
The split configurations in FIF mentioned earlier would work for local
files
as well. They aren't deemed unsplitable, just considered as one single
block.

If the FS in use has its advantages it's better to implement a proper
interface to it making use of them, than to rely on the LFS by mounting
it.
This is what we do with HDFS.

On Aug 15, 2014 8:52 PM, java8964 java8...@hotmail.com wrote:

I believe that Calvin mentioned before that this parallel file system
mounted into local file system.

In this case, will Hadoop just use java.io.File as local File system to
treat them as local file and not split the file?

Just want to know the logic in hadoop handling the local file.

One suggestion I can think is to split the files manually outside of
hadoop. For example, generate lots of small files as 128M or 256M size.

In this case, each mapper will process one small file, so you can get
good
utilization of your cluster, assume you have a lot of small files.

Yong

From: ha...@cloudera.com
Date: Fri, 15 Aug 2014 16:45:02 +0530
Subject: Re: hadoop/yarn and task parallelization on non-hdfs
filesystems
To: user@hadoop.apache.org

Does your non-HDFS filesystem implement a getBlockLocations API, that
MR relies on to know how to split files?

The API is at

http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,
long, long), and MR calls it at

Re: hadoop/yarn and task parallelization on non-hdfs filesystems

2014-08-18 Thread Calvin

Oops, one of the settings should read
yarn.nodemanager.vmem-check-enabled. The blog post has a typo and a
comment pointed that out as well.

Thanks,
Calvin

On Mon, Aug 18, 2014 at 4:45 PM, Calvin iphcal...@gmail.com wrote:
OK, I figured out exactly what was happening.