RE: data loss after cluster wide power loss

2013-07-02 Thread Uma Maheswara Rao G
Hi Dave, Looks like your analysis is correct. I have faced similar issue some time back. See the discussion link: http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results On sudden restarts, it can lost the OS filesystem edits. Similar thing happened in ou

Re: data loss after cluster wide power loss

2013-07-02 Thread Azuryy Yu
Hi Uma, I think there is minimum performance degration if set dfs.datanode.synconclose to true. On Tue, Jul 2, 2013 at 3:31 PM, Uma Maheswara Rao G wrote: > Hi Dave, > > Looks like your analysis is correct. I have faced similar issue some time > back. > See the discussion link: > http://markm

Re: How to write/run MPI program on Yarn?

2013-07-02 Thread sam liu
Any one could help answer above questions? Thanks a lot! 2013/7/1 sam liu > Thanks Pramod and Clark! > > 1. What's the relationship of Hadoop 2.x branch and mpich2-yarn project? > 2. Does Hadoop 2.x branch plan to include MPI implementation? I mentioned > there is already a JIRA: > https://issu

Re: How to write/run MPI program on Yarn?

2013-07-02 Thread Olivier Renault
Sam, The fundamental idea of YARN is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. With YARN, we're going to be able to run multiple workload, one of them being MapReduce another one being MPI. So mpich2-yarn

Latest disctcp code

2013-07-02 Thread Jagat Singh
Hi, Which branch of Hadoop has latest Disctp code. The branch-1 mentions something like distcp2 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/tools/org/apache/hadoop/tools/distcp2/util/DistCpUtils.java The trunk has no mention of distcp2 http://svn.apache.org/repos/asf/h

Re: Latest disctcp code

2013-07-02 Thread Harsh J
The trunk's distcp is by default distcp2. The branch-1 received a backport of distcp2 recently, so is named differently. In general we try not to have a new feature introduced in branch-1. All new features must go to the trunk first, before being back-ported into maintained release branches. On T

Re: Latest disctcp code

2013-07-02 Thread Jagat Singh
Hello Harsh, Thank you very much for your reply. Regards, Jagat On Tue, Jul 2, 2013 at 8:29 PM, Harsh J wrote: > The trunk's distcp is by default distcp2. The branch-1 received a > backport of distcp2 recently, so is named differently. > > In general we try not to have a new feature introduc

Re: data loss after cluster wide power loss

2013-07-02 Thread Dave Latham
Hi Uma, Thanks for the pointer. Your case sounds very similar. The main differences that I see are that in my case it happened on all 3 replicas and the power failure occurred merely seconds after the blocks were finalized. So I guess the question is whether HDFS can do anything to better recov

Fwd: Number format exception : For input string

2013-07-02 Thread Mix Nin
I wrote a script as below. Data = LOAD 'part-r-0' AS (session_start_gmt:long) FilterData = FILTER Data BY session_start_gmt=1369546091667 I get below error 2013-07-01 22:48:06,510 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: For input string: "1369546091667" In detail log it

RE: intermediate results files

2013-07-02 Thread John Lilley
Replication also has downstream effects: it puts pressure on the available network bandwidth and disk I/O bandwidth when the cluster is loaded. john From: Mohammad Tariq [mailto:donta...@gmail.com] Sent: Monday, July 01, 2013 6:35 PM To: user@hadoop.apache.org Subject: Re: intermediate results fi

RE: YARN tasks and child processes

2013-07-02 Thread John Lilley
Devaraj, Thanks, this is also good information. But I was really asking if a child *process* that was spawned by a task can persist, in addition to the data. john From: Devaraj k [mailto:devara...@huawei.com] Sent: Monday, July 01, 2013 11:50 PM To: user@hadoop.apache.org Subject: RE: YARN tasks

Re: intermediate results files

2013-07-02 Thread Ravi Prakash
Hi John! If your block is going to be replicated to three nodes, then in the default block placement policy, 2 of them will be on the same rack, and a third one will be on a different rack. Depending on the network bandwidths available intra-rack and inter-rack, writing with replication factor=3

Re: YARN tasks and child processes

2013-07-02 Thread Ravi Prakash
Nopes! The node manager kills the entire process tree when the task reports that it is done. Now if you were able to figure out a way for one of the children to break out of the process tree, maybe? However your approach is obviously not recommended. You would be stealing from the resources tha

HDFS file section rewrite

2013-07-02 Thread John Lilley
I'm sure this has been asked a zillion times, so please just point me to the JIRA comments: is there a feature underway to allow for re-writing of HDFS file sections? Thanks John

typical JSON data sets

2013-07-02 Thread John Lilley
I would like to hear your experiences working with large JSON data sets, specifically: 1) How large is each JSON document? 2) Do they tend to be a single JSON doc per file, or multiples per file? 3) Do the JSON schemas change over time? 4) Are there interesting public data

Containers and CPU

2013-07-02 Thread John Lilley
I have YARN tasks that benefit from multicore scaling. However, they don't *always* use more than one core. I would like to allocate containers based only on memory, and let each task use as many cores as needed, without allocating exclusive CPU "slots" in the scheduler. For example, on an 8-

RE: some idea about the Data Compression

2013-07-02 Thread John Lilley
Geelong, 1. These files will probably be some standard format like .gz or .bz2 or .zip. In that case, pick an appropriate InputFormat. See e.g. http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/, http://stackoverflow.com/questions/14497572/reading-gzipped-file-in-hadoop

Custom JoinRecordReader class

2013-07-02 Thread Chloe Guszo
Hi all, I would like some help/direction on implementing a custom join class. I believe this is the way to address my task at hand, which is given 2 matrices in SequenceFile format, I wish to run operations on all pairs of rows between them. The rows may not be equal in number. The actual operatio

RE: Containers and CPU

2013-07-02 Thread Chuan Liu
I believe this is the default behavior. By default, only memory limit on resources is enforced. The capacity scheduler will use DefaultResourceCalculator to compute resource allocation for containers by default, which also does not take CPU into account. -Chuan From: John Lilley [mailto:john.lil

Re: Containers and CPU

2013-07-02 Thread Sandy Ryza
CPU limits are only enforced if cgroups is turned on. With cgroups on, they are only limited when there is contention, in which case tasks are given CPU time in proportion to the number of cores requested for/allocated to them. Does that make sense? -Sandy On Tue, Jul 2, 2013 at 9:50 AM, Chuan

RE: How can a YarnTask read/write local-host HDFS blocks?

2013-07-02 Thread John Lilley
Blah blah, One point you might have missed: multiple tasks cannot all write the same HDFS file at the same time. So you can't just split an output file into sections and say "task1 write block1, etc". Typically each task outputs a separate file and these file-parts are read or merged later. jo

RE: Yarn HDFS and Yarn Exceptions when processing "larger" datasets.

2013-07-02 Thread John Lilley
Blah blah, Can you build and run the DistributedShell example? If it does not run correctly this would tend to implicate your configuration. If it run correctly then your code is suspect. John From: blah blah [mailto:tmp5...@gmail.com] Sent: Tuesday, June 25, 2013 6:09 PM To: user@hadoop.apac

RE: Exception in createBlockOutputStream - poss firewall issue

2013-07-02 Thread John Lilley
I don’t know the answer… but if it is possible to make the DNs report a domain-name instead of an IP quad it may help. John From: Robin East [mailto:robin.e...@xense.co.uk] Sent: Thursday, June 27, 2013 12:18 AM To: user@hadoop.apache.org Subject: Re: Exception in createBlockOutputStream - poss

RE: Business Analysts in Hadoop World

2013-07-02 Thread John Lilley
Hadoop is not yet an easy learning curve, so I'd recommend that you start with Amazon Elastic MapReduce as an experimental platform to start learning. John From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:vijaya.bho...@huawei.com] Sent: Friday, June 28, 2013 7:10 AM To: user@hadoop.apache.org Su

Re: HDFS file section rewrite

2013-07-02 Thread Suresh Srinivas
HDFS only supports regular writes and append. Random write is not supported. I do not know of any feature/jira that is underway to support this feature. On Tue, Jul 2, 2013 at 9:01 AM, John Lilley wrote: > I’m sure this has been asked a zillion times, so please just point me to > the JIRA comme

RE: YARN tasks and child processes

2013-07-02 Thread John Lilley
Thanks, that answers my question. I am trying to explore alternatives to a YARN auxiliary service, but apparently this isn’t an option. John From: Ravi Prakash [mailto:ravi...@ymail.com] Sent: Tuesday, July 02, 2013 9:55 AM To: user@hadoop.apache.org Subject: Re: YARN tasks and child processes

RE: Containers and CPU

2013-07-02 Thread John Lilley
Sandy, Sorry, I don't completely follow. When you say "with cgroups on", is that an attribute of the AM, the Scheduler, or the Site/RM? In other words is it site-wide or something that my application can control? With cgroups on, is there still a way to get my desired behavior? I'd really like

RE: Containers and CPU

2013-07-02 Thread John Lilley
To explain my reasoning, suppose that I have an application that performs some CPU-intensive calculation, and can scale to multiple cores internally, but it doesn't need those cores all the time because the CPU-intensive phase is only a part of the overall computation. I'm not sure I understand

HDFS temporary file locations

2013-07-02 Thread John Lilley
Is there any convention for clients/applications wishing to use temporary file space in HDFS? For example, my application wants to: 1) Load data into some temporary space in HDFS as an external client 2) Run an AM, which produces HDFS output (also in the temporary space) 3) Read

Re: Containers and CPU

2013-07-02 Thread Sandy Ryza
Use of cgroups for controlling CPU is off by default, but can be turned on as a nodemanager configuration with yarn.nodemanager.linux-container-executor.resources-handler.class. So it is site-wide. If you want tasks to purely fight it out in the OS thread scheduler, simply don't change from the d

RE: Containers and CPU

2013-07-02 Thread John Lilley
Sandy, Thanks, I think I understand. So it only makes a difference if cgroups is on AND the AM requests multiple cores? E.g. if each task wants 4 cores the RM would only allow two containers per 8-core node? John From: Sandy Ryza [mailto:sandy.r...@cloudera.com] Sent: Tuesday, July 02, 2013 1

Re: Yarn HDFS and Yarn Exceptions when processing "larger" datasets.

2013-07-02 Thread blah blah
Hi Just a quick short reply (tomorrow is my prototype presentation). @Omkar Joshi - RM port 8030 already running when I start my AM - I'll do the client thread size AM - Only AM communicates with RM - RM/NM no exceptions there (as far as I remember will check later [sorry]) Furthermore in fully

RE: HDFS file section rewrite

2013-07-02 Thread John Lilley
I found this: https://issues.apache.org/jira/browse/HADOOP-5215 Doesn't seem to have attracted much interest. John From: Suresh Srinivas [mailto:sur...@hortonworks.com] Sent: Tuesday, July 02, 2013 1:03 PM To: hdfs-u...@hadoop.apache.org Subject: Re: HDFS file section rewrite HDFS only supports

Re: Containers and CPU

2013-07-02 Thread Sandy Ryza
That's correct. -Sandy On Tue, Jul 2, 2013 at 12:28 PM, John Lilley wrote: > Sandy, > > Thanks, I think I understand. So it only makes a difference if cgroups is > on AND the AM requests multiple cores? E.g. if each task wants 4 cores the > RM would only allow two containers per 8-core n

Re: Exception in createBlockOutputStream - poss firewall issue

2013-07-02 Thread Robin East
Hi John, exactly what I was thinking, however I haven't found a way to do that. If I ever have time I'll trawl through the code, however I've managed to avoid the issue by placing both machines inside the firewall. Regards Robin Sent from my iPhone On 2 Jul 2013, at 19:48, John Lilley wrote:

[no subject]

2013-07-02 Thread Chui-Hui Chiu
Hello, I have a Hadoop 2.0.5 Alpha cluster. When I execute any Hadoop command, I see the following message. WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Is it at the lib/native folder? How do I configure the s

Unable to load native-hadoop library

2013-07-02 Thread Chui-Hui Chiu
Hello, I have a Hadoop 2.0.5 Alpha cluster. When I execute any Hadoop command, I see the following message. WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Is it at the lib/native folder? How do I configure the s

Re: Unable to load native-hadoop library

2013-07-02 Thread Ted Yu
Take a look here: http://search-hadoop.com/m/FXOOOTJruq1 On Tue, Jul 2, 2013 at 3:25 PM, Chui-Hui Chiu wrote: > Hello, > > I have a Hadoop 2.0.5 Alpha cluster. When I execute any Hadoop command, I > see the following message. > > WARN util.NativeCodeLoader: Unable to load native-hadoop library

What's Yarn?

2013-07-02 Thread Azuryy Yu
Hi Dear all, I just fount it occasionally, maybe all you know that, but I just show here again. Yet Another Resource Negotiator—YARN from: http://adtmag.com/blogs/watersworks/2012/08/apache-yarn-promotion.aspx

Parameter 'yarn.nodemanager.resource.cpu-cores' does not work

2013-07-02 Thread sam liu
Hi, With Hadoop 2.0.4-alpha, yarn.nodemanager.resource.cpu-cores does not work for me: 1. The performance of running same terasort job do not change, even after increasing or decreasing the value of 'yarn.nodemanager.resource.cpu-cores' in yarn-site.xml and restart the yarn cluster. 2. Even if I