Re: What else can be built on top of YARN.

2013-05-29 Thread Krishna Kishore Bonagiri
Hi Rahul,

  It is at least because of the reasons that Vinod listed that makes my
life easy for porting my application on to YARN instead of making it work
in the Map Reduce framework. The main purpose of me using YARN is to
exploit the resource management capabilities of YARN.

Thanks,
Kishore


On Wed, May 29, 2013 at 11:00 PM, Rahul Bhattacharjee <
rahul.rec@gmail.com> wrote:

> Thanks for the response Krishna.
>
> I was wondering if it were possible for using MR to  solve you problem
> instead of building the whole stack on top of yarn.
> Most likely its not possible , thats why you are building it . I wanted to
> know why is that ?
>
> I am in just trying to find out the need or why we might need to write the
> application on yarn.
>
> Rahul
>
>
> On Wed, May 29, 2013 at 8:23 PM, Krishna Kishore Bonagiri <
> write2kish...@gmail.com> wrote:
>
>> Hi Rahul,
>>
>>   I am porting a distributed application that runs on a fixed set of
>> given resources to YARN, with the aim of  being able to run it on a
>> dynamically selected resources whichever are available at the time of
>> running the application.
>>
>> Thanks,
>> Kishore
>>
>>
>> On Wed, May 29, 2013 at 8:04 PM, Rahul Bhattacharjee <
>> rahul.rec@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I was going through the motivation behind Yarn. Splitting the
>>> responsibility of JT is the major concern.Ultimately the base (Yarn) was
>>> built in a generic way for building other generic distributed applications
>>> too.
>>>
>>> I am not able to think of any other parallel processing use case that
>>> would be useful to built on top of YARN. I though of a lot of use cases
>>> that would be beneficial when run in parallel , but again ,we can do those
>>> using map only jobs in MR.
>>>
>>> Can someone tell me a scenario , where a application can utilize Yarn
>>> features or can be built on top of YARN and at the same time , it cannot be
>>> done efficiently using MRv2 jobs.
>>>
>>> thanks,
>>> Rahul
>>>
>>>
>>>
>>
>


Re: What else can be built on top of YARN.

2013-05-29 Thread Vinod Kumar Vavilapalli


Historically, many applications/frameworks wanted to take advantage of just the 
resource management capabilities and failure handling of Hadoop (via 
JobTracker/TaskTracker), but were forced to used MapReduce even though they 
didn't have to. Obvious examples are graph processing (Giraph), BSP(Hama), 
storm/s4 and even a simple tool like DistCp.

There are issues even with map-only jobs.
 - You have to fake key-value processing, periodic pings, key-value outputs
 - You are limited to map slot capacity in the cluster
 - The number of tasks is static, so you cannot grow and shrink your job
 - You are forced to sort data all the time (even though this has changed 
recently)
 - You are tied to faking things like OutputCommit even if you don't need to.

That's just for starters. I can definitely think harder and list more ;)

YARN lets you move ahead without those limitations.

HTH
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/


On May 29, 2013, at 7:34 AM, Rahul Bhattacharjee wrote:

> Hi all,
> 
> I was going through the motivation behind Yarn. Splitting the responsibility 
> of JT is the major concern.Ultimately the base (Yarn) was built in a generic 
> way for building other generic distributed applications too.
> 
> I am not able to think of any other parallel processing use case that would 
> be useful to built on top of YARN. I though of a lot of use cases that would 
> be beneficial when run in parallel , but again ,we can do those using map 
> only jobs in MR.
> 
> Can someone tell me a scenario , where a application can utilize Yarn 
> features or can be built on top of YARN and at the same time , it cannot be 
> done efficiently using MRv2 jobs.
> 
> thanks,
> Rahul
> 
> 



Re: Reading json format input

2013-05-29 Thread Rahul Bhattacharjee
Whatever you have mentioned Jamal should work.you can debug this.

Thanks,
Rahul


On Thu, May 30, 2013 at 5:14 AM, jamal sasha  wrote:

> Hi,
>   For some reason, this have to be in java :(
> I am trying to use org.json library, something like (in mapper)
> JSONObject jsn = new JSONObject(value.toString());
>
> String text = (String) jsn.get("text");
> StringTokenizer itr = new StringTokenizer(text);
>
> But its not working :(
> It would be better to get this thing properly but I wouldnt mind using a
> hack as well :)
>
>
> On Wed, May 29, 2013 at 4:30 PM, Michael Segel 
> wrote:
>
>> Yeah,
>> I have to agree w Russell. Pig is definitely the way to go on this.
>>
>> If you want to do it as a Java program you will have to do some work on
>> the input string but it too should be trivial.
>> How formal do you want to go?
>> Do you want to strip it down or just find the quote after the text part?
>>
>>
>> On May 29, 2013, at 5:13 PM, Russell Jurney 
>> wrote:
>>
>> Seriously consider Pig (free answer, 4 LOC):
>>
>> my_data = LOAD 'my_data.json' USING
>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>> words = FOREACH my_data GENERATE $0#'author' as author,
>> FLATTEN(TOKENIZE($0#'text')) as word;
>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>> COUNT_STAR(words) AS word_count;
>> STORE word_counts INTO '/tmp/word_counts.txt';
>>
>> It will be faster than the Java you'll likely write.
>>
>>
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha wrote:
>>
>>> Hi,
>>>I am stuck again. :(
>>> My input data is in hdfs. I am again trying to do wordcount but there is
>>> slight difference.
>>> The data is in json format.
>>> So each line of data is:
>>>
>>> {"author":"foo", "text": "hello"}
>>> {"author":"foo123", "text": "hello world"}
>>> {"author":"foo234", "text": "hello this world"}
>>>
>>> So I want to do wordcount for text part.
>>> I understand that in mapper, I just have to pass this data as json and
>>> extract "text" and rest of the code is just the same but I am trying to
>>> switch from python to java hadoop.
>>> How do I do this.
>>> Thanks
>>>
>>
>>
>>
>> --
>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.
>> com
>>
>>
>>
>


Re: Reading json format input

2013-05-29 Thread Michael Segel
You have the entire string. 
If you tokenize on commas ... 

Starting with :
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}

You end up with 
{"author":"foo",and "text":"hello"}

So you can ignore the first token, then again split the token on the colon (':')

This gives you "text" and "hello"}

You can again ignore the first token and you now have "hello"}

And now you can parse out the stuff within the quotes. 

HTH


On May 29, 2013, at 6:44 PM, jamal sasha  wrote:

> Hi,
>   For some reason, this have to be in java :(
> I am trying to use org.json library, something like (in mapper)
> JSONObject jsn = new JSONObject(value.toString());
> 
> String text = (String) jsn.get("text");
> StringTokenizer itr = new StringTokenizer(text);
> 
> But its not working :(
> It would be better to get this thing properly but I wouldnt mind using a hack 
> as well :)
> 
> 
> On Wed, May 29, 2013 at 4:30 PM, Michael Segel  
> wrote:
> Yeah, 
> I have to agree w Russell. Pig is definitely the way to go on this. 
> 
> If you want to do it as a Java program you will have to do some work on the 
> input string but it too should be trivial. 
> How formal do you want to go? 
> Do you want to strip it down or just find the quote after the text part? 
> 
> 
> On May 29, 2013, at 5:13 PM, Russell Jurney  wrote:
> 
>> Seriously consider Pig (free answer, 4 LOC):
>> 
>> my_data = LOAD 'my_data.json' USING 
>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>> words = FOREACH my_data GENERATE $0#'author' as author, 
>> FLATTEN(TOKENIZE($0#'text')) as word;
>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word, 
>> COUNT_STAR(words) AS word_count;
>> STORE word_counts INTO '/tmp/word_counts.txt';
>> 
>> It will be faster than the Java you'll likely write.
>> 
>> 
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha  wrote:
>> Hi,
>>I am stuck again. :(
>> My input data is in hdfs. I am again trying to do wordcount but there is 
>> slight difference.
>> The data is in json format.
>> So each line of data is:
>> 
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}
>> 
>> So I want to do wordcount for text part.
>> I understand that in mapper, I just have to pass this data as json and 
>> extract "text" and rest of the code is just the same but I am trying to 
>> switch from python to java hadoop. 
>> How do I do this.
>> Thanks
>> 
>> 
>> 
>> -- 
>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
> 
> 



Re: issue launching mapreduce job with kerberos secured hadoop

2013-05-29 Thread Robert Molina
Hi Neeraj,
This error doesn't look to be kerberos related initially. Can you
verify if 192.168.49.51
has the tasktracker process running?

Regards,
Robert


On Tue, May 28, 2013 at 7:58 PM, Rahul Bhattacharjee <
rahul.rec@gmail.com> wrote:

> The error looks a little low level , network level . The http server for
> some reason couldn't bind to the port.
> Might have nothing to do with Kerborose.
>
> Thanks,
> Rahul
>
>
> On Tue, May 28, 2013 at 6:36 PM, Neeraj Chaplot  wrote:
>
>> Hi All,
>>
>> When hadoop started with Kerberos authentication hadoop fs commands work
>> well but MR job fails.
>>
>> Simple wordcount program fails at reducer stage giving follwoing
>> exception :
>> 013-05-28 17:43:58,896 WARN org.apache.hadoop.mapred.
>> ReduceTask: attempt_201305281729_0003_r_00_1 copy failed:
>> attempt_201305281729_0003_m_00_0 from 192.168.49.51
>> 2013-05-28 17:43:58,897 WARN org.apache.hadoop.mapred.ReduceTask:
>> java.net.ConnectException: Connection refused
>> at java.net.PlainSocketImpl.socketConnect(Native Method)
>> at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
>> at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
>> at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>> at java.net.Socket.connect(Socket.java:529)
>> at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
>> at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
>> at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
>> at sun.net.www.http.HttpClient.(HttpClient.java:233)
>> at sun.net.www.http.HttpClient.New(HttpClient.java:306)
>> at sun.net.www.http.HttpClient.New(HttpClient.java:323)
>> at
>> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970)
>> at
>> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
>> at
>> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
>> at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1618)
>> at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1575)
>> at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1483)
>> at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1394)
>> at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1326)
>>
>> Please provide some inputs to fix the issue.
>>
>> Thanks
>>
>>
>


Re: Reading json format input

2013-05-29 Thread Rishi Yadav
for that, you have to only write intermediate data if word = "text"

String[] words = line.split("\\W+");

 for (String word : words) {

if (word.equals("text"))

  context.write(new Text(word), new IntWritable(1));

 }


I  am assuming you have huge volume of data for it, otherwise MapReduce
will be an overkill and simple regex will do.



On Wed, May 29, 2013 at 4:45 PM, jamal sasha  wrote:

> Hi Rishi,
>But I dont want the wordcount of all the words..
> In json, there is a field "text".. and those are the words I wish to count?
>
>
> On Wed, May 29, 2013 at 4:43 PM, Rishi Yadav wrote:
>
>> Hi Jamal,
>>
>> I took your input and put it in sample wordcount program and it's working
>> just fine and giving this output.
>>
>> author 3
>> foo234 1
>> text 3
>> foo 1
>> foo123 1
>> hello 3
>> this 1
>> world 2
>>
>>
>> When we split using
>>
>> String[] words = input.split("\\W+");
>>
>> it takes care of all non-alphanumeric characters.
>>
>> Thanks and Regards,
>>
>> Rishi Yadav
>>
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha wrote:
>>
>>> Hi,
>>>I am stuck again. :(
>>> My input data is in hdfs. I am again trying to do wordcount but there is
>>> slight difference.
>>> The data is in json format.
>>> So each line of data is:
>>>
>>> {"author":"foo", "text": "hello"}
>>> {"author":"foo123", "text": "hello world"}
>>> {"author":"foo234", "text": "hello this world"}
>>>
>>> So I want to do wordcount for text part.
>>> I understand that in mapper, I just have to pass this data as json and
>>> extract "text" and rest of the code is just the same but I am trying to
>>> switch from python to java hadoop.
>>> How do I do this.
>>> Thanks
>>>
>>
>>
>


Re: Reading json format input

2013-05-29 Thread jamal sasha
Hi Rishi,
   But I dont want the wordcount of all the words..
In json, there is a field "text".. and those are the words I wish to count?


On Wed, May 29, 2013 at 4:43 PM, Rishi Yadav  wrote:

> Hi Jamal,
>
> I took your input and put it in sample wordcount program and it's working
> just fine and giving this output.
>
> author 3
> foo234 1
> text 3
> foo 1
> foo123 1
> hello 3
> this 1
> world 2
>
>
> When we split using
>
> String[] words = input.split("\\W+");
>
> it takes care of all non-alphanumeric characters.
>
> Thanks and Regards,
>
> Rishi Yadav
>
> On Wed, May 29, 2013 at 2:54 PM, jamal sasha wrote:
>
>> Hi,
>>I am stuck again. :(
>> My input data is in hdfs. I am again trying to do wordcount but there is
>> slight difference.
>> The data is in json format.
>> So each line of data is:
>>
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}
>>
>> So I want to do wordcount for text part.
>> I understand that in mapper, I just have to pass this data as json and
>> extract "text" and rest of the code is just the same but I am trying to
>> switch from python to java hadoop.
>> How do I do this.
>> Thanks
>>
>
>


Re: Reading json format input

2013-05-29 Thread jamal sasha
Hi,
  For some reason, this have to be in java :(
I am trying to use org.json library, something like (in mapper)
JSONObject jsn = new JSONObject(value.toString());

String text = (String) jsn.get("text");
StringTokenizer itr = new StringTokenizer(text);

But its not working :(
It would be better to get this thing properly but I wouldnt mind using a
hack as well :)


On Wed, May 29, 2013 at 4:30 PM, Michael Segel wrote:

> Yeah,
> I have to agree w Russell. Pig is definitely the way to go on this.
>
> If you want to do it as a Java program you will have to do some work on
> the input string but it too should be trivial.
> How formal do you want to go?
> Do you want to strip it down or just find the quote after the text part?
>
>
> On May 29, 2013, at 5:13 PM, Russell Jurney 
> wrote:
>
> Seriously consider Pig (free answer, 4 LOC):
>
> my_data = LOAD 'my_data.json' USING
> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
> words = FOREACH my_data GENERATE $0#'author' as author,
> FLATTEN(TOKENIZE($0#'text')) as word;
> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
> COUNT_STAR(words) AS word_count;
> STORE word_counts INTO '/tmp/word_counts.txt';
>
> It will be faster than the Java you'll likely write.
>
>
> On Wed, May 29, 2013 at 2:54 PM, jamal sasha wrote:
>
>> Hi,
>>I am stuck again. :(
>> My input data is in hdfs. I am again trying to do wordcount but there is
>> slight difference.
>> The data is in json format.
>> So each line of data is:
>>
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}
>>
>> So I want to do wordcount for text part.
>> I understand that in mapper, I just have to pass this data as json and
>> extract "text" and rest of the code is just the same but I am trying to
>> switch from python to java hadoop.
>> How do I do this.
>> Thanks
>>
>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.
> com
>
>
>


Re: Reading json format input

2013-05-29 Thread Rishi Yadav
Hi Jamal,

I took your input and put it in sample wordcount program and it's working
just fine and giving this output.

author 3
foo234 1
text 3
foo 1
foo123 1
hello 3
this 1
world 2


When we split using

String[] words = input.split("\\W+");

it takes care of all non-alphanumeric characters.

Thanks and Regards,

Rishi Yadav

On Wed, May 29, 2013 at 2:54 PM, jamal sasha  wrote:

> Hi,
>I am stuck again. :(
> My input data is in hdfs. I am again trying to do wordcount but there is
> slight difference.
> The data is in json format.
> So each line of data is:
>
> {"author":"foo", "text": "hello"}
> {"author":"foo123", "text": "hello world"}
> {"author":"foo234", "text": "hello this world"}
>
> So I want to do wordcount for text part.
> I understand that in mapper, I just have to pass this data as json and
> extract "text" and rest of the code is just the same but I am trying to
> switch from python to java hadoop.
> How do I do this.
> Thanks
>


Re: Reading json format input

2013-05-29 Thread Michael Segel
Yeah, 
I have to agree w Russell. Pig is definitely the way to go on this. 

If you want to do it as a Java program you will have to do some work on the 
input string but it too should be trivial. 
How formal do you want to go? 
Do you want to strip it down or just find the quote after the text part? 


On May 29, 2013, at 5:13 PM, Russell Jurney  wrote:

> Seriously consider Pig (free answer, 4 LOC):
> 
> my_data = LOAD 'my_data.json' USING 
> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
> words = FOREACH my_data GENERATE $0#'author' as author, 
> FLATTEN(TOKENIZE($0#'text')) as word;
> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word, 
> COUNT_STAR(words) AS word_count;
> STORE word_counts INTO '/tmp/word_counts.txt';
> 
> It will be faster than the Java you'll likely write.
> 
> 
> On Wed, May 29, 2013 at 2:54 PM, jamal sasha  wrote:
> Hi,
>I am stuck again. :(
> My input data is in hdfs. I am again trying to do wordcount but there is 
> slight difference.
> The data is in json format.
> So each line of data is:
> 
> {"author":"foo", "text": "hello"}
> {"author":"foo123", "text": "hello world"}
> {"author":"foo234", "text": "hello this world"}
> 
> So I want to do wordcount for text part.
> I understand that in mapper, I just have to pass this data as json and 
> extract "text" and rest of the code is just the same but I am trying to 
> switch from python to java hadoop. 
> How do I do this.
> Thanks
> 
> 
> 
> -- 
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com



Re: Reading json format input

2013-05-29 Thread Russell Jurney
Seriously consider Pig (free answer, 4 LOC):

my_data = LOAD 'my_data.json' USING
com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
words = FOREACH my_data GENERATE $0#'author' as author,
FLATTEN(TOKENIZE($0#'text')) as word;
word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
COUNT_STAR(words) AS word_count;
STORE word_counts INTO '/tmp/word_counts.txt';

It will be faster than the Java you'll likely write.


On Wed, May 29, 2013 at 2:54 PM, jamal sasha  wrote:

> Hi,
>I am stuck again. :(
> My input data is in hdfs. I am again trying to do wordcount but there is
> slight difference.
> The data is in json format.
> So each line of data is:
>
> {"author":"foo", "text": "hello"}
> {"author":"foo123", "text": "hello world"}
> {"author":"foo234", "text": "hello this world"}
>
> So I want to do wordcount for text part.
> I understand that in mapper, I just have to pass this data as json and
> extract "text" and rest of the code is just the same but I am trying to
> switch from python to java hadoop.
> How do I do this.
> Thanks
>



-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Reading json format input

2013-05-29 Thread jamal sasha
Hi,
   I am stuck again. :(
My input data is in hdfs. I am again trying to do wordcount but there is
slight difference.
The data is in json format.
So each line of data is:

{"author":"foo", "text": "hello"}
{"author":"foo123", "text": "hello world"}
{"author":"foo234", "text": "hello this world"}

So I want to do wordcount for text part.
I understand that in mapper, I just have to pass this data as json and
extract "text" and rest of the code is just the same but I am trying to
switch from python to java hadoop.
How do I do this.
Thanks


Re: Writing data in db instead of hdfs

2013-05-29 Thread Mohammad Tariq
Hello Jamal,

Yes, it is possible. You could use TableReducer to do that. Use it
instead of the normal reducer in your wordcount example. Alternatively you
could use HFileOutputFormat to write directly to HFiles.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Thu, May 30, 2013 at 2:08 AM, jamal sasha  wrote:

> Hi,
>
>   Is it possible to save data in database (HBase, cassandra??) directly
> from hadoop.
> so that there is no output in hdfs but that it directly writes data into
> this db?
>
> If I want to modify wordcount example to achive this, what/where should I
> made these modifications.
> Any help/ suggestions.
> Thanks
>


Writing data in db instead of hdfs

2013-05-29 Thread jamal sasha
Hi,

  Is it possible to save data in database (HBase, cassandra??) directly
from hadoop.
so that there is no output in hdfs but that it directly writes data into
this db?

If I want to modify wordcount example to achive this, what/where should I
made these modifications.
Any help/ suggestions.
Thanks


Re: What else can be built on top of YARN.

2013-05-29 Thread Viral Bajaria
There is a project at Yahoo which makes it possible to run Storm on Yarn. I
think the team behind it is going to give a talk at Hadoop Summit and plan
to open source it after that.

-Viral

On Wed, May 29, 2013 at 11:04 AM, John Conwell  wrote:

> Storm, a distributed realtime computation framework used for analyzing
> realtime steams of data, doesn't really need to be ported.  Its doing fine
> by itself, though I think its a prime candidate for a Yarn port.


Re: Help: error in hadoop build

2013-05-29 Thread Ted Yu
What's the output of:

protoc --version

You should be using 2.4.1

Cheers

On Wed, May 29, 2013 at 11:33 AM, John Lilley wrote:

>  Sorry if this is a dumb question, but I’m not sure where to start.  I am
> following BUILDING.txt instructions for source checked out today using git:
> 
>
> git clone git://git.apache.org/hadoop-common.git Hadoop
>
> ** **
>
> Following build steps and adding -X for more logging:
>
> mvn compile -X
>
> ** **
>
> But I get this error in Hadoop common
>
> [WARNING] [protoc,
> --java_out=/home/jlilley/hadoop/hadoop-common-project/hadoop-common/target/generated-sources/java,
> -I/home/jlilley/hadoop/hadoop-common-project/hadoop-common/src/main/proto,
> /home/jlilley/hadoop/hadoop-common-project/hadoop-common/src/main/proto/Security.proto,
> /home/jlilley/hadoop/hadoop-common-project/hadoop-common/src/main/proto/ZKFCProtocol.proto,
> /home/jlilley/hadoop/hadoop-common-project/hadoop-common/src/main/proto/RpcHeader.proto,
> /home/jlilley/hadoop/hadoop-common-project/hadoop-common/src/main/proto/ProtobufRpcEngine.proto,
> /home/jlilley/hadoop/hadoop-common-project/hadoop-common/src/main/proto/IpcConnectionContext.proto,
> /home/jlilley/hadoop/hadoop-common-project/hadoop-common/src/main/proto/HAServiceProtocol.proto,
> /home/jlilley/hadoop/hadoop-common-project/hadoop-common/src/main/proto/ProtocolInfo.proto]
> failed with error code 1
>
> [DEBUG] Security.proto:22:8: Option "java_generate_equals_and_hash"
> unknown.
>
> [ERROR] protoc compiler error
>
> ** **
>
> The following were installed today, this is basically on a clean CentOS6
> system:
>
> # history | grep yum
>
>81  yum install protobuf-compiler
>
>82  yum install gcc
>
>84  yum install gcc-c++
>
>85  yum install cmake
>
>86  yum install make
>
>87  yum install zlib
>
>90  yum install git
>
>93  yum install eclipse
>
> ** **
>
> This seems to indicate that the problem may have to do with a too-recent
> protobuf and additional settings that must be applied because of that:
>
>
> http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201106.mbox/%3c4dfa3c67.50...@gmail.com%3E
> 
>
> ** **
>
> Thanks
>
> John
>


Help: error in hadoop build

2013-05-29 Thread John Lilley
Sorry if this is a dumb question, but I'm not sure where to start.  I am 
following BUILDING.txt instructions for source checked out today using git:
git clone git://git.apache.org/hadoop-common.git Hadoop

Following build steps and adding -X for more logging:
mvn compile -X

But I get this error in Hadoop common
[WARNING] [protoc, 
--java_out=/home/jlilley/hadoop/hadoop-common-project/hadoop-common/target/generated-sources/java,
 -I/home/jlilley/hadoop/hadoop-common-project/hadoop-common/src/main/proto, 
/home/jlilley/hadoop/hadoop-common-project/hadoop-common/src/main/proto/Security.proto,
 
/home/jlilley/hadoop/hadoop-common-project/hadoop-common/src/main/proto/ZKFCProtocol.proto,
 
/home/jlilley/hadoop/hadoop-common-project/hadoop-common/src/main/proto/RpcHeader.proto,
 
/home/jlilley/hadoop/hadoop-common-project/hadoop-common/src/main/proto/ProtobufRpcEngine.proto,
 
/home/jlilley/hadoop/hadoop-common-project/hadoop-common/src/main/proto/IpcConnectionContext.proto,
 
/home/jlilley/hadoop/hadoop-common-project/hadoop-common/src/main/proto/HAServiceProtocol.proto,
 
/home/jlilley/hadoop/hadoop-common-project/hadoop-common/src/main/proto/ProtocolInfo.proto]
 failed with error code 1
[DEBUG] Security.proto:22:8: Option "java_generate_equals_and_hash" unknown.
[ERROR] protoc compiler error

The following were installed today, this is basically on a clean CentOS6 system:
# history | grep yum
   81  yum install protobuf-compiler
   82  yum install gcc
   84  yum install gcc-c++
   85  yum install cmake
   86  yum install make
   87  yum install zlib
   90  yum install git
   93  yum install eclipse

This seems to indicate that the problem may have to do with a too-recent 
protobuf and additional settings that must be applied because of that:
http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201106.mbox/%3c4dfa3c67.50...@gmail.com%3E

Thanks
John


Re: What else can be built on top of YARN.

2013-05-29 Thread John Conwell
Two scenarios I can think of are re-implementations of Twitter's Storm (
http://storm-project.net/) and DryadLinq (
http://research.microsoft.com/en-us/projects/dryadlinq/).

Storm, a distributed realtime computation framework used for analyzing
realtime steams of data, doesn't really need to be ported.  Its doing fine
by itself, though I think its a prime candidate for a Yarn port.

DryadLinq is a (now closed) research project out of Microsoft Research that
allowed the user to write standard LINQ code (in any .net language) and it
build an execution DAG based structure of the LINQ statement, and execute
the DAG on a MS HPC cluster.

The LINQ syntax is very much like PIG, though way more flexible and has
full IDE support (is Visual Studio), and is used in standard single process
programming.  That, to me, was the beauty behind DryadLinq: the programming
language for distributed execution was exactly the same as a well known and
used language for standard single process programming already used by
hundreds of thousands of programmers, so learning curve and acceptance debt
is really low.  But, like all good things that come out of MS Research, it
was killed because they sat on it too long.

The interesting thing is that distributed DAG execution is one of the main
examples given for the types of Yarn applications that could be developed.








On Wed, May 29, 2013 at 10:30 AM, Rahul Bhattacharjee <
rahul.rec@gmail.com> wrote:

> Thanks for the response Krishna.
>
> I was wondering if it were possible for using MR to  solve you problem
> instead of building the whole stack on top of yarn.
> Most likely its not possible , thats why you are building it . I wanted to
> know why is that ?
>
> I am in just trying to find out the need or why we might need to write the
> application on yarn.
>
> Rahul
>
>
> On Wed, May 29, 2013 at 8:23 PM, Krishna Kishore Bonagiri <
> write2kish...@gmail.com> wrote:
>
>> Hi Rahul,
>>
>>   I am porting a distributed application that runs on a fixed set of
>> given resources to YARN, with the aim of  being able to run it on a
>> dynamically selected resources whichever are available at the time of
>> running the application.
>>
>> Thanks,
>> Kishore
>>
>>
>> On Wed, May 29, 2013 at 8:04 PM, Rahul Bhattacharjee <
>> rahul.rec@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I was going through the motivation behind Yarn. Splitting the
>>> responsibility of JT is the major concern.Ultimately the base (Yarn) was
>>> built in a generic way for building other generic distributed applications
>>> too.
>>>
>>> I am not able to think of any other parallel processing use case that
>>> would be useful to built on top of YARN. I though of a lot of use cases
>>> that would be beneficial when run in parallel , but again ,we can do those
>>> using map only jobs in MR.
>>>
>>> Can someone tell me a scenario , where a application can utilize Yarn
>>> features or can be built on top of YARN and at the same time , it cannot be
>>> done efficiently using MRv2 jobs.
>>>
>>> thanks,
>>> Rahul
>>>
>>>
>>>
>>
>


-- 

Thanks,
John C


Re: What else can be built on top of YARN.

2013-05-29 Thread Rahul Bhattacharjee
Thanks for the response Krishna.

I was wondering if it were possible for using MR to  solve you problem
instead of building the whole stack on top of yarn.
Most likely its not possible , thats why you are building it . I wanted to
know why is that ?

I am in just trying to find out the need or why we might need to write the
application on yarn.

Rahul


On Wed, May 29, 2013 at 8:23 PM, Krishna Kishore Bonagiri <
write2kish...@gmail.com> wrote:

> Hi Rahul,
>
>   I am porting a distributed application that runs on a fixed set of given
> resources to YARN, with the aim of  being able to run it on a dynamically
> selected resources whichever are available at the time of running the
> application.
>
> Thanks,
> Kishore
>
>
> On Wed, May 29, 2013 at 8:04 PM, Rahul Bhattacharjee <
> rahul.rec@gmail.com> wrote:
>
>> Hi all,
>>
>> I was going through the motivation behind Yarn. Splitting the
>> responsibility of JT is the major concern.Ultimately the base (Yarn) was
>> built in a generic way for building other generic distributed applications
>> too.
>>
>> I am not able to think of any other parallel processing use case that
>> would be useful to built on top of YARN. I though of a lot of use cases
>> that would be beneficial when run in parallel , but again ,we can do those
>> using map only jobs in MR.
>>
>> Can someone tell me a scenario , where a application can utilize Yarn
>> features or can be built on top of YARN and at the same time , it cannot be
>> done efficiently using MRv2 jobs.
>>
>> thanks,
>> Rahul
>>
>>
>>
>


Re: OpenJDK?

2013-05-29 Thread Lenin Raj
Yup. Thats right.


Thanks,
Lenin


On Wed, May 29, 2013 at 10:23 PM, John Lilley wrote:

>  Great, that’s what I’ve done.  At least I think so.  This is JRE6 right?*
> ***
>
> ** **
>
> # java -version
>
> java version "1.6.0_43"
>
> Java(TM) SE Runtime Environment (build 1.6.0_43-b01)
>
> Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01, mixed mode)
>
> ** **
>
> john
>
> ** **
>
> *From:* Lenin Raj [mailto:emaille...@gmail.com]
> *Sent:* Wednesday, May 29, 2013 9:34 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: OpenJDK?
>
> ** **
>
> Yes. Use Sun/Oracle JDK
>
> I have had memory issues while using Oozie. When I replaced OpenJDK with
> Sun JDK 6. the memory issue was resolved.
>
>
> 
>
>
> Thanks,
> Lenin
>
> ** **
>
> On Wed, May 29, 2013 at 8:22 PM, John Lilley 
> wrote:
>
> I am having trouble finding a definitive answer about OpenJDK vs Sun JDK
> in regards to building Hadoop.  This:
>
> http://wiki.apache.org/hadoop/HadoopJavaVersions
>
> Indicates that OpenJDK is not recommended, but is that an authoritative
> answer?
>
> BUILDING.txt states no preference.
>
>  
>
> Thanks
>
> John
>
>  
>
> ** **
>


RE: OpenJDK?

2013-05-29 Thread John Lilley
Great, that's what I've done.  At least I think so.  This is JRE6 right?

# java -version
java version "1.6.0_43"
Java(TM) SE Runtime Environment (build 1.6.0_43-b01)
Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01, mixed mode)

john

From: Lenin Raj [mailto:emaille...@gmail.com]
Sent: Wednesday, May 29, 2013 9:34 AM
To: user@hadoop.apache.org
Subject: Re: OpenJDK?

Yes. Use Sun/Oracle JDK

I have had memory issues while using Oozie. When I replaced OpenJDK with Sun 
JDK 6. the memory issue was resolved.


Thanks,
Lenin

On Wed, May 29, 2013 at 8:22 PM, John Lilley 
mailto:john.lil...@redpoint.net>> wrote:
I am having trouble finding a definitive answer about OpenJDK vs Sun JDK in 
regards to building Hadoop.  This:
http://wiki.apache.org/hadoop/HadoopJavaVersions
Indicates that OpenJDK is not recommended, but is that an authoritative answer?
BUILDING.txt states no preference.

Thanks
John




Unsubscribe

2013-05-29 Thread sahil soni
Unsubscribe


Re: OpenJDK?

2013-05-29 Thread Lenin Raj
Yes. Use Sun/Oracle JDK

I have had memory issues while using Oozie. When I replaced OpenJDK with
Sun JDK 6. the memory issue was resolved.


Thanks,
Lenin


On Wed, May 29, 2013 at 8:22 PM, John Lilley wrote:

>  I am having trouble finding a definitive answer about OpenJDK vs Sun JDK
> in regards to building Hadoop.  This:
>
> http://wiki.apache.org/hadoop/HadoopJavaVersions
>
> Indicates that OpenJDK is not recommended, but is that an authoritative
> answer?
>
> BUILDING.txt states no preference.
>
> ** **
>
> Thanks
>
> John
>
> ** **
>


Re: What else can be built on top of YARN.

2013-05-29 Thread Krishna Kishore Bonagiri
Hi Rahul,

  I am porting a distributed application that runs on a fixed set of given
resources to YARN, with the aim of  being able to run it on a dynamically
selected resources whichever are available at the time of running the
application.

Thanks,
Kishore


On Wed, May 29, 2013 at 8:04 PM, Rahul Bhattacharjee <
rahul.rec@gmail.com> wrote:

> Hi all,
>
> I was going through the motivation behind Yarn. Splitting the
> responsibility of JT is the major concern.Ultimately the base (Yarn) was
> built in a generic way for building other generic distributed applications
> too.
>
> I am not able to think of any other parallel processing use case that
> would be useful to built on top of YARN. I though of a lot of use cases
> that would be beneficial when run in parallel , but again ,we can do those
> using map only jobs in MR.
>
> Can someone tell me a scenario , where a application can utilize Yarn
> features or can be built on top of YARN and at the same time , it cannot be
> done efficiently using MRv2 jobs.
>
> thanks,
> Rahul
>
>
>


OpenJDK?

2013-05-29 Thread John Lilley
I am having trouble finding a definitive answer about OpenJDK vs Sun JDK in 
regards to building Hadoop.  This:
http://wiki.apache.org/hadoop/HadoopJavaVersions
Indicates that OpenJDK is not recommended, but is that an authoritative answer?
BUILDING.txt states no preference.

Thanks
John



Reduce side question on MR

2013-05-29 Thread Rahul Bhattacharjee
Hi,

I have one question related to the reduce phase of MR jobs.

The intermediate outputs of map tasks are pulled in from the nodes which
ran map tasks to the node where reducers is going to run and those
intermediate data is written to the reducers local fs. My question is that
if there is a job processing huge amount of data and it has multiple
mappers but only one reducer , then its possible that the job would never
complete successfully as the single hosts disk might not be sufficient to
hold all the map outputs of the job.

The job essentially would fail after retrying configured number of attempts.

Thanks,
Rahul


What else can be built on top of YARN.

2013-05-29 Thread Rahul Bhattacharjee
Hi all,

I was going through the motivation behind Yarn. Splitting the
responsibility of JT is the major concern.Ultimately the base (Yarn) was
built in a generic way for building other generic distributed applications
too.

I am not able to think of any other parallel processing use case that would
be useful to built on top of YARN. I though of a lot of use cases that
would be beneficial when run in parallel , but again ,we can do those using
map only jobs in MR.

Can someone tell me a scenario , where a application can utilize Yarn
features or can be built on top of YARN and at the same time , it cannot be
done efficiently using MRv2 jobs.

thanks,
Rahul


Re: Please help me with heartbeat storm

2013-05-29 Thread Philippe Signoret
This might be relevant: https://issues.apache.org/jira/browse/MAPREDUCE-4478

"There are two configuration items to control the TaskTracker's heartbeat
interval. One is *mapreduce.tasktracker.outofband.heartbeat*. The other is*
mapreduce.tasktracker.outofband.heartbeat.damper*. If we set *
mapreduce.tasktracker.outofband.heartbeat* with true and set*
mapreduce.tasktracker.outofband.heartbeat.damper* with default value
(100), TaskTracker may send heartbeat without any interval."


Philippe

---
*Philippe Signoret*


On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan <
rajesh.balamo...@gmail.com> wrote:

> Default value of CLUSTER_INCREMENT is 100. Math.max(1000* 29/100, 3000) =
> 3000 always. This is the reason why you are seeing so many heartbeats. *You
> might want to set it to 1 or 5.* This would increase the time taken to
> send the heartbeat from TT to JT.
>
>
> ~Rajesh.B
>
>
> On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey <
> a.eremi...@corp.badoo.com> wrote:
>
>>  Hi!
>>
>> Tried 5 seconds. Less number of nodes get into storm, but still they do.
>> Additionaly update of ntp service helped a little.
>>
>> Initially almost 50% got into storming each MR job. But after ntp update
>> and and increasing heart-beatto 5 seconds level is around 10%.
>>
>>
>> On 26/05/13 10:43, murali adireddy wrote:
>>
>> Hi ,
>>
>>  Just try this one.
>>
>>  in the file "hdfs-site.xml" try to add the below property
>> "dfs.heartbeat.interval" and value  in seconds.
>>
>>  Default value is '3' seconds. In your case increase value.
>>
>>  
>>  dfs.heartbeat.interval
>>  3
>> 
>>
>>  You can find more properties and default values in the below link.
>>
>>
>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>>
>>
>>  Please let me know is the above solution worked for you ..?
>>
>>
>>
>>
>>  On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <
>> a.eremi...@corp.badoo.com> wrote:
>>
>>> Hi all,
>>> I have 29 servers hadoop cluster in almost default configuration.
>>> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
>>> I started stracing its behaviour and found that some TT send heartbeats
>>> in an unlimited ways.
>>> It means hundreds in a second.
>>>
>>> Daemon restart solves the issue, but even easiest Hive MR returns issue
>>> back.
>>>
>>> Here is the filtered strace of heartbeating process
>>>
>>> hadoop9.mlan:~$ sudo strace -tt -f -s 1 -p 6032 2>&1  | grep 6065 |
>>> grep write
>>>
>>>
>>> [pid  6065] 13:07:34.801106 write(70,
>>> "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>>> 284) = 284
>>> [pid  6065] 13:07:34.807968 write(70,
>>> "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>>> 284 
>>> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>>> [pid  6065] 13:07:34.814473 write(70,
>>> "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>>> 284 
>>> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>>> [pid  6065] 13:07:34.820960 write(70,
>>> "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>>> 284 
>>>
>>>
>>> Please help me to stop this storming 8(
>>>
>>>
>>
>>
>
>
> --
> ~Rajesh.B
>