Re: remote job submission

2012-04-20 Thread Harsh J
If you are allowed a remote connection to the cluster's service ports,
then you can directly submit your jobs from your local CLI. Just make
sure your local configuration points to the right locations.

Otherwise, perhaps you can choose to use Apache Oozie (Incubating)
(http://incubator.apache.org/oozie/) It does provide a REST interface
that launches jobs up for you over the supplied clusters, but its more
oriented towards workflow management or perhaps HUE:
https://github.com/cloudera/hue

On Fri, Apr 20, 2012 at 5:37 PM, Arindam Choudhury
 wrote:
> Hi,
>
> Do hadoop have any web service or other interface so I can submit jobs from
> remote machine?
>
> Thanks,
> Arindam



-- 
Harsh J


Re: remote job submission

2012-04-20 Thread Arindam Choudhury
"If you are allowed a remote connection to the cluster's service ports,
then you can directly submit your jobs from your local CLI. Just make
sure your local configuration points to the right locations."

Can you elaborate in details please?

On Fri, Apr 20, 2012 at 2:20 PM, Harsh J  wrote:

> If you are allowed a remote connection to the cluster's service ports,
> then you can directly submit your jobs from your local CLI. Just make
> sure your local configuration points to the right locations.
>
> Otherwise, perhaps you can choose to use Apache Oozie (Incubating)
> (http://incubator.apache.org/oozie/) It does provide a REST interface
> that launches jobs up for you over the supplied clusters, but its more
> oriented towards workflow management or perhaps HUE:
> https://github.com/cloudera/hue
>
> On Fri, Apr 20, 2012 at 5:37 PM, Arindam Choudhury
>  wrote:
> > Hi,
> >
> > Do hadoop have any web service or other interface so I can submit jobs
> from
> > remote machine?
> >
> > Thanks,
> > Arindam
>
>
>
> --
> Harsh J
>


Re: remote job submission

2012-04-20 Thread Harsh J
Arindam,

If your machine can access the clusters' NN/JT/DN ports, then you can
simply run your job from the machine itself.

On Fri, Apr 20, 2012 at 6:31 PM, Arindam Choudhury
 wrote:
> "If you are allowed a remote connection to the cluster's service ports,
> then you can directly submit your jobs from your local CLI. Just make
> sure your local configuration points to the right locations."
>
> Can you elaborate in details please?
>
> On Fri, Apr 20, 2012 at 2:20 PM, Harsh J  wrote:
>
>> If you are allowed a remote connection to the cluster's service ports,
>> then you can directly submit your jobs from your local CLI. Just make
>> sure your local configuration points to the right locations.
>>
>> Otherwise, perhaps you can choose to use Apache Oozie (Incubating)
>> (http://incubator.apache.org/oozie/) It does provide a REST interface
>> that launches jobs up for you over the supplied clusters, but its more
>> oriented towards workflow management or perhaps HUE:
>> https://github.com/cloudera/hue
>>
>> On Fri, Apr 20, 2012 at 5:37 PM, Arindam Choudhury
>>  wrote:
>> > Hi,
>> >
>> > Do hadoop have any web service or other interface so I can submit jobs
>> from
>> > remote machine?
>> >
>> > Thanks,
>> > Arindam
>>
>>
>>
>> --
>> Harsh J
>>



-- 
Harsh J


Re: remote job submission

2012-04-20 Thread Arindam Choudhury
Sorry. But I can you give me a example.

On Fri, Apr 20, 2012 at 3:08 PM, Harsh J  wrote:

> Arindam,
>
> If your machine can access the clusters' NN/JT/DN ports, then you can
> simply run your job from the machine itself.
>
> On Fri, Apr 20, 2012 at 6:31 PM, Arindam Choudhury
>  wrote:
> > "If you are allowed a remote connection to the cluster's service ports,
> > then you can directly submit your jobs from your local CLI. Just make
> > sure your local configuration points to the right locations."
> >
> > Can you elaborate in details please?
> >
> > On Fri, Apr 20, 2012 at 2:20 PM, Harsh J  wrote:
> >
> >> If you are allowed a remote connection to the cluster's service ports,
> >> then you can directly submit your jobs from your local CLI. Just make
> >> sure your local configuration points to the right locations.
> >>
> >> Otherwise, perhaps you can choose to use Apache Oozie (Incubating)
> >> (http://incubator.apache.org/oozie/) It does provide a REST interface
> >> that launches jobs up for you over the supplied clusters, but its more
> >> oriented towards workflow management or perhaps HUE:
> >> https://github.com/cloudera/hue
> >>
> >> On Fri, Apr 20, 2012 at 5:37 PM, Arindam Choudhury
> >>  wrote:
> >> > Hi,
> >> >
> >> > Do hadoop have any web service or other interface so I can submit jobs
> >> from
> >> > remote machine?
> >> >
> >> > Thanks,
> >> > Arindam
> >>
> >>
> >>
> >> --
> >> Harsh J
> >>
>
>
>
> --
> Harsh J
>


Re: Help me with architecture of a somewhat non-trivial mapreduce implementation

2012-04-20 Thread Sky

Thanks! That helped!



-Original Message- 
From: Michael Segel

Sent: Thursday, April 19, 2012 9:38 PM
To: common-user@hadoop.apache.org
Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce 
implementation


If the file is small enough you could read it in to a java object like a 
list and write your own input format that takes a list object as its input 
and then lets you specify the number of mappers.


On Apr 19, 2012, at 11:34 PM, Sky wrote:

My file for the input to mapper is very small - as all it has is urls to 
list of manifests. The task for mappers is to fetch each manifest, and 
then fetch files using urls from the manifests and then process them. 
Besides passing around lists of files, I am not really accessing the disk. 
It should be RAM, network, and CPU (unzip, parsexml,extract attributes).


So is my only choice to break the input file and submit multiple files (if 
I have 15 cores, I should split the file with urls to 15 files? also how 
does it look in code?)? The two drawbacks are - some cores might finish 
early and stay idle, and I don’t know how to deal with dynamically 
increasing/decreasing cores.


Thx
- Sky

-Original Message- From: Michael Segel
Sent: Thursday, April 19, 2012 8:49 PM
To: common-user@hadoop.apache.org
Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce 
implementation


How 'large' or rather in this case small is your file?

If you're on a default system, the block sizes are 64MB. So if your file 
~<= 64MB, you end up with 1 block, and you will only have 1 mapper.



On Apr 19, 2012, at 10:10 PM, Sky wrote:

Thanks for your reply.  After I sent my email, I found a fundamental 
defect - in my understanding of how MR is distributed. I discovered that 
even though I was firing off 15 COREs, the map job - which is the most 
expensive part of my processing was run only on 1 core.


To start my map job, I was creating a single file with following data:
1 storage:/root/1.manif.txt
2 storage:/root/2.manif.txt
3 storage:/root/3.manif.txt
...
4000 storage:/root/4000.manif.txt

I thought that each of the available COREs will be assigned a map job 
from top down from the same file one at a time, and as soon as one CORE 
is done, it would get the next map job. However, it looks like I need to 
split the file into the number of times. Now while that’s clearly trivial 
to do, I am not sure how I can detect at runtime how many splits I need 
to do, and also to deal with adding new CORES at runtime. Any 
suggestions? (it doesn't have to be a file, it can be a list, etc).


This all would be much easier to debug, if somehow I could get my log4j 
logs for my mappers and reducers. I can see log4j for my main launcher, 
but not sure how to enable it for mappers and reducers.


Thx
- Akash


-Original Message- From: Robert Evans
Sent: Thursday, April 19, 2012 2:08 PM
To: common-user@hadoop.apache.org
Subject: Re: Help me with architecture of a somewhat non-trivial 
mapreduce implementation


From what I can see your implementation seems OK, especially from a 
performance perspective. Depending on what storage: is it is likely to be 
your bottlekneck, not the hadoop computations.


Because you are writing files directly instead of relying on Hadoop to do 
it for you, you may need to deal with error cases that Hadoop will 
normally hide from you, and you will not be able to turn on speculative 
execution. Just be aware that a map or reduce task may have problems in 
the middle, and be relaunched.  So when you are writing out your updated 
manifest be careful to not replace the old one until the new one is 
completely ready and will not fail, or you may lose data.  You may also 
need to be careful in your reduce if you are writing directly to the file 
there too, but because it is not a read modify write, but just a write it 
is not as critical.


--Bobby Evans

On 4/18/12 4:56 PM, "Sky USC"  wrote:




Please help me architect the design of my first significant MR task 
beyond "word count". My program works well. but I am trying to optimize 
performance to maximize use of available computing resources. I have 3 
questions at the bottom.


Project description in an abstract sense (written in java):
* I have MM number of MANIFEST files available on 
storage:/root/1.manif.txt to 4000.manif.txt
  * Each MANIFEST in turn contains varilable number "EE" of URLs to 
EBOOKS (range could be 1 - 50,000 EBOOKS urls per MANIFEST) -- stored 
on storage:/root/1.manif/1223.folder/5443.Ebook.ebk

So we are talking about millions of ebooks

My task is to:
1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: 
publisher, year, ebook-version).
2. Update each of the EBOOK entry record in the manifest - with the 3 
attributes (eg: ebook 1334 -> publisher=aaa year=bbb, ebook-version=2.01)
3. Create a output file such that the named 
"__"  contains a list of all "ebook urls" 
that met that criteria.

example:
File "st

Re: Help me with architecture of a somewhat non-trivial mapreduce implementation

2012-04-20 Thread Robert Evans
You could also use the NLineInputFormat which will launch 1 mapper for every N 
(configurable) lines of input.


On 4/20/12 9:48 AM, "Sky"  wrote:

Thanks! That helped!



-Original Message-
From: Michael Segel
Sent: Thursday, April 19, 2012 9:38 PM
To: common-user@hadoop.apache.org
Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce
implementation

If the file is small enough you could read it in to a java object like a
list and write your own input format that takes a list object as its input
and then lets you specify the number of mappers.

On Apr 19, 2012, at 11:34 PM, Sky wrote:

> My file for the input to mapper is very small - as all it has is urls to
> list of manifests. The task for mappers is to fetch each manifest, and
> then fetch files using urls from the manifests and then process them.
> Besides passing around lists of files, I am not really accessing the disk.
> It should be RAM, network, and CPU (unzip, parsexml,extract attributes).
>
> So is my only choice to break the input file and submit multiple files (if
> I have 15 cores, I should split the file with urls to 15 files? also how
> does it look in code?)? The two drawbacks are - some cores might finish
> early and stay idle, and I don't know how to deal with dynamically
> increasing/decreasing cores.
>
> Thx
> - Sky
>
> -Original Message- From: Michael Segel
> Sent: Thursday, April 19, 2012 8:49 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce
> implementation
>
> How 'large' or rather in this case small is your file?
>
> If you're on a default system, the block sizes are 64MB. So if your file
> ~<= 64MB, you end up with 1 block, and you will only have 1 mapper.
>
>
> On Apr 19, 2012, at 10:10 PM, Sky wrote:
>
>> Thanks for your reply.  After I sent my email, I found a fundamental
>> defect - in my understanding of how MR is distributed. I discovered that
>> even though I was firing off 15 COREs, the map job - which is the most
>> expensive part of my processing was run only on 1 core.
>>
>> To start my map job, I was creating a single file with following data:
>> 1 storage:/root/1.manif.txt
>> 2 storage:/root/2.manif.txt
>> 3 storage:/root/3.manif.txt
>> ...
>> 4000 storage:/root/4000.manif.txt
>>
>> I thought that each of the available COREs will be assigned a map job
>> from top down from the same file one at a time, and as soon as one CORE
>> is done, it would get the next map job. However, it looks like I need to
>> split the file into the number of times. Now while that's clearly trivial
>> to do, I am not sure how I can detect at runtime how many splits I need
>> to do, and also to deal with adding new CORES at runtime. Any
>> suggestions? (it doesn't have to be a file, it can be a list, etc).
>>
>> This all would be much easier to debug, if somehow I could get my log4j
>> logs for my mappers and reducers. I can see log4j for my main launcher,
>> but not sure how to enable it for mappers and reducers.
>>
>> Thx
>> - Akash
>>
>>
>> -Original Message- From: Robert Evans
>> Sent: Thursday, April 19, 2012 2:08 PM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Help me with architecture of a somewhat non-trivial
>> mapreduce implementation
>>
>> From what I can see your implementation seems OK, especially from a
>> performance perspective. Depending on what storage: is it is likely to be
>> your bottlekneck, not the hadoop computations.
>>
>> Because you are writing files directly instead of relying on Hadoop to do
>> it for you, you may need to deal with error cases that Hadoop will
>> normally hide from you, and you will not be able to turn on speculative
>> execution. Just be aware that a map or reduce task may have problems in
>> the middle, and be relaunched.  So when you are writing out your updated
>> manifest be careful to not replace the old one until the new one is
>> completely ready and will not fail, or you may lose data.  You may also
>> need to be careful in your reduce if you are writing directly to the file
>> there too, but because it is not a read modify write, but just a write it
>> is not as critical.
>>
>> --Bobby Evans
>>
>> On 4/18/12 4:56 PM, "Sky USC"  wrote:
>>
>>
>>
>>
>> Please help me architect the design of my first significant MR task
>> beyond "word count". My program works well. but I am trying to optimize
>> performance to maximize use of available computing resources. I have 3
>> questions at the bottom.
>>
>> Project description in an abstract sense (written in java):
>> * I have MM number of MANIFEST files available on
>> storage:/root/1.manif.txt to 4000.manif.txt
>>   * Each MANIFEST in turn contains varilable number "EE" of URLs to
>> EBOOKS (range could be 1 - 50,000 EBOOKS urls per MANIFEST) -- stored
>> on storage:/root/1.manif/1223.folder/5443.Ebook.ebk
>> So we are talking about millions of ebooks
>>
>> My task is to:
>> 1. Fetch each ebook, and obtain a set of 3 attr

How to add debugging to map- red code

2012-04-20 Thread Mapred Learn
Hi,
I m trying to find out best way to add debugging in map- red code.
I have System.out.println() statements that I keep on commenting and 
uncommenting so as not to increase stdout size

But problem is anytime I need debug, I Hv to re-compile.

If there a way, I can define log levels using log4j in map-red code and define 
log level as conf option ?

Thanks,
JJ

Sent from my iPhone

Re: remote job submission

2012-04-20 Thread Robert Evans
You can use Oozie to do it.


On 4/20/12 8:45 AM, "Arindam Choudhury"  wrote:

Sorry. But I can you give me a example.

On Fri, Apr 20, 2012 at 3:08 PM, Harsh J  wrote:

> Arindam,
>
> If your machine can access the clusters' NN/JT/DN ports, then you can
> simply run your job from the machine itself.
>
> On Fri, Apr 20, 2012 at 6:31 PM, Arindam Choudhury
>  wrote:
> > "If you are allowed a remote connection to the cluster's service ports,
> > then you can directly submit your jobs from your local CLI. Just make
> > sure your local configuration points to the right locations."
> >
> > Can you elaborate in details please?
> >
> > On Fri, Apr 20, 2012 at 2:20 PM, Harsh J  wrote:
> >
> >> If you are allowed a remote connection to the cluster's service ports,
> >> then you can directly submit your jobs from your local CLI. Just make
> >> sure your local configuration points to the right locations.
> >>
> >> Otherwise, perhaps you can choose to use Apache Oozie (Incubating)
> >> (http://incubator.apache.org/oozie/) It does provide a REST interface
> >> that launches jobs up for you over the supplied clusters, but its more
> >> oriented towards workflow management or perhaps HUE:
> >> https://github.com/cloudera/hue
> >>
> >> On Fri, Apr 20, 2012 at 5:37 PM, Arindam Choudhury
> >>  wrote:
> >> > Hi,
> >> >
> >> > Do hadoop have any web service or other interface so I can submit jobs
> >> from
> >> > remote machine?
> >> >
> >> > Thanks,
> >> > Arindam
> >>
> >>
> >>
> >> --
> >> Harsh J
> >>
>
>
>
> --
> Harsh J
>



Re: How to add debugging to map- red code

2012-04-20 Thread Harsh J
Yes this is possible, and there's two ways to do this.

1. Use a distro/release that carries the
https://issues.apache.org/jira/browse/MAPREDUCE-336 fix. This will let
you avoid work (see 2, which is same as your idea)

2. Configure your implementation's logger object's level in the
setup/setConf methods of the task, by looking at some conf prop to
decide the level. This will work just as well - and will also avoid
changing Hadoop's own Child log levels, unlike the (1) method.

On Fri, Apr 20, 2012 at 8:47 PM, Mapred Learn  wrote:
> Hi,
> I m trying to find out best way to add debugging in map- red code.
> I have System.out.println() statements that I keep on commenting and 
> uncommenting so as not to increase stdout size
>
> But problem is anytime I need debug, I Hv to re-compile.
>
> If there a way, I can define log levels using log4j in map-red code and 
> define log level as conf option ?
>
> Thanks,
> JJ
>
> Sent from my iPhone



-- 
Harsh J


RE: remote job submission

2012-04-20 Thread Amith D K
I dont know your use case if its for test and
ssh across the machine are disabled then u write a script that can do ssh run 
the jobs using cli for running your jobs. U can check ssh usage.

Or else use Ooze

From: Robert Evans [ev...@yahoo-inc.com]
Sent: Friday, April 20, 2012 11:17 PM
To: common-user@hadoop.apache.org
Subject: Re: remote job submission

You can use Oozie to do it.


On 4/20/12 8:45 AM, "Arindam Choudhury"  wrote:

Sorry. But I can you give me a example.

On Fri, Apr 20, 2012 at 3:08 PM, Harsh J  wrote:

> Arindam,
>
> If your machine can access the clusters' NN/JT/DN ports, then you can
> simply run your job from the machine itself.
>
> On Fri, Apr 20, 2012 at 6:31 PM, Arindam Choudhury
>  wrote:
> > "If you are allowed a remote connection to the cluster's service ports,
> > then you can directly submit your jobs from your local CLI. Just make
> > sure your local configuration points to the right locations."
> >
> > Can you elaborate in details please?
> >
> > On Fri, Apr 20, 2012 at 2:20 PM, Harsh J  wrote:
> >
> >> If you are allowed a remote connection to the cluster's service ports,
> >> then you can directly submit your jobs from your local CLI. Just make
> >> sure your local configuration points to the right locations.
> >>
> >> Otherwise, perhaps you can choose to use Apache Oozie (Incubating)
> >> (http://incubator.apache.org/oozie/) It does provide a REST interface
> >> that launches jobs up for you over the supplied clusters, but its more
> >> oriented towards workflow management or perhaps HUE:
> >> https://github.com/cloudera/hue
> >>
> >> On Fri, Apr 20, 2012 at 5:37 PM, Arindam Choudhury
> >>  wrote:
> >> > Hi,
> >> >
> >> > Do hadoop have any web service or other interface so I can submit jobs
> >> from
> >> > remote machine?
> >> >
> >> > Thanks,
> >> > Arindam
> >>
> >>
> >>
> >> --
> >> Harsh J
> >>
>
>
>
> --
> Harsh J
>



Re: Accessing global Counters

2012-04-20 Thread Jagat
Hi

You can create your own counters like

enum CountFruits {
Apple,
Mango,
Banana
}


And in your mapper class when you see condition to increment , you can use
Reporter incrCounter method to do the same.

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.Enum,%20long)

e.g
// I saw Apple increment it by one
reporter.incrCounter(CountFruits.Apple,1);

Now you can access them using job.getCounters

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()

Hope this helps

Regards,

Jagat Singh


On Fri, Apr 20, 2012 at 9:43 PM, Gayatri Rao  wrote:

> Hi All,
>
> Is there a way for me to set global counters in Mapper and access them from
> reducer?
> Could you suggest how I can acheve this?
>
> Thanks
> Gayatri
>


RE: Accessing global Counters

2012-04-20 Thread Amith D K
Yes U can use user defined counter as Jagat suggeted.

Counter can be enum as Jagat described or any string which are called dynamic 
counters.

It is easier to use Enum counter than dynamic counters, finally it depends on 
your use case :)

Amith

From: Jagat [jagatsi...@gmail.com]
Sent: Saturday, April 21, 2012 12:25 AM
To: common-user@hadoop.apache.org
Subject: Re: Accessing global Counters

Hi

You can create your own counters like

enum CountFruits {
Apple,
Mango,
Banana
}


And in your mapper class when you see condition to increment , you can use
Reporter incrCounter method to do the same.

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.Enum,%20long)

e.g
// I saw Apple increment it by one
reporter.incrCounter(CountFruits.Apple,1);

Now you can access them using job.getCounters

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()

Hope this helps

Regards,

Jagat Singh


On Fri, Apr 20, 2012 at 9:43 PM, Gayatri Rao  wrote:

> Hi All,
>
> Is there a way for me to set global counters in Mapper and access them from
> reducer?
> Could you suggest how I can acheve this?
>
> Thanks
> Gayatri
>


Re: Accessing global Counters

2012-04-20 Thread Robert Evans
There was a discussion about this several months ago

http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201112.mbox/%3CCADYHM8xiw8_bF=zqe-bagdfz6r3tob0aof9viozgtzeqgkp...@mail.gmail.com%3E

The conclusion is that if you want to read them from the reducer you are going 
to have to do something special until someone finds time to implement it as 
part of.

https://issues.apache.org/jira/browse/MAPREDUCE-3520

--Bobby Evans


On 4/20/12 11:36 AM, "Amith D K"  wrote:

Yes U can use user defined counter as Jagat suggeted.

Counter can be enum as Jagat described or any string which are called dynamic 
counters.

It is easier to use Enum counter than dynamic counters, finally it depends on 
your use case :)

Amith

From: Jagat [jagatsi...@gmail.com]
Sent: Saturday, April 21, 2012 12:25 AM
To: common-user@hadoop.apache.org
Subject: Re: Accessing global Counters

Hi

You can create your own counters like

enum CountFruits {
Apple,
Mango,
Banana
}


And in your mapper class when you see condition to increment , you can use
Reporter incrCounter method to do the same.

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.Enum,%20long)

e.g
// I saw Apple increment it by one
reporter.incrCounter(CountFruits.Apple,1);

Now you can access them using job.getCounters

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()

Hope this helps

Regards,

Jagat Singh


On Fri, Apr 20, 2012 at 9:43 PM, Gayatri Rao  wrote:

> Hi All,
>
> Is there a way for me to set global counters in Mapper and access them from
> reducer?
> Could you suggest how I can acheve this?
>
> Thanks
> Gayatri
>



Re: How to add debugging to map- red code

2012-04-20 Thread Mark question
I'm interested in this too, but could you tell me where to apply the patch
and is the following the right command to write it:

 
patch
< 
MAPREDUCE-336_0_20090818.patch

Thank you,
Mark

On Fri, Apr 20, 2012 at 8:28 AM, Harsh J  wrote:

> Yes this is possible, and there's two ways to do this.
>
> 1. Use a distro/release that carries the
> https://issues.apache.org/jira/browse/MAPREDUCE-336 fix. This will let
> you avoid work (see 2, which is same as your idea)
>
> 2. Configure your implementation's logger object's level in the
> setup/setConf methods of the task, by looking at some conf prop to
> decide the level. This will work just as well - and will also avoid
> changing Hadoop's own Child log levels, unlike the (1) method.
>
> On Fri, Apr 20, 2012 at 8:47 PM, Mapred Learn 
> wrote:
> > Hi,
> > I m trying to find out best way to add debugging in map- red code.
> > I have System.out.println() statements that I keep on commenting and
> uncommenting so as not to increase stdout size
> >
> > But problem is anytime I need debug, I Hv to re-compile.
> >
> > If there a way, I can define log levels using log4j in map-red code and
> define log level as conf option ?
> >
> > Thanks,
> > JJ
> >
> > Sent from my iPhone
>
>
>
> --
> Harsh J
>


RE: Hive Thrift help

2012-04-20 Thread Michael Wang
Thanks folks for your help.
I tried to use hive to analyze the apachelog. It is fine if I just select * 
from apachelog and I can get the results. But if I do anything like count, 
group by,.., It just shows " map = 0%,  reduce = 0%" message again and again 
endlessly. I had to stop it. Any ideas? Thank you!


CREATE TABLE apachelog (
ipaddress STRING, 
identd STRING, 
user STRING,
finishtime STRING,
requestline string, 
returncode INT, 
size INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe'
WITH SERDEPROPERTIES (
'serialization.format'='org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol',
'quote.delim'='("|\\[|\\])',
'field.delim'=' ',
'serialization.null.format'='-')
STORED AS TEXTFILE;


hive> select count(*) from apachelog;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Execution log at: 
/tmp/mwang/mwang_20120420141818_2208abfb-1840-49f4-b0b4-0d44d90fc121.log
Job running in-process (local Hadoop)
2012-04-20 14:19:06,691 null map = 0%,  reduce = 0%
2012-04-20 14:20:06,882 null map = 0%,  reduce = 0%
2012-04-20 14:21:07,073 null map = 0%,  reduce = 0%
2012-04-20 14:22:07,258 null map = 0%,  reduce = 0%
2012-04-20 14:23:07,418 null map = 0%,  reduce = 0%
2012-04-20 14:24:07,579 null map = 0%,  reduce = 0%
2012-04-20 14:25:07,738 null map = 0%,  reduce = 0%
2012-04-20 14:26:07,903 null map = 0%,  reduce = 0%
2012-04-20 14:27:08,075 null map = 0%,  reduce = 0%
2012-04-20 14:28:08,241 null map = 0%,  reduce = 0%
2012-04-20 14:29:08,397 null map = 0%,  reduce = 0%
2012-04-20 14:30:08,552 null map = 0%,  reduce = 0%
2012-04-20 14:31:08,712 null map = 0%,  reduce = 0%
2012-04-20 14:32:08,893 null map = 0%,  reduce = 0%
2012-04-20 14:33:09,056 null map = 0%,  reduce = 0%
2012-04-20 14:34:09,223 null map = 0%,  reduce = 0%
2012-04-20 14:35:09,396 null map = 0%,  reduce = 0%


-Original Message-
From: Edward Capriolo [mailto:edlinuxg...@gmail.com] 
Sent: Monday, April 16, 2012 4:55 PM
To: common-user@hadoop.apache.org
Subject: Re: Hive Thrift help

You can NOT connect to hive thrift to confirm it's status. Thrift is
thrift not http. But you are right to say HiveServer does not produce
and output by default.

if
netstat -nl | grep 1

shows status it is up.


On Mon, Apr 16, 2012 at 5:18 PM, Rahul Jain  wrote:
> I am assuming you read thru:
>
> https://cwiki.apache.org/Hive/hiveserver.html
>
> The server comes up on port 10,000 by default, did you verify that it is
> actually listening on the port ?  You can also connect to hive server using
> web browser to confirm its status.
>
> -Rahul
>
> On Mon, Apr 16, 2012 at 1:53 PM, Michael Wang 
> wrote:
>
>> we need to connect to HIVE from Microstrategy reports, and it requires the
>> Hive Thrift server. But I
>> tried to start it, and it just hangs as below.
>> # hive --service hiveserver
>> Starting Hive Thrift Server
>> Any ideas?
>> Thanks,
>> Michael
>>
>> This electronic message, including any attachments, may contain
>> proprietary, confidential or privileged information for the sole use of the
>> intended recipient(s). You are hereby notified that any unauthorized
>> disclosure, copying, distribution, or use of this message is prohibited. If
>> you have received this message in error, please immediately notify the
>> sender by reply e-mail and delete it.
>>

This electronic message, including any attachments, may contain proprietary, 
confidential or privileged information for the sole use of the intended 
recipient(s). You are hereby notified that any unauthorized disclosure, 
copying, distribution, or use of this message is prohibited. If you have 
received this message in error, please immediately notify the sender by reply 
e-mail and delete it.



Re: Accessing global Counters

2012-04-20 Thread Michel Segel
Actually it's easier to use dynamic counters...

Sent from a remote device. Please excuse any typos...

Mike Segel

On Apr 20, 2012, at 11:36 AM, Amith D K  wrote:

> Yes U can use user defined counter as Jagat suggeted.
> 
> Counter can be enum as Jagat described or any string which are called dynamic 
> counters.
> 
> It is easier to use Enum counter than dynamic counters, finally it depends on 
> your use case :)
> 
> Amith
> 
> From: Jagat [jagatsi...@gmail.com]
> Sent: Saturday, April 21, 2012 12:25 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Accessing global Counters
> 
> Hi
> 
> You can create your own counters like
> 
> enum CountFruits {
> Apple,
> Mango,
> Banana
> }
> 
> 
> And in your mapper class when you see condition to increment , you can use
> Reporter incrCounter method to do the same.
> 
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.Enum,%20long)
> 
> e.g
> // I saw Apple increment it by one
> reporter.incrCounter(CountFruits.Apple,1);
> 
> Now you can access them using job.getCounters
> 
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()
> 
> Hope this helps
> 
> Regards,
> 
> Jagat Singh
> 
> 
> On Fri, Apr 20, 2012 at 9:43 PM, Gayatri Rao  wrote:
> 
>> Hi All,
>> 
>> Is there a way for me to set global counters in Mapper and access them from
>> reducer?
>> Could you suggest how I can acheve this?
>> 
>> Thanks
>> Gayatri
>> 
> 


Feedback on real world production experience with Flume

2012-04-20 Thread Karl Hennig
I am investigating automated methods of moving our data from the web tier into 
HDFS for processing, a process that's performed periodically.

I am looking for feedback from anyone who has actually used Flume in a 
production setup (redundant, failover) successfully.  I understand it is now 
being largely rearchitected during its incubation as Apache Flume-NG, so I 
don't have full confidence in the old, stable releases.

The other option would be to write our own tools.  What methods are you using 
for these kinds of tasks?  Did you write your own or does Flume (or something 
else) work for you?

I'm also on the Flume mailing list, but I wanted to ask these questions here 
because I'm interested in Flume _and_ alternatives.

Thank you!



Re: Accessing global Counters

2012-04-20 Thread JAX
No reducers can't access mapper counters.
---> maybe theres a way to intermediately put counters in the distributed 
cache???

Jay Vyas 
MMSB
UCHC

On Apr 20, 2012, at 1:24 PM, Robert Evans  wrote:

> There was a discussion about this several months ago
> 
> http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201112.mbox/%3CCADYHM8xiw8_bF=zqe-bagdfz6r3tob0aof9viozgtzeqgkp...@mail.gmail.com%3E
> 
> The conclusion is that if you want to read them from the reducer you are 
> going to have to do something special until someone finds time to implement 
> it as part of.
> 
> https://issues.apache.org/jira/browse/MAPREDUCE-3520
> 
> --Bobby Evans
> 
> 
> On 4/20/12 11:36 AM, "Amith D K"  wrote:
> 
> Yes U can use user defined counter as Jagat suggeted.
> 
> Counter can be enum as Jagat described or any string which are called dynamic 
> counters.
> 
> It is easier to use Enum counter than dynamic counters, finally it depends on 
> your use case :)
> 
> Amith
> 
> From: Jagat [jagatsi...@gmail.com]
> Sent: Saturday, April 21, 2012 12:25 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Accessing global Counters
> 
> Hi
> 
> You can create your own counters like
> 
> enum CountFruits {
> Apple,
> Mango,
> Banana
> }
> 
> 
> And in your mapper class when you see condition to increment , you can use
> Reporter incrCounter method to do the same.
> 
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.Enum,%20long)
> 
> e.g
> // I saw Apple increment it by one
> reporter.incrCounter(CountFruits.Apple,1);
> 
> Now you can access them using job.getCounters
> 
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()
> 
> Hope this helps
> 
> Regards,
> 
> Jagat Singh
> 
> 
> On Fri, Apr 20, 2012 at 9:43 PM, Gayatri Rao  wrote:
> 
>> Hi All,
>> 
>> Is there a way for me to set global counters in Mapper and access them from
>> reducer?
>> Could you suggest how I can acheve this?
>> 
>> Thanks
>> Gayatri
>> 
> 


Re: remote job submission

2012-04-20 Thread JAX
RE anirunds question on "how to submit a job remotely".  

Here are my follow up questions - hope this helps to guide the discussion: 

1) Normally - what is the "job client"? Do you guys typically use the namenode 
as the client? 

2) In the case where the client != name node  how does the client know how 
to start up the task trackers ?

UCHC

On Apr 20, 2012, at 11:19 AM, Amith D K  wrote:

> I dont know your use case if its for test and
> ssh across the machine are disabled then u write a script that can do ssh run 
> the jobs using cli for running your jobs. U can check ssh usage.
> 
> Or else use Ooze
> 
> From: Robert Evans [ev...@yahoo-inc.com]
> Sent: Friday, April 20, 2012 11:17 PM
> To: common-user@hadoop.apache.org
> Subject: Re: remote job submission
> 
> You can use Oozie to do it.
> 
> 
> On 4/20/12 8:45 AM, "Arindam Choudhury"  wrote:
> 
> Sorry. But I can you give me a example.
> 
> On Fri, Apr 20, 2012 at 3:08 PM, Harsh J  wrote:
> 
>> Arindam,
>> 
>> If your machine can access the clusters' NN/JT/DN ports, then you can
>> simply run your job from the machine itself.
>> 
>> On Fri, Apr 20, 2012 at 6:31 PM, Arindam Choudhury
>>  wrote:
>>> "If you are allowed a remote connection to the cluster's service ports,
>>> then you can directly submit your jobs from your local CLI. Just make
>>> sure your local configuration points to the right locations."
>>> 
>>> Can you elaborate in details please?
>>> 
>>> On Fri, Apr 20, 2012 at 2:20 PM, Harsh J  wrote:
>>> 
 If you are allowed a remote connection to the cluster's service ports,
 then you can directly submit your jobs from your local CLI. Just make
 sure your local configuration points to the right locations.
 
 Otherwise, perhaps you can choose to use Apache Oozie (Incubating)
 (http://incubator.apache.org/oozie/) It does provide a REST interface
 that launches jobs up for you over the supplied clusters, but its more
 oriented towards workflow management or perhaps HUE:
 https://github.com/cloudera/hue
 
 On Fri, Apr 20, 2012 at 5:37 PM, Arindam Choudhury
  wrote:
> Hi,
> 
> Do hadoop have any web service or other interface so I can submit jobs
 from
> remote machine?
> 
> Thanks,
> Arindam
 
 
 
 --
 Harsh J
 
>> 
>> 
>> 
>> --
>> Harsh J
>> 
> 


Reporter vs context

2012-04-20 Thread JAX
Hi guys : I notice that there's been some chatter about the "Reporter" in 
context of counters Forgive my ignorance here as I've never seen Reporters 
used in real code.

What is the difference between the use of our Context, and Reporter objects- 
and how are they related? Is there any overlap in their functionality.?


Jay Vyas 
MMSB
UCHC

Re: remote job submission

2012-04-20 Thread Harsh J
Hi,

A JobClient is something that facilitates validating your job
configuration and shipping necessities to the cluster and notifying
the JobTracker of that new job. Afterwards, its responsibility may
merely be to monitor progress via reports from
JobTracker(MR1)/ApplicationMaster(MR2).

A client need not concern themselves, nor be aware about TaskTrackers
(or NodeManagers). These are non-permanent members of a cluster and do
not carry (critical) persistent states. The scheduling
of job and its tasks is taken care of from the JobTracker in MR1 (or
the MR Application's ApplicationMaster in MR2). The only thing a
JobClient running user needs to ensure is that he has access to the
NameNode (For creating staging files - job jar, job xml, etc.), the
DataNodes (for actually writing the previous files to DFS for the
JobTracker to pick up) and the JobTracker/Scheduler (for protocol
communication required to notify the cluster of a job and that its
resources are now ready to launch - and also monitoring progress)

On Sat, Apr 21, 2012 at 5:36 AM, JAX  wrote:
> RE anirunds question on "how to submit a job remotely".
>
> Here are my follow up questions - hope this helps to guide the discussion:
>
> 1) Normally - what is the "job client"? Do you guys typically use the 
> namenode as the client?
>
> 2) In the case where the client != name node  how does the client know 
> how to start up the task trackers ?
>
> UCHC
>
> On Apr 20, 2012, at 11:19 AM, Amith D K  wrote:
>
>> I dont know your use case if its for test and
>> ssh across the machine are disabled then u write a script that can do ssh 
>> run the jobs using cli for running your jobs. U can check ssh usage.
>>
>> Or else use Ooze
>> 
>> From: Robert Evans [ev...@yahoo-inc.com]
>> Sent: Friday, April 20, 2012 11:17 PM
>> To: common-user@hadoop.apache.org
>> Subject: Re: remote job submission
>>
>> You can use Oozie to do it.
>>
>>
>> On 4/20/12 8:45 AM, "Arindam Choudhury"  wrote:
>>
>> Sorry. But I can you give me a example.
>>
>> On Fri, Apr 20, 2012 at 3:08 PM, Harsh J  wrote:
>>
>>> Arindam,
>>>
>>> If your machine can access the clusters' NN/JT/DN ports, then you can
>>> simply run your job from the machine itself.
>>>
>>> On Fri, Apr 20, 2012 at 6:31 PM, Arindam Choudhury
>>>  wrote:
 "If you are allowed a remote connection to the cluster's service ports,
 then you can directly submit your jobs from your local CLI. Just make
 sure your local configuration points to the right locations."

 Can you elaborate in details please?

 On Fri, Apr 20, 2012 at 2:20 PM, Harsh J  wrote:

> If you are allowed a remote connection to the cluster's service ports,
> then you can directly submit your jobs from your local CLI. Just make
> sure your local configuration points to the right locations.
>
> Otherwise, perhaps you can choose to use Apache Oozie (Incubating)
> (http://incubator.apache.org/oozie/) It does provide a REST interface
> that launches jobs up for you over the supplied clusters, but its more
> oriented towards workflow management or perhaps HUE:
> https://github.com/cloudera/hue
>
> On Fri, Apr 20, 2012 at 5:37 PM, Arindam Choudhury
>  wrote:
>> Hi,
>>
>> Do hadoop have any web service or other interface so I can submit jobs
> from
>> remote machine?
>>
>> Thanks,
>> Arindam
>
>
>
> --
> Harsh J
>
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>



-- 
Harsh J


Re: Accessing global Counters

2012-04-20 Thread Harsh J
Currently the DistributedCache is populated pre-Job run, hence both
Map and Reduce phases carry the same items. With MR2, the approach
Robert describes above should work better instead.

On Sat, Apr 21, 2012 at 5:21 AM, JAX  wrote:
> No reducers can't access mapper counters.
> ---> maybe theres a way to intermediately put counters in the distributed 
> cache???
>
> Jay Vyas
> MMSB
> UCHC
>
> On Apr 20, 2012, at 1:24 PM, Robert Evans  wrote:
>
>> There was a discussion about this several months ago
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201112.mbox/%3CCADYHM8xiw8_bF=zqe-bagdfz6r3tob0aof9viozgtzeqgkp...@mail.gmail.com%3E
>>
>> The conclusion is that if you want to read them from the reducer you are 
>> going to have to do something special until someone finds time to implement 
>> it as part of.
>>
>> https://issues.apache.org/jira/browse/MAPREDUCE-3520
>>
>> --Bobby Evans
>>
>>
>> On 4/20/12 11:36 AM, "Amith D K"  wrote:
>>
>> Yes U can use user defined counter as Jagat suggeted.
>>
>> Counter can be enum as Jagat described or any string which are called 
>> dynamic counters.
>>
>> It is easier to use Enum counter than dynamic counters, finally it depends 
>> on your use case :)
>>
>> Amith
>> 
>> From: Jagat [jagatsi...@gmail.com]
>> Sent: Saturday, April 21, 2012 12:25 AM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Accessing global Counters
>>
>> Hi
>>
>> You can create your own counters like
>>
>> enum CountFruits {
>> Apple,
>> Mango,
>> Banana
>> }
>>
>>
>> And in your mapper class when you see condition to increment , you can use
>> Reporter incrCounter method to do the same.
>>
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.Enum,%20long)
>>
>> e.g
>> // I saw Apple increment it by one
>> reporter.incrCounter(CountFruits.Apple,1);
>>
>> Now you can access them using job.getCounters
>>
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()
>>
>> Hope this helps
>>
>> Regards,
>>
>> Jagat Singh
>>
>>
>> On Fri, Apr 20, 2012 at 9:43 PM, Gayatri Rao  wrote:
>>
>>> Hi All,
>>>
>>> Is there a way for me to set global counters in Mapper and access them from
>>> reducer?
>>> Could you suggest how I can acheve this?
>>>
>>> Thanks
>>> Gayatri
>>>
>>



-- 
Harsh J


Re: Reporter vs context

2012-04-20 Thread Harsh J
Context is what the new MR API offers, and it wraps over a Reporter
object, and provides other helpful functions and data you'd require
within a task (lives up to its name).

Reporter was the raw object provided in the old MR API, that lets one
report progress, set status, etc.. In new API, you access the same
reporter via a Context wrapper class instead.

On Sat, Apr 21, 2012 at 6:04 AM, JAX  wrote:
> Hi guys : I notice that there's been some chatter about the "Reporter" in 
> context of counters Forgive my ignorance here as I've never seen 
> Reporters used in real code.
>
> What is the difference between the use of our Context, and Reporter objects- 
> and how are they related? Is there any overlap in their functionality.?
>
>
> Jay Vyas
> MMSB
> UCHC



-- 
Harsh J