from:"Robert Evans"

Re: Hadoop : java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text

2012-08-02 Thread Robert Evans

The default text input format has a key of a LongWritable that is the
offset into the file.  The value is the full line.

On 8/2/12 2:59 PM, "Harit Himanshu"  wrote:

>StackOverflow link -
>http://stackoverflow.com/questions/11784729/hadoop-java-lang-classcastexce
>ption-org-apache-hadoop-io-longwritable-cannot
>
>--
>
>My program looks like
>
>public class TopKRecord extends Configured implements Tool {
>
>public static class MapClass extends Mapper {
>
>public void map(Text key, Text value, Context context) throws
>IOException, InterruptedException {
>// your map code goes here
>String[] fields = value.toString().split(",");
>String year = fields[1];
>String claims = fields[8];
>
>if (claims.length() > 0 && (!claims.startsWith("\""))) {
>context.write(new Text(year.toString()), new
>Text(claims.toString()));
>}
>}
>}
>   public int run(String args[]) throws Exception {
>Job job = new Job();
>job.setJarByClass(TopKRecord.class);
>
>job.setMapperClass(MapClass.class);
>
>FileInputFormat.setInputPaths(job, new Path(args[0]));
>FileOutputFormat.setOutputPath(job, new Path(args[1]));
>
>job.setJobName("TopKRecord");
>job.setMapOutputValueClass(Text.class);
>job.setNumReduceTasks(0);
>boolean success = job.waitForCompletion(true);
>return success ? 0 : 1;
>}
>
>public static void main(String args[]) throws Exception {
>int ret = ToolRunner.run(new TopKRecord(), args);
>System.exit(ret);
>}
>}
>
>The data looks like
>
>"PATENT","GYEAR","GDATE","APPYEAR","COUNTRY","POSTATE","ASSIGNEE","ASSCODE
>","CLAIMS","NCLASS","CAT","SUBCAT","CMADE","CRECEIVE","RATIOCIT","GENERAL"
>,"ORIGINAL","FWDAPLAG","BCKGTLAG","SELFCTUB","SELFCTLB","SECDUPBD","SECDLW
>BD"
>3070801,1963,1096,,"BE","",,1,,269,6,69,,1,,0,,,
>3070802,1963,1096,,"US","TX",,1,,2,6,63,,0,
>3070803,1963,1096,,"US","IL",,1,,2,6,63,,9,,0.3704,,,
>3070804,1963,1096,,"US","OH",,1,,2,6,63,,3,,0.6667,,,
>
>On running this program I see the following on console
>
>12/08/02 12:43:34 INFO mapred.JobClient: Task Id :
>attempt_201208021025_0007_m_00_0, Status : FAILED
>java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
>be cast to org.apache.hadoop.io.Text
>at com.hadoop.programs.TopKRecord$MapClass.map(TopKRecord.java:26)
>at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:396)
>at 
>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.
>java:1121)
>at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>I believe that the Class Types are mapped correctly, Class
>Mappermapreduce/Mapper.html>
>,
>
>Please let me know what is that I am doing wrong here?
>
>
>Thank you
>
>+ Harit Himanshu

Re: Issue with Hadoop Streaming

2012-08-02 Thread Robert Evans

http://www.mail-archive.com/core-user@hadoop.apache.org/msg07382.html

From: Devi Kumarappan mailto:kpala...@att.net>>
Reply-To: 
"mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>" 
mailto:mapreduce-u...@hadoop.apache.org>>
Date: Thursday, August 2, 2012 3:03 PM
To: "common-user@hadoop.apache.org<mailto:common-user@hadoop.apache.org>" 
mailto:common-user@hadoop.apache.org>>, 
"mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>" 
mailto:mapreduce-u...@hadoop.apache.org>>
Subject: Re: Issue with Hadoop Streaming

My mapper is perl script  and it is not in Java.So how do I specify the 
NLineFormat?

From: Robert Evans mailto:ev...@yahoo-inc.com>>
To: "mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>" 
mailto:mapreduce-u...@hadoop.apache.org>>; 
"common-user@hadoop.apache.org<mailto:common-user@hadoop.apache.org>" 
mailto:common-user@hadoop.apache.org>>
Sent: Thu, August 2, 2012 12:59:50 PM
Subject: Re: Issue with Hadoop Streaming

It depends on the input format you use.  You probably want to look at using 
NLineInputFormat

From: Devi Kumarappan 
mailto:kpala...@att.net><mailto:kpala...@att.net<mailto:kpala...@att.net>>>
Reply-To: 
"mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org><mailto:mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>>"

mailto:mapreduce-u...@hadoop.apache.org><mailto:mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>>>
Date: Wednesday, August 1, 2012 8:09 PM
To: 
"common-user@hadoop.apache.org<mailto:common-user@hadoop.apache.org><mailto:common-user@hadoop.apache.org<mailto:common-user@hadoop.apache.org>>"

mailto:common-user@hadoop.apache.org><mailto:common-user@hadoop.apache.org<mailto:common-user@hadoop.apache.org>>>,

"mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org><mailto:mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>>"

mailto:mapreduce-u...@hadoop.apache.org><mailto:mapreduce-u...@hadoop.apache.org<mailto:mapreduce-u...@hadoop.apache.org>>>
Subject: Issue with Hadoop Streaming

I am trying to run hadoop streaming using perl script as the mapper and with no 
reducer. My requirement is for the Mapper  to run on one file at a time.  since 
I have to do pattern processing in the entire contents of one file at a time 
and the file size is small.

Hadoop streaming manual suggests the following solution

*  Generate a file containing the full HDFS path of the input files. Each map 
task would get one file name as input.
*  Create a mapper script which, given a filename, will get the file to local 
disk, gzip the file and put it back in the desired output directory.

I am running the fllowing command.

hadoop jar 
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar 
-input /user/devi/file.txt -output /user/devi/s_output -mapper "/usr/bin/perl 
/home/devi/Perl/crash_parser.pl"

/user/devi/file.txt contains the following two lines.

/user/devi/s_input/a.txt
/user/devi/s_input/b.txt

When this runs, instead of spawing two mappers for a.txt and b.txt as per the 
document, only one mapper is being spawned and the perl script gets the 
/user/devi/s_input/a.txt and /user/devi/s_input/b.txt as the inputs.

How could I make the mapper perl script to run using only one file at a time ?

Appreciate your help, Thanks, Devi

Re: Issue with Hadoop Streaming

2012-08-02 Thread Robert Evans

It depends on the input format you use.  You probably want to look at using 
NLineInputFormat

From: Devi Kumarappan mailto:kpala...@att.net>>
Reply-To: 
"mapreduce-u...@hadoop.apache.org" 
mailto:mapreduce-u...@hadoop.apache.org>>
Date: Wednesday, August 1, 2012 8:09 PM
To: "common-user@hadoop.apache.org" 
mailto:common-user@hadoop.apache.org>>, 
"mapreduce-u...@hadoop.apache.org" 
mailto:mapreduce-u...@hadoop.apache.org>>
Subject: Issue with Hadoop Streaming

I am trying to run hadoop streaming using perl script as the mapper and with no 
reducer. My requirement is for the Mapper  to run on one file at a time.  since 
I have to do pattern processing in the entire contents of one file at a time 
and the file size is small.

Hadoop streaming manual suggests the following solution

 *   Generate a file containing the full HDFS path of the input files. Each map 
task would get one file name as input.
 *   Create a mapper script which, given a filename, will get the file to local 
disk, gzip the file and put it back in the desired output directory.

I am running the fllowing command.

hadoop jar 
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar 
-input /user/devi/file.txt -output /user/devi/s_output -mapper "/usr/bin/perl 
/home/devi/Perl/crash_parser.pl"

/user/devi/file.txt contains the following two lines.

/user/devi/s_input/a.txt
/user/devi/s_input/b.txt

When this runs, instead of spawing two mappers for a.txt and b.txt as per the 
document, only one mapper is being spawned and the perl script gets the 
/user/devi/s_input/a.txt and /user/devi/s_input/b.txt as the inputs.

How could I make the mapper perl script to run using only one file at a time ?

Appreciate your help, Thanks, Devi

Re: Using REST to get ApplicationMaster info (Issue solved)

2012-07-27 Thread Robert Evans

Yes you are right.  If it is true by default we probably want to update
the documentation for the web services to indicate this.  Could you file a
JIRA for improving that documentation?

Thanks,

Bobby

On 7/27/12 3:11 AM, "Prajakta Kalmegh"  wrote:

>:) Yes, you are right. The yarn.acl.enable property in yarn-default.xml is
>set true. If the property is true by default, then this makes it mandatory
>for users to either specify a value for hadoop.http.staticuser.user
>property explicitly or to change the acl's to false. Am I right to assume
>this?
>
>Regards,
>Prajakta
>
>
>
>On Thu, Jul 26, 2012 at 11:59 PM, Robert Evans 
>wrote:
>
>> OK I think I understand it now.  You probably have ACLs enabled, but no
>> web filter on the RM to let you sign in as a given user.  As such the
>> default filter is making you be Dr. Who, or whomever else it is, but the
>> ACL check in the web service is rejecting Dr Who, because that is not
>>the
>> correct user.  You will probably run into this issue again if anyone
>>else
>> but you runs something.  You could fix this by either disabling the ACL
>> check, which makes a lot of since for a cluster without security, or you
>> could implement a servlet Filter for the RM that would let you sign on
>>as
>> a given user.
>>
>> --Bobby Evans
>>
>>
>> On 7/26/12 12:48 AM, "Prajakta Kalmegh"  wrote:
>>
>> >Hi Bobby
>> >
>> >Thanks for the reply. My REST calls are working fine since I set the
>> >'hadoop.http.staticuser.user' property to 'prajakta' instead of Dr.Who
>>in
>> >core-site.xml . I didn't get time to figure out the reason behind it
>>as I
>> >just moved on to further coding :)
>> >
>> >Thanks,
>> >Prajakta
>> >
>> >
>> >
>> >On Thu, Jul 26, 2012 at 1:40 AM, Robert Evans 
>> wrote:
>> >
>> >> Hmm, that is very odd.  It only checks the user if security is
>>enabled
>> >>to
>> >> warn the user about potentially accessing something unsafe.  I am not
>> >>sure
>> >> why that would cause an issue.
>> >>
>> >> --Bobby Evans
>> >>
>> >> On 7/9/12 6:07 AM, "Prajakta Kalmegh"  wrote:
>> >>
>> >> >Hi Robert
>> >> >
>> >> >I figured out the problem just now. To avoid the below error, I had
>>to
>> >>set
>> >> >the 'hadoop.http.staticuser.user' property in core-site.xml
>>(defaults
>> >>to
>> >> >dr.who). I can now get runtime data from AppMaster using *curl* as
>> >>well as
>> >> >in GUI.
>> >> >
>> >> >I wonder if we have to set this property even when we are not
>> >>specifying
>> >> >the yarn web-proxy address (when it runs as part of RM by default)
>>as
>> >> >well.
>> >> >If yes, was it documented somewhere which I failed to see? :(
>> >> >
>> >> >Anyways, thanks for your response so far.
>> >> >
>> >> >Regards,
>> >> >Prajakta
>> >> >
>> >> >
>> >> >
>> >> >On Mon, Jul 9, 2012 at 3:29 PM, Prajakta Kalmegh
>>
>> >> >wrote:
>> >> >
>> >> >> Hi Robert
>> >> >>
>> >> >> I started the proxyserver explicitly by specifying a value for the
>> >> >> yarn.web-proxy.address in yarn-site.xml. The proxyserver did start
>> >>and I
>> >> >> tried getting the JSON response using the following command :
>> >> >>
>> >> >> curl --compressed -H "Accept: application/json" -X GET "
>> >> >>
>> >> >>
>> >>
>> >>
>>
>>http://localhost:8090/proxy/application_1341823967331_0001/ws/v1/mapreduc
>> >> >>e/jobs/job_1341823967331_0001
>> >> >> "
>> >> >>
>> >> >> However, it refused connection and below is the excerpt from the
>> >> >> Proxyserver logs:
>> >> >> -
>> >> >> 2012-07-09 14:26:40,402 INFO org.mortbay.log: Extract
>> >> >>
>> >>
>>
>>>>>>jar:file:/home/prajakta/Projects/IRL/hadoop-common/hadoop-dist/target
>>>>>>/h
>> >>>>ad
>> >>
>>
>>>>>>

Re: Using REST to get ApplicationMaster info (Issue solved)

2012-07-26 Thread Robert Evans

OK I think I understand it now.  You probably have ACLs enabled, but no
web filter on the RM to let you sign in as a given user.  As such the
default filter is making you be Dr. Who, or whomever else it is, but the
ACL check in the web service is rejecting Dr Who, because that is not the
correct user.  You will probably run into this issue again if anyone else
but you runs something.  You could fix this by either disabling the ACL
check, which makes a lot of since for a cluster without security, or you
could implement a servlet Filter for the RM that would let you sign on as
a given user.

--Bobby Evans


On 7/26/12 12:48 AM, "Prajakta Kalmegh"  wrote:

>Hi Bobby
>
>Thanks for the reply. My REST calls are working fine since I set the
>'hadoop.http.staticuser.user' property to 'prajakta' instead of Dr.Who in
>core-site.xml . I didn't get time to figure out the reason behind it as I
>just moved on to further coding :)
>
>Thanks,
>Prajakta
>
>
>
>On Thu, Jul 26, 2012 at 1:40 AM, Robert Evans  wrote:
>
>> Hmm, that is very odd.  It only checks the user if security is enabled
>>to
>> warn the user about potentially accessing something unsafe.  I am not
>>sure
>> why that would cause an issue.
>>
>> --Bobby Evans
>>
>> On 7/9/12 6:07 AM, "Prajakta Kalmegh"  wrote:
>>
>> >Hi Robert
>> >
>> >I figured out the problem just now. To avoid the below error, I had to
>>set
>> >the 'hadoop.http.staticuser.user' property in core-site.xml (defaults
>>to
>> >dr.who). I can now get runtime data from AppMaster using *curl* as
>>well as
>> >in GUI.
>> >
>> >I wonder if we have to set this property even when we are not
>>specifying
>> >the yarn web-proxy address (when it runs as part of RM by default) as
>> >well.
>> >If yes, was it documented somewhere which I failed to see? :(
>> >
>> >Anyways, thanks for your response so far.
>> >
>> >Regards,
>> >Prajakta
>> >
>> >
>> >
>> >On Mon, Jul 9, 2012 at 3:29 PM, Prajakta Kalmegh 
>> >wrote:
>> >
>> >> Hi Robert
>> >>
>> >> I started the proxyserver explicitly by specifying a value for the
>> >> yarn.web-proxy.address in yarn-site.xml. The proxyserver did start
>>and I
>> >> tried getting the JSON response using the following command :
>> >>
>> >> curl --compressed -H "Accept: application/json" -X GET "
>> >>
>> >>
>>
>>http://localhost:8090/proxy/application_1341823967331_0001/ws/v1/mapreduc
>> >>e/jobs/job_1341823967331_0001
>> >> "
>> >>
>> >> However, it refused connection and below is the excerpt from the
>> >> Proxyserver logs:
>> >> -
>> >> 2012-07-09 14:26:40,402 INFO org.mortbay.log: Extract
>> >>
>>
>>>>jar:file:/home/prajakta/Projects/IRL/hadoop-common/hadoop-dist/target/h
>>>>ad
>>
>>>>oop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-yarn-common-3.0.0-SNAP
>>>>SH
>> >>OT.jar!/webapps/proxy
>> >> to /tmp/Jetty_localhost_8090_proxy.ak3o30/webapp
>> >> 2012-07-09 14:26:40,992 INFO org.mortbay.log: Started
>> >> SelectChannelConnector@localhost:8090
>> >> 2012-07-09 14:26:40,993 INFO
>> >> org.apache.hadoop.yarn.service.AbstractService:
>> >> Service:org.apache.hadoop.yarn.server.webproxy.WebAppProxy is
>>started.
>> >> 2012-07-09 14:26:40,993 INFO
>> >> org.apache.hadoop.yarn.service.AbstractService:
>> >> Service:org.apache.hadoop.yarn.server.webproxy.WebAppProxyServer is
>> >>started.
>> >> 2012-07-09 14:33:26,039 INFO
>> >> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: dr.who is
>> >> accessing unchecked
>> >> http://prajakta:44314/ws/v1/mapreduce/jobs/job_1341823967331_0001
>>which
>> >> is the app master GUI of application_1341823967331_0001 owned by
>> >>prajakta
>> >> 2012-07-09 14:33:29,277 INFO
>> >> org.apache.commons.httpclient.HttpMethodDirector: I/O exception
>> >> (org.apache.commons.httpclient.NoHttpResponseException) caught when
>> >> processing request: The server prajakta failed to respond
>> >> 2012-07-09 14:33:29,277 INFO
>> >> org.apache.commons.httpclient.HttpMethodDirector: Retrying request
>> >> 2012-07-09 14:33:29,284 WARN org.mortbay.log:
>> >&

Re: Using REST to get ApplicationMaster info (Issue solved)

2012-07-25 Thread Robert Evans

Hmm, that is very odd.  It only checks the user if security is enabled to
warn the user about potentially accessing something unsafe.  I am not sure
why that would cause an issue.

--Bobby Evans

On 7/9/12 6:07 AM, "Prajakta Kalmegh"  wrote:

>Hi Robert
>
>I figured out the problem just now. To avoid the below error, I had to set
>the 'hadoop.http.staticuser.user' property in core-site.xml (defaults to
>dr.who). I can now get runtime data from AppMaster using *curl* as well as
>in GUI.
>
>I wonder if we have to set this property even when we are not specifying
>the yarn web-proxy address (when it runs as part of RM by default) as
>well.
>If yes, was it documented somewhere which I failed to see? :(
>
>Anyways, thanks for your response so far.
>
>Regards,
>Prajakta
>
>
>
>On Mon, Jul 9, 2012 at 3:29 PM, Prajakta Kalmegh 
>wrote:
>
>> Hi Robert
>>
>> I started the proxyserver explicitly by specifying a value for the
>> yarn.web-proxy.address in yarn-site.xml. The proxyserver did start and I
>> tried getting the JSON response using the following command :
>>
>> curl --compressed -H "Accept: application/json" -X GET "
>> 
>>http://localhost:8090/proxy/application_1341823967331_0001/ws/v1/mapreduc
>>e/jobs/job_1341823967331_0001
>> "
>>
>> However, it refused connection and below is the excerpt from the
>> Proxyserver logs:
>> -
>> 2012-07-09 14:26:40,402 INFO org.mortbay.log: Extract
>> 
>>jar:file:/home/prajakta/Projects/IRL/hadoop-common/hadoop-dist/target/had
>>oop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-yarn-common-3.0.0-SNAPSH
>>OT.jar!/webapps/proxy
>> to /tmp/Jetty_localhost_8090_proxy.ak3o30/webapp
>> 2012-07-09 14:26:40,992 INFO org.mortbay.log: Started
>> SelectChannelConnector@localhost:8090
>> 2012-07-09 14:26:40,993 INFO
>> org.apache.hadoop.yarn.service.AbstractService:
>> Service:org.apache.hadoop.yarn.server.webproxy.WebAppProxy is started.
>> 2012-07-09 14:26:40,993 INFO
>> org.apache.hadoop.yarn.service.AbstractService:
>> Service:org.apache.hadoop.yarn.server.webproxy.WebAppProxyServer is
>>started.
>> 2012-07-09 14:33:26,039 INFO
>> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: dr.who is
>> accessing unchecked
>> http://prajakta:44314/ws/v1/mapreduce/jobs/job_1341823967331_0001 which
>> is the app master GUI of application_1341823967331_0001 owned by
>>prajakta
>> 2012-07-09 14:33:29,277 INFO
>> org.apache.commons.httpclient.HttpMethodDirector: I/O exception
>> (org.apache.commons.httpclient.NoHttpResponseException) caught when
>> processing request: The server prajakta failed to respond
>> 2012-07-09 14:33:29,277 INFO
>> org.apache.commons.httpclient.HttpMethodDirector: Retrying request
>> 2012-07-09 14:33:29,284 WARN org.mortbay.log:
>> 
>>/proxy/application_1341823967331_0001/ws/v1/mapreduce/jobs/job_1341823967
>>331_0001:
>> java.net.SocketException: Connection reset
>> 2012-07-09 14:37:33,834 INFO
>> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: dr.who is
>> accessing unchecked
>> 
>>http://prajakta:19888/jobhistory/job/job_1341823967331_0001/jobhistory/jo
>>b/job_1341823967331_0001which is the app master GUI of
>>application_1341823967331_0001 owned by
>> prajakta
>> ---
>>
>> I am not sure why http request object is setting my remoteUser to
>>dr.who.
>> :(
>>
>> I gather from <https://issues.apache.org/jira/browse/MAPREDUCE-2858>
>>that
>> this warning is posted only in case where security is disabled. I assume
>> that the proxy server is not disabled if security is disabled.
>>
>> Any idea what could be the reason for this I/O exception? Am I missing
>> setting any property for proper access. Please let me know.
>>
>> Regards,
>> Prajakta
>>
>>
>>
>>
>>
>>
>> On Fri, Jul 6, 2012 at 10:59 PM, Prajakta Kalmegh
>>wrote:
>>
>>> I am using hadoop trunk (forked from github). It supports RESTful APIs
>>>as
>>> I am able to retrieve JSON objects for RM (cluster/nodes info)+
>>> Historyserver. The only issue is with AppMaster REST API.
>>>
>>> Regards,
>>> Prajakta
>>>
>>>
>>>
>>> On Fri, Jul 6, 2012 at 10:55 PM, Robert Evans
>>>wrote:
>>>
>>>> What version of hadoop are you using?  It could be that the version
>>>>you
>>>> have does not have the RESTful APIs in it yet, and the proxy is
>>>>wor

Re: (Repost) Using REST to get ApplicationMaster info

2012-07-25 Thread Robert Evans

I am sorry it has taken me so long to respond.  Work has been crazy :).

I really am at a loss right now why you are getting the connection refused
error. The error is happening between the RM and the AM.  The Dr who is
something you can ignore.  It is the default name that is given to a web
user when security is disabled.  You probably want to check the logs for
the AM to see if there is anything in there, but beyond that I am at a
loss.

Sorry,

Bobby Evans

On 7/9/12 4:59 AM, "Prajakta Kalmegh"  wrote:

>Hi Robert
>
>I started the proxyserver explicitly by specifying a value for the
>yarn.web-proxy.address in yarn-site.xml. The proxyserver did start and I
>tried getting the JSON response using the following command :
>
>curl --compressed -H "Accept: application/json" -X GET "
>http://localhost:8090/proxy/application_1341823967331_0001/ws/v1/mapreduce
>/jobs/job_1341823967331_0001
>"
>
>However, it refused connection and below is the excerpt from the
>Proxyserver logs:
>-
>2012-07-09 14:26:40,402 INFO org.mortbay.log: Extract
>jar:file:/home/prajakta/Projects/IRL/hadoop-common/hadoop-dist/target/hado
>op-3.0.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-yarn-common-3.0.0-SNAPSHOT
>.jar!/webapps/proxy
>to /tmp/Jetty_localhost_8090_proxy.ak3o30/webapp
>2012-07-09 14:26:40,992 INFO org.mortbay.log: Started
>SelectChannelConnector@localhost:8090
>2012-07-09 14:26:40,993 INFO
>org.apache.hadoop.yarn.service.AbstractService:
>Service:org.apache.hadoop.yarn.server.webproxy.WebAppProxy is started.
>2012-07-09 14:26:40,993 INFO
>org.apache.hadoop.yarn.service.AbstractService:
>Service:org.apache.hadoop.yarn.server.webproxy.WebAppProxyServer is
>started.
>2012-07-09 14:33:26,039 INFO
>org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: dr.who is
>accessing unchecked
>http://prajakta:44314/ws/v1/mapreduce/jobs/job_1341823967331_0001 which is
>the app master GUI of application_1341823967331_0001 owned by prajakta
>2012-07-09 14:33:29,277 INFO
>org.apache.commons.httpclient.HttpMethodDirector: I/O exception
>(org.apache.commons.httpclient.NoHttpResponseException) caught when
>processing request: The server prajakta failed to respond
>2012-07-09 14:33:29,277 INFO
>org.apache.commons.httpclient.HttpMethodDirector: Retrying request
>2012-07-09 14:33:29,284 WARN org.mortbay.log:
>/proxy/application_1341823967331_0001/ws/v1/mapreduce/jobs/job_13418239673
>31_0001:
>java.net.SocketException: Connection reset
>2012-07-09 14:37:33,834 INFO
>org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: dr.who is
>accessing unchecked
>http://prajakta:19888/jobhistory/job/job_1341823967331_0001/jobhistory/job
>/job_1341823967331_0001which
>is the app master GUI of application_1341823967331_0001 owned by
>prajakta
>---
>
>I am not sure why http request object is setting my remoteUser to dr.who.
>:(
>
>I gather from <https://issues.apache.org/jira/browse/MAPREDUCE-2858> that
>this warning is posted only in case where security is disabled. I assume
>that the proxy server is not disabled if security is disabled.
>
>Any idea what could be the reason for this I/O exception? Am I missing
>setting any property for proper access. Please let me know.
>
>Regards,
>Prajakta
>
>
>
>
>
>
>On Fri, Jul 6, 2012 at 10:59 PM, Prajakta Kalmegh
>wrote:
>
>> I am using hadoop trunk (forked from github). It supports RESTful APIs
>>as
>> I am able to retrieve JSON objects for RM (cluster/nodes info)+
>> Historyserver. The only issue is with AppMaster REST API.
>>
>> Regards,
>> Prajakta
>>
>>
>>
>> On Fri, Jul 6, 2012 at 10:55 PM, Robert Evans 
>>wrote:
>>
>>> What version of hadoop are you using?  It could be that the version you
>>> have does not have the RESTful APIs in it yet, and the proxy is working
>>> just fine.
>>>
>>> --Bobby Evans
>>>
>>> On 7/6/12 12:06 PM, "Prajakta Kalmegh"  wrote:
>>>
>>> >Robert , Thanks for the response. If I do not provide any explicit
>>> >configuration for the proxy server, do I still need to start it using
>>>the
>>> >'yarn start proxy server'? I am currently not doing it.
>>> >
>>> >Also, I am able to access the html page for proxy using the
>>> ><http://localhost:8088/proxy/{appid}/mapreduce/jobs> URL. (Note this
>>>url
>>> >does not have the '/ws/v1/ part in it. I get the html response when I
>>> >query
>>> >for this URL in runtime.
>>> >
>>> >So I assume the proxy server must be starting fine since I am able

Re: (Repost) Using REST to get ApplicationMaster info

2012-07-06 Thread Robert Evans

What version of hadoop are you using?  It could be that the version you
have does not have the RESTful APIs in it yet, and the proxy is working
just fine.

--Bobby Evans

On 7/6/12 12:06 PM, "Prajakta Kalmegh"  wrote:

>Robert , Thanks for the response. If I do not provide any explicit
>configuration for the proxy server, do I still need to start it using the
>'yarn start proxy server'? I am currently not doing it.
>
>Also, I am able to access the html page for proxy using the
><http://localhost:8088/proxy/{appid}/mapreduce/jobs> URL. (Note this url
>does not have the '/ws/v1/ part in it. I get the html response when I
>query
>for this URL in runtime.
>
>So I assume the proxy server must be starting fine since I am able to
>access this URL. I will try logging more details tomorrow from my office
>machine and will let you know the result.
>
>Regards,
>Prajakta
>
>
>
>On Fri, Jul 6, 2012 at 10:22 PM, Robert Evans  wrote:
>
>> Sorry I did not respond sooner.  The default behavior is to have the
>>proxy
>> server run as part of the RM.  I am not really sure why it is not doing
>> this in your case.  If you set the config yourself to be a URI that is
>> different from that of the RM then you need to launch a standalone proxy
>> server.  You can do this by running
>>
>> yarn start proxy server
>>
>> Without sitting down with you it is going to be somewhat difficult to
>> debug why this is happening.  However, in retrospect it would be nice to
>> add in some extra logging to help indicate why the proxy server is not
>> functioning as desired.  If you could file a JIRA to add in the logging
>>I
>> would be happy to provide a patch to you and we can try and debug the
>> issue further.  Please file it under the MAPREDUCE JIRA project.
>>
>> --Bobby
>>
>> On 7/6/12 3:29 AM, "Prajakta Kalmegh"  wrote:
>>
>> >Re-posting as I haven't got a solution yet. Sorry for spamming. I
>>won't be
>> >able to proceed in my code until I get a JSON response using AppMaster
>> >REST
>> >URL. :(
>> >
>> >Thanks,
>> >Prajakta
>> >
>> >
>> >On Wed, Jul 4, 2012 at 5:55 PM, Prajakta Kalmegh 
>> >wrote:
>> >
>> >> Hi Robert/Harsh
>> >>
>> >> Thanks for your reply.
>> >>
>> >> My RM is starting just fine. The problem is with the use of
>> >>http:///proxy/{appid}/ws/v1/mapreduce
>> >> to get the JSON response.
>> >>
>> >> As I said before, I had not configured the yarn.web-proxy.address
>> >>property in yarn-site.xml. I assumed it will use the RM's
>> >>yarn.resourcemanager.webapp.address property value as default.
>>However,
>> >>it gives me a '404-Page not found error'.  Today I tried specifying a
>> >>value explicitly for the yarn.web-proxy.address property.
>> >>
>> >> On running the wordcount example, it even gives a url
>> >><http://localhost:8090>/proxy/{appid}/> to track the App Mast info.
>> >>However, I am still not able to get a json response.
>> >>
>> >> Also, I tried to get the data from historyserver instead of runtime
>> >>using the instructions given on page
>> >><
>> http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yar
>> >>n-site/HistoryServerRest.html>
>> >>
>> >> HistoryServer REST response does not give me jobids corresponding to
>>an
>> >>application. It just lists all the jobs run until now. By the way, the
>> >>documentation does say
>> >>
>> >> --
>> >>
>> >> "Both of the following URI's give you the history server information,
>> >>from an application id identified by the appid value.
>> >>   * http:///ws/v1/history
>> >>   * http:///ws/v1/history/info"
>> >> -
>> >>
>> >> But there is no provision to specify the application id with these
>>REST
>> >>URLs.
>> >>
>> >> Any idea how I can get the Application Master REST working and also
>> >>linking jobids to application id using the HistoryServerREST API?
>> >>
>> >> Any help is appreciated. Thanks in advance.
>> >> Regards,
>> >> Prajakta
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Jun 29, 2012 at 8:55 PM, Robert Evans 
>> >>wrote:
&

Re: (Repost) Using REST to get ApplicationMaster info

2012-07-06 Thread Robert Evans

Sorry I did not respond sooner.  The default behavior is to have the proxy
server run as part of the RM.  I am not really sure why it is not doing
this in your case.  If you set the config yourself to be a URI that is
different from that of the RM then you need to launch a standalone proxy
server.  You can do this by running

yarn start proxy server

Without sitting down with you it is going to be somewhat difficult to
debug why this is happening.  However, in retrospect it would be nice to
add in some extra logging to help indicate why the proxy server is not
functioning as desired.  If you could file a JIRA to add in the logging I
would be happy to provide a patch to you and we can try and debug the
issue further.  Please file it under the MAPREDUCE JIRA project.

--Bobby

On 7/6/12 3:29 AM, "Prajakta Kalmegh"  wrote:

>Re-posting as I haven't got a solution yet. Sorry for spamming. I won't be
>able to proceed in my code until I get a JSON response using AppMaster
>REST
>URL. :(
>
>Thanks,
>Prajakta
>
>
>On Wed, Jul 4, 2012 at 5:55 PM, Prajakta Kalmegh 
>wrote:
>
>> Hi Robert/Harsh
>>
>> Thanks for your reply.
>>
>> My RM is starting just fine. The problem is with the use of
>>http:///proxy/{appid}/ws/v1/mapreduce
>> to get the JSON response.
>>
>> As I said before, I had not configured the yarn.web-proxy.address
>>property in yarn-site.xml. I assumed it will use the RM's
>>yarn.resourcemanager.webapp.address property value as default. However,
>>it gives me a '404-Page not found error'.  Today I tried specifying a
>>value explicitly for the yarn.web-proxy.address property.
>>
>> On running the wordcount example, it even gives a url
>><http://localhost:8090>/proxy/{appid}/> to track the App Mast info.
>>However, I am still not able to get a json response.
>>
>> Also, I tried to get the data from historyserver instead of runtime
>>using the instructions given on page
>><http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yar
>>n-site/HistoryServerRest.html>
>>
>> HistoryServer REST response does not give me jobids corresponding to an
>>application. It just lists all the jobs run until now. By the way, the
>>documentation does say
>>
>> --
>>
>> "Both of the following URI's give you the history server information,
>>from an application id identified by the appid value.
>>   * http:///ws/v1/history
>>   * http:///ws/v1/history/info"
>> -
>>
>> But there is no provision to specify the application id with these REST
>>URLs.
>>
>> Any idea how I can get the Application Master REST working and also
>>linking jobids to application id using the HistoryServerREST API?
>>
>> Any help is appreciated. Thanks in advance.
>> Regards,
>> Prajakta
>>
>>
>>
>>
>> On Fri, Jun 29, 2012 at 8:55 PM, Robert Evans 
>>wrote:
>>
>>> Please don't file that JIRA.  The proxy server is intended to front the
>>> web server for all calls to the AM.  This is so you only have to go to
>>>a
>>> single location to get to any AM's web service.  The proxy server is a
>>> very simple proxy and just forwards the extra part of the path on to
>>>the
>>> AM.
>>>
>>> If you are having issues with this please include the version you are
>>> having problems with.  Also please look at the logs for the RM on
>>>startup
>>> to see if there is anything there indicating why it is not starting up.
>>>
>>> --Bobby Evans
>>>
>>> On 6/28/12 9:46 AM, "Harsh J"  wrote:
>>>
>>> >As far as I can tell, the MR WebApp, as the name itself indicates on
>>> >its doc page, starts only at the MR AM (which may be running at any
>>> >NM), and it starts as an ephemeral port logged at in the AM logs
>>> >usually as:
>>> >
>>> >INFO Web app /mapreduce started at [PORT]
>>> >
>>> >That it starts its own server with an ephemeral access point makes
>>> >sense, since each job uses its own AM and having a common location may
>>> >not work with the form of REST API documented at your link. Can you
>>> >please file a JIRA to fix the doc and remove the proxy server refs,
>>> >which are misleading?
>>> >
>>> >Do correct me if I'm wrong.
>>> >
>>> >On Thu, Jun 28, 2012 at 6:13 PM, Prajakta Kalmegh 
>>> >wrote:
>>> >> Hi
>>> >>
>>> >>

Re: Which hadoop version shoul I install in a production environment

2012-07-03 Thread Robert Evans

0.23 is also somewhat of an Alpha/Beta quality and I would not want to run
it in production just yet.  0.23 was renamed 2.0 and development work has
continued on both lines.  New features have been going into 2.0 and 0.23
has been left only for stabilization.  Hopefully we will have 0.23.3 to
the point that I feel comfortable calling a release vote on it soon, but
not quite yet.

--Bobby Evans

On 7/3/12 3:19 PM, "Pablo Musa"  wrote:

>Hi guys,
>I am sorry to bother you, but I have a cluster already configured and
>running
>with the following packages (cdh3):
>hadoop-0.20.noarch  0.20.2+923.256-1
>hadoop-hbase.noarch 0.90.6+84.29-1
>hadoop-zookeeper.noarch 3.3.5+19.1-1
>
>I having trouble with HBase regionserver that crashes (2-3 nodes per day)
>in
>an 8 nodes cluster. I thought it was the GC problem, but I did the fix
>and they
>still crash.
>
>I was wondering if updating the whole system (hadoop + hbase + zookeeperr
>+ mapred)
>would fix my problems. Besides there are a lot of fixes and features
>implemented
>since 0.20.
>
>Searching around I found all the different versions realeased and opted
>by the
>0.23-cdh4. Applied this new version to an old cdh3 dev environment 2
>months ago.
>Started doing other stuff and came back to finally try it into production.
>Unfortunately, cdh4 is moved to 2.0 and I could not find the 0.23 packages
>anymore. Besides, the 2.0 realease is in alpha version and thus not ready
>to
>production.
>Even worse, there is no 0.23 release in hadoop site:
>http://hadoop.apache.org/common/releases.html
>
>I read a few links, such as:
>http://www.cloudera.com/blog/2012/01/an-update-on-apache-hadoop-1-0/
>http://www.dbms2.com/2012/06/19/distributions-cdh-4-hdp-1-hadoop-2-0/
>and could not reach to a conclusion for a very simple question:
>Which is the latest stable hadoop version to install in a production
>environment with package manager support?
>
>Do you guys suggest any other manager than cdh?
>
>Sorry for the bad english and thanks for the help ;)
>
>Abs,
>Pablo

Re: bug in streaming?

2012-07-03 Thread Robert Evans

Yang,

I would not call it a bug, I would call it a potential optimization.  The
default input format for streaming will try to create one mapper per
block, but if there is only one block it will create two mappers for it.
You can override the streaming input format to get a different behavior.
The reality is that Hadoop does not do a very good job with small inputs.
We are going to be much slower than running without Hadoop for such small
jobs.  We have created the UberAM in Hadoop 2.0 to be able to address some
of this.  There is still a lot of work to be done and I am not sure what
priority it is for the different developers.  Feel free to file a JIRA if
you want to.  

--Bobby Evans

On 7/2/12 4:16 PM, "Yang"  wrote:

>if I set input to a file that contains just 1 line (which does not even
>contain "\n")
>
>and the mapper is
>
>-mapper " bash -c './a.sh' "
>
>and a.sh is
>
>echo -n "|"
>cat
>echo -n "|"
>
>
>
>
>I see 2 part- files generated in the output, which means 2 mappers were
>invoked, and one mapper consumed empty input , producing an output
>'||'
>but given such a small input file,  we should definitely see only one
>mapper.
>
>this looks like a bug
>
>Yang

Re: Dealing with changing file format

2012-07-02 Thread Robert Evans

There are several different ways.  One of the ways is to use something
like Hcatalog to track the format and location of the dataset.  This may
be overkill for your problem, but it will grow with you.  Another is to
store the scheme with the data when it is written out.  Your code may need
to the dynamically adjust to when the field is there and when it is not.

--Bobby Evans

On 7/2/12 4:09 PM, "Mohit Anchlia"  wrote:

>I am wondering what's the right way to go about designing reading input
>and
>output where file format may change over period. For instance we might
>start with "field1,field2,field3" but at some point we add new field4 in
>the input. What's the best way to deal with such scenarios? Keep a catalog
>of changes that timestamped?

Re: Using REST to get ApplicationMaster info

2012-06-29 Thread Robert Evans

Please don't file that JIRA.  The proxy server is intended to front the
web server for all calls to the AM.  This is so you only have to go to a
single location to get to any AM's web service.  The proxy server is a
very simple proxy and just forwards the extra part of the path on to the
AM.

If you are having issues with this please include the version you are
having problems with.  Also please look at the logs for the RM on startup
to see if there is anything there indicating why it is not starting up.

--Bobby Evans

On 6/28/12 9:46 AM, "Harsh J"  wrote:

>As far as I can tell, the MR WebApp, as the name itself indicates on
>its doc page, starts only at the MR AM (which may be running at any
>NM), and it starts as an ephemeral port logged at in the AM logs
>usually as:
>
>INFO Web app /mapreduce started at [PORT]
>
>That it starts its own server with an ephemeral access point makes
>sense, since each job uses its own AM and having a common location may
>not work with the form of REST API documented at your link. Can you
>please file a JIRA to fix the doc and remove the proxy server refs,
>which are misleading?
>
>Do correct me if I'm wrong.
>
>On Thu, Jun 28, 2012 at 6:13 PM, Prajakta Kalmegh 
>wrote:
>> Hi
>>
>> I am trying to get the ApplicationMaster info using the >http
>> address:port>/proxy/{appid}/ws/v1/mapreduce/info> link as described on
>>the <
>> 
>>http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn
>>-site/MapredAppMasterRest.html>
>> page.
>>
>> I am able to access and retrieve JSON response for other modules
>> (ResourceManager, NodeManager and HistoryServer). However, I am getting
>> 'Page not found' when I try to use my ResourceManager Http address to
>> access the ApplicationMaster info. I am using <
>> http://localhost:8088/proxy/{appid}/ws/v1/mapreduce/info> to retrieve
>>JSON
>> response.
>>
>> The instructions say "The application master should be accessed via the
>> proxy. This proxy is configurable to run either on the resource manager
>>or
>> on a separate host."
>>
>> My yarn-default.xml contains:
>>  
>>The address for the web proxy as HOST:PORT, if this is
>>not
>> given then the proxy will run as part of the RM
>> yarn.web-proxy.address
>> 
>>  
>>
>> and I did not set a value explicitly in yarn-site.xml.  Any idea how I
>>can
>> get this working? Thanks in advance.
>>
>> Regards,
>> Prajakta
>
>
>
>-- 
>Harsh J

Re: Is it possible to implement transpose with PigLatin/any other MR language?

2012-06-22 Thread Robert Evans

@Subit

You can do it.  Here is some pseudo code for it in map/reduce.  It abuses 
Map/Reduce a little to be more performent.  But it is definitely doable.  At 
the end you will get a file for each reducer you have configured.  If you want 
a single file you can concatenate all of the files together ordered by the name 
of the file.  You should be able to do it in pig too, but you will need an 
input format that will give you the offset, and you will need to possibly have 
the reducer sort by the offset internally in the bag it is handed.  This may 
cause pig to have performance issues if it cannot keep the entire bag in memory 
to sort, which is why I did it in MR instead.

//Assuming Text input format where the key is the offset into the original 
input file, and there is only one input file. If there is more then one input 
file you need a way to include the ordering of the input files in the offset.
Map (LongWriteable offset, Text line)
String[] parts = line.toString().split(',');
for(long I = 0 ; I < parts.length; i++) {
collect(new ColumnOffsetKey(offset, I), new Text(parts[I]));
}
 }

//We need to know the max Columns ahead of time to get total order partitioning 
to work
Int partition(ColumnOffsetKey key) {
  return (int)(((double)key.column/MaxColumns)*numPartitions);
}

//You probably want to put in a binary comparator for performance reasons
int compare(ColumnOffsetKey key1, ColumnOffsetKey key2) {
//First sort by column(which will become the new row) next sort by offset 
which will tell us the new column ordering
if(key1.column > key2.column) {
  return 1;
} else if(key1.column < key2.column) {
  return -1;
} else if(key1.offset > key2.offset) {
  return 1;
} else if (key1.offset < key2.offset) {
  return -1;
}
return 0;
}

StringBuffer currentRow = null;
Long currentRowNum = -1;

Reduce(RowOffsetKey key, Iterable part) {
//This is a bit ugly because we did need to detect changes to the row, there is 
probably a cleaner way to do this
if(currentRowNum != key.column) {
//Output the currentRow if needed
if(currentRow != null) {
collect(null, currentRow);
}
currentRow = new StringBuffer();
currentRow.append(part);
currentRowNum = key.column;
} else {
currentRow.append(',');
currentRow.append(part);
}
}

//This is called at the end of the reducer in the new API, ro something like it 
I don't remember the method name off the top of my head
cleanup() {
if(currentRow != null)
  collect(null, currentRow);
}

On 6/22/12 5:35 AM, "Subir S"  wrote:

Thank you for the inputs!

@Norbert,
 But a Group By column number clause also does not guarantee the order of
columns to be preserved. Like even the row number should be known so that
may be in the end we can sort each row based on the row number using a
nested FOREACH. But after that  FOREACH since sorting is not preserved, for
other operations again data may be in wrong order in the row.

To me it seems like it is not possible to do this in MR.

On Fri, Jun 22, 2012 at 12:56 AM, Robert Evans  wrote:

> That may be true, I have not read through the code very closely, if you
> have multiple reduces,  so you can run it with a single reduce or you can
> write a custom partitioner to do it.  You only need to know the length of
> the column, and then you can divide them up appropriately, kind of like how
> the total order partitioner does it.
>
> --Bobby Evans
>
> On 6/21/12 1:15 PM, "Norbert Burger"  wrote:
>
> While it may be fine for many cases, If I'm reading the Nectar code
> correctly, that transpose doesn't guarantee anything about the order of
> rows within each column.  In other words, transposing:
>
> a - b -c
> d - e - f
> g - h - i
>
> may give you different permutations of "a - d - g" as the first row,
> depending on shuffle order.  You can trivially avoid this with one
> mapper/reducer, but then you're not exploiting the framework.  Note that
> you can accomplish same with a higher-level language like PIg by using a
> UDF like LinkedIn's Enumerate [1] to tag each column, and then simply
> GROUPing BY column number.
>
> [1]
>
> https://raw.github.com/linkedin/datafu/master/src/java/datafu/pig/bags/Enumerate.java
>
> Norbert
>
> On Thu, Jun 21, 2012 at 5:00 AM, madhu phatak 
> wrote:
>
> > Hi,
> >  Its possible in Map/Reduce. Look into the code here
> >
> >
> https://github.com/zinnia-phatak-dev/Nectar/tree/master/Nectar-regression/src/main/java/com/zinnia/nectar/regression/hadoop/primitive/mapreduce
> >
> >
> >
> > 2012/6/21 Subir S 
> >
> > > Hi,
> > >
> > > Is it possible to implement transpose operation of rows in

Re: Is it possible to implement transpose with PigLatin/any other MR language?

2012-06-21 Thread Robert Evans

That may be true, I have not read through the code very closely, if you have 
multiple reduces,  so you can run it with a single reduce or you can write a 
custom partitioner to do it.  You only need to know the length of the column, 
and then you can divide them up appropriately, kind of like how the total order 
partitioner does it.

--Bobby Evans

On 6/21/12 1:15 PM, "Norbert Burger"  wrote:

While it may be fine for many cases, If I'm reading the Nectar code
correctly, that transpose doesn't guarantee anything about the order of
rows within each column.  In other words, transposing:

a - b -c
d - e - f
g - h - i

may give you different permutations of "a - d - g" as the first row,
depending on shuffle order.  You can trivially avoid this with one
mapper/reducer, but then you're not exploiting the framework.  Note that
you can accomplish same with a higher-level language like PIg by using a
UDF like LinkedIn's Enumerate [1] to tag each column, and then simply
GROUPing BY column number.

[1]
https://raw.github.com/linkedin/datafu/master/src/java/datafu/pig/bags/Enumerate.java

Norbert

On Thu, Jun 21, 2012 at 5:00 AM, madhu phatak  wrote:

> Hi,
>  Its possible in Map/Reduce. Look into the code here
>
> https://github.com/zinnia-phatak-dev/Nectar/tree/master/Nectar-regression/src/main/java/com/zinnia/nectar/regression/hadoop/primitive/mapreduce
>
>
>
> 2012/6/21 Subir S 
>
> > Hi,
> >
> > Is it possible to implement transpose operation of rows into columns and
> > vice versa...
> >
> >
> > i.e.
> >
> > col1 col2 col3
> > col4 col5 col6
> > col7 col8 col9
> > col10 col11 col12
> >
> > can this be converted to
> >
> > col1 col4 col7 col10
> > col2 col5 col8 col11
> > col3 col6 col9 col12
> >
> > Is this even possible with map reduce? If yes, which language helps to
> > achieve this faster?
> >
> > Thanks
> >
>
>
>
> --
> https://github.com/zinnia-phatak-dev/Nectar
>

Re: Yahoo Hadoop Tutorial with new APIs?

2012-06-04 Thread Robert Evans

I am happy to announce that I was able to get the license on the Yahoo! Hadoop 
tutorial updated from Creative Commons Attribution 3.0 Unported License to 
Apache 2.0.  I have filed HADOOP-8477 
<https://issues.apache.org/jira/browse/HADOOP-8477> to pull the tutorial into 
the Hadoop project, and to update it accordingly.  I am going to be very busy 
the next little while and I am hoping that those in the community that want 
this can help drive it and possibly break it down into subtasks to get the 
tutorial up to date.  I am very happy to help out, but like I said it may be a 
while before I am able to do much on this.

--Bobby Evans

On 4/4/12 4:43 PM, "Marcos Ortiz"  wrote:

 Ok, Robert, I will be waiting for you then. There are many folks that use this 
tutorial, so I think this a good effort in favor of the Hadoop community.It 
would be nice
 if Yahoo! donate this work, because, I have some ideas behind this, for 
example: to release a Spanish version of the tutorial.
 Regards and best wishes

 On 04/04/2012 05:29 PM, Robert Evans wrote:
Re: Yahoo Hadoop Tutorial with new APIs? I am dropping the cross posts and 
leaving this on common-user with the others BCCed.

 Marcos,

 That is a great idea to be able to update the tutorial, especially if the 
community is interested in helping to do so.  We are looking into the best way 
to do this.  The idea right now is to donate this to the Hadoop project so that 
the community can keep it up to date, but we need some time to jump through all 
of the corporate hoops to get this to happen.  We have a lot going on right 
now, so if you don't see any progress on this please feel free to ping me and 
bug me about it.

 --
 Bobby Evans


 On 4/4/12 8:15 AM, "Jagat Singh"  wrote:


Hello Marcos

  Yes , Yahoo tutorials are pretty old but still they explain the concepts of 
Map Reduce , HDFS beautifully. The way in which tutorials have been defined 
into sub sections , each builing on previous one is awesome. I remember when i 
started i was digged in there for many days. The tutorials are lagging now from 
new API point of view.

  Lets have some documentation session one day , I would love to Volunteer to 
update those tutorials if people at Yahoo take input from outside world :)

  Regards,

  Jagat

 - Original Message -
 From: Marcos Ortiz
 Sent: 04/04/12 08:32 AM
 To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org 
<%27hdfs-u...@hadoop.apache.org> ', mapreduce-u...@hadoop.apache.org
 Subject: Yahoo Hadoop Tutorial with new APIs?

 Regards to all the list.
  There are many people that use the Hadoop Tutorial released by Yahoo at 
http://developer.yahoo.com/hadoop/tutorial/ 
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
 The main issue here is that, this tutorial is written with the old APIs? 
(Hadoop 0.18 I think).
  Is there a project for update this tutorial to the new APIs? to Hadoop 1.0.2 
or YARN (Hadoop 0.23)

  Best wishes
  -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI 
http://marcosluis2186.posterous.com
  http://www.uci.cu/




  <http://www.uci.cu/>

Re: Pragmatic cluster backup strategies?

2012-05-30 Thread Robert Evans

I am not an expert on the trash so you probably want to verify everything I am 
about to say.  I believe that trash acts oddly when you try to use it to delete 
a trash directory.  Quotas can potentially get off when doing this, but I think 
it still deletes the directory.  Trash is a nice feature, but I wouldn't trust 
it as a true backup.  I just don't think it is mature enough for something like 
that.  There are enough issues with quotas that sadly most of our users almost 
always add -skipTrash all the time.

Where I work we do a combination of several different things depending on the 
project and their requirements.  In some cases where there are government 
regulations involved we do regular tape backups.  In other cases we keep the 
original data around for some time and can re-import it to HDFS if necessary.  
In other cases we will copy the data, to multiple Hadoop clusters.  This is 
usually for the case where we want to do Hot/Warm failover between clusters.  
Now we may be different from most other users because we do run lots of 
different projects on lots of different clusters.

--Bobby Evans

On 5/30/12 1:31 AM, "Darrell Taylor"  wrote:

Will "hadoop fs -rm -rf" move everything to the the /trash directory or
will it delete that as well?

I was thinking along the lines of what you suggest, keep the original
source of the data somewhere and then reprocess it all in the event of a
problem.

What do other people do?  Do you run another cluster?  Do you backup
specific parts of the cluster?  Some form of offsite SAN?

On Tue, May 29, 2012 at 6:02 PM, Robert Evans  wrote:

> Yes you will have redundancy, so no single point of hardware failure can
> wipe out your data, short of a major catastrophe.  But you can still have
> an errant or malicious "hadoop fs -rm -rf" shut you down.  If you still
> have the original source of your data somewhere else you may be able to
> recover, by reprocessing the data, but if this cluster is your single
> repository for all your data you may have a problem.
>
> --Bobby Evans
>
> On 5/29/12 11:40 AM, "Michael Segel"  wrote:
>
> Hi,
> That's not a back up strategy.
> You could still have joe luser take out a key file or directory. What do
> you do then?
>
> On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:
>
> > Hi,
> >
> > We are about to build a 10 machine cluster with 40Tb of storage,
> obviously
> > as this gets full actually trying to create an offsite backup becomes a
> > problem unless we build another 10 machine cluster (too expensive right
> > now).  Not sure if it will help but we have planned the cabinet into an
> > upper and lower half with separate redundant power, then we plan to put
> > half of the cluster in the top, half in the bottom, effectively 2 racks,
> so
> > in theory we could lose half the cluster and still have the copies of all
> > the blocks with a replication factor of 3?  Apart form the data centre
> > burning down or some other disaster that would render the machines
> totally
> > unrecoverable, is this approach good enough?
> >
> > I realise this is a very open question and everyone's circumstances are
> > different, but I'm wondering what other peoples experiences/opinions are
> > for backing up cluster data?
> >
> > Thanks
> > Darrell.
>
>
>

Re: How to mapreduce in the scenario

2012-05-29 Thread Robert Evans

Yes you can do it.  In pig you would write something like

A = load ‘a.txt’ as (id, name, age, ...)
B = load ‘b.txt’ as (id, address, ...)
C = JOIN A BY id, B BY id;
STORE C into ‘c.txt’

Hive can do it similarly too.  Or you could write your own directly in 
map/redcue or using the data_join jar.

--Bobby Evans

On 5/29/12 4:08 AM, "lzg"  wrote:

Hi,

I wonder that if Hadoop can solve effectively the question as following:

==
input file: a.txt, b.txt
result: c.txt

a.txt:
id1,name1,age1,...
id2,name2,age2,...
id3,name3,age3,...
id4,name4,age4,...

b.txt：
id1,address1,...
id2,address2,...
id3,address3,...

c.txt
id1,name1,age1,address1,...
id2,name2,age2,address2,...


I know that it can be done well by database.
But I want to handle it with hadoop if possible.
Can hadoop meet the requirement?

Any suggestion can help me. Thank you very much!

Best Regards,

Gump

Re: Pragmatic cluster backup strategies?

2012-05-29 Thread Robert Evans

Yes you will have redundancy, so no single point of hardware failure can wipe 
out your data, short of a major catastrophe.  But you can still have an errant 
or malicious "hadoop fs -rm -rf" shut you down.  If you still have the original 
source of your data somewhere else you may be able to recover, by reprocessing 
the data, but if this cluster is your single repository for all your data you 
may have a problem.

--Bobby Evans

On 5/29/12 11:40 AM, "Michael Segel"  wrote:

Hi,
That's not a back up strategy.
You could still have joe luser take out a key file or directory. What do you do 
then?

On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:

> Hi,
>
> We are about to build a 10 machine cluster with 40Tb of storage, obviously
> as this gets full actually trying to create an offsite backup becomes a
> problem unless we build another 10 machine cluster (too expensive right
> now).  Not sure if it will help but we have planned the cabinet into an
> upper and lower half with separate redundant power, then we plan to put
> half of the cluster in the top, half in the bottom, effectively 2 racks, so
> in theory we could lose half the cluster and still have the copies of all
> the blocks with a replication factor of 3?  Apart form the data centre
> burning down or some other disaster that would render the machines totally
> unrecoverable, is this approach good enough?
>
> I realise this is a very open question and everyone's circumstances are
> different, but I'm wondering what other peoples experiences/opinions are
> for backing up cluster data?
>
> Thanks
> Darrell.

Re: Stream data processing

2012-05-22 Thread Robert Evans

If you want the results to come out instantly Map/Reduce is not the proper 
choice.  Map/Reduce is designed for batch processing.  It can do small batches, 
but the overhead of launching the map/redcue jobs can be very high compared to 
the amount of processing you are doing.  I personally would look into using 
either Storm, S4, or some other realtime stream processing framework.  From 
what you have said it sounds like you probably want to use Storm, as it can be 
used to guarantee that each event is processed once and only once.  You can 
also store your results into HDFS if you want, perhaps through HBASE, if you 
need to do further processing on the data.

--Bobby Evans

On 5/22/12 5:02 AM, "Zhiwei Lin"  wrote:

Hi Robert,
Thank you.
How quickly do you have to get the result out once the new data is added?
If possible, I hope to get the result instantly.

How far back in time do you have to look for  from the occurrence of
?
The time slot is not constant. It depends on the "last" occurrence of 
in front of .  So, I need to look up the history to get the last 
in this case.

Do you have to do this for all combinations of values or is it just a small
subset of values?
I think this depends on the time of last occurrence of  in the history.
If  rarely occurred, then the early stage data has to be taken into
account.

Definitely, I think HDFS is a good place to store the data I have (the size
of daily log is above 1GB). But I am not sure if Map/Reduce can help to
handle the stated problem.

Zhiwei

On 21 May 2012 22:07, Robert Evans  wrote:

> Zhiwei,
>
> How quickly do you have to get the result out once the new data is added?
>  How far back in time do you have to look for  from the occurrence of
> ?  Do you have to do this for all combinations of values or is it just
> a small subset of values?
>
> --Bobby Evans
>
> On 5/21/12 3:01 PM, "Zhiwei Lin"  wrote:
>
> I have large volume of stream log data. Each data record contains a time
> stamp, which is very important to the analysis.
> For example, I have data format like this:
> (1) 20:30:21 01/April/2012A.
> (2) 20:30:51 01/April/2012.
> (3) 21:30:21 01/April/2012.
>
> Moreover, new data comes every few minutes.
> I have to calculate the probability of the occurrence "" given the
> occurrence of "" (where  occurs earlier than ). So, it is
> really time-dependant.
>
> I wonder if Hadoop  is the right platform for this job? Is there any
> package available for this kind of work?
>
> Thank you.
>
> Zhiwei
>
>

--

Best wishes.

Zhiwei

Re: Stream data processing

2012-05-21 Thread Robert Evans

Zhiwei,

How quickly do you have to get the result out once the new data is added?  How 
far back in time do you have to look for  from the occurrence of ?  Do 
you have to do this for all combinations of values or is it just a small subset 
of values?

--Bobby Evans

On 5/21/12 3:01 PM, "Zhiwei Lin"  wrote:

I have large volume of stream log data. Each data record contains a time
stamp, which is very important to the analysis.
For example, I have data format like this:
(1) 20:30:21 01/April/2012A.
(2) 20:30:51 01/April/2012.
(3) 21:30:21 01/April/2012.

Moreover, new data comes every few minutes.
I have to calculate the probability of the occurrence "" given the
occurrence of "" (where  occurs earlier than ). So, it is
really time-dependant.

I wonder if Hadoop  is the right platform for this job? Is there any
package available for this kind of work?

Thank you.

Zhiwei

Re: Transfer archives (or any file) from Mapper to Reducer?

2012-05-21 Thread Robert Evans

Be careful putting them in HDFS.  It does not scale very well, as the number of 
file opens will be on the order of Number of Mappers * Number of Reducers.  You 
can quickly do a denial of service on the namenode if you have a lot of mappers 
and reducers.

--Bobby Evans

On 5/21/12 4:02 AM, "Harsh J"  wrote:

Biro,

I guess you could write these archives onto HDFS, and have your
reducers read it from a location there, but this method may be a bit
ugly. See 
http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F
for properly writing files from tasks onto a DFS, or look at
MultipleOutputs API class.

Depending on how large these files are, you can also perhaps ship them
in via the KV pairs itself. A custom key or sort comparator can
further ensure that they are delivered in the first iteration of the
reducer - if the file is required before regular reduce() ops can
begin.

On Mon, May 21, 2012 at 1:42 PM, biro lehel  wrote:
> Dear all,
>
> In my Mapper, I run a script that processes my set of input text files, 
> creates from them some other text files (this is done locally on the FS on my 
> nodes), and as a result, each MapTask will produce an archive as a result. My 
> issue is, that I'm looking for a way for the Reducer to "take" these archives 
> as some kind of an input. I understood that the communication between 
> Mapper-Reducer is done through the means of the key-value pairs in the 
> Context, but what I would need is the transferring of these archive files to 
> the respective Reducer (I would probably have one single Reducer, so all the 
> files should be transferred/copied there somehow).
>
> Is this possible? Is there a way to transfer files from Mapper to Reducer? If 
> not, what is the best approach in scenarios like mine? Any suggestions would 
> be greatly appreciated.
>
> Thank you in advance,
> Lehel.
>
>
>

--
Harsh J

Re: freeze a mapreduce job

2012-05-11 Thread Robert Evans

There is an idle timeout for map/reduce tasks.  If a task makes no progress for 
10 min (Default) the AM will kill it on 2.0 and the JT will kill it on 1.0.  
But I don't know of anything associated with a Job, other then in 0.23 is the 
AM does not heart beat back in for too long, I believe that the RM may kill it 
and retry, but I don't know for sure.

--Bobby Evans

On 5/11/12 10:53 AM, "Harsh J"  wrote:

Am not aware of a job-level timeout or idle monitor.

On Fri, May 11, 2012 at 7:33 PM, Shi Yu  wrote:
> Is there any risk to suppress a job too long in FS?I guess there are
> some parameters to control the waiting time of a job (such as timeout
> ,etc.),   for example, if a job is kept idle for more than 24 hours is there
> a configuration deciding kill/keep that job?
>
> Shi
>
>
> On 5/11/2012 6:52 AM, Rita wrote:
>>
>> thanks.  I think I will investigate capacity scheduler.
>>
>>
>> On Fri, May 11, 2012 at 7:26 AM, Michael
>> Segelwrote:
>>
>>> Just a quick note...
>>>
>>> If your task is currently occupying a slot,  the only way to release the
>>> slot is to kill the specific task.
>>> If you are using FS, you can move the task to another queue and/or you
>>> can
>>> lower the job's priority which will cause new tasks to spawn  slower than
>>> other jobs so you will eventually free up the cluster.
>>>
>>> There isn't a way to 'freeze' or stop a job mid state.
>>>
>>> Is the issue that the job has a large number of slots, or is it an issue
>>> of the individual tasks taking a  long time to complete?
>>>
>>> If its the latter, you will probably want to go to a capacity scheduler
>>> over the fair scheduler.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> On May 11, 2012, at 6:08 AM, Harsh J wrote:
>>>
 I do not know about the per-host slot control (that is most likely not
 supported, or not yet anyway - and perhaps feels wrong to do), but the
 rest of the needs can be doable if you use schedulers and
 queues/pools.

 If you use FairScheduler (FS), ensure that this job always goes to a
 special pool and when you want to freeze the pool simply set the
 pool's maxMaps and maxReduces to 0. Likewise, control max simultaneous
 tasks as you wish, to constrict instead of freeze. When you make
 changes to the FairScheduler configs, you do not need to restart the
 JT, and you may simply wait a few seconds for FairScheduler to refresh
 its own configs.

 More on FS at
>>>
>>> http://hadoop.apache.org/common/docs/current/fair_scheduler.html

 If you use CapacityScheduler (CS), then I believe you can do this by
 again making sure the job goes to a specific queue, and when needed to
 freeze it, simply set the queue's maximum-capacity to 0 (percentage)
 or to constrict it, choose a lower, positive percentage value as you
 need. You can also refresh CS to pick up config changes by refreshing
 queues via mradmin.

 More on CS at
>>>
>>> http://hadoop.apache.org/common/docs/current/capacity_scheduler.html

 Either approach will not freeze/constrict the job immediately, but
 should certainly prevent it from progressing. Meaning, their existing
 running tasks during the time of changes made to scheduler config will
 continue to run till completion but further tasks scheduling from
 those jobs shall begin seeing effect of the changes made.

 P.s. A better solution would be to make your job not take as many
 days, somehow? :-)

 On Fri, May 11, 2012 at 4:13 PM, Rita  wrote:
>
> I have a rather large map reduce job which takes few days. I was
>>>
>>> wondering
>
> if its possible for me to freeze the job or make the job less
>>>
>>> intensive. Is
>
> it possible to reduce the number of slots per host and then I can
>>>
>>> increase
>
> them overnight?
>
>
> tia
>
> --
> --- Get your facts first, then you can distort them as you please.--

 --
 Harsh J

>>>
>>
>

--
Harsh J

Re: Need to improve documentation for v 0.23.x ( v 2.x)

2012-05-09 Thread Robert Evans

Yes I am the one that said I would look into releasing the Yahoo documentation. 
 Thanks for reminding me.  I have been a bit distracted, by some 
"restructuring" that has happened recently, but I will get on that.

--Bobby

On 5/8/12 12:30 PM, "Arun C Murthy"  wrote:

Thanks for offering to provide doc-patches Jagat!

On May 7, 2012, at 9:04 AM, Jagat wrote:

> Hello Bobby,
>
> Yes i will file couple of JIRAs and work on them in coming few days to
> write how to setup things , include basics which were very good in old
> documentation, yes there are many good things which we can take up from
> 1.0.2 documentation also.
>
> Few days back Yahoo team was saying that they can donate documentation to
> community which although old can be improved on to include within Apache.
> May be we can get some update on confirmation for permission to use for the
> same.
>
> One of the project which is good in documentation ( wiki ) is Apache Mahout
> we can learn from them , lots of extensible references , presentations ,
> tutorials all at one place at wiki to refer.
>
>
>
> On Mon, May 7, 2012 at 9:19 PM, Robert Evans  wrote:
>
>> I agree that better documentation is almost always needed.  The problem is
>> in finding the time to really make this happen.  If you or anyone else here
>> wants to help out with this effort please feel free to file JIRAs and
>> submit patches to improve the documentation.  Even if all the patch is, is
>> a copy/paste of information from the 1.0.2 documentation that is still
>> relevant for 2.0.
>>
>> --Bobby Evans
>>
>> On 5/4/12 2:21 PM, "Jagat"  wrote:
>>
>> Hello All,
>>
>> As Apache Hadoop community is ready to release the next 2.0 alpha version
>> of Hadoop , i would like to bring attention towards need to make better
>> documentation of the tutorials and examples for the same.
>>
>> Just one short example
>>
>> See the Single Node Setup tutorials for v
>> 1.x<http://hadoop.apache.org/common/docs/r1.0.2/single_node_setup.html>and
>> v
>> 0.23<
>> http://hadoop.apache.org/common/docs/r0.23.1/hadoop-yarn/hadoop-yarn-site/SingleCluster.html
>>> ,
>> you can say 0.23  author is in hurry with keeping all things in
>> assumption that reader knows everything what and where to do.
>>
>> We should spend some time on documentation , with so many beautiful
>> features coming it would be great if you guys plan some special hackathon
>> meetings to improve its documentation , code examples so that people can
>> understand how to use them effectively.
>>
>> At present only two people can understand 0.23 , those who wrote the code
>> and the other one is java compiler who is verifying its code :)
>>
>> *Tom White , *I request if you are reading this message , please pick-up
>> your pen again to write Hadoop Definitive Guide edition 4th dedicated to
>> next release for greater benefit of community.
>>
>> Thanks
>>
>>

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: Need to improve documentation for v 0.23.x ( v 2.x)

2012-05-07 Thread Robert Evans

I agree that better documentation is almost always needed.  The problem is in 
finding the time to really make this happen.  If you or anyone else here wants 
to help out with this effort please feel free to file JIRAs and submit patches 
to improve the documentation.  Even if all the patch is, is a copy/paste of 
information from the 1.0.2 documentation that is still relevant for 2.0.

--Bobby Evans

On 5/4/12 2:21 PM, "Jagat"  wrote:

Hello All,

As Apache Hadoop community is ready to release the next 2.0 alpha version
of Hadoop , i would like to bring attention towards need to make better
documentation of the tutorials and examples for the same.

Just one short example

See the Single Node Setup tutorials for v
1.xand
v
0.23,
you can say 0.23  author is in hurry with keeping all things in
assumption that reader knows everything what and where to do.

We should spend some time on documentation , with so many beautiful
features coming it would be great if you guys plan some special hackathon
meetings to improve its documentation , code examples so that people can
understand how to use them effectively.

At present only two people can understand 0.23 , those who wrote the code
and the other one is java compiler who is verifying its code :)

*Tom White , *I request if you are reading this message , please pick-up
your pen again to write Hadoop Definitive Guide edition 4th dedicated to
next release for greater benefit of community.

Thanks

Re: cannot use a map side join to merge the output of multiple map side joins

2012-05-07 Thread Robert Evans

I believe that you are correct about the split processing.  It orders the 
splits by size so that the largest splits are processed first.  This allows for 
the smaller splits to potentially fill in the gaps.  As far as a fix is 
concerned I think overriding the file name in the file output committer is a 
much more straight forward solution to the issue.

--Bobby Evans

On 5/5/12 10:50 AM, "Jim Donofrio"  wrote:

I am trying to use a map side join to merge the output of multiple map
side joins. This is failing because of the below code in
JobClient.writeOldSplits which reorders the splits from largest to
smallest. Why is that done, is that so that the largest split which will
take the longest gets processed first?

Each map side join then fails to name its part-* files with the same
number as the incoming partition so files that named part-0 that go
into the first map side join get outputted to part-00010 while another
one of the first level map side joins sends files named part-0 to
part-5. The second level map side join then does not get the input
splits in partitioner order from each first level map side join output
directory.

I can think of only 2 fixes, add some conf property to allow turning off
the below sorting OR extend FileOutputCommitter to rename the outputs of
the first level map side join to merge_part-the orginal partition
number. Any other solutions?

 // sort the splits into order based on size, so that the biggest
 // go first
 Arrays.sort(splits, new
Comparator() {
   public int compare(org.apache.hadoop.mapred.InputSplit a,
  org.apache.hadoop.mapred.InputSplit b) {
 try {
   long left = a.getLength();
   long right = b.getLength();
   if (left == right) {
 return 0;
   } else if (left < right) {
 return 1;
   } else {
 return -1;
   }

Re: hadoop streaming using a java program as mapper

2012-05-02 Thread Robert Evans

Do you have the error message from running java?  You can use myMapper.sh to 
help you debug what is happening and logging it.  Stderr of myMapper.sh is 
logged and you can get to it.  You can run shell commands link find, ls, and 
you can probably look at any error messages that java produced while trying to 
run.  Things like class not found exceptions.

--Bobby Evans

On 5/2/12 12:25 AM, "Boyu Zhang"  wrote:

Yes, I did, the myMapper.sh is executed, the problem is inside this
myMapper.sh, it calls a java program named myJava, the myJava did not get
executed on slaves, and I shipped myJava.class too.

Thanks,
Boyu

On Wed, May 2, 2012 at 1:20 AM, 黄 山  wrote:

> have you shipped myMapper.sh to each node?
>
> thuhuang...@gmail.com
>
>
>
> 在 2012-5-2，下午1:17， Boyu Zhang 写道：
>
> > Hi All,
> >
> > I am in a little bit strange situation, I am using Hadoop streaming to
> run
> > a bash shell program myMapper.sh, and in the myMapper.sh, it calls a java
> > program, then a R program, then output intermediate key, values. I used
> > -file option to ship the java and R files, but the java program was not
> > executed by the streaming. The myMapper.sh has something like this:
> >
> > java myJava arguments
> >
> > And in the streaming command, I use something like this:
> >
> > hadoop jar /opt/hadoop/hadoop-0.20.2-streaming.jar -D
> mapred.reduce.tasks=0
> > -input /user/input -output /user/output7 -mapper ./myMapper.sh -file
> > myJava.class  -verbose
> >
> > And the myJava program is not run when I execute like this, and if I go
> to
> > the actual slave node to check the files, the myMapper.sh is shipped to
> the
> > slave node, but the myJava.class is not, it is inside the job.jar file.
> >
> > Can someone provide some insights on how to run a java program through
> > hadoop streaming? Thanks!
> >
> > Boyu
>
>

Re: Node-wide Combiner

2012-04-30 Thread Robert Evans

Do you mean that when multiple map jobs run on the same node, that there is a 
combiner that will run across all of that code.  There is nothing for that 
right now.  It seems like it could be somewhat difficult to get right given the 
current architecture.

--Bobby Evans


On 4/27/12 11:13 PM, "Superymk"  wrote:

Hi all,

I am a newbie in Hadoop and I like the system. I have one question: Is
there a node-wide combiner or something similar in Hadoop? I think it
can reduce the number of intermediate results in further. Any hint?

Thanks a lot!

Superymk

Re: KMeans clustering on Hadoop infrastructure

2012-04-30 Thread Robert Evans

You are likely going to get more help from talking to the Mahout mailing list.

https://cwiki.apache.org/confluence/display/MAHOUT/Mailing+Lists,+IRC+and+Archives

--Bobby Evans

On 4/28/12 7:45 AM, "Lukáš Kryške"  wrote:






Hello,
I am successfully running K-Means clustering sample from the 'Mahout In Action' 
book (example in Chapter 7.3) in my Hadoop environment.Now I need to extend the 
program to take the vectors from a file located in my HDFS. I need to process 
clustering of millions or billions of vectors which are represented by 
comma-separated values in a .txt file in HDFS. Data are stored in this pattern:
x1,y1x2,y2xn,yn
As I understood from the book, I need to transform my .txt file with vectors 
into Hadoop's SequenceFile first - how to do it most efficiently? And how to 
tell to the KMeansDriver that the input path contains SequenceFile with vectors?

Thanks for help.

_Best Regards,Lukas Kryske

Re: reducers and data locality

2012-04-27 Thread Robert Evans

Also generating random keys/partitions can be problematic.  Although the 
problems are rare.  A mapper can be restarted after it finishes successfully if 
the machine it was on goes down or has other problems so that the reducers and 
not able to get that mapper's output data.  If this happens while some of the 
reducers have finished fetching it, but not all of them, and the new mapper 
partitions things differently some records may show up twice in your output and 
others not at all.

If you do something like random for the partitioning make sure that you use a 
constant seed so that it is deterministic.

--Bobby Evans

On 4/27/12 4:24 AM, "Bejoy KS"  wrote:

Hi Mete

A custom Paritioner class can control the flow of keys to the desired reducer. 
It gives you more control on which key to which reducer.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: mete 
Date: Fri, 27 Apr 2012 09:19:21
To: 
Reply-To: common-user@hadoop.apache.org
Subject: reducers and data locality

Hello folks,

I have a lot of input splits (10k-50k - 128 mb blocks) which contains text
files. I need to process those line by line, then copy the result into
roughly equal size of "shards".

So i generate a random key (from a range of [0:numberOfShards]) which is
used to route the map output to different reducers and the size is more
less equal.

I know that this is not really efficient and i was wondering if i could
somehow control how keys are routed.
For example could i generate the randomKeys with hostname prefixes and
control which keys are sent to each reducer? What do you think?

Kind regards
Mete

Re: Text Analysis

2012-04-25 Thread Robert Evans

Hadoop itself is the core Map/Reduce and HDFS functionality.  The higher level 
algorithms like sentiment analysis are often done by others.  Cloudera has a 
video from HadoopWorld 2010 about it

http://www.cloudera.com/resource/hw10_video_sentiment_analysis_powered_by_hadoop/

And there are likely to be other tools like R that can help you out with it.  I 
am not really sure if mahout offers sentiment analysis or not, but you might 
want to look there too http://mahout.apache.org/

--Bobby Evans


On 4/25/12 7:50 AM, "karanveer.si...@barclays.com" 
 wrote:

Hi,

I wanted to know if there are any existing API's within Hadoop for us to do 
some text analysis like sentiment analysis, etc. OR are we to rely on tools 
like R, etc. for this.


Regards,
Karanveer





This e-mail and any attachments are confidential and intended
solely for the addressee and may also be privileged or exempt from
disclosure under applicable law. If you are not the addressee, or
have received this e-mail in error, please notify the sender
immediately, delete it from your system and do not copy, disclose
or otherwise act upon any part of this e-mail or its attachments.

Internet communications are not guaranteed to be secure or
virus-free.
The Barclays Group does not accept responsibility for any loss
arising from unauthorised access to, or interference with, any
Internet communications by any third party, or from the
transmission of any viruses. Replies to this e-mail may be
monitored by the Barclays Group for operational or business
reasons.

Any opinion or other information in this e-mail or its attachments
that does not relate to the business of the Barclays Group is
personal to the sender and is not given or endorsed by the Barclays
Group.

Barclays Bank PLC. Registered in England and Wales (registered no.
1026167).
Registered Office: 1 Churchill Place, London, E14 5HP, United
Kingdom.

Barclays Bank PLC is authorised and regulated by the Financial
Services Authority.

Re: isSplitable() problem

2012-04-24 Thread Robert Evans

The current code guarantees that they will be received in order.  There some 
patches that are likely to go in soon that would allow for the JVM itself to be 
reused.  In those cases I believe that the mapper class would be recreated, so 
the only thing you would have to worry about would be static values that are 
updated while processing the data.

-- Bobby Evans

On 4/24/12 4:45 AM, "Dan Drew"  wrote:

I have chosen to use Jay's suggestion as a quick workaround and am pleased
to report that it seems to work well on small test inputs.

My question now is, are the mappers guaranteed to receive the file's lines
in order?

Browsing the source suggests this is so, but I just want to make sure as my
understanding of Hadoop is transubstantial.

Thank you for your patience in answering my questions.

On 23 April 2012 14:28, Harsh J  wrote:

> Jay,
>
> On Mon, Apr 23, 2012 at 6:43 PM, JAX  wrote:
> > Curious : Seems like you could aggregate the results in the mapper as a
> local variable or list of strings--- is there a way to know that your
> mapper has just read the LAST line of an input split?
>
> True. Can be one way to do it (unless aggregation of 'records' needs
> to happen live, and you don't wish to store it all in memory).
>
> > Is there a "cleanup" or "finalize" method in mappers that is run at the
> end of a whole steam read to support these sort of chunked, in memor map/r
> operations?
>
> Yes there is. See:
>
> Old API:
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html
> (See Closeable's close())
>
> New API:
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Mapper.html#cleanup(org.apache.hadoop.mapreduce.Mapper.Context)
>
>
> --
> Harsh J
>

Re: Accessing global Counters

2012-04-20 Thread Robert Evans

There was a discussion about this several months ago

http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201112.mbox/%3CCADYHM8xiw8_bF=zqe-bagdfz6r3tob0aof9viozgtzeqgkp...@mail.gmail.com%3E

The conclusion is that if you want to read them from the reducer you are going 
to have to do something special until someone finds time to implement it as 
part of.

https://issues.apache.org/jira/browse/MAPREDUCE-3520

--Bobby Evans


On 4/20/12 11:36 AM, "Amith D K"  wrote:

Yes U can use user defined counter as Jagat suggeted.

Counter can be enum as Jagat described or any string which are called dynamic 
counters.

It is easier to use Enum counter than dynamic counters, finally it depends on 
your use case :)

Amith

From: Jagat [jagatsi...@gmail.com]
Sent: Saturday, April 21, 2012 12:25 AM
To: common-user@hadoop.apache.org
Subject: Re: Accessing global Counters

Hi

You can create your own counters like

enum CountFruits {
Apple,
Mango,
Banana
}


And in your mapper class when you see condition to increment , you can use
Reporter incrCounter method to do the same.

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.Enum,%20long)

e.g
// I saw Apple increment it by one
reporter.incrCounter(CountFruits.Apple,1);

Now you can access them using job.getCounters

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getCounters()

Hope this helps

Regards,

Jagat Singh


On Fri, Apr 20, 2012 at 9:43 PM, Gayatri Rao  wrote:

> Hi All,
>
> Is there a way for me to set global counters in Mapper and access them from
> reducer?
> Could you suggest how I can acheve this?
>
> Thanks
> Gayatri
>

Re: remote job submission

2012-04-20 Thread Robert Evans

You can use Oozie to do it.


On 4/20/12 8:45 AM, "Arindam Choudhury"  wrote:

Sorry. But I can you give me a example.

On Fri, Apr 20, 2012 at 3:08 PM, Harsh J  wrote:

> Arindam,
>
> If your machine can access the clusters' NN/JT/DN ports, then you can
> simply run your job from the machine itself.
>
> On Fri, Apr 20, 2012 at 6:31 PM, Arindam Choudhury
>  wrote:
> > "If you are allowed a remote connection to the cluster's service ports,
> > then you can directly submit your jobs from your local CLI. Just make
> > sure your local configuration points to the right locations."
> >
> > Can you elaborate in details please?
> >
> > On Fri, Apr 20, 2012 at 2:20 PM, Harsh J  wrote:
> >
> >> If you are allowed a remote connection to the cluster's service ports,
> >> then you can directly submit your jobs from your local CLI. Just make
> >> sure your local configuration points to the right locations.
> >>
> >> Otherwise, perhaps you can choose to use Apache Oozie (Incubating)
> >> (http://incubator.apache.org/oozie/) It does provide a REST interface
> >> that launches jobs up for you over the supplied clusters, but its more
> >> oriented towards workflow management or perhaps HUE:
> >> https://github.com/cloudera/hue
> >>
> >> On Fri, Apr 20, 2012 at 5:37 PM, Arindam Choudhury
> >>  wrote:
> >> > Hi,
> >> >
> >> > Do hadoop have any web service or other interface so I can submit jobs
> >> from
> >> > remote machine?
> >> >
> >> > Thanks,
> >> > Arindam
> >>
> >>
> >>
> >> --
> >> Harsh J
> >>
>
>
>
> --
> Harsh J
>

Re: Help me with architecture of a somewhat non-trivial mapreduce implementation

2012-04-20 Thread Robert Evans

You could also use the NLineInputFormat which will launch 1 mapper for every N 
(configurable) lines of input.


On 4/20/12 9:48 AM, "Sky"  wrote:

Thanks! That helped!



-Original Message-
From: Michael Segel
Sent: Thursday, April 19, 2012 9:38 PM
To: common-user@hadoop.apache.org
Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce
implementation

If the file is small enough you could read it in to a java object like a
list and write your own input format that takes a list object as its input
and then lets you specify the number of mappers.

On Apr 19, 2012, at 11:34 PM, Sky wrote:

> My file for the input to mapper is very small - as all it has is urls to
> list of manifests. The task for mappers is to fetch each manifest, and
> then fetch files using urls from the manifests and then process them.
> Besides passing around lists of files, I am not really accessing the disk.
> It should be RAM, network, and CPU (unzip, parsexml,extract attributes).
>
> So is my only choice to break the input file and submit multiple files (if
> I have 15 cores, I should split the file with urls to 15 files? also how
> does it look in code?)? The two drawbacks are - some cores might finish
> early and stay idle, and I don't know how to deal with dynamically
> increasing/decreasing cores.
>
> Thx
> - Sky
>
> -Original Message- From: Michael Segel
> Sent: Thursday, April 19, 2012 8:49 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce
> implementation
>
> How 'large' or rather in this case small is your file?
>
> If you're on a default system, the block sizes are 64MB. So if your file
> ~<= 64MB, you end up with 1 block, and you will only have 1 mapper.
>
>
> On Apr 19, 2012, at 10:10 PM, Sky wrote:
>
>> Thanks for your reply.  After I sent my email, I found a fundamental
>> defect - in my understanding of how MR is distributed. I discovered that
>> even though I was firing off 15 COREs, the map job - which is the most
>> expensive part of my processing was run only on 1 core.
>>
>> To start my map job, I was creating a single file with following data:
>> 1 storage:/root/1.manif.txt
>> 2 storage:/root/2.manif.txt
>> 3 storage:/root/3.manif.txt
>> ...
>> 4000 storage:/root/4000.manif.txt
>>
>> I thought that each of the available COREs will be assigned a map job
>> from top down from the same file one at a time, and as soon as one CORE
>> is done, it would get the next map job. However, it looks like I need to
>> split the file into the number of times. Now while that's clearly trivial
>> to do, I am not sure how I can detect at runtime how many splits I need
>> to do, and also to deal with adding new CORES at runtime. Any
>> suggestions? (it doesn't have to be a file, it can be a list, etc).
>>
>> This all would be much easier to debug, if somehow I could get my log4j
>> logs for my mappers and reducers. I can see log4j for my main launcher,
>> but not sure how to enable it for mappers and reducers.
>>
>> Thx
>> - Akash
>>
>>
>> -Original Message- From: Robert Evans
>> Sent: Thursday, April 19, 2012 2:08 PM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Help me with architecture of a somewhat non-trivial
>> mapreduce implementation
>>
>> From what I can see your implementation seems OK, especially from a
>> performance perspective. Depending on what storage: is it is likely to be
>> your bottlekneck, not the hadoop computations.
>>
>> Because you are writing files directly instead of relying on Hadoop to do
>> it for you, you may need to deal with error cases that Hadoop will
>> normally hide from you, and you will not be able to turn on speculative
>> execution. Just be aware that a map or reduce task may have problems in
>> the middle, and be relaunched.  So when you are writing out your updated
>> manifest be careful to not replace the old one until the new one is
>> completely ready and will not fail, or you may lose data.  You may also
>> need to be careful in your reduce if you are writing directly to the file
>> there too, but because it is not a read modify write, but just a write it
>> is not as critical.
>>
>> --Bobby Evans
>>
>> On 4/18/12 4:56 PM, "Sky USC"  wrote:
>>
>>
>>
>>
>> Please help me architect the design of my first significant MR task
>> beyond "word count". My program works well. but I am trying to optimize
>> performance to maximize use of available computing resources. I have 3
>&

Re: Multiple data centre in Hadoop

2012-04-19 Thread Robert Evans

If you want to start an open source project for this I am sure that there are 
others with the same problem that might be very wiling to help out. :)

--Bobby Evans

On 4/19/12 4:31 PM, "Michael Segel"  wrote:

I don't know of any open source solution in doing this...
And yeah its something one can't talk about  ;-)


On Apr 19, 2012, at 4:28 PM, Robert Evans wrote:

> Where I work  we have done some things like this, but none of them are open 
> source, and I have not really been directly involved with the details of it.  
> I can guess about what it would take, but that is all it would be at this 
> point.
>
> --Bobby
>
>
> On 4/17/12 5:46 PM, "Abhishek Pratap Singh"  wrote:
>
> Thanks bobby, I m looking for something like this. Now the question is
> what is the best strategy to do Hot/Hot or Hot/Warm.
> I need to consider the CPU and Network bandwidth, also needs to decide from
> which layer this replication should start.
>
> Regards,
> Abhishek
>
> On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans  wrote:
>
>> Hi Abhishek,
>>
>> Manu is correct about High Availability within a single colo.  I realize
>> that in some cases you have to have fail over between colos.  I am not
>> aware of any turn key solution for things like that, but generally what you
>> want to do is to run two clusters, one in each colo, either hot/hot or
>> hot/warm, and I have seen both depending on how quickly you need to fail
>> over.  In hot/hot the input data is replicated to both clusters and the
>> same software is run on both.  In this case though you have to be fairly
>> sure that your processing is deterministic, or the results could be
>> slightly different (i.e. No generating if random ids).  In hot/warm the
>> data is replicated from one colo to the other at defined checkpoints.  The
>> data is only processed on one of the grids, but if that colo goes down the
>> other one can take up the processing from where ever the last checkpoint
>> was.
>>
>> I hope that helps.
>>
>> --Bobby
>>
>> On 4/12/12 5:07 AM, "Manu S"  wrote:
>>
>> Hi Abhishek,
>>
>> 1. Use multiple directories for *dfs.name.dir* & *dfs.data.dir* etc
>> * Recommendation: write to *two local directories on different
>> physical volumes*, and to an *NFS-mounted* directory
>> - Data will be preserved even in the event of a total failure of the
>> NameNode machines
>> * Recommendation: *soft-mount the NFS* directory
>> - If the NFS mount goes offline, this will not cause the NameNode
>> to fail
>>
>> 2. *Rack awareness*
>>
>> https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf
>>
>> On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh
>> wrote:
>>
>>> Thanks Robert.
>>> Is there a best practice or design than can address the High Availability
>>> to certain extent?
>>>
>>> ~Abhishek
>>>
>>> On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans 
>>> wrote:
>>>
>>>> No it does not. Sorry
>>>>
>>>>
>>>> On 4/11/12 1:44 PM, "Abhishek Pratap Singh" 
>> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> Just wanted if hadoop supports more than one data centre. This is
>>> basically
>>>> for DR purposes and High Availability where one centre goes down other
>>> can
>>>> bring up.
>>>>
>>>>
>>>> Regards,
>>>> Abhishek
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Thanks & Regards
>> 
>> *Manu S*
>> SI Engineer - OpenSource & HPC
>> Wipro Infotech
>> Mob: +91 8861302855Skype: manuspkd
>> www.opensourcetalk.co.in
>>
>>
>

Re: Multiple data centre in Hadoop

2012-04-19 Thread Robert Evans

Where I work  we have done some things like this, but none of them are open 
source, and I have not really been directly involved with the details of it.  I 
can guess about what it would take, but that is all it would be at this point.

--Bobby


On 4/17/12 5:46 PM, "Abhishek Pratap Singh"  wrote:

Thanks bobby, I m looking for something like this. Now the question is
what is the best strategy to do Hot/Hot or Hot/Warm.
I need to consider the CPU and Network bandwidth, also needs to decide from
which layer this replication should start.

Regards,
Abhishek

On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans  wrote:

> Hi Abhishek,
>
> Manu is correct about High Availability within a single colo.  I realize
> that in some cases you have to have fail over between colos.  I am not
> aware of any turn key solution for things like that, but generally what you
> want to do is to run two clusters, one in each colo, either hot/hot or
> hot/warm, and I have seen both depending on how quickly you need to fail
> over.  In hot/hot the input data is replicated to both clusters and the
> same software is run on both.  In this case though you have to be fairly
> sure that your processing is deterministic, or the results could be
> slightly different (i.e. No generating if random ids).  In hot/warm the
> data is replicated from one colo to the other at defined checkpoints.  The
> data is only processed on one of the grids, but if that colo goes down the
> other one can take up the processing from where ever the last checkpoint
> was.
>
> I hope that helps.
>
> --Bobby
>
> On 4/12/12 5:07 AM, "Manu S"  wrote:
>
> Hi Abhishek,
>
> 1. Use multiple directories for *dfs.name.dir* & *dfs.data.dir* etc
> * Recommendation: write to *two local directories on different
> physical volumes*, and to an *NFS-mounted* directory
> - Data will be preserved even in the event of a total failure of the
> NameNode machines
> * Recommendation: *soft-mount the NFS* directory
> - If the NFS mount goes offline, this will not cause the NameNode
> to fail
>
> 2. *Rack awareness*
>
> https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf
>
> On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh
> wrote:
>
> > Thanks Robert.
> > Is there a best practice or design than can address the High Availability
> > to certain extent?
> >
> > ~Abhishek
> >
> > On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans 
> > wrote:
> >
> > > No it does not. Sorry
> > >
> > >
> > > On 4/11/12 1:44 PM, "Abhishek Pratap Singh" 
> wrote:
> > >
> > > Hi All,
> > >
> > > Just wanted if hadoop supports more than one data centre. This is
> > basically
> > > for DR purposes and High Availability where one centre goes down other
> > can
> > > bring up.
> > >
> > >
> > > Regards,
> > > Abhishek
> > >
> > >
> >
>
>
>
> --
> Thanks & Regards
> 
> *Manu S*
> SI Engineer - OpenSource & HPC
> Wipro Infotech
> Mob: +91 8861302855Skype: manuspkd
> www.opensourcetalk.co.in
>
>

Re: Help me with architecture of a somewhat non-trivial mapreduce implementation

2012-04-19 Thread Robert Evans

>From what I can see your implementation seems OK, especially from a 
>performance perspective. Depending on what storage: is it is likely to be your 
>bottlekneck, not the hadoop computations.

Because you are writing files directly instead of relying on Hadoop to do it 
for you, you may need to deal with error cases that Hadoop will normally hide 
from you, and you will not be able to turn on speculative execution.  Just be 
aware that a map or reduce task may have problems in the middle, and be 
relaunched.  So when you are writing out your updated manifest be careful to 
not replace the old one until the new one is completely ready and will not 
fail, or you may lose data.  You may also need to be careful in your reduce if 
you are writing directly to the file there too, but because it is not a read 
modify write, but just a write it is not as critical.

--Bobby Evans

On 4/18/12 4:56 PM, "Sky USC"  wrote:

Please help me architect the design of my first significant MR task beyond 
"word count". My program works well. but I am trying to optimize performance to 
maximize use of available computing resources. I have 3 questions at the bottom.

Project description in an abstract sense (written in java):
* I have MM number of MANIFEST files available on storage:/root/1.manif.txt to 
4000.manif.txt
 * Each MANIFEST in turn contains varilable number "EE" of URLs to EBOOKS 
(range could be 1 - 50,000 EBOOKS urls per MANIFEST) -- stored on 
storage:/root/1.manif/1223.folder/5443.Ebook.ebk
So we are talking about millions of ebooks

My task is to:
1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: 
publisher, year, ebook-version).
2. Update each of the EBOOK entry record in the manifest - with the 3 
attributes (eg: ebook 1334 -> publisher=aaa year=bbb, ebook-version=2.01)
3. Create a output file such that the named 
"__"  contains a list of all "ebook urls" that 
met that criteria.
example:
File "storage:/root/summary/RANDOMHOUSE_1999_2.01.txt" contains:
storage:/root/1.manif/1223.folder/2143.Ebook.ebk
storage:/root/2.manif/2133.folder/5449.Ebook.ebk
storage:/root/2.manif/2133.folder/5450.Ebook.ebk
etc..

and File "storage:/root/summary/PENGUIN_2001_3.12.txt" contains:
storage:/root/19.manif/2223.folder/4343.Ebook.ebk
storage:/root/13.manif/9733.folder/2149.Ebook.ebk
storage:/root/21.manif/3233.folder/1110.Ebook.ebk

etc

4. finally, I also want to output statistics such that:
__  
PENGUIN_2001_3.12 250,111
RANDOMHOUSE_1999_2.01  11,322
etc

Here is how I implemented:
* My launcher gets list of MM manifests
* My Mapper gets one manifest.
 --- It reads the manifest, within a WHILE loop,
--- fetches each EBOOK,  and obtain attributes from each ebook,
--- updates the manifest for that ebook
--- context.write(new Text("RANDOMHOUSE_1999_2.01"), new 
Text("storage:/root/1.manif/1223.folder/2143.Ebook.ebk"))
 --- Once all ebooks in the manifest are read, it saves the updated Manifest, 
and exits
* My Reducer gets the "RANDOMHOUSE_1999_2.01" and a list of ebooks urls.
 --- It writes a new file "storage:/root/summary/RANDOMHOUSE_1999_2.01.txt" 
with all the storage urls for the ebooks
 --- It also does a context.write(new Text("RANDOMHOUSE_1999_2.01"), new 
IntWritable(SUM_OF_ALL_EBOOK_URLS_FROM_THE_LIST))

As I mentioned, its working. I launch it on 15 elastic instances. I have three 
questions:
1. Is this the best way to implement the MR logic?
2. I dont know if each of the instances is getting one task or multiple tasks 
simultaneously for the MAP portion. If it is not getting multiple MAP tasks, 
should I go with the route of "multithreaded" reading of ebooks from each 
manifest? Its not efficient to read just one ebook at a time per machine. Is 
"Context.write()" threadsafe?
3. I can see log4j logs for main program, but no visibility into logs for 
Mapper or Reducer. Any idea?

Re: Multiple data centre in Hadoop

2012-04-16 Thread Robert Evans

Hi Abhishek,

Manu is correct about High Availability within a single colo.  I realize that 
in some cases you have to have fail over between colos.  I am not aware of any 
turn key solution for things like that, but generally what you want to do is to 
run two clusters, one in each colo, either hot/hot or hot/warm, and I have seen 
both depending on how quickly you need to fail over.  In hot/hot the input data 
is replicated to both clusters and the same software is run on both.  In this 
case though you have to be fairly sure that your processing is deterministic, 
or the results could be slightly different (i.e. No generating if random ids).  
In hot/warm the data is replicated from one colo to the other at defined 
checkpoints.  The data is only processed on one of the grids, but if that colo 
goes down the other one can take up the processing from where ever the last 
checkpoint was.

I hope that helps.

--Bobby

On 4/12/12 5:07 AM, "Manu S"  wrote:

Hi Abhishek,

1. Use multiple directories for *dfs.name.dir* & *dfs.data.dir* etc
* Recommendation: write to *two local directories on different
physical volumes*, and to an *NFS-mounted* directory
- Data will be preserved even in the event of a total failure of the
NameNode machines
* Recommendation: *soft-mount the NFS* directory
- If the NFS mount goes offline, this will not cause the NameNode
to fail

2. *Rack awareness*
https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf

On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh
wrote:

> Thanks Robert.
> Is there a best practice or design than can address the High Availability
> to certain extent?
>
> ~Abhishek
>
> On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans 
> wrote:
>
> > No it does not. Sorry
> >
> >
> > On 4/11/12 1:44 PM, "Abhishek Pratap Singh"  wrote:
> >
> > Hi All,
> >
> > Just wanted if hadoop supports more than one data centre. This is
> basically
> > for DR purposes and High Availability where one centre goes down other
> can
> > bring up.
> >
> >
> > Regards,
> > Abhishek
> >
> >
>

--
Thanks & Regards

*Manu S*
SI Engineer - OpenSource & HPC
Wipro Infotech
Mob: +91 8861302855Skype: manuspkd
www.opensourcetalk.co.in

Re: Multiple data centre in Hadoop

2012-04-11 Thread Robert Evans

No it does not. Sorry


On 4/11/12 1:44 PM, "Abhishek Pratap Singh"  wrote:

Hi All,

Just wanted if hadoop supports more than one data centre. This is basically
for DR purposes and High Availability where one centre goes down other can
bring up.


Regards,
Abhishek

Re: Hadoop streaming or pipes ..

2012-04-05 Thread Robert Evans

It is a regular process, unless you explicitly say you want it to be java, 
which would be a bit odd to do, but possible.

--Bobby

On 4/5/12 3:14 PM, "Mark question"  wrote:

Thanks for the response Robert ..  so the overhead will be in read/write
and communication. But is the new process spawned a JVM or a regular
process?

Thanks,
Mark

On Thu, Apr 5, 2012 at 12:49 PM, Robert Evans  wrote:

> Both streaming and pipes do very similar things.  They will fork/exec a
> separate process that is running whatever you want it to run.  The JVM that
> is running hadoop then communicates with this process to send the data over
> and get the processing results back.  The difference between streaming and
> pipes is that streaming uses stdin/stdout for this communication so
> preexisting processing like grep, sed and awk can be used here.  Pipes uses
> a custom protocol with a C++ library to communicate.  The C++ library is
> tagged with SWIG compatible data so that it can be wrapped to have APIs in
> other languages like python or perl.
>
> I am not sure what the performance difference is between the two, but in
> my own work I have seen a significant performance penalty from using either
> of them, because there is a somewhat large overhead of sending all of the
> data out to a separate process just to read it back in again.
>
> --Bobby Evans
>
>
> On 4/5/12 1:54 PM, "Mark question"  wrote:
>
> Hi guys,
>  quick question:
>   Are there any performance gains from hadoop streaming or pipes over
> Java? From what I've read, it's only to ease testing by using your favorite
> language. So I guess it is eventually translated to bytecode then executed.
> Is that true?
>
> Thank you,
> Mark
>
>

Re: Hadoop streaming or pipes ..

2012-04-05 Thread Robert Evans

Both streaming and pipes do very similar things.  They will fork/exec a 
separate process that is running whatever you want it to run.  The JVM that is 
running hadoop then communicates with this process to send the data over and 
get the processing results back.  The difference between streaming and pipes is 
that streaming uses stdin/stdout for this communication so preexisting 
processing like grep, sed and awk can be used here.  Pipes uses a custom 
protocol with a C++ library to communicate.  The C++ library is tagged with 
SWIG compatible data so that it can be wrapped to have APIs in other languages 
like python or perl.

I am not sure what the performance difference is between the two, but in my own 
work I have seen a significant performance penalty from using either of them, 
because there is a somewhat large overhead of sending all of the data out to a 
separate process just to read it back in again.

--Bobby Evans


On 4/5/12 1:54 PM, "Mark question"  wrote:

Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
Java? From what I've read, it's only to ease testing by using your favorite
language. So I guess it is eventually translated to bytecode then executed.
Is that true?

Thank you,
Mark

Re: Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Robert Evans

I am dropping the cross posts and leaving this on common-user with the others 
BCCed.

Marcos,

That is a great idea to be able to update the tutorial, especially if the 
community is interested in helping to do so.  We are looking into the best way 
to do this.  The idea right now is to donate this to the Hadoop project so that 
the community can keep it up to date, but we need some time to jump through all 
of the corporate hoops to get this to happen.  We have a lot going on right 
now, so if you don't see any progress on this please feel free to ping me and 
bug me about it.

--
Bobby Evans


On 4/4/12 8:15 AM, "Jagat Singh"  wrote:

Hello Marcos

 Yes , Yahoo tutorials are pretty old but still they explain the concepts of 
Map Reduce , HDFS beautifully. The way in which tutorials have been defined 
into sub sections , each builing on previous one is awesome. I remember when i 
started i was digged in there for many days. The tutorials are lagging now from 
new API point of view.

 Lets have some documentation session one day , I would love to Volunteer to 
update those tutorials if people at Yahoo take input from outside world :)

 Regards,

 Jagat

- Original Message -
From: Marcos Ortiz
Sent: 04/04/12 08:32 AM
To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org', 
mapreduce-u...@hadoop.apache.org
Subject: Yahoo Hadoop Tutorial with new APIs?

Regards to all the list.
 There are many people that use the Hadoop Tutorial released by Yahoo at 
http://developer.yahoo.com/hadoop/tutorial/ 
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
The main issue here is that, this tutorial is written with the old APIs? 
(Hadoop 0.18 I think).
 Is there a project for update this tutorial to the new APIs? to Hadoop 1.0.2 
or YARN (Hadoop 0.23)

 Best wishes
 -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI 
http://marcosluis2186.posterous.com
 http://www.uci.cu/

Re: Temporal query

2012-03-29 Thread Robert Evans

I am not aware of anyone that does this for you directly, but it should not be 
too difficult for you to write what you want using pig or hive.  I am not as 
familiar with Jaql but I assume that you can do it there too.  Although it 
might be simpler to write it using Map/Reduce because we can abuse Map/Reduce 
in ways that the higher level languages disallow so that they can do 
optimizations.

What I would do is in the mapper scan through each entry and look for 
transitions of $value around $threshold, and the time that they occurred.  You 
can then look for 30+ second windows where $value > $threshold within that 
partition and output them to the reducer.  The trick with this is that you need 
to pay special attention to the beginning and end of the partition.  You need 
to also send to the reducer the state at the beginning and end of each 
partition and how long it was in that state.  The reducer can then combine 
these pieces together and see if they meet the 30+ second criteria. If so 
output them with the rest, otherwise don't.  The known times when it is > 30 
seconds can be sent to any reducer, so they can have any key, but for the 
transitions to work correctly you need to send them to a single reducer, so 
they should have a very specific key.  You could also try to divide them up if 
you have to scale very very large, but that would be rather difficult to get 
right.

--Bobby Evans


On 3/29/12 4:02 AM, "banermatt"  wrote:



Hello,

I'm developping a log file anomaly detection system on an hadoop cluster.
I'm looking for a way to process query like: "select all values when
value>threshold for a duration>30 secondes". Do you know a tool which could
help me to process such a query?
I documented on the script langages pig, hive and jaql which seem to have
very similar application. I tried it but I was not be able to do what I
want.

Thank you in advance,

Matthieu

--
View this message in context: 
http://old.nabble.com/Temporal-query-tp33544869p33544869.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: question about processing large zip

2012-03-21 Thread Robert Evans

How are your splitting the zip right now?  Do you have multiple mappers and 
each mapper starts at the beginning of the zip and goes to the point it cares 
about or do you just have one mapper?  If you are doing it the first way you 
may want to increase your replication factor.  Alternatively you could use 
multiple zip files, one per mapper that you want to launch.

--Bobby Evans

On 3/19/12 7:26 PM, "Andrew McNair"  wrote:

Hi,

I have a large (~300 gig) zip of images that I need to process. My
current workflow is to copy the zip to HDFS, use a custom input format
to read the zip entries, do the processing in a map, and then generate
a processing report in the reduce. I'm struggling to tune params right
now with my cluster to make everything run smoothly, but I'm also
worried that I'm missing a better way of processing.

Does anybody have suggestions for how to make the processing of a zip
more parallel? The only other idea I had was uploading the zip as a
sequence file, but that proved incredibly slow (~30 hours on my 3 node
cluster to upload).

Thanks in advance.

-Andrew

Re: Should splittable Gzip be a "core" hadoop feature?

2012-02-29 Thread Robert Evans

If many people are going to use it then by all means put it in.  If there is 
only one person, or a very small handful of people that are going to use it 
then I personally would prefer to see it a separate project.  However, Edward, 
you have convinced me that I am trying to make a logical judgment based only on 
a gut feeling and the response rate to an email chain.  Thanks for that.  What 
I really want to know is how well does this new CompressionCodec perform in 
comparison to the regular gzip codec in various different conditions and what 
type of impact does it have on network traffic and datanode load.  My gut 
feeling is that the speedup is going to be relatively small except when there 
is a lot of computation happening in the mapper and the added load and network 
traffic outweighs the speedup in most cases, but like all performance on a 
complex system gut feelings are almost worthless and hard numbers are what is 
needed to make a judgment call.  Niels, I assume you have tested this on your 
cluster(s).  Can you share with us some of the numbers?

--Bobby Evans

On 2/29/12 11:06 AM, "Edward Capriolo"  wrote:

Too bad we can not up the replication on the first few blocks of the
file or distributed cache it.

The crontrib statement is arguable. I could make a case that the
majority of stuff should not be in hadoop-core. NLineInputFormat for
example, nice to have. Took a long time to get ported to the new map
reduce format. DBInputFormat DataDriverDBInputFormat sexy for sure but
does not need to be part of core. I could see hadoop as just coming
with TextInputFormat and SequenceInputFormat and everything else is
after market from github,

On Wed, Feb 29, 2012 at 11:31 AM, Robert Evans  wrote:
> I can see a use for it, but I have two concerns about it.  My biggest concern 
> is maintainability.  We have had lots of things get thrown into contrib in 
> the past, very few people use them, and inevitably they start to suffer from 
> bit rot.  I am not saying that it will happen with this, but if you have to 
> ask if people will use it and there has been no overwhelming yes, it makes me 
> nervous about it.  My second concern is with knowing when to use this.  
> Anything that adds this in would have to come with plenty of documentation 
> about how it works, how it is different from the normal gzip format, 
> explanations about what type of a load it might put on data nodes that hold 
> the start of the file, etc.
>
> From both of these I would prefer to see this as a github project for a while 
> first, and one it shows that it has a significant following, or a community 
> with it, then we can pull it in.  But if others disagree I am not going to 
> block it.  I am a -0 on pulling this in now.
>
> --Bobby
>
> On 2/29/12 10:00 AM, "Niels Basjes"  wrote:
>
> Hi,
>
> On Wed, Feb 29, 2012 at 16:52, Edward Capriolo wrote:
> ...
>
>> But being able to generate split info for them and processing them
>> would be good as well. I remember that was a hot thing to do with lzo
>> back in the day. The pain of once overing the gz files to generate the
>> split info is detracting but it is nice to know it is there if you
>> want it.
>>
>
> Note that the solution I created (HADOOP-7076) does not require any
> preprocessing.
> It can split ANY gzipped file as-is.
> The downside is that this effectively costs some additional performance
> because the task has to decompress the first part of the file that is to be
> discarded.
>
> The other two ways of splitting gzipped files either require
> - creating come kind of "compression index" before actually using the file
> (HADOOP-6153)
> - creating a file in a format that is gerenated in such a way that it is
> really a set of concatenated gzipped files. (HADOOP-7909)
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>

Re: Should splittable Gzip be a "core" hadoop feature?

2012-02-29 Thread Robert Evans

I can see a use for it, but I have two concerns about it.  My biggest concern 
is maintainability.  We have had lots of things get thrown into contrib in the 
past, very few people use them, and inevitably they start to suffer from bit 
rot.  I am not saying that it will happen with this, but if you have to ask if 
people will use it and there has been no overwhelming yes, it makes me nervous 
about it.  My second concern is with knowing when to use this.  Anything that 
adds this in would have to come with plenty of documentation about how it 
works, how it is different from the normal gzip format, explanations about what 
type of a load it might put on data nodes that hold the start of the file, etc.

>From both of these I would prefer to see this as a github project for a while 
>first, and one it shows that it has a significant following, or a community 
>with it, then we can pull it in.  But if others disagree I am not going to 
>block it.  I am a -0 on pulling this in now.

--Bobby

On 2/29/12 10:00 AM, "Niels Basjes"  wrote:

Hi,

On Wed, Feb 29, 2012 at 16:52, Edward Capriolo wrote:
...

> But being able to generate split info for them and processing them
> would be good as well. I remember that was a hot thing to do with lzo
> back in the day. The pain of once overing the gz files to generate the
> split info is detracting but it is nice to know it is there if you
> want it.
>

Note that the solution I created (HADOOP-7076) does not require any
preprocessing.
It can split ANY gzipped file as-is.
The downside is that this effectively costs some additional performance
because the task has to decompress the first part of the file that is to be
discarded.

The other two ways of splitting gzipped files either require
- creating come kind of "compression index" before actually using the file
(HADOOP-6153)
- creating a file in a format that is gerenated in such a way that it is
really a set of concatenated gzipped files. (HADOOP-7909)

--
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Execute a Map/Reduce Job Jar from Another Java Program.

2012-02-03 Thread Robert Evans

It looks like there is something wrong with your configuration where the 
default file system is coming back as the local file system, but you are 
passing in an HDFS URI fs.exists(Path), I cannot tell for sure because I don't 
have access to 
com.amd.kdf.protobuf.SequentialFileDriver.main(SequentialFileDriver.java:64).

If running it works just fine from the command line, you could try doing a 
fork/exec to launch the process and then monitor it.

--Bobby Evans

On 2/2/12 11:31 PM, "Abees Muhammad"  wrote:

Hi Evans,

Thanks for your reply. I have a mapreduce job jar file lets call it as 
test.jar. I am executing this jar file as hadoop jar test.jar inputpath 
outPath, and it is executed succesfully. Now i want to execute this job for a 
batch of files(a batch of 20 files), for this purpose i have created another 
java application,this application moves a batch of files from one location of 
hdfs to another location in hdfs. After that this application needs to execute 
the m/R job for this batch. we will invoke the second application(which will 
execute the M/R Job) from as control m job.But i dont know how to create the 
second java application which will invoke the M/R job. The code snippet i used 
for testing the jar which calls the M/R job is

List arguments = new ArrayList();
arguments.add("test.jar");
arguments.add("inputPath");
arguments.add(outputPath);
RunJar.main((String[])arguments.toArray(new String[0]));

i executed this jar as java -jar M/RJobInvokeApp.jar but i got error as

java.lang.IllegalArgumentException: Wrong FS: hdfs://ip 
address:54310/tmp/test-out, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:410)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:56)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:379)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:748)
at 
com.amd.kdf.protobuf.SequentialFileDriver.main(SequentialFileDriver.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
at com.amd.wrapper.main.ParserWrapper.main(ParserWrapper.java:31)

Thanks,
Abees

On 2 February 2012 23:02, Robert Evans  wrote:
What happens?  Is there an exception, does nothing happen?  I am curious.  Also 
how did you launch your other job that is trying to run this one.  The hadoop 
script sets up a lot of environment variables classpath etc to make hadoop work 
properly, and some of that may not be set up correctly to make RunJar work.

--Bobby Evans

On 2/2/12 9:36 AM, "Harsh J" http://ha...@cloudera.com> > 
wrote:

Moving to common-user. Common-dev is for project development
discussions, not user help.

Could you elaborate on how you used RunJar? What arguments did you
provide, and is the target jar a runnable one or a regular jar? What
error did you get?

On Thu, Feb 2, 2012 at 8:44 PM, abees muhammad http://abees...@gmail.com> > wrote:
>
> Hi,
>
> I am a newbie to Hadoop Development. I have a Map/Reduce job jar file, i
> want to execute this jar file programmatically from another java program. I
> used the following code to execute it.
>
> RunJar.main(String[] args). But The jar file is not executed.
>
> Can you please give me  a work around for this issue.
> --
> View this message in context: 
> http://old.nabble.com/Execute-a-Map-Reduce-Job-Jar-from-Another-Java-Program.-tp33250801p33250801.html
> Sent from the Hadoop core-dev mailing list archive at Nabble.com.
>

--
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about

Re: Execute a Map/Reduce Job Jar from Another Java Program.

2012-02-02 Thread Robert Evans

What happens?  Is there an exception, does nothing happen?  I am curious.  Also 
how did you launch your other job that is trying to run this one.  The hadoop 
script sets up a lot of environment variables classpath etc to make hadoop work 
properly, and some of that may not be set up correctly to make RunJar work.

--Bobby Evans

On 2/2/12 9:36 AM, "Harsh J"  wrote:

Moving to common-user. Common-dev is for project development
discussions, not user help.

Could you elaborate on how you used RunJar? What arguments did you
provide, and is the target jar a runnable one or a regular jar? What
error did you get?

On Thu, Feb 2, 2012 at 8:44 PM, abees muhammad  wrote:
>
> Hi,
>
> I am a newbie to Hadoop Development. I have a Map/Reduce job jar file, i
> want to execute this jar file programmatically from another java program. I
> used the following code to execute it.
>
> RunJar.main(String[] args). But The jar file is not executed.
>
> Can you please give me  a work around for this issue.
> --
> View this message in context: 
> http://old.nabble.com/Execute-a-Map-Reduce-Job-Jar-from-Another-Java-Program.-tp33250801p33250801.html
> Sent from the Hadoop core-dev mailing list archive at Nabble.com.
>

--
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about

Re: Why $HADOOP_PREFIX ?

2012-02-01 Thread Robert Evans

I think it comes down to a long history of splitting and then remerging the 
hadoop project.  I could be wrong about a lot of this so take it worth a grain 
of salt.  Hadoop originally, and still is on 1.0 a single project.  HDFS, 
mapreduce and common are all compiled together into a single jar hadoop-core.  
In that respect HADOOP_HOME made a lot of since because it was a single thing, 
with some dependencies that needed to be found by some shell scripts.

Fast forward the projects were split, HADOOP_HOME was deprecated, and 
HADOOP_COMMON_HOME, HADOOP_MAPRED_HOME, and HADOOP_HDFS_HOME were born.  But if 
we install them all into a single tree it is a pain to configure all of these 
to point to the same place, but HADOOP_HOME is deprecated, so HADOOP_PREFIX was 
born.  NOTE: like was stated before all of these are supposed to be hidden from 
the end user and are intended more towards packaging and deploying hadoop.  
Also the process is not done and it is likely to change further.

--Bobby Evans

On 2/1/12 8:10 AM, "praveenesh kumar"  wrote:

Interesting and strange.
but are there any reason for setting $HADOOP_HOME to $HADOOP_PREFIX in
hadoop-conf.sh
and then checking in /bin/hadoop.sh whether $HADOOP_HOME is not equal to ""

I mean if I comment out the export HADOOP_HOME=${HADOOP_PREFIX} in
hadoop-conf.sh, does it make any difference ?

Thanks,
Praveenesh

On Wed, Feb 1, 2012 at 6:04 PM, Prashant Sharma wrote:

> I think you have misunderstood something. AFAIK or understand  these
> variables are set automatically when you run a script. it's name is obscure
> for some strange reason. ;).
>
> Warning: $HADOOP_HOME is deprecated is always there. whether the variable
> is set or not. Why?
> Because the hadoop-config is sourced in all scripts. And all it does is
> sets HADOOP_PREFIX as HADOOP_HOME. I think this can be reported as a bug.
>
> -P
>
>
> On Wed, Feb 1, 2012 at 5:46 PM, praveenesh kumar  >wrote:
>
> > Does anyone have idea on Why $HADOOP_PREFIX was introduced instead of
> > $HADOOP_HOME in hadoop 0.20.205 ?
> >
> > I believe $HADOOP_HOME was not giving any troubles or is there a
> reason/new
> > feature that require $HADOOP_PREFIX to be added ?
> >
> > Its a kind of funny, but I got habitual of using $HADOOP_HOME. Just
> curious
> > to know for this change.
> > Also, there are some old packages ( I am not referring apache/cloudera/or
> > any hadoop distribution ), that depends on hadoop that still uses
> > $HADOOP_HOME inside. So its kind of weird when you use those packages,
> you
> > still get warning messages even though its suppressed from Hadoop side.
> >
> >
> > Thanks,
> > Praveenesh
> >
>

Re: When to use a combiner?

2012-01-25 Thread Robert Evans

You can use a combiner for average.  You just have to write a separate combiner 
from your reducer.

Class myCombiner {
//The value is sum/count pairs
void reduce(Key key, Interable> values, Context context) {
long sum = 0;
long count = 0;
for(Pair value: values) {
sum += pair.first;
count += pair.second;
}
context.write(key, new Pair(sum, count));
}
}

Class myReducer {
//The value is sum/count pairs
void reduce(Key key, Interable> values, Context context) {
long sum = 0;
long count = 0;
for(Pair value: values) {
sum += pair.first;
count += pair.second;
}
context.write(key, ((double)sum)/count);
}
}

--Bobby Evans

On 1/24/12 4:34 PM, "Raj V"  wrote:

Just to add to Sameer's response - you cannot use a combiner in case you are 
finding the average  temperature. The combiner running on each mapper will 
produce the average for that mapper's output and the reducer will find the 
average of the combiner outputs, which in this case will be the average of the 
averages.

You can  use a combiner if your reducer function R is like this

R(S) = R(R(s1), R(s2) R(sn)) Where S is the whole set and s1,s2 ... sn are 
some arbitrary partition of the set S.

Raj

 From: Sameer Farooqui 
 To: common-user@hadoop.apache.org
 Sent: Tuesday, January 24, 2012 12:22 PM
 Subject: Re: When to use a combiner?

Hi Steve,

Yeah, you're right in your suspicions that a combiner may not be useful in your 
use case. It's mainly used to reduce network traffic between the mappers and 
the reducers. Hadoop may apply the combiner zero, one or multiple times to the 
intermediate output from the mapper, so it's hard to accurately predict the CPU 
impact a combiner will have. The reduction in network packets is a lot easier 
to predict and actually see.

>From Chuck Lam's 'Hadoop in Action': "A combiner doesn't necessarily improve 
>performance. You should monitor the job's behavior to see if the number of 
>records outputted by the combiner is meaningfully less than the number of 
>records going in. The reduction must justify the extra execution time of 
>running a combiner. You can easily check this through the JobTracker's Web UI."

One thing to point out is don't just assume the combiner's ineffectiveness b/c 
it's not reducing the # of unique keys emitted from the Map side. It really 
depends on your specific use case for the combiner and the nature of the 
MapReduce job. For example, imagine your map tasks find the maximum temperature 
for a given year (example from 'Hadoop: The Definitive Guide'), like so:

Node 1's Map output:
(1950, 20)
(1950, 10)
(1950, 40)

Node 2's Map output:
(1950, 0)
(1950, 15)

The reduce function would get this input after the shuffle phase:
(1950, [0, 10, 15, 20, 40])
and the reduce function would output:
(1950, 40)

But if you used a combiner, the reduce function would have gotten smaller input 
to work with after the shuffle phase:
(1950, [40, 15])
and the output from Reduce would be the same.

There are specific use cases like the one above that a combiner makes magical 
performance gains for, but it shouldn't by default be used 100% of the time.

Both of the books I mentioned are excellent with tons of real-world tips, so I 
highly recommend them.

Re: Hadoop PIPES job using C++ and binary data results in data locality problem.

2012-01-10 Thread Robert Evans

I think what you want to try and do is to use JNI rather then pipes or
streaming. PIPES has known issues and it is my understanding that its use is
now discouraged. The ideal way to do this is to use JNI to send your data to
the C code. Be aware that moving large amounts of data through JNI has some of
its own challenges, but most of these can be solved by using Direct ByteBuffer.

--Bobby Evans

On 1/10/12 10:31 AM, "GorGo" wrote:

Hi everyone.

I am running C++ code using the PIPES wrapper and I am looking for some
tutorials, examples or any kind of help with regards to using binary data.
My problems is that I am working with large chunks of binary data and
converting it to text not an option.
My first question is thus, can I pass large chunks (>128 MB) of binary data
through the PIPES interface?
I have not been able to find documentation on this.

The way I do things now is that I bypass the Hadoop process by opening and
reading the data directly from the C++ code using the HDFS C API. However,
that means that I lose the data locality and causes too much network
overhead to be viable at large scale.

If passing binary data directly is not possible with PIPES, I need somehow
to write my own RecordReader that maintains the data locality but still does
not actually emit the data, (I just need to make sure the c++ mapper reads
the same data from a local source when it is spawned).
The recordreader actually does not need to read the data at all. Generating
a config string that tells the C++ mapper code what to read would be just
fine.

The second question is thus, how to write my own RecordReader in the C++ or
JAVA?
I also would like information on how Hadoop maintains the data locality
between RecordReaders and the spawned map tasks.

Any information is most welcome.

Regards
GorGo
--
View this message in context:
http://old.nabble.com/Hadoop-PIPES-job-using-C%2B%2B-and-binary-data-results-in-data-locality-problem.-tp33112818p33112818.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: increase number of map tasks

2012-01-10 Thread Robert Evans

Similarly there is the NLineInputFormat that does this automatically.  If your 
input is small it will read in the input and make a split for every N lines of 
input.   Then you don't have to reformat your data files.

--Bobby Evans

On 1/10/12 8:09 AM, "GorGo"  wrote:

Hi.

I am no expert, but you could try this.

Your problem, I guess, is that the record reader reads multiple lines of
work (tasks) and gives to each mapper and thus if you only have a few tasks
(line of work in the input file) Hadoop will not spawn multiple mappers.

You could try this, make each input record, in your now single input file,
an independent file (with only one line) and give as input to your job the
the directory with the files (not a single file). For me, this forced the
spawning of multiple mappers.

There is another more correct way that forces the spawning of a map task for
each line but as I was using c++ pipes that was not an option for me.

Hope this helps
   GorGo

sset wrote:
>
> Hello,
>
> In hdfs we have set block size - 40bytes . Input Data set is as below
> terminated with line feed.
>
> data1   (5*8=40 bytes)
> data2
> ..
> ...
> data10
>
>
> But still we see only 2 map tasks spawned, should have been atleast 10 map
> tasks. Each mapper performs complex mathematical computation. Not sure how
> works internally. Line feed does not work. Even with below settings map
> tasks never goes beyound 2, any way to make this spawn 10 tasks. Basically
> it should look like compute grid - computation in parallel.
>
> 
>   io.bytes.per.checksum
>   30
>   The number of bytes per checksum.  Must not be larger than
>   io.file.buffer.size.
> 
>
> 
>   dfs.block.size
>30
>   The default block size for new files.
> 
>
> 
>   mapred.tasktracker.map.tasks.maximum
>   10
>   The maximum number of map tasks that will be run
>   simultaneously by a task tracker.
>   
> 
>
> single node with high configuration -> 8 cpus and 8gb memory. Hence taking
> an example of 10 data items with line feeds. We want to utilize full power
> of machine - hence want at least 10 map tasks - each task needs to perform
> highly complex mathematical simulation.  At present it looks like file
> data is the only way to specify number of map tasks via splitsize (in
> bytes) - but I prefer some criteria like line feed or whatever.
>
> How do we get 10 map tasks from above configuration - pls help.
>
> thanks
>
>

--
View this message in context: 
http://old.nabble.com/increase-number-of-map-tasks-tp33107775p33111745.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: dual power for hadoop in datacenter?

2012-01-09 Thread Robert Evans

Be aware that if half of your cluster goes down, depending of the version and 
configuration of Hadoop, there may be a replication storm, as hadoop tries to 
bring it all back up to the proper number of replications.  Your cluster may 
still be unusable in this case.

--Bobby Evans

On 1/7/12 2:55 PM, "Alexander Lorenz"  wrote:

NN, SN and JT must have separated power adapter, for the entire cluster are 
dual adapter recommend.
For HBase and zookeeper servers / regionservers also dual adapters with 
seperated power lines recommend.

- Alex

sent via my mobile device

On Jan 7, 2012, at 11:23 AM, Koert Kuipers  wrote:

> what are the thoughts on running a hadoop cluster in a datacenter with
> respect to power? should all the boxes have redundant power supplies and be
> on dual power? or just dual power for the namenode, secondary namenode, and
> hbase master, and then perhaps switch the power source per rack for the
> slaves to provide resilience to a power failure? or even just run
> everything on single power and accept the risk that everything can do down
> at once?

Re: Appmaster error

2012-01-05 Thread Robert Evans

Pleas don't cross post.

Common is BCCed.  Each container has a vmem limit that is enforced, but not in 
local mode.  If this is for the app master then you can increase this amount so 
that when you launch your AM you can set this amount through 
SubmitApplicationRequest req; ...
req.getApplicationSubmissionContext().getAMContainerSpec().getResource().setMemory(memory_in_mb)

--Bobby Evans

On 1/5/12 12:44 AM, "raghavendhra rahul"  wrote:

Hi,
   I am trying to start an server within the application master's 
container alone.But when i tried using Runtime.getRuntime.exec("command").But 
it throws the following execption.
Application application_1325738010393_0003 failed 1 times due to AM Container 
for appattempt_1325738010393_0003_01 exited with  exitCode: 143 due to: 
Container [pid=7212,containerID=container_1325738010393_0003_01_01] is 
running beyond virtual memory limits. Current usage: 118.4mb of 1.0gb physical 
memory used; 2.7gb of 2.1gb virtual memory used. Killing container. Dump of the 
process-tree for container_1325738010393_0003_01_0

When i tried using single node yarn cluster everything works fine.But in multi 
node it throws this exception.Should i increase the size of /tmp in linux...
Any ideas

Re: Best ways to look-up information?

2011-12-12 Thread Robert Evans

Mark,

Are all of the tables used by all of the processes?  Are all of the tables used 
all of the time or are some used infrequently?  Does the data in these lookup 
tables change a lot or is it very stable?  What is the actual size of the data, 
yes 1 million entries, but is this 1 million 1kB, 100kB, or 1MB entries? Also 
how critical is it that your map/reduce jobs are reproducible later on.  If you 
have a shared resource like HBase it can change after a job runs, or even while 
a job is running, and you may not ever be able to reproduce that exact same 
result again.

Generally the processing that I have done is dominated by that last question so 
we tend to use cache archives to pull in versioned DBs that are optimized for 
reads and we use tools like CDB ( 
http://en.wikipedia.org/wiki/Constant_Data_Base) to do the lookups.  Most of 
these tend to be small enough to fit into memory and so we don't have to worry 
about that too much.  But I have seen other use cases too.

--Bobby Evans

On 12/12/11 11:33 AM, "Mark Kerzner"  wrote:

Hi,

I am planning a system to process information with Hadoop, and I will have
a few look-up tables that each processing node will need to query. There
are perhaps 20-50 such tables, and each has on the order of one million
entries. Which is the best mechanism for this look-up? Memcache, HBase,
JavaSpace, Lucene index, anything else?

Thank you,

Mark

Re: Passing data files via the distributed cache

2011-11-28 Thread Robert Evans

There is currently no way to delete the data from the cache when you are done.  
It is garbage collected when the cache starts to fill up (in LRU order if you 
are on a newer release).  The DistributedCache.addCacheFile is modifying the 
JobConf behind the scenes for you.  If you want to dig into the details of what 
it is doing you can look at the source code for it.

--Bobby Evans

On 11/28/11 4:46 AM, "Andy Doddington"  wrote:

Thanks for that link Prashant - very useful.

Two brief follow-up questions:

1) Having put data in the cache, I would like to be a good citizen by deleting 
the data from the cache once
I've finished - how do I do that?
2) Would it be simpler to pass the data as a value in the jobConf object?

Thanks,

Andy D.

On 25 Nov 2011, at 12:14, Prashant Kommireddi wrote:

> I believe you want to ship data to each node in your cluster before MR
> begins so the mappers can access files local to their machine. Hadoop
> tutorial on YDN has some good info on this.
>
> http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata
>
> -Prashant Kommireddi
>
> On Fri, Nov 25, 2011 at 1:05 AM, Andy Doddington wrote:
>
>> I have a series of mappers that I would like to be passed data using the
>> distributed cache mechanism. At the
>> moment, I am using HDFS to pass the data, but this seems wasteful to me,
>> since they are all reading the same data.
>>
>> Is there a piece of example code that shows how data files can be placed
>> in the cache and accessed by mappers?
>>
>> Thanks,
>>
>>   Andy Doddington
>>
>>

Re: mapred.map.tasks getting set, but not sure where

2011-11-07 Thread Robert Evans

It seems logical too that launching 4000 map tasks on a 20 node cluster is 
going to have a lot of overhead with it.  20 does not seem like the ideal 
number, but I don't really know the internals of Cassandra that well.  You 
might want to post this question on the Cassandra list to see if they can help 
you identify a way to increase the number of map tasks.

--Bobby Evans

On 11/5/11 9:33 AM, "Brendan W."  wrote:

Yeah, that's my guess now, that somebody must have hacked the Cassandra
libs on me...just wanted to see if there were other possibilities for where
that parameter was being set.

Thanks a lot for the help.

On Fri, Nov 4, 2011 at 2:11 PM, Harsh J  wrote:

> Could just be that Cassandra has changed the way their splits generate?
> Was Cassandra client libs changed at any point? Have you looked at its
> input formats' sources?
>
> On 04-Nov-2011, at 10:05 PM, Brendan W. wrote:
>
> > Plain Java MR, using the Cassandra inputFormat to read out of Cassandra.
> >
> > Perhaps somebody hacked the inputFormat code on me...
> >
> > But what's weird is that the parameter mapred.map.tasks didn't appear in
> > the job confs before at all.  Now it does, with a value of 20 (happens to
> > be the # of machines in the cluster), and that's without the jobs or the
> > mapred-site.xml files themselves changing.
> >
> > The inputSplitSize is set specifically in the jobs, and has not been
> > changed (except I subsequently fiddled with it a little to see if it
> > affected the fact that I was getting 20 splits, and it didn't affect
> > that...just the split size, not the number).
> >
> > After a submit the job, I get a message "TOTAL NUMBER OF SPLIT = 20",
> > before a list of the input splits...sort of looks like a hack but I can't
> > find where it is.
> >
> > On Fri, Nov 4, 2011 at 11:58 AM, Harsh J  wrote:
> >
> >> Brendan,
> >>
> >> Are these jobs (whose split behavior has changed) via Hive/etc. or plain
> >> Java MR?
> >>
> >> In case its the former, do you have users using newer versions of them?
> >>
> >> On 04-Nov-2011, at 8:03 PM, Brendan W. wrote:
> >>
> >>> Hi,
> >>>
> >>> In the jobs running on my cluster of 20 machines, I used to run jobs
> (via
> >>> "hadoop jar ...") that would spawn around 4000 map tasks.  Now when I
> run
> >>> the same jobs, that number is 20; and I notice that in the job
> >>> configuration, the parameter mapred.map.tasks is set to 20, whereas it
> >>> never used to be present at all in the configuration file.
> >>>
> >>> Changing the input split size in the job doesn't affect this--I get the
> >>> size split I ask for, but the *number* of input splits is still capped
> at
> >>> 20--i.e., the job isn't reading all of my data.
> >>>
> >>> The mystery to me is where this parameter could be getting set.  It is
> >> not
> >>> present in the mapred-site.xml file in /conf on any
> machine
> >> in
> >>> the cluster, and it is not being set in the job (I'm running out of the
> >>> same jar I always did; no updates).
> >>>
> >>> Is there *anywhere* else this parameter could possibly be getting set?
> >>> I've stopped and restarted map-reduce on the cluster with no
> >> effect...it's
> >>> getting re-read in from somewhere, but I can't figure out where.
> >>>
> >>> Thanks a lot,
> >>>
> >>> Brendan
> >>
> >>
>
>

Re: mapred.map.tasks getting set, but not sure where

2011-11-04 Thread Robert Evans

In 0.20.2 The JobClient will update mapred.map.tasks to be equal to the number 
of splits returned by the InputFormat.  The input format will usually take 
mapred.map.tasks as a recommendation when deciding on what splits to make.  
That is the only place in the code that I could find that is setting the value 
and could have any impact on the number of mappers launched.  It could be that 
Someone changed the number of files that are being read in as input, or that 
the block size of the files being read in is now different.  It could also be 
that someone started compressing the input files, so now they can not be split. 
 If the number of mappers is different it probably means that the input is 
different some how.

--Bobby Evans

On 11/4/11 10:12 AM, "Brendan W."  wrote:

All the same, no change in that...0.20.2.

Other people do have access to this system to change things like conf
files, but nobody's owning up and I have to figure this out.  I have
verified that the mapred.map.tasks property is not getting set in the
mapred-site.xml files on the cluster or in the job.  Just out of other
ideas about where it might be getting set...

Thanks,

Brendan

On Fri, Nov 4, 2011 at 11:04 AM, Robert Evans  wrote:

> What versions of Hadoop were you running with previously, and what version
> are you running with now?
>
> --Bobby Evans
>
> On 11/4/11 9:33 AM, "Brendan W."  wrote:
>
> Hi,
>
> In the jobs running on my cluster of 20 machines, I used to run jobs (via
> "hadoop jar ...") that would spawn around 4000 map tasks.  Now when I run
> the same jobs, that number is 20; and I notice that in the job
> configuration, the parameter mapred.map.tasks is set to 20, whereas it
> never used to be present at all in the configuration file.
>
> Changing the input split size in the job doesn't affect this--I get the
> size split I ask for, but the *number* of input splits is still capped at
> 20--i.e., the job isn't reading all of my data.
>
> The mystery to me is where this parameter could be getting set.  It is not
> present in the mapred-site.xml file in /conf on any machine in
> the cluster, and it is not being set in the job (I'm running out of the
> same jar I always did; no updates).
>
> Is there *anywhere* else this parameter could possibly be getting set?
> I've stopped and restarted map-reduce on the cluster with no effect...it's
> getting re-read in from somewhere, but I can't figure out where.
>
> Thanks a lot,
>
> Brendan
>
>

Re: mapred.map.tasks getting set, but not sure where

2011-11-04 Thread Robert Evans

What versions of Hadoop were you running with previously, and what version are 
you running with now?

--Bobby Evans

On 11/4/11 9:33 AM, "Brendan W."  wrote:

Hi,

In the jobs running on my cluster of 20 machines, I used to run jobs (via
"hadoop jar ...") that would spawn around 4000 map tasks.  Now when I run
the same jobs, that number is 20; and I notice that in the job
configuration, the parameter mapred.map.tasks is set to 20, whereas it
never used to be present at all in the configuration file.

Changing the input split size in the job doesn't affect this--I get the
size split I ask for, but the *number* of input splits is still capped at
20--i.e., the job isn't reading all of my data.

The mystery to me is where this parameter could be getting set.  It is not
present in the mapred-site.xml file in /conf on any machine in
the cluster, and it is not being set in the job (I'm running out of the
same jar I always did; no updates).

Is there *anywhere* else this parameter could possibly be getting set?
I've stopped and restarted map-reduce on the cluster with no effect...it's
getting re-read in from somewhere, but I can't figure out where.

Thanks a lot,

Brendan

Re: Is it possible to run multiple MapReduce against the same HDFS?

2011-10-11 Thread Robert Evans

I am not positive how all of that works and I may get some of this wrong, but I 
believe that the map reduce user has special privileges in relation to HDFS 
that allows it to become another user and read the data on that users behalf.  
I think that these privileges are granted by the user when it connects to the 
JT. I am not an expert on how the security in Hadoop works and I am likely to 
have gotten some of this wrong, so if there is someone on the list that wants 
to correct me or confirm what I have said that would be great.

--
Bobby Evans

On 10/10/11 9:56 PM, "Zhenhua (Gerald) Guo"  wrote:

Thanks, Robert.  I will look into hod.

When MapReduce framework accesses data stored in HDFS, which account
is used, the account which MapReduce daemons (e.g. job tracker) run as
or the account of the user who submits the job?  If HDFS and MapReduce
clusters are run with different accounts, can MapReduce cluster be
able to access HDFS directories and files (if authentication in HDFS
is enabled)?

Thanks!

Gerald

On Mon, Oct 10, 2011 at 12:36 PM, Robert Evans  wrote:
> It should be possible to use multiple map/reduce clusters sharing the same 
> HDFS, you can look at hod where it launches a JT on demand.  The only change 
> of collision that I can think of would be if by some odd chance both Job 
> Trackers were started at exactly the same millisecond.   The JT uses the time 
> it was started as part of the job id for all jobs.  Those job ids are assumed 
> to be unique and used to create files/directories in HDFS to store data for 
> that job.
>
> --Bobby Evans
>
> On 10/7/11 12:09 PM, "Zhenhua (Gerald) Guo"  wrote:
>
> I plan to deploy a HDFS cluster which will be shared by multiple
> MapReduce clusters.
> I wonder whether this is possible.  Will it incur any conflicts among
> MapReduce (e.g. different MapReduce clusters try to use the same temp
> directory in HDFS)?
> If it is possible, how should the security parameters be set up (e.g.
> user identity, file permission)?
>
> Thanks,
>
> Gerald
>
>

Re: Is it possible to run multiple MapReduce against the same HDFS?

2011-10-10 Thread Robert Evans

It should be possible to use multiple map/reduce clusters sharing the same 
HDFS, you can look at hod where it launches a JT on demand.  The only change of 
collision that I can think of would be if by some odd chance both Job Trackers 
were started at exactly the same millisecond.   The JT uses the time it was 
started as part of the job id for all jobs.  Those job ids are assumed to be 
unique and used to create files/directories in HDFS to store data for that job.

--Bobby Evans

On 10/7/11 12:09 PM, "Zhenhua (Gerald) Guo"  wrote:

I plan to deploy a HDFS cluster which will be shared by multiple
MapReduce clusters.
I wonder whether this is possible.  Will it incur any conflicts among
MapReduce (e.g. different MapReduce clusters try to use the same temp
directory in HDFS)?
If it is possible, how should the security parameters be set up (e.g.
user identity, file permission)?

Thanks,

Gerald

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans

That is correct,  However, it is a bit more complicated then that.  The Task 
Tracker's in memory index of the distributed cache is keyed off of the path of 
the file and the HDFS creation time of the file.  So if you delete the original 
file off of HDFS, and then recreate it with a new time stamp the distributed 
cache will start downloading the new file.

Also when the distributed cache on a disk fills up unused entries in it are 
deleted.

--Bobby Evans

On 9/27/11 2:32 PM, "Meng Mao"  wrote:

So the proper description of how DistributedCache normally works is:

1. have files to be cached sitting around in HDFS
2. Run Job A, which specifies those files to be put into DistributedCache
space. Each worker node copies the to-be-cached files from HDFS to local
disk, but more importantly, the TaskTracker acknowledges this distribution
and marks somewhere the fact that these files are cached for the first (and
only time)
3. job runs fine
4. Run Job A some time later. TaskTracker simply assumes (by looking at its
memory) that the files are still cached. The tasks on the workers, on this
second call to addCacheFile, don't actually copy the files from HDFS to
local disk again, but instead accept TaskTracker's word that they're still
there. Because the files actually still exist, the workers run fine and the
job finishes normally.

Is that a correct interpretation? If so, the caution, then, must be that if
you accidentally deleted the local disk cache file copies, you either
repopulate them (as well as their crc checksums) or you restart the
TaskTracker?

On Tue, Sep 27, 2011 at 3:03 PM, Robert Evans  wrote:

> Yes, all of the state for the task tracker is in memory.  It never looks at
> the disk to see what is there, it only maintains the state in memory.
>
> --bobby Evans
>
>
> On 9/27/11 1:00 PM, "Meng Mao"  wrote:
>
> I'm not concerned about disk space usage -- the script we used that deleted
> the taskTracker cache path has been fixed not to do so.
>
> I'm curious about the exact behavior of jobs that use DistributedCache
> files. Again, it seems safe from your description to delete files between
> completed runs. How could the job or the taskTracker distinguish between
> the
> files having been deleted and their not having been downloaded from a
> previous run of the job? Is it state in memory that the taskTracker
> maintains?
>
>
> On Tue, Sep 27, 2011 at 1:44 PM, Robert Evans  wrote:
>
> > If you are never ever going to use that file again for any map/reduce
> task
> > in the future then yes you can delete it, but I would not recommend it.
>  If
> > you want to reduce the amount of space that is used by the distributed
> cache
> > there is a config parameter for that.
> >
> > "local.cache.size"  it is the number of bytes per drive that will be used
> > for storing data in the distributed cache.   This is in 0.20 for hadoop I
> am
> > not sure if it has changed at all for trunk.  It is not documented as far
> as
> > I can tell, and it defaults to 10GB.
> >
> > --Bobby Evans
> >
> >
> > On 9/27/11 12:04 PM, "Meng Mao"  wrote:
> >
> > From that interpretation, it then seems like it would be safe to delete
> the
> > files between completed runs? How could it distinguish between the files
> > having been deleted and their not having been downloaded from a previous
> > run?
> >
> > On Tue, Sep 27, 2011 at 12:25 PM, Robert Evans 
> > wrote:
> >
> > > addCacheFile sets a config value in your jobConf that indicates which
> > files
> > > your particular job depends on.  When the TaskTracker is assigned to
> run
> > > part of your job (map task or reduce task), it will download your
> > jobConf,
> > > read it in, and then download the files listed in the conf, if it has
> not
> > > already downloaded them from a previous run.  Then it will set up the
> > > directory structure for your job, possibly adding in symbolic links to
> > these
> > > files in the working directory for your task.  After that it will
> launch
> > > your task.
> > >
> > > --Bobby Evans
> > >
> > > On 9/27/11 11:17 AM, "Meng Mao"  wrote:
> > >
> > > Who is in charge of getting the files there for the first time? The
> > > addCacheFile call in the mapreduce job? Or a manual setup by the
> > > user/operator?
> > >
> > > On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans 
> > > wrote:
> > >
> > > > The problem is the step 4 in the breaking sequence.  Currently the
> > > > TaskTracker never looks at the disk to know if a file

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans

Yes, all of the state for the task tracker is in memory.  It never looks at the 
disk to see what is there, it only maintains the state in memory.

--bobby Evans


On 9/27/11 1:00 PM, "Meng Mao"  wrote:

I'm not concerned about disk space usage -- the script we used that deleted
the taskTracker cache path has been fixed not to do so.

I'm curious about the exact behavior of jobs that use DistributedCache
files. Again, it seems safe from your description to delete files between
completed runs. How could the job or the taskTracker distinguish between the
files having been deleted and their not having been downloaded from a
previous run of the job? Is it state in memory that the taskTracker
maintains?


On Tue, Sep 27, 2011 at 1:44 PM, Robert Evans  wrote:

> If you are never ever going to use that file again for any map/reduce task
> in the future then yes you can delete it, but I would not recommend it.  If
> you want to reduce the amount of space that is used by the distributed cache
> there is a config parameter for that.
>
> "local.cache.size"  it is the number of bytes per drive that will be used
> for storing data in the distributed cache.   This is in 0.20 for hadoop I am
> not sure if it has changed at all for trunk.  It is not documented as far as
> I can tell, and it defaults to 10GB.
>
> --Bobby Evans
>
>
> On 9/27/11 12:04 PM, "Meng Mao"  wrote:
>
> From that interpretation, it then seems like it would be safe to delete the
> files between completed runs? How could it distinguish between the files
> having been deleted and their not having been downloaded from a previous
> run?
>
> On Tue, Sep 27, 2011 at 12:25 PM, Robert Evans 
> wrote:
>
> > addCacheFile sets a config value in your jobConf that indicates which
> files
> > your particular job depends on.  When the TaskTracker is assigned to run
> > part of your job (map task or reduce task), it will download your
> jobConf,
> > read it in, and then download the files listed in the conf, if it has not
> > already downloaded them from a previous run.  Then it will set up the
> > directory structure for your job, possibly adding in symbolic links to
> these
> > files in the working directory for your task.  After that it will launch
> > your task.
> >
> > --Bobby Evans
> >
> > On 9/27/11 11:17 AM, "Meng Mao"  wrote:
> >
> > Who is in charge of getting the files there for the first time? The
> > addCacheFile call in the mapreduce job? Or a manual setup by the
> > user/operator?
> >
> > On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans 
> > wrote:
> >
> > > The problem is the step 4 in the breaking sequence.  Currently the
> > > TaskTracker never looks at the disk to know if a file is in the
> > distributed
> > > cache or not.  It assumes that if it downloaded the file and did not
> > delete
> > > that file itself then the file is still there in its original form.  It
> > does
> > > not know that you deleted those files, or if wrote to the files, or in
> > any
> > > way altered those files.  In general you should not be modifying those
> > > files.  This is not only because it messes up the tracking of those
> > files,
> > > but because other jobs running concurrently with your task may also be
> > using
> > > those files.
> > >
> > > --Bobby Evans
> > >
> > >
> > > On 9/26/11 4:40 PM, "Meng Mao"  wrote:
> > >
> > > Let's frame the issue in another way. I'll describe a sequence of
> Hadoop
> > > operations that I think should work, and then I'll get into what we did
> > and
> > > how it failed.
> > >
> > > Normal sequence:
> > > 1. have files to be cached in HDFS
> > > 2. Run Job A, which specifies those files to be put into
> DistributedCache
> > > space
> > > 3. job runs fine
> > > 4. Run Job A some time later. job runs fine again.
> > >
> > > Breaking sequence:
> > > 1. have files to be cached in HDFS
> > > 2. Run Job A, which specifies those files to be put into
> DistributedCache
> > > space
> > > 3. job runs fine
> > > 4. Manually delete cached files out of local disk on worker nodes
> > > 5. Run Job A again, expect it to push out cache copies as needed.
> > > 6. job fails because the cache copies didn't get distributed
> > >
> > > Should this second sequence have broken?
> > >
> > > On Fri, Sep 23, 2011 at 3:09 PM, Meng Mao  wrote:
> > >
> > > > Hmm, I must have real

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans

If you are never ever going to use that file again for any map/reduce task in 
the future then yes you can delete it, but I would not recommend it.  If you 
want to reduce the amount of space that is used by the distributed cache there 
is a config parameter for that.

"local.cache.size"  it is the number of bytes per drive that will be used for 
storing data in the distributed cache.   This is in 0.20 for hadoop I am not 
sure if it has changed at all for trunk.  It is not documented as far as I can 
tell, and it defaults to 10GB.

--Bobby Evans


On 9/27/11 12:04 PM, "Meng Mao"  wrote:

>From that interpretation, it then seems like it would be safe to delete the
files between completed runs? How could it distinguish between the files
having been deleted and their not having been downloaded from a previous
run?

On Tue, Sep 27, 2011 at 12:25 PM, Robert Evans  wrote:

> addCacheFile sets a config value in your jobConf that indicates which files
> your particular job depends on.  When the TaskTracker is assigned to run
> part of your job (map task or reduce task), it will download your jobConf,
> read it in, and then download the files listed in the conf, if it has not
> already downloaded them from a previous run.  Then it will set up the
> directory structure for your job, possibly adding in symbolic links to these
> files in the working directory for your task.  After that it will launch
> your task.
>
> --Bobby Evans
>
> On 9/27/11 11:17 AM, "Meng Mao"  wrote:
>
> Who is in charge of getting the files there for the first time? The
> addCacheFile call in the mapreduce job? Or a manual setup by the
> user/operator?
>
> On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans 
> wrote:
>
> > The problem is the step 4 in the breaking sequence.  Currently the
> > TaskTracker never looks at the disk to know if a file is in the
> distributed
> > cache or not.  It assumes that if it downloaded the file and did not
> delete
> > that file itself then the file is still there in its original form.  It
> does
> > not know that you deleted those files, or if wrote to the files, or in
> any
> > way altered those files.  In general you should not be modifying those
> > files.  This is not only because it messes up the tracking of those
> files,
> > but because other jobs running concurrently with your task may also be
> using
> > those files.
> >
> > --Bobby Evans
> >
> >
> > On 9/26/11 4:40 PM, "Meng Mao"  wrote:
> >
> > Let's frame the issue in another way. I'll describe a sequence of Hadoop
> > operations that I think should work, and then I'll get into what we did
> and
> > how it failed.
> >
> > Normal sequence:
> > 1. have files to be cached in HDFS
> > 2. Run Job A, which specifies those files to be put into DistributedCache
> > space
> > 3. job runs fine
> > 4. Run Job A some time later. job runs fine again.
> >
> > Breaking sequence:
> > 1. have files to be cached in HDFS
> > 2. Run Job A, which specifies those files to be put into DistributedCache
> > space
> > 3. job runs fine
> > 4. Manually delete cached files out of local disk on worker nodes
> > 5. Run Job A again, expect it to push out cache copies as needed.
> > 6. job fails because the cache copies didn't get distributed
> >
> > Should this second sequence have broken?
> >
> > On Fri, Sep 23, 2011 at 3:09 PM, Meng Mao  wrote:
> >
> > > Hmm, I must have really missed an important piece somewhere. This is
> from
> > > the MapRed tutorial text:
> > >
> > > "DistributedCache is a facility provided by the Map/Reduce framework to
> > > cache files (text, archives, jars and so on) needed by applications.
> > >
> > > Applications specify the files to be cached via urls (hdfs://) in the
> > > JobConf. The DistributedCache* assumes that the files specified via
> > > hdfs:// urls are already present on the FileSystem.*
> > >
> > > *The framework will copy the necessary files to the slave node before
> any
> > > tasks for the job are executed on that node*. Its efficiency stems from
> > > the fact that the files are only copied once per job and the ability to
> > > cache archives which are un-archived on the slaves."
> > >
> > >
> > > After some close reading, the two bolded pieces seem to be in
> > contradiction
> > > of each other? I'd always that addCacheFile() would perform the 2nd
> > bolded
> > > statement. If that sentence is true, then I still don't have an
> > explanation
> >

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans

addCacheFile sets a config value in your jobConf that indicates which files 
your particular job depends on.  When the TaskTracker is assigned to run part 
of your job (map task or reduce task), it will download your jobConf, read it 
in, and then download the files listed in the conf, if it has not already 
downloaded them from a previous run.  Then it will set up the directory 
structure for your job, possibly adding in symbolic links to these files in the 
working directory for your task.  After that it will launch your task.

--Bobby Evans

On 9/27/11 11:17 AM, "Meng Mao"  wrote:

Who is in charge of getting the files there for the first time? The
addCacheFile call in the mapreduce job? Or a manual setup by the
user/operator?

On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans  wrote:

> The problem is the step 4 in the breaking sequence.  Currently the
> TaskTracker never looks at the disk to know if a file is in the distributed
> cache or not.  It assumes that if it downloaded the file and did not delete
> that file itself then the file is still there in its original form.  It does
> not know that you deleted those files, or if wrote to the files, or in any
> way altered those files.  In general you should not be modifying those
> files.  This is not only because it messes up the tracking of those files,
> but because other jobs running concurrently with your task may also be using
> those files.
>
> --Bobby Evans
>
>
> On 9/26/11 4:40 PM, "Meng Mao"  wrote:
>
> Let's frame the issue in another way. I'll describe a sequence of Hadoop
> operations that I think should work, and then I'll get into what we did and
> how it failed.
>
> Normal sequence:
> 1. have files to be cached in HDFS
> 2. Run Job A, which specifies those files to be put into DistributedCache
> space
> 3. job runs fine
> 4. Run Job A some time later. job runs fine again.
>
> Breaking sequence:
> 1. have files to be cached in HDFS
> 2. Run Job A, which specifies those files to be put into DistributedCache
> space
> 3. job runs fine
> 4. Manually delete cached files out of local disk on worker nodes
> 5. Run Job A again, expect it to push out cache copies as needed.
> 6. job fails because the cache copies didn't get distributed
>
> Should this second sequence have broken?
>
> On Fri, Sep 23, 2011 at 3:09 PM, Meng Mao  wrote:
>
> > Hmm, I must have really missed an important piece somewhere. This is from
> > the MapRed tutorial text:
> >
> > "DistributedCache is a facility provided by the Map/Reduce framework to
> > cache files (text, archives, jars and so on) needed by applications.
> >
> > Applications specify the files to be cached via urls (hdfs://) in the
> > JobConf. The DistributedCache* assumes that the files specified via
> > hdfs:// urls are already present on the FileSystem.*
> >
> > *The framework will copy the necessary files to the slave node before any
> > tasks for the job are executed on that node*. Its efficiency stems from
> > the fact that the files are only copied once per job and the ability to
> > cache archives which are un-archived on the slaves."
> >
> >
> > After some close reading, the two bolded pieces seem to be in
> contradiction
> > of each other? I'd always that addCacheFile() would perform the 2nd
> bolded
> > statement. If that sentence is true, then I still don't have an
> explanation
> > of why our job didn't correctly push out new versions of the cache files
> > upon the startup and execution of JobConfiguration. We deleted them
> before
> > our job started, not during.
> >
> > On Fri, Sep 23, 2011 at 9:35 AM, Robert Evans 
> wrote:
> >
> >> Meng Mao,
> >>
> >> The way the distributed cache is currently written, it does not verify
> the
> >> integrity of the cache files at all after they are downloaded.  It just
> >> assumes that if they were downloaded once they are still there and in
> the
> >> proper shape.  It might be good to file a JIRA to add in some sort of
> check.
> >>  Another thing to do is that the distributed cache also includes the
> time
> >> stamp of the original file, just incase you delete the file and then use
> a
> >> different version.  So if you want it to force a download again you can
> copy
> >> it delete the original and then move it back to what it was before.
> >>
> >> --Bobby Evans
> >>
> >> On 9/23/11 1:57 AM, "Meng Mao"  wrote:
> >>
> >> We use the DistributedCache class to distribute a few lookup files for
> our
> >> jobs. We have been aggressively

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans

The problem is the step 4 in the breaking sequence.  Currently the TaskTracker 
never looks at the disk to know if a file is in the distributed cache or not.  
It assumes that if it downloaded the file and did not delete that file itself 
then the file is still there in its original form.  It does not know that you 
deleted those files, or if wrote to the files, or in any way altered those 
files.  In general you should not be modifying those files.  This is not only 
because it messes up the tracking of those files, but because other jobs 
running concurrently with your task may also be using those files.

--Bobby Evans

On 9/26/11 4:40 PM, "Meng Mao"  wrote:

Let's frame the issue in another way. I'll describe a sequence of Hadoop
operations that I think should work, and then I'll get into what we did and
how it failed.

Normal sequence:
1. have files to be cached in HDFS
2. Run Job A, which specifies those files to be put into DistributedCache
space
3. job runs fine
4. Run Job A some time later. job runs fine again.

Breaking sequence:
1. have files to be cached in HDFS
2. Run Job A, which specifies those files to be put into DistributedCache
space
3. job runs fine
4. Manually delete cached files out of local disk on worker nodes
5. Run Job A again, expect it to push out cache copies as needed.
6. job fails because the cache copies didn't get distributed

Should this second sequence have broken?

On Fri, Sep 23, 2011 at 3:09 PM, Meng Mao  wrote:

> Hmm, I must have really missed an important piece somewhere. This is from
> the MapRed tutorial text:
>
> "DistributedCache is a facility provided by the Map/Reduce framework to
> cache files (text, archives, jars and so on) needed by applications.
>
> Applications specify the files to be cached via urls (hdfs://) in the
> JobConf. The DistributedCache* assumes that the files specified via
> hdfs:// urls are already present on the FileSystem.*
>
> *The framework will copy the necessary files to the slave node before any
> tasks for the job are executed on that node*. Its efficiency stems from
> the fact that the files are only copied once per job and the ability to
> cache archives which are un-archived on the slaves."
>
>
> After some close reading, the two bolded pieces seem to be in contradiction
> of each other? I'd always that addCacheFile() would perform the 2nd bolded
> statement. If that sentence is true, then I still don't have an explanation
> of why our job didn't correctly push out new versions of the cache files
> upon the startup and execution of JobConfiguration. We deleted them before
> our job started, not during.
>
> On Fri, Sep 23, 2011 at 9:35 AM, Robert Evans  wrote:
>
>> Meng Mao,
>>
>> The way the distributed cache is currently written, it does not verify the
>> integrity of the cache files at all after they are downloaded.  It just
>> assumes that if they were downloaded once they are still there and in the
>> proper shape.  It might be good to file a JIRA to add in some sort of check.
>>  Another thing to do is that the distributed cache also includes the time
>> stamp of the original file, just incase you delete the file and then use a
>> different version.  So if you want it to force a download again you can copy
>> it delete the original and then move it back to what it was before.
>>
>> --Bobby Evans
>>
>> On 9/23/11 1:57 AM, "Meng Mao"  wrote:
>>
>> We use the DistributedCache class to distribute a few lookup files for our
>> jobs. We have been aggressively deleting failed task attempts' leftover
>> data
>> , and our script accidentally deleted the path to our distributed cache
>> files.
>>
>> Our task attempt leftover data was here [per node]:
>> /hadoop/hadoop-metadata/cache/mapred/local/
>> and our distributed cache path was:
>> hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/
>> We deleted this path by accident.
>>
>> Does this latter path look normal? I'm not that familiar with
>> DistributedCache but I'm up right now investigating the issue so I thought
>> I'd ask.
>>
>> After that deletion, the first 2 jobs to run (which are use the
>> addCacheFile
>> method to distribute their files) didn't seem to push the files out to the
>> cache path, except on one node. Is this expected behavior? Shouldn't
>> addCacheFile check to see if the files are missing, and if so, repopulate
>> them as needed?
>>
>> I'm trying to get a handle on whether it's safe to delete the distributed
>> cache path when the grid is quiet and no jobs are running. That is, if
>> addCacheFile is designed to be robust against the files it's caching not
>> being at each job start.
>>
>>
>

Re: many killed tasks, long execution time

2011-09-23 Thread Robert Evans

Sofia,

Speculative execution is great so long as you are not writing data off to HDFS 
on the side.  If you use a normal output format it can handle putting your 
output in a temporary location with a unique name, and then in the cleanup 
method when all tasks have finished it moves the files to their final location.

For what you have said it sounds almost like you are writing data out from your 
map task to HDFS and then reading that data back in to your reduce task from 
HDFS.   Is that correct?  This may be the cause of your slowdown when you add 
in more reducers.  It could also have something to do with the number of nodes 
that you have.  You said you have a 12 node cluster.  How many reducer slots 
are there per node?  If there is only one then you are adding in extra overhead 
not just to launch a new reducer, but also because the distribution to 
different nodes is uneven.  One node will get 2 reducers on right after the 
other, and all the others will just have to run one reducer.

I think it is more likely something like the second issue then the first.

--Bobby Evans


On 9/23/11 9:04 AM, "Sofia Georgiakaki"  wrote:

Mr. Bobby, thank you for your reply.
The IOException was related with the speculative execution. In my Reducers I 
create some files written on the HDFS, so in some occasions multiple tasks 
attempted to write the same file. I turned the speculative mode off for the 
reduce tasks, and the problem was solved.

However, the major problem with the long execution time remains. I can assume 
now that all these failed map tasks have to do with the speculative execution 
too, so the source of the problem must be somewhere else.

I noticed that the average time for the map tasks (as well as the time e.g. the 
longer mapper finishes) increases as I increase the number of reducers! Is this 
normal??? The input is always the same, as well as the number of the map tasks 
(158 map tasks executed on the 12-node cluster. each node has capacity for 4 
map tasks).
In addition, the performance of the Job is ok when the number of reducers are 
in the range 2-12, and then if I increase the reducers further, the performance 
gets worse and worse...

Any ideas would be helpful!
Thank you!





____
From: Robert Evans 
To: "common-user@hadoop.apache.org" ; Sofia 
Georgiakaki 
Sent: Friday, September 23, 2011 4:28 PM
Subject: Re: many killed tasks, long execution time

Can you include the complete stack trace of the IOException you are seeing?

--Bobby Evans

On 9/23/11 2:15 AM, "Sofia Georgiakaki"  wrote:




Good morning!

I would be grateful if anyone could help me about a serious problem that I'm 
facing.
I try to run a hadoop job on a 12-node luster (has 48 task capacity), and I 
have problems when dealing with big input data (10-20GB) which gets worse when 
I increase the number of reducers.
Many tasks get killed (for example 25 out of the 148 map tasks, and 15 out of 
40 reducers) and the job struggles to finish.

The job is heavy in general, as it builds an Rtree on hdfs.
During the reduce phase, I also create and write some binary files on HDFS 
using FSDataOutputStream. and I noticed  that sometimes some tasks fail to 
write correctly to their particular binary file, throwing an IOexception when 
they try to execute  dataFileOut.write(m_buffer); .

I'm using 0.20.203 version and I had also tested the code on 0.20.2 before 
(facing the same problems with killed tasks!)


I would appreciate any advice/idea, as I have to finish my diploma thesis (it 
has taken me a year, I hope not to take longer).

Thank you very much in advance
Sofia

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-23 Thread Robert Evans

Meng Mao,

The way the distributed cache is currently written, it does not verify the 
integrity of the cache files at all after they are downloaded.  It just assumes 
that if they were downloaded once they are still there and in the proper shape. 
 It might be good to file a JIRA to add in some sort of check.  Another thing 
to do is that the distributed cache also includes the time stamp of the 
original file, just incase you delete the file and then use a different 
version.  So if you want it to force a download again you can copy it delete 
the original and then move it back to what it was before.

--Bobby Evans

On 9/23/11 1:57 AM, "Meng Mao"  wrote:

We use the DistributedCache class to distribute a few lookup files for our
jobs. We have been aggressively deleting failed task attempts' leftover data
, and our script accidentally deleted the path to our distributed cache
files.

Our task attempt leftover data was here [per node]:
/hadoop/hadoop-metadata/cache/mapred/local/
and our distributed cache path was:
hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/
We deleted this path by accident.

Does this latter path look normal? I'm not that familiar with
DistributedCache but I'm up right now investigating the issue so I thought
I'd ask.

After that deletion, the first 2 jobs to run (which are use the addCacheFile
method to distribute their files) didn't seem to push the files out to the
cache path, except on one node. Is this expected behavior? Shouldn't
addCacheFile check to see if the files are missing, and if so, repopulate
them as needed?

I'm trying to get a handle on whether it's safe to delete the distributed
cache path when the grid is quiet and no jobs are running. That is, if
addCacheFile is designed to be robust against the files it's caching not
being at each job start.

Re: many killed tasks, long execution time

2011-09-23 Thread Robert Evans

Can you include the complete stack trace of the IOException you are seeing?

--Bobby Evans

On 9/23/11 2:15 AM, "Sofia Georgiakaki"  wrote:




Good morning!

I would be grateful if anyone could help me about a serious problem that I'm 
facing.
I try to run a hadoop job on a 12-node luster (has 48 task capacity), and I 
have problems when dealing with big input data (10-20GB) which gets worse when 
I increase the number of reducers.
Many tasks get killed (for example 25 out of the 148 map tasks, and 15 out of 
40 reducers) and the job struggles to finish.

The job is heavy in general, as it builds an Rtree on hdfs.
During the reduce phase, I also create and write some binary files on HDFS 
using FSDataOutputStream. and I noticed  that sometimes some tasks fail to 
write correctly to their particular binary file, throwing an IOexception when 
they try to execute  dataFileOut.write(m_buffer); .

I'm using 0.20.203 version and I had also tested the code on 0.20.2 before 
(facing the same problems with killed tasks!)


I would appreciate any advice/idea, as I have to finish my diploma thesis (it 
has taken me a year, I hope not to take longer).

Thank you very much in advance
Sofia

Re: How to get hadoop job information effectively?

2011-09-21 Thread Robert Evans

Not that I know of.  We scrape web pages which is a horrible thing to do.  
There is a JIRA to add in some web service APIs to expose this type of 
information, but it is not going to be available for a while.

--Bobby Evans

On 9/21/11 1:01 PM, "Benyi Wang"  wrote:

I'm working a project to collect MapReduce job information on an application
level. For example, a DW ETL process may involves several MapReduce jobs, we
want to have a dashboard to show the progress of those jobs for the specific
ETL process.

JobStatus does not provide all information like JobTracker web
page. JobInProgress is used in JobTracker and JobHistory and it is in
JobTracker memory, and seem not exposed to the client side.

The current method I am using is to check history log files and job conf XML
file to extract those information like jobdetailhistory.jsp and
jobhistory.jsp.

Is there a better way to collect the information like JobInProgress?

Thanks.

Re: Can we run job on some datanodes ?

2011-09-21 Thread Robert Evans

Praveen,

If you are doing performance measurements be aware that having more datanodes 
then tasktrackers will impact the performance as well (Don't really know for 
sure how).  It will not be the same performance as running on a cluster with 
just fewer nodes over all.  Also if you do shut off datanodes as well as task 
trackers you will need to give the cluster a while for re-replication to finish 
before you try to run your performance numbers.

--Bobby Evans


On 9/21/11 8:27 AM, "Harsh J"  wrote:

Praveenesh,

Absolutely right. Just stop them individually :)

On Wed, Sep 21, 2011 at 6:53 PM, praveenesh kumar  wrote:
> Oh wow.. I didn't know that..
> Actually for me datanodes/tasktrackers are running on same machines.
> I mention datanodes because if I delete those machines from masters list,
> chances are the data will also loose.
> So I don't want to do that..
> but now I guess by stoping tasktrackers individually... I can decrease the
> strength of my cluster by decreasing the number of nodes that will run
> tasktracker .. right ?? This  way I won't loose my data also.. Right ??
>
>
>
> On Wed, Sep 21, 2011 at 6:39 PM, Harsh J  wrote:
>
>> Praveenesh,
>>
>> TaskTrackers run your jobs' tasks for you, not DataNodes directly. So
>> you can statically control loads on nodes by removing away
>> TaskTrackers from your cluster.
>>
>> i.e, if you "service hadoop-0.20-tasktracker stop" or
>> "hadoop-daemon.sh stop tasktracker" on the specific nodes, jobs won't
>> run there anymore.
>>
>> Is this what you're looking for?
>>
>> (There are ways to achieve the exclusion dynamically, by writing a
>> scheduler, but hard to tell without knowing what you need
>> specifically, and why do you require it?)
>>
>> On Wed, Sep 21, 2011 at 6:32 PM, praveenesh kumar 
>> wrote:
>> > Is there any way that we can run a particular job in a hadoop on subset
>> of
>> > datanodes ?
>> >
>> > My problem is I don't want to use all the nodes to run some job,
>> > I am trying to make Job completion Vs No. of nodes graph for a particular
>> > job.
>> > One way to do is I can remove datanodes, and then see how much time the
>> job
>> > is taking.
>> >
>> > Just for curiosity sake, want to know is there any other way possible to
>> do
>> > this, without removing datanodes.
>> > I am afraid, if I remove datanodes, I can loose some data blocks that
>> reside
>> > on those machines as I have some files with replication = 1 ?
>> >
>> > Thanks,
>> > Praveenesh
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>



--
Harsh J

Re: Is Hadoop the right platform for my HPC application?

2011-09-13 Thread Robert Evans

Another option to think about is that there is a Hamster project ( 
MAPREDUCE-2911  ) that 
will allow OpenMPI to run on a Hadoop Cluster.  It is still very preliminary 
and will probably not be ready until Hadoop 0.23 or 0.24.

There are other processing methodologies being developed to run on top of YARN 
(Which is the resource scheduler put in as part of Hadoop 0.23) 
http://wiki.apache.org/hadoop/PoweredByYarn

So there are even more choices coming depending on your problem.

--Bobby Evans

On 9/13/11 12:54 PM, "Parker Jones"  wrote:



Thank you for the explanations, Bobby.  That helps significantly.

I also read the article below which gave me a better understanding of the 
relative merits of MapReduce/Hadoop vs MPI.  Alberto, you might find it useful 
too.
http://grids.ucs.indiana.edu/ptliupages/publications/CloudsandMR.pdf

There is even a MapReduce API built on top of MPI developed at Sandia.

So many options to choose from :-)

Cheers,
Parker

> From: ev...@yahoo-inc.com
> To: common-user@hadoop.apache.org
> Date: Mon, 12 Sep 2011 14:02:44 -0700
> Subject: Re: Is Hadoop the right platform for my HPC application?
>
> Parker,
>
> The hadoop command itself is just a shell script that sets up your classpath 
> and some environment variables for a JVM.  Hadoop provides a java API and you 
> should be able to use to write you application, without dealing with the 
> command line.  That being said there is no Map/Reduce C/C++ API.  There is 
> libhdfs.so that will allow you to read/write HDFS files from a C/C++ program, 
> but it actually launches a JVM behind the scenes to handle the actual 
> requests.
>
> As for a way to avoid writing your input data into files, the data has to be 
> distributed to the compute nodes some how.  You could write a custom input 
> format that does not use any input files, and then have it load the data a 
> different way.  I believe that some people do this to load data from MySQL or 
> some other DB for processing.  Similarly you could do something with the 
> output format to put the data someplace else.
>
> It is hard to say if Hadoop is the right platform without more information 
> about what you are doing.  Hadoop has been used for lots of embarrassingly 
> parallel problems.  The processing is easy, the real question is where is 
> your data coming from, and where are the results going.  Map/Reduce is fast 
> in part because it tries to reduce data movement and move the computation to 
> the data, not the other way round.  Without knowing the expected size of your 
> data or the amount of processing that it will do, it is hard to say.
>
> --Bobby Evans
>
> On 9/12/11 5:09 AM, "Parker Jones"  wrote:
>
>
>
> Hello all,
>
> I have Hadoop up and running and an embarrassingly parallel problem but can't 
> figure out how to arrange the problem.  My apologies in advance if this is 
> obvious and I'm not getting it.
>
> My HPC application isn't a batch program, but runs in a continuous loop (like 
> a server) *outside* of the Hadoop machines, and it should occasionally farm 
> out a large computation to Hadoop and use the results.  However, all the 
> examples I have come across interact with Hadoop via files and the command 
> line.  (Perhaps I am looking at the wrong places?)
>
> So,
> * is Hadoop the right platform for this kind of problem?
> * is it possible to use Hadoop without going through the command line and 
> writing all input data to files?
>
> If so, could someone point me to some examples and documentation.  I am 
> coding in C/C++ in case that is relevant, but examples in any language should 
> be helpful.
>
> Thanks for any suggestions,
> Parker
>
>
>

Re: Is Hadoop the right platform for my HPC application?

2011-09-12 Thread Robert Evans

Parker,

The hadoop command itself is just a shell script that sets up your classpath 
and some environment variables for a JVM.  Hadoop provides a java API and you 
should be able to use to write you application, without dealing with the 
command line.  That being said there is no Map/Reduce C/C++ API.  There is 
libhdfs.so that will allow you to read/write HDFS files from a C/C++ program, 
but it actually launches a JVM behind the scenes to handle the actual requests.

As for a way to avoid writing your input data into files, the data has to be 
distributed to the compute nodes some how.  You could write a custom input 
format that does not use any input files, and then have it load the data a 
different way.  I believe that some people do this to load data from MySQL or 
some other DB for processing.  Similarly you could do something with the output 
format to put the data someplace else.

It is hard to say if Hadoop is the right platform without more information 
about what you are doing.  Hadoop has been used for lots of embarrassingly 
parallel problems.  The processing is easy, the real question is where is your 
data coming from, and where are the results going.  Map/Reduce is fast in part 
because it tries to reduce data movement and move the computation to the data, 
not the other way round.  Without knowing the expected size of your data or the 
amount of processing that it will do, it is hard to say.

--Bobby Evans

On 9/12/11 5:09 AM, "Parker Jones"  wrote:



Hello all,

I have Hadoop up and running and an embarrassingly parallel problem but can't 
figure out how to arrange the problem.  My apologies in advance if this is 
obvious and I'm not getting it.

My HPC application isn't a batch program, but runs in a continuous loop (like a 
server) *outside* of the Hadoop machines, and it should occasionally farm out a 
large computation to Hadoop and use the results.  However, all the examples I 
have come across interact with Hadoop via files and the command line.  (Perhaps 
I am looking at the wrong places?)

So,
* is Hadoop the right platform for this kind of problem?
* is it possible to use Hadoop without going through the command line and 
writing all input data to files?

If so, could someone point me to some examples and documentation.  I am coding 
in C/C++ in case that is relevant, but examples in any language should be 
helpful.

Thanks for any suggestions,
Parker

Re: Distributed cluster filesystem on EC2

2011-08-31 Thread Robert Evans

Dmitry,

It sounds like an interesting idea, but I have not really heard of anyone doing 
it before.  It would make for a good feature to have tiered file systems all 
mapped into the same namespace, but that would be a lot of work and complexity.

The quick solution would be to know what data you want to process before hand 
and then run distcp to copy it from S3 into HDFS before launching the other 
map/reduce jobs.  I don't think there is anything automatic out there.

--Bobby Evans

On 8/29/11 4:56 PM, "Dmitry Pushkarev"  wrote:

Dear hadoop users,

Sorry for the off-topic. We're slowly migrating our hadoop cluster to EC2,
and one thing that I'm trying to explore is whether we can use alternative
scheduling systems like SGE with shared FS for non data intensive tasks,
since they are easier to work with for lay users.

One problem for now is how to create shared cluster filesystem similar to
HDFS, distributed with high-performance, somewhat POSIX compliant (symlinks
and permissions), that will use amazon EC2 local nonpersistent storage.

Idea is to keep original data on S3, then as needed fire up a bunch of
nodes, start shared filesystem, and quickly copy data from S3 to that FS,
run the analysis with SGE, save results and shut down that filesystem.
I tried things like S3FS and similar native S3 implementation but speed is
too bad. Currently I just have a FS on my master node that is shared via NFS
to all the rest, but I pretty much saturate 1GB bandwidth as soon as I start
more than 10 nodes.

Thank you. I'd appreciate any suggestions and links to relevant resources!.


Dmitry

Re: Hadoop JVM Size (Not MapReduce)

2011-08-19 Thread Robert Evans

The hadoop command is just a shell script that sets up the class path before 
call java.  I think if you set the ENV HADOOP_JAVA_OPTS then they will show up 
on the command line, but you can look at the top of the hadoop shell script to 
be sure.  It has all the env vars it supports listed there at the top.

--Bobby Evans

On 8/19/11 10:34 AM, "Adam Shook"  wrote:

Hello all,

I have a Hadoop related application that is integrated with HDFS and is started 
via the command line with "hadoop jar ..."  The amount of data used by the 
application changes from use case to use case, and I have to adjust the JVM 
that is started using the "hadoop jar" command.  Typically, you just set the 
-Xmx and -Xms variables from the "java -jar" command, but this doesn't seem to 
work.

Does anyone know how I can set it?  Note that this is unrelated to the JVM size 
for map and reduce tasks - there is no MapReduce involved in my application.

Thanks in advance!

--Adam

PS - I imagine I can code my application to be hooked into HDFS or read from 
the Hadoop configuration files by hand - but I would prefer Hadoop to do all 
the work for me!

Re: Hadoop-streaming using binary executable c program

2011-08-02 Thread Robert Evans

What I usually do to debug streaming is to print things to STDERR.  STDERR 
shows up in the logs for the attempt and you should be able to see better what 
is happening.  I am not an expert on perl so I am not sure if you have to pass 
in something special to get your perl script to read form STDIN.  I see  you 
opening handles to all of the files on the command line, but I am not sure how 
that works with stdin, becaue whatever you run through streaming has to read 
from stdin and write to stdout.

cat map1.txt map2.txt map3.txt | ./reducer.pl

--Bobby

On 8/1/11 5:13 PM, "Daniel Yehdego"  wrote:



Hi Bobby,

I have written a small Perl script which do the following job:

Assume we have an output from the mapper

MAP1



MAP2



MAP3



and what the script does is reduce in the following manner :
\t\n
 and the script looks like this:

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

my @handles = map { open my $h, '<', $_; $h } @ARGV;

while (@handles){
@handles = grep { ! eof $_ } @handles;
my @lines = map { my $v = <$_>; chomp $v; $v } @handles;
print join(' ', @lines), "\n";
}

close $_ for @handles;

This should work for any inputs from the  mapper. But after I use hadoop 
streaming and put the above code as my reducer, the job was successful
but the output files were empty. And I couldn't find out.

 bin/hadoop jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar
-mapper ./hadoopPknotsRG
-file /data/yehdego/hadoop-0.20.2/pknotsRG
-file /data/yehdego/hadoop-0.20.2/hadoopPknotsRG
-reducer ./reducer.pl
-file /data/yehdego/hadoop-0.20.2/reducer.pl
-input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt
-output /user/yehdego/RFR2-out - verbose

Any help or suggestion is really appreciatedI am just stuck here for the 
weekend.

Regards,

Daniel T. Yehdego
Computational Science Program
University of Texas at El Paso, UTEP
dtyehd...@miners.utep.edu

> From: ev...@yahoo-inc.com
> To: common-user@hadoop.apache.org
> Date: Thu, 28 Jul 2011 07:12:11 -0700
> Subject: Re: Hadoop-streaming using binary executable c program
>
> I am not completely sure what you are getting at.  It looks like the output 
> of your c program is (And this is just a guess)  NOTE: \t stands for the tab 
> character and in streaming it is used to separate the key from the value \n 
> stands for carriage return and is used to separate individual records..
> \t\n
> \t\n
> \t\n
> ...
>
>
> And you want the output to look like
> \t\n
>
> You could use a reduce to do this, but the issue here is with the shuffle in 
> between the maps and the reduces.  The Shuffle will group by the key to send 
> to the reducers and then sort by the key.  So in reality your map output 
> looks something like
>
> FROM MAP 1:
> \t\n
> \t\n
>
> FROM MAP 2:
> \t\n
> \t\n
>
> FROM MAP 3:
> \t\n
> \t\n
>
> If you send it to a single reducer (The only way to get a single file) Then 
> the input to the reducer will be sorted alphabetically by the RNA, and the 
> order of the input will be lost.  You can work around this by giving each 
> line a unique number that is in the order you want It to be output.  But 
> doing this would require you to write some code.  I would suggest that you do 
> it with a small shell script after all the maps have completed to splice them 
> together.
>
> --
> Bobby
>
> On 7/27/11 2:55 PM, "Daniel Yehdego"  wrote:
>
>
>
> Hi Bobby,
>
> I just want to ask you if there is away of using a reducer or something like 
> concatenation to glue my outputs from the mapper and outputs
> them as a single file and segment of the predicted RNA 2D structure?
>
> FYI: I have used a reducer NONE before:
>
> HADOOP_HOME$ bin/hadoop jar
> /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper
> ./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file
> /data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input
> /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output
> /user/yehdego/RF-out -reducer NONE -verbose
>
> and a sample of my output using the mapper of two different slave nodes looks 
> like this :
>
> AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGC
> and
> [...(((...))).].
>   (-13.46)
>
> GGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUCU
> .(((.((......)..  (-11.00)
>
> and I want to concatenate and output them as a single predicated RNA sequence 
> structure:
>
> AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGCGGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUCU
>
> [...(((...))).]..(((.((......)..
>
>
> Regards,
>
> Daniel T. Yehdego
> Computational Science Program
> University of Texas at El Paso, UTEP
> dtyehd...@miners.utep.edu
>
> > From:

Re: Hadoop-streaming using binary executable c program

2011-07-28 Thread Robert Evans

I am not completely sure what you are getting at.  It looks like the output of 
your c program is (And this is just a guess)  NOTE: \t stands for the tab 
character and in streaming it is used to separate the key from the value \n 
stands for carriage return and is used to separate individual records..
\t\n
\t\n
\t\n
...

And you want the output to look like
\t\n

You could use a reduce to do this, but the issue here is with the shuffle in 
between the maps and the reduces.  The Shuffle will group by the key to send to 
the reducers and then sort by the key.  So in reality your map output looks 
something like

FROM MAP 1:
\t\n
\t\n

FROM MAP 2:
\t\n
\t\n

FROM MAP 3:
\t\n
\t\n

If you send it to a single reducer (The only way to get a single file) Then the 
input to the reducer will be sorted alphabetically by the RNA, and the order of 
the input will be lost.  You can work around this by giving each line a unique 
number that is in the order you want It to be output.  But doing this would 
require you to write some code.  I would suggest that you do it with a small 
shell script after all the maps have completed to splice them together.

--
Bobby

On 7/27/11 2:55 PM, "Daniel Yehdego"  wrote:

Hi Bobby,

I just want to ask you if there is away of using a reducer or something like 
concatenation to glue my outputs from the mapper and outputs
them as a single file and segment of the predicted RNA 2D structure?

FYI: I have used a reducer NONE before:

HADOOP_HOME$ bin/hadoop jar
/data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper
./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file
/data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input
/user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output
/user/yehdego/RF-out -reducer NONE -verbose

and a sample of my output using the mapper of two different slave nodes looks 
like this :

AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGC
and
[...(((...))).].
  (-13.46)

GGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUCU
.(((.((......)..  (-11.00)

and I want to concatenate and output them as a single predicated RNA sequence 
structure:

AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGCGGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUCU

[...(((...))).]..(((.((......)..

Regards,

Daniel T. Yehdego
Computational Science Program
University of Texas at El Paso, UTEP
dtyehd...@miners.utep.edu

> From: dtyehd...@miners.utep.edu
> To: common-user@hadoop.apache.org
> Subject: RE: Hadoop-streaming using binary executable c program
> Date: Tue, 26 Jul 2011 16:23:10 +
>
>
> Good afternoon Bobby,
>
> Thanks so much, now its working excellent. And the speed is also reasonable. 
> Once again thanks u.
>
> Regards,
>
> Daniel T. Yehdego
> Computational Science Program
> University of Texas at El Paso, UTEP
> dtyehd...@miners.utep.edu
>
> > From: ev...@yahoo-inc.com
> > To: common-user@hadoop.apache.org
> > Date: Mon, 25 Jul 2011 14:47:34 -0700
> > Subject: Re: Hadoop-streaming using binary executable c program
> >
> > This is likely to be slow and it is not ideal.  The ideal would be to 
> > modify pknotsRG to be able to read from stdin, but that may not be possible.
> >
> > The shell script would probably look something like the following
> >
> > #!/bin/sh
> > rm -f temp.txt;
> > while read line
> > do
> >   echo $line >> temp.txt;
> > done
> > exec pknotsRG temp.txt;
> >
> > Place it in a file say hadoopPknotsRG  Then you probably want to run
> >
> > chmod +x hadoopPknotsRG
> >
> > After that you want to test it with
> >
> > hadoop fs -cat 
> > /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 
> > | ./hadoopPknotsRG
> >
> > If that works then you can try it with Hadoop streaming
> >
> > HADOOP_HOME$ bin/hadoop jar 
> > /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper 
> > ./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file 
> > /data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input 
> > /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output 
> > /user/yehdego/RF-out -reducer NONE -verbose
> >
> > --Bobby
> >
> > On 7/25/11 3:37 PM, "Daniel Yehdego"  wrote:
> >
> >
> >
> > Good afternoon Bobby,
> >
> > Thanks, you gave me a great help in finding out what the problem was. After 
> > I put the command line you suggested me, I found out that there was a 
> > segmentation error.
> > The binary executable program pknotsRG only reads a file with a sequence in 
> > it. This means, there should be a shell script, as you have said, that will 
> > take the data coming
> > from stdin and write it to a temporary file. Any idea on how to do this job 
> > i

Re: next gen map reduce

2011-07-28 Thread Robert Evans

It has not been introduced yet.  If you are referring to MRV2.  It is targeted 
to go into the 0.23 release of Hadoop, but is currently on the MR-279 branch.  
Which should hopefully be merged to trunk in abut a week.

--Bobby

On 7/28/11 7:31 AM, "real great.."  wrote:

In which Hadoop version is next gen introduced?

--
Regards,
R.V.

Re: Running queries using index on HDFS

2011-07-25 Thread Robert Evans

Sofia,

You can access any HDFS file from a normal java application so long as your 
classpath and some configuration is set up correctly.  That is all that the 
hadoop jar command does.  It is a shell script that sets up the environment for 
java to work with Hadoop.  Look at the example for the Tool Class

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Tool.html

If you delete the JobConf stuff you can then just talk to the FIleSystem by 
doing the following

Path p = new Path("URI OF FILE TO OPEN");
FileSystem fs = p.getFileSystem(conf);
InputStream in = fs.open(p);

Now you can use in to read your data.  Just be sure to close it when you are 
done.

--Bobby Evans



On 7/25/11 4:40 PM, "Sofia Georgiakaki"  wrote:

Good evening,

I have built an Rtree on HDFS, in order to improve the query performance of 
high-selectivity spatial queries.
The Rtree is composed of a number of hdfs files (each one created by one 
Reducer, so as the number of the files is equal to the number of the reducers), 
where each file is a subtree of the root of the Rtree.
I investigate the way to use the Rtree in an efficient way, with respect to the 
locality of each file on hdfs (data-placement).


I would like to ask, if it is possible to read a file which is on hdfs, from a 
java application (not MapReduce).
In case this is not possible (as I believe), either I should download the files 
on the local filesystem (which is not a solution, since the files could be very 
large), orrun the queries using the Hadoop.
In order to maximise the gain, I should probably process a batch of queries 
during each Job, and run each query on a node that is "near" to the files that 
are involved in handling the specific query.

Can I find the node where each file is located (or at least most of its 
blocks), and run on that node a reducer that handles these queries? Could the 
function  DFSClient.getBlockLocations() help ?

Thank you in advance,
Sofia

Re: Hadoop-streaming using binary executable c program

2011-07-25 Thread Robert Evans

This is likely to be slow and it is not ideal.  The ideal would be to modify 
pknotsRG to be able to read from stdin, but that may not be possible.

The shell script would probably look something like the following

#!/bin/sh
rm -f temp.txt;
while read line
do
  echo $line >> temp.txt;
done
exec pknotsRG temp.txt;

Place it in a file say hadoopPknotsRG  Then you probably want to run

chmod +x hadoopPknotsRG

After that you want to test it with

hadoop fs -cat 
/user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 | 
./hadoopPknotsRG

If that works then you can try it with Hadoop streaming

HADOOP_HOME$ bin/hadoop jar 
/data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper 
./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file 
/data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input 
/user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output 
/user/yehdego/RF-out -reducer NONE -verbose

--Bobby

On 7/25/11 3:37 PM, "Daniel Yehdego"  wrote:



Good afternoon Bobby,

Thanks, you gave me a great help in finding out what the problem was. After I 
put the command line you suggested me, I found out that there was a 
segmentation error.
The binary executable program pknotsRG only reads a file with a sequence in it. 
This means, there should be a shell script, as you have said, that will take 
the data coming
from stdin and write it to a temporary file. Any idea on how to do this job in 
shell script. The thing is I am from a biology background and don't have much 
experience in CS.
looking forward to hear from you. Thanks so much.

Regards,

Daniel T. Yehdego
Computational Science Program
University of Texas at El Paso, UTEP
dtyehd...@miners.utep.edu

> From: ev...@yahoo-inc.com
> To: common-user@hadoop.apache.org
> Date: Fri, 22 Jul 2011 12:39:08 -0700
> Subject: Re: Hadoop-streaming using binary executable c program
>
> I would suggest that you do the following to help you debug.
>
> hadoop fs -cat 
> /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 | 
> /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -
>
> This is simulating what hadoop streaming is doing.  Here we are taking the 
> first 2 lines out of the input file and feeding them to the stdin of 
> pknotsRG.  The first step is to make sure that you can get your program to 
> run correctly with something like this.  You may need to change the command 
> line to pknotsRG to get it to read the data it is processing from stdin, 
> instead of from a file.  Alternatively you may need to write a shell script 
> that will take the data coming from stdin.  Write it to a file and then call 
> pknotsRG on that temporary file.  Once you have this working then you should 
> try it again with streaming.
>
> --Bobby Evans
>
> On 7/22/11 12:31 PM, "Daniel Yehdego"  wrote:
>
>
>
> Hi Bobby, Thanks for the response.
>
> After I tried the following comannd:
>
> bin/hadoop jar $HADOOP_HOME/hadoop-0.20.2-streaming.jar -mapper 
> /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -  -file 
> /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG  -reducer NONE -input 
> /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output 
> /user/yehdego/RF-out - verbose
>
> I got a stderr logs :
>
> java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed 
> with code 139
> at 
> org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
> at 
> org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
> at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
> at 
> org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
>
>
> syslog logs
>
> 2011-07-22 13:02:27,467 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
> Initializing JVM Metrics with processName=MAP, sessionId=
> 2011-07-22 13:02:27,913 INFO org.apache.hadoop.mapred.MapTask: 
> numReduceTasks: 0
> 2011-07-22 13:02:28,149 INFO org.apache.hadoop.streaming.PipeMapRed: 
> PipeMapRed exec 
> [/data/yehdego/hadoop_tmp/dfs/local/taskTracker/jobcache/job_201107181535_0079/attempt_201107181535_0079_m_00_0/work/./pknotsRG]
> 2011-07-22 13:02:28,242 INFO org.apache.hadoop.streaming.PipeMapRed: 
> R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
> 2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: 
> MROutputThread done
> 2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: 
> MRErrorThread done
> 2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: 
> PipeMapRed failed!
> 2011-07-22 13:02:28,361 WARN org.apache.hadoop.mapred.TaskTracker: Error 
> running child
> java.lang.RuntimeException: PipeMapRed.waitOut

Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Robert Evans

Tom,

I also forgot to mention that if you are writing to lots of little files it 
could cause issues too.  HDFS is designed to handle relatively few BIG files.  
There is some work to improve this, but it is still a ways off.  So it is 
likely going to be very slow and put a big load on the namenode if you are 
going to create lot of small files using this method.

--Bobby


On 7/25/11 3:30 PM, "Robert Evans"  wrote:

Tom,

That assumes that you will never write to the same file from two different 
mappers or processes.  HDFS currently does not support writing to a single file 
from multiple processes.

--Bobby

On 7/25/11 3:25 PM, "Tom Melendez"  wrote:

Hi Folks,

Just doing a sanity check here.

I have a map-only job, which produces a filename for a key and data as
a value.  I want to write the value (data) into the key (filename) in
the path specified when I run the job.

The value (data) doesn't need any formatting, I can just write it to
HDFS without modification.

So, looking at this link (the Output Formats section):

http://developer.yahoo.com/hadoop/tutorial/module5.html

Looks like I want to:
- create a new output format
- override write, tell it not to call writekey as I don't want that written
- new getRecordWriter method that use the key as the filename and
calls my outputformat

Sound reasonable?

Thanks,

Tom

--
===
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs

Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Robert Evans

Tom,

That assumes that you will never write to the same file from two different 
mappers or processes.  HDFS currently does not support writing to a single file 
from multiple processes.

--Bobby

On 7/25/11 3:25 PM, "Tom Melendez"  wrote:

Hi Folks,

Just doing a sanity check here.

I have a map-only job, which produces a filename for a key and data as
a value.  I want to write the value (data) into the key (filename) in
the path specified when I run the job.

The value (data) doesn't need any formatting, I can just write it to
HDFS without modification.

So, looking at this link (the Output Formats section):

http://developer.yahoo.com/hadoop/tutorial/module5.html

Looks like I want to:
- create a new output format
- override write, tell it not to call writekey as I don't want that written
- new getRecordWriter method that use the key as the filename and
calls my outputformat

Sound reasonable?

Thanks,

Tom

--
===
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs

Re: Hadoop-streaming using binary executable c program

2011-07-22 Thread Robert Evans

I would suggest that you do the following to help you debug.

hadoop fs -cat 
/user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 | 
/data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -

This is simulating what hadoop streaming is doing.  Here we are taking the 
first 2 lines out of the input file and feeding them to the stdin of pknotsRG.  
The first step is to make sure that you can get your program to run correctly 
with something like this.  You may need to change the command line to pknotsRG 
to get it to read the data it is processing from stdin, instead of from a file. 
 Alternatively you may need to write a shell script that will take the data 
coming from stdin.  Write it to a file and then call pknotsRG on that temporary 
file.  Once you have this working then you should try it again with streaming.

--Bobby Evans

On 7/22/11 12:31 PM, "Daniel Yehdego"  wrote:

Hi Bobby, Thanks for the response.

After I tried the following comannd:

bin/hadoop jar $HADOOP_HOME/hadoop-0.20.2-streaming.jar -mapper 
/data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -  -file 
/data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG  -reducer NONE -input 
/user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output 
/user/yehdego/RF-out - verbose

I got a stderr logs :

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed 
with code 139
at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

syslog logs

2011-07-22 13:02:27,467 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=MAP, sessionId=
2011-07-22 13:02:27,913 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2011-07-22 13:02:28,149 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed 
exec 
[/data/yehdego/hadoop_tmp/dfs/local/taskTracker/jobcache/job_201107181535_0079/attempt_201107181535_0079_m_00_0/work/./pknotsRG]
2011-07-22 13:02:28,242 INFO org.apache.hadoop.streaming.PipeMapRed: 
R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: 
MROutputThread done
2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: 
MRErrorThread done
2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed 
failed!
2011-07-22 13:02:28,361 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed 
with code 139
at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
2011-07-22 13:02:28,395 INFO org.apache.hadoop.mapred.TaskRunner: Runnning 
cleanup for the task

Regards,

Daniel T. Yehdego
Computational Science Program
University of Texas at El Paso, UTEP
dtyehd...@miners.utep.edu

> From: ev...@yahoo-inc.com
> To: common-user@hadoop.apache.org; dtyehd...@miners.utep.edu
> Date: Fri, 22 Jul 2011 09:12:18 -0700
> Subject: Re: Hadoop-streaming using binary executable c program
>
> It looks like it tried to run your program and the program exited with a 1 
> not a 0.  What are the stderr logs like for the mappers that were launched, 
> you should be able to access them through the Web GUI?  You might want to add 
> in some stderr log messages to you c program too. To be able to debug how far 
> along it is going before exiting.
>
> --Bobby Evans
>
> On 7/22/11 9:19 AM, "Daniel Yehdego"  wrote:
>
> I am trying to parallelize some very long RNA sequence for the sake of
> predicting their RNA 2D structures. I am using a binary executable c
> program called pknotsRG as my mapper. I tried the following bin/hadoop
> command:
>
> HADOOP_HOME$ bin/hadoop
> jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar
> -mapper /data/yehdego/hadoop-0.20.2/pknotsRG
> -file /data/yehdego/hadoop-0.20.2/pknotsRG
> -input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt
> -output /user/yehdego/RF-out -reducer

Re: Hadoop-streaming using binary executable c program

2011-07-22 Thread Robert Evans

It looks like it tried to run your program and the program exited with a 1 not 
a 0.  What are the stderr logs like for the mappers that were launched, you 
should be able to access them through the Web GUI?  You might want to add in 
some stderr log messages to you c program too. To be able to debug how far 
along it is going before exiting.

--Bobby Evans

On 7/22/11 9:19 AM, "Daniel Yehdego"  wrote:

I am trying to parallelize some very long RNA sequence for the sake of
predicting their RNA 2D structures. I am using a binary executable c
program called pknotsRG as my mapper. I tried the following bin/hadoop
command:

HADOOP_HOME$ bin/hadoop
jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar
-mapper /data/yehdego/hadoop-0.20.2/pknotsRG
-file /data/yehdego/hadoop-0.20.2/pknotsRG
-input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt
-output /user/yehdego/RF-out -reducer NONE -verbose

but i keep getting the following error message:

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
failed with code 1
at
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

FYI: my input file is RF00028_B.bpseqL3G5_seg_Centered_Method.txt which
is a chunk of RNA sequences and the mapper is expected to get the input
and execute the input file line by line and out put the predicted
structure for each line of sequence for a specified number of maps. Any
help on this problem is really appreciated. Thanks.

Re: Problem with Hadoop Streaming -file option for Java class files

2011-07-22 Thread Robert Evans

>From a practical standpoint if you just leave off the -mapper you will get an 
>IdentityMapper being run in streaming.  I don't believe that -mapper will 
>understand something.class as a class file that should be loaded and used as 
>the mapper.  I think you need to specify the class, including the package to 
>get it to load like you did with org.apache.hadoop.mapred.lib.IdentityMapper.  
>I am not sure what changes you made to IdentiyMapper.java before recompiling 
>but in order to get it on the classpath you probably need to ship it as a jar 
>not as a single file.  I believe that you can use -libJars to ship it and add 
>it to the classpath of the JVM, but I am not positive of that.

--Bobby Evans

On 7/22/11 10:18 AM, "Shrish"  wrote:



I am struggling with a  issue in hadoop streaming in the "-file" option.

First I tried the very basic example in streaming:

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar
contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper
org.apache.hadoop.mapred.lib.IdentityMapper \ -reducer /bin/wc -inputformat
KeyValueTextInputFormat -input gutenberg/* -output gutenberg-outputtstchk22

which worked absolutely fine.

Then I copied the IdentityMapper.java source code and compiled it. Then I placed
this class file in the /home/hadoop folder and executed the following in the
terminal.

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar
contrib/streaming/hadoop-streaming-0.20.203.0.jar -file ~/IdentityMapper.class
-mapper IdentityMapper.class \ -reducer /bin/wc -inputformat
KeyValueTextInputFormat -input gutenberg/* -output gutenberg-outputtstch6

The execution failed with the following error in the stderr file:

java.io.IOException: Cannot run program "IdentityMapper.class":
java.io.IOException: error=2, No such file or directory

Then again I tried it by copying the IdentityMapper.class file in the hadoop
installation and executed the following:

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar
contrib/streaming/hadoop-streaming-0.20.203.0.jar -file IdentityMapper.class
-mapper IdentityMapper.class \ -reducer /bin/wc -inputformat
KeyValueTextInputFormat -input gutenberg/* -output gutenberg-outputtstch5

But unfortunately again I got the same error.

It would be great if you can help me with it as I cannot move any further
without overcoming this.


***I am trying this after I tried hadoop-streaming for a different class file
which failed, so to identify if there is something wrong with the class file
itself or with the way I am using it


Thanking you in anticipation

Re: Issue with MR code not scaling correctly with data sizes

2011-07-15 Thread Robert Evans

Please don't cross post.  I put common-user in BCC.

I really don't know for sure what is happening especially without the code or 
more to go on and debugging something remotely over e-mail is extremely 
difficult.  You are essentially doing a cross which is going to be very 
expensive no matter what you do. But I do have a few questions for you.


 1.  How large is the IDs file(s) you are using?  Have you updated the amount 
of heap the JVM has and the number of slots to accommodate it?
 2.  How are you storing the IDs in RAM to do the join?
 3.  Have you tried logging in your map/reduce code to verify the number of 
entries you expect are being loaded at each stage?
 4.  Along with that have you looked at the counters for your map./reduce 
program to verify that the number of records are showing flowing through the 
system as expected?

--Bobby

On 7/14/11 5:14 PM, "GOEKE, MATTHEW (AG/1000)"  
wrote:

All,

I have a MR program that I feed in a list of IDs and it generates the unique 
comparison set as a result. Example: if I have a list {1,2,3,4,5} then the 
resulting output would be {2x1, 3x2, 3x1, 4x3, 4x2, 4x1, 5x4, 5x3, 5x2, 5x1} or 
(n^2-n)/2 number of comparisons. My code works just fine on smaller scaled sets 
(I can verify less than 1000 fairly easily) but fails when I try to push the 
set to 10-20k IDs which is annoying when the end goal is 1-10 million.

The flow of the program is:
1) Partition the IDs evenly, based on amount of output per value, into 
a set of keys equal to the number of reduce slots we currently have
2) Use the distributed cache to push the ID file out to the various 
reducers
3) In the setup of the reducer, populate an int array with the values 
from the ID file in distributed cache
4) Output a comparison only if the current ID from the values iterator 
is greater than the current iterator through the int array

I realize that this could be done many other ways but this will be part of an 
oozie workflow so it made sense to just do it in MR for now. My issue is that 
when I try the larger sized ID files it only outputs part of resulting data set 
and there are no errors to be found. Part of me thinks that I need to tweak 
some site configuration properties, due to the size of data that is spilling to 
disk, but after scanning through all 3 sites I am having issues pin pointing 
anything I think could be causing this. I moved from reading the file from HDFS 
to using the distributed cache for the join read thinking that might solve my 
problem but there seems to be something else I am overlooking.

Any advice is greatly appreciated!

Matt
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.

Re: Giving filename as key to mapper ?

2011-07-15 Thread Robert Evans

To add to that if you really want the file name to be the key instead of just 
calling a different API in your map to get it you will probably need to write 
your own input format to do it.  It should be fairly simple and you can base it 
off of an existing input format to do it.

--Bobby

On 7/15/11 7:40 AM, "Harsh J"  wrote:

You can retrieve the filename in the new API as described here:

http://search-hadoop.com/m/ZOmmJ1PZJqt1/map+input+filename&subj=Retrieving+Filename

In the old API, its available in the configuration instance of the
mapper as key "map.input.file". See the table below this section
http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task+JVM+Reuse
for more such goodies.

On Fri, Jul 15, 2011 at 5:44 PM, praveenesh kumar  wrote:
> Hi,
> How can I give filename as key to mapper ?
> I want to know the occurence of word in set of docs, so I want to keep key
> as filename. Is it possible to give input key as filename in map function ?
> Thanks,
> Praveenesh
>

--
Harsh J

Re: Which release to use?

2011-07-15 Thread Robert Evans

Adarsh,

Yahoo! no longer has its own distribution of Hadoop.  It has been merged into 
the 0.20.2XX line so 0.20.203 is what Yahoo is running internally right now, 
and we are moving towards 0.20.204 which should be out soon.  I am not an 
expert on Cloudera so I cannot really map its releases to the Apache Releases, 
but their distro is based off of Apache Hadoop with a few bug fixes and maybe a 
few features like append added in on top of it, but you need to talk to 
Cloudera about the exact details.  For the most part they are all very similar. 
 You need to think most about support, there are several companies that can 
sell you support if you want/need it.  You also need to think about features 
vs. stability.  The 0.20.203 release has been tested on a lot of machines by 
many different groups, but may be missing some features that are needed in some 
situations.

--Bobby

On 7/14/11 11:49 PM, "Adarsh Sharma"  wrote:

Hadoop releases are issued time by time. But one more thing related to
hadoop usage,

There are so many providers that provides the distribution of Hadoop ;

1. Apache Hadoop
2. Cloudera
3. Yahoo

etc.
Which distribution is best among them on production usage.
I think Cloudera's  is best among them.

Best Regards,
Adarsh
Owen O'Malley wrote:
> On Jul 14, 2011, at 4:33 PM, Teruhiko Kurosaka wrote:
>
>
>> I'm a newbie and I am confused by the Hadoop releases.
>> I thought 0.21.0 is the latest & greatest release that I
>> should be using but I noticed 0.20.203 has been released
>> lately, and 0.21.X is marked "unstable, unsupported".
>>
>> Should I be using 0.20.203?
>>
>
> Yes, I apologize for confusing release numbering, but the best release to use 
> is 0.20.203.0. It includes security, job limits, and many other improvements 
> over 0.20.2 and 0.21.0. Unfortunately, it doesn't have the new sync support 
> so it isn't suitable for using with HBase. Most large clusters use a separate 
> version of HDFS for HBase.
>
> -- Owen
>
>

Re: large data and hbase

2011-07-11 Thread Robert Evans

Rita,

My understanding is that you do not need to setup map/reduce to use Hbase, but 
I am not an expert on it.  Contacting the Hbase mailing list would probably be 
the best option to get your questions answered.

u...@hbase.apache.org

Their setup page might be able to help you out too

http://hbase.apache.org/book/notsoquick.html

I don't believe that Hbase supports SQL though.  You can use Hive 
(http://hive.apache.org/) It supports a lot of SQL, but it does batch 
processing to run the queries and requires you to set up map/reduce to use.

--Bobby Evans

On 7/11/11 6:31 AM, "Rita"  wrote:

I have a dataset which is several terabytes in size. I would like to query
this data using hbase (sql). Would I need to setup mapreduce to use hbase?
Currently the data is stored in hdfs and I am using `hdfs -cat ` to get the
data and pipe it into stdin.


--
--- Get your facts first, then you can distort them as you please.--

Re: Cluster Tuning

2011-07-08 Thread Robert Evans

I doubt It is going to make that much of a difference, even with the hardware 
constraints.  All that the reduce is doing during this period of time is 
downloading the map output data doing a merge sort on it and possibly dumping 
parts of it to disk. It may take up some RAM and if you are swapping a lot then 
it might be a speed bump to keep it from running, but only if you are really on 
the edge of the amount of RAM available to the system.  Looking at how you can 
reduce the data you transfer and tuning the heap size for the various JVMs can 
probably have a bigger impact.

--Bobby Evans

On 7/8/11 10:25 AM, "Juan P."  wrote:

Here's another thought. I realized that the reduce operation in my
map/reduce jobs is a flash. But it goes really slow until the
mappers end. Is there a way to configure the cluster to make the reduce wait
for the map operations to complete? Specially considering my hardware
restraints

Thanks!
Pony

On Fri, Jul 8, 2011 at 11:41 AM, Juan P.  wrote:

> Hey guys,
> Thanks all of you for your help.
>
> Joey,
> I tweaked my MapReduce to serialize/deserialize only escencial values and
> added a combiner and that helped a lot. Previously I had a domain object
> which was being passed between Mapper and Reducer when I only needed a
> single value.
>
> Esteban,
> I think you underestimate the constraints of my cluster. Adding multiple
> jobs per JVM really kills me in terms of memory. Not to mention that by
> having a single core there's not much to gain in terms of paralelism (other
> than perhaps while a process is waiting of an I/O operation). Still I gave
> it a shot, but even though I kept changing the config I always ended with a
> Java heap space error.
>
> Is it me or performance tuning is mostly a per job task? I mean it will, in
> the end, depend on the the data you are processing (structure, size, weather
> it's in one file or many, etc). If my jobs have different sets of data,
> which are in different formats and organized in different  file structures,
> Do you guys recommend moving some of the configuration to Java code?
>
> Thanks!
> Pony
>
> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex  wrote:
>
>> Eres el Esteban que conozco?
>>
>>
>>
>> El 07/07/2011, a las 15:53, Esteban Gutierrez 
>> escribió:
>>
>> > Hi Pony,
>> >
>> > There is a good chance that your boxes are doing some heavy swapping and
>> > that is a killer for Hadoop.  Have you tried
>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
>> > heap on that boxes?
>> >
>> > Cheers,
>> > Esteban.
>> >
>> > --
>> > Get Hadoop!  http://www.cloudera.com/downloads/
>> >
>> >
>> >
>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P.  wrote:
>> >
>> >> Hi guys!
>> >>
>> >> I'd like some help fine tuning my cluster. I currently have 20 boxes
>> >> exactly
>> >> alike. Single core machines with 600MB of RAM. No chance of upgrading
>> the
>> >> hardware.
>> >>
>> >> My cluster is made out of 1 NameNode/JobTracker box and 19
>> >> DataNode/TaskTracker boxes.
>> >>
>> >> All my config is default except i've set the following in my
>> >> mapred-site.xml
>> >> in an effort to try and prevent choking my boxes.
>> >> **
>> >> *  mapred.tasktracker.map.tasks.maximum*
>> >> *  1*
>> >> *  *
>> >>
>> >> I'm running a MapReduce job which reads a Proxy Server log file (2GB),
>> maps
>> >> hosts to each record and then in the reduce task it accumulates the
>> amount
>> >> of bytes received from each host.
>> >>
>> >> Currently it's producing about 65000 keys
>> >>
>> >> The hole job takes forever to complete, specially the reduce part. I've
>> >> tried different tuning configs by I can't bring it down under 20mins.
>> >>
>> >> Any ideas?
>> >>
>> >> Thanks for your help!
>> >> Pony
>> >>
>>
>
>

Re: Can i safely set dfs.blockreport.intervalMsec to very large value (1 year or more?)

2011-07-08 Thread Robert Evans

Moon Soo Lee

The full block report is used in error cases.  Currently when a datanode 
heartbeats into the namenode the namenode can send back a list of tasks to be 
preformed, this is mostly for deleting blocks.  The namenode just assumes that 
all of these tasks execute successfully.  If any of them fail then the namenode 
is unaware of it.  HDFS-395 adds in an ack to address this.  Creating of new 
blocks is sent to the namenode as they happen so this is not really an issue. 
So if you set the period to 1 year then you will likely have several blocks in 
your cluster sitting around unused but taking up space.  It is also likely 
compensating for other error conditions or even bugs in HDFS that I am unaware 
of, just because of the nature of it.

--Bobby Evans

On 7/7/11 9:02 PM, "moon soo Lee"  wrote:

I have many blocks. Around 50~90m each datanode.

They often do not respond while 1~3 min and i think this is because of full
scanning for block report.

So if i set dfs.blockreport.intervalMsec to very large value (1year or
more?), i expect problem clear.

But if i really do what happens? any side effects?

Re: Automatic line number in reducer output

2011-06-10 Thread Robert Evans

In this case you probably want two different classes.  You can have the base 
Reducer class that adds in the line count, and then subclass it for the 
combiner, that sets a flag to not output the line numbers.

--Bobby

On 6/9/11 12:57 PM, "Shi Yu"  wrote:

Hi,

Thanks for the reply. The line count in new API works fine now, it was a
bug in my code.  In new API,

Iterator  is changed to Iterable,

but I didn't pay attention to that and was still using Iterator and hasNext(), 
Next() method. Surprisingly, the wrong code still ran and got output, but the 
line number count did not work and I think it was null value. After fixing that 
Iterable mistake, the code works fine.

The remaining problem is when combiner and reducer are both implemented, the 
output is like

0   0   value1
1   0   value2
2   0   value3
3   1   value4
4   1   value5

The first column are counts from reducer, the second column are counts from 
combiner. I want to avoid the line counter in combiner, so my plan is to create 
another class which is almost the same as Reducer, but without the line count. 
I think it is doable to set Combiner and Reducer to different classes in 
jobconf, but I haven't tried it yet.

Best,

Shi

On 6/9/2011 8:49 AM, Robert Evans wrote:

> What exactly is linecount being output as in the new APIs?
>
> --Bobby
>
> On 6/7/11 11:21 AM, "Shi Yu"  wrote:
>
> Hi,
>
> I am wondering is there any built-in function to automatically add a
> self-increment line number in reducer output (like the relation DB
> auto-key).
>
> I have this problem because in 0.19.2 API, I used a variable linecount
> increasing in the reducer like:
>
>public static class Reduce extends MapReduceBase implements
> Reducer{
>   private long linecount = 0;
>
>   public void reduce(Text key, Iterator  values,
> OutputCollector  output, Reporter reporter) throws
> IOException {
>
>   //.some code here
>   linecount ++;
>   output.collect(new Text(Long.toString(linecount)), var);
>
>  }
>
> }
>
>
> However, I found that this is not working in 0.20.2 API, if I write the
> code like:
>
> public static class Reduce extends
> org.apache.hadoop.mapreduce.Reducer{
>  private long linecount = 0;
>
>  public void reduce (Text key, Iterator  values,
> org.apache.hadoop.mapreduce.Reducer.Context context) throws IOException,
> InterruptedException {
>
>  //some code here
>  linecount ++;
>  context.write(new Text(Long.toString(linecount)),var);
> }
> }
>
> but it seems not working anymore.
>
>
> I would also like to know if there are combiner and reducer implemented,
> how to avoid that line number being written twice (cause I only want it
> in reducer, not in combiner). Thanks!
>
>
> Shi
>
>
>
>

Re: Linear scalability question

2011-06-09 Thread Robert Evans

Shantian,

You are correct.  The other big factor in this is the cost of connections 
between the Mappers and the Reducers.  With N mappers and M reducers you will 
make M*N connections between them.  This can be a very large cost as well.  The 
basic tricks you can play are to filter data before doing a join so that less 
data is passed through the network; make your block sizes larger so that more 
data is processed per map and less connections are made between mappers and 
reducers; and make sure you have compression enabled between the map/reduce 
phases usually LZO is a good choice for this.  But for the most part this is a 
very difficult problem to address because fundamentally a join requires 
transferring data with the same key to the same node.  There is some 
experimental work into very large processing but it is still just that 
experimental and does not actually try to make a join scale linearly.

--Bobby Evans

On 6/8/11 4:17 PM, "Shantian Purkad"  wrote:

any comments?




From: Shantian Purkad 
To: "common-user@hadoop.apache.org" 
Sent: Tuesday, June 7, 2011 3:53 PM
Subject: Linear scalability question

Hi,

I have a question on the linear scalability of Hadoop.

We have a situation where we have to do reduce side joins on two big tables 
(10+ TB). This causes lot of data to be transferred over network and network is 
becoming a bottleneck.

In few years these table will have 100TB + data and the reduce side joins will 
demand lot of data transfer over network. Since network bandwidth is limited 
and can not be addressed by adding more nodes, hadoop will no longer be 
linearly scalable in this case.

Is my understanding correct? Am I missing anything here? How do people address 
these kind of bottlenecks?

Thanks and Regards,
Shantian

Re: DistributedCache

2011-06-09 Thread Robert Evans

I think the issue you are seeing is because the distributed cache is not set up 
by default to create symlinks to the files it pulls over.  If you want to 
access them through symlinks in the local directory call 
DistributedCache.createSymklink(conf) before submitting your job, otherwise you 
can use getLocalCacheFiles and getLocalCacheArchives to know where the files 
are.

One thing to be aware of is that the cache archives and cache files format may 
optionally end with a # where  is the name of the symlink you want 
on the compute node.

--Bobby Evans

On 6/7/11 8:52 AM, "John Armstrong"  wrote:

On Tue, 7 Jun 2011 09:41:21 -0300, "Juan P." 
wrote:
> Not 100% clear on what you meant. You are saying I should put the file
into
> my HDFS cluster or should I use DistributedCache? If you suggest the
> latter,
> could you address my original question?

I mean that you can certainly get away with putting information into a
known place on HDFS and loading it in each mapper or reducer, but that may
become very inefficient as your problem scales up.  Mostly I was responding
to Shi Yu's question about why the DC is even worth using at all.

As to your question, here's how I do it, which I think I basically lifted
from an example in The Definitive Guide.  There may be better ways, though.

In my setup, I put files into the DC by getting Path objects (which should
be able to reference either HDFS or local filesystem files, though I always
have my files on HDFS to start) and using

  DistributedCache.addCacheFile(path.toUri(), conf);

Then within my mapper or reducer I retrieve all the cached files with

  Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf);

IIRC, this is what you were doing.  The problem is this gets all the
cached files, although they are now in a working directory on the local
filesystem.  Luckily, I know the filename of the file I want, so I iterate

  for (Path cachePath : cacheFiles) {
if (cachePath.getName().equals(cachedFilename)) {
  return cachePath;
}
  }

Then I've got the path to the local filesystem copy of the file I want in
hand and I can do whatever I want with it.

hth

Re: Automatic line number in reducer output

2011-06-09 Thread Robert Evans

What exactly is linecount being output as in the new APIs?

--Bobby

On 6/7/11 11:21 AM, "Shi Yu"  wrote:

Hi,

I am wondering is there any built-in function to automatically add a
self-increment line number in reducer output (like the relation DB
auto-key).

I have this problem because in 0.19.2 API, I used a variable linecount
increasing in the reducer like:

  public static class Reduce extends MapReduceBase implements
Reducer{
 private long linecount = 0;

 public void reduce(Text key, Iterator values,
OutputCollector output, Reporter reporter) throws
IOException {

 //.some code here
 linecount ++;
 output.collect(new Text(Long.toString(linecount)), var);

}

}


However, I found that this is not working in 0.20.2 API, if I write the
code like:

public static class Reduce extends
org.apache.hadoop.mapreduce.Reducer{
private long linecount = 0;

public void reduce (Text key, Iterator values,
org.apache.hadoop.mapreduce.Reducer.Context context) throws IOException,
InterruptedException {

//some code here
linecount ++;
context.write(new Text(Long.toString(linecount)),var);
   }
}

but it seems not working anymore.


I would also like to know if there are combiner and reducer implemented,
how to avoid that line number being written twice (cause I only want it
in reducer, not in combiner). Thanks!


Shi

Re: Hadoop project - help needed

2011-05-31 Thread Robert Evans

Parismav,

So you are more or less trying to scrape some data in a distributed way.  Well 
there are several things that you could do, just be careful I am not sure the 
terms of service for the flickr APIs so make sure that you are not violating 
them by downloading too much data.  You probably want to use the map input data 
to be command/control for what the mappers do.  I would probably put in a 
format like

ACCOUT INFO\tGROUP INFO\n

Then you could use the N-line input format so that each mapper will process one 
line out of the file.  Something like (This is just psudo code)

Mapper {
  map(Long offset, String line,...) {
String parts = line.split("\t");
openConnection(parts[0]);
GroupData gd = getDataAboutGroup(parts[1]);
...
  }
}

I would probably not bother with a reducer if all you are doing is pulling down 
data.  Also the output format you choose really depends on the type of data you 
are downloading, and how you want to use that data later.  For example if you 
want to download the actual picture then you probably want to use a sequence 
file format or some other binary format, because converting a picture to text 
can be very costly.

--Bobby Evans

On 5/31/11 10:35 AM, "parismav"  wrote:



Hello dear forum,
i am working on a project on apache Hadoop, i am totally new to this
software and i need some help understanding the basic features!

To sum up, for my project i have configured hadoop so that it runs 3
datanodes on one machine.
The project's main goal is, to use both Flickr API (flickr.com) libraries
and hadoop libraries on Java, so that each one of the 3 datanodes, chooses a
Flickr group and returns photos' info from that group.

In order to do that, i have 3 flickr accounts, each one with a different api
key.

I dont need any help on the flickr side of the code, ofcourse. But what i
dont understand, is how to use the Mapper and Reducer part of the code.
What input do i have to give the Map() function?
do i have to contain this whole "info downloading" process in the map()
function?

In a few words, how do i convert my code so that it runs distributedly on
hadoop?
thank u!
--
View this message in context: 
http://old.nabble.com/Hadoop-project---help-needed-tp31741968p31741968.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Sorting ...

2011-05-26 Thread Robert Evans

Also if you want something that is fairly fast and a lot less dev work to get 
going you might want to look at pig.  They can do a distributed order by that 
is fairly good.

--Bobby Evans

On 5/26/11 2:45 AM, "Luca Pireddu"  wrote:

On May 25, 2011 22:15:50 Mark question wrote:
> I'm using SequenceFileInputFormat, but then what to write in my mappers?
>
>   each mapper is taking a split from the SequenceInputFile then sort its
> split ?! I don't want that..
>
> Thanks,
> Mark
>
> On Wed, May 25, 2011 at 2:09 AM, Luca Pireddu  wrote:
> > On May 25, 2011 01:43:22 Mark question wrote:
> > > Thanks Luca, but what other way to sort a directory of sequence files?
> > >
> > > I don't plan to write a sorting algorithm in mappers/reducers, but
> > > hoping to use the sequenceFile.sorter instead.
> > >
> > > Any ideas?
> > >
> > > Mark
> >

If you want to achieve a global sort, then look at how TeraSort does it:

http://sortbenchmark.org/YahooHadoop.pdf

The idea is to partition the data so that all keys in part[i] are < all keys
in part[i+1].  Each partition in individually sorted, so to read the data in
globally sorted order you simply have to traverse it starting from the first
partition and working your way to the last one.

If your keys are already what you want to sort by, then you don't even need a
mapper (just use the default identity map).

--
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
Pula 09010 (CA), Italy
Tel:  +39 0709250452

Re: Applications creates bigger output than input?

2011-05-19 Thread Robert Evans

I'm not sure if this has been mentioned or not but in Machine Learning with 
text based documents, the first stage is often a glorified word count action.  
Except much of the time they will do N-Gram.  So

Map Input:
"Hello this is a test"

Map Output:
"Hello"
"This"
"is"
"a"
"test"
"Hello" "this"
"this" "is"
"is" "a"
"a" "test"
...

You may also be extracting all kinds of other features form the text, but the 
tokenization/n-gram is not that CPU intensive.

--Bobby Evans

On 5/19/11 3:06 AM, "elton sky"  wrote:

Hello,
I pick up this topic again, because what I am looking for is something not
CPU bound. Augmenting data for ETL and generating index are good examples.
Neither of them requires too much cpu time on map side. The main bottle neck
for them is shuffle and merge.

Market basket analysis is cpu intensive in map phase, for sampling all
possible combinations of items.

I am still looking for more applications, which creates bigger output and
not CPU bound.
Any further idea? I appreciate.

On Tue, May 3, 2011 at 3:10 AM, Steve Loughran  wrote:

> On 30/04/2011 05:31, elton sky wrote:
>
>> Thank you for suggestions:
>>
>> Weblog analysis, market basket analysis and generating search index.
>>
>> I guess for these applications we need more reduces than maps, for
>> handling
>> large intermediate output, isn't it. Besides, the input split for map
>> should
>> be smaller than usual,  because the workload for spill and merge on map's
>> local disk is heavy.
>>
>
> any form of rendering can generate very large images
>
> see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf
>
>
>

Re: current line number as key?

2011-05-18 Thread Robert Evans

You are correct, that there is no easy and efficient way to do this.

You could create a new InputFormat that derives from FileInputFormat that makes 
it so the files do not split, and then have a RecordReader that keeps track of 
line numbers.  But then each file is read by only one mapper.

Alternatively you could assume that the split is going to be done 
deterministically and do two passes one, where you count the number of lines in 
each partition, and a second that then assigns the lines based off of the 
output from the first.  But that requires two map passes.

--Bobby Evans

On 5/18/11 1:53 PM, "Alexandra Anghelescu"  wrote:

Hi,

It is hard to pick up certain lines of a text file - globally I mean.
Remember that the file is split according to its size (byte boundries) not
lines.,, so, it is possible to keep track of the lines inside a split, but
globally for the whole file, assuming it is split among map tasks... i don't
think it is possible.. I am new to hadoop, but that is my take on it.

Alexandra

On Wed, May 18, 2011 at 2:41 PM, bnonymous  wrote:

>
> Hello,
>
> I'm trying to pick up certain lines of a text file. (say 1st, 110th line of
> a file with 10^10 lines). I need a InputFormat which gives the Mapper line
> number as the key.
>
> I tried to implement RecordReader, but I can't get line information from
> InputSplit.
>
> Any solution to this???
>
> Thanks in advance!!!
> --
> View this message in context:
> http://old.nabble.com/current-line-number-as-key--tp31649694p31649694.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

1 2 >

1 - 100 of 105 matches

Mail list logo