Re: How can I record some position of context in Reduce()?
Hi I am also woring on join using MapReduce i think instead of finding postion of table in RawKeyValuIterator. what we can do modify context.write method to alway write key as table name or id then we dont need to find postion we can get Key and Value from "reducerContext" befor calling reducer.run(reducerContext) in ReduceTask.java we can add method join in Reducer.java Reducer class and give call to reducer.join(reduceContext) I just wonder how r going to support NON EQUI join. I am also having same problem how to do join if datasets cant fit in to memory. for now i am cloning using following code : KEYIN key = context.getCurrentKey() ; KEYIN outKey = null; try { outKey = (KEYIN)key.getClass().newInstance(); } catch(Exception e) {} ReflectionUtils.copy(context.getConfiguration(), key, outKey); Iterable values = context.getValues(); ArrayList myValues = new ArrayList(); for(VALUEIN value: values) { VALUEIN outValue = null; try { outValue = (VALUEIN)value.getClass().newInstance(); } catch(Exception e){} ReflectionUtils.copy(context.getConfiguration(), value, outValue); } if you have found any other solution please feel free to share Thank You. On Thu, Mar 14, 2013 at 1:53 PM, Roth Effy wrote: > In reduce() we have: > > key1 values1 > key2 values2 > ... > keyn valuesn > > so,what i want to do is join all values like a SQL: > > select * from values1,values2...valuesn; > > if memory is not enough to cache values,how to complete the join operation? > my idea is clone the reducecontext,but it maybe not easy. > > Any help will be appreciated. > > > 2013/3/13 Roth Effy > >> I want a n:n join as Cartesian product,but the DataJoinReducerBase looks >> like only support equal join. >> I want a non-equal join,but I have no idea now. >> >> >> 2013/3/13 Azuryy Yu >> >>> you want a n:n join or 1:n join? >>> On Mar 13, 2013 10:51 AM, "Roth Effy" wrote: >>> I want to join two table data in reducer.So I need to find the start of the table. someone said the DataJoinReducerBase can help me,isn't it? 2013/3/13 Azuryy Yu > you cannot use RecordReader in Reducer. > > what's the mean of you want get the record position? I cannot > understand, can you give a simple example? > > > On Wed, Mar 13, 2013 at 9:56 AM, Roth Effy wrote: > >> sorry,I still can't understand how to use recordreader in the >> reduce(),because the input is a RawKeyValueIterator in the class >> reducecontext.so,I'm confused. >> anyway,thank you. >> >> >> 2013/3/12 samir das mohapatra >> >>> Through the RecordReader and FileStatus you can get it. >>> >>> >>> On Tue, Mar 12, 2013 at 4:08 PM, Roth Effy wrote: >>> Hi,everyone, I want to join the k-v pairs in Reduce(),but how to get the record position? Now,what I thought is to save the context status,but class Context doesn't implement a clone construct method. Any help will be appreciated. Thank you very much. >>> >>> >> > >> > -- * * * Thanx and Regards* * Vikas Jadhav*
Distributed cache: how big is too big?
I am researching a Hadoop solution for an existing application that requires a directory structure full of data for processing. To make the Hadoop solution work I need to deploy the data directory to each DN when the job is executed.I know this isn't new and commonly done with a Distributed Cache. Based on experience what are the common file sizes deployed in a Distributed Cache? I know smaller is better, but how big is too big? the larger cache deployed I have read there will be startup latency. I also assume there are other factors that play into this. I know that->Default local.cache.size=10Gb -Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??-Distributed Cache is normally not used if larger than =? Another Option: Put the data directories on each DN and provide location to TaskTracker? thanksJohn
Re: Problem accessing HDFS from a remote machine
have you checked firewall on namenode. If you are running ubuntu and namenode port is 8020 command is -> ufw allow 8020 Thanks and Regards, Rishi Yadav InfoObjects Inc || http://www.infoobjects.com *(Big Data Solutions)* On Mon, Apr 8, 2013 at 6:57 PM, Azuryy Yu wrote: > can you use command "jps" on your localhost to see if there is NameNode > process running? > > > On Tue, Apr 9, 2013 at 2:27 AM, Bjorn Jonsson wrote: > >> Yes, the namenode port is not open for your cluster. I had this problem >> to. First, log into your namenode and do netstat -nap to see what ports are >> listening. You can do service --status-all to see if the namenode service >> is running. Basically you need Hadoop to bind to the correct ip (an >> external one, or at least reachable from your remote machine). So listening >> on 127.0.0.1 or localhost or some ip for a private network will not be >> sufficient. Check your /etc/hosts file and /etc/hadoop/conf/*-site.xml >> files to configure the correct ip/ports. >> >> I'm no expert, so my understanding might be limited/wrong...but I hope >> this helps :) >> >> Best, >> B >> >> >> On Mon, Apr 8, 2013 at 7:29 AM, Saurabh Jain >> wrote: >> >>> Hi All, >>> >>> ** ** >>> >>> I have setup a single node cluster(release hadoop-1.0.4). Following is >>> the configuration used – >>> >>> ** ** >>> >>> *core-site.xml :-* >>> >>> ** ** >>> >>> >>> >>> fs.default.name >>> >>> hdfs://localhost:54310 >>> >>> >>> >>> * * >>> >>> *masters:-* >>> >>> localhost >>> >>> ** ** >>> >>> *slaves:-* >>> >>> localhost >>> >>> ** ** >>> >>> I am able to successfully format the Namenode and perform files system >>> operations by running the CLIs on Namenode. >>> >>> ** ** >>> >>> But I am receiving following error when I try to access HDFS from a *remote >>> machine* – >>> >>> ** ** >>> >>> $ bin/hadoop fs -ls / >>> >>> Warning: $HADOOP_HOME is deprecated. >>> >>> ** ** >>> >>> 13/04/08 07:13:56 INFO ipc.Client: Retrying connect to server: >>> 10.209.10.206/10.209.10.206:54310. Already tried 0 time(s). >>> >>> 13/04/08 07:13:57 INFO ipc.Client: Retrying connect to server: >>> 10.209.10.206/10.209.10.206:54310. Already tried 1 time(s). >>> >>> 13/04/08 07:13:58 INFO ipc.Client: Retrying connect to server: >>> 10.209.10.206/10.209.10.206:54310. Already tried 2 time(s). >>> >>> 13/04/08 07:13:59 INFO ipc.Client: Retrying connect to server: >>> 10.209.10.206/10.209.10.206:54310. Already tried 3 time(s). >>> >>> 13/04/08 07:14:00 INFO ipc.Client: Retrying connect to server: >>> 10.209.10.206/10.209.10.206:54310. Already tried 4 time(s). >>> >>> 13/04/08 07:14:01 INFO ipc.Client: Retrying connect to server: >>> 10.209.10.206/10.209.10.206:54310. Already tried 5 time(s). >>> >>> 13/04/08 07:14:02 INFO ipc.Client: Retrying connect to server: >>> 10.209.10.206/10.209.10.206:54310. Already tried 6 time(s). >>> >>> 13/04/08 07:14:03 INFO ipc.Client: Retrying connect to server: >>> 10.209.10.206/10.209.10.206:54310. Already tried 7 time(s). >>> >>> 13/04/08 07:14:04 INFO ipc.Client: Retrying connect to server: >>> 10.209.10.206/10.209.10.206:54310. Already tried 8 time(s). >>> >>> 13/04/08 07:14:05 INFO ipc.Client: Retrying connect to server: >>> 10.209.10.206/10.209.10.206:54310. Already tried 9 time(s). >>> >>> Bad connection to FS. command aborted. exception: Call to >>> 10.209.10.206/10.209.10.206:54310 failed on connection exception: >>> java.net.ConnectException: Connection refused >>> >>> ** ** >>> >>> Where 10.209.10.206 is the IP of the server hosting the Namenode and it >>> is also the configured value for “fs.default.name” in the core-site.xml >>> file on the remote machine. >>> >>> ** ** >>> >>> Executing ‘*bin/hadoop fs -fs hdfs://10.209.10.206:54310 -ls /*’ also >>> result in same output. >>> >>> ** ** >>> >>> Also, I am writing a C application using libhdfs to communicate with >>> HDFS. How do we provide credentials while connecting to HDFS? >>> >>> ** ** >>> >>> Thanks >>> >>> Saurabh >>> >>> ** ** >>> >>> ** ** >>> >> >> >
RE: mr default=local?
Harsh, thanks for the quick reply.While I am a Hadoop-newbie, I find I am explaining Hadoop install, config, job processing to newer-newbies. Thus the desire and need for more details.John > From: ha...@cloudera.com > Date: Tue, 9 Apr 2013 09:16:49 +0530 > Subject: Re: mr default=local? > To: user@hadoop.apache.org > > Hey John, > > Sorta unclear on what is prompting this question (to answer it more > specifically) but my response below: > > On Tue, Apr 9, 2013 at 9:05 AM, John Meza wrote: > > The default mode for hadoop is Standalone, PsuedoDistributed and Fully > > Distributed modes. It is configured for Psuedo and Fully Distributed via > > configuration file, but defaults to Standalone otherwise (correct?). > > The mapred-default.xml we ship, has "mapred.job.tracker" > (0.20.x/1.x/0.22.x) set to local, or "mapreduce.framework.name" > (0.23.x, 2.x, trunk) set to local. This is why, without reconfiguring > an installation to point to a proper cluster (JT or YARN), you will > get local job runner activated. > > > Question about the -defaulting- mechanism: > > -Does it get the -default- configuration via one of the config files? > > For any Configuration type of invocation: > 1. First level of defaults come from *-default.xml embedded inside the > various relevant jar files. > 2. Configurations further found in a classpath resource XML > (core,mapred,hdfs,yarn, *-site.xmls) are applied on top of the > defaults. > 3. User applications' code may then override this set, with any > settings of their own, if needed. > > > -Or does it get the -default- configuration via hard-coded values? > > There may be a few cases of hardcodes, missing documentation and > presence in *-default.xml, but they should still be configurable via > (2) and (3). > > > -Or another mechanism? > > -- > Harsh J
Re: How to configure mapreduce archive size?
Hi, This directory is used as part of the 'DistributedCache' feature. ( http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#DistributedCache). There is a configuration key "local.cache.size" which controls the amount of data stored under DistributedCache. The default limit is 10GB. However, the files under this cannot be deleted if they are being used. Also, some frameworks on Hadoop could be using DistributedCache transparently to you. So you could check what is being stored here and based on that lower the limit of the cache size if you feel that will help. The property needs to be set in mapred-default.xml. Thanks Hemanth On Mon, Apr 8, 2013 at 11:09 PM, wrote: > Hi, > > ** ** > > I am using hadoop which is packaged within hbase -0.94.1. It is hadoop > 1.0.3. There is some mapreduce job running on my server. After some time, I > found that my folder /tmp/hadoop-root/mapred/local/archive has 14G size.** > ** > > ** ** > > How to configure this and limit the size? I do not want to waste my space > for archive. > > ** ** > > Thanks, > > ** ** > > Xia > > ** ** >
Re:RES: I want to call HDFS REST api to upload a file using httplib.
Really Thanks. But the returned URL is wrong. And the localhost is the real URL, as i tested successfully with curl using "localhost". Can anybody help me translate the curl to Python httplib? curl -i -X PUT -T "http://:/webhdfs/v1/?op=CREATE" I test it using python httplib, and receive the right response. But the file uploaded to HDFS is empty, no data sent!! Is "conn.send(data)" the problem? -- Original -- From: "MARCOS MEDRADO RUBINELLI"; Date: Mon, Apr 8, 2013 04:22 PM To: "user@hadoop.apache.org"; Subject: RES: I want to call HDFS REST api to upload a file using httplib. On your first call, Hadoop will return a URL pointing to a datanode in the Location header of the 307 response. On your second call, you have to use that URL instead of constructing your own. You can see the specific documentation here: http://hadoop.apache.org/docs/r1.0.4/webhdfs.html#CREATE Regards, Marcos I want to call HDFS REST api to upload a file using httplib. My program created the file, but no content is in it. = Here is my code: import httplib conn=httplib.HTTPConnection("localhost:50070") conn.request("PUT","/webhdfs/v1/levi/4?op=CREATE") res=conn.getresponse() print res.status,res.reason conn.close() conn=httplib.HTTPConnection("localhost:50075") conn.connect() conn.putrequest("PUT","/webhdfs/v1/levi/4?op=CREATE&user.name=levi") conn.endheaders() a_file=open("/home/levi/4","rb") a_file.seek(0) data=a_file.read() conn.send(data) res=conn.getresponse() print res.status,res.reason conn.close() == Here is the return: 307 TEMPORARY_REDIRECT 201 Created = OK, the file was created, but no content was sent. When I comment the #conn.send(data), the result is the same, still no content. Maybe the file read or the send is wrong, not sure. Do you know how this happened?
Re: mr default=local?
Hey John, Sorta unclear on what is prompting this question (to answer it more specifically) but my response below: On Tue, Apr 9, 2013 at 9:05 AM, John Meza wrote: > The default mode for hadoop is Standalone, PsuedoDistributed and Fully > Distributed modes. It is configured for Psuedo and Fully Distributed via > configuration file, but defaults to Standalone otherwise (correct?). The mapred-default.xml we ship, has "mapred.job.tracker" (0.20.x/1.x/0.22.x) set to local, or "mapreduce.framework.name" (0.23.x, 2.x, trunk) set to local. This is why, without reconfiguring an installation to point to a proper cluster (JT or YARN), you will get local job runner activated. > Question about the -defaulting- mechanism: > -Does it get the -default- configuration via one of the config files? For any Configuration type of invocation: 1. First level of defaults come from *-default.xml embedded inside the various relevant jar files. 2. Configurations further found in a classpath resource XML (core,mapred,hdfs,yarn, *-site.xmls) are applied on top of the defaults. 3. User applications' code may then override this set, with any settings of their own, if needed. > -Or does it get the -default- configuration via hard-coded values? There may be a few cases of hardcodes, missing documentation and presence in *-default.xml, but they should still be configurable via (2) and (3). > -Or another mechanism? -- Harsh J
mr default=local?
The default mode for hadoop is Standalone, PsuedoDistributed and Fully Distributed modes. It is configured for Psuedo and Fully Distributed via configuration file, but defaults to Standalone otherwise (correct?). Question about the -defaulting- mechanism:-Does it get the -default- configuration via one of the config files?-Or does it get the -default- configuration via hard-coded values?-Or another mechanism? thanksJohn
Re: Best format to use
Hey Mark, Gzip codec creates extension .gzip, not .deflate (which is DeflateCodec). You may want to re-check your settings. Impala questions are best resolved at its current user and developer community at https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user. Impala does currently support LZO (and also Indexed LZO) compressed text files however, so you may want to try that as its splittable (compared to Gzip ones). On Tue, Apr 9, 2013 at 5:18 AM, Mark wrote: > Trying to determine what the best format to use for storing daily logs. We > recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if > there is something better? Our main clients for these daily logs are pig and > hive using an external table. We were thinking about testing out impala but > we see that it doesn't work with compressed text files. Any suggestions? > > Thanks -- Harsh J
Re: Problem accessing HDFS from a remote machine
can you use command "jps" on your localhost to see if there is NameNode process running? On Tue, Apr 9, 2013 at 2:27 AM, Bjorn Jonsson wrote: > Yes, the namenode port is not open for your cluster. I had this problem > to. First, log into your namenode and do netstat -nap to see what ports are > listening. You can do service --status-all to see if the namenode service > is running. Basically you need Hadoop to bind to the correct ip (an > external one, or at least reachable from your remote machine). So listening > on 127.0.0.1 or localhost or some ip for a private network will not be > sufficient. Check your /etc/hosts file and /etc/hadoop/conf/*-site.xml > files to configure the correct ip/ports. > > I'm no expert, so my understanding might be limited/wrong...but I hope > this helps :) > > Best, > B > > > On Mon, Apr 8, 2013 at 7:29 AM, Saurabh Jain wrote: > >> Hi All, >> >> ** ** >> >> I have setup a single node cluster(release hadoop-1.0.4). Following is >> the configuration used – >> >> ** ** >> >> *core-site.xml :-* >> >> ** ** >> >> >> >> fs.default.name >> >> hdfs://localhost:54310 >> >> >> >> * * >> >> *masters:-* >> >> localhost >> >> ** ** >> >> *slaves:-* >> >> localhost >> >> ** ** >> >> I am able to successfully format the Namenode and perform files system >> operations by running the CLIs on Namenode. >> >> ** ** >> >> But I am receiving following error when I try to access HDFS from a *remote >> machine* – >> >> ** ** >> >> $ bin/hadoop fs -ls / >> >> Warning: $HADOOP_HOME is deprecated. >> >> ** ** >> >> 13/04/08 07:13:56 INFO ipc.Client: Retrying connect to server: >> 10.209.10.206/10.209.10.206:54310. Already tried 0 time(s). >> >> 13/04/08 07:13:57 INFO ipc.Client: Retrying connect to server: >> 10.209.10.206/10.209.10.206:54310. Already tried 1 time(s). >> >> 13/04/08 07:13:58 INFO ipc.Client: Retrying connect to server: >> 10.209.10.206/10.209.10.206:54310. Already tried 2 time(s). >> >> 13/04/08 07:13:59 INFO ipc.Client: Retrying connect to server: >> 10.209.10.206/10.209.10.206:54310. Already tried 3 time(s). >> >> 13/04/08 07:14:00 INFO ipc.Client: Retrying connect to server: >> 10.209.10.206/10.209.10.206:54310. Already tried 4 time(s). >> >> 13/04/08 07:14:01 INFO ipc.Client: Retrying connect to server: >> 10.209.10.206/10.209.10.206:54310. Already tried 5 time(s). >> >> 13/04/08 07:14:02 INFO ipc.Client: Retrying connect to server: >> 10.209.10.206/10.209.10.206:54310. Already tried 6 time(s). >> >> 13/04/08 07:14:03 INFO ipc.Client: Retrying connect to server: >> 10.209.10.206/10.209.10.206:54310. Already tried 7 time(s). >> >> 13/04/08 07:14:04 INFO ipc.Client: Retrying connect to server: >> 10.209.10.206/10.209.10.206:54310. Already tried 8 time(s). >> >> 13/04/08 07:14:05 INFO ipc.Client: Retrying connect to server: >> 10.209.10.206/10.209.10.206:54310. Already tried 9 time(s). >> >> Bad connection to FS. command aborted. exception: Call to >> 10.209.10.206/10.209.10.206:54310 failed on connection exception: >> java.net.ConnectException: Connection refused >> >> ** ** >> >> Where 10.209.10.206 is the IP of the server hosting the Namenode and it >> is also the configured value for “fs.default.name” in the core-site.xml >> file on the remote machine. >> >> ** ** >> >> Executing ‘*bin/hadoop fs -fs hdfs://10.209.10.206:54310 -ls /*’ also >> result in same output. >> >> ** ** >> >> Also, I am writing a C application using libhdfs to communicate with >> HDFS. How do we provide credentials while connecting to HDFS? >> >> ** ** >> >> Thanks >> >> Saurabh >> >> ** ** >> >> ** ** >> > >
Re: Best format to use
impala can work with compressed files, but it's sequence file, not compressed directly. On Tue, Apr 9, 2013 at 7:48 AM, Mark wrote: > Trying to determine what the best format to use for storing daily logs. We > recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering > if there is something better? Our main clients for these daily logs are pig > and hive using an external table. We were thinking about testing out impala > but we see that it doesn't work with compressed text files. Any suggestions? > > Thanks
Best format to use
Trying to determine what the best format to use for storing daily logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if there is something better? Our main clients for these daily logs are pig and hive using an external table. We were thinking about testing out impala but we see that it doesn't work with compressed text files. Any suggestions? Thanks
how to install hadoop 2.0.3 in standalone mode
I am new to hadoop and maven. I will like to compile the hadoop from the source and install it. I am following instructions from http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html So far, i have managed to download hadoop source code and from the source directory issued "mvn clean install -Pnative" Next i tried to execute mvn assembly:assembly, but i get following error: Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.3:assembly (default-cli) on project hadoop-main: Error reading assemblies: No assembly descriptors found. -> [Help 1] Please help so that i can move forward. Also, the above mentioned install link, does not mention what should be the value of "*$HADOOP_COMMON_HOME*/*$HADOOP_HDFS_HOME"* * * *Thanks in advance,* *Jim*
Re: Problem accessing HDFS from a remote machine
Yes, the namenode port is not open for your cluster. I had this problem to. First, log into your namenode and do netstat -nap to see what ports are listening. You can do service --status-all to see if the namenode service is running. Basically you need Hadoop to bind to the correct ip (an external one, or at least reachable from your remote machine). So listening on 127.0.0.1 or localhost or some ip for a private network will not be sufficient. Check your /etc/hosts file and /etc/hadoop/conf/*-site.xml files to configure the correct ip/ports. I'm no expert, so my understanding might be limited/wrong...but I hope this helps :) Best, B On Mon, Apr 8, 2013 at 7:29 AM, Saurabh Jain wrote: > Hi All, > > ** ** > > I have setup a single node cluster(release hadoop-1.0.4). Following is the > configuration used – > > ** ** > > *core-site.xml :-* > > ** ** > > > > fs.default.name > > hdfs://localhost:54310 > > > > * * > > *masters:-* > > localhost > > ** ** > > *slaves:-* > > localhost > > ** ** > > I am able to successfully format the Namenode and perform files system > operations by running the CLIs on Namenode. > > ** ** > > But I am receiving following error when I try to access HDFS from a *remote > machine* – > > ** ** > > $ bin/hadoop fs -ls / > > Warning: $HADOOP_HOME is deprecated. > > ** ** > > 13/04/08 07:13:56 INFO ipc.Client: Retrying connect to server: > 10.209.10.206/10.209.10.206:54310. Already tried 0 time(s). > > 13/04/08 07:13:57 INFO ipc.Client: Retrying connect to server: > 10.209.10.206/10.209.10.206:54310. Already tried 1 time(s). > > 13/04/08 07:13:58 INFO ipc.Client: Retrying connect to server: > 10.209.10.206/10.209.10.206:54310. Already tried 2 time(s). > > 13/04/08 07:13:59 INFO ipc.Client: Retrying connect to server: > 10.209.10.206/10.209.10.206:54310. Already tried 3 time(s). > > 13/04/08 07:14:00 INFO ipc.Client: Retrying connect to server: > 10.209.10.206/10.209.10.206:54310. Already tried 4 time(s). > > 13/04/08 07:14:01 INFO ipc.Client: Retrying connect to server: > 10.209.10.206/10.209.10.206:54310. Already tried 5 time(s). > > 13/04/08 07:14:02 INFO ipc.Client: Retrying connect to server: > 10.209.10.206/10.209.10.206:54310. Already tried 6 time(s). > > 13/04/08 07:14:03 INFO ipc.Client: Retrying connect to server: > 10.209.10.206/10.209.10.206:54310. Already tried 7 time(s). > > 13/04/08 07:14:04 INFO ipc.Client: Retrying connect to server: > 10.209.10.206/10.209.10.206:54310. Already tried 8 time(s). > > 13/04/08 07:14:05 INFO ipc.Client: Retrying connect to server: > 10.209.10.206/10.209.10.206:54310. Already tried 9 time(s). > > Bad connection to FS. command aborted. exception: Call to > 10.209.10.206/10.209.10.206:54310 failed on connection exception: > java.net.ConnectException: Connection refused > > ** ** > > Where 10.209.10.206 is the IP of the server hosting the Namenode and it > is also the configured value for “fs.default.name” in the core-site.xml > file on the remote machine. > > ** ** > > Executing ‘*bin/hadoop fs -fs hdfs://10.209.10.206:54310 -ls /*’ also > result in same output. > > ** ** > > Also, I am writing a C application using libhdfs to communicate with HDFS. > How do we provide credentials while connecting to HDFS? > > ** ** > > Thanks > > Saurabh > > ** ** > > ** ** >
How to configure mapreduce archive size?
Hi, I am using hadoop which is packaged within hbase -0.94.1. It is hadoop 1.0.3. There is some mapreduce job running on my server. After some time, I found that my folder /tmp/hadoop-root/mapred/local/archive has 14G size. How to configure this and limit the size? I do not want to waste my space for archive. Thanks, Xia
Problem accessing HDFS from a remote machine
Hi All, I have setup a single node cluster(release hadoop-1.0.4). Following is the configuration used - core-site.xml :- fs.default.name hdfs://localhost:54310 masters:- localhost slaves:- localhost I am able to successfully format the Namenode and perform files system operations by running the CLIs on Namenode. But I am receiving following error when I try to access HDFS from a remote machine - $ bin/hadoop fs -ls / Warning: $HADOOP_HOME is deprecated. 13/04/08 07:13:56 INFO ipc.Client: Retrying connect to server: 10.209.10.206/10.209.10.206:54310. Already tried 0 time(s). 13/04/08 07:13:57 INFO ipc.Client: Retrying connect to server: 10.209.10.206/10.209.10.206:54310. Already tried 1 time(s). 13/04/08 07:13:58 INFO ipc.Client: Retrying connect to server: 10.209.10.206/10.209.10.206:54310. Already tried 2 time(s). 13/04/08 07:13:59 INFO ipc.Client: Retrying connect to server: 10.209.10.206/10.209.10.206:54310. Already tried 3 time(s). 13/04/08 07:14:00 INFO ipc.Client: Retrying connect to server: 10.209.10.206/10.209.10.206:54310. Already tried 4 time(s). 13/04/08 07:14:01 INFO ipc.Client: Retrying connect to server: 10.209.10.206/10.209.10.206:54310. Already tried 5 time(s). 13/04/08 07:14:02 INFO ipc.Client: Retrying connect to server: 10.209.10.206/10.209.10.206:54310. Already tried 6 time(s). 13/04/08 07:14:03 INFO ipc.Client: Retrying connect to server: 10.209.10.206/10.209.10.206:54310. Already tried 7 time(s). 13/04/08 07:14:04 INFO ipc.Client: Retrying connect to server: 10.209.10.206/10.209.10.206:54310. Already tried 8 time(s). 13/04/08 07:14:05 INFO ipc.Client: Retrying connect to server: 10.209.10.206/10.209.10.206:54310. Already tried 9 time(s). Bad connection to FS. command aborted. exception: Call to 10.209.10.206/10.209.10.206:54310 failed on connection exception: java.net.ConnectException: Connection refused Where 10.209.10.206 is the IP of the server hosting the Namenode and it is also the configured value for "fs.default.name" in the core-site.xml file on the remote machine. Executing 'bin/hadoop fs -fs hdfs://10.209.10.206:54310 -ls /' also result in same output. Also, I am writing a C application using libhdfs to communicate with HDFS. How do we provide credentials while connecting to HDFS? Thanks Saurabh
Question about RPM Hadoop disto user and keys (re-posting with subject this time)
Hi all, I'm new to Hadoop and am posting my first message on this list. I have downloaded and installed the hadoop_1.1.1-1_x86_64.deb distro and have a couple of issues which are blocking me from progressing. I'm working through the 'Hadoop - The Definitive Guide' book and am trying to set up a test VM in pseudodistributed mode using the RPM. The examples in the book allude to (although I don't think they explicitly state) having a single user for everything and creating a passwordless private/public key pair to allow the user to ssh to locahost to control things. I'm guessing this is because the book uses the .zip distribution which doesn't create any users and therefore assumes running as an already existing locally logged on user. I notice however that the RPM creates 2 users: mapred and hdfs. As a result I'm a bit unclear about the following: 1: Does it matter which user I log in as to perform various actions? e.g. if I want to run start-dfs.sh should I be logged in as 'hdfs'? I did try running start-dfs as root thinking it might drop down to hdfs using a RUN_AS user (like most init.d scripts do) but it didn't work like that. Is there any documentation covering which users should be used to do what when running the RPM distribution? 2: Whilst the RPMcreates the hdfs user and specifies /var/lib/hadoop/hdfs as the homedir, it doesn't actually create this directory. This results in an error when logging in as the user. Is this normal? 3: How should I set up ssh keys between the 2 users? Should each user's public key be in the authorized_keys file of the other user (i.e. is communication between the 2 processes bi-directional) or would something simpler suffice? Hope these questions are clear enough to advise on, please don't hesitate to ask for more info if there's something I've left out. Cheers, Edd -- Web: http://www.eddgrant.com Email: e...@eddgrant.com Mobile: +44 (0) 7861 394 543
Re: Parsing the JobTracker Job Logs
That's nice! Thank you very much. No i try to get flume to work. It should collect all the files, (also the log files from the Task Tracker). Best Regards, Christian. 2013/3/28 Arun C Murthy > Use 'rumen', it's part of Hadoop. > > On Mar 19, 2013, at 3:56 AM, Christian Schneider wrote: > > Hi, > how to parse the log files for our jobs? Are there already classes I can > use? > > I need to display some information on a WebInterface (like the native > JobTracker does). > > > I am talking about this kind of files: > > michaela 11:52:59 > /var/log/hadoop-0.20-mapreduce/history/done/michaela.ixcloud.net_1363615430691_/2013/03/19/00 > # cat job_201303181503_0864_1363686587824_christian_wordCountJob_15 > Meta VERSION="1" . > Job JOBID="job_201303181503_0864" JOBNAME="wordCountJob_15" > USER="christian" SUBMIT_TIME="1363686587824" JOBCONF=" > hdfs://carolin\.ixcloud\.net:8020/user/christian/\.staging/job_201303181503_0864/job\.xml" > VIEW_JOB="*" MODIFY_JOB="*" JOB_QUEUE="default" . > Job JOBID="job_201303181503_0864" JOB_PRIORITY="NORMAL" . > Job JOBID="job_201303181503_0864" LAUNCH_TIME="1363686587923" > TOTAL_MAPS="1" TOTAL_REDUCES="1" JOB_STATUS="PREP" . > Task TASKID="task_201303181503_0864_m_02" TASK_TYPE="SETUP" > START_TIME="1363686587923" SPLITS="" . > MapAttempt TASK_TYPE="SETUP" TASKID="task_201303181503_0864_m_02" > TASK_ATTEMPT_ID="attempt_201303181503_0864_m_02_0" > START_TIME="1363686594028" > TRACKER_NAME="tracker_anna\.ixcloud\.net:localhost/127\.0\.0\.1:34657" > HTTP_PORT="50060" . > MapAttempt TASK_TYPE="SETUP" TASKID="task_201303181503_0864_m_02" > TASK_ATTEMPT_ID="attempt_201303181503_0864_m_02_0" > TASK_STATUS="SUCCESS" FINISH_TIME="1363686595929" > HOSTNAME="/default/anna\.ixcloud\.net" STATE_STRING="setup" > COUNTERS="{(org\.apache\.hadoop\.mapreduce\.FileSystemCounter)(File System > Counters)[(FILE_BYTES_READ)(FILE: Number of bytes > read)(0)][(FILE_BYTES_WRITTEN)(FILE: Number of bytes > written)(152299)][(FILE_READ_OPS)(FILE: Number of read > operations)(0)][(FILE_LARGE_READ_OPS)(FILE: Number of large read > operations)(0)][(FILE_WRITE_OPS)(FILE: Number of write > operations)(0)][(HDFS_BYTES_READ)(HDFS: Number of bytes > read)(0)][(HDFS_BYTES_WRITTEN)(HDFS: Number of bytes > written)(0)][(HDFS_READ_OPS)(HDFS: Number of read > operations)(0)][(HDFS_LARGE_READ_OPS)(HDFS: Number of large read > operations)(0)][(HDFS_WRITE_OPS)(HDFS: Number of write > operations)(1)]}{(org\.apache\.hadoop\.mapreduce\.TaskCounter)(Map-Reduce > Framework)[(SPILLED_RECORDS)(Spilled Records)(0)][(CPU_MILLISECONDS)(CPU > time spent \\(ms\\))(80)][(PHYSICAL_MEMORY_BYTES)(Physical memory > \\(bytes\\) snapshot)(91693056)][(VIRTUAL_MEMORY_BYTES)(Virtual memory > \\(bytes\\) snapshot)(575086592)][(COMMITTED_HEAP_BYTES)(Total committed > heap usage > \\(bytes\\))(62324736)]}nullnullnullnullnullnullnullnullnullnullnullnullnull" > > ... > > > Best Regards, > Christian. > > > -- > Arun C. Murthy > Hortonworks Inc. > http://hortonworks.com/ > > >
RES: I want to call HDFS REST api to upload a file using httplib.
On your first call, Hadoop will return a URL pointing to a datanode in the Location header of the 307 response. On your second call, you have to use that URL instead of constructing your own. You can see the specific documentation here: http://hadoop.apache.org/docs/r1.0.4/webhdfs.html#CREATE Regards, Marcos I want to call HDFS REST api to upload a file using httplib. My program created the file, but no content is in it. = Here is my code: import httplib conn=httplib.HTTPConnection("localhost:50070") conn.request("PUT","/webhdfs/v1/levi/4?op=CREATE") res=conn.getresponse() print res.status,res.reason conn.close() conn=httplib.HTTPConnection("localhost:50075") conn.connect() conn.putrequest("PUT","/webhdfs/v1/levi/4?op=CREATE&user.name=levi") conn.endheaders() a_file=open("/home/levi/4","rb") a_file.seek(0) data=a_file.read() conn.send(data) res=conn.getresponse() print res.status,res.reason conn.close() == Here is the return: 307 TEMPORARY_REDIRECT 201 Created = OK, the file was created, but no content was sent. When I comment the #conn.send(data), the result is the same, still no content. Maybe the file read or the send is wrong, not sure. Do you know how this happened?
[no subject]
Hi all, I'm new to Hadoop and am posting my first message on this list. I have downloaded and installed the hadoop_1.1.1-1_x86_64.deb distro and have a couple of issues which are blocking me from progressing. I'm working through the 'Hadoop - The Definitive Guide' book and am trying to set up a test VM in pseudodistributed mode using the RPM. The examples in the book allude to (although I don't think they explicitly state) having a single user for everything and creating a passwordless private/public key pair to allow the user to ssh to locahost to control things. I'm guessing this is because the book uses the .zip distribution which doesn't create any users and therefore assumes running as an already existing locally logged on user. I notice however that the RPM creates 2 users: mapred and hdfs. As a result I'm a bit unclear about the following: 1: Does it matter which user I log in as to perform various actions? e.g. if I want to run start-dfs.sh should I be logged in as 'hdfs'? I did try running start-dfs as root thinking it might drop down to hdfs using a RUN_AS user (like most init.d scripts do) but it didn't work like that. Is there any documentation covering which users should be used to do what when running the RPM distribution? 2: Whilst the RPMcreates the hdfs user and specifies /var/lib/hadoop/hdfs as the homedir, it doesn't actually create this directory. This results in an error when logging in as the user. Is this normal? 3: How should I set up ssh keys between the 2 users? Should each user's public key be in the authorized_keys file of the other user (i.e. is communication between the 2 processes bi-directional) or would something simpler suffice? Hope these questions are clear enough to advise on, please don't hesitate to ask for more info if there's something I've left out. Cheers, Edd -- Web: http://www.eddgrant.com Email: e...@eddgrant.com Mobile: +44 (0) 7861 394 543