Re: How can I record some position of context in Reduce()?

2013-04-08 Thread Vikas Jadhav
Hi
I am also woring on join using MapReduce
i think instead of finding postion of table in RawKeyValuIterator.
what we can do modify context.write method to alway write key as table name
or id
then we dont need to find postion we can get Key and Value from
"reducerContext"

befor calling reducer.run(reducerContext) in ReduceTask.java we can  add
method join in Reducer.java Reducer class and give call to
reducer.join(reduceContext)


I just wonder how r going to support NON EQUI join.

I am also having same problem how to do join if datasets cant fit in to
memory.


for now i am cloning using following code :


KEYIN key = context.getCurrentKey() ;
KEYIN outKey = null;
try {
outKey = (KEYIN)key.getClass().newInstance();
   }
catch(Exception e)
 {}
ReflectionUtils.copy(context.getConfiguration(), key, outKey);

 Iterable values = context.getValues();
 ArrayList myValues = new ArrayList();
 for(VALUEIN value: values) {
   VALUEIN outValue = null;
try {
 outValue = (VALUEIN)value.getClass().newInstance();
   }
   catch(Exception e){}
   ReflectionUtils.copy(context.getConfiguration(), value, outValue);
 }


if you have found any other solution please feel free to share

Thank You.







On Thu, Mar 14, 2013 at 1:53 PM, Roth Effy  wrote:

> In reduce() we have:
>
> key1 values1
> key2 values2
> ...
> keyn valuesn
>
> so,what i want to do is join all values like a SQL:
>
> select * from values1,values2...valuesn;
>
> if memory is not enough to cache values,how to complete the join operation?
> my idea is clone the reducecontext,but it maybe not easy.
>
> Any help will be appreciated.
>
>
> 2013/3/13 Roth Effy 
>
>> I want a n:n join as Cartesian product,but the DataJoinReducerBase looks
>> like only support equal join.
>> I want a non-equal join,but I have no idea now.
>>
>>
>> 2013/3/13 Azuryy Yu 
>>
>>> you want a n:n join or 1:n join?
>>> On Mar 13, 2013 10:51 AM, "Roth Effy"  wrote:
>>>
 I want to join two table data in reducer.So I need to find the start of
 the table.
 someone said the DataJoinReducerBase can help me,isn't it?


 2013/3/13 Azuryy Yu 

> you cannot use RecordReader in Reducer.
>
> what's the mean of you want get the record position? I cannot
> understand, can you give a simple example?
>
>
> On Wed, Mar 13, 2013 at 9:56 AM, Roth Effy  wrote:
>
>> sorry,I still can't understand how to use recordreader in the
>> reduce(),because the input is a RawKeyValueIterator in the class
>> reducecontext.so,I'm confused.
>> anyway,thank you.
>>
>>
>> 2013/3/12 samir das mohapatra 
>>
>>> Through the RecordReader and FileStatus you can get it.
>>>
>>>
>>> On Tue, Mar 12, 2013 at 4:08 PM, Roth Effy wrote:
>>>
 Hi,everyone,
 I want to join the k-v pairs in Reduce(),but how to get the record
 position?
 Now,what I thought is to save the context status,but class Context
 doesn't implement a clone construct method.

 Any help will be appreciated.
 Thank you very much.

>>>
>>>
>>
>

>>
>


-- 
*
*
*

Thanx and Regards*
* Vikas Jadhav*


Distributed cache: how big is too big?

2013-04-08 Thread John Meza
I am researching a Hadoop solution for an existing application that requires a 
directory structure full of data for processing.
To make the Hadoop solution work I need to deploy the data directory to each DN 
when the job is executed.I know this isn't new and commonly done with a 
Distributed Cache.
Based on experience what are the common file sizes deployed in a Distributed 
Cache? I know smaller is better, but how big is too big? the larger cache 
deployed I have read there will be startup latency. I also assume there are 
other factors that play into this.
I know that->Default local.cache.size=10Gb
-Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??-Distributed Cache 
is normally not used if larger than =?
Another Option: Put the data directories on each DN and provide location to 
TaskTracker?
thanksJohn
  

Re: Problem accessing HDFS from a remote machine

2013-04-08 Thread Rishi Yadav
have you checked firewall on namenode.

If you are running ubuntu and namenode port is 8020 command is
-> ufw allow 8020

Thanks and Regards,

Rishi Yadav

InfoObjects Inc || http://www.infoobjects.com *(Big Data Solutions)*

On Mon, Apr 8, 2013 at 6:57 PM, Azuryy Yu  wrote:

> can you use command "jps" on your localhost to see if there is NameNode
> process running?
>
>
> On Tue, Apr 9, 2013 at 2:27 AM, Bjorn Jonsson  wrote:
>
>> Yes, the namenode port is not open for your cluster. I had this problem
>> to. First, log into your namenode and do netstat -nap to see what ports are
>> listening. You can do service --status-all to see if the namenode service
>> is running. Basically you need Hadoop to bind to the correct ip (an
>> external one, or at least reachable from your remote machine). So listening
>> on 127.0.0.1 or localhost or some ip for a private network will not be
>> sufficient. Check your /etc/hosts file and /etc/hadoop/conf/*-site.xml
>> files to configure the correct ip/ports.
>>
>> I'm no expert, so my understanding might be limited/wrong...but I hope
>> this helps :)
>>
>> Best,
>> B
>>
>>
>> On Mon, Apr 8, 2013 at 7:29 AM, Saurabh Jain 
>> wrote:
>>
>>> Hi All,
>>>
>>> ** **
>>>
>>> I have setup a single node cluster(release hadoop-1.0.4). Following is
>>> the configuration used –
>>>
>>> ** **
>>>
>>> *core-site.xml :-*
>>>
>>> ** **
>>>
>>> 
>>>
>>>  fs.default.name
>>>
>>>  hdfs://localhost:54310 
>>>
>>> 
>>>
>>> * *
>>>
>>> *masters:-*
>>>
>>> localhost
>>>
>>> ** **
>>>
>>> *slaves:-*
>>>
>>> localhost
>>>
>>> ** **
>>>
>>> I am able to successfully format the Namenode and perform files system
>>> operations by running the CLIs on Namenode.
>>>
>>> ** **
>>>
>>> But I am receiving following error when I try to access HDFS from a *remote
>>> machine* – 
>>>
>>> ** **
>>>
>>> $ bin/hadoop fs -ls /
>>>
>>> Warning: $HADOOP_HOME is deprecated.
>>>
>>> ** **
>>>
>>> 13/04/08 07:13:56 INFO ipc.Client: Retrying connect to server:
>>> 10.209.10.206/10.209.10.206:54310. Already tried 0 time(s).
>>>
>>> 13/04/08 07:13:57 INFO ipc.Client: Retrying connect to server:
>>> 10.209.10.206/10.209.10.206:54310. Already tried 1 time(s).
>>>
>>> 13/04/08 07:13:58 INFO ipc.Client: Retrying connect to server:
>>> 10.209.10.206/10.209.10.206:54310. Already tried 2 time(s).
>>>
>>> 13/04/08 07:13:59 INFO ipc.Client: Retrying connect to server:
>>> 10.209.10.206/10.209.10.206:54310. Already tried 3 time(s).
>>>
>>> 13/04/08 07:14:00 INFO ipc.Client: Retrying connect to server:
>>> 10.209.10.206/10.209.10.206:54310. Already tried 4 time(s).
>>>
>>> 13/04/08 07:14:01 INFO ipc.Client: Retrying connect to server:
>>> 10.209.10.206/10.209.10.206:54310. Already tried 5 time(s).
>>>
>>> 13/04/08 07:14:02 INFO ipc.Client: Retrying connect to server:
>>> 10.209.10.206/10.209.10.206:54310. Already tried 6 time(s).
>>>
>>> 13/04/08 07:14:03 INFO ipc.Client: Retrying connect to server:
>>> 10.209.10.206/10.209.10.206:54310. Already tried 7 time(s).
>>>
>>> 13/04/08 07:14:04 INFO ipc.Client: Retrying connect to server:
>>> 10.209.10.206/10.209.10.206:54310. Already tried 8 time(s).
>>>
>>> 13/04/08 07:14:05 INFO ipc.Client: Retrying connect to server:
>>> 10.209.10.206/10.209.10.206:54310. Already tried 9 time(s).
>>>
>>> Bad connection to FS. command aborted. exception: Call to
>>> 10.209.10.206/10.209.10.206:54310 failed on connection exception:
>>> java.net.ConnectException: Connection refused
>>>
>>> ** **
>>>
>>> Where 10.209.10.206 is the IP of the server hosting the Namenode and it
>>> is also the configured value for “fs.default.name” in the core-site.xml
>>> file on the remote machine.
>>>
>>> ** **
>>>
>>> Executing ‘*bin/hadoop fs -fs hdfs://10.209.10.206:54310 -ls /*’ also
>>> result in same output.
>>>
>>> ** **
>>>
>>> Also, I am writing a C application using libhdfs to communicate with
>>> HDFS. How do we provide credentials while connecting to HDFS?
>>>
>>> ** **
>>>
>>> Thanks
>>>
>>> Saurabh
>>>
>>> ** **
>>>
>>> ** **
>>>
>>
>>
>


RE: mr default=local?

2013-04-08 Thread John Meza
Harsh, thanks for the quick reply.While I am a Hadoop-newbie, I find I am 
explaining Hadoop install, config, job processing to newer-newbies. Thus the 
desire and need for more details.John
> From: ha...@cloudera.com
> Date: Tue, 9 Apr 2013 09:16:49 +0530
> Subject: Re: mr default=local?
> To: user@hadoop.apache.org
> 
> Hey John,
> 
> Sorta unclear on what is prompting this question (to answer it more
> specifically) but my response below:
> 
> On Tue, Apr 9, 2013 at 9:05 AM, John Meza  wrote:
> > The default mode for hadoop is Standalone, PsuedoDistributed and Fully
> > Distributed modes. It is configured for Psuedo and Fully Distributed via
> > configuration file, but defaults to Standalone otherwise (correct?).
> 
> The mapred-default.xml we ship, has "mapred.job.tracker"
> (0.20.x/1.x/0.22.x) set to local, or "mapreduce.framework.name"
> (0.23.x, 2.x, trunk) set to local. This is why, without reconfiguring
> an installation to point to a proper cluster (JT or YARN), you will
> get local job runner activated.
> 
> > Question about the -defaulting- mechanism:
> > -Does it get the -default- configuration via one of the config files?
> 
> For any Configuration type of invocation:
> 1. First level of defaults come from *-default.xml embedded inside the
> various relevant jar files.
> 2. Configurations further found in a classpath resource XML
> (core,mapred,hdfs,yarn, *-site.xmls) are applied on top of the
> defaults.
> 3. User applications' code may then override this set, with any
> settings of their own, if needed.
> 
> > -Or does it get the -default- configuration via hard-coded values?
> 
> There may be a few cases of hardcodes, missing documentation and
> presence in *-default.xml, but they should still be configurable via
> (2) and (3).
> 
> > -Or another mechanism?
> 
> --
> Harsh J
  

Re: How to configure mapreduce archive size?

2013-04-08 Thread Hemanth Yamijala
Hi,

This directory is used as part of the 'DistributedCache' feature. (
http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#DistributedCache).
There is a configuration key "local.cache.size" which controls the amount
of data stored under DistributedCache. The default limit is 10GB. However,
the files under this cannot be deleted if they are being used. Also, some
frameworks on Hadoop could be using DistributedCache transparently to you.

So you could check what is being stored here and based on that lower the
limit of the cache size if you feel that will help. The property needs to
be set in mapred-default.xml.

Thanks
Hemanth


On Mon, Apr 8, 2013 at 11:09 PM,  wrote:

> Hi,
>
> ** **
>
> I am using hadoop which is packaged within hbase -0.94.1. It is hadoop
> 1.0.3. There is some mapreduce job running on my server. After some time, I
> found that my folder /tmp/hadoop-root/mapred/local/archive has 14G size.**
> **
>
> ** **
>
> How to configure this and limit the size? I do not want  to waste my space
> for archive.
>
> ** **
>
> Thanks,
>
> ** **
>
> Xia
>
> ** **
>


Re:RES: I want to call HDFS REST api to upload a file using httplib.

2013-04-08 Thread ??????PHP
Really Thanks.
But the returned URL is wrong. And the localhost is the real URL, as i tested 
successfully with curl using "localhost".
Can anybody help me translate the curl to Python httplib?
curl -i -X PUT -T  
"http://:/webhdfs/v1/?op=CREATE"
I test it using python httplib, and receive the right response. But the file 
uploaded to HDFS is empty, no data sent!!
Is "conn.send(data)"  the problem?


-- Original --
From:  "MARCOS MEDRADO RUBINELLI";
Date:  Mon, Apr 8, 2013 04:22 PM
To:  "user@hadoop.apache.org"; 

Subject:  RES: I want to call HDFS REST api to upload a file using httplib.



 On your first call, Hadoop will return a URL pointing to a datanode in the 
Location header of the 307 response. On your second call, you have to use that 
URL instead of constructing  your own. You can see the specific documentation 
here: http://hadoop.apache.org/docs/r1.0.4/webhdfs.html#CREATE
 
 
 Regards,
 Marcos
  
  
 
 
 
I want to call HDFS REST api to upload a file using httplib.
 
My program created the file, but no content is in it.
 
=
 
Here is my code:
 import httplib conn=httplib.HTTPConnection("localhost:50070") 
conn.request("PUT","/webhdfs/v1/levi/4?op=CREATE") res=conn.getresponse() print 
res.status,res.reason conn.close() 
conn=httplib.HTTPConnection("localhost:50075") conn.connect() 
conn.putrequest("PUT","/webhdfs/v1/levi/4?op=CREATE&user.name=levi") 
conn.endheaders() a_file=open("/home/levi/4","rb") a_file.seek(0) 
data=a_file.read() conn.send(data) res=conn.getresponse() print 
res.status,res.reason conn.close() 
==
 
Here is the return:
  
307 TEMPORARY_REDIRECT 201 Created
  
=
 
OK, the file was created, but no content was sent.
 
When I comment the #conn.send(data), the result is the same, still no content.
 
Maybe the file read or the send is wrong, not sure.
 
Do you know how this happened?

Re: mr default=local?

2013-04-08 Thread Harsh J
Hey John,

Sorta unclear on what is prompting this question (to answer it more
specifically) but my response below:

On Tue, Apr 9, 2013 at 9:05 AM, John Meza  wrote:
> The default mode for hadoop is Standalone, PsuedoDistributed and Fully
> Distributed modes. It is configured for Psuedo and Fully Distributed via
> configuration file, but defaults to Standalone otherwise (correct?).

The mapred-default.xml we ship, has "mapred.job.tracker"
(0.20.x/1.x/0.22.x) set to local, or "mapreduce.framework.name"
(0.23.x, 2.x, trunk) set to local. This is why, without reconfiguring
an installation to point to a proper cluster (JT or YARN), you will
get local job runner activated.

> Question about the -defaulting- mechanism:
> -Does it get the -default- configuration via one of the config files?

For any Configuration type of invocation:
1. First level of defaults come from *-default.xml embedded inside the
various relevant jar files.
2. Configurations further found in a classpath resource XML
(core,mapred,hdfs,yarn, *-site.xmls) are applied on top of the
defaults.
3. User applications' code may then override this set, with any
settings of their own, if needed.

> -Or does it get the -default- configuration via hard-coded values?

There may be a few cases of hardcodes, missing documentation and
presence in *-default.xml, but they should still be configurable via
(2) and (3).

> -Or another mechanism?

--
Harsh J


mr default=local?

2013-04-08 Thread John Meza
The default mode for hadoop is Standalone, PsuedoDistributed and Fully 
Distributed modes. It is configured for Psuedo and Fully Distributed via 
configuration file, but defaults to Standalone otherwise (correct?).
Question about the -defaulting- mechanism:-Does it get the -default- 
configuration via one of the config files?-Or does it get the -default- 
configuration via hard-coded values?-Or another mechanism?
thanksJohn
  

Re: Best format to use

2013-04-08 Thread Harsh J
Hey Mark,

Gzip codec creates extension .gzip, not .deflate (which is
DeflateCodec). You may want to re-check your settings.

Impala questions are best resolved at its current user and developer
community at https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
Impala does currently support LZO (and also Indexed LZO) compressed
text files however, so you may want to try that as its splittable
(compared to Gzip ones).

On Tue, Apr 9, 2013 at 5:18 AM, Mark  wrote:
> Trying to determine what the best format to use for storing daily logs. We 
> recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if 
> there is something better? Our main clients for these daily logs are pig and 
> hive using an external table. We were thinking about testing out impala but 
> we see that it doesn't work with compressed text files. Any suggestions?
>
> Thanks



-- 
Harsh J


Re: Problem accessing HDFS from a remote machine

2013-04-08 Thread Azuryy Yu
can you use command "jps" on your localhost to see if there is NameNode
process running?


On Tue, Apr 9, 2013 at 2:27 AM, Bjorn Jonsson  wrote:

> Yes, the namenode port is not open for your cluster. I had this problem
> to. First, log into your namenode and do netstat -nap to see what ports are
> listening. You can do service --status-all to see if the namenode service
> is running. Basically you need Hadoop to bind to the correct ip (an
> external one, or at least reachable from your remote machine). So listening
> on 127.0.0.1 or localhost or some ip for a private network will not be
> sufficient. Check your /etc/hosts file and /etc/hadoop/conf/*-site.xml
> files to configure the correct ip/ports.
>
> I'm no expert, so my understanding might be limited/wrong...but I hope
> this helps :)
>
> Best,
> B
>
>
> On Mon, Apr 8, 2013 at 7:29 AM, Saurabh Jain wrote:
>
>> Hi All,
>>
>> ** **
>>
>> I have setup a single node cluster(release hadoop-1.0.4). Following is
>> the configuration used –
>>
>> ** **
>>
>> *core-site.xml :-*
>>
>> ** **
>>
>> 
>>
>>  fs.default.name
>>
>>  hdfs://localhost:54310 
>>
>> 
>>
>> * *
>>
>> *masters:-*
>>
>> localhost
>>
>> ** **
>>
>> *slaves:-*
>>
>> localhost
>>
>> ** **
>>
>> I am able to successfully format the Namenode and perform files system
>> operations by running the CLIs on Namenode.
>>
>> ** **
>>
>> But I am receiving following error when I try to access HDFS from a *remote
>> machine* – 
>>
>> ** **
>>
>> $ bin/hadoop fs -ls /
>>
>> Warning: $HADOOP_HOME is deprecated.
>>
>> ** **
>>
>> 13/04/08 07:13:56 INFO ipc.Client: Retrying connect to server:
>> 10.209.10.206/10.209.10.206:54310. Already tried 0 time(s).
>>
>> 13/04/08 07:13:57 INFO ipc.Client: Retrying connect to server:
>> 10.209.10.206/10.209.10.206:54310. Already tried 1 time(s).
>>
>> 13/04/08 07:13:58 INFO ipc.Client: Retrying connect to server:
>> 10.209.10.206/10.209.10.206:54310. Already tried 2 time(s).
>>
>> 13/04/08 07:13:59 INFO ipc.Client: Retrying connect to server:
>> 10.209.10.206/10.209.10.206:54310. Already tried 3 time(s).
>>
>> 13/04/08 07:14:00 INFO ipc.Client: Retrying connect to server:
>> 10.209.10.206/10.209.10.206:54310. Already tried 4 time(s).
>>
>> 13/04/08 07:14:01 INFO ipc.Client: Retrying connect to server:
>> 10.209.10.206/10.209.10.206:54310. Already tried 5 time(s).
>>
>> 13/04/08 07:14:02 INFO ipc.Client: Retrying connect to server:
>> 10.209.10.206/10.209.10.206:54310. Already tried 6 time(s).
>>
>> 13/04/08 07:14:03 INFO ipc.Client: Retrying connect to server:
>> 10.209.10.206/10.209.10.206:54310. Already tried 7 time(s).
>>
>> 13/04/08 07:14:04 INFO ipc.Client: Retrying connect to server:
>> 10.209.10.206/10.209.10.206:54310. Already tried 8 time(s).
>>
>> 13/04/08 07:14:05 INFO ipc.Client: Retrying connect to server:
>> 10.209.10.206/10.209.10.206:54310. Already tried 9 time(s).
>>
>> Bad connection to FS. command aborted. exception: Call to
>> 10.209.10.206/10.209.10.206:54310 failed on connection exception:
>> java.net.ConnectException: Connection refused
>>
>> ** **
>>
>> Where 10.209.10.206 is the IP of the server hosting the Namenode and it
>> is also the configured value for “fs.default.name” in the core-site.xml
>> file on the remote machine.
>>
>> ** **
>>
>> Executing ‘*bin/hadoop fs -fs hdfs://10.209.10.206:54310 -ls /*’ also
>> result in same output.
>>
>> ** **
>>
>> Also, I am writing a C application using libhdfs to communicate with
>> HDFS. How do we provide credentials while connecting to HDFS?
>>
>> ** **
>>
>> Thanks
>>
>> Saurabh
>>
>> ** **
>>
>> ** **
>>
>
>


Re: Best format to use

2013-04-08 Thread Azuryy Yu
impala can work with compressed files, but it's sequence file, not
compressed directly.


On Tue, Apr 9, 2013 at 7:48 AM, Mark  wrote:

> Trying to determine what the best format to use for storing daily logs. We
> recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering
> if there is something better? Our main clients for these daily logs are pig
> and hive using an external table. We were thinking about testing out impala
> but we see that it doesn't work with compressed text files. Any suggestions?
>
> Thanks


Best format to use

2013-04-08 Thread Mark
Trying to determine what the best format to use for storing daily logs. We 
recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if 
there is something better? Our main clients for these daily logs are pig and 
hive using an external table. We were thinking about testing out impala but we 
see that it doesn't work with compressed text files. Any suggestions? 

Thanks

how to install hadoop 2.0.3 in standalone mode

2013-04-08 Thread jim jimm
I am new to hadoop and maven. I will like to compile the hadoop from the
source and install it. I am following instructions from
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html

So far, i have managed to download hadoop source code and from the source
directory issued "mvn clean install -Pnative"
Next i tried to execute mvn assembly:assembly, but i get following error:

Failed to execute goal
org.apache.maven.plugins:maven-assembly-plugin:2.3:assembly (default-cli)
on project hadoop-main: Error reading assemblies: No assembly descriptors
found. -> [Help 1]

Please help so that i can move forward.

Also, the above mentioned install link, does not mention what should be the
value of "*$HADOOP_COMMON_HOME*/*$HADOOP_HDFS_HOME"*
*
*
*Thanks in advance,*
*Jim*


Re: Problem accessing HDFS from a remote machine

2013-04-08 Thread Bjorn Jonsson
Yes, the namenode port is not open for your cluster. I had this problem to.
First, log into your namenode and do netstat -nap to see what ports are
listening. You can do service --status-all to see if the namenode service
is running. Basically you need Hadoop to bind to the correct ip (an
external one, or at least reachable from your remote machine). So listening
on 127.0.0.1 or localhost or some ip for a private network will not be
sufficient. Check your /etc/hosts file and /etc/hadoop/conf/*-site.xml
files to configure the correct ip/ports.

I'm no expert, so my understanding might be limited/wrong...but I hope this
helps :)

Best,
B


On Mon, Apr 8, 2013 at 7:29 AM, Saurabh Jain wrote:

> Hi All,
>
> ** **
>
> I have setup a single node cluster(release hadoop-1.0.4). Following is the
> configuration used –
>
> ** **
>
> *core-site.xml :-*
>
> ** **
>
> 
>
>  fs.default.name
>
>  hdfs://localhost:54310 
>
> 
>
> * *
>
> *masters:-*
>
> localhost
>
> ** **
>
> *slaves:-*
>
> localhost
>
> ** **
>
> I am able to successfully format the Namenode and perform files system
> operations by running the CLIs on Namenode.
>
> ** **
>
> But I am receiving following error when I try to access HDFS from a *remote
> machine* – 
>
> ** **
>
> $ bin/hadoop fs -ls /
>
> Warning: $HADOOP_HOME is deprecated.
>
> ** **
>
> 13/04/08 07:13:56 INFO ipc.Client: Retrying connect to server:
> 10.209.10.206/10.209.10.206:54310. Already tried 0 time(s).
>
> 13/04/08 07:13:57 INFO ipc.Client: Retrying connect to server:
> 10.209.10.206/10.209.10.206:54310. Already tried 1 time(s).
>
> 13/04/08 07:13:58 INFO ipc.Client: Retrying connect to server:
> 10.209.10.206/10.209.10.206:54310. Already tried 2 time(s).
>
> 13/04/08 07:13:59 INFO ipc.Client: Retrying connect to server:
> 10.209.10.206/10.209.10.206:54310. Already tried 3 time(s).
>
> 13/04/08 07:14:00 INFO ipc.Client: Retrying connect to server:
> 10.209.10.206/10.209.10.206:54310. Already tried 4 time(s).
>
> 13/04/08 07:14:01 INFO ipc.Client: Retrying connect to server:
> 10.209.10.206/10.209.10.206:54310. Already tried 5 time(s).
>
> 13/04/08 07:14:02 INFO ipc.Client: Retrying connect to server:
> 10.209.10.206/10.209.10.206:54310. Already tried 6 time(s).
>
> 13/04/08 07:14:03 INFO ipc.Client: Retrying connect to server:
> 10.209.10.206/10.209.10.206:54310. Already tried 7 time(s).
>
> 13/04/08 07:14:04 INFO ipc.Client: Retrying connect to server:
> 10.209.10.206/10.209.10.206:54310. Already tried 8 time(s).
>
> 13/04/08 07:14:05 INFO ipc.Client: Retrying connect to server:
> 10.209.10.206/10.209.10.206:54310. Already tried 9 time(s).
>
> Bad connection to FS. command aborted. exception: Call to
> 10.209.10.206/10.209.10.206:54310 failed on connection exception:
> java.net.ConnectException: Connection refused
>
> ** **
>
> Where 10.209.10.206 is the IP of the server hosting the Namenode and it
> is also the configured value for “fs.default.name” in the core-site.xml
> file on the remote machine.
>
> ** **
>
> Executing ‘*bin/hadoop fs -fs hdfs://10.209.10.206:54310 -ls /*’ also
> result in same output.
>
> ** **
>
> Also, I am writing a C application using libhdfs to communicate with HDFS.
> How do we provide credentials while connecting to HDFS?
>
> ** **
>
> Thanks
>
> Saurabh
>
> ** **
>
> ** **
>


How to configure mapreduce archive size?

2013-04-08 Thread Xia_Yang
Hi,

I am using hadoop which is packaged within hbase -0.94.1. It is hadoop 1.0.3. 
There is some mapreduce job running on my server. After some time, I found that 
my folder /tmp/hadoop-root/mapred/local/archive has 14G size.

How to configure this and limit the size? I do not want  to waste my space for 
archive.

Thanks,

Xia



Problem accessing HDFS from a remote machine

2013-04-08 Thread Saurabh Jain
Hi All,

I have setup a single node cluster(release hadoop-1.0.4). Following is the 
configuration used -

core-site.xml :-


 fs.default.name
 hdfs://localhost:54310


masters:-
localhost

slaves:-
localhost

I am able to successfully format the Namenode and perform files system 
operations by running the CLIs on Namenode.

But I am receiving following error when I try to access HDFS from a remote 
machine -

$ bin/hadoop fs -ls /
Warning: $HADOOP_HOME is deprecated.

13/04/08 07:13:56 INFO ipc.Client: Retrying connect to server: 
10.209.10.206/10.209.10.206:54310. Already tried 0 time(s).
13/04/08 07:13:57 INFO ipc.Client: Retrying connect to server: 
10.209.10.206/10.209.10.206:54310. Already tried 1 time(s).
13/04/08 07:13:58 INFO ipc.Client: Retrying connect to server: 
10.209.10.206/10.209.10.206:54310. Already tried 2 time(s).
13/04/08 07:13:59 INFO ipc.Client: Retrying connect to server: 
10.209.10.206/10.209.10.206:54310. Already tried 3 time(s).
13/04/08 07:14:00 INFO ipc.Client: Retrying connect to server: 
10.209.10.206/10.209.10.206:54310. Already tried 4 time(s).
13/04/08 07:14:01 INFO ipc.Client: Retrying connect to server: 
10.209.10.206/10.209.10.206:54310. Already tried 5 time(s).
13/04/08 07:14:02 INFO ipc.Client: Retrying connect to server: 
10.209.10.206/10.209.10.206:54310. Already tried 6 time(s).
13/04/08 07:14:03 INFO ipc.Client: Retrying connect to server: 
10.209.10.206/10.209.10.206:54310. Already tried 7 time(s).
13/04/08 07:14:04 INFO ipc.Client: Retrying connect to server: 
10.209.10.206/10.209.10.206:54310. Already tried 8 time(s).
13/04/08 07:14:05 INFO ipc.Client: Retrying connect to server: 
10.209.10.206/10.209.10.206:54310. Already tried 9 time(s).
Bad connection to FS. command aborted. exception: Call to 
10.209.10.206/10.209.10.206:54310 failed on connection exception: 
java.net.ConnectException: Connection refused

Where 10.209.10.206 is the IP of the server hosting the Namenode and it  is 
also the configured value for "fs.default.name" in the core-site.xml file on 
the remote machine.

Executing 'bin/hadoop fs -fs hdfs://10.209.10.206:54310 -ls /' also result in 
same output.

Also, I am writing a C application using libhdfs to communicate with HDFS. How 
do we provide credentials while connecting to HDFS?

Thanks
Saurabh




Question about RPM Hadoop disto user and keys (re-posting with subject this time)

2013-04-08 Thread Edd Grant
Hi all,

I'm new to Hadoop and am posting my first message on this list. I have
downloaded and installed the hadoop_1.1.1-1_x86_64.deb distro and have a
couple of issues which are blocking me from progressing.

I'm working through the 'Hadoop - The Definitive Guide' book and am trying
to set up a test VM in pseudodistributed mode using the RPM. The examples
in the book allude to (although I don't think they explicitly state) having
a single user for everything and creating a passwordless private/public key
pair to allow the user to ssh to locahost to control things. I'm guessing
this is because the book uses the .zip distribution which doesn't create
any users and therefore assumes running as an already existing locally
logged on user.

I notice however that the RPM creates 2 users: mapred and hdfs. As a result
I'm a bit unclear about the following:

1: Does it matter which user I log in as to perform various actions? e.g.
if I want to run start-dfs.sh should I be logged in as 'hdfs'? I did try
running start-dfs as root thinking it might drop down to hdfs using a
RUN_AS user (like most init.d scripts do) but it didn't work like that. Is
there any documentation covering which users should be used to do what when
running the RPM distribution?

2: Whilst the RPMcreates the hdfs user and specifies /var/lib/hadoop/hdfs
as the homedir, it doesn't actually create this directory. This results in
an error when logging in as the user. Is this normal?

3: How should I set up ssh keys between the 2 users? Should each user's
public key be in the authorized_keys file of the other user (i.e. is
communication between the 2 processes bi-directional) or would something
simpler suffice?

Hope these questions are clear enough to advise on, please don't hesitate
to ask for more info if there's something I've left out.

Cheers,

Edd

-- 
Web: http://www.eddgrant.com
Email: e...@eddgrant.com
Mobile: +44 (0) 7861 394 543


Re: Parsing the JobTracker Job Logs

2013-04-08 Thread Christian Schneider
That's nice! Thank you very much.

No i try to get flume to work. It should collect all the files, (also the
log files from the Task Tracker).

Best Regards,
Christian.


2013/3/28 Arun C Murthy 

> Use 'rumen', it's part of Hadoop.
>
> On Mar 19, 2013, at 3:56 AM, Christian Schneider wrote:
>
> Hi,
> how to parse the log files for our jobs? Are there already classes I can
> use?
>
> I need to display some information on a WebInterface (like the native
> JobTracker does).
>
>
> I am talking about this kind of files:
>
> michaela 11:52:59
> /var/log/hadoop-0.20-mapreduce/history/done/michaela.ixcloud.net_1363615430691_/2013/03/19/00
> # cat job_201303181503_0864_1363686587824_christian_wordCountJob_15
> Meta VERSION="1" .
> Job JOBID="job_201303181503_0864" JOBNAME="wordCountJob_15"
> USER="christian" SUBMIT_TIME="1363686587824" JOBCONF="
> hdfs://carolin\.ixcloud\.net:8020/user/christian/\.staging/job_201303181503_0864/job\.xml"
> VIEW_JOB="*" MODIFY_JOB="*" JOB_QUEUE="default" .
> Job JOBID="job_201303181503_0864" JOB_PRIORITY="NORMAL" .
> Job JOBID="job_201303181503_0864" LAUNCH_TIME="1363686587923"
> TOTAL_MAPS="1" TOTAL_REDUCES="1" JOB_STATUS="PREP" .
> Task TASKID="task_201303181503_0864_m_02" TASK_TYPE="SETUP"
> START_TIME="1363686587923" SPLITS="" .
> MapAttempt TASK_TYPE="SETUP" TASKID="task_201303181503_0864_m_02"
> TASK_ATTEMPT_ID="attempt_201303181503_0864_m_02_0"
> START_TIME="1363686594028"
> TRACKER_NAME="tracker_anna\.ixcloud\.net:localhost/127\.0\.0\.1:34657"
> HTTP_PORT="50060" .
> MapAttempt TASK_TYPE="SETUP" TASKID="task_201303181503_0864_m_02"
> TASK_ATTEMPT_ID="attempt_201303181503_0864_m_02_0"
> TASK_STATUS="SUCCESS" FINISH_TIME="1363686595929"
> HOSTNAME="/default/anna\.ixcloud\.net" STATE_STRING="setup"
> COUNTERS="{(org\.apache\.hadoop\.mapreduce\.FileSystemCounter)(File System
> Counters)[(FILE_BYTES_READ)(FILE: Number of bytes
> read)(0)][(FILE_BYTES_WRITTEN)(FILE: Number of bytes
> written)(152299)][(FILE_READ_OPS)(FILE: Number of read
> operations)(0)][(FILE_LARGE_READ_OPS)(FILE: Number of large read
> operations)(0)][(FILE_WRITE_OPS)(FILE: Number of write
> operations)(0)][(HDFS_BYTES_READ)(HDFS: Number of bytes
> read)(0)][(HDFS_BYTES_WRITTEN)(HDFS: Number of bytes
> written)(0)][(HDFS_READ_OPS)(HDFS: Number of read
> operations)(0)][(HDFS_LARGE_READ_OPS)(HDFS: Number of large read
> operations)(0)][(HDFS_WRITE_OPS)(HDFS: Number of write
> operations)(1)]}{(org\.apache\.hadoop\.mapreduce\.TaskCounter)(Map-Reduce
> Framework)[(SPILLED_RECORDS)(Spilled Records)(0)][(CPU_MILLISECONDS)(CPU
> time spent \\(ms\\))(80)][(PHYSICAL_MEMORY_BYTES)(Physical memory
> \\(bytes\\) snapshot)(91693056)][(VIRTUAL_MEMORY_BYTES)(Virtual memory
> \\(bytes\\) snapshot)(575086592)][(COMMITTED_HEAP_BYTES)(Total committed
> heap usage
> \\(bytes\\))(62324736)]}nullnullnullnullnullnullnullnullnullnullnullnullnull"
>
> ...
>
>
> Best Regards,
> Christian.
>
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>
>


RES: I want to call HDFS REST api to upload a file using httplib.

2013-04-08 Thread MARCOS MEDRADO RUBINELLI
On your first call, Hadoop will return a URL pointing to a datanode in the 
Location header of the 307 response. On your second call, you have to use that 
URL instead of constructing your own. You can see the specific documentation 
here:
http://hadoop.apache.org/docs/r1.0.4/webhdfs.html#CREATE

Regards,
Marcos




I want to call HDFS REST api to upload a file using httplib.

My program created the file, but no content is in it.

=

Here is my code:

import httplib

conn=httplib.HTTPConnection("localhost:50070")
conn.request("PUT","/webhdfs/v1/levi/4?op=CREATE")
res=conn.getresponse()
print res.status,res.reason
conn.close()

conn=httplib.HTTPConnection("localhost:50075")
conn.connect()
conn.putrequest("PUT","/webhdfs/v1/levi/4?op=CREATE&user.name=levi")
conn.endheaders()
a_file=open("/home/levi/4","rb")
a_file.seek(0)
data=a_file.read()
conn.send(data)
res=conn.getresponse()
print res.status,res.reason
conn.close()

==

Here is the return:

307 TEMPORARY_REDIRECT 201 Created

=

OK, the file was created, but no content was sent.

When I comment the #conn.send(data), the result is the same, still no content.

Maybe the file read or the send is wrong, not sure.

Do you know how this happened?



[no subject]

2013-04-08 Thread Edd Grant
Hi all,

I'm new to Hadoop and am posting my first message on this list. I have
downloaded and installed the hadoop_1.1.1-1_x86_64.deb distro and have a
couple of issues which are blocking me from progressing.

I'm working through the 'Hadoop - The Definitive Guide' book and am trying
to set up a test VM in pseudodistributed mode using the RPM. The examples
in the book allude to (although I don't think they explicitly state) having
a single user for everything and creating a passwordless private/public key
pair to allow the user to ssh to locahost to control things. I'm guessing
this is because the book uses the .zip distribution which doesn't create
any users and therefore assumes running as an already existing locally
logged on user.

I notice however that the RPM creates 2 users: mapred and hdfs. As a result
I'm a bit unclear about the following:

1: Does it matter which user I log in as to perform various actions? e.g.
if I want to run start-dfs.sh should I be logged in as 'hdfs'? I did try
running start-dfs as root thinking it might drop down to hdfs using a
RUN_AS user (like most init.d scripts do) but it didn't work like that. Is
there any documentation covering which users should be used to do what when
running the RPM distribution?

2: Whilst the RPMcreates the hdfs user and specifies /var/lib/hadoop/hdfs
as the homedir, it doesn't actually create this directory. This results in
an error when logging in as the user. Is this normal?

3: How should I set up ssh keys between the 2 users? Should each user's
public key be in the authorized_keys file of the other user (i.e. is
communication between the 2 processes bi-directional) or would something
simpler suffice?

Hope these questions are clear enough to advise on, please don't hesitate
to ask for more info if there's something I've left out.

Cheers,

Edd

-- 
Web: http://www.eddgrant.com
Email: e...@eddgrant.com
Mobile: +44 (0) 7861 394 543