Hbase-hdfs CLOSE_WAIT connection problem

2010-10-26 Thread akhilesh.kumar
Hi, We are using hbase-0.20.6 with hdfs (single node setup), while pushing the data into hbase, using java API's, there are lots of TCP CLOSE_WAIT connection crops up. These connections persist for a long time, even for day or two. Linux setting for TCP connection is 72 sec. which are overri

Re: add a resource (wich is in hdfs) to a configuration

2010-10-26 Thread Lance Norskog
Yes, I use this in a batch job driver. There is a common file with global configs, and then a per-job config. The driver command line is: driver -c common-site.xml batchjob.xml On Tue, Oct 26, 2010 at 11:40 AM, Marc Sturlese wrote: > > Thanks, it worked. In case it can help someone else: > >    

Re: MultipleOutputFormat and MultipleOutputs - which will last?

2010-10-26 Thread Saptarshi Guha
I prefer the latter(MultipleOutputFormat) as I would not have had to change my code. Everything would have stayed in the outputformat. And I hardly need the extra features. Oh well, got to keep with the times. Cheers Saptarshi On Tue, Oct 26, 2010 at 2:44 AM, Rekha Joshi wrote: > Hi Saptarshi,

Re: Small File Management

2010-10-26 Thread Todd Lipcon
It's worth checking out the "har" tool as well. I would say that HBase is good fit for binaries so long as the binaries aren't huge. Anything under a few MB should be fine. -Todd On Tue, Oct 26, 2010 at 10:56 AM, Ananth Sarathy wrote: > Thanks, but that more of a one time use, not ongoing man

help with rewriting hadoop java code for new API: getPos() and getCounter()

2010-10-26 Thread Bibek Paudel
[Apologies for cross-posting] HI all, I am rewriting a hadoop java code for the new (0.20.2) API- the code was originally written for versions <= 0.19. 1. What is the equivalent of the getCounter() method ? For example, the old code is as following: //import org.apache.hadoop.mapred.RunningJob; R

Re: how to find out: which files related to current hadoop task

2010-10-26 Thread Shi Yu
Maybe this message can solve your problem as well: @Shi Yu: Yes there are built in functions to get the input file Path in the Mapper (you can use these for counters by putting the file name in the counter name), however there are some issues if you use MultipleInputs to your job. Here's some sam

how to find out: which files related to current hadoop task

2010-10-26 Thread Oleg Ruchovets
Hi , Running a hadoop job which manipulates ~ 4000 files (files ar gz) , and suppose one of this gz was corrupted. From web console /log files I can see which task got exception ,but to isolate which files was corrupted it is really hard. Is it a way to know which files was produced by which hado

Re: add a resource (wich is in hdfs) to a configuration

2010-10-26 Thread Marc Sturlese
Thanks, it worked. In case it can help someone else: try { Configuration c = new Configuration() ; FileSystem fs = FileSystem.get(c) ; InputStream is = new FSDataInputStream(fs.open(new Path("hdfs://hadoop_cluster/user/me/conf/extra-props.xml"))) ;

Re: FUSE HDFS significantly slower

2010-10-26 Thread Allen Wittenauer
On Oct 26, 2010, at 11:25 AM, Hazem Mahmoud wrote: > That raises a question that I am currently looking into and would appreciate > any and all advice people have. > > We are replacing our current NetApp solution, which has served us well but we > have outgrown it. > > I am looking at either

Re: FUSE HDFS significantly slower

2010-10-26 Thread Hazem Mahmoud
That raises a question that I am currently looking into and would appreciate any and all advice people have. We are replacing our current NetApp solution, which has served us well but we have outgrown it. I am looking at either upgrading to a bigger and meaner NetApp or possibly going with Had

Re: Small File Management

2010-10-26 Thread Ananth Sarathy
Thanks, but that more of a one time use, not ongoing management. Ananth T Sarathy On Tue, Oct 26, 2010 at 12:31 PM, Mark Kerzner wrote: > http://stuartsierra.com/2008/04/24/a-million-little-files > > On Tue, Oct 26, 2010 at 11:28 AM, Ananth Sarathy < > ananth.t.sara...@gmail.com > > wrote: > >

Re: Small File Management

2010-10-26 Thread Ananth Sarathy
Yeah I had looked into hbase, but they are pretty adament about not using it for binaries. We use hbase for other stuff, so that would have been our preference I know that BigTable serves some image tiles for google maps, but images tiles are a lot smaller in general. Ananth T Sarathy On Tue, O

Re: add a resource (wich is in hdfs) to a configuration

2010-10-26 Thread Jamie Cockrill
Marc, addResource takes an InputStream, which you could get from a FileSystem instance, however you'd have yourself something of a chicken/egg situation in that you'd need a Configuration to get a FileSystem (via FileSystem.get()), but then you could always just add it on and hit 'reloadConfigurat

Re: Small File Management

2010-10-26 Thread Patrick Angeles
HBase might fit the bill. On Tue, Oct 26, 2010 at 12:28 PM, Ananth Sarathy wrote: > I was wondering if there were any projects out there doing a small file > management layer on top of Hadoop? I know that HDFS is primarily for > map/reduce but I think companies are going to start using hdfs clus

add a resource (wich is in hdfs) to a configuration

2010-10-26 Thread Marc Sturlese
is it possible to add a custom-site.xml resource (wich is placed in hdfs) to a Configuration? Something like: Configuration cc = new Configuration(); Path p = new Path("hdfs://hadoop_cluster/user/me/conf/extra-props.xml"); c.addResource(p); It doesn't seem to work for me. If I convert 'c' to St

Re: Small File Management

2010-10-26 Thread Allen Wittenauer
On Oct 26, 2010, at 9:28 AM, Ananth Sarathy wrote: > I was wondering if there were any projects out there doing a small file > management layer on top of Hadoop? I know that HDFS is primarily for > map/reduce but I think companies are going to start using hdfs clusters as > storage in the cloud,

Re: Small File Management

2010-10-26 Thread Mark Kerzner
http://stuartsierra.com/2008/04/24/a-million-little-files On Tue, Oct 26, 2010 at 11:28 AM, Ananth Sarathy wrote: > I was wondering if there were any projects out there doing a small file > management layer on top of Hadoop? I know that HDFS is primarily for > map/reduce but I think companies ar

Small File Management

2010-10-26 Thread Ananth Sarathy
I was wondering if there were any projects out there doing a small file management layer on top of Hadoop? I know that HDFS is primarily for map/reduce but I think companies are going to start using hdfs clusters as storage in the cloud, and i was wondering if any work had been done on this. Ananth

Re: CDH3 beta 3

2010-10-26 Thread Patrick Angeles
This is not CDH3 specific... it's related to the kerberos security patch, so these upgrade issues will pop up in the Y! distribution, and eventually in 0.22 as well. These aren't bugs in the code per se, it's just that the upgrade process going from pre- to post- security is somewhat tricky, and c

GC overhead limit exceeded while running Terrior on Hadoop

2010-10-26 Thread siddharth raghuvanshi
Hi, While running Terrior on Hadoop, I am getting the following error again & again, can someone please point out where the problem is? attempt_201010252225_0001_m_09_2: WARN - Error running child attempt_201010252225_0001_m_09_2: java.lang.OutOfMemoryError: GC overhead limit exceeded att

Re: CDH3 beta 3

2010-10-26 Thread Raj V
With a 755 permission the jobtracker could not operate on the directory and with 775 permission the datanode's log said " Expecting 755 foound 775 . Exiting". I will do a more careful attempt today. Raj From: Michael Segel To: common-user@hadoop.apache.or

Re: LZO Compression Libraries don't appear to work properly with MultipleOutputs

2010-10-26 Thread ed
Calling close() on the MultipleOutputs objects in the cleanup() method of the reducer fixed the lzo file problem. Thanks! ~Ed On Thu, Oct 21, 2010 at 9:12 PM, ed wrote: > Hi Todd, > > I don't have the code in front of me right but I was looking over the API > docs and it looks like I forgot to

Re: FUSE HDFS significantly slower

2010-10-26 Thread Brian Bockelman
In general, unless you run newer kernels and versions of FUSE as that ticket suggests, it is significantly slower in raw throughput. However, we generally don't have a day go by at my site where we don't push FUSE over 30Gbps, as the bandwidth is spread throughout nodes. Additionally, as we ar

RE: Tasktracker volume failure...

2010-10-26 Thread Gokulakannan M
Yes.. This is my scenario.. I have one tasktracker... I configured 10 dirs(volumes)in mapred.local.dir..if one of the volume has bugs , tasktracker is not executing further tasks.. I remember in datanode, a similar scenario is handled.. when one of the volume fails, it will mark that volume as

Re: Tasktracker volume failure...

2010-10-26 Thread Steve Loughran
On 26/10/10 04:10, Gokulakannan M wrote: Hi, I faced a problem when a volume configured in *mapred.local.dir* fails, the tasktracker continuously trying to create directory and fails. Eventually all the running jobs are getting failed and new jobs cannot be executed. I think you can provid

Re: MultipleOutputFormat and MultipleOutputs - which will last?

2010-10-26 Thread Rekha Joshi
Hi Saptarshi, AFAIK, this is an intermediate stage where the old api is supported, while evolving the new api. In 0.21, the old api - MultipleOutputFormat is not deprecated. http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/index.html.In future it might be. >From usage perspective, what Mult

Re: Nodes up, Master sees 0 ReigonServers

2010-10-26 Thread Bradford Stephens
Ugh, wrong mailing list. Silly GMail. On Mon, Oct 25, 2010 at 11:45 PM, Bradford Stephens wrote: > Hey datamigos, > > I'm having trouble getting a finicky .20.6 cluster to behave. > > The Master, Zookeeper, and ReigonServers all seem to be happy -- > except the Master doesn't see any RSs. Doing a