How 'commodity' is 'commodity'

2009-09-29 Thread James Carroll
I work in a call center which means we have a lot of PCs sitting on agents' desks doing a whole lot nothing in the middle of the night. It also means that we collect a lot of phone and other data, etc that all gets rolled out into reports and/or tables that drive reports or other processes. We're p

Re: How 'commodity' is 'commodity'

2009-09-29 Thread 司宪策
This is really an interesting setting :) I want to know the answer too! Xiance On Tue, Sep 29, 2009 at 11:45 AM, James Carroll wrote: > I work in a call center which means we have a lot of PCs sitting on > agents' desks doing a whole lot nothing in the middle of the night. It > also means that w

Re: How 'commodity' is 'commodity'

2009-09-29 Thread Taeho Kang
If your "commodity" pc's don't have a whole lot of storage space, then you would have to run your HDFS datanodes elsewhere. In that case, a lot of data traffic will occur (e.g. sending data from datanodes to where data processing occurs), meaning map reduce performance will be slowed down. It's alw

Re: Native libraries and HDFS

2009-09-29 Thread Stas Oskin
Hi. Thanks for answers. Just to clarify: 1) There is no impact whatsoever for the NameNode / SecondaryNameNode, or DataNode themselves. 2) Only the client applications using Hadoop / HDFS can take benefit of these libraries, hence it makes sense to have them installed only on same nodes as the c

Re: lost+found files prevent DataNode formatting

2009-09-29 Thread Stas Oskin
Hi. Question - will DataNode ever try to format again the directory after the initial format? Common sense says no, so if I erased them once and they ever come back, it should not impact DataNode in any way? Thanks again. 2009/9/29 Anthony Urso > Those are created by fsck and will come back.

Distributed cache - are files unique per job?

2009-09-29 Thread Erik Forsberg
Hi! If I distribute files using the Distributed Cache (-archives option), are they guaranteed to be unique per job, or is there a risk that if I distribute a file named A with job 1, job 2 which also distributes a file named A will read job 1's file? I think they are unique per job, just want to

Re: How 'commodity' is 'commodity'

2009-09-29 Thread Steve Loughran
"commodity" really means x86 parts, non-RAID storage, no infiniband-connected storage array, no esoteric OS -just Linux- and commodity gigabit ether, nothing fancy like 10GBE except on a heavy-utilised backbone :) With those kind of configurations, you reduce your capital costs, leaving you m

Re: How 'commodity' is 'commodity'

2009-09-29 Thread 司宪策
Virtualized nodes is a brilliant idea :) This greatly reduced the efforts, especially when the PCs are not fully in your control. Xiance On Tue, Sep 29, 2009 at 6:01 PM, Steve Loughran wrote: > > "commodity" really means x86 parts, non-RAID storage, no > infiniband-connected storage array, no es

RE: Best Idea to deal with following situation

2009-09-29 Thread Amogh Vasekar
Along with partitioner, try to plug in a combiner. It would provide significant performance gains. Not sure about the algo you use, but might have to tweak that a little to facilitate a combiner. Thanks, Amogh -Original Message- From: Chandraprakash Bhagtani [mailto:cpbhagt...@gmail.com

Re: Final Reminder: NSF, Google, IBM CLuE PI Meeting: October 5, 2009

2009-09-29 Thread Steve Lihn
Can the group make these speeches available online (such as youtube) for the global community? Thx, steve On 9/28/09, Jimmy Lin wrote: > Hi everyone, > > Just a final reminder for this NSF/Google/IBM event next Monday (10/5). > We've put together an exciting program with talks by Luiz André >

Re: dfs create block sticking

2009-09-29 Thread Jason Venner
I had a problem like that with a custom record writer - solr-1301 On Mon, Sep 28, 2009 at 11:18 PM, Chandraprakash Bhagtani < cpbhagt...@gmail.com> wrote: > I faced the org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException > exception once. > What I was doing that I was overriding FileOutp

Re: Distributed cache - are files unique per job?

2009-09-29 Thread Jason Venner
When you use the commandline -archives a directory "archives" is created in hdfs under the the per job submission area, to store the archives. So there should be no collisions, as long as no other job tracker is using the same system directory path (conf.get("mapred.system.dir", "/tmp/hadoop/mapred

Re: Final Reminder: NSF, Google, IBM CLuE PI Meeting: October 5, 2009

2009-09-29 Thread Oliver Senn
+1 Steve Lihn wrote: Can the group make these speeches available online (such as youtube) for the global community? Thx, steve On 9/28/09, Jimmy Lin wrote: Hi everyone, Just a final reminder for this NSF/Google/IBM event next Monday (10/5). We've put together an exciting program with talk

Re: Final Reminder: NSF, Google, IBM CLuE PI Meeting: October 5, 2009

2009-09-29 Thread Jason Venner
You can also publish them on www.prohadoop.com, as well as announce your events ;) On Tue, Sep 29, 2009 at 7:58 AM, Oliver Senn wrote: > +1 > > > Steve Lihn wrote: > >> Can the group make these speeches available online (such as youtube) >> for the global community? >> >> Thx, steve >> >> On 9/2

Re: How 'commodity' is 'commodity'

2009-09-29 Thread Edward Capriolo
On Tue, Sep 29, 2009 at 6:09 AM, Xiance SI(司宪策) wrote: > Virtualized nodes is a brilliant idea :) This greatly reduced the efforts, > especially when the PCs are not fully in your control. > Xiance > > On Tue, Sep 29, 2009 at 6:01 PM, Steve Loughran wrote: > >> >> "commodity" really means x86 par

Cloudera version of HBase

2009-09-29 Thread Lajos
Hi y'all, Probably this should be posted on the hbase list, but hopefully someone will know offhand. We're running Cloudera's 0.20.0 hadoop installation, built from their sources. Can we run the stock 0.20.0 hbase dist with that or do we need a specific Cloudera version? If the latter, is ther

Re: How 'commodity' is 'commodity'

2009-09-29 Thread Steve Loughran
Edward Capriolo wrote: In hadoop terms commodity means "Not super computer". If you look around most large deployments have DataNodes with duel quad core processors 8+GB ram and numerous disks, that is hardly the PC you find under your desk. I have 4 cores and 6GB RAM, but only one HDD on the

Read only mode timeout

2009-09-29 Thread Stas Oskin
Hi. After namenode comes online, and finds all the blocks on all datanodes, I have a time-out of about 30 seconds before it accepts writes. Any idea: 1) Why it so long? 2) How it's possible to make it smaller? Thanks in advance!

Re: Running Hadoop on cluster with NFS booted systems

2009-09-29 Thread Nick Rathke
Hi, Here is the dump. I looked it over and unfortunately it is pretty meaningless to me at this point. Any help deciphering it would be greatly appreciated. I have also now disabled the IB interface on my 2 test systems, unfortunately that had no impact. -Nick Todd Lipcon wrote: Hi Nick

Re: Final Reminder: NSF, Google, IBM CLuE PI Meeting: October 5, 2009

2009-09-29 Thread Jimmy Lin
Thanks for the feedback. I'll look into it... but at the very least slides will be posted online. -Jimmy Oliver Senn wrote: +1 Steve Lihn wrote: Can the group make these speeches available online (such as youtube) for the global community? Thx, steve On 9/28/09, Jimmy Lin wrote: Hi ever

Re: Running Hadoop on cluster with NFS booted systems

2009-09-29 Thread Brian Bockelman
Hey Nick, I believe the mailing list stripped out your attachment. Brian On Sep 29, 2009, at 10:22 AM, Nick Rathke wrote: Hi, Here is the dump. I looked it over and unfortunately it is pretty meaningless to me at this point. Any help deciphering it would be greatly appreciated. I have

Re: How 'commodity' is 'commodity'

2009-09-29 Thread Edward Capriolo
On Tue, Sep 29, 2009 at 11:15 AM, Steve Loughran wrote: > Edward Capriolo wrote: > >> In hadoop terms commodity means "Not super computer". If you look >> around most large deployments have DataNodes with duel quad core >> processors 8+GB ram and numerous disks, that is hardly the PC you find >> u

Re: lost+found files prevent DataNode formatting

2009-09-29 Thread Edward Capriolo
On Tue, Sep 29, 2009 at 5:17 AM, Stas Oskin wrote: > Hi. > > Question - will DataNode ever try to format again the directory after the > initial format? > > Common sense says no, so if I erased them once and they ever come back, it > should not impact DataNode in any way? > > Thanks again. > > 200

Re: Running Hadoop on cluster with NFS booted systems

2009-09-29 Thread Nick Rathke
Thanks. Here it is as in all of it's glory... -Nick 2009-09-29 09:15:53 Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed mode): "263851...@qtp0-1" prio=10 tid=0x2aaaf846a000 nid=0x226b in Object.wait() [0x41d24000] java.lang.Thread.State: TIMED_WAITING (on ob

Re: Distributed cache - are files unique per job?

2009-09-29 Thread Allen Wittenauer
On 9/29/09 2:55 AM, "Erik Forsberg" wrote: > If I distribute files using the Distributed Cache (-archives option), > are they guaranteed to be unique per job, or is there a risk that if I > distribute a file named A with job 1, job 2 which also distributes a > file named A will read job 1's fil

Re: Running Hadoop on cluster with NFS booted systems

2009-09-29 Thread Brian Bockelman
Hey Nick, Strange. It appears that the Jetty server has stalled while trying to read from /dev/random. Is it possible that some part of /dev isn't initialized before the datanode is launched? Can you confirm this using "lsof -p " ? I copy/paste a solution I found in a forum via google be

Re: How 'commodity' is 'commodity'

2009-09-29 Thread Brian Bockelman
Hey James, I think this would be a fun project, but be prepared to have the desktop portion not work out in the end. I would recommend focusing on prototyping your application in MapReduce, and consider the fact you might be able to reuse your desktops as sugar-coating (remember there ma

Which instance type on Amazon EC2?

2009-09-29 Thread Kevin Peterson
Has anyone done any extensive testing of what instance types on Amazon EC2 give you the most bang for the buck? Given the normal Hadoop recommendations of beefy machines, I would expect the best performance from the extra-large, but our testing showed otherwise. We did some rough testing while we

Re: Which instance type on Amazon EC2?

2009-09-29 Thread Brian Bockelman
Hey Kevin, From seeing presentations from the HEP field (totally unrelated to Hadoop), I've seen folks claim the large instance is more than 4x better than the small, and less than 2x slower than extra-large. I.e., it provided that application the best bang for its buck. In other words,

Re: Running Hadoop on cluster with NFS booted systems

2009-09-29 Thread Todd Lipcon
Yep, this is a common problem. The fix that Brian outlined helps a lot, but if you are *really* strapped for random bits, you'll still block. This is because even if you've set the random source, it still uses the real /dev/random to grab a seed for the prng, at least on my system. On systems wher

Re: Cloudera version of HBase

2009-09-29 Thread Amandeep Khurana
Lajos, Cloudera recently added HBase to their distribution. Read: http://www.cloudera.com/blog/2009/09/29/hbase-available-in-cdh2/ You can use the 0.20 release of Hbase directly. Dont need cloudera's version specifically. -ak Amandeep Khurana Computer Science Graduate Student University of Cal

Advice on new Datacenter Hadoop Cluster?

2009-09-29 Thread ylx_admin
Hey all, I'm pretty new to hadoop in general and I've been tasked with building out a datacenter cluster of hadoop servers to process logfiles. We currently use Amazon but our heavy usage is starting to justify running our own servers. I'm aiming for less than $1k per box, and of course trying t

Re: Read only mode timeout

2009-09-29 Thread Todd Lipcon
Hi Stas, This is the dfs.safemode.extension parameter. Default is 30 seconds, feel free to reconfigure down on small clusters if 30 seconds is upsetting to you :) -Todd On Tue, Sep 29, 2009 at 8:18 AM, Stas Oskin wrote: > Hi. > > After namenode comes online, and finds all the blocks on all dat

Re: Advice on new Datacenter Hadoop Cluster?

2009-09-29 Thread Todd Lipcon
Hi Kevin, Less than $1k/box is unrealistic and won't be your best price/performance. Most people building new clusters at this point seem to be leaning towards dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM. You're better off starting with a small cluster of these nicer machi

Re: Does hadoop 0.18.3 support thrift ?

2009-09-29 Thread Todd Lipcon
Hi Jeff, The contrib/thriftfs module is probably what you're looking for. I don't know of a lot of people using it, but it does provide a proxy-style thrift access to HDFS. Be aware that performance will not be great since it introduces several more copies in the pipeline, and Thrift was never d

Re: Advice on new Datacenter Hadoop Cluster?

2009-09-29 Thread Amandeep Khurana
Also, if you plan to run HBase as well (now or in the future), you'll need more RAM. Take that into account too. Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Tue, Sep 29, 2009 at 10:59 AM, Todd Lipcon wrote: > Hi Kevin, > > Less than $1k/box is un

Re: Running Hadoop on cluster with NFS booted systems

2009-09-29 Thread Nick Rathke
Great. I'll look at this fix. Here is what I got based on Brian's info lsof -p gave me : java12739 root 50r CHR1,8 3335 /dev/random java12739 root 51r CHR1,9 3325 /dev/urandom . . . . java12739 root 66r CHR

Re: Which instance type on Amazon EC2?

2009-09-29 Thread Paul Ingles
Hi, I don't have any real benchmarks or testing to speak of specifically for the performance benefits of a larger instance size. However, we have played around a little and for our work (a form of document clustering) the benefits of a larger instance were far outweighed by having more of

TaskTracker: Can not start task tracker because java.lang.RuntimeException: Not a host:port pair: local

2009-09-29 Thread David Been
Running start-all.sh it logs the above message (only for tasktracker), i figured out it was essentially the following command, then also tried to pick up a specific config. Full message at the bottom hadoop --config ../conf tasktracker // start-all.sh uses this command What conf fil

Re: TaskTracker: Can not start task tracker because java.lang.RuntimeException: Not a host:port pair: local

2009-09-29 Thread Todd Lipcon
Hi David, The --config parameter to hadoop expects a configuration directory, not a file. Your file still needs to be named hadoop-site.xml (or in newer versions core-site, hdfs-site, or mapred-site.xml) Hope that helps -Todd On Tue, Sep 29, 2009 at 11:49 AM, David Been wrote: > Running start

Re: Running Hadoop on cluster with NFS booted systems

2009-09-29 Thread Brian Bockelman
Hey Nick, Try this: cat /proc/sys/kernel/random/entropy_avail Is it a small number (<300)? Basically, one way Linux generates entropy is via input from the keyboard. So, as soon as you log into the NFS booted server, you've given it enough entropy for HDFS to start up. Here's a relevant-

Re: TaskTracker: Can not start task tracker because java.lang.RuntimeException: Not a host:port pair: local

2009-09-29 Thread David Been
Copied my property to mapred-site.xml and works great!! thanks dave On Tue, Sep 29, 2009 at 11:52 AM, Todd Lipcon wrote: > Hi David, > > The --config parameter to hadoop expects a configuration directory, not a > file. Your file still needs to be named hadoop-site.xml (or in newer > versions c

Re: Which instance type on Amazon EC2?

2009-09-29 Thread Ted Dunning
IN our experiments, the large instance turned out better, but that was largely due to our need for substantial memory. For many of our jobs, the different between 4x as many small nodes and large nodes was not substantial. We had less than 2x gain from extra large nodes. For small memory hadoop

Re: Read only mode timeout

2009-09-29 Thread Stas Oskin
Hi. Thanks! :) It's NameNode only setting? Regards. 2009/9/29 Todd Lipcon > Hi Stas, > > This is the dfs.safemode.extension parameter. Default is 30 seconds, feel > free to reconfigure down on small clusters if 30 seconds is upsetting to > you > :) > > -Todd > > On Tue, Sep 29, 2009 at 8:18 A

Re: Running Hadoop on cluster with NFS booted systems

2009-09-29 Thread Nick Rathke
Hi Brian / Todd, -bash-3.2# cat /proc/sys/kernel/random/entropy_avail 128 So I did rngd -r /dev/urandom -o /dev/random -f -t 1 & and it **seems** to be working.. The web page shows the nodes as there and the logs seem to show that the clients have started correctly, but I have not yet tried

Re: Running Hadoop on cluster with NFS booted systems

2009-09-29 Thread Brian Bockelman
Sounds great Nick, Just goes to show that in any software product, for every new user there's approximately one "bug" :) Brian On Sep 29, 2009, at 4:45 PM, Nick Rathke wrote: Hi Brian / Todd, -bash-3.2# cat /proc/sys/kernel/random/entropy_avail 128 So I did rngd -r /dev/urandom -o /dev/

Re: Running Hadoop on cluster with NFS booted systems

2009-09-29 Thread Nick Rathke
Thanks again for the help. I now have a big sign outside my office that reads "Increase your entropy!" :-) .n Brian Bockelman wrote: Sounds great Nick, Just goes to show that in any software product, for every new user there's approximately one "bug" :) Brian On Sep 29, 2009, at 4:45 PM,

Re: Read only mode timeout

2009-09-29 Thread Todd Lipcon
On Tue, Sep 29, 2009 at 1:38 PM, Stas Oskin wrote: > Hi. > > Thanks! :) > > It's NameNode only setting? > > Yes -Todd 2009/9/29 Todd Lipcon > > > Hi Stas, > > > > This is the dfs.safemode.extension parameter. Default is 30 seconds, feel > > free to reconfigure down on small clusters if 30 sec

ask help for hsql conflict problem.

2009-09-29 Thread Jianwu Wang
Hi there, When I have hadoop running (version 0.20.0, Pseudo-Distributed Mode), I can not start my own java application. The exception complains that 'java.sql.SQLException: failed to connect to url "jdbc:hsqldb:hsql://localhost/hsqldb". I have to stop hadoop to start my own java applicati

Re: ask help for hsql conflict problem.

2009-09-29 Thread Amandeep Khurana
What do you mean by your own java application? What are you trying to run? Is it a Map Reduce job? Secondly, hadoop talks to a database only when you are trying to read/write data during a job... There is nothing else that it does. The connectors to interface with databases are DBInputFormat and

Re: Does hadoop 0.18.3 support thrift ?

2009-09-29 Thread Jeff Zhang
But I do not found the thrift in hadoop 0.18.3. Can I use other version of hadoop's thrift for my cluster which is built upon hadoop 0.18.3 On Wed, Sep 30, 2009 at 1:56 AM, Todd Lipcon wrote: > Hi Jeff, > > The contrib/thriftfs module is probably what you're looking for. > > I don't know of a l

Re: Does hadoop 0.18.3 support thrift ?

2009-09-29 Thread Todd Lipcon
Ah, right, it was introduced in 0.19. It shouldn't be too tough to compile against 0.18 - the API changes were minimal and it's an external client so internal changes shouldn't have any effect. I'd recommend pulling the source from 0.20, moving it into the contrib directory, and hacking away. If

Re: Does hadoop 0.18.3 support thrift ?

2009-09-29 Thread Edward Capriolo
On Tue, Sep 29, 2009 at 11:20 PM, Todd Lipcon wrote: > Ah, right, it was introduced in 0.19. > > It shouldn't be too tough to compile against 0.18 - the API changes were > minimal and it's an external client so internal changes shouldn't have any > effect. > > I'd recommend pulling the source from

RE: Distributed cache - are files unique per job?

2009-09-29 Thread Amogh Vasekar
I believe framework checks timestamps on HDFS for marking an already available copy of the file valid or invalid, since the archived files are not cleaned up till a certain du limit is reached, and no apis for cleanup available. There was a thread on this some time back on the list. Amogh

Re: ask help for hsql conflict problem.

2009-09-29 Thread Jianwu Wang
Hi Amandeep, Thanks for your info. My own java application has its own classes and runs separately. The only thing is that it also tries to create hsql server to cache some data. That's why I think there may be conflicts for multiple hsql instances. What I want to know about hadoop hsq