I work in a call center which means we have a lot of PCs sitting on
agents' desks doing a whole lot nothing in the middle of the night. It
also means that we collect a lot of phone and other data, etc that all
gets rolled out into reports and/or tables that drive reports or other
processes. We're p
This is really an interesting setting :) I want to know the answer too!
Xiance
On Tue, Sep 29, 2009 at 11:45 AM, James Carroll wrote:
> I work in a call center which means we have a lot of PCs sitting on
> agents' desks doing a whole lot nothing in the middle of the night. It
> also means that w
If your "commodity" pc's don't have a whole lot of storage space, then you
would have to run your HDFS datanodes elsewhere. In that case, a lot of data
traffic will occur (e.g. sending data from datanodes to where data
processing occurs), meaning map reduce performance will be slowed down. It's
alw
Hi.
Thanks for answers. Just to clarify:
1) There is no impact whatsoever for the NameNode / SecondaryNameNode, or
DataNode themselves.
2) Only the client applications using Hadoop / HDFS can take benefit of
these libraries, hence it makes sense to have them installed only on same
nodes as the c
Hi.
Question - will DataNode ever try to format again the directory after the
initial format?
Common sense says no, so if I erased them once and they ever come back, it
should not impact DataNode in any way?
Thanks again.
2009/9/29 Anthony Urso
> Those are created by fsck and will come back.
Hi!
If I distribute files using the Distributed Cache (-archives option),
are they guaranteed to be unique per job, or is there a risk that if I
distribute a file named A with job 1, job 2 which also distributes a
file named A will read job 1's file?
I think they are unique per job, just want to
"commodity" really means x86 parts, non-RAID storage, no
infiniband-connected storage array, no esoteric OS -just Linux- and
commodity gigabit ether, nothing fancy like 10GBE except on a
heavy-utilised backbone :) With those kind of configurations, you reduce
your capital costs, leaving you m
Virtualized nodes is a brilliant idea :) This greatly reduced the efforts,
especially when the PCs are not fully in your control.
Xiance
On Tue, Sep 29, 2009 at 6:01 PM, Steve Loughran wrote:
>
> "commodity" really means x86 parts, non-RAID storage, no
> infiniband-connected storage array, no es
Along with partitioner, try to plug in a combiner. It would provide significant
performance gains. Not sure about the algo you use, but might have to tweak
that a little to facilitate a combiner.
Thanks,
Amogh
-Original Message-
From: Chandraprakash Bhagtani [mailto:cpbhagt...@gmail.com
Can the group make these speeches available online (such as youtube)
for the global community?
Thx, steve
On 9/28/09, Jimmy Lin wrote:
> Hi everyone,
>
> Just a final reminder for this NSF/Google/IBM event next Monday (10/5).
> We've put together an exciting program with talks by Luiz André
>
I had a problem like that with a custom record writer - solr-1301
On Mon, Sep 28, 2009 at 11:18 PM, Chandraprakash Bhagtani <
cpbhagt...@gmail.com> wrote:
> I faced the org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException
> exception once.
> What I was doing that I was overriding FileOutp
When you use the commandline -archives
a directory "archives" is created in hdfs under the the per job submission
area, to store the archives.
So there should be no collisions, as long as no other job tracker is using
the same system directory path (conf.get("mapred.system.dir",
"/tmp/hadoop/mapred
+1
Steve Lihn wrote:
Can the group make these speeches available online (such as youtube)
for the global community?
Thx, steve
On 9/28/09, Jimmy Lin wrote:
Hi everyone,
Just a final reminder for this NSF/Google/IBM event next Monday (10/5).
We've put together an exciting program with talk
You can also publish them on www.prohadoop.com, as well as announce your
events ;)
On Tue, Sep 29, 2009 at 7:58 AM, Oliver Senn wrote:
> +1
>
>
> Steve Lihn wrote:
>
>> Can the group make these speeches available online (such as youtube)
>> for the global community?
>>
>> Thx, steve
>>
>> On 9/2
On Tue, Sep 29, 2009 at 6:09 AM, Xiance SI(司宪策) wrote:
> Virtualized nodes is a brilliant idea :) This greatly reduced the efforts,
> especially when the PCs are not fully in your control.
> Xiance
>
> On Tue, Sep 29, 2009 at 6:01 PM, Steve Loughran wrote:
>
>>
>> "commodity" really means x86 par
Hi y'all,
Probably this should be posted on the hbase list, but hopefully someone
will know offhand. We're running Cloudera's 0.20.0 hadoop installation,
built from their sources. Can we run the stock 0.20.0 hbase dist with
that or do we need a specific Cloudera version? If the latter, is ther
Edward Capriolo wrote:
In hadoop terms commodity means "Not super computer". If you look
around most large deployments have DataNodes with duel quad core
processors 8+GB ram and numerous disks, that is hardly the PC you find
under your desk.
I have 4 cores and 6GB RAM, but only one HDD on the
Hi.
After namenode comes online, and finds all the blocks on all datanodes, I
have a time-out of about 30 seconds before it accepts writes.
Any idea:
1) Why it so long?
2) How it's possible to make it smaller?
Thanks in advance!
Hi,
Here is the dump. I looked it over and unfortunately it is pretty
meaningless to me at this point. Any help deciphering it would be
greatly appreciated.
I have also now disabled the IB interface on my 2 test systems,
unfortunately that had no impact.
-Nick
Todd Lipcon wrote:
Hi Nick
Thanks for the feedback. I'll look into it... but at the very least
slides will be posted online.
-Jimmy
Oliver Senn wrote:
+1
Steve Lihn wrote:
Can the group make these speeches available online (such as youtube)
for the global community?
Thx, steve
On 9/28/09, Jimmy Lin wrote:
Hi ever
Hey Nick,
I believe the mailing list stripped out your attachment.
Brian
On Sep 29, 2009, at 10:22 AM, Nick Rathke wrote:
Hi,
Here is the dump. I looked it over and unfortunately it is pretty
meaningless to me at this point. Any help deciphering it would be
greatly appreciated.
I have
On Tue, Sep 29, 2009 at 11:15 AM, Steve Loughran wrote:
> Edward Capriolo wrote:
>
>> In hadoop terms commodity means "Not super computer". If you look
>> around most large deployments have DataNodes with duel quad core
>> processors 8+GB ram and numerous disks, that is hardly the PC you find
>> u
On Tue, Sep 29, 2009 at 5:17 AM, Stas Oskin wrote:
> Hi.
>
> Question - will DataNode ever try to format again the directory after the
> initial format?
>
> Common sense says no, so if I erased them once and they ever come back, it
> should not impact DataNode in any way?
>
> Thanks again.
>
> 200
Thanks. Here it is as in all of it's glory...
-Nick
2009-09-29 09:15:53
Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed mode):
"263851...@qtp0-1" prio=10 tid=0x2aaaf846a000 nid=0x226b in
Object.wait() [0x41d24000]
java.lang.Thread.State: TIMED_WAITING (on ob
On 9/29/09 2:55 AM, "Erik Forsberg" wrote:
> If I distribute files using the Distributed Cache (-archives option),
> are they guaranteed to be unique per job, or is there a risk that if I
> distribute a file named A with job 1, job 2 which also distributes a
> file named A will read job 1's fil
Hey Nick,
Strange. It appears that the Jetty server has stalled while trying to
read from /dev/random. Is it possible that some part of /dev isn't
initialized before the datanode is launched?
Can you confirm this using "lsof -p " ?
I copy/paste a solution I found in a forum via google be
Hey James,
I think this would be a fun project, but be prepared to have the
desktop portion not work out in the end. I would recommend focusing
on prototyping your application in MapReduce, and consider the fact
you might be able to reuse your desktops as sugar-coating (remember
there ma
Has anyone done any extensive testing of what instance types on Amazon EC2
give you the most bang for the buck?
Given the normal Hadoop recommendations of beefy machines, I would expect
the best performance from the extra-large, but our testing showed otherwise.
We did some rough testing while we
Hey Kevin,
From seeing presentations from the HEP field (totally unrelated to
Hadoop), I've seen folks claim the large instance is more than 4x
better than the small, and less than 2x slower than extra-large.
I.e., it provided that application the best bang for its buck.
In other words,
Yep, this is a common problem. The fix that Brian outlined helps a lot, but
if you are *really* strapped for random bits, you'll still block. This is
because even if you've set the random source, it still uses the real
/dev/random to grab a seed for the prng, at least on my system.
On systems wher
Lajos,
Cloudera recently added HBase to their distribution. Read:
http://www.cloudera.com/blog/2009/09/29/hbase-available-in-cdh2/
You can use the 0.20 release of Hbase directly. Dont need cloudera's version
specifically.
-ak
Amandeep Khurana
Computer Science Graduate Student
University of Cal
Hey all,
I'm pretty new to hadoop in general and I've been tasked with building out a
datacenter cluster of hadoop servers to process logfiles. We currently use
Amazon but our heavy usage is starting to justify running our own servers.
I'm aiming for less than $1k per box, and of course trying t
Hi Stas,
This is the dfs.safemode.extension parameter. Default is 30 seconds, feel
free to reconfigure down on small clusters if 30 seconds is upsetting to you
:)
-Todd
On Tue, Sep 29, 2009 at 8:18 AM, Stas Oskin wrote:
> Hi.
>
> After namenode comes online, and finds all the blocks on all dat
Hi Kevin,
Less than $1k/box is unrealistic and won't be your best price/performance.
Most people building new clusters at this point seem to be leaning towards
dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM.
You're better off starting with a small cluster of these nicer machi
Hi Jeff,
The contrib/thriftfs module is probably what you're looking for.
I don't know of a lot of people using it, but it does provide a proxy-style
thrift access to HDFS.
Be aware that performance will not be great since it introduces several more
copies in the pipeline, and Thrift was never d
Also, if you plan to run HBase as well (now or in the future), you'll need
more RAM. Take that into account too.
Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz
On Tue, Sep 29, 2009 at 10:59 AM, Todd Lipcon wrote:
> Hi Kevin,
>
> Less than $1k/box is un
Great. I'll look at this fix. Here is what I got based on Brian's info
lsof -p gave me :
java12739 root 50r CHR1,8 3335
/dev/random
java12739 root 51r CHR1,9 3325
/dev/urandom
.
.
.
.
java12739 root 66r CHR
Hi,
I don't have any real benchmarks or testing to speak of specifically
for the performance benefits of a larger instance size. However, we
have played around a little and for our work (a form of document
clustering) the benefits of a larger instance were far outweighed by
having more of
Running start-all.sh it logs the above message (only for tasktracker),
i figured out it was essentially the following command, then also
tried to pick up a specific config.
Full message at the bottom
hadoop --config ../conf tasktracker
// start-all.sh uses this command What conf fil
Hi David,
The --config parameter to hadoop expects a configuration directory, not a
file. Your file still needs to be named hadoop-site.xml (or in newer
versions core-site, hdfs-site, or mapred-site.xml)
Hope that helps
-Todd
On Tue, Sep 29, 2009 at 11:49 AM, David Been wrote:
> Running start
Hey Nick,
Try this:
cat /proc/sys/kernel/random/entropy_avail
Is it a small number (<300)?
Basically, one way Linux generates entropy is via input from the
keyboard. So, as soon as you log into the NFS booted server, you've
given it enough entropy for HDFS to start up.
Here's a relevant-
Copied my property to mapred-site.xml and works great!! thanks
dave
On Tue, Sep 29, 2009 at 11:52 AM, Todd Lipcon wrote:
> Hi David,
>
> The --config parameter to hadoop expects a configuration directory, not a
> file. Your file still needs to be named hadoop-site.xml (or in newer
> versions c
IN our experiments, the large instance turned out better, but that was
largely due to our need for substantial memory. For many of our jobs, the
different between 4x as many small nodes and large nodes was not
substantial. We had less than 2x gain from extra large nodes.
For small memory hadoop
Hi.
Thanks! :)
It's NameNode only setting?
Regards.
2009/9/29 Todd Lipcon
> Hi Stas,
>
> This is the dfs.safemode.extension parameter. Default is 30 seconds, feel
> free to reconfigure down on small clusters if 30 seconds is upsetting to
> you
> :)
>
> -Todd
>
> On Tue, Sep 29, 2009 at 8:18 A
Hi Brian / Todd,
-bash-3.2# cat /proc/sys/kernel/random/entropy_avail
128
So I did
rngd -r /dev/urandom -o /dev/random -f -t 1 &
and it **seems** to be working.. The web page shows the nodes as there
and the logs seem to show that the clients have started correctly, but I
have not yet tried
Sounds great Nick,
Just goes to show that in any software product, for every new user
there's approximately one "bug" :)
Brian
On Sep 29, 2009, at 4:45 PM, Nick Rathke wrote:
Hi Brian / Todd,
-bash-3.2# cat /proc/sys/kernel/random/entropy_avail
128
So I did
rngd -r /dev/urandom -o /dev/
Thanks again for the help. I now have a big sign outside my office that
reads "Increase your entropy!" :-)
.n
Brian Bockelman wrote:
Sounds great Nick,
Just goes to show that in any software product, for every new user
there's approximately one "bug" :)
Brian
On Sep 29, 2009, at 4:45 PM,
On Tue, Sep 29, 2009 at 1:38 PM, Stas Oskin wrote:
> Hi.
>
> Thanks! :)
>
> It's NameNode only setting?
>
>
Yes
-Todd
2009/9/29 Todd Lipcon
>
> > Hi Stas,
> >
> > This is the dfs.safemode.extension parameter. Default is 30 seconds, feel
> > free to reconfigure down on small clusters if 30 sec
Hi there,
When I have hadoop running (version 0.20.0, Pseudo-Distributed
Mode), I can not start my own java application. The exception complains
that 'java.sql.SQLException: failed to connect to url
"jdbc:hsqldb:hsql://localhost/hsqldb". I have to stop hadoop to start my
own java applicati
What do you mean by your own java application? What are you trying to run?
Is it a Map Reduce job?
Secondly, hadoop talks to a database only when you are trying to read/write
data during a job... There is nothing else that it does.
The connectors to interface with databases are DBInputFormat and
But I do not found the thrift in hadoop 0.18.3.
Can I use other version of hadoop's thrift for my cluster which is built
upon hadoop 0.18.3
On Wed, Sep 30, 2009 at 1:56 AM, Todd Lipcon wrote:
> Hi Jeff,
>
> The contrib/thriftfs module is probably what you're looking for.
>
> I don't know of a l
Ah, right, it was introduced in 0.19.
It shouldn't be too tough to compile against 0.18 - the API changes were
minimal and it's an external client so internal changes shouldn't have any
effect.
I'd recommend pulling the source from 0.20, moving it into the contrib
directory, and hacking away.
If
On Tue, Sep 29, 2009 at 11:20 PM, Todd Lipcon wrote:
> Ah, right, it was introduced in 0.19.
>
> It shouldn't be too tough to compile against 0.18 - the API changes were
> minimal and it's an external client so internal changes shouldn't have any
> effect.
>
> I'd recommend pulling the source from
I believe framework checks timestamps on HDFS for marking an already available
copy of the file valid or invalid, since the archived files are not cleaned up
till a certain du limit is reached, and no apis for cleanup available. There
was a thread on this some time back on the list.
Amogh
Hi Amandeep,
Thanks for your info. My own java application has its own classes
and runs separately. The only thing is that it also tries to create hsql
server to cache some data. That's why I think there may be conflicts for
multiple hsql instances.
What I want to know about hadoop hsq
55 matches
Mail list logo