Re: Data for Testing in Hadoop

2011-01-03 Thread Dave Viner
How about http://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1 ? Just the first one (WestburyLab USENET corpus) is 40GB. I suspect you can find different formats and data sizes there. Dave Viner On Mon, Jan 3, 2011 at 11:31 PM, Adarsh Sharma wrote: > Dear all, > > Designing the archit

Data for Testing in Hadoop

2011-01-03 Thread Adarsh Sharma
Dear all, Designing the architecture is very important for the Hadoop in Production Clusters. We are researching to run Hadoop in Cloud in Individual Nodes and in Cloud Environment ( VM's ). For this, I require some data for testing. Would anyone send me some links for data of different si

Re: How does HDFS handle a failed Datanode during write?

2011-01-03 Thread Dhruba Borthakur
each packet has an offset in the file that it is supposed to be written to. So, there is no hard in resending the same packet twice, the receiving datanode would always write this packet to the correct offset in the destination file. If B crashes during the write, the client does not know whether

Re: Entropy Pool and HDFS FS Commands Hanging System

2011-01-03 Thread Konstantin Boudnik
Another possibility to fix it is to install rng-tools which will allow you to increase the amount of entropy in your system. --   Take care, Konstantin (Cos) Boudnik On Mon, Jan 3, 2011 at 16:48, Jon Lederman wrote: > Thanks.  Will try that.  One final question, based on the jstack output I >

Re: monit? daemontools? jsvc? something else?

2011-01-03 Thread Allen Wittenauer
On Jan 3, 2011, at 2:22 AM, Otis Gospodnetic wrote: > I see over on http://search-hadoop.com/?q=monit+daemontools that people *do* > use > tools like monit and daemontools (and a few other ones) to keep revive their > Hadoop processes when they die. > I'm not a fan of doing this for H

Re: Question regarding a System good candidate for Hadoop?

2011-01-03 Thread Allen Wittenauer
On Jan 1, 2011, at 8:31 PM, Harsh J wrote: > Hi, > > Hadoop should be evaluated if your to-process dataset is large (Large > is relative to the size of the cluster you're going to use -- > basically using at least X amount of data such that all the processing > power of your cluster is utilized

Hadoop India Summit 2011 - Call for Papers now Open

2011-01-03 Thread Basant Verma
Hi Hadoop enthusiasts, Apache Hadoop has become the de-facto platform for developing large-scale data-intensive applications. It has been used actively in academia and Industry for research and data mining. Hadoop Summit provides an opportunity for understanding the latest trends and roadmap

Re: Entropy Pool and HDFS FS Commands Hanging System

2011-01-03 Thread Ted Dunning
On Mon, Jan 3, 2011 at 4:48 PM, Jon Lederman wrote: > Thanks. Will try that. One final question, based on the jstack output I > sent, is it obvious that the system is blocked due to the behavior of > /dev/random? I tried to send you a highlighted markup of your jstack output. The key thing

Re: Entropy Pool and HDFS FS Commands Hanging System

2011-01-03 Thread Jon Lederman
Thanks. Will try that. One final question, based on the jstack output I sent, is it obvious that the system is blocked due to the behavior of /dev/random? That is, can you enlighten me to the output I sent that explicitly or implicitly indicates the blocking? I am trying to understand whethe

Re: Entropy Pool and HDFS FS Commands Hanging System

2011-01-03 Thread Ted Dunning
try dd if=/dev/random bs=1 count=100 of=/dev/null This will likely hang for a long time. There is no way that I know of to change the behavior of /dev/random except by changing the file itself to point to a different minor device. That would be very bad form. One think you may be able do is

Re: Entropy Pool and HDFS FS Commands Hanging System

2011-01-03 Thread Jon Lederman
Hi Ted, Could you give me a bit more information on how I can overcome this issue. I am running Hadoop on an embedded processor and networking is turned off to the embedded processor. Is there a quick way to check whether this is in fact blocking on my system? And, are there some variables or

Re: What is the runtime efficiency of secondary sorting?

2011-01-03 Thread Ted Dunning
On Mon, Jan 3, 2011 at 4:00 PM, W.P. McNeill wrote: > ... If I write a combiner like this, is there any advantage to also doing a > secondary sort? > The definitive answer is that it depends. > As for deserialization, the value in my actual application is a Java object > with a floating point

Re: What is the runtime efficiency of secondary sorting?

2011-01-03 Thread W.P. McNeill
The bit about finding the smallest element of a set was an artificially simple version of the problem I concocted for the purposes of explication, but maybe it actually made things less clear. My real problem involves sorting a list of objects with associated real number ranks and then doing some

Re: Entropy Pool and HDFS FS Commands Hanging System

2011-01-03 Thread Ted Dunning
Yes. It is stuck as suggested. See the bolded lines. You can help avoid this by dumping additional entropy into the machine via network traffic. According to the man page for /dev/random you can cheat by writing goo into /dev/urandom, but I have been unable to verify that by experiment. Is it

Re: What is the runtime efficiency of secondary sorting?

2011-01-03 Thread Ted Dunning
As a point of order, you would normally use a combiner with this problem and you wouldn't sort in either the combiner or the reducer. Instead, combiner and reducer would simply scan and keep the smallest item to emit at the end of the scan. As a point of information, most of the rank-based statis

Re: Entropy Pool and HDFS FS Commands Hanging System

2011-01-03 Thread Jon Lederman
Todd, I have attached the jstack output. Does it appear to be stuck in SecureRandom as you noted as a possibility? I am not sure whether this is indicated in the following output: sh-4.1# jps 4038 JobTracker 4160 Jps 3917 DataNode 4121 TaskTracker 3844 NameNode 3992 SecondaryNameNode sh-4.1

What is the runtime efficiency of secondary sorting?

2011-01-03 Thread W.P. McNeill
Say I have a set of unordered sets of integers: A: {2,5,7} B: {6,1,9} C: {3,8,2,1,6} I want to use map/reduce to emit the smallest integer in each set. If my input data looks like this: A2 A5 A7 B6 B1 ...etc... I could use an identity mapper and a reducer like the following

Re: Job configuration Versus Site Configuration!

2011-01-03 Thread Harsh J
On Tue, Jan 4, 2011 at 3:13 AM, Raj V wrote: > I am sure this question has been asked many times. I have tried searching the > mailing lists and also the faq. > > Is there a way to determine which parameter goes where for admistrative > purposes? AFAIK, you can follow is the naming convention o

Job configuration Versus Site Configuration!

2011-01-03 Thread Raj V
I am sure this question has been asked many times. I have tried searching the mailing lists and also the faq.   Given the various configuration parameters, is there so,mething that indicates whether they are site specific or job specific? For example I understand thathadoop.tmp.dir or mapred.job

[Call for Papers] ICSE Software Engineering for Cloud Computing (SECLOUD) Workshop

2011-01-03 Thread Mattmann, Chris A (388J)
(apologies for the cross posting) Please consider submitting a paper to the ICSE 2011 Software Engineering for Cloud Computing (SECLOUD) Workshop to be held Sunday, May 22, 2011, at the Hilton Hawaiian Village Resort in Waikiki, Honolulu, HI. This workshop focuses on identifying the grand chall

Re: How does HDFS handle a failed Datanode during write?

2011-01-03 Thread Thanh Do
some details can be found here appendDesign3.pdf Thanh On Mon, Jan 3, 2011 at 2:49 AM, Sean Bigdatafun wrote: > I'd like to understand how H

Re: where can I see those email answers?

2011-01-03 Thread maha
Never mind. I just saw the left tags on the side of the page in question found in "Search Hadoop" site. Thanks all, Maha On Jan 3, 2011, at 11:29 AM, maha wrote: > Hi, > > I remember discussing the following error one time, but when I searched for > it I can only see the question raise

where can I see those email answers?

2011-01-03 Thread maha
Hi, I remember discussing the following error one time, but when I searched for it I can only see the question raised without the responses. Where can I find all the emails within this discussion-group? $jps Error occurred during initialization of VM Could not reserve enough space for obje

Re: documentation of hadoop implementation

2011-01-03 Thread Da Zheng
if you want the slides, maybe you can see it here http://www.slideshare.net/hadoopusergroup/ordered-record-collection?from=ss_embed. hope it can help. Da On 01/03/2011 11:53 AM, bharath vissapragada wrote: Any idea ..how to download this?? .. It isn't buffering correctly :/ On Thu, Dec 30, 201

Re: documentation of hadoop implementation

2011-01-03 Thread bharath vissapragada
Any idea ..how to download this?? .. It isn't buffering correctly :/ On Thu, Dec 30, 2010 at 9:00 PM, Mark Kerzner wrote: > Thanks, Da, this makes you a better Googler, and an expert one. > > Cheers, > Mark > > On Thu, Dec 30, 2010 at 9:25 AM, Da Zheng wrote: > >> there is someone else like me w

HNY-2011

2011-01-03 Thread Adarsh Sharma
Dear all, A very-very Happy New Year 2011 to all. May God Bless all of us to solve future problems. Thanks and Regards Adarsh Sharma

Hadoop India Summit 2011

2011-01-03 Thread Basant Verma
Hi Hadoop enthusiasts, Apache Hadoop has become the de-facto platform for developing large-scale data-intensive applications. It has been used actively in academia and Industry for research and data mining. Hadoop Summit provides an opportunity for understanding the latest trends and roadmap

monit? daemontools? jsvc? something else?

2011-01-03 Thread Otis Gospodnetic
Hello, I see over on http://search-hadoop.com/?q=monit+daemontools that people *do* use tools like monit and daemontools (and a few other ones) to keep revive their Hadoop processes when they die. Questions: 1. Is one of these tools better than others for Hadoop? 2. Is there a tool the communi

How does HDFS handle a failed Datanode during write?

2011-01-03 Thread Sean Bigdatafun
I'd like to understand how HDFS handle Datanode failure gracefully. Let's suppose a replication factor of 3 is used in HDFS for this discussion. After 'DataStreamer' receives a list of Datanodes A, B, C for a block, it starts pulling data packets off the 'data queue' and putting it onto 'ack que