Re: RPM spec file for 0.19.1

2009-04-03 Thread Steve Loughran
Ian Soboroff wrote: I created a JIRA (https://issues.apache.org/jira/browse/HADOOP-5615) with a spec file for building a 0.19.1 RPM. I like the idea of Cloudera's RPM file very much. In particular, it has nifty /etc/init.d scripts and RPM is nice for managing updates. However, it's for an

Re: Using HDFS to serve www requests

2009-04-03 Thread Steve Loughran
Snehal Nagmote wrote: can you please explain exactly adding NIO bridge means what and how it can be done , what could be advantages in this case ? NIO: java non-blocking IO. It's a standard API to talk to different filesystems; support has been discussed in jira. If the DFS APIs were

Re: Amazon Elastic MapReduce

2009-04-03 Thread Steve Loughran
Brian Bockelman wrote: On Apr 2, 2009, at 3:13 AM, zhang jianfeng wrote: seems like I should pay for additional money, so why not configure a hadoop cluster in EC2 by myself. This already have been automatic using script. Not everyone has a support team or an operations team or enough

Re: Amazon Elastic MapReduce

2009-04-03 Thread Tim Wintle
On Fri, 2009-04-03 at 11:19 +0100, Steve Loughran wrote: True, but this way nobody gets the opportunity to learn how to do it themselves, which can be a tactical error one comes to regret further down the line. By learning the pain of cluster management today, you get to keep it under

best practice: mapred.local vs dfs drives

2009-04-03 Thread Craig Macdonald
Hello all, Following recent hardware discussions, I thought I'd ask a related question. Our cluster nodes have 3 drives: 1x 160GB system/scratch and 2x 500GB DFS drives. The 160GB system drive is partitioned such that 100GB is for job mapred.local space. However, we find that for our

Re: RPM spec file for 0.19.1

2009-04-03 Thread Ian Soboroff
If you guys want to spin RPMs for the community, that's great. My main motivation was that I wanted the current version rather than 0.18.3. There is of course (as Steve points out) a larger discussion about if you want RPMs, what should be in them. In particular, some might want to include the

Re: RPM spec file for 0.19.1

2009-04-03 Thread Ian Soboroff
Steve Loughran ste...@apache.org writes: I think from your perpective it makes sense as it stops anyone getting itchy fingers and doing their own RPMs. Um, what's wrong with that? Ian

Re: RPM spec file for 0.19.1

2009-04-03 Thread Ian Soboroff
Steve Loughran ste...@apache.org writes: -RPM and deb packaging would be nice Indeed. The best thing would be to have the hadoop build system output them, for some sensible subset of systems. -the jdk requirements are too harsh as it should run on openjdk's JRE or jrockit; no need for sun

Re: Amazon Elastic MapReduce

2009-04-03 Thread Stuart Sierra
On Thu, Apr 2, 2009 at 4:13 AM, zhang jianfeng zjf...@gmail.com wrote: seems like I should pay for additional money, so why not configure a hadoop cluster in EC2 by myself. This already have been automatic using script. Personally, I'm excited about this. They're charging a tiny fraction above

Re: Amazon Elastic MapReduce

2009-04-03 Thread Lukáš Vlček
I may be wrong but I would welcome this. As far as I understand the hot topic in cloud computing these days is standardization ... and I would be happy if Hadoop could be considered as a standard for cloud computing architecture. So the more Amazon pushes Hadoop the more it could be accepted by

RE: Amazon Elastic MapReduce

2009-04-03 Thread Ricky Ho
I disagree. This is like arguing that everyone should learn everything otherwise they don't know how to do everything. A better situation is having the algorithm designer just focusing in how to break down their algorithm into Map/Reduce form and test it out immediately, rather than requiring

RE: Hadoop/HDFS for scientific simulation output data analysis

2009-04-03 Thread Tu, Tiankai
Thanks for the comments, Matei. The machines I ran the experiments have 16 GB memory each. I don't see how 64 MB buffer could be huge or is bad for memory consumption. In fact, I set it to much larger value after initial rounds of tests showed abysmal results using the default 64 KB buffer. Also,

How many people is using Hadoop Streaming ?

2009-04-03 Thread Ricky Ho
Has anyone benchmark the performance difference of using Hadoop ? 1) Java vs C++ 2) Java vs Streaming From looking at the Hadoop architecture, since TaskTracker will fork a separate process anyway to run the user supplied map() and reduce() function, I don't see the performance overhead of

Re: How many people is using Hadoop Streaming ?

2009-04-03 Thread Owen O'Malley
On Apr 3, 2009, at 9:42 AM, Ricky Ho wrote: Has anyone benchmark the performance difference of using Hadoop ? 1) Java vs C++ 2) Java vs Streaming Yes, a while ago. When I tested it using sort, Java and C++ were roughly equal and streaming was 10-20% slower. Most of the cost with

Re: Hadoop/HDFS for scientific simulation output data analysis

2009-04-03 Thread Matei Zaharia
Hadoop does checksums for each small chunk of the file (512 bytes by default) and stores a list of checksums for each chunk with the file, rather than storing just one checksum at the end. This makes it easier to read only a part of a file and know that it's not corrupt without having to read and

RE: How many people is using Hadoop Streaming ?

2009-04-03 Thread Ricky Ho
Owen, thanks for your elaboration, the data point is very useful. On your point ... In java you get key1, (value1, value2, ...) key2, (value3, ...) in streaming you get key1 value1 key1 value2

Re: How many people is using Hadoop Streaming ?

2009-04-03 Thread Owen O'Malley
On Apr 3, 2009, at 10:35 AM, Ricky Ho wrote: I assume that the key is still sorted, right ? That mean I will get all the key1, valueX entries before getting any of the key2 valueY entries and key2 is always bigger than key1. Yes. -- Owen

RE: Hadoop/HDFS for scientific simulation output data analysis

2009-04-03 Thread Tu, Tiankai
Thanks for the update and suggestion, Matei. I can certainly construct an input text file containing all the files of a dataset (http://hadoop.apache.org/core/docs/r0.19.1/streaming.html#How+do+I+proc ess+files%2C+one+per+map%3F), then let the jobtracker dispatch the file names to the maps, and

Re: Hadoop/HDFS for scientific simulation output data analysis

2009-04-03 Thread Owen O'Malley
On Apr 3, 2009, at 1:41 PM, Tu, Tiankai wrote: By the way, what is the largest size---in terms of total bytes, number of files, and number of nodes---in your applications? Thanks. The largest Hadoop application that has been documented is the Yahoo Webmap. 10,000 cores 500 TB shuffle 300

why SequenceFile cannot run without native GZipCodec?

2009-04-03 Thread Zheng Shao
I guess the performance will be bad, but we should still be able to read/write the file. Correct? Why do we throw an Exception? Zheng

Re: How many people is using Hadoop Streaming ?

2009-04-03 Thread Tim Wintle
On Fri, 2009-04-03 at 09:42 -0700, Ricky Ho wrote: 1) I can pick the language that offers a different programming paradigm (e.g. I may choose functional language, or logic programming if they suit the problem better). In fact, I can even chosen Erlang at the map() and Prolog at the