Re: HDFS: Flash Application and Available APIs

2008-03-20 Thread Ted Dunning
Reading files from HDFS is very easy since there is a URL based mechanism for that. On 3/20/08 5:21 PM, "Michael Bieniosek" <[EMAIL PROTECTED]> wrote: > If you want to talk to the hdfs from flash, your best bet is probably to set > up a java server and talk to it over http. There's a webdav se

Re: HDFS: Flash Application and Available APIs

2008-03-20 Thread Michael Bieniosek
If you want to talk to the hdfs from flash, your best bet is probably to set up a java server and talk to it over http. There's a webdav server patch here: https://issues.apache.org/jira/browse/HADOOP-496 (I worked on this a while, but never finished it). I think some other people have written

Re: Default Combiner or default combining behaviour?

2008-03-20 Thread Arun C Murthy
On Mar 20, 2008, at 3:56 PM, Otis Gospodnetic wrote: Hi, The MapReduce tutorial mentions Combiners only in passing. Is there a default Combiner or default combining behaviour? No, there is *no* default combiner at all. It has to be explicitly set in the JobConf to take effect. Arun

Re: Limiting Total # of TaskTracker threads

2008-03-20 Thread Khalil Honsali
Hi, >The map/reduce tasks are not threads, they are run in separate JVMs which are forked by the tasktracker. I don't understand why? is it a design to support task failures? I think that on the other hand running a thread queue (of tasks) per job per JVM would grealy improve performance, since f

RE: Partitioning reduce output by date

2008-03-20 Thread Runping Qi
If you want to output data to different files based on date or any value parts, you may want to check https://issues.apache.org/jira/browse/HADOOP-2906 Runping > -Original Message- > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > Sent: Thursday, March 20, 2008 4:00 PM > To: core-us

Re: Partitioning reduce output by date

2008-03-20 Thread Otis Gospodnetic
Thank you, Doug and Ted, this pointed me in the right direction, which lead to a custom OutputFormat and a RecordWriter that opens and closes the DataOutputStream based on the current key (if current key diff from previous key, close previous output and open a new one, then write) As for pa

Default Combiner or default combining behaviour?

2008-03-20 Thread Otis Gospodnetic
Hi, The MapReduce tutorial mentions Combiners only in passing. Is there a default Combiner or default combining behaviour? Concretely, I want to make sure that records are not getting combined behind the scenes in some way without me seeing it, and causing me to lose data. For instance, if t

Re: Limiting Total # of TaskTracker threads

2008-03-20 Thread Jimmy Wan
On Tue, 18 Mar 2008 19:53:04 -0500, Ted Dunning <[EMAIL PROTECTED]> wrote: I think the original request was to limit the sum of maps and reduces rather than limiting the two parameters independently. Ted, yes this is exactly what I'm looking for. I just found an issue that seems to state th

Re: using a set of MapFiles - getting the right partition

2008-03-20 Thread Doug Cutting
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/MapFileOutputFormat.html#getEntry(org.apache.hadoop.io.MapFile.Reader[],%20org.apache.hadoop.mapred.Partitioner,%20K,%20V) MapFileOutputFormat#getEntry() does this. Use MapFileOutputFormat#getReaders() to create the readers

Re: Hadoop streaming cacheArchive

2008-03-20 Thread Norbert Burger
Amareshwari, thanks for your help. This turned out to be user error (when packaging my JAR, I inadvertently included a lib directory, so the libraries actually existed in HDFS as ./lib/lib/perl..., when I was only expecting ./lib/perl... Thanks again, Norbert On Thu, Mar 20, 2008 at 3:03 AM, Ama

using a set of MapFiles - getting the right partition

2008-03-20 Thread Chris Dyer
Hi all-- I would like to have a reducer generate a MapFile so that in later processes I can look up the values associated with a few keys without processing an entire sequence file. However, if I have N reducers, I will generate N different map files, so to pick the right map file I will need to u

Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Chris K Wensel
you can't do this with the contrib/ec2 scripts/ami. but passing the master private dns name to the slaves on boot as 'user- data' works fine. when a slave starts, it contacts the master and joins the cluster. there isn't any need for a slave to rsync from the master, thus removing the depend

Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Prasan Ary
Chris, What do you mean when you say boot the slaves with "the master private name" ? === Chris K Wensel <[EMAIL PROTECTED]> wrote: I found it much better to start the master first, then boot the slaves with the master private name. i do not use the start|stop-a

RE: HDFS: Flash Application and Available APIs

2008-03-20 Thread dhruba Borthakur
There is a C-language based API to access HDFS. You can find more details at: http://wiki.apache.org/hadoop/LibHDFS If you download the Hadoop source code from http://hadoop.apache.org/core/releases.html, you will see this API in src/c++/libhdfs/hdfs.c hope this helps, dhruba -Original Mess

Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Andreas Kostyrka
Actually, I personally use the following "2 part" copy technique to copy files to a cluster of boxes: tar cf - myfile | dsh -f host-list-file -i -c -M tar xCfv /tmp - The first tar packages myfile into a tar file. dsh runs a tar that unpacks the tar (in the above case all boxes listed in host-li

Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Tom White
Yes, this isn't ideal for larger clusters. There's a jira to address this: https://issues.apache.org/jira/browse/HADOOP-2410. Tom On 20/03/2008, Prasan Ary <[EMAIL PROTECTED]> wrote: > Hi All, > I have been trying to configure Hadoop on EC2 for large number of clusters > ( 100 plus). It seems

Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Chris K Wensel
I found it much better to start the master first, then boot the slaves with the master private name. i do not use the start|stop-all scrips, so i do not need to maintain the slaves file. thus i don't need to push private keys around to support those scripts. this lets me start 20 nodes, t

Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Andrey Pankov
Hi, Did you see hadoop-0.16.0/src/contrib/ec2/bin/start-hadoop script? It already contains such part: echo "Copying private key to slaves" for slave in `cat slaves`; do scp $SSH_OPTS $PRIVATE_KEY_PATH "[EMAIL PROTECTED]:/root/.ssh/id_rsa" ssh $SSH_OPTS "[EMAIL PROTECTED]" "chmod 600 /root/

Hadoop on EC2 for large cluster

2008-03-20 Thread Prasan Ary
Hi All, I have been trying to configure Hadoop on EC2 for large number of clusters ( 100 plus). It seems that I have to copy EC2 private key to all the machines in the cluster so that they can have SSH connections. For now it seems I have to run a script to copy the key file to each of the E

Re: Input file globbing

2008-03-20 Thread Hairong Kuang
Yes, this is a bug. This only occurs when a job's input path contains the closures. JobConf.getInputPaths interprets mr/input/glob/2008/02/{02.08} as two input paths: mr/input/glob/2008/02/{02 and 08}. Let's see how to fix it. Hairong On 3/20/08 9:43 AM, "Tom White" <[EMAIL PROTECTED]> wrote:

Input file globbing

2008-03-20 Thread Tom White
I'm trying to use file globbing to select various input paths, like so: conf.setInputPath(new Path("mr/input/glob/2008/02/{02,08}")); But this gives an exception: Exception in thread "main" java.io.IOException: Illegal file pattern: Expecting set closure character or end of range, or } for glob

Re: MapFile and MapFileOutputFormat

2008-03-20 Thread Doug Cutting
Rong-en Fan wrote: I have two questions regarding the mapfile in hadoop/hdfs. First, when using MapFileOutputFormat as reducer's output, is there any way to change the index interval (i.e., able to call setIndexInterval() on the output MapFile)? Not at present. It would probably be good to cha

RE: Trash option in hadoop-site.xml configuration.

2008-03-20 Thread dhruba Borthakur
Actually, the fs.trash.interval number has no significance on the client. If it is non-zero, then the client does a rename instead of a delete. The value specified in fs.trash.interval is used only by the namenode to periodically remove files from Trash: the periodicity is the value specified by

MapFile and MapFileOutputFormat

2008-03-20 Thread Rong-en Fan
Hi, I have two questions regarding the mapfile in hadoop/hdfs. First, when using MapFileOutputFormat as reducer's output, is there any way to change the index interval (i.e., able to call setIndexInterval() on the output MapFile)? Second, is it possible to tell what is the position in data file fo

Re: Trash option in hadoop-site.xml configuration.

2008-03-20 Thread Taeho Kang
Thank you for the clarification. Here is my another question. If two different clients ordered "move to trash" with different interval, (e.g. client #1 with fs.trash.interval = 60; client #2 with fs.trash.interval = 120) what would happen? Does namenode keep track of all these info? /Taeho On

Re: Hadoop streaming cacheArchive

2008-03-20 Thread Amareshwari Sriramadasu
Norbert Burger wrote: I'm trying to use the cacheArchive command-line options with the hadoop-0.15.3-streaming.jar. I'm using the option as follows: -cacheArchive hdfs://host:50001/user/root/lib.jar#lib Unfortunately, my PERL scripts fail with an error consistent with not being able to find th