Re: All keys went to single reducer in WordCount program

2009-05-07 Thread Foss User
On Thu, May 7, 2009 at 9:45 PM, jason hadoop wrote: > If you have it available still, via the job tracker web interface, attach > the per job xml configuration Job Configuration: JobId - job_200905071619_0003 namevalue fs.s3n.impl org.apache.hadoop.fs.s3native.NativeS3FileSystem mapred.t

Re: How to write a map() method that needs no input?

2009-05-07 Thread Jothi Padmanabhan
You could write your own InputFormat to fake the split. See src/test/org/apache/hadoop/mapred/EmptyInputFormat.java Jothi On 5/8/09 11:40 AM, "Foss User" wrote: > Sometimes I would like to just execute a certain method in all nodes. > The method does not need inputs. So, there is no need of any

Re: How to write a map() method that needs no input?

2009-05-07 Thread Todd Lipcon
On Thu, May 7, 2009 at 11:10 PM, Foss User wrote: > Sometimes I would like to just execute a certain method in all nodes. > The method does not need inputs. So, there is no need of any > InputFormat implementation class. So, I would want to just write a > Mapper implementation class with a map()

How to write a map() method that needs no input?

2009-05-07 Thread Foss User
Sometimes I would like to just execute a certain method in all nodes. The method does not need inputs. So, there is no need of any InputFormat implementation class. So, I would want to just write a Mapper implementation class with a map() method. But, the problem with map() method is that it always

Re: Are the API changes in mapred->mapreduce in 0.20.0 usable?

2009-05-07 Thread Brian Ferris
Thanks so much. That did the trick. On May 7, 2009, at 10:34 PM, Jothi Padmanabhan wrote: examples/wordcount has been modified to use the new API. Also, there is a test case in the mapreduce directory that uses the new API. Jothi On 5/8/09 10:59 AM, "Brian Ferris" wrote: I was upgrade

Re: Are the API changes in mapred->mapreduce in 0.20.0 usable?

2009-05-07 Thread Jothi Padmanabhan
examples/wordcount has been modified to use the new API. Also, there is a test case in the mapreduce directory that uses the new API. Jothi On 5/8/09 10:59 AM, "Brian Ferris" wrote: > I was upgraded to 0.20.0 last week and I noticed most everything in > "org.apache.hadoop.mapred.*" has been d

Are the API changes in mapred->mapreduce in 0.20.0 usable?

2009-05-07 Thread Brian Ferris
I was upgraded to 0.20.0 last week and I noticed most everything in "org.apache.hadoop.mapred.*" has been deprecated. However, I've not been having any luck getting the new Map-Reduce classes to work. Hadoop Streaming still seems to expect the old API and it doesn't seem that JobClient ha

Re: How to add user and group to hadoop?

2009-05-07 Thread Wang Zhong
read this doc: http://hadoop.apache.org/core/docs/r0.20.0/hdfs_permissions_guide.html On Fri, May 8, 2009 at 12:56 PM, Starry SHI wrote: > Hi, everyone! I am new to hadoop and recently I have set up a small > hadoop cluster and have several users access to it. However, I notice > that no matter

How to add user and group to hadoop?

2009-05-07 Thread Starry SHI
Hi, everyone! I am new to hadoop and recently I have set up a small hadoop cluster and have several users access to it. However, I notice that no matter which user login to HDFS and do some operations, the files are always belong to the user DrWho in group Supergroup. HDFS seems provide no access t

Re: Using Hadoop API through python

2009-05-07 Thread Zak Stone
You should consider using Dumbo to run Python jobs with Hadoop Streaming: http://wiki.github.com/klbostee/dumbo Dumbo is already very useful, and it is improving all the time. Zak On Fri, May 8, 2009 at 12:07 AM, Aditya Desai wrote: > Hi All, > Is there any way that I can access the hadoop AP

Re: Using Hadoop API through python

2009-05-07 Thread Amit Saha
On Fri, May 8, 2009 at 9:37 AM, Aditya Desai wrote: > Hi All, > Is there any way that I can access the hadoop API through python. I am aware > that hadoop streaming can be used to create a mapper and reducer in a > different language but have not come accross any module that helps me apply > funct

Using Hadoop API through python

2009-05-07 Thread Aditya Desai
Hi All, Is there any way that I can access the hadoop API through python. I am aware that hadoop streaming can be used to create a mapper and reducer in a different language but have not come accross any module that helps me apply functions to manipulate data or control as is an option in java. Fir

Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-07 Thread Grace
Thanks all for your replying. I have run several times with different Java options for Map/Reduce tasks. However there is no much difference. Following is the example of my test setting: Test A: -Xmx1024m -server -XXlazyUnlocking -XlargePages -XgcPrio:deterministic -XXallocPrefetch -XXallocRedoPr

Re: Is HDFS protocol written from scratch?

2009-05-07 Thread Philip Zeyliger
On Thu, May 7, 2009 at 1:04 PM, Foss User wrote: > On Fri, May 8, 2009 at 1:20 AM, Raghu Angadi > wrote: > > > > > > Philip Zeyliger wrote: > >> > >> It's over TCP/IP, in a custom protocol. See DataXceiver.java. My sense > >> is > >> that it's a custom protocol because Hadoop's IPC mechanism i

Re: HDFS to S3 copy problems

2009-05-07 Thread Andrew Hitchcock
Hi Ken, S3N doesn't work that well with large files. When uploading a file to S3, S3N saves it to local disk during write() and then uploads to S3 during the close(). Close can take a long time for large files and it doesn't report progress, so the call can time out. As a work around, I'd recomme

RE: On usig Eclipse IDE

2009-05-07 Thread georgep
Hi Asseem, Thank you, but after "fs.trash.interval" I got something else. Maybe my version is not correct. What is your Eclipse europa version? George Puri, Aseem wrote: > > George, > In my Eclipse Europa it is showing the attribute > "hadoop.job.ugi". It is after the "fs.trash.in

HDFS to S3 copy problems

2009-05-07 Thread Ken Krugler
Hi all, I have a few large files (4 that are 1.8GB+) I'm trying to copy from HDFS to S3. My micro EC2 cluster is running Hadoop 0.19.1, and has one master/two slaves. I first tried using the hadoop fs -cp command, as in: hadoop fs -cp output// s3n: This seemed to be working, as I could

Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-07 Thread JQ Hadoop
There are a lot of tuning "knobs" for the JRockit JVM when it comes to performance; those tuning can make a huge difference. I'm very interested if there are some tuning tips for Hadoop. Grace, what are the parameters that you used in your testing? Thanks, JQ On Thu, May 7, 2009 at 11:35 PM, Ste

Re: java.io.EOFException: while trying to read 65557 bytes

2009-05-07 Thread Raghu Angadi
Albert Sunwoo wrote: Thanks for the info! I was hoping to get some more specific information though. in short : we need to more info. There are typically 4 machines/processes involved in a write : the client and 3 datanodes writing the replicas. To see what really happened, you need to pro

Re: NullPointerException while trying to copy file

2009-05-07 Thread Todd Lipcon
On Thu, May 7, 2009 at 1:47 PM, Foss User wrote: > > This does not work for me as you are reading the "a.txt" from the DFS > while I want to read the "a.txt" from the local file system. Also, I > do not want to copy the file to the distributed file system. Instead I > want to copy it to LocalFile

Re: NullPointerException while trying to copy file

2009-05-07 Thread Foss User
On Fri, May 8, 2009 at 1:59 AM, Todd Lipcon wrote: > On Thu, May 7, 2009 at 1:26 PM, Foss User wrote: > >> I was trying to write a Java code to copy a file from local system to >> a file system (which is also local file system). This is my code. >> >> package in.fossist.examples; >> >> import jav

Re: Is it possible to sort intermediate values and final values?

2009-05-07 Thread Owen O'Malley
On May 7, 2009, at 12:38 PM, Foss User wrote: Where can I find this example. I was not able to find it in the src/examples directory. It is in 0.20. http://svn.apache.org/repos/asf/hadoop/core/trunk/src/examples/org/apache/hadoop/examples/SecondarySort.java -- Owen

Re: NullPointerException while trying to copy file

2009-05-07 Thread Todd Lipcon
On Thu, May 7, 2009 at 1:26 PM, Foss User wrote: > I was trying to write a Java code to copy a file from local system to > a file system (which is also local file system). This is my code. > > package in.fossist.examples; > > import java.io.File; > import java.io.IOException; > import org.apache.

NullPointerException while trying to copy file

2009-05-07 Thread Foss User
I was trying to write a Java code to copy a file from local system to a file system (which is also local file system). This is my code. package in.fossist.examples; import java.io.File; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; impo

Re: Is HDFS protocol written from scratch?

2009-05-07 Thread Foss User
On Fri, May 8, 2009 at 1:20 AM, Raghu Angadi wrote: > > > Philip Zeyliger wrote: >> >> It's over TCP/IP, in a custom protocol.  See DataXceiver.java.  My sense >> is >> that it's a custom protocol because Hadoop's IPC mechanism isn't optimized >> for large messages. > > yes, and job classes are no

Re: Is HDFS protocol written from scratch?

2009-05-07 Thread Raghu Angadi
Philip Zeyliger wrote: It's over TCP/IP, in a custom protocol. See DataXceiver.java. My sense is that it's a custom protocol because Hadoop's IPC mechanism isn't optimized for large messages. yes, and job classes are not distributed using this. It is a very simple protocol used to read and

RE: .gz input files having less output than uncompressed version

2009-05-07 Thread Malcolm Matalka
This is the result of running gzip on the input files. There appears to be some support for two reasons: 1) I do get some output in my results. There are 86851 lines in my output file, and they are valid results. 2) In the job task output I pasted it states: org.apache.hadoop.io.compress.zli

Re: Is it possible to sort intermediate values and final values?

2009-05-07 Thread Foss User
On Thu, May 7, 2009 at 3:10 AM, Owen O'Malley wrote: > > On May 6, 2009, at 12:15 PM, Foss User wrote: > >> Is it possible to sort the intermediate values for each key before >> they pair reaches the reducer? > > Look at the example SecondarySort. Where can I find this example. I was not able to

Re: .gz input files having less output than uncompressed version

2009-05-07 Thread tim robertson
Hi, What input format are you using for the GZipped file? I don't believe there is a GZip input format although some people have discussed whether it is feasible... Cheers Tim On Thu, May 7, 2009 at 9:05 PM, Malcolm Matalka wrote: > Problem: > > I am comparing two jobs.  The both have the sa

.gz input files having less output than uncompressed version

2009-05-07 Thread Malcolm Matalka
Problem: I am comparing two jobs. The both have the same input content, however in one job the input file has been gziped, and in the other it has not. I get far less output rows in the gzipped result than I do in the uncompressed version: Lines in output: Gzipped: 86851 Uncompressed: 65693

RE: PIG and Hive

2009-05-07 Thread Ashish Thusoo
Ok that explains a lot of that. When we started off Hive our immediate usecase was to do group bys on data with a lot of skew on the grouping keys. In that scenario it is better to do this in 2 map/reduce jobs using the first one to randomly distribute data and generating the partial sums follow

RE: PIG and Hive

2009-05-07 Thread Ashish Thusoo
Scott, Namit is actually correct. If you do a explain on the query that he sent out, you actually get only 2 map/reduce jobs and not 5 with Hive. We have verified that and that is consistent with what we should expect in this case. We would be very interested to know the exact query that you us

Re: PIG and Hive

2009-05-07 Thread Scott Carey
The work was done 3 months ago, and the exact query I used may not have been the below - it was functionally the same - two sources, arithmetic aggregation on each inner-joined by a small set of values. We wrote a hand-coded map reduce, a Pig script, and Hive against the same data and performa

Re: Is HDFS protocol written from scratch?

2009-05-07 Thread Philip Zeyliger
It's over TCP/IP, in a custom protocol. See DataXceiver.java. My sense is that it's a custom protocol because Hadoop's IPC mechanism isn't optimized for large messages. -- Philip On Thu, May 7, 2009 at 9:11 AM, Foss User wrote: > I understand that the blocks are transferred between various no

RE: PIG and Hive

2009-05-07 Thread Namit Jain
SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y group by x, y. If you do a explain on the above query, you will see that you are performing a Cartesian product followed by the filter. It would be better to rewrite the query as: SELECT count(a.z), count(b.z), a.x, a.

Re: All keys went to single reducer in WordCount program

2009-05-07 Thread jason hadoop
If you have it available still, via the job tracker web interface, attach the per job xml configuration On Thu, May 7, 2009 at 8:39 AM, Foss User wrote: > On Thu, May 7, 2009 at 8:51 PM, jason hadoop > wrote: > > Most likely the 3rd mapper ran as a speculative execution, and it is > > possible

Is HDFS protocol written from scratch?

2009-05-07 Thread Foss User
I understand that the blocks are transferred between various nodes using HDFS protocol. I believe, even the job classes are distributed as files using the same HDFS protocol. Is this protocol written over TCP/IP from scratch or this is a protocol that works on top of some other protocol like HTTP,

Is "/room1" in the rack name "/room1/rack1" significant during replication?

2009-05-07 Thread Foss User
I have written a rack awareness script which maps the IP addresses to rack names in this way. 10.31.1.* -> /room1/rack1 10.31.2.* -> /room1/rack2 10.31.3.* -> /room1/rack3 10.31.100.* -> /room2/rack1 10.31.200.* -> /room2/rack2 10.31.200.* -> /room2/rack3 I understand that DFS will try to have re

Re: All keys went to single reducer in WordCount program

2009-05-07 Thread Foss User
On Thu, May 7, 2009 at 8:51 PM, jason hadoop wrote: > Most likely the 3rd mapper ran as a speculative execution, and it is > possible that all of your keys hashed to a single partition. Also, if you > don't specify the default is to run a single reduce task. As I mentioned in my first mail, I tri

Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-07 Thread Steve Loughran
Chris Collins wrote: a couple of years back we did a lot of experimentation between sun's vm and jrocket. We had initially assumed that jrocket was going to scream since thats what the press were saying. In short, what we discovered was that certain jdk library usage was a little bit faster w

Re: Using multiple FileSystems in hadoop input

2009-05-07 Thread jason hadoop
I have used multiple file systems in jobs, but not used Har as one of them. Worked for me in 18 On Wed, May 6, 2009 at 4:07 AM, Tom White wrote: > Hi Ivan, > > I haven't tried this combination, but I think it should work. If it > doesn't it should be treated as a bug. > > Tom > > On Wed, May 6,

Re: Large number of map output keys and performance issues.

2009-05-07 Thread jason hadoop
It may simply be that your JVM's are spending their time doing garbage collection instead of running your tasks. My book, in chapterr 6 has a section on how to tune your jobs, and how to determine what to tune. That chapter is available now as an alpha. On Wed, May 6, 2009 at 1:29 PM, Todd Lipcon

Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-07 Thread Chris Collins
a couple of years back we did a lot of experimentation between sun's vm and jrocket. We had initially assumed that jrocket was going to scream since thats what the press were saying. In short, what we discovered was that certain jdk library usage was a little bit faster with jrocket, but

Re: how to improve the Hadoop's capability of dealing with small files

2009-05-07 Thread jason hadoop
The way I typically address that is to write a zip file using the zip utilities. Commonly for output. HDFS is not optimized for low latency, but for high through put for bulk operations. 2009/5/7 Edward Capriolo > 2009/5/7 Jeff Hammerbacher : > > Hey, > > > > You can read more about why small fi

Re: All keys went to single reducer in WordCount program

2009-05-07 Thread jason hadoop
Most likely the 3rd mapper ran as a speculative execution, and it is possible that all of your keys hashed to a single partition. Also, if you don't specify the default is to run a single reduce task. >From JobConf, /** * Get configured the number of reduce tasks for this job. Defaults to *

Re: how to improve the Hadoop's capability of dealing with small files

2009-05-07 Thread Edward Capriolo
2009/5/7 Jeff Hammerbacher : > Hey, > > You can read more about why small files are difficult for HDFS at > http://www.cloudera.com/blog/2009/02/02/the-small-files-problem. > > Regards, > Jeff > > 2009/5/7 Piotr Praczyk > >> If You want to use many small files, they are probably having the same >>

Re: PIG and Hive

2009-05-07 Thread Alan Gates
SQL has been on Pig's roadmap for some time, see http://wiki.apache.org/pig/ProposedRoadMap We would like to add SQL support to Pig sometime this year. We don't have an ETA or a JIRA for it yet. Alan. On May 6, 2009, at 11:20 PM, Amr Awadallah wrote: Yiping, (1) Any ETA for when that wi

Re: setGroupingComparatorClass() or setOutputValueGroupingComparator() does not work for Combiner

2009-05-07 Thread Jothi Padmanabhan
OutputValueGroupingComparator is used only at the reducer. AFAIK, I do not think you can have a different comparator for combiners. Jothi On 5/7/09 3:32 PM, "zsongbo" wrote: > Hi all, > I have a application want the rules of sorting and grouping use > different Comparator. > > I had tested 0.

Re: move tasks to another machine on the fly

2009-05-07 Thread Sharad Agarwal
> Just one more question, does Hadoop handles reassign of task failure > to different machines in some way? Yes. If task fails then it is retried, preferably on a different machine. > > > I saw that sometimes, usually at the end, when there are more > "processing units" available than map() tasks

Re: All keys went to single reducer in WordCount program

2009-05-07 Thread Miles Osborne
with such a small data set who knows what will happen: you are probably hitting minimal limits of some kind repeat this with more data Miles 2009/5/7 Foss User : > I have two reducers running on two different machines. I ran the > example word count program with some of my own System.out.printl

All keys went to single reducer in WordCount program

2009-05-07 Thread Foss User
I have two reducers running on two different machines. I ran the example word count program with some of my own System.out.println() statements to see what is going on. There were 2 slaves each running datanode as well as tasktracker. There was one namenode and one jobtracker. I know there is a ve

setGroupingComparatorClass() or setOutputValueGroupingComparator() does not work for Combiner

2009-05-07 Thread zsongbo
Hi all, I have a application want the rules of sorting and grouping use different Comparator. I had tested 0.19.1 and 0.20.0 about this function, but both do not work for Combiner. In 0.19.1, I use job.setOutputValueGroupingComparator(), and in 0.20.0, I use job.setGroupingComparatorClass() This

Hadoop internal details

2009-05-07 Thread monty123
My query is how hadoop manages map files, files etc. stuffs. What is the internal data structure is uses to manage things. Whether it is graph of something..? Please help. -- View this message in context: http://www.nabble.com/Hadoop-internal-details-tp23423618p23423618.html Sent from the Hadoo

Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-07 Thread Grace
I am running the test on 0.18.1 and 0.19.1. Both versions have the same issue with JRockit JVM. It is for the example sort job, to sort 20G data on 1+2 nodes. Following is the result(version 0.18.1). The sort job running with JRockit JVM took 260 secs more than that with Sun JVM. ---

RE: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-07 Thread JQ Hadoop
I believe Jrockit JVM have slightly higer startup time than the SUN JVM; but that should not make a lot of difference, especially if JVMs are reused in 0.19. Which Hadoop version are you using? What Hadoop job are you running? And what performance do you get? Thanks, JQ -Original Message---

Re: About Hadoop optimizations

2009-05-07 Thread Tom White
On Thu, May 7, 2009 at 6:05 AM, Foss User wrote: > Thanks for your response again. I could not understand a few things in > your reply. So, I want to clarify them. Please find my questions > inline. > > On Thu, May 7, 2009 at 2:28 AM, Todd Lipcon wrote: >> On Wed, May 6, 2009 at 1:46 PM, Foss Use

Re: Hadoop internal details

2009-05-07 Thread Nitay
This is better directed at the Hadoop mailing lists. I've added hadoop core user mailing list to your query. Cheers, -n On Thu, May 7, 2009 at 1:11 AM, monty123 wrote: > > My query is how hadoop manages map files, files etc. stuffs. What is the > internal data structure is uses to manage things

Re: how to improve the Hadoop's capability of dealing with small files

2009-05-07 Thread Jeff Hammerbacher
Hey, You can read more about why small files are difficult for HDFS at http://www.cloudera.com/blog/2009/02/02/the-small-files-problem. Regards, Jeff 2009/5/7 Piotr Praczyk > If You want to use many small files, they are probably having the same > purpose and struc? > Why not use HBase instead

Re: how to improve the Hadoop's capability of dealing with small files

2009-05-07 Thread Piotr Praczyk
If You want to use many small files, they are probably having the same purpose and struc? Why not use HBase instead of a raw HDFS ? Many small files would be packed together and the problem would disappear. cheers Piotr 2009/5/7 Jonathan Cao > There are at least two design choices in Hadoop tha