Does Hadoop compress files?
I'm starting to evaluate Hadoop. We are currently running Sensage and store a lot of log files in our current environment. I've been looking at the Hadoop forums and googling (of course) but haven't learned if Hadoop HDFS does any compression to files we store. On the average we're storing about 600 gigs a week in log files (more or less). Generally we need to store about 1 1/2 - 2 years of logs. With Sensage compression we can store about 200+ Tb of logs in our current environment. As I said, we're starting to evaluate if Hadoop would be a good replacement to our Sensage environment (or at least augment it). Thanks a bunch!!
Re: losing network interfaces during long running map-reduce jobs
Hi David, On Fri, Apr 2, 2010 at 6:16 PM, David Howell dehow...@gmail.com wrote: I'm encountering a completely bizarre failure mode in my Hadoop cluster. A week ago, I switched from vanilla apache Hadoop 0.20.1 to CDH 2. Ever since then, my tasktracker/ datenode machines have been regularly losing their networking during long ( 1 hour) jobs. Restarting the network interface brings them back online immediately. Could you clarify wha you mean by losing their networking? Can you ping the node externally? If you access the node via the console (via ILOM, etc) and run tcpdump or tshark, can you see ethernet broadcast traffic at all? Do you see anything in dmesg on the machine in question? Thanks -Todd -- Todd Lipcon Software Engineer, Cloudera
Trouble Submitting Job as another User
Hi, I am trying to set up a Hadoop cluster so that any of our users can access HDFS and submit jobs and I am having trouble with this. I added a HDFS path for mapred.system.dir in mapred-site.xml as suggested in an FAQ. I start/stop the cluster with system user _hadoop. I would like to be able to access HDFS and submit jobs as user ryan (and other users on the system). When I attempt to copy a directory from local FS to HDFS I get, 10/04/03 14:35:27 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 0 time(s). . . . How can I accomplish what I am trying to do? Thanks in advance, Ryan -- RRR
Re: Trouble Submitting Job as another User
Did you disable the permissions for HDFS? property namedfs.permissions/name valuefalse/value /property Abhishek On Sat, Apr 3, 2010 at 5:36 PM, Ryan Rosario uclamath...@gmail.com wrote: Hi, I am trying to set up a Hadoop cluster so that any of our users can access HDFS and submit jobs and I am having trouble with this. I added a HDFS path for mapred.system.dir in mapred-site.xml as suggested in an FAQ. I start/stop the cluster with system user _hadoop. I would like to be able to access HDFS and submit jobs as user ryan (and other users on the system). When I attempt to copy a directory from local FS to HDFS I get, 10/04/03 14:35:27 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 0 time(s). . . . How can I accomplish what I am trying to do? Thanks in advance, Ryan -- RRR
Re: Trouble Submitting Job as another User
Yes. I just tried that then stopped mapred and dfs and restarted them. No change. 10/04/03 14:55:03 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 0 time(s). 10/04/03 14:55:04 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 1 time(s). 10/04/03 14:55:05 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 2 time(s). 10/04/03 14:55:06 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 3 time(s). 10/04/03 14:55:07 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 4 time(s). 10/04/03 14:55:08 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 5 time(s). 10/04/03 14:55:09 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 6 time(s). 10/04/03 14:55:10 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 7 time(s). 10/04/03 14:55:11 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 8 time(s). 10/04/03 14:55:12 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 9 time(s). Bad connection to FS. command aborted. I am still only able to connect to HDFS only as _hadoop. R. On Sat, Apr 3, 2010 at 2:49 PM, abhishek sharma absha...@usc.edu wrote: Did you disable the permissions for HDFS? property namedfs.permissions/name valuefalse/value /property Abhishek On Sat, Apr 3, 2010 at 5:36 PM, Ryan Rosario uclamath...@gmail.com wrote: Hi, I am trying to set up a Hadoop cluster so that any of our users can access HDFS and submit jobs and I am having trouble with this. I added a HDFS path for mapred.system.dir in mapred-site.xml as suggested in an FAQ. I start/stop the cluster with system user _hadoop. I would like to be able to access HDFS and submit jobs as user ryan (and other users on the system). When I attempt to copy a directory from local FS to HDFS I get, 10/04/03 14:35:27 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 0 time(s). . . . How can I accomplish what I am trying to do? Thanks in advance, Ryan -- RRR -- RRR
Re: losing network interfaces during long running map-reduce jobs
Could you clarify wha you mean by losing their networking? Can you ping the node externally? If you access the node via the console (via ILOM, etc) and run tcpdump or tshark, can you see ethernet broadcast traffic at all? Do you see anything in dmesg on the machine in question? Thanks -Todd My cluster is small and the physical servers managed by my company's IT department... I just admin the Hadoop install and I don't have access except through ssh. When one of my nodes goes unresponsive, it doesn't respond to ping, ssh, or any traffic on any port. I've been limited so far to trying to investigate logs after my sysadmin restarts the networking interface. But I haven't seen anything in the dmesg log. I'll have to try looking at the tcpdump output on Monday, once I can get console access again. My apologies that I'm so sketchy on details right now... so far, I haven't been any able to find any evidence of something going wrong except for the hadoop log entries when the IOExceptions start. Thanks, -David
Re: Does Hadoop compress files?
Hi, Please check http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Data+Compression Thanks and Regards, Sonal www.meghsoft.com On Sat, Apr 3, 2010 at 11:15 PM, u235sentinel u235senti...@gmail.comwrote: I'm starting to evaluate Hadoop. We are currently running Sensage and store a lot of log files in our current environment. I've been looking at the Hadoop forums and googling (of course) but haven't learned if Hadoop HDFS does any compression to files we store. On the average we're storing about 600 gigs a week in log files (more or less). Generally we need to store about 1 1/2 - 2 years of logs. With Sensage compression we can store about 200+ Tb of logs in our current environment. As I said, we're starting to evaluate if Hadoop would be a good replacement to our Sensage environment (or at least augment it). Thanks a bunch!!
Re: Does Hadoop compress files?
There is a facility in Hadoop to compress intermediate mapoutput and job output. Is your question related to reading compressed files itself into hadoop? If so, refer SequenceFileInputFormat. ( http://developer.yahoo.com/hadoop/tutorial/module4.html ) the *SequenceFileInputFormat* reads special binary files that are specific to Hadoop. These files include many features designed to allow data to be rapidly read into Hadoop mappers. Sequence files are block-compressed and provide direct serialization and deserialization of several arbitrary data types (not just text). Sequence files can be generated as the output of other MapReduce tasks and are an efficient intermediate representation for data that is passing from one MapReduce job to anther. On Sat, Apr 3, 2010 at 11:15 PM, u235sentinel u235senti...@gmail.comwrote: I'm starting to evaluate Hadoop. We are currently running Sensage and store a lot of log files in our current environment. I've been looking at the Hadoop forums and googling (of course) but haven't learned if Hadoop HDFS does any compression to files we store. On the average we're storing about 600 gigs a week in log files (more or less). Generally we need to store about 1 1/2 - 2 years of logs. With Sensage compression we can store about 200+ Tb of logs in our current environment. As I said, we're starting to evaluate if Hadoop would be a good replacement to our Sensage environment (or at least augment it). Thanks a bunch!! -- ~Rajesh.B
measuring the split reading time in Hadoop
Hi all, I wanted to measure the time it takes to read input split for a map task. For my cluster, I am interested in measuring the overhead of fetching the input to a map task over the network as opposed to reading from the local disk. Is there an easy way to instrument some function to log this information (say, in the TaskTracker logs)? Thanks, Abhishek