Does Hadoop compress files?

2010-04-03 Thread u235sentinel
I'm starting to evaluate Hadoop.  We are currently running Sensage and 
store a lot of log files in our current environment.  I've been looking 
at the Hadoop forums and googling (of course) but haven't learned if 
Hadoop HDFS does any compression to files we store.


On the average we're storing about 600 gigs a week in log files (more or 
less).  Generally we need to store about 1 1/2 - 2 years of logs.  With 
Sensage compression we can store about 200+ Tb of logs in our current 
environment.


As I said, we're starting to evaluate if Hadoop would be a good 
replacement to our Sensage environment (or at least augment it).


Thanks a bunch!!


Re: losing network interfaces during long running map-reduce jobs

2010-04-03 Thread Todd Lipcon
Hi David,

On Fri, Apr 2, 2010 at 6:16 PM, David Howell dehow...@gmail.com wrote:

 I'm encountering a completely bizarre failure mode in my Hadoop
 cluster. A week ago, I switched from vanilla apache Hadoop 0.20.1 to
 CDH 2.

 Ever since then, my tasktracker/ datenode machines have been regularly
 losing their networking during long ( 1 hour) jobs. Restarting the
 network interface brings them back online immediately.


Could you clarify wha you mean by losing their networking? Can you ping
the node externally? If you access the node via the console (via ILOM, etc)
and run tcpdump or tshark, can you see ethernet broadcast traffic at all? Do
you see anything in dmesg on the machine in question?

Thanks
-Todd

-- 
Todd Lipcon
Software Engineer, Cloudera


Trouble Submitting Job as another User

2010-04-03 Thread Ryan Rosario
Hi,

I am trying to set up a Hadoop cluster so that any of our users can
access HDFS and submit jobs and I am having trouble with this.
I added a HDFS path for mapred.system.dir in mapred-site.xml as
suggested in an FAQ.

I start/stop the cluster with system user _hadoop.
I would like to be able to access HDFS and submit jobs as user ryan
(and other users on the system). When I attempt to copy a directory
from local FS to HDFS I get,

10/04/03 14:35:27 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 0 time(s).
.
.
.

How can I accomplish what I am trying to do?

Thanks in advance,
Ryan

-- 
RRR


Re: Trouble Submitting Job as another User

2010-04-03 Thread abhishek sharma
Did you disable the permissions for HDFS?

 property
namedfs.permissions/name
valuefalse/value
  /property


Abhishek

On Sat, Apr 3, 2010 at 5:36 PM, Ryan Rosario uclamath...@gmail.com wrote:
 Hi,

 I am trying to set up a Hadoop cluster so that any of our users can
 access HDFS and submit jobs and I am having trouble with this.
 I added a HDFS path for mapred.system.dir in mapred-site.xml as
 suggested in an FAQ.

 I start/stop the cluster with system user _hadoop.
 I would like to be able to access HDFS and submit jobs as user ryan
 (and other users on the system). When I attempt to copy a directory
 from local FS to HDFS I get,

 10/04/03 14:35:27 INFO ipc.Client: Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 0 time(s).
 .
 .
 .

 How can I accomplish what I am trying to do?

 Thanks in advance,
 Ryan

 --
 RRR



Re: Trouble Submitting Job as another User

2010-04-03 Thread Ryan Rosario
Yes. I just tried that then stopped mapred and dfs and restarted them.
No change.

10/04/03 14:55:03 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 0 time(s).
10/04/03 14:55:04 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 1 time(s).
10/04/03 14:55:05 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 2 time(s).
10/04/03 14:55:06 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 3 time(s).
10/04/03 14:55:07 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 4 time(s).
10/04/03 14:55:08 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 5 time(s).
10/04/03 14:55:09 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 6 time(s).
10/04/03 14:55:10 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 7 time(s).
10/04/03 14:55:11 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 8 time(s).
10/04/03 14:55:12 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 9 time(s).
Bad connection to FS. command aborted.

I am still only able to connect to HDFS only as _hadoop.

R.

On Sat, Apr 3, 2010 at 2:49 PM, abhishek sharma absha...@usc.edu wrote:
 Did you disable the permissions for HDFS?

  property
    namedfs.permissions/name
    valuefalse/value
  /property


 Abhishek

 On Sat, Apr 3, 2010 at 5:36 PM, Ryan Rosario uclamath...@gmail.com wrote:
 Hi,

 I am trying to set up a Hadoop cluster so that any of our users can
 access HDFS and submit jobs and I am having trouble with this.
 I added a HDFS path for mapred.system.dir in mapred-site.xml as
 suggested in an FAQ.

 I start/stop the cluster with system user _hadoop.
 I would like to be able to access HDFS and submit jobs as user ryan
 (and other users on the system). When I attempt to copy a directory
 from local FS to HDFS I get,

 10/04/03 14:35:27 INFO ipc.Client: Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 0 time(s).
 .
 .
 .

 How can I accomplish what I am trying to do?

 Thanks in advance,
 Ryan

 --
 RRR





-- 
RRR


Re: losing network interfaces during long running map-reduce jobs

2010-04-03 Thread David Howell
 Could you clarify wha you mean by losing their networking? Can you ping
 the node externally? If you access the node via the console (via ILOM, etc)
 and run tcpdump or tshark, can you see ethernet broadcast traffic at all? Do
 you see anything in dmesg on the machine in question?

 Thanks
 -Todd

My cluster is small and the physical servers managed by my company's
IT department... I just admin the Hadoop install and I don't have
access except through ssh. When one of my nodes goes unresponsive, it
doesn't respond to ping, ssh, or any traffic on any port. I've been
limited so far to trying to investigate logs after my sysadmin
restarts the networking interface.

But I haven't seen anything in the dmesg log. I'll have to try looking
at the tcpdump output on Monday, once I can get console access again.
My apologies that I'm so sketchy on details right now... so far, I
haven't been any able to find any evidence of something going wrong
except for the hadoop log entries when the IOExceptions start.

Thanks,
-David


Re: Does Hadoop compress files?

2010-04-03 Thread Sonal Goyal
Hi,

Please check
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Data+Compression

Thanks and Regards,
Sonal
www.meghsoft.com


On Sat, Apr 3, 2010 at 11:15 PM, u235sentinel u235senti...@gmail.comwrote:

 I'm starting to evaluate Hadoop.  We are currently running Sensage and
 store a lot of log files in our current environment.  I've been looking at
 the Hadoop forums and googling (of course) but haven't learned if Hadoop
 HDFS does any compression to files we store.

 On the average we're storing about 600 gigs a week in log files (more or
 less).  Generally we need to store about 1 1/2 - 2 years of logs.  With
 Sensage compression we can store about 200+ Tb of logs in our current
 environment.

 As I said, we're starting to evaluate if Hadoop would be a good replacement
 to our Sensage environment (or at least augment it).

 Thanks a bunch!!



Re: Does Hadoop compress files?

2010-04-03 Thread Rajesh Balamohan
There is a facility in Hadoop to compress intermediate mapoutput and job
output. Is your question related to reading compressed files itself into
hadoop?

If so, refer SequenceFileInputFormat. (
http://developer.yahoo.com/hadoop/tutorial/module4.html )

 the *SequenceFileInputFormat* reads special binary files that are specific
to Hadoop. These files include many features designed to allow data to be
rapidly read into Hadoop mappers. Sequence files are block-compressed and
provide direct serialization and deserialization of several arbitrary data
types (not just text). Sequence files can be generated as the output of
other MapReduce tasks and are an efficient intermediate representation for
data that is passing from one MapReduce job to anther.

On Sat, Apr 3, 2010 at 11:15 PM, u235sentinel u235senti...@gmail.comwrote:

 I'm starting to evaluate Hadoop.  We are currently running Sensage and
 store a lot of log files in our current environment.  I've been looking at
 the Hadoop forums and googling (of course) but haven't learned if Hadoop
 HDFS does any compression to files we store.

 On the average we're storing about 600 gigs a week in log files (more or
 less).  Generally we need to store about 1 1/2 - 2 years of logs.  With
 Sensage compression we can store about 200+ Tb of logs in our current
 environment.

 As I said, we're starting to evaluate if Hadoop would be a good replacement
 to our Sensage environment (or at least augment it).

 Thanks a bunch!!




-- 
~Rajesh.B


measuring the split reading time in Hadoop

2010-04-03 Thread abhishek sharma
Hi all,

I wanted to measure the time it takes to read input split for a map
task. For my cluster, I am interested in measuring the overhead of
fetching the input to a map task over the network as opposed to
reading from the local disk.

Is there an easy way to instrument some function to log this
information (say, in the TaskTracker logs)?

Thanks,
Abhishek