Re: Hadoop startup problems ( FileSystem is not ready yet! )

2012-06-16 Thread prasenjit mukherjee
Changing /etc/hosts line from : 127.0.0.1 localhost, prasen-host to 127.0.0.1 localhost fixed the problem... On Sat, Jun 16, 2012 at 12:30 PM, prasenjit mukherjee prasen@gmail.com wrote: I started hadoop in a single-node/pseudo-distributed  mode. Took all the precautionary

How hdfs splits blocks on record boundaries

2012-06-13 Thread prasenjit mukherjee
I have a textfile which doesn't have any newline characters. The records are separated by a special character ( e.g. $ ). if I push a single file of 5 GB to hdfs, how will it identify the boundaries on which the files should be split ? What are the options I have in such scenaion so that I can

Re: NameNode per-block memory usage?

2012-01-18 Thread prasenjit mukherjee
Does it mean that on an average 1 file has only 2 blocks ( with replication=1 ) ? On 1/18/12, M. C. Srivas mcsri...@gmail.com wrote: Konstantin's paper http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf mentions that on average a file consumes about 600 bytes of memory in

Re: NameNode per-block memory usage?

2012-01-18 Thread prasenjit mukherjee
Does it mean that on an average 1 file has only 2 blocks ( with replication=1 ) ? On 1/18/12, M. C. Srivas mcsri...@gmail.com wrote: Konstantin's paper http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf mentions that on average a file consumes about 600 bytes of memory in

Awesome post on Hadoop. Some questions...

2011-12-12 Thread prasenjit mukherjee
Really enthralled to read the post : http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/ Great job done. Some related questions: 1. The article says that hdfs always maintains 2 copies in the same rack and 3rd in a different rack. This only speeds up the hdfs put (

Fwd: How to avoid a full table scan for column search. ( HIVE+LUCENE)

2010-03-16 Thread prasenjit mukherjee
forwarding to hdfs and pig mailing-lists for responses from wider audience. -- Forwarded message -- From: prasenjit mukherjee prasen@gmail.com Date: Tue, Mar 16, 2010 at 11:47 AM Subject: How to avoid a full table scan for column search. ( HIVE+LUCENE) To: hive-user hive-u

Re: HFile backup while cluster running

2010-03-14 Thread prasenjit mukherjee
This is kind of use-case what I was looking for ( persistent HDFS across ec2 cluster restarts ) Correct me if I am wrong, I probably don't even need to take snapshots if I am bringing down and restarting the entire ec2 cluster. I am using cloudera's hadoop-ec2's launch/terminate clusters to

Re: Parallelizing HTTP calls with Hadoop

2010-03-07 Thread prasenjit mukherjee
Thanks to Mridul, here is an approach suggested by him based on pig, which works fine for me : input_lines = load 'my_s3_list_file' as (location_line:chararray); grp_op = GROUP input_lines BY location_line PARALLEL $NUM_MAPPERS_REQUIRED; actual_result = FOREACH grp_op GENERATE MY_S3_UDF(group);

Re: how to get cluster-ips

2010-03-06 Thread prasenjit mukherjee
On Fri, Mar 5, 2010 at 7:13 PM, prasenjit mukherjee pmukher...@quattrowireless.com wrote: Is there any way ( like hadoop-commandline or files ) to know ip address of all the cluster nodes ( from master )

Re: how to get cluster-ips

2010-03-06 Thread prasenjit mukherjee
page displays information about live nodes. You can execute commands on slaves nodes using bin/slaves.sh – bin/slaves.sh /sbin/ifconfig | grep “inet addr” - Ravi On 3/6/10 9:15 AM, prasenjit mukherjee prasen@gmail.com wrote: I am using ec2 and dont see the slaves  in $HADOOP_HOME/conf

how to get cluster-ips

2010-03-05 Thread prasenjit mukherjee
Is there any way ( like hadoop-commandline or files ) to know ip address of all the cluster nodes ( from master )

launching time for high-cpu-instance hadoop clusters

2010-02-14 Thread prasenjit mukherjee
I am using hadoop's in-built ec2-scripts to launch hadoop clusters in ec2. For lower instance-types ( upto m2.2xlarge ) the launch scripts work fine. But when I move to either m2.4xlarge or c1.xlarge instances, the same scripts seem to get stuck. For c1.xlarge it gives connection-time out and for

Re: duplicate tasks getting started/killed

2010-02-10 Thread prasenjit mukherjee
), each attempt's output is first isolated to a path keyed to its attempt id, and only committed when one and only one attempt is complete. On Tue, Feb 9, 2010 at 9:52 PM, prasenjit mukherjee pmukher...@quattrowireless.com wrote: Any thoughts on this problem ? I am using a DEFINE command

Re: duplicate tasks getting started/killed

2010-02-10 Thread prasenjit mukherjee
completion? I believe by default (though I'm not sure for Pig), each attempt's output is first isolated to a path keyed to its attempt id, and only committed when one and only one attempt is complete. On Tue, Feb 9, 2010 at 9:52 PM, prasenjit mukherjee pmukher...@quattrowireless.com wrote

duplicate tasks getting started/killed

2010-02-09 Thread prasenjit mukherjee
Sometimes for the same task I see that a duplicate task gets run on a different machine and gets killed later. Not always but sometimes. Any reason why duplicate tasks get run. I thought tasks are duplicated only if either the first attempt exits( exceptions etc ) or exceeds mapred.task.timeout.

Re: duplicate tasks getting started/killed

2010-02-09 Thread prasenjit mukherjee
Any thoughts on this problem ? I am using a DEFINE command ( in PIG ) and hence the actions are not idempotent. Because of which duplicate execution does have an affect on my results. Any way to overcome that ? On Tue, Feb 9, 2010 at 9:26 PM, prasenjit mukherjee pmukher...@quattrowireless.com

Re: aws

2010-02-07 Thread prasenjit mukherjee
Not sure I understand. How is it different from using plain ec2 with hadoop-specific AMIs ? On Wed, Feb 3, 2010 at 11:17 PM, Sirota, Peter sir...@amazon.com wrote: Elastic MapReduce uses Hadoop .18.3 with several patches that improve S3N performance/reliability. -Original Message-

Re: Anyone has sample program for Hadoop Streaming using shell scripting for Map/Reduce

2010-01-26 Thread prasenjit mukherjee
This is a sample work I am trying to write a distributed s3-fetch pig scrip which uses python script. s3fetch.pig: define CMD `s3fetch.py` SHIP('/root/s3fetch.py'); r1 = LOAD '/ip/s3fetch_input_files' AS (filename:chararray); grp_r1 = GROUP r1 BY filename PARALLEL 5; r2 = FOREACH grp_r1 GENERATE

Re: do all mappers finish before reducer starts

2010-01-26 Thread prasenjit mukherjee
For algebraic reduce functions, it should be able to parallally start user reduce functions (3) as well even before the mapper completes, right ? On Wed, Jan 27, 2010 at 4:19 AM, Ed Mazur ma...@cs.umass.edu wrote: You're right that the user reduce function cannot be applied until all maps

Techniques to speedup hadoop fs -moveFromLocal command

2010-01-23 Thread prasenjit mukherjee
It is taking ~ 2 minutes for hdfs to do -removeFromLocal of a 200 MB file on a 12 node cluster ( in ec2 ). I haven't touched any of the default configurations ( like replication-config etc. ) I am using the following command : hadood fs -removefromLocal /mnt/file1 /mnt/file2 . /ip/. BTW,

distributing hadoop push

2010-01-23 Thread prasenjit mukherjee
I have hundreds of large files ( ~ 100MB ) in a /mnt/ location which is shared by all my hadoop nodes. Was wondering if I could directly use hadoop distcp file:///mnt/data/tr* /input to parallelize/distribute hadoop push. Hadoop push is indeed becoming a bottle neck for me and any help in this

rmr: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /op. Name node is in safe mode.

2010-01-18 Thread prasenjit mukherjee
hadoop fs -rmr /op That command always fails. I am trying to run sequential hadoop jobs. After the first run all subsequent runs fail while cleaning up ( aka removing the hadoop dir created by previous run ). What can I do to avoid this ? here is my hadoop version : # hadoop version Hadoop

Re: rmr: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /op. Name node is in safe mode.

2010-01-18 Thread prasenjit mukherjee
, prasenjit mukherjee prasen@gmail.com wrote: hadoop fs -rmr /op That command always fails. I am trying to run sequential hadoop jobs. After the first run all subsequent runs fail while cleaning up ( aka removing the hadoop dir created by previous run ). What can I do to avoid

Re: rmr: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /op. Name node is in safe mode.

2010-01-18 Thread prasenjit mukherjee
That was exactly the reason. Thanks a bunch. On Tue, Jan 19, 2010 at 12:24 PM, Mafish Liu maf...@gmail.com wrote: 2010/1/19 prasenjit mukherjee pmukher...@quattrowireless.com:  I run hadoop fs -rmr .. immediately after start-all.sh    Does the namenode always start in safemode and after

Re: which hadoop-ec2 is preferred ( cloudera/hadoop ? )

2010-01-17 Thread prasenjit mukherjee
- AWS_ACCESS_KEY_ID - Your AWS Access Key ID - AWS_SECRET_ACCESS_KEY - Your AWS Secret Access Key On Mon, Jan 18, 2010 at 11:12 AM, prasenjit mukherjee prasen@gmail.comwrote: Thanks for the suggestion. Now I am getting the following error with cloudera's distro.* I have set