Changing /etc/hosts line from :
127.0.0.1 localhost, prasen-host
to
127.0.0.1 localhost
fixed the problem...
On Sat, Jun 16, 2012 at 12:30 PM, prasenjit mukherjee
prasen@gmail.com wrote:
I started hadoop in a single-node/pseudo-distributed mode. Took all
the precautionary
I have a textfile which doesn't have any newline characters. The
records are separated by a special character ( e.g. $ ). if I push a
single file of 5 GB to hdfs, how will it identify the boundaries on
which the files should be split ?
What are the options I have in such scenaion so that I can
Does it mean that on an average 1 file has only 2 blocks ( with
replication=1 ) ?
On 1/18/12, M. C. Srivas mcsri...@gmail.com wrote:
Konstantin's paper
http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf
mentions that on average a file consumes about 600 bytes of memory in
Does it mean that on an average 1 file has only 2 blocks ( with
replication=1 ) ?
On 1/18/12, M. C. Srivas mcsri...@gmail.com wrote:
Konstantin's paper
http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf
mentions that on average a file consumes about 600 bytes of memory in
Really enthralled to read the post :
http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/
Great job done.
Some related questions:
1. The article says that hdfs always maintains 2 copies in the same
rack and 3rd in a different rack. This only speeds up the hdfs put (
forwarding to hdfs and pig mailing-lists for responses from wider audience.
-- Forwarded message --
From: prasenjit mukherjee prasen@gmail.com
Date: Tue, Mar 16, 2010 at 11:47 AM
Subject: How to avoid a full table scan for column search. ( HIVE+LUCENE)
To: hive-user hive-u
This is kind of use-case what I was looking for ( persistent HDFS
across ec2 cluster restarts )
Correct me if I am wrong, I probably don't even need to take
snapshots if I am bringing down and restarting the entire ec2
cluster. I am using cloudera's hadoop-ec2's launch/terminate clusters
to
Thanks to Mridul, here is an approach suggested by him based on pig,
which works fine for me :
input_lines = load 'my_s3_list_file' as (location_line:chararray);
grp_op = GROUP input_lines BY location_line PARALLEL $NUM_MAPPERS_REQUIRED;
actual_result = FOREACH grp_op GENERATE MY_S3_UDF(group);
On Fri, Mar 5, 2010 at 7:13 PM, prasenjit mukherjee
pmukher...@quattrowireless.com wrote:
Is there any way ( like hadoop-commandline or files ) to know ip
address of all the cluster nodes ( from master )
page displays information about live nodes.
You can execute commands on slaves nodes using bin/slaves.sh – bin/slaves.sh
/sbin/ifconfig | grep “inet addr”
-
Ravi
On 3/6/10 9:15 AM, prasenjit mukherjee prasen@gmail.com wrote:
I am using ec2 and dont see the slaves in $HADOOP_HOME/conf
Is there any way ( like hadoop-commandline or files ) to know ip
address of all the cluster nodes ( from master )
I am using hadoop's in-built ec2-scripts to launch hadoop clusters in
ec2. For lower instance-types ( upto m2.2xlarge ) the launch scripts
work fine. But when I move to either m2.4xlarge or c1.xlarge
instances, the same scripts seem to get stuck. For c1.xlarge it gives
connection-time out and for
), each attempt's output is
first isolated to a path keyed to its attempt id, and only committed when
one and only one attempt is complete.
On Tue, Feb 9, 2010 at 9:52 PM, prasenjit mukherjee
pmukher...@quattrowireless.com wrote:
Any thoughts on this problem ? I am using a DEFINE command
completion?
I believe by default (though I'm not sure for Pig), each attempt's output
is
first isolated to a path keyed to its attempt id, and only committed when
one and only one attempt is complete.
On Tue, Feb 9, 2010 at 9:52 PM, prasenjit mukherjee
pmukher...@quattrowireless.com wrote
Sometimes for the same task I see that a duplicate task gets run on a
different machine and gets killed later. Not always but sometimes. Any
reason why duplicate tasks get run. I thought tasks are duplicated
only if either the first attempt exits( exceptions etc ) or exceeds
mapred.task.timeout.
Any thoughts on this problem ? I am using a DEFINE command ( in PIG )
and hence the actions are not idempotent. Because of which duplicate
execution does have an affect on my results. Any way to overcome that
?
On Tue, Feb 9, 2010 at 9:26 PM, prasenjit mukherjee
pmukher...@quattrowireless.com
Not sure I understand. How is it different from using plain ec2 with
hadoop-specific AMIs ?
On Wed, Feb 3, 2010 at 11:17 PM, Sirota, Peter sir...@amazon.com wrote:
Elastic MapReduce uses Hadoop .18.3 with several patches that improve S3N
performance/reliability.
-Original Message-
This is a sample work I am trying to write a distributed s3-fetch pig scrip
which uses python script.
s3fetch.pig:
define CMD `s3fetch.py` SHIP('/root/s3fetch.py');
r1 = LOAD '/ip/s3fetch_input_files' AS (filename:chararray);
grp_r1 = GROUP r1 BY filename PARALLEL 5;
r2 = FOREACH grp_r1 GENERATE
For algebraic reduce functions, it should be able to parallally start user
reduce functions (3) as well even before the mapper completes, right ?
On Wed, Jan 27, 2010 at 4:19 AM, Ed Mazur ma...@cs.umass.edu wrote:
You're right that the user reduce function cannot be applied until all
maps
It is taking ~ 2 minutes for hdfs to do -removeFromLocal of a 200 MB file
on a 12 node cluster ( in ec2 ). I haven't touched any of the default
configurations ( like replication-config etc. )
I am using the following command : hadood fs -removefromLocal /mnt/file1
/mnt/file2 . /ip/. BTW,
I have hundreds of large files ( ~ 100MB ) in a /mnt/ location which is
shared by all my hadoop nodes. Was wondering if I could directly use hadoop
distcp file:///mnt/data/tr* /input to parallelize/distribute hadoop push.
Hadoop push is indeed becoming a bottle neck for me and any help in this
hadoop fs -rmr /op
That command always fails. I am trying to run sequential hadoop jobs.
After the first run all subsequent runs fail while cleaning up ( aka
removing the hadoop dir created by previous run ). What can I do to
avoid this ?
here is my hadoop version :
# hadoop version
Hadoop
, prasenjit mukherjee prasen@gmail.com wrote:
hadoop fs -rmr /op
That command always fails. I am trying to run sequential hadoop jobs.
After the first run all subsequent runs fail while cleaning up ( aka
removing the hadoop dir created by previous run ). What can I do to
avoid
That was exactly the reason. Thanks a bunch.
On Tue, Jan 19, 2010 at 12:24 PM, Mafish Liu maf...@gmail.com wrote:
2010/1/19 prasenjit mukherjee pmukher...@quattrowireless.com:
I run hadoop fs -rmr .. immediately after start-all.sh Does the
namenode always start in safemode and after
- AWS_ACCESS_KEY_ID - Your AWS Access Key ID
- AWS_SECRET_ACCESS_KEY - Your AWS Secret Access Key
On Mon, Jan 18, 2010 at 11:12 AM, prasenjit mukherjee
prasen@gmail.comwrote:
Thanks for the suggestion. Now I am getting the following error with
cloudera's distro.* I have set
25 matches
Mail list logo