Re: where are the old hadoop documentations for v0.22.0 and below ?

2014-07-30 Thread Jane Wayne
harsh, those are just javadocs. i'm talking about the full documentations
(see original post).


On Tue, Jul 29, 2014 at 2:17 PM, Harsh J  wrote:

> Precompiled docs are available in the archived tarballs of these
> releases, which you can find on:
> https://archive.apache.org/dist/hadoop/common/
>
> On Tue, Jul 29, 2014 at 1:36 AM, Jane Wayne 
> wrote:
> > where can i get the old hadoop documentation (e.g. cluster setup, xml
> > configuration params) for hadoop v0.22.0 and below? i downloaded the
> source
> > and binary files but could not find the documentations as a part of the
> > archive file.
> >
> > on the home page at http://hadoop.apache.org/, i only see documentations
> > for the following versions.
> > - current, stable, 1.2.1, 2.2.0, 2.4.1, 0.23.11
>
>
>
> --
> Harsh J
>


where are the old hadoop documentations for v0.22.0 and below ?

2014-07-28 Thread Jane Wayne
where can i get the old hadoop documentation (e.g. cluster setup, xml
configuration params) for hadoop v0.22.0 and below? i downloaded the source
and binary files but could not find the documentations as a part of the
archive file.

on the home page at http://hadoop.apache.org/, i only see documentations
for the following versions.
- current, stable, 1.2.1, 2.2.0, 2.4.1, 0.23.11


Re: slaves datanodes are not starting, hadoop v2.4.1

2014-07-27 Thread Jane Wayne
nevermind i resolved it. the solution was bad instructions on the hadoop
site or unclear/misleading instructions.

this is NOT the way to start slave datanode daemons (NOTICE THE SINGULAR
DAEMON).

$HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script
hdfs start datanode

this is the correct way to start slave datanode daemons (NOTICE THE PLURAL
DAEMONS).

$HADOOP_PREFIX/sbin/hadoop-daemons.sh --config $HADOOP_CONF_DIR --script
hdfs start datanode



On Sun, Jul 27, 2014 at 3:11 AM, Jane Wayne 
wrote:

> i am following the instructions to setup a multi-node cluster at
> http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/ClusterSetup.html
> .
>
> my problem is that when i run the script to start up the slave datanodes,
> no slave datanode is started (more on this later).
>
> i have two nodes so far that i am experimenting with
> 1. node1 (this is the namenode)
> 2. node2 (this is the datanode)
>
> on node1 (namenode), i start the namenode daemon as follows. there is no
> problem here.
>
> $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script
> hdfs start namenode
>
> on node1 (namenode), i start the datanode daemon (on node2) as follows.
>
> $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script
> hdfs start datanode
>
> it is at this point that the problems begin:
> a) the datanode daemon is started on node1 (instead of node2)
> b) the datanode daemon is not started on node2 (the only slave defined)
> c) the only job of node1 is to be a namenode, not also a datanode
>
> my $HADOOP_CONF_DIR/slaves has one single entry (line):
> node2
>
> my /etc/hosts file looks like the following
> 127.0.0.1 localhost 192.168.0.10 node1
> 192.168.0.11 node2
>
> why does the script to start the datanode slave daemons not start them and
> only start a local slave daemon?
>
> on node2, when i run the script to start the datanode daemon, it is able
> to find node1:8020 and become active as part of the cluster. but this is
> strange to me, because this script should be called/executed from the
> namenode to start all datanodes on the slaves (not the other way around).
>
> any idea what i am doing wrong?
>
>


slaves datanodes are not starting, hadoop v2.4.1

2014-07-27 Thread Jane Wayne
i am following the instructions to setup a multi-node cluster at
http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/ClusterSetup.html
.

my problem is that when i run the script to start up the slave datanodes,
no slave datanode is started (more on this later).

i have two nodes so far that i am experimenting with
1. node1 (this is the namenode)
2. node2 (this is the datanode)

on node1 (namenode), i start the namenode daemon as follows. there is no
problem here.

$HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script
hdfs start namenode

on node1 (namenode), i start the datanode daemon (on node2) as follows.

$HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script
hdfs start datanode

it is at this point that the problems begin:
a) the datanode daemon is started on node1 (instead of node2)
b) the datanode daemon is not started on node2 (the only slave defined)
c) the only job of node1 is to be a namenode, not also a datanode

my $HADOOP_CONF_DIR/slaves has one single entry (line):
node2

my /etc/hosts file looks like the following
127.0.0.1 localhost 192.168.0.10 node1
192.168.0.11 node2

why does the script to start the datanode slave daemons not start them and
only start a local slave daemon?

on node2, when i run the script to start the datanode daemon, it is able to
find node1:8020 and become active as part of the cluster. but this is
strange to me, because this script should be called/executed from the
namenode to start all datanodes on the slaves (not the other way around).

any idea what i am doing wrong?


hdfs permission is still being checked after being disabled

2014-03-06 Thread Jane Wayne
i am using hadoop v2.3.0.

in my hdfs-site.xml, i have the following property set.

 
  dfs.permissions.enabled
  false
 

however, when i try to run a hadoop job, i see the following
AccessControlException.

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
Permission denied: user=hadoopuser, access=EXECUTE,
inode="/tmp":root:supergroup:drwxrwx---

to me, it seems that i have already disabled permission checking, so i
shouldn't get that AccessControlException.

any ideas?


Re: are the job and task tracker monitor webpages gone now in hadoop v2.3.0

2014-03-06 Thread Jane Wayne
ok, the reason why hadoop jobs were not showing up was because i did not
enable mapreduce to be run as a yarn application.


On Thu, Mar 6, 2014 at 11:45 PM, Jane Wayne wrote:

> when i go to the job history server
>
> http://hadoop-cluster:19888/jobhistory
>
> i see no map reduce job there. i ran 3 simple mr jobs successfully. i
> verified by the console output and hdfs output directory.
>
> all i see on the UI is: No data available in table.
>
> any ideas?
>
> unless there is a JobHistoryServer just for MapReduce.. where is that?
>
>
> On Thu, Mar 6, 2014 at 7:14 PM, Vinod Kumar Vavilapalli <
> vino...@apache.org> wrote:
>
>>
>> Yes. JobTracker and TaskTracker are gone from all the 2.x release lines.
>>
>> MapReduce is an application on top of YARN. That is per job - launches,
>> starts and finishes after it is done with its work. Once it is done, you
>> can go look at it in the MapReduce specific JobHistoryServer.
>>
>> +Vinod
>>
>> On Mar 6, 2014, at 1:11 PM, Jane Wayne  wrote:
>>
>> > i recently made the switch from hadoop 0.20.x to hadoop 2.3.0 (yes, big
>> > leap). i was wondering if there is a way to view my jobs now via a web
>> UI?
>> > i used to be able to do this by accessing the following URL
>> >
>> > http://hadoop-cluster:50030/jobtracker.jsp
>> >
>> > however, there is no more job tracker monitoring page here.
>> >
>> > furthermore, i am confused about MapReduce as an application running on
>> top
>> > of YARN. so the documentation says MapReduce is just an application
>> running
>> > on YARN. if that is true, how come i do not see MapReduce as an
>> application
>> > on the ResourceManager web UI?
>> >
>> > http://hadoop-cluster:8088/cluster/apps
>> >
>> > is this because MapReduce is NOT a long-running app? meaning, a
>> MapReduce
>> > job will only show up as an app in YARN when it is running? (please bear
>> > with me, i'm still adjusting to this new design).
>> >
>> > any help/pointer is appreciated.
>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified
>> that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>> immediately
>> and delete it from your system. Thank You.
>>
>
>


Re: are the job and task tracker monitor webpages gone now in hadoop v2.3.0

2014-03-06 Thread Jane Wayne
when i go to the job history server

http://hadoop-cluster:19888/jobhistory

i see no map reduce job there. i ran 3 simple mr jobs successfully. i
verified by the console output and hdfs output directory.

all i see on the UI is: No data available in table.

any ideas?

unless there is a JobHistoryServer just for MapReduce.. where is that?


On Thu, Mar 6, 2014 at 7:14 PM, Vinod Kumar Vavilapalli
wrote:

>
> Yes. JobTracker and TaskTracker are gone from all the 2.x release lines.
>
> MapReduce is an application on top of YARN. That is per job - launches,
> starts and finishes after it is done with its work. Once it is done, you
> can go look at it in the MapReduce specific JobHistoryServer.
>
> +Vinod
>
> On Mar 6, 2014, at 1:11 PM, Jane Wayne  wrote:
>
> > i recently made the switch from hadoop 0.20.x to hadoop 2.3.0 (yes, big
> > leap). i was wondering if there is a way to view my jobs now via a web
> UI?
> > i used to be able to do this by accessing the following URL
> >
> > http://hadoop-cluster:50030/jobtracker.jsp
> >
> > however, there is no more job tracker monitoring page here.
> >
> > furthermore, i am confused about MapReduce as an application running on
> top
> > of YARN. so the documentation says MapReduce is just an application
> running
> > on YARN. if that is true, how come i do not see MapReduce as an
> application
> > on the ResourceManager web UI?
> >
> > http://hadoop-cluster:8088/cluster/apps
> >
> > is this because MapReduce is NOT a long-running app? meaning, a MapReduce
> > job will only show up as an app in YARN when it is running? (please bear
> > with me, i'm still adjusting to this new design).
> >
> > any help/pointer is appreciated.
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>


are the job and task tracker monitor webpages gone now in hadoop v2.3.0

2014-03-06 Thread Jane Wayne
i recently made the switch from hadoop 0.20.x to hadoop 2.3.0 (yes, big
leap). i was wondering if there is a way to view my jobs now via a web UI?
i used to be able to do this by accessing the following URL

http://hadoop-cluster:50030/jobtracker.jsp

however, there is no more job tracker monitoring page here.

furthermore, i am confused about MapReduce as an application running on top
of YARN. so the documentation says MapReduce is just an application running
on YARN. if that is true, how come i do not see MapReduce as an application
on the ResourceManager web UI?

http://hadoop-cluster:8088/cluster/apps

is this because MapReduce is NOT a long-running app? meaning, a MapReduce
job will only show up as an app in YARN when it is running? (please bear
with me, i'm still adjusting to this new design).

any help/pointer is appreciated.


openjdk warning, the vm will try to fix the stack guard

2014-03-06 Thread Jane Wayne
hi,

i have have hadoop v2.3.0 installed on CentOS 6.5 64-bit. OpenJDK 64-bit
v1.7 is my java version.

when i attempt to start hadoop, i keep seeing this message below.

OpenJDK 64-Bit Server VM warning: You have loaded library
/usr/local/hadoop-2.3.0/lib/native/libhadoop.so.1.0.0 which might have
disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c
', or link it with '-z noexecstack'.


i followed the instruction. here were my steps.
1. sudo yum install -y prelink
2. execstack -c /usr/local/hadoop-2.3.0/lib/native/libhadoop.so.1.0.0

however, this message still keeps popping up. i did some more search on the
internet, and one user says that basically, libhadoop.so.1.0.0 is 32-bit,
and to get rid of this message, i will need to recompile this into 64-bit.

is that correct? is there not a 64-bit version of libhadoop.so.1.0.0
available for download?

thanks,


Re: hadoop v0.23.9, namenode -format command results in Could not find or load main class org.apache.hadoop.hdfs.server.namenode.NameNode

2013-08-12 Thread Jane Wayne
thanks, i also tried using HADOOP_PREFIX but that didn't work. I still get
the same error:  Could not find or load main class
org.apache.hadoop.hdfs.server.namenode.NameNode

btw, how do we install hadoop-common and hadoop-hdfs?

also, according to this link,
http://hadoop.apache.org/docs/r0.23.9/hadoop-project-dist/hadoop-common/SingleCluster.html,
there are several other variables to set.

$HADOOP_COMMON_HOME
$HADOOP_HDFS_HOME
$HADOOP_MAPRED_HOME
$YARN_HOME

where do we set these directories? i tried to set these as follows, which
still did not help to get rid of the error message.

$HADOOP_COMMON_HOME=${HADOOP_PREFX}/share/hadoop/common
$HADOOP_HDFS_HOME=${HADOOP_PREFX}/share/hadoop/hdfs
$HADOOP_MAPRED_HOME=${HADOOP_PREFX}/share/hadoop/mapreduce
$YARN_HOME=${HADOOP_PREFX}/share/hadoop/yarn

interestingly, that link never talks about HADOOP_PREFIX.



On Sun, Aug 11, 2013 at 3:21 AM, Harsh J  wrote:

> I don't think you ought to be using HADOOP_HOME anymore.
>
> Try "unset HADOOP_HOME" and then "export HADOOP_PREFIX=/opt/hadoop"
> and retry the NN command.
>
> On Sun, Aug 11, 2013 at 8:50 AM, Jane Wayne 
> wrote:
> > hi,
> >
> > i have downloaded and untarred hadoop v0.23.9. i am trying to set up a
> > single node instance to learn this version of hadop. also, i am following
> > as best as i can, the instructions at
> >
> http://hadoop.apache.org/docs/r0.23.9/hadoop-project-dist/hadoop-common/SingleCluster.html
> > .
> >
> > when i attempt to run ${HADOOP_HOME}/bin/hdfs namenode -format, i get the
> > following error.
> >
> > Error: Could not find or load main class
> > org.apache.hadoop.hdfs.server.namenode.NameNode
> >
> > the instructions in the link above are complete. they jump right in and
> > say, "assuming you have installed hadoop-common/hadoop-hdfs..." what does
> > this assumption even mean? how do we install hadoop-common and
> hadoop-hdfs?
> >
> > right now, i am running on CentOS 6.4 x64 minimal. my steps are the
> > following.
> >
> > 0. installed jdk 1.7 (Oracle)
> > 1. tar xfz hadoop-0.23.9.tar.gz
> > 2. mv hadoop-0.23.9 /opt
> > 3. ln -s /opt/hadoop-0.23.9 /opt/hadoop
> > 4. export HADOOP_HOME=/opt/hadoop
> > 5. export JAVA_HOME=/opt/java
> > 6. export PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${PATH}
> >
> > any help is appreciated.
>
>
>
> --
> Harsh J
>


hadoop v0.23.9, namenode -format command results in Could not find or load main class org.apache.hadoop.hdfs.server.namenode.NameNode

2013-08-10 Thread Jane Wayne
hi,

i have downloaded and untarred hadoop v0.23.9. i am trying to set up a
single node instance to learn this version of hadop. also, i am following
as best as i can, the instructions at
http://hadoop.apache.org/docs/r0.23.9/hadoop-project-dist/hadoop-common/SingleCluster.html
.

when i attempt to run ${HADOOP_HOME}/bin/hdfs namenode -format, i get the
following error.

Error: Could not find or load main class
org.apache.hadoop.hdfs.server.namenode.NameNode

the instructions in the link above are complete. they jump right in and
say, "assuming you have installed hadoop-common/hadoop-hdfs..." what does
this assumption even mean? how do we install hadoop-common and hadoop-hdfs?

right now, i am running on CentOS 6.4 x64 minimal. my steps are the
following.

0. installed jdk 1.7 (Oracle)
1. tar xfz hadoop-0.23.9.tar.gz
2. mv hadoop-0.23.9 /opt
3. ln -s /opt/hadoop-0.23.9 /opt/hadoop
4. export HADOOP_HOME=/opt/hadoop
5. export JAVA_HOME=/opt/java
6. export PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${PATH}

any help is appreciated.


Re: how to get the time of a hadoop cluster, v0.20.2

2013-05-17 Thread Jane Wayne
and please remember, i stated that although the hadoop cluster uses NTP,
the "server" (the machine that is not a part of the hadoop cluster) cannot
assume to be using NTP (and in fact, doesn't).


On Fri, May 17, 2013 at 10:10 AM, Jane Wayne wrote:

> "if NTP is correclty used"
>
> that's the key statement. in several of our clusters, NTP setup is kludgy.
> note that the professionals administering the cluster are different from
> "us" the engineers. so, there's a lot of red tape to go through to get
> something trivial or not fixed. we have noticed that NTP is not setup
> correctly (using default GMT timezone, for example). without explaining all
> the tedious details, this mismatch of date/time (of the hadoop cluster to
> the server machine) is causing some pains.
>
> i'm not sure i agree with "the local OS time from your server machine will
> be the best estimation." that doesn't make sense.
>
> but what i want to achieve is very simple. as stated before, i just want
> to ask the namenode or jobtracker, "hey, what date/time do you have?"
> unfortunately for me, as niels pointed out, this query is not possible via
> the hadoop api.
>
> thanks for helping, though.
>
> :)
>
>
> On Fri, May 17, 2013 at 10:02 AM, Bertrand Dechoux wrote:
>
>> For hadoop, 'cluster time' is the local OS time. You might want to get the
>> time of the namenode machine but indeed if NTP is correctly used, the
>> local
>> OS time from your server machine will be the best estimation. If you
>> request the time from the namenode machine, you will be penalized by the
>> delay of your request.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Fri, May 17, 2013 at 3:17 PM, Niels Basjes  wrote:
>>
>> > Hi,
>> >
>> > > i have another computer (which i have referred to as a server, since
>> it
>> > is
>> > > running tomcat), and this computer is NOT a part of the hadoop cluster
>> > (it
>> > > doesn't run any of the hadoop daemons), but does submit jobs to the
>> > hadoop
>> > > cluster via a JEE webapp interface. i need to check that the time on
>> this
>> > > computer is in sync with the time on the hadoop cluster. when i say
>> > "check
>> > > that the time is in sync", there is a defined tolerance/threshold
>> > > difference in date/time that i am willing to accept (e.g. the
>> date/time
>> > > should be the same down to the minute).
>> >
>> > If you ensure (using NTP) that all your servers have the same time then
>> you
>> > can simply query your local server for the time and you have the correct
>> > answer to your question.
>> >
>> > You are searching for a solution in the Hadoop API (where this does not
>> > exist) when the solution is present at a different level.
>> >
>> > --
>> > Best regards / Met vriendelijke groeten,
>> >
>> > Niels Basjes
>> >
>>
>
>


Re: how to get the time of a hadoop cluster, v0.20.2

2013-05-17 Thread Jane Wayne
"if NTP is correclty used"

that's the key statement. in several of our clusters, NTP setup is kludgy.
note that the professionals administering the cluster are different from
"us" the engineers. so, there's a lot of red tape to go through to get
something trivial or not fixed. we have noticed that NTP is not setup
correctly (using default GMT timezone, for example). without explaining all
the tedious details, this mismatch of date/time (of the hadoop cluster to
the server machine) is causing some pains.

i'm not sure i agree with "the local OS time from your server machine will
be the best estimation." that doesn't make sense.

but what i want to achieve is very simple. as stated before, i just want to
ask the namenode or jobtracker, "hey, what date/time do you have?"
unfortunately for me, as niels pointed out, this query is not possible via
the hadoop api.

thanks for helping, though.

:)


On Fri, May 17, 2013 at 10:02 AM, Bertrand Dechoux wrote:

> For hadoop, 'cluster time' is the local OS time. You might want to get the
> time of the namenode machine but indeed if NTP is correctly used, the local
> OS time from your server machine will be the best estimation. If you
> request the time from the namenode machine, you will be penalized by the
> delay of your request.
>
> Regards
>
> Bertrand
>
>
> On Fri, May 17, 2013 at 3:17 PM, Niels Basjes  wrote:
>
> > Hi,
> >
> > > i have another computer (which i have referred to as a server, since it
> > is
> > > running tomcat), and this computer is NOT a part of the hadoop cluster
> > (it
> > > doesn't run any of the hadoop daemons), but does submit jobs to the
> > hadoop
> > > cluster via a JEE webapp interface. i need to check that the time on
> this
> > > computer is in sync with the time on the hadoop cluster. when i say
> > "check
> > > that the time is in sync", there is a defined tolerance/threshold
> > > difference in date/time that i am willing to accept (e.g. the date/time
> > > should be the same down to the minute).
> >
> > If you ensure (using NTP) that all your servers have the same time then
> you
> > can simply query your local server for the time and you have the correct
> > answer to your question.
> >
> > You are searching for a solution in the Hadoop API (where this does not
> > exist) when the solution is present at a different level.
> >
> > --
> > Best regards / Met vriendelijke groeten,
> >
> > Niels Basjes
> >
>


Re: how to get the time of a hadoop cluster, v0.20.2

2013-05-17 Thread Jane Wayne
> You are searching for a solution in the Hadoop API (where this does not
exist)

thanks, that's all i needed to know.

cheers.



On Fri, May 17, 2013 at 9:17 AM, Niels Basjes  wrote:

> Hi,
>
> > i have another computer (which i have referred to as a server, since it
> is
> > running tomcat), and this computer is NOT a part of the hadoop cluster
> (it
> > doesn't run any of the hadoop daemons), but does submit jobs to the
> hadoop
> > cluster via a JEE webapp interface. i need to check that the time on this
> > computer is in sync with the time on the hadoop cluster. when i say
> "check
> > that the time is in sync", there is a defined tolerance/threshold
> > difference in date/time that i am willing to accept (e.g. the date/time
> > should be the same down to the minute).
>
> If you ensure (using NTP) that all your servers have the same time then you
> can simply query your local server for the time and you have the correct
> answer to your question.
>
> You are searching for a solution in the Hadoop API (where this does not
> exist) when the solution is present at a different level.
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>


Re: how to get the time of a hadoop cluster, v0.20.2

2013-05-17 Thread Jane Wayne
> What is meant by 'cluster time'?  and What you want to achieve?

let me try to clarify. i have a hadoop cluster (e.g. name node, data nodes,
job tracker, task trackers, etc...). all the nodes in this hadoop cluster
use ntp to sync time.

i have another computer (which i have referred to as a server, since it is
running tomcat), and this computer is NOT a part of the hadoop cluster (it
doesn't run any of the hadoop daemons), but does submit jobs to the hadoop
cluster via a JEE webapp interface. i need to check that the time on this
computer is in sync with the time on the hadoop cluster. when i say "check
that the time is in sync", there is a defined tolerance/threshold
difference in date/time that i am willing to accept (e.g. the date/time
should be the same down to the minute).

so, using niels link, i can get the time on the "server" (the computer that
is running tomcat and not a part of the hadoop cluster). which solves 1/3
of the problem.
how do i get the time of the hadoop cluster? this is 1/3 of the problem.
the last 1/3 of the problem, for me, is to then take the time on the
"server", denote this as A, the time on the hadoop cluster, denote this as
B, and subtract them,

C = | A - B |

and then i want to see if C < threshold.

by "cluster time", i am assuming, per my understanding, that the hadoop
cluster (all its nodes), somehow has a notion of "the time" (maybe i'm
wrong). now, i know that having all the date/time to the second or
millisecond between all the hadoop nodes to be exactly the same is unlikely
(similar to what you have stated). but, at least, the date/time between the
nodes should be the same down to the minute (i think that's reasonably fair
to expect that condition). but even if that's not the case, that's ok,
because that's not really what i'm trying to check (not my goal to ensure
time sync, as my goal is to probe the date/time from the cluster and
compare it to the "server").

so, is there a way to programmatically (via the hadoop API) get the hadoop
cluster's date/time? or can i get the date/time via the hadoop API from
just the name node or job tracker? (preferably the latter).




On Thu, May 16, 2013 at 12:46 PM, Michael Segel
wrote:

> Uhm... sort of...
>
> Niels is essentially correct and for the most of us, just starting an
> NNTPd on a server that sync's with a government clock and then your local
> servers sync to that... will be enough. However... in more detail...
>
> Time is relative. ;-)
>
> Ok... being a bit more serious...
>
> There are two things you have to consider... What is meant by 'cluster
> time'?  and What you want to achieve?
>
> Each machine in the cluster has its own clock. These will still have a
> certain amount of drift throughout the day.
>
> So you can set up your own NTP server. (You can either run NTPd and sync
> to a known government clock) or you can spend money and buy an atomic clock
> for your servers or machine room.
> (See http://www.atomic-clock.galleon.eu.com/ )
>
> Then periodically throughout the day, via cron, have the machines in your
> machine room sync to the local NTP server.
> This way all of your machines will have the same and correct time.
>
> So this will sync the clocks to a degree, but then drift sets in.
>
> Of course you also need to set up a machine to sync from... my vote would
> be the Name node. ;-)
>
> HTH
>
> -Mike
>
>
> On May 16, 2013, at 10:34 AM, Niels Basjes  wrote:
>
> > If you make sure that everything uses NTP then this becomes an irrelevant
> > distinction.
> >
> >
> > On Thu, May 16, 2013 at 4:01 PM, Jane Wayne  >wrote:
> >
> >> yes, but that gets the current time on the server, not the hadoop
> cluster.
> >> i need to be able to probe the date/time of the hadoop cluster.
> >>
> >>
> >> On Tue, May 14, 2013 at 5:09 PM, Niels Basjes  wrote:
> >>
> >>> I made a typo. I meant API (instead of SPI).
> >>>
> >>> Have a look at this for more information:
> >>>
> >>>
> >>
> http://stackoverflow.com/questions/833768/java-code-for-getting-current-time
> >>>
> >>>
> >>> If you have a client that is not under NTP then that should be the way
> to
> >>> fix your issue.
> >>> Once you  have that getting the current time is easy.
> >>>
> >>> Niels Basjes
> >>>
> >>>
> >>>
> >>> On Tue, May 14, 2013 at 5:46 PM, Jane Wayne  >>>> wrote:
> >>>
> >>>> niels,
> >>>>
> >>>> i'm not familiar with the native java spi.

Re: how to get the time of a hadoop cluster, v0.20.2

2013-05-16 Thread Jane Wayne
yes, but that gets the current time on the server, not the hadoop cluster.
i need to be able to probe the date/time of the hadoop cluster.


On Tue, May 14, 2013 at 5:09 PM, Niels Basjes  wrote:

> I made a typo. I meant API (instead of SPI).
>
> Have a look at this for more information:
>
> http://stackoverflow.com/questions/833768/java-code-for-getting-current-time
>
>
> If you have a client that is not under NTP then that should be the way to
> fix your issue.
> Once you  have that getting the current time is easy.
>
> Niels Basjes
>
>
>
> On Tue, May 14, 2013 at 5:46 PM, Jane Wayne  >wrote:
>
> > niels,
> >
> > i'm not familiar with the native java spi. spi = service provider
> > interface? could you let me know if this spi is part of the hadoop
> > api? if so, which package/class?
> >
> > but yes, all nodes on the cluster are using NTP to synchronize time.
> > however, the server (which is not a part of the hadoop cluster)
> > accessing/interfacing with the hadoop cluster cannot be assumed to be
> > using NTP. will this still make a difference? and actually, this is
> > the primary reason why i need to get the date/time of the hadoop
> > cluster (need to check if the date/time of the hadooop cluster is in
> > sync with the server).
> >
> >
> >
> > On Tue, May 14, 2013 at 11:38 AM, Niels Basjes  wrote:
> > > If you have all nodes using NTP then you can simply use the native Java
> > SPI
> > > to get the current system time.
> > >
> > >
> > > On Tue, May 14, 2013 at 4:41 PM, Jane Wayne  > >wrote:
> > >
> > >> hi all,
> > >>
> > >> is there a way to get the current time of a hadoop cluster via the
> > >> api? in particular, getting the time from the namenode or jobtracker
> > >> would suffice.
> > >>
> > >> i looked at JobClient but didn't see anything helpful.
> > >>
> > >
> > >
> > >
> > > --
> > > Best regards / Met vriendelijke groeten,
> > >
> > > Niels Basjes
> >
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>


Re: how to get the time of a hadoop cluster, v0.20.2

2013-05-14 Thread Jane Wayne
niels,

i'm not familiar with the native java spi. spi = service provider
interface? could you let me know if this spi is part of the hadoop
api? if so, which package/class?

but yes, all nodes on the cluster are using NTP to synchronize time.
however, the server (which is not a part of the hadoop cluster)
accessing/interfacing with the hadoop cluster cannot be assumed to be
using NTP. will this still make a difference? and actually, this is
the primary reason why i need to get the date/time of the hadoop
cluster (need to check if the date/time of the hadooop cluster is in
sync with the server).



On Tue, May 14, 2013 at 11:38 AM, Niels Basjes  wrote:
> If you have all nodes using NTP then you can simply use the native Java SPI
> to get the current system time.
>
>
> On Tue, May 14, 2013 at 4:41 PM, Jane Wayne wrote:
>
>> hi all,
>>
>> is there a way to get the current time of a hadoop cluster via the
>> api? in particular, getting the time from the namenode or jobtracker
>> would suffice.
>>
>> i looked at JobClient but didn't see anything helpful.
>>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes


how to get the time of a hadoop cluster, v0.20.2

2013-05-14 Thread Jane Wayne
hi all,

is there a way to get the current time of a hadoop cluster via the
api? in particular, getting the time from the namenode or jobtracker
would suffice.

i looked at JobClient but didn't see anything helpful.


Re: reg hadoop on AWS

2012-10-05 Thread Jane Wayne
i'm on windows using AWS EMR/EC2. i use the ruby client to manipulate AWS EMR.

1. spawn an EMR cluster. this should return a jobflow id (jobflow-id).

ruby elastic-mapreduce --create --name j-med --alive --num-instances
10 --instance-type c1.medium

2. run a job. you need to describe the job parameters in a JSON file.

ruby elastic-mapreduce --jobflow  --json job.json

the JSON file might look like this.

[
{
"Name": "special word count job",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep":
{
"MainClass": "com.company.hadoop.SpecialWordCountJob",
"Jar": "s3n://hadoop-uri/my-hadoop-jar-0.1.jar",
"Args":
[
"", 
"-Dmapred.reduce.slowstart.completed.maps=0.90",
"", 
"-Dmapred.input.dir=s3n://hadoop-uri/input_data/",
"", 
"-Dmapred.output.dir=s3n://hadoop-uri/output/",
"", "-Dmapred.reduce.tasks=10"
]
}
}
]

two lines of code + a JSON configuration file specifying input/output
(and other mapred) parameter values. this assumes you got a AWS
account, downloaded and processed all the access keys, installed ruby,
installed aws ruby client, etc there's a series on non-trivial
steps to undertake before you can just keep reusing this 2 step
approach (spawn EMR cluster, then submit hadoop job).

imho, i'm not surprised you're asking, because the aws emr
documentation is scattered. the information is there, but you got to
dig and make sense of it yourself. i'm not sure why amazon claims aws
is a billion+ dollar operation, yet, their documentation is weak and
not cohesive (and their support forums are horrible and their aws
evangelists are anything but evangelists).

On Fri, Oct 5, 2012 at 7:15 AM, sudha sadhasivam
 wrote:
> Sir
> We tried to setup hadoop on AWS. The procedure is given. We face problem with 
> the parameters needed for input and output files. Can somebody provide us 
> with a sample exercise with steps for working on hadoop in AWS?
> thanking you
> Dr G Sudha


Re: strategies to share information between mapreduce tasks

2012-09-26 Thread Jane Wayne
thanks. those issues pointed out do cover the pain points i'm experiencing.

On Wed, Sep 26, 2012 at 3:11 PM, Harsh J  wrote:
> Also read: http://arxiv.org/abs/1209.2191 ;-)
>
> On Thu, Sep 27, 2012 at 12:24 AM, Bertrand Dechoux  wrote:
>> I wouldn't so surprised. It takes times, energy and money to solve problems
>> and make solutions that would be prod-ready. A few people would consider
>> that the namenode/secondary spof is a limit for Hadoop itself in order to
>> go into a critical production environnement. (I am only quoting it and
>> don't want to start a discussion about it.)
>>
>> One paper that I heard about (but didn't have the time to read as of now)
>> might be related to your problem space
>> http://arxiv.org/abs/1110.4198
>> But research paper does not mean prod ready for tomorrow.
>>
>> http://research.google.com/archive/mapreduce.html is from 2004.
>> and http://research.google.com/pubs/pub36632.html (dremel) is from 2010.
>>
>> Regards
>>
>> Bertrand
>>
>> On Wed, Sep 26, 2012 at 8:18 PM, Jane Wayne wrote:
>>
>>> jay,
>>>
>>> thanks. i just needed a sanity check. i hope and expect that one day,
>>> hadoop will mature towards supporting a "shared-something" approach.
>>> the web service call is not a bad idea at all. that way, we can
>>> abstract what that ultimate data store really is.
>>>
>>> i'm just a little surprised that we are still in the same state with
>>> hadoop in regards to this issue (there are probably higher priorities)
>>> and that no research (that i know of) has come out of academia to
>>> mitigate some of these limitations of hadoop (where's all the funding
>>> to hadoop/mapreduce research gone to if this framework is the
>>> fundamental building block of a vast amount of knowledge mining
>>> activities?).
>>>
>>> On Wed, Sep 26, 2012 at 12:40 PM, Jay Vyas  wrote:
>>> > The reason this is so rare is that the nature of map/reduce tasks is that
>>> > they are orthogonal  i.e. the word count, batch image recognition, tera
>>> > sort -- all the things hadoop is famous for are largely orthogonal tasks.
>>> > Its much more rare (i think) to see people using hadoop to do traffic
>>> > simulations or solve protein folding problems... Because those tasks
>>> > require continuous signal integration.
>>> >
>>> > 1) First, try to consider rewriting it so that ll communication is
>>> replaced
>>> > by state variables in a reducer, and choose your keys wisely, so that all
>>> > "communication" between machines is obviated by the fact that a single
>>> > reducer is receiving all the information relevant for it to do its task.
>>> >
>>> > 2) If a small amount of state needs to be preserved or cached in real
>>> time
>>> > two optimize the situation where two machines might dont have to redo the
>>> > same task (i.e. invoke a web service to get a peice of data, or some
>>> other
>>> > task that needs to be rate limited and not duplicated) then you can use a
>>> > fast key value store (like you suggested) like the ones provided by
>>> basho (
>>> > http://basho.com/) or amazon (Dynamo).
>>> >
>>> > 3) If you really need alot of message passing, then then you might be
>>> > better of using an inherently more integrated tool like GridGain... which
>>> > allows for sophisticated message passing between asynchronously running
>>> > processes, i.e.
>>> >
>>> http://gridgaintech.wordpress.com/2011/01/26/distributed-actors-in-gridgain/
>>> .
>>> >
>>> >
>>> > It seems like there might not be a reliable way to implement a
>>> > sophisticated message passing architecutre in hadoop, because the system
>>> is
>>> > inherently so dynamic, and is built for rapid streaming reads/writes,
>>> which
>>> > would be stifled by significant communication overhead.
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>
>
>
> --
> Harsh J


Re: strategies to share information between mapreduce tasks

2012-09-26 Thread Jane Wayne
i'll look for myself, but could you please let me know what is giraph?
is it another layer on hadoop like hive/pig or an api like mahout?



On Wed, Sep 26, 2012 at 12:09 PM, Jonathan Bishop  wrote:
> Yes, Giraph seems like the best way to go - it is mainly a vertex
> evaluation with message passing between vertices. Synchronization is
> handled for you.
>
> On Wed, Sep 26, 2012 at 8:36 AM, Jane Wayne wrote:
>
>> hi,
>>
>> i know that some algorithms cannot be parallelized and adapted to the
>> mapreduce paradigm. however, i have noticed that in most cases where i
>> find myself struggling to express an algorithm in mapreduce, the
>> problem is mainly due to no ability to cross-communicate between
>> mappers or reducers.
>>
>> one naive approach i've seen mentioned here and elsewhere, is to use a
>> database to store data for use by all the mappers. however, i have
>> seen many arguments (that i agree with largely) against this approach.
>>
>> in general, my question is this: has anyone tried to implement an
>> algorithm using mapreduce where mappers required cross-communications?
>> how did you solve this limitation of mapreduce?
>>
>> thanks,
>>
>> jane.
>>


Re: strategies to share information between mapreduce tasks

2012-09-26 Thread Jane Wayne
my problem is more general (than graph problems) and doesn't need to
have logic built around synchronization or failure. for example, when
a mapper is finished successfully, it just writes/persists to a
storage location (could be disk, could be database, could be memory,
etc...). when the next input is processed (could be on the same mapper
or different mapper), i just need to do a lookup from the storage
location (that is accessible by all task nodes). if the mapper fails,
this doesn't hurt my processing, although i would like for no failures
(and it's good if hadoop can spawn another task to mitigate).



On Wed, Sep 26, 2012 at 11:43 AM, Bertrand Dechoux  wrote:
> The difficulty with data transfer between tasks is handling synchronisation
> and failure.
> You may want to look at graph processing done on top of Hadoop (like
> Giraph).
> That's one way to do it but whether it is relevant or not to you will
> depend on your context.
>
> Regards
>
> Bertrand
>
> On Wed, Sep 26, 2012 at 5:36 PM, Jane Wayne wrote:
>
>> hi,
>>
>> i know that some algorithms cannot be parallelized and adapted to the
>> mapreduce paradigm. however, i have noticed that in most cases where i
>> find myself struggling to express an algorithm in mapreduce, the
>> problem is mainly due to no ability to cross-communicate between
>> mappers or reducers.
>>
>> one naive approach i've seen mentioned here and elsewhere, is to use a
>> database to store data for use by all the mappers. however, i have
>> seen many arguments (that i agree with largely) against this approach.
>>
>> in general, my question is this: has anyone tried to implement an
>> algorithm using mapreduce where mappers required cross-communications?
>> how did you solve this limitation of mapreduce?
>>
>> thanks,
>>
>> jane.
>>
>
>
>
> --
> Bertrand Dechoux


Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines

2012-06-01 Thread Jane Wayne
Sandeep,

How are you guys moving 100 TB into the AWS cloud? Are you using S3 or
EBS? If you are using S3, it does not work like HDFS. Although data is
replicated (I believe within an availability zone) in S3, it is not
the same as HDFS replication. You lose the data locality optimization
feature of Hadoop when you use S3, which runs counter to the "sending
code to data" paradigm of MapReduce. Mind you, traffic in/out of S3
equates to costs incurred as well (when you lose data locality
optimization).

I hear that to get PBs worth of data into AWS, it is not uncommon to
drive a truck with your data on some physical storage device (in fact,
Amazon will help you do this).

Please update us, this is an interesting problem.

Thanks,

On Thu, May 31, 2012 at 2:41 PM, Sandeep Reddy P
 wrote:
> Hi,
> We are getting 100TB of data with replication factor of 3 this goes to
> 300TB of data. We are planning to use hadoop with 65nodes. We want to know
> which option will be better in terms of hardware either physical Machines
> or deploy hadoop on EC2. Is there any document that supports use of
> physical machines.
> Hardware specs:  2 quad core cpu, 32 Gb Ram, 12*1 Tb hard drives , 10Gb
> Ethernet Switches costs $10k for each machine. Is that cheaper to use EC2
> ?? will there be any performance issues??
> --
> Thanks,
> sandeep


Re: how do i view the local file system output of a mapper on cygwin + windows?

2012-04-05 Thread Jane Wayne
i found out what my problem was. apparently, when you iterate over
Iterable values, that instance of Type is being used over and
over. for example, in my reducer,

public void reduce(Key key, Iterator values, Context context)
throws IOException, InterruptedException {
 Iterator it = values.iterator();
 Value a = it.next();
 Value b = it.next();
}

the variables, a and b of type Value, will be the same object
instance! i suppose this behavior of the iterator is to optimize
iterating so as to avoid the new operator.



On Thu, Apr 5, 2012 at 4:55 PM, Jane Wayne  wrote:
> i am currently testing my map reduce job on Windows + Cygwin + Hadoop
> v0.20.205. for some strange reason, the list of values (i.e.
> Iterable values) going into the reducer looks all wrong. i have
> tracked the map reduce process with logging statements (i.e. logged
> the input to the map, logged the output from the map, logged the
> partitioner, logged the input to the reducer). at all stages,
> everything looks correct except at the reducer.
>
> is there anyway (using Windows  + Cygwin) to view the local map
> outputs before they are shuffled/sorted to the reducer? i need to know
> why the values are incorrect.


how do i view the local file system output of a mapper on cygwin + windows?

2012-04-05 Thread Jane Wayne
i am currently testing my map reduce job on Windows + Cygwin + Hadoop
v0.20.205. for some strange reason, the list of values (i.e.
Iterable values) going into the reducer looks all wrong. i have
tracked the map reduce process with logging statements (i.e. logged
the input to the map, logged the output from the map, logged the
partitioner, logged the input to the reducer). at all stages,
everything looks correct except at the reducer.

is there anyway (using Windows  + Cygwin) to view the local map
outputs before they are shuffled/sorted to the reducer? i need to know
why the values are incorrect.


Re: how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)

2012-04-04 Thread Jane Wayne
serge, i specify 15 instances, but only 14 end up being data/tasks
nodes. 1 instance is reserved as the name node (job tracker).

On Wed, Apr 4, 2012 at 1:17 PM, Serge Blazhievsky
 wrote:
> How many datanodes do you use fir your job?
>
> On 4/3/12 8:11 PM, "Jane Wayne"  wrote:
>
>>i don't have the option of setting the map heap size to 2 GB since my
>>real environment is AWS EMR and the constraints are set.
>>
>>http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html this
>>link is where i am currently reading on the meaning of io.sort.factor
>>and io.sort.mb.
>>
>>it seems io.sort.mb tunes the map tasks and io.sort.factor tunes the
>>shuffle/reduce task. am i correct to say then that io.sort.factor is
>>not relevant here (yet, anways)? since i don't really make it to the
>>reduce phase (except for only a very small data size).
>>
>>in that link above, here is the description for, io.sort.mb:  The
>>cumulative size of the serialization and accounting buffers storing
>>records emitted from the map, in megabytes. there's a paragraph above
>>the table that is value is simply the threshold that triggers a sort
>>and spill to the disk. furthermore, it says, "If either buffer fills
>>completely while the spill is in progress, the map thread will block,"
>>which is what i believe is happening in my case.
>>
>>this sentence concerns me, "Minimizing the number of spills to disk
>>can decrease map time, but a larger buffer also decreases the memory
>>available to the mapper." to minimize the number of spills, you need a
>>larger buffer; however, this statement seems to suggest to NOT
>>minimize the number of spills; a) you will not decrease map time, b)
>>you will not decrease the memory available to the mapper. so, in your
>>advice below, you say to increase, but i may actually want to decrease
>>the value for io.sort.mb. (if i understood the documentation
>>correctly, )
>>
>>it seems these three map tuning parameters, io.sort.mb,
>>io.sort.record.percent, and io.sort.spill.percent are a pain-point
>>trading off between speed and memory. to me, if you set them high,
>>more serialized data + metadata are stored in memory before a spill
>>(an I/O operation is performed). you also get less merges (less I/O
>>operation?), but the negatives are blocking map operations and more
>>memory requirements. if you set them low, there are more frequent
>>spills (more I/O operations), but less memory requirements. it just
>>seems like no matter what you do, you are stuck: you may stall the
>>mapper if the values are high because of the amount of time required
>>to spill an enormous amount of data; you may stall the mapper if the
>>values are low because of the amount of I/O operations required
>>(spill/merge).
>>
>>i must be understanding something wrong here because everywhere i
>>read, hadoop is supposed to be #1 at sorting. but here, in dealing
>>with the intermediary key-value pairs, in the process of sorting,
>>mappers can stall for any number of reasons.
>>
>>does anyone know any competitive dynamic hadoop clustering service
>>like AWS EMR? the reason why i ask is because AWS EMR does not use
>>HDFS (it uses S3), and therefore, data locality is not possible. also,
>>i have read the TCP protocol is not efficient for network transfers;
>>if the S3 node and task nodes are far, this distance will certainly
>>exacerbate the situation of slow speed. it seems there are a lot of
>>factors working against me.
>>
>>any help is appreciated.
>>
>>On Tue, Apr 3, 2012 at 7:48 AM, Bejoy Ks  wrote:
>>>
>>> Jane,
>>>       From my first look, properties that can help you could be
>>> - Increase io sort factor to 100
>>> - Increase io.sort.mb to 512Mb
>>> - increase map task heap size to 2GB.
>>>
>>> If the task still stalls, try providing lesser input for each mapper.
>>>
>>> Regards
>>> Bejoy KS
>>>
>>> On Tue, Apr 3, 2012 at 2:08 PM, Jane Wayne 
>>>wrote:
>>>
>>> > i have a map reduce job that is generating a lot of intermediate
>>>key-value
>>> > pairs. for example, when i am 1/3 complete with my map phase, i may
>>>have
>>> > generated over 130,000,000 output records (which is about 9
>>>gigabytes). to
>>> > get to the 1/3 complete mark is very fast (less than 10 minutes), but
>>>at
>>> > the 1/3 complete mark, it seems to stall. when i look at the counter
>>>logs,
>&g

Re: how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)

2012-04-03 Thread Jane Wayne
i don't have the option of setting the map heap size to 2 GB since my
real environment is AWS EMR and the constraints are set.

http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html this
link is where i am currently reading on the meaning of io.sort.factor
and io.sort.mb.

it seems io.sort.mb tunes the map tasks and io.sort.factor tunes the
shuffle/reduce task. am i correct to say then that io.sort.factor is
not relevant here (yet, anways)? since i don't really make it to the
reduce phase (except for only a very small data size).

in that link above, here is the description for, io.sort.mb:  The
cumulative size of the serialization and accounting buffers storing
records emitted from the map, in megabytes. there's a paragraph above
the table that is value is simply the threshold that triggers a sort
and spill to the disk. furthermore, it says, "If either buffer fills
completely while the spill is in progress, the map thread will block,"
which is what i believe is happening in my case.

this sentence concerns me, "Minimizing the number of spills to disk
can decrease map time, but a larger buffer also decreases the memory
available to the mapper." to minimize the number of spills, you need a
larger buffer; however, this statement seems to suggest to NOT
minimize the number of spills; a) you will not decrease map time, b)
you will not decrease the memory available to the mapper. so, in your
advice below, you say to increase, but i may actually want to decrease
the value for io.sort.mb. (if i understood the documentation
correctly, )

it seems these three map tuning parameters, io.sort.mb,
io.sort.record.percent, and io.sort.spill.percent are a pain-point
trading off between speed and memory. to me, if you set them high,
more serialized data + metadata are stored in memory before a spill
(an I/O operation is performed). you also get less merges (less I/O
operation?), but the negatives are blocking map operations and more
memory requirements. if you set them low, there are more frequent
spills (more I/O operations), but less memory requirements. it just
seems like no matter what you do, you are stuck: you may stall the
mapper if the values are high because of the amount of time required
to spill an enormous amount of data; you may stall the mapper if the
values are low because of the amount of I/O operations required
(spill/merge).

i must be understanding something wrong here because everywhere i
read, hadoop is supposed to be #1 at sorting. but here, in dealing
with the intermediary key-value pairs, in the process of sorting,
mappers can stall for any number of reasons.

does anyone know any competitive dynamic hadoop clustering service
like AWS EMR? the reason why i ask is because AWS EMR does not use
HDFS (it uses S3), and therefore, data locality is not possible. also,
i have read the TCP protocol is not efficient for network transfers;
if the S3 node and task nodes are far, this distance will certainly
exacerbate the situation of slow speed. it seems there are a lot of
factors working against me.

any help is appreciated.

On Tue, Apr 3, 2012 at 7:48 AM, Bejoy Ks  wrote:
>
> Jane,
>       From my first look, properties that can help you could be
> - Increase io sort factor to 100
> - Increase io.sort.mb to 512Mb
> - increase map task heap size to 2GB.
>
> If the task still stalls, try providing lesser input for each mapper.
>
> Regards
> Bejoy KS
>
> On Tue, Apr 3, 2012 at 2:08 PM, Jane Wayne  wrote:
>
> > i have a map reduce job that is generating a lot of intermediate key-value
> > pairs. for example, when i am 1/3 complete with my map phase, i may have
> > generated over 130,000,000 output records (which is about 9 gigabytes). to
> > get to the 1/3 complete mark is very fast (less than 10 minutes), but at
> > the 1/3 complete mark, it seems to stall. when i look at the counter logs,
> > i do not see any logging of spilling yet. however, on the web job UI, i see
> > that FILE_BYTES_WRITTEN and Spilled Records keeps increasing. needless to
> > say, i have to dig deeper to see what is going on.
> >
> > my question is, how do i fine tune my map reduce job with the above
> > properties? namely, the property of generating a lot of intermediate
> > key-value pairs? it seems the I/O operations are negatively impacting the
> > job speed. there are so many map- and reduce-side tuning properties (see
> > Tom White, Hadoop, 2nd edition, pp 181-182), i am a little unsure about
> > just how to approach the tuning parameters. since the slow down is
> > happening during the map-phase/task, i assume i should narrow down on the
> > map-side tuning properties.
> >
> > by the way, i am using the CPU-intensive c1.medium instances of amazon web
> > service's (AWS) elastic map reduce (EMR) on hadoop v0.20. a c

how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)

2012-04-03 Thread Jane Wayne
i have a map reduce job that is generating a lot of intermediate key-value
pairs. for example, when i am 1/3 complete with my map phase, i may have
generated over 130,000,000 output records (which is about 9 gigabytes). to
get to the 1/3 complete mark is very fast (less than 10 minutes), but at
the 1/3 complete mark, it seems to stall. when i look at the counter logs,
i do not see any logging of spilling yet. however, on the web job UI, i see
that FILE_BYTES_WRITTEN and Spilled Records keeps increasing. needless to
say, i have to dig deeper to see what is going on.

my question is, how do i fine tune my map reduce job with the above
properties? namely, the property of generating a lot of intermediate
key-value pairs? it seems the I/O operations are negatively impacting the
job speed. there are so many map- and reduce-side tuning properties (see
Tom White, Hadoop, 2nd edition, pp 181-182), i am a little unsure about
just how to approach the tuning parameters. since the slow down is
happening during the map-phase/task, i assume i should narrow down on the
map-side tuning properties.

by the way, i am using the CPU-intensive c1.medium instances of amazon web
service's (AWS) elastic map reduce (EMR) on hadoop v0.20. a compute node
has 2 mappers, 1 reducers, and 384 MB JVM memory per task. this instance
type is documented to have moderate I/O performance.

any help on fine tuning my particular map reduce job is appreciated.


Re: how to unit test my RawComparator

2012-03-31 Thread Jane Wayne
chris,

1. thanks, that approach to converting my custom key to byte[] works.

2. on the issue of pass by reference or pass by value, (it's been a while
since i've visited this issue), i'm pretty sure java is pass by value
(regardless of whether the parameters are primitives or objects). when i
put the code into debugger, the ids of byte[] b1 and byte[] b2 are equal.
if this is indeed the same byte array, why not just pass it as one
parameter instead of two? unless in some cases, b1 and b2 are not the same.
this second issue is not terribly too important, because the interface
defines two byte arrays to be passed in, and so there's not much i (we) can
do about it.

thanks for the help!

On Sat, Mar 31, 2012 at 5:18 PM, Chris White wrote:

> You can serialize your Writables to a ByteArrayOutputStream and then
> get it's underlying byte array:
>
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> DataOutputStream dos = new DataOutputStream(baos);
> Writable myWritable = new Text("text");
> myWritable.write(dos);
> byte[] bytes = baos.toByteArray();
>
> I would recommend writing a few bytes to the DataOutputStream first -
> i always forget to respect the offset variables (s1 / s2), and this,
> depending on how well you write your unit test, should allow you to
> test that you are respecting them.
>
> The huge bytes arrays store the other Writables in the stream the are
> about to be run by the comparator.
>
> Finally, arrays in java are objects, so you're passing a reference to
> a byte array, not making a copy of the array.
>
> Chris
>
> On Sat, Mar 31, 2012 at 12:23 AM, Jane Wayne 
> wrote:
> > i have a RawComparator that i would like to unit test (using mockito and
> > mrunit testing packages). i want to test the method,
> >
> > public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)
> >
> > how do i convert my custom key into a byte[] array? is there a util class
> > to help me do this?
> >
> > also, when i put the code into the debugger, i notice that the byte[]
> > arrays (b1 and b2) are HUGE (the lengths of each array are huge, in the
> > thousands). what is actually in these byte[] arrays? intuitively, it does
> > not seem like these byte[] arrays only represent my keys.
> >
> > lastly, why are such huge byte[] arrays being passed around? one would
> > think that since Java is pass-by-value, there would be a large overhead
> > with passing such large byte arrays around.
> >
> > your help is appreciated.
>


Re: what is the code for WritableComparator.readVInt and WritableUtils.decodeVIntSize doing?

2012-03-31 Thread Jane Wayne
chris,

thanks. i see now.

internally, i use String instead of Text and so I use
WritableUtils.writeString(...) and not Text.write(...). in the latter
method, i see that it calls WritableUtils.writeVInt(...) before
out.write(byte[], start, length).

tom white uses Text internally to represent strings (which is maybe what i
should do), so his example is correct and works. i think i was just
confusing myself.

thanks for the last paragraph too, that really helped a lot.

On Sat, Mar 31, 2012 at 1:17 PM, Chris White wrote:

> A text object is written out as a vint representing the number of bytes and
> then the byte array contents of the text object
>
> Because a vintage can be between 1-5 bytes in length, the decodeVIntSize
> method examines the first byte of the vint to work out how many bytes to
> skip over before the text bytes start.
>
> readVInt then actually reads the vint bytes to get the length of the
> following byte array.
>
> So when you call the compareBytes method you need to pass in where the
> actual bytes start (s1 + vIntLen) and how many bytes to compare (vint)
> On Mar 31, 2012 12:38 AM, "Jane Wayne"  wrote:
>
> > in tom white's book, Hadoop, The Definitive Guide, in the second edition,
> > on page 99, he shows how to compare the raw bytes of a key with Text
> > fields. he shows an example like the following.
> >
> > int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
> > int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
> >
> > his explanation is that firstL1 is the length of the first String/Text in
> > b1, and firstL2 is the length of the first String/Text in b2. but i'm
> > unsure of what the code is actually doing.
> >
> > what is WritableUtils.decodeVIntSize(...) doing?
> > what is WritableComparator.readVInt(...) doing?
> > why do we have to add the outputs of these 2 methods to get the length of
> > the String/Text?
> >
> > could someone please explain in plain terms what's happening here? it
> seems
> > WritableComparator.readVInt(...) is already getting the length of the
> > byte[] corresponding to the string. it seems
> > WritableUtils.decodeVIntSize(...) is also doing the same thing (from
> > reading the javadoc).
> >
> > when i look at WritableUtils.writeString(...), two things happen. the
> > length of the byte[] is written, followed by writing the byte[] itself.
> why
> > can't we simply do something like the following to get the length?
> >
> > int firstL1 = readInt(b1[s1]);
> > int firstL2 = readInt(b2[s2]);
> >
>


what is the code for WritableComparator.readVInt and WritableUtils.decodeVIntSize doing?

2012-03-30 Thread Jane Wayne
in tom white's book, Hadoop, The Definitive Guide, in the second edition,
on page 99, he shows how to compare the raw bytes of a key with Text
fields. he shows an example like the following.

int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);

his explanation is that firstL1 is the length of the first String/Text in
b1, and firstL2 is the length of the first String/Text in b2. but i'm
unsure of what the code is actually doing.

what is WritableUtils.decodeVIntSize(...) doing?
what is WritableComparator.readVInt(...) doing?
why do we have to add the outputs of these 2 methods to get the length of
the String/Text?

could someone please explain in plain terms what's happening here? it seems
WritableComparator.readVInt(...) is already getting the length of the
byte[] corresponding to the string. it seems
WritableUtils.decodeVIntSize(...) is also doing the same thing (from
reading the javadoc).

when i look at WritableUtils.writeString(...), two things happen. the
length of the byte[] is written, followed by writing the byte[] itself. why
can't we simply do something like the following to get the length?

int firstL1 = readInt(b1[s1]);
int firstL2 = readInt(b2[s2]);


Re: where are my logging output files going to?

2012-03-28 Thread Jane Wayne
what do you mean by an edge node? do you mean any node that is not the
master node (or NameNode or JobTracker node)?

On Wed, Mar 28, 2012 at 3:51 AM, Michel Segel wrote:

> First you really don't want to launch the job from the cluster but from an
> edge node.
>
> To answer your question, in a word, yes, you should have a consistent set
> of configuration files as possible, noting that overtime this may not be
> possible as hardware configs may change,
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Mar 27, 2012, at 8:42 PM, Jane Wayne  wrote:
>
> > if i have a hadoop cluster of 10 nodes, do i have to modify the
> > /hadoop/conf/log4j.properties files on ALL 10 nodes to be the same?
> >
> > currently, i ssh into the master node to execute a job. this node is the
> > only place where i have modified the logj4.properties file. i notice that
> > although my log files are being created, nothing is being written to
> them.
> > when i test on cygwin, the logging works, however, when i go to a live
> > cluster (i.e. amazon elastic mapreduce), the logging output on the master
> > node no longer works. i wonder if logging is happening at each slave/task
> > node?
> >
> > could someone explain logging or point me to the documentation discussing
> > this issue?
>


Re: how can i increase the number of mappers?

2012-03-21 Thread Jane Wayne
if anyone is facing the same problem, here's what i did. i took anil's
advice to use NLineInputFormat (because that approach would scale out my
mappers).

however, i am using the new mapreduce package/API in hadoop v0.20.2. i
notice that you cannot use NLineInputFormat from the old package/API
(mapred).

when i took a look at hadoop v1.0.1, there is a NLineInputFormat class for
the new API. i simply copied and pasted this file into my project. i got 4
errors associated with import statements and annotations. when i removed
the 2 import statements and corresponding 2 annotations, the class compiled
successfully. after this modification, running NLineInputFormat of v1.0.1
on a cluster based on v0.20.2, works.

one mini-problem solved, many more to go.

thanks for the help.

On Wed, Mar 21, 2012 at 3:33 AM, Jane Wayne wrote:

> as i understand, that class does not exist for new API in hadoop v0.20.2
> (which is what i am using). if i am mistaken, where is it?
>
> i am looking at hadoop v1.0.1, and there is a NLineInputFormat class. i
> wonder if i can simply copy/paste this into my project.
>
>
> On Wed, Mar 21, 2012 at 2:37 AM, Anil Gupta  wrote:
>
>> Have a look at NLineInputFormat class in Hadoop. That class will solve
>> your purpose.
>>
>> Best Regards,
>> Anil
>>
>> On Mar 20, 2012, at 11:07 PM, Jane Wayne 
>> wrote:
>>
>> > i have a matrix that i am performing operations on. it is 10,000 rows by
>> > 5,000 columns. the total size of the file is just under 30 MB. my HDFS
>> > block size is set to 64 MB. from what i understand, the number of
>> mappers
>> > is roughly equal to the number of HDFS blocks used in the input. i.e.
>> if my
>> > input data spans 1 block, then only 1 mapper is created, if my data
>> spans 2
>> > blocks, then 2 mappers will be created, etc...
>> >
>> > so, with my 1 matrix file of 15 MB, this won't fill up a block of data,
>> and
>> > being as such, only 1 mapper will be called upon the data. is this
>> > understanding correct?
>> >
>> > if so, what i want to happen is for more than one mapper (let's say 10)
>> to
>> > work on the data, even though it remains on 1 block. my analysis (or
>> > map/reduce job) is such that +1 mappers can work on different parts of
>> the
>> > matrix. for example, mapper 1 can work on the first 500 rows, mapper 2
>> can
>> > work on the next 500 rows, etc... how can i set up multiple mappers (+1
>> > mapper) to work on a file that resides only one block (or a file whose
>> size
>> > is smaller than the HDFS block size).
>> >
>> > can i split the matrix into (let's say) 10 files? that will mean 30 MB
>> / 10
>> > = 3 MB per file. then put each 3 MB file onto HDFS ? will this increase
>> the
>> > chance of having multiple mappers work simultaneously on the
>> data/matrix?
>> > if i can increase the number of mappers, i think (pretty sure) my
>> > implementation will improve in speed linearly.
>> >
>> > any help is appreciated.
>>
>
>


Re: how can i increase the number of mappers?

2012-03-21 Thread Jane Wayne
as i understand, that class does not exist for new API in hadoop v0.20.2
(which is what i am using). if i am mistaken, where is it?

i am looking at hadoop v1.0.1, and there is a NLineInputFormat class. i
wonder if i can simply copy/paste this into my project.

On Wed, Mar 21, 2012 at 2:37 AM, Anil Gupta  wrote:

> Have a look at NLineInputFormat class in Hadoop. That class will solve
> your purpose.
>
> Best Regards,
> Anil
>
> On Mar 20, 2012, at 11:07 PM, Jane Wayne  wrote:
>
> > i have a matrix that i am performing operations on. it is 10,000 rows by
> > 5,000 columns. the total size of the file is just under 30 MB. my HDFS
> > block size is set to 64 MB. from what i understand, the number of mappers
> > is roughly equal to the number of HDFS blocks used in the input. i.e. if
> my
> > input data spans 1 block, then only 1 mapper is created, if my data
> spans 2
> > blocks, then 2 mappers will be created, etc...
> >
> > so, with my 1 matrix file of 15 MB, this won't fill up a block of data,
> and
> > being as such, only 1 mapper will be called upon the data. is this
> > understanding correct?
> >
> > if so, what i want to happen is for more than one mapper (let's say 10)
> to
> > work on the data, even though it remains on 1 block. my analysis (or
> > map/reduce job) is such that +1 mappers can work on different parts of
> the
> > matrix. for example, mapper 1 can work on the first 500 rows, mapper 2
> can
> > work on the next 500 rows, etc... how can i set up multiple mappers (+1
> > mapper) to work on a file that resides only one block (or a file whose
> size
> > is smaller than the HDFS block size).
> >
> > can i split the matrix into (let's say) 10 files? that will mean 30 MB /
> 10
> > = 3 MB per file. then put each 3 MB file onto HDFS ? will this increase
> the
> > chance of having multiple mappers work simultaneously on the data/matrix?
> > if i can increase the number of mappers, i think (pretty sure) my
> > implementation will improve in speed linearly.
> >
> > any help is appreciated.
>


how can i increase the number of mappers?

2012-03-20 Thread Jane Wayne
i have a matrix that i am performing operations on. it is 10,000 rows by
5,000 columns. the total size of the file is just under 30 MB. my HDFS
block size is set to 64 MB. from what i understand, the number of mappers
is roughly equal to the number of HDFS blocks used in the input. i.e. if my
input data spans 1 block, then only 1 mapper is created, if my data spans 2
blocks, then 2 mappers will be created, etc...

so, with my 1 matrix file of 15 MB, this won't fill up a block of data, and
being as such, only 1 mapper will be called upon the data. is this
understanding correct?

if so, what i want to happen is for more than one mapper (let's say 10) to
work on the data, even though it remains on 1 block. my analysis (or
map/reduce job) is such that +1 mappers can work on different parts of the
matrix. for example, mapper 1 can work on the first 500 rows, mapper 2 can
work on the next 500 rows, etc... how can i set up multiple mappers (+1
mapper) to work on a file that resides only one block (or a file whose size
is smaller than the HDFS block size).

can i split the matrix into (let's say) 10 files? that will mean 30 MB / 10
= 3 MB per file. then put each 3 MB file onto HDFS ? will this increase the
chance of having multiple mappers work simultaneously on the data/matrix?
if i can increase the number of mappers, i think (pretty sure) my
implementation will improve in speed linearly.

any help is appreciated.


Re: is implementing WritableComparable and setting Job.setSortComparatorClass(...) redundant?

2012-03-20 Thread Jane Wayne
thanks chris!

On Tue, Mar 20, 2012 at 6:30 AM, Chris White wrote:

> Setting sortComparatorClass will allow you to configure a
> RawComparator implementation (allowing you to do more efficient
> comparisons at the byte level). If you don't set it then hadoop uses
> the WritableComparator by default. This implementation deserializes
> the bytes into instances using your readFields method and then calls
> compareTo to determine key ordering. (look at the source in
> org.apache.hadoop.io.WritableComparator.compare(byte[], int, int,
> byte[], int, int))
>
> So if you don't want to be as efficient as possible, then delegating
> to WritableComparator is probably fine.
>
> Note that you can also configure a RawComparator for your key class
> using a static block to register it with WritableComparator, look at
> the source for Text for an example of this:
>
> /** A WritableComparator optimized for Text keys. */
>  public static class Comparator extends WritableComparator {
>public Comparator() {
>  super(Text.class);
>}
>
>public int compare(byte[] b1, int s1, int l1,
>   byte[] b2, int s2, int l2) {
>  int n1 = WritableUtils.decodeVIntSize(b1[s1]);
>  int n2 = WritableUtils.decodeVIntSize(b2[s2]);
>  return compareBytes(b1, s1+n1, l1-n1, b2, s2+n2, l2-n2);
>}
>  }
>
>  static {
>// register this comparator
>WritableComparator.define(Text.class, new Comparator());
>  }
>
> Chris
>
> On Tue, Mar 20, 2012 at 2:47 AM, Jane Wayne 
> wrote:
> > quick question:
> >
> > i have a key that already implements WritableComparable. this will be the
> > intermediary key passed from the map to the reducer.
> >
> > is it necessary to extend RawComparator and set it on
> > Job.setSortComparatorClass(Class cls) ?
>


Re: does hadoop always respect setNumReduceTasks?

2012-03-14 Thread Jane Wayne
Thanks Lance.

On Thu, Mar 8, 2012 at 9:38 PM, Lance Norskog  wrote:

> Instead of String.hashCode() you can use the MD5 hashcode generator.
> This has not "in the wild" created a duplicate. (It has been hacked,
> but that's not relevant here.)
>
> http://snippets.dzone.com/posts/show/3686
>
> I think the Partitioner class guarantees that you will have multiple
> reducers.
>
> On Thu, Mar 8, 2012 at 6:30 PM, Jane Wayne 
> wrote:
> > i am wondering if hadoop always respect Job.setNumReduceTasks(int)?
> >
> > as i am emitting items from the mapper, i expect/desire only 1 reducer to
> > get these items because i want to assign each key of the key-value input
> > pair a unique integer id. if i had 1 reducer, i can just keep a local
> > counter (with respect to the reducer instance) and increment it.
> >
> > on my local hadoop cluster, i noticed that most, if not all, my jobs have
> > only 1 reducer, regardless of whether or not i set
> > Job.setNumReduceTasks(int).
> >
> > however, as soon as i moved the code unto amazon's elastic mapreduce
> (emr),
> > i notice that there are multiple reducers. if i set the number of reduce
> > tasks to 1, is this always guaranteed? i ask because i don't know if
> there
> > is a gotcha like the combiner (where it may or may not run at all).
> >
> > also, it looks like this might not be a good idea just having 1 reducer
> (it
> > won't scale). it is most likely better if there are +1 reducers, but in
> > that case, i lose the ability to assign unique numbers to the key-value
> > pairs coming in. is there a design pattern out there that addresses this
> > issue?
> >
> > my mapper/reducer key-value pair signatures looks something like the
> > following.
> >
> > mapper(Text, Text, Text, IntWritable)
> > reducer(Text, IntWritable, IntWritable, Text)
> >
> > the mapper reads a sequence file whose key-value pairs are of type Text
> and
> > Text. i then emit Text (let's say a word) and IntWritable (let's say
> > frequency of the word).
> >
> > the reducer gets the word and its frequencies, and then assigns the word
> an
> > integer id. it emits IntWritable (the id) and Text (the word).
> >
> > i remember seeing code from mahout's API where they assign integer ids to
> > items. the items were already given an id of type long. the conversion
> they
> > make is as follows.
> >
> > public static int idToIndex(long id) {
> >  return 0x7FFF & ((int) id ^ (int) (id >>> 32));
> > }
> >
> > is there something equivalent for Text or a "word"? i was thinking about
> > simply taking the hash value of the string/word, but of course, different
> > strings can map to the same hash value.
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: Partition classes, how to pass in background information

2012-03-14 Thread Jane Wayne
Thanks Chris! That worked!

On Wed, Mar 14, 2012 at 6:06 AM, Chris White wrote:

> If your class implements the configurable interface, hadoop will call the
> setConf method after creating the instance. Look in the source code for
> ReflectionUtils.newInstance for more info
> On Mar 14, 2012 2:31 AM, "Jane Wayne"  wrote:
>
> > i am using the new org.apache.hadoop.mapreduce.Partitioner class.
> however,
> > i need to pass it some background information. how can i do this?
> >
> > in the old, org.apache.hadoop.mapred.Partitioner class (now deprecated),
> > this class extends JobConfigurable, and it seems the "hook" to pass in
> any
> > background data is with the JobConfigurable.configure(JobConf job)
> method.
> >
> > i thought that if i sub-classed org.apache.hadoop.mapreduce.Partitioner,
> i
> > could pass in the background information, however, in the
> > org.apache.hadoop.mapreduce.Job class, it only has a
> > setPartitionerClass(Class) method.
> >
> > all my development has been the new mapreduce package, and i would
> > definitely desire to stick with the new API/package. any help is
> > appreciated.
> >
>


does hadoop always respect setNumReduceTasks?

2012-03-08 Thread Jane Wayne
i am wondering if hadoop always respect Job.setNumReduceTasks(int)?

as i am emitting items from the mapper, i expect/desire only 1 reducer to
get these items because i want to assign each key of the key-value input
pair a unique integer id. if i had 1 reducer, i can just keep a local
counter (with respect to the reducer instance) and increment it.

on my local hadoop cluster, i noticed that most, if not all, my jobs have
only 1 reducer, regardless of whether or not i set
Job.setNumReduceTasks(int).

however, as soon as i moved the code unto amazon's elastic mapreduce (emr),
i notice that there are multiple reducers. if i set the number of reduce
tasks to 1, is this always guaranteed? i ask because i don't know if there
is a gotcha like the combiner (where it may or may not run at all).

also, it looks like this might not be a good idea just having 1 reducer (it
won't scale). it is most likely better if there are +1 reducers, but in
that case, i lose the ability to assign unique numbers to the key-value
pairs coming in. is there a design pattern out there that addresses this
issue?

my mapper/reducer key-value pair signatures looks something like the
following.

mapper(Text, Text, Text, IntWritable)
reducer(Text, IntWritable, IntWritable, Text)

the mapper reads a sequence file whose key-value pairs are of type Text and
Text. i then emit Text (let's say a word) and IntWritable (let's say
frequency of the word).

the reducer gets the word and its frequencies, and then assigns the word an
integer id. it emits IntWritable (the id) and Text (the word).

i remember seeing code from mahout's API where they assign integer ids to
items. the items were already given an id of type long. the conversion they
make is as follows.

public static int idToIndex(long id) {
 return 0x7FFF & ((int) id ^ (int) (id >>> 32));
}

is there something equivalent for Text or a "word"? i was thinking about
simply taking the hash value of the string/word, but of course, different
strings can map to the same hash value.


Re: can i specify no shuffle and/or no sort in the reducer and no disk space left IOException when there is DFS space remaining

2012-03-07 Thread Jane Wayne
thanks Jie. that worked. instead of part-r-0, i just get part-m-0.
so, that's no problem.

however, i'm going to see if i still get that IOException complaining about
no more free disk space.

On Thu, Mar 8, 2012 at 12:19 AM, Jie Li  wrote:

> You don't need to specify the reducer at all.
>
> Yeah the map output will go to HDFS directly. It's called map-only job.
>
> Jie
>
> On Thursday, March 8, 2012, Jane Wayne wrote:
>
> > Jie,
> >
> > so if if i set the number of reduce tasks to 0, do i need to specify the
> > reducer (or should i set it null)? if i don't specify the reducer, and
> just
> > have a mapper, where do all the mapper output key-value pair go to? do
> they
> > get serialized to disk/HDFS automagically?
> >
> > On Thu, Mar 8, 2012 at 12:02 AM, Jie Li >
> > wrote:
> >
> > > Hi Jane,
> > >
> > > The default Reducer (IdentityReducer) would simply read/write
> everything
> > > that goes through it. By default Shuffling would also happen and the
> map
> > > output data is partitioned by the HashPartitioner.
> > >
> > > If you don't need the shuffle/reduce, you need to explicitly set the
> > number
> > > of the reduce tasks to zero via JobConf's setNumReduceTasks(int num).
> > >
> > > Hope that helps.
> > >
> > > Jie
> > >
> > > On Wed, Mar 7, 2012 at 11:28 PM, Jane Wayne  
> > > >wrote:
> > >
> > > > i have a Mapper and Reducer as a part of a job. all my data
> > > transformation
> > > > occurs in the mapper, and there is absolutely nothing that needs to
> be
> > > done
> > > > in the reducer. when i set the reducer on the Job, i simply use the
> > > > Reducer.class.
> > > >
> > > > i notice that after the mapper tasks have reached 100%, then the time
> > > until
> > > > reducing starts is very long. when reducing starts then i get a
> > > > java.io.IOException: No space left on deviceFSError. i checked the
> dfs
> > > > health (via web page), and i still have 42.41% DFS remaining. why
> does
> > > this
> > > > occur? i see that eventually 4 attempts are made to call Reducer,
> > > however,
> > > > they all end up with the IOException mentioned. at the bottom is an
> > > output.
> > > > notice that the percentage goes up then back down to 0% before the
> > > > IOException.
> > > >
> > > > also, i want to know if i can just subclass Reducer or do something
> > about
> > > > shuffling and sorting as these steps are not important. i just want
> > each
> > > > record emitted from the Mapper to go straight to disk. is it possible
> > to
> > > do
> > > > this without going through Reducer? i am thinking this is part of the
> > > > problem for taking so long between 100% map and the first sign of
> > reduce.
> > > >
> > > > EXAMPLE OUTPUT
> > > >
> > > > 12/03/07 22:38:45 INFO mapred.JobClient:  map 98% reduce 0%
> > > > 12/03/07 22:39:18 INFO mapred.JobClient:  map 99% reduce 0%
> > > > 12/03/07 22:39:43 INFO mapred.JobClient:  map 100% reduce 0%
> > > > 12/03/07 22:58:14 INFO mapred.JobClient:  map 100% reduce 1%
> > > > 12/03/07 22:58:23 INFO mapred.JobClient:  map 100% reduce 3%
> > > > 12/03/07 22:58:38 INFO mapred.JobClient:  map 100% reduce 6%
> > > > 12/03/07 22:58:57 INFO mapred.JobClient:  map 100% reduce 7%
> > > > 12/03/07 22:59:21 INFO mapred.JobClient:  map 100% reduce 9%
> > > > 12/03/07 23:00:00 INFO mapred.JobClient:  map 100% reduce 10%
> > > > 12/03/07 23:00:09 INFO mapred.JobClient:  map 100% reduce 12%
> > > > 12/03/07 23:00:58 INFO mapred.JobClient:  map 100% reduce 0%
> > > > 12/03/07 23:01:00 INFO mapred.JobClient: Task Id :
> > > > attempt_201203071517_0043_r_00_0, Status : FAILED
> > > > FSError: java.io.IOException: No space left on deviceFSError:
> > > > java.io.IOException: No space left on deviceFSError:
> > java.io.IOException:
> > > > No space left on deviceFSError: java.io.IOException: No space left on
> > > > deviceFSError: java.io.IOException: No space left on deviceFSError:
> > > > java.io.IOException: No space left on device
> > > > attempt_201203071517_0043_r_00_0: log4j:ERROR Failed to flush
> > writer,
> > > > attempt_201203071517_0043_r_00_0: java.io.IOException: No space
> > left
> > > on
> > > > device
> > > > 12/03/07 23:01:31 INFO mapred.JobClient:  map 100% reduce 1%
> > > > 12/03/07 23:01:34 INFO mapred.JobClient:  map 100% reduce 3%
> > > > 12/03/07 23:01:37 INFO mapred.JobClient:  map 100% reduce 4%
> > > > 12/03/07 23:01:49 INFO mapred.JobClient:  map 100% reduce 6%
> > > > 12/03/07 23:01:55 INFO mapred.JobClient:  map 100% reduce 7%
> > > > 12/03/07 23:02:19 INFO mapred.JobClient:  map 100% reduce 9%
> > > > 12/03/07 23:02:52 INFO mapred.JobClient:  map 100% reduce 0%
> > > > 12/03/07 23:02:54 INFO mapred.JobClient: Task Id :
> > > > attempt_201203071517_0043_r_00_1, Status : FAILED
> > > > FSError: java.io.IOException: No space left on deviceFSError:
> > > > java.io.IOException: No space left on deviceFSError:
> > java.io.IOException:
> > > > No space left on device
> > > >
> > >
> >
>


Re: can i specify no shuffle and/or no sort in the reducer and no disk space left IOException when there is DFS space remaining

2012-03-07 Thread Jane Wayne
Jie,

so if if i set the number of reduce tasks to 0, do i need to specify the
reducer (or should i set it null)? if i don't specify the reducer, and just
have a mapper, where do all the mapper output key-value pair go to? do they
get serialized to disk/HDFS automagically?

On Thu, Mar 8, 2012 at 12:02 AM, Jie Li  wrote:

> Hi Jane,
>
> The default Reducer (IdentityReducer) would simply read/write everything
> that goes through it. By default Shuffling would also happen and the map
> output data is partitioned by the HashPartitioner.
>
> If you don't need the shuffle/reduce, you need to explicitly set the number
> of the reduce tasks to zero via JobConf's setNumReduceTasks(int num).
>
> Hope that helps.
>
> Jie
>
> On Wed, Mar 7, 2012 at 11:28 PM, Jane Wayne  >wrote:
>
> > i have a Mapper and Reducer as a part of a job. all my data
> transformation
> > occurs in the mapper, and there is absolutely nothing that needs to be
> done
> > in the reducer. when i set the reducer on the Job, i simply use the
> > Reducer.class.
> >
> > i notice that after the mapper tasks have reached 100%, then the time
> until
> > reducing starts is very long. when reducing starts then i get a
> > java.io.IOException: No space left on deviceFSError. i checked the dfs
> > health (via web page), and i still have 42.41% DFS remaining. why does
> this
> > occur? i see that eventually 4 attempts are made to call Reducer,
> however,
> > they all end up with the IOException mentioned. at the bottom is an
> output.
> > notice that the percentage goes up then back down to 0% before the
> > IOException.
> >
> > also, i want to know if i can just subclass Reducer or do something about
> > shuffling and sorting as these steps are not important. i just want each
> > record emitted from the Mapper to go straight to disk. is it possible to
> do
> > this without going through Reducer? i am thinking this is part of the
> > problem for taking so long between 100% map and the first sign of reduce.
> >
> > EXAMPLE OUTPUT
> >
> > 12/03/07 22:38:45 INFO mapred.JobClient:  map 98% reduce 0%
> > 12/03/07 22:39:18 INFO mapred.JobClient:  map 99% reduce 0%
> > 12/03/07 22:39:43 INFO mapred.JobClient:  map 100% reduce 0%
> > 12/03/07 22:58:14 INFO mapred.JobClient:  map 100% reduce 1%
> > 12/03/07 22:58:23 INFO mapred.JobClient:  map 100% reduce 3%
> > 12/03/07 22:58:38 INFO mapred.JobClient:  map 100% reduce 6%
> > 12/03/07 22:58:57 INFO mapred.JobClient:  map 100% reduce 7%
> > 12/03/07 22:59:21 INFO mapred.JobClient:  map 100% reduce 9%
> > 12/03/07 23:00:00 INFO mapred.JobClient:  map 100% reduce 10%
> > 12/03/07 23:00:09 INFO mapred.JobClient:  map 100% reduce 12%
> > 12/03/07 23:00:58 INFO mapred.JobClient:  map 100% reduce 0%
> > 12/03/07 23:01:00 INFO mapred.JobClient: Task Id :
> > attempt_201203071517_0043_r_00_0, Status : FAILED
> > FSError: java.io.IOException: No space left on deviceFSError:
> > java.io.IOException: No space left on deviceFSError: java.io.IOException:
> > No space left on deviceFSError: java.io.IOException: No space left on
> > deviceFSError: java.io.IOException: No space left on deviceFSError:
> > java.io.IOException: No space left on device
> > attempt_201203071517_0043_r_00_0: log4j:ERROR Failed to flush writer,
> > attempt_201203071517_0043_r_00_0: java.io.IOException: No space left
> on
> > device
> > 12/03/07 23:01:31 INFO mapred.JobClient:  map 100% reduce 1%
> > 12/03/07 23:01:34 INFO mapred.JobClient:  map 100% reduce 3%
> > 12/03/07 23:01:37 INFO mapred.JobClient:  map 100% reduce 4%
> > 12/03/07 23:01:49 INFO mapred.JobClient:  map 100% reduce 6%
> > 12/03/07 23:01:55 INFO mapred.JobClient:  map 100% reduce 7%
> > 12/03/07 23:02:19 INFO mapred.JobClient:  map 100% reduce 9%
> > 12/03/07 23:02:52 INFO mapred.JobClient:  map 100% reduce 0%
> > 12/03/07 23:02:54 INFO mapred.JobClient: Task Id :
> > attempt_201203071517_0043_r_00_1, Status : FAILED
> > FSError: java.io.IOException: No space left on deviceFSError:
> > java.io.IOException: No space left on deviceFSError: java.io.IOException:
> > No space left on device
> >
>


how to get rid of -libjars ?

2012-03-06 Thread Jane Wayne
currently, i have my main jar and then 2 depedent jars. what i do is
1. copy dependent-1.jar to $HADOOP/lib
2. copy dependent-2.jar to $HADOOP/lib

then, when i need to run my job, MyJob inside main.jar, i do the following.

hadoop jar main.jar demo.MyJob -libjars dependent-1.jar,dependent-2.jar
-Dmapred.input.dir=/input/path -Dmapred.output.dir=/output/path

what i want to do is NOT copy the dependent jars to $HADOOP/lib and always
specify -libjars. is there any way around this multi-step procedure? i
really do not want to clutter $HADOOP/lib or specify a comma-delimited list
of jars for -libjars.

any help is appreciated.


Re: is there anyway to detect the file size as am i writing a sequence file?

2012-03-06 Thread Jane Wayne
Thanks Joey. That's what I meant (I've been staring at the screen too
long). :)

On Tue, Mar 6, 2012 at 10:00 AM, Joey Echeverria  wrote:

> I think you mean Writer.getLength(). It returns the current position
> in the output stream in bytes (more or less the current size of the
> file).
>
> -Joey
>
> On Tue, Mar 6, 2012 at 9:53 AM, Jane Wayne 
> wrote:
> > hi,
> >
> > i am writing a little util class to recurse into a directory and add all
> > *.txt files into a sequence file (key is the file name, value is the
> > content of the corresponding text file). as i am writing (i.e.
> > SequenceFile.Writer.append(key, value)), is there any way to detect how
> > large the sequence file is?
> >
> > for example, i want to create a new sequence file as soon as the current
> > one exceeds 64 MB.
> >
> > i notice there is a SequenceFile.Writer.getLong() which the javadocs says
> > "returns the current length of the output file," but that is vague. what
> is
> > this Writer.getLong() method? is it the number of bytes, kilobytes,
> > megabytes, or something else?
> >
> > thanks,
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>


Re: why does my mapper class reads my input file twice?

2012-03-06 Thread Jane Wayne
Harsh,

Thanks. I went into the code on FileInputFormat.addInputPath(Job,Path) and
it is as you stated. That make sense now. I simply commented out
FileInputFormat.addInputPath(job, input)
 and FileOutputFormat.setOutputPath(job, output) and everything
automagically works now.

Thanks a bunch!

On Tue, Mar 6, 2012 at 2:06 AM, Harsh J  wrote:

> Its your use of the mapred.input.dir property, which is a reserved
> name in the framework (its what FileInputFormat uses).
>
> You have a config you extract path from:
> Path input = new Path(conf.get("mapred.input.dir"));
>
> Then you do:
> FileInputFormat.addInputPath(job, input);
>
> Which internally, simply appends a path to a config prop called
> "mapred.input.dir". Hence your job gets launched with two input files
> (the very same) - one added by default Tool-provided configuration
> (cause of your -Dmapred.input.dir) and the other added by you.
>
> Fix the input path line to use a different config:
> Path input = new Path(conf.get("input.path"));
>
> And run job as:
> hadoop jar dummy-0.1.jar dummy.MyJob -Dinput.path=data/dummy.txt
> -Dmapred.output.dir=result
>
> On Tue, Mar 6, 2012 at 9:03 AM, Jane Wayne 
> wrote:
> > i have code that reads in a text file. i notice that each line in the
> text
> > file is somehow being read twice. why is this happening?
> >
> > my mapper class looks like the following:
> >
> > public class MyMapper extends Mapper > Text> {
> >
> > private static final Log _log = LogFactory.getLog(MyMapper.class);
> >  @Override
> > public void map(LongWritable key, Text value, Context context) throws
> > IOException, InterruptedException {
> > String s = (new
> > StringBuilder()).append(value.toString()).append("m").toString();
> > context.write(key, new Text(s));
> > _log.debug(key.toString() + " => " + s);
> > }
> > }
> >
> > my reducer class looks like the following:
> >
> > public class MyReducer extends Reducer > Text> {
> >
> > private static final Log _log = LogFactory.getLog(MyReducer.class);
> >  @Override
> > public void reduce(LongWritable key, Iterable values, Context
> > context) throws IOException, InterruptedException {
> > for(Iterator it = values.iterator(); it.hasNext();) {
> > Text txt = it.next();
> > String s = (new
> > StringBuilder()).append(txt.toString()).append("r").toString();
> > context.write(key, new Text(s));
> > _log.debug(key.toString() + " => " + s);
> > }
> > }
> > }
> >
> > my job class looks like the following:
> >
> > public class MyJob extends Configured implements Tool {
> >
> > public static void main(String[] args) throws Exception {
> > ToolRunner.run(new Configuration(), new MyJob(), args);
> > }
> >
> > @Override
> > public int run(String[] args) throws Exception {
> > Configuration conf = getConf();
> > Path input = new Path(conf.get("mapred.input.dir"));
> >Path output = new Path(conf.get("mapred.output.dir"));
> >
> >Job job = new Job(conf, "dummy job");
> >job.setMapOutputKeyClass(LongWritable.class);
> >job.setMapOutputValueClass(Text.class);
> >job.setOutputKeyClass(LongWritable.class);
> >job.setOutputValueClass(Text.class);
> >
> >job.setMapperClass(MyMapper.class);
> >job.setReducerClass(MyReducer.class);
> >
> >FileInputFormat.addInputPath(job, input);
> >FileOutputFormat.setOutputPath(job, output);
> >
> >job.setJarByClass(MyJob.class);
> >
> >return job.waitForCompletion(true) ? 0 : 1;
> > }
> > }
> >
> > the text file that i am trying to read in looks like the following. as
> you
> > can see, there are 9 lines.
> >
> > T, T
> > T, T
> > T, T
> > F, F
> > F, F
> > F, F
> > F, F
> > T, F
> > F, T
> >
> > the output file that i get after my Job runs looks like the following. as
> > you can see, there are 18 lines. each key is emitted twice from the
> mapper
> > to the reducer.
> >
> > 0   T, Tmr
> > 0   T, Tmr
> > 6   T, Tmr
> > 6   T, Tmr
> > 12  T, Tmr
> > 12  T, Tmr
> > 18  F, Fmr
> > 18  F, Fmr
> > 24  F, Fmr
> > 24  F, Fmr
> > 30  F, Fmr
> > 30  F, Fmr
> > 36  F, Fmr
> > 36  F, Fmr
> > 42  T, Fmr
> > 42  T, Fmr
> > 48  F, Tmr
> > 48  F, Tmr
> >
> > the way i execute my Job is as follows (cygwin + hadoop 0.20.2).
> >
> > hadoop jar dummy-0.1.jar dummy.MyJob -Dmapred.input.dir=data/dummy.txt
> > -Dmapred.output.dir=result
> >
> > originally, this happened when i read in a sequence file, but even for a
> > text file, this problem is still happening. is it the way i have setup my
> > Job?
>
>
>
> --
> Harsh J
>