Re: Fastest HDFS loader

2011-12-20 Thread Konstantin Boudnik
Do you have some strict performance requirement or something? Cause 5Gb is
pretty much nothing, really. I'd say copyFromLocal will do just fine.

Cos

On Tue, Dec 20, 2011 at 10:32PM, Edmon Begoli wrote:
> Hi,
> 
> We are going to be loading 4-5 GB text, delimited file from a RHEL file
> system into HDFS to be managed
> as external table by Hive.
> 
> What is the recommended, fastest loading mechanism?
> 
> Thank you,
> Edmon


signature.asc
Description: Digital signature


Fastest HDFS loader

2011-12-20 Thread Edmon Begoli
Hi,

We are going to be loading 4-5 GB text, delimited file from a RHEL file
system into HDFS to be managed
as external table by Hive.

What is the recommended, fastest loading mechanism?

Thank you,
Edmon


Re: How to create Output files of about fixed size

2011-12-20 Thread Mapred Learn
Hi Shevek/others,

I tried this.

First job created about 78 files of each 15 MB size.

I tried a second map only job with IdentityMapper with
-Dmapred.min.split.size=1073741824  but it did not cause output files to be
1 Gb each but same output as above i.e. 78 files of 15 MB size.

Is there a way to combine about files to 1 GB size each ?

Thanks,
-JJ

On Fri, Oct 28, 2011 at 9:53 AM, Shevek  wrote:

> If you run it as a pure map job, it will do it per split. If you run it as
> a
> single reducer job, it will do it overall. However, one starts to suspect
> that by the time you've paid that extra cost, you might as well reconsider
> your downstream process and the reason for this subdivision.
>
> S.
>
> On 27 October 2011 23:07, Mapred Learn  wrote:
>
> > Hi Shevek,
> > Thanks for the explanation !
> >
> > Can you point me to some documentatino for specifying size in output
> format
> > ?
> >
> > If i say size as 200 MB, then after 200 mb, it would do this per split or
> > overall ?
> > I mena would I end up with 200 mb and a 50 mb from 1st mapper and then,
> say
> > 200 mb and 10 mb from 2nd mapper and so on. Or will I get 200 mb files
> only
> > ?
> >
> >
> >
> > On Wed, Oct 26, 2011 at 10:48 AM, Shevek  wrote:
> >
> > > You can control the input to a computer program, but not (arbitrarily)
> > how
> > > much output it generates. The only way to generate output files of a
> > fixed
> > > size is to write a custom output format which shifts to a new filename
> > > every
> > > time that size is exceeded, but you will still get some small bits left
> > > over. The plumbing in this is pretty ugly, and I would not recommend it
> > > casually.
> > >
> > > You may be able to write a second map-only job which reprocesses the
> > output
> > > from the first job in chunks of X bytes, and just writes them out. Use
> an
> > > IdentityMapper and set the split size. I have not tried this at home.
> > >
> > > S.
> > >
> > > On 26 October 2011 07:03, Mapred Learn  wrote:
> > >
> > > >
> > > > >
> > > >
> > > > > Hi,
> > > > > I am trying to create output files of fixed size by using :
> > > > > -Dmapred.max.split.size=6442450812 (6 Gb)
> > > > >
> > > > > But the problem is that the input Data size and metadata varies
>  and
> > I
> > > > have to adjust above value manually to achieve fixed size.
> > > > >
> > > > > Is there a way I can programmatically determine split size that
> would
> > > > yield me fixed sized output files. For eg 200 MB each ?
> > > > >
> > > > > Thanks,
> > > > > JJ
> > > >
> > >
> >
>


network configuration (etc/hosts) ?

2011-12-20 Thread MirrorX

dear all

i am trying for many days to get a simple hadoop cluster (with 2 nodes) to
work but i have trouble configuring the network parameters. i have properly
configured the ssh keys, and the /etc/hosts files are:

master->
127.0.0.1 localhost6.localdomain6 localhost
127.0.1.1 localhost4.localdomain4 master-pc
192.168.7.110 master
192.168.7.157 slave

slave->
127.0.1.1localhost5.localdomain5 lab-pc
127.0.0.1localhost3.localdomain3 localhost
192.168.7.110 master
192.168.7.157 slave

i have tried all possible combinations on the /etc/hosts files but i still
cannot make it work. i either get errors 'too many fetch failures' and by
examining the logs of the slave i see this 
' INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201112210259_0002_r_00_0 0.1112% reduce > copy (1 of 3 at
0.03MB/s)'

or i get errors like -> 
'INFO mapred.JobClient: Task Id : attempt_201112210308_0001_r_00_0,
Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.'

i have searched into many similar posts on the web but i still cannot find
the solution. could you please help me?

when i run the same job only on the master it is completed fine, and i can
connect via ssh from every node to every node and from each node to itself,
that's why i think there is something wrong with the network configuration

thank you in advance for your help



-- 
View this message in context: 
http://old.nabble.com/network-configuration-%28etc-hosts%29---tp33013719p33013719.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: collecting CPU, mem, iops of hadoop jobs

2011-12-20 Thread Arun C Murthy
Go ahead and open a MR jira (would appreciate a patch too! ;) ).

thanks,
Arun

On Dec 20, 2011, at 2:55 PM, Patai Sangbutsarakum wrote:

> Thanks again Arun, you save me again.. :-)
> 
> This is a great starting point. for CPU and possibly Mem.
> 
> For the IOPS, just would like to ask if the tasknode/datanode collect the 
> number
> or we should dig into OS level.. like /proc/PID_OF_tt/io
> ^hope this make sense
> 
> -P
> 
> On Tue, Dec 20, 2011 at 1:22 PM, Arun C Murthy  wrote:
>> Take a look at the JobHistory files produced for each job.
>> 
>> With 0.20.205 you get CPU (slot millis).
>> With 0.23 (alpha quality) you get CPU and JVM metrics (GC etc.). I believe 
>> you also get Memory, but not IOPS.
>> 
>> Arun
>> 
>> On Dec 20, 2011, at 1:11 PM, Patai Sangbutsarakum wrote:
>> 
>>> Thanks for reply, but I don't think metric exposed to Ganglia would be
>>> what i am really looking for..
>>> 
>>> what i am looking for is some kind of these (but not limit to)
>>> 
>>> Job__
>>> CPU time: 10204 sec.   <--aggregate from all tasknodes
>>> IOPS: 2344  <-- aggregated from all datanode
>>> MEM: 30G   <-- aggregated
>>> 
>>> etc,
>>> 
>>> Job_aaa_bbb
>>> CPU time:
>>> IOPS:
>>> MEM:
>>> 
>>> Sorry for ambiguous question.
>>> Thanks
>>> 
>>> On Tue, Dec 20, 2011 at 12:47 PM, He Chen  wrote:
 You may need Ganglia. It is a cluster monitoring software.
 
 On Tue, Dec 20, 2011 at 2:44 PM, Patai Sangbutsarakum <
 silvianhad...@gmail.com> wrote:
 
> Hi Hadoopers,
> 
> We're running Hadoop 0.20 CentOS5.5. I am finding the way to collect
> CPU time, memory usage, IOPS of each hadoop Job.
> What would be the good starting point ? document ? api ?
> 
> Thanks in advance
> -P
> 
>> 



Re: collecting CPU, mem, iops of hadoop jobs

2011-12-20 Thread Patai Sangbutsarakum
Thanks again Arun, you save me again.. :-)

This is a great starting point. for CPU and possibly Mem.

For the IOPS, just would like to ask if the tasknode/datanode collect the number
or we should dig into OS level.. like /proc/PID_OF_tt/io
^hope this make sense

-P

On Tue, Dec 20, 2011 at 1:22 PM, Arun C Murthy  wrote:
> Take a look at the JobHistory files produced for each job.
>
> With 0.20.205 you get CPU (slot millis).
> With 0.23 (alpha quality) you get CPU and JVM metrics (GC etc.). I believe 
> you also get Memory, but not IOPS.
>
> Arun
>
> On Dec 20, 2011, at 1:11 PM, Patai Sangbutsarakum wrote:
>
>> Thanks for reply, but I don't think metric exposed to Ganglia would be
>> what i am really looking for..
>>
>> what i am looking for is some kind of these (but not limit to)
>>
>> Job__
>> CPU time: 10204 sec.   <--aggregate from all tasknodes
>> IOPS: 2344  <-- aggregated from all datanode
>> MEM: 30G   <-- aggregated
>>
>> etc,
>>
>> Job_aaa_bbb
>> CPU time:
>> IOPS:
>> MEM:
>>
>> Sorry for ambiguous question.
>> Thanks
>>
>> On Tue, Dec 20, 2011 at 12:47 PM, He Chen  wrote:
>>> You may need Ganglia. It is a cluster monitoring software.
>>>
>>> On Tue, Dec 20, 2011 at 2:44 PM, Patai Sangbutsarakum <
>>> silvianhad...@gmail.com> wrote:
>>>
 Hi Hadoopers,

 We're running Hadoop 0.20 CentOS5.5. I am finding the way to collect
 CPU time, memory usage, IOPS of each hadoop Job.
 What would be the good starting point ? document ? api ?

 Thanks in advance
 -P

>


Release 1.0.0 RC3 - Ant Build Fails with JSP-Compile Error

2011-12-20 Thread Royston Sellman
Hi,

 

We have just checked out the latest version of Hadoop source from
http://svn.apache.org/repos/asf/hadoop/common/tags/release-1.0.0-rc3 and we
have attempted to build it using the ant build.xml script. However we are
getting errors relating to the jsp-compile command.

 

We are getting the following error when we run "ant" inside our
$HADOOP_INSTALL directory. 

 

BUILD FAILED

/opt/hadoop/release-1.0.0-rc3/build.xml:527: jsp-compile doesn't support the
"webxml" attribute

 

Please can you suggest what might be the cause and/or how we can fix this
issue, to build Hadoop successfully.

 

Thanks,

Royston



Re: collecting CPU, mem, iops of hadoop jobs

2011-12-20 Thread Arun C Murthy
Take a look at the JobHistory files produced for each job.

With 0.20.205 you get CPU (slot millis).
With 0.23 (alpha quality) you get CPU and JVM metrics (GC etc.). I believe you 
also get Memory, but not IOPS.

Arun

On Dec 20, 2011, at 1:11 PM, Patai Sangbutsarakum wrote:

> Thanks for reply, but I don't think metric exposed to Ganglia would be
> what i am really looking for..
> 
> what i am looking for is some kind of these (but not limit to)
> 
> Job__
> CPU time: 10204 sec.   <--aggregate from all tasknodes
> IOPS: 2344  <-- aggregated from all datanode
> MEM: 30G   <-- aggregated
> 
> etc,
> 
> Job_aaa_bbb
> CPU time:
> IOPS:
> MEM:
> 
> Sorry for ambiguous question.
> Thanks
> 
> On Tue, Dec 20, 2011 at 12:47 PM, He Chen  wrote:
>> You may need Ganglia. It is a cluster monitoring software.
>> 
>> On Tue, Dec 20, 2011 at 2:44 PM, Patai Sangbutsarakum <
>> silvianhad...@gmail.com> wrote:
>> 
>>> Hi Hadoopers,
>>> 
>>> We're running Hadoop 0.20 CentOS5.5. I am finding the way to collect
>>> CPU time, memory usage, IOPS of each hadoop Job.
>>> What would be the good starting point ? document ? api ?
>>> 
>>> Thanks in advance
>>> -P
>>> 



Re: collecting CPU, mem, iops of hadoop jobs

2011-12-20 Thread Patai Sangbutsarakum
Thanks for reply, but I don't think metric exposed to Ganglia would be
what i am really looking for..

what i am looking for is some kind of these (but not limit to)

Job__
CPU time: 10204 sec.   <--aggregate from all tasknodes
IOPS: 2344  <-- aggregated from all datanode
MEM: 30G   <-- aggregated

etc,

Job_aaa_bbb
CPU time:
IOPS:
MEM:

Sorry for ambiguous question.
Thanks

On Tue, Dec 20, 2011 at 12:47 PM, He Chen  wrote:
> You may need Ganglia. It is a cluster monitoring software.
>
> On Tue, Dec 20, 2011 at 2:44 PM, Patai Sangbutsarakum <
> silvianhad...@gmail.com> wrote:
>
>> Hi Hadoopers,
>>
>> We're running Hadoop 0.20 CentOS5.5. I am finding the way to collect
>> CPU time, memory usage, IOPS of each hadoop Job.
>> What would be the good starting point ? document ? api ?
>>
>> Thanks in advance
>> -P
>>


Re: collecting CPU, mem, iops of hadoop jobs

2011-12-20 Thread He Chen
You may need Ganglia. It is a cluster monitoring software.

On Tue, Dec 20, 2011 at 2:44 PM, Patai Sangbutsarakum <
silvianhad...@gmail.com> wrote:

> Hi Hadoopers,
>
> We're running Hadoop 0.20 CentOS5.5. I am finding the way to collect
> CPU time, memory usage, IOPS of each hadoop Job.
> What would be the good starting point ? document ? api ?
>
> Thanks in advance
> -P
>


collecting CPU, mem, iops of hadoop jobs

2011-12-20 Thread Patai Sangbutsarakum
Hi Hadoopers,

We're running Hadoop 0.20 CentOS5.5. I am finding the way to collect
CPU time, memory usage, IOPS of each hadoop Job.
What would be the good starting point ? document ? api ?

Thanks in advance
-P


Re: Desperate!!!! Expanding,shrinking cluster or replacing failed nodes.

2011-12-20 Thread Merto Mertek
I followed the same tutorial as you. If I am not wrong the problem arise
because you first tried to run a node as single node and then joining it to
the cluster (like Arpit mentioned). After testing that the new node works
ok try to delete content in directory /app/hadoop/tmp/ and insert a new
node to the cluster.When I setup config files on the new node I followed
the following procedure:

DATANODE
setup config files (look the tutorial)
/usr/local/hadoop/bin/hadoop-daemon.sh start datanode
/usr/local/hadoop/bin/hadoop-daemon.sh start tasktracker
---
MASTER
$hdbin/hadoop dfsadmin -report
nano /usr/local/hadoop/conf/slaves (add a new node)
$hdbin/hadoop dfsadmin -refreshNodes
$hdbin/hadoop namenode restart
$hdbin/hadoop jobtracker restart
($hdbin/hadoop balancer  )
($hdbin/hadoop dfsadmin -report )

Hope it helps..

On 20 December 2011 18:38, Arpit Gupta  wrote:

> On the new nodes you are trying to add make sure the  dfs/data directories
> are empty. You probably have a VERSION file from an older deploy and thus
> causing the incompatible namespaceId error.
>
>
> --
> Arpit
> ar...@hortonworks.com
>
>
> On Dec 20, 2011, at 5:35 AM, Sloot, Hans-Peter wrote:
>
> >
> >
> > But I ran into the : java.io.IOException: Incompatible namespaceIDs
> error every time.
> > Should I config the files :  dfs/data/current/VERSION and
> dfs/name/current/VERSION  and  conf/*site.xml
> > from other existing nodes?
> >
> >
> >
> >
> >
> > -Original Message-
> > From: Harsh J [mailto:ha...@cloudera.com]
> > Sent: dinsdag 20 december 2011 14:30
> > To: common-user@hadoop.apache.org
> > Cc: hdfs-...@hadoop.apache.org
> > Subject: Re: Desperate Expanding,shrinking cluster or replacing
> failed nodes.
> >
> > Hans-Peter,
> >
> > Adding new nodes is simply (assuming network setup is sane and done):
> >
> > - Install/unpack services on new machine.
> > - Deploy a config copy for the services.
> > - Start the services.
> >
> > You should *not* format a NameNode *ever*, after the first time you
> start it up. Formatting loses all data of HDFS, so don't even think about
> that after the first time you use it :)
> >
> > On 20-Dec-2011, at 6:12 PM, Sloot, Hans-Peter wrote:
> >
> >> Hello all,
> >>
> >> I have asked this question a couple of days ago but no one responded.
> >>
> >> I built a 6 node hadoop cluster, guided Michael Noll, starting with a
> single node and expanding it one by one.
> >> Every time I expanded the cluster I ran into error :
> java.io.IOException: Incompatible namespaceIDs
> >>
> >> So now my question is what is the correct procedure for expanding,
> shrinking a cluster?
> >> And how to replace a failed node?
> >>
> >> Can someone  point me to the correct manuals.
> >> I have already looked at the available documents on the wiki and
> hadoop.apache.org but could not find the answers.
> >>
> >> Regards Hans-Peter
> >>
> >>
> >>
> >>
> >>
> >> Dit bericht is vertrouwelijk en kan geheime informatie bevatten enkel
> bestemd voor de geadresseerde. Indien dit bericht niet voor u is bestemd,
> verzoeken wij u dit onmiddellijk aan ons te melden en het bericht te
> vernietigen. Aangezien de integriteit van het bericht niet veilig gesteld
> is middels verzending via internet, kan Atos Nederland B.V. niet
> aansprakelijk worden gehouden voor de inhoud daarvan. Hoewel wij ons
> inspannen een virusvrij netwerk te hanteren, geven wij geen enkele garantie
> dat dit bericht virusvrij is, noch aanvaarden wij enige aansprakelijkheid
> voor de mogelijke aanwezigheid van een virus in dit bericht. Op al onze
> rechtsverhoudingen, aanbiedingen en overeenkomsten waaronder Atos Nederland
> B.V. goederen en/of diensten levert zijn met uitsluiting van alle andere
> voorwaarden de Leveringsvoorwaarden van Atos Nederland B.V. van toepassing.
> Deze worden u op aanvraag direct kosteloos toegezonden.
> >>
> >> This e-mail and the documents attached are confidential and intended
> solely for the addressee; it may also be privileged. If you receive this
> e-mail in error, please notify the sender immediately and destroy it. As
> its integrity cannot be secured on the Internet, the Atos Nederland B.V.
> group liability cannot be triggered for the message content. Although the
> sender endeavours to maintain a computer virus-free network, the sender
> does not warrant that this transmission is virus-free and will not be
> liable for any damages resulting from any virus transmitted. On all offers
> and agreements under which Atos Nederland B.V. supplies goods and/or
> services of whatever nature, the Terms of Delivery from Atos Nederland B.V.
> exclusively apply. The Terms of Delivery shall be promptly submitted to you
> on your request.
> >>
> >> Atos Nederland B.V. / Utrecht
> >> KvK Utrecht 30132762
> >>
> >
> >
> >
> >
> >
> >
> >
> > Dit bericht is vertrouwelijk e

Re: Configure hadoop scheduler

2011-12-20 Thread Merto Mertek
Thanks both of you, for sharing your time in solving the problem!

Now it works. The main problem was hiding in
"mapreduce.jobtracker.taskScheduler". In version 0.20.204 the right package
is "mapred.jobtracker.taskScheduler". After this change fairscheduling is
running.

I encountered another problem when fairscheduler was starting.
Exception was caused by:  org.xml.sax.SAXParseException: The processing
instruction target matching "[xX][mM][lL]" is not allowed. Problem was
solved with a validation of the allocation.xml where I had few spaces at
the beginning of the file..

@Matei: I have read some papers from you and your colleagues. If you do not
mind I would ask you few questions:
a) is LATE scheduler a standalone scheduler or is integrated in
fairscheduler? If is standalone where to find it and which hadoop version
does it support?
b) Is it the pseudo code-concept for fair scheduling from paper "Job
scheduling for multi-user mapreduce cluster" still the same in version
20.204 as is described in appendix A of the paper?
c) Is it delay scheduling and task preemption like mentioned in your paper
available in version 20.204? I' ve checked fairscheduler parameters but I
did not find such options..





On 20 December 2011 18:03, Prashant Kommireddi  wrote:

> I am guessing you are trying to use the FairScheduler but you have
> specified CapacityScheduler in your configuration. You need to change
> mapreduce.jobtracker.scheduler to FairScheduler.
>
> Sent from my iPhone
>
> On Dec 20, 2011, at 8:51 AM, Merto Mertek  wrote:
>
> > Hi,
> >
> > I am having problems with changing the default hadoop scheduler (i assume
> > that the default scheduler is a FIFO scheduler).
> >
> > I am following the guide located in hadoop/docs directory however I am
> not
> > able to run it.  Link for scheduling administration returns an http error
> > 404 ( http://localhost:50030/scheduler ). In the UI under scheduling
> > information I can see only one queue named "default". mapred-site.xml
> file
> > is accessible because when changing a port for a jobtracker I can see a
> > daemon running with a changed port. Variable $HADOOP_CONFIG_DIR was added
> > to .bashrc, however that did not solve the problem. I tried to rebuild
> > hadoop, manualy place the fair scheduler jar in hadoop/lib and changed
> the
> > hadoop classpath in hadoop-env.sh to point to the lib folder, but without
> > success. The only info of the scheduler that is seen in the jobtracker
> log
> > is the folowing info:
> >
> > Scheduler configured with (memSizeForMapSlotOnJT,
> memSizeForReduceSlotOnJT,
> >> limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)
> >>
> >
> > I am working on this several days and running out of ideas... I am
> > wondering how to fix it and where to check currently active scheduler
> > parameters?
> >
> > Config files:
> > mapred-site.xml 
> > allocation.xml 
> > Tried versions: 0.20.203 and 204
> >
> > Thank you
>


Re: Desperate!!!! Expanding,shrinking cluster or replacing failed nodes.

2011-12-20 Thread Arpit Gupta
On the new nodes you are trying to add make sure the  dfs/data directories are 
empty. You probably have a VERSION file from an older deploy and thus causing 
the incompatible namespaceId error.


--
Arpit
ar...@hortonworks.com


On Dec 20, 2011, at 5:35 AM, Sloot, Hans-Peter wrote:

> 
> 
> But I ran into the : java.io.IOException: Incompatible namespaceIDs error 
> every time.
> Should I config the files :  dfs/data/current/VERSION and 
> dfs/name/current/VERSION  and  conf/*site.xml
> from other existing nodes?
> 
> 
> 
> 
> 
> -Original Message-
> From: Harsh J [mailto:ha...@cloudera.com]
> Sent: dinsdag 20 december 2011 14:30
> To: common-user@hadoop.apache.org
> Cc: hdfs-...@hadoop.apache.org
> Subject: Re: Desperate Expanding,shrinking cluster or replacing failed 
> nodes.
> 
> Hans-Peter,
> 
> Adding new nodes is simply (assuming network setup is sane and done):
> 
> - Install/unpack services on new machine.
> - Deploy a config copy for the services.
> - Start the services.
> 
> You should *not* format a NameNode *ever*, after the first time you start it 
> up. Formatting loses all data of HDFS, so don't even think about that after 
> the first time you use it :)
> 
> On 20-Dec-2011, at 6:12 PM, Sloot, Hans-Peter wrote:
> 
>> Hello all,
>> 
>> I have asked this question a couple of days ago but no one responded.
>> 
>> I built a 6 node hadoop cluster, guided Michael Noll, starting with a single 
>> node and expanding it one by one.
>> Every time I expanded the cluster I ran into error : java.io.IOException: 
>> Incompatible namespaceIDs
>> 
>> So now my question is what is the correct procedure for expanding, shrinking 
>> a cluster?
>> And how to replace a failed node?
>> 
>> Can someone  point me to the correct manuals.
>> I have already looked at the available documents on the wiki and 
>> hadoop.apache.org but could not find the answers.
>> 
>> Regards Hans-Peter
>> 
>> 
>> 
>> 
>> 
>> Dit bericht is vertrouwelijk en kan geheime informatie bevatten enkel 
>> bestemd voor de geadresseerde. Indien dit bericht niet voor u is bestemd, 
>> verzoeken wij u dit onmiddellijk aan ons te melden en het bericht te 
>> vernietigen. Aangezien de integriteit van het bericht niet veilig gesteld is 
>> middels verzending via internet, kan Atos Nederland B.V. niet aansprakelijk 
>> worden gehouden voor de inhoud daarvan. Hoewel wij ons inspannen een 
>> virusvrij netwerk te hanteren, geven wij geen enkele garantie dat dit 
>> bericht virusvrij is, noch aanvaarden wij enige aansprakelijkheid voor de 
>> mogelijke aanwezigheid van een virus in dit bericht. Op al onze 
>> rechtsverhoudingen, aanbiedingen en overeenkomsten waaronder Atos Nederland 
>> B.V. goederen en/of diensten levert zijn met uitsluiting van alle andere 
>> voorwaarden de Leveringsvoorwaarden van Atos Nederland B.V. van toepassing. 
>> Deze worden u op aanvraag direct kosteloos toegezonden.
>> 
>> This e-mail and the documents attached are confidential and intended solely 
>> for the addressee; it may also be privileged. If you receive this e-mail in 
>> error, please notify the sender immediately and destroy it. As its integrity 
>> cannot be secured on the Internet, the Atos Nederland B.V. group liability 
>> cannot be triggered for the message content. Although the sender endeavours 
>> to maintain a computer virus-free network, the sender does not warrant that 
>> this transmission is virus-free and will not be liable for any damages 
>> resulting from any virus transmitted. On all offers and agreements under 
>> which Atos Nederland B.V. supplies goods and/or services of whatever nature, 
>> the Terms of Delivery from Atos Nederland B.V. exclusively apply. The Terms 
>> of Delivery shall be promptly submitted to you on your request.
>> 
>> Atos Nederland B.V. / Utrecht
>> KvK Utrecht 30132762
>> 
> 
> 
> 
> 
> 
> 
> 
> Dit bericht is vertrouwelijk en kan geheime informatie bevatten enkel bestemd 
> voor de geadresseerde. Indien dit bericht niet voor u is bestemd, verzoeken 
> wij u dit onmiddellijk aan ons te melden en het bericht te vernietigen. 
> Aangezien de integriteit van het bericht niet veilig gesteld is middels 
> verzending via internet, kan Atos Nederland B.V. niet aansprakelijk worden 
> gehouden voor de inhoud daarvan. Hoewel wij ons inspannen een virusvrij 
> netwerk te hanteren, geven wij geen enkele garantie dat dit bericht virusvrij 
> is, noch aanvaarden wij enige aansprakelijkheid voor de mogelijke 
> aanwezigheid van een virus in dit bericht. Op al onze rechtsverhoudingen, 
> aanbiedingen en overeenkomsten waaronder Atos Nederland B.V. goederen en/of 
> diensten levert zijn met uitsluiting van alle andere voorwaarden de 
> Leveringsvoorwaarden van Atos Nederland B.V. van toepassing. Deze worden u op 
> aanvraag direct kosteloos toegezonden.
> 
> This e-mail and the documents attached are confidential and intended solely 
> for the addressee; it may also be privileged. If you receive this e-mail in 
> error, pl

Custom Writables in MapWritable

2011-12-20 Thread Kyle Renfro
Hadoop 0.22.0-RC0

I have the following reducer:
public static class MergeRecords extends
Reducer

The MapWritables that are handled by the reducer all have Text 'keys'
and contain different 'value' classes including Text, DoubleWritable,
and a custom Writable MapArrayWritable.  The reduce works as expected
if each MapWritable contains both a DoubleWritable and
MapArrayWritable.  The reduce fails with the following exception if
some of the MapWritables contains only a DoubleWritable value:

---
java.lang.IllegalArgumentException: Id 1 exists but maps to
com.realcomp.data.hadoop.record.MapArrayWritable and not
org.apache.hadoop.io.DoubleWritable at
org.apache.hadoop.io.AbstractMapWritable.addToMap(AbstractMapWritable.java:75)
at 
org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:203)
at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:148)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:73)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:44)
at 
org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:145)
at 
org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
at 
org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:292)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:168) at
org.apache.hadoop.mapred.ReduceTask.


Digging into the source a little I stumbled upon the fact that the
default constructor for AbstractMapWritable does not configure itself
to handle DoubleWritable as it does for all the other base Writables.
This looks like an omission to me, and If the DoubleWritable was
configured, I would probably never have noticed this problem, as there
would be only one custom class in the MapWritable.

Question 1:
Should I be able to reduce on MapWritables that contain different
(custom) value classes?

Question 2:
It appears the org.apache.hadoop.io.serialize.WritableSerialization
class reuses the first MapWritable instance for each deserialization.
This is probably a performance optimization, and explains why I am
getting the exception.  Is it possible
for me to register my own serialization class that would allow me to
deserialize MapWritables with different value classes?  Are there
examples of this available?


Note: I realize I am running off of a release candidate, but I thought
I would ask here first before I go through the trouble of upgrading
the cluster.

thanks,
Kyle


Re: Configure hadoop scheduler

2011-12-20 Thread Prashant Kommireddi
I am guessing you are trying to use the FairScheduler but you have
specified CapacityScheduler in your configuration. You need to change
mapreduce.jobtracker.scheduler to FairScheduler.

Sent from my iPhone

On Dec 20, 2011, at 8:51 AM, Merto Mertek  wrote:

> Hi,
>
> I am having problems with changing the default hadoop scheduler (i assume
> that the default scheduler is a FIFO scheduler).
>
> I am following the guide located in hadoop/docs directory however I am not
> able to run it.  Link for scheduling administration returns an http error
> 404 ( http://localhost:50030/scheduler ). In the UI under scheduling
> information I can see only one queue named "default". mapred-site.xml file
> is accessible because when changing a port for a jobtracker I can see a
> daemon running with a changed port. Variable $HADOOP_CONFIG_DIR was added
> to .bashrc, however that did not solve the problem. I tried to rebuild
> hadoop, manualy place the fair scheduler jar in hadoop/lib and changed the
> hadoop classpath in hadoop-env.sh to point to the lib folder, but without
> success. The only info of the scheduler that is seen in the jobtracker log
> is the folowing info:
>
> Scheduler configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT,
>> limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)
>>
>
> I am working on this several days and running out of ideas... I am
> wondering how to fix it and where to check currently active scheduler
> parameters?
>
> Config files:
> mapred-site.xml 
> allocation.xml 
> Tried versions: 0.20.203 and 204
>
> Thank you


Re: Configure hadoop scheduler

2011-12-20 Thread Matei Zaharia
Are you trying to use the capacity scheduler or the fair scheduler? Your 
mapred-site.xml says to use the capacity scheduler but then points to a fair 
scheduler allocation file. Take a look at 
http://hadoop.apache.org/common/docs/r0.20.204.0/fair_scheduler.html for 
setting up the fair scheduler or 
http://hadoop.apache.org/common/docs/r0.20.204.0/capacity_scheduler.html for 
the capacity scheduler.

It may also be good to remove the  stuff in mapred-site.xml. I'm not 
sure whether it can affect these settings but it's certainly not necessary for 
the scheduler settings.

Matei

On Dec 20, 2011, at 11:51 AM, Merto Mertek wrote:

> Hi,
> 
> I am having problems with changing the default hadoop scheduler (i assume
> that the default scheduler is a FIFO scheduler).
> 
> I am following the guide located in hadoop/docs directory however I am not
> able to run it.  Link for scheduling administration returns an http error
> 404 ( http://localhost:50030/scheduler ). In the UI under scheduling
> information I can see only one queue named "default". mapred-site.xml file
> is accessible because when changing a port for a jobtracker I can see a
> daemon running with a changed port. Variable $HADOOP_CONFIG_DIR was added
> to .bashrc, however that did not solve the problem. I tried to rebuild
> hadoop, manualy place the fair scheduler jar in hadoop/lib and changed the
> hadoop classpath in hadoop-env.sh to point to the lib folder, but without
> success. The only info of the scheduler that is seen in the jobtracker log
> is the folowing info:
> 
> Scheduler configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT,
>> limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)
>> 
> 
> I am working on this several days and running out of ideas... I am
> wondering how to fix it and where to check currently active scheduler
> parameters?
> 
> Config files:
> mapred-site.xml 
> allocation.xml 
> Tried versions: 0.20.203 and 204
> 
> Thank you



Configure hadoop scheduler

2011-12-20 Thread Merto Mertek
Hi,

I am having problems with changing the default hadoop scheduler (i assume
that the default scheduler is a FIFO scheduler).

I am following the guide located in hadoop/docs directory however I am not
able to run it.  Link for scheduling administration returns an http error
404 ( http://localhost:50030/scheduler ). In the UI under scheduling
information I can see only one queue named "default". mapred-site.xml file
is accessible because when changing a port for a jobtracker I can see a
daemon running with a changed port. Variable $HADOOP_CONFIG_DIR was added
to .bashrc, however that did not solve the problem. I tried to rebuild
hadoop, manualy place the fair scheduler jar in hadoop/lib and changed the
hadoop classpath in hadoop-env.sh to point to the lib folder, but without
success. The only info of the scheduler that is seen in the jobtracker log
is the folowing info:

Scheduler configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT,
> limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)
>

I am working on this several days and running out of ideas... I am
wondering how to fix it and where to check currently active scheduler
parameters?

Config files:
mapred-site.xml 
allocation.xml 
Tried versions: 0.20.203 and 204

Thank you


Hadoop and Ubuntu / Java

2011-12-20 Thread hadoopman


http://www.omgubuntu.co.uk/2011/12/java-to-be-removed-from-ubuntu-uninstalled-from-user-machines/

I'm curious what this will mean for Hadoop on Ubuntu systems moving 
forward.  I've tried openJDK nearly two years ago with Hadoop.  Needless 
to say it was a real problem.


Hopefully we can still download it from the Sun/Oracle web site and 
still use it.  Won't be the same though :/


RE: Desperate!!!! Expanding,shrinking cluster or replacing failed nodes.

2011-12-20 Thread Sloot, Hans-Peter


But I ran into the : java.io.IOException: Incompatible namespaceIDs error every 
time.
Should I config the files :  dfs/data/current/VERSION and 
dfs/name/current/VERSION  and  conf/*site.xml
 from other existing nodes?





-Original Message-
From: Harsh J [mailto:ha...@cloudera.com]
Sent: dinsdag 20 december 2011 14:30
To: common-user@hadoop.apache.org
Cc: hdfs-...@hadoop.apache.org
Subject: Re: Desperate Expanding,shrinking cluster or replacing failed 
nodes.

Hans-Peter,

Adding new nodes is simply (assuming network setup is sane and done):

- Install/unpack services on new machine.
- Deploy a config copy for the services.
- Start the services.

You should *not* format a NameNode *ever*, after the first time you start it 
up. Formatting loses all data of HDFS, so don't even think about that after the 
first time you use it :)

On 20-Dec-2011, at 6:12 PM, Sloot, Hans-Peter wrote:

> Hello all,
>
> I have asked this question a couple of days ago but no one responded.
>
> I built a 6 node hadoop cluster, guided Michael Noll, starting with a single 
> node and expanding it one by one.
> Every time I expanded the cluster I ran into error : java.io.IOException: 
> Incompatible namespaceIDs
>
> So now my question is what is the correct procedure for expanding, shrinking 
> a cluster?
> And how to replace a failed node?
>
> Can someone  point me to the correct manuals.
> I have already looked at the available documents on the wiki and 
> hadoop.apache.org but could not find the answers.
>
> Regards Hans-Peter
>
>
>
>
>
> Dit bericht is vertrouwelijk en kan geheime informatie bevatten enkel bestemd 
> voor de geadresseerde. Indien dit bericht niet voor u is bestemd, verzoeken 
> wij u dit onmiddellijk aan ons te melden en het bericht te vernietigen. 
> Aangezien de integriteit van het bericht niet veilig gesteld is middels 
> verzending via internet, kan Atos Nederland B.V. niet aansprakelijk worden 
> gehouden voor de inhoud daarvan. Hoewel wij ons inspannen een virusvrij 
> netwerk te hanteren, geven wij geen enkele garantie dat dit bericht virusvrij 
> is, noch aanvaarden wij enige aansprakelijkheid voor de mogelijke 
> aanwezigheid van een virus in dit bericht. Op al onze rechtsverhoudingen, 
> aanbiedingen en overeenkomsten waaronder Atos Nederland B.V. goederen en/of 
> diensten levert zijn met uitsluiting van alle andere voorwaarden de 
> Leveringsvoorwaarden van Atos Nederland B.V. van toepassing. Deze worden u op 
> aanvraag direct kosteloos toegezonden.
>
> This e-mail and the documents attached are confidential and intended solely 
> for the addressee; it may also be privileged. If you receive this e-mail in 
> error, please notify the sender immediately and destroy it. As its integrity 
> cannot be secured on the Internet, the Atos Nederland B.V. group liability 
> cannot be triggered for the message content. Although the sender endeavours 
> to maintain a computer virus-free network, the sender does not warrant that 
> this transmission is virus-free and will not be liable for any damages 
> resulting from any virus transmitted. On all offers and agreements under 
> which Atos Nederland B.V. supplies goods and/or services of whatever nature, 
> the Terms of Delivery from Atos Nederland B.V. exclusively apply. The Terms 
> of Delivery shall be promptly submitted to you on your request.
>
> Atos Nederland B.V. / Utrecht
> KvK Utrecht 30132762
>







Dit bericht is vertrouwelijk en kan geheime informatie bevatten enkel bestemd 
voor de geadresseerde. Indien dit bericht niet voor u is bestemd, verzoeken wij 
u dit onmiddellijk aan ons te melden en het bericht te vernietigen. Aangezien 
de integriteit van het bericht niet veilig gesteld is middels verzending via 
internet, kan Atos Nederland B.V. niet aansprakelijk worden gehouden voor de 
inhoud daarvan. Hoewel wij ons inspannen een virusvrij netwerk te hanteren, 
geven wij geen enkele garantie dat dit bericht virusvrij is, noch aanvaarden 
wij enige aansprakelijkheid voor de mogelijke aanwezigheid van een virus in dit 
bericht. Op al onze rechtsverhoudingen, aanbiedingen en overeenkomsten 
waaronder Atos Nederland B.V. goederen en/of diensten levert zijn met 
uitsluiting van alle andere voorwaarden de Leveringsvoorwaarden van Atos 
Nederland B.V. van toepassing. Deze worden u op aanvraag direct kosteloos 
toegezonden.

This e-mail and the documents attached are confidential and intended solely for 
the addressee; it may also be privileged. If you receive this e-mail in error, 
please notify the sender immediately and destroy it. As its integrity cannot be 
secured on the Internet, the Atos Nederland B.V. group liability cannot be 
triggered for the message content. Although the sender endeavours to maintain a 
computer virus-free network, the sender does not warrant that this transmission 
is virus-free and will not be liable for any damages resulting from any virus 
transmitted. On all offers and agreements under which Atos Nederland 

Re: Desperate!!!! Expanding,shrinking cluster or replacing failed nodes.

2011-12-20 Thread Harsh J
Hans-Peter,

Adding new nodes is simply (assuming network setup is sane and done):

- Install/unpack services on new machine.
- Deploy a config copy for the services.
- Start the services.

You should *not* format a NameNode *ever*, after the first time you start it 
up. Formatting loses all data of HDFS, so don't even think about that after the 
first time you use it :)

On 20-Dec-2011, at 6:12 PM, Sloot, Hans-Peter wrote:

> Hello all,
> 
> I have asked this question a couple of days ago but no one responded.
> 
> I built a 6 node hadoop cluster, guided Michael Noll, starting with a single 
> node and expanding it one by one.
> Every time I expanded the cluster I ran into error : java.io.IOException: 
> Incompatible namespaceIDs
> 
> So now my question is what is the correct procedure for expanding, shrinking 
> a cluster?
> And how to replace a failed node?
> 
> Can someone  point me to the correct manuals.
> I have already looked at the available documents on the wiki and 
> hadoop.apache.org but could not find the answers.
> 
> Regards Hans-Peter
> 
> 
> 
> 
> 
> Dit bericht is vertrouwelijk en kan geheime informatie bevatten enkel bestemd 
> voor de geadresseerde. Indien dit bericht niet voor u is bestemd, verzoeken 
> wij u dit onmiddellijk aan ons te melden en het bericht te vernietigen. 
> Aangezien de integriteit van het bericht niet veilig gesteld is middels 
> verzending via internet, kan Atos Nederland B.V. niet aansprakelijk worden 
> gehouden voor de inhoud daarvan. Hoewel wij ons inspannen een virusvrij 
> netwerk te hanteren, geven wij geen enkele garantie dat dit bericht virusvrij 
> is, noch aanvaarden wij enige aansprakelijkheid voor de mogelijke 
> aanwezigheid van een virus in dit bericht. Op al onze rechtsverhoudingen, 
> aanbiedingen en overeenkomsten waaronder Atos Nederland B.V. goederen en/of 
> diensten levert zijn met uitsluiting van alle andere voorwaarden de 
> Leveringsvoorwaarden van Atos Nederland B.V. van toepassing. Deze worden u op 
> aanvraag direct kosteloos toegezonden.
> 
> This e-mail and the documents attached are confidential and intended solely 
> for the addressee; it may also be privileged. If you receive this e-mail in 
> error, please notify the sender immediately and destroy it. As its integrity 
> cannot be secured on the Internet, the Atos Nederland B.V. group liability 
> cannot be triggered for the message content. Although the sender endeavours 
> to maintain a computer virus-free network, the sender does not warrant that 
> this transmission is virus-free and will not be liable for any damages 
> resulting from any virus transmitted. On all offers and agreements under 
> which Atos Nederland B.V. supplies goods and/or services of whatever nature, 
> the Terms of Delivery from Atos Nederland B.V. exclusively apply. The Terms 
> of Delivery shall be promptly submitted to you on your request.
> 
> Atos Nederland B.V. / Utrecht
> KvK Utrecht 30132762
> 



Re: Desperate!!!! Expanding,shrinking cluster or replacing failed nodes.

2011-12-20 Thread Dejan Menges
Hi,

Here is quick info in section 1.5 http://wiki.apache.org/hadoop/FAQ

So just briefly - when you add new node, and you are sure configuration on
that one is fine, before you start anything you need to issue "hadoop
dfsadmin -refreshNodes" after what you need to start datanode/mr services.

Hope this helps!

Regards,
Dejan

On Tue, Dec 20, 2011 at 1:42 PM, Sloot, Hans-Peter <
hans-peter.sl...@atos.net> wrote:

> Hello all,
>
> I have asked this question a couple of days ago but no one responded.
>
> I built a 6 node hadoop cluster, guided Michael Noll, starting with a
> single node and expanding it one by one.
> Every time I expanded the cluster I ran into error : java.io.IOException:
> Incompatible namespaceIDs
>
> So now my question is what is the correct procedure for expanding,
> shrinking a cluster?
> And how to replace a failed node?
>
> Can someone  point me to the correct manuals.
> I have already looked at the available documents on the wiki and
> hadoop.apache.org but could not find the answers.
>
> Regards Hans-Peter
>
>
>
>
>
> Dit bericht is vertrouwelijk en kan geheime informatie bevatten enkel
> bestemd voor de geadresseerde. Indien dit bericht niet voor u is bestemd,
> verzoeken wij u dit onmiddellijk aan ons te melden en het bericht te
> vernietigen. Aangezien de integriteit van het bericht niet veilig gesteld
> is middels verzending via internet, kan Atos Nederland B.V. niet
> aansprakelijk worden gehouden voor de inhoud daarvan. Hoewel wij ons
> inspannen een virusvrij netwerk te hanteren, geven wij geen enkele garantie
> dat dit bericht virusvrij is, noch aanvaarden wij enige aansprakelijkheid
> voor de mogelijke aanwezigheid van een virus in dit bericht. Op al onze
> rechtsverhoudingen, aanbiedingen en overeenkomsten waaronder Atos Nederland
> B.V. goederen en/of diensten levert zijn met uitsluiting van alle andere
> voorwaarden de Leveringsvoorwaarden van Atos Nederland B.V. van toepassing.
> Deze worden u op aanvraag direct kosteloos toegezonden.
>
> This e-mail and the documents attached are confidential and intended
> solely for the addressee; it may also be privileged. If you receive this
> e-mail in error, please notify the sender immediately and destroy it. As
> its integrity cannot be secured on the Internet, the Atos Nederland B.V.
> group liability cannot be triggered for the message content. Although the
> sender endeavours to maintain a computer virus-free network, the sender
> does not warrant that this transmission is virus-free and will not be
> liable for any damages resulting from any virus transmitted. On all offers
> and agreements under which Atos Nederland B.V. supplies goods and/or
> services of whatever nature, the Terms of Delivery from Atos Nederland B.V.
> exclusively apply. The Terms of Delivery shall be promptly submitted to you
> on your request.
>
> Atos Nederland B.V. / Utrecht
> KvK Utrecht 30132762
>
>


Desperate!!!! Expanding,shrinking cluster or replacing failed nodes.

2011-12-20 Thread Sloot, Hans-Peter
Hello all,

I have asked this question a couple of days ago but no one responded.

I built a 6 node hadoop cluster, guided Michael Noll, starting with a single 
node and expanding it one by one.
Every time I expanded the cluster I ran into error : java.io.IOException: 
Incompatible namespaceIDs

So now my question is what is the correct procedure for expanding, shrinking a 
cluster?
And how to replace a failed node?

Can someone  point me to the correct manuals.
I have already looked at the available documents on the wiki and 
hadoop.apache.org but could not find the answers.

Regards Hans-Peter





Dit bericht is vertrouwelijk en kan geheime informatie bevatten enkel bestemd 
voor de geadresseerde. Indien dit bericht niet voor u is bestemd, verzoeken wij 
u dit onmiddellijk aan ons te melden en het bericht te vernietigen. Aangezien 
de integriteit van het bericht niet veilig gesteld is middels verzending via 
internet, kan Atos Nederland B.V. niet aansprakelijk worden gehouden voor de 
inhoud daarvan. Hoewel wij ons inspannen een virusvrij netwerk te hanteren, 
geven wij geen enkele garantie dat dit bericht virusvrij is, noch aanvaarden 
wij enige aansprakelijkheid voor de mogelijke aanwezigheid van een virus in dit 
bericht. Op al onze rechtsverhoudingen, aanbiedingen en overeenkomsten 
waaronder Atos Nederland B.V. goederen en/of diensten levert zijn met 
uitsluiting van alle andere voorwaarden de Leveringsvoorwaarden van Atos 
Nederland B.V. van toepassing. Deze worden u op aanvraag direct kosteloos 
toegezonden.

This e-mail and the documents attached are confidential and intended solely for 
the addressee; it may also be privileged. If you receive this e-mail in error, 
please notify the sender immediately and destroy it. As its integrity cannot be 
secured on the Internet, the Atos Nederland B.V. group liability cannot be 
triggered for the message content. Although the sender endeavours to maintain a 
computer virus-free network, the sender does not warrant that this transmission 
is virus-free and will not be liable for any damages resulting from any virus 
transmitted. On all offers and agreements under which Atos Nederland B.V. 
supplies goods and/or services of whatever nature, the Terms of Delivery from 
Atos Nederland B.V. exclusively apply. The Terms of Delivery shall be promptly 
submitted to you on your request.

Atos Nederland B.V. / Utrecht
KvK Utrecht 30132762