Re: Pragmatic cluster backup strategies?

2012-05-30 Thread alo alt
Hi,

you could set fs.trash.interval into the number of minutes you want to consider 
that the rm'd data will lost forever. The data will be moved into .Trash and 
deleted after the configured time.
Second way could be to use mount.fuse to mount the HDFS and backup over that 
mount your data into a storage tier. That is not the best solution, but a 
useable way. 

cheers,
 Alex 

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF

On May 30, 2012, at 8:31 AM, Darrell Taylor wrote:

 Will hadoop fs -rm -rf move everything to the the /trash directory or
 will it delete that as well?
 
 I was thinking along the lines of what you suggest, keep the original
 source of the data somewhere and then reprocess it all in the event of a
 problem.
 
 What do other people do?  Do you run another cluster?  Do you backup
 specific parts of the cluster?  Some form of offsite SAN?
 
 On Tue, May 29, 2012 at 6:02 PM, Robert Evans ev...@yahoo-inc.com wrote:
 
 Yes you will have redundancy, so no single point of hardware failure can
 wipe out your data, short of a major catastrophe.  But you can still have
 an errant or malicious hadoop fs -rm -rf shut you down.  If you still
 have the original source of your data somewhere else you may be able to
 recover, by reprocessing the data, but if this cluster is your single
 repository for all your data you may have a problem.
 
 --Bobby Evans
 
 On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote:
 
 Hi,
 That's not a back up strategy.
 You could still have joe luser take out a key file or directory. What do
 you do then?
 
 On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:
 
 Hi,
 
 We are about to build a 10 machine cluster with 40Tb of storage,
 obviously
 as this gets full actually trying to create an offsite backup becomes a
 problem unless we build another 10 machine cluster (too expensive right
 now).  Not sure if it will help but we have planned the cabinet into an
 upper and lower half with separate redundant power, then we plan to put
 half of the cluster in the top, half in the bottom, effectively 2 racks,
 so
 in theory we could lose half the cluster and still have the copies of all
 the blocks with a replication factor of 3?  Apart form the data centre
 burning down or some other disaster that would render the machines
 totally
 unrecoverable, is this approach good enough?
 
 I realise this is a very open question and everyone's circumstances are
 different, but I'm wondering what other peoples experiences/opinions are
 for backing up cluster data?
 
 Thanks
 Darrell.
 
 
 



Hadoop BI Usergroup Stuttgart (Germany)

2012-05-30 Thread alo alt
For our german speaking folks,

we want to start a Hadoop-BI Usergroup in Stuttgart (Germany), if you have 
interest in please visit our LinkedIn Group 
(http://www.linkedin.com/groups/Hadoop-Germany-4325443) and our Doodle poll 
(http://www.doodle.com/aqwsg4snbwimrsfc). If we figure out a real interest we 
will call for sponsors and speakers later. 

(german)
Schwerpunkte:
- Integration von Hadoop / HDFS basierenden Lösungen in bestehende 
Infrastrukturen
- Export von Daten aus Relationalen Datenbanken in NoSQL / Analyse Cluster 
(HBase, Hive)
- Statistische Auswertungen (Mahout) 
- ISO kompatible Lösungsansätze Backup und Recovery, HA mittels OpenSource





--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF



Re: Writing click stream data to hadoop

2012-05-30 Thread alo alt
I cc'd flume-u...@incubator.apache.org because I don't know if Mohit subscribed 
there.

Mohit,

you could use Avro to serialize the data and send them to a Flume Avro source. 
Or you could syslog - both are supported in Flume 1.x. 
http://archive.cloudera.com/cdh/3/flume-ng-1.1.0-cdh3u4/FlumeUserGuide.html

An exec-source is also possible, please note, flume will only start / use the 
command you configured and didn't take control over the whole process.

- Alex 



--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF

On May 30, 2012, at 4:56 PM, Mohit Anchlia wrote:

 On Fri, May 25, 2012 at 9:30 AM, Harsh J ha...@cloudera.com wrote:
 
 Mohit,
 
 Not if you call sync (or hflush/hsync in 2.0) periodically to persist
 your changes to the file. SequenceFile doesn't currently have a
 sync-API inbuilt in it (in 1.0 at least), but you can call sync on the
 underlying output stream instead at the moment. This is possible to do
 in 1.0 (just own the output stream).
 
 Your use case also sounds like you may want to simply use Apache Flume
 (Incubating) [http://incubator.apache.org/flume/] that already does
 provide these features and the WAL-kinda reliability you seek.
 
 
 Thanks Harsh, Does flume also provides API on top. I am getting this data
 as http call, how would I go about using flume with http calls?
 
 
 On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 We get click data through API calls. I now need to send this data to our
 hadoop environment. I am wondering if I could open one sequence file and
 write to it until it's of certain size. Once it's over the specified
 size I
 can close that file and open a new one. Is this a good approach?
 
 Only thing I worry about is what happens if the server crashes before I
 am
 able to cleanly close the file. Would I lose all previous data?
 
 
 
 --
 Harsh J
 



Event: Meetup in Munich, Thursday May, 24

2012-05-22 Thread alo alt
Folks,

for our German / Switzerland / Austria people:

Thu, May 24 NoSQL Meetup in Munich, Bavaria, Germany:
http://www.nosqlmunich.de/

eCircle GmbH
Nymphenburger Höfe NY II
Dachauer Str. 63
80335 München

register here: http://www.doodle.com/7e5a6ecizinaznbu
Entry free

Speaker:
Doug (Hypertable), Christian (HBase NRT) and Me (flumeNG / sqoop)

- Alex

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF



Re: Swiss Hadoop User Group launched ...

2012-05-09 Thread alo alt
That's pretty cool. We have a german LinkedIn / Xing group too
http://goo.gl/N8pCF

cheers,
 Alex 

--
Alexander Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF

On May 9, 2012, at 1:17 PM, Jean-Pierre Koenig wrote:

 Hallo Hadoop Users!
 
 To promote knowledge exchange amongst developers, many Hadoop User
 Groups have been founded around the world. In German-speaking
 countries there are currently only two of these groups, located in
 Berlin and Munich, and until now nothing in Switzerland. This gap will
 now be filled. MeMo News, with the support of the ETH Zurich spin-off
 Teralytics, has officially launched the first Swiss Hadoop User
 Group.
 
 The first meeting will be held on the 14th May 2012 at 6PM in the ETH Zurich.
 Registration is now available on the project website.
 Details can be found here: bit.ly/LMrh3x
 
 Cheers, Jean-Pierre
 
 -- 
 Jean-Pierre Koenig
 Head of Technology
 
 MeMo News AG
 Sonnenstr. 4
 CH-8280 Kreuzlingen
 
 Tel: +41 71 508 24 86
 Fax: +41 71 671 20 26
 E-Mail: jean-pierre.koe...@memonews.com
 
 http://www.memonews.com
 http://twitter.com/MeMoNewsAG
 http://facebook.com/MeMoNewsAG
 http://xing.com/companies/MeMoNewsAG



Re: HDFS mounting issue using Hadoop-Fuse on Fully Distributed Cluster?

2012-04-26 Thread alo alt
Hi,

I wrote a small writeup about:
http://mapredit.blogspot.de/2011/11/nfs-exported-hdfs-cdh3.html

As you see, the FS is mounted as nobody and you try as root. Change the 
permissions in your hdfs:
hadoop -dfs chmod / chown 

- Alex

--
Alexander Lorenz
http://mapredit.blogspot.com

On Apr 26, 2012, at 12:50 PM, Manu S wrote:

 Dear All,
 
 I have installed *Hadoop-fuse* to mount the HDFS filesystem locally . I
 could mount the HDFS without any issues.But I am not able to do any file
 operations like *delete, copy, move* etc directly. The directory ownership
 automatically changed to *nobody:nobody* while mounting.
 
 *[root@namenode ~]# ls -ld /hdfs_mount/
 drwxr-xr-x 2 root root 4096 Apr 3 16:22 /hdfs_mount/
 
 [root@namenode ~]# hadoop-fuse-dfs dfs://namenode:8020 /hdfs_mount/** **
 INFO fuse_options.c:162 Adding FUSE arg /hdfs_mount/
 
 [root@namenode ~]# ls -ld /hdfs_mount/* *
 drwxrwxr-x 10 nobody nobody 4096 Apr 5 13:22 /hdfs_mount/*
 
 I tried the same with *pseudo-distributed node*,but its working fine. I can
 do any normal file operations after mounting the HDFS.
 
 Appreciate your help on the same.
 
 -- 
 Thanks  Regards
 
 *Manu S*
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in



Re: HDFS mounting issue using Hadoop-Fuse on Fully Distributed Cluster?

2012-04-26 Thread alo alt
Manu,

did you mount hdfs over fstab:

hadoop-fuse-dfs#dfs://namenode.local:PORT /hdfs-mount fuse usetrash,rw 0 0 ?

You could that do with:
mkdir -p /hdfs-mount  chmod 777 /hdfs-mount  echo 
hadoop-fuse-dfs#dfs://NN.URI:PORT /hdfs-mount fuse usetrash,rw 0 0  
/etc/fstab  mount -a ; mount


- Alex


On Apr 26, 2012, at 2:00 PM, Manu S wrote:

 Thanks a lot Alex.
 
 Actually I didn't tried the NFS option, as I am trying to sort out this 
 hadoop-fuse mounting issue.
 I can't change the ownership of mount directory after hadoop-fuse mount.
 
 [root@namenode ~]# ls -ld /hdfs_mount/
 drwxrwxr-x 11 nobody nobody 4096 Apr  9 12:34 /hdfs_mount/
 
 [root@namenode ~]# chown hdfs /hdfs_mount/
 chown: changing ownership of `/hdfs_mount/': Input/output error
 
 Any ideas?
 
 
 On Thu, Apr 26, 2012 at 4:26 PM, alo alt wget.n...@googlemail.com wrote:
 Hi,
 
 I wrote a small writeup about:
 http://mapredit.blogspot.de/2011/11/nfs-exported-hdfs-cdh3.html
 
 As you see, the FS is mounted as nobody and you try as root. Change the 
 permissions in your hdfs:
 hadoop -dfs chmod / chown 
 
 - Alex
 
 --
 Alexander Lorenz
 http://mapredit.blogspot.com
 
 On Apr 26, 2012, at 12:50 PM, Manu S wrote:
 
 Dear All,
 
 I have installed *Hadoop-fuse* to mount the HDFS filesystem locally . I
 could mount the HDFS without any issues.But I am not able to do any file
 operations like *delete, copy, move* etc directly. The directory ownership
 automatically changed to *nobody:nobody* while mounting.
 
 *[root@namenode ~]# ls -ld /hdfs_mount/
 drwxr-xr-x 2 root root 4096 Apr 3 16:22 /hdfs_mount/
 
 [root@namenode ~]# hadoop-fuse-dfs dfs://namenode:8020 /hdfs_mount/** **
 INFO fuse_options.c:162 Adding FUSE arg /hdfs_mount/
 
 [root@namenode ~]# ls -ld /hdfs_mount/* *
 drwxrwxr-x 10 nobody nobody 4096 Apr 5 13:22 /hdfs_mount/*
 
 I tried the same with *pseudo-distributed node*,but its working fine. I can
 do any normal file operations after mounting the HDFS.
 
 Appreciate your help on the same.
 
 --
 Thanks  Regards
 
 *Manu S*
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in
 
 
 
 
 -- 
 Thanks  Regards
 
 Manu S
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in
 
 
 


--
Alexander Lorenz
http://mapredit.blogspot.com



Re: HDFS mounting issue using Hadoop-Fuse on Fully Distributed Cluster?

2012-04-26 Thread alo alt
Yes, as I wrote. You can't use root as user for writing, root (or superuser) 
has another context in hdfs. Just change into hdfs (su - hdfs) and try again. 
For all user who should have access to the mounted fs you should create a group 
and chown them in hdfs (maybe /tmp/group or similar)

best,
Alex 


On Apr 26, 2012, at 2:53 PM, Manu S wrote:

 Yeah Alex, I tried. But still I am not able to make it
 
 [root@namenode ~]# echo hadoop-fuse-dfs#dfs://namenode:8020 /hdfs_mount fuse 
 usetrash,rw 0 0  /etc/fstab
 [root@namenode ~]# mount -a
 INFO fuse_options.c:162 Adding FUSE arg /hdfs_mount
 
 [root@namenode ~]# mount | grep fuse
 fuse on /hdfs_mount type fuse 
 (rw,nosuid,nodev,allow_other,default_permissions)
 
 [root@namenode ~]# ls -ld /hdfs_mount/
 drwxrwxr-x 11 nobody nobody 4096 Apr  9 12:34 /hdfs_mount/
 [root@namenode ~]# touch /hdfs_mount/file
 touch: cannot touch `/hdfs_mount/file': Permission denied
 
 
 
 On Thu, Apr 26, 2012 at 6:09 PM, alo alt wget.n...@googlemail.com wrote:
 Manu,
 
 did you mount hdfs over fstab:
 
 hadoop-fuse-dfs#dfs://namenode.local:PORT /hdfs-mount fuse usetrash,rw 0 0 
 ?
 
 You could that do with:
 mkdir -p /hdfs-mount  chmod 777 /hdfs-mount  echo 
 hadoop-fuse-dfs#dfs://NN.URI:PORT /hdfs-mount fuse usetrash,rw 0 0  
 /etc/fstab  mount -a ; mount
 
 
 - Alex
 
 
 On Apr 26, 2012, at 2:00 PM, Manu S wrote:
 
 Thanks a lot Alex.
 
 Actually I didn't tried the NFS option, as I am trying to sort out this 
 hadoop-fuse mounting issue.
 I can't change the ownership of mount directory after hadoop-fuse mount.
 
 [root@namenode ~]# ls -ld /hdfs_mount/
 drwxrwxr-x 11 nobody nobody 4096 Apr  9 12:34 /hdfs_mount/
 
 [root@namenode ~]# chown hdfs /hdfs_mount/
 chown: changing ownership of `/hdfs_mount/': Input/output error
 
 Any ideas?
 
 
 On Thu, Apr 26, 2012 at 4:26 PM, alo alt wget.n...@googlemail.com wrote:
 Hi,
 
 I wrote a small writeup about:
 http://mapredit.blogspot.de/2011/11/nfs-exported-hdfs-cdh3.html
 
 As you see, the FS is mounted as nobody and you try as root. Change the 
 permissions in your hdfs:
 hadoop -dfs chmod / chown 
 
 - Alex
 
 --
 Alexander Lorenz
 http://mapredit.blogspot.com
 
 On Apr 26, 2012, at 12:50 PM, Manu S wrote:
 
 Dear All,
 
 I have installed *Hadoop-fuse* to mount the HDFS filesystem locally . I
 could mount the HDFS without any issues.But I am not able to do any file
 operations like *delete, copy, move* etc directly. The directory ownership
 automatically changed to *nobody:nobody* while mounting.
 
 *[root@namenode ~]# ls -ld /hdfs_mount/
 drwxr-xr-x 2 root root 4096 Apr 3 16:22 /hdfs_mount/
 
 [root@namenode ~]# hadoop-fuse-dfs dfs://namenode:8020 /hdfs_mount/** **
 INFO fuse_options.c:162 Adding FUSE arg /hdfs_mount/
 
 [root@namenode ~]# ls -ld /hdfs_mount/* *
 drwxrwxr-x 10 nobody nobody 4096 Apr 5 13:22 /hdfs_mount/*
 
 I tried the same with *pseudo-distributed node*,but its working fine. I can
 do any normal file operations after mounting the HDFS.
 
 Appreciate your help on the same.
 
 --
 Thanks  Regards
 
 *Manu S*
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in
 
 
 
 
 --
 Thanks  Regards
 
 Manu S
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in
 
 
 
 
 
 --
 Alexander Lorenz
 http://mapredit.blogspot.com
 
 
 
 
 -- 
 Thanks  Regards
 
 Manu S
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in
 
 
 


--
Alexander Lorenz
http://mapredit.blogspot.com



Re: HDFS mounting issue using Hadoop-Fuse on Fully Distributed Cluster?

2012-04-26 Thread alo alt
Manu,

for clarifying:

root has no access to the mounted HDFS. Just follow the howto:

1. create the group and the users on ALL nodes:
groupadd hdfs-user  adduser USERNAME -G hdfs-user 

2. sudo into hdfs:
su - hdfs

3. create a directory in hdfs and change the rights:
hadoop fs -mkdir /someone  hadoop fs -chmod 774 /someone  hadoop fs -chgrp 
hdfs-user /someone

Now the users you created and added into to group are able to write files.

- Alex
 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Apr 26, 2012, at 3:58 PM, alo alt wrote:

 Yes, as I wrote. You can't use root as user for writing, root (or superuser) 
 has another context in hdfs. Just change into hdfs (su - hdfs) and try again. 
 For all user who should have access to the mounted fs you should create a 
 group and chown them in hdfs (maybe /tmp/group or similar)
 
 best,
 Alex 
 
 
 On Apr 26, 2012, at 2:53 PM, Manu S wrote:
 
 Yeah Alex, I tried. But still I am not able to make it
 
 [root@namenode ~]# echo hadoop-fuse-dfs#dfs://namenode:8020 /hdfs_mount 
 fuse usetrash,rw 0 0  /etc/fstab
 [root@namenode ~]# mount -a
 INFO fuse_options.c:162 Adding FUSE arg /hdfs_mount
 
 [root@namenode ~]# mount | grep fuse
 fuse on /hdfs_mount type fuse 
 (rw,nosuid,nodev,allow_other,default_permissions)
 
 [root@namenode ~]# ls -ld /hdfs_mount/
 drwxrwxr-x 11 nobody nobody 4096 Apr  9 12:34 /hdfs_mount/
 [root@namenode ~]# touch /hdfs_mount/file
 touch: cannot touch `/hdfs_mount/file': Permission denied
 
 
 
 On Thu, Apr 26, 2012 at 6:09 PM, alo alt wget.n...@googlemail.com wrote:
 Manu,
 
 did you mount hdfs over fstab:
 
 hadoop-fuse-dfs#dfs://namenode.local:PORT /hdfs-mount fuse usetrash,rw 0 
 0 ?
 
 You could that do with:
 mkdir -p /hdfs-mount  chmod 777 /hdfs-mount  echo 
 hadoop-fuse-dfs#dfs://NN.URI:PORT /hdfs-mount fuse usetrash,rw 0 0  
 /etc/fstab  mount -a ; mount
 
 
 - Alex
 
 
 On Apr 26, 2012, at 2:00 PM, Manu S wrote:
 
 Thanks a lot Alex.
 
 Actually I didn't tried the NFS option, as I am trying to sort out this 
 hadoop-fuse mounting issue.
 I can't change the ownership of mount directory after hadoop-fuse mount.
 
 [root@namenode ~]# ls -ld /hdfs_mount/
 drwxrwxr-x 11 nobody nobody 4096 Apr  9 12:34 /hdfs_mount/
 
 [root@namenode ~]# chown hdfs /hdfs_mount/
 chown: changing ownership of `/hdfs_mount/': Input/output error
 
 Any ideas?
 
 
 On Thu, Apr 26, 2012 at 4:26 PM, alo alt wget.n...@googlemail.com wrote:
 Hi,
 
 I wrote a small writeup about:
 http://mapredit.blogspot.de/2011/11/nfs-exported-hdfs-cdh3.html
 
 As you see, the FS is mounted as nobody and you try as root. Change the 
 permissions in your hdfs:
 hadoop -dfs chmod / chown 
 
 - Alex
 
 --
 Alexander Lorenz
 http://mapredit.blogspot.com
 
 On Apr 26, 2012, at 12:50 PM, Manu S wrote:
 
 Dear All,
 
 I have installed *Hadoop-fuse* to mount the HDFS filesystem locally . I
 could mount the HDFS without any issues.But I am not able to do any file
 operations like *delete, copy, move* etc directly. The directory ownership
 automatically changed to *nobody:nobody* while mounting.
 
 *[root@namenode ~]# ls -ld /hdfs_mount/
 drwxr-xr-x 2 root root 4096 Apr 3 16:22 /hdfs_mount/
 
 [root@namenode ~]# hadoop-fuse-dfs dfs://namenode:8020 /hdfs_mount/** **
 INFO fuse_options.c:162 Adding FUSE arg /hdfs_mount/
 
 [root@namenode ~]# ls -ld /hdfs_mount/* *
 drwxrwxr-x 10 nobody nobody 4096 Apr 5 13:22 /hdfs_mount/*
 
 I tried the same with *pseudo-distributed node*,but its working fine. I can
 do any normal file operations after mounting the HDFS.
 
 Appreciate your help on the same.
 
 --
 Thanks  Regards
 
 *Manu S*
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in
 
 
 
 
 --
 Thanks  Regards
 
 Manu S
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in
 
 
 
 
 
 --
 Alexander Lorenz
 http://mapredit.blogspot.com
 
 
 
 
 -- 
 Thanks  Regards
 
 Manu S
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in
 
 
 
 
 
 --
 Alexander Lorenz
 http://mapredit.blogspot.com
 



Re: Feedback on real world production experience with Flume

2012-04-21 Thread alo alt
Hi,

in my former job:
productive, Germany, Web portal. Throughput 600 mb/minute. Logfiles from 
Windows IIS, Apache. Used in a usual way, no own decorators or sinks. Simply 
syslog - bucketing (1 minute rollover) - hdfs splitted into minutes 
(MMDDHHMM). 

Stable, some issues (you'll found on the mailing list), but works well if you 
know what is to do when anything will happen. Btw, NG 1.1.0 is more stable as 
flume pre 1.x and runs in some productive environments.

- Alex

--
Alexander Lorenz
http://mapredit.blogspot.com

On Apr 21, 2012, at 12:14 AM, Karl Hennig wrote:

 I am investigating automated methods of moving our data from the web tier 
 into HDFS for processing, a process that's performed periodically.
 
 I am looking for feedback from anyone who has actually used Flume in a 
 production setup (redundant, failover) successfully.  I understand it is now 
 being largely rearchitected during its incubation as Apache Flume-NG, so I 
 don't have full confidence in the old, stable releases.
 
 The other option would be to write our own tools.  What methods are you using 
 for these kinds of tasks?  Did you write your own or does Flume (or something 
 else) work for you?
 
 I'm also on the Flume mailing list, but I wanted to ask these questions here 
 because I'm interested in Flume _and_ alternatives.
 
 Thank you!
 



Re: Feedback on real world production experience with Flume

2012-04-21 Thread alo alt
We decided NO product and vendor advertising on apache mailing lists! 
I do not understand why you'll put that closed source stuff from your employe 
in the room. It has nothing to do with flume or the use cases!

--
Alexander Lorenz
http://mapredit.blogspot.com

On Apr 21, 2012, at 4:06 PM, M. C. Srivas wrote:

 Karl,
 
 since you did ask for alternatives,  people using MapR prefer to use the
 NFS access to directly deposit data (or access it).  Works seamlessly from
 all Linuxes, Solaris, Windows, AIX and a myriad of other legacy systems
 without having to load any agents on those machines. And it is fully
 automatic HA
 
 Since compression is built-in in MapR, the data gets compressed coming in
 over NFS automatically without much fuss.
 
 Wrt to performance,  can get about 870 MB/s per node if you have 10GigE
 attached (of course, with compression, the effective throughput will
 surpass that based on how good the data can be squeezed).
 
 
 On Fri, Apr 20, 2012 at 3:14 PM, Karl Hennig khen...@baynote.com wrote:
 
 I am investigating automated methods of moving our data from the web tier
 into HDFS for processing, a process that's performed periodically.
 
 I am looking for feedback from anyone who has actually used Flume in a
 production setup (redundant, failover) successfully.  I understand it is
 now being largely rearchitected during its incubation as Apache Flume-NG,
 so I don't have full confidence in the old, stable releases.
 
 The other option would be to write our own tools.  What methods are you
 using for these kinds of tasks?  Did you write your own or does Flume (or
 something else) work for you?
 
 I'm also on the Flume mailing list, but I wanted to ask these questions
 here because I'm interested in Flume _and_ alternatives.
 
 Thank you!
 
 



Re: Hadoop User Group Cologne

2012-03-07 Thread alo alt
HI,

we've setup few days ago a German UG:
http://mapredit.blogspot.com/2012/03/hadoop-ug-germany.html

Deutsch / german:
Wir haben eine UHG gegruendet, erstmal Gruppen in XING / LinkedIn und eine 
Website, die aber wirklich recht neu ist :) Wenn Du mitmachen willst, melden!

Danke und bis bald,
 - Alex 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Mar 8, 2012, at 7:48 AM, Christian Bitter wrote:

 Dear all,
 
 I would like to know whether there is already or whether there is interest in 
 establishing some form of user group for hadoop in Cologne / Germany.
 
 Cheers,
 
 Christian



Re: help for snappy

2012-02-26 Thread alo alt
Hi,

https://ccp.cloudera.com/display/CDHDOC/Snappy+Installation#SnappyInstallation-UsingSnappyforMapReduceCompression

best,
 Alex 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 27, 2012, at 7:16 AM, hadoop hive wrote:

 Hey folks,
 
 i m using hadoop 0.20.2 + r911707 , please tell me the installation and how
 to use snappy for compression and decompression
 
 Regards
 Vikas Srivastava



Re: help for snappy

2012-02-26 Thread alo alt
hive?
You are then on the wrong list, for hive related questions refer to:
u...@hive.apache.org

--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 27, 2012, at 8:14 AM, hadoop hive wrote:

 hey Alex,
 
 i cant you hive for that.???
 
 On Mon, Feb 27, 2012 at 12:29 PM, alo alt wget.n...@googlemail.com wrote:
 
 After you have installed snappy you have to configure the codec like the
 URL I posted before or you can reference them in your MR jobs. Be sure you
 have the jars in your classpath for.
 
 For storing snappy compressed files in HDFS you should use Pig or Flume.
 
 --
 Alexander Lorenz
 http://mapredit.blogspot.com
 
 On Feb 27, 2012, at 7:28 AM, hadoop hive wrote:
 
 thanks Alex,
 
 i m using Apache hadoop, steps i followed
 
 1:- untar snappy
 2:- entry in mapred site
 
 this can be used like deflate only(like only on overwriting file)
 
 
 
 On Mon, Feb 27, 2012 at 11:50 AM, alo alt wget.n...@googlemail.com
 wrote:
 
 Hi,
 
 
 
 https://ccp.cloudera.com/display/CDHDOC/Snappy+Installation#SnappyInstallation-UsingSnappyforMapReduceCompression
 
 best,
 Alex
 
 --
 Alexander Lorenz
 http://mapredit.blogspot.com
 
 On Feb 27, 2012, at 7:16 AM, hadoop hive wrote:
 
 Hey folks,
 
 i m using hadoop 0.20.2 + r911707 , please tell me the installation and
 how
 to use snappy for compression and decompression
 
 Regards
 Vikas Srivastava
 
 
 
 



Re: HELP - Problem in setting up Hadoop - Multi-Node Cluster

2012-02-09 Thread alo alt
Please use jdk 6 latest.

best,
 Alex 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 9, 2012, at 11:11 AM, hadoop hive wrote:

 did you make check the ssh between localhost means its should be ssh password 
 less between localhost 
 
 public-key =authorized_key
 
 On Thu, Feb 9, 2012 at 1:06 AM, Robin Mueller-Bady 
 robin.mueller-b...@oracle.com wrote:
 Dear Guruprasad,
 
 it would be very helpful to provide details from your configuration files as 
 well as more details on your setup.
 It seems to be that the connection from slave to master cannot be established 
 (Connection reset by peer).
 Do you use a virtual environment, physical master/slaves or all on one 
 machine ?
 Please paste also the output of kingul2 namenode logs.
 
 Regards,
 
 Robin
 
 
 On 02/08/12 13:06, Guruprasad B wrote:
 Hi,
 
 I am Guruprasad from Bangalore (India). I need help in setting up hadoop
 platform. I am very much new to Hadoop Platform.
 
 I am following the below given articles and I was able to set up
 Single-Node Cluster
 
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#what-we-want-to-do
 
 Now I am trying to set up 
 Multi-Node Cluster by following the below given
 article.
 
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
 
 
 Below given is my setup:
 Hadoop : hadoop_0.20.2
 Linux: Ubuntu Linux 10.10
 Java: java-7-oracle
 
 
 I have successfully reached till the topic Starting the multi-node
 cluster in the above given article.
 When I start the HDFS/MapReduce daemons it is getting started and going
 down immediately both in master  slave as well,
 please have a look at the below logs,
 
 hduser@kinigul2:/usr/local/hadoop$ bin/start-dfs.sh
 starting namenode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-kinigul2.out
 master: starting datanode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-kinigul2.out
 slave: starting datanode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-guruL.out
 master: starting secondarynamenode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-kinigul2.out
 
 hduser@kinigul2:/usr/local/hadoop$ jps
 6098 DataNode
 6328 Jps
 5914 NameNode
 6276 SecondaryNameNode
 
 hduser@kinigul2:/usr/local/hadoop$ jps
 6350 Jps
 
 
 I am getting below given error in slave logs:
 
 2012-02-08 21:04:01,641 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call
 to master/
 16.150.98.62:54310
  failed on local exception:
 java.io.IOException: Connection reset by peer
 at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
 at org.apache.hadoop.ipc.Client.call(Client.java:743)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
 at $Proxy4.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383)
 at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:314)
 at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:291)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:269)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
 Caused by: java.io.IOException: Connection reset by peer
 at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:218)
 at sun.nio.ch.IOUtil.read(IOUtil.java:191)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:359)
 at
 org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
 at
 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
 at java.io.FilterInputStream.read(FilterInputStream.java:133)
 at
 org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
 at java.io.DataInputStream.readInt(DataInputStream.java:387)
 at
 org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
 at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
 
 
 Can you please tell what could be the reason behind this or point me to
 some pointers?
 
 

Re: What's the best practice of loading logs into hdfs while using hive to do log analytic?

2012-02-07 Thread alo alt
Hi,

a first start with flume:
http://mapredit.blogspot.com/2011/10/centralized-logfile-management-across.html

Facebook's scribe could also be work for you.

- Alex

--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 7, 2012, at 11:03 AM, Xiaobin She wrote:

 Hi all,
 
 Sorry if it is not appropriate to send one thread into two maillist.
 **
 I'm tring to use hadoop and hive to do some log analytic jobs.
 
 Our system generate lots of logs every day, for example, it produce about
 370GB logs(including lots of log files) yesterday, and every day the logs
 increases.
 
 And we want to use hadoop and hive to replace our old log analysic system.
 
 We distinguish our logs with logid, we have an log collector which will
 collect logs from clients and then generate log files.
 
 for every logid, there will be one log file every hour, for some logid,
 this hourly log file can be 1~2GB
 
 I have set up an test cluster with hadoop and hive, and I have run some
 test which seems good for us.
 
 For reference, we will create one table in hive for every logid which will
 be partitoned by hour.
 
 Now I have a question, what's the best practice for loading logs files into
 hdfs or hive warehouse dir ?
 
 My first thought is,  at the begining of every hour,  compress the log file
 of the last hour of every logid and then use the hive cmd tool to load
 these compressed log files into hdfs.
 
 using  commands like LOAD DATA LOCAL inpath '$logname' OVERWRITE  INTO
 TABLE $tablename PARTITION (dt='$h') 
 
 I think this can work, and I have run some test on our 3-nodes test
 clusters.
 
 But the problem is, there are lots of logid which means there are lots of
 log files,  so every hour we will have to load lots of files into hdfs
 and there is another problem,  we will run hourly analysis job on these
 hourly collected log files,
 which inroduces the problem, because there are lots of log files, if we
 load these log files at the same time at the begining of every hour, I
 think  there will some network flows and there will be data delivery
 latency problem.
 
 For data delivery latency problem, I mean it will take some time for the
 log files to be copyed into hdfs,  and this will cause our hourly log
 analysis job to start later.
 
 So I want to figure out if we can write or append logs into an compressed
 file which is already located in hdfs, and I have posted an thread in the
 mailist, and from what I have learned, this is not possible.
 
 
 So, what's the best practice of loading logs into hdfs while using hive to
 do log analytic?
 
 Or what's the common methods to handle problem I have describe above?
 
 Can anyone give me some advices?
 
 Thank you very much for your help!



Re: What's the best practice of loading logs into hdfs while using hive to do log analytic?

2012-02-07 Thread alo alt
Yes. 
You can use partitioned tables in hive to append in a new table without moving 
the data. For flume you can define small sinks, but you're right, the file in 
hdfs is closed and written when flume send the closing. Please note, the gzip 
codec has no marker inside so you have to wait till flume has closing the file 
in hdfs before you can process them. Snappy would fit, but I have no longtime 
tests within an productive environment. 

For blocksizing you're right, but I think that you can move behind. 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 7, 2012, at 3:09 PM, Xiaobin She wrote:

 hi Bejoy and Alex,
 
 thank you for your advice.
 
 Actually I have look at Scribe first, and I have heard of Flume.
 
 I look at flume's user guide just now, and flume seems promising, as Bejoy 
 said , the flume collector can dump data into hdfs when the collector buffer 
 reaches a particular size of after a particular time interval, this is good 
 and I think it can solve the problem of data delivery latency.
 
 But what about compress?
 
 from the user's guide of flume, I see that flum supports compression  of log 
 files, but if flume did not wait until the collector has collect one hour of 
 log and then compress it and send it to hdfs, then it will  send part of the 
 one hour log to hdfs, am I right?
 
 so if I want to use thest data in hive (assume I have an external table in 
 hive), I have to specify at least two partiton key while creating table, one 
 for day-month-hour, and one for some other time interval like ten miniutes, 
 then I add hive partition to the existed external table with specified 
 partition key.
 
 Is the above process right ?
 
 If this right, then there could be some other problem, like the ten miniute 
 logs after compress is not big enough to fit the block size of hdfs which may 
 couse lots of small files ( for some of our log id, this may come true), or 
 if I set the time interval to be half an hour, then at the end of hour, it 
 may still cause the data delivery latency problem.
 
 this seems not a very good solution, am I making some mistakes or 
 misunderstanding here?
 
 thank you very much!
 
 
 
 
 
 2012/2/7 alo alt wget.n...@googlemail.com
 Hi,
 
 a first start with flume:
 http://mapredit.blogspot.com/2011/10/centralized-logfile-management-across.html
 
 Facebook's scribe could also be work for you.
 
 - Alex
 
 --
 Alexander Lorenz
 http://mapredit.blogspot.com
 
 On Feb 7, 2012, at 11:03 AM, Xiaobin She wrote:
 
  Hi all,
 
  Sorry if it is not appropriate to send one thread into two maillist.
  **
  I'm tring to use hadoop and hive to do some log analytic jobs.
 
  Our system generate lots of logs every day, for example, it produce about
  370GB logs(including lots of log files) yesterday, and every day the logs
  increases.
 
  And we want to use hadoop and hive to replace our old log analysic system.
 
  We distinguish our logs with logid, we have an log collector which will
  collect logs from clients and then generate log files.
 
  for every logid, there will be one log file every hour, for some logid,
  this hourly log file can be 1~2GB
 
  I have set up an test cluster with hadoop and hive, and I have run some
  test which seems good for us.
 
  For reference, we will create one table in hive for every logid which will
  be partitoned by hour.
 
  Now I have a question, what's the best practice for loading logs files into
  hdfs or hive warehouse dir ?
 
  My first thought is,  at the begining of every hour,  compress the log file
  of the last hour of every logid and then use the hive cmd tool to load
  these compressed log files into hdfs.
 
  using  commands like LOAD DATA LOCAL inpath '$logname' OVERWRITE  INTO
  TABLE $tablename PARTITION (dt='$h') 
 
  I think this can work, and I have run some test on our 3-nodes test
  clusters.
 
  But the problem is, there are lots of logid which means there are lots of
  log files,  so every hour we will have to load lots of files into hdfs
  and there is another problem,  we will run hourly analysis job on these
  hourly collected log files,
  which inroduces the problem, because there are lots of log files, if we
  load these log files at the same time at the begining of every hour, I
  think  there will some network flows and there will be data delivery
  latency problem.
 
  For data delivery latency problem, I mean it will take some time for the
  log files to be copyed into hdfs,  and this will cause our hourly log
  analysis job to start later.
 
  So I want to figure out if we can write or append logs into an compressed
  file which is already located in hdfs, and I have posted an thread in the
  mailist, and from what I have learned, this is not possible.
 
 
  So, what's the best practice of loading logs into hdfs while using hive to
  do log analytic?
 
  Or what's the common methods to handle problem I have describe above?
 
  Can anyone give me some advices?
 
  Thank you very much

Re: working with SAS

2012-02-06 Thread alo alt
Hi,

hadoop is running on a linux box (mostly) and can run in a standalone 
installation for testing only. If you decide to use hadoop with hive or hbase 
you have to face a lot of more tasks:

- installation (whirr and Amazone EC2 as example)
- write your own mapreduce job or use hive / hbase
- setup sqoop with the terradata-driver

You can easy setup part 1 and 2 with Amazon's EC2, I think you can also book 
Windows Server there. For a single query the best option I think before you 
install a hadoop cluster.

best,
 Alex 


--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 6, 2012, at 8:11 AM, Ali Jooan Rizvi wrote:

 Hi,
 
 
 
 I would like to know if hadoop will be of help to me? Let me explain you
 guys my scenario:
 
 
 
 I have a windows server based single machine server having 16 Cores and 48
 GB of Physical Memory. In addition, I have 120 GB of virtual memory.
 
 
 
 I am running a query with statistical calculation on large data of over 1
 billion rows, on SAS. In this case, SAS is acting like a database on which
 both source and target tables are residing. For storage, I can keep the
 source and target data on Teradata as well but the query containing a patent
 can only be run on SAS interface.
 
 
 
 The problem is that SAS is taking many days (25 days) to run it (a single
 query with statistical function) and not all cores all the time were used
 and rather merely 5% CPU was utilized on average. However memory utilization
 was high, very high, and that's why large virtual memory was used. 
 
 
 
 Can I have a hadoop interface in place to do it all so that I may end up
 running the query in lesser time that is in 1 or 2 days. Anything squeezing
 my run time will be very helpful. 
 
 
 
 Thanks
 
 
 
 Ali Jooan Rizvi
 



Re: The Common Account for Hadoop

2012-02-06 Thread alo alt
check the rights of .ssh/authorized_keys on the hosts, have to be only read- 
and writable for the user (including directory)
Be sure you copied the right key without line-breaks and fragments.  If you 
have a lot of boxes you could use BCFG2:
http://docs.bcfg2.org/

- Alex 



--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 6, 2012, at 10:55 AM, Bing Li wrote:

 Dear all,
 
 I am just starting to learn Hadoop. According to the book, Hadoop in
 Action, a common account for each server (masters/slaves) must be created.
 
 Moreover, I need to create a public/private rsa key pair as follows.
 
ssh-keygen -t rsa
 
 Then, id_rsa and id_rsa.pub are put under $HOME/.ssh.
 
 After that, the public key is distributed to other nodes and saved in
 @HOME/.ssh/authorized_keys.
 
 According to the book (Page 27), I can login in a remote target with the
 following command.
 
ssh target (I typed IP address here)
 
 However, according to the book, no password is required to sign in the
 target. On my machine, it is required to type password each time.
 
 Any affects for my future to configure Hadoop? What's wrong with my work?
 
 Thanks so much!
 Bing



Re: 10 nodes how to build the topological graph

2012-02-02 Thread alo alt
Hi Rock,

you mean rack awareness.
http://hadoop.apache.org/common/docs/r0.17.2/hdfs_user_guide.html#Rack+Awareness

Here you find examples:
http://wiki.apache.org/hadoop/topology_rack_awareness_scripts

best,
 Alex 


--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 3, 2012, at 8:03 AM, Jinyan Xu wrote:

 Hi ,
 
 I have 10 machines to build a small cluster, how to build the topological 
 graph between these machines?
 
 I mean how to build the rack use 10 machines.
 
 Thanks !
 
 -Rock
 
 
 The information and any attached documents contained in this message
 may be confidential and/or legally privileged. The message is
 intended solely for the addressee(s). If you are not the intended
 recipient, you are hereby notified that any use, dissemination, or
 reproduction is strictly prohibited and may be unlawful. If you are
 not the intended recipient, please contact the sender immediately by
 return e-mail and destroy all copies of the original message.



Re: Best Linux Operating system used for Hadoop

2012-01-28 Thread alo alt
But Fedora is unstable and not useful for a enterprise stack. I work within the 
project :) Really nice for PC, but the high update- and fixing rate make it not 
useable in a big data environment. 

- Alex 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Jan 28, 2012, at 2:53 AM, Masoud wrote:

 Hi,
 
 I suggest you Fedora, in my opinion its more powerful than other distribution.
 i have run hadoop on it without any problem,
 
 good luck
 
 On 01/27/2012 06:15 PM, Sujit Dhamale wrote:
 Hi All,
 I am new to Hadoop,
 Can any one tell me which is the best Linux Operating system used for
 installing  running Hadoop. ??
 now a day i am using Ubuntu 11.4 and install Hadoop on it but it
 crashes number of times .
 
 can some please help me out ???
 
 
 Kind regards
 Sujit Dhamale
 
 



Re: Best Linux Operating system used for Hadoop

2012-01-27 Thread alo alt
I suggest CentOS 5.7 / RHEL 5.7

CentOS 6.2 runs also stable

- Alex 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Jan 27, 2012, at 10:15 AM, Sujit Dhamale wrote:

 Hi All,
 I am new to Hadoop,
 Can any one tell me which is the best Linux Operating system used for
 installing  running Hadoop. ??
 now a day i am using Ubuntu 11.4 and install Hadoop on it but it
 crashes number of times .
 
 can some please help me out ???
 
 
 Kind regards
 Sujit Dhamale



Re: Connect to HDFS running on a different Hadoop-Version

2012-01-25 Thread alo alt
Insight is a IBM related product, based on an fork of hadoop I think. The 
mixing of totally different stacks make no sense. And will not work, I guess.

- Alex 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Jan 25, 2012, at 1:12 PM, Harsh J wrote:

 Hello Romeo,
 
 Inline…
 
 On Wed, Jan 25, 2012 at 4:07 PM, Romeo Kienzler ro...@ormium.de wrote:
 Dear List,
 
 we're trying to use a central HDFS storage in order to be accessed from
 various other Hadoop-Distributions.
 
 The HDFS you've setup, what 'distribution' is that from? You will have
 to use that particular version's jar across all client applications
 you use, else you'll run into RPC version incompatibilities.
 
 Do you think this is possible? We're having trouble, but not related to
 different RPC-Versions.
 
 It should be possible _most of the times_ by replacing jars at the
 client end to use the one that runs your cluster, but there may be
 minor API incompatibilities between certain versions that can get in
 the way. Purely depends on your client application and its
 implementation. If it sticks to using the publicly supported APIs, you
 are mostly fine.
 
 When trying to access a Cloudera CDH3 Update 2 (cdh3u2) HDFS from
 BigInsights 1.3 we're getting this error:
 
 BigInsights runs off IBM's own patched Hadoop sources if I am right,
 and things can get a bit tricky there. See the following points:
 
 Bad connection to FS. Command aborted. Exception: Call to
 localhost.localdomain/127.0.0.1:50070 failed on local exception:
 java.io.EOFException
 java.io.IOException: Call to localhost.localdomain/127.0.0.1:50070 failed on
 local exception: java.io.EOFException
 
 This is surely an RPC issue. The call tries to read off a field, but
 gets no response, EOFs and dies. We have more descriptive error
 messages with the 0.23 version onwards, but the problem here is that
 your IBM client jar is not the same as your cluster's jar. The mixture
 won't work.
 
 com.ibm.biginsights.hadoop.patch.PatchedDistributedFileSystem.initialize(PatchedDistributedFileSystem.java:19)
 
 ^^ This is what am speaking of. Your client (BigInsights? Have not
 used it really…) is using an IBM jar with their supplied
 'PatchDistributedFileSystem', and that is probably incompatible with
 the cluster's HDFS RPC protocols. I do not know enough about IBM's
 custom stuff to know for sure it would work if you replace it with
 your clusters' jar.
 
 But we've already replaced the client hadoop-common.jar's with the Cloudera
 ones.
 
 Apparently not. Your strace shows that com.ibm.* classes are still
 being pulled. My guess is that BigInsights would not work with
 anything non IBM, but I have not used it to know for sure.
 
 If they have a user community, you can ask there if there is a working
 way to have BigInsights run against Apache/CDH/etc. distributions.
 For CDH specific questions, you may ask at
 https://groups.google.com/a/cloudera.org/group/cdh-user/topics instead
 of the Apache lists here.
 
 -- 
 Harsh J
 Customer Ops. Engineer, Cloudera



Re: JobTracker url shwoing less no of nodes available

2012-01-24 Thread alo alt
+common user BCC

please post to the correct mailing lists. Added common users.

that mean that some DN daemons not running. FIrst place for that are the logs 
of the DNs. What says that?

- Alex 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Jan 24, 2012, at 7:55 AM, hadoop hive wrote:

 HI Folks,
 
 i got a problem in my job tracker Url, its not Showing the actual no of DN 
 present in Cluster.
 
 any suggestion wats wrong with this, 
 
 
 regards
 Vikas Srivastava



Re: MapReduce job failing when a node of cluster is rebooted

2011-12-27 Thread alo alt
Did the DN you've just rebooted connecting to the NN? Mostly the
datanode daemon is'nt running, check it:
ps waux |grep DataNode |grep -v grep

- ALex

On Tue, Dec 27, 2011 at 2:44 PM, Rajat Goel rajatgoe...@gmail.com wrote:
 Yes. Hdfs and Mapred related dirs are set outside of /tmp.

 On Tue, Dec 27, 2011 at 6:48 PM, alo alt wget.n...@googlemail.com wrote:

 Hi,

 did you set the hdfs-related dirs outside of /tmp? Most *ux systems
 clean them up on reboot.

 - Alex

 On Tue, Dec 27, 2011 at 2:09 PM, Rajat Goel rajatgoe...@gmail.com wrote:
  Hi,
 
  I have a 7-node setup (1 - Namenode/JobTracker, 6 -
 Datanodes/TaskTrackers)
  running Hadoop version 0.20.203.
 
  I performed the following test:
  Initially cluster is running smoothly. Just before launching a MapReduce
  job (about one or two minutes before), I shutdown one of the data nodes
  (rebooted the machine). Then my MapReduce job starts but immediately
 fails
  with following messages on stderr:
 
  WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
  use org.apache.hadoop.log.metrics.EventCounter in all the
 log4j.properties
  files.
  WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
  use org.apache.hadoop.log.metrics.EventCounter in all the
 log4j.properties
  files.
  WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
  use org.apache.hadoop.log.metrics.EventCounter in all the
 log4j.properties
  files.
  WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
  use org.apache.hadoop.log.metrics.EventCounter in all the
 log4j.properties
  files.
  NOTICE: Configuration: /device.map    /region.map    /url.map
  /data/output/2011/12/26/08
   PS:192.168.100.206:1    3600    true    Notice
  11/12/26 09:10:26 WARN mapred.JobClient: Use GenericOptionsParser for
  parsing the arguments. Applications should implement Tool for the same.
  11/12/26 09:10:26 INFO input.FileInputFormat: Total input paths to
 process
  : 24
  11/12/26 09:10:37 INFO hdfs.DFSClient: Exception in
 createBlockOutputStream
  java.io.IOException: Bad connect ack with firstBadLink as
  192.168.100.5:50010
  11/12/26 09:10:37 INFO hdfs.DFSClient: Abandoning block
  blk_-6309642664478517067_35619
  11/12/26 09:10:37 INFO hdfs.DFSClient: Waiting to find target node:
  192.168.100.7:50010
  11/12/26 09:10:44 INFO hdfs.DFSClient: Exception in
 createBlockOutputStream
  java.net.NoRouteToHostException: No route to host
  11/12/26 09:10:44 INFO hdfs.DFSClient: Abandoning block
  blk_4129088682008611797_35619
  11/12/26 09:10:53 INFO hdfs.DFSClient: Exception in
 createBlockOutputStream
  java.io.IOException: Bad connect ack with firstBadLink as
  192.168.100.5:50010
  11/12/26 09:10:53 INFO hdfs.DFSClient: Abandoning block
  blk_3596375242483863157_35619
  11/12/26 09:11:01 INFO hdfs.DFSClient: Exception in
 createBlockOutputStream
  java.io.IOException: Bad connect ack with firstBadLink as
  192.168.100.5:50010
  11/12/26 09:11:01 INFO hdfs.DFSClient: Abandoning block
  blk_724369205729364853_35619
  11/12/26 09:11:07 WARN hdfs.DFSClient: DataStreamer Exception:
  java.io.IOException: Unable to create new block.
     at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3002)
     at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
     at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
 
  11/12/26 09:11:07 WARN hdfs.DFSClient: Error Recovery for block
  blk_724369205729364853_35619 bad datanode[1] nodes == null
  11/12/26 09:11:07 WARN hdfs.DFSClient: Could not get block locations.
  Source file
 
 /data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split
  - Aborting...
  11/12/26 09:11:07 INFO mapred.JobClient: Cleaning up the staging area
 
 hdfs://machine-100-205:9000/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292
  Exception in thread main java.io.IOException: Bad connect ack with
  firstBadLink as 192.168.100.5:50010
     at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
     at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
     at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
     at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
  11/12/26 09:11:07 ERROR hdfs.DFSClient: Exception closing file
 
 /data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split
  : java.io.IOException: Bad connect ack with firstBadLink as
  192.168.100.5:50010
  java.io.IOException: Bad connect ack with firstBadLink as
  192.168.100.5:50010
     at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
     at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983

Re: Task process exit with nonzero status of 134

2011-12-27 Thread alo alt
Anthony,

personally I haven't tested yet, jdk7 have bugs already. Was only a
hint to see if the error occurs.
I would focus on memory issues, the installed RAM are okay? No errors?
My next step would be to downgrade into one JDK earlier to check for a
bug. Did you update the OS before?

- Alex

On Tue, Dec 27, 2011 at 2:54 PM, anthony garnier sokar6...@hotmail.com wrote:

 Alex

 -XX:+UseCompressedOops option is the default in 1.6.0_24 and above on 64 bit 
 JVMs (http://wiki.apache.org/hadoop/HadoopJavaVersions)
 Anyway, I tested it but same result.
 Is it wise to test hadoop with the new jdk7_2 ?

 Anthony


 Date: Tue, 27 Dec 2011 13:47:03 +0100
 Subject: Re: Task process exit with nonzero status of 134
 From: wget.n...@googlemail.com

 To: sokar6...@hotmail.com

 Anthony,

 134 depends mostly on JRE (Bug) or defect RAM. _30 is the newest
 update, could be a bug inside. Can you test SE 7u2?
 Todd mentioned in a older post to use -XX:+UseCompressedOops
 (hadoop-env.sh). Another option could be to take a closer look at
 garbage collection with compressed option.

 - Alex

 On Tue, Dec 27, 2011 at 1:20 PM, anthony garnier sokar6...@hotmail.com 
 wrote:
  Alex,
 
  Memory available on namenode / Jobtracker :
  Tasks: 435 total,   1 running, 434 sleeping,   0 stopped,   0 zombie
  Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
  0.0%st
  Mem:     15360M total,    11609M used,     3750M free,      311M buffers
  Swap:     2047M total,        1M used,     2046M free,     8833M cached
 
  On datanode / Tasktracker :
  top - 13:15:27 up 6 days, 21:11,  1 user,  load average: 0.03, 0.28, 0.26
  Tasks: 377 total,   1 running, 376 sleeping,   0 stopped,   0 zombie
  Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,
  0.0%st
  Mem:     72373M total,     4321M used,    68051M free,      348M buffers
  Swap:     2047M total,        0M used,     2047M free,     2771M cached
 
  src/mapred/mapred-default.xml :
  property
    namemapred.child.java.opts/name
    value-Xmx200m/value
  /property
 
 
  So there should be enough memory
 
  Anthony
 
 
  Date: Tue, 27 Dec 2011 11:58:46 +0100
 
  Subject: Re: Task process exit with nonzero status of 134
  From: wget.n...@googlemail.com
  To: sokar6...@hotmail.com
 
 
  Anthony,
 
  How much memory you have available? Did the system going into swap?
 
  - Check mapred.map.child.java.opts (mapred.xml) for given MaxSize (xmx).
  - what says top -Hc?
 
  - Alex
 
  On Tue, Dec 27, 2011 at 11:49 AM, anthony garnier sokar6...@hotmail.com
  wrote:
   Hi,
  
   I got Nothing in the dmesg
   I've checked the Tasktracker and this is what I got :
  
   /
   STARTUP_MSG: Starting TaskTracker
   STARTUP_MSG:   host = ylal2960.inetpsa.com/10.68.217.86
   STARTUP_MSG:   args = []
   STARTUP_MSG:   version = 0.20.203.0
   STARTUP_MSG:   build =
  
   http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-203
   -r 1099333; compiled by 'oom' on Wed May  4 07:57:50 PDT 2011
   /
   2011-12-23 15:11:02,275 INFO
   org.apache.hadoop.metrics2.impl.MetricsConfig:
   loaded properties from hadoop-metrics2.properties
   2011-12-23 15:11:02,330 INFO
   org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
   MetricsSystem,sub=Stats registered.
   2011-12-23 15:11:02,331 INFO
   org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
   period
   at 10 second(s).
   2011-12-23 15:11:02,331 INFO
   org.apache.hadoop.metrics2.impl.MetricsSystemImpl: TaskTracker metrics
   system started
   2011-12-23 15:11:02,597 INFO
   org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
   ugi
   registered.
   2011-12-23 15:11:02,738 INFO org.mortbay.log: Logging to
   org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
   org.mortbay.log.Slf4jLog
   2011-12-23 15:11:02,803 INFO org.apache.hadoop.http.HttpServer: Added
   global
   filtersafety
   (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
   2011-12-23 15:11:02,827 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
   Initializing logs' truncater with mapRetainSize=-1 and
   reduceRetainSize=-1
   2011-12-23 15:11:02,832 INFO org.apache.hadoop.mapred.TaskTracker:
   Starting
   tasktracker with owner as root
   2011-12-23 15:11:02,870 INFO
   org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
   jvm
   registered.
   2011-12-23 15:11:02,871 INFO
   org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
   TaskTrackerMetrics registered.
   2011-12-23 15:11:02,897 INFO org.apache.hadoop.ipc.Server: Starting
   SocketReader
   2011-12-23 15:11:02,900 INFO
   org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
   RpcDetailedActivityForPort58709 registered.
   2011-12-23 15:11:02,900 INFO
   org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
   

Re: MapReduce job failing when a node of cluster is rebooted

2011-12-27 Thread alo alt
Hi,

did you set the hdfs-related dirs outside of /tmp? Most *ux systems
clean them up on reboot.

- Alex

On Tue, Dec 27, 2011 at 2:09 PM, Rajat Goel rajatgoe...@gmail.com wrote:
 Hi,

 I have a 7-node setup (1 - Namenode/JobTracker, 6 - Datanodes/TaskTrackers)
 running Hadoop version 0.20.203.

 I performed the following test:
 Initially cluster is running smoothly. Just before launching a MapReduce
 job (about one or two minutes before), I shutdown one of the data nodes
 (rebooted the machine). Then my MapReduce job starts but immediately fails
 with following messages on stderr:

 WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
 use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
 files.
 WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
 use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
 files.
 WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
 use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
 files.
 WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
 use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
 files.
 NOTICE: Configuration: /device.map    /region.map    /url.map
 /data/output/2011/12/26/08
  PS:192.168.100.206:1    3600    true    Notice
 11/12/26 09:10:26 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.
 11/12/26 09:10:26 INFO input.FileInputFormat: Total input paths to process
 : 24
 11/12/26 09:10:37 INFO hdfs.DFSClient: Exception in createBlockOutputStream
 java.io.IOException: Bad connect ack with firstBadLink as
 192.168.100.5:50010
 11/12/26 09:10:37 INFO hdfs.DFSClient: Abandoning block
 blk_-6309642664478517067_35619
 11/12/26 09:10:37 INFO hdfs.DFSClient: Waiting to find target node:
 192.168.100.7:50010
 11/12/26 09:10:44 INFO hdfs.DFSClient: Exception in createBlockOutputStream
 java.net.NoRouteToHostException: No route to host
 11/12/26 09:10:44 INFO hdfs.DFSClient: Abandoning block
 blk_4129088682008611797_35619
 11/12/26 09:10:53 INFO hdfs.DFSClient: Exception in createBlockOutputStream
 java.io.IOException: Bad connect ack with firstBadLink as
 192.168.100.5:50010
 11/12/26 09:10:53 INFO hdfs.DFSClient: Abandoning block
 blk_3596375242483863157_35619
 11/12/26 09:11:01 INFO hdfs.DFSClient: Exception in createBlockOutputStream
 java.io.IOException: Bad connect ack with firstBadLink as
 192.168.100.5:50010
 11/12/26 09:11:01 INFO hdfs.DFSClient: Abandoning block
 blk_724369205729364853_35619
 11/12/26 09:11:07 WARN hdfs.DFSClient: DataStreamer Exception:
 java.io.IOException: Unable to create new block.
    at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3002)
    at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
    at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)

 11/12/26 09:11:07 WARN hdfs.DFSClient: Error Recovery for block
 blk_724369205729364853_35619 bad datanode[1] nodes == null
 11/12/26 09:11:07 WARN hdfs.DFSClient: Could not get block locations.
 Source file
 /data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split
 - Aborting...
 11/12/26 09:11:07 INFO mapred.JobClient: Cleaning up the staging area
 hdfs://machine-100-205:9000/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292
 Exception in thread main java.io.IOException: Bad connect ack with
 firstBadLink as 192.168.100.5:50010
    at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
    at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
    at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
    at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
 11/12/26 09:11:07 ERROR hdfs.DFSClient: Exception closing file
 /data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split
 : java.io.IOException: Bad connect ack with firstBadLink as
 192.168.100.5:50010
 java.io.IOException: Bad connect ack with firstBadLink as
 192.168.100.5:50010
    at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
    at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
    at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
    at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)


 - In the above logs, 192.168.100.5 is the machine I rebooted.
 - JobTracker's log file doesn't have any logs in the above time period.
 - NameNode's log file doesn't have any exceptions or any messages related
 to the above error logs.
 - All nodes can access each other via IP or hostnames.
 - 

Re: Task process exit with nonzero status of 134

2011-12-23 Thread alo alt
Hi,

take a look into the logs for the failed attempt at your Tasktracker.
Also check the system logs with dmesg or /var/log/kern*. Could be a
syskill (segfault).

- Alex

On Fri, Dec 23, 2011 at 3:32 PM, anthony garnier sokar6...@hotmail.com wrote:

 Hi folks,

 I've just done a fresh install of Hadoop, Namenode and datanode are up, 
 Task/job Tracker also up, but when I run the Map reduce worcount exemple I 
 got this error on Task tracker:

 2011-12-23 15:11:52,679 INFO org.apache.hadoop.mapred.JvmManager: JVM : 
 jvm_201112231511_0001_m_-1653678851 exited with exit code 134. Number of 
 tasks it ran: 0
 2011-12-23 15:11:52,681 WARN org.apache.hadoop.mapred.TaskRunner: 
 attempt_201112231511_0001_m_02_0 : Child Error
 java.io.IOException: Task process exit with nonzero status of 134.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

 And job tracker stuck :
 # hadoop jar hadoop-examples*.jar wordcount input/test.txt output/
 11/12/23 15:11:48 INFO input.FileInputFormat: Total input paths to process : 1
 11/12/23 15:11:49 INFO mapred.JobClient: Running job: job_201112231511_0001
 11/12/23 15:11:50 INFO mapred.JobClient:  map 0% reduce 0%


 I'running hadoop 0.20.203.0, java 1.6.0 rev 25

 I've done some googling, apparently the JVM crash hard (maybe Out of memory), 
 does someone have any hint ?

 Regards,

 Anthony Garnier
 /DSIN/ASTI/ETSO
 IT Center
 PSA Peugeot Citroen
 Bessoncourt 90160







-- 
Alexander Lorenz
http://mapredit.blogspot.com

P Think of the environment: please don't print this email unless you
really need to.


Re: 1gig or 10gig network for cluster?

2011-12-23 Thread alo alt
Hi,

recommend or optimum?
10G are the best for optimal rackawareness. If you plan to grow up
seriously, start with the best you can effort. Depends on your
available investment, I think.

- Alex


On Fri, Dec 23, 2011 at 6:23 PM, Koert Kuipers ko...@tresata.com wrote:
 For a hadoop cluster that starts medium size (50 nodes) but could grow to
 hundred of nodes, what is the recommended network in the rack? 1gig or 10gig
 We have machines with 8 cores, 4 X 1tb drive (could grow to 8 X 1b drive),
 48 Gb ram per node.
 We expect balanced usage of the cluster (both storage and computations).
 Thanks for your input,
 Koert



-- 
Alexander Lorenz
http://mapredit.blogspot.com

P Think of the environment: please don't print this email unless you
really need to.


Re: Hadoop configuration

2011-12-22 Thread alo alt
Hi,

Apache:
http://hadoop.apache.org/common/docs/current/cluster_setup.html

RHEL / CentOS:
http://mapredit.blogspot.com/p/get-hadoop-cluster-running-in-20.html

Ubuntu:
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/


- Alex

On Thu, Dec 22, 2011 at 10:24 AM, Humayun kabir humayun0...@gmail.com wrote:
 someone please help me to configure hadoop such as core-site.xml,
 hdfs-site.xml, mapred-site.xml etc.
 please provide some example. it is badly needed. because i run in a 2 node
 cluster. when i run the wordcount example then it gives the result too
 mutch fetch failure.



-- 
Alexander Lorenz
http://mapredit.blogspot.com

P Think of the environment: please don't print this email unless you
really need to.


Re: measuring network throughput

2011-12-22 Thread alo alt
Rita,

ganglia give you a throughput like Nagios. Could that help?

- Alex

On Thu, Dec 22, 2011 at 1:58 PM, Rita rmorgan...@gmail.com wrote:
 Is there a tool or a method to measure the throughput of the cluster at a
 given time? It would be a great feature to add





 --
 --- Get your facts first, then you can distort them as you please.--



-- 
Alexander Lorenz
http://mapredit.blogspot.com

P Think of the environment: please don't print this email unless you
really need to.


Re: measuring network throughput

2011-12-22 Thread alo alt
Yes, I know ;) You can grab and extend the metrics as you like. Here a
post from Sematext:
http://blog.sematext.com/2011/07/31/extending-hadoop-metrics/

- Alex

On Thu, Dec 22, 2011 at 2:45 PM, Rita rmorgan...@gmail.com wrote:
 Yes, I think they can graph it for you. However, I am looking for raw data
 because I would like to create something custom



 On Thu, Dec 22, 2011 at 8:19 AM, alo alt wget.n...@googlemail.com wrote:

 Rita,

 ganglia give you a throughput like Nagios. Could that help?

 - Alex

 On Thu, Dec 22, 2011 at 1:58 PM, Rita rmorgan...@gmail.com wrote:
  Is there a tool or a method to measure the throughput of the cluster at a
  given time? It would be a great feature to add
 
 
 
 
 
  --
  --- Get your facts first, then you can distort them as you please.--



 --
 Alexander Lorenz
 http://mapredit.blogspot.com

 P Think of the environment: please don't print this email unless you
 really need to.




 --
 --- Get your facts first, then you can distort them as you please.--



-- 
Alexander Lorenz
http://mapredit.blogspot.com

P Think of the environment: please don't print this email unless you
really need to.


Re: Regarding a Multi user environment

2011-12-18 Thread alo alt
Hi,

if I understood correctly, you want other users than root can stop /
start an entire cluster?
Its possible via sudoers (visudo). Create a group, insert the users
(must be exists on the system) and give the right to run bin/start.sh
and bin/stop.sh.

hope it helps,
 - Alex

On Sat, Dec 17, 2011 at 9:51 AM, ashutosh pangasa
ashutoshpang...@gmail.com wrote:
 I have set up hadoop for a multi user environment. Different users are able
 to submit map reduce jobs on a cluster.

 What i am trying to do is to see if different users can start or stop the
 cluster as well.

 Is it possible in hadoop . if yes, how can we do it



-- 
Alexander Lorenz
http://mapredit.blogspot.com

P Think of the environment: please don't print this email unless you
really need to.


Re: Dynamically adding nodes in Hadoop

2011-12-17 Thread alo alt
Hi,

in the slave - file too. /etc/hosts is also recommend to avoid DNS
issues. After adding in slaves the new node has to be started and
should quickly appear in the web-ui. If you don't need the nodes all
time you can setup a exclude and refresh your cluster
(http://wiki.apache.org/hadoop/FAQ#I_want_to_make_a_large_cluster_smaller_by_taking_out_a_bunch_of_nodes_simultaneously._How_can_this_be_done.3F)

- Alex

On Sat, Dec 17, 2011 at 12:06 PM, madhu phatak phatak@gmail.com wrote:
 Hi,
  I am trying to add nodes dynamically to a running hadoop cluster.I started
 tasktracker and datanode in the node. It works fine. But when some node
 try fetch values ( for reduce phase) it fails with unknown host exception.
 When i add a node to running cluster do i have to add its hostname to all
 nodes (slaves +master) /etc/hosts file? Or some other way is there?


 --
 Join me at http://hadoopworkshop.eventbrite.com/



-- 
Alexander Lorenz
http://mapredit.blogspot.com

P Think of the environment: please don't print this email unless you
really need to.


Re: Choosing IO intensive and CPU intensive workloads

2011-12-09 Thread alo alt
Hmm, the PI or Wordcount workload could be usefull. Sorry, I have such
topics always as links for me:

http://developer.yahoo.com/hadoop/tutorial/module3.html#running
= wordcount

I think per default are some examples included, like Pi:

cd /usr/lib/hadoop-0.20/
hadoop jar hadoop-examples.jar pi 10 1000


- Alex

On Fri, Dec 9, 2011 at 10:08 AM, ArunKumar arunk...@gmail.com wrote:

 Alex,

 Thanks for the link.
 I have boxes of say 30 - 50 of free space. Obviously i can't run Terasort .
 What reasonable input size do i need to take to see the behaviour when
 Terasort and TestDFSIO are run?
 Is there any benchmark for mixed workload ?

 Arun

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Choosing-IO-and-CPU-intensive-workloads-tp3572282p3572416.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.




-- 
Alexander Lorenz
http://mapredit.blogspot.com

*P **Think of the environment: please don't print this email unless you
really need to.*


Re: Choosing IO intensive and CPU intensive workloads

2011-12-08 Thread alo alt
Hi Arun,

Micheal has write up a good tutorial about, including stress test and IO.
http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/

- Alex

On Fri, Dec 9, 2011 at 8:24 AM, ArunKumar arunk...@gmail.com wrote:

 Hi guys !

 I want to see the behavior of a single node of Hadoop cluster when IO
 intensive / CPU intensive workload and mix of both is submitted to the
 single node alone.
 These workloads must stress the nodes.
 I see that TestDFSIO benchmark is good for IO intensive workload.
 1 Which benchmarks do i need to use for this ?
 2 What amount of input data will be fair enough for seeing the behavior
 under these workloads for each type of boxes if i have boxes with :-
  B1: 4 GB RAM, Dual  core ,150-250 GB DISK ,
  B2 : 1GB RAM, 50-80 GB Disk.

 Arun

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Choosing-IO-intensive-and-CPU-intensive-workloads-tp3572282p3572282.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.




-- 
Alexander Lorenz
http://mapredit.blogspot.com

*P **Think of the environment: please don't print this email unless you
really need to.*


Re: Hadoop Comic

2011-12-07 Thread alo alt
Hi,

https://docs.google.com/viewer?a=vpid=explorerchrome=truesrcid=0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1hl=en_US

- alex

On Wed, Dec 7, 2011 at 10:47 AM, shreya@cognizant.com wrote:

 Hi,



 Can someone please send me the Hadoop comic.

 Saw references about it in the mailing list.



 Regards,

 Shreya


 This e-mail and any files transmitted with it are for the sole use of the
 intended recipient(s) and may contain confidential and privileged
 information.
 If you are not the intended recipient, please contact the sender by reply
 e-mail and destroy all copies of the original message.
 Any unauthorized review, use, disclosure, dissemination, forwarding,
 printing or copying of this email or any action taken in reliance on this
 e-mail is strictly prohibited and may be unlawful.




-- 
Alexander Lorenz
http://mapredit.blogspot.com

*P **Think of the environment: please don't print this email unless you
really need to.*


Re: Warning: $HADOOP_HOME is deprecated

2011-12-07 Thread alo alt
Hi,

looks like a bug in .205:
https://issues.apache.org/jira/browse/HADOOP-7816

- Alex

On Wed, Dec 7, 2011 at 11:37 AM, praveenesh kumar praveen...@gmail.comwrote:

 How to avoid Warning: $HADOOP_HOME is deprecated messages on hadoop
 0.20.205 ?

 I tried adding *export HADOOP_HOME_WARN_SUPPRESS=  *in hadoop-env.sh on
 Namenode.

 But its still coming. Am I doing the right thing ?

 Thanks,
 Praveenesh




-- 
Alexander Lorenz
http://mapredit.blogspot.com

*P **Think of the environment: please don't print this email unless you
really need to.*


Re: Automate Hadoop installation

2011-12-05 Thread alo alt
Hi,

to deploy software I suggest pulp:
https://fedorahosted.org/pulp/wiki/HowTo

For a package-based distro (debian, redhat, centos) you can build apache's
hadoop, pack it and delpoy. Configs, as Cos say, over puppet. If you use a
redhat / centos take a look at spacewalk.

best,
 Alex


On Mon, Dec 5, 2011 at 8:20 PM, Konstantin Boudnik c...@apache.org wrote:

 These that great project called BigTop (in the apache incubator) which
 provides for building of Hadoop stack.

 The part of what it provides is a set of Puppet recipes which will allow
 you
 to do exactly what you're looking for with perhaps some minor corrections.

 Serious, look at Puppet - otherwise it will be a living through nightmare
 of
 configuration mismanagements.

 Cos

 On Mon, Dec 05, 2011 at 04:02PM, praveenesh kumar wrote:
  Hi all,
 
  Can anyone guide me how to automate the hadoop installation/configuration
  process?
  I want to install hadoop on 10-20 nodes which may even exceed to 50-100
  nodes ?
  I know we can use some configuration tools like puppet/or shell-scripts ?
  Has anyone done it ?
 
  How can we do hadoop installations on so many machines parallely ? What
 are
  the best practices for this ?
 
  Thanks,
  Praveenesh




-- 
Alexander Lorenz
http://mapredit.blogspot.com

*P **Think of the environment: please don't print this email unless you
really need to.*