Re: working with SAS
Hi, hadoop is running on a linux box (mostly) and can run in a standalone installation for testing only. If you decide to use hadoop with hive or hbase you have to face a lot of more tasks: - installation (whirr and Amazone EC2 as example) - write your own mapreduce job or use hive / hbase - setup sqoop with the terradata-driver You can easy setup part 1 and 2 with Amazon's EC2, I think you can also book Windows Server there. For a single query the best option I think before you install a hadoop cluster. best, Alex -- Alexander Lorenz http://mapredit.blogspot.com On Feb 6, 2012, at 8:11 AM, Ali Jooan Rizvi wrote: Hi, I would like to know if hadoop will be of help to me? Let me explain you guys my scenario: I have a windows server based single machine server having 16 Cores and 48 GB of Physical Memory. In addition, I have 120 GB of virtual memory. I am running a query with statistical calculation on large data of over 1 billion rows, on SAS. In this case, SAS is acting like a database on which both source and target tables are residing. For storage, I can keep the source and target data on Teradata as well but the query containing a patent can only be run on SAS interface. The problem is that SAS is taking many days (25 days) to run it (a single query with statistical function) and not all cores all the time were used and rather merely 5% CPU was utilized on average. However memory utilization was high, very high, and that's why large virtual memory was used. Can I have a hadoop interface in place to do it all so that I may end up running the query in lesser time that is in 1 or 2 days. Anything squeezing my run time will be very helpful. Thanks Ali Jooan Rizvi
Re: working with SAS
+ you will not necessarily need vertical systems for speeding up things(totally depends on your query) . Give a thought of having commodity hardware(much cheaper) and hadoop being suited for them, *I hope* your infrastructure can be cheaper in terms of price to performance ratio. Having said that, I do not mean you have to throw away you existing infrastructure, because it is ideal for certain requirements. your solution can be like writing a mapreduce job which does what query is supposed to do and run it on a cluster of size ? depends! (how fast you want things be done? and scale). Incase your querry is adhoc and have to be run frequently. You might wanna consider HBASE and HIVE as solutions with a lot of expensive vertical nodes ;). BTW Is your querry iterative? A little more details on your type of querry can attract guy's with more wisdom to help. HTH On Mon, Feb 6, 2012 at 1:46 PM, alo alt wget.n...@googlemail.com wrote: Hi, hadoop is running on a linux box (mostly) and can run in a standalone installation for testing only. If you decide to use hadoop with hive or hbase you have to face a lot of more tasks: - installation (whirr and Amazone EC2 as example) - write your own mapreduce job or use hive / hbase - setup sqoop with the terradata-driver You can easy setup part 1 and 2 with Amazon's EC2, I think you can also book Windows Server there. For a single query the best option I think before you install a hadoop cluster. best, Alex -- Alexander Lorenz http://mapredit.blogspot.com On Feb 6, 2012, at 8:11 AM, Ali Jooan Rizvi wrote: Hi, I would like to know if hadoop will be of help to me? Let me explain you guys my scenario: I have a windows server based single machine server having 16 Cores and 48 GB of Physical Memory. In addition, I have 120 GB of virtual memory. I am running a query with statistical calculation on large data of over 1 billion rows, on SAS. In this case, SAS is acting like a database on which both source and target tables are residing. For storage, I can keep the source and target data on Teradata as well but the query containing a patent can only be run on SAS interface. The problem is that SAS is taking many days (25 days) to run it (a single query with statistical function) and not all cores all the time were used and rather merely 5% CPU was utilized on average. However memory utilization was high, very high, and that's why large virtual memory was used. Can I have a hadoop interface in place to do it all so that I may end up running the query in lesser time that is in 1 or 2 days. Anything squeezing my run time will be very helpful. Thanks Ali Jooan Rizvi
Can I write to an compressed file which is located in hdfs?
hi all, I'm testing hadoop and hive, and I want to use them in log analysis. Here I have a question, can I write/append log to an compressed file which is located in hdfs? Our system generate lots of log files every day, I can't compress these logs every hour and them put them into hdfs. But what if I want to write logs into files that was already in the hdfs and was compressed? Is these files were not compressed, then this job seems easy, but how to write or append logs into an compressed log? Can I do that? Can anyone give me some advices or give me some examples? Thank you very much! xiaobin
Re: Can I write to an compressed file which is located in hdfs?
sorry, this sentence is wrong, I can't compress these logs every hour and them put them into hdfs. it should be I can compress these logs every hour and them put them into hdfs. 2012/2/6 Xiaobin She xiaobin...@gmail.com hi all, I'm testing hadoop and hive, and I want to use them in log analysis. Here I have a question, can I write/append log to an compressed file which is located in hdfs? Our system generate lots of log files every day, I can't compress these logs every hour and them put them into hdfs. But what if I want to write logs into files that was already in the hdfs and was compressed? Is these files were not compressed, then this job seems easy, but how to write or append logs into an compressed log? Can I do that? Can anyone give me some advices or give me some examples? Thank you very much! xiaobin
The Common Account for Hadoop
Dear all, I am just starting to learn Hadoop. According to the book, Hadoop in Action, a common account for each server (masters/slaves) must be created. Moreover, I need to create a public/private rsa key pair as follows. ssh-keygen -t rsa Then, id_rsa and id_rsa.pub are put under $HOME/.ssh. After that, the public key is distributed to other nodes and saved in @HOME/.ssh/authorized_keys. According to the book (Page 27), I can login in a remote target with the following command. ssh target (I typed IP address here) However, according to the book, no password is required to sign in the target. On my machine, it is required to type password each time. Any affects for my future to configure Hadoop? What's wrong with my work? Thanks so much! Bing
Re: The Common Account for Hadoop
check the rights of .ssh/authorized_keys on the hosts, have to be only read- and writable for the user (including directory) Be sure you copied the right key without line-breaks and fragments. If you have a lot of boxes you could use BCFG2: http://docs.bcfg2.org/ - Alex -- Alexander Lorenz http://mapredit.blogspot.com On Feb 6, 2012, at 10:55 AM, Bing Li wrote: Dear all, I am just starting to learn Hadoop. According to the book, Hadoop in Action, a common account for each server (masters/slaves) must be created. Moreover, I need to create a public/private rsa key pair as follows. ssh-keygen -t rsa Then, id_rsa and id_rsa.pub are put under $HOME/.ssh. After that, the public key is distributed to other nodes and saved in @HOME/.ssh/authorized_keys. According to the book (Page 27), I can login in a remote target with the following command. ssh target (I typed IP address here) However, according to the book, no password is required to sign in the target. On my machine, it is required to type password each time. Any affects for my future to configure Hadoop? What's wrong with my work? Thanks so much! Bing
Re: The Common Account for Hadoop
Hi, Alex, Thanks so much for your help! I noticed that I didn't put the RSA key to the account's home directory. Best regards, Bing On Mon, Feb 6, 2012 at 6:19 PM, alo alt wget.n...@googlemail.com wrote: check the rights of .ssh/authorized_keys on the hosts, have to be only read- and writable for the user (including directory) Be sure you copied the right key without line-breaks and fragments. If you have a lot of boxes you could use BCFG2: http://docs.bcfg2.org/ - Alex -- Alexander Lorenz http://mapredit.blogspot.com On Feb 6, 2012, at 10:55 AM, Bing Li wrote: Dear all, I am just starting to learn Hadoop. According to the book, Hadoop in Action, a common account for each server (masters/slaves) must be created. Moreover, I need to create a public/private rsa key pair as follows. ssh-keygen -t rsa Then, id_rsa and id_rsa.pub are put under $HOME/.ssh. After that, the public key is distributed to other nodes and saved in @HOME/.ssh/authorized_keys. According to the book (Page 27), I can login in a remote target with the following command. ssh target (I typed IP address here) However, according to the book, no password is required to sign in the target. On my machine, it is required to type password each time. Any affects for my future to configure Hadoop? What's wrong with my work? Thanks so much! Bing
Re: Can I write to an compressed file which is located in hdfs?
Hi If you have log files enough to become at least one block size in an hour. You can go ahead as - run a scheduled job every hour that compresses the log files for that hour and stores them on to hdfs (can use LZO or even Snappy to compress) - if your hive does more frequent analysis on this data store it as PARTITIONED BY (Date,Hour) . While loading into hdfs also follow a directory - sub dir structure. Once data is in hdfs issue a Alter Table Add Partition statement on corresponding hive table. -in Hive DDL use the appropriate Input format (Hive has some ApacheLog Input Format already) Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: Xiaobin She xiaobin...@gmail.com Date: Mon, 6 Feb 2012 16:41:50 To: common-user@hadoop.apache.org; 佘晓彬xiaobin...@gmail.com Reply-To: common-user@hadoop.apache.org Subject: Re: Can I write to an compressed file which is located in hdfs? sorry, this sentence is wrong, I can't compress these logs every hour and them put them into hdfs. it should be I can compress these logs every hour and them put them into hdfs. 2012/2/6 Xiaobin She xiaobin...@gmail.com hi all, I'm testing hadoop and hive, and I want to use them in log analysis. Here I have a question, can I write/append log to an compressed file which is located in hdfs? Our system generate lots of log files every day, I can't compress these logs every hour and them put them into hdfs. But what if I want to write logs into files that was already in the hdfs and was compressed? Is these files were not compressed, then this job seems easy, but how to write or append logs into an compressed log? Can I do that? Can anyone give me some advices or give me some examples? Thank you very much! xiaobin
Re: Can I write to an compressed file which is located in hdfs?
hi Bejoy , thank you for your reply. actually I have set up an test cluster which has one namenode/jobtracker and two datanode/tasktracker, and I have make an test on this cluster. I fetch the log file of one of our modules from the log collector machines by rsync, and then I use hive command line tool to load this log file into the hive warehouse which simply copy the file from the local filesystem to hdfs. And I have run some analysis on these data with hive, all this run well. But now I want to avoid the fetch section which use rsync, and write the logs into hdfs files directly from the servers which generate these logs. And it seems easy to do this job if the file locate in the hdfs is not compressed. But how to write or append logs to an file that is compressed and located in hdfs? Is this possible? Or is this an bad practice? Thanks! 2012/2/6 bejoy.had...@gmail.com Hi If you have log files enough to become at least one block size in an hour. You can go ahead as - run a scheduled job every hour that compresses the log files for that hour and stores them on to hdfs (can use LZO or even Snappy to compress) - if your hive does more frequent analysis on this data store it as PARTITIONED BY (Date,Hour) . While loading into hdfs also follow a directory - sub dir structure. Once data is in hdfs issue a Alter Table Add Partition statement on corresponding hive table. -in Hive DDL use the appropriate Input format (Hive has some ApacheLog Input Format already) Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: Xiaobin She xiaobin...@gmail.com Date: Mon, 6 Feb 2012 16:41:50 To: common-user@hadoop.apache.org; 佘晓彬xiaobin...@gmail.com Reply-To: common-user@hadoop.apache.org Subject: Re: Can I write to an compressed file which is located in hdfs? sorry, this sentence is wrong, I can't compress these logs every hour and them put them into hdfs. it should be I can compress these logs every hour and them put them into hdfs. 2012/2/6 Xiaobin She xiaobin...@gmail.com hi all, I'm testing hadoop and hive, and I want to use them in log analysis. Here I have a question, can I write/append log to an compressed file which is located in hdfs? Our system generate lots of log files every day, I can't compress these logs every hour and them put them into hdfs. But what if I want to write logs into files that was already in the hdfs and was compressed? Is these files were not compressed, then this job seems easy, but how to write or append logs into an compressed log? Can I do that? Can anyone give me some advices or give me some examples? Thank you very much! xiaobin
Re: working with SAS
Both responses assume replacing SAS w a Hadoop cluster. I would agree that going to EC2 might make sense in terms of a PoC before investing in a physical cluster, but we need to know more about the underlying problem. First, can the problem be broken down in to something that can be accomplished in parallel sub tasks? Second... How much data? It could be a good use case for whirr... Sent from a remote device. Please excuse any typos... Mike Segel On Feb 6, 2012, at 2:32 AM, Prashant Sharma prashan...@imaginea.com wrote: + you will not necessarily need vertical systems for speeding up things(totally depends on your query) . Give a thought of having commodity hardware(much cheaper) and hadoop being suited for them, *I hope* your infrastructure can be cheaper in terms of price to performance ratio. Having said that, I do not mean you have to throw away you existing infrastructure, because it is ideal for certain requirements. your solution can be like writing a mapreduce job which does what query is supposed to do and run it on a cluster of size ? depends! (how fast you want things be done? and scale). Incase your querry is adhoc and have to be run frequently. You might wanna consider HBASE and HIVE as solutions with a lot of expensive vertical nodes ;). BTW Is your querry iterative? A little more details on your type of querry can attract guy's with more wisdom to help. HTH On Mon, Feb 6, 2012 at 1:46 PM, alo alt wget.n...@googlemail.com wrote: Hi, hadoop is running on a linux box (mostly) and can run in a standalone installation for testing only. If you decide to use hadoop with hive or hbase you have to face a lot of more tasks: - installation (whirr and Amazone EC2 as example) - write your own mapreduce job or use hive / hbase - setup sqoop with the terradata-driver You can easy setup part 1 and 2 with Amazon's EC2, I think you can also book Windows Server there. For a single query the best option I think before you install a hadoop cluster. best, Alex -- Alexander Lorenz http://mapredit.blogspot.com On Feb 6, 2012, at 8:11 AM, Ali Jooan Rizvi wrote: Hi, I would like to know if hadoop will be of help to me? Let me explain you guys my scenario: I have a windows server based single machine server having 16 Cores and 48 GB of Physical Memory. In addition, I have 120 GB of virtual memory. I am running a query with statistical calculation on large data of over 1 billion rows, on SAS. In this case, SAS is acting like a database on which both source and target tables are residing. For storage, I can keep the source and target data on Teradata as well but the query containing a patent can only be run on SAS interface. The problem is that SAS is taking many days (25 days) to run it (a single query with statistical function) and not all cores all the time were used and rather merely 5% CPU was utilized on average. However memory utilization was high, very high, and that's why large virtual memory was used. Can I have a hadoop interface in place to do it all so that I may end up running the query in lesser time that is in 1 or 2 days. Anything squeezing my run time will be very helpful. Thanks Ali Jooan Rizvi
Re: Can I write to an compressed file which is located in hdfs?
Hi, You may want to have a look at the Flume project from Cloudera. I use it for writing data into HDFS. https://ccp.cloudera.com/display/SUPPORT/Downloads dave 2012/2/6 Xiaobin She xiaobin...@gmail.com hi Bejoy , thank you for your reply. actually I have set up an test cluster which has one namenode/jobtracker and two datanode/tasktracker, and I have make an test on this cluster. I fetch the log file of one of our modules from the log collector machines by rsync, and then I use hive command line tool to load this log file into the hive warehouse which simply copy the file from the local filesystem to hdfs. And I have run some analysis on these data with hive, all this run well. But now I want to avoid the fetch section which use rsync, and write the logs into hdfs files directly from the servers which generate these logs. And it seems easy to do this job if the file locate in the hdfs is not compressed. But how to write or append logs to an file that is compressed and located in hdfs? Is this possible? Or is this an bad practice? Thanks! 2012/2/6 bejoy.had...@gmail.com Hi If you have log files enough to become at least one block size in an hour. You can go ahead as - run a scheduled job every hour that compresses the log files for that hour and stores them on to hdfs (can use LZO or even Snappy to compress) - if your hive does more frequent analysis on this data store it as PARTITIONED BY (Date,Hour) . While loading into hdfs also follow a directory - sub dir structure. Once data is in hdfs issue a Alter Table Add Partition statement on corresponding hive table. -in Hive DDL use the appropriate Input format (Hive has some ApacheLog Input Format already) Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: Xiaobin She xiaobin...@gmail.com Date: Mon, 6 Feb 2012 16:41:50 To: common-user@hadoop.apache.org; 佘晓彬xiaobin...@gmail.com Reply-To: common-user@hadoop.apache.org Subject: Re: Can I write to an compressed file which is located in hdfs? sorry, this sentence is wrong, I can't compress these logs every hour and them put them into hdfs. it should be I can compress these logs every hour and them put them into hdfs. 2012/2/6 Xiaobin She xiaobin...@gmail.com hi all, I'm testing hadoop and hive, and I want to use them in log analysis. Here I have a question, can I write/append log to an compressed file which is located in hdfs? Our system generate lots of log files every day, I can't compress these logs every hour and them put them into hdfs. But what if I want to write logs into files that was already in the hdfs and was compressed? Is these files were not compressed, then this job seems easy, but how to write or append logs into an compressed log? Can I do that? Can anyone give me some advices or give me some examples? Thank you very much! xiaobin
Re: Can I write to an compressed file which is located in hdfs?
Hi I agree with David on the point, you can achieve step 1 of my previous response with flume. ie load real time inflow of data in compressed format into hdfs. You can specify a time interval or data size in flume collector that determines when to flush data on to hdfs. Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: David Sinclair dsincl...@chariotsolutions.com Date: Mon, 6 Feb 2012 09:06:00 To: common-user@hadoop.apache.org Cc: bejoy.had...@gmail.com Subject: Re: Can I write to an compressed file which is located in hdfs? Hi, You may want to have a look at the Flume project from Cloudera. I use it for writing data into HDFS. https://ccp.cloudera.com/display/SUPPORT/Downloads dave 2012/2/6 Xiaobin She xiaobin...@gmail.com hi Bejoy , thank you for your reply. actually I have set up an test cluster which has one namenode/jobtracker and two datanode/tasktracker, and I have make an test on this cluster. I fetch the log file of one of our modules from the log collector machines by rsync, and then I use hive command line tool to load this log file into the hive warehouse which simply copy the file from the local filesystem to hdfs. And I have run some analysis on these data with hive, all this run well. But now I want to avoid the fetch section which use rsync, and write the logs into hdfs files directly from the servers which generate these logs. And it seems easy to do this job if the file locate in the hdfs is not compressed. But how to write or append logs to an file that is compressed and located in hdfs? Is this possible? Or is this an bad practice? Thanks! 2012/2/6 bejoy.had...@gmail.com Hi If you have log files enough to become at least one block size in an hour. You can go ahead as - run a scheduled job every hour that compresses the log files for that hour and stores them on to hdfs (can use LZO or even Snappy to compress) - if your hive does more frequent analysis on this data store it as PARTITIONED BY (Date,Hour) . While loading into hdfs also follow a directory - sub dir structure. Once data is in hdfs issue a Alter Table Add Partition statement on corresponding hive table. -in Hive DDL use the appropriate Input format (Hive has some ApacheLog Input Format already) Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: Xiaobin She xiaobin...@gmail.com Date: Mon, 6 Feb 2012 16:41:50 To: common-user@hadoop.apache.org; 佘晓彬xiaobin...@gmail.com Reply-To: common-user@hadoop.apache.org Subject: Re: Can I write to an compressed file which is located in hdfs? sorry, this sentence is wrong, I can't compress these logs every hour and them put them into hdfs. it should be I can compress these logs every hour and them put them into hdfs. 2012/2/6 Xiaobin She xiaobin...@gmail.com hi all, I'm testing hadoop and hive, and I want to use them in log analysis. Here I have a question, can I write/append log to an compressed file which is located in hdfs? Our system generate lots of log files every day, I can't compress these logs every hour and them put them into hdfs. But what if I want to write logs into files that was already in the hdfs and was compressed? Is these files were not compressed, then this job seems easy, but how to write or append logs into an compressed log? Can I do that? Can anyone give me some advices or give me some examples? Thank you very much! xiaobin
HDFS Files Seem to be Stored in the Wrong Location?
Hi, I have a pseudo-distributed Hadoop cluster setup, and I'm currently hoping to put about 100 gigs of files on it to play around with. I got a unix box at work no one else is using for this, and running a df -h, I get: FilesystemSize Used Avail Use% Mounted on /dev/sda1 7.9G 2.4G 5.2G 31% / none 3.8G 0 3.8G 0% /dev/shm /dev/sdb 414G 210M 393G 1% /mnt Alright, so /mnt looks quite big and seems like a good place to store my hdfs files. I go ahead and create a folder named hadoop-data there and set the following in hdfs-site.xml: property !-- where hadoop stores its files (datanodes only) -- namedfs.name.dir/name value/mnt/hadoop-data/value /property After a bit of troubleshooting, I restart the cluster and try to put a couple of test files onto HDFS. Doing an ls of hadoop-data, I see: $ ls current image in_use.lock previous.checkpoint OK, things look good. Time to try uploading some real data. Now, here's where the problem arises. If I add a 10mb dummy file to hadoop-data through regular unix and run df -h, I see that the used space of /mnt goes up exactly 10mb. But, when I start running a big dump of data through: hadoop fs -put ~/hadoop_playground/data2/data2/ /data/ I notice that running df -h seems to put the data in completely the wrong location! Note that below, only the usage of /dev/sda1 has increased. /mnt has not moved. FilesystemSize Used Avail Use% Mounted on /dev/sda1 7.9G 3.4G 4.2G 45% / none 3.8G 0 3.8G 0% /dev/shm /dev/sdb 414G 210M 393G 1% /mnt So, what gives? Anyone have any clue how my files are seemingly both put in the hadoop-data folder, but take up space elsewhere? I could see this likely being a Unix issue, but I figured I'd ask here just in case it's not, since I'm pretty stumped. Cheers, Eli
Re: HDFS Files Seem to be Stored in the Wrong Location?
You need your dfs.data.dir configured to the bigger disks for data. That config targets the datanodes. The one you've overriden is for the namenode's metadata, and hence the default dfs.data.dir config is writing to /tmp on your root disk (which is a bad thing, gets wiped after a reboot). On Mon, Feb 6, 2012 at 9:51 PM, Eli Finkelshteyn iefin...@gmail.com wrote: Hi, I have a pseudo-distributed Hadoop cluster setup, and I'm currently hoping to put about 100 gigs of files on it to play around with. I got a unix box at work no one else is using for this, and running a df -h, I get: Filesystem Size Used Avail Use% Mounted on /dev/sda1 7.9G 2.4G 5.2G 31% / none 3.8G 0 3.8G 0% /dev/shm /dev/sdb 414G 210M 393G 1% /mnt Alright, so /mnt looks quite big and seems like a good place to store my hdfs files. I go ahead and create a folder named hadoop-data there and set the following in hdfs-site.xml: property !-- where hadoop stores its files (datanodes only) -- namedfs.name.dir/name value/mnt/hadoop-data/value /property After a bit of troubleshooting, I restart the cluster and try to put a couple of test files onto HDFS. Doing an ls of hadoop-data, I see: $ ls current image in_use.lock previous.checkpoint OK, things look good. Time to try uploading some real data. Now, here's where the problem arises. If I add a 10mb dummy file to hadoop-data through regular unix and run df -h, I see that the used space of /mnt goes up exactly 10mb. But, when I start running a big dump of data through: hadoop fs -put ~/hadoop_playground/data2/data2/ /data/ I notice that running df -h seems to put the data in completely the wrong location! Note that below, only the usage of /dev/sda1 has increased. /mnt has not moved. Filesystem Size Used Avail Use% Mounted on /dev/sda1 7.9G 3.4G 4.2G 45% / none 3.8G 0 3.8G 0% /dev/shm /dev/sdb 414G 210M 393G 1% /mnt So, what gives? Anyone have any clue how my files are seemingly both put in the hadoop-data folder, but take up space elsewhere? I could see this likely being a Unix issue, but I figured I'd ask here just in case it's not, since I'm pretty stumped. Cheers, Eli -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about
Tom White's book, 2nd ed. Which API?
I have the first edition of Tom White's O'Reilly Hadoop book and I was curious about the second edition. I realize it adds new sections on some of the wrapper tools, like Hive, but as far as the core Hadoop documentation is concerned, I'm wondering if there is much difference? In particular, I was curious if it teaches the .20 API? The first edition explicitly taught .19 because .20 wasn't quite vetted at the time he wrote it. He even explains that in the book. Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't scratch. All together this implies: He scratched the itch from the scratch that itched but would never itch the scratch from the itch that scratched. -- Keith Wiley
Re: HDFS Files Seem to be Stored in the Wrong Location?
Ah, crud. Typo on my part. Don't know how I didn't notice that. Thanks! On 2/6/12 11:30 AM, Harsh J wrote: You need your dfs.data.dir configured to the bigger disks for data. That config targets the datanodes. The one you've overriden is for the namenode's metadata, and hence the default dfs.data.dir config is writing to /tmp on your root disk (which is a bad thing, gets wiped after a reboot). On Mon, Feb 6, 2012 at 9:51 PM, Eli Finkelshteyniefin...@gmail.com wrote: Hi, I have a pseudo-distributed Hadoop cluster setup, and I'm currently hoping to put about 100 gigs of files on it to play around with. I got a unix box at work no one else is using for this, and running a df -h, I get: FilesystemSize Used Avail Use% Mounted on /dev/sda1 7.9G 2.4G 5.2G 31% / none 3.8G 0 3.8G 0% /dev/shm /dev/sdb 414G 210M 393G 1% /mnt Alright, so /mnt looks quite big and seems like a good place to store my hdfs files. I go ahead and create a folder named hadoop-data there and set the following in hdfs-site.xml: property !-- where hadoop stores its files (datanodes only) -- namedfs.name.dir/name value/mnt/hadoop-data/value /property After a bit of troubleshooting, I restart the cluster and try to put a couple of test files onto HDFS. Doing an ls of hadoop-data, I see: $ ls current image in_use.lock previous.checkpoint OK, things look good. Time to try uploading some real data. Now, here's where the problem arises. If I add a 10mb dummy file to hadoop-data through regular unix and run df -h, I see that the used space of /mnt goes up exactly 10mb. But, when I start running a big dump of data through: hadoop fs -put ~/hadoop_playground/data2/data2/ /data/ I notice that running df -h seems to put the data in completely the wrong location! Note that below, only the usage of /dev/sda1 has increased. /mnt has not moved. FilesystemSize Used Avail Use% Mounted on /dev/sda1 7.9G 3.4G 4.2G 45% / none 3.8G 0 3.8G 0% /dev/shm /dev/sdb 414G 210M 393G 1% /mnt So, what gives? Anyone have any clue how my files are seemingly both put in the hadoop-data folder, but take up space elsewhere? I could see this likely being a Unix issue, but I figured I'd ask here just in case it's not, since I'm pretty stumped. Cheers, Eli
Re: Tom White's book, 2nd ed. Which API?
On Monday, February 06, 2012 11:36:10 AM, Keith Wiley wrote: I have the first edition of Tom White's O'Reilly Hadoop book and I was curious about the second edition. I realize it adds new sections on some of the wrapper tools, like Hive, but as far as the core Hadoop documentation is concerned, I'm wondering if there is much difference? In particular, I was curious if it teaches the .20 API? The first edition explicitly taught .19 because .20 wasn't quite vetted at the time he wrote it. He even explains that in the book. Thanks. I have access to a safaribooksonline account and according to a quick scan: What’s New in the Second Edition? The second edition has two new chapters on Hive and Sqoop (Chapters 12 and 15), a new section covering Avro (in Chapter 4), an introduction to the new security features in Hadoop (in Chapter 9), and a new case study on analyzing massive network graphs using Hadoop (in Chapter 16). This edition continues to describe the 0.20 release series of Apache Hadoop, since this was the latest stable release at the time of writing. New features from later releases are occasionally mentioned in the text, however, with reference to the version that they were introduced in. you could also check out amazon's look inside functionality to check a few key pages once you find the second edition. hope this is of some help.
Re: Tom White's book, 2nd ed. Which API?
The second edition of Tom White's *Hadoop: The Definitive Guidehttp://www.librarything.com/work/book/72181963 * uses the old API for its examples, though it does contain a brief two-page overview of the new API. The first edition is all old API.
Re: Tom White's book, 2nd ed. Which API?
If you're looking to buy the 2nd edition you might want to wait, the third edition is in the works now. Regards, Rick On Feb 6, 2012 10:24 AM, W.P. McNeill bill...@gmail.com wrote: The second edition of Tom White's *Hadoop: The Definitive Guidehttp://www.librarything.com/work/book/72181963 * uses the old API for its examples, though it does contain a brief two-page overview of the new API. The first edition is all old API.
Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?
I assume you have seen the following information on Hadoop twiki, http://wiki.apache.org/hadoop/GangliaMetrics So do you use GangliaContext31 in hadoop-metrics2.properties? We use Ganglia 3.2 with Hadoop 20.205 and works fine (I remember seeing gmetad sometime goes down due to buffer overflow problem when hadoop starts pumping in the metrics.. but restarting works.. let me know if you face same problem? --Suhas Additionally, the Ganglia protocol change significantly between Ganglia 3.0 and Ganglia 3.1 (i.e., Ganglia 3.1 is not compatible with Ganglia 3.0 clients). This caused Hadoop to not work with Ganglia 3.1; there is a patch available for this, HADOOP-4675. As of November 2010, this patch has been rolled into the mainline for 0.20.2 and later. To use the Ganglia 3.1 protocol in place of the 3.0, substitute org.apache.hadoop.metrics.ganglia.GangliaContext31 for org.apache.hadoop.metrics.ganglia.GangliaContext in the hadoop-metrics.properties lines above. On Fri, Feb 3, 2012 at 1:07 PM, Merto Mertek masmer...@gmail.com wrote: I spent a lot of time to figure it out however i did not find a solution. Problems from the logs pointed me for some bugs in rrdupdate tool, however i tried to solve it with different versions of ganglia and rrdtool but the error is the same. Segmentation fault appears after the following lines, if I run gmetad in debug mode... Created rrd /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.publish_max_time.rrd Created rrd /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.snapshot_max_time.rrd which I suppose are generated from MetricsSystemImpl.java (Is there any way just to disable this two metrics?) From the /var/log/messages there are a lot of errors: xxx gmetad[15217]: RRD_update (/var/lib/ganglia/rrds/hdc/xxx/metricssystem.MetricsSystem.publish_imax_time.rrd): converting '4.9E-324' to float: Numerical result out of range xxx gmetad[15217]: RRD_update (/var/lib/ganglia/rrds/hdc/xxx/metricssystem.MetricsSystem.snapshot_imax_time.rrd): converting '4.9E-324' to float: Numerical result out of range so probably there are some converting issues ? Where should I look for the solution? Would you rather suggest to use ganglia 3.0.x with the old protocol and leave the version 3.1 for further releases? any help is realy appreciated... On 1 February 2012 04:04, Merto Mertek masmer...@gmail.com wrote: I would be glad to hear that too.. I've setup the following: Hadoop 0.20.205 Ganglia Front 3.1.7 Ganglia Back *(gmetad)* 3.1.7 RRDTool http://www.rrdtool.org/ 1.4.5. - i had some troubles installing 1.4.4 Ganglia works just in case hadoop is not running, so metrics are not publshed to gmetad node (conf with new hadoop-metrics2.proprieties). When hadoop is started, a segmentation fault appears in gmetad deamon: sudo gmetad -d 2 ... Updating host xxx, metric dfs.FSNamesystem.BlocksTotal Updating host xxx, metric bytes_in Updating host xxx, metric bytes_out Updating host xxx, metric metricssystem.MetricsSystem.publish_max_time Created rrd /var/lib/ganglia/rrds/hdcluster/hadoopmaster/metricssystem.MetricsSystem.publish_max_time.rrd Segmentation fault And some info from the apache log http://pastebin.com/nrqKRtKJ.. Can someone suggest a ganglia version that is tested with hadoop 0.20.205? I will try to sort it out however it seems a not so tribial problem.. Thank you On 2 December 2011 12:32, praveenesh kumar praveen...@gmail.com wrote: or Do I have to apply some hadoop patch for this ? Thanks, Praveenesh
Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?
Hello, i also face this issue when using GangliaContext31 and hadoop-1.0.0, and ganglia 3.1.7 (also tried 3.1.2). I continuously get buffer overflows as soon as i restart the gmetad. Regards Mete On Mon, Feb 6, 2012 at 7:42 PM, Vitthal Suhas Gogate gog...@hortonworks.com wrote: I assume you have seen the following information on Hadoop twiki, http://wiki.apache.org/hadoop/GangliaMetrics So do you use GangliaContext31 in hadoop-metrics2.properties? We use Ganglia 3.2 with Hadoop 20.205 and works fine (I remember seeing gmetad sometime goes down due to buffer overflow problem when hadoop starts pumping in the metrics.. but restarting works.. let me know if you face same problem? --Suhas Additionally, the Ganglia protocol change significantly between Ganglia 3.0 and Ganglia 3.1 (i.e., Ganglia 3.1 is not compatible with Ganglia 3.0 clients). This caused Hadoop to not work with Ganglia 3.1; there is a patch available for this, HADOOP-4675. As of November 2010, this patch has been rolled into the mainline for 0.20.2 and later. To use the Ganglia 3.1 protocol in place of the 3.0, substitute org.apache.hadoop.metrics.ganglia.GangliaContext31 for org.apache.hadoop.metrics.ganglia.GangliaContext in the hadoop-metrics.properties lines above. On Fri, Feb 3, 2012 at 1:07 PM, Merto Mertek masmer...@gmail.com wrote: I spent a lot of time to figure it out however i did not find a solution. Problems from the logs pointed me for some bugs in rrdupdate tool, however i tried to solve it with different versions of ganglia and rrdtool but the error is the same. Segmentation fault appears after the following lines, if I run gmetad in debug mode... Created rrd /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.publish_max_time.rrd Created rrd /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.snapshot_max_time.rrd which I suppose are generated from MetricsSystemImpl.java (Is there any way just to disable this two metrics?) From the /var/log/messages there are a lot of errors: xxx gmetad[15217]: RRD_update (/var/lib/ganglia/rrds/hdc/xxx/metricssystem.MetricsSystem.publish_imax_time.rrd): converting '4.9E-324' to float: Numerical result out of range xxx gmetad[15217]: RRD_update (/var/lib/ganglia/rrds/hdc/xxx/metricssystem.MetricsSystem.snapshot_imax_time.rrd): converting '4.9E-324' to float: Numerical result out of range so probably there are some converting issues ? Where should I look for the solution? Would you rather suggest to use ganglia 3.0.x with the old protocol and leave the version 3.1 for further releases? any help is realy appreciated... On 1 February 2012 04:04, Merto Mertek masmer...@gmail.com wrote: I would be glad to hear that too.. I've setup the following: Hadoop 0.20.205 Ganglia Front 3.1.7 Ganglia Back *(gmetad)* 3.1.7 RRDTool http://www.rrdtool.org/ 1.4.5. - i had some troubles installing 1.4.4 Ganglia works just in case hadoop is not running, so metrics are not publshed to gmetad node (conf with new hadoop-metrics2.proprieties). When hadoop is started, a segmentation fault appears in gmetad deamon: sudo gmetad -d 2 ... Updating host xxx, metric dfs.FSNamesystem.BlocksTotal Updating host xxx, metric bytes_in Updating host xxx, metric bytes_out Updating host xxx, metric metricssystem.MetricsSystem.publish_max_time Created rrd /var/lib/ganglia/rrds/hdcluster/hadoopmaster/metricssystem.MetricsSystem.publish_max_time.rrd Segmentation fault And some info from the apache log http://pastebin.com/nrqKRtKJ.. Can someone suggest a ganglia version that is tested with hadoop 0.20.205? I will try to sort it out however it seems a not so tribial problem.. Thank you On 2 December 2011 12:32, praveenesh kumar praveen...@gmail.com wrote: or Do I have to apply some hadoop patch for this ? Thanks, Praveenesh
Re: How to Set the Value of hadoop.tmp.dir?
Hi Bing What is yout value for dfs.name.dir and dfs.data.dir ? I believe that is still pointing to /tmp. Better to change it to another location as /tmp gets wiped on every reboot. --Original Message-- From: Bing Li To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org ReplyTo: bing...@asu.edu Subject: How to Set the Value of hadoop.tmp.dir? Sent: Feb 7, 2012 02:04 Dear all, I am a new Hadoop learner. The version I used is 1.0.0. I tried to set a new value for the parameter instead of /tmp, hadoop.tmp.dir in core-site.xml, hdfs-site.xml and mapred-site.xml. Do I need to do that in all the above xmls? However, when I execute the format command, I was asked if I need to reformat filesystem in /tmp/hadoop-myname/dfs/name. Why is the path still /tmp? Thanks, Bing Regards Bejoy K S From handheld, Please excuse typos.
Re: Tom White's book, 2nd ed. Which API?
Or get O'Reilly Safari, which would get you both? On Feb 6, 2012, at 9:34 AM, Richard Nadeau strout...@gmail.com wrote: If you're looking to buy the 2nd edition you might want to wait, the third edition is in the works now. Regards, Rick On Feb 6, 2012 10:24 AM, W.P. McNeill bill...@gmail.com wrote: The second edition of Tom White's *Hadoop: The Definitive Guidehttp://www.librarything.com/work/book/72181963 * uses the old API for its examples, though it does contain a brief two-page overview of the new API. The first edition is all old API.
Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?
Yes I am encoutering the same problems and like Mete said few seconds after restarting a segmentation fault appears.. here is my conf.. http://pastebin.com/VgBjp08d And here are some info from /var/log/messages (ubuntu server 10.10): kernel: [424447.140641] gmetad[26115] general protection ip:7f7762428fdb sp:7f776362d370 error:0 in libgcc_s.so.1[7f776241a000+15000] When I compiled gmetad I used the following command: ./configure --with-gmetad --sysconfdir=/etc/ganglia CPPFLAGS=-I/usr/local/rrdtool-1.4.7/include CFLAGS=-I/usr/local/rrdtool-1.4.7/include LDFLAGS=-L/usr/local/rrdtool-1.4.7/lib The same was tried with rrdtool 1.4.5. My current ganglia version is 3.2.0 and like Mete I tried it with version 3.1.7 but without success.. Hope we will sort it out soon any solution.. thank you On 6 February 2012 20:09, mete efk...@gmail.com wrote: Hello, i also face this issue when using GangliaContext31 and hadoop-1.0.0, and ganglia 3.1.7 (also tried 3.1.2). I continuously get buffer overflows as soon as i restart the gmetad. Regards Mete On Mon, Feb 6, 2012 at 7:42 PM, Vitthal Suhas Gogate gog...@hortonworks.com wrote: I assume you have seen the following information on Hadoop twiki, http://wiki.apache.org/hadoop/GangliaMetrics So do you use GangliaContext31 in hadoop-metrics2.properties? We use Ganglia 3.2 with Hadoop 20.205 and works fine (I remember seeing gmetad sometime goes down due to buffer overflow problem when hadoop starts pumping in the metrics.. but restarting works.. let me know if you face same problem? --Suhas Additionally, the Ganglia protocol change significantly between Ganglia 3.0 and Ganglia 3.1 (i.e., Ganglia 3.1 is not compatible with Ganglia 3.0 clients). This caused Hadoop to not work with Ganglia 3.1; there is a patch available for this, HADOOP-4675. As of November 2010, this patch has been rolled into the mainline for 0.20.2 and later. To use the Ganglia 3.1 protocol in place of the 3.0, substitute org.apache.hadoop.metrics.ganglia.GangliaContext31 for org.apache.hadoop.metrics.ganglia.GangliaContext in the hadoop-metrics.properties lines above. On Fri, Feb 3, 2012 at 1:07 PM, Merto Mertek masmer...@gmail.com wrote: I spent a lot of time to figure it out however i did not find a solution. Problems from the logs pointed me for some bugs in rrdupdate tool, however i tried to solve it with different versions of ganglia and rrdtool but the error is the same. Segmentation fault appears after the following lines, if I run gmetad in debug mode... Created rrd /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.publish_max_time.rrd Created rrd /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.snapshot_max_time.rrd which I suppose are generated from MetricsSystemImpl.java (Is there any way just to disable this two metrics?) From the /var/log/messages there are a lot of errors: xxx gmetad[15217]: RRD_update (/var/lib/ganglia/rrds/hdc/xxx/metricssystem.MetricsSystem.publish_imax_time.rrd): converting '4.9E-324' to float: Numerical result out of range xxx gmetad[15217]: RRD_update (/var/lib/ganglia/rrds/hdc/xxx/metricssystem.MetricsSystem.snapshot_imax_time.rrd): converting '4.9E-324' to float: Numerical result out of range so probably there are some converting issues ? Where should I look for the solution? Would you rather suggest to use ganglia 3.0.x with the old protocol and leave the version 3.1 for further releases? any help is realy appreciated... On 1 February 2012 04:04, Merto Mertek masmer...@gmail.com wrote: I would be glad to hear that too.. I've setup the following: Hadoop 0.20.205 Ganglia Front 3.1.7 Ganglia Back *(gmetad)* 3.1.7 RRDTool http://www.rrdtool.org/ 1.4.5. - i had some troubles installing 1.4.4 Ganglia works just in case hadoop is not running, so metrics are not publshed to gmetad node (conf with new hadoop-metrics2.proprieties). When hadoop is started, a segmentation fault appears in gmetad deamon: sudo gmetad -d 2 ... Updating host xxx, metric dfs.FSNamesystem.BlocksTotal Updating host xxx, metric bytes_in Updating host xxx, metric bytes_out Updating host xxx, metric metricssystem.MetricsSystem.publish_max_time Created rrd /var/lib/ganglia/rrds/hdcluster/hadoopmaster/metricssystem.MetricsSystem.publish_max_time.rrd Segmentation fault And some info from the apache log http://pastebin.com/nrqKRtKJ.. Can someone suggest a ganglia version that is tested with hadoop 0.20.205? I will try to sort it out however it seems a not so tribial problem.. Thank you On 2 December 2011 12:32, praveenesh kumar praveen...@gmail.com wrote: or Do I have to apply some hadoop patch for this ?
Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?
Hey Merto, I've been digging into this problem since Sunday, and believe I may have root-caused it. I'm using ganglia-3.2.0, rrdtool-1.4.5 and http://svn.apache.org/viewvc/hadoop/common/branches/branch-1/ (which I believe should be running essentially the identical relevant code as 0.20.205). While I test out one of my potential fixes, I would appreciate if you could confirm my understanding of the behavior you are seeing: - When you start gmetad and Hadoop is not emitting metrics, everything is peachy. - When you start Hadoop (and it thus starts emitting metrics), gmetad cores. - On my MacBookPro, it's a SIGABRT due to a buffer overflow. I believe this is happening for everyone. What I would like for you to try out are the following 2 scenarios: - Once gmetad cores, if you start it up again, does it core again? Does this process repeat ad infinitum? - On my MBP, the core is a one-time thing, and restarting gmetad after the first core makes things run perfectly smoothly. - I know others are saying this core occurs continuously, but they were all using ganglia-3.1.x, and I'm interested in how ganglia-3.2.0 behaves for you. - If you start Hadoop first (so gmetad is not running when the first batch of Hadoop metrics are emitted) and THEN start gmetad after a few seconds, do you still see gmetad coring? - On my MBP, this sequence works perfectly fine, and there are no gmetad cores whatsoever. Bear in mind that this only addresses the gmetad coring issue - the warnings emitted about '4.9E-324' being out of range will continue, but I know what's causing that as well (and hope that my patch fixes it for free). Varun On Mon, Feb 6, 2012 at 2:39 PM, Merto Mertek masmer...@gmail.com wrote: Yes I am encoutering the same problems and like Mete said few seconds after restarting a segmentation fault appears.. here is my conf.. http://pastebin.com/VgBjp08d And here are some info from /var/log/messages (ubuntu server 10.10): kernel: [424447.140641] gmetad[26115] general protection ip:7f7762428fdb sp:7f776362d370 error:0 in libgcc_s.so.1[7f776241a000+15000] When I compiled gmetad I used the following command: ./configure --with-gmetad --sysconfdir=/etc/ganglia CPPFLAGS=-I/usr/local/rrdtool-1.4.7/include CFLAGS=-I/usr/local/rrdtool-1.4.7/include LDFLAGS=-L/usr/local/rrdtool-1.4.7/lib The same was tried with rrdtool 1.4.5. My current ganglia version is 3.2.0 and like Mete I tried it with version 3.1.7 but without success.. Hope we will sort it out soon any solution.. thank you On 6 February 2012 20:09, mete efk...@gmail.com wrote: Hello, i also face this issue when using GangliaContext31 and hadoop-1.0.0, and ganglia 3.1.7 (also tried 3.1.2). I continuously get buffer overflows as soon as i restart the gmetad. Regards Mete On Mon, Feb 6, 2012 at 7:42 PM, Vitthal Suhas Gogate gog...@hortonworks.com wrote: I assume you have seen the following information on Hadoop twiki, http://wiki.apache.org/hadoop/GangliaMetrics So do you use GangliaContext31 in hadoop-metrics2.properties? We use Ganglia 3.2 with Hadoop 20.205 and works fine (I remember seeing gmetad sometime goes down due to buffer overflow problem when hadoop starts pumping in the metrics.. but restarting works.. let me know if you face same problem? --Suhas Additionally, the Ganglia protocol change significantly between Ganglia 3.0 and Ganglia 3.1 (i.e., Ganglia 3.1 is not compatible with Ganglia 3.0 clients). This caused Hadoop to not work with Ganglia 3.1; there is a patch available for this, HADOOP-4675. As of November 2010, this patch has been rolled into the mainline for 0.20.2 and later. To use the Ganglia 3.1 protocol in place of the 3.0, substitute org.apache.hadoop.metrics.ganglia.GangliaContext31 for org.apache.hadoop.metrics.ganglia.GangliaContext in the hadoop-metrics.properties lines above. On Fri, Feb 3, 2012 at 1:07 PM, Merto Mertek masmer...@gmail.com wrote: I spent a lot of time to figure it out however i did not find a solution. Problems from the logs pointed me for some bugs in rrdupdate tool, however i tried to solve it with different versions of ganglia and rrdtool but the error is the same. Segmentation fault appears after the following lines, if I run gmetad in debug mode... Created rrd /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.publish_max_time.rrd Created rrd /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.snapshot_max_time.rrd which I suppose are generated from MetricsSystemImpl.java (Is there any way just to disable this two metrics?) From the /var/log/messages there are a lot of errors: xxx gmetad[15217]: RRD_update
Re: Tom White's book, 2nd ed. Which API?
Thanks everyone. I knew about the upcoming third edition. I'm not sure I want to wait until May to learn the new API (pretty old actually). I'd like to find a resource that goes through the new API. I realize Tom White's examples are offered with the new API online, I was just hoping for something more explicitly instructional. I'll figure something out. Cheers! On Feb 6, 2012, at 09:34 , Richard Nadeau wrote: If you're looking to buy the 2nd edition you might want to wait, the third edition is in the works now. Regards, Rick On Feb 6, 2012 10:24 AM, W.P. McNeill bill...@gmail.com wrote: The second edition of Tom White's *Hadoop: The Definitive Guidehttp://www.librarything.com/work/book/72181963 * uses the old API for its examples, though it does contain a brief two-page overview of the new API. The first edition is all old API. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com Luminous beings are we, not this crude matter. -- Yoda
Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?
I have tried to run it but it repeats crashing.. - When you start gmetad and Hadoop is not emitting metrics, everything is peachy. Right, running just ganglia without running hadoop jobs seems stable for at least a day.. - When you start Hadoop (and it thus starts emitting metrics), gmetad cores. True, with a following error : *** stack smashing detected ***: gmetad terminated \n Segmentation fault - On my MacBookPro, it's a SIGABRT due to a buffer overflow. I believe this is happening for everyone. What I would like for you to try out are the following 2 scenarios: - Once gmetad cores, if you start it up again, does it core again? Does this process repeat ad infinitum? - On my MBP, the core is a one-time thing, and restarting gmetad after the first core makes things run perfectly smoothly. - I know others are saying this core occurs continuously, but they were all using ganglia-3.1.x, and I'm interested in how ganglia-3.2.0 behaves for you. It cores everytime I run it. The difference is just that sometimes a segmentation faults appears instantly, and sometimes it appears after a random time...lets say after a minute of running gmetad and collecting data. - If you start Hadoop first (so gmetad is not running when the first batch of Hadoop metrics are emitted) and THEN start gmetad after a few seconds, do you still see gmetad coring? Yes - On my MBP, this sequence works perfectly fine, and there are no gmetad cores whatsoever. I have tested this scenario with 2 working nodes so two gmond plus the head gmond on the server where gmetad is located. I have checked and all of them are versioned 3.2.0. Hope it helps.. Bear in mind that this only addresses the gmetad coring issue - the warnings emitted about '4.9E-324' being out of range will continue, but I know what's causing that as well (and hope that my patch fixes it for free). Varun On Mon, Feb 6, 2012 at 2:39 PM, Merto Mertek masmer...@gmail.com wrote: Yes I am encoutering the same problems and like Mete said few seconds after restarting a segmentation fault appears.. here is my conf.. http://pastebin.com/VgBjp08d And here are some info from /var/log/messages (ubuntu server 10.10): kernel: [424447.140641] gmetad[26115] general protection ip:7f7762428fdb sp:7f776362d370 error:0 in libgcc_s.so.1[7f776241a000+15000] When I compiled gmetad I used the following command: ./configure --with-gmetad --sysconfdir=/etc/ganglia CPPFLAGS=-I/usr/local/rrdtool-1.4.7/include CFLAGS=-I/usr/local/rrdtool-1.4.7/include LDFLAGS=-L/usr/local/rrdtool-1.4.7/lib The same was tried with rrdtool 1.4.5. My current ganglia version is 3.2.0 and like Mete I tried it with version 3.1.7 but without success.. Hope we will sort it out soon any solution.. thank you On 6 February 2012 20:09, mete efk...@gmail.com wrote: Hello, i also face this issue when using GangliaContext31 and hadoop-1.0.0, and ganglia 3.1.7 (also tried 3.1.2). I continuously get buffer overflows as soon as i restart the gmetad. Regards Mete On Mon, Feb 6, 2012 at 7:42 PM, Vitthal Suhas Gogate gog...@hortonworks.com wrote: I assume you have seen the following information on Hadoop twiki, http://wiki.apache.org/hadoop/GangliaMetrics So do you use GangliaContext31 in hadoop-metrics2.properties? We use Ganglia 3.2 with Hadoop 20.205 and works fine (I remember seeing gmetad sometime goes down due to buffer overflow problem when hadoop starts pumping in the metrics.. but restarting works.. let me know if you face same problem? --Suhas Additionally, the Ganglia protocol change significantly between Ganglia 3.0 and Ganglia 3.1 (i.e., Ganglia 3.1 is not compatible with Ganglia 3.0 clients). This caused Hadoop to not work with Ganglia 3.1; there is a patch available for this, HADOOP-4675. As of November 2010, this patch has been rolled into the mainline for 0.20.2 and later. To use the Ganglia 3.1 protocol in place of the 3.0, substitute org.apache.hadoop.metrics.ganglia.GangliaContext31 for org.apache.hadoop.metrics.ganglia.GangliaContext in the hadoop-metrics.properties lines above. On Fri, Feb 3, 2012 at 1:07 PM, Merto Mertek masmer...@gmail.com wrote: I spent a lot of time to figure it out however i did not find a solution. Problems from the logs pointed me for some bugs in rrdupdate tool, however i tried to solve it with different versions of ganglia and rrdtool but the error is the same. Segmentation fault appears after the following lines, if I run gmetad in debug mode... Created rrd /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.publish_max_time.rrd Created rrd
Re: Hadoop does not start on Windows XP
Hi Ron, Thank you. I deleted the Hadoop directory from my Windows folder. The untar/unzipped on Cygwin ad the directory d:\hadoop (Ex: here is a path: D:\Hadoop\hadoop-1.0.0\bin\ ) Now I could start Hadoop: $ bin/hadoop start-all.sh Above worked. But similar problem persists: $ bin/hadoop fs -ls bin/hadoop: line 321: c:\Program: command not found Found 26 items $ bin/hadoop dfs -mkdir urls cygwin warning: MS-DOS style path detected: D:\Hadoop\hadoop-1.0.0\/build/native Preferred POSIX equivalent is: /cygdrive/d/Hadoop/hadoop-1.0.0/build/native CYGWIN environment variable option nodosfilewarning turns off this warning. Consult the user's guide for more details about POSIX paths: http://cygwin.com/cygwin-ug-net/using.html#using-pathnames bin/hadoop: line 321: c:\Program: command not found How could I possibly get rid of the error? Btw, $ bin=`dirname .` /cygdrive/d/Hadoop/hadoop-1.0.0 $ echo $bin . /cygdrive/d/Hadoop/hadoop-1.0.0 $ pwd /cygdrive/d/Hadoop/hadoop-1.0.0 Thanks, Jay From: Ronald Petty ronald.pe...@gmail.com To: common-user@hadoop.apache.org; Jay su1...@yahoo.com Sent: Sunday, February 5, 2012 4:28 PM Subject: Re: Hadoop does not start on Windows XP Jay, What does the following give you on the command line? bin=`dirname $0` //also try =`dirname .` echo $bin Regards. Ron On Sat, Feb 4, 2012 at 10:56 PM, Jay su1...@yahoo.com wrote: Hi, In Windows XP I installed Cygwin and tried to run Hadoop: W1234@W19064-00 /cygdrive/d/Profiles/w1234/My Documents/Hadoop/hadoop1.0/hadoop-1.0.0 $ bin/hadoop start-all.sh bin/hadoop: line 2: $'\r': command not found bin/hadoop: line 17: $'\r': command not found bin/hadoop: line 18: $'\r': command not found bin/hadoop: line 49: $'\r': command not found : No such file or directoryn bin/hadoop: line 52: $'\r': command not found bin/hadoop: line 60: syntax error near unexpected token `$'in\r'' 'in/hadoop: line 60: `case `uname` in $ I have this in the file hadoop-env.sh export JAVA_HOME=c:\\Program\ Files\\Java\\jdk1.7.0_02 How could I possibly fix it? Thanks a lot!
Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?
Same with Merto's situation here, it always overflows short time after the restart. Without the hadoop metrics enabled everything is smooth. Regards Mete On Tue, Feb 7, 2012 at 4:58 AM, Merto Mertek masmer...@gmail.com wrote: I have tried to run it but it repeats crashing.. - When you start gmetad and Hadoop is not emitting metrics, everything is peachy. Right, running just ganglia without running hadoop jobs seems stable for at least a day.. - When you start Hadoop (and it thus starts emitting metrics), gmetad cores. True, with a following error : *** stack smashing detected ***: gmetad terminated \n Segmentation fault - On my MacBookPro, it's a SIGABRT due to a buffer overflow. I believe this is happening for everyone. What I would like for you to try out are the following 2 scenarios: - Once gmetad cores, if you start it up again, does it core again? Does this process repeat ad infinitum? - On my MBP, the core is a one-time thing, and restarting gmetad after the first core makes things run perfectly smoothly. - I know others are saying this core occurs continuously, but they were all using ganglia-3.1.x, and I'm interested in how ganglia-3.2.0 behaves for you. It cores everytime I run it. The difference is just that sometimes a segmentation faults appears instantly, and sometimes it appears after a random time...lets say after a minute of running gmetad and collecting data. - If you start Hadoop first (so gmetad is not running when the first batch of Hadoop metrics are emitted) and THEN start gmetad after a few seconds, do you still see gmetad coring? Yes - On my MBP, this sequence works perfectly fine, and there are no gmetad cores whatsoever. I have tested this scenario with 2 working nodes so two gmond plus the head gmond on the server where gmetad is located. I have checked and all of them are versioned 3.2.0. Hope it helps.. Bear in mind that this only addresses the gmetad coring issue - the warnings emitted about '4.9E-324' being out of range will continue, but I know what's causing that as well (and hope that my patch fixes it for free). Varun On Mon, Feb 6, 2012 at 2:39 PM, Merto Mertek masmer...@gmail.com wrote: Yes I am encoutering the same problems and like Mete said few seconds after restarting a segmentation fault appears.. here is my conf.. http://pastebin.com/VgBjp08d And here are some info from /var/log/messages (ubuntu server 10.10): kernel: [424447.140641] gmetad[26115] general protection ip:7f7762428fdb sp:7f776362d370 error:0 in libgcc_s.so.1[7f776241a000+15000] When I compiled gmetad I used the following command: ./configure --with-gmetad --sysconfdir=/etc/ganglia CPPFLAGS=-I/usr/local/rrdtool-1.4.7/include CFLAGS=-I/usr/local/rrdtool-1.4.7/include LDFLAGS=-L/usr/local/rrdtool-1.4.7/lib The same was tried with rrdtool 1.4.5. My current ganglia version is 3.2.0 and like Mete I tried it with version 3.1.7 but without success.. Hope we will sort it out soon any solution.. thank you On 6 February 2012 20:09, mete efk...@gmail.com wrote: Hello, i also face this issue when using GangliaContext31 and hadoop-1.0.0, and ganglia 3.1.7 (also tried 3.1.2). I continuously get buffer overflows as soon as i restart the gmetad. Regards Mete On Mon, Feb 6, 2012 at 7:42 PM, Vitthal Suhas Gogate gog...@hortonworks.com wrote: I assume you have seen the following information on Hadoop twiki, http://wiki.apache.org/hadoop/GangliaMetrics So do you use GangliaContext31 in hadoop-metrics2.properties? We use Ganglia 3.2 with Hadoop 20.205 and works fine (I remember seeing gmetad sometime goes down due to buffer overflow problem when hadoop starts pumping in the metrics.. but restarting works.. let me know if you face same problem? --Suhas Additionally, the Ganglia protocol change significantly between Ganglia 3.0 and Ganglia 3.1 (i.e., Ganglia 3.1 is not compatible with Ganglia 3.0 clients). This caused Hadoop to not work with Ganglia 3.1; there is a patch available for this, HADOOP-4675. As of November 2010, this patch has been rolled into the mainline for 0.20.2 and later. To use the Ganglia 3.1 protocol in place of the 3.0, substitute org.apache.hadoop.metrics.ganglia.GangliaContext31 for org.apache.hadoop.metrics.ganglia.GangliaContext in the hadoop-metrics.properties lines above. On Fri, Feb 3, 2012 at 1:07 PM, Merto Mertek masmer...@gmail.com wrote: I spent a lot of time to figure it out however i did not find a solution. Problems from the logs pointed me for some bugs in rrdupdate tool, however i tried to solve it
The Mapper does not run from JobControl
Using Hadoop version 0.20.. I am creating a chain of jobs job1 and job2 (mappers of which are in x.jar, there is no reducer) , with dependency and submitting to hadoop cluster using JobControl. Note I have setJarByClass and getJar gives the correct jar file, when checked before submission. Submission goes through and there seem to be no errors in user logs and jobtracker. But I dont see my Mapper getting executed (no sysouts or log output), but default output seems to be coming to the output folder (input file is read as is and output). I am able to run the job directly using x.jar, but I am really out of clues as to why it is not running with Jobcontrol. -- View this message in context: http://old.nabble.com/The-Mapper-does-not-run-from-JobControl-tp33276757p33276757.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Can I write to an compressed file which is located in hdfs?
Hi AFAIK I don't think it is possible to append into a compressed file. If you have files in hdfs on a dir and you need to compress the same (like files for an hour) you can use MapReduce to do that by setting mapred.output.compress = true and mapred.output.compression.codec='theCodecYouPrefer' You'd get the blocks compressed in the output dir. You can use the API to read from standard input like -get hadoop conf -register the required compression codec -write to CompressionOutputStream. You should get a well detailed explanation on the same from the book 'Hadoop - The definitive guide' by Tom White. Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: Xiaobin She xiaobin...@gmail.com Date: Tue, 7 Feb 2012 14:24:01 To: common-user@hadoop.apache.org; bejoy.had...@gmail.com; David Sinclairdsincl...@chariotsolutions.com Subject: Re: Can I write to an compressed file which is located in hdfs? hi Bejoy and David, thank you for you help. So I can't directly write logs or append logs into an compressed file in hdfs, right? Can I compress an file which is already in hdfs and has not been compressed? If I can , how can I do that? Thanks! 2012/2/6 bejoy.had...@gmail.com Hi I agree with David on the point, you can achieve step 1 of my previous response with flume. ie load real time inflow of data in compressed format into hdfs. You can specify a time interval or data size in flume collector that determines when to flush data on to hdfs. Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: David Sinclair dsincl...@chariotsolutions.com Date: Mon, 6 Feb 2012 09:06:00 To: common-user@hadoop.apache.org Cc: bejoy.had...@gmail.com Subject: Re: Can I write to an compressed file which is located in hdfs? Hi, You may want to have a look at the Flume project from Cloudera. I use it for writing data into HDFS. https://ccp.cloudera.com/display/SUPPORT/Downloads dave 2012/2/6 Xiaobin She xiaobin...@gmail.com hi Bejoy , thank you for your reply. actually I have set up an test cluster which has one namenode/jobtracker and two datanode/tasktracker, and I have make an test on this cluster. I fetch the log file of one of our modules from the log collector machines by rsync, and then I use hive command line tool to load this log file into the hive warehouse which simply copy the file from the local filesystem to hdfs. And I have run some analysis on these data with hive, all this run well. But now I want to avoid the fetch section which use rsync, and write the logs into hdfs files directly from the servers which generate these logs. And it seems easy to do this job if the file locate in the hdfs is not compressed. But how to write or append logs to an file that is compressed and located in hdfs? Is this possible? Or is this an bad practice? Thanks! 2012/2/6 bejoy.had...@gmail.com Hi If you have log files enough to become at least one block size in an hour. You can go ahead as - run a scheduled job every hour that compresses the log files for that hour and stores them on to hdfs (can use LZO or even Snappy to compress) - if your hive does more frequent analysis on this data store it as PARTITIONED BY (Date,Hour) . While loading into hdfs also follow a directory - sub dir structure. Once data is in hdfs issue a Alter Table Add Partition statement on corresponding hive table. -in Hive DDL use the appropriate Input format (Hive has some ApacheLog Input Format already) Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: Xiaobin She xiaobin...@gmail.com Date: Mon, 6 Feb 2012 16:41:50 To: common-user@hadoop.apache.org; 佘晓彬xiaobin...@gmail.com Reply-To: common-user@hadoop.apache.org Subject: Re: Can I write to an compressed file which is located in hdfs? sorry, this sentence is wrong, I can't compress these logs every hour and them put them into hdfs. it should be I can compress these logs every hour and them put them into hdfs. 2012/2/6 Xiaobin She xiaobin...@gmail.com hi all, I'm testing hadoop and hive, and I want to use them in log analysis. Here I have a question, can I write/append log to an compressed file which is located in hdfs? Our system generate lots of log files every day, I can't compress these logs every hour and them put them into hdfs. But what if I want to write logs into files that was already in the hdfs and was compressed? Is these files were not compressed, then this job seems easy, but how to write or append logs into an compressed log? Can I do that? Can anyone give me some advices or give me some examples? Thank you very much! xiaobin