Re: working with SAS

2012-02-06 Thread alo alt
Hi,

hadoop is running on a linux box (mostly) and can run in a standalone 
installation for testing only. If you decide to use hadoop with hive or hbase 
you have to face a lot of more tasks:

- installation (whirr and Amazone EC2 as example)
- write your own mapreduce job or use hive / hbase
- setup sqoop with the terradata-driver

You can easy setup part 1 and 2 with Amazon's EC2, I think you can also book 
Windows Server there. For a single query the best option I think before you 
install a hadoop cluster.

best,
 Alex 


--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 6, 2012, at 8:11 AM, Ali Jooan Rizvi wrote:

 Hi,
 
 
 
 I would like to know if hadoop will be of help to me? Let me explain you
 guys my scenario:
 
 
 
 I have a windows server based single machine server having 16 Cores and 48
 GB of Physical Memory. In addition, I have 120 GB of virtual memory.
 
 
 
 I am running a query with statistical calculation on large data of over 1
 billion rows, on SAS. In this case, SAS is acting like a database on which
 both source and target tables are residing. For storage, I can keep the
 source and target data on Teradata as well but the query containing a patent
 can only be run on SAS interface.
 
 
 
 The problem is that SAS is taking many days (25 days) to run it (a single
 query with statistical function) and not all cores all the time were used
 and rather merely 5% CPU was utilized on average. However memory utilization
 was high, very high, and that's why large virtual memory was used. 
 
 
 
 Can I have a hadoop interface in place to do it all so that I may end up
 running the query in lesser time that is in 1 or 2 days. Anything squeezing
 my run time will be very helpful. 
 
 
 
 Thanks
 
 
 
 Ali Jooan Rizvi
 



Re: working with SAS

2012-02-06 Thread Prashant Sharma
+ you will not necessarily need vertical systems for speeding up
things(totally depends on your query) . Give a thought of having commodity
hardware(much cheaper) and hadoop being suited for them, *I hope* your
infrastructure can be cheaper in terms of price to performance ratio.
Having said that, I do not mean you have to throw away you existing
infrastructure, because it is ideal for certain requirements.

your solution can be like writing a mapreduce job which does what query is
supposed to do and run it on a cluster of size ? depends! (how fast you
want things be done? and scale). Incase your querry is adhoc and have to be
run frequently. You might wanna consider HBASE and HIVE as solutions with a
lot of expensive vertical nodes ;).

BTW Is your querry iterative? A little more details on your type of querry
can attract guy's with more wisdom to help.

HTH


On Mon, Feb 6, 2012 at 1:46 PM, alo alt wget.n...@googlemail.com wrote:

 Hi,

 hadoop is running on a linux box (mostly) and can run in a standalone
 installation for testing only. If you decide to use hadoop with hive or
 hbase you have to face a lot of more tasks:

 - installation (whirr and Amazone EC2 as example)
 - write your own mapreduce job or use hive / hbase
 - setup sqoop with the terradata-driver

 You can easy setup part 1 and 2 with Amazon's EC2, I think you can also
 book Windows Server there. For a single query the best option I think
 before you install a hadoop cluster.

 best,
  Alex


 --
 Alexander Lorenz
 http://mapredit.blogspot.com

 On Feb 6, 2012, at 8:11 AM, Ali Jooan Rizvi wrote:

  Hi,
 
 
 
  I would like to know if hadoop will be of help to me? Let me explain you
  guys my scenario:
 
 
 
  I have a windows server based single machine server having 16 Cores and
 48
  GB of Physical Memory. In addition, I have 120 GB of virtual memory.
 
 
 
  I am running a query with statistical calculation on large data of over 1
  billion rows, on SAS. In this case, SAS is acting like a database on
 which
  both source and target tables are residing. For storage, I can keep the
  source and target data on Teradata as well but the query containing a
 patent
  can only be run on SAS interface.
 
 
 
  The problem is that SAS is taking many days (25 days) to run it (a single
  query with statistical function) and not all cores all the time were used
  and rather merely 5% CPU was utilized on average. However memory
 utilization
  was high, very high, and that's why large virtual memory was used.
 
 
 
  Can I have a hadoop interface in place to do it all so that I may end up
  running the query in lesser time that is in 1 or 2 days. Anything
 squeezing
  my run time will be very helpful.
 
 
 
  Thanks
 
 
 
  Ali Jooan Rizvi
 




Can I write to an compressed file which is located in hdfs?

2012-02-06 Thread Xiaobin She
hi all,

I'm testing hadoop and hive, and I want to use them in log analysis.

Here I have a question, can I write/append log to  an compressed file which
is located in hdfs?

Our system generate lots of log files every day, I can't compress these
logs every hour and them put them into hdfs.

But what if I want to write logs into files that was already in the hdfs
and was compressed?

Is these files were not compressed, then this job seems easy, but how to
write or append logs into an compressed log?

Can I do that?

Can anyone give me some advices or give me some examples?

Thank you very much!

xiaobin


Re: Can I write to an compressed file which is located in hdfs?

2012-02-06 Thread Xiaobin She
sorry, this sentence is wrong,

I can't compress these logs every hour and them put them into hdfs.

it should be

I can  compress these logs every hour and them put them into hdfs.




2012/2/6 Xiaobin She xiaobin...@gmail.com


 hi all,

 I'm testing hadoop and hive, and I want to use them in log analysis.

 Here I have a question, can I write/append log to  an compressed file
 which is located in hdfs?

 Our system generate lots of log files every day, I can't compress these
 logs every hour and them put them into hdfs.

 But what if I want to write logs into files that was already in the hdfs
 and was compressed?

 Is these files were not compressed, then this job seems easy, but how to
 write or append logs into an compressed log?

 Can I do that?

 Can anyone give me some advices or give me some examples?

 Thank you very much!

 xiaobin



The Common Account for Hadoop

2012-02-06 Thread Bing Li
Dear all,

I am just starting to learn Hadoop. According to the book, Hadoop in
Action, a common account for each server (masters/slaves) must be created.

Moreover, I need to create a public/private rsa key pair as follows.

ssh-keygen -t rsa

Then, id_rsa and id_rsa.pub are put under $HOME/.ssh.

After that, the public key is distributed to other nodes and saved in
@HOME/.ssh/authorized_keys.

According to the book (Page 27), I can login in a remote target with the
following command.

ssh target (I typed IP address here)

However, according to the book, no password is required to sign in the
target. On my machine, it is required to type password each time.

Any affects for my future to configure Hadoop? What's wrong with my work?

Thanks so much!
Bing


Re: The Common Account for Hadoop

2012-02-06 Thread alo alt
check the rights of .ssh/authorized_keys on the hosts, have to be only read- 
and writable for the user (including directory)
Be sure you copied the right key without line-breaks and fragments.  If you 
have a lot of boxes you could use BCFG2:
http://docs.bcfg2.org/

- Alex 



--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 6, 2012, at 10:55 AM, Bing Li wrote:

 Dear all,
 
 I am just starting to learn Hadoop. According to the book, Hadoop in
 Action, a common account for each server (masters/slaves) must be created.
 
 Moreover, I need to create a public/private rsa key pair as follows.
 
ssh-keygen -t rsa
 
 Then, id_rsa and id_rsa.pub are put under $HOME/.ssh.
 
 After that, the public key is distributed to other nodes and saved in
 @HOME/.ssh/authorized_keys.
 
 According to the book (Page 27), I can login in a remote target with the
 following command.
 
ssh target (I typed IP address here)
 
 However, according to the book, no password is required to sign in the
 target. On my machine, it is required to type password each time.
 
 Any affects for my future to configure Hadoop? What's wrong with my work?
 
 Thanks so much!
 Bing



Re: The Common Account for Hadoop

2012-02-06 Thread Bing Li
Hi, Alex,

Thanks so much for your help!

I noticed that I didn't put the RSA key to the account's home directory.

Best regards,
Bing

On Mon, Feb 6, 2012 at 6:19 PM, alo alt wget.n...@googlemail.com wrote:

 check the rights of .ssh/authorized_keys on the hosts, have to be only
 read- and writable for the user (including directory)
 Be sure you copied the right key without line-breaks and fragments.  If
 you have a lot of boxes you could use BCFG2:
 http://docs.bcfg2.org/

 - Alex



 --
 Alexander Lorenz
 http://mapredit.blogspot.com

 On Feb 6, 2012, at 10:55 AM, Bing Li wrote:

  Dear all,
 
  I am just starting to learn Hadoop. According to the book, Hadoop in
  Action, a common account for each server (masters/slaves) must be
 created.
 
  Moreover, I need to create a public/private rsa key pair as follows.
 
 ssh-keygen -t rsa
 
  Then, id_rsa and id_rsa.pub are put under $HOME/.ssh.
 
  After that, the public key is distributed to other nodes and saved in
  @HOME/.ssh/authorized_keys.
 
  According to the book (Page 27), I can login in a remote target with the
  following command.
 
 ssh target (I typed IP address here)
 
  However, according to the book, no password is required to sign in the
  target. On my machine, it is required to type password each time.
 
  Any affects for my future to configure Hadoop? What's wrong with my work?
 
  Thanks so much!
  Bing




Re: Can I write to an compressed file which is located in hdfs?

2012-02-06 Thread bejoy . hadoop
Hi
 If you have log files enough to become at least one block size in an hour. 
You can go ahead as
- run a scheduled job every hour that compresses the log files for that hour 
and stores them on to hdfs (can use LZO or even Snappy to compress)
- if your hive does more frequent analysis on this data store it as PARTITIONED 
BY (Date,Hour) . While loading into hdfs also follow a directory - sub dir 
structure. Once data is in hdfs issue a Alter Table Add Partition statement on 
corresponding hive table.
-in Hive DDL use the appropriate Input format (Hive has some ApacheLog Input 
Format already)


Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: Xiaobin She xiaobin...@gmail.com
Date: Mon, 6 Feb 2012 16:41:50 
To: common-user@hadoop.apache.org; 佘晓彬xiaobin...@gmail.com
Reply-To: common-user@hadoop.apache.org
Subject: Re: Can I write to an compressed file which is located in hdfs?

sorry, this sentence is wrong,

I can't compress these logs every hour and them put them into hdfs.

it should be

I can  compress these logs every hour and them put them into hdfs.




2012/2/6 Xiaobin She xiaobin...@gmail.com


 hi all,

 I'm testing hadoop and hive, and I want to use them in log analysis.

 Here I have a question, can I write/append log to  an compressed file
 which is located in hdfs?

 Our system generate lots of log files every day, I can't compress these
 logs every hour and them put them into hdfs.

 But what if I want to write logs into files that was already in the hdfs
 and was compressed?

 Is these files were not compressed, then this job seems easy, but how to
 write or append logs into an compressed log?

 Can I do that?

 Can anyone give me some advices or give me some examples?

 Thank you very much!

 xiaobin




Re: Can I write to an compressed file which is located in hdfs?

2012-02-06 Thread Xiaobin She
hi Bejoy ,

thank you for your reply.

actually I have set up an test cluster which has one namenode/jobtracker
and two datanode/tasktracker, and I have make an test on this cluster.

I fetch the log file of one of our modules from the log collector machines
by rsync, and then I use hive command line tool to load this log file into
the hive warehouse which  simply copy the file from the local filesystem to
hdfs.

And I have run some analysis on these data with hive, all this run well.

But now I want to avoid the fetch section which use rsync, and write the
logs into hdfs files directly from the servers which generate these logs.

And it seems easy to do this job if the file locate in the hdfs is not
compressed.

But how to write or append logs to an file that is compressed and located
in hdfs?

Is this possible?

Or is this an bad practice?

Thanks!



2012/2/6 bejoy.had...@gmail.com

 Hi
 If you have log files enough to become at least one block size in an
 hour. You can go ahead as
 - run a scheduled job every hour that compresses the log files for that
 hour and stores them on to hdfs (can use LZO or even Snappy to compress)
 - if your hive does more frequent analysis on this data store it as
 PARTITIONED BY (Date,Hour) . While loading into hdfs also follow a
 directory - sub dir structure. Once data is in hdfs issue a Alter Table Add
 Partition statement on corresponding hive table.
 -in Hive DDL use the appropriate Input format (Hive has some ApacheLog
 Input Format already)


 Regards
 Bejoy K S

 From handheld, Please excuse typos.

 -Original Message-
 From: Xiaobin She xiaobin...@gmail.com
 Date: Mon, 6 Feb 2012 16:41:50
 To: common-user@hadoop.apache.org; 佘晓彬xiaobin...@gmail.com
 Reply-To: common-user@hadoop.apache.org
 Subject: Re: Can I write to an compressed file which is located in hdfs?

 sorry, this sentence is wrong,

 I can't compress these logs every hour and them put them into hdfs.

 it should be

 I can  compress these logs every hour and them put them into hdfs.




 2012/2/6 Xiaobin She xiaobin...@gmail.com

 
  hi all,
 
  I'm testing hadoop and hive, and I want to use them in log analysis.
 
  Here I have a question, can I write/append log to  an compressed file
  which is located in hdfs?
 
  Our system generate lots of log files every day, I can't compress these
  logs every hour and them put them into hdfs.
 
  But what if I want to write logs into files that was already in the hdfs
  and was compressed?
 
  Is these files were not compressed, then this job seems easy, but how to
  write or append logs into an compressed log?
 
  Can I do that?
 
  Can anyone give me some advices or give me some examples?
 
  Thank you very much!
 
  xiaobin
 




Re: working with SAS

2012-02-06 Thread Michel Segel
Both responses assume replacing SAS w a Hadoop cluster.
I would agree that going to EC2 might make sense in terms of a PoC before 
investing in a physical cluster, but we need to know more about the underlying 
problem.

First, can the problem be broken down in to something that can be accomplished 
in parallel sub tasks?  Second... How much data? It could be a good use case 
for whirr...

Sent from a remote device. Please excuse any typos...

Mike Segel

On Feb 6, 2012, at 2:32 AM, Prashant Sharma prashan...@imaginea.com wrote:

 + you will not necessarily need vertical systems for speeding up
 things(totally depends on your query) . Give a thought of having commodity
 hardware(much cheaper) and hadoop being suited for them, *I hope* your
 infrastructure can be cheaper in terms of price to performance ratio.
 Having said that, I do not mean you have to throw away you existing
 infrastructure, because it is ideal for certain requirements.
 
 your solution can be like writing a mapreduce job which does what query is
 supposed to do and run it on a cluster of size ? depends! (how fast you
 want things be done? and scale). Incase your querry is adhoc and have to be
 run frequently. You might wanna consider HBASE and HIVE as solutions with a
 lot of expensive vertical nodes ;).
 
 BTW Is your querry iterative? A little more details on your type of querry
 can attract guy's with more wisdom to help.
 
 HTH
 
 
 On Mon, Feb 6, 2012 at 1:46 PM, alo alt wget.n...@googlemail.com wrote:
 
 Hi,
 
 hadoop is running on a linux box (mostly) and can run in a standalone
 installation for testing only. If you decide to use hadoop with hive or
 hbase you have to face a lot of more tasks:
 
 - installation (whirr and Amazone EC2 as example)
 - write your own mapreduce job or use hive / hbase
 - setup sqoop with the terradata-driver
 
 You can easy setup part 1 and 2 with Amazon's EC2, I think you can also
 book Windows Server there. For a single query the best option I think
 before you install a hadoop cluster.
 
 best,
 Alex
 
 
 --
 Alexander Lorenz
 http://mapredit.blogspot.com
 
 On Feb 6, 2012, at 8:11 AM, Ali Jooan Rizvi wrote:
 
 Hi,
 
 
 
 I would like to know if hadoop will be of help to me? Let me explain you
 guys my scenario:
 
 
 
 I have a windows server based single machine server having 16 Cores and
 48
 GB of Physical Memory. In addition, I have 120 GB of virtual memory.
 
 
 
 I am running a query with statistical calculation on large data of over 1
 billion rows, on SAS. In this case, SAS is acting like a database on
 which
 both source and target tables are residing. For storage, I can keep the
 source and target data on Teradata as well but the query containing a
 patent
 can only be run on SAS interface.
 
 
 
 The problem is that SAS is taking many days (25 days) to run it (a single
 query with statistical function) and not all cores all the time were used
 and rather merely 5% CPU was utilized on average. However memory
 utilization
 was high, very high, and that's why large virtual memory was used.
 
 
 
 Can I have a hadoop interface in place to do it all so that I may end up
 running the query in lesser time that is in 1 or 2 days. Anything
 squeezing
 my run time will be very helpful.
 
 
 
 Thanks
 
 
 
 Ali Jooan Rizvi
 
 
 


Re: Can I write to an compressed file which is located in hdfs?

2012-02-06 Thread David Sinclair
Hi,

You may want to have a look at the Flume project from Cloudera. I use it
for writing data into HDFS.

https://ccp.cloudera.com/display/SUPPORT/Downloads

dave

2012/2/6 Xiaobin She xiaobin...@gmail.com

 hi Bejoy ,

 thank you for your reply.

 actually I have set up an test cluster which has one namenode/jobtracker
 and two datanode/tasktracker, and I have make an test on this cluster.

 I fetch the log file of one of our modules from the log collector machines
 by rsync, and then I use hive command line tool to load this log file into
 the hive warehouse which  simply copy the file from the local filesystem to
 hdfs.

 And I have run some analysis on these data with hive, all this run well.

 But now I want to avoid the fetch section which use rsync, and write the
 logs into hdfs files directly from the servers which generate these logs.

 And it seems easy to do this job if the file locate in the hdfs is not
 compressed.

 But how to write or append logs to an file that is compressed and located
 in hdfs?

 Is this possible?

 Or is this an bad practice?

 Thanks!



 2012/2/6 bejoy.had...@gmail.com

  Hi
  If you have log files enough to become at least one block size in an
  hour. You can go ahead as
  - run a scheduled job every hour that compresses the log files for that
  hour and stores them on to hdfs (can use LZO or even Snappy to compress)
  - if your hive does more frequent analysis on this data store it as
  PARTITIONED BY (Date,Hour) . While loading into hdfs also follow a
  directory - sub dir structure. Once data is in hdfs issue a Alter Table
 Add
  Partition statement on corresponding hive table.
  -in Hive DDL use the appropriate Input format (Hive has some ApacheLog
  Input Format already)
 
 
  Regards
  Bejoy K S
 
  From handheld, Please excuse typos.
 
  -Original Message-
  From: Xiaobin She xiaobin...@gmail.com
  Date: Mon, 6 Feb 2012 16:41:50
  To: common-user@hadoop.apache.org; 佘晓彬xiaobin...@gmail.com
  Reply-To: common-user@hadoop.apache.org
  Subject: Re: Can I write to an compressed file which is located in hdfs?
 
  sorry, this sentence is wrong,
 
  I can't compress these logs every hour and them put them into hdfs.
 
  it should be
 
  I can  compress these logs every hour and them put them into hdfs.
 
 
 
 
  2012/2/6 Xiaobin She xiaobin...@gmail.com
 
  
   hi all,
  
   I'm testing hadoop and hive, and I want to use them in log analysis.
  
   Here I have a question, can I write/append log to  an compressed file
   which is located in hdfs?
  
   Our system generate lots of log files every day, I can't compress these
   logs every hour and them put them into hdfs.
  
   But what if I want to write logs into files that was already in the
 hdfs
   and was compressed?
  
   Is these files were not compressed, then this job seems easy, but how
 to
   write or append logs into an compressed log?
  
   Can I do that?
  
   Can anyone give me some advices or give me some examples?
  
   Thank you very much!
  
   xiaobin
  
 
 



Re: Can I write to an compressed file which is located in hdfs?

2012-02-06 Thread bejoy . hadoop
Hi
   I agree with David on the point, you can achieve step 1 of my previous 
response with flume. ie load real time inflow of data in compressed format into 
hdfs. You can specify a time interval or data size in flume collector that 
determines when to flush data on to hdfs. 

Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: David Sinclair dsincl...@chariotsolutions.com
Date: Mon, 6 Feb 2012 09:06:00 
To: common-user@hadoop.apache.org
Cc: bejoy.had...@gmail.com
Subject: Re: Can I write to an compressed file which is located in hdfs?

Hi,

You may want to have a look at the Flume project from Cloudera. I use it
for writing data into HDFS.

https://ccp.cloudera.com/display/SUPPORT/Downloads

dave

2012/2/6 Xiaobin She xiaobin...@gmail.com

 hi Bejoy ,

 thank you for your reply.

 actually I have set up an test cluster which has one namenode/jobtracker
 and two datanode/tasktracker, and I have make an test on this cluster.

 I fetch the log file of one of our modules from the log collector machines
 by rsync, and then I use hive command line tool to load this log file into
 the hive warehouse which  simply copy the file from the local filesystem to
 hdfs.

 And I have run some analysis on these data with hive, all this run well.

 But now I want to avoid the fetch section which use rsync, and write the
 logs into hdfs files directly from the servers which generate these logs.

 And it seems easy to do this job if the file locate in the hdfs is not
 compressed.

 But how to write or append logs to an file that is compressed and located
 in hdfs?

 Is this possible?

 Or is this an bad practice?

 Thanks!



 2012/2/6 bejoy.had...@gmail.com

  Hi
  If you have log files enough to become at least one block size in an
  hour. You can go ahead as
  - run a scheduled job every hour that compresses the log files for that
  hour and stores them on to hdfs (can use LZO or even Snappy to compress)
  - if your hive does more frequent analysis on this data store it as
  PARTITIONED BY (Date,Hour) . While loading into hdfs also follow a
  directory - sub dir structure. Once data is in hdfs issue a Alter Table
 Add
  Partition statement on corresponding hive table.
  -in Hive DDL use the appropriate Input format (Hive has some ApacheLog
  Input Format already)
 
 
  Regards
  Bejoy K S
 
  From handheld, Please excuse typos.
 
  -Original Message-
  From: Xiaobin She xiaobin...@gmail.com
  Date: Mon, 6 Feb 2012 16:41:50
  To: common-user@hadoop.apache.org; 佘晓彬xiaobin...@gmail.com
  Reply-To: common-user@hadoop.apache.org
  Subject: Re: Can I write to an compressed file which is located in hdfs?
 
  sorry, this sentence is wrong,
 
  I can't compress these logs every hour and them put them into hdfs.
 
  it should be
 
  I can  compress these logs every hour and them put them into hdfs.
 
 
 
 
  2012/2/6 Xiaobin She xiaobin...@gmail.com
 
  
   hi all,
  
   I'm testing hadoop and hive, and I want to use them in log analysis.
  
   Here I have a question, can I write/append log to  an compressed file
   which is located in hdfs?
  
   Our system generate lots of log files every day, I can't compress these
   logs every hour and them put them into hdfs.
  
   But what if I want to write logs into files that was already in the
 hdfs
   and was compressed?
  
   Is these files were not compressed, then this job seems easy, but how
 to
   write or append logs into an compressed log?
  
   Can I do that?
  
   Can anyone give me some advices or give me some examples?
  
   Thank you very much!
  
   xiaobin
  
 
 




HDFS Files Seem to be Stored in the Wrong Location?

2012-02-06 Thread Eli Finkelshteyn

Hi,
I have a pseudo-distributed Hadoop cluster setup, and I'm currently 
hoping to put about 100 gigs of files on it to play around with. I got a 
unix box at work no one else is using for this, and running a df -h, I get:

FilesystemSize  Used Avail Use% Mounted on
/dev/sda1 7.9G  2.4G  5.2G  31% /
none  3.8G 0  3.8G   0% /dev/shm
/dev/sdb  414G  210M  393G   1% /mnt

Alright, so /mnt looks quite big and seems like a good place to store my 
hdfs files. I go ahead and create a folder named hadoop-data there and 
set the following in hdfs-site.xml:


property
!-- where hadoop stores its files (datanodes only) --
namedfs.name.dir/name
value/mnt/hadoop-data/value
/property

After a bit of troubleshooting, I restart the cluster and try to put a 
couple of test files onto HDFS. Doing an ls of hadoop-data, I see:


$ ls
current  image  in_use.lock  previous.checkpoint

OK, things look good. Time to try uploading some real data. Now, here's 
where the problem arises. If I add a 10mb dummy file to hadoop-data 
through regular unix and run df -h, I see that the used space of /mnt 
goes up exactly 10mb. But, when I start running a big dump of data through:


hadoop fs -put ~/hadoop_playground/data2/data2/ /data/

I notice that running df -h seems to put the data in completely the 
wrong location! Note that below, only the usage of /dev/sda1 has 
increased. /mnt has not moved.


FilesystemSize  Used Avail Use% Mounted on
/dev/sda1 7.9G  3.4G  4.2G  45% /
none  3.8G 0  3.8G   0% /dev/shm
/dev/sdb  414G  210M  393G   1% /mnt

So, what gives? Anyone have any clue how my files are seemingly both put 
in the hadoop-data folder, but take up space elsewhere? I could see this 
likely being a Unix issue, but I figured I'd ask here just in case it's 
not, since I'm pretty stumped.


Cheers,
Eli


Re: HDFS Files Seem to be Stored in the Wrong Location?

2012-02-06 Thread Harsh J
You need your dfs.data.dir configured to the bigger disks for data.
That config targets the datanodes.

The one you've overriden is for the namenode's metadata, and hence the
default dfs.data.dir config is writing to /tmp on your root disk
(which is a bad thing, gets wiped after a reboot).

On Mon, Feb 6, 2012 at 9:51 PM, Eli Finkelshteyn iefin...@gmail.com wrote:
 Hi,
 I have a pseudo-distributed Hadoop cluster setup, and I'm currently hoping
 to put about 100 gigs of files on it to play around with. I got a unix box
 at work no one else is using for this, and running a df -h, I get:
 Filesystem            Size  Used Avail Use% Mounted on
 /dev/sda1             7.9G  2.4G  5.2G  31% /
 none                  3.8G     0  3.8G   0% /dev/shm
 /dev/sdb              414G  210M  393G   1% /mnt

 Alright, so /mnt looks quite big and seems like a good place to store my
 hdfs files. I go ahead and create a folder named hadoop-data there and set
 the following in hdfs-site.xml:

 property
 !-- where hadoop stores its files (datanodes only) --
 namedfs.name.dir/name
 value/mnt/hadoop-data/value
 /property

 After a bit of troubleshooting, I restart the cluster and try to put a
 couple of test files onto HDFS. Doing an ls of hadoop-data, I see:

 $ ls
 current  image  in_use.lock  previous.checkpoint

 OK, things look good. Time to try uploading some real data. Now, here's
 where the problem arises. If I add a 10mb dummy file to hadoop-data through
 regular unix and run df -h, I see that the used space of /mnt goes up
 exactly 10mb. But, when I start running a big dump of data through:

 hadoop fs -put ~/hadoop_playground/data2/data2/ /data/

 I notice that running df -h seems to put the data in completely the wrong
 location! Note that below, only the usage of /dev/sda1 has increased. /mnt
 has not moved.

 Filesystem            Size  Used Avail Use% Mounted on
 /dev/sda1             7.9G  3.4G  4.2G  45% /
 none                  3.8G     0  3.8G   0% /dev/shm
 /dev/sdb              414G  210M  393G   1% /mnt

 So, what gives? Anyone have any clue how my files are seemingly both put in
 the hadoop-data folder, but take up space elsewhere? I could see this likely
 being a Unix issue, but I figured I'd ask here just in case it's not, since
 I'm pretty stumped.

 Cheers,
 Eli



-- 
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about


Tom White's book, 2nd ed. Which API?

2012-02-06 Thread Keith Wiley
I have the first edition of Tom White's O'Reilly Hadoop book and I was curious 
about the second edition.  I realize it adds new sections on some of the 
wrapper tools, like Hive, but as far as the core Hadoop documentation is 
concerned, I'm wondering if there is much difference?  In particular, I was 
curious if it teaches the .20 API?  The first edition explicitly taught .19 
because .20 wasn't quite vetted at the time he wrote it.  He even explains that 
in the book.

Thanks.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched.
   --  Keith Wiley




Re: HDFS Files Seem to be Stored in the Wrong Location?

2012-02-06 Thread Eli Finkelshteyn

Ah, crud. Typo on my part. Don't know how I didn't notice that. Thanks!

On 2/6/12 11:30 AM, Harsh J wrote:

You need your dfs.data.dir configured to the bigger disks for data.
That config targets the datanodes.

The one you've overriden is for the namenode's metadata, and hence the
default dfs.data.dir config is writing to /tmp on your root disk
(which is a bad thing, gets wiped after a reboot).

On Mon, Feb 6, 2012 at 9:51 PM, Eli Finkelshteyniefin...@gmail.com  wrote:

Hi,
I have a pseudo-distributed Hadoop cluster setup, and I'm currently hoping
to put about 100 gigs of files on it to play around with. I got a unix box
at work no one else is using for this, and running a df -h, I get:
FilesystemSize  Used Avail Use% Mounted on
/dev/sda1 7.9G  2.4G  5.2G  31% /
none  3.8G 0  3.8G   0% /dev/shm
/dev/sdb  414G  210M  393G   1% /mnt

Alright, so /mnt looks quite big and seems like a good place to store my
hdfs files. I go ahead and create a folder named hadoop-data there and set
the following in hdfs-site.xml:

property
!-- where hadoop stores its files (datanodes only) --
namedfs.name.dir/name
value/mnt/hadoop-data/value
/property

After a bit of troubleshooting, I restart the cluster and try to put a
couple of test files onto HDFS. Doing an ls of hadoop-data, I see:

$ ls
current  image  in_use.lock  previous.checkpoint

OK, things look good. Time to try uploading some real data. Now, here's
where the problem arises. If I add a 10mb dummy file to hadoop-data through
regular unix and run df -h, I see that the used space of /mnt goes up
exactly 10mb. But, when I start running a big dump of data through:

hadoop fs -put ~/hadoop_playground/data2/data2/ /data/

I notice that running df -h seems to put the data in completely the wrong
location! Note that below, only the usage of /dev/sda1 has increased. /mnt
has not moved.

FilesystemSize  Used Avail Use% Mounted on
/dev/sda1 7.9G  3.4G  4.2G  45% /
none  3.8G 0  3.8G   0% /dev/shm
/dev/sdb  414G  210M  393G   1% /mnt

So, what gives? Anyone have any clue how my files are seemingly both put in
the hadoop-data folder, but take up space elsewhere? I could see this likely
being a Unix issue, but I figured I'd ask here just in case it's not, since
I'm pretty stumped.

Cheers,
Eli







Re: Tom White's book, 2nd ed. Which API?

2012-02-06 Thread zep
On Monday, February 06, 2012 11:36:10 AM, Keith Wiley wrote:
 I have the first edition of Tom White's O'Reilly Hadoop book and I was 
 curious about the second edition.  I realize it adds new sections on some of 
 the wrapper tools, like Hive, but as far as the core Hadoop documentation is 
 concerned, I'm wondering if there is much difference?  In particular, I was 
 curious if it teaches the .20 API?  The first edition explicitly taught .19 
 because .20 wasn't quite vetted at the time he wrote it.  He even explains 
 that in the book.

 Thanks.


I have access to a safaribooksonline account and according to a quick 
scan:

What’s New in the Second Edition?
The second edition has two new chapters on Hive and Sqoop (Chapters 12 
and 15), a
new section covering Avro (in Chapter 4), an introduction to the new 
security features
in Hadoop (in Chapter 9), and a new case study on analyzing massive 
network graphs
using Hadoop (in Chapter 16).
This edition continues to describe the 0.20 release series of Apache 
Hadoop, since this
was the latest stable release at the time of writing. New features from 
later releases are
occasionally mentioned in the text, however, with reference to the 
version that they
were introduced in.

you could also check out amazon's look inside functionality to check 
a few key pages once you find the second edition.

hope this is of some help.



Re: Tom White's book, 2nd ed. Which API?

2012-02-06 Thread W.P. McNeill
The second edition of Tom White's *Hadoop: The Definitive
Guidehttp://www.librarything.com/work/book/72181963
* uses the old API for its examples, though it does contain a brief
two-page overview of the new API.

The first edition is all old API.


Re: Tom White's book, 2nd ed. Which API?

2012-02-06 Thread Richard Nadeau
If you're looking to buy the 2nd edition you might want to wait, the third
edition is in the works now.

Regards,
Rick
On Feb 6, 2012 10:24 AM, W.P. McNeill bill...@gmail.com wrote:

 The second edition of Tom White's *Hadoop: The Definitive
 Guidehttp://www.librarything.com/work/book/72181963
 * uses the old API for its examples, though it does contain a brief
 two-page overview of the new API.

 The first edition is all old API.



Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?

2012-02-06 Thread Vitthal Suhas Gogate
I assume you have seen the following information on Hadoop twiki,
http://wiki.apache.org/hadoop/GangliaMetrics

So do you use GangliaContext31 in hadoop-metrics2.properties?

We use Ganglia 3.2 with Hadoop 20.205  and works fine (I remember seeing
gmetad sometime goes down due to buffer overflow problem when hadoop starts
pumping in the metrics.. but restarting works.. let me know if you face
same problem?

--Suhas

Additionally, the Ganglia protocol change significantly between Ganglia 3.0
and Ganglia 3.1 (i.e., Ganglia 3.1 is not compatible with Ganglia 3.0
clients). This caused Hadoop to not work with Ganglia 3.1; there is a patch
available for this, HADOOP-4675. As of November 2010, this patch has been
rolled into the mainline for 0.20.2 and later. To use the Ganglia 3.1
protocol in place of the 3.0, substitute
org.apache.hadoop.metrics.ganglia.GangliaContext31 for
org.apache.hadoop.metrics.ganglia.GangliaContext in the
hadoop-metrics.properties lines above.

On Fri, Feb 3, 2012 at 1:07 PM, Merto Mertek masmer...@gmail.com wrote:

 I spent a lot of time to figure it out however i did not find a solution.
 Problems from the logs pointed me for some bugs in rrdupdate tool, however
 i tried to solve it with different versions of ganglia and rrdtool but the
 error is the same. Segmentation fault appears after the following lines, if
 I run gmetad in debug mode...

 Created rrd

 /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.publish_max_time.rrd
 Created rrd

 /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.snapshot_max_time.rrd
 

 which I suppose are generated from MetricsSystemImpl.java (Is there any way
 just to disable this two metrics?)

 From the /var/log/messages there are a lot of errors:

 xxx gmetad[15217]: RRD_update

 (/var/lib/ganglia/rrds/hdc/xxx/metricssystem.MetricsSystem.publish_imax_time.rrd):
 converting  '4.9E-324' to float: Numerical result out of range
 xxx gmetad[15217]: RRD_update

 (/var/lib/ganglia/rrds/hdc/xxx/metricssystem.MetricsSystem.snapshot_imax_time.rrd):
 converting  '4.9E-324' to float: Numerical result out of range

 so probably there are some converting issues ? Where should I look for the
 solution? Would you rather suggest to use ganglia 3.0.x with the old
 protocol and leave the version 3.1 for further releases?

 any help is realy appreciated...

 On 1 February 2012 04:04, Merto Mertek masmer...@gmail.com wrote:

  I would be glad to hear that too.. I've setup the following:
 
  Hadoop 0.20.205
  Ganglia Front  3.1.7
  Ganglia Back *(gmetad)* 3.1.7
  RRDTool http://www.rrdtool.org/ 1.4.5. - i had some troubles
  installing 1.4.4
 
  Ganglia works just in case hadoop is not running, so metrics are not
  publshed to gmetad node (conf with new hadoop-metrics2.proprieties). When
  hadoop is started, a segmentation fault appears in gmetad deamon:
 
  sudo gmetad -d 2
  ...
  Updating host xxx, metric dfs.FSNamesystem.BlocksTotal
  Updating host xxx, metric bytes_in
  Updating host xxx, metric bytes_out
  Updating host xxx, metric metricssystem.MetricsSystem.publish_max_time
  Created rrd
 
 /var/lib/ganglia/rrds/hdcluster/hadoopmaster/metricssystem.MetricsSystem.publish_max_time.rrd
  Segmentation fault
 
  And some info from the apache log http://pastebin.com/nrqKRtKJ..
 
  Can someone suggest a ganglia version that is tested with hadoop
 0.20.205?
  I will try to sort it out however it seems a not so tribial problem..
 
  Thank you
 
 
 
 
 
  On 2 December 2011 12:32, praveenesh kumar praveen...@gmail.com wrote:
 
  or Do I have to apply some hadoop patch for this ?
 
  Thanks,
  Praveenesh
 
 
 



Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?

2012-02-06 Thread mete
Hello,
i also face this issue when using GangliaContext31 and hadoop-1.0.0, and
ganglia 3.1.7 (also tried 3.1.2). I continuously get buffer overflows as
soon as i restart the gmetad.
Regards
Mete

On Mon, Feb 6, 2012 at 7:42 PM, Vitthal Suhas Gogate 
gog...@hortonworks.com wrote:

 I assume you have seen the following information on Hadoop twiki,
 http://wiki.apache.org/hadoop/GangliaMetrics

 So do you use GangliaContext31 in hadoop-metrics2.properties?

 We use Ganglia 3.2 with Hadoop 20.205  and works fine (I remember seeing
 gmetad sometime goes down due to buffer overflow problem when hadoop starts
 pumping in the metrics.. but restarting works.. let me know if you face
 same problem?

 --Suhas

 Additionally, the Ganglia protocol change significantly between Ganglia 3.0
 and Ganglia 3.1 (i.e., Ganglia 3.1 is not compatible with Ganglia 3.0
 clients). This caused Hadoop to not work with Ganglia 3.1; there is a patch
 available for this, HADOOP-4675. As of November 2010, this patch has been
 rolled into the mainline for 0.20.2 and later. To use the Ganglia 3.1
 protocol in place of the 3.0, substitute
 org.apache.hadoop.metrics.ganglia.GangliaContext31 for
 org.apache.hadoop.metrics.ganglia.GangliaContext in the
 hadoop-metrics.properties lines above.

 On Fri, Feb 3, 2012 at 1:07 PM, Merto Mertek masmer...@gmail.com wrote:

  I spent a lot of time to figure it out however i did not find a solution.
  Problems from the logs pointed me for some bugs in rrdupdate tool,
 however
  i tried to solve it with different versions of ganglia and rrdtool but
 the
  error is the same. Segmentation fault appears after the following lines,
 if
  I run gmetad in debug mode...
 
  Created rrd
 
 
 /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.publish_max_time.rrd
  Created rrd
 
 
 /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.snapshot_max_time.rrd
  
 
  which I suppose are generated from MetricsSystemImpl.java (Is there any
 way
  just to disable this two metrics?)
 
  From the /var/log/messages there are a lot of errors:
 
  xxx gmetad[15217]: RRD_update
 
 
 (/var/lib/ganglia/rrds/hdc/xxx/metricssystem.MetricsSystem.publish_imax_time.rrd):
  converting  '4.9E-324' to float: Numerical result out of range
  xxx gmetad[15217]: RRD_update
 
 
 (/var/lib/ganglia/rrds/hdc/xxx/metricssystem.MetricsSystem.snapshot_imax_time.rrd):
  converting  '4.9E-324' to float: Numerical result out of range
 
  so probably there are some converting issues ? Where should I look for
 the
  solution? Would you rather suggest to use ganglia 3.0.x with the old
  protocol and leave the version 3.1 for further releases?
 
  any help is realy appreciated...
 
  On 1 February 2012 04:04, Merto Mertek masmer...@gmail.com wrote:
 
   I would be glad to hear that too.. I've setup the following:
  
   Hadoop 0.20.205
   Ganglia Front  3.1.7
   Ganglia Back *(gmetad)* 3.1.7
   RRDTool http://www.rrdtool.org/ 1.4.5. - i had some troubles
   installing 1.4.4
  
   Ganglia works just in case hadoop is not running, so metrics are not
   publshed to gmetad node (conf with new hadoop-metrics2.proprieties).
 When
   hadoop is started, a segmentation fault appears in gmetad deamon:
  
   sudo gmetad -d 2
   ...
   Updating host xxx, metric dfs.FSNamesystem.BlocksTotal
   Updating host xxx, metric bytes_in
   Updating host xxx, metric bytes_out
   Updating host xxx, metric metricssystem.MetricsSystem.publish_max_time
   Created rrd
  
 
 /var/lib/ganglia/rrds/hdcluster/hadoopmaster/metricssystem.MetricsSystem.publish_max_time.rrd
   Segmentation fault
  
   And some info from the apache log http://pastebin.com/nrqKRtKJ..
  
   Can someone suggest a ganglia version that is tested with hadoop
  0.20.205?
   I will try to sort it out however it seems a not so tribial problem..
  
   Thank you
  
  
  
  
  
   On 2 December 2011 12:32, praveenesh kumar praveen...@gmail.com
 wrote:
  
   or Do I have to apply some hadoop patch for this ?
  
   Thanks,
   Praveenesh
  
  
  
 



Re: How to Set the Value of hadoop.tmp.dir?

2012-02-06 Thread bejoy . hadoop
Hi Bing
  What is yout value for dfs.name.dir and dfs.data.dir ? I believe that is 
still pointing to /tmp. Better to change it to another location as /tmp gets 
wiped on every reboot.

--Original Message--
From: Bing Li
To: common-user@hadoop.apache.org
ReplyTo: common-user@hadoop.apache.org
ReplyTo: bing...@asu.edu
Subject: How to Set the Value of hadoop.tmp.dir?
Sent: Feb 7, 2012 02:04

Dear all,

I am a new Hadoop learner. The version I used is 1.0.0.

I tried to set a new value for the parameter instead of /tmp,
hadoop.tmp.dir in core-site.xml, hdfs-site.xml and mapred-site.xml. Do I
need to do that in all the above xmls?

However, when I execute the format command, I was asked if I need to
reformat filesystem in /tmp/hadoop-myname/dfs/name. Why is the path still
/tmp?

Thanks,
Bing



Regards
Bejoy K S

From handheld, Please excuse typos.


Re: Tom White's book, 2nd ed. Which API?

2012-02-06 Thread Russell Jurney
Or get O'Reilly Safari, which would get you both?

On Feb 6, 2012, at 9:34 AM, Richard Nadeau strout...@gmail.com wrote:

 If you're looking to buy the 2nd edition you might want to wait, the third
 edition is in the works now.
 
 Regards,
 Rick
 On Feb 6, 2012 10:24 AM, W.P. McNeill bill...@gmail.com wrote:
 
 The second edition of Tom White's *Hadoop: The Definitive
 Guidehttp://www.librarything.com/work/book/72181963
 * uses the old API for its examples, though it does contain a brief
 two-page overview of the new API.
 
 The first edition is all old API.
 


Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?

2012-02-06 Thread Merto Mertek
Yes I am encoutering the same problems and like Mete said  few seconds
after restarting a segmentation fault appears.. here is my conf..
http://pastebin.com/VgBjp08d

And here are some info from /var/log/messages (ubuntu server 10.10):

kernel: [424447.140641] gmetad[26115] general protection ip:7f7762428fdb
 sp:7f776362d370 error:0 in libgcc_s.so.1[7f776241a000+15000]


When I compiled gmetad I used the following command:

./configure --with-gmetad --sysconfdir=/etc/ganglia
 CPPFLAGS=-I/usr/local/rrdtool-1.4.7/include
 CFLAGS=-I/usr/local/rrdtool-1.4.7/include
 LDFLAGS=-L/usr/local/rrdtool-1.4.7/lib


The same was tried with rrdtool 1.4.5. My current ganglia version is 3.2.0
and like Mete I tried it with version 3.1.7 but without success..

Hope we will sort it out soon any solution..
thank you


On 6 February 2012 20:09, mete efk...@gmail.com wrote:

 Hello,
 i also face this issue when using GangliaContext31 and hadoop-1.0.0, and
 ganglia 3.1.7 (also tried 3.1.2). I continuously get buffer overflows as
 soon as i restart the gmetad.
 Regards
 Mete

 On Mon, Feb 6, 2012 at 7:42 PM, Vitthal Suhas Gogate 
 gog...@hortonworks.com wrote:

  I assume you have seen the following information on Hadoop twiki,
  http://wiki.apache.org/hadoop/GangliaMetrics
 
  So do you use GangliaContext31 in hadoop-metrics2.properties?
 
  We use Ganglia 3.2 with Hadoop 20.205  and works fine (I remember seeing
  gmetad sometime goes down due to buffer overflow problem when hadoop
 starts
  pumping in the metrics.. but restarting works.. let me know if you face
  same problem?
 
  --Suhas
 
  Additionally, the Ganglia protocol change significantly between Ganglia
 3.0
  and Ganglia 3.1 (i.e., Ganglia 3.1 is not compatible with Ganglia 3.0
  clients). This caused Hadoop to not work with Ganglia 3.1; there is a
 patch
  available for this, HADOOP-4675. As of November 2010, this patch has been
  rolled into the mainline for 0.20.2 and later. To use the Ganglia 3.1
  protocol in place of the 3.0, substitute
  org.apache.hadoop.metrics.ganglia.GangliaContext31 for
  org.apache.hadoop.metrics.ganglia.GangliaContext in the
  hadoop-metrics.properties lines above.
 
  On Fri, Feb 3, 2012 at 1:07 PM, Merto Mertek masmer...@gmail.com
 wrote:
 
   I spent a lot of time to figure it out however i did not find a
 solution.
   Problems from the logs pointed me for some bugs in rrdupdate tool,
  however
   i tried to solve it with different versions of ganglia and rrdtool but
  the
   error is the same. Segmentation fault appears after the following
 lines,
  if
   I run gmetad in debug mode...
  
   Created rrd
  
  
 
 /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.publish_max_time.rrd
   Created rrd
  
  
 
 /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.snapshot_max_time.rrd
   
  
   which I suppose are generated from MetricsSystemImpl.java (Is there any
  way
   just to disable this two metrics?)
  
   From the /var/log/messages there are a lot of errors:
  
   xxx gmetad[15217]: RRD_update
  
  
 
 (/var/lib/ganglia/rrds/hdc/xxx/metricssystem.MetricsSystem.publish_imax_time.rrd):
   converting  '4.9E-324' to float: Numerical result out of range
   xxx gmetad[15217]: RRD_update
  
  
 
 (/var/lib/ganglia/rrds/hdc/xxx/metricssystem.MetricsSystem.snapshot_imax_time.rrd):
   converting  '4.9E-324' to float: Numerical result out of range
  
   so probably there are some converting issues ? Where should I look for
  the
   solution? Would you rather suggest to use ganglia 3.0.x with the old
   protocol and leave the version 3.1 for further releases?
  
   any help is realy appreciated...
  
   On 1 February 2012 04:04, Merto Mertek masmer...@gmail.com wrote:
  
I would be glad to hear that too.. I've setup the following:
   
Hadoop 0.20.205
Ganglia Front  3.1.7
Ganglia Back *(gmetad)* 3.1.7
RRDTool http://www.rrdtool.org/ 1.4.5. - i had some troubles
installing 1.4.4
   
Ganglia works just in case hadoop is not running, so metrics are not
publshed to gmetad node (conf with new hadoop-metrics2.proprieties).
  When
hadoop is started, a segmentation fault appears in gmetad deamon:
   
sudo gmetad -d 2
...
Updating host xxx, metric dfs.FSNamesystem.BlocksTotal
Updating host xxx, metric bytes_in
Updating host xxx, metric bytes_out
Updating host xxx, metric
 metricssystem.MetricsSystem.publish_max_time
Created rrd
   
  
 
 /var/lib/ganglia/rrds/hdcluster/hadoopmaster/metricssystem.MetricsSystem.publish_max_time.rrd
Segmentation fault
   
And some info from the apache log http://pastebin.com/nrqKRtKJ..
   
Can someone suggest a ganglia version that is tested with hadoop
   0.20.205?
I will try to sort it out however it seems a not so tribial problem..
   
Thank you
   
   
   
   
   
On 2 December 2011 12:32, praveenesh kumar praveen...@gmail.com
  wrote:
   
or Do I have to apply some hadoop patch for this ?
   

Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?

2012-02-06 Thread Varun Kapoor
Hey Merto,

I've been digging into this problem since Sunday, and believe I may have
root-caused it.

I'm using ganglia-3.2.0, rrdtool-1.4.5 and
http://svn.apache.org/viewvc/hadoop/common/branches/branch-1/ (which I
believe should be running essentially the identical relevant code as
0.20.205).

While I test out one of my potential fixes, I would appreciate if you could
confirm my understanding of the behavior you are seeing:

   - When you start gmetad and Hadoop is not emitting metrics, everything
   is peachy.
   - When you start Hadoop (and it thus starts emitting metrics), gmetad
   cores.
  - On my MacBookPro, it's a SIGABRT due to a buffer overflow.

I believe this is happening for everyone. What I would like for you to try
out are the following 2 scenarios:

   - Once gmetad cores, if you start it up again, does it core again? Does
   this process repeat ad infinitum?
  - On my MBP, the core is a one-time thing, and restarting gmetad
  after the first core makes things run perfectly smoothly.
 - I know others are saying this core occurs continuously, but they
 were all using ganglia-3.1.x, and I'm interested in how ganglia-3.2.0
 behaves for you.
 - If you start Hadoop first (so gmetad is not running when the
   first batch of Hadoop metrics are emitted) and THEN start gmetad after a
   few seconds, do you still see gmetad coring?
  - On my MBP, this sequence works perfectly fine, and there are no
  gmetad cores whatsoever.

Bear in mind that this only addresses the gmetad coring issue - the
warnings emitted about '4.9E-324' being out of range will continue, but I
know what's causing that as well (and hope that my patch fixes it for free).

Varun
On Mon, Feb 6, 2012 at 2:39 PM, Merto Mertek masmer...@gmail.com wrote:

 Yes I am encoutering the same problems and like Mete said  few seconds
 after restarting a segmentation fault appears.. here is my conf..
 http://pastebin.com/VgBjp08d

 And here are some info from /var/log/messages (ubuntu server 10.10):

 kernel: [424447.140641] gmetad[26115] general protection ip:7f7762428fdb
  sp:7f776362d370 error:0 in libgcc_s.so.1[7f776241a000+15000]
 

 When I compiled gmetad I used the following command:

 ./configure --with-gmetad --sysconfdir=/etc/ganglia
  CPPFLAGS=-I/usr/local/rrdtool-1.4.7/include
  CFLAGS=-I/usr/local/rrdtool-1.4.7/include
  LDFLAGS=-L/usr/local/rrdtool-1.4.7/lib
 

 The same was tried with rrdtool 1.4.5. My current ganglia version is 3.2.0
 and like Mete I tried it with version 3.1.7 but without success..

 Hope we will sort it out soon any solution..
 thank you


 On 6 February 2012 20:09, mete efk...@gmail.com wrote:

  Hello,
  i also face this issue when using GangliaContext31 and hadoop-1.0.0, and
  ganglia 3.1.7 (also tried 3.1.2). I continuously get buffer overflows as
  soon as i restart the gmetad.
  Regards
  Mete
 
  On Mon, Feb 6, 2012 at 7:42 PM, Vitthal Suhas Gogate 
  gog...@hortonworks.com wrote:
 
   I assume you have seen the following information on Hadoop twiki,
   http://wiki.apache.org/hadoop/GangliaMetrics
  
   So do you use GangliaContext31 in hadoop-metrics2.properties?
  
   We use Ganglia 3.2 with Hadoop 20.205  and works fine (I remember
 seeing
   gmetad sometime goes down due to buffer overflow problem when hadoop
  starts
   pumping in the metrics.. but restarting works.. let me know if you face
   same problem?
  
   --Suhas
  
   Additionally, the Ganglia protocol change significantly between Ganglia
  3.0
   and Ganglia 3.1 (i.e., Ganglia 3.1 is not compatible with Ganglia 3.0
   clients). This caused Hadoop to not work with Ganglia 3.1; there is a
  patch
   available for this, HADOOP-4675. As of November 2010, this patch has
 been
   rolled into the mainline for 0.20.2 and later. To use the Ganglia 3.1
   protocol in place of the 3.0, substitute
   org.apache.hadoop.metrics.ganglia.GangliaContext31 for
   org.apache.hadoop.metrics.ganglia.GangliaContext in the
   hadoop-metrics.properties lines above.
  
   On Fri, Feb 3, 2012 at 1:07 PM, Merto Mertek masmer...@gmail.com
  wrote:
  
I spent a lot of time to figure it out however i did not find a
  solution.
Problems from the logs pointed me for some bugs in rrdupdate tool,
   however
i tried to solve it with different versions of ganglia and rrdtool
 but
   the
error is the same. Segmentation fault appears after the following
  lines,
   if
I run gmetad in debug mode...
   
Created rrd
   
   
  
 
 /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.publish_max_time.rrd
Created rrd
   
   
  
 
 /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.snapshot_max_time.rrd

   
which I suppose are generated from MetricsSystemImpl.java (Is there
 any
   way
just to disable this two metrics?)
   
From the /var/log/messages there are a lot of errors:
   
xxx gmetad[15217]: RRD_update
   
   
  
 
 

Re: Tom White's book, 2nd ed. Which API?

2012-02-06 Thread Keith Wiley
Thanks everyone.  I knew about the upcoming third edition.  I'm not sure I want 
to wait until May to learn the new API (pretty old actually).  I'd like to 
find a resource that goes through the new API.  I realize Tom White's examples 
are offered with the new API online, I was just hoping for something more 
explicitly instructional.

I'll figure something out.

Cheers!

On Feb 6, 2012, at 09:34 , Richard Nadeau wrote:

 If you're looking to buy the 2nd edition you might want to wait, the third
 edition is in the works now.
 
 Regards,
 Rick
 On Feb 6, 2012 10:24 AM, W.P. McNeill bill...@gmail.com wrote:
 
 The second edition of Tom White's *Hadoop: The Definitive
 Guidehttp://www.librarything.com/work/book/72181963
 * uses the old API for its examples, though it does contain a brief
 two-page overview of the new API.
 
 The first edition is all old API.
 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

Luminous beings are we, not this crude matter.
   --  Yoda




Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?

2012-02-06 Thread Merto Mertek
I have tried to run it but it repeats crashing..

  - When you start gmetad and Hadoop is not emitting metrics, everything
   is peachy.


Right, running just ganglia without running hadoop jobs seems stable for at
least a day..


   - When you start Hadoop (and it thus starts emitting metrics), gmetad
   cores.


True, with a  following error : *** stack smashing detected ***: gmetad
terminated \n Segmentation fault

 - On my MacBookPro, it's a SIGABRT due to a buffer overflow.

 I believe this is happening for everyone. What I would like for you to try
 out are the following 2 scenarios:

   - Once gmetad cores, if you start it up again, does it core again? Does
   this process repeat ad infinitum?

 - On my MBP, the core is a one-time thing, and restarting gmetad
  after the first core makes things run perfectly smoothly.
 - I know others are saying this core occurs continuously, but they
 were all using ganglia-3.1.x, and I'm interested in how
 ganglia-3.2.0
 behaves for you.


It cores everytime I run it. The difference is just that sometimes a
segmentation faults appears instantly, and sometimes it appears after a
random time...lets say after a minute of running gmetad and collecting data.


 - If you start Hadoop first (so gmetad is not running when the
   first batch of Hadoop metrics are emitted) and THEN start gmetad after a
   few seconds, do you still see gmetad coring?


Yes


  - On my MBP, this sequence works perfectly fine, and there are no
  gmetad cores whatsoever.


I have tested this scenario with 2 working nodes so two gmond plus the head
gmond on the server where gmetad is located. I have checked and all of them
are versioned 3.2.0.

Hope it helps..




 Bear in mind that this only addresses the gmetad coring issue - the
 warnings emitted about '4.9E-324' being out of range will continue, but I
 know what's causing that as well (and hope that my patch fixes it for
 free).

 Varun
 On Mon, Feb 6, 2012 at 2:39 PM, Merto Mertek masmer...@gmail.com wrote:

  Yes I am encoutering the same problems and like Mete said  few seconds
  after restarting a segmentation fault appears.. here is my conf..
  http://pastebin.com/VgBjp08d
 
  And here are some info from /var/log/messages (ubuntu server 10.10):
 
  kernel: [424447.140641] gmetad[26115] general protection ip:7f7762428fdb
   sp:7f776362d370 error:0 in libgcc_s.so.1[7f776241a000+15000]
  
 
  When I compiled gmetad I used the following command:
 
  ./configure --with-gmetad --sysconfdir=/etc/ganglia
   CPPFLAGS=-I/usr/local/rrdtool-1.4.7/include
   CFLAGS=-I/usr/local/rrdtool-1.4.7/include
   LDFLAGS=-L/usr/local/rrdtool-1.4.7/lib
  
 
  The same was tried with rrdtool 1.4.5. My current ganglia version is
 3.2.0
  and like Mete I tried it with version 3.1.7 but without success..
 
  Hope we will sort it out soon any solution..
  thank you
 
 
  On 6 February 2012 20:09, mete efk...@gmail.com wrote:
 
   Hello,
   i also face this issue when using GangliaContext31 and hadoop-1.0.0,
 and
   ganglia 3.1.7 (also tried 3.1.2). I continuously get buffer overflows
 as
   soon as i restart the gmetad.
   Regards
   Mete
  
   On Mon, Feb 6, 2012 at 7:42 PM, Vitthal Suhas Gogate 
   gog...@hortonworks.com wrote:
  
I assume you have seen the following information on Hadoop twiki,
http://wiki.apache.org/hadoop/GangliaMetrics
   
So do you use GangliaContext31 in hadoop-metrics2.properties?
   
We use Ganglia 3.2 with Hadoop 20.205  and works fine (I remember
  seeing
gmetad sometime goes down due to buffer overflow problem when hadoop
   starts
pumping in the metrics.. but restarting works.. let me know if you
 face
same problem?
   
--Suhas
   
Additionally, the Ganglia protocol change significantly between
 Ganglia
   3.0
and Ganglia 3.1 (i.e., Ganglia 3.1 is not compatible with Ganglia 3.0
clients). This caused Hadoop to not work with Ganglia 3.1; there is a
   patch
available for this, HADOOP-4675. As of November 2010, this patch has
  been
rolled into the mainline for 0.20.2 and later. To use the Ganglia 3.1
protocol in place of the 3.0, substitute
org.apache.hadoop.metrics.ganglia.GangliaContext31 for
org.apache.hadoop.metrics.ganglia.GangliaContext in the
hadoop-metrics.properties lines above.
   
On Fri, Feb 3, 2012 at 1:07 PM, Merto Mertek masmer...@gmail.com
   wrote:
   
 I spent a lot of time to figure it out however i did not find a
   solution.
 Problems from the logs pointed me for some bugs in rrdupdate tool,
however
 i tried to solve it with different versions of ganglia and rrdtool
  but
the
 error is the same. Segmentation fault appears after the following
   lines,
if
 I run gmetad in debug mode...

 Created rrd


   
  
 
 /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.publish_max_time.rrd
 Created rrd


   
  
 
 

Re: Hadoop does not start on Windows XP

2012-02-06 Thread Jay
Hi Ron,

Thank you. I deleted the Hadoop directory from my Windows folder.
The untar/unzipped on Cygwin ad the directory d:\hadoop 

(Ex: here is a path: D:\Hadoop\hadoop-1.0.0\bin\ )

Now I could start Hadoop:
$ bin/hadoop start-all.sh

Above worked.
But similar problem persists:
$ bin/hadoop fs -ls
bin/hadoop: line 321: c:\Program: command not found
Found 26 items

$ bin/hadoop dfs -mkdir urls
cygwin warning:
  MS-DOS style path detected: D:\Hadoop\hadoop-1.0.0\/build/native
  Preferred POSIX equivalent is: /cygdrive/d/Hadoop/hadoop-1.0.0/build/native
  CYGWIN environment variable option nodosfilewarning turns off this warning.
  Consult the user's guide for more details about POSIX paths:
    http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
bin/hadoop: line 321: c:\Program: command not found

How could I possibly get rid of the error?
Btw,

$ bin=`dirname .`

/cygdrive/d/Hadoop/hadoop-1.0.0
$ echo $bin
.
/cygdrive/d/Hadoop/hadoop-1.0.0
$ pwd
/cygdrive/d/Hadoop/hadoop-1.0.0


Thanks,
Jay




 From: Ronald Petty ronald.pe...@gmail.com
To: common-user@hadoop.apache.org; Jay su1...@yahoo.com 
Sent: Sunday, February 5, 2012 4:28 PM
Subject: Re: Hadoop does not start on Windows XP
 
Jay,

What does the following give you on the command line?

bin=`dirname $0`   //also try =`dirname .`

echo $bin

Regards.

Ron

On Sat, Feb 4, 2012 at 10:56 PM, Jay su1...@yahoo.com wrote:

 Hi,

 In Windows XP I installed Cygwin and tried to run Hadoop:

 W1234@W19064-00 /cygdrive/d/Profiles/w1234/My
 Documents/Hadoop/hadoop1.0/hadoop-1.0.0
 $ bin/hadoop start-all.sh
 bin/hadoop: line 2: $'\r': command not found
 bin/hadoop: line 17: $'\r': command not found
 bin/hadoop: line 18: $'\r': command not found
 bin/hadoop: line 49: $'\r': command not found
 : No such file or directoryn
 bin/hadoop: line 52: $'\r': command not found
 bin/hadoop: line 60: syntax error near unexpected token `$'in\r''
 'in/hadoop: line 60: `case `uname` in
 $


 I have this in the file hadoop-env.sh

 export JAVA_HOME=c:\\Program\ Files\\Java\\jdk1.7.0_02

 How could I possibly fix it?

 Thanks a lot!


Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?

2012-02-06 Thread mete
Same with Merto's situation here, it always overflows short time after the
restart. Without the hadoop metrics enabled everything is smooth.
Regards

Mete

On Tue, Feb 7, 2012 at 4:58 AM, Merto Mertek masmer...@gmail.com wrote:

 I have tried to run it but it repeats crashing..

  - When you start gmetad and Hadoop is not emitting metrics, everything
is peachy.
 

 Right, running just ganglia without running hadoop jobs seems stable for at
 least a day..


- When you start Hadoop (and it thus starts emitting metrics), gmetad
cores.
 

 True, with a  following error : *** stack smashing detected ***: gmetad
 terminated \n Segmentation fault

 - On my MacBookPro, it's a SIGABRT due to a buffer overflow.
 
  I believe this is happening for everyone. What I would like for you to
 try
  out are the following 2 scenarios:
 
- Once gmetad cores, if you start it up again, does it core again? Does
this process repeat ad infinitum?
 
 - On my MBP, the core is a one-time thing, and restarting gmetad
   after the first core makes things run perfectly smoothly.
  - I know others are saying this core occurs continuously, but
 they
  were all using ganglia-3.1.x, and I'm interested in how
  ganglia-3.2.0
  behaves for you.
 

 It cores everytime I run it. The difference is just that sometimes a
 segmentation faults appears instantly, and sometimes it appears after a
 random time...lets say after a minute of running gmetad and collecting
 data.


  - If you start Hadoop first (so gmetad is not running when the
first batch of Hadoop metrics are emitted) and THEN start gmetad after
 a
few seconds, do you still see gmetad coring?
 

 Yes


   - On my MBP, this sequence works perfectly fine, and there are no
   gmetad cores whatsoever.
 

 I have tested this scenario with 2 working nodes so two gmond plus the head
 gmond on the server where gmetad is located. I have checked and all of them
 are versioned 3.2.0.

 Hope it helps..



 
  Bear in mind that this only addresses the gmetad coring issue - the
  warnings emitted about '4.9E-324' being out of range will continue, but I
  know what's causing that as well (and hope that my patch fixes it for
  free).
 
  Varun
  On Mon, Feb 6, 2012 at 2:39 PM, Merto Mertek masmer...@gmail.com
 wrote:
 
   Yes I am encoutering the same problems and like Mete said  few seconds
   after restarting a segmentation fault appears.. here is my conf..
   http://pastebin.com/VgBjp08d
  
   And here are some info from /var/log/messages (ubuntu server 10.10):
  
   kernel: [424447.140641] gmetad[26115] general protection
 ip:7f7762428fdb
sp:7f776362d370 error:0 in libgcc_s.so.1[7f776241a000+15000]
   
  
   When I compiled gmetad I used the following command:
  
   ./configure --with-gmetad --sysconfdir=/etc/ganglia
CPPFLAGS=-I/usr/local/rrdtool-1.4.7/include
CFLAGS=-I/usr/local/rrdtool-1.4.7/include
LDFLAGS=-L/usr/local/rrdtool-1.4.7/lib
   
  
   The same was tried with rrdtool 1.4.5. My current ganglia version is
  3.2.0
   and like Mete I tried it with version 3.1.7 but without success..
  
   Hope we will sort it out soon any solution..
   thank you
  
  
   On 6 February 2012 20:09, mete efk...@gmail.com wrote:
  
Hello,
i also face this issue when using GangliaContext31 and hadoop-1.0.0,
  and
ganglia 3.1.7 (also tried 3.1.2). I continuously get buffer overflows
  as
soon as i restart the gmetad.
Regards
Mete
   
On Mon, Feb 6, 2012 at 7:42 PM, Vitthal Suhas Gogate 
gog...@hortonworks.com wrote:
   
 I assume you have seen the following information on Hadoop twiki,
 http://wiki.apache.org/hadoop/GangliaMetrics

 So do you use GangliaContext31 in hadoop-metrics2.properties?

 We use Ganglia 3.2 with Hadoop 20.205  and works fine (I remember
   seeing
 gmetad sometime goes down due to buffer overflow problem when
 hadoop
starts
 pumping in the metrics.. but restarting works.. let me know if you
  face
 same problem?

 --Suhas

 Additionally, the Ganglia protocol change significantly between
  Ganglia
3.0
 and Ganglia 3.1 (i.e., Ganglia 3.1 is not compatible with Ganglia
 3.0
 clients). This caused Hadoop to not work with Ganglia 3.1; there
 is a
patch
 available for this, HADOOP-4675. As of November 2010, this patch
 has
   been
 rolled into the mainline for 0.20.2 and later. To use the Ganglia
 3.1
 protocol in place of the 3.0, substitute
 org.apache.hadoop.metrics.ganglia.GangliaContext31 for
 org.apache.hadoop.metrics.ganglia.GangliaContext in the
 hadoop-metrics.properties lines above.

 On Fri, Feb 3, 2012 at 1:07 PM, Merto Mertek masmer...@gmail.com
wrote:

  I spent a lot of time to figure it out however i did not find a
solution.
  Problems from the logs pointed me for some bugs in rrdupdate
 tool,
 however
  i tried to solve it 

The Mapper does not run from JobControl

2012-02-06 Thread prajor

Using Hadoop version 0.20.. I am creating a chain of jobs job1 and job2
(mappers of which are in x.jar, there is no reducer) , with dependency and
submitting to hadoop cluster using JobControl. Note I have setJarByClass and
getJar gives the correct jar file, when checked before submission.

Submission goes through and there seem to be no errors in user logs and
jobtracker. But I dont see my Mapper getting executed (no sysouts or log
output), but default output seems to be coming to the output folder (input
file is read as is and output). I am able to run the job directly using
x.jar, but I am really out of clues as to why it is not running with
Jobcontrol.
-- 
View this message in context: 
http://old.nabble.com/The-Mapper-does-not-run-from-JobControl-tp33276757p33276757.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Can I write to an compressed file which is located in hdfs?

2012-02-06 Thread bejoy . hadoop
Hi
AFAIK I don't think it is possible to append into a compressed file.

If you have files in hdfs on a dir and you need to compress the same (like 
files for an hour) you can use MapReduce to do that by setting 
mapred.output.compress = true and 
mapred.output.compression.codec='theCodecYouPrefer'
You'd get the blocks compressed in the output dir.

You can use the API to read from standard input like
-get hadoop conf
-register the required compression codec
-write to CompressionOutputStream.

You should get a well detailed explanation on the same from the book 'Hadoop - 
The definitive guide' by Tom White. 

Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: Xiaobin She xiaobin...@gmail.com
Date: Tue, 7 Feb 2012 14:24:01 
To: common-user@hadoop.apache.org; bejoy.had...@gmail.com; David 
Sinclairdsincl...@chariotsolutions.com
Subject: Re: Can I write to an compressed file which is located in hdfs?

hi Bejoy and David,

thank you for you help.

So I can't directly write logs or append logs into an compressed file in
hdfs, right?

Can I compress an file which is already in hdfs and has not been compressed?

If I can , how can I do that?

Thanks!



2012/2/6 bejoy.had...@gmail.com

 Hi
   I agree with David on the point, you can achieve step 1 of my
 previous response with flume. ie load real time inflow of data in
 compressed format into hdfs. You can specify a time interval or data size
 in flume collector that determines when to flush data on to hdfs.

 Regards
 Bejoy K S

 From handheld, Please excuse typos.

 -Original Message-
 From: David Sinclair dsincl...@chariotsolutions.com
 Date: Mon, 6 Feb 2012 09:06:00
 To: common-user@hadoop.apache.org
 Cc: bejoy.had...@gmail.com
 Subject: Re: Can I write to an compressed file which is located in hdfs?

 Hi,

 You may want to have a look at the Flume project from Cloudera. I use it
 for writing data into HDFS.

 https://ccp.cloudera.com/display/SUPPORT/Downloads

 dave

 2012/2/6 Xiaobin She xiaobin...@gmail.com

  hi Bejoy ,
 
  thank you for your reply.
 
  actually I have set up an test cluster which has one namenode/jobtracker
  and two datanode/tasktracker, and I have make an test on this cluster.
 
  I fetch the log file of one of our modules from the log collector
 machines
  by rsync, and then I use hive command line tool to load this log file
 into
  the hive warehouse which  simply copy the file from the local filesystem
 to
  hdfs.
 
  And I have run some analysis on these data with hive, all this run well.
 
  But now I want to avoid the fetch section which use rsync, and write the
  logs into hdfs files directly from the servers which generate these logs.
 
  And it seems easy to do this job if the file locate in the hdfs is not
  compressed.
 
  But how to write or append logs to an file that is compressed and located
  in hdfs?
 
  Is this possible?
 
  Or is this an bad practice?
 
  Thanks!
 
 
 
  2012/2/6 bejoy.had...@gmail.com
 
   Hi
   If you have log files enough to become at least one block size in
 an
   hour. You can go ahead as
   - run a scheduled job every hour that compresses the log files for that
   hour and stores them on to hdfs (can use LZO or even Snappy to
 compress)
   - if your hive does more frequent analysis on this data store it as
   PARTITIONED BY (Date,Hour) . While loading into hdfs also follow a
   directory - sub dir structure. Once data is in hdfs issue a Alter Table
  Add
   Partition statement on corresponding hive table.
   -in Hive DDL use the appropriate Input format (Hive has some ApacheLog
   Input Format already)
  
  
   Regards
   Bejoy K S
  
   From handheld, Please excuse typos.
  
   -Original Message-
   From: Xiaobin She xiaobin...@gmail.com
   Date: Mon, 6 Feb 2012 16:41:50
   To: common-user@hadoop.apache.org; 佘晓彬xiaobin...@gmail.com
   Reply-To: common-user@hadoop.apache.org
   Subject: Re: Can I write to an compressed file which is located in
 hdfs?
  
   sorry, this sentence is wrong,
  
   I can't compress these logs every hour and them put them into hdfs.
  
   it should be
  
   I can  compress these logs every hour and them put them into hdfs.
  
  
  
  
   2012/2/6 Xiaobin She xiaobin...@gmail.com
  
   
hi all,
   
I'm testing hadoop and hive, and I want to use them in log analysis.
   
Here I have a question, can I write/append log to  an compressed file
which is located in hdfs?
   
Our system generate lots of log files every day, I can't compress
 these
logs every hour and them put them into hdfs.
   
But what if I want to write logs into files that was already in the
  hdfs
and was compressed?
   
Is these files were not compressed, then this job seems easy, but how
  to
write or append logs into an compressed log?
   
Can I do that?
   
Can anyone give me some advices or give me some examples?
   
Thank you very much!
   
xiaobin