Re: Using google cloud storage for spark big data

2014-05-05 Thread Akhil Das
Hi Aureliano,

You might want to check this script out,
https://github.com/sigmoidanalytics/spark_gce
Let me know if you need any help around that.

Thanks
Best Regards


On Tue, Apr 22, 2014 at 7:12 PM, Aureliano Buendia buendia...@gmail.comwrote:




 On Tue, Apr 22, 2014 at 10:50 AM, Andras Nemeth 
 andras.nem...@lynxanalytics.com wrote:

 We don't have anything fancy. It's basically some very thin layer of
 google specifics on top of a stand alone cluster. We basically created two
 disk snapshots, one for the master and one for the workers. The snapshots
 contain initialization scripts so that the master/worker daemons are
 started on boot. So if I want a cluster I just create a new instance (with
 a fixed name) using the master snapshot for the master. When it is up I
 start as many slave instances as I need using the slave snapshot. By the
 time the machines are up the cluster is ready to be used.


 This sounds like being a lot simpler than the existing spark-ec2 script.
 Does google compute engine api makes this happen in a simple way, when
 compared to ec2 api? Does your script do everything spark-ec2 does?

 Also, any plans to make this open source?


 Andras



 On Mon, Apr 21, 2014 at 10:04 PM, Mayur Rustagi 
 mayur.rust...@gmail.comwrote:

 Okay just commented on another thread :)
 I have one that I use internally. Can give it out but will need some
 support from you to fix bugs etc. Let me know if you are interested.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia buendia...@gmail.com
  wrote:

 Thanks, Andras. What approach did you use to setup a spark cluster on
 google compute engine? Currently, there is no production-ready official
 support for an equivalent of spark-ec2 on gce. Did you roll your own?


 On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth 
 andras.nem...@lynxanalytics.com wrote:

 Hello!

 On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia 
 buendia...@gmail.com wrote:

 Hi,

 Google has publisheed a new connector for hadoop: google cloud
 storage, which is an equivalent of amazon s3:


 googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html

 This is actually about Cloud Datastore and not Cloud Storage (yeah,
 quite confusing naming ;) ). But they do already have for a while a cloud
 storage connector, also linked from your article:
 https://developers.google.com/hadoop/google-cloud-storage-connector




 How can spark be configured to use this connector?

 Yes, it can, but in a somewhat hacky way. The problem is that for some
 reason Google does not officially publish the library jar alone, you get 
 it
 installed as part of a Hadoop on Google Cloud installation. So, the
 official way would be (we did not try that) to have a Hadoop on Google
 Cloud installation and run spark on top of that.

 The other option - that we did try and which works fine for us - is to
 snatch the jar:
 https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar,
 make sure it's shipped to your workers (e.g. with setJars on SparkConf 
 when
 you create your SparkContext). Then create a core-site.xml file which you
 make sure is on the classpath both in your driver and your cluster (e.g.
 you can make sure it ends up in one of the jars you send with setJars
 above) with this content (with YOUR_* replaced):
 configuration

 propertynamefs.gs.impl/namevaluecom.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem/value/property
   propertynamefs.gs.project.id
 /namevalueYOUR_PROJECT_ID/value/property

 propertynamefs.gs.system.bucket/namevalueYOUR_FAVORITE_BUCKET/value/property
 /configuration

 From this point on you can simply use gs://... filenames to read/write
 data on Cloud Storage.

 Note that you should run your cluster and driver program on Google
 Compute Engine for this to work as is. Probably it's possible to configure
 access from the outside too but we didn't do that.

 Hope this helps,
 Andras











Re: Using google cloud storage for spark big data

2014-04-22 Thread Andras Nemeth
We don't have anything fancy. It's basically some very thin layer of google
specifics on top of a stand alone cluster. We basically created two disk
snapshots, one for the master and one for the workers. The snapshots
contain initialization scripts so that the master/worker daemons are
started on boot. So if I want a cluster I just create a new instance (with
a fixed name) using the master snapshot for the master. When it is up I
start as many slave instances as I need using the slave snapshot. By the
time the machines are up the cluster is ready to be used.

Andras



On Mon, Apr 21, 2014 at 10:04 PM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 Okay just commented on another thread :)
 I have one that I use internally. Can give it out but will need some
 support from you to fix bugs etc. Let me know if you are interested.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia 
 buendia...@gmail.comwrote:

 Thanks, Andras. What approach did you use to setup a spark cluster on
 google compute engine? Currently, there is no production-ready official
 support for an equivalent of spark-ec2 on gce. Did you roll your own?


 On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth 
 andras.nem...@lynxanalytics.com wrote:

 Hello!

 On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia buendia...@gmail.com
  wrote:

 Hi,

 Google has publisheed a new connector for hadoop: google cloud storage,
 which is an equivalent of amazon s3:


 googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html

 This is actually about Cloud Datastore and not Cloud Storage (yeah,
 quite confusing naming ;) ). But they do already have for a while a cloud
 storage connector, also linked from your article:
 https://developers.google.com/hadoop/google-cloud-storage-connector




 How can spark be configured to use this connector?

 Yes, it can, but in a somewhat hacky way. The problem is that for some
 reason Google does not officially publish the library jar alone, you get it
 installed as part of a Hadoop on Google Cloud installation. So, the
 official way would be (we did not try that) to have a Hadoop on Google
 Cloud installation and run spark on top of that.

 The other option - that we did try and which works fine for us - is to
 snatch the jar:
 https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar,
 make sure it's shipped to your workers (e.g. with setJars on SparkConf when
 you create your SparkContext). Then create a core-site.xml file which you
 make sure is on the classpath both in your driver and your cluster (e.g.
 you can make sure it ends up in one of the jars you send with setJars
 above) with this content (with YOUR_* replaced):
 configuration

 propertynamefs.gs.impl/namevaluecom.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem/value/property
   propertynamefs.gs.project.id
 /namevalueYOUR_PROJECT_ID/value/property

 propertynamefs.gs.system.bucket/namevalueYOUR_FAVORITE_BUCKET/value/property
 /configuration

 From this point on you can simply use gs://... filenames to read/write
 data on Cloud Storage.

 Note that you should run your cluster and driver program on Google
 Compute Engine for this to work as is. Probably it's possible to configure
 access from the outside too but we didn't do that.

 Hope this helps,
 Andras