Hi Aureliano,
You might want to check this script out,
https://github.com/sigmoidanalytics/spark_gce
Let me know if you need any help around that.
Thanks
Best Regards
On Tue, Apr 22, 2014 at 7:12 PM, Aureliano Buendia buendia...@gmail.comwrote:
On Tue, Apr 22, 2014 at 10:50 AM, Andras Nemeth
andras.nem...@lynxanalytics.com wrote:
We don't have anything fancy. It's basically some very thin layer of
google specifics on top of a stand alone cluster. We basically created two
disk snapshots, one for the master and one for the workers. The snapshots
contain initialization scripts so that the master/worker daemons are
started on boot. So if I want a cluster I just create a new instance (with
a fixed name) using the master snapshot for the master. When it is up I
start as many slave instances as I need using the slave snapshot. By the
time the machines are up the cluster is ready to be used.
This sounds like being a lot simpler than the existing spark-ec2 script.
Does google compute engine api makes this happen in a simple way, when
compared to ec2 api? Does your script do everything spark-ec2 does?
Also, any plans to make this open source?
Andras
On Mon, Apr 21, 2014 at 10:04 PM, Mayur Rustagi
mayur.rust...@gmail.comwrote:
Okay just commented on another thread :)
I have one that I use internally. Can give it out but will need some
support from you to fix bugs etc. Let me know if you are interested.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia buendia...@gmail.com
wrote:
Thanks, Andras. What approach did you use to setup a spark cluster on
google compute engine? Currently, there is no production-ready official
support for an equivalent of spark-ec2 on gce. Did you roll your own?
On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth
andras.nem...@lynxanalytics.com wrote:
Hello!
On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia
buendia...@gmail.com wrote:
Hi,
Google has publisheed a new connector for hadoop: google cloud
storage, which is an equivalent of amazon s3:
googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
This is actually about Cloud Datastore and not Cloud Storage (yeah,
quite confusing naming ;) ). But they do already have for a while a cloud
storage connector, also linked from your article:
https://developers.google.com/hadoop/google-cloud-storage-connector
How can spark be configured to use this connector?
Yes, it can, but in a somewhat hacky way. The problem is that for some
reason Google does not officially publish the library jar alone, you get
it
installed as part of a Hadoop on Google Cloud installation. So, the
official way would be (we did not try that) to have a Hadoop on Google
Cloud installation and run spark on top of that.
The other option - that we did try and which works fine for us - is to
snatch the jar:
https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar,
make sure it's shipped to your workers (e.g. with setJars on SparkConf
when
you create your SparkContext). Then create a core-site.xml file which you
make sure is on the classpath both in your driver and your cluster (e.g.
you can make sure it ends up in one of the jars you send with setJars
above) with this content (with YOUR_* replaced):
configuration
propertynamefs.gs.impl/namevaluecom.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem/value/property
propertynamefs.gs.project.id
/namevalueYOUR_PROJECT_ID/value/property
propertynamefs.gs.system.bucket/namevalueYOUR_FAVORITE_BUCKET/value/property
/configuration
From this point on you can simply use gs://... filenames to read/write
data on Cloud Storage.
Note that you should run your cluster and driver program on Google
Compute Engine for this to work as is. Probably it's possible to configure
access from the outside too but we didn't do that.
Hope this helps,
Andras