Hi Aureliano, You might want to check this script out, https://github.com/sigmoidanalytics/spark_gce Let me know if you need any help around that.
Thanks Best Regards On Tue, Apr 22, 2014 at 7:12 PM, Aureliano Buendia <buendia...@gmail.com>wrote: > > > > On Tue, Apr 22, 2014 at 10:50 AM, Andras Nemeth < > andras.nem...@lynxanalytics.com> wrote: > >> We don't have anything fancy. It's basically some very thin layer of >> google specifics on top of a stand alone cluster. We basically created two >> disk snapshots, one for the master and one for the workers. The snapshots >> contain initialization scripts so that the master/worker daemons are >> started on boot. So if I want a cluster I just create a new instance (with >> a fixed name) using the master snapshot for the master. When it is up I >> start as many slave instances as I need using the slave snapshot. By the >> time the machines are up the cluster is ready to be used. >> >> > This sounds like being a lot simpler than the existing spark-ec2 script. > Does google compute engine api makes this happen in a simple way, when > compared to ec2 api? Does your script do everything spark-ec2 does? > > Also, any plans to make this open source? > > >> Andras >> >> >> >> On Mon, Apr 21, 2014 at 10:04 PM, Mayur Rustagi >> <mayur.rust...@gmail.com>wrote: >> >>> Okay just commented on another thread :) >>> I have one that I use internally. Can give it out but will need some >>> support from you to fix bugs etc. Let me know if you are interested. >>> >>> Mayur Rustagi >>> Ph: +1 (760) 203 3257 >>> http://www.sigmoidanalytics.com >>> @mayur_rustagi <https://twitter.com/mayur_rustagi> >>> >>> >>> >>> On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia <buendia...@gmail.com >>> > wrote: >>> >>>> Thanks, Andras. What approach did you use to setup a spark cluster on >>>> google compute engine? Currently, there is no production-ready official >>>> support for an equivalent of spark-ec2 on gce. Did you roll your own? >>>> >>>> >>>> On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth < >>>> andras.nem...@lynxanalytics.com> wrote: >>>> >>>>> Hello! >>>>> >>>>> On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia < >>>>> buendia...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Google has publisheed a new connector for hadoop: google cloud >>>>>> storage, which is an equivalent of amazon s3: >>>>>> >>>>>> >>>>>> googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html >>>>>> >>>>> This is actually about Cloud Datastore and not Cloud Storage (yeah, >>>>> quite confusing naming ;) ). But they do already have for a while a cloud >>>>> storage connector, also linked from your article: >>>>> https://developers.google.com/hadoop/google-cloud-storage-connector >>>>> >>>>> >>>>>> >>>>>> >>>>>> How can spark be configured to use this connector? >>>>>> >>>>> Yes, it can, but in a somewhat hacky way. The problem is that for some >>>>> reason Google does not officially publish the library jar alone, you get >>>>> it >>>>> installed as part of a Hadoop on Google Cloud installation. So, the >>>>> official way would be (we did not try that) to have a Hadoop on Google >>>>> Cloud installation and run spark on top of that. >>>>> >>>>> The other option - that we did try and which works fine for us - is to >>>>> snatch the jar: >>>>> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar, >>>>> make sure it's shipped to your workers (e.g. with setJars on SparkConf >>>>> when >>>>> you create your SparkContext). Then create a core-site.xml file which you >>>>> make sure is on the classpath both in your driver and your cluster (e.g. >>>>> you can make sure it ends up in one of the jars you send with setJars >>>>> above) with this content (with YOUR_* replaced): >>>>> <configuration> >>>>> >>>>> <property><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property> >>>>> <property><name>fs.gs.project.id >>>>> </name><value>YOUR_PROJECT_ID</value></property> >>>>> >>>>> <property><name>fs.gs.system.bucket</name><value>YOUR_FAVORITE_BUCKET</value></property> >>>>> </configuration> >>>>> >>>>> From this point on you can simply use gs://... filenames to read/write >>>>> data on Cloud Storage. >>>>> >>>>> Note that you should run your cluster and driver program on Google >>>>> Compute Engine for this to work as is. Probably it's possible to configure >>>>> access from the outside too but we didn't do that. >>>>> >>>>> Hope this helps, >>>>> Andras >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >> >