Thanks, Andras. What approach did you use to setup a spark cluster on google compute engine? Currently, there is no production-ready official support for an equivalent of spark-ec2 on gce. Did you roll your own?
On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth < andras.nem...@lynxanalytics.com> wrote: > Hello! > > On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia > <buendia...@gmail.com>wrote: > >> Hi, >> >> Google has publisheed a new connector for hadoop: google cloud storage, >> which is an equivalent of amazon s3: >> >> >> googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html >> > This is actually about Cloud Datastore and not Cloud Storage (yeah, quite > confusing naming ;) ). But they do already have for a while a cloud storage > connector, also linked from your article: > https://developers.google.com/hadoop/google-cloud-storage-connector > > >> >> >> How can spark be configured to use this connector? >> > Yes, it can, but in a somewhat hacky way. The problem is that for some > reason Google does not officially publish the library jar alone, you get it > installed as part of a Hadoop on Google Cloud installation. So, the > official way would be (we did not try that) to have a Hadoop on Google > Cloud installation and run spark on top of that. > > The other option - that we did try and which works fine for us - is to > snatch the jar: > https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar, > make sure it's shipped to your workers (e.g. with setJars on SparkConf when > you create your SparkContext). Then create a core-site.xml file which you > make sure is on the classpath both in your driver and your cluster (e.g. > you can make sure it ends up in one of the jars you send with setJars > above) with this content (with YOUR_* replaced): > <configuration> > > <property><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property> > <property><name>fs.gs.project.id > </name><value>YOUR_PROJECT_ID</value></property> > > <property><name>fs.gs.system.bucket</name><value>YOUR_FAVORITE_BUCKET</value></property> > </configuration> > > From this point on you can simply use gs://... filenames to read/write > data on Cloud Storage. > > Note that you should run your cluster and driver program on Google Compute > Engine for this to work as is. Probably it's possible to configure access > from the outside too but we didn't do that. > > Hope this helps, > Andras > > > > >