
On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia <buendia...@gmail.com>wrote:

> Hi,
> Google has publisheed a new connector for hadoop: google cloud storage,
> which is an equivalent of amazon s3:
> googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
This is actually about Cloud Datastore and not Cloud Storage (yeah, quite
confusing naming ;) ). But they do already have for a while a cloud storage
connector, also linked from your article:

> How can spark be configured to use this connector?
Yes, it can, but in a somewhat hacky way. The problem is that for some
reason Google does not officially publish the library jar alone, you get it
installed as part of a Hadoop on Google Cloud installation. So, the
official way would be (we did not try that) to have a Hadoop on Google
Cloud installation and run spark on top of that.

The other option - that we did try and which works fine for us - is to
snatch the jar:
https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar, make
sure it's shipped to your workers (e.g. with setJars on SparkConf when you
create your SparkContext). Then create a core-site.xml file which you make
sure is on the classpath both in your driver and your cluster (e.g. you can
make sure it ends up in one of the jars you send with setJars above) with
this content (with YOUR_* replaced):



>From this point on you can simply use gs://... filenames to read/write data
on Cloud Storage.

Note that you should run your cluster and driver program on Google Compute
Engine for this to work as is. Probably it's possible to configure access
from the outside too but we didn't do that.

Hope this helps,

Reply via email to