Re: Using google cloud storage for spark big data

Mayur Rustagi Mon, 21 Apr 2014 13:05:40 -0700

Okay just commented on another thread :)
I have one that I use internally. Can give it out but will need some
support from you to fix bugs etc. Let me know if you are interested.


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia <buendia...@gmail.com>wrote:

> Thanks, Andras. What approach did you use to setup a spark cluster on
> google compute engine? Currently, there is no production-ready official
> support for an equivalent of spark-ec2 on gce. Did you roll your own?
>
>
> On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth <
> andras.nem...@lynxanalytics.com> wrote:
>
>> Hello!
>>
>> On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia 
>> <buendia...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> Google has publisheed a new connector for hadoop: google cloud storage,
>>> which is an equivalent of amazon s3:
>>>
>>>
>>> googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
>>>
>> This is actually about Cloud Datastore and not Cloud Storage (yeah, quite
>> confusing naming ;) ). But they do already have for a while a cloud storage
>> connector, also linked from your article:
>> https://developers.google.com/hadoop/google-cloud-storage-connector
>>
>>
>>>
>>>
>>> How can spark be configured to use this connector?
>>>
>> Yes, it can, but in a somewhat hacky way. The problem is that for some
>> reason Google does not officially publish the library jar alone, you get it
>> installed as part of a Hadoop on Google Cloud installation. So, the
>> official way would be (we did not try that) to have a Hadoop on Google
>> Cloud installation and run spark on top of that.
>>
>> The other option - that we did try and which works fine for us - is to
>> snatch the jar:
>> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar,
>> make sure it's shipped to your workers (e.g. with setJars on SparkConf when
>> you create your SparkContext). Then create a core-site.xml file which you
>> make sure is on the classpath both in your driver and your cluster (e.g.
>> you can make sure it ends up in one of the jars you send with setJars
>> above) with this content (with YOUR_* replaced):
>> <configuration>
>>
>> <property><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property>
>>   <property><name>fs.gs.project.id
>> </name><value>YOUR_PROJECT_ID</value></property>
>>
>> <property><name>fs.gs.system.bucket</name><value>YOUR_FAVORITE_BUCKET</value></property>
>> </configuration>
>>
>> From this point on you can simply use gs://... filenames to read/write
>> data on Cloud Storage.
>>
>> Note that you should run your cluster and driver program on Google
>> Compute Engine for this to work as is. Probably it's possible to configure
>> access from the outside too but we didn't do that.
>>
>> Hope this helps,
>> Andras
>>
>>
>>
>>
>>
>

Re: Using google cloud storage for spark big data

Reply via email to