Re: Using google cloud storage for spark big data

Akhil Das Mon, 05 May 2014 07:26:32 -0700

Hi Aureliano,

You might want to check this script out,
https://github.com/sigmoidanalytics/spark_gce
Let me know if you need any help around that.


Thanks
Best Regards


On Tue, Apr 22, 2014 at 7:12 PM, Aureliano Buendia <buendia...@gmail.com>wrote:

>
>
>
> On Tue, Apr 22, 2014 at 10:50 AM, Andras Nemeth <
> andras.nem...@lynxanalytics.com> wrote:
>
>> We don't have anything fancy. It's basically some very thin layer of
>> google specifics on top of a stand alone cluster. We basically created two
>> disk snapshots, one for the master and one for the workers. The snapshots
>> contain initialization scripts so that the master/worker daemons are
>> started on boot. So if I want a cluster I just create a new instance (with
>> a fixed name) using the master snapshot for the master. When it is up I
>> start as many slave instances as I need using the slave snapshot. By the
>> time the machines are up the cluster is ready to be used.
>>
>>
> This sounds like being a lot simpler than the existing spark-ec2 script.
> Does google compute engine api makes this happen in a simple way, when
> compared to ec2 api? Does your script do everything spark-ec2 does?
>
> Also, any plans to make this open source?
>
>
>> Andras
>>
>>
>>
>> On Mon, Apr 21, 2014 at 10:04 PM, Mayur Rustagi 
>> <mayur.rust...@gmail.com>wrote:
>>
>>> Okay just commented on another thread :)
>>> I have one that I use internally. Can give it out but will need some
>>> support from you to fix bugs etc. Let me know if you are interested.
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>
>>>
>>>
>>> On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia <buendia...@gmail.com
>>> > wrote:
>>>
>>>> Thanks, Andras. What approach did you use to setup a spark cluster on
>>>> google compute engine? Currently, there is no production-ready official
>>>> support for an equivalent of spark-ec2 on gce. Did you roll your own?
>>>>
>>>>
>>>> On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth <
>>>> andras.nem...@lynxanalytics.com> wrote:
>>>>
>>>>> Hello!
>>>>>
>>>>> On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia <
>>>>> buendia...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Google has publisheed a new connector for hadoop: google cloud
>>>>>> storage, which is an equivalent of amazon s3:
>>>>>>
>>>>>>
>>>>>> googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
>>>>>>
>>>>> This is actually about Cloud Datastore and not Cloud Storage (yeah,
>>>>> quite confusing naming ;) ). But they do already have for a while a cloud
>>>>> storage connector, also linked from your article:
>>>>> https://developers.google.com/hadoop/google-cloud-storage-connector
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> How can spark be configured to use this connector?
>>>>>>
>>>>> Yes, it can, but in a somewhat hacky way. The problem is that for some
>>>>> reason Google does not officially publish the library jar alone, you get 
>>>>> it
>>>>> installed as part of a Hadoop on Google Cloud installation. So, the
>>>>> official way would be (we did not try that) to have a Hadoop on Google
>>>>> Cloud installation and run spark on top of that.
>>>>>
>>>>> The other option - that we did try and which works fine for us - is to
>>>>> snatch the jar:
>>>>> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar,
>>>>> make sure it's shipped to your workers (e.g. with setJars on SparkConf 
>>>>> when
>>>>> you create your SparkContext). Then create a core-site.xml file which you
>>>>> make sure is on the classpath both in your driver and your cluster (e.g.
>>>>> you can make sure it ends up in one of the jars you send with setJars
>>>>> above) with this content (with YOUR_* replaced):
>>>>> <configuration>
>>>>>
>>>>> <property><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property>
>>>>>   <property><name>fs.gs.project.id
>>>>> </name><value>YOUR_PROJECT_ID</value></property>
>>>>>
>>>>> <property><name>fs.gs.system.bucket</name><value>YOUR_FAVORITE_BUCKET</value></property>
>>>>> </configuration>
>>>>>
>>>>> From this point on you can simply use gs://... filenames to read/write
>>>>> data on Cloud Storage.
>>>>>
>>>>> Note that you should run your cluster and driver program on Google
>>>>> Compute Engine for this to work as is. Probably it's possible to configure
>>>>> access from the outside too but we didn't do that.
>>>>>
>>>>> Hope this helps,
>>>>> Andras
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Using google cloud storage for spark big data

Reply via email to