subject:"Using google cloud storage for spark big data"

Re: Using google cloud storage for spark big data

2014-05-05 Thread Akhil Das

Hi Aureliano,

You might want to check this script out,
https://github.com/sigmoidanalytics/spark_gce
Let me know if you need any help around that.

Thanks
Best Regards

On Tue, Apr 22, 2014 at 7:12 PM, Aureliano Buendia buendia...@gmail.comwrote:

On Tue, Apr 22, 2014 at 10:50 AM, Andras Nemeth
andras.nem...@lynxanalytics.com wrote:

We don't have anything fancy. It's basically some very thin layer of
google specifics on top of a stand alone cluster. We basically created two
disk snapshots, one for the master and one for the workers. The snapshots
contain initialization scripts so that the master/worker daemons are
started on boot. So if I want a cluster I just create a new instance (with
a fixed name) using the master snapshot for the master. When it is up I
start as many slave instances as I need using the slave snapshot. By the
time the machines are up the cluster is ready to be used.

This sounds like being a lot simpler than the existing spark-ec2 script.
Does google compute engine api makes this happen in a simple way, when
compared to ec2 api? Does your script do everything spark-ec2 does?

Also, any plans to make this open source?

Andras

On Mon, Apr 21, 2014 at 10:04 PM, Mayur Rustagi
mayur.rust...@gmail.comwrote:

Okay just commented on another thread :)
I have one that I use internally. Can give it out but will need some
support from you to fix bugs etc. Let me know if you are interested.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi

On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia buendia...@gmail.com
wrote:

Thanks, Andras. What approach did you use to setup a spark cluster on
google compute engine? Currently, there is no production-ready official
support for an equivalent of spark-ec2 on gce. Did you roll your own?

On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth
andras.nem...@lynxanalytics.com wrote:

Hello!

On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia
buendia...@gmail.com wrote:

Hi,

Google has publisheed a new connector for hadoop: google cloud
storage, which is an equivalent of amazon s3:

googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html

This is actually about Cloud Datastore and not Cloud Storage (yeah,
quite confusing naming ;) ). But they do already have for a while a cloud
storage connector, also linked from your article:
https://developers.google.com/hadoop/google-cloud-storage-connector

How can spark be configured to use this connector?

Yes, it can, but in a somewhat hacky way. The problem is that for some
reason Google does not officially publish the library jar alone, you get
it
installed as part of a Hadoop on Google Cloud installation. So, the
official way would be (we did not try that) to have a Hadoop on Google
Cloud installation and run spark on top of that.

The other option - that we did try and which works fine for us - is to
snatch the jar:
https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar,
make sure it's shipped to your workers (e.g. with setJars on SparkConf
when
you create your SparkContext). Then create a core-site.xml file which you
make sure is on the classpath both in your driver and your cluster (e.g.
you can make sure it ends up in one of the jars you send with setJars
above) with this content (with YOUR_* replaced):
configuration

propertynamefs.gs.impl/namevaluecom.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem/value/property
propertynamefs.gs.project.id
/namevalueYOUR_PROJECT_ID/value/property

propertynamefs.gs.system.bucket/namevalueYOUR_FAVORITE_BUCKET/value/property
/configuration

From this point on you can simply use gs://... filenames to read/write
data on Cloud Storage.

Note that you should run your cluster and driver program on Google
Compute Engine for this to work as is. Probably it's possible to configure
access from the outside too but we didn't do that.

Hope this helps,
Andras

Re: Using google cloud storage for spark big data

2014-04-22 Thread Andras Nemeth

We don't have anything fancy. It's basically some very thin layer of google
specifics on top of a stand alone cluster. We basically created two disk
snapshots, one for the master and one for the workers. The snapshots
contain initialization scripts so that the master/worker daemons are
started on boot. So if I want a cluster I just create a new instance (with
a fixed name) using the master snapshot for the master. When it is up I
start as many slave instances as I need using the slave snapshot. By the
time the machines are up the cluster is ready to be used.

Andras

On Mon, Apr 21, 2014 at 10:04 PM, Mayur Rustagi mayur.rust...@gmail.comwrote:

Okay just commented on another thread :)
I have one that I use internally. Can give it out but will need some
support from you to fix bugs etc. Let me know if you are interested.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi

On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia
buendia...@gmail.comwrote:

On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth
andras.nem...@lynxanalytics.com wrote:

Hello!

On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia buendia...@gmail.com
wrote:

Hi,

Google has publisheed a new connector for hadoop: google cloud storage,
which is an equivalent of amazon s3:

googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html

How can spark be configured to use this connector?

Yes, it can, but in a somewhat hacky way. The problem is that for some
reason Google does not officially publish the library jar alone, you get it
installed as part of a Hadoop on Google Cloud installation. So, the
official way would be (we did not try that) to have a Hadoop on Google
Cloud installation and run spark on top of that.

The other option - that we did try and which works fine for us - is to
snatch the jar:
https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar,
make sure it's shipped to your workers (e.g. with setJars on SparkConf when
you create your SparkContext). Then create a core-site.xml file which you
make sure is on the classpath both in your driver and your cluster (e.g.
you can make sure it ends up in one of the jars you send with setJars
above) with this content (with YOUR_* replaced):
configuration

propertynamefs.gs.impl/namevaluecom.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem/value/property
propertynamefs.gs.project.id
/namevalueYOUR_PROJECT_ID/value/property

propertynamefs.gs.system.bucket/namevalueYOUR_FAVORITE_BUCKET/value/property
/configuration

From this point on you can simply use gs://... filenames to read/write
data on Cloud Storage.

Note that you should run your cluster and driver program on Google
Compute Engine for this to work as is. Probably it's possible to configure
access from the outside too but we didn't do that.

Hope this helps,
Andras

Re: Using google cloud storage for spark big data

Re: Using google cloud storage for spark big data

2 matches

Site Navigation

Mail list logo

Footer information