Re: Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter
I'm using dataframes, types are all doubles and I'm only extracting what I
need.

The caveat on these is that I am porting an existing system for a client
and for there business it's likely to be cheaper to throw hardware (in aws)
at the problem for a couple of hours than re-engineer there algorithms

cheers


On 7 June 2016 at 21:54, Jörn Franke  wrote:

> Before hardware optimization there is always software optimization.
> Are you using dataset / dataframe? Are you using the  right data types (
> eg int where int is appropriate , try to avoid string and char etc)
> Do you extract only the stuff needed? What are the algorithm parameters?
>
> > On 07 Jun 2016, at 13:09, Franc Carter  wrote:
> >
> >
> > Hi,
> >
> > I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and
> am interested in how it might be best to scale it - e.g more cpus per
> instances, more memory per instance, more instances etc.
> >
> > I'm currently using 32 m3.xlarge instances for for a training set with
> 2.5 million rows, 1300 columns and a total size of 31GB (parquet)
> >
> > thanks
> >
> > --
> > Franc
>



-- 
Franc


Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter
Hi,

I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and am
interested in how it might be best to scale it - e.g more cpus per
instances, more memory per instance, more instances etc.

I'm currently using 32 m3.xlarge instances for for a training set with 2.5
million rows, 1300 columns and a total size of 31GB (parquet)

thanks

-- 
Franc


Re: installing packages with pyspark

2016-03-19 Thread Franc Carter
Thanks - I'll give that a try

cheers

On 20 March 2016 at 09:42, Felix Cheung  wrote:

> You are running pyspark in Spark client deploy mode. I have ran into the
> same error as well and I'm not sure if this is graphframes specific - the
> python process can't find the graphframes Python code when it is loaded as
> a Spark package.
>
> To workaround this, I extract the graphframes Python directory locally
> where I run pyspark into a directory called graphframes.
>
>
>
>
>
>
> On Thu, Mar 17, 2016 at 10:11 PM -0700, "Franc Carter" <
> franc.car...@gmail.com> wrote:
>
>
> I'm having trouble with that for pyspark, yarn and graphframes. I'm using:-
>
> pyspark --master yarn --packages graphframes:graphframes:0.1.0-spark1.5
>
> which starts and gives me a REPL, but when I try
>
>from graphframes import *
>
> I get
>
>   No module names graphframes
>
> without '--master yarn' it works as expected
>
> thanks
>
>
> On 18 March 2016 at 12:59, Felix Cheung  wrote:
>
> For some, like graphframes that are Spark packages, you could also use
> --packages in the command line of spark-submit or pyspark. See
> http://spark.apache.org/docs/latest/submitting-applications.html
>
> _
> From: Jakob Odersky 
> Sent: Thursday, March 17, 2016 6:40 PM
> Subject: Re: installing packages with pyspark
> To: Ajinkya Kale 
> Cc: 
>
>
>
> Hi,
> regarding 1, packages are resolved locally. That means that when you
> specify a package, spark-submit will resolve the dependencies and
> download any jars on the local machine, before shipping* them to the
> cluster. So, without a priori knowledge of dataproc clusters, it
> should be no different to specify packages.
>
> Unfortunatly I can't help with 2.
>
> --Jakob
>
> *shipping in this case means making them available via the network
>
> On Thu, Mar 17, 2016 at 5:36 PM, Ajinkya Kale 
> wrote:
> > Hi all,
> >
> > I had couple of questions.
> > 1. Is there documentation on how to add the graphframes or any other
> package
> > for that matter on the google dataproc managed spark clusters ?
> >
> > 2. Is there a way to add a package to an existing pyspark context
> through a
> > jupyter notebook ?
> >
> > --aj
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
>
> --
> Franc
>



-- 
Franc


Re: installing packages with pyspark

2016-03-19 Thread Franc Carter
I'm having trouble with that for pyspark, yarn and graphframes. I'm using:-

pyspark --master yarn --packages graphframes:graphframes:0.1.0-spark1.5

which starts and gives me a REPL, but when I try

   from graphframes import *

I get

  No module names graphframes

without '--master yarn' it works as expected

thanks


On 18 March 2016 at 12:59, Felix Cheung  wrote:

> For some, like graphframes that are Spark packages, you could also use
> --packages in the command line of spark-submit or pyspark. See
> http://spark.apache.org/docs/latest/submitting-applications.html
>
> _
> From: Jakob Odersky 
> Sent: Thursday, March 17, 2016 6:40 PM
> Subject: Re: installing packages with pyspark
> To: Ajinkya Kale 
> Cc: 
>
>
>
> Hi,
> regarding 1, packages are resolved locally. That means that when you
> specify a package, spark-submit will resolve the dependencies and
> download any jars on the local machine, before shipping* them to the
> cluster. So, without a priori knowledge of dataproc clusters, it
> should be no different to specify packages.
>
> Unfortunatly I can't help with 2.
>
> --Jakob
>
> *shipping in this case means making them available via the network
>
> On Thu, Mar 17, 2016 at 5:36 PM, Ajinkya Kale 
> wrote:
> > Hi all,
> >
> > I had couple of questions.
> > 1. Is there documentation on how to add the graphframes or any other
> package
> > for that matter on the google dataproc managed spark clusters ?
> >
> > 2. Is there a way to add a package to an existing pyspark context
> through a
> > jupyter notebook ?
> >
> > --aj
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>


-- 
Franc


Re: filter by dict() key in pySpark

2016-02-24 Thread Franc Carter
A colleague found how to do this, the approach was to use a udf()

cheers

On 21 February 2016 at 22:41, Franc Carter  wrote:

>
> I have a DataFrame that has a Python dict() as one of the columns. I'd
> like to filter he DataFrame for those Rows that where the dict() contains a
> specific value. e.g something like this:-
>
> DF2 = DF1.filter('name' in DF1.params)
>
> but that gives me this error
>
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
>
> How do I express this correctly ?
>
> thanks
>
> --
> Franc
>



-- 
Franc


filter by dict() key in pySpark

2016-02-21 Thread Franc Carter
I have a DataFrame that has a Python dict() as one of the columns. I'd like
to filter he DataFrame for those Rows that where the dict() contains a
specific value. e.g something like this:-

DF2 = DF1.filter('name' in DF1.params)

but that gives me this error

ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
for 'or', '~' for 'not' when building DataFrame boolean expressions.

How do I express this correctly ?

thanks

-- 
Franc


Re: sparkR not able to create /append new columns

2016-02-03 Thread Franc Carter
Yes, I didn't work out how to solve that - sorry


On 3 February 2016 at 22:37, Devesh Raj Singh 
wrote:

> Hi,
>
> but "withColumn" will only add once, if i want to add columns to the same
> dataframe in a loop it will keep overwriting the added column and in the
> end the last added column( in the loop) will be the added column. like in
> my code above.
>
> On Wed, Feb 3, 2016 at 5:05 PM, Franc Carter 
> wrote:
>
>>
>> I had problems doing this as well - I ended up using 'withColumn', it's
>> not particularly graceful but it worked (1.5.2 on AWS EMR)
>>
>> cheerd
>>
>> On 3 February 2016 at 22:06, Devesh Raj Singh 
>> wrote:
>>
>>> Hi,
>>>
>>> i am trying to create dummy variables in sparkR by creating new columns
>>> for categorical variables. But it is not appending the columns
>>>
>>>
>>> df <- createDataFrame(sqlContext, iris)
>>> class(dtypes(df))
>>>
>>> cat.column<-vector(mode="character",length=nrow(df))
>>> cat.column<-collect(select(df,df$Species))
>>> lev<-length(levels(as.factor(unlist(cat.column
>>> varb.names<-vector(mode="character",length=lev)
>>> for (i in 1:lev){
>>>
>>>   varb.names[i]<-paste0(colnames(cat.column),i)
>>>
>>> }
>>>
>>> for (j in 1:lev)
>>>
>>> {
>>>
>>>dummy.df.new<-withColumn(df,paste0(colnames
>>>(cat.column),j),if else(df$Species==levels(as.factor(un
>>> list(cat.column))
>>>[j],1,0) )
>>>
>>> }
>>>
>>> I am getting the below output for
>>>
>>> head(dummy.df.new)
>>>
>>> output:
>>>
>>>   Sepal_Length Sepal_Width Petal_Length Petal_Width Species Species1
>>> 1  5.1 3.5  1.4 0.2  setosa1
>>> 2  4.9 3.0  1.4 0.2  setosa1
>>> 3  4.7 3.2  1.3 0.2  setosa1
>>> 4  4.6 3.1  1.5 0.2  setosa1
>>> 5  5.0 3.6  1.4 0.2  setosa1
>>> 6  5.4 3.9  1.7 0.4  setosa1
>>>
>>> Problem: Species2 and Species3 column are not getting added to the
>>> dataframe
>>>
>>> --
>>> Warm regards,
>>> Devesh.
>>>
>>
>>
>>
>> --
>> Franc
>>
>
>
>
> --
> Warm regards,
> Devesh.
>



-- 
Franc


Re: sparkR not able to create /append new columns

2016-02-03 Thread Franc Carter
I had problems doing this as well - I ended up using 'withColumn', it's not
particularly graceful but it worked (1.5.2 on AWS EMR)

cheerd

On 3 February 2016 at 22:06, Devesh Raj Singh 
wrote:

> Hi,
>
> i am trying to create dummy variables in sparkR by creating new columns
> for categorical variables. But it is not appending the columns
>
>
> df <- createDataFrame(sqlContext, iris)
> class(dtypes(df))
>
> cat.column<-vector(mode="character",length=nrow(df))
> cat.column<-collect(select(df,df$Species))
> lev<-length(levels(as.factor(unlist(cat.column
> varb.names<-vector(mode="character",length=lev)
> for (i in 1:lev){
>
>   varb.names[i]<-paste0(colnames(cat.column),i)
>
> }
>
> for (j in 1:lev)
>
> {
>
>dummy.df.new<-withColumn(df,paste0(colnames
>(cat.column),j),if else(df$Species==levels(as.factor(un
> list(cat.column))
>[j],1,0) )
>
> }
>
> I am getting the below output for
>
> head(dummy.df.new)
>
> output:
>
>   Sepal_Length Sepal_Width Petal_Length Petal_Width Species Species1
> 1  5.1 3.5  1.4 0.2  setosa1
> 2  4.9 3.0  1.4 0.2  setosa1
> 3  4.7 3.2  1.3 0.2  setosa1
> 4  4.6 3.1  1.5 0.2  setosa1
> 5  5.0 3.6  1.4 0.2  setosa1
> 6  5.4 3.9  1.7 0.4  setosa1
>
> Problem: Species2 and Species3 column are not getting added to the
> dataframe
>
> --
> Warm regards,
> Devesh.
>



-- 
Franc


Re: pyspark: calculating row deltas

2016-01-10 Thread Franc Carter
Thanks

cheers

On 10 January 2016 at 22:35, Blaž Šnuderl  wrote:

> This can be done using spark.sql and window functions. Take a look at
> https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
>
> On Sun, Jan 10, 2016 at 11:07 AM, Franc Carter 
> wrote:
>
>>
>> Sure, for a dataframe that looks like this
>>
>> ID Year Value
>>  1 2012   100
>>  1 2013   102
>>  1 2014   106
>>  2 2012   110
>>  2 2013   118
>>  2 2014   128
>>
>> I'd like to get back
>>
>> ID Year Value
>>  1 2013 2
>>  1 2014 4
>>  2 2013 8
>>  2 201410
>>
>> i.e the Value for an ID,Year combination is the Value for the ID,Year
>> minus the Value for the ID,Year-1
>>
>> thanks
>>
>>
>>
>>
>>
>>
>> On 10 January 2016 at 20:51, Femi Anthony  wrote:
>>
>>> Can you clarify what you mean with an actual example ?
>>>
>>> For example, if your data frame looks like this:
>>>
>>> ID  Year   Value
>>> 12012   100
>>> 22013   101
>>> 32014   102
>>>
>>> What's your desired output ?
>>>
>>> Femi
>>>
>>>
>>> On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter 
>>> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> I have a DataFrame with the columns
>>>>
>>>>  ID,Year,Value
>>>>
>>>> I'd like to create a new Column that is Value2-Value1 where the
>>>> corresponding Year2=Year-1
>>>>
>>>> At the moment I am creating  a new DataFrame with renamed columns and
>>>> doing
>>>>
>>>>DF.join(DF2, . . . .)
>>>>
>>>>  This looks cumbersome to me, is there abtter way ?
>>>>
>>>> thanks
>>>>
>>>>
>>>> --
>>>> Franc
>>>>
>>>
>>>
>>>
>>> --
>>> http://www.femibyte.com/twiki5/bin/view/Tech/
>>> http://www.nextmatrix.com
>>> "Great spirits have always encountered violent opposition from mediocre
>>> minds." - Albert Einstein.
>>>
>>
>>
>>
>> --
>> Franc
>>
>
>


-- 
Franc


Re: pyspark: calculating row deltas

2016-01-10 Thread Franc Carter
Sure, for a dataframe that looks like this

ID Year Value
 1 2012   100
 1 2013   102
 1 2014   106
 2 2012   110
 2 2013   118
 2 2014   128

I'd like to get back

ID Year Value
 1 2013 2
 1 2014 4
 2 2013 8
 2 201410

i.e the Value for an ID,Year combination is the Value for the ID,Year minus
the Value for the ID,Year-1

thanks






On 10 January 2016 at 20:51, Femi Anthony  wrote:

> Can you clarify what you mean with an actual example ?
>
> For example, if your data frame looks like this:
>
> ID  Year   Value
> 12012   100
> 22013   101
> 32014   102
>
> What's your desired output ?
>
> Femi
>
>
> On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter 
> wrote:
>
>>
>> Hi,
>>
>> I have a DataFrame with the columns
>>
>>  ID,Year,Value
>>
>> I'd like to create a new Column that is Value2-Value1 where the
>> corresponding Year2=Year-1
>>
>> At the moment I am creating  a new DataFrame with renamed columns and
>> doing
>>
>>DF.join(DF2, . . . .)
>>
>>  This looks cumbersome to me, is there abtter way ?
>>
>> thanks
>>
>>
>> --
>> Franc
>>
>
>
>
> --
> http://www.femibyte.com/twiki5/bin/view/Tech/
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>



-- 
Franc


Re: pyspark: conditionals inside functions

2016-01-09 Thread Franc Carter
Got it, I needed to use the when/otherwise construct - code below

def getSunday(day):

day = day.cast("date")

sun = next_day(day, "Sunday")

n = datediff(sun,day)

x = when(n==7,day).otherwise(sun)

return x


On 10 January 2016 at 08:41, Franc Carter  wrote:

>
> My Python is not particularly good, so I'm afraid I don't understand what
> that mean
>
> cheers
>
>
> On 9 January 2016 at 14:45, Franc Carter  wrote:
>
>>
>> Hi,
>>
>> I'm trying to write a short function that returns the last sunday of the
>> week of a given date, code below
>>
>> def getSunday(day):
>>
>> day = day.cast("date")
>>
>> sun = next_day(day, "Sunday")
>>
>> n = datediff(sun,day)
>>
>> if (n == 7):
>>
>> return day
>>
>> else:
>>
>> return sun
>>
>>
>> this gives me
>>
>> ValueError: Cannot convert column into bool:
>>
>>
>> Can someone point out what I am doing wrong
>>
>> thanks
>>
>>
>> --
>> Franc
>>
>
>
>
> --
> Franc
>



-- 
Franc


pyspark: calculating row deltas

2016-01-09 Thread Franc Carter
Hi,

I have a DataFrame with the columns

 ID,Year,Value

I'd like to create a new Column that is Value2-Value1 where the
corresponding Year2=Year-1

At the moment I am creating  a new DataFrame with renamed columns and doing

   DF.join(DF2, . . . .)

 This looks cumbersome to me, is there abtter way ?

thanks


-- 
Franc


Re: pyspark: conditionals inside functions

2016-01-09 Thread Franc Carter
My Python is not particularly good, so I'm afraid I don't understand what
that mean

cheers


On 9 January 2016 at 14:45, Franc Carter  wrote:

>
> Hi,
>
> I'm trying to write a short function that returns the last sunday of the
> week of a given date, code below
>
> def getSunday(day):
>
> day = day.cast("date")
>
> sun = next_day(day, "Sunday")
>
> n = datediff(sun,day)
>
> if (n == 7):
>
> return day
>
> else:
>
> return sun
>
>
> this gives me
>
> ValueError: Cannot convert column into bool:
>
>
> Can someone point out what I am doing wrong
>
> thanks
>
>
> --
> Franc
>



-- 
Franc


pyspark: conditionals inside functions

2016-01-08 Thread Franc Carter
Hi,

I'm trying to write a short function that returns the last sunday of the
week of a given date, code below

def getSunday(day):

day = day.cast("date")

sun = next_day(day, "Sunday")

n = datediff(sun,day)

if (n == 7):

return day

else:

return sun


this gives me

ValueError: Cannot convert column into bool:


Can someone point out what I am doing wrong

thanks


-- 
Franc


Re: number of executors in sparkR.init()

2015-12-25 Thread Franc Carter
Thanks, that works

cheers

On 26 December 2015 at 16:53, Felix Cheung 
wrote:

> The equivalent for spark-submit --num-executors should be
> spark.executor.instances
> When use in SparkConf?
> http://spark.apache.org/docs/latest/running-on-yarn.html
>
> Could you try setting that with sparkR.init()?
>
>
> _________
> From: Franc Carter 
> Sent: Friday, December 25, 2015 9:23 PM
> Subject: number of executors in sparkR.init()
> To: 
>
>
>
> Hi,
>
> I'm having trouble working out how to get the number of executors set when
> using sparkR.init().
>
> If I start sparkR with
>
>   sparkR  --master yarn --num-executors 6
>
> then I get 6 executors
>
> However, if start sparkR with
>
>   sparkR
>
> followed by
>
>   sc <- sparkR.init(master="yarn-client",
> sparkEnvir=list(spark.num.executors='6'))
>
> then I only get 2 executors.
>
> Can anyone point me in the direction of what I might doing wrong ? I need
> to initialise this was so that rStudio can hook in to SparkR
>
> thanks
>
> --
> Franc
>
>
>


-- 
Franc


number of executors in sparkR.init()

2015-12-25 Thread Franc Carter
Hi,

I'm having trouble working out how to get the number of executors set when
using sparkR.init().

If I start sparkR with

  sparkR  --master yarn --num-executors 6

then I get 6 executors

However, if start sparkR with

  sparkR

followed by

  sc <- sparkR.init(master="yarn-client",
sparkEnvir=list(spark.num.executors='6'))

then I only get 2 executors.

Can anyone point me in the direction of what I might doing wrong ? I need
to initialise this was so that rStudio can hook in to SparkR

thanks

-- 
Franc


Re: SparkR csv without headers

2015-08-20 Thread Franc Carter
Thanks - works nicely

cheers

On Fri, Aug 21, 2015 at 12:43 PM, Sun, Rui  wrote:

> Hi,
>
>
>
> You can create a DataFrame using load.df() with a specified schema.
>
>
>
> Something like:
>
> schema <- structType(structField(“a”, “string”), structField(“b”,
> integer), …)
>
> read.df ( …, schema = schema)
>
>
>
> *From:* Franc Carter [mailto:franc.car...@rozettatech.com]
> *Sent:* Wednesday, August 19, 2015 1:48 PM
> *To:* user@spark.apache.org
> *Subject:* SparkR csv without headers
>
>
>
>
>
> Hi,
>
>
>
> Does anyone have an example of how to create a DataFrame in SparkR  which
> specifies the column names - the csv files I have do not have column names
> in the first row. I can get read a csv nicely with
> com.databricks:spark-csv_2.10:1.0.3, but I end up with column names C1, C2,
> C3 etc
>
>
>
>
>
> thanks
>
>
>
> --
>
> *Franc Carter* I  Systems ArchitectI RoZetta Technology
>
>
>
> [image: Description: Description: Description:
> cid:image003.jpg@01D02903.9B540580]
>
>
>
> L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000
>
> PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA
>
> *T*  +61 2 8355 2515 Iwww.rozettatechnology.com
>
> [image: cid:image002.jpg@01D02903.0B41B280]
>
> DISCLAIMER: The contents of this email, inclusive of attachments, may be
> legally
>
> privileged and confidential. Any unauthorised use of the contents is
> expressly prohibited.
>
>
>
>
>



-- 

*Franc Carter* I  Systems ArchitectI RoZetta Technology



[image: Description: Description: Description:
cid:image003.jpg@01D02903.9B540580]



L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000

PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA

*T*  +61 2 8355 2515 Iwww.rozettatechnology.com

[image: cid:image002.jpg@01D02903.0B41B280]

DISCLAIMER: The contents of this email, inclusive of attachments, may be
legally

privileged and confidential. Any unauthorised use of the contents is
expressly prohibited.


SparkR csv without headers

2015-08-18 Thread Franc Carter
Hi,

Does anyone have an example of how to create a DataFrame in SparkR  which
specifies the column names - the csv files I have do not have column names
in the first row. I can get read a csv nicely with
com.databricks:spark-csv_2.10:1.0.3, but I end up with column names C1, C2,
C3 etc


thanks

-- 

*Franc Carter* I  Systems ArchitectI RoZetta Technology



[image: Description: Description: Description:
cid:image003.jpg@01D02903.9B540580]



L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000

PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA

*T*  +61 2 8355 2515 Iwww.rozettatechnology.com

[image: cid:image002.jpg@01D02903.0B41B280]

DISCLAIMER: The contents of this email, inclusive of attachments, may be
legally

privileged and confidential. Any unauthorised use of the contents is
expressly prohibited.


subscribe

2015-08-05 Thread Franc Carter
subscribe


Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread Franc Carter
One issue is that 'big' becomes 'not so big' reasonably quickly. A couple
of TeraBytes is not that challenging (depending on the algorithm) these
days where as 5 years ago it was a big challenge. We have a bit over a
PetaByte (not using Spark)  and using a distributed system is the only
viable way to get reasonable performance for reasonable cost

cheers

On Tue, Mar 31, 2015 at 4:55 AM, Steve Loughran 
wrote:

>
>  On 30 Mar 2015, at 13:27, jay vyas  wrote:
>
>  Just the same as spark was disrupting the hadoop ecosystem by changing
> the assumption that "you can't rely on memory in distributed
> analytics"...now maybe we are challenging the assumption that "big data
> analytics need to distributed"?
>
> I've been asking the same question lately and seen similarly that spark
> performs quite reliably and well on local single node system even for an
> app which I ran for a streaming app which I ran for ten days in a row...  I
> almost felt guilty that I never put it on a cluster!
>
>
>  Modern machines can be pretty powerful: 16 physical cores HT'd to 32,
> 384+MB, GPU, giving you lots of compute. What you don't get is the storage
> capacity to match, and especially, the IO bandwidth. RAID-0 striping 2-4
> HDDs gives you some boost, but if you are reading, say, a 4 GB file from
> HDFS broken in to 256MB blocks, you have that data  replicated into (4*4*3)
> blocks: 48. Algorithm and capacity permitting, you've just massively
> boosted your load time. Downstream, if data can be thinned down, then you
> can start looking more at things you can do on a single host : a machine
> that can be in your Hadoop cluster. Ask YARN nicely and you can get a
> dedicated machine for a couple of days (i.e. until your Kerberos tokens
> expire).
>
>


-- 

*Franc Carter* I  Systems ArchitectI RoZetta Technology



[image: Description: Description: Description:
cid:image003.jpg@01D02903.9B540580]



L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000

PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA

*T*  +61 2 8355 2515 Iwww.rozettatechnology.com

[image: cid:image002.jpg@01D02903.0B41B280]

DISCLAIMER: The contents of this email, inclusive of attachments, may be
legally

privileged and confidential. Any unauthorised use of the contents is
expressly prohibited.


Re: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Franc Carter
ise$DefaultPromise.ready(Promise.scala:219)
> >> at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> >> at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> >> at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> >> at scala.concurrent.Await$.result(package.scala:107)
> >> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
> >> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
> >> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
> >> ... 7 more
> >>
> >>
> >>
> >>
> >> -Original Message-
> >> From: Patrick Wendell [mailto:pwend...@gmail.com]
> >> Sent: Monday, February 23, 2015 12:17 AM
> >> To: Oleg Shirokikh
> >> Subject: Re: Submitting jobs to Spark EC2 cluster remotely
> >>
> >> The reason is that the file needs to be in a globally visible
> >> filesystem where the master node can download. So it needs to be on
> >> s3, for instance, rather than on your local filesystem.
> >>
> >> - Patrick
> >>
> >> On Sun, Feb 22, 2015 at 11:55 PM, olegshirokikh 
> wrote:
> >>> I've set up the EC2 cluster with Spark. Everything works, all
> >>> master/slaves are up and running.
> >>>
> >>> I'm trying to submit a sample job (SparkPi). When I ssh to cluster
> >>> and submit it from there - everything works fine. However when
> >>> driver is created on a remote host (my laptop), it doesn't work.
> >>> I've tried both modes for
> >>> `--deploy-mode`:
> >>>
> >>> **`--deploy-mode=client`:**
> >>>
> >>> From my laptop:
> >>>
> >>> ./bin/spark-submit --master
> >>> spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077
> >>> --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
> >>>
> >>> Results in the following indefinite warnings/errors:
> >>>
> >>>>  WARN TaskSchedulerImpl: Initial job has not accepted any
> >>>> resources; check your cluster UI to ensure that workers are
> >>>> registered and have sufficient memory 15/02/22 18:30:45
> >>>
> >>>> ERROR SparkDeploySchedulerBackend: Asked to remove non-existent
> >>>> executor 0
> >>>> 15/02/22 18:30:45
> >>>
> >>>> ERROR SparkDeploySchedulerBackend: Asked to remove non-existent
> >>>> executor 1
> >>>
> >>> ...and failed drivers - in Spark Web UI "Completed Drivers" with
> >>> "State=ERROR" appear.
> >>>
> >>> I've tried to pass limits for cores and memory to submit script but
> >>> it didn't help...
> >>>
> >>> **`--deploy-mode=cluster`:**
> >>>
> >>> From my laptop:
> >>>
> >>> ./bin/spark-submit --master
> >>> spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077
> >>> --deploy-mode cluster --class SparkPi
> >>> ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
> >>>
> >>> The result is:
> >>>
> >>>>  Driver successfully submitted as driver-20150223023734-0007 ...
> >>>> waiting before polling master for driver state ... polling master
> >>>> for driver state State of driver-20150223023734-0007 is ERROR
> >>>> Exception from cluster was: java.io.FileNotFoundException: File
> >>>> file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.1
> >>>> 0 -0.0.1.jar does not exist. java.io.FileNotFoundException: File
> >>>>
> file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
> >>>> does not exist.   at
> >>>>
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
> >>>>   at
> >>>>
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
> >>>>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
> at
> >>>> org.apache.spark.deploy.worker.DriverRunner.org
> $apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
> >>>>   at
> >>>> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunne
> >>>> r
> >>>> .scala:75)
> >>>
> >>>  So, I'd appreciate any pointers on what is going wrong and some
> >>> guidance how to deploy jobs from remote client. Thanks.
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-jobs-
> >>> t o-Spark-EC2-cluster-remotely-tp21762.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>> 
> >>> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
> >>> additional commands, e-mail: user-h...@spark.apache.org
> >>>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

*Franc Carter* I  Systems ArchitectI RoZetta Technology



[image: Description: Description: Description:
cid:image003.jpg@01D02903.9B540580]



L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000

PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA

*T*  +61 2 8355 2515 Iwww.rozettatechnology.com

[image: Description: Description: Description:
cid:image002.jpg@01D02903.0B41B280]

DISCLAIMER: The contents of this email, inclusive of attachments, may be
legally

privileged and confidential. Any unauthorised use of the contents is
expressly prohibited.


Re: spark, reading from s3

2015-02-12 Thread Franc Carter
Check that your timezone is correct as well, an incorrect timezone can make
it look like your time is correct when it is skewed.

cheers

On Fri, Feb 13, 2015 at 5:51 AM, Kane Kim  wrote:

> The thing is that my time is perfectly valid...
>
> On Tue, Feb 10, 2015 at 10:50 PM, Akhil Das 
> wrote:
>
>> Its with the timezone actually, you can either use an NTP to maintain
>> accurate system clock or you can adjust your system time to match with the
>> AWS one. You can do it as:
>>
>> telnet s3.amazonaws.com 80
>> GET / HTTP/1.0
>>
>>
>> [image: Inline image 1]
>>
>> Thanks
>> Best Regards
>>
>> On Wed, Feb 11, 2015 at 6:43 AM, Kane Kim  wrote:
>>
>>> I'm getting this warning when using s3 input:
>>> 15/02/11 00:58:37 WARN RestStorageService: Adjusted time offset in
>>> response to
>>> RequestTimeTooSkewed error. Local machine and S3 server disagree on the
>>> time by approximately 0 seconds. Retrying connection.
>>>
>>> After that there are tons of 403/forbidden errors and then job fails.
>>> It's sporadic, so sometimes I get this error and sometimes not, what
>>> could be the issue?
>>> I think it could be related to network connectivity?
>>>
>>
>>
>


-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter
I forgot to mention that if you do decide to use Cassandra I'd highly
recommend jumping on the Cassandra mailing list, if we had taken in come of
the advice on that list things would have been considerably smoother

cheers

On Wed, Feb 11, 2015 at 8:12 PM, Christian Betz <
christian.b...@performance-media.de> wrote:

>   Hi
>
>  Regarding the Cassandra Data model, there's an excellent post on the
> ebay tech blog:
> http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/.
> There's also a slideshare for this somewhere.
>
>  Happy hacking
>
>  Chris
>
>   Von: Franc Carter 
> Datum: Mittwoch, 11. Februar 2015 10:03
> An: Paolo Platter 
> Cc: Mike Trienis , "user@spark.apache.org" <
> user@spark.apache.org>
> Betreff: Re: Datastore HDFS vs Cassandra
>
>
> One additional comment I would make is that you should be careful with
> Updates in Cassandra, it does support them but large amounts of Updates
> (i.e changing existing keys) tends to cause fragmentation. If you are
> (mostly) adding new keys (e.g new records in the the time series) then
> Cassandra can be excellent
>
>  cheers
>
>
> On Wed, Feb 11, 2015 at 6:13 PM, Paolo Platter 
> wrote:
>
>>   Hi Mike,
>>
>> I developed a Solution with cassandra and spark, using DSE.
>> The main difficult is about cassandra, you need to understand very well
>> its data model and its Query patterns.
>> Cassandra has better performance than hdfs and it has DR and stronger
>> availability.
>> Hdfs is a filesystem, cassandra is a dbms.
>> Cassandra supports full CRUD without acid.
>> Hdfs is more flexible than cassandra.
>>
>> In my opinion, if you have a real time series, go with Cassandra paying
>> attention at your reporting data access patterns.
>>
>> Paolo
>>
>> Inviata dal mio Windows Phone
>>  --
>> Da: Mike Trienis 
>> Inviato: ?11/?02/?2015 05:59
>> A: user@spark.apache.org
>> Oggetto: Datastore HDFS vs Cassandra
>>
>>   Hi,
>>
>> I am considering implement Apache Spark on top of Cassandra database after
>> listing to related talk and reading through the slides from DataStax. It
>> seems to fit well with our time-series data and reporting requirements.
>>
>>
>> http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-for-time-series-data
>>
>> Does anyone have any experiences using Apache Spark and Cassandra,
>> including
>> limitations (and or) technical difficulties? How does Cassandra compare
>> with
>> HDFS and what use cases would make HDFS more suitable?
>>
>> Thanks, Mike.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Datastore-HDFS-vs-Cassandra-tp21590.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -----
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
>  --
>
> *Franc Carter* | Systems Architect | Rozetta Technology
>
> franc.car...@rozettatech.com  |
> www.rozettatechnology.com
>
> Tel: +61 2 8355 2515
>
> Level 4, 55 Harrington St, The Rocks NSW 2000
>
> PO Box H58, Australia Square, Sydney NSW 1215
>
> AUSTRALIA
>
>


-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter
One additional comment I would make is that you should be careful with
Updates in Cassandra, it does support them but large amounts of Updates
(i.e changing existing keys) tends to cause fragmentation. If you are
(mostly) adding new keys (e.g new records in the the time series) then
Cassandra can be excellent

cheers


On Wed, Feb 11, 2015 at 6:13 PM, Paolo Platter 
wrote:

>   Hi Mike,
>
> I developed a Solution with cassandra and spark, using DSE.
> The main difficult is about cassandra, you need to understand very well
> its data model and its Query patterns.
> Cassandra has better performance than hdfs and it has DR and stronger
> availability.
> Hdfs is a filesystem, cassandra is a dbms.
> Cassandra supports full CRUD without acid.
> Hdfs is more flexible than cassandra.
>
> In my opinion, if you have a real time series, go with Cassandra paying
> attention at your reporting data access patterns.
>
> Paolo
>
> Inviata dal mio Windows Phone
>  --
> Da: Mike Trienis 
> Inviato: ‎11/‎02/‎2015 05:59
> A: user@spark.apache.org
> Oggetto: Datastore HDFS vs Cassandra
>
>   Hi,
>
> I am considering implement Apache Spark on top of Cassandra database after
> listing to related talk and reading through the slides from DataStax. It
> seems to fit well with our time-series data and reporting requirements.
>
>
> http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-for-time-series-data
>
> Does anyone have any experiences using Apache Spark and Cassandra,
> including
> limitations (and or) technical difficulties? How does Cassandra compare
> with
> HDFS and what use cases would make HDFS more suitable?
>
> Thanks, Mike.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Datastore-HDFS-vs-Cassandra-tp21590.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -----
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


Re: How to create spark AMI in AWS

2015-02-09 Thread Franc Carter
Hi,

I'm very new to Spark, but  experienced with AWS - so take that in to
account with my suggestions.

I started with an AWS base image and then added the pre-built Spark-1.2. I
then added made a 'Master' version and a 'Worker' versions and then made
AMIs for them.

The Master comes up with a static IP and the Worker image has this baked
in. I haven't completed everything I am planning to do but so far I can
bring up the Master and a bunch of Workers inside and ASG and run spark
code successfully.

cheers


On Mon, Feb 9, 2015 at 10:06 PM, Guodong Wang  wrote:

> Hi guys,
>
> I want to launch spark cluster in AWS. And I know there is a spark_ec2.py
> script.
>
> I am using the AWS service in China. But I can not find the AMI in the
> region of China.
>
> So, I have to build one. My question is
> 1. Where is the bootstrap script to create the Spark AMI? Is it here(
> https://github.com/mesos/spark-ec2/blob/branch-1.3/create_image.sh) ?
> 2. What is the base image of the Spark AMI? Eg, the base image of this (
> https://github.com/mesos/spark-ec2/blob/branch-1.3/ami-list/us-west-1/hvm)
> 3. Shall I install scala during building the AMI?
>
>
> Thanks.
>
> Guodong
>



-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


Re: No AMI for Spark 1.2 using ec2 scripts

2015-01-26 Thread Franc Carter
AMI's are specific to an AWS region, so the ami-id of the spark AMI in
us-west will be different if it exists. I can't remember where but I have a
memory of seeing somewhere that the AMI was only in us-east

cheers

On Mon, Jan 26, 2015 at 8:47 PM, Håkan Jonsson  wrote:

> Thanks,
>
> I also use Spark 1.2 with prebuilt for Hadoop 2.4. I launch both 1.1 and
> 1.2 with the same command:
>
> ./spark-ec2 -k foo -i bar.pem launch mycluster
>
> By default this launches in us-east-1. I tried changing the the region
> using:
>
> -r us-west-1 but that had the same result:
>
> Could not resolve AMI at:
> https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm
>
> Looking up
> https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm in a
> browser results in the same AMI ID as yours. If I search for ami-7a320f3f
> AMI in AWS, I can't find any such image. I tried searching in all regions I
> could find in the AWS console.
>
> The AMI for 1.1 is spark.ami.pvm.v9 (ami-5bb18832). I can find that AMI in
> us-west-1.
>
> Strange. Not sure what to do.
>
> /Håkan
>
>
> On Mon Jan 26 2015 at 9:02:42 AM Charles Feduke 
> wrote:
>
> I definitely have Spark 1.2 running within EC2 using the spark-ec2
> scripts. I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later.
>
> What parameters are you using when you execute spark-ec2?
>
>
> I am launching in the us-west-1 region (ami-7a320f3f) which may explain
> things.
>
> On Mon Jan 26 2015 at 2:40:01 AM hajons  wrote:
>
> Hi,
>
> When I try to launch a standalone cluster on EC2 using the scripts in the
> ec2 directory for Spark 1.2, I get the following error:
>
> Could not resolve AMI at:
> https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm
>
> It seems there is not yet any AMI available on EC2. Any ideas when there
> will be one?
>
> This works without problems for version 1.1. Starting up a cluster using
> these scripts is so simple and straightforward, so I am really missing it
> on
> 1.2.
>
> /Håkan
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/No-AMI-for-Spark-1-2-using-ec2-scripts-tp21362.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


Re: Reading from a centralized stored

2015-01-06 Thread Franc Carter
Ah, so it's rdd specific - that would make sense. For those systems where
it is possible to extract sensible susbets the rdds do so. My use case,
which is probably biasing my thinking is DynamoDb which I don't think can
efficiently extract records from M-to-N

cheers

On Wed, Jan 7, 2015 at 6:59 AM, Cody Koeninger  wrote:

> No, most rdds partition input data appropriately.
>
> On Tue, Jan 6, 2015 at 1:41 PM, Franc Carter  > wrote:
>
>>
>> One more question, to be clarify. Will every node pull in all the data ?
>>
>> thanks
>>
>> On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger 
>> wrote:
>>
>>> If you are not co-locating spark executor processes on the same machines
>>> where the data is stored, and using an rdd that knows about which node to
>>> prefer scheduling a task on, yes, the data will be pulled over the network.
>>>
>>> Of the options you listed, S3 and DynamoDB cannot have spark running on
>>> the same machines. Cassandra can be run on the same nodes as spark, and
>>> recent versions of the spark cassandra connector implement preferred
>>> locations.  You can run an rdbms on the same nodes as spark, but JdbcRDD
>>> doesn't implement preferred locations.
>>>
>>> On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter <
>>> franc.car...@rozettatech.com> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> I'm trying to understand how a Spark Cluster behaves when the data it
>>>> is processing resides on a centralized/remote store (S3, Cassandra,
>>>> DynamoDB, RDBMS etc).
>>>>
>>>> Does every node in the cluster retrieve all the data from the central
>>>> store ?
>>>>
>>>> thanks
>>>>
>>>> --
>>>>
>>>> *Franc Carter* | Systems Architect | Rozetta Technology
>>>>
>>>> franc.car...@rozettatech.com  |
>>>> www.rozettatechnology.com
>>>>
>>>> Tel: +61 2 8355 2515
>>>>
>>>> Level 4, 55 Harrington St, The Rocks NSW 2000
>>>>
>>>> PO Box H58, Australia Square, Sydney NSW 1215
>>>>
>>>> AUSTRALIA
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> *Franc Carter* | Systems Architect | Rozetta Technology
>>
>> franc.car...@rozettatech.com  |
>> www.rozettatechnology.com
>>
>> Tel: +61 2 8355 2515
>>
>> Level 4, 55 Harrington St, The Rocks NSW 2000
>>
>> PO Box H58, Australia Square, Sydney NSW 1215
>>
>> AUSTRALIA
>>
>>
>


-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


Re: Reading from a centralized stored

2015-01-06 Thread Franc Carter
One more question, to be clarify. Will every node pull in all the data ?

thanks

On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger  wrote:

> If you are not co-locating spark executor processes on the same machines
> where the data is stored, and using an rdd that knows about which node to
> prefer scheduling a task on, yes, the data will be pulled over the network.
>
> Of the options you listed, S3 and DynamoDB cannot have spark running on
> the same machines. Cassandra can be run on the same nodes as spark, and
> recent versions of the spark cassandra connector implement preferred
> locations.  You can run an rdbms on the same nodes as spark, but JdbcRDD
> doesn't implement preferred locations.
>
> On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter  > wrote:
>
>>
>> Hi,
>>
>> I'm trying to understand how a Spark Cluster behaves when the data it is
>> processing resides on a centralized/remote store (S3, Cassandra, DynamoDB,
>> RDBMS etc).
>>
>> Does every node in the cluster retrieve all the data from the central
>> store ?
>>
>> thanks
>>
>> --
>>
>> *Franc Carter* | Systems Architect | Rozetta Technology
>>
>> franc.car...@rozettatech.com  |
>> www.rozettatechnology.com
>>
>> Tel: +61 2 8355 2515
>>
>> Level 4, 55 Harrington St, The Rocks NSW 2000
>>
>> PO Box H58, Australia Square, Sydney NSW 1215
>>
>> AUSTRALIA
>>
>>
>


-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


Re: Reading from a centralized stored

2015-01-05 Thread Franc Carter
Thanks, that's what I suspected.

cheers

On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger  wrote:

> If you are not co-locating spark executor processes on the same machines
> where the data is stored, and using an rdd that knows about which node to
> prefer scheduling a task on, yes, the data will be pulled over the network.
>
> Of the options you listed, S3 and DynamoDB cannot have spark running on
> the same machines. Cassandra can be run on the same nodes as spark, and
> recent versions of the spark cassandra connector implement preferred
> locations.  You can run an rdbms on the same nodes as spark, but JdbcRDD
> doesn't implement preferred locations.
>
> On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter  > wrote:
>
>>
>> Hi,
>>
>> I'm trying to understand how a Spark Cluster behaves when the data it is
>> processing resides on a centralized/remote store (S3, Cassandra, DynamoDB,
>> RDBMS etc).
>>
>> Does every node in the cluster retrieve all the data from the central
>> store ?
>>
>> thanks
>>
>> --
>>
>> *Franc Carter* | Systems Architect | Rozetta Technology
>>
>> franc.car...@rozettatech.com  |
>> www.rozettatechnology.com
>>
>> Tel: +61 2 8355 2515
>>
>> Level 4, 55 Harrington St, The Rocks NSW 2000
>>
>> PO Box H58, Australia Square, Sydney NSW 1215
>>
>> AUSTRALIA
>>
>>
>


-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


Reading from a centralized stored

2015-01-05 Thread Franc Carter
Hi,

I'm trying to understand how a Spark Cluster behaves when the data it is
processing resides on a centralized/remote store (S3, Cassandra, DynamoDB,
RDBMS etc).

Does every node in the cluster retrieve all the data from the central store
?

thanks

-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA