Re: Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter
I'm using dataframes, types are all doubles and I'm only extracting what I
need.

The caveat on these is that I am porting an existing system for a client
and for there business it's likely to be cheaper to throw hardware (in aws)
at the problem for a couple of hours than re-engineer there algorithms

cheers


On 7 June 2016 at 21:54, Jörn Franke <jornfra...@gmail.com> wrote:

> Before hardware optimization there is always software optimization.
> Are you using dataset / dataframe? Are you using the  right data types (
> eg int where int is appropriate , try to avoid string and char etc)
> Do you extract only the stuff needed? What are the algorithm parameters?
>
> > On 07 Jun 2016, at 13:09, Franc Carter <franc.car...@gmail.com> wrote:
> >
> >
> > Hi,
> >
> > I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and
> am interested in how it might be best to scale it - e.g more cpus per
> instances, more memory per instance, more instances etc.
> >
> > I'm currently using 32 m3.xlarge instances for for a training set with
> 2.5 million rows, 1300 columns and a total size of 31GB (parquet)
> >
> > thanks
> >
> > --
> > Franc
>



-- 
Franc


Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter
Hi,

I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and am
interested in how it might be best to scale it - e.g more cpus per
instances, more memory per instance, more instances etc.

I'm currently using 32 m3.xlarge instances for for a training set with 2.5
million rows, 1300 columns and a total size of 31GB (parquet)

thanks

-- 
Franc


Re: installing packages with pyspark

2016-03-19 Thread Franc Carter
Thanks - I'll give that a try

cheers

On 20 March 2016 at 09:42, Felix Cheung <felixcheun...@hotmail.com> wrote:

> You are running pyspark in Spark client deploy mode. I have ran into the
> same error as well and I'm not sure if this is graphframes specific - the
> python process can't find the graphframes Python code when it is loaded as
> a Spark package.
>
> To workaround this, I extract the graphframes Python directory locally
> where I run pyspark into a directory called graphframes.
>
>
>
>
>
>
> On Thu, Mar 17, 2016 at 10:11 PM -0700, "Franc Carter" <
> franc.car...@gmail.com> wrote:
>
>
> I'm having trouble with that for pyspark, yarn and graphframes. I'm using:-
>
> pyspark --master yarn --packages graphframes:graphframes:0.1.0-spark1.5
>
> which starts and gives me a REPL, but when I try
>
>from graphframes import *
>
> I get
>
>   No module names graphframes
>
> without '--master yarn' it works as expected
>
> thanks
>
>
> On 18 March 2016 at 12:59, Felix Cheung <felixcheun...@hotmail.com> wrote:
>
> For some, like graphframes that are Spark packages, you could also use
> --packages in the command line of spark-submit or pyspark. See
> http://spark.apache.org/docs/latest/submitting-applications.html
>
> _
> From: Jakob Odersky <ja...@odersky.com>
> Sent: Thursday, March 17, 2016 6:40 PM
> Subject: Re: installing packages with pyspark
> To: Ajinkya Kale <kaleajin...@gmail.com>
> Cc: <user@spark.apache.org>
>
>
>
> Hi,
> regarding 1, packages are resolved locally. That means that when you
> specify a package, spark-submit will resolve the dependencies and
> download any jars on the local machine, before shipping* them to the
> cluster. So, without a priori knowledge of dataproc clusters, it
> should be no different to specify packages.
>
> Unfortunatly I can't help with 2.
>
> --Jakob
>
> *shipping in this case means making them available via the network
>
> On Thu, Mar 17, 2016 at 5:36 PM, Ajinkya Kale <kaleajin...@gmail.com>
> wrote:
> > Hi all,
> >
> > I had couple of questions.
> > 1. Is there documentation on how to add the graphframes or any other
> package
> > for that matter on the google dataproc managed spark clusters ?
> >
> > 2. Is there a way to add a package to an existing pyspark context
> through a
> > jupyter notebook ?
> >
> > --aj
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
>
> --
> Franc
>



-- 
Franc


Re: installing packages with pyspark

2016-03-19 Thread Franc Carter
I'm having trouble with that for pyspark, yarn and graphframes. I'm using:-

pyspark --master yarn --packages graphframes:graphframes:0.1.0-spark1.5

which starts and gives me a REPL, but when I try

   from graphframes import *

I get

  No module names graphframes

without '--master yarn' it works as expected

thanks


On 18 March 2016 at 12:59, Felix Cheung  wrote:

> For some, like graphframes that are Spark packages, you could also use
> --packages in the command line of spark-submit or pyspark. See
> http://spark.apache.org/docs/latest/submitting-applications.html
>
> _
> From: Jakob Odersky 
> Sent: Thursday, March 17, 2016 6:40 PM
> Subject: Re: installing packages with pyspark
> To: Ajinkya Kale 
> Cc: 
>
>
>
> Hi,
> regarding 1, packages are resolved locally. That means that when you
> specify a package, spark-submit will resolve the dependencies and
> download any jars on the local machine, before shipping* them to the
> cluster. So, without a priori knowledge of dataproc clusters, it
> should be no different to specify packages.
>
> Unfortunatly I can't help with 2.
>
> --Jakob
>
> *shipping in this case means making them available via the network
>
> On Thu, Mar 17, 2016 at 5:36 PM, Ajinkya Kale 
> wrote:
> > Hi all,
> >
> > I had couple of questions.
> > 1. Is there documentation on how to add the graphframes or any other
> package
> > for that matter on the google dataproc managed spark clusters ?
> >
> > 2. Is there a way to add a package to an existing pyspark context
> through a
> > jupyter notebook ?
> >
> > --aj
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>


-- 
Franc


Re: filter by dict() key in pySpark

2016-02-24 Thread Franc Carter
A colleague found how to do this, the approach was to use a udf()

cheers

On 21 February 2016 at 22:41, Franc Carter <franc.car...@gmail.com> wrote:

>
> I have a DataFrame that has a Python dict() as one of the columns. I'd
> like to filter he DataFrame for those Rows that where the dict() contains a
> specific value. e.g something like this:-
>
> DF2 = DF1.filter('name' in DF1.params)
>
> but that gives me this error
>
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
>
> How do I express this correctly ?
>
> thanks
>
> --
> Franc
>



-- 
Franc


filter by dict() key in pySpark

2016-02-21 Thread Franc Carter
I have a DataFrame that has a Python dict() as one of the columns. I'd like
to filter he DataFrame for those Rows that where the dict() contains a
specific value. e.g something like this:-

DF2 = DF1.filter('name' in DF1.params)

but that gives me this error

ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
for 'or', '~' for 'not' when building DataFrame boolean expressions.

How do I express this correctly ?

thanks

-- 
Franc


Re: sparkR not able to create /append new columns

2016-02-03 Thread Franc Carter
Yes, I didn't work out how to solve that - sorry


On 3 February 2016 at 22:37, Devesh Raj Singh <raj.deves...@gmail.com>
wrote:

> Hi,
>
> but "withColumn" will only add once, if i want to add columns to the same
> dataframe in a loop it will keep overwriting the added column and in the
> end the last added column( in the loop) will be the added column. like in
> my code above.
>
> On Wed, Feb 3, 2016 at 5:05 PM, Franc Carter <franc.car...@gmail.com>
> wrote:
>
>>
>> I had problems doing this as well - I ended up using 'withColumn', it's
>> not particularly graceful but it worked (1.5.2 on AWS EMR)
>>
>> cheerd
>>
>> On 3 February 2016 at 22:06, Devesh Raj Singh <raj.deves...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> i am trying to create dummy variables in sparkR by creating new columns
>>> for categorical variables. But it is not appending the columns
>>>
>>>
>>> df <- createDataFrame(sqlContext, iris)
>>> class(dtypes(df))
>>>
>>> cat.column<-vector(mode="character",length=nrow(df))
>>> cat.column<-collect(select(df,df$Species))
>>> lev<-length(levels(as.factor(unlist(cat.column
>>> varb.names<-vector(mode="character",length=lev)
>>> for (i in 1:lev){
>>>
>>>   varb.names[i]<-paste0(colnames(cat.column),i)
>>>
>>> }
>>>
>>> for (j in 1:lev)
>>>
>>> {
>>>
>>>dummy.df.new<-withColumn(df,paste0(colnames
>>>(cat.column),j),if else(df$Species==levels(as.factor(un
>>> list(cat.column))
>>>[j],1,0) )
>>>
>>> }
>>>
>>> I am getting the below output for
>>>
>>> head(dummy.df.new)
>>>
>>> output:
>>>
>>>   Sepal_Length Sepal_Width Petal_Length Petal_Width Species Species1
>>> 1  5.1 3.5  1.4 0.2  setosa1
>>> 2  4.9 3.0  1.4 0.2  setosa1
>>> 3  4.7 3.2  1.3 0.2  setosa1
>>> 4  4.6 3.1  1.5 0.2  setosa1
>>> 5  5.0 3.6  1.4 0.2  setosa1
>>> 6  5.4 3.9  1.7 0.4  setosa1
>>>
>>> Problem: Species2 and Species3 column are not getting added to the
>>> dataframe
>>>
>>> --
>>> Warm regards,
>>> Devesh.
>>>
>>
>>
>>
>> --
>> Franc
>>
>
>
>
> --
> Warm regards,
> Devesh.
>



-- 
Franc


Re: sparkR not able to create /append new columns

2016-02-03 Thread Franc Carter
I had problems doing this as well - I ended up using 'withColumn', it's not
particularly graceful but it worked (1.5.2 on AWS EMR)

cheerd

On 3 February 2016 at 22:06, Devesh Raj Singh 
wrote:

> Hi,
>
> i am trying to create dummy variables in sparkR by creating new columns
> for categorical variables. But it is not appending the columns
>
>
> df <- createDataFrame(sqlContext, iris)
> class(dtypes(df))
>
> cat.column<-vector(mode="character",length=nrow(df))
> cat.column<-collect(select(df,df$Species))
> lev<-length(levels(as.factor(unlist(cat.column
> varb.names<-vector(mode="character",length=lev)
> for (i in 1:lev){
>
>   varb.names[i]<-paste0(colnames(cat.column),i)
>
> }
>
> for (j in 1:lev)
>
> {
>
>dummy.df.new<-withColumn(df,paste0(colnames
>(cat.column),j),if else(df$Species==levels(as.factor(un
> list(cat.column))
>[j],1,0) )
>
> }
>
> I am getting the below output for
>
> head(dummy.df.new)
>
> output:
>
>   Sepal_Length Sepal_Width Petal_Length Petal_Width Species Species1
> 1  5.1 3.5  1.4 0.2  setosa1
> 2  4.9 3.0  1.4 0.2  setosa1
> 3  4.7 3.2  1.3 0.2  setosa1
> 4  4.6 3.1  1.5 0.2  setosa1
> 5  5.0 3.6  1.4 0.2  setosa1
> 6  5.4 3.9  1.7 0.4  setosa1
>
> Problem: Species2 and Species3 column are not getting added to the
> dataframe
>
> --
> Warm regards,
> Devesh.
>



-- 
Franc


Re: pyspark: calculating row deltas

2016-01-10 Thread Franc Carter
Sure, for a dataframe that looks like this

ID Year Value
 1 2012   100
 1 2013   102
 1 2014   106
 2 2012   110
 2 2013   118
 2 2014   128

I'd like to get back

ID Year Value
 1 2013 2
 1 2014 4
 2 2013 8
 2 201410

i.e the Value for an ID,Year combination is the Value for the ID,Year minus
the Value for the ID,Year-1

thanks






On 10 January 2016 at 20:51, Femi Anthony <femib...@gmail.com> wrote:

> Can you clarify what you mean with an actual example ?
>
> For example, if your data frame looks like this:
>
> ID  Year   Value
> 12012   100
> 22013   101
> 32014   102
>
> What's your desired output ?
>
> Femi
>
>
> On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter <franc.car...@gmail.com>
> wrote:
>
>>
>> Hi,
>>
>> I have a DataFrame with the columns
>>
>>  ID,Year,Value
>>
>> I'd like to create a new Column that is Value2-Value1 where the
>> corresponding Year2=Year-1
>>
>> At the moment I am creating  a new DataFrame with renamed columns and
>> doing
>>
>>DF.join(DF2, . . . .)
>>
>>  This looks cumbersome to me, is there abtter way ?
>>
>> thanks
>>
>>
>> --
>> Franc
>>
>
>
>
> --
> http://www.femibyte.com/twiki5/bin/view/Tech/
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>



-- 
Franc


Re: pyspark: calculating row deltas

2016-01-10 Thread Franc Carter
Thanks

cheers

On 10 January 2016 at 22:35, Blaž Šnuderl <snud...@gmail.com> wrote:

> This can be done using spark.sql and window functions. Take a look at
> https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
>
> On Sun, Jan 10, 2016 at 11:07 AM, Franc Carter <franc.car...@gmail.com>
> wrote:
>
>>
>> Sure, for a dataframe that looks like this
>>
>> ID Year Value
>>  1 2012   100
>>  1 2013   102
>>  1 2014   106
>>  2 2012   110
>>  2 2013   118
>>  2 2014   128
>>
>> I'd like to get back
>>
>> ID Year Value
>>  1 2013 2
>>  1 2014 4
>>  2 2013 8
>>  2 201410
>>
>> i.e the Value for an ID,Year combination is the Value for the ID,Year
>> minus the Value for the ID,Year-1
>>
>> thanks
>>
>>
>>
>>
>>
>>
>> On 10 January 2016 at 20:51, Femi Anthony <femib...@gmail.com> wrote:
>>
>>> Can you clarify what you mean with an actual example ?
>>>
>>> For example, if your data frame looks like this:
>>>
>>> ID  Year   Value
>>> 12012   100
>>> 22013   101
>>> 32014   102
>>>
>>> What's your desired output ?
>>>
>>> Femi
>>>
>>>
>>> On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter <franc.car...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> I have a DataFrame with the columns
>>>>
>>>>  ID,Year,Value
>>>>
>>>> I'd like to create a new Column that is Value2-Value1 where the
>>>> corresponding Year2=Year-1
>>>>
>>>> At the moment I am creating  a new DataFrame with renamed columns and
>>>> doing
>>>>
>>>>DF.join(DF2, . . . .)
>>>>
>>>>  This looks cumbersome to me, is there abtter way ?
>>>>
>>>> thanks
>>>>
>>>>
>>>> --
>>>> Franc
>>>>
>>>
>>>
>>>
>>> --
>>> http://www.femibyte.com/twiki5/bin/view/Tech/
>>> http://www.nextmatrix.com
>>> "Great spirits have always encountered violent opposition from mediocre
>>> minds." - Albert Einstein.
>>>
>>
>>
>>
>> --
>> Franc
>>
>
>


-- 
Franc


Re: pyspark: conditionals inside functions

2016-01-09 Thread Franc Carter
My Python is not particularly good, so I'm afraid I don't understand what
that mean

cheers


On 9 January 2016 at 14:45, Franc Carter <franc.car...@gmail.com> wrote:

>
> Hi,
>
> I'm trying to write a short function that returns the last sunday of the
> week of a given date, code below
>
> def getSunday(day):
>
> day = day.cast("date")
>
> sun = next_day(day, "Sunday")
>
> n = datediff(sun,day)
>
> if (n == 7):
>
> return day
>
> else:
>
> return sun
>
>
> this gives me
>
> ValueError: Cannot convert column into bool:
>
>
> Can someone point out what I am doing wrong
>
> thanks
>
>
> --
> Franc
>



-- 
Franc


pyspark: calculating row deltas

2016-01-09 Thread Franc Carter
Hi,

I have a DataFrame with the columns

 ID,Year,Value

I'd like to create a new Column that is Value2-Value1 where the
corresponding Year2=Year-1

At the moment I am creating  a new DataFrame with renamed columns and doing

   DF.join(DF2, . . . .)

 This looks cumbersome to me, is there abtter way ?

thanks


-- 
Franc


Re: pyspark: conditionals inside functions

2016-01-09 Thread Franc Carter
Got it, I needed to use the when/otherwise construct - code below

def getSunday(day):

day = day.cast("date")

sun = next_day(day, "Sunday")

n = datediff(sun,day)

x = when(n==7,day).otherwise(sun)

return x


On 10 January 2016 at 08:41, Franc Carter <franc.car...@gmail.com> wrote:

>
> My Python is not particularly good, so I'm afraid I don't understand what
> that mean
>
> cheers
>
>
> On 9 January 2016 at 14:45, Franc Carter <franc.car...@gmail.com> wrote:
>
>>
>> Hi,
>>
>> I'm trying to write a short function that returns the last sunday of the
>> week of a given date, code below
>>
>> def getSunday(day):
>>
>> day = day.cast("date")
>>
>> sun = next_day(day, "Sunday")
>>
>> n = datediff(sun,day)
>>
>> if (n == 7):
>>
>> return day
>>
>> else:
>>
>> return sun
>>
>>
>> this gives me
>>
>> ValueError: Cannot convert column into bool:
>>
>>
>> Can someone point out what I am doing wrong
>>
>> thanks
>>
>>
>> --
>> Franc
>>
>
>
>
> --
> Franc
>



-- 
Franc


pyspark: conditionals inside functions

2016-01-08 Thread Franc Carter
Hi,

I'm trying to write a short function that returns the last sunday of the
week of a given date, code below

def getSunday(day):

day = day.cast("date")

sun = next_day(day, "Sunday")

n = datediff(sun,day)

if (n == 7):

return day

else:

return sun


this gives me

ValueError: Cannot convert column into bool:


Can someone point out what I am doing wrong

thanks


-- 
Franc


number of executors in sparkR.init()

2015-12-25 Thread Franc Carter
Hi,

I'm having trouble working out how to get the number of executors set when
using sparkR.init().

If I start sparkR with

  sparkR  --master yarn --num-executors 6

then I get 6 executors

However, if start sparkR with

  sparkR

followed by

  sc <- sparkR.init(master="yarn-client",
sparkEnvir=list(spark.num.executors='6'))

then I only get 2 executors.

Can anyone point me in the direction of what I might doing wrong ? I need
to initialise this was so that rStudio can hook in to SparkR

thanks

-- 
Franc


Re: number of executors in sparkR.init()

2015-12-25 Thread Franc Carter
Thanks, that works

cheers

On 26 December 2015 at 16:53, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> The equivalent for spark-submit --num-executors should be
> spark.executor.instances
> When use in SparkConf?
> http://spark.apache.org/docs/latest/running-on-yarn.html
>
> Could you try setting that with sparkR.init()?
>
>
> _____
> From: Franc Carter <franc.car...@gmail.com>
> Sent: Friday, December 25, 2015 9:23 PM
> Subject: number of executors in sparkR.init()
> To: <user@spark.apache.org>
>
>
>
> Hi,
>
> I'm having trouble working out how to get the number of executors set when
> using sparkR.init().
>
> If I start sparkR with
>
>   sparkR  --master yarn --num-executors 6
>
> then I get 6 executors
>
> However, if start sparkR with
>
>   sparkR
>
> followed by
>
>   sc <- sparkR.init(master="yarn-client",
> sparkEnvir=list(spark.num.executors='6'))
>
> then I only get 2 executors.
>
> Can anyone point me in the direction of what I might doing wrong ? I need
> to initialise this was so that rStudio can hook in to SparkR
>
> thanks
>
> --
> Franc
>
>
>


-- 
Franc


Re: SparkR csv without headers

2015-08-20 Thread Franc Carter
Thanks - works nicely

cheers

On Fri, Aug 21, 2015 at 12:43 PM, Sun, Rui rui@intel.com wrote:

 Hi,



 You can create a DataFrame using load.df() with a specified schema.



 Something like:

 schema - structType(structField(“a”, “string”), structField(“b”,
 integer), …)

 read.df ( …, schema = schema)



 *From:* Franc Carter [mailto:franc.car...@rozettatech.com]
 *Sent:* Wednesday, August 19, 2015 1:48 PM
 *To:* user@spark.apache.org
 *Subject:* SparkR csv without headers





 Hi,



 Does anyone have an example of how to create a DataFrame in SparkR  which
 specifies the column names - the csv files I have do not have column names
 in the first row. I can get read a csv nicely with
 com.databricks:spark-csv_2.10:1.0.3, but I end up with column names C1, C2,
 C3 etc





 thanks



 --

 *Franc Carter* I  Systems ArchitectI RoZetta Technology



 [image: Description: Description: Description:
 cid:image003.jpg@01D02903.9B540580]



 L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000

 PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA

 *T*  +61 2 8355 2515 Iwww.rozettatechnology.com

 [image: cid:image002.jpg@01D02903.0B41B280]

 DISCLAIMER: The contents of this email, inclusive of attachments, may be
 legally

 privileged and confidential. Any unauthorised use of the contents is
 expressly prohibited.








-- 

*Franc Carter* I  Systems ArchitectI RoZetta Technology



[image: Description: Description: Description:
cid:image003.jpg@01D02903.9B540580]



L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000

PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA

*T*  +61 2 8355 2515 Iwww.rozettatechnology.com

[image: cid:image002.jpg@01D02903.0B41B280]

DISCLAIMER: The contents of this email, inclusive of attachments, may be
legally

privileged and confidential. Any unauthorised use of the contents is
expressly prohibited.


SparkR csv without headers

2015-08-18 Thread Franc Carter
Hi,

Does anyone have an example of how to create a DataFrame in SparkR  which
specifies the column names - the csv files I have do not have column names
in the first row. I can get read a csv nicely with
com.databricks:spark-csv_2.10:1.0.3, but I end up with column names C1, C2,
C3 etc


thanks

-- 

*Franc Carter* I  Systems ArchitectI RoZetta Technology



[image: Description: Description: Description:
cid:image003.jpg@01D02903.9B540580]



L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000

PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA

*T*  +61 2 8355 2515 Iwww.rozettatechnology.com

[image: cid:image002.jpg@01D02903.0B41B280]

DISCLAIMER: The contents of this email, inclusive of attachments, may be
legally

privileged and confidential. Any unauthorised use of the contents is
expressly prohibited.


Column operation on Spark RDDs.

2015-06-04 Thread Carter
Hi, I have a RDD with MANY columns (e.g., hundreds), and most of my operation
is on columns, e.g., I need to create many intermediate variables from
different columns, what is the most efficient way to do this?

For example, if my dataRDD[Array[String]] is like below: 

123, 523, 534, ..., 893 
536, 98, 1623, ..., 98472 
537, 89, 83640, ..., 9265 
7297, 98364, 9, ..., 735 
.. 
29, 94, 956, ..., 758 

I will need to create a new column or a variable as newCol1 =
2ndCol+19thCol, and another new column based on newCol1 and the existing
columns: newCol2 = function(newCol1, 34thCol), what is the best way of doing
this?

I have been thinking using index for the intermediate variables and the
dataRDD, and then join them together on the index to do my calculation:
var dataRDD = sc.textFile(/test.csv).map(_.split(,))
val dt = dataRDD.zipWithIndex.map(_.swap)
val newCol1 = dataRDD.map(x = x(1)+x(18)).zipWithIndex.map(_.swap)
val newCol2 = newCol1.join(dt).map(x= function(.))

Is there a better way of doing this?

Thank you very much!












--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Column-operation-on-Spark-RDDs-tp23165.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to add a column to a spark RDD with many columns?

2015-05-02 Thread Carter
Thanks for your reply! It is what I am after.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-a-column-to-a-spark-RDD-with-many-columns-tp22729p22740.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to add a column to a spark RDD with many columns?

2015-04-30 Thread Carter
Hi all,

I have a RDD with *MANY *columns (e.g., *hundreds*), how do I add one more
column at the end of this RDD?

For example, if my RDD is like below:

123, 523, 534, ..., 893
536, 98, 1623, ..., 98472
537, 89, 83640, ..., 9265
7297, 98364, 9, ..., 735
..
29, 94, 956, ..., 758

how can I efficiently add a column to it, whose value is the sum of the 2nd
and the 200th columns?

Thank you very much.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-a-column-to-a-spark-RDD-with-many-columns-tp22729.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread Franc Carter
One issue is that 'big' becomes 'not so big' reasonably quickly. A couple
of TeraBytes is not that challenging (depending on the algorithm) these
days where as 5 years ago it was a big challenge. We have a bit over a
PetaByte (not using Spark)  and using a distributed system is the only
viable way to get reasonable performance for reasonable cost

cheers

On Tue, Mar 31, 2015 at 4:55 AM, Steve Loughran ste...@hortonworks.com
wrote:


  On 30 Mar 2015, at 13:27, jay vyas jayunit100.apa...@gmail.com wrote:

  Just the same as spark was disrupting the hadoop ecosystem by changing
 the assumption that you can't rely on memory in distributed
 analytics...now maybe we are challenging the assumption that big data
 analytics need to distributed?

 I've been asking the same question lately and seen similarly that spark
 performs quite reliably and well on local single node system even for an
 app which I ran for a streaming app which I ran for ten days in a row...  I
 almost felt guilty that I never put it on a cluster!


  Modern machines can be pretty powerful: 16 physical cores HT'd to 32,
 384+MB, GPU, giving you lots of compute. What you don't get is the storage
 capacity to match, and especially, the IO bandwidth. RAID-0 striping 2-4
 HDDs gives you some boost, but if you are reading, say, a 4 GB file from
 HDFS broken in to 256MB blocks, you have that data  replicated into (4*4*3)
 blocks: 48. Algorithm and capacity permitting, you've just massively
 boosted your load time. Downstream, if data can be thinned down, then you
 can start looking more at things you can do on a single host : a machine
 that can be in your Hadoop cluster. Ask YARN nicely and you can get a
 dedicated machine for a couple of days (i.e. until your Kerberos tokens
 expire).




-- 

*Franc Carter* I  Systems ArchitectI RoZetta Technology



[image: Description: Description: Description:
cid:image003.jpg@01D02903.9B540580]



L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000

PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA

*T*  +61 2 8355 2515 Iwww.rozettatechnology.com

[image: cid:image002.jpg@01D02903.0B41B280]

DISCLAIMER: The contents of this email, inclusive of attachments, may be
legally

privileged and confidential. Any unauthorised use of the contents is
expressly prohibited.


Re: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Franc Carter
/errors:
 
   WARN TaskSchedulerImpl: Initial job has not accepted any
  resources; check your cluster UI to ensure that workers are
  registered and have sufficient memory 15/02/22 18:30:45
 
  ERROR SparkDeploySchedulerBackend: Asked to remove non-existent
  executor 0
  15/02/22 18:30:45
 
  ERROR SparkDeploySchedulerBackend: Asked to remove non-existent
  executor 1
 
  ...and failed drivers - in Spark Web UI Completed Drivers with
  State=ERROR appear.
 
  I've tried to pass limits for cores and memory to submit script but
  it didn't help...
 
  **`--deploy-mode=cluster`:**
 
  From my laptop:
 
  ./bin/spark-submit --master
  spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077
  --deploy-mode cluster --class SparkPi
  ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
 
  The result is:
 
   Driver successfully submitted as driver-20150223023734-0007 ...
  waiting before polling master for driver state ... polling master
  for driver state State of driver-20150223023734-0007 is ERROR
  Exception from cluster was: java.io.FileNotFoundException: File
  file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.1
  0 -0.0.1.jar does not exist. java.io.FileNotFoundException: File
 
 file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
  does not exist.   at
 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at
 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
 at
  org.apache.spark.deploy.worker.DriverRunner.org
 $apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
at
  org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunne
  r
  .scala:75)
 
   So, I'd appreciate any pointers on what is going wrong and some
  guidance how to deploy jobs from remote client. Thanks.
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-jobs-
  t o-Spark-EC2-cluster-remotely-tp21762.html
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 
  
  - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
  additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

*Franc Carter* I  Systems ArchitectI RoZetta Technology



[image: Description: Description: Description:
cid:image003.jpg@01D02903.9B540580]



L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000

PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA

*T*  +61 2 8355 2515 Iwww.rozettatechnology.com

[image: Description: Description: Description:
cid:image002.jpg@01D02903.0B41B280]

DISCLAIMER: The contents of this email, inclusive of attachments, may be
legally

privileged and confidential. Any unauthorised use of the contents is
expressly prohibited.


Re: spark, reading from s3

2015-02-12 Thread Franc Carter
Check that your timezone is correct as well, an incorrect timezone can make
it look like your time is correct when it is skewed.

cheers

On Fri, Feb 13, 2015 at 5:51 AM, Kane Kim kane.ist...@gmail.com wrote:

 The thing is that my time is perfectly valid...

 On Tue, Feb 10, 2015 at 10:50 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Its with the timezone actually, you can either use an NTP to maintain
 accurate system clock or you can adjust your system time to match with the
 AWS one. You can do it as:

 telnet s3.amazonaws.com 80
 GET / HTTP/1.0


 [image: Inline image 1]

 Thanks
 Best Regards

 On Wed, Feb 11, 2015 at 6:43 AM, Kane Kim kane.ist...@gmail.com wrote:

 I'm getting this warning when using s3 input:
 15/02/11 00:58:37 WARN RestStorageService: Adjusted time offset in
 response to
 RequestTimeTooSkewed error. Local machine and S3 server disagree on the
 time by approximately 0 seconds. Retrying connection.

 After that there are tons of 403/forbidden errors and then job fails.
 It's sporadic, so sometimes I get this error and sometimes not, what
 could be the issue?
 I think it could be related to network connectivity?






-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  franc.car...@rozettatech.com|
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


How to sum up the values in the columns of a dataset in Scala?

2015-02-12 Thread Carter

I am new to Scala. I have a dataset with many columns, each column has a
column name. Given several column names (these column names are not fixed,
they are generated dynamically), I need to sum up the values of these
columns. Is there an efficient way of doing this?

I worked out a way by using for loop, but I don't think it is efficient:

val AllLabels = List(ID, val1, val2, val3, val4)
val lbla = List(val1, val3, val4)
val index_lbla = lbla.map(x = AllLabels.indexOf(x))

val dataRDD = sc.textFile(../test.csv).map(_.split(,))

dataRDD.map(x=
 {
  var sum = 0.0
  for (i - 1 to index_lbla.length) 
sum = sum + x(i).toDouble
  sum
 }
).collect

The test.csv looks like below (without column names):

ID, val1, val2, val3, val4
 A, 123, 523, 534, 893
 B, 536, 98, 1623, 98472
 C, 537, 89, 83640, 9265
 D, 7297, 98364, 9, 735
 ...

Your help is very much appreciated!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-sum-up-the-values-in-the-columns-of-a-dataset-in-Scala-tp21639.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter
One additional comment I would make is that you should be careful with
Updates in Cassandra, it does support them but large amounts of Updates
(i.e changing existing keys) tends to cause fragmentation. If you are
(mostly) adding new keys (e.g new records in the the time series) then
Cassandra can be excellent

cheers


On Wed, Feb 11, 2015 at 6:13 PM, Paolo Platter paolo.plat...@agilelab.it
wrote:

   Hi Mike,

 I developed a Solution with cassandra and spark, using DSE.
 The main difficult is about cassandra, you need to understand very well
 its data model and its Query patterns.
 Cassandra has better performance than hdfs and it has DR and stronger
 availability.
 Hdfs is a filesystem, cassandra is a dbms.
 Cassandra supports full CRUD without acid.
 Hdfs is more flexible than cassandra.

 In my opinion, if you have a real time series, go with Cassandra paying
 attention at your reporting data access patterns.

 Paolo

 Inviata dal mio Windows Phone
  --
 Da: Mike Trienis mike.trie...@orcsol.com
 Inviato: ‎11/‎02/‎2015 05:59
 A: user@spark.apache.org
 Oggetto: Datastore HDFS vs Cassandra

   Hi,

 I am considering implement Apache Spark on top of Cassandra database after
 listing to related talk and reading through the slides from DataStax. It
 seems to fit well with our time-series data and reporting requirements.


 http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-for-time-series-data

 Does anyone have any experiences using Apache Spark and Cassandra,
 including
 limitations (and or) technical difficulties? How does Cassandra compare
 with
 HDFS and what use cases would make HDFS more suitable?

 Thanks, Mike.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Datastore-HDFS-vs-Cassandra-tp21590.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  franc.car...@rozettatech.com|
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter
I forgot to mention that if you do decide to use Cassandra I'd highly
recommend jumping on the Cassandra mailing list, if we had taken in come of
the advice on that list things would have been considerably smoother

cheers

On Wed, Feb 11, 2015 at 8:12 PM, Christian Betz 
christian.b...@performance-media.de wrote:

   Hi

  Regarding the Cassandra Data model, there's an excellent post on the
 ebay tech blog:
 http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/.
 There's also a slideshare for this somewhere.

  Happy hacking

  Chris

   Von: Franc Carter franc.car...@rozettatech.com
 Datum: Mittwoch, 11. Februar 2015 10:03
 An: Paolo Platter paolo.plat...@agilelab.it
 Cc: Mike Trienis mike.trie...@orcsol.com, user@spark.apache.org 
 user@spark.apache.org
 Betreff: Re: Datastore HDFS vs Cassandra


 One additional comment I would make is that you should be careful with
 Updates in Cassandra, it does support them but large amounts of Updates
 (i.e changing existing keys) tends to cause fragmentation. If you are
 (mostly) adding new keys (e.g new records in the the time series) then
 Cassandra can be excellent

  cheers


 On Wed, Feb 11, 2015 at 6:13 PM, Paolo Platter paolo.plat...@agilelab.it
 wrote:

   Hi Mike,

 I developed a Solution with cassandra and spark, using DSE.
 The main difficult is about cassandra, you need to understand very well
 its data model and its Query patterns.
 Cassandra has better performance than hdfs and it has DR and stronger
 availability.
 Hdfs is a filesystem, cassandra is a dbms.
 Cassandra supports full CRUD without acid.
 Hdfs is more flexible than cassandra.

 In my opinion, if you have a real time series, go with Cassandra paying
 attention at your reporting data access patterns.

 Paolo

 Inviata dal mio Windows Phone
  --
 Da: Mike Trienis mike.trie...@orcsol.com
 Inviato: ?11/?02/?2015 05:59
 A: user@spark.apache.org
 Oggetto: Datastore HDFS vs Cassandra

   Hi,

 I am considering implement Apache Spark on top of Cassandra database after
 listing to related talk and reading through the slides from DataStax. It
 seems to fit well with our time-series data and reporting requirements.


 http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-for-time-series-data

 Does anyone have any experiences using Apache Spark and Cassandra,
 including
 limitations (and or) technical difficulties? How does Cassandra compare
 with
 HDFS and what use cases would make HDFS more suitable?

 Thanks, Mike.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Datastore-HDFS-vs-Cassandra-tp21590.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




  --

 *Franc Carter* | Systems Architect | Rozetta Technology

 franc.car...@rozettatech.com  franc.car...@rozettatech.com|
 www.rozettatechnology.com

 Tel: +61 2 8355 2515

 Level 4, 55 Harrington St, The Rocks NSW 2000

 PO Box H58, Australia Square, Sydney NSW 1215

 AUSTRALIA




-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  franc.car...@rozettatech.com|
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


Re: No AMI for Spark 1.2 using ec2 scripts

2015-01-26 Thread Franc Carter
AMI's are specific to an AWS region, so the ami-id of the spark AMI in
us-west will be different if it exists. I can't remember where but I have a
memory of seeing somewhere that the AMI was only in us-east

cheers

On Mon, Jan 26, 2015 at 8:47 PM, Håkan Jonsson haj...@gmail.com wrote:

 Thanks,

 I also use Spark 1.2 with prebuilt for Hadoop 2.4. I launch both 1.1 and
 1.2 with the same command:

 ./spark-ec2 -k foo -i bar.pem launch mycluster

 By default this launches in us-east-1. I tried changing the the region
 using:

 -r us-west-1 but that had the same result:

 Could not resolve AMI at:
 https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm

 Looking up
 https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm in a
 browser results in the same AMI ID as yours. If I search for ami-7a320f3f
 AMI in AWS, I can't find any such image. I tried searching in all regions I
 could find in the AWS console.

 The AMI for 1.1 is spark.ami.pvm.v9 (ami-5bb18832). I can find that AMI in
 us-west-1.

 Strange. Not sure what to do.

 /Håkan


 On Mon Jan 26 2015 at 9:02:42 AM Charles Feduke charles.fed...@gmail.com
 wrote:

 I definitely have Spark 1.2 running within EC2 using the spark-ec2
 scripts. I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later.

 What parameters are you using when you execute spark-ec2?


 I am launching in the us-west-1 region (ami-7a320f3f) which may explain
 things.

 On Mon Jan 26 2015 at 2:40:01 AM hajons haj...@gmail.com wrote:

 Hi,

 When I try to launch a standalone cluster on EC2 using the scripts in the
 ec2 directory for Spark 1.2, I get the following error:

 Could not resolve AMI at:
 https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm

 It seems there is not yet any AMI available on EC2. Any ideas when there
 will be one?

 This works without problems for version 1.1. Starting up a cluster using
 these scripts is so simple and straightforward, so I am really missing it
 on
 1.2.

 /Håkan





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/No-AMI-for-Spark-1-2-using-ec2-scripts-tp21362.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  franc.car...@rozettatech.com|
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


Does DecisionTree model in MLlib deal with missing values?

2015-01-10 Thread Carter
Hi, I am new to the MLlib in Spark. Can the DecisionTree model in MLlib deal
with missing values? If so, what data structure should I use for the input?

Moreover, my data has categorical features, but the LabeledPoint requires
double data type, in this case what can I do?

Thank you very much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-DecisionTree-model-in-MLlib-deal-with-missing-values-tp21080.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Reading from a centralized stored

2015-01-06 Thread Franc Carter
Ah, so it's rdd specific - that would make sense. For those systems where
it is possible to extract sensible susbets the rdds do so. My use case,
which is probably biasing my thinking is DynamoDb which I don't think can
efficiently extract records from M-to-N

cheers

On Wed, Jan 7, 2015 at 6:59 AM, Cody Koeninger c...@koeninger.org wrote:

 No, most rdds partition input data appropriately.

 On Tue, Jan 6, 2015 at 1:41 PM, Franc Carter franc.car...@rozettatech.com
  wrote:


 One more question, to be clarify. Will every node pull in all the data ?

 thanks

 On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger c...@koeninger.org
 wrote:

 If you are not co-locating spark executor processes on the same machines
 where the data is stored, and using an rdd that knows about which node to
 prefer scheduling a task on, yes, the data will be pulled over the network.

 Of the options you listed, S3 and DynamoDB cannot have spark running on
 the same machines. Cassandra can be run on the same nodes as spark, and
 recent versions of the spark cassandra connector implement preferred
 locations.  You can run an rdbms on the same nodes as spark, but JdbcRDD
 doesn't implement preferred locations.

 On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter 
 franc.car...@rozettatech.com wrote:


 Hi,

 I'm trying to understand how a Spark Cluster behaves when the data it
 is processing resides on a centralized/remote store (S3, Cassandra,
 DynamoDB, RDBMS etc).

 Does every node in the cluster retrieve all the data from the central
 store ?

 thanks

 --

 *Franc Carter* | Systems Architect | Rozetta Technology

 franc.car...@rozettatech.com  franc.car...@rozettatech.com|
 www.rozettatechnology.com

 Tel: +61 2 8355 2515

 Level 4, 55 Harrington St, The Rocks NSW 2000

 PO Box H58, Australia Square, Sydney NSW 1215

 AUSTRALIA





 --

 *Franc Carter* | Systems Architect | Rozetta Technology

 franc.car...@rozettatech.com  franc.car...@rozettatech.com|
 www.rozettatechnology.com

 Tel: +61 2 8355 2515

 Level 4, 55 Harrington St, The Rocks NSW 2000

 PO Box H58, Australia Square, Sydney NSW 1215

 AUSTRALIA





-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  franc.car...@rozettatech.com|
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


Re: Reading from a centralized stored

2015-01-05 Thread Franc Carter
Thanks, that's what I suspected.

cheers

On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger c...@koeninger.org wrote:

 If you are not co-locating spark executor processes on the same machines
 where the data is stored, and using an rdd that knows about which node to
 prefer scheduling a task on, yes, the data will be pulled over the network.

 Of the options you listed, S3 and DynamoDB cannot have spark running on
 the same machines. Cassandra can be run on the same nodes as spark, and
 recent versions of the spark cassandra connector implement preferred
 locations.  You can run an rdbms on the same nodes as spark, but JdbcRDD
 doesn't implement preferred locations.

 On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter franc.car...@rozettatech.com
  wrote:


 Hi,

 I'm trying to understand how a Spark Cluster behaves when the data it is
 processing resides on a centralized/remote store (S3, Cassandra, DynamoDB,
 RDBMS etc).

 Does every node in the cluster retrieve all the data from the central
 store ?

 thanks

 --

 *Franc Carter* | Systems Architect | Rozetta Technology

 franc.car...@rozettatech.com  franc.car...@rozettatech.com|
 www.rozettatechnology.com

 Tel: +61 2 8355 2515

 Level 4, 55 Harrington St, The Rocks NSW 2000

 PO Box H58, Australia Square, Sydney NSW 1215

 AUSTRALIA





-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  franc.car...@rozettatech.com|
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


Reading from a centralized stored

2015-01-05 Thread Franc Carter
Hi,

I'm trying to understand how a Spark Cluster behaves when the data it is
processing resides on a centralized/remote store (S3, Cassandra, DynamoDB,
RDBMS etc).

Does every node in the cluster retrieve all the data from the central store
?

thanks

-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  franc.car...@rozettatech.com|
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


How to get the help or explanation for the functions in Spark shell?

2014-06-08 Thread Carter
Hi All,

I am new to Spark. 

In the Spark shell, how can I get the help or explanation for those
functions that I can use for a variable or RDD? For example, after I input a
RDD's name with a dot (.) at the end, if I press the Tab key, a list of
functions that I can use for this RDD will be displayed, but I dont know how
to use these functions.

Your help is greatly appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-help-or-explanation-for-the-functions-in-Spark-shell-tp7191.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to get the help or explanation for the functions in Spark shell?

2014-06-08 Thread Carter
Thank you very much Gerard.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-help-or-explanation-for-the-functions-in-Spark-shell-tp7191p7193.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Carter
Hi All,

I just downloaded the Scala IDE for Eclipse. After I created a Spark project
and clicked Run there was an error on this line of code import
org.apache.spark.SparkContext: object apache is not a member of package
org. I guess I need to import the Spark dependency into Scala IDE for
Eclipse, can anyone tell me how to do it? Thanks a lot.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Carter
Thanks  a lot Krishna, this works for me.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197p7223.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Carter
Thanks for your reply Wei, will try this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197p7224.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: K-nearest neighbors search in Spark

2014-05-27 Thread Carter
Any suggestion is very much appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6421.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


K-nearest neighbors search in Spark

2014-05-26 Thread Carter
Hi all,I want to implement a basic K-nearest neighbors search in Spark, but I
am totally new to Scala so don't know where to start with.My data consists
of millions of points. For each point, I need to compute its Euclidean
distance to the other points, and return the top-K points that are closest
to it. The data.txt is with the comma-separated format like this:ID, X, Y1,
68, 932, 12, 903, 45, 76100, 86, 54 Could you please tell me
what data structure I should use, and how to implement this algorithm in
Scala (*some sample code are greatly appreciated*).Thank you very
much.Regards,Carter



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: sbt/sbt run command returns a JVM problem

2014-05-05 Thread Carter
hi I still have over 1g left for my program.

Date: Sun, 4 May 2014 19:14:30 -0700
From: ml-node+s1001560n5340...@n3.nabble.com
To: gyz...@hotmail.com
Subject: Re: sbt/sbt run command returns a JVM problem



the total memory of your machine is 2G right?then how much memory is 
left free? wouldn`t ubuntu take up quite a big portion of 2G?
just a guess!


On Sat, May 3, 2014 at 8:15 PM, Carter [hidden email] wrote:

Hi, thanks for all your help.

I tried your setting in the sbt file, but the problem is still there.



The Java setting in my sbt file is:

java \

  -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m \

  -jar ${JAR} \

  $@



I have tried to set these 3 parameters bigger and smaller, but nothing

works. Did I change the right thing?



Thank you very much.







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5267.html


Sent from the Apache Spark User List mailing list archive at Nabble.com.














If you reply to this email, your message will be added to the 
discussion below:

http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5340.html



To unsubscribe from sbt/sbt run command returns a JVM 
problem, click here.

NAML
  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5412.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: sbt/sbt run command returns a JVM problem

2014-05-04 Thread Carter
Hi Michael,

The log after I typed last is as below:
 last
scala.tools.nsc.MissingRequirementError: object scala not found.
at
scala.tools.nsc.symtab.Definitions$definitions$.getModuleOrClass(Definitions.scala:655)
at
scala.tools.nsc.symtab.Definitions$definitions$.getModule(Definitions.scala:605)
at
scala.tools.nsc.symtab.Definitions$definitions$.ScalaPackage(Definitions.scala:145)
at
scala.tools.nsc.symtab.Definitions$definitions$.ScalaPackageClass(Definitions.scala:146)
at
scala.tools.nsc.symtab.Definitions$definitions$.AnyClass(Definitions.scala:176)
at
scala.tools.nsc.symtab.Definitions$definitions$.init(Definitions.scala:814)
at scala.tools.nsc.Global$Run.init(Global.scala:697)
at sbt.compiler.Eval$$anon$1.init(Eval.scala:53)
at sbt.compiler.Eval.run$1(Eval.scala:53)
at sbt.compiler.Eval.unlinkAll$1(Eval.scala:56)
at sbt.compiler.Eval.eval(Eval.scala:62)
at sbt.EvaluateConfigurations$.evaluateSetting(Build.scala:104)
at sbt.BuiltinCommands$$anonfun$set$1.apply(Main.scala:212)
at sbt.BuiltinCommands$$anonfun$set$1.apply(Main.scala:209)
at
sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:60)
at
sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:60)
at
sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:62)
at
sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:62)
at sbt.Command$.process(Command.scala:90)
at 
sbt.MainLoop$$anonfun$next$1$$anonfun$apply$1.apply(MainLoop.scala:71)
at 
sbt.MainLoop$$anonfun$next$1$$anonfun$apply$1.apply(MainLoop.scala:71)
at sbt.State$$anon$2.process(State.scala:171)
at sbt.MainLoop$$anonfun$next$1.apply(MainLoop.scala:71)
at sbt.MainLoop$$anonfun$next$1.apply(MainLoop.scala:71)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
at sbt.MainLoop$.next(MainLoop.scala:71)
at sbt.MainLoop$.run(MainLoop.scala:64)
at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:53)
at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:50)
at sbt.Using.apply(Using.scala:25)
at sbt.MainLoop$.runWithNewLog(MainLoop.scala:50)
at sbt.MainLoop$.runAndClearLast(MainLoop.scala:33)
at sbt.MainLoop$.runLoggedLoop(MainLoop.scala:17)
at sbt.MainLoop$.runLogged(MainLoop.scala:13)
at sbt.xMain.run(Main.scala:26)
at xsbt.boot.Launch$.run(Launch.scala:55)
at xsbt.boot.Launch$$anonfun$explicit$1.apply(Launch.scala:45)
at xsbt.boot.Launch$.launch(Launch.scala:60)
at xsbt.boot.Launch$.apply(Launch.scala:16)
at xsbt.boot.Boot$.runImpl(Boot.scala:31)
at xsbt.boot.Boot$.main(Boot.scala:20)
at xsbt.boot.Boot.main(Boot.scala)
[error] scala.tools.nsc.MissingRequirementError: object scala not found.
[error] Use 'last' for the full log.


And my sbt file is like below (my sbt launcher is sbt-launch-0.12.4.jar in
the same folder):
#!/bin/bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the License); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an AS IS BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This script launches sbt for this project. If present it uses the system 
# version of sbt. If there is no system version of sbt it attempts to
download
# sbt locally.
SBT_VERSION=`awk -F = '/sbt\\.version/ {print $2}'
./project/build.properties`
URL1=http://typesafe.artifactoryonline.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
URL2=http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
JAR=sbt/sbt-launch-${SBT_VERSION}.jar

# Download sbt launch jar if it hasn't been downloaded yet
if [ ! -f ${JAR} ]; then
  # Download
  printf Attempting to fetch sbt\n
  if hash curl 2/dev/null; then
curl --progress-bar ${URL1}  ${JAR} || curl --progress-bar ${URL2} 
${JAR}
  elif hash wget 2/dev/null; then
wget --progress=bar ${URL1} -O ${JAR} || wget --progress=bar ${URL2} -O
${JAR}
  else
printf You do not have curl or wget installed, please install sbt
manually from http://www.scala-sbt.org/\n;
exit -1
  fi
fi
if [ ! -f ${JAR} ]; then
  # We failed to download
  printf Our attempt to 

Re: sbt/sbt run command returns a JVM problem

2014-05-03 Thread Carter
Hi, thanks for all your help.
I tried your setting in the sbt file, but the problem is still there.

The Java setting in my sbt file is:
java \
  -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m \
  -jar ${JAR} \
  $@

I have tried to set these 3 parameters bigger and smaller, but nothing
works. Did I change the right thing?

Thank you very much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5267.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: sbt/sbt run command returns a JVM problem

2014-05-03 Thread Carter
Hi Michael,

Thank you very much for your reply.

Sorry I am not very familiar with sbt. Could you tell me where to set the
Java option for the sbt fork for my program? I brought up the sbt console,
and run set javaOptions += -Xmx1G in it, but it returned an error: 
[error] scala.tools.nsc.MissingRequirementError: object scala not found.
[error] Use 'last' for the full log.

Is this the right way to set the java option? Thank you very much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5294.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


sbt/sbt run command returns a JVM problem

2014-05-01 Thread Carter
Hi, I have a very simple spark program written in Scala:
/*** testApp.scala ***/
object testApp {
  def main(args: Array[String]) {
println(Hello! World!)
  }
}
Then I use the following command to compile it:
$ sbt/sbt package
The compilation finished successfully and I got a JAR file.
But when I use this command to run it:
$ sbt/sbt run
it returned an error with JVM:
[info] Error occurred during initialization of VM 
[info] Could not reserve enough space for object heap 
[error] Error: Could not create the Java Virtual Machine. 
[error] Error: A fatal exception has occurred. Program will exit. 
java.lang.RuntimeException: Nonzero exit code returned from runner: 1   
at scala.sys.package$.error(package.scala:27)

My machine has 2G memory, and runs on Ubuntu 11.04. I also tried to change
the setting of java parameter (e.g.,  -Xmx, -Xms, -XX:MaxPermSize,
-XX:ReservedCodeCacheSize) in the file sbt/sbt, but it looks like non of the
change works. Can anyone help me out with this problem? Thank you very much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Need help about how hadoop works.

2014-04-24 Thread Carter
Thanks Mayur.

So without Hadoop and any other distributed file systems, by running:
 val doc = sc.textFile(/home/scalatest.txt,5)
 doc.count
we can only get parallelization within the computer where the file is
loaded, but not the parallelization within the computers in the cluster
(Spark can not automatically duplicate the file to the other computers in
the cluster), is this understanding correct? Thank you.

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4734.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Need help about how hadoop works.

2014-04-24 Thread Carter
Thank you very much for your help Prashant.

Sorry I still have another question about your answer: however if the
file(/home/scalatest.txt) is present on the same path on all systems it
will be processed on all nodes.

When presenting the file to the same path on all nodes, do we just simply
copy the same file to all nodes, or do we need to split the original file
into different parts (each part is still with the same file name
scalatest.txt), and copy each part to a different node for
parallelization? 

Thank you very much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4738.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RE: Need help about how hadoop works.

2014-04-24 Thread Carter
Thank you very much Prashant.
 
Date: Thu, 24 Apr 2014 01:24:39 -0700
From: ml-node+s1001560n4739...@n3.nabble.com
To: gyz...@hotmail.com
Subject: Re: Need help about how hadoop works.



It is the same file and hadoop library that we use for splitting takes 
care of assigning the right split to each node.Prashant Sharma




On Thu, Apr 24, 2014 at 1:36 PM, Carter [hidden email] wrote:


Thank you very much for your help Prashant.



Sorry I still have another question about your answer: however if the

file(/home/scalatest.txt) is present on the same path on all systems it

will be processed on all nodes.



When presenting the file to the same path on all nodes, do we just simply

copy the same file to all nodes, or do we need to split the original file

into different parts (each part is still with the same file name

scalatest.txt), and copy each part to a different node for

parallelization?



Thank you very much.







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4738.html



Sent from the Apache Spark User List mailing list archive at Nabble.com.














If you reply to this email, your message will be added to the 
discussion below:

http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4739.html



To unsubscribe from Need help about how hadoop works., click 
here.

NAML
  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4746.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Need help about how hadoop works.

2014-04-23 Thread Carter
Hi, I am a beginner of Hadoop and Spark, and want some help in understanding
how hadoop works.

If we have a cluster of 5 computers, and install Spark on the cluster
WITHOUT Hadoop. And then we run the code on one computer: 
val doc = sc.textFile(/home/scalatest.txt,5)
doc.count
Can the count task be distributed to all the 5 computers? Or it is only
run by 5 parallel threads of the current computer?

On th other hand, if we install Hadoop on the cluster and upload the data
into HDFS, when running the same code will this count task be done by 25
threads?

Thank you very much for your help. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.