from:"Carter"

Re: Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter

I'm using dataframes, types are all doubles and I'm only extracting what I
need.

The caveat on these is that I am porting an existing system for a client
and for there business it's likely to be cheaper to throw hardware (in aws)
at the problem for a couple of hours than re-engineer there algorithms

cheers


On 7 June 2016 at 21:54, Jörn Franke  wrote:

> Before hardware optimization there is always software optimization.
> Are you using dataset / dataframe? Are you using the  right data types (
> eg int where int is appropriate , try to avoid string and char etc)
> Do you extract only the stuff needed? What are the algorithm parameters?
>
> > On 07 Jun 2016, at 13:09, Franc Carter  wrote:
> >
> >
> > Hi,
> >
> > I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and
> am interested in how it might be best to scale it - e.g more cpus per
> instances, more memory per instance, more instances etc.
> >
> > I'm currently using 32 m3.xlarge instances for for a training set with
> 2.5 million rows, 1300 columns and a total size of 31GB (parquet)
> >
> > thanks
> >
> > --
> > Franc
>



-- 
Franc

Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter

Hi,

I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and am
interested in how it might be best to scale it - e.g more cpus per
instances, more memory per instance, more instances etc.

I'm currently using 32 m3.xlarge instances for for a training set with 2.5
million rows, 1300 columns and a total size of 31GB (parquet)

thanks

-- 
Franc

Re: installing packages with pyspark

2016-03-19 Thread Franc Carter

Thanks - I'll give that a try

cheers

On 20 March 2016 at 09:42, Felix Cheung  wrote:

> You are running pyspark in Spark client deploy mode. I have ran into the
> same error as well and I'm not sure if this is graphframes specific - the
> python process can't find the graphframes Python code when it is loaded as
> a Spark package.
>
> To workaround this, I extract the graphframes Python directory locally
> where I run pyspark into a directory called graphframes.
>
>
>
>
>
>
> On Thu, Mar 17, 2016 at 10:11 PM -0700, "Franc Carter" <
> franc.car...@gmail.com> wrote:
>
>
> I'm having trouble with that for pyspark, yarn and graphframes. I'm using:-
>
> pyspark --master yarn --packages graphframes:graphframes:0.1.0-spark1.5
>
> which starts and gives me a REPL, but when I try
>
>from graphframes import *
>
> I get
>
>   No module names graphframes
>
> without '--master yarn' it works as expected
>
> thanks
>
>
> On 18 March 2016 at 12:59, Felix Cheung  wrote:
>
> For some, like graphframes that are Spark packages, you could also use
> --packages in the command line of spark-submit or pyspark. See
> http://spark.apache.org/docs/latest/submitting-applications.html
>
> _
> From: Jakob Odersky 
> Sent: Thursday, March 17, 2016 6:40 PM
> Subject: Re: installing packages with pyspark
> To: Ajinkya Kale 
> Cc: 
>
>
>
> Hi,
> regarding 1, packages are resolved locally. That means that when you
> specify a package, spark-submit will resolve the dependencies and
> download any jars on the local machine, before shipping* them to the
> cluster. So, without a priori knowledge of dataproc clusters, it
> should be no different to specify packages.
>
> Unfortunatly I can't help with 2.
>
> --Jakob
>
> *shipping in this case means making them available via the network
>
> On Thu, Mar 17, 2016 at 5:36 PM, Ajinkya Kale 
> wrote:
> > Hi all,
> >
> > I had couple of questions.
> > 1. Is there documentation on how to add the graphframes or any other
> package
> > for that matter on the google dataproc managed spark clusters ?
> >
> > 2. Is there a way to add a package to an existing pyspark context
> through a
> > jupyter notebook ?
> >
> > --aj
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
>
> --
> Franc
>



-- 
Franc

Re: installing packages with pyspark

2016-03-19 Thread Franc Carter

I'm having trouble with that for pyspark, yarn and graphframes. I'm using:-

pyspark --master yarn --packages graphframes:graphframes:0.1.0-spark1.5

which starts and gives me a REPL, but when I try

   from graphframes import *

I get

  No module names graphframes

without '--master yarn' it works as expected

thanks


On 18 March 2016 at 12:59, Felix Cheung  wrote:

> For some, like graphframes that are Spark packages, you could also use
> --packages in the command line of spark-submit or pyspark. See
> http://spark.apache.org/docs/latest/submitting-applications.html
>
> _
> From: Jakob Odersky 
> Sent: Thursday, March 17, 2016 6:40 PM
> Subject: Re: installing packages with pyspark
> To: Ajinkya Kale 
> Cc: 
>
>
>
> Hi,
> regarding 1, packages are resolved locally. That means that when you
> specify a package, spark-submit will resolve the dependencies and
> download any jars on the local machine, before shipping* them to the
> cluster. So, without a priori knowledge of dataproc clusters, it
> should be no different to specify packages.
>
> Unfortunatly I can't help with 2.
>
> --Jakob
>
> *shipping in this case means making them available via the network
>
> On Thu, Mar 17, 2016 at 5:36 PM, Ajinkya Kale 
> wrote:
> > Hi all,
> >
> > I had couple of questions.
> > 1. Is there documentation on how to add the graphframes or any other
> package
> > for that matter on the google dataproc managed spark clusters ?
> >
> > 2. Is there a way to add a package to an existing pyspark context
> through a
> > jupyter notebook ?
> >
> > --aj
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>


-- 
Franc

Re: filter by dict() key in pySpark

2016-02-24 Thread Franc Carter

A colleague found how to do this, the approach was to use a udf()

cheers

On 21 February 2016 at 22:41, Franc Carter  wrote:

>
> I have a DataFrame that has a Python dict() as one of the columns. I'd
> like to filter he DataFrame for those Rows that where the dict() contains a
> specific value. e.g something like this:-
>
> DF2 = DF1.filter('name' in DF1.params)
>
> but that gives me this error
>
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
>
> How do I express this correctly ?
>
> thanks
>
> --
> Franc
>



-- 
Franc

filter by dict() key in pySpark

2016-02-21 Thread Franc Carter

I have a DataFrame that has a Python dict() as one of the columns. I'd like
to filter he DataFrame for those Rows that where the dict() contains a
specific value. e.g something like this:-

DF2 = DF1.filter('name' in DF1.params)

but that gives me this error

ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
for 'or', '~' for 'not' when building DataFrame boolean expressions.

How do I express this correctly ?

thanks

-- 
Franc

Re: sparkR not able to create /append new columns

2016-02-03 Thread Franc Carter

Yes, I didn't work out how to solve that - sorry


On 3 February 2016 at 22:37, Devesh Raj Singh 
wrote:

> Hi,
>
> but "withColumn" will only add once, if i want to add columns to the same
> dataframe in a loop it will keep overwriting the added column and in the
> end the last added column( in the loop) will be the added column. like in
> my code above.
>
> On Wed, Feb 3, 2016 at 5:05 PM, Franc Carter 
> wrote:
>
>>
>> I had problems doing this as well - I ended up using 'withColumn', it's
>> not particularly graceful but it worked (1.5.2 on AWS EMR)
>>
>> cheerd
>>
>> On 3 February 2016 at 22:06, Devesh Raj Singh 
>> wrote:
>>
>>> Hi,
>>>
>>> i am trying to create dummy variables in sparkR by creating new columns
>>> for categorical variables. But it is not appending the columns
>>>
>>>
>>> df <- createDataFrame(sqlContext, iris)
>>> class(dtypes(df))
>>>
>>> cat.column<-vector(mode="character",length=nrow(df))
>>> cat.column<-collect(select(df,df$Species))
>>> lev<-length(levels(as.factor(unlist(cat.column
>>> varb.names<-vector(mode="character",length=lev)
>>> for (i in 1:lev){
>>>
>>>   varb.names[i]<-paste0(colnames(cat.column),i)
>>>
>>> }
>>>
>>> for (j in 1:lev)
>>>
>>> {
>>>
>>>dummy.df.new<-withColumn(df,paste0(colnames
>>>(cat.column),j),if else(df$Species==levels(as.factor(un
>>> list(cat.column))
>>>[j],1,0) )
>>>
>>> }
>>>
>>> I am getting the below output for
>>>
>>> head(dummy.df.new)
>>>
>>> output:
>>>
>>>   Sepal_Length Sepal_Width Petal_Length Petal_Width Species Species1
>>> 1  5.1 3.5  1.4 0.2  setosa1
>>> 2  4.9 3.0  1.4 0.2  setosa1
>>> 3  4.7 3.2  1.3 0.2  setosa1
>>> 4  4.6 3.1  1.5 0.2  setosa1
>>> 5  5.0 3.6  1.4 0.2  setosa1
>>> 6  5.4 3.9  1.7 0.4  setosa1
>>>
>>> Problem: Species2 and Species3 column are not getting added to the
>>> dataframe
>>>
>>> --
>>> Warm regards,
>>> Devesh.
>>>
>>
>>
>>
>> --
>> Franc
>>
>
>
>
> --
> Warm regards,
> Devesh.
>



-- 
Franc

Re: sparkR not able to create /append new columns

2016-02-03 Thread Franc Carter

I had problems doing this as well - I ended up using 'withColumn', it's not
particularly graceful but it worked (1.5.2 on AWS EMR)

cheerd

On 3 February 2016 at 22:06, Devesh Raj Singh 
wrote:

> Hi,
>
> i am trying to create dummy variables in sparkR by creating new columns
> for categorical variables. But it is not appending the columns
>
>
> df <- createDataFrame(sqlContext, iris)
> class(dtypes(df))
>
> cat.column<-vector(mode="character",length=nrow(df))
> cat.column<-collect(select(df,df$Species))
> lev<-length(levels(as.factor(unlist(cat.column
> varb.names<-vector(mode="character",length=lev)
> for (i in 1:lev){
>
>   varb.names[i]<-paste0(colnames(cat.column),i)
>
> }
>
> for (j in 1:lev)
>
> {
>
>dummy.df.new<-withColumn(df,paste0(colnames
>(cat.column),j),if else(df$Species==levels(as.factor(un
> list(cat.column))
>[j],1,0) )
>
> }
>
> I am getting the below output for
>
> head(dummy.df.new)
>
> output:
>
>   Sepal_Length Sepal_Width Petal_Length Petal_Width Species Species1
> 1  5.1 3.5  1.4 0.2  setosa1
> 2  4.9 3.0  1.4 0.2  setosa1
> 3  4.7 3.2  1.3 0.2  setosa1
> 4  4.6 3.1  1.5 0.2  setosa1
> 5  5.0 3.6  1.4 0.2  setosa1
> 6  5.4 3.9  1.7 0.4  setosa1
>
> Problem: Species2 and Species3 column are not getting added to the
> dataframe
>
> --
> Warm regards,
> Devesh.
>



-- 
Franc

Re: pyspark: calculating row deltas

2016-01-10 Thread Franc Carter

Thanks

cheers

On 10 January 2016 at 22:35, Blaž Šnuderl  wrote:

> This can be done using spark.sql and window functions. Take a look at
> https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
>
> On Sun, Jan 10, 2016 at 11:07 AM, Franc Carter 
> wrote:
>
>>
>> Sure, for a dataframe that looks like this
>>
>> ID Year Value
>>  1 2012   100
>>  1 2013   102
>>  1 2014   106
>>  2 2012   110
>>  2 2013   118
>>  2 2014   128
>>
>> I'd like to get back
>>
>> ID Year Value
>>  1 2013 2
>>  1 2014 4
>>  2 2013 8
>>  2 201410
>>
>> i.e the Value for an ID,Year combination is the Value for the ID,Year
>> minus the Value for the ID,Year-1
>>
>> thanks
>>
>>
>>
>>
>>
>>
>> On 10 January 2016 at 20:51, Femi Anthony  wrote:
>>
>>> Can you clarify what you mean with an actual example ?
>>>
>>> For example, if your data frame looks like this:
>>>
>>> ID  Year   Value
>>> 12012   100
>>> 22013   101
>>> 32014   102
>>>
>>> What's your desired output ?
>>>
>>> Femi
>>>
>>>
>>> On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter 
>>> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> I have a DataFrame with the columns
>>>>
>>>>  ID,Year,Value
>>>>
>>>> I'd like to create a new Column that is Value2-Value1 where the
>>>> corresponding Year2=Year-1
>>>>
>>>> At the moment I am creating  a new DataFrame with renamed columns and
>>>> doing
>>>>
>>>>DF.join(DF2, . . . .)
>>>>
>>>>  This looks cumbersome to me, is there abtter way ?
>>>>
>>>> thanks
>>>>
>>>>
>>>> --
>>>> Franc
>>>>
>>>
>>>
>>>
>>> --
>>> http://www.femibyte.com/twiki5/bin/view/Tech/
>>> http://www.nextmatrix.com
>>> "Great spirits have always encountered violent opposition from mediocre
>>> minds." - Albert Einstein.
>>>
>>
>>
>>
>> --
>> Franc
>>
>
>


-- 
Franc

Re: pyspark: calculating row deltas

2016-01-10 Thread Franc Carter

Sure, for a dataframe that looks like this

ID Year Value
 1 2012   100
 1 2013   102
 1 2014   106
 2 2012   110
 2 2013   118
 2 2014   128

I'd like to get back

ID Year Value
 1 2013 2
 1 2014 4
 2 2013 8
 2 201410

i.e the Value for an ID,Year combination is the Value for the ID,Year minus
the Value for the ID,Year-1

thanks






On 10 January 2016 at 20:51, Femi Anthony  wrote:

> Can you clarify what you mean with an actual example ?
>
> For example, if your data frame looks like this:
>
> ID  Year   Value
> 12012   100
> 22013   101
> 32014   102
>
> What's your desired output ?
>
> Femi
>
>
> On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter 
> wrote:
>
>>
>> Hi,
>>
>> I have a DataFrame with the columns
>>
>>  ID,Year,Value
>>
>> I'd like to create a new Column that is Value2-Value1 where the
>> corresponding Year2=Year-1
>>
>> At the moment I am creating  a new DataFrame with renamed columns and
>> doing
>>
>>DF.join(DF2, . . . .)
>>
>>  This looks cumbersome to me, is there abtter way ?
>>
>> thanks
>>
>>
>> --
>> Franc
>>
>
>
>
> --
> http://www.femibyte.com/twiki5/bin/view/Tech/
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>



-- 
Franc

Re: pyspark: conditionals inside functions

2016-01-09 Thread Franc Carter

Got it, I needed to use the when/otherwise construct - code below

def getSunday(day):

day = day.cast("date")

sun = next_day(day, "Sunday")

n = datediff(sun,day)

x = when(n==7,day).otherwise(sun)

return x


On 10 January 2016 at 08:41, Franc Carter  wrote:

>
> My Python is not particularly good, so I'm afraid I don't understand what
> that mean
>
> cheers
>
>
> On 9 January 2016 at 14:45, Franc Carter  wrote:
>
>>
>> Hi,
>>
>> I'm trying to write a short function that returns the last sunday of the
>> week of a given date, code below
>>
>> def getSunday(day):
>>
>> day = day.cast("date")
>>
>> sun = next_day(day, "Sunday")
>>
>> n = datediff(sun,day)
>>
>> if (n == 7):
>>
>> return day
>>
>> else:
>>
>> return sun
>>
>>
>> this gives me
>>
>> ValueError: Cannot convert column into bool:
>>
>>
>> Can someone point out what I am doing wrong
>>
>> thanks
>>
>>
>> --
>> Franc
>>
>
>
>
> --
> Franc
>



-- 
Franc

pyspark: calculating row deltas

2016-01-09 Thread Franc Carter

Hi,

I have a DataFrame with the columns

 ID,Year,Value

I'd like to create a new Column that is Value2-Value1 where the
corresponding Year2=Year-1

At the moment I am creating  a new DataFrame with renamed columns and doing

   DF.join(DF2, . . . .)

 This looks cumbersome to me, is there abtter way ?

thanks


-- 
Franc

Re: pyspark: conditionals inside functions

2016-01-09 Thread Franc Carter

My Python is not particularly good, so I'm afraid I don't understand what
that mean

cheers


On 9 January 2016 at 14:45, Franc Carter  wrote:

>
> Hi,
>
> I'm trying to write a short function that returns the last sunday of the
> week of a given date, code below
>
> def getSunday(day):
>
> day = day.cast("date")
>
> sun = next_day(day, "Sunday")
>
> n = datediff(sun,day)
>
> if (n == 7):
>
> return day
>
> else:
>
> return sun
>
>
> this gives me
>
> ValueError: Cannot convert column into bool:
>
>
> Can someone point out what I am doing wrong
>
> thanks
>
>
> --
> Franc
>



-- 
Franc

pyspark: conditionals inside functions

2016-01-08 Thread Franc Carter

Hi,

I'm trying to write a short function that returns the last sunday of the
week of a given date, code below

def getSunday(day):

day = day.cast("date")

sun = next_day(day, "Sunday")

n = datediff(sun,day)

if (n == 7):

return day

else:

return sun


this gives me

ValueError: Cannot convert column into bool:


Can someone point out what I am doing wrong

thanks


-- 
Franc

Re: number of executors in sparkR.init()

2015-12-25 Thread Franc Carter

Thanks, that works

cheers

On 26 December 2015 at 16:53, Felix Cheung 
wrote:

> The equivalent for spark-submit --num-executors should be
> spark.executor.instances
> When use in SparkConf?
> http://spark.apache.org/docs/latest/running-on-yarn.html
>
> Could you try setting that with sparkR.init()?
>
>
> _____
> From: Franc Carter 
> Sent: Friday, December 25, 2015 9:23 PM
> Subject: number of executors in sparkR.init()
> To: 
>
>
>
> Hi,
>
> I'm having trouble working out how to get the number of executors set when
> using sparkR.init().
>
> If I start sparkR with
>
>   sparkR  --master yarn --num-executors 6
>
> then I get 6 executors
>
> However, if start sparkR with
>
>   sparkR
>
> followed by
>
>   sc <- sparkR.init(master="yarn-client",
> sparkEnvir=list(spark.num.executors='6'))
>
> then I only get 2 executors.
>
> Can anyone point me in the direction of what I might doing wrong ? I need
> to initialise this was so that rStudio can hook in to SparkR
>
> thanks
>
> --
> Franc
>
>
>


-- 
Franc

number of executors in sparkR.init()

2015-12-25 Thread Franc Carter

Hi,

I'm having trouble working out how to get the number of executors set when
using sparkR.init().

If I start sparkR with

  sparkR  --master yarn --num-executors 6

then I get 6 executors

However, if start sparkR with

  sparkR

followed by

  sc <- sparkR.init(master="yarn-client",
sparkEnvir=list(spark.num.executors='6'))

then I only get 2 executors.

Can anyone point me in the direction of what I might doing wrong ? I need
to initialise this was so that rStudio can hook in to SparkR

thanks

-- 
Franc

Re: SparkR csv without headers

2015-08-20 Thread Franc Carter

Thanks - works nicely

cheers

On Fri, Aug 21, 2015 at 12:43 PM, Sun, Rui  wrote:

> Hi,
>
>
>
> You can create a DataFrame using load.df() with a specified schema.
>
>
>
> Something like:
>
> schema <- structType(structField(“a”, “string”), structField(“b”,
> integer), …)
>
> read.df ( …, schema = schema)
>
>
>
> *From:* Franc Carter [mailto:franc.car...@rozettatech.com]
> *Sent:* Wednesday, August 19, 2015 1:48 PM
> *To:* user@spark.apache.org
> *Subject:* SparkR csv without headers
>
>
>
>
>
> Hi,
>
>
>
> Does anyone have an example of how to create a DataFrame in SparkR  which
> specifies the column names - the csv files I have do not have column names
> in the first row. I can get read a csv nicely with
> com.databricks:spark-csv_2.10:1.0.3, but I end up with column names C1, C2,
> C3 etc
>
>
>
>
>
> thanks
>
>
>
> --
>
> *Franc Carter* I  Systems ArchitectI RoZetta Technology
>
>
>
> [image: Description: Description: Description:
> cid:image003.jpg@01D02903.9B540580]
>
>
>
> L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000
>
> PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA
>
> *T*  +61 2 8355 2515 Iwww.rozettatechnology.com
>
> [image: cid:image002.jpg@01D02903.0B41B280]
>
> DISCLAIMER: The contents of this email, inclusive of attachments, may be
> legally
>
> privileged and confidential. Any unauthorised use of the contents is
> expressly prohibited.
>
>
>
>
>



-- 

*Franc Carter* I  Systems ArchitectI RoZetta Technology



[image: Description: Description: Description:
cid:image003.jpg@01D02903.9B540580]



L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000

PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA

*T*  +61 2 8355 2515 Iwww.rozettatechnology.com

[image: cid:image002.jpg@01D02903.0B41B280]

DISCLAIMER: The contents of this email, inclusive of attachments, may be
legally

privileged and confidential. Any unauthorised use of the contents is
expressly prohibited.

SparkR csv without headers

2015-08-18 Thread Franc Carter

Hi,

Does anyone have an example of how to create a DataFrame in SparkR  which
specifies the column names - the csv files I have do not have column names
in the first row. I can get read a csv nicely with
com.databricks:spark-csv_2.10:1.0.3, but I end up with column names C1, C2,
C3 etc


thanks

-- 

*Franc Carter* I  Systems ArchitectI RoZetta Technology



[image: Description: Description: Description:
cid:image003.jpg@01D02903.9B540580]



L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000

PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA

*T*  +61 2 8355 2515 Iwww.rozettatechnology.com

[image: cid:image002.jpg@01D02903.0B41B280]

DISCLAIMER: The contents of this email, inclusive of attachments, may be
legally

privileged and confidential. Any unauthorised use of the contents is
expressly prohibited.

2015-08-05 Thread Franc Carter

subscribe

Column operation on Spark RDDs.

2015-06-04 Thread Carter

Hi, I have a RDD with MANY columns (e.g., hundreds), and most of my operation
is on columns, e.g., I need to create many intermediate variables from
different columns, what is the most efficient way to do this?

For example, if my dataRDD[Array[String]] is like below: 

123, 523, 534, ..., 893 
536, 98, 1623, ..., 98472 
537, 89, 83640, ..., 9265 
7297, 98364, 9, ..., 735 
.. 
29, 94, 956, ..., 758 

I will need to create a new column or a variable as newCol1 =
2ndCol+19thCol, and another new column based on newCol1 and the existing
columns: newCol2 = function(newCol1, 34thCol), what is the best way of doing
this?

I have been thinking using index for the intermediate variables and the
dataRDD, and then join them together on the index to do my calculation:
var dataRDD = sc.textFile("/test.csv").map(_.split(","))
val dt = dataRDD.zipWithIndex.map(_.swap)
val newCol1 = dataRDD.map(x => x(1)+x(18)).zipWithIndex.map(_.swap)
val newCol2 = newCol1.join(dt).map(x=> function(.))

Is there a better way of doing this?

Thank you very much!












--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Column-operation-on-Spark-RDDs-tp23165.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to add a column to a spark RDD with many columns?

2015-05-02 Thread Carter

Thanks for your reply! It is what I am after.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-a-column-to-a-spark-RDD-with-many-columns-tp22729p22740.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to add a column to a spark RDD with many columns?

2015-04-30 Thread Carter

Hi all,

I have a RDD with *MANY *columns (e.g., *hundreds*), how do I add one more
column at the end of this RDD?

For example, if my RDD is like below:

123, 523, 534, ..., 893
536, 98, 1623, ..., 98472
537, 89, 83640, ..., 9265
7297, 98364, 9, ..., 735
..
29, 94, 956, ..., 758

how can I efficiently add a column to it, whose value is the sum of the 2nd
and the 200th columns?

Thank you very much.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-a-column-to-a-spark-RDD-with-many-columns-tp22729.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread Franc Carter

One issue is that 'big' becomes 'not so big' reasonably quickly. A couple
of TeraBytes is not that challenging (depending on the algorithm) these
days where as 5 years ago it was a big challenge. We have a bit over a
PetaByte (not using Spark)  and using a distributed system is the only
viable way to get reasonable performance for reasonable cost

cheers

On Tue, Mar 31, 2015 at 4:55 AM, Steve Loughran 
wrote:

>
>  On 30 Mar 2015, at 13:27, jay vyas  wrote:
>
>  Just the same as spark was disrupting the hadoop ecosystem by changing
> the assumption that "you can't rely on memory in distributed
> analytics"...now maybe we are challenging the assumption that "big data
> analytics need to distributed"?
>
> I've been asking the same question lately and seen similarly that spark
> performs quite reliably and well on local single node system even for an
> app which I ran for a streaming app which I ran for ten days in a row...  I
> almost felt guilty that I never put it on a cluster!
>
>
>  Modern machines can be pretty powerful: 16 physical cores HT'd to 32,
> 384+MB, GPU, giving you lots of compute. What you don't get is the storage
> capacity to match, and especially, the IO bandwidth. RAID-0 striping 2-4
> HDDs gives you some boost, but if you are reading, say, a 4 GB file from
> HDFS broken in to 256MB blocks, you have that data  replicated into (4*4*3)
> blocks: 48. Algorithm and capacity permitting, you've just massively
> boosted your load time. Downstream, if data can be thinned down, then you
> can start looking more at things you can do on a single host : a machine
> that can be in your Hadoop cluster. Ask YARN nicely and you can get a
> dedicated machine for a couple of days (i.e. until your Kerberos tokens
> expire).
>
>


-- 

*Franc Carter* I  Systems ArchitectI RoZetta Technology



[image: Description: Description: Description:
cid:image003.jpg@01D02903.9B540580]



L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000

PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA

*T*  +61 2 8355 2515 Iwww.rozettatechnology.com

[image: cid:image002.jpg@01D02903.0B41B280]

DISCLAIMER: The contents of this email, inclusive of attachments, may be
legally

privileged and confidential. Any unauthorised use of the contents is
expressly prohibited.

Re: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Franc Carter

ise$DefaultPromise.ready(Promise.scala:219)
> >> at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> >> at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> >> at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> >> at scala.concurrent.Await$.result(package.scala:107)
> >> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
> >> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
> >> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
> >> ... 7 more
> >>
> >>
> >>
> >>
> >> -Original Message-
> >> From: Patrick Wendell [mailto:pwend...@gmail.com]
> >> Sent: Monday, February 23, 2015 12:17 AM
> >> To: Oleg Shirokikh
> >> Subject: Re: Submitting jobs to Spark EC2 cluster remotely
> >>
> >> The reason is that the file needs to be in a globally visible
> >> filesystem where the master node can download. So it needs to be on
> >> s3, for instance, rather than on your local filesystem.
> >>
> >> - Patrick
> >>
> >> On Sun, Feb 22, 2015 at 11:55 PM, olegshirokikh 
> wrote:
> >>> I've set up the EC2 cluster with Spark. Everything works, all
> >>> master/slaves are up and running.
> >>>
> >>> I'm trying to submit a sample job (SparkPi). When I ssh to cluster
> >>> and submit it from there - everything works fine. However when
> >>> driver is created on a remote host (my laptop), it doesn't work.
> >>> I've tried both modes for
> >>> `--deploy-mode`:
> >>>
> >>> **`--deploy-mode=client`:**
> >>>
> >>> From my laptop:
> >>>
> >>> ./bin/spark-submit --master
> >>> spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077
> >>> --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
> >>>
> >>> Results in the following indefinite warnings/errors:
> >>>
> >>>>  WARN TaskSchedulerImpl: Initial job has not accepted any
> >>>> resources; check your cluster UI to ensure that workers are
> >>>> registered and have sufficient memory 15/02/22 18:30:45
> >>>
> >>>> ERROR SparkDeploySchedulerBackend: Asked to remove non-existent
> >>>> executor 0
> >>>> 15/02/22 18:30:45
> >>>
> >>>> ERROR SparkDeploySchedulerBackend: Asked to remove non-existent
> >>>> executor 1
> >>>
> >>> ...and failed drivers - in Spark Web UI "Completed Drivers" with
> >>> "State=ERROR" appear.
> >>>
> >>> I've tried to pass limits for cores and memory to submit script but
> >>> it didn't help...
> >>>
> >>> **`--deploy-mode=cluster`:**
> >>>
> >>> From my laptop:
> >>>
> >>> ./bin/spark-submit --master
> >>> spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077
> >>> --deploy-mode cluster --class SparkPi
> >>> ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
> >>>
> >>> The result is:
> >>>
> >>>>  Driver successfully submitted as driver-20150223023734-0007 ...
> >>>> waiting before polling master for driver state ... polling master
> >>>> for driver state State of driver-20150223023734-0007 is ERROR
> >>>> Exception from cluster was: java.io.FileNotFoundException: File
> >>>> file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.1
> >>>> 0 -0.0.1.jar does not exist. java.io.FileNotFoundException: File
> >>>>
> file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
> >>>> does not exist.   at
> >>>>
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
> >>>>   at
> >>>>
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
> >>>>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
> at
> >>>> org.apache.spark.deploy.worker.DriverRunner.org
> $apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
> >>>>   at
> >>>> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunne
> >>>> r
> >>>> .scala:75)
> >>>
> >>>  So, I'd appreciate any pointers on what is going wrong and some
> >>> guidance how to deploy jobs from remote client. Thanks.
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-jobs-
> >>> t o-Spark-EC2-cluster-remotely-tp21762.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>> 
> >>> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
> >>> additional commands, e-mail: user-h...@spark.apache.org
> >>>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

*Franc Carter* I  Systems ArchitectI RoZetta Technology



[image: Description: Description: Description:
cid:image003.jpg@01D02903.9B540580]



L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000

PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA

*T*  +61 2 8355 2515 Iwww.rozettatechnology.com

[image: Description: Description: Description:
cid:image002.jpg@01D02903.0B41B280]

DISCLAIMER: The contents of this email, inclusive of attachments, may be
legally

privileged and confidential. Any unauthorised use of the contents is
expressly prohibited.

How to sum up the values in the columns of a dataset in Scala?

2015-02-12 Thread Carter


I am new to Scala. I have a dataset with many columns, each column has a
column name. Given several column names (these column names are not fixed,
they are generated dynamically), I need to sum up the values of these
columns. Is there an efficient way of doing this?

I worked out a way by using for loop, but I don't think it is efficient:

val AllLabels = List("ID", "val1", "val2", "val3", "val4")
val lbla = List("val1", "val3", "val4")
val index_lbla = lbla.map(x => AllLabels.indexOf(x))

val dataRDD = sc.textFile("../test.csv").map(_.split(","))

dataRDD.map(x=>
 {
  var sum = 0.0
  for (i <- 1 to index_lbla.length) 
sum = sum + x(i).toDouble
  sum
 }
).collect

The test.csv looks like below (without column names):

"ID", "val1", "val2", "val3", "val4"
 A, 123, 523, 534, 893
 B, 536, 98, 1623, 98472
 C, 537, 89, 83640, 9265
 D, 7297, 98364, 9, 735
 ...

Your help is very much appreciated!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-sum-up-the-values-in-the-columns-of-a-dataset-in-Scala-tp21639.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark, reading from s3

2015-02-12 Thread Franc Carter

Check that your timezone is correct as well, an incorrect timezone can make
it look like your time is correct when it is skewed.

cheers

On Fri, Feb 13, 2015 at 5:51 AM, Kane Kim  wrote:

> The thing is that my time is perfectly valid...
>
> On Tue, Feb 10, 2015 at 10:50 PM, Akhil Das 
> wrote:
>
>> Its with the timezone actually, you can either use an NTP to maintain
>> accurate system clock or you can adjust your system time to match with the
>> AWS one. You can do it as:
>>
>> telnet s3.amazonaws.com 80
>> GET / HTTP/1.0
>>
>>
>> [image: Inline image 1]
>>
>> Thanks
>> Best Regards
>>
>> On Wed, Feb 11, 2015 at 6:43 AM, Kane Kim  wrote:
>>
>>> I'm getting this warning when using s3 input:
>>> 15/02/11 00:58:37 WARN RestStorageService: Adjusted time offset in
>>> response to
>>> RequestTimeTooSkewed error. Local machine and S3 server disagree on the
>>> time by approximately 0 seconds. Retrying connection.
>>>
>>> After that there are tons of 403/forbidden errors and then job fails.
>>> It's sporadic, so sometimes I get this error and sometimes not, what
>>> could be the issue?
>>> I think it could be related to network connectivity?
>>>
>>
>>
>


-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter

I forgot to mention that if you do decide to use Cassandra I'd highly
recommend jumping on the Cassandra mailing list, if we had taken in come of
the advice on that list things would have been considerably smoother

cheers

On Wed, Feb 11, 2015 at 8:12 PM, Christian Betz <
christian.b...@performance-media.de> wrote:

>   Hi
>
>  Regarding the Cassandra Data model, there's an excellent post on the
> ebay tech blog:
> http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/.
> There's also a slideshare for this somewhere.
>
>  Happy hacking
>
>  Chris
>
>   Von: Franc Carter 
> Datum: Mittwoch, 11. Februar 2015 10:03
> An: Paolo Platter 
> Cc: Mike Trienis , "user@spark.apache.org" <
> user@spark.apache.org>
> Betreff: Re: Datastore HDFS vs Cassandra
>
>
> One additional comment I would make is that you should be careful with
> Updates in Cassandra, it does support them but large amounts of Updates
> (i.e changing existing keys) tends to cause fragmentation. If you are
> (mostly) adding new keys (e.g new records in the the time series) then
> Cassandra can be excellent
>
>  cheers
>
>
> On Wed, Feb 11, 2015 at 6:13 PM, Paolo Platter 
> wrote:
>
>>   Hi Mike,
>>
>> I developed a Solution with cassandra and spark, using DSE.
>> The main difficult is about cassandra, you need to understand very well
>> its data model and its Query patterns.
>> Cassandra has better performance than hdfs and it has DR and stronger
>> availability.
>> Hdfs is a filesystem, cassandra is a dbms.
>> Cassandra supports full CRUD without acid.
>> Hdfs is more flexible than cassandra.
>>
>> In my opinion, if you have a real time series, go with Cassandra paying
>> attention at your reporting data access patterns.
>>
>> Paolo
>>
>> Inviata dal mio Windows Phone
>>  --
>> Da: Mike Trienis 
>> Inviato: ?11/?02/?2015 05:59
>> A: user@spark.apache.org
>> Oggetto: Datastore HDFS vs Cassandra
>>
>>   Hi,
>>
>> I am considering implement Apache Spark on top of Cassandra database after
>> listing to related talk and reading through the slides from DataStax. It
>> seems to fit well with our time-series data and reporting requirements.
>>
>>
>> http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-for-time-series-data
>>
>> Does anyone have any experiences using Apache Spark and Cassandra,
>> including
>> limitations (and or) technical difficulties? How does Cassandra compare
>> with
>> HDFS and what use cases would make HDFS more suitable?
>>
>> Thanks, Mike.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Datastore-HDFS-vs-Cassandra-tp21590.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
>  --
>
> *Franc Carter* | Systems Architect | Rozetta Technology
>
> franc.car...@rozettatech.com  |
> www.rozettatechnology.com
>
> Tel: +61 2 8355 2515
>
> Level 4, 55 Harrington St, The Rocks NSW 2000
>
> PO Box H58, Australia Square, Sydney NSW 1215
>
> AUSTRALIA
>
>


-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter

One additional comment I would make is that you should be careful with
Updates in Cassandra, it does support them but large amounts of Updates
(i.e changing existing keys) tends to cause fragmentation. If you are
(mostly) adding new keys (e.g new records in the the time series) then
Cassandra can be excellent

cheers


On Wed, Feb 11, 2015 at 6:13 PM, Paolo Platter 
wrote:

>   Hi Mike,
>
> I developed a Solution with cassandra and spark, using DSE.
> The main difficult is about cassandra, you need to understand very well
> its data model and its Query patterns.
> Cassandra has better performance than hdfs and it has DR and stronger
> availability.
> Hdfs is a filesystem, cassandra is a dbms.
> Cassandra supports full CRUD without acid.
> Hdfs is more flexible than cassandra.
>
> In my opinion, if you have a real time series, go with Cassandra paying
> attention at your reporting data access patterns.
>
> Paolo
>
> Inviata dal mio Windows Phone
>  --
> Da: Mike Trienis 
> Inviato: ‎11/‎02/‎2015 05:59
> A: user@spark.apache.org
> Oggetto: Datastore HDFS vs Cassandra
>
>   Hi,
>
> I am considering implement Apache Spark on top of Cassandra database after
> listing to related talk and reading through the slides from DataStax. It
> seems to fit well with our time-series data and reporting requirements.
>
>
> http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-for-time-series-data
>
> Does anyone have any experiences using Apache Spark and Cassandra,
> including
> limitations (and or) technical difficulties? How does Cassandra compare
> with
> HDFS and what use cases would make HDFS more suitable?
>
> Thanks, Mike.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Datastore-HDFS-vs-Cassandra-tp21590.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -----
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA

Re: How to create spark AMI in AWS

2015-02-09 Thread Franc Carter

Hi,

I'm very new to Spark, but  experienced with AWS - so take that in to
account with my suggestions.

I started with an AWS base image and then added the pre-built Spark-1.2. I
then added made a 'Master' version and a 'Worker' versions and then made
AMIs for them.

The Master comes up with a static IP and the Worker image has this baked
in. I haven't completed everything I am planning to do but so far I can
bring up the Master and a bunch of Workers inside and ASG and run spark
code successfully.

cheers

On Mon, Feb 9, 2015 at 10:06 PM, Guodong Wang  wrote:

> Hi guys,
>
> I want to launch spark cluster in AWS. And I know there is a spark_ec2.py
> script.
>
> I am using the AWS service in China. But I can not find the AMI in the
> region of China.
>
> So, I have to build one. My question is
> 1. Where is the bootstrap script to create the Spark AMI? Is it here(
> https://github.com/mesos/spark-ec2/blob/branch-1.3/create_image.sh) ?
> 2. What is the base image of the Spark AMI? Eg, the base image of this (
> https://github.com/mesos/spark-ec2/blob/branch-1.3/ami-list/us-west-1/hvm)
> 3. Shall I install scala during building the AMI?
>
>
> Thanks.
>
> Guodong
>

-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA

Re: No AMI for Spark 1.2 using ec2 scripts

2015-01-26 Thread Franc Carter

AMI's are specific to an AWS region, so the ami-id of the spark AMI in
us-west will be different if it exists. I can't remember where but I have a
memory of seeing somewhere that the AMI was only in us-east

cheers

On Mon, Jan 26, 2015 at 8:47 PM, Håkan Jonsson  wrote:

> Thanks,
>
> I also use Spark 1.2 with prebuilt for Hadoop 2.4. I launch both 1.1 and
> 1.2 with the same command:
>
> ./spark-ec2 -k foo -i bar.pem launch mycluster
>
> By default this launches in us-east-1. I tried changing the the region
> using:
>
> -r us-west-1 but that had the same result:
>
> Could not resolve AMI at:
> https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm
>
> Looking up
> https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm in a
> browser results in the same AMI ID as yours. If I search for ami-7a320f3f
> AMI in AWS, I can't find any such image. I tried searching in all regions I
> could find in the AWS console.
>
> The AMI for 1.1 is spark.ami.pvm.v9 (ami-5bb18832). I can find that AMI in
> us-west-1.
>
> Strange. Not sure what to do.
>
> /Håkan
>
>
> On Mon Jan 26 2015 at 9:02:42 AM Charles Feduke 
> wrote:
>
> I definitely have Spark 1.2 running within EC2 using the spark-ec2
> scripts. I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later.
>
> What parameters are you using when you execute spark-ec2?
>
>
> I am launching in the us-west-1 region (ami-7a320f3f) which may explain
> things.
>
> On Mon Jan 26 2015 at 2:40:01 AM hajons  wrote:
>
> Hi,
>
> When I try to launch a standalone cluster on EC2 using the scripts in the
> ec2 directory for Spark 1.2, I get the following error:
>
> Could not resolve AMI at:
> https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm
>
> It seems there is not yet any AMI available on EC2. Any ideas when there
> will be one?
>
> This works without problems for version 1.1. Starting up a cluster using
> these scripts is so simple and straightforward, so I am really missing it
> on
> 1.2.
>
> /Håkan
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/No-AMI-for-Spark-1-2-using-ec2-scripts-tp21362.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA

Does DecisionTree model in MLlib deal with missing values?

2015-01-10 Thread Carter

Hi, I am new to the MLlib in Spark. Can the DecisionTree model in MLlib deal
with missing values? If so, what data structure should I use for the input?

Moreover, my data has categorical features, but the LabeledPoint requires
"double" data type, in this case what can I do?

Thank you very much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-DecisionTree-model-in-MLlib-deal-with-missing-values-tp21080.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Reading from a centralized stored

2015-01-06 Thread Franc Carter

Ah, so it's rdd specific - that would make sense. For those systems where
it is possible to extract sensible susbets the rdds do so. My use case,
which is probably biasing my thinking is DynamoDb which I don't think can
efficiently extract records from M-to-N

cheers

On Wed, Jan 7, 2015 at 6:59 AM, Cody Koeninger  wrote:

> No, most rdds partition input data appropriately.
>
> On Tue, Jan 6, 2015 at 1:41 PM, Franc Carter  > wrote:
>
>>
>> One more question, to be clarify. Will every node pull in all the data ?
>>
>> thanks
>>
>> On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger 
>> wrote:
>>
>>> If you are not co-locating spark executor processes on the same machines
>>> where the data is stored, and using an rdd that knows about which node to
>>> prefer scheduling a task on, yes, the data will be pulled over the network.
>>>
>>> Of the options you listed, S3 and DynamoDB cannot have spark running on
>>> the same machines. Cassandra can be run on the same nodes as spark, and
>>> recent versions of the spark cassandra connector implement preferred
>>> locations.  You can run an rdbms on the same nodes as spark, but JdbcRDD
>>> doesn't implement preferred locations.
>>>
>>> On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter <
>>> franc.car...@rozettatech.com> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> I'm trying to understand how a Spark Cluster behaves when the data it
>>>> is processing resides on a centralized/remote store (S3, Cassandra,
>>>> DynamoDB, RDBMS etc).
>>>>
>>>> Does every node in the cluster retrieve all the data from the central
>>>> store ?
>>>>
>>>> thanks
>>>>
>>>> --
>>>>
>>>> *Franc Carter* | Systems Architect | Rozetta Technology
>>>>
>>>> franc.car...@rozettatech.com  |
>>>> www.rozettatechnology.com
>>>>
>>>> Tel: +61 2 8355 2515
>>>>
>>>> Level 4, 55 Harrington St, The Rocks NSW 2000
>>>>
>>>> PO Box H58, Australia Square, Sydney NSW 1215
>>>>
>>>> AUSTRALIA
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> *Franc Carter* | Systems Architect | Rozetta Technology
>>
>> franc.car...@rozettatech.com  |
>> www.rozettatechnology.com
>>
>> Tel: +61 2 8355 2515
>>
>> Level 4, 55 Harrington St, The Rocks NSW 2000
>>
>> PO Box H58, Australia Square, Sydney NSW 1215
>>
>> AUSTRALIA
>>
>>
>


-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA

Re: Reading from a centralized stored

2015-01-06 Thread Franc Carter

One more question, to be clarify. Will every node pull in all the data ?

thanks

On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger  wrote:

> If you are not co-locating spark executor processes on the same machines
> where the data is stored, and using an rdd that knows about which node to
> prefer scheduling a task on, yes, the data will be pulled over the network.
>
> Of the options you listed, S3 and DynamoDB cannot have spark running on
> the same machines. Cassandra can be run on the same nodes as spark, and
> recent versions of the spark cassandra connector implement preferred
> locations.  You can run an rdbms on the same nodes as spark, but JdbcRDD
> doesn't implement preferred locations.
>
> On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter  > wrote:
>
>>
>> Hi,
>>
>> I'm trying to understand how a Spark Cluster behaves when the data it is
>> processing resides on a centralized/remote store (S3, Cassandra, DynamoDB,
>> RDBMS etc).
>>
>> Does every node in the cluster retrieve all the data from the central
>> store ?
>>
>> thanks
>>
>> --
>>
>> *Franc Carter* | Systems Architect | Rozetta Technology
>>
>> franc.car...@rozettatech.com  |
>> www.rozettatechnology.com
>>
>> Tel: +61 2 8355 2515
>>
>> Level 4, 55 Harrington St, The Rocks NSW 2000
>>
>> PO Box H58, Australia Square, Sydney NSW 1215
>>
>> AUSTRALIA
>>
>>
>


-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA

Re: Reading from a centralized stored

2015-01-05 Thread Franc Carter

Thanks, that's what I suspected.

cheers

On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger  wrote:

> If you are not co-locating spark executor processes on the same machines
> where the data is stored, and using an rdd that knows about which node to
> prefer scheduling a task on, yes, the data will be pulled over the network.
>
> Of the options you listed, S3 and DynamoDB cannot have spark running on
> the same machines. Cassandra can be run on the same nodes as spark, and
> recent versions of the spark cassandra connector implement preferred
> locations.  You can run an rdbms on the same nodes as spark, but JdbcRDD
> doesn't implement preferred locations.
>
> On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter  > wrote:
>
>>
>> Hi,
>>
>> I'm trying to understand how a Spark Cluster behaves when the data it is
>> processing resides on a centralized/remote store (S3, Cassandra, DynamoDB,
>> RDBMS etc).
>>
>> Does every node in the cluster retrieve all the data from the central
>> store ?
>>
>> thanks
>>
>> --
>>
>> *Franc Carter* | Systems Architect | Rozetta Technology
>>
>> franc.car...@rozettatech.com  |
>> www.rozettatechnology.com
>>
>> Tel: +61 2 8355 2515
>>
>> Level 4, 55 Harrington St, The Rocks NSW 2000
>>
>> PO Box H58, Australia Square, Sydney NSW 1215
>>
>> AUSTRALIA
>>
>>
>


-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA

Reading from a centralized stored

2015-01-05 Thread Franc Carter

Hi,

I'm trying to understand how a Spark Cluster behaves when the data it is
processing resides on a centralized/remote store (S3, Cassandra, DynamoDB,
RDBMS etc).

Does every node in the cluster retrieve all the data from the central store
?

thanks

-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  |
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA

Re: How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Carter

Thanks for your reply Wei, will try this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197p7224.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Carter

Thanks  a lot Krishna, this works for me.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197p7223.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Carter

Hi All,

I just downloaded the Scala IDE for Eclipse. After I created a Spark project
and clicked "Run" there was an error on this line of code "import
org.apache.spark.SparkContext": "object apache is not a member of package
org". I guess I need to import the Spark dependency into Scala IDE for
Eclipse, can anyone tell me how to do it? Thanks a lot.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to get the help or explanation for the functions in Spark shell?

2014-06-08 Thread Carter

Thank you very much Gerard.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-help-or-explanation-for-the-functions-in-Spark-shell-tp7191p7193.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to get the help or explanation for the functions in Spark shell?

2014-06-08 Thread Carter

Hi All,

I am new to Spark. 

In the Spark shell, how can I get the help or explanation for those
functions that I can use for a variable or RDD? For example, after I input a
RDD's name with a dot (.) at the end, if I press the Tab key, a list of
functions that I can use for this RDD will be displayed, but I dont know how
to use these functions.

Your help is greatly appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-help-or-explanation-for-the-functions-in-Spark-shell-tp7191.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: K-nearest neighbors search in Spark

2014-05-28 Thread Carter

Hi Andrew,

Thank you for your info. I will have a look at these links.

Thanks,
Carter

Date: Tue, 27 May 2014 09:06:02 -0700
From: ml-node+s1001560n6436...@n3.nabble.com
To: gyz...@hotmail.com
Subject: Re: K-nearest neighbors search in Spark

Hi Carter,
In Spark 1.0 there will be an implementation of k-means available as part of 
MLLib.  You can see the documentation for that below (until 1.0 is fully 
released).   

https://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/mllib-clustering.html

Maybe diving into the source here will help get you started?  
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

Cheers,Andrew

On Tue, May 27, 2014 at 4:10 AM, Carter <[hidden email]> wrote:

Any suggestion is very much appreciated.

--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6421.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

If you reply to this email, your message will be added to the 
discussion below:

http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6436.html

To unsubscribe from K-nearest neighbors search in Spark, click 
here.

NAML

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6465.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: K-nearest neighbors search in Spark

2014-05-28 Thread Carter

Hi Krishna,
 
Thank you very much for your code. I will use it as a good start point.
 
Thanks,
Carter
 
Date: Tue, 27 May 2014 16:42:39 -0700
From: ml-node+s1001560n6455...@n3.nabble.com
To: gyz...@hotmail.com
Subject: Re: K-nearest neighbors search in Spark



Carter,   Just as a quick & simple starting point for Spark. (caveats - 
lots of improvements reqd for scaling, graceful and efficient handling of RDD 
et al):
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

import scala.collection.immutable.ListMap
import scala.collection.immutable.SortedMap

object TopK {
  //

  def getCurrentDirectory = new java.io.File( "." ).getCanonicalPath
  //

  def distance(x1:List[Int],x2:List[Int]):Double = {
val dist:Double = math.sqrt(math.pow(x1(1)-x2(1),2) + 
math.pow(x1(2)-x2(2),2))

dist
  }
  //

  def main(args: Array[String]): Unit = {
//

println(getCurrentDirectory)
val sc = new SparkContext("local","TopK","spark://USS-Defiant.local:7077")

println(s"Running Spark Version ${sc.version}")

val file = sc.textFile("data01.csv")

//
val data = file

  .map(line => line.split(","))
  .map(x1 => List(x1(0).toInt,x1(1).toInt,x1(2).toInt))

//val data1 = data.collect
println("data")

for (d <- data) {
  println(d)

  println(d(0))
}

//
val distList = for (d <- data) yield {d(0)}

//for (d <- distList) (println(d))
val zipList = for (a <- distList.collect; b <- distList.collect) yield { 
List(a,b)}

zipList.foreach(println(_))
//

val dist = for (l <- zipList) yield {

  println(s"${l(0)} = ${l(1)}")

  val x1a:Array[List[Int]] = data.filter(d => d(0) == l(0)).collect

  val x2a:Array[List[Int]] = data.filter(d => d(0) == l(1)).collect

  val x1:List[Int] = x1a(0)
  val x2:List[Int] = x2a(0)

  val dist = distance(x1,x2)
  Map ( dist -> l )

  }
dist.foreach(println(_)) // sort this for topK

//
  }
}

data01.csv

1,68,93
2,12,90

3,45,76

4,86,54
HTH.
































































Cheers

On Tue, May 27, 2014 at 4:10 AM, Carter <[hidden email]> wrote:

Any suggestion is very much appreciated.







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6421.html


Sent from the Apache Spark User List mailing list archive at Nabble.com.














If you reply to this email, your message will be added to the 
discussion below:

http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6455.html



To unsubscribe from K-nearest neighbors search in Spark, click 
here.

NAML
  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6464.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: K-nearest neighbors search in Spark

2014-05-27 Thread Carter

Any suggestion is very much appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6421.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

K-nearest neighbors search in Spark

2014-05-26 Thread Carter

Hi all,I want to implement a basic K-nearest neighbors search in Spark, but I
am totally new to Scala so don't know where to start with.My data consists
of millions of points. For each point, I need to compute its Euclidean
distance to the other points, and return the top-K points that are closest
to it. The data.txt is with the comma-separated format like this:ID, X, Y1,
68, 932, 12, 903, 45, 76100, 86, 54 Could you please tell me
what data structure I should use, and how to implement this algorithm in
Scala (*some sample code are greatly appreciated*).Thank you very
much.Regards,Carter



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: "sbt/sbt run" command returns a JVM problem

2014-05-06 Thread Carter

Hi Akhil, 

Thanks for your reply.

I have tried this option with different values, but it still doesn't work.

The Java version I am using is jre1.7.0_55, does the java version matter in
this problem?

Thanks. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5437.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: "sbt/sbt run" command returns a JVM problem

2014-05-05 Thread Carter

hi I still have over 1g left for my program.

Date: Sun, 4 May 2014 19:14:30 -0700
From: ml-node+s1001560n5340...@n3.nabble.com
To: gyz...@hotmail.com
Subject: Re: "sbt/sbt run" command returns a JVM problem

the total memory of your machine is 2G right?then how much memory is 
left free? wouldn`t ubuntu take up quite a big portion of 2G?
just a guess!

On Sat, May 3, 2014 at 8:15 PM, Carter <[hidden email]> wrote:

Hi, thanks for all your help.

I tried your setting in the sbt file, but the problem is still there.

The Java setting in my sbt file is:

java \

  -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m \

  -jar ${JAR} \

  "$@"

I have tried to set these 3 parameters bigger and smaller, but nothing

works. Did I change the right thing?

Thank you very much.

--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5267.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

If you reply to this email, your message will be added to the 
discussion below:

http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5340.html

To unsubscribe from "sbt/sbt run" command returns a JVM 
problem, click here.

NAML

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5412.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: "sbt/sbt run" command returns a JVM problem

2014-05-04 Thread Carter

Hi Michael,

The log after I typed "last" is as below:
> last
scala.tools.nsc.MissingRequirementError: object scala not found.
at
scala.tools.nsc.symtab.Definitions$definitions$.getModuleOrClass(Definitions.scala:655)
at
scala.tools.nsc.symtab.Definitions$definitions$.getModule(Definitions.scala:605)
at
scala.tools.nsc.symtab.Definitions$definitions$.ScalaPackage(Definitions.scala:145)
at
scala.tools.nsc.symtab.Definitions$definitions$.ScalaPackageClass(Definitions.scala:146)
at
scala.tools.nsc.symtab.Definitions$definitions$.AnyClass(Definitions.scala:176)
at
scala.tools.nsc.symtab.Definitions$definitions$.init(Definitions.scala:814)
at scala.tools.nsc.Global$Run.(Global.scala:697)
at sbt.compiler.Eval$$anon$1.(Eval.scala:53)
at sbt.compiler.Eval.run$1(Eval.scala:53)
at sbt.compiler.Eval.unlinkAll$1(Eval.scala:56)
at sbt.compiler.Eval.eval(Eval.scala:62)
at sbt.EvaluateConfigurations$.evaluateSetting(Build.scala:104)
at sbt.BuiltinCommands$$anonfun$set$1.apply(Main.scala:212)
at sbt.BuiltinCommands$$anonfun$set$1.apply(Main.scala:209)
at
sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:60)
at
sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:60)
at
sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:62)
at
sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:62)
at sbt.Command$.process(Command.scala:90)
at 
sbt.MainLoop$$anonfun$next$1$$anonfun$apply$1.apply(MainLoop.scala:71)
at 
sbt.MainLoop$$anonfun$next$1$$anonfun$apply$1.apply(MainLoop.scala:71)
at sbt.State$$anon$2.process(State.scala:171)
at sbt.MainLoop$$anonfun$next$1.apply(MainLoop.scala:71)
at sbt.MainLoop$$anonfun$next$1.apply(MainLoop.scala:71)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
at sbt.MainLoop$.next(MainLoop.scala:71)
at sbt.MainLoop$.run(MainLoop.scala:64)
at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:53)
at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:50)
at sbt.Using.apply(Using.scala:25)
at sbt.MainLoop$.runWithNewLog(MainLoop.scala:50)
at sbt.MainLoop$.runAndClearLast(MainLoop.scala:33)
at sbt.MainLoop$.runLoggedLoop(MainLoop.scala:17)
at sbt.MainLoop$.runLogged(MainLoop.scala:13)
at sbt.xMain.run(Main.scala:26)
at xsbt.boot.Launch$.run(Launch.scala:55)
at xsbt.boot.Launch$$anonfun$explicit$1.apply(Launch.scala:45)
at xsbt.boot.Launch$.launch(Launch.scala:60)
at xsbt.boot.Launch$.apply(Launch.scala:16)
at xsbt.boot.Boot$.runImpl(Boot.scala:31)
at xsbt.boot.Boot$.main(Boot.scala:20)
at xsbt.boot.Boot.main(Boot.scala)
[error] scala.tools.nsc.MissingRequirementError: object scala not found.
[error] Use 'last' for the full log.


And my sbt file is like below (my sbt launcher is "sbt-launch-0.12.4.jar" in
the same folder):
#!/bin/bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This script launches sbt for this project. If present it uses the system 
# version of sbt. If there is no system version of sbt it attempts to
download
# sbt locally.
SBT_VERSION=`awk -F "=" '/sbt\\.version/ {print $2}'
./project/build.properties`
URL1=http://typesafe.artifactoryonline.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
URL2=http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
JAR=sbt/sbt-launch-${SBT_VERSION}.jar

# Download sbt launch jar if it hasn't been downloaded yet
if [ ! -f ${JAR} ]; then
  # Download
  printf "Attempting to fetch sbt\n"
  if hash curl 2>/dev/null; then
curl --progress-bar ${URL1} > ${JAR} || curl --progress-bar ${URL2} >
${JAR}
  elif hash wget 2>/dev/null; then
wget --progress=bar ${URL1} -O ${JAR} || wget --progress=bar ${URL2} -O
${JAR}
  else
printf "You do not have curl or wget installed, please install sbt
manually from http://www.scala-sbt.org/\n";
exit -1
  fi
fi
if [ ! -f ${JAR} ]; then
  # We failed to download
  printf "Our at

Re: "sbt/sbt run" command returns a JVM problem

2014-05-03 Thread Carter

Hi Michael,

Thank you very much for your reply.

Sorry I am not very familiar with sbt. Could you tell me where to set the
Java option for the sbt fork for my program? I brought up the sbt console,
and run "set javaOptions += "-Xmx1G"" in it, but it returned an error: 
[error] scala.tools.nsc.MissingRequirementError: object scala not found.
[error] Use 'last' for the full log.

Is this the right way to set the java option? Thank you very much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5294.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: "sbt/sbt run" command returns a JVM problem

2014-05-03 Thread Carter

Hi, thanks for all your help.
I tried your setting in the sbt file, but the problem is still there.

The Java setting in my sbt file is:
java \
  -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m \
  -jar ${JAR} \
  "$@"

I have tried to set these 3 parameters bigger and smaller, but nothing
works. Did I change the right thing?

Thank you very much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5267.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

"sbt/sbt run" command returns a JVM problem

2014-05-01 Thread Carter

Hi, I have a very simple spark program written in Scala:
/*** testApp.scala ***/
object testApp {
  def main(args: Array[String]) {
println("Hello! World!")
  }
}
Then I use the following command to compile it:
$ sbt/sbt package
The compilation finished successfully and I got a JAR file.
But when I use this command to run it:
$ sbt/sbt run
it returned an error with JVM:
[info] Error occurred during initialization of VM 
[info] Could not reserve enough space for object heap 
[error] Error: Could not create the Java Virtual Machine. 
[error] Error: A fatal exception has occurred. Program will exit. 
java.lang.RuntimeException: Nonzero exit code returned from runner: 1   
at scala.sys.package$.error(package.scala:27)

My machine has 2G memory, and runs on Ubuntu 11.04. I also tried to change
the setting of java parameter (e.g.,  -Xmx, -Xms, -XX:MaxPermSize,
-XX:ReservedCodeCacheSize) in the file sbt/sbt, but it looks like non of the
change works. Can anyone help me out with this problem? Thank you very much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Need help about how hadoop works.

2014-04-24 Thread Carter

Thank you very much Prashant.

Date: Thu, 24 Apr 2014 01:24:39 -0700
From: ml-node+s1001560n4739...@n3.nabble.com
To: gyz...@hotmail.com
Subject: Re: Need help about how hadoop works.

It is the same file and hadoop library that we use for splitting takes 
care of assigning the right split to each node.Prashant Sharma

On Thu, Apr 24, 2014 at 1:36 PM, Carter <[hidden email]> wrote:

Thank you very much for your help Prashant.

Sorry I still have another question about your answer: "however if the

file("/home/scalatest.txt") is present on the same path on all systems it

will be processed on all nodes."

When presenting the file to the same path on all nodes, do we just simply

copy the same file to all nodes, or do we need to split the original file

into different parts (each part is still with the same file name

"scalatest.txt"), and copy each part to a different node for

parallelization?

Thank you very much.

--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4738.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

If you reply to this email, your message will be added to the 
discussion below:

http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4739.html

To unsubscribe from Need help about how hadoop works., click 
here.

NAML

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4746.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Need help about how hadoop works.

2014-04-24 Thread Carter

Thank you very much for your help Prashant.

Sorry I still have another question about your answer: "however if the
file("/home/scalatest.txt") is present on the same path on all systems it
will be processed on all nodes."

When presenting the file to the same path on all nodes, do we just simply
copy the same file to all nodes, or do we need to split the original file
into different parts (each part is still with the same file name
"scalatest.txt"), and copy each part to a different node for
parallelization? 

Thank you very much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4738.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Need help about how hadoop works.

2014-04-23 Thread Carter

Thanks Mayur.

So without Hadoop and any other distributed file systems, by running:
 val doc = sc.textFile("/home/scalatest.txt",5)
 doc.count
we can only get parallelization within the computer where the file is
loaded, but not the parallelization within the computers in the cluster
(Spark can not automatically duplicate the file to the other computers in
the cluster), is this understanding correct? Thank you.

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4734.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Need help about how hadoop works.

2014-04-22 Thread Carter

Hi, I am a beginner of Hadoop and Spark, and want some help in understanding
how hadoop works.

If we have a cluster of 5 computers, and install Spark on the cluster
WITHOUT Hadoop. And then we run the code on one computer: 
val doc = sc.textFile("/home/scalatest.txt",5)
doc.count
Can the "count" task be distributed to all the 5 computers? Or it is only
run by 5 parallel threads of the current computer?

On th other hand, if we install Hadoop on the cluster and upload the data
into HDFS, when running the same code will this "count" task be done by 25
threads?

Thank you very much for your help. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

54 matches

Mail list logo