Re: Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter
ng and char etc) > Do you extract only the stuff needed? What are the algorithm parameters? > > > On 07 Jun 2016, at 13:09, Franc Carter <franc.car...@gmail.com> wrote: > > > > > > Hi, > > > > I am training a RandomForest Regression Model on Spark-1

Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter
Hi, I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and am interested in how it might be best to scale it - e.g more cpus per instances, more memory per instance, more instances etc. I'm currently using 32 m3.xlarge instances for for a training set with 2.5 million rows, 1300

Re: installing packages with pyspark

2016-03-19 Thread Franc Carter
cess can't find the graphframes Python code when it is loaded as > a Spark package. > > To workaround this, I extract the graphframes Python directory locally > where I run pyspark into a directory called graphframes. > > > > > > > On Thu, Mar 17, 2016 at 10:11 PM -0700,

Re: installing packages with pyspark

2016-03-19 Thread Franc Carter
I'm having trouble with that for pyspark, yarn and graphframes. I'm using:- pyspark --master yarn --packages graphframes:graphframes:0.1.0-spark1.5 which starts and gives me a REPL, but when I try from graphframes import * I get No module names graphframes without '--master yarn' it

Re: filter by dict() key in pySpark

2016-02-24 Thread Franc Carter
A colleague found how to do this, the approach was to use a udf() cheers On 21 February 2016 at 22:41, Franc Carter <franc.car...@gmail.com> wrote: > > I have a DataFrame that has a Python dict() as one of the columns. I'd > like to filter he DataFrame for those Rows that

filter by dict() key in pySpark

2016-02-21 Thread Franc Carter
I have a DataFrame that has a Python dict() as one of the columns. I'd like to filter he DataFrame for those Rows that where the dict() contains a specific value. e.g something like this:- DF2 = DF1.filter('name' in DF1.params) but that gives me this error ValueError: Cannot convert column

Re: sparkR not able to create /append new columns

2016-02-03 Thread Franc Carter
dded column and in the > end the last added column( in the loop) will be the added column. like in > my code above. > > On Wed, Feb 3, 2016 at 5:05 PM, Franc Carter <franc.car...@gmail.com> > wrote: > >> >> I had problems doing this as well - I ended up using 'wit

Re: sparkR not able to create /append new columns

2016-02-03 Thread Franc Carter
I had problems doing this as well - I ended up using 'withColumn', it's not particularly graceful but it worked (1.5.2 on AWS EMR) cheerd On 3 February 2016 at 22:06, Devesh Raj Singh wrote: > Hi, > > i am trying to create dummy variables in sparkR by creating new

Re: pyspark: calculating row deltas

2016-01-10 Thread Franc Carter
00 > 22013 101 > 32014 102 > > What's your desired output ? > > Femi > > > On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter <franc.car...@gmail.com> > wrote: > >> >> Hi, >> >> I have a DataFrame with the columns >

Re: pyspark: calculating row deltas

2016-01-10 Thread Franc Carter
Thanks cheers On 10 January 2016 at 22:35, Blaž Šnuderl <snud...@gmail.com> wrote: > This can be done using spark.sql and window functions. Take a look at > https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html > > On Sun, Jan 10, 2016 at 11:0

Re: pyspark: conditionals inside functions

2016-01-09 Thread Franc Carter
My Python is not particularly good, so I'm afraid I don't understand what that mean cheers On 9 January 2016 at 14:45, Franc Carter <franc.car...@gmail.com> wrote: > > Hi, > > I'm trying to write a short function that returns the last sunday of the > week of a given dat

pyspark: calculating row deltas

2016-01-09 Thread Franc Carter
Hi, I have a DataFrame with the columns ID,Year,Value I'd like to create a new Column that is Value2-Value1 where the corresponding Year2=Year-1 At the moment I am creating a new DataFrame with renamed columns and doing DF.join(DF2, . . . .) This looks cumbersome to me, is there

Re: pyspark: conditionals inside functions

2016-01-09 Thread Franc Carter
Got it, I needed to use the when/otherwise construct - code below def getSunday(day): day = day.cast("date") sun = next_day(day, "Sunday") n = datediff(sun,day) x = when(n==7,day).otherwise(sun) return x On 10 January 2016 at 08:41, Franc Carter <

pyspark: conditionals inside functions

2016-01-08 Thread Franc Carter
Hi, I'm trying to write a short function that returns the last sunday of the week of a given date, code below def getSunday(day): day = day.cast("date") sun = next_day(day, "Sunday") n = datediff(sun,day) if (n == 7): return day else: return sun this

number of executors in sparkR.init()

2015-12-25 Thread Franc Carter
Hi, I'm having trouble working out how to get the number of executors set when using sparkR.init(). If I start sparkR with sparkR --master yarn --num-executors 6 then I get 6 executors However, if start sparkR with sparkR followed by sc <- sparkR.init(master="yarn-client",

Re: number of executors in sparkR.init()

2015-12-25 Thread Franc Carter
> > Could you try setting that with sparkR.init()? > > > _____ > From: Franc Carter <franc.car...@gmail.com> > Sent: Friday, December 25, 2015 9:23 PM > Subject: number of executors in sparkR.init() > To: <user@spark.apache.org>

Re: SparkR csv without headers

2015-08-20 Thread Franc Carter
( …, schema = schema) *From:* Franc Carter [mailto:franc.car...@rozettatech.com] *Sent:* Wednesday, August 19, 2015 1:48 PM *To:* user@spark.apache.org *Subject:* SparkR csv without headers Hi, Does anyone have an example of how to create a DataFrame in SparkR which specifies the column

SparkR csv without headers

2015-08-18 Thread Franc Carter
-- *Franc Carter* I Systems ArchitectI RoZetta Technology [image: Description: Description: Description: cid:image003.jpg@01D02903.9B540580] L4. 55 Harrington Street, THE ROCKS, NSW, 2000 PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA *T* +61 2 8355 2515

Column operation on Spark RDDs.

2015-06-04 Thread Carter
Hi, I have a RDD with MANY columns (e.g., hundreds), and most of my operation is on columns, e.g., I need to create many intermediate variables from different columns, what is the most efficient way to do this? For example, if my dataRDD[Array[String]] is like below: 123, 523, 534, ..., 893

Re: How to add a column to a spark RDD with many columns?

2015-05-02 Thread Carter
Thanks for your reply! It is what I am after. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-a-column-to-a-spark-RDD-with-many-columns-tp22729p22740.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to add a column to a spark RDD with many columns?

2015-04-30 Thread Carter
Hi all, I have a RDD with *MANY *columns (e.g., *hundreds*), how do I add one more column at the end of this RDD? For example, if my RDD is like below: 123, 523, 534, ..., 893 536, 98, 1623, ..., 98472 537, 89, 83640, ..., 9265 7297, 98364, 9, ..., 735 .. 29, 94,

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread Franc Carter
Carter* I Systems ArchitectI RoZetta Technology [image: Description: Description: Description: cid:image003.jpg@01D02903.9B540580] L4. 55 Harrington Street, THE ROCKS, NSW, 2000 PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA *T* +61 2 8355 2515 I

Re: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Franc Carter
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- *Franc Carter* I Systems ArchitectI RoZetta Technology [image: Description: Description: Description

Re: spark, reading from s3

2015-02-12 Thread Franc Carter
by approximately 0 seconds. Retrying connection. After that there are tons of 403/forbidden errors and then job fails. It's sporadic, so sometimes I get this error and sometimes not, what could be the issue? I think it could be related to network connectivity? -- *Franc Carter* | Systems

How to sum up the values in the columns of a dataset in Scala?

2015-02-12 Thread Carter
I am new to Scala. I have a dataset with many columns, each column has a column name. Given several column names (these column names are not fixed, they are generated dynamically), I need to sum up the values of these columns. Is there an efficient way of doing this? I worked out a way by using

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter
-- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com franc.car...@rozettatech.com| www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter
...@performance-media.de wrote: Hi Regarding the Cassandra Data model, there's an excellent post on the ebay tech blog: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/. There's also a slideshare for this somewhere. Happy hacking Chris Von: Franc Carter

Re: No AMI for Spark 1.2 using ec2 scripts

2015-01-26 Thread Franc Carter
...@spark.apache.org -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com franc.car...@rozettatech.com| www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA

Does DecisionTree model in MLlib deal with missing values?

2015-01-10 Thread Carter
Hi, I am new to the MLlib in Spark. Can the DecisionTree model in MLlib deal with missing values? If so, what data structure should I use for the input? Moreover, my data has categorical features, but the LabeledPoint requires double data type, in this case what can I do? Thank you very much.

Re: Reading from a centralized stored

2015-01-06 Thread Franc Carter
AM, Cody Koeninger c...@koeninger.org wrote: No, most rdds partition input data appropriately. On Tue, Jan 6, 2015 at 1:41 PM, Franc Carter franc.car...@rozettatech.com wrote: One more question, to be clarify. Will every node pull in all the data ? thanks On Tue, Jan 6, 2015 at 12:56 PM

Re: Reading from a centralized stored

2015-01-05 Thread Franc Carter
on the same nodes as spark, but JdbcRDD doesn't implement preferred locations. On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter franc.car...@rozettatech.com wrote: Hi, I'm trying to understand how a Spark Cluster behaves when the data it is processing resides on a centralized/remote store (S3

Reading from a centralized stored

2015-01-05 Thread Franc Carter
Hi, I'm trying to understand how a Spark Cluster behaves when the data it is processing resides on a centralized/remote store (S3, Cassandra, DynamoDB, RDBMS etc). Does every node in the cluster retrieve all the data from the central store ? thanks -- *Franc Carter* | Systems Architect

How to get the help or explanation for the functions in Spark shell?

2014-06-08 Thread Carter
Hi All, I am new to Spark. In the Spark shell, how can I get the help or explanation for those functions that I can use for a variable or RDD? For example, after I input a RDD's name with a dot (.) at the end, if I press the Tab key, a list of functions that I can use for this RDD will be

Re: How to get the help or explanation for the functions in Spark shell?

2014-06-08 Thread Carter
Thank you very much Gerard. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-help-or-explanation-for-the-functions-in-Spark-shell-tp7191p7193.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Carter
Hi All, I just downloaded the Scala IDE for Eclipse. After I created a Spark project and clicked Run there was an error on this line of code import org.apache.spark.SparkContext: object apache is not a member of package org. I guess I need to import the Spark dependency into Scala IDE for

Re: How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Carter
Thanks a lot Krishna, this works for me. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197p7223.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Carter
Thanks for your reply Wei, will try this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197p7224.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: K-nearest neighbors search in Spark

2014-05-27 Thread Carter
Any suggestion is very much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6421.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

K-nearest neighbors search in Spark

2014-05-26 Thread Carter
much.Regards,Carter -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: sbt/sbt run command returns a JVM problem

2014-05-05 Thread Carter
? wouldn`t ubuntu take up quite a big portion of 2G? just a guess! On Sat, May 3, 2014 at 8:15 PM, Carter [hidden email] wrote: Hi, thanks for all your help. I tried your setting in the sbt file, but the problem is still there. The Java setting in my sbt file is: java \ -Xmx1200m

Re: sbt/sbt run command returns a JVM problem

2014-05-04 Thread Carter
Hi Michael, The log after I typed last is as below: last scala.tools.nsc.MissingRequirementError: object scala not found. at scala.tools.nsc.symtab.Definitions$definitions$.getModuleOrClass(Definitions.scala:655) at

Re: sbt/sbt run command returns a JVM problem

2014-05-03 Thread Carter
Hi, thanks for all your help. I tried your setting in the sbt file, but the problem is still there. The Java setting in my sbt file is: java \ -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m \ -jar ${JAR} \ $@ I have tried to set these 3 parameters bigger and smaller, but

Re: sbt/sbt run command returns a JVM problem

2014-05-03 Thread Carter
Hi Michael, Thank you very much for your reply. Sorry I am not very familiar with sbt. Could you tell me where to set the Java option for the sbt fork for my program? I brought up the sbt console, and run set javaOptions += -Xmx1G in it, but it returned an error: [error]

sbt/sbt run command returns a JVM problem

2014-05-01 Thread Carter
Hi, I have a very simple spark program written in Scala: /*** testApp.scala ***/ object testApp { def main(args: Array[String]) { println(Hello! World!) } } Then I use the following command to compile it: $ sbt/sbt package The compilation finished successfully and I got a JAR file. But

Re: Need help about how hadoop works.

2014-04-24 Thread Carter
Thanks Mayur. So without Hadoop and any other distributed file systems, by running: val doc = sc.textFile(/home/scalatest.txt,5) doc.count we can only get parallelization within the computer where the file is loaded, but not the parallelization within the computers in the cluster (Spark

Re: Need help about how hadoop works.

2014-04-24 Thread Carter
Thank you very much for your help Prashant. Sorry I still have another question about your answer: however if the file(/home/scalatest.txt) is present on the same path on all systems it will be processed on all nodes. When presenting the file to the same path on all nodes, do we just simply copy

RE: Need help about how hadoop works.

2014-04-24 Thread Carter
split to each node.Prashant Sharma On Thu, Apr 24, 2014 at 1:36 PM, Carter [hidden email] wrote: Thank you very much for your help Prashant. Sorry I still have another question about your answer: however if the file(/home/scalatest.txt) is present on the same path on all systems

Need help about how hadoop works.

2014-04-23 Thread Carter
Hi, I am a beginner of Hadoop and Spark, and want some help in understanding how hadoop works. If we have a cluster of 5 computers, and install Spark on the cluster WITHOUT Hadoop. And then we run the code on one computer: val doc = sc.textFile(/home/scalatest.txt,5) doc.count Can the count task