Re: Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter
ng and char etc) > Do you extract only the stuff needed? What are the algorithm parameters? > > > On 07 Jun 2016, at 13:09, Franc Carter <franc.car...@gmail.com> wrote: > > > > > > Hi, > > > > I am training a RandomForest Regression Model on Spark-1

Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter
Hi, I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and am interested in how it might be best to scale it - e.g more cpus per instances, more memory per instance, more instances etc. I'm currently using 32 m3.xlarge instances for for a training set with 2.5 million rows, 1300

Re: installing packages with pyspark

2016-03-19 Thread Franc Carter
cess can't find the graphframes Python code when it is loaded as > a Spark package. > > To workaround this, I extract the graphframes Python directory locally > where I run pyspark into a directory called graphframes. > > > > > > > On Thu, Mar 17, 2016 at 10:11 PM -0700,

Re: installing packages with pyspark

2016-03-19 Thread Franc Carter
I'm having trouble with that for pyspark, yarn and graphframes. I'm using:- pyspark --master yarn --packages graphframes:graphframes:0.1.0-spark1.5 which starts and gives me a REPL, but when I try from graphframes import * I get No module names graphframes without '--master yarn' it

Re: filter by dict() key in pySpark

2016-02-24 Thread Franc Carter
A colleague found how to do this, the approach was to use a udf() cheers On 21 February 2016 at 22:41, Franc Carter <franc.car...@gmail.com> wrote: > > I have a DataFrame that has a Python dict() as one of the columns. I'd > like to filter he DataFrame for those Rows that

filter by dict() key in pySpark

2016-02-21 Thread Franc Carter
I have a DataFrame that has a Python dict() as one of the columns. I'd like to filter he DataFrame for those Rows that where the dict() contains a specific value. e.g something like this:- DF2 = DF1.filter('name' in DF1.params) but that gives me this error ValueError: Cannot convert column

Re: sparkR not able to create /append new columns

2016-02-03 Thread Franc Carter
dded column and in the > end the last added column( in the loop) will be the added column. like in > my code above. > > On Wed, Feb 3, 2016 at 5:05 PM, Franc Carter <franc.car...@gmail.com> > wrote: > >> >> I had problems doing this as well - I ended up using 'wit

Re: sparkR not able to create /append new columns

2016-02-03 Thread Franc Carter
I had problems doing this as well - I ended up using 'withColumn', it's not particularly graceful but it worked (1.5.2 on AWS EMR) cheerd On 3 February 2016 at 22:06, Devesh Raj Singh wrote: > Hi, > > i am trying to create dummy variables in sparkR by creating new

Re: pyspark: calculating row deltas

2016-01-10 Thread Franc Carter
00 > 22013 101 > 32014 102 > > What's your desired output ? > > Femi > > > On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter <franc.car...@gmail.com> > wrote: > >> >> Hi, >> >> I have a DataFrame with the columns >

Re: pyspark: calculating row deltas

2016-01-10 Thread Franc Carter
Thanks cheers On 10 January 2016 at 22:35, Blaž Šnuderl <snud...@gmail.com> wrote: > This can be done using spark.sql and window functions. Take a look at > https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html > > On Sun, Jan 10, 2016 at 11:0

Re: pyspark: conditionals inside functions

2016-01-09 Thread Franc Carter
My Python is not particularly good, so I'm afraid I don't understand what that mean cheers On 9 January 2016 at 14:45, Franc Carter <franc.car...@gmail.com> wrote: > > Hi, > > I'm trying to write a short function that returns the last sunday of the > week of a given dat

pyspark: calculating row deltas

2016-01-09 Thread Franc Carter
Hi, I have a DataFrame with the columns ID,Year,Value I'd like to create a new Column that is Value2-Value1 where the corresponding Year2=Year-1 At the moment I am creating a new DataFrame with renamed columns and doing DF.join(DF2, . . . .) This looks cumbersome to me, is there

Re: pyspark: conditionals inside functions

2016-01-09 Thread Franc Carter
Got it, I needed to use the when/otherwise construct - code below def getSunday(day): day = day.cast("date") sun = next_day(day, "Sunday") n = datediff(sun,day) x = when(n==7,day).otherwise(sun) return x On 10 January 2016 at 08:41, Franc Carter <

pyspark: conditionals inside functions

2016-01-08 Thread Franc Carter
Hi, I'm trying to write a short function that returns the last sunday of the week of a given date, code below def getSunday(day): day = day.cast("date") sun = next_day(day, "Sunday") n = datediff(sun,day) if (n == 7): return day else: return sun this

number of executors in sparkR.init()

2015-12-25 Thread Franc Carter
Hi, I'm having trouble working out how to get the number of executors set when using sparkR.init(). If I start sparkR with sparkR --master yarn --num-executors 6 then I get 6 executors However, if start sparkR with sparkR followed by sc <- sparkR.init(master="yarn-client",

Re: number of executors in sparkR.init()

2015-12-25 Thread Franc Carter
> > Could you try setting that with sparkR.init()? > > > _____ > From: Franc Carter <franc.car...@gmail.com> > Sent: Friday, December 25, 2015 9:23 PM > Subject: number of executors in sparkR.init() > To: <user@spark.apache.org>

Re: SparkR csv without headers

2015-08-20 Thread Franc Carter
( …, schema = schema) *From:* Franc Carter [mailto:franc.car...@rozettatech.com] *Sent:* Wednesday, August 19, 2015 1:48 PM *To:* user@spark.apache.org *Subject:* SparkR csv without headers Hi, Does anyone have an example of how to create a DataFrame in SparkR which specifies the column

SparkR csv without headers

2015-08-18 Thread Franc Carter
-- *Franc Carter* I Systems ArchitectI RoZetta Technology [image: Description: Description: Description: cid:image003.jpg@01D02903.9B540580] L4. 55 Harrington Street, THE ROCKS, NSW, 2000 PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA *T* +61 2 8355 2515

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread Franc Carter
time. Downstream, if data can be thinned down, then you can start looking more at things you can do on a single host : a machine that can be in your Hadoop cluster. Ask YARN nicely and you can get a dedicated machine for a couple of days (i.e. until your Kerberos tokens expire). -- *Franc

Re: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Franc Carter
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- *Franc Carter* I Systems ArchitectI RoZetta Technology [image: Description: Description: Description

Re: spark, reading from s3

2015-02-12 Thread Franc Carter
by approximately 0 seconds. Retrying connection. After that there are tons of 403/forbidden errors and then job fails. It's sporadic, so sometimes I get this error and sometimes not, what could be the issue? I think it could be related to network connectivity? -- *Franc Carter* | Systems

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter
-- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com franc.car...@rozettatech.com| www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter
...@performance-media.de wrote: Hi Regarding the Cassandra Data model, there's an excellent post on the ebay tech blog: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/. There's also a slideshare for this somewhere. Happy hacking Chris Von: Franc Carter

Re: No AMI for Spark 1.2 using ec2 scripts

2015-01-26 Thread Franc Carter
...@spark.apache.org -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com franc.car...@rozettatech.com| www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA

Re: Reading from a centralized stored

2015-01-06 Thread Franc Carter
AM, Cody Koeninger c...@koeninger.org wrote: No, most rdds partition input data appropriately. On Tue, Jan 6, 2015 at 1:41 PM, Franc Carter franc.car...@rozettatech.com wrote: One more question, to be clarify. Will every node pull in all the data ? thanks On Tue, Jan 6, 2015 at 12:56 PM

Re: Reading from a centralized stored

2015-01-05 Thread Franc Carter
on the same nodes as spark, but JdbcRDD doesn't implement preferred locations. On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter franc.car...@rozettatech.com wrote: Hi, I'm trying to understand how a Spark Cluster behaves when the data it is processing resides on a centralized/remote store (S3

Reading from a centralized stored

2015-01-05 Thread Franc Carter
Hi, I'm trying to understand how a Spark Cluster behaves when the data it is processing resides on a centralized/remote store (S3, Cassandra, DynamoDB, RDBMS etc). Does every node in the cluster retrieve all the data from the central store ? thanks -- *Franc Carter* | Systems Architect