Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread James Hammerton
This may be related to: https://issues.apache.org/jira/browse/SPARK-13773 Regards, James On 11 May 2016 at 15:49, Ted Yu wrote: > In master branch, behavior is the same. > > Suggest opening a JIRA if you haven't done so. > > On Wed, May 11, 2016 at 6:55 AM, Tony Jin

Re: Error from reading S3 in Scala

2016-05-04 Thread James Hammerton
On 3 May 2016 at 17:22, Gourav Sengupta wrote: > Hi, > > The best thing to do is start the EMR clusters with proper permissions in > the roles that way you do not need to worry about the keys at all. > > Another thing, why are we using s3a// instead of s3:// ? >

Re: ML Random Forest Classifier

2016-04-13 Thread James Hammerton
categoricalFeatures: Map[Int, Int], numClasses: Int, numFeatures: Int > = -1): RandomForestClassificationModel = { > RandomForestClassificationModel.fromOld(oldModel, parent, > categoricalFeatures, numClasses, numFeatures) > } > > > def toOld(newModel: RandomForest

Re: ML Random Forest Classifier

2016-04-11 Thread James Hammerton
tegoricalFeatures, numClasses, numFeatures) > > } > > > def toOld(newModel: RandomForestClassificationModel): > OldRandomForestModel = { > > newModel.toOld > > } > > } > Regards, James On 11 April 2016 at 10:36, James Hammerton <ja...@gluru.co>

Re: ML Random Forest Classifier

2016-04-11 Thread James Hammerton
There are methods for converting the dataframe based random forest models to the old RDD based models and vice versa. Perhaps using these will help given that the old models can be saved and loaded? In order to use them however you will need to write code in the org.apache.spark.ml package. I've

Logistic regression throwing errors

2016-04-01 Thread James Hammerton
Hi, On a particular .csv data set - which I can use in WEKA's logistic regression implementation without any trouble, I'm getting errors like the following: 16/04/01 18:04:18 ERROR LBFGS: Failure! Resetting history: > breeze.optimize.FirstOrderException: Line search failed These errors cause

Re: Work out date column in CSV more than 6 months old (datediff or something)

2016-03-22 Thread James Hammerton
On 22 March 2016 at 10:57, Mich Talebzadeh wrote: > Thanks Silvio. > > The problem I have is that somehow string comparison does not work. > > Case in point > > val df = > sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", >

Re: Find all invoices more than 6 months from csv file

2016-03-22 Thread James Hammerton
On 21 March 2016 at 17:57, Mich Talebzadeh wrote: > > Hi, > > For test purposes I am ready a simple csv file as follows: > > val df = > sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", > "true").option("header", "true").load("/data/stg/table2")

Add org.apache.spark.mllib model .predict() method to models in org.apache.spark.ml?

2016-03-22 Thread James Hammerton
Hi, The machine learning models in org.apache.spark.mllib have a .predict() method that can be applied to a Vector to return a prediction. However this method does not appear on the new models on org.apache.spark.ml and you have to wrap up a Vector in a DataFrame to send a prediction in. This

Re: best way to do deep learning on spark ?

2016-03-20 Thread James Hammerton
In the meantime there is also deeplearning4j which integrates with Spark (for both Java and Scala): http://deeplearning4j.org/ Regards, James On 17 March 2016 at 02:32, Ulanov, Alexander wrote: > Hi Charles, > > > > There is an implementation of multilayer perceptron

Saving the DataFrame based RandomForestClassificationModels

2016-03-18 Thread James Hammerton
Hi, If you train a org.apache.spark.ml.classification.RandomForestClassificationModel, you can't save it - attempts to do so yield the following error: 16/03/18 14:12:44 INFO SparkContext: Successfully stopped SparkContext > Exception in thread "main" java.lang.UnsupportedOperationException: >

Best way to process values for key in sorted order

2016-03-15 Thread James Hammerton
Hi, I need to process some events in a specific order based on a timestamp, for each user in my data. I had implemented this by using the dataframe sort method to sort by user id and then sort by the timestamp secondarily, then do a groupBy().mapValues() to process the events for each user.

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-09 Thread James Hammerton
Hi Ted, Finally got round to creating this: https://issues.apache.org/jira/browse/SPARK-13773 I hope you don't mind me selecting you as the shepherd for this ticket. Regards, James On 7 March 2016 at 17:50, James Hammerton <ja...@gluru.co> wrote: > Hi Ted, > > Thanks for g

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-07 Thread James Hammerton
to select Spark as the Project. > > Cheers > > On Mon, Mar 7, 2016 at 2:54 AM, James Hammerton <ja...@gluru.co> wrote: > >> Hi, >> >> So I managed to isolate the bug and I'm ready to try raising a JIRA >> issue. I joined the Apache Jira project so I can c

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-07 Thread James Hammerton
Infrastructure. There doesn't seem to be an option for me to raise an issue for Spark?! Regards, James On 4 March 2016 at 14:03, James Hammerton <ja...@gluru.co> wrote: > Sure thing, I'll see if I can isolate this. > > Regards. > > James > > On 4 March 2016 at 12:24,

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-04 Thread James Hammerton
Sure thing, I'll see if I can isolate this. Regards. James On 4 March 2016 at 12:24, Ted Yu <yuzhih...@gmail.com> wrote: > If you can reproduce the following with a unit test, I suggest you open a > JIRA. > > Thanks > > On Mar 4, 2016, at 4:01 AM, James Hammerton <ja

Re: How to control the number of parquet files getting created under a partition ?

2016-03-02 Thread James Hammerton
Hi, Based on the behaviour I've seen using parquet, the number of partitions in the DataFrame will determine the number of files in each parquet partition. I.e. when you use "PARTITION BY" you're actually partitioning twice, once via the partitions spark has created internally and then again

Count job stalling at shuffle stage on 3.4TB input (but only 5.3GB shuffle write)

2016-02-23 Thread James Hammerton
Hi, I have been having problems processing a 3.4TB data set - uncompressed tab separated text - containing object creation/update events from our system, one event per line. I decided to see what happens with a count of the number of events (= number of lines in the text files) and a count of

Re: Is this likely to cause any problems?

2016-02-19 Thread James Hammerton
gt; using the spark-ec2 script rather than EMR? > > On Thu, Feb 18, 2016 at 11:39 AM, James Hammerton <ja...@gluru.co> wrote: > >> I have now... So far I think the issues I've had are not related to >> this, but I wanted to be sure in case it should be something that ne

Re: Is this likely to cause any problems?

2016-02-18 Thread James Hammerton
hih...@gmail.com> wrote: > Have you seen this ? > > HADOOP-10988 > > Cheers > > On Thu, Feb 18, 2016 at 3:39 AM, James Hammerton <ja...@gluru.co> wrote: > >> HI, >> >> I am seeing warnings like this in the logs when I run Spark jobs: >> >> O

Re: Is this likely to cause any problems?

2016-02-18 Thread James Hammerton
t curiosity why are you not using EMR to start your SPARK > cluster? > > > Regards, > Gourav > > On Thu, Feb 18, 2016 at 12:23 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Have you seen this ? >> >> HADOOP-10988 >> >> Cheers >> >

Is this likely to cause any problems?

2016-02-18 Thread James Hammerton
HI, I am seeing warnings like this in the logs when I run Spark jobs: OpenJDK 64-Bit Server VM warning: You have loaded library /root/ephemeral-hdfs/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you