Mean over window with minimum number of rows

2018-10-18 Thread Sumona Routh
Hi all, Before I go the route of rolling my own UDAF: I'm doing a calculation of last 5 mean so I have the following window defined: Window.partitionBy(person).orderBy(timestamp).rowsBetween(-4, Window.currentRow) Then I calculate the mean over that window. Within each partition, I'd like the

Databricks 1/2 day certification course at Spark Summit

2018-05-25 Thread Sumona Routh
Hi all, My company just now approved for some of us to go to Spark Summit in SF this year. Unfortunately, the day long workshops on Monday are sold out now. We are considering what we might do instead. Have others done the 1/2 day certification course before? Is it worth considering? Does it

Re: DataFrameReader read from S3 org.apache.spark.sql.AnalysisException: Path does not exist

2017-07-13 Thread Sumona Routh
Yong Zhang <java8...@hotmail.com> wrote: > Can't you just catch that exception and return an empty dataframe? > > > Yong > > > ------ > *From:* Sumona Routh <sumos...@gmail.com> > *Sent:* Wednesday, July 12, 2017 4:36 PM >

DataFrameReader read from S3 org.apache.spark.sql.AnalysisException: Path does not exist

2017-07-12 Thread Sumona Routh
Hi there, I'm trying to read a list of paths from S3 into a dataframe for a window of time using the following: sparkSession.read.parquet(listOfPaths:_*) In some cases, the path may not be there because there is no data, which is an acceptable scenario. However, Spark throws an

Re: Random Forest hangs without trace of error

2017-05-30 Thread Sumona Routh
Hi Morten, Were you able to resolve your issue with RandomForest? I am having similar issues with a newly trained model (that does have larger number of trees, smaller minInstancesPerNode, which is by design to produce the best performing model). I wanted to get some feedback on how you solved

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-11 Thread Sumona Routh
Hi Sam, I would absolutely be interested in reading a blog write-up of how you are doing this. We have pieced together a relatively decent pipeline ourselves, in jenkins, but have many kinks to work out. We also have some new requirements to start running side by side comparisons of different

Re: Dataframes na fill with empty list

2017-04-11 Thread Sumona Routh
line which doesn't compile is what I would want to do (after outer joining of course, it's not necessary except in that particular case where a null could be populated in that field). Thanks, Sumona On Tue, Apr 11, 2017 at 9:50 AM Sumona Routh <sumos...@gmail.com> wrote: > The seq

Re: Dataframes na fill with empty list

2017-04-11 Thread Sumona Routh
; .na.fill(0, Seq(“numeric_field1”,"numeric_field2")) > .na.fill("", Seq( >“text_field1","text_field2","text_field3”)) > > > Notice that you have to differentiate those fields that are meant to be > filled with an int, from those that require a differ

Dataframes na fill with empty list

2017-04-10 Thread Sumona Routh
Hi there, I have two dataframes that each have some columns which are of list type (array generated by the collect_list function actually). I need to outer join these two dfs, however by nature of an outer join I am sometimes left with null values. Normally I would use df.na.fill(...), however it

Re: Can't load a RandomForestClassificationModel in Spark job

2017-01-12 Thread Sumona Routh
s. > > Best > Ayan > > On Fri, 13 Jan 2017 at 5:39 am, Sumona Routh <sumos...@gmail.com> wrote: > > Hi all, > I've been working with Spark mllib 2.0.2 RandomForestClassificationModel. > > I encountered two frustrating issues and would really appreciate

Can't load a RandomForestClassificationModel in Spark job

2017-01-12 Thread Sumona Routh
Hi all, I've been working with Spark mllib 2.0.2 RandomForestClassificationModel. I encountered two frustrating issues and would really appreciate some advice: 1) RandomForestClassificationModel is effectively not serializable (I assume it's referencing something that can't be serialized, since

Re: Upgrade from 1.2 to 1.6 - parsing flat files in working directory

2016-07-26 Thread Sumona Routh
Can anyone provide some guidance on how to get files on the classpath for our Spark job? This used to work in 1.2, however after upgrading we are getting nulls when attempting to load resources. Thanks, Sumona On Thu, Jul 21, 2016 at 4:43 PM Sumona Routh <sumos...@gmail.com> wrote: &g

Upgrade from 1.2 to 1.6 - parsing flat files in working directory

2016-07-21 Thread Sumona Routh
Hi all, We are running into a classpath issue when we upgrade our application from 1.2 to 1.6. In 1.2, we load properties from a flat file (from working directory of the spark-submit script) using classloader resource approach. This was executed up front (by the driver) before any processing

Spark UI shows finished when job had an error

2016-06-17 Thread Sumona Routh
Hi there, Our Spark job had an error (specifically the Cassandra table definition did not match what was in Cassandra), which threw an exception that logged out to our spark-submit log. However ,the UI never showed any failed stage or job. It appeared as if the job finished without error, which is

Re: Spark UI standalone "crashes" after an application finishes

2016-03-01 Thread Sumona Routh
_n_zXe51k8z9hVuw4svP6dqWF0JrjabAa==be50a4160f49000256d50b7b>, > and so you still need to set a big java heap for master. > > > > -- 原始邮件 -- > *发件人:* "Shixiong(Ryan) Zhu";<shixi...@databricks.com>; > *发送时间:* 2016年3月

Spark UI standalone "crashes" after an application finishes

2016-02-29 Thread Sumona Routh
Hi there, I've been doing some performance tuning of our Spark application, which is using Spark 1.2.1 standalone. I have been using the spark metrics to graph out details as I run the jobs, as well as the UI to review the tasks and stages. I notice that after my application completes, or is near

Re: SparkListener onApplicationEnd processing an RDD throws exception because of stopped SparkContext

2016-02-22 Thread Sumona Routh
SparkContext. In general, SparkListener is > used to monitor the job progress and collect job information, an you should > not submit jobs there. Why not submit your jobs in the main thread? > > On Wed, Feb 17, 2016 at 7:11 AM, Sumona Routh <sumos...@gmail.com> wrote: > &

Re: SparkListener onApplicationEnd processing an RDD throws exception because of stopped SparkContext

2016-02-17 Thread Sumona Routh
Can anyone provide some insight into the flow of SparkListeners, specifically onApplicationEnd? I'm having issues with the SparkContext being stopped before my final processing can complete. Thanks! Sumona On Mon, Feb 15, 2016 at 8:59 AM Sumona Routh <sumos...@gmail.com> wrote: > Hi t

SparkListener onApplicationEnd processing an RDD throws exception because of stopped SparkContext

2016-02-15 Thread Sumona Routh
Hi there, I am trying to implement a listener that performs as a post-processor which stores data about what was processed or erred. With this, I use an RDD that may or may not change during the course of the application. My thought was to use onApplicationEnd and then saveToCassandra call to

SparkListener - why is org.apache.spark.scheduler.JobFailed in scala private?

2016-02-10 Thread Sumona Routh
Hi there, I am trying to create a listener for my Spark job to do some additional notifications for failures using this Scala API: https://spark.apache.org/docs/1.2.1/api/scala/#org.apache.spark.scheduler.JobResult . My idea was to write something like this: override def onJobEnd(jobEnd: