Use SparkContext in Web Application

2018-09-30 Thread Girish Vasmatkar
Hi All We are very early into our Spark days so the following may sound like a novice question :) I will try to keep this as short as possible. We are trying to use Spark to introduce a recommendation engine that can be used to provide product recommendations and need help on some design decision

Re: Pyspark Partitioning

2018-09-30 Thread ayan guha
Hi There are a set pf finction which can be used with the construct Over (partition by col order by col). You search for rank and window functions in spark documentation. On Mon, 1 Oct 2018 at 5:29 am, Riccardo Ferrari wrote: > Hi Dimitris, > > I believe the methods partitionBy >

Re: Pyspark Partitioning

2018-09-30 Thread Riccardo Ferrari
Hi Dimitris, I believe the methods partitionBy and mapPartitions are specific to RDDs while you're talking about DataFrames

Pyspark Partitioning

2018-09-30 Thread dimitris plakas
Hello everyone, I am trying to split a dataframe on partitions and i want to apply a custom function on every partition. More precisely i have a dataframe like the one below Group_Id | Id | Points 1| id1| Point1 2| id2| Point2 I want to have a partition for every Group_Id

Re: Watermarking without aggregation with Structured Streaming

2018-09-30 Thread peay
Thanks for the pointers. I guess right now the only workaround would be to apply a "dummy" aggregation (e.g., group by the timestamp itself) only to have the stateful processing logic kick in and apply the filtering? For my purposes, an alternative solution to pushing it out to the source would

Re: Watermarking without aggregation with Structured Streaming

2018-09-30 Thread Jungtaek Lim
The purpose of watermark is to set a limitation on handling records due to state going infinity. In other cases (non-stateful operations), it is pretty normal to handle all of records even they're pretty late. Btw, there was some comments regarding this: while Spark delegates to filter out late re

Re: Watermarking without aggregation with Structured Streaming

2018-09-30 Thread chandan prakash
Interesting question. I do not think without any aggregation operation/groupBy , watermark is supported currently . Reason: Watermark in Structured Streaming is used for limiting the size of state needed to keep intermediate information in-memory. And state only comes in picture in case of statefu

Re: [Structured Streaming SPARK-23966] Why non-atomic rename is problem in State Store ?

2018-09-30 Thread chandan prakash
Anyone who can clear doubts on the questions asked here ? Regards, Chandan On Sat, Aug 11, 2018 at 10:03 PM chandan prakash wrote: > Hi All, > I was going through this pull request about new CheckpointFileManager > abstraction in structured streaming coming in 2.4 : > https://issues.apache.or