How to update structured streaming apps gracefully

2018-12-17 Thread Yuta Morisawa
Hi Now I'm trying to update my structured streaming application. But I have no idea how to update it gracefully. Should I stop it, replace a jar file then restart it? In my understanding, in that case, all the state will be recovered if I use checkpoints. Is this correct? Thank you, --

How to clean up logs-dirs and local-dirs of running spark streaming in yarn cluster mode

2018-12-17 Thread shyla deshpande
I get the ERROR 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers Is there a way to clean up these directories while the spark streaming application is running? Thanks

Spark 2.2.1 - Operation not allowed: alter table replace columns

2018-12-17 Thread Nirav Patel
I see that similar issue if fixed for `ALTER TABLE table_name ADD COLUMNS(..)` stmt. https://issues.apache.org/jira/browse/SPARK-19261 Is it also fixed for `REPLACE COLUMNS` in any subsequent version? Thanks --

Re: Need help with SparkSQL Query

2018-12-17 Thread Ramandeep Singh Nanda
You can use analytical functions in spark sql. Something like select * from (select id, row_number() over (partition by id order by timestamp ) as rn from root) where rn=1 On Mon, Dec 17, 2018 at 4:03 PM Nikhil Goyal wrote: > Hi guys, > > I have a dataframe of type Record (id: Long, timestamp:

Re: Need help with SparkSQL Query

2018-12-17 Thread Patrick McCarthy
Untested, but something like the below should work: from pyspark.sql import functions as F from pyspark.sql import window as W (record .withColumn('ts_rank', F.dense_rank().over(W.Window.orderBy('timestamp').partitionBy("id")) .filter(F.col('ts_rank')==1) .drop('ts_rank') ) On Mon, Dec 17,

Need help with SparkSQL Query

2018-12-17 Thread Nikhil Goyal
Hi guys, I have a dataframe of type Record (id: Long, timestamp: Long, isValid: Boolean, other metrics) Schema looks like this: root |-- id: long (nullable = true) |-- timestamp: long (nullable = true) |-- isValid: boolean (nullable = true) . I need to find the earliest valid record

Spark App Write nothing on HDFS

2018-12-17 Thread Soheil Pourbafrani
Hi, I submit an app on Spark2 cluster using standalone scheduler on client mode. The app saves the results of the processing on the HDFS. There is no error on output logs and the app finished successfully. But the problem is it just create _SUCSSES and empty part-0 file on the saving

Mllib / kalman

2018-12-17 Thread robin . east
Pretty sure there is nothing in MLLib. This seems to be the most comprehensive coverage of implementing in Spark  https://dzone.com/articles/kalman-filter-with-apache-spark-streaming-and-kafk. I’ve skimmed it but not read it in detail but looks useful. Sent from Polymail (

Mllib / kalman

2018-12-17 Thread Laurent Thiebaud
Hi everybody, Is there any built-in implementation of Kalman filter with spark mllib? Or any other filter to achieve the same result? What's the state of the art about it? Thanks.