How to update structured streaming apps gracefully

2018-12-17 Thread Yuta Morisawa

Hi

Now I'm trying to update my structured streaming application.
But I have no idea how to update it gracefully.

Should I stop it, replace a jar file then restart it?
In my understanding, in that case, all the state will be recovered if I 
use checkpoints.

Is this correct?

Thank you,


--


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to clean up logs-dirs and local-dirs of running spark streaming in yarn cluster mode

2018-12-17 Thread shyla deshpande
I get the ERROR
1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad:
/var/log/hadoop-yarn/containers

Is there a way to clean up these directories while the spark streaming
application is running?

Thanks


Spark 2.2.1 - Operation not allowed: alter table replace columns

2018-12-17 Thread Nirav Patel
I see that similar issue if fixed for `ALTER TABLE table_name ADD
COLUMNS(..)` stmt.

https://issues.apache.org/jira/browse/SPARK-19261

Is it also fixed for `REPLACE COLUMNS` in any subsequent version?

Thanks

-- 


 

 
   
   
      



Re: Need help with SparkSQL Query

2018-12-17 Thread Ramandeep Singh Nanda
You can use analytical functions in spark sql.

Something like select * from (select id, row_number() over (partition by id
order by timestamp ) as rn from root) where rn=1

On Mon, Dec 17, 2018 at 4:03 PM Nikhil Goyal  wrote:

> Hi guys,
>
> I have a dataframe of type Record (id: Long, timestamp: Long, isValid:
> Boolean,  other metrics)
>
> Schema looks like this:
> root
>  |-- id: long (nullable = true)
>  |-- timestamp: long (nullable = true)
>  |-- isValid: boolean (nullable = true)
> .
>
> I need to find the earliest valid record per id. In RDD world I can do
> groupBy 'id' and find the earliest one but I am not sure how I can do it in
> SQL. Since I am doing this in PySpark I cannot really use DataSet API for
> this.
>
> One thing I can do is groupBy 'id', find the earliest timestamp available
> and then join with the original dataframe to get the right record (all the
> metrics).
>
> Or I can create a single column with all the records and then implement a
> UDAF in scala and use it in pyspark.
>
> Both solutions don't seem to be straight forward. Is there a simpler
> solution to this?
>
> Thanks
> Nikhil
>


-- 
Regards,
Ramandeep Singh
http://orastack.com
+13474792296
ramannan...@gmail.com


Re: Need help with SparkSQL Query

2018-12-17 Thread Patrick McCarthy
Untested, but something like the below should work:

from pyspark.sql import functions as F
from pyspark.sql import window as W

(record
.withColumn('ts_rank',
F.dense_rank().over(W.Window.orderBy('timestamp').partitionBy("id"))
.filter(F.col('ts_rank')==1)
.drop('ts_rank')
)


On Mon, Dec 17, 2018 at 4:04 PM Nikhil Goyal  wrote:

> Hi guys,
>
> I have a dataframe of type Record (id: Long, timestamp: Long, isValid:
> Boolean,  other metrics)
>
> Schema looks like this:
> root
>  |-- id: long (nullable = true)
>  |-- timestamp: long (nullable = true)
>  |-- isValid: boolean (nullable = true)
> .
>
> I need to find the earliest valid record per id. In RDD world I can do
> groupBy 'id' and find the earliest one but I am not sure how I can do it in
> SQL. Since I am doing this in PySpark I cannot really use DataSet API for
> this.
>
> One thing I can do is groupBy 'id', find the earliest timestamp available
> and then join with the original dataframe to get the right record (all the
> metrics).
>
> Or I can create a single column with all the records and then implement a
> UDAF in scala and use it in pyspark.
>
> Both solutions don't seem to be straight forward. Is there a simpler
> solution to this?
>
> Thanks
> Nikhil
>


Need help with SparkSQL Query

2018-12-17 Thread Nikhil Goyal
Hi guys,

I have a dataframe of type Record (id: Long, timestamp: Long, isValid:
Boolean,  other metrics)

Schema looks like this:
root
 |-- id: long (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- isValid: boolean (nullable = true)
.

I need to find the earliest valid record per id. In RDD world I can do
groupBy 'id' and find the earliest one but I am not sure how I can do it in
SQL. Since I am doing this in PySpark I cannot really use DataSet API for
this.

One thing I can do is groupBy 'id', find the earliest timestamp available
and then join with the original dataframe to get the right record (all the
metrics).

Or I can create a single column with all the records and then implement a
UDAF in scala and use it in pyspark.

Both solutions don't seem to be straight forward. Is there a simpler
solution to this?

Thanks
Nikhil


Spark App Write nothing on HDFS

2018-12-17 Thread Soheil Pourbafrani
Hi, I submit an app on Spark2 cluster using standalone scheduler on client
mode.
The app saves the results of the processing on the HDFS. There is no error
on output logs and the app finished successfully.
But the problem is it just create _SUCSSES and empty part-0 file on the
saving directory! I checked the app logic on local mode and it works
correctly.

Here are the related parts of the logs:

18/12/17 20:57:10 INFO scheduler.TaskSetManager: Finished task 1.0 in stage
78.0 (TID 350) in 13 ms on localhost (executor driver) (8/9)
18/12/17 20:57:10 INFO storage.ShuffleBlockFetcherIterator: Getting 0
non-empty blocks out of 9 blocks
18/12/17 20:57:10 INFO storage.ShuffleBlockFetcherIterator: Started 0
remote fetches in 0 ms
18/12/17 20:57:10 INFO executor.Executor: Finished task 8.0 in stage 78.0
(TID 357). 1779 bytes result sent to driver
18/12/17 20:57:10 INFO scheduler.TaskSetManager: Finished task 8.0 in stage
78.0 (TID 357) in 6 ms on localhost (executor driver) (9/9)
18/12/17 20:57:10 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 78.0,
whose tasks have all completed, from pool
18/12/17 20:57:10 INFO scheduler.DAGScheduler: ShuffleMapStage 78 (sortBy
at SparkSum.scala:260) finished in 0.016 s
18/12/17 20:57:10 INFO scheduler.DAGScheduler: looking for newly runnable
stages
18/12/17 20:57:10 INFO scheduler.DAGScheduler: running: Set()
18/12/17 20:57:10 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 79)
18/12/17 20:57:10 INFO scheduler.DAGScheduler: failed: Set()
18/12/17 20:57:10 INFO scheduler.DAGScheduler: Submitting ResultStage 79
(MapPartitionsRDD[240] at saveAsTextFile at SparkSum.scala:261), which has
no missing parents
18/12/17 20:57:10 INFO memory.MemoryStore: Block broadcast_46 stored as
values in memory (estimated size 73.0 KB, free 1987.9 MB)
18/12/17 20:57:10 INFO memory.MemoryStore: Block broadcast_46_piece0 stored
as bytes in memory (estimated size 26.6 KB, free 1987.9 MB)
18/12/17 20:57:10 INFO storage.BlockManagerInfo: Added broadcast_46_piece0
in memory on 172.16.20.4:40007 (size: 26.6 KB, free: 2004.4 MB)
18/12/17 20:57:10 INFO spark.SparkContext: Created broadcast 46 from
broadcast at DAGScheduler.scala:996
18/12/17 20:57:10 INFO scheduler.DAGScheduler: Submitting 1 missing tasks
from ResultStage 79 (MapPartitionsRDD[240] at saveAsTextFile at
SparkSum.scala:261)
18/12/17 20:57:10 INFO scheduler.TaskSchedulerImpl: Adding task set 79.0
with 1 tasks
18/12/17 20:57:10 INFO scheduler.TaskSetManager: Starting task 0.0 in stage
79.0 (TID 358, localhost, executor driver, partition 0, PROCESS_LOCAL, 5812
bytes)
18/12/17 20:57:10 INFO executor.Executor: Running task 0.0 in stage 79.0
(TID 358)
18/12/17 20:57:10 INFO storage.ShuffleBlockFetcherIterator: Getting 0
non-empty blocks out of 9 blocks
18/12/17 20:57:10 INFO storage.ShuffleBlockFetcherIterator: Started 0
remote fetches in 0 ms
18/12/17 20:57:10 INFO output.FileOutputCommitter: File Output Committer
Algorithm version is 1
18/12/17 20:57:10 INFO output.FileOutputCommitter: Saved output of task
'attempt_20181217205709_0079_m_00_358' to
hdfs://master:9000/home/hduser/soheil/res/_temporary/0/task_20181217205709_0079_m_00
18/12/17 20:57:10 INFO mapred.SparkHadoopMapRedUtil:
attempt_20181217205709_0079_m_00_358: Committed
18/12/17 20:57:10 INFO executor.Executor: Finished task 0.0 in stage 79.0
(TID 358). 1722 bytes result sent to driver
18/12/17 20:57:10 INFO scheduler.TaskSetManager: Finished task 0.0 in stage
79.0 (TID 358) in 75 ms on localhost (executor driver) (1/1)
18/12/17 20:57:10 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 79.0,
whose tasks have all completed, from pool
18/12/17 20:57:10 INFO scheduler.DAGScheduler: ResultStage 79
(saveAsTextFile at SparkSum.scala:261) finished in 0.076 s
18/12/17 20:57:10 INFO scheduler.DAGScheduler: Job 5 finished:
saveAsTextFile at SparkSum.scala:261, took 0.150206 s
Stopping ...
Elapsed Time: 218 Secs

18/12/17 20:57:10 INFO spark.SparkContext: Invoking stop() from shutdown
hook
18/12/17 20:57:10 INFO server.ServerConnector: Stopped
ServerConnector@5cc5b667{HTTP/1.1}{0.0.0.0:4040}
18/12/17 20:57:10 INFO handler.ContextHandler: Stopped
o.s.j.s.ServletContextHandler@5ae76500{/stages/stage/kill,null,UNAVAILABLE}
18/12/17 20:57:10 INFO handler.ContextHandler: Stopped
o.s.j.s.ServletContextHandler@2fd1731c{/jobs/job/kill,null,UNAVAILABLE}
18/12/17 20:57:10 INFO handler.ContextHandler: Stopped
o.s.j.s.ServletContextHandler@5ae81e1{/api,null,UNAVAILABLE}
18/12/17 20:57:10 INFO handler.ContextHandler: Stopped
o.s.j.s.ServletContextHandler@59fc684e{/,null,UNAVAILABLE}
18/12/17 20:57:10 INFO handler.ContextHandler: Stopped
o.s.j.s.ServletContextHandler@46c670a6{/static,null,UNAVAILABLE}
18/12/17 20:57:10 INFO handler.ContextHandler: Stopped
o.s.j.s.ServletContextHandler@782a4fff
{/executors/threadDump/json,null,UNAVAILABLE}
18/12/17 20:57:10 INFO handler.ContextHandler: Stopped
o.s.j.s.ServletContextHandler@30ed9c6c
{/executors/threadDump,null,UNAVAILABLE}
18/12/17 20:57:10

Mllib / kalman

2018-12-17 Thread robin . east
Pretty sure there is nothing in MLLib. This seems to be the most comprehensive 
coverage of implementing in Spark 
https://dzone.com/articles/kalman-filter-with-apache-spark-streaming-and-kafk. 
I’ve skimmed it but not read it in detail but looks useful.

Sent from Polymail ( 
https://polymail.io/?utm_source=polymail&utm_medium=referral&utm_campaign=signature
 )

On Mon, 17 Dec 2018 at 14:00 Laurent Thiebaud < Laurent Thiebaud ( Laurent 
Thiebaud  ) > wrote:

> 
> 
> Hi everybody,
> 
> 
> Is there any built-in implementation of Kalman filter with spark mllib? Or
> any other filter to achieve the same result? What's the state of the art
> about it?
> 
> 
> Thanks.
>

Mllib / kalman

2018-12-17 Thread Laurent Thiebaud
Hi everybody,

Is there any built-in implementation of Kalman filter with spark mllib? Or
any other filter to achieve the same result? What's the state of the art
about it?

Thanks.