Jdbc Hook in Spark Batch Application

2020-12-23 Thread lec ssmi
Hi: guys, I have some spark programs that have database connection operations. I want to acquire the connection information, such as jdbc connection properties , but not too intrusive to the code. Any good ideas ? Can java agent make it ?

Re: [Spark Structured Streaming] Not working while worker node is on different machine

2020-12-23 Thread lec ssmi
Any more detail about it ? bannya 于2020年12月18日周五 上午11:25写道: > Hi, > > I have a spark structured streaming application that is reading data from a > Kafka topic (16 partitions). I am using standalone mode. I have two workers > node, one node is on the same machine with masters and another one is

Re: Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread David Morin
Thanks Jungtaek Ok I got it. I'll test it and check if the loss of efficiency is acceptable. Le mer. 23 déc. 2020 à 23:29, Jungtaek Lim a écrit : > Please refer my previous answer - >

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-23 Thread Sean Owen
Why do you want to use this function instead of the built-in stddev function? On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh wrote: > Hi, > > > This is a shot in the dark so to speak. > > > I would like to use the standard deviation std offered by numpy in > PySpark. I am using SQL for now > >

Re: Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread Jungtaek Lim
Please refer my previous answer - https://lists.apache.org/thread.html/r7dfc9e47cd9651fb974f97dde756013fd0b90e49d4f6382d7a3d68f7%40%3Cuser.spark.apache.org%3E Probably we may want to add it in the SS guide doc. We didn't need it as it just didn't work with eventually consistent model, and now it

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-23 Thread Mich Talebzadeh
OK Thanks for the tip. I found this link useful for Python from Databricks User-defined functions - Python — Databricks Documentation LinkedIn *

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-23 Thread Peyman Mohajerian
https://stackoverflow.com/questions/43484269/how-to-register-udf-to-use-in-sql-and-dataframe On Wed, Dec 23, 2020 at 12:52 PM Mich Talebzadeh wrote: > Hi, > > > This is a shot in the dark so to speak. > > > I would like to use the standard deviation std offered by numpy in > PySpark. I am using

Re: Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread David Morin
Does it work with the standard AWS S3 solution and its new consistency model ? Le mer. 23 déc. 2020 à 18:48, David Morin a écrit : > Thanks. > My Spark applications run on nodes based on docker images but

Using UDF based on Numpy functions in Spark SQL

2020-12-23 Thread Mich Talebzadeh
Hi, This is a shot in the dark so to speak. I would like to use the standard deviation std offered by numpy in PySpark. I am using SQL for now The code as below sqltext = f""" SELECT rs.Customer_ID , rs.Number_of_orders , rs.Total_customer_amount

Re: Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread David Morin
Thanks. My Spark applications run on nodes based on docker images but this is a standalone mode (1 driver - n workers) Can we use S3 directly with consistency addon like s3guard (s3a) or AWS Consistent view ?

Re: Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread Lalwani, Jayesh
Yes. It is necessary to have a distributed file system because all the workers need to read/write to the checkpoint. The distributed file system has to be immediately consistent: When one node writes to it, the other nodes should be able to read it immediately The solutions/workarounds depend

Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread David Morin
Hello, I have an issue with my Pyspark job related to checkpoint. Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 16997.0 failed 4 times, most recent failure: Lost task 3.3 in stage 16997.0 (TID 206609, 10.XXX, executor 4):