How to work around NoOffsetForPartitionException when using Spark Streaming

2018-06-01 Thread Martin Peng
Hi, We see below exception when using Spark Kafka streaming 0.10 on a normal Kafka topic. Not sure why offset missing in zk, but since Spark streaming override the offset reset policy to none in the code. I can not set the reset policy to latest(I don't really care data loss now). Is there any

Help explaining explain() after DataFrame join reordering

2018-06-01 Thread Mohamed Nadjib MAMI
Dear Sparkers, I'm loading into DataFrames data from 5 sources (using official connectors): Parquet, MongoDB, Cassandra, MySQL and CSV. I'm then joining those DataFrames in two different orders. - mongo * cassandra * jdbc * parquet * csv (random order). - parquet * csv * cassandra * jdbc *

Append In-Place to S3

2018-06-01 Thread Benjamin Kim
I have a situation where I trying to add only new rows to an existing data set that lives in S3 as gzipped parquet files, looping and appending for each hour of the day. First, I create a DF from the existing data, then I use a query to create another DF with the data that is new. Here is the

Re: Spark structured streaming generate output path runtime

2018-06-01 Thread Lalwani, Jayesh
This will not work the way you have implemented it. The code that you have here will be called only once before the streaming query is started. Once the streaming query starts, this code is not called What I would do is 1. Implement a udf that calculates flourtimestamp 2. Add a column in

Spark structured streaming generate output path runtime

2018-06-01 Thread Swapnil Chougule
Hi I want to generate output directory runtime for data. Directory name is derived from current timestamp. Lets say, data for same minute should go into same directory. I tried following snippet but it didn't work. All data is being written in same directory (created with respect to initial

[Spark SQL] Is it possible to do stream to stream inner join without event time?

2018-06-01 Thread Becket Qin
Hi, I am new to Spark and I'm trying to run a few queries from TPC-H using Spark SQL. According to the documentation here , it is OPTIONAL to have watermark defined in the

[Spark SQL] Efficiently calculating Weight of Evidence in PySpark

2018-06-01 Thread Aakash Basu
Hi guys, Can anyone please let me know if you've any clue on this problem I posted in StackOverflow - https://stackoverflow.com/questions/50638911/how-to-efficiently-calculate-woe-in-pyspark Thanks, Aakash.