Live Code Reviews, Coding, and Dev Tools

2018-07-24 Thread Holden Karau
Tomorrow afternoon @ 3pm pacific I'll be doing some dev tools poking for Beam and Spark - https://www.youtube.com/watch?v=6cTmC_fP9B0 for mention-bot. On Friday I'll be doing my normal code reviews - https://www.youtube.com/watch?v=O4rRx-3PTiM On Monday July 30th @ 9:30am I'll be doing some more

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-24 Thread Karthik Reddy Vadde
On Thu, Jul 12, 2018 at 10:23 AM Arun Mahadevan wrote: > Yes ForeachWriter [1] could be an option If you want to write to different > sinks. You can put your custom logic to split the data into different sinks. > > The drawback here is that you cannot plugin existing sinks like Kafka and > you

How to read json data from kafka and store to hdfs with spark structued streaming?

2018-07-24 Thread dddaaa
I'm trying to read json messages from kafka and store them in hdfs with spark structured streaming. I followed the example here: https://spark.apache.org/docs/2.1.0/structured-streaming-kafka-integration.html and when my code looks like this: df = spark \ .read \

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-24 Thread kant kodali
I am not seeing what else I will be increasing in this case if I were to write a set of rows to two different topics? which means for every row there will be two I/O's to two different topics if forEachBatch caches great! but again I would try and cache as little as possible. If I duplicate rows

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-24 Thread Silvio Fiorito
Any way you go you’ll increase something... ☺ Even with foreachBatch you would have to cache the DataFrame before submitting each batch to each topic to avoid recomputing it (see

Re: Arrow type issue with Pandas UDF

2018-07-24 Thread Gourav Sengupta
super duper :) On Tue, Jul 24, 2018 at 7:11 PM, Patrick McCarthy < pmccar...@dstillery.com.invalid> wrote: > Thanks Byran. I think it was ultimately groupings that were too large - > after setting spark.sql.shuffle.partitions to a much higher number I was > able to get the UDF to execute. > > On

Re: Arrow type issue with Pandas UDF

2018-07-24 Thread Patrick McCarthy
Thanks Byran. I think it was ultimately groupings that were too large - after setting spark.sql.shuffle.partitions to a much higher number I was able to get the UDF to execute. On Fri, Jul 20, 2018 at 12:45 AM, Bryan Cutler wrote: > Hi Patrick, > > It looks like it's failing in Scala before it

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-24 Thread kant kodali
@Silvio Thought about duplicating rows but dropped the idea for increasing memory. forEachBatch sounds Interesting! On Mon, Jul 23, 2018 at 6:51 AM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Using the current Kafka sink that supports routing based on topic column, > you could just

Re: Where can I read the Kafka offsets in SparkSQL application

2018-07-24 Thread Gourav Sengupta
Hi, can you see whether using the option for checkPointLocation would work in case you are using structured streaming? Regards, Gourav Sengupta On Tue, Jul 24, 2018 at 12:30 PM, John, Vishal (Agoda) < vishal.j...@agoda.com.invalid> wrote: > > Hello all, > > > I have to read data from Kafka

Where can I read the Kafka offsets in SparkSQL application

2018-07-24 Thread John, Vishal (Agoda)
Hello all, I have to read data from Kafka topic at regular intervals. I create the dataframe as shown below. I don’t want to start reading from the beginning on each run. At the same time, I don’t want to miss the messages between run intervals. val queryDf = sqlContext .read