Re: Synonym handling replacement issue with UDF in Apache Spark

2017-05-03 Thread JayeshLalwani
You need to understand how join works to make sense of it. Logically, a join does a cartesian product of the 2 tables, and then filters the rows that satisfy the contains UDF. So, let's say you have Input Allen Armstrong nishanth hemanth Allen shivu Armstrong nishanth shree shivu DeWALT

Re: Spark-SQL collect function

2017-05-03 Thread JayeshLalwani
In any distributed application, you scale up by splitting execution up on multiple machines. The way Spark does this is by slicing the data into partitions and spreading them on multiple machines. Logically, an RDD is exactly that: data is split up and spread around on multiple machines. When you

Re: In-order processing using spark streaming

2017-05-03 Thread JayeshLalwani
Option A If you can get all the messages in a session into the same Spark partition, you can use df.mapWithPartition to process the whole partition. This will allow you to control the order in which the messages are processed within the partition. This will work if messages are posted in Kafka in

Refreshing a persisted RDD

2017-05-03 Thread JayeshLalwani
We have a Structured Streaming application that gets accounts from Kafka into a streaming data frame. We have a blacklist of accounts stored in S3 and we want to filter out all the accounts that are blacklisted. So, we are loading the blacklisted accounts into a batch data frame and joining it

Structured Streaming: How to handle bad input

2017-02-23 Thread JayeshLalwani
What is a good way to make a Structured Streaming application deal with bad input? Right now, the problem is that bad input kills the Structured Streaming application. This is highly undesirable, because a Structured Streaming application has to be always on For example, here is a very simple