[Structured Streaming] Avoiding multiple streaming queries

2018-02-12 Thread Priyank Shrivastava
I have a structured streaming query which sinks to Kafka. This query has a complex aggregation logic. I would like to sink the output DF of this query to multiple Kafka topics each partitioned on a different ‘key’ column. I don’t want to have multiple Kafka sinks for each of the different

Re: Spark on K8s with Romana

2018-02-12 Thread Yinan Li
We actually moved away from using the driver pod IP because of https://github.com/apache-spark-on-k8s/spark/issues/482. The current way this works is that the driver url is constructed based on the value of "spark.driver.host" that is set to the DNS name of the headless driver service in the

Spark on K8s with Romana

2018-02-12 Thread Jenna Hoole
So, we've run into something interesting. In our case, we've got some proprietary networking HW which is very feature limited in the TCP/IP space, so using Romana, executors can't seem to find the driver using the hostname lookup method it's attempting. Is there any way to make it use IP? Thanks,

org.apache.kafka.clients.consumer.OffsetOutOfRangeException

2018-02-12 Thread Mina Aslani
Hi, I am getting below error Caused by: org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {topic1-0=304337} as soon as I submit a spark app to my cluster. I am using below dependency name:

Re: optimize hive query to move a subset of data from one partition table to another table

2018-02-12 Thread devjyoti patra
Can you try running your query with static literal for date filter. (join_date >= SOME 2 MONTH OLD DATE). I cannot think of any reason why this query should create more than 60 tasks. On 12 Feb 2018 6:26 am, "amit kumar singh" wrote: Hi create table emp as select *

Re: Schema - DataTypes.NullType

2018-02-12 Thread Jean Georges Perrin
Thanks Nicholas. It makes sense. Now that I have a hint, I can play with it too! jg > On Feb 11, 2018, at 19:15, Nicholas Hakobian > wrote: > > I spent a few minutes poking around in the source code and found this: > > The data type representing None, used

Re: Efficient way to compare the current row with previous row contents

2018-02-12 Thread Georg Heiler
See https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html and https://stackoverflow.com/questions/42448564/spark-sql-window-function-with-complex-condition for a more involved example KhajaAsmath Mohammed schrieb am Mo. 12. Feb. 2018 um

Re: Efficient way to compare the current row with previous row contents

2018-02-12 Thread KhajaAsmath Mohammed
I am also looking for the same answer. Will this work in streaming application too ?? Sent from my iPhone > On Feb 12, 2018, at 8:12 AM, Debabrata Ghosh wrote: > > Georg - Thanks ! Will you be able to help me with a few examples please. > > Thanks in advance again ! >

Re: Efficient way to compare the current row with previous row contents

2018-02-12 Thread Debabrata Ghosh
Georg - Thanks ! Will you be able to help me with a few examples please. Thanks in advance again ! Cheers, D On Mon, Feb 12, 2018 at 6:03 PM, Georg Heiler wrote: > You should look into window functions for spark sql. > Debabrata Ghosh schrieb

Spark sortByKey is not lazy evaluated

2018-02-12 Thread sandudi
Hey people, sortByKey is not lazy evaluated in spark because of it will use rang partition but i have one confusion it will load whole data from source or only sample data. so if i use sortByKey it will get whole data in memory or only sample so second time when i run real action it will load the

Re: Efficient way to compare the current row with previous row contents

2018-02-12 Thread Georg Heiler
You should look into window functions for spark sql. Debabrata Ghosh schrieb am Mo. 12. Feb. 2018 um 13:10: > Hi, > Greetings ! > > I needed some efficient way in pyspark to execute a > comparison (on all the attributes) between the

Efficient way to compare the current row with previous row contents

2018-02-12 Thread Debabrata Ghosh
Hi, Greetings ! I needed some efficient way in pyspark to execute a comparison (on all the attributes) between the current row and the previous row. My intent here is to leverage the distributed framework of Spark to the best extent so that can achieve a good

[pyspark] structured streaming deployment & monitoring recommendation

2018-02-12 Thread Bram
Hi, I have questions regarding spark structured streaming deployment recommendation I have +- 100 kafka topics that can be processed using similar code block. I am using pyspark 2.2.1 Here is the idea: TOPIC_LIST = ["topic_a","topic_b"."topic_c"] stream = {} for t in TOPIC_LIST: