How spark decides whether to do BroadcastHashJoin or SortMergeJoin

2016-07-20 Thread raaggarw
Hi, How spark decides/optimizes internally as to when it needs to a BroadcastHashJoin vs SortMergeJoin? Is there anyway we can guide from outside or through options which Join to use? Because in my case when i am trying to do a join, spark makes that join as BroadCastHashJoin internally and when

OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-06-09 Thread raaggarw
Hi, I was trying to port my code from spark 1.5.2 to spark 2.0 however i faced some outofMemory issues. On drilling down i could see that OOM is because of join, because removing join fixes the issue. I then created a small spark-app to reproduce this: (48 cores, 300gb ram - divided among 4

Re: Timeline for supporting basic operations like groupBy, joins etc on Streaming DataFrames

2016-06-05 Thread raaggarw
Thanks So, 1) For joins (stream-batch) - are all types of joins supported - i mean inner, leftouter etc or specific ones? Also what is the timeline for complete support - I mean stream-stream joins? 2) So now outputMode is exposed via DataFrameWriter but will work in specific cases as you

Re: Timeline for supporting basic operations like groupBy, joins etc on Streaming DataFrames

2016-06-05 Thread raaggarw
I accidentally deleted the original post. So I am just pasting the response from Tathagata Das Join is supported but only stream-batch joins. Outmodes were added late last week, currently supports append mode for non-aggregation queries and complete mode for aggregation

Timeline for supporting basic operations like groupBy, joins etc on Streaming DataFrames

2016-06-05 Thread raaggarw
Hi, I am Ravi, Computer scientist @ Adobe Systems. We have been actively using Spark for our internal projects. Recently we had a need for ETL on streaming data, so we were exploring Spark 2.0 for that. *But as i could see, the streaming dataframes do not support basic operations like Joins,

Timeline for supporting basic operations like groupBy, joins etc on Streaming DataFrames

2016-06-05 Thread raaggarw
Hi, I am Ravi, Computer scientist @ Adobe Systems. We have been actively using Spark for our internal projects. Recently we had a need for ETL on streaming data, so we were exploring Spark 2.0 for that. *But as i could see, the streaming dataframes do not support basic operations like Joins,