Re: [pyspark 2.4] broadcasting DataFrame throws error

2020-09-18 Thread Amit Joshi
Hi Rishi, May be you have aready done these steps. Can you check the size of the dataframe you are trying to broadcast using logInfo(SizeEstimator.estimate(df)) and adjust the driver similarly. There is one more issue which I found was in spark 2. Broadcast does not work in cache data. It is poss

how to integrate hbase and hive in spark3.0.1?

2020-09-18 Thread ??????
hello,   I am using spark3.0.1, I want to integrate hive and hbase, but I don't know choose hive and hbase version, I had re-compiled spark source and installed spark3.0.1 with hive and Hadoop,but I encountered below the error, anyone who can help? root@namenode bin]# ./spark-sql 20/09/18 23:3

Re: [pyspark 2.4] broadcasting DataFrame throws error

2020-09-18 Thread Rishi Shah
Thanks Amit. I have tried increasing driver memory , also tried increasing max result size returned to the driver. Nothing works, I believe spark is not able to determine the fact that the result to be broadcasted is small enough because input data is huge? When I tried this in 2 stages, write out

Spark : Very simple query failing [Needed help please]

2020-09-18 Thread Debabrata Ghosh
Hi, I needed some help from you on the attached Spark problem please. I am running the following query: >>> df_location = spark.sql("""select dt from ql_raw_zone.ext_ql_location where ( lat between 41.67 and 45.82) and (lon between -86.74 and -82.42 ) and year=2020 and month=9 and da

Spark streaming job not able to launch more number of executors

2020-09-18 Thread Vibhor Banga ( Engineering - VS)
Hi all, We have a spark streaming job which reads from two kafka topics with 10 partitions each. And we are running the streaming job with 3 concurrent microbatches. (So total 20 partitions and 3 concurrency) We have following question: In our processing DAG, we do a rdd.persist() at one stage,

Pre query execution hook for custom datasources

2020-09-18 Thread Shubham Chaurasia
Hi, In our custom datasource implementation, we want to inject some query level information. For example - scala> val df = spark.sql("some query") // uses custom datasource under the hood through Session Extensions. scala> df.count // here we want some kind of pre execution hook just before