Re: Save Spark dataframe as dynamic partitioned table in Hive

2020-04-23 Thread ZHANG Wei
AFAICT, we can use spark.sql(s"select $name ..."), name is a value in Scala context[1]. -- Cheers, -z [1] https://docs.scala-lang.org/overviews/core/string-interpolation.html On Fri, 17 Apr 2020 00:10:59 +0100 Mich Talebzadeh wrote: > Thanks Patrick, > > The partition broadcastId is static

30000 partitions vs 1000 partitions with Coalescing

2020-04-23 Thread dev nan
I would like to know why it is faster to write out an RDD that has 30,000 partitions as 30,000 files sized 1K-2M rather than coalescing it to 1000 partitions and writing out 1000 S3 files of roughly 26MB each, or even 100 partitions and 100 S3 files of 260MB each. The coalescing takes a long time.

回复: 回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

2020-04-23 Thread maqy
Hi Jinxin,  Thanks for your suggestions, I will try to use foreachpartition later.   Best regards, maqy 发件人: Tang Jinxin 发送时间: 2020年4月23日 7:31 收件人: maqy 抄送: Andrew Melo; user@spark.apache.org 主题: 回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]? Hi maqy, Thanks for your

回复: 回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

2020-04-23 Thread maqy
Hi Jinxin,  Thanks for your suggestions, I will try to use foreachpartition later.   Best regards, maqy 发件人: Tang Jinxin 发送时间: 2020年4月23日 7:31 收件人: maqy 抄送: Andrew Melo; user@spark.apache.org 主题: 回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]? Hi maqy, Thanks for your

回复: 回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]?

2020-04-23 Thread maqy
Hi Jinxin,  Thanks for your suggestions, I will try to use foreachpartition later.   Best regards, maqy 发件人: Tang Jinxin 发送时间: 2020年4月23日 7:31 收件人: maqy 抄送: Andrew Melo; user@spark.apache.org 主题: 回复:Can I collect Dataset[Row] to driver without converting it toArray [Row]? Hi maqy, Thanks for your

Re: Error while reading hive tables with tmp/hidden files inside partitions

2020-04-23 Thread Wenchen Fan
Yea, please report the bug on a supported Spark version like 2.4. On Thu, Apr 23, 2020 at 3:40 PM Dhrubajyoti Hati wrote: > FYI we are using Spark 2.2.0. Should the change be present in this spark > version? Wanted to check before opening a JIRA ticket? > > > > > *Regards,Dhrubajyoti Hati.* > >

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-23 Thread ZHANG Wei
That's not dead locked. They are just trying acqure the same Monitor lock, and there are 3 threads. One acquired, and others are waiting for the lock being released. It's a common senario. You have to check the monitor lock object from callstack source code. There should be some operations after

Re: Error while reading hive tables with tmp/hidden files inside partitions

2020-04-23 Thread Dhrubajyoti Hati
FYI we are using Spark 2.2.0. Should the change be present in this spark version? Wanted to check before opening a JIRA ticket? *Regards,Dhrubajyoti Hati.* On Thu, Apr 23, 2020 at 10:12 AM Wenchen Fan wrote: > This looks like a bug that path filter doesn't work for hive table > reading. Can

Re: Cross Region Apache Spark Setup

2020-04-23 Thread Stone Zhong
Thank you Wei. I will look into #1. With option 2, seems it will push the complexity to application -- application need to write multiple queries and merge the final result. Regards, Stone On Mon, Apr 20, 2020 at 7:39 AM ZHANG Wei wrote: > There might be 3 options: > > 1. Just as you expect,