Re: spark-sql force parallel union

2018-11-20 Thread kathleen li
you might first write the code to construct query statement with "union all" like below: scala> val query="select * from dfv1 union all select * from dfv2 union all select * from dfv3" query: String = select * from dfv1 union all select * from dfv2 union all select * from dfv3 then write loop

Re: [Spark SQL] [Spark 2.4.0] v1 -> struct(v1.e) fails

2018-11-19 Thread kathleen li
How about this: df.select(expr("transform( b, v1 -> struct(v1) )")).show() + |transform(b, lambdafunction(named_struct(v1, namedlambdavariable()), namedlambdavariable()))|

Re: 答复: Executor hang

2018-10-07 Thread kathleen li
oolExecutor.runWorker(ThreadPoolExecutor.java:1149) >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >at java.lang.Thread.run(Thread.java:748) > > How can I disable wholestage code gen? > > Thanks and Regards, > Tony > > 发件人: ka

Re: Executor hang

2018-10-07 Thread kathleen li
It seemed you had data skewed issue since shuffle read size for executor 4 is almost 2 times than other executors and GC time 11s almost 15 to 20 times than others. Kathleen Sent from my iPhone > On Oct 7, 2018, at 5:24 AM, 阎志涛 wrote: > > Hi, All, > I am running Spark 2.1 on Hadoop 2.7.2

Re: How to do a broadcast join using raw Spark SQL 2.3.1 or 2.3.2?

2018-10-03 Thread kathleen li
Not sure what you mean about “raw” Spark sql, but there is one parameter which will impact the optimizer choose broadcast join automatically or not : spark.sql.autoBroadcastJoinThreshold You can read Spark doc about above parameter setting and using explain to check your join using broadcast

Re: Text from pdf spark

2018-09-28 Thread kathleen li
The error message is “file not found” Are you able to use the following command line to assess the file with the user you submitted the job? hdfs dfs -ls /tmp/sample.pdf Sent from my iPhone > On Sep 28, 2018, at 12:10 PM, Joel D wrote: > > I'm trying to extract text from pdf files in hdfs

Re: Given events with start and end times, how to count the number of simultaneous events using Spark?

2018-09-26 Thread kathleen li
You can use Spark sql window function , something like df.createOrReplaceTempView(“dfv”) Select count(eventid) over ( partition by start_time, end_time orderly start_time) from dfv Sent from my iPhone > On Sep 26, 2018, at 11:32 AM, Debajyoti Roy wrote: > > The problem statement and an

Re: [SparkSQL] Count Distinct issue

2018-09-17 Thread kathleen li
Hi, I can't reproduce your issue: scala> spark.sql("select distinct * from dfv").show() ++++++++++++++++---+ | a| b| c| d| e| f| g| h| i| j| k| l| m| n| o| p|

What is the best way for Spark to read HDF5@scale?

2018-09-14 Thread kathleen li
Hi, Any Spark-connector for HDF5? The following link does not work anymore? https://www.hdfgroup.org/downloads/spark-connector/ down vo Thanks, Kathleen