from:"kathleen li"

Re: spark-sql force parallel union

2018-11-20 Thread kathleen li

you might first write the code to construct query statement with "union all" like below: scala> val query="select * from dfv1 union all select * from dfv2 union all select * from dfv3" query: String = select * from dfv1 union all select * from dfv2 union all select * from dfv3 then write loop

Re: [Spark SQL] [Spark 2.4.0] v1 -> struct(v1.e) fails

2018-11-19 Thread kathleen li

How about this: df.select(expr("transform( b, v1 -> struct(v1) )")).show() + |transform(b, lambdafunction(named_struct(v1, namedlambdavariable()), namedlambdavariable()))|

Re: 答复: Executor hang

2018-10-07 Thread kathleen li

oolExecutor.runWorker(ThreadPoolExecutor.java:1149) >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >at java.lang.Thread.run(Thread.java:748) > > How can I disable wholestage code gen? > > Thanks and Regards, > Tony > > 发件人: ka

Re: Executor hang

2018-10-07 Thread kathleen li

It seemed you had data skewed issue since shuffle read size for executor 4 is almost 2 times than other executors and GC time 11s almost 15 to 20 times than others. Kathleen Sent from my iPhone > On Oct 7, 2018, at 5:24 AM, 阎志涛 wrote: > > Hi, All, > I am running Spark 2.1 on Hadoop 2.7.2

Re: How to do a broadcast join using raw Spark SQL 2.3.1 or 2.3.2?

2018-10-03 Thread kathleen li

Not sure what you mean about “raw” Spark sql, but there is one parameter which will impact the optimizer choose broadcast join automatically or not : spark.sql.autoBroadcastJoinThreshold You can read Spark doc about above parameter setting and using explain to check your join using broadcast

Re: Text from pdf spark

2018-09-28 Thread kathleen li

The error message is “file not found” Are you able to use the following command line to assess the file with the user you submitted the job? hdfs dfs -ls /tmp/sample.pdf Sent from my iPhone > On Sep 28, 2018, at 12:10 PM, Joel D wrote: > > I'm trying to extract text from pdf files in hdfs

Re: Given events with start and end times, how to count the number of simultaneous events using Spark?

2018-09-26 Thread kathleen li

You can use Spark sql window function , something like df.createOrReplaceTempView(“dfv”) Select count(eventid) over ( partition by start_time, end_time orderly start_time) from dfv Sent from my iPhone > On Sep 26, 2018, at 11:32 AM, Debajyoti Roy wrote: > > The problem statement and an

Re: [SparkSQL] Count Distinct issue

2018-09-17 Thread kathleen li

Hi, I can't reproduce your issue: scala> spark.sql("select distinct * from dfv").show() ++++++++++++++++---+ | a| b| c| d| e| f| g| h| i| j| k| l| m| n| o| p|

What is the best way for Spark to read HDF5@scale?

2018-09-14 Thread kathleen li

Hi, Any Spark-connector for HDF5? The following link does not work anymore? https://www.hdfgroup.org/downloads/spark-connector/ down vo Thanks, Kathleen

Re: spark-sql force parallel union

Re: [Spark SQL] [Spark 2.4.0] v1 -> struct(v1.e) fails

Re: 答复: Executor hang

Re: Executor hang

Re: How to do a broadcast join using raw Spark SQL 2.3.1 or 2.3.2?

Re: Text from pdf spark

Re: Given events with start and end times, how to count the number of simultaneous events using Spark?

Re: [SparkSQL] Count Distinct issue

What is the best way for Spark to read HDF5@scale?

9 matches

Site Navigation

Mail list logo

Footer information