please provide the detail way to generate the tpcds data in TPCDSQueryBenchmark

2023-05-15 Thread zhangliyun
hi i want to set up a tpcds benchmark to test some performance of some spark feature i saw in TPCDSQueryBenchmark , it need send the --data-location to the class, my question is how to generate the tpcds data in this benchmark ``` /** * Benchmark to measure TPCDS query performance. *

please help why error thrown out in org.apache.spark.sql.catalyst.expressions.BindReferences in spark3

2023-03-26 Thread zhangliyun
hi all i have a query ``` 1 spark.sql("select distinct cust_id, cast (b.device_name as varchar(200)) as devc_name_cast, prmry_reside_cntry_code from (select * from ${model_db}.crs_recent_30d_SF_dim_cust_info where dt='${today}') a join fact_rsk_magnes_txn b on

Re:Re: please help why there is big difference in partitionFilter in spark2.4.2 and spark3.1.3.

2023-03-06 Thread zhangliyun
PARK-27638. You can set spark.sql.legacy.typeCoercion.datetimeToString.enabled to true to restore the old behavior. On Mon, Mar 6, 2023 at 10:27 AM zhangliyun wrote: Hi all i have a spark sql , before in spark 2.4.2 it runs correctly, when i upgrade to spark 3.1.3, it has some problem.

please help why there is big difference in partitionFilter in spark2.4.2 and spark3.1.3.

2023-03-05 Thread zhangliyun
Hi all i have a spark sql , before in spark 2.4.2 it runs correctly, when i upgrade to spark 3.1.3, it has some problem. the sql ``` select * from eds_rds.cdh_prpc63cgudba_pp_index_disputecasedetails_hourly where dt >= date_sub('${today}',30); ``` it will load the data of past

is there any mentrics to show the usage of executor on memory or CPU

2020-05-15 Thread zhangliyun
Hi all: i want to ask a question about the metrics to show the executor is fully used the memory. in the log i always saw following in the log, i guess this means i did not fully use the executor 's memory. but i don't want to open the log to view, is there any metrics to show it? my

Re:Re: Re:Re: Screen Shot 2020-05-11 at 5.28.03 AM

2020-05-11 Thread zhangliyun
t; org.fusesource.jansi.internal.WindowsSupport.readConsoleInput(WindowsSupport.java:97) > jline.WindowsTerminal.readConsoleInput(WindowsTerminal.java:222) > > >Cheers, >-z > > >From: zhangliyun >Sent: Monday, May 11, 2020 9:44

a spark job hangs 20 hours and it shows 2 tasks not finished in stage page but all tasks shows finished or failed in task page

2020-05-06 Thread zhangliyun
Hi all: i want to ask a question, it seems that my spark job hangs as 20+ hours. in the spark history log, it shows 8999 completed task while 2 not finished. but when i go to the tasks page. i did not find any running tasks. All tasks are failed or Successful. I guess it seems that all

Screen Shot 2020-05-07 at 6.35.43 AM

2020-05-06 Thread zhangliyun
Hi all: i want to ask a question, it seems that my spark job hangs as 20+ hours. in the spark history log, it shows 8999 completed task while 2 not finished. but when i go to the tasks page. i did not find any running tasks. All tasks are failed or Successful. I guess it seems that all

Re:Re: is there any tool to visualize the spark physical plan or spark plan

2020-05-01 Thread zhangliyun
kPlanGraph` has a `makeDotFile` method where you can write out a `.dot` file and visualize it with Graphviz tools, e.g. http://www.webgraphviz.com/ Thanks, Manu On Thu, Apr 30, 2020 at 3:21 PM zhangliyun wrote: Hi all i want to ask a question is there any tool to visualize the spark phys

Re:Re: is there any tool to visualize the spark physical plan or spark plan

2020-04-30 Thread zhangliyun
main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L306. `SparkPlanGraph` has a `makeDotFile` method where you can write out a `.dot` file and visualize it with Graphviz tools, e.g. http://www.webgraphviz.com/ Thanks, Manu On Thu, Apr 30, 2020 at 3:21 PM zhangliyun wrot

is there any tool to visualize the spark physical plan or spark plan

2020-04-30 Thread zhangliyun
Hi all i want to ask a question is there any tool to visualize the spark physical plan or spark plan? sometimes the physical plan is very long so it is difficult to view it. Best Regards KellyZhang

How to estimate the rdd size before the rdd result is written to disk

2019-12-19 Thread zhangliyun
Hi all: i want to ask a question about how to estimate the rdd size( according to byte) when it is not saved to disk because the job spends long time if the output is very huge and output partition number is small. following step is what i can solve for this problem 1.sample 0.01 's

Fw:Re:Re: A question about radd bytes size

2019-12-02 Thread zhangliyun
转发邮件信息 发件人:"zhangliyun" 发送日期:2019-12-03 05:56:55 收件人:"Wenchen Fan" 主题:Re:Re: A question about radd bytes size Hi Fan: thanks for reply, I agree that the how the data is stored decides the total bytes of the table file. In my

A question about radd bytes size

2019-12-01 Thread zhangliyun
Hi: I want to get the total bytes of a DataFrame by following function , but when I insert the DataFrame into hive , I found the value of the function is different from spark.sql.statistics.totalSize . The spark.sql.statistics.totalSize is less than the result of following function

A question about skew join hint

2019-11-04 Thread zhangliyun
Hi all: i saw skewed join hint optimization in https://docs.azuredatabricks.net/delta/join-performance/skew-join.html. it is a great feature to help users to avoid the problem brought from skewed data. My question 1. which version we will have this ? i have not found the feature in the

Re:Re: A question about broadcast nest loop join

2019-10-23 Thread zhangliyun
y generated by using a NOT IN (subquery), if you are OK with slightly different NULL semantics then you could use NOT EXISTS(subquery). The latter should perform a lot better. On Wed, Oct 23, 2019 at 12:02 PM zhangliyun wrote: Hi all: i want to ask a question about broadcast nestloop join

Re:Re: A question about broadcast nest loop join

2019-10-23 Thread zhangliyun
OM happens. Maybe there is an algorithm to implement left/right join in a distributed environment without broadcast, but currently Spark is only able to deal with it using broadcast. On Wed, Oct 23, 2019 at 6:02 PM zhangliyun wrote: Hi all: i want to ask a question about broadcast nestloop

A question about broadcast nest loop join

2019-10-23 Thread zhangliyun
Hi all: i want to ask a question about broadcast nestloop join? from google i know, that left outer/semi join and right outer/semi join will use broadcast nestloop. and in some cases, when the input data is very small, it is suitable to use. so here how to define the input data very

Question about in subquery

2019-10-22 Thread zhangliyun
Hi all: i used in subquery like following in spark 2.3.1 {code} set spark.sql.autoBroadcastJoinThreshold=-1; explain select * from testdata where key1 not in (select key1 from testdata as b); = Physical Plan == BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#60 = key1#62) ||

Re:Re: Please help view the problem of spark dynamic partition

2019-08-23 Thread zhangliyun
Johann, Uwe Reimann Am 23.08.2019 um 09:43 schrieb zhangliyun : Hi all: when i use spark dynamic partition feature , i met a problem about hdfs quota. I found that it is every easy to meet quota problem (exceed the max value of quota of directory) I have generated a unpart

Please help view the problem of spark dynamic partition

2019-08-23 Thread zhangliyun
Hi all: when i use spark dynamic partition feature , i met a problem about hdfs quota. I found that it is every easy to meet quota problem (exceed the max value of quota of directory) I have generated a unpartitioned table 'bsl12.email_edge_lyh_mth1' which contains 584M records and will

Please help the question of repartition for dataset from partiitoned hive table

2019-08-21 Thread zhangliyun
Hi All: i have a question about repartition api and sparksql partition. I have an table which partition key is day ``` ./bin/spark-sql -e "CREATE TABLE t_original_partitioned_spark (cust_id int, loss double) PARTITIONED BY (day STRING) location

Re:Re: How to force sorted merge join to broadcast join

2019-07-29 Thread zhangliyun
l On 29 July 2019 at 07:12:30, zhangliyun (kelly...@126.com) wrote: Hi all: i want to ask a question about broadcast join in spark sql. ``` select A.*,B.nsf_cards_ratio * 1.00 / A.nsf_on_entry as nsf_ratio_to_pop from B left join A on trim(A.country) = trim(B.cntry_code); ``` here A

How to force sorted merge join to broadcast join

2019-07-28 Thread zhangliyun
Hi all: i want to ask a question about broadcast join in spark sql. ``` select A.*,B.nsf_cards_ratio * 1.00 / A.nsf_on_entry as nsf_ratio_to_pop from B left join A on trim(A.country) = trim(B.cntry_code); ``` here A is a small table only 8 rows, but somehow the statistics of table A

Re:RE: How to build single jar for single project in spark

2019-03-27 Thread zhangliyun
rg Date:2019-03-26 14:34:20 Subject:RE: How to build single jar for single project in spark You can try this https://spark.apache.org/docs/latest/building-spark.html#building-submodules-individually Thanks, Gerry From: zhangliyun Sent: 2019年3月26日 16:50 To: dev@spark.apache.org