Github user jameszhouyi commented on the pull request: https://github.com/apache/spark/pull/5688#issuecomment-104464124 @viirya , please see below query details with Using script transform: ADD FILE ${env:BIG_BENCH_QUERIES_DIR}/Resources/bigbenchqueriesmr.jar; --CREATE RESULT TABLE. Store query result externally in output_dir/qXXresult/ DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE}; CREATE TABLE ${hiveconf:RESULT_TABLE} ( pid1 BIGINT, pid2 BIGINT, cnt BIGINT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS ${env:BIG_BENCH_hive_default_fileformat_result_table} LOCATION '${hiveconf:RESULT_DIR}'; -- the real query part --Find the most frequent ones INSERT INTO TABLE ${hiveconf:RESULT_TABLE} SELECT pid1, pid2, COUNT (*) AS cnt FROM ( --Make items basket FROM ( -- Joining two tables SELECT s.ss_ticket_number AS oid , s.ss_item_sk AS pid FROM store_sales s INNER JOIN item i ON (s.ss_item_sk = i.i_item_sk) WHERE i.i_category_id in (${hiveconf:q01_i_category_id_IN}) AND s.ss_store_sk in (${hiveconf:q01_ss_store_sk_IN}) CLUSTER BY oid ) q01_map_output REDUCE q01_map_output.oid, q01_map_output.pid USING '${env:BIG_BENCH_JAVA} ${env:BIG_BENCH_java_child_process_xmx} -cp bigbenchqueriesmr.jar de.bankmark.bigbench.queries.q01.Red -ITEM_SET_MAX ${hiveconf:q01_NPATH_ITEM_SET_MAX} ' AS (pid1 BIGINT, pid2 BIGINT) ) q01_temp_basket GROUP BY pid1, pid2 HAVING COUNT (pid1) > ${hiveconf:q01_COUNT_pid1_greater} CLUSTER BY pid1 ,cnt ,pid2 ;
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org