wuchang created SPARK-19647: ------------------------------- Summary: Spark query hive is extremelly slow even the result data is small Key: SPARK-19647 URL: https://issues.apache.org/jira/browse/SPARK-19647 Project: Spark Issue Type: Question Components: PySpark Affects Versions: 2.0.2 Reporter: wuchang Priority: Critical
I am using spark 2.0.0 to query hive table: my sql is: select * from app.abtestmsg_v limit 10 Yes, I want to get the first 10 records from a view app.abtestmsg_v. When I run this sql in spark-shell,it is very fast, USE about 2 seconds . But then the problem comes when I try to implement this query by my python code. I am using Spark 2.0.0 and write a very simple pyspark program, code is: from pyspark.sql import HiveContext from pyspark.sql.functions import * import json hc = HiveContext(sc) hc.setConf("hive.exec.orc.split.strategy", "ETL") hc.setConf("hive.security.authorization.enabled",false) zj_sql = 'select * from app.abtestmsg_v limit 10' zj_df = hc.sql(zj_sql) zj_df.collect() >From the info log , I find: although I use "limit 10" to tell spark that I >just want the first 10 records , but spark still scan and read all files(in my >case, the source data of this view contains 100 files and each file's size is >about 1G) of the view , So , there are nearly 100 tasks , each task read a >file , and all the task is executed serially. I use nearlly 15 minutes to >finish these 100 tasks!!!!! but what I want is just to get the first 10 >records. So , I don't know what to do and what is wrong; Anybode could give me some suggestions? -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org