Take a look about this Jira: https://issues.apache.org/jira/browse/SPARK-6910 Yong
> Date: Mon, 1 Jun 2015 12:26:16 -0700 > From: oke...@gmail.com > To: user@spark.apache.org > Subject: SparkSQL's performance gets degraded depending on number of > partitions of Hive tables..is it normal? > > > Hello, > I posted this question a while back but am posting it again to get your > attention. > > I am using SparkSQL 1.3.1 and Hive 0.13.1 on AWS & YARN (tested under both > 1.3.0 & 1.3.1). > My hive table is partitioned. > I noticed that the query response time is bad depending on the number of > partitions though the query targets a small subset of the partitions. TRACE > level logs (ThriftServer's) showed that it runs commands like getFileInfo, > getListing, getBlockLocation for each every partitions ( also runs > getBlockLocation for each every files) though they are not part of the > queried partitions. > > I don't know why it is necessary. Is it a bug of SparkSql? Is there a way to > avoid that? > Below is the detail of reporting this issue including logs. > > Thanks, > > > ---------- > > My Hive table as an external table is partitioned with date and hour. > I expected that a query with certain partitions will read only the data > files of the partitions. > I turned on TRACE level logging for ThriftServer since the query response > time even for narrowed partitions was very long. > And I found that all the available partitions are checked during some steps. > > The logs showed as a execution flow such as: > == > Step 1: Contacted HiveMetastore to get partition info (cmd : > get_partitions) > > Step 2: Came up with an execution rule > > Step 3: Contact namenode to make at least 4 calls (data is in HDFS) for all > available partitions of the table : > getFileInfo once, getListing once, and the repeat them again for each > partition. > > Step 4: Contact NameNode to find blocklocation of all the partitions > > Step 5: Contact DataNode for each file of all the partitions > > Step 6: Contact NameNode again for all the partitions > > Step 7: SparkSQL generated some optimal plan > > Step 8: Contacted corresponding datanodes for the narrowed partitions (it > seems) > And more..... > === > > Why Step3, 4, 5, and 6 should check all partitions? > After removing partitions from the table, the query was much quicker while > processing same volume of data. > > I don't know if it is normal or Hive issue or SparkSQL issue or my > configuration issue. > I added some logs below for some steps. > > I appreciate any of your advice. > > Thanks a lot, > Okehee > > ==== some logs of some steps > > Query: select count(*) from api_search where pdate='2015-05-23'; > ( > > Step 2: > > 2015-05-25 16:37:43 TRACE HiveContext$$anon$3:67 - > > === Applying Rule > org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates === > > !'Project [COUNT(1) AS _c0#25L] Aggregate [], > [COUNT(1) AS _c0#25L] > > Filter (pdate#26 = 2015-05-23) Filter (pdate#26 = > 2015-05-23) > > MetastoreRelation api_hdfs_perf, api_search, None MetastoreRelation > api_hdfs_perf, api_search, None > .. > > Step 3: > > 2015-05-25 16:37:44 TRACE ProtobufRpcEngine:206 - 84: Call -> > /10.128.193.211:9000: getFileInfo {src: > "/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00"} > > 2015-05-25 16:37:44 DEBUG Client:424 - The ping interval is 60000 ms. > > 2015-05-25 16:37:44 DEBUG Client:693 - Connecting to /10.128.193.211:9000 > > 2015-05-25 16:37:44 DEBUG Client:1007 - IPC Client (2100771791) connection > to /10.128.193.211:9000 from ogoh sending #151 > > 2015-05-25 16:37:44 DEBUG Client:944 - IPC Client (2100771791) connection to > /10.128.193.211:9000 from ogoh: starting, having connections 2 > > 2015-05-25 16:37:44 DEBUG Client:1064 - IPC Client (2100771791) connection > to /10.128.193.211:9000 from ogoh got value #151 > > 2015-05-25 16:37:44 DEBUG ProtobufRpcEngine:235 - Call: getFileInfo took > 13ms > > 2015-05-25 16:37:44 TRACE ProtobufRpcEngine:250 - 84: Response <- > /10.128.193.211:9000: getFileInfo {fs { fileType: IS_DIR path: "" length: 0 > permission { perm: 493 } owner: "hadoop" group: "supergroup" > modification_time: 1432364487906 access_time: 0 block_replication: 0 > blocksize: 0 fileId: 100602 childrenNum: 2 }} > > 2015-05-25 16:37:44 TRACE ProtobufRpcEngine:206 - 84: Call -> > /10.128.193.211:9000: getFileInfo {src: > "/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00"} > > 2015-05-25 16:37:44 DEBUG Client:1007 - IPC Client (2100771791) connection > to /10.128.193.211:9000 from ogoh sending #152 > > 2015-05-25 16:37:44 DEBUG Client:1064 - IPC Client (2100771791) connection > to /10.128.193.211:9000 from ogoh got value #152 > > 2015-05-25 16:37:44 DEBUG ProtobufRpcEngine:235 - Call: getFileInfo took > 2ms. > ...... > > > Step 4: > > 2015-05-25 16:37:47 TRACE ProtobufRpcEngine:206 - 89: Call -> > /10.128.193.211:9000: getBlockLocations {src: > "/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00/part-r-2015050800-00001.parquet" > offset: 0 length: 1342177280} > ... > > > Step 5: > > 2015-05-25 16:37:48 DEBUG DFSClient:951 - Connecting to datanode > 10.191.137.197:9200 > > 2015-05-25 16:37:48 TRACE BlockReaderFactory:653 - > BlockReaderFactory(fileName=/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00/part-r-2015050800-00002.parquet, > block=BP-1843960649-10.128.193.211-1427923845046:blk_1073758677_981812): > trying to create a remote block reader from a TCP socket > ... > > Step 6: > > 2015-05-25 16:37:56 TRACE ProtobufRpcEngine:206 - 84: Call -> > /10.128.193.211:9000: getFileInfo {src: > "/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00"} > ... > > > Step 7: > > === Applying Rule > org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions === > > == Optimized Logical Plan == > > Aggregate [], [COUNT(1) AS _c0#25L] > > Project [] > > Filter (pdate#111 = 2015-05-23) > > > Relation[timestamp#84,request_id#85,request_timestamp#86,response_timestamp#87,request_query_url#88,request_query_params#89,response_status#90,q#91,session_id#92,partner_id#93,partner_name#94,partner_ip#95,partner_useragent#96,search_id#97,user_id#98,client_ip#99,client_country#100,client_useragent#101,client_platform#102,search_appids#103,search_topeditionidsbyappid#104,query_categories#105,ad_log#106,ad_log_app_editions#107,ad_log_app_id#108,trace#109,trace_annotations#110,pdate#111,phour#112] > ParquetRelation2(ArrayBuffer(hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00, > hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=01, > hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=02, > hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=03, > hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=04, > hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=05, > hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=06, > hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=07, > hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=08, > hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=09, > hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=10, > > .... > ... > > > Step8 : > > 2015-05-25 16:38:06 TRACE ProtobufRpcEngine:206 - 51: Call -> > /10.128.193.211:9000: getBlockLocations {src: > "/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-23/phour=01/part-r-00001.parquet" > offset: 0 length: 23423005} > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-s-performance-gets-degraded-depending-on-number-of-partitions-of-Hive-tables-is-it-normal-tp23100.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >