Take a look about this Jira: https://issues.apache.org/jira/browse/SPARK-6910
Yong

> Date: Mon, 1 Jun 2015 12:26:16 -0700
> From: oke...@gmail.com
> To: user@spark.apache.org
> Subject: SparkSQL's performance gets degraded depending on number of 
> partitions of Hive tables..is it normal?
> 
> 
> Hello, 
> I posted this question a while back but am posting it again to get your
> attention.
> 
> I am using SparkSQL 1.3.1 and Hive 0.13.1 on AWS & YARN (tested under both
> 1.3.0 & 1.3.1). 
> My hive table is partitioned.
> I noticed that the query response time is bad depending on the number of
> partitions though the query targets a small subset of the partitions. TRACE
> level logs (ThriftServer's) showed that it runs commands like getFileInfo,
> getListing, getBlockLocation for each every partitions ( also runs
> getBlockLocation for each every files) though they are not part of the
> queried partitions.
> 
> I don't know why it is necessary. Is it a bug of SparkSql? Is there a way to
> avoid that?
> Below is the detail of reporting this issue including logs.
> 
> Thanks,
> 
> 
> ----------
> 
> My Hive table as an external table is partitioned with date and hour. 
> I expected that a query with certain partitions will read only the data
> files of the partitions. 
> I turned on TRACE level logging for ThriftServer since the query response
> time even for narrowed partitions was very long. 
> And I found that all the available partitions are checked during some steps. 
> 
> The logs showed as a execution flow  such as: 
> == 
> Step 1: Contacted HiveMetastore to get partition info  (cmd :
> get_partitions) 
> 
> Step 2: Came up with an execution rule 
> 
> Step 3: Contact namenode to make at least 4 calls (data is in HDFS) for all
> available partitions of the table : 
>    getFileInfo once, getListing once, and the repeat them again for each
> partition. 
> 
> Step 4: Contact NameNode to find blocklocation of all the partitions 
> 
> Step 5: Contact DataNode for each file of all the partitions 
> 
> Step 6:  Contact NameNode  again for all the partitions 
> 
> Step 7: SparkSQL generated some optimal plan 
> 
> Step 8: Contacted corresponding datanodes for the narrowed partitions (it
> seems) 
> And more..... 
> === 
> 
> Why Step3, 4, 5, and 6 should check all partitions? 
> After removing partitions from the table, the query was much quicker while
> processing same volume of data. 
> 
> I don't know if it is normal or Hive issue or SparkSQL issue or my
> configuration issue. 
> I added some logs below for some steps. 
> 
> I appreciate any of your advice. 
> 
> Thanks a lot, 
> Okehee 
> 
> ==== some logs of some steps 
> 
> Query: select count(*) from api_search where pdate='2015-05-23'; 
> ( 
> 
> Step 2: 
> 
> 2015-05-25 16:37:43 TRACE HiveContext$$anon$3:67 - 
> 
> === Applying Rule
> org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates === 
> 
> !'Project [COUNT(1) AS _c0#25L]                        Aggregate [],
> [COUNT(1) AS _c0#25L] 
> 
>   Filter (pdate#26 = 2015-05-23)                        Filter (pdate#26 =
> 2015-05-23) 
> 
>    MetastoreRelation api_hdfs_perf, api_search, None     MetastoreRelation
> api_hdfs_perf, api_search, None 
> .. 
> 
> Step 3: 
> 
> 2015-05-25 16:37:44 TRACE ProtobufRpcEngine:206 - 84: Call ->
> /10.128.193.211:9000: getFileInfo {src:
> "/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00"} 
> 
> 2015-05-25 16:37:44 DEBUG Client:424 - The ping interval is 60000 ms. 
> 
> 2015-05-25 16:37:44 DEBUG Client:693 - Connecting to /10.128.193.211:9000 
> 
> 2015-05-25 16:37:44 DEBUG Client:1007 - IPC Client (2100771791) connection
> to /10.128.193.211:9000 from ogoh sending #151 
> 
> 2015-05-25 16:37:44 DEBUG Client:944 - IPC Client (2100771791) connection to
> /10.128.193.211:9000 from ogoh: starting, having connections 2 
> 
> 2015-05-25 16:37:44 DEBUG Client:1064 - IPC Client (2100771791) connection
> to /10.128.193.211:9000 from ogoh got value #151 
> 
> 2015-05-25 16:37:44 DEBUG ProtobufRpcEngine:235 - Call: getFileInfo took
> 13ms 
> 
> 2015-05-25 16:37:44 TRACE ProtobufRpcEngine:250 - 84: Response <-
> /10.128.193.211:9000: getFileInfo {fs { fileType: IS_DIR path: "" length: 0
> permission { perm: 493 } owner: "hadoop" group: "supergroup"
> modification_time: 1432364487906 access_time: 0 block_replication: 0
> blocksize: 0 fileId: 100602 childrenNum: 2 }} 
> 
> 2015-05-25 16:37:44 TRACE ProtobufRpcEngine:206 - 84: Call ->
> /10.128.193.211:9000: getFileInfo {src:
> "/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00"} 
> 
> 2015-05-25 16:37:44 DEBUG Client:1007 - IPC Client (2100771791) connection
> to /10.128.193.211:9000 from ogoh sending #152 
> 
> 2015-05-25 16:37:44 DEBUG Client:1064 - IPC Client (2100771791) connection
> to /10.128.193.211:9000 from ogoh got value #152 
> 
> 2015-05-25 16:37:44 DEBUG ProtobufRpcEngine:235 - Call: getFileInfo took
> 2ms. 
> ...... 
> 
> 
> Step 4: 
> 
> 2015-05-25 16:37:47 TRACE ProtobufRpcEngine:206 - 89: Call ->
> /10.128.193.211:9000: getBlockLocations {src:
> "/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00/part-r-2015050800-00001.parquet"
> offset: 0 length: 1342177280} 
> ... 
> 
> 
> Step 5: 
> 
> 2015-05-25 16:37:48 DEBUG DFSClient:951 - Connecting to datanode
> 10.191.137.197:9200 
> 
> 2015-05-25 16:37:48 TRACE BlockReaderFactory:653 -
> BlockReaderFactory(fileName=/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00/part-r-2015050800-00002.parquet,
> block=BP-1843960649-10.128.193.211-1427923845046:blk_1073758677_981812):
> trying to create a remote block reader from a TCP socket 
> ... 
> 
> Step 6: 
> 
> 2015-05-25 16:37:56 TRACE ProtobufRpcEngine:206 - 84: Call ->
> /10.128.193.211:9000: getFileInfo {src:
> "/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00"} 
> ... 
> 
> 
> Step 7: 
> 
> === Applying Rule
> org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions === 
> 
> == Optimized Logical Plan == 
> 
> Aggregate [], [COUNT(1) AS _c0#25L] 
> 
>  Project [] 
> 
>   Filter (pdate#111 = 2015-05-23) 
> 
>   
> Relation[timestamp#84,request_id#85,request_timestamp#86,response_timestamp#87,request_query_url#88,request_query_params#89,response_status#90,q#91,session_id#92,partner_id#93,partner_name#94,partner_ip#95,partner_useragent#96,search_id#97,user_id#98,client_ip#99,client_country#100,client_useragent#101,client_platform#102,search_appids#103,search_topeditionidsbyappid#104,query_categories#105,ad_log#106,ad_log_app_editions#107,ad_log_app_id#108,trace#109,trace_annotations#110,pdate#111,phour#112]
> ParquetRelation2(ArrayBuffer(hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00,
> hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=01,
> hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=02,
> hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=03,
> hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=04,
> hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=05,
> hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=06,
> hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=07,
> hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=08,
> hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=09,
> hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=10,
>  
> .... 
> ... 
> 
> 
> Step8 : 
> 
> 2015-05-25 16:38:06 TRACE ProtobufRpcEngine:206 - 51: Call ->
> /10.128.193.211:9000: getBlockLocations {src:
> "/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-23/phour=01/part-r-00001.parquet"
> offset: 0 length: 23423005} 
> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-s-performance-gets-degraded-depending-on-number-of-partitions-of-Hive-tables-is-it-normal-tp23100.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
                                          

Reply via email to