SparkSQL's performance : contacting namenode and datanode to uncessarily check all partitions for a query of specific partitions

ogoh Mon, 25 May 2015 12:11:13 -0700

Hello,

I am using SparkSQL 1.3.0 and Hive 0.13.1 on AWS & YARN.


My Hive table as an external table is partitioned with date and hour.
I expected that a query with certain partitions will read only the data
files of the partitions.
I turned on TRACE level logging for ThriftServer since the query response
time even for narrowed partitions was very long.
And I found that all the available partitions are checked during some steps.

The logs showed as a execution flow  such as:
==
Step 1: Contacted HiveMetastore to get partition info  (cmd :
get_partitions)

Step 2: Came up with an execution rule

Step 3: Contact namenode to make at least 4 calls (data is in HDFS) for all
available partitions of the table :
   getFileInfo once, getListing once, and the repeat them again for each
partition.

Step 4: Contact NameNode to find blocklocation of all the partitions

Step 5: Contact DataNode for each file of all the partitions

Step 6:  Contact NameNode  again for all the partitions

Step 7: SparkSQL generated some optimal plan

Step 8: Contacted corresponding datanodes for the narrowed partitions (it
seems)
And more.....
===

Why Step3, 4, 5, and 6 should check all partitions?
After removing partitions from the table, the query was much quicker while
processing same volume of data.

I don't know if it is normal or Hive issue or SparkSQL issue or my
configuration issue.
I added some logs below for some steps.

I appreciate any of your advice.

Thanks a lot,
Okehee

==== some logs of some steps

Query: select count(*) from api_search where pdate='2015-05-23';
(

Step 2:

2015-05-25 16:37:43 TRACE HiveContext$$anon$3:67 -

=== Applying Rule
org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates ===

!'Project [COUNT(1) AS _c0#25L]                        Aggregate [],
[COUNT(1) AS _c0#25L]

  Filter (pdate#26 = 2015-05-23)                        Filter (pdate#26 =
2015-05-23)

   MetastoreRelation api_hdfs_perf, api_search, None     MetastoreRelation
api_hdfs_perf, api_search, None
..

Step 3:

2015-05-25 16:37:44 TRACE ProtobufRpcEngine:206 - 84: Call ->
/10.128.193.211:9000: getFileInfo {src:
"/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00"}

2015-05-25 16:37:44 DEBUG Client:424 - The ping interval is 60000 ms.

2015-05-25 16:37:44 DEBUG Client:693 - Connecting to /10.128.193.211:9000

2015-05-25 16:37:44 DEBUG Client:1007 - IPC Client (2100771791) connection
to /10.128.193.211:9000 from ogoh sending #151

2015-05-25 16:37:44 DEBUG Client:944 - IPC Client (2100771791) connection to
/10.128.193.211:9000 from ogoh: starting, having connections 2

2015-05-25 16:37:44 DEBUG Client:1064 - IPC Client (2100771791) connection
to /10.128.193.211:9000 from ogoh got value #151

2015-05-25 16:37:44 DEBUG ProtobufRpcEngine:235 - Call: getFileInfo took
13ms

2015-05-25 16:37:44 TRACE ProtobufRpcEngine:250 - 84: Response <-
/10.128.193.211:9000: getFileInfo {fs { fileType: IS_DIR path: "" length: 0
permission { perm: 493 } owner: "hadoop" group: "supergroup"
modification_time: 1432364487906 access_time: 0 block_replication: 0
blocksize: 0 fileId: 100602 childrenNum: 2 }}

2015-05-25 16:37:44 TRACE ProtobufRpcEngine:206 - 84: Call ->
/10.128.193.211:9000: getFileInfo {src:
"/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00"}

2015-05-25 16:37:44 DEBUG Client:1007 - IPC Client (2100771791) connection
to /10.128.193.211:9000 from ogoh sending #152

2015-05-25 16:37:44 DEBUG Client:1064 - IPC Client (2100771791) connection
to /10.128.193.211:9000 from ogoh got value #152

2015-05-25 16:37:44 DEBUG ProtobufRpcEngine:235 - Call: getFileInfo took
2ms.
......


Step 4:

2015-05-25 16:37:47 TRACE ProtobufRpcEngine:206 - 89: Call ->
/10.128.193.211:9000: getBlockLocations {src:
"/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00/part-r-2015050800-00001.parquet"
offset: 0 length: 1342177280}
...


Step 5:

2015-05-25 16:37:48 DEBUG DFSClient:951 - Connecting to datanode
10.191.137.197:9200

2015-05-25 16:37:48 TRACE BlockReaderFactory:653 -
BlockReaderFactory(fileName=/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00/part-r-2015050800-00002.parquet,
block=BP-1843960649-10.128.193.211-1427923845046:blk_1073758677_981812):
trying to create a remote block reader from a TCP socket
...

Step 6:

2015-05-25 16:37:56 TRACE ProtobufRpcEngine:206 - 84: Call ->
/10.128.193.211:9000: getFileInfo {src:
"/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00"}
...


Step 7:

=== Applying Rule
org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions ===

== Optimized Logical Plan ==

Aggregate [], [COUNT(1) AS _c0#25L]

 Project []

  Filter (pdate#111 = 2015-05-23)

  
Relation[timestamp#84,request_id#85,request_timestamp#86,response_timestamp#87,request_query_url#88,request_query_params#89,response_status#90,q#91,session_id#92,partner_id#93,partner_name#94,partner_ip#95,partner_useragent#96,search_id#97,user_id#98,client_ip#99,client_country#100,client_useragent#101,client_platform#102,search_appids#103,search_topeditionidsbyappid#104,query_categories#105,ad_log#106,ad_log_app_editions#107,ad_log_app_id#108,trace#109,trace_annotations#110,pdate#111,phour#112]
ParquetRelation2(ArrayBuffer(hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00,
hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=01,
hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=02,
hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=03,
hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=04,
hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=05,
hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=06,
hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=07,
hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=08,
hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=09,
hdfs://10.128.193.211:9000/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=10,
....
...


Step8 :

2015-05-25 16:38:06 TRACE ProtobufRpcEngine:206 - 51: Call ->
/10.128.193.211:9000: getBlockLocations {src:
"/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-23/phour=01/part-r-00001.parquet"
offset: 0 length: 23423005}



......



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-s-performance-contacting-namenode-and-datanode-to-uncessarily-check-all-partitions-for-a-qus-tp23024.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

SparkSQL's performance : contacting namenode and datanode to uncessarily check all partitions for a query of specific partitions

Reply via email to