Hi all, We are using Parquet Hive table, and we are upgrading to Spark 1.3. But we find that, just a simple COUNT(*) query will much slower (100x) than Spark 1.2.
I find the most time spent on driver to get HDFS blocks. I find large amount of get below logs printed: 15/03/30 23:03:43 DEBUG ProtobufRpcEngine: Call: getBlockLocations took 2097ms 15/03/30 23:03:43 DEBUG DFSClient: newInfo = LocatedBlocks{ fileLength=77153436 underConstruction=false blocks=[LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948_1448275; getBlockSize()=77153436; corrupt=false; offset=0; locs=[10.152.116.172:50010, 10.152.116.169:50010, 10.153.125.184:50010]}] lastLocatedBlock=LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948_1448275; getBlockSize()=77153436; corrupt=false; offset=0; locs=[10.152.116.169:50010, 10.153.125.184:50010, 10.152.116.172:50010]} isLastBlockComplete=true} 15/03/30 23:03:43 DEBUG DFSClient: Connecting to datanode 10.152.116.172:50010 I compare the printed log with Spark 1.2, although the number of getBlockLocations call is similar, but each such operation only cost 20~30 ms (but it is 2000ms~3000ms now), and it didn't print the detailed LocatedBlocks info. Another finding is, if I read the Parquet file via scala code form spark-shell as below, it looks fine, the computation will return the result quick as before. sqlContext.parquetFile("data/myparquettable") Any idea about it? Thank you! -- 郑旭东 Zheng, Xudong