Hi all,

We are using Parquet Hive table, and we are upgrading to Spark 1.3. But we
find that, just a simple COUNT(*) query will much slower (100x) than Spark
1.2.

I find the most time spent on driver to get HDFS blocks. I find large
amount of get below logs printed:

15/03/30 23:03:43 DEBUG ProtobufRpcEngine: Call: getBlockLocations took 2097ms
15/03/30 23:03:43 DEBUG DFSClient: newInfo = LocatedBlocks{
  fileLength=77153436
  underConstruction=false
  
blocks=[LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948_1448275;
getBlockSize()=77153436; corrupt=false; offset=0;
locs=[10.152.116.172:50010, 10.152.116.169:50010,
10.153.125.184:50010]}]
  
lastLocatedBlock=LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948_1448275;
getBlockSize()=77153436; corrupt=false; offset=0;
locs=[10.152.116.169:50010, 10.153.125.184:50010,
10.152.116.172:50010]}
  isLastBlockComplete=true}
15/03/30 23:03:43 DEBUG DFSClient: Connecting to datanode 10.152.116.172:50010


I compare the printed log with Spark 1.2, although the number of
getBlockLocations call is similar, but each such operation only cost 20~30
ms (but it is 2000ms~3000ms now), and it didn't print the detailed
LocatedBlocks info.

Another finding is, if I read the Parquet file via scala code form
spark-shell as below, it looks fine, the computation will return the result
quick as before.

sqlContext.parquetFile("data/myparquettable")


Any idea about it? Thank you!


-- 
郑旭东
Zheng, Xudong

Reply via email to