I am finding a strange issue with Spark SQL where "select count(*) " returns wrong row counts for certain tables.
I am using TPCDS tables, so here are the actual counts: Row counts in Spark SQL tables Spark SQL tables raw generated count(*) Parquet count(*) Text files call_center 6 0 0 catalog_page 11718 11515 11515 catalog_returns 144067 138352 138352 catalog_sales 1441548 1427257 1427257 customer 100000 93063 93063 customer_address 50000 48444 48444 customer_demographics 1920800 1920800 1920800 date_dim 73049 73049 73049 household_demographics 7200 7200 7200 income_band 20 20 20 inventory 11745000 11158087 11158087 item 18000 17917 17917 promotion 300 289 289 reason 35 35 35 ship_mode 20 20 20 store 12 3 3 store_returns 287514 267471 267471 store_sales 2880404 2620573 2620573 time_dim 86400 86400 86400 warehouse 5 4 4 web_page 60 21 21 web_returns 71763 65384 65384 web_sales 719384 719025 719025 web_site 30 25 25 call_center returned 0 count :( The code used to do the count is fairly simple: df.registerTempTable(tablename) println("registered tempTable") val rc=sqlContext.sql("select count(*) from "+tablename) rc.map(t => "row count table "+tablename+": "+ t(0)).collect ().foreach(println) This made many tpcds-derived queries return WRONG results. Wanted to know if anything key is missing here. Jesse JESSE CHEN Big Data Performance | IBM Analytics Office: 408 463 2296 Mobile: 408 828 9068 Email: jfc...@us.ibm.com