I am finding a strange issue with Spark SQL where "select count(*) "
returns wrong row counts for certain tables.

I am using TPCDS tables, so here are the actual counts:

                                                                                
      
                            Row counts in     Spark SQL tables      Spark SQL 
tables  
                            raw generated     count(*) Parquet      count(*) 
Text     
                            files                                               
      
                                                                                
      
 call_center                                6                     0             
    0 
                                                                                
      
 catalog_page                           11718                 11515             
11515 
                                                                                
      
 catalog_returns                       144067                138352            
138352 
                                                                                
      
 catalog_sales                        1441548               1427257           
1427257 
                                                                                
      
 customer                              100000                 93063             
93063 
                                                                                
      
 customer_address                       50000                 48444             
48444 
                                                                                
      
 customer_demographics                1920800               1920800           
1920800 
                                                                                
      
 date_dim                               73049                 73049             
73049 
                                                                                
      
 household_demographics                  7200                  7200             
 7200 
                                                                                
      
 income_band                               20                    20             
   20 
                                                                                
      
 inventory                           11745000              11158087          
11158087 
                                                                                
      
 item                                   18000                 17917             
17917 
                                                                                
      
 promotion                                300                   289             
  289 
                                                                                
      
 reason                                    35                    35             
   35 
                                                                                
      
 ship_mode                                 20                    20             
   20 
                                                                                
      
 store                                     12                     3             
    3 
                                                                                
      
 store_returns                         287514                267471            
267471 
                                                                                
      
 store_sales                          2880404               2620573           
2620573 
                                                                                
      
 time_dim                               86400                 86400             
86400 
                                                                                
      
 warehouse                                  5                     4             
    4 
                                                                                
      
 web_page                                  60                    21             
   21 
                                                                                
      
 web_returns                            71763                 65384             
65384 
                                                                                
      
 web_sales                             719384                719025            
719025 
                                                                                
      
 web_site                                  30                    25             
   25 
                                                                                
      



call_center returned 0 count :(

The code used to do the count is fairly simple:

      df.registerTempTable(tablename)
      println("registered tempTable")
      val rc=sqlContext.sql("select count(*) from "+tablename)
      rc.map(t => "row count table "+tablename+": "+ t(0)).collect
().foreach(println)

This made many tpcds-derived queries return WRONG results.

Wanted to know if anything key is missing here.

Jesse

                                                                                
                                                              
                                                                                
                                                              
                                                                                
                                                              
                                                                                
                                                              
                                                                                
                                                              
                                                                                
                                                              
                                   JESSE CHEN                                   
                                                              
                                   Big Data Performance | IBM Analytics         
                                                              
                                                                                
                                                              
                                   Office:  408 463 2296                        
                                                              
                                   Mobile: 408 828 9068                         
                                                              
                                   Email:   jfc...@us.ibm.com                   
                                                              
                                                                                
                                                              
                                                                                
                                                              


Reply via email to