This works fine scala> sql("use oraclehadoop") res1: org.apache.spark.sql.DataFrame = [result: string] scala> sql("select count(1) from sales").show +-------+ | _c0| +-------+ |4991761| +-------+
You can do "select count(*) from tablename") as it is not dynamic sql. Does it actually work? Since count(*) returns one value you don't really need to map it. A simple show() will work. Alternatively scala> var sqltext : String = "" sqltext: String = "" scala> sqltext = "select count(*) from sales" sqltext: String = select count(*) from sales scala> sql(sqltext) res3: org.apache.spark.sql.DataFrame = [_c0: bigint] HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 3 March 2016 at 02:38, Jesse F Chen <jfc...@us.ibm.com> wrote: > I am finding a strange issue with Spark SQL where "select count(*) " > returns wrong row counts for certain tables. > > I am using TPCDS tables, so here are the actual counts: > > > Row counts in > raw generated files Spark SQL tables count(*) Parquet Spark SQL tables > count(*) Text > call_center > 6 > 0 > 0 > catalog_page > 11718 > 11515 > 11515 > catalog_returns > 144067 > 138352 > 138352 > catalog_sales > 1441548 > 1427257 > 1427257 > customer > 100000 > 93063 > 93063 > customer_address > 50000 > 48444 > 48444 > customer_demographics > 1920800 > 1920800 > 1920800 > date_dim > 73049 > 73049 > 73049 > household_demographics > 7200 > 7200 > 7200 > income_band > 20 > 20 > 20 > inventory > 11745000 > 11158087 > 11158087 > item > 18000 > 17917 > 17917 > promotion > 300 > 289 > 289 > reason > 35 > 35 > 35 > ship_mode > 20 > 20 > 20 > store > 12 > 3 > 3 > store_returns > 287514 > 267471 > 267471 > store_sales > 2880404 > 2620573 > 2620573 > time_dim > 86400 > 86400 > 86400 > warehouse > 5 > 4 > 4 > web_page > 60 > 21 > 21 > web_returns > 71763 > 65384 > 65384 > web_sales > 719384 > 719025 > 719025 > web_site > 30 > 25 > 25 > > call_center returned 0 count :( > > The code used to do the count is fairly simple: > > df.registerTempTable(tablename) > println("registered tempTable") > val rc=sqlContext.sql("select count(*) from "+tablename) > rc.map(t => "row count table "+tablename+": "+ > t(0)).collect().foreach(println) > > This made many tpcds-derived queries return WRONG results. > > Wanted to know if anything key is missing here. > > Jesse > > > *JESSE CHEN* > Big Data Performance | IBM Analytics > > Office: 408 463 2296 > Mobile: 408 828 9068 > Email: jfc...@us.ibm.com > >