This works fine

scala> sql("use oraclehadoop")
res1: org.apache.spark.sql.DataFrame = [result: string]
scala> sql("select count(1) from sales").show
+-------+
|    _c0|
+-------+
|4991761|
+-------+

You can do "select count(*) from tablename") as it is not dynamic sql. Does
it actually work?

Since count(*) returns one value you don't really need to map it. A simple
show() will work.

Alternatively

scala> var sqltext : String = ""
sqltext: String = ""
scala> sqltext = "select count(*) from sales"
sqltext: String = select count(*) from sales
scala> sql(sqltext)
res3: org.apache.spark.sql.DataFrame = [_c0: bigint]


HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 3 March 2016 at 02:38, Jesse F Chen <jfc...@us.ibm.com> wrote:

> I am finding a strange issue with Spark SQL where "select count(*) "
> returns wrong row counts for certain tables.
>
> I am using TPCDS tables, so here are the actual counts:
>
>
> Row counts in
> raw generated files Spark SQL tables count(*) Parquet Spark SQL tables
> count(*) Text
> call_center
> 6
> 0
> 0
> catalog_page
> 11718
> 11515
> 11515
> catalog_returns
> 144067
> 138352
> 138352
> catalog_sales
> 1441548
> 1427257
> 1427257
> customer
> 100000
> 93063
> 93063
> customer_address
> 50000
> 48444
> 48444
> customer_demographics
> 1920800
> 1920800
> 1920800
> date_dim
> 73049
> 73049
> 73049
> household_demographics
> 7200
> 7200
> 7200
> income_band
> 20
> 20
> 20
> inventory
> 11745000
> 11158087
> 11158087
> item
> 18000
> 17917
> 17917
> promotion
> 300
> 289
> 289
> reason
> 35
> 35
> 35
> ship_mode
> 20
> 20
> 20
> store
> 12
> 3
> 3
> store_returns
> 287514
> 267471
> 267471
> store_sales
> 2880404
> 2620573
> 2620573
> time_dim
> 86400
> 86400
> 86400
> warehouse
> 5
> 4
> 4
> web_page
> 60
> 21
> 21
> web_returns
> 71763
> 65384
> 65384
> web_sales
> 719384
> 719025
> 719025
> web_site
> 30
> 25
> 25
>
> call_center returned 0 count :(
>
> The code used to do the count is fairly simple:
>
> df.registerTempTable(tablename)
> println("registered tempTable")
> val rc=sqlContext.sql("select count(*) from "+tablename)
> rc.map(t => "row count table "+tablename+": "+
> t(0)).collect().foreach(println)
>
> This made many tpcds-derived queries return WRONG results.
>
> Wanted to know if anything key is missing here.
>
> Jesse
>
>
> *JESSE CHEN*
> Big Data Performance | IBM Analytics
>
> Office: 408 463 2296
> Mobile: 408 828 9068
> Email: jfc...@us.ibm.com
>
>

Reply via email to