Resolution:
After realizing that the SerDe (OpenCSV) was causing all the fields to be
defined as String type, I modified the Hive load statement to use the
default serializer. I was able to modify the CSV input file to use a
different delimiter. Although, this is a workaround, I am able to proceed
Hi
I am in the process of migrating some logic in pig scripts to Spark-SQL. As
part of this process, I am creating a few Select...Group By query and
registering them as tables using the SchemaRDD.registerAsTable feature.
When using such a registered table in a subsequent Select...Group By
query,
Using SUM on a string should automatically cast the column. Also you can
use CAST to change the datatype
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions
.
What version of Spark are you running? This could be
Thanks Michael. Should the cast be done in the source RDD or while doing
the SUM?
To give a better picture here is the code sequence:
val sourceRdd = sql(select ... from source-hive-table)
sourceRdd.registerAsTable(sourceRDD)
val aggRdd = sql(select c1, c2, sum(c3) from sourceRDD group by c1, c2)
Which version of Spark are you running?
On Wed, Oct 8, 2014 at 4:18 PM, Ranga sra...@gmail.com wrote:
Thanks Michael. Should the cast be done in the source RDD or while doing
the SUM?
To give a better picture here is the code sequence:
val sourceRdd = sql(select ... from source-hive-table)
Sorry. Its 1.1.0.
After digging a bit more into this, it seems like the OpenCSV Deseralizer
converts all the columns to a String type. This maybe throwing the
execution off. Planning to create a class and map the rows to this custom
class. Will keep this thread updated.
On Wed, Oct 8, 2014 at
This is a bit strange. When I print the schema for the RDD, it reflects the
correct data type for each column. But doing any kind of mathematical
calculation seems to result in ClassCastException. Here is a sample that
results in the exception:
select c1, c2
...
cast (c18 as int) * cast (c21 as