FYI, Latest hive 0.14/parquet will have column renaming support.
Jianshi On Wed, Dec 10, 2014 at 3:37 AM, Michael Armbrust <mich...@databricks.com> wrote: > You might also try out the recently added support for views. > > On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang <jianshi.hu...@gmail.com> > wrote: > >> Ah... I see. Thanks for pointing it out. >> >> Then it means we cannot mount external table using customized column >> names. hmm... >> >> Then the only option left is to use a subquery to add a bunch of column >> alias. I'll try it later. >> >> Thanks, >> Jianshi >> >> On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> This is by hive's design. From the Hive documentation: >>> >>> The column change command will only modify Hive's metadata, and will not >>>> modify data. Users should make sure the actual data layout of the >>>> table/partition conforms with the metadata definition. >>> >>> >>> >>> On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang <jianshi.hu...@gmail.com> >>> wrote: >>> >>>> Ok, found another possible bug in Hive. >>>> >>>> My current solution is to use ALTER TABLE CHANGE to rename the column >>>> names. >>>> >>>> The problem is after renaming the column names, the value of the >>>> columns became all NULL. >>>> >>>> Before renaming: >>>> scala> sql("select `sorted::cre_ts` from pmt limit 1").collect >>>> res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54]) >>>> >>>> Execute renaming: >>>> scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string") >>>> res13: org.apache.spark.sql.SchemaRDD = >>>> SchemaRDD[972] at RDD at SchemaRDD.scala:108 >>>> == Query Plan == >>>> <Native command: executed by Hive> >>>> >>>> After renaming: >>>> scala> sql("select cre_ts from pmt limit 1").collect >>>> res16: Array[org.apache.spark.sql.Row] = Array([null]) >>>> >>>> I created a JIRA for it: >>>> >>>> https://issues.apache.org/jira/browse/SPARK-4781 >>>> >>>> >>>> Jianshi >>>> >>>> On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang <jianshi.hu...@gmail.com> >>>> wrote: >>>> >>>>> Hmm... another issue I found doing this approach is that ANALYZE TABLE >>>>> ... COMPUTE STATISTICS will fail to attach the metadata to the table, and >>>>> later broadcast join and such will fail... >>>>> >>>>> Any idea how to fix this issue? >>>>> >>>>> Jianshi >>>>> >>>>> On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang <jianshi.hu...@gmail.com >>>>> > wrote: >>>>> >>>>>> Very interesting, the line doing drop table will throws an exception. >>>>>> After removing it all works. >>>>>> >>>>>> Jianshi >>>>>> >>>>>> On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang < >>>>>> jianshi.hu...@gmail.com> wrote: >>>>>> >>>>>>> Here's the solution I got after talking with Liancheng: >>>>>>> >>>>>>> 1) using backquote `..` to wrap up all illegal characters >>>>>>> >>>>>>> val rdd = parquetFile(file) >>>>>>> val schema = rdd.schema.fields.map(f => s"`${f.name}` >>>>>>> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n") >>>>>>> >>>>>>> val ddl_13 = s""" >>>>>>> |CREATE EXTERNAL TABLE $name ( >>>>>>> | $schema >>>>>>> |) >>>>>>> |STORED AS PARQUET >>>>>>> |LOCATION '$file' >>>>>>> """.stripMargin >>>>>>> >>>>>>> sql(ddl_13) >>>>>>> >>>>>>> 2) create a new Schema and do applySchema to generate a new >>>>>>> SchemaRDD, had to drop and register table >>>>>>> >>>>>>> val t = table(name) >>>>>>> val newSchema = StructType(t.schema.fields.map(s => s.copy(name >>>>>>> = s.name.replaceAll(".*?::", "")))) >>>>>>> sql(s"drop table $name") >>>>>>> applySchema(t, newSchema).registerTempTable(name) >>>>>>> >>>>>>> I'm testing it for now. >>>>>>> >>>>>>> Thanks for the help! >>>>>>> >>>>>>> >>>>>>> Jianshi >>>>>>> >>>>>>> On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang < >>>>>>> jianshi.hu...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I had to use Pig for some preprocessing and to generate Parquet >>>>>>>> files for Spark to consume. >>>>>>>> >>>>>>>> However, due to Pig's limitation, the generated schema contains >>>>>>>> Pig's identifier >>>>>>>> >>>>>>>> e.g. >>>>>>>> sorted::id, sorted::cre_ts, ... >>>>>>>> >>>>>>>> I tried to put the schema inside CREATE EXTERNAL TABLE, e.g. >>>>>>>> >>>>>>>> create external table pmt ( >>>>>>>> sorted::id bigint >>>>>>>> ) >>>>>>>> stored as parquet >>>>>>>> location '...' >>>>>>>> >>>>>>>> Obviously it didn't work, I also tried removing the identifier >>>>>>>> sorted::, but the resulting rows contain only nulls. >>>>>>>> >>>>>>>> Any idea how to create a table in HiveContext from these Parquet >>>>>>>> files? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Jianshi >>>>>>>> -- >>>>>>>> Jianshi Huang >>>>>>>> >>>>>>>> LinkedIn: jianshi >>>>>>>> Twitter: @jshuang >>>>>>>> Github & Blog: http://huangjs.github.com/ >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jianshi Huang >>>>>>> >>>>>>> LinkedIn: jianshi >>>>>>> Twitter: @jshuang >>>>>>> Github & Blog: http://huangjs.github.com/ >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Jianshi Huang >>>>>> >>>>>> LinkedIn: jianshi >>>>>> Twitter: @jshuang >>>>>> Github & Blog: http://huangjs.github.com/ >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Jianshi Huang >>>>> >>>>> LinkedIn: jianshi >>>>> Twitter: @jshuang >>>>> Github & Blog: http://huangjs.github.com/ >>>>> >>>> >>>> >>>> >>>> -- >>>> Jianshi Huang >>>> >>>> LinkedIn: jianshi >>>> Twitter: @jshuang >>>> Github & Blog: http://huangjs.github.com/ >>>> >>> >>> >> >> >> -- >> Jianshi Huang >> >> LinkedIn: jianshi >> Twitter: @jshuang >> Github & Blog: http://huangjs.github.com/ >> > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/