Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

Jianshi Huang Tue, 23 Dec 2014 23:15:07 -0800

FYI,

Latest hive 0.14/parquet will have column renaming support.


Jianshi

On Wed, Dec 10, 2014 at 3:37 AM, Michael Armbrust <mich...@databricks.com>
wrote:

> You might also try out the recently added support for views.
>
> On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang <jianshi.hu...@gmail.com>
> wrote:
>
>> Ah... I see. Thanks for pointing it out.
>>
>> Then it means we cannot mount external table using customized column
>> names. hmm...
>>
>> Then the only option left is to use a subquery to add a bunch of column
>> alias. I'll try it later.
>>
>> Thanks,
>> Jianshi
>>
>> On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> This is by hive's design.  From the Hive documentation:
>>>
>>> The column change command will only modify Hive's metadata, and will not
>>>> modify data. Users should make sure the actual data layout of the
>>>> table/partition conforms with the metadata definition.
>>>
>>>
>>>
>>> On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang <jianshi.hu...@gmail.com>
>>> wrote:
>>>
>>>> Ok, found another possible bug in Hive.
>>>>
>>>> My current solution is to use ALTER TABLE CHANGE to rename the column
>>>> names.
>>>>
>>>> The problem is after renaming the column names, the value of the
>>>> columns became all NULL.
>>>>
>>>> Before renaming:
>>>> scala> sql("select `sorted::cre_ts` from pmt limit 1").collect
>>>> res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
>>>>
>>>> Execute renaming:
>>>> scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string")
>>>> res13: org.apache.spark.sql.SchemaRDD =
>>>> SchemaRDD[972] at RDD at SchemaRDD.scala:108
>>>> == Query Plan ==
>>>> <Native command: executed by Hive>
>>>>
>>>> After renaming:
>>>> scala> sql("select cre_ts from pmt limit 1").collect
>>>> res16: Array[org.apache.spark.sql.Row] = Array([null])
>>>>
>>>> I created a JIRA for it:
>>>>
>>>>   https://issues.apache.org/jira/browse/SPARK-4781
>>>>
>>>>
>>>> Jianshi
>>>>
>>>> On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang <jianshi.hu...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hmm... another issue I found doing this approach is that ANALYZE TABLE
>>>>> ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
>>>>> later broadcast join and such will fail...
>>>>>
>>>>> Any idea how to fix this issue?
>>>>>
>>>>> Jianshi
>>>>>
>>>>> On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang <jianshi.hu...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Very interesting, the line doing drop table will throws an exception.
>>>>>> After removing it all works.
>>>>>>
>>>>>> Jianshi
>>>>>>
>>>>>> On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang <
>>>>>> jianshi.hu...@gmail.com> wrote:
>>>>>>
>>>>>>> Here's the solution I got after talking with Liancheng:
>>>>>>>
>>>>>>> 1) using backquote `..` to wrap up all illegal characters
>>>>>>>
>>>>>>>     val rdd = parquetFile(file)
>>>>>>>     val schema = rdd.schema.fields.map(f => s"`${f.name}`
>>>>>>> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
>>>>>>>
>>>>>>>     val ddl_13 = s"""
>>>>>>>       |CREATE EXTERNAL TABLE $name (
>>>>>>>       |  $schema
>>>>>>>       |)
>>>>>>>       |STORED AS PARQUET
>>>>>>>       |LOCATION '$file'
>>>>>>>       """.stripMargin
>>>>>>>
>>>>>>>     sql(ddl_13)
>>>>>>>
>>>>>>> 2) create a new Schema and do applySchema to generate a new
>>>>>>> SchemaRDD, had to drop and register table
>>>>>>>
>>>>>>>     val t = table(name)
>>>>>>>     val newSchema = StructType(t.schema.fields.map(s => s.copy(name
>>>>>>> = s.name.replaceAll(".*?::", ""))))
>>>>>>>     sql(s"drop table $name")
>>>>>>>     applySchema(t, newSchema).registerTempTable(name)
>>>>>>>
>>>>>>> I'm testing it for now.
>>>>>>>
>>>>>>> Thanks for the help!
>>>>>>>
>>>>>>>
>>>>>>> Jianshi
>>>>>>>
>>>>>>> On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang <
>>>>>>> jianshi.hu...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I had to use Pig for some preprocessing and to generate Parquet
>>>>>>>> files for Spark to consume.
>>>>>>>>
>>>>>>>> However, due to Pig's limitation, the generated schema contains
>>>>>>>> Pig's identifier
>>>>>>>>
>>>>>>>> e.g.
>>>>>>>> sorted::id, sorted::cre_ts, ...
>>>>>>>>
>>>>>>>> I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.
>>>>>>>>
>>>>>>>>   create external table pmt (
>>>>>>>>     sorted::id bigint
>>>>>>>>   )
>>>>>>>>   stored as parquet
>>>>>>>>   location '...'
>>>>>>>>
>>>>>>>> Obviously it didn't work, I also tried removing the identifier
>>>>>>>> sorted::, but the resulting rows contain only nulls.
>>>>>>>>
>>>>>>>> Any idea how to create a table in HiveContext from these Parquet
>>>>>>>> files?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jianshi
>>>>>>>> --
>>>>>>>> Jianshi Huang
>>>>>>>>
>>>>>>>> LinkedIn: jianshi
>>>>>>>> Twitter: @jshuang
>>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jianshi Huang
>>>>>>>
>>>>>>> LinkedIn: jianshi
>>>>>>> Twitter: @jshuang
>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jianshi Huang
>>>>>>
>>>>>> LinkedIn: jianshi
>>>>>> Twitter: @jshuang
>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jianshi Huang
>>>>>
>>>>> LinkedIn: jianshi
>>>>> Twitter: @jshuang
>>>>> Github & Blog: http://huangjs.github.com/
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jianshi Huang
>>>>
>>>> LinkedIn: jianshi
>>>> Twitter: @jshuang
>>>> Github & Blog: http://huangjs.github.com/
>>>>
>>>
>>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

Reply via email to