Re: Hive table creation - possible bug in Spark 1.3?

madhu phatak Thu, 23 Apr 2015 01:15:24 -0700

Hi Michael,
Here <https://issues.apache.org/jira/browse/SPARK-7084> is the jira issue
and PR <https://github.com/apache/spark/pull/5654> for the same. Please
have a look.





Regards,
Madhukara Phatak
http://datamantra.io/

On Thu, Apr 23, 2015 at 1:22 PM, madhu phatak <phatak....@gmail.com> wrote:

> Hi,
>  Hive table creation need an extra step from 1.3. You can follow the
> following template
>
>  df.registerTempTable(tableName)
>
>  hc.sql(s"create table $tableName as select * from $tableName")
>
> this will save the table in hive with given tableName.
>
>
>
>
>
>
>
>
>
> Regards,
> Madhukara Phatak
> http://datamantra.io/
>
> On Thu, Apr 23, 2015 at 4:00 AM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Sorry for the confusion.  We should be more clear about the semantics in
>> the documentation. (PRs welcome :) )
>>
>> .saveAsTable does not create a hive table, but instead creates a Spark
>> Data Source table.  Here the metadata is persisted into Hive, but hive
>> cannot read the tables (as this API support MLlib vectors, schema
>> discovery, and other things that hive does not).  If you want to create a
>> hive table, use HiveQL and run a CREATE TABLE AS SELECT ...
>>
>> On Wed, Apr 22, 2015 at 12:50 AM, Ophir Cohen <oph...@gmail.com> wrote:
>>
>>> I wrote few mails here regarding this issue.
>>> After further investigation I think there is a bug in Spark 1.3 in
>>> saving Hive tables.
>>>
>>> (hc is HiveContext)
>>>
>>> 1. Verify the needed configuration exists:
>>> scala> hc.sql("set hive.exec.compress.output").collect
>>> res4: Array[org.apache.spark.sql.Row] =
>>> Array([hive.exec.compress.output=true])
>>> scala> hc.sql("set
>>> mapreduce.output.fileoutputformat.compress.codec").collect
>>> res5: Array[org.apache.spark.sql.Row] =
>>> Array([mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec])
>>> scala> hc.sql("set
>>> mapreduce.output.fileoutputformat.compress.type").collect
>>> res6: Array[org.apache.spark.sql.Row] =
>>> Array([mapreduce.output.fileoutputformat.compress.type=BLOCK])
>>> 2. Loading DataFrame and save as table (path point to exists file):
>>> val saDF = hc.parquetFile(path)
>>> saDF.count
>>>
>>> (count yield 229764 - i.e. the rdd exists)
>>> saDF.saveAsTable("test_hive_ms")
>>>
>>> Now for few interesting outputs:
>>> 1. Trying to query Hive CLI, the table exists but with wrong output
>>> format:
>>> Failed with exception java.io.IOException:java.io.IOException: hdfs://
>>> 10.166.157.97:9000/user/hive/warehouse/test_hive_ms/part-r-00001.parquet
>>> not a SequenceFile
>>> 2. Looking at the output files found that files are '.parquet' and not
>>> '.snappy'
>>> 3. Looking at the saveAsTable output shows that it actually store the
>>> table in both, wrong output format and without compression:
>>> 15/04/22 07:16:54 INFO metastore.HiveMetaStore: 0: create_table:
>>> Table(tableName:test_hive_ms, dbName:default, owner:hadoop,
>>> createTime:1429687014, lastAccessTime:0, retention:0,
>>> sd:StorageDescriptor(cols:[FieldSchema(name:col, type:array<string>,
>>> comment:from deserializer)], location:null,
>>> inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,
>>> outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,
>>> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
>>> serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,
>>> parameters:{serialization.format=1, path=hdfs://
>>> 10.166.157.97:9000/user/hive/warehouse/test_hive_ms}
>>> <http://10.166.157.97:9000/user/hive/warehouse/test_hive_ms%7D>),
>>> bucketCols:[], sortCols:[], parameters:{},
>>> skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],
>>> skewedColValueLocationMaps:{})), partitionKeys:[],
>>> parameters:{spark.sql.sources.schema.part.0={"type":"struct","fields":[{"name":"ADJDATE","type":"long","nullable":true,"metadata":{}},{"name":"sid","type":"integer","nullable":true,"metadata":{}},{"name":"ADJTYPE","type":"integer","nullable":true,"metadata":{}},{"name":"ENDADJDATE","type":"long","nullable":true,"metadata":{}},{"name":"ADJFACTOR","type":"double","nullable":true,"metadata":{}},{"name":"CUMADJFACTOR","type":"double","nullable":true,"metadata":{}}]},
>>> EXTERNAL=FALSE, spark.sql.sources.schema.numParts=1,
>>> spark.sql.sources.provider=org.apache.spark.sql.parquet},
>>> viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
>>>
>>> So, the question is: do I miss some configuration here or should I open
>>> a bug?
>>>
>>> Thanks,
>>> Ophir
>>>
>>>
>>
>

Re: Hive table creation - possible bug in Spark 1.3?

Reply via email to