If it works under hive, do you try just create the DF from Hive table directly 
in Spark? That should work, right?


Yong


________________________________
From: Begar, Veena <veena.be...@hpe.com>
Sent: Wednesday, February 15, 2017 10:16 AM
To: Yong Zhang; smartzjp; user@spark.apache.org
Subject: RE: How to specify default value for StructField?


Thanks Yong.



I know about merging the schema option.

Using Hive we can read AVRO files having different schemas. And also we can do 
the same in Spark also.

Similarly we can read ORC files having different schemas in Hive. But, we can’t 
do the same in Spark using dataframe. How we can do it using dataframe?



Thanks.

From: Yong Zhang [mailto:java8...@hotmail.com]
Sent: Tuesday, February 14, 2017 8:31 PM
To: Begar, Veena <veena.be...@hpe.com>; smartzjp <zjp_j...@163.com>; 
user@spark.apache.org
Subject: Re: How to specify default value for StructField?



You maybe are looking for something like "spark.sql.parquet.mergeSchema" for 
ORC. Unfortunately, I don't think it is available, unless someone tells me I am 
wrong.

You can create a JIRA to request this feature, but we all know that Parquet is 
the first citizen format [??]



Yong



________________________________

From: Begar, Veena <veena.be...@hpe.com<mailto:veena.be...@hpe.com>>
Sent: Tuesday, February 14, 2017 10:37 AM
To: smartzjp; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: How to specify default value for StructField?



Thanks, it didn't work. Because, the folder has files from 2 different schemas.
It fails with the following exception:
org.apache.spark.sql.AnalysisException: cannot resolve '`f2`' given input 
columns: [f1];


-----Original Message-----
From: smartzjp [mailto:zjp_j...@163.com]
Sent: Tuesday, February 14, 2017 10:32 AM
To: Begar, Veena <veena.be...@hpe.com<mailto:veena.be...@hpe.com>>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: How to specify default value for StructField?

You can try the below code.

val df = spark.read.format("orc").load("/user/hos/orc_files_test_together")
df.select(“f1”,”f2”).show





在 2017/2/14 上午6:54,“vbegar”<user-return-67879-zjp_jdev=163....@spark.apache.org 
代表 
veena.be...@hpe.com<mailto:user-return-67879-zjp_jdev=163....@spark.apache.org%20代表%20veena.be...@hpe.com>>
 写入:

>Hello,
>
>I specified a StructType like this:
>
>*val mySchema = StructType(Array(StructField("f1", StringType,
>true),StructField("f2", StringType, true)))*
>
>I have many ORC files stored in HDFS location:*
>/user/hos/orc_files_test_together
>*
>
>These files use different schema : some of them have only f1 columns
>and other have both f1 and f2 columns.
>
>I read the data from these files to a dataframe:
>*val df =
>spark.read.format("orc").schema(mySchema).load("/user/hos/orc_files_tes
>t_together")*
>
>But, now when I give the following command to see the data, it fails:
>*df.show*
>
>The error message is like "f2" comun doesn't exist.
>
>Since I have specified nullable attribute as true for f2 column, why it
>fails?
>
>Or, is there any way to specify default vaule for StructField?
>
>Because, in AVRO schema, we can specify the default value in this way
>and can read AVRO files in a folder which have 2 different schemas
>(either only
>f1 column or both f1 and f2 columns):
>
>*{
>   "type": "record",
>   "name": "myrecord",
>   "fields":
>   [
>      {
>         "name": "f1",
>         "type": "string",
>         "default": ""
>      },
>      {
>         "name": "f2",
>         "type": "string",
>         "default": ""
>      }
>   ]
>}*
>
>Wondering why it doesn't work with ORC files.
>
>thanks.
>
>
>
>--
>View this message in context:
>http://apache-spark-user-list.1001560.n3.nabble.com/How-to-specify-defa
>ult-value-for-StructField-tp28386.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>---------------------------------------------------------------------
>To unsubscribe e-mail: 
>user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
>

Reply via email to