[jira] [Comment Edited] (SPARK-10848) Applied JSON Schema Works for json RDD but not when loading json file

2019-10-14 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951228#comment-16951228
 ] 

Dongjoon Hyun edited comment on SPARK-10848 at 10/14/19 6:23 PM:
-

Hi, [~purijatin]. If you are sure that the non-nullable schema is consistent 
with your data, there is a workaround. Please try the following in the above 
example.
{code}
scala> spark.createDataFrame(dfDFfromFile.rdd, mySchema)
res9: org.apache.spark.sql.DataFrame = [OrderID: bigint, CustomerID: int ... 5 
more fields]
{code}


was (Author: dongjoon):
Hi, [~purijatin]. If you are sure the schema, there is a workaround. Please try 
the following in the above example.
{code}
scala> spark.createDataFrame(dfDFfromFile.rdd, mySchema)
res9: org.apache.spark.sql.DataFrame = [OrderID: bigint, CustomerID: int ... 5 
more fields]
{code}

> Applied JSON Schema Works for json RDD but not when loading json file
> -
>
> Key: SPARK-10848
> URL: https://issues.apache.org/jira/browse/SPARK-10848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Miklos Christine
>Priority: Minor
>
> Using a defined schema to load a json rdd works as expected. Loading the json 
> records from a file does not apply the supplied schema. Mainly the nullable 
> field isn't applied correctly. Loading from a file uses nullable=true on all 
> fields regardless of applied schema. 
> Code to reproduce:
> {code}
> import  org.apache.spark.sql.types._
> val jsonRdd = sc.parallelize(List(
>   """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
> "ProductCode": "WQT648", "Qty": 5}""",
>   """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
> "ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
> "expressDelivery":true}"""))
> val mySchema = StructType(Array(
>   StructField(name="OrderID"   , dataType=LongType, nullable=false),
>   StructField("CustomerID", IntegerType, false),
>   StructField("OrderDate", DateType, false),
>   StructField("ProductCode", StringType, false),
>   StructField("Qty", IntegerType, false),
>   StructField("Discount", FloatType, true),
>   StructField("expressDelivery", BooleanType, true)
> ))
> val myDF = sqlContext.read.schema(mySchema).json(jsonRdd)
> val schema1 = myDF.printSchema
> val dfDFfromFile = sqlContext.read.schema(mySchema).json("Orders.json")
> val schema2 = dfDFfromFile.printSchema
> {code}
> Orders.json
> {code}
> {"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode": 
> "WQT648", "Qty": 5}
> {"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", "ProductCode": 
> "LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true}
> {code}
> The behavior should be consistent. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10848) Applied JSON Schema Works for json RDD but not when loading json file

2017-11-24 Thread Nurdin Premji (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265406#comment-16265406
 ] 

Nurdin Premji edited comment on SPARK-10848 at 11/24/17 4:26 PM:
-

I've also encountered this issue with spark 2.2.0.

Please see this stack overflow question: 
https://stackoverflow.com/questions/47443483/how-do-i-apply-schema-with-nullable-false-to-json-reading

@Miklos are you able to re-open this ticket? I cannot.

Thank you,
-Nurdin.


was (Author: nurdin.premji):
I've also encountered this issue with spark 2.2.0.

Please see this stack overflow question: 
https://stackoverflow.com/questions/47443483/how-do-i-apply-schema-with-nullable-false-to-json-reading/47448437#47448437

@Miklos are you able to re-open this ticket? I cannot.

Thank you,
-Nurdin.

> Applied JSON Schema Works for json RDD but not when loading json file
> -
>
> Key: SPARK-10848
> URL: https://issues.apache.org/jira/browse/SPARK-10848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Miklos Christine
>Priority: Minor
>
> Using a defined schema to load a json rdd works as expected. Loading the json 
> records from a file does not apply the supplied schema. Mainly the nullable 
> field isn't applied correctly. Loading from a file uses nullable=true on all 
> fields regardless of applied schema. 
> Code to reproduce:
> {code}
> import  org.apache.spark.sql.types._
> val jsonRdd = sc.parallelize(List(
>   """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
> "ProductCode": "WQT648", "Qty": 5}""",
>   """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
> "ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
> "expressDelivery":true}"""))
> val mySchema = StructType(Array(
>   StructField(name="OrderID"   , dataType=LongType, nullable=false),
>   StructField("CustomerID", IntegerType, false),
>   StructField("OrderDate", DateType, false),
>   StructField("ProductCode", StringType, false),
>   StructField("Qty", IntegerType, false),
>   StructField("Discount", FloatType, true),
>   StructField("expressDelivery", BooleanType, true)
> ))
> val myDF = sqlContext.read.schema(mySchema).json(jsonRdd)
> val schema1 = myDF.printSchema
> val dfDFfromFile = sqlContext.read.schema(mySchema).json("Orders.json")
> val schema2 = dfDFfromFile.printSchema
> {code}
> Orders.json
> {code}
> {"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode": 
> "WQT648", "Qty": 5}
> {"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", "ProductCode": 
> "LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true}
> {code}
> The behavior should be consistent. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10848) Applied JSON Schema Works for json RDD but not when loading json file

2017-11-22 Thread Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263867#comment-16263867
 ] 

Amit edited comment on SPARK-10848 at 11/23/17 6:22 AM:


This issue is still persistent in Spark 2.1.0.
I tried below steps  in Spark 2.1.0, it giving the same result as in the 
question, Please reopen the JIRA to get it tracked.


{code:java}
import  org.apache.spark.sql.types._

{code}


{code:java}
val jsonRdd = sc.parallelize(List(
  """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
"ProductCode": "WQT648", "Qty": 5}""",
  """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
"ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
"expressDelivery":true}"""))

{code}


{code:java}
val mySchema = StructType(Array(
  StructField(name="OrderID"   , dataType=LongType, nullable=false),
  StructField("CustomerID", IntegerType, false),
  StructField("OrderDate", DateType, false),
  StructField("ProductCode", StringType, false),
  StructField("Qty", IntegerType, false),
  StructField("Discount", FloatType, true),
  StructField("expressDelivery", BooleanType, true)
))

val myDF = spark.read.schema(mySchema).json(jsonRdd)
val schema1 = myDF.printSchema

val dfDFfromFile = spark.read.schema(mySchema).json("csvdatatest/Orders.json")
val schema2 = dfDFfromFile.printSchema
{code}



was (Author: amit1990):
This issue is still persistent in Spark 2.1.0.
I tried below steps and in Spark 2.1.0, it giving the same result as in the 
question, Please reopen the JIRA to get it tracked.

import  org.apache.spark.sql.types._


{code:java}
val jsonRdd = sc.parallelize(List(
  """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
"ProductCode": "WQT648", "Qty": 5}""",
  """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
"ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
"expressDelivery":true}"""))

{code}


{code:java}
val mySchema = StructType(Array(
  StructField(name="OrderID"   , dataType=LongType, nullable=false),
  StructField("CustomerID", IntegerType, false),
  StructField("OrderDate", DateType, false),
  StructField("ProductCode", StringType, false),
  StructField("Qty", IntegerType, false),
  StructField("Discount", FloatType, true),
  StructField("expressDelivery", BooleanType, true)
))

val myDF = spark.read.schema(mySchema).json(jsonRdd)
val schema1 = myDF.printSchema

val dfDFfromFile = spark.read.schema(mySchema).json("csvdatatest/Orders.json")
val schema2 = dfDFfromFile.printSchema
{code}


> Applied JSON Schema Works for json RDD but not when loading json file
> -
>
> Key: SPARK-10848
> URL: https://issues.apache.org/jira/browse/SPARK-10848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Miklos Christine
>Priority: Minor
>
> Using a defined schema to load a json rdd works as expected. Loading the json 
> records from a file does not apply the supplied schema. Mainly the nullable 
> field isn't applied correctly. Loading from a file uses nullable=true on all 
> fields regardless of applied schema. 
> Code to reproduce:
> {code}
> import  org.apache.spark.sql.types._
> val jsonRdd = sc.parallelize(List(
>   """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
> "ProductCode": "WQT648", "Qty": 5}""",
>   """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
> "ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
> "expressDelivery":true}"""))
> val mySchema = StructType(Array(
>   StructField(name="OrderID"   , dataType=LongType, nullable=false),
>   StructField("CustomerID", IntegerType, false),
>   StructField("OrderDate", DateType, false),
>   StructField("ProductCode", StringType, false),
>   StructField("Qty", IntegerType, false),
>   StructField("Discount", FloatType, true),
>   StructField("expressDelivery", BooleanType, true)
> ))
> val myDF = sqlContext.read.schema(mySchema).json(jsonRdd)
> val schema1 = myDF.printSchema
> val dfDFfromFile = sqlContext.read.schema(mySchema).json("Orders.json")
> val schema2 = dfDFfromFile.printSchema
> {code}
> Orders.json
> {code}
> {"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode": 
> "WQT648", "Qty": 5}
> {"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", "ProductCode": 
> "LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true}
> {code}
> The behavior should be consistent. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10848) Applied JSON Schema Works for json RDD but not when loading json file

2017-11-22 Thread Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263867#comment-16263867
 ] 

Amit edited comment on SPARK-10848 at 11/23/17 6:21 AM:


This issue is still persistent in Spark 2.1.0.
I tried below steps and in Spark 2.1.0, it giving the same result as in the 
question, Please reopen the JIRA to get it tracked.

import  org.apache.spark.sql.types._


{code:java}
val jsonRdd = sc.parallelize(List(
  """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
"ProductCode": "WQT648", "Qty": 5}""",
  """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
"ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
"expressDelivery":true}"""))

{code}


{code:java}
val mySchema = StructType(Array(
  StructField(name="OrderID"   , dataType=LongType, nullable=false),
  StructField("CustomerID", IntegerType, false),
  StructField("OrderDate", DateType, false),
  StructField("ProductCode", StringType, false),
  StructField("Qty", IntegerType, false),
  StructField("Discount", FloatType, true),
  StructField("expressDelivery", BooleanType, true)
))

val myDF = spark.read.schema(mySchema).json(jsonRdd)
val schema1 = myDF.printSchema

val dfDFfromFile = spark.read.schema(mySchema).json("csvdatatest/Orders.json")
val schema2 = dfDFfromFile.printSchema
{code}



was (Author: amit1990):
This issue is still persistent in Spark 2.1.0.
I tried below steps and in Spark 2.1.0, it giving the same result as in the 
question, Please reopen the JIRA to get it tracked.

import  org.apache.spark.sql.types._

val jsonRdd = sc.parallelize(List(
  """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
"ProductCode": "WQT648", "Qty": 5}""",
  """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
"ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
"expressDelivery":true}"""))

val mySchema = StructType(Array(
  StructField(name="OrderID"   , dataType=LongType, nullable=false),
  StructField("CustomerID", IntegerType, false),
  StructField("OrderDate", DateType, false),
  StructField("ProductCode", StringType, false),
  StructField("Qty", IntegerType, false),
  StructField("Discount", FloatType, true),
  StructField("expressDelivery", BooleanType, true)
))

val myDF = spark.read.schema(mySchema).json(jsonRdd)
val schema1 = myDF.printSchema

val dfDFfromFile = spark.read.schema(mySchema).json("csvdatatest/Orders.json")
val schema2 = dfDFfromFile.printSchema

> Applied JSON Schema Works for json RDD but not when loading json file
> -
>
> Key: SPARK-10848
> URL: https://issues.apache.org/jira/browse/SPARK-10848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Miklos Christine
>Priority: Minor
>
> Using a defined schema to load a json rdd works as expected. Loading the json 
> records from a file does not apply the supplied schema. Mainly the nullable 
> field isn't applied correctly. Loading from a file uses nullable=true on all 
> fields regardless of applied schema. 
> Code to reproduce:
> {code}
> import  org.apache.spark.sql.types._
> val jsonRdd = sc.parallelize(List(
>   """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
> "ProductCode": "WQT648", "Qty": 5}""",
>   """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
> "ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
> "expressDelivery":true}"""))
> val mySchema = StructType(Array(
>   StructField(name="OrderID"   , dataType=LongType, nullable=false),
>   StructField("CustomerID", IntegerType, false),
>   StructField("OrderDate", DateType, false),
>   StructField("ProductCode", StringType, false),
>   StructField("Qty", IntegerType, false),
>   StructField("Discount", FloatType, true),
>   StructField("expressDelivery", BooleanType, true)
> ))
> val myDF = sqlContext.read.schema(mySchema).json(jsonRdd)
> val schema1 = myDF.printSchema
> val dfDFfromFile = sqlContext.read.schema(mySchema).json("Orders.json")
> val schema2 = dfDFfromFile.printSchema
> {code}
> Orders.json
> {code}
> {"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode": 
> "WQT648", "Qty": 5}
> {"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", "ProductCode": 
> "LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true}
> {code}
> The behavior should be consistent. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org