[jira] [Commented] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column

2020-03-15 Thread Kyrill Alyoshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059823#comment-17059823
 ] 

Kyrill Alyoshin commented on SPARK-31074:
-

Hmm... If you can't reproduce it, then it is likely resolved. The issues do 
sound similar.

> Avro serializer should not fail when a nullable Spark field is written to a 
> non-null Avro column
> 
>
> Key: SPARK-31074
> URL: https://issues.apache.org/jira/browse/SPARK-31074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Kyrill Alyoshin
>Priority: Major
>
> Spark StructType schema are strongly biased towards having _nullable_ fields. 
> In fact, this is what _Encoders.bean()_ does - any non-primitive field is 
> automatically _nullable_. When we attempt to serialize dataframes into 
> *user-supplied* Avro schemas where such corresponding fields are marked as 
> _non-null_ (i.e., they are not of _union_ type) any such attempt will fail 
> with the following exception
>  
> {code:java}
> Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string"
>   at org.apache.avro.Schema.getTypes(Schema.java:299)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
>   at 
> org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209)
>  {code}
> This seems as rather draconian. We certainly should be able to write a field 
> of the same type and with the same name if it is not a null into a 
> non-nullable Avro column. In fact, the problem is so *severe* that it is not 
> clear what should be done in such situations when Avro schema is given to you 
> as part of API communication contract (i.e., it is non-changeable).
> This is an important issue.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column

2020-03-11 Thread Kyrill Alyoshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057505#comment-17057505
 ] 

Kyrill Alyoshin edited comment on SPARK-31074 at 3/12/20, 1:01 AM:
---

Here you go.

Avro schema:
{code:java}
{
  "type": "record",
  "namespace": "com.domain.em",
  "name": "PracticeDiff",
  "fields": [
{
  "name": "practiceId",
  "type": "string"
},
{
  "name": "value",
  "type": "string"
},
{
  "name": "checkedValue",
  "type": "string"
}
  ]
} {code}
 

 

Java code:
{code:java}
package com.domain.em;


public final class PracticeDiff {

private String practiceId;
private String value;
private String checkedValue;
   
   
public String getPracticeId() {
return practiceId;
}

public String getValue() {
return value;
}

public String getCheckedValue() {
return checkedValue;
}
   
} {code}
Thank you!


was (Author: kyrill007):
Here you go.

Avro schema:

 
{code:java}
{
  "type": "record",
  "namespace": "com.domain.em",
  "name": "PracticeDiff",
  "fields": [
{
  "name": "practiceId",
  "type": "string"
},
{
  "name": "value",
  "type": "string"
},
{
  "name": "checkedValue",
  "type": "string"
}
  ]
} {code}
 

 

Java code:
{code:java}
package com.domain.em;


public final class PracticeDiff {

private String practiceId;
private String value;
private String checkedValue;
   
   
public String getPracticeId() {
return practiceId;
}

public String getValue() {
return value;
}

public String getCheckedValue() {
return checkedValue;
}
   
} {code}
Thank you!

> Avro serializer should not fail when a nullable Spark field is written to a 
> non-null Avro column
> 
>
> Key: SPARK-31074
> URL: https://issues.apache.org/jira/browse/SPARK-31074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Kyrill Alyoshin
>Priority: Major
>
> Spark StructType schema are strongly biased towards having _nullable_ fields. 
> In fact, this is what _Encoders.bean()_ does - any non-primitive field is 
> automatically _nullable_. When we attempt to serialize dataframes into 
> *user-supplied* Avro schemas where such corresponding fields are marked as 
> _non-null_ (i.e., they are not of _union_ type) any such attempt will fail 
> with the following exception
>  
> {code:java}
> Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string"
>   at org.apache.avro.Schema.getTypes(Schema.java:299)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
>   at 
> org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209)
>  {code}
> This seems as rather draconian. We certainly should be able to write a field 
> of the same type and with the same name if it is not a null into a 
> non-nullable Avro column. In fact, the problem is so *severe* that it is not 
> clear what should be done in such situations when Avro schema is given to you 
> as part of API communication contract (i.e., it is non-changeable).
> This is an important issue.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column

2020-03-11 Thread Kyrill Alyoshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057505#comment-17057505
 ] 

Kyrill Alyoshin edited comment on SPARK-31074 at 3/12/20, 1:01 AM:
---

Here you go.

Avro schema:

 
{code:java}
{
  "type": "record",
  "namespace": "com.domain.em",
  "name": "PracticeDiff",
  "fields": [
{
  "name": "practiceId",
  "type": "string"
},
{
  "name": "value",
  "type": "string"
},
{
  "name": "checkedValue",
  "type": "string"
}
  ]
} {code}
 

 

Java code:
{code:java}
package com.domain.em;


public final class PracticeDiff {

private String practiceId;
private String value;
private String checkedValue;
   
   
public String getPracticeId() {
return practiceId;
}

public String getValue() {
return value;
}

public String getCheckedValue() {
return checkedValue;
}
   
} {code}
Thank you!


was (Author: kyrill007):
Here you go.

Avro schema:

 
{code:java}
{
  "type": "record",
  "namespace": "com.domain.em",
  "name": "PracticeDiff",
  "fields": [
{
  "name": "practiceId",
  "type": "string"
},
{
  "name": "cisValue",
  "type": "string"
},
{
  "name": "checkedValue",
  "type": "string"
}
  ]
} {code}
 

 

Java code:
{code:java}
package com.domain.em;


public final class PracticeDiff {

private String practiceId;
private String value;
private String checkedValue;
   
   
public String getPracticeId() {
return practiceId;
}

public String getValue() {
return value;
}

public String getCheckedValue() {
return checkedValue;
}
   
} {code}
Thank you!

> Avro serializer should not fail when a nullable Spark field is written to a 
> non-null Avro column
> 
>
> Key: SPARK-31074
> URL: https://issues.apache.org/jira/browse/SPARK-31074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Kyrill Alyoshin
>Priority: Major
>
> Spark StructType schema are strongly biased towards having _nullable_ fields. 
> In fact, this is what _Encoders.bean()_ does - any non-primitive field is 
> automatically _nullable_. When we attempt to serialize dataframes into 
> *user-supplied* Avro schemas where such corresponding fields are marked as 
> _non-null_ (i.e., they are not of _union_ type) any such attempt will fail 
> with the following exception
>  
> {code:java}
> Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string"
>   at org.apache.avro.Schema.getTypes(Schema.java:299)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
>   at 
> org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209)
>  {code}
> This seems as rather draconian. We certainly should be able to write a field 
> of the same type and with the same name if it is not a null into a 
> non-nullable Avro column. In fact, the problem is so *severe* that it is not 
> clear what should be done in such situations when Avro schema is given to you 
> as part of API communication contract (i.e., it is non-changeable).
> This is an important issue.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column

2020-03-11 Thread Kyrill Alyoshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057505#comment-17057505
 ] 

Kyrill Alyoshin commented on SPARK-31074:
-

Here you go.

Avro schema:

 
{code:java}
{
  "type": "record",
  "namespace": "com.domain.em",
  "name": "PracticeDiff",
  "fields": [
{
  "name": "practiceId",
  "type": "string"
},
{
  "name": "cisValue",
  "type": "string"
},
{
  "name": "checkedValue",
  "type": "string"
}
  ]
} {code}
 

 

Java code:
{code:java}
package com.domain.em;


public final class PracticeDiff {

private String practiceId;
private String value;
private String checkedValue;
   
   
public String getPracticeId() {
return practiceId;
}

public String getValue() {
return value;
}

public String getCheckedValue() {
return checkedValue;
}
   
} {code}
Thank you!

> Avro serializer should not fail when a nullable Spark field is written to a 
> non-null Avro column
> 
>
> Key: SPARK-31074
> URL: https://issues.apache.org/jira/browse/SPARK-31074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Kyrill Alyoshin
>Priority: Major
>
> Spark StructType schema are strongly biased towards having _nullable_ fields. 
> In fact, this is what _Encoders.bean()_ does - any non-primitive field is 
> automatically _nullable_. When we attempt to serialize dataframes into 
> *user-supplied* Avro schemas where such corresponding fields are marked as 
> _non-null_ (i.e., they are not of _union_ type) any such attempt will fail 
> with the following exception
>  
> {code:java}
> Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string"
>   at org.apache.avro.Schema.getTypes(Schema.java:299)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
>   at 
> org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209)
>  {code}
> This seems as rather draconian. We certainly should be able to write a field 
> of the same type and with the same name if it is not a null into a 
> non-nullable Avro column. In fact, the problem is so *severe* that it is not 
> clear what should be done in such situations when Avro schema is given to you 
> as part of API communication contract (i.e., it is non-changeable).
> This is an important issue.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column

2020-03-11 Thread Kyrill Alyoshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056972#comment-17056972
 ] 

Kyrill Alyoshin commented on SPARK-31074:
-

Yes,
 # Create a simple Avro schema file with 2 properties in it '*f1*' and '*f2*' - 
their types can be Strings.
 # Create a Spark dataframe with two fields in it '*f1*' and '*f2*' of String 
type that are *nullable*.
 # Write out this dataframe to a file using the Avro schema create in 1 through 
'{{avroSchema}}' option.

 

> Avro serializer should not fail when a nullable Spark field is written to a 
> non-null Avro column
> 
>
> Key: SPARK-31074
> URL: https://issues.apache.org/jira/browse/SPARK-31074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Kyrill Alyoshin
>Priority: Major
>
> Spark StructType schema are strongly biased towards having _nullable_ fields. 
> In fact, this is what _Encoders.bean()_ does - any non-primitive field is 
> automatically _nullable_. When we attempt to serialize dataframes into 
> *user-supplied* Avro schemas where such corresponding fields are marked as 
> _non-null_ (i.e., they are not of _union_ type) any such attempt will fail 
> with the following exception
>  
> {code:java}
> Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string"
>   at org.apache.avro.Schema.getTypes(Schema.java:299)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
>   at 
> org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209)
>  {code}
> This seems as rather draconian. We certainly should be able to write a field 
> of the same type and with the same name if it is not a null into a 
> non-nullable Avro column. In fact, the problem is so *severe* that it is not 
> clear what should be done in such situations when Avro schema is given to you 
> as part of API communication contract (i.e., it is non-changeable).
> This is an important issue.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column

2020-03-09 Thread Kyrill Alyoshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054859#comment-17054859
 ] 

Kyrill Alyoshin edited comment on SPARK-31074 at 3/9/20, 11:57 AM:
---

The first issue was about controlling _nullability_ in Spark schema generated 
through the bean encoder. This issue is about allowing nullable Spark schema 
fields to be written to an Avro schema where they are declared as _non null_. 
)Of course, we assume that Spark's values will never actually going to be 
_null_.)

This first issue is rather narrow and applies to Java bean encoder only. This 
issue applies to all nullable columns in Spark schema. I mean, the column can 
be _nullable_ just because the datasource returned it as such (without any 
encoders).

There is a subtle difference here, but the issues are related.


was (Author: kyrill007):
The first issue was about controlling _nullability_ in Spark schema. This issue 
is about allowing nullable Spark schema fields to be written to an Avro schema 
where they are declared as _non null_. Of course, we assume that Spark's values 
will never actually going to be _null_. There is a subtle difference here, but 
the issues are related.

> Avro serializer should not fail when a nullable Spark field is written to a 
> non-null Avro column
> 
>
> Key: SPARK-31074
> URL: https://issues.apache.org/jira/browse/SPARK-31074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Kyrill Alyoshin
>Priority: Major
>
> Spark StructType schema are strongly biased towards having _nullable_ fields. 
> In fact, this is what _Encoders.bean()_ does - any non-primitive field is 
> automatically _nullable_. When we attempt to serialize dataframes into 
> *user-supplied* Avro schemas where such corresponding fields are marked as 
> _non-null_ (i.e., they are not of _union_ type) any such attempt will fail 
> with the following exception
>  
> {code:java}
> Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string"
>   at org.apache.avro.Schema.getTypes(Schema.java:299)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
>   at 
> org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209)
>  {code}
> This seems as rather draconian. We certainly should be able to write a field 
> of the same type and with the same name if it is not a null into a 
> non-nullable Avro column. In fact, the problem is so *severe* that it is not 
> clear what should be done in such situations when Avro schema is given to you 
> as part of API communication contract (i.e., it is non-changeable).
> This is an important issue.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column

2020-03-09 Thread Kyrill Alyoshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054859#comment-17054859
 ] 

Kyrill Alyoshin commented on SPARK-31074:
-

The first issue was about controlling _nullability_ in Spark schema. This issue 
is about allowing nullable Spark schema fields to be written to an Avro schema 
where they are declared as _non null_. Of course, we assume that Spark's values 
will never actually going to be _null_. There is a subtle difference here, but 
the issues are related.

> Avro serializer should not fail when a nullable Spark field is written to a 
> non-null Avro column
> 
>
> Key: SPARK-31074
> URL: https://issues.apache.org/jira/browse/SPARK-31074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Kyrill Alyoshin
>Priority: Major
>
> Spark StructType schema are strongly biased towards having _nullable_ fields. 
> In fact, this is what _Encoders.bean()_ does - any non-primitive field is 
> automatically _nullable_. When we attempt to serialize dataframes into 
> *user-supplied* Avro schemas where such corresponding fields are marked as 
> _non-null_ (i.e., they are not of _union_ type) any such attempt will fail 
> with the following exception
>  
> {code:java}
> Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string"
>   at org.apache.avro.Schema.getTypes(Schema.java:299)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
>   at 
> org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209)
>  {code}
> This seems as rather draconian. We certainly should be able to write a field 
> of the same type and with the same name if it is not a null into a 
> non-nullable Avro column. In fact, the problem is so *severe* that it is not 
> clear what should be done in such situations when Avro schema is given to you 
> as part of API communication contract (i.e., it is non-changeable).
> This is an important issue.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31071) Spark Encoders.bean() should allow marking non-null fields in its Spark schema

2020-03-08 Thread Kyrill Alyoshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054521#comment-17054521
 ] 

Kyrill Alyoshin commented on SPARK-31071:
-

_javax.annotation.Nonnull_ - seems like a good choice. You already include 
jsr305-1.3.9.jar with Spark distribution (I am using 2.4.4), so this will not 
even lead to additional dependencies.

> Spark Encoders.bean() should allow marking non-null fields in its Spark schema
> --
>
> Key: SPARK-31071
> URL: https://issues.apache.org/jira/browse/SPARK-31071
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kyrill Alyoshin
>Priority: Major
>
> Spark _Encoders.bean()_ method should allow the generated StructType schema 
> fields be *non-nullable*.
> Currently, any non-primitive type is automatically _nullable_. It is 
> hard-coded in the _org.apache.spark.sql.catalyst.JavaTypeReference_ class.  
> This can lead to rather interesting situations... For example, let's say I 
> want to save a dataframe using an Avro format with my own non-spark generated 
> Avro schema. Let's also say that my Avro schema has a field that is non-null 
> (i.e., not a union type). Well, it appears *impossible* to store a dataframe 
> using such an Avro schema since Spark would assume that the field is nullable 
> (as it is in its own schema) which would conflict with Avro schema semantics 
> and throw an exception.
> I propose making a change to the _JavaTypeReference_ class to observe the 
> JSR-305 _Nonnull_ annotation (and its children) on the provided bean class 
> during StructType schema generation. This would allow bean creators to 
> control the resulting Spark schema so much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column

2020-03-06 Thread Kyrill Alyoshin (Jira)
Kyrill Alyoshin created SPARK-31074:
---

 Summary: Avro serializer should not fail when a nullable Spark 
field is written to a non-null Avro column
 Key: SPARK-31074
 URL: https://issues.apache.org/jira/browse/SPARK-31074
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Kyrill Alyoshin


Spark StructType schema are strongly biased towards having _nullable_ fields. 
In fact, this is what _Encoders.bean()_ does - any non-primitive field is 
automatically _nullable_. When we attempt to serialize dataframes into 
*user-supplied* Avro schemas where such corresponding fields are marked as 
_non-null_ (i.e., they are not of _union_ type) any such attempt will fail with 
the following exception

 
{code:java}
Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string"
at org.apache.avro.Schema.getTypes(Schema.java:299)
at 
org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
at 
org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209)
 {code}
This seems as rather draconian. We certainly should be able to write a field of 
the same type and with the same name if it is not a null into a non-nullable 
Avro column. In fact, the problem is so *severe* that it is not clear what 
should be done in such situations when Avro schema is given to you as part of 
API communication contract (i.e., it is non-changeable).

This is an important issue.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31071) Spark Encoders.bean() should allow marking non-null fields in its Spark schema

2020-03-06 Thread Kyrill Alyoshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyrill Alyoshin updated SPARK-31071:

Summary: Spark Encoders.bean() should allow marking non-null fields in its 
Spark schema  (was: Spark Encoders.bean() should allow setting non-null fields 
in its Spark schema)

> Spark Encoders.bean() should allow marking non-null fields in its Spark schema
> --
>
> Key: SPARK-31071
> URL: https://issues.apache.org/jira/browse/SPARK-31071
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kyrill Alyoshin
>Priority: Major
>
> Spark _Encoders.bean()_ method should allow the generated StructType schema 
> fields be *non-nullable*.
> Currently, any non-primitive type is automatically _nullable_. It is 
> hard-coded in the _org.apache.spark.sql.catalyst.JavaTypeReference_ class.  
> This can lead to rather interesting situations... For example, let's say I 
> want to save a dataframe using an Avro format with my own non-spark generated 
> Avro schema. Let's also say that my Avro schema has a field that is non-null 
> (i.e., not a union type). Well, it appears *impossible* to store a dataframe 
> using such an Avro schema since Spark would assume that the field is nullable 
> (as it is in its own schema) which would conflict with Avro schema semantics 
> and throw an exception.
> I propose making a change to the _JavaTypeReference_ class to observe the 
> JSR-305 _Nonnull_ annotation (and its children) on the provided bean class 
> during StructType schema generation. This would allow bean creators to 
> control the resulting Spark schema so much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31071) Spark Encoders.bean() should allow setting non-null fields in its Spark schema

2020-03-06 Thread Kyrill Alyoshin (Jira)
Kyrill Alyoshin created SPARK-31071:
---

 Summary: Spark Encoders.bean() should allow setting non-null 
fields in its Spark schema
 Key: SPARK-31071
 URL: https://issues.apache.org/jira/browse/SPARK-31071
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: Kyrill Alyoshin


Spark _Encoders.bean()_ method should allow the generated StructType schema 
fields be *non-nullable*.

Currently, any non-primitive type is automatically _nullable_. It is hard-coded 
in the _org.apache.spark.sql.catalyst.JavaTypeReference_ class.  This can lead 
to rather interesting situations... For example, let's say I want to save a 
dataframe using an Avro format with my own non-spark generated Avro schema. 
Let's also say that my Avro schema has a field that is non-null (i.e., not a 
union type). Well, it appears *impossible* to store a dataframe using such an 
Avro schema since Spark would assume that the field is nullable (as it is in 
its own schema) which would conflict with Avro schema semantics and throw an 
exception.

I propose making a change to the _JavaTypeReference_ class to observe the 
JSR-305 _Nonnull_ annotation (and its children) on the provided bean class 
during StructType schema generation. This would allow bean creators to control 
the resulting Spark schema so much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org