[jira] [Commented] (SPARK-1649) Figure out Nullability semantics for Array elements and Map values

2015-07-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633000#comment-14633000
 ] 

Reynold Xin commented on SPARK-1649:


[~yhuai] / [~liancheng]  does this ticket still apply? If not, please close it. 
Thanks.


 Figure out Nullability semantics for Array elements and Map values
 --

 Key: SPARK-1649
 URL: https://issues.apache.org/jira/browse/SPARK-1649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Andre Schumacher
Priority: Critical

 For the underlying storage layer it would simplify things such as schema 
 conversions, predicate filter determination and such to record in the data 
 type itself whether a column can be nullable. So the DataType type could look 
 like like this:
 abstract class DataType(nullable: Boolean = true)
 Concrete subclasses could then override the nullable val. Mostly this could 
 be left as the default but when types can be contained in nested types one 
 could optimize for, e.g., arrays with elements that are nullable and those 
 that are not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1649) Figure out Nullability semantics for Array elements and Map values

2014-07-28 Thread Robbie Russo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077115#comment-14077115
 ] 

Robbie Russo commented on SPARK-1649:
-

Thrift also supports null values in a map and this makes any thrift generated 
parquet files that contain a map unreadable by spark sql due to the following 
code in parquet-thrift for generating the schema for maps:

{code:title=parquet.thrift.ThriftSchemaConverter.java|borderStyle=solid}
  @Override
  public void visit(ThriftType.MapType mapType) {
final ThriftField mapKeyField = mapType.getKey();
final ThriftField mapValueField = mapType.getValue();

//save env for map
String mapName = currentName;
Type.Repetition mapRepetition = currentRepetition;

//=handle key
currentFieldPath.push(mapKeyField);
currentName = key;
currentRepetition = REQUIRED;
mapKeyField.getType().accept(this);
Type keyType = currentType;//currentType is the already converted type
currentFieldPath.pop();

//=handle value
currentFieldPath.push(mapValueField);
currentName = value;
currentRepetition = OPTIONAL;
mapValueField.getType().accept(this);
Type valueType = currentType;
currentFieldPath.pop();

if (keyType == null  valueType == null) {
  currentType = null;
  return;
}

if (keyType == null  valueType != null)
  throw new ThriftProjectionException(key of map is not specified in 
projection:  + currentFieldPath);

//restore Env
currentName = mapName;
currentRepetition = mapRepetition;
currentType = ConversionPatterns.mapType(currentRepetition, currentName,
keyType,
valueType);
  }
{code}

Which causes an error on the spark side when we reach this step in the 
toDataType function that asserts that both the key and value are of repetition 
level REQUIRED:

{code:title=org.apache.spark.sql.parquet.ParquetTypes.scala|borderStyle=solid}
case ParquetOriginalType.MAP = {
  assert(
!groupType.getFields.apply(0).isPrimitive,
Parquet Map type malformatted: expected nested group for map!)
  val keyValueGroup = groupType.getFields.apply(0).asGroupType()
  assert(
keyValueGroup.getFieldCount == 2,
Parquet Map type malformatted: nested group should have 2 (key, 
value) fields!)
  val keyType = toDataType(keyValueGroup.getFields.apply(0))
  println(here)
  assert(keyValueGroup.getFields.apply(0).getRepetition == 
Repetition.REQUIRED)
  val valueType = toDataType(keyValueGroup.getFields.apply(1))
  assert(keyValueGroup.getFields.apply(1).getRepetition == 
Repetition.REQUIRED)
  new MapType(keyType, valueType)
}
{code}

Currently I have modified parquet-thrift to use repetition REQUIRED just to 
make spark sql able to work on the parquet files since we don't actually use 
null values in our maps. However it would be preferred to use parquet-thrift 
and spark sql out of the box and have them work nicely together with our 
existing thrift data types without having to modify dependencies.

 Figure out Nullability semantics for Array elements and Map values
 --

 Key: SPARK-1649
 URL: https://issues.apache.org/jira/browse/SPARK-1649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Andre Schumacher
Priority: Critical

 For the underlying storage layer it would simplify things such as schema 
 conversions, predicate filter determination and such to record in the data 
 type itself whether a column can be nullable. So the DataType type could look 
 like like this:
 abstract class DataType(nullable: Boolean = true)
 Concrete subclasses could then override the nullable val. Mostly this could 
 be left as the default but when types can be contained in nested types one 
 could optimize for, e.g., arrays with elements that are nullable and those 
 that are not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1649) Figure out Nullability semantics for Array elements and Map values

2014-07-28 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077127#comment-14077127
 ] 

Yin Huai commented on SPARK-1649:
-

[~rrusso2007] Can you open a JIRA for the issue of reading Parquet datasets?

 Figure out Nullability semantics for Array elements and Map values
 --

 Key: SPARK-1649
 URL: https://issues.apache.org/jira/browse/SPARK-1649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Andre Schumacher
Priority: Critical

 For the underlying storage layer it would simplify things such as schema 
 conversions, predicate filter determination and such to record in the data 
 type itself whether a column can be nullable. So the DataType type could look 
 like like this:
 abstract class DataType(nullable: Boolean = true)
 Concrete subclasses could then override the nullable val. Mostly this could 
 be left as the default but when types can be contained in nested types one 
 could optimize for, e.g., arrays with elements that are nullable and those 
 that are not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1649) Figure out Nullability semantics for Array elements and Map values

2014-07-28 Thread Robbie Russo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077150#comment-14077150
 ] 

Robbie Russo commented on SPARK-1649:
-

Just opened https://issues.apache.org/jira/browse/SPARK-2721

 Figure out Nullability semantics for Array elements and Map values
 --

 Key: SPARK-1649
 URL: https://issues.apache.org/jira/browse/SPARK-1649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Andre Schumacher
Priority: Critical

 For the underlying storage layer it would simplify things such as schema 
 conversions, predicate filter determination and such to record in the data 
 type itself whether a column can be nullable. So the DataType type could look 
 like like this:
 abstract class DataType(nullable: Boolean = true)
 Concrete subclasses could then override the nullable val. Mostly this could 
 be left as the default but when types can be contained in nested types one 
 could optimize for, e.g., arrays with elements that are nullable and those 
 that are not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1649) Figure out Nullability semantics for Array elements and Map values

2014-07-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14059570#comment-14059570
 ] 

Yin Huai commented on SPARK-1649:
-

My PR for SPARK-2179 (https://github.com/apache/spark/pull/1346) introduces the 
containsNull field to the ArrayType. For Parquet, we still do not support 
null values insides a Parquet array.

For the key and value of MapType, [~marmbrus] and I discussed about it. We 
think it is not semantically clear what a null means when it appears in the key 
or value field (considering a null is used to indicate a missing data value). 
So, we decided that key and value in a MapType should not contain any null 
value and we will not introduce containsNull to MapType

 Figure out Nullability semantics for Array elements and Map values
 --

 Key: SPARK-1649
 URL: https://issues.apache.org/jira/browse/SPARK-1649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Andre Schumacher
Priority: Critical

 For the underlying storage layer it would simplify things such as schema 
 conversions, predicate filter determination and such to record in the data 
 type itself whether a column can be nullable. So the DataType type could look 
 like like this:
 abstract class DataType(nullable: Boolean = true)
 Concrete subclasses could then override the nullable val. Mostly this could 
 be left as the default but when types can be contained in nested types one 
 could optimize for, e.g., arrays with elements that are nullable and those 
 that are not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1649) Figure out Nullability semantics for Array elements and Map values

2014-04-29 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984909#comment-13984909
 ] 

Michael Armbrust commented on SPARK-1649:
-

Oh, I see.  I forgot that we would also need this inside of ArrayType.  Also, 
for MapType it seems like it only matters for the value, not the key as I'm not 
sure we would allow null keys.

This is something we need to consider. However, I think I'm going to change the 
title to something less prescriptive.  Could we just for now say that null 
values are not supported in arrays of parquet files?

 Figure out Nullability semantics for Array elements and Map values
 --

 Key: SPARK-1649
 URL: https://issues.apache.org/jira/browse/SPARK-1649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Andre Schumacher
Priority: Critical

 For the underlying storage layer it would simplify things such as schema 
 conversions, predicate filter determination and such to record in the data 
 type itself whether a column can be nullable. So the DataType type could look 
 like like this:
 abstract class DataType(nullable: Boolean = true)
 Concrete subclasses could then override the nullable val. Mostly this could 
 be left as the default but when types can be contained in nested types one 
 could optimize for, e.g., arrays with elements that are nullable and those 
 that are not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)