[ https://issues.apache.org/jira/browse/SPARK-13323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147851#comment-15147851 ]
Hyukjin Kwon commented on SPARK-13323: -------------------------------------- [~davies] Yes it's complicated but dealimg with numeric precedence is not super much. The problem is that is can't find a compatible types. Namly, if the types of following rows are different with the types of the first row, it just simply fails to infer types, which CSV and JSON type inference do not. > Type cast support in type inference during merging types. > --------------------------------------------------------- > > Key: SPARK-13323 > URL: https://issues.apache.org/jira/browse/SPARK-13323 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 2.0.0 > Reporter: Hyukjin Kwon > > As described in {{types.py}}, there is a todo {{TODO: type cast (such as int > -> long)}}. > Currently, PySpark infers types but does not try to find compatible types > when the given types are different during merging schemas. > I think this can be done by resembling > {{HiveTypeCoercion.findTightestCommonTypeOfTwo}} for numbers and when one of > both is compared to {{StingType}}, just convert them into string. > It looks the possible leaf data types are below: > {code} > # Mapping Python types to Spark SQL DataType > _type_mappings = { > type(None): NullType, > bool: BooleanType, > int: LongType, > float: DoubleType, > str: StringType, > bytearray: BinaryType, > decimal.Decimal: DecimalType, > datetime.date: DateType, > datetime.datetime: TimestampType, > datetime.time: TimestampType, > } > {code} > and they are converted pretty well to string as below: > {code} > >>> print str(None) > None > >>> print str(True) > True > >>> print str(float(0.1)) > 0.1 > >>> str(bytearray([255])) > '\xff' > >>> str(decimal.Decimal()) > '0' > >>> str(datetime.date(1,1,1)) > '0001-01-01' > >>> str(datetime.datetime(1,1,1)) > '0001-01-01 00:00:00' > >>> str(datetime.time(1,1,1)) > '01:01:01' > {code} > First, I tried to find the relevant issue with this but I couldn't. Please > mark this as a duplicate if there is already. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org