[ https://issues.apache.org/jira/browse/SPARK-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Davies Liu updated SPARK-6917: ------------------------------ Assignee: Yin Huai (was: Davies Liu) > Broken data returned to PySpark dataframe if any large numbers used in Scala > land > --------------------------------------------------------------------------------- > > Key: SPARK-6917 > URL: https://issues.apache.org/jira/browse/SPARK-6917 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 1.3.0 > Environment: Spark 1.3, Python 2.7.6, Scala 2.10 > Reporter: Harry Brundage > Assignee: Yin Huai > Priority: Critical > Attachments: part-r-00001.parquet > > > When trying to access data stored in a Parquet file with an INT96 column > (read: TimestampType() encoded for Impala), if the INT96 column is included > in the fetched data, other, smaller numeric types come back broken. > {code} > In [1]: > sql.parquetFile("/Users/hornairs/Downloads/part-r-00001.parquet").select('int_col', > 'long_col').first() > Out[1]: Row(int_col=Decimal('1'), long_col=Decimal('10')) > In [2]: > sql.parquetFile("/Users/hornairs/Downloads/part-r-00001.parquet").first() > Out[2]: Row(long_col={u'__class__': u'scala.runtime.BoxedUnit'}, > str_col=u'Hello!', int_col={u'__class__': u'scala.runtime.BoxedUnit'}, > date_col=datetime.datetime(1, 12, 31, 19, 0, tzinfo=<DstTzInfo > 'America/Toronto' EDT-1 day, 19:00:00 DST>)) > {code} > Note the {{\{u'__class__': u'scala.runtime.BoxedUnit'}}} values being > returned for the {{int_col}} and {{long_col}} columns in the second loop > above. This only happens if I select the {{date_col}} which is stored as > {{INT96}}. > I don't know much about Scala boxing, but I assume that somehow by including > numeric columns that are bigger than a machine word I trigger some different, > slower execution path somewhere that boxes stuff and causes this problem. > If anyone could give me any pointers on where to get started fixing this I'd > be happy to dive in! -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org