[jira] [Created] (SPARK-6917) Broken data returned to PySpark dataframe if any large numbers used in Scala land

Harry Brundage (JIRA) Tue, 14 Apr 2015 16:26:50 -0700

Harry Brundage created SPARK-6917:
-------------------------------------

             Summary: Broken data returned to PySpark dataframe if any large 
numbers used in Scala land
                 Key: SPARK-6917
                 URL: https://issues.apache.org/jira/browse/SPARK-6917
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
    Affects Versions: 1.3.0
         Environment: Spark 1.3, Python 2.7.6, Scala 2.10
            Reporter: Harry Brundage



When trying to access data stored in a Parquet file with an INT96 column (read: 
TimestampType() encoded for Impala), if the INT96 column is included in the 
fetched data, other, smaller numeric types come back broken.

{code}
In [1]: 
sql.sql.parquetFile("/Users/hornairs/Downloads/part-r-00001.parquet").select('int_col',
 'long_col').first()
Out[1]: Row(int_col=Decimal('1'), long_col=Decimal('10'))

In [2]: 
sql.parquetFile("/Users/hornairs/Downloads/part-r-00001.parquet").first()
Out[2]: Row(long_col={u'__class__': u'scala.runtime.BoxedUnit'}, 
str_col=u'Hello!', int_col={u'__class__': u'scala.runtime.BoxedUnit'}, 
date_col=datetime.datetime(1, 12, 31, 19, 0, tzinfo=<DstTzInfo 
'America/Toronto' EDT-1 day, 19:00:00 DST>))
{code}

Note the {{\{u'__class__': u'scala.runtime.BoxedUnit'}}} values being returned 
for the {{int_col}} and {{long_col}} columns in the second loop above. This 
only happens if I select the {{date_col}} which is stored as {{INT96}}. 

I don't know much about Scala boxing, but I assume that somehow by including 
numeric columns that are bigger than a machine word I trigger some different, 
slower execution path somewhere that boxes stuff and causes this problem.

If anyone could give me any pointers on where to get started fixing this I'd be 
happy to dive in!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6917) Broken data returned to PySpark dataframe if any large numbers used in Scala land

Reply via email to