[ https://issues.apache.org/jira/browse/SPARK-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546160#comment-14546160 ]
Davies Liu edited comment on SPARK-6917 at 5/15/15 8:58 PM: ------------------------------------------------------------ [~yhuai] It's a bug in SQL or Parquet library: {code} scala> sqlContext.parquetFile("/Users/davies/Downloads/part-r-00001.parquet") res1: org.apache.spark.sql.DataFrame = [long_col: decimal(18,0), str_col: string, int_col: decimal(18,0), date_col: timestamp] scala> sqlContext.parquetFile("/Users/davies/Downloads/part-r-00001.parquet").first() res2: org.apache.spark.sql.Row = [(),Hello!,(),0001-12-31 16:00:00.0] scala> sqlContext.parquetFile("/Users/davies/Downloads/part-r-00001.parquet").select("long_col").first() res3: org.apache.spark.sql.Row = [10] scala> sqlContext.parquetFile("/Users/davies/Downloads/part-r-00001.parquet").select("long_col", "date_col").first() res4: org.apache.spark.sql.Row = [(),0001-12-31 16:00:00.0] scala> sqlContext.parquetFile("/Users/davies/Downloads/part-r-00001.parquet").select("date_col").first() res5: org.apache.spark.sql.Row = [0001-12-31 16:00:00.0] {code} was (Author: davies): [~yhuai] It's a bug in SQL or Parquet library: [[code]] scala> sqlContext.parquetFile("/Users/davies/Downloads/part-r-00001.parquet") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. res1: org.apache.spark.sql.DataFrame = [long_col: decimal(18,0), str_col: string, int_col: decimal(18,0), date_col: timestamp] scala> sqlContext.parquetFile("/Users/davies/Downloads/part-r-00001.parquet").first() res2: org.apache.spark.sql.Row = [(),Hello!,(),0001-12-31 16:00:00.0] scala> sqlContext.parquetFile("/Users/davies/Downloads/part-r-00001.parquet").select("long_col").first() res3: org.apache.spark.sql.Row = [10] scala> sqlContext.parquetFile("/Users/davies/Downloads/part-r-00001.parquet").select("long_col", "date_col").first() res4: org.apache.spark.sql.Row = [(),0001-12-31 16:00:00.0] scala> sqlContext.parquetFile("/Users/davies/Downloads/part-r-00001.parquet").select("date_col").first() res5: org.apache.spark.sql.Row = [0001-12-31 16:00:00.0] [[code]] > Broken data returned to PySpark dataframe if any large numbers used in Scala > land > --------------------------------------------------------------------------------- > > Key: SPARK-6917 > URL: https://issues.apache.org/jira/browse/SPARK-6917 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 1.3.0 > Environment: Spark 1.3, Python 2.7.6, Scala 2.10 > Reporter: Harry Brundage > Assignee: Davies Liu > Priority: Critical > Attachments: part-r-00001.parquet > > > When trying to access data stored in a Parquet file with an INT96 column > (read: TimestampType() encoded for Impala), if the INT96 column is included > in the fetched data, other, smaller numeric types come back broken. > {code} > In [1]: > sql.parquetFile("/Users/hornairs/Downloads/part-r-00001.parquet").select('int_col', > 'long_col').first() > Out[1]: Row(int_col=Decimal('1'), long_col=Decimal('10')) > In [2]: > sql.parquetFile("/Users/hornairs/Downloads/part-r-00001.parquet").first() > Out[2]: Row(long_col={u'__class__': u'scala.runtime.BoxedUnit'}, > str_col=u'Hello!', int_col={u'__class__': u'scala.runtime.BoxedUnit'}, > date_col=datetime.datetime(1, 12, 31, 19, 0, tzinfo=<DstTzInfo > 'America/Toronto' EDT-1 day, 19:00:00 DST>)) > {code} > Note the {{\{u'__class__': u'scala.runtime.BoxedUnit'}}} values being > returned for the {{int_col}} and {{long_col}} columns in the second loop > above. This only happens if I select the {{date_col}} which is stored as > {{INT96}}. > I don't know much about Scala boxing, but I assume that somehow by including > numeric columns that are bigger than a machine word I trigger some different, > slower execution path somewhere that boxes stuff and causes this problem. > If anyone could give me any pointers on where to get started fixing this I'd > be happy to dive in! -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org