[ https://issues.apache.org/jira/browse/SPARK-11657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xin Wu updated SPARK-11657: --------------------------- Comment: was deleted (was: I tried this sample data as local file mode. and it seems working to me. Have you tried it this way? {code} scala> val data = sqlContext.read.parquet("/root/sample") [Stage 0:> (0 + 8) / 8]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: string, clusterData: array<string>, dpid: int] scala> data.take(1) 15/11/11 08:26:29 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl res0: Array[org.apache.spark.sql.Row] = Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813]) scala> data.collect() 15/11/11 08:26:53 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl res1: Array[org.apache.spark.sql.Row] = Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813]) scala> data.show(false) 15/11/11 08:26:57 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl +-----------+------------------------+----------------------------------------+----+ |clusterSize|clusterName |clusterData |dpid| +-----------+------------------------+----------------------------------------+----+ |1 |6A01CACD56169A947F000101|[77512098164594606101815510825479776971]|813 | +-----------+------------------------+----------------------------------------+----+ {code}) > Bad Dataframe data read from parquet > ------------------------------------ > > Key: SPARK-11657 > URL: https://issues.apache.org/jira/browse/SPARK-11657 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 1.5.1, 1.5.2 > Environment: EMR (yarn) > Reporter: Virgil Palanciuc > Priority: Critical > Attachments: sample.tgz > > > I get strange behaviour when reading parquet data: > {code} > scala> val data = sqlContext.read.parquet("hdfs:///sample") > data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: > string, clusterData: array<string>, dpid: int] > scala> data.take(1) /// this returns garbage > res0: Array[org.apache.spark.sql.Row] = > Array([1,56169A947F000101????????,WrappedArray(164594606101815510825479776971????????),813]) > > scala> data.collect() /// this works > res1: Array[org.apache.spark.sql.Row] = > Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813]) > {code} > I've attached the "hdfs:///sample" directory to this bug report -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org