[ https://issues.apache.org/jira/browse/SPARK-20185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953183#comment-15953183 ]
Hyukjin Kwon edited comment on SPARK-20185 at 4/3/17 9:28 AM: -------------------------------------------------------------- {{codec}} or {{compression}} is an option for writing out as documented. It seems the workaround is not so difficult and the Hadoop's behaviour looks sensible to me as well. was (Author: hyukjin.kwon): {{codec}} or {{compression}} is an option for writing out as documented. It seems the workaround is not so difficult and the behaviour looks reasonable to me as well. > csv decompressed incorrectly with extention other than 'gz' > ----------------------------------------------------------- > > Key: SPARK-20185 > URL: https://issues.apache.org/jira/browse/SPARK-20185 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 > Reporter: Ran Mingxuan > Priority: Minor > Original Estimate: 168h > Remaining Estimate: 168h > > With code below: > val start_time = System.currentTimeMillis() > val gzFile = spark.read > .format("com.databricks.spark.csv") > .option("header", "false") > .option("inferSchema", "false") > .option("codec", "gzip") > .load("/foo/someCsvFile.gz.bak") > gzFile.repartition(1).write.mode("overwrite").parquet("/foo/") > got error even if I indicated the codec: > WARN util.NativeCodeLoader: Unable to load native-hadoop library for your > platform... using builtin-java classes where applicable > 17/03/23 15:44:55 WARN ipc.Client: Exception encountered while connecting to > the server : > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby. Visit > https://s.apache.org/sbnn-error > 17/03/23 15:44:58 ERROR executor.Executor: Exception in task 2.0 in stage > 12.0 (TID 977) > java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:109) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166) > Have to add extension to GzipCodec to make my code run. > import org.apache.hadoop.io.compress.GzipCodec > class BakGzipCodec extends GzipCodec { > override def getDefaultExtension(): String = ".gz.bak" > } > I suppose the file loader should get file codec depending on option first, > and then to extension. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org