[ https://issues.apache.org/jira/browse/SPARK-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jianshi Huang updated SPARK-6533: --------------------------------- Description: If spark.sql.parquet.useDataSourceApi is not set to false, which is the default. Loading parquet files using file pattern will throw errors. *\*Wildcard* {noformat} scala> val qp = sqlContext.parquetFile("hdfs://.../source=live/date=2014-06-0*") 15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. java.io.FileNotFoundException: File does not exist: hdfs://.../source=live/date=2014-06-0* at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) at org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:388) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) {noformat} And *\[abc\]* {noformat} val qp = sqlContext.parquetFile("hdfs://.../source=live/date=2014-06-0[12]") java.lang.IllegalArgumentException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12] at java.net.URI.create(URI.java:859) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) at org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:388) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) ... 49 elided Caused by: java.net.URISyntaxException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12] at java.net.URI$Parser.fail(URI.java:2829) at java.net.URI$Parser.checkChars(URI.java:3002) at java.net.URI$Parser.parseHierarchical(URI.java:3086) at java.net.URI$Parser.parse(URI.java:3034) at java.net.URI.<init>(URI.java:595) at java.net.URI.create(URI.java:857) {noformat} If spark.sql.parquet.useDataSourceApi is not enabled we cannot have partition discovery, schema evolution etc, but being able to specify file pattern is also very important to applications. Please add this important feature. Jianshi was: If spark.sql.parquet.useDataSourceApi is not set to false, which is the default. Loading parquet files using file pattern will throw errors. *\*Wildcard* {noformat} scala> val qp = sqlContext.parquetFile("hdfs://.../source=live/date=2014-06-0*") 15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. java.io.FileNotFoundException: File does not exist: hdfs://.../source=live/date=2014-06-0* at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) at org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:388) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) {noformat} And *\[abc\]* {noformat} val qp = sqlContext.parquetFile("hdfs://.../source=live/date=2014-06-0[12]") java.lang.IllegalArgumentException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12] at java.net.URI.create(URI.java:859) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) at org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:388) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) ... 49 elided Caused by: java.net.URISyntaxException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12] at java.net.URI$Parser.fail(URI.java:2829) at java.net.URI$Parser.checkChars(URI.java:3002) at java.net.URI$Parser.parseHierarchical(URI.java:3086) at java.net.URI$Parser.parse(URI.java:3034) at java.net.URI.<init>(URI.java:595) at java.net.URI.create(URI.java:857) {noformat} Jianshi Issue Type: Improvement (was: Bug) Summary: Allow using wildcard and other file pattern in Parquet DataSource (was: Cannot use wildcard and other file pattern in sqlContext.parquetFile if spark.sql.parquet.useDataSourceApi is not set to false) > Allow using wildcard and other file pattern in Parquet DataSource > ----------------------------------------------------------------- > > Key: SPARK-6533 > URL: https://issues.apache.org/jira/browse/SPARK-6533 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.3.0, 1.3.1 > Reporter: Jianshi Huang > > If spark.sql.parquet.useDataSourceApi is not set to false, which is the > default. > Loading parquet files using file pattern will throw errors. > *\*Wildcard* > {noformat} > scala> val qp = > sqlContext.parquetFile("hdfs://.../source=live/date=2014-06-0*") > 15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > java.io.FileNotFoundException: File does not exist: > hdfs://.../source=live/date=2014-06-0* > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) > at > org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:388) > at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) > {noformat} > And > *\[abc\]* > {noformat} > val qp = sqlContext.parquetFile("hdfs://.../source=live/date=2014-06-0[12]") > java.lang.IllegalArgumentException: Illegal character in path at index 74: > hdfs://.../source=live/date=2014-06-0[12] > at java.net.URI.create(URI.java:859) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) > at > org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:388) > at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) > ... 49 elided > Caused by: java.net.URISyntaxException: Illegal character in path at index > 74: hdfs://.../source=live/date=2014-06-0[12] > at java.net.URI$Parser.fail(URI.java:2829) > at java.net.URI$Parser.checkChars(URI.java:3002) > at java.net.URI$Parser.parseHierarchical(URI.java:3086) > at java.net.URI$Parser.parse(URI.java:3034) > at java.net.URI.<init>(URI.java:595) > at java.net.URI.create(URI.java:857) > {noformat} > If spark.sql.parquet.useDataSourceApi is not enabled we cannot have partition > discovery, schema evolution etc, but being able to specify file pattern is > also very important to applications. > Please add this important feature. > Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org