[ https://issues.apache.org/jira/browse/SPARK-11544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14995476#comment-14995476 ]
Dilip Biswal edited comment on SPARK-11544 at 11/8/15 4:52 AM: --------------------------------------------------------------- I would like to work on this issue. was (Author: dkbiswal): I am looking into this issue. > sqlContext doesn't use PathFilter > --------------------------------- > > Key: SPARK-11544 > URL: https://issues.apache.org/jira/browse/SPARK-11544 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.0 > Environment: AWS EMR 4.1.0, Spark 1.5.0 > Reporter: Frank Dai > > When sqlContext reads JSON files, it doesn't use {{PathFilter}} in the > underlying SparkContext > {code:java} > val sc = new SparkContext(conf) > sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", > classOf[TmpFileFilter], classOf[PathFilter]) > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > {code} > The definition of {{TmpFileFilter}} is: > {code:title=TmpFileFilter.scala|borderStyle=solid} > import org.apache.hadoop.fs.{Path, PathFilter} > class TmpFileFilter extends PathFilter { > override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp") > } > {code} > When use {{sqlContext}} to read JSON files, e.g., > {{sqlContext.read.schema(mySchema).json(s3Path)}}, Spark will throw out an > exception: > {quote} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > s3://chef-logstash-access-backup/2015/10/21/00/logstash-172.18.68.59-s3.1445388158944.gz.tmp > {quote} > It seems {{sqlContext}} can see {{.tmp}} files while {{sc}} can not, which > causes the above exception -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org