[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13531 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/13531#discussion_r66539899 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala --- @@ -340,6 +340,40 @@ class FileSourceStrategySuite extends QueryTest with SharedSQLContext with Predi } } + test("SPARK-15654 do not split non-splittable files") { --- End diff -- Updated --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...
Github user clockfly commented on a diff in the pull request: https://github.com/apache/spark/pull/13531#discussion_r66538048 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala --- @@ -340,6 +340,40 @@ class FileSourceStrategySuite extends QueryTest with SharedSQLContext with Predi } } + test("SPARK-15654 do not split non-splittable files") { --- End diff -- Should we also test the bin-packing the gzipped file? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...
Github user clockfly commented on a diff in the pull request: https://github.com/apache/spark/pull/13531#discussion_r66533473 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala --- @@ -298,6 +309,28 @@ trait FileFormat { } /** + * The base class file format that is based on text file. + */ +abstract class TextBasedFileFormat extends FileFormat { + private var codecFactory: CompressionCodecFactory = null + override def isSplitable( + sparkSession: SparkSession, + options: Map[String, String], + path: Path): Boolean = { +if (codecFactory == null) { + synchronized { --- End diff -- I am not sure we need "synchronized" here or not, do we want to ensure FileFormat is thread safe? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/13531#discussion_r66478617 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala --- @@ -298,6 +309,28 @@ trait FileFormat { } /** + * The base class file format that is based on text file. + */ +abstract class TextBasedFileFormat extends FileFormat { + private var codecFactory: CompressionCodecFactory = null + override def isSplitable( + sparkSession: SparkSession, + options: Map[String, String], + path: Path): Boolean = { +if (codecFactory == null) { + synchronized { +if (codecFactory == null) { + codecFactory = new CompressionCodecFactory( --- End diff -- There could be "io.compression.codecs" in options, it's used by getCodec(). Since all other APIs have that, it's better to have that here too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/13531#discussion_r66451629 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala --- @@ -298,6 +309,28 @@ trait FileFormat { } /** + * The base class file format that is based on text file. + */ +abstract class TextBasedFileFormat extends FileFormat { + private var codecFactory: CompressionCodecFactory = null + override def isSplitable( + sparkSession: SparkSession, + options: Map[String, String], + path: Path): Boolean = { +if (codecFactory == null) { + synchronized { +if (codecFactory == null) { + codecFactory = new CompressionCodecFactory( --- End diff -- sorry. It seems we can use `sparkSession.sessionState.newHadoopConf()` instread of `sparkSession.sessionState.newHadoopConfWithOptions(options)` (I checked the `FileSourceStrategySuite` test passed without passing `options` in `CompressionCodecFactory`). So, we need `options` in the arguments of `isSplitable`? ``` if (codecFactory == null) { codecFactory = new CompressionCodecFactory(sparkSession.sessionState.newHadoopConf()) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/13531#discussion_r66383788 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala --- @@ -298,6 +309,28 @@ trait FileFormat { } /** + * The base class file format that is based on text file. + */ +abstract class TextBasedFileFormat extends FileFormat { + private var codecFactory: CompressionCodecFactory = null + override def isSplitable( + sparkSession: SparkSession, + options: Map[String, String], + path: Path): Boolean = { +if (codecFactory == null) { + synchronized { +if (codecFactory == null) { + codecFactory = new CompressionCodecFactory( --- End diff -- Sorry, I do not understand your question or suggestion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/13531#discussion_r66016111 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala --- @@ -298,6 +309,28 @@ trait FileFormat { } /** + * The base class file format that is based on text file. + */ +abstract class TextBasedFileFormat extends FileFormat { + private var codecFactory: CompressionCodecFactory = null + override def isSplitable( + sparkSession: SparkSession, + options: Map[String, String], + path: Path): Boolean = { +if (codecFactory == null) { + synchronized { +if (codecFactory == null) { + codecFactory = new CompressionCodecFactory( --- End diff -- We need to pass `options` into `CompressionCodecFactory` here? It seems `getCodec` only checks suffixes in `input`. https://github.com/apache/hadoop-common/blob/HADOOP-3628/src/core/org/apache/hadoop/io/compress/CompressionCodecFactory.java#L153 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/13531 [SPARK-15654] [SQL] fix non-splitable files for text based file formats ## What changes were proposed in this pull request? This PR is based on #13442 , fix the bug for non-splittable files for all text based file formats. Closes #13442 ## How was this patch tested? add regression tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark fix_split Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13531.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13531 commit 975a1423e6650dae8eb9bf5f22b986ade39702a4 Author: Davies LiuDate: 2016-06-06T19:13:07Z fix non-splitable files for text based file formats commit dad01936a4e891da8a3dfafd1a7e06d5952b52c2 Author: Davies Liu Date: 2016-06-06T20:14:50Z add regression test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org