[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery
[ https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325749#comment-15325749 ] Yin Huai commented on SPARK-13207: -- Created https://issues.apache.org/jira/browse/SPARK-15895 > _SUCCESS should not break partition discovery > - > > Key: SPARK-13207 > URL: https://issues.apache.org/jira/browse/SPARK-13207 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Labels: backport-needed > Fix For: 1.6.2, 2.0.0 > > > Partitioning discovery will fail with the following case > {code} > test("_SUCCESS should not break partitioning discovery") { > withTempPath { dir => > val tablePath = new File(dir, "table") > val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d") > df.write > .format("parquet") > .partitionBy("b", "c", "d") > .save(tablePath.getCanonicalPath) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS")) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", > "_SUCCESS")) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", > "_SUCCESS")) > > checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath), > df) > } > } > {code} > Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning > discovery will fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery
[ https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325744#comment-15325744 ] Yin Huai commented on SPARK-13207: -- oh, I only looked at the title. OK. It is not fixed. Can you create a jira? > _SUCCESS should not break partition discovery > - > > Key: SPARK-13207 > URL: https://issues.apache.org/jira/browse/SPARK-13207 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Labels: backport-needed > Fix For: 1.6.2, 2.0.0 > > > Partitioning discovery will fail with the following case > {code} > test("_SUCCESS should not break partitioning discovery") { > withTempPath { dir => > val tablePath = new File(dir, "table") > val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d") > df.write > .format("parquet") > .partitionBy("b", "c", "d") > .save(tablePath.getCanonicalPath) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS")) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", > "_SUCCESS")) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", > "_SUCCESS")) > > checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath), > df) > } > } > {code} > Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning > discovery will fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery
[ https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325718#comment-15325718 ] Simeon Simeonov commented on SPARK-13207: - [~yhuai] The PR associated with that ticket explicitly calls out {{_metadata}} and {{_common_metadata}} as not excluded. i am wondering why that PR will fix this issue... Can you add a test to demonstrate that this is fixed? > _SUCCESS should not break partition discovery > - > > Key: SPARK-13207 > URL: https://issues.apache.org/jira/browse/SPARK-13207 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Labels: backport-needed > Fix For: 1.6.2, 2.0.0 > > > Partitioning discovery will fail with the following case > {code} > test("_SUCCESS should not break partitioning discovery") { > withTempPath { dir => > val tablePath = new File(dir, "table") > val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d") > df.write > .format("parquet") > .partitionBy("b", "c", "d") > .save(tablePath.getCanonicalPath) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS")) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", > "_SUCCESS")) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", > "_SUCCESS")) > > checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath), > df) > } > } > {code} > Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning > discovery will fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery
[ https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325337#comment-15325337 ] Yin Huai commented on SPARK-13207: -- Hey [~simeons], sorry for late reply. SPARK-15454 has fixed this issue. > _SUCCESS should not break partition discovery > - > > Key: SPARK-13207 > URL: https://issues.apache.org/jira/browse/SPARK-13207 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Labels: backport-needed > Fix For: 1.6.2, 2.0.0 > > > Partitioning discovery will fail with the following case > {code} > test("_SUCCESS should not break partitioning discovery") { > withTempPath { dir => > val tablePath = new File(dir, "table") > val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d") > df.write > .format("parquet") > .partitionBy("b", "c", "d") > .save(tablePath.getCanonicalPath) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS")) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", > "_SUCCESS")) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", > "_SUCCESS")) > > checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath), > df) > } > } > {code} > Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning > discovery will fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery
[ https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305703#comment-15305703 ] Simeon Simeonov commented on SPARK-13207: - [~yhuai] I see the same problem with common metadata files. Do we need another JIRA issue for those? For example, the following S3 directory structure: {code} 2016-05-28 20:08:18 41207 ss/tests/partitioning/placements/par_ts=201605260400/_common_metadata 2016-05-28 20:08:173981760 ss/tests/partitioning/placements/par_ts=201605260400/_metadata 2016-05-28 20:06:10 26149863 ss/tests/partitioning/placements/par_ts=201605260400/par_job=load/par_cat=engagement/part-r-1-85d6c96a-4384-4e43-a575-ac71e147f349.gz.parquet 2016-05-28 20:06:42 32882968 ss/tests/partitioning/placements/par_ts=201605260400/par_job=load/par_cat=error/part-r-2-85d6c96a-4384-4e43-a575-ac71e147f349.gz.parquet 2016-05-28 20:05:491700553 ss/tests/partitioning/placements/par_ts=201605260400/par_job=load/par_cat=late/part-r-0-85d6c96a-4384-4e43-a575-ac71e147f349.gz.parquet 2016-05-28 20:04:21 41207 ss/tests/partitioning/placements/par_ts=201605270400/_common_metadata 2016-05-28 20:04:204120453 ss/tests/partitioning/placements/par_ts=201605270400/_metadata 2016-05-28 20:02:37 21471845 ss/tests/partitioning/placements/par_ts=201605270400/par_job=load/par_cat=engagement/part-r-00028-ad451b48-e48c-46eb-9e5f-c1bc4c663a4c.gz.parquet 2016-05-28 20:03:12 29981797 ss/tests/partitioning/placements/par_ts=201605270400/par_job=load/par_cat=error/part-r-00029-ad451b48-e48c-46eb-9e5f-c1bc4c663a4c.gz.parquet 2016-05-28 20:02:291525027 ss/tests/partitioning/placements/par_ts=201605270400/par_job=load/par_cat=late/part-r-00025-ad451b48-e48c-46eb-9e5f-c1bc4c663a4c.gz.parquet {code} generates the following partition discovery exception when loading {{/ss/tests/partitioning/placements}}: {code} java.lang.AssertionError: assertion failed: Conflicting partition column names detected: Partition column name list #0: par_ts, par_job, par_cat Partition column name list #1: par_ts For partitioned table directories, data files should only live in leaf directories. And directories at the same level should have the same partition column name. Please check the following directories for unexpected files or inconsistent partition column names: dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605270400 dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605260400 dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605260400/par_job=load/par_cat=late dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605270400/par_job=load/par_cat=late dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605260400/par_job=load/par_cat=error dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605270400/par_job=load/par_cat=error dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605270400/par_job=load/par_cat=engagement dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605260400/par_job=load/par_cat=engagement at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.execution.datasources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:246) at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:115) at org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:621) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:526) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:525) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.sources.HadoopFsRelation.partitionSpec(interfaces.scala:524) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionColumns$1.apply(interfaces.scala:578) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionColumns$1.apply(interfaces.scala:578) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.sources.HadoopFsRelation.partitionColumns(interfaces.scala:578) at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:637) at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:635) at org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109) {code} > _SUCCESS should not break par
[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery
[ https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193730#comment-15193730 ] Apache Spark commented on SPARK-13207: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/11697 > _SUCCESS should not break partition discovery > - > > Key: SPARK-13207 > URL: https://issues.apache.org/jira/browse/SPARK-13207 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Labels: backport-needed > Fix For: 2.0.0 > > > Partitioning discovery will fail with the following case > {code} > test("_SUCCESS should not break partitioning discovery") { > withTempPath { dir => > val tablePath = new File(dir, "table") > val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d") > df.write > .format("parquet") > .partitionBy("b", "c", "d") > .save(tablePath.getCanonicalPath) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS")) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", > "_SUCCESS")) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", > "_SUCCESS")) > > checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath), > df) > } > } > {code} > Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning > discovery will fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery
[ https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193526#comment-15193526 ] Yin Huai commented on SPARK-13207: -- It will be good to back port this to 1.6 branch. > _SUCCESS should not break partition discovery > - > > Key: SPARK-13207 > URL: https://issues.apache.org/jira/browse/SPARK-13207 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Labels: backport-needed > Fix For: 2.0.0 > > > Partitioning discovery will fail with the following case > {code} > test("_SUCCESS should not break partitioning discovery") { > withTempPath { dir => > val tablePath = new File(dir, "table") > val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d") > df.write > .format("parquet") > .partitionBy("b", "c", "d") > .save(tablePath.getCanonicalPath) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS")) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", > "_SUCCESS")) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", > "_SUCCESS")) > > checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath), > df) > } > } > {code} > Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning > discovery will fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery
[ https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133540#comment-15133540 ] Apache Spark commented on SPARK-13207: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/11088 > _SUCCESS should not break partition discovery > - > > Key: SPARK-13207 > URL: https://issues.apache.org/jira/browse/SPARK-13207 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > Partitioning discovery will fail with the following case > {code} > test("_SUCCESS should not break partitioning discovery") { > withTempPath { dir => > val tablePath = new File(dir, "table") > val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d") > df.write > .format("parquet") > .partitionBy("b", "c", "d") > .save(tablePath.getCanonicalPath) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS")) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", > "_SUCCESS")) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", > "_SUCCESS")) > > checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath), > df) > } > } > {code} > Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning > discovery will fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery
[ https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133529#comment-15133529 ] Yin Huai commented on SPARK-13207: -- It will be better to let partitioning discovery ignore files/dirs starting with "_" or ".". But, we need to change parquet to not rely on leaf files to get those metadata files (see https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L421-L422). I am thinking we can get a simple fix in master and 1.6 branch and then have a better fix in master. > _SUCCESS should not break partition discovery > - > > Key: SPARK-13207 > URL: https://issues.apache.org/jira/browse/SPARK-13207 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > Partitioning discovery will fail with the following case > {code} > test("_SUCCESS should not break partitioning discovery") { > withTempPath { dir => > val tablePath = new File(dir, "table") > val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d") > df.write > .format("parquet") > .partitionBy("b", "c", "d") > .save(tablePath.getCanonicalPath) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS")) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", > "_SUCCESS")) > Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", > "_SUCCESS")) > > checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath), > df) > } > } > {code} > Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning > discovery will fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org