[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery

2016-06-10 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325749#comment-15325749
 ] 

Yin Huai commented on SPARK-13207:
--

Created https://issues.apache.org/jira/browse/SPARK-15895

> _SUCCESS should not break partition discovery
> -
>
> Key: SPARK-13207
> URL: https://issues.apache.org/jira/browse/SPARK-13207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>  Labels: backport-needed
> Fix For: 1.6.2, 2.0.0
>
>
> Partitioning discovery will fail with the following case
> {code}
> test("_SUCCESS should not break partitioning discovery") {
> withTempPath { dir =>
>   val tablePath = new File(dir, "table")
>   val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d")
>   df.write
> .format("parquet")
> .partitionBy("b", "c", "d")
> .save(tablePath.getCanonicalPath)
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", 
> "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", 
> "_SUCCESS"))
>   
> checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath),
>  df)
> }
>   }
> {code}
> Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning 
> discovery will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery

2016-06-10 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325744#comment-15325744
 ] 

Yin Huai commented on SPARK-13207:
--

oh, I only looked at the title. OK. It is not fixed. Can you create a jira?

> _SUCCESS should not break partition discovery
> -
>
> Key: SPARK-13207
> URL: https://issues.apache.org/jira/browse/SPARK-13207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>  Labels: backport-needed
> Fix For: 1.6.2, 2.0.0
>
>
> Partitioning discovery will fail with the following case
> {code}
> test("_SUCCESS should not break partitioning discovery") {
> withTempPath { dir =>
>   val tablePath = new File(dir, "table")
>   val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d")
>   df.write
> .format("parquet")
> .partitionBy("b", "c", "d")
> .save(tablePath.getCanonicalPath)
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", 
> "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", 
> "_SUCCESS"))
>   
> checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath),
>  df)
> }
>   }
> {code}
> Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning 
> discovery will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery

2016-06-10 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325718#comment-15325718
 ] 

Simeon Simeonov commented on SPARK-13207:
-

[~yhuai] The PR associated with that ticket explicitly calls out {{_metadata}} 
and {{_common_metadata}} as not excluded. i am wondering why that PR will fix 
this issue... Can you add a test to demonstrate that this is fixed?

> _SUCCESS should not break partition discovery
> -
>
> Key: SPARK-13207
> URL: https://issues.apache.org/jira/browse/SPARK-13207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>  Labels: backport-needed
> Fix For: 1.6.2, 2.0.0
>
>
> Partitioning discovery will fail with the following case
> {code}
> test("_SUCCESS should not break partitioning discovery") {
> withTempPath { dir =>
>   val tablePath = new File(dir, "table")
>   val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d")
>   df.write
> .format("parquet")
> .partitionBy("b", "c", "d")
> .save(tablePath.getCanonicalPath)
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", 
> "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", 
> "_SUCCESS"))
>   
> checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath),
>  df)
> }
>   }
> {code}
> Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning 
> discovery will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery

2016-06-10 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325337#comment-15325337
 ] 

Yin Huai commented on SPARK-13207:
--

Hey [~simeons], sorry for late reply. SPARK-15454 has fixed this issue.

> _SUCCESS should not break partition discovery
> -
>
> Key: SPARK-13207
> URL: https://issues.apache.org/jira/browse/SPARK-13207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>  Labels: backport-needed
> Fix For: 1.6.2, 2.0.0
>
>
> Partitioning discovery will fail with the following case
> {code}
> test("_SUCCESS should not break partitioning discovery") {
> withTempPath { dir =>
>   val tablePath = new File(dir, "table")
>   val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d")
>   df.write
> .format("parquet")
> .partitionBy("b", "c", "d")
> .save(tablePath.getCanonicalPath)
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", 
> "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", 
> "_SUCCESS"))
>   
> checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath),
>  df)
> }
>   }
> {code}
> Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning 
> discovery will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery

2016-05-28 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305703#comment-15305703
 ] 

Simeon Simeonov commented on SPARK-13207:
-

[~yhuai] I see the same problem with common metadata files. Do we need another 
JIRA issue for those? 

For example, the following S3 directory structure:

{code}
2016-05-28 20:08:18  41207 
ss/tests/partitioning/placements/par_ts=201605260400/_common_metadata
2016-05-28 20:08:173981760 
ss/tests/partitioning/placements/par_ts=201605260400/_metadata
2016-05-28 20:06:10   26149863 
ss/tests/partitioning/placements/par_ts=201605260400/par_job=load/par_cat=engagement/part-r-1-85d6c96a-4384-4e43-a575-ac71e147f349.gz.parquet
2016-05-28 20:06:42   32882968 
ss/tests/partitioning/placements/par_ts=201605260400/par_job=load/par_cat=error/part-r-2-85d6c96a-4384-4e43-a575-ac71e147f349.gz.parquet
2016-05-28 20:05:491700553 
ss/tests/partitioning/placements/par_ts=201605260400/par_job=load/par_cat=late/part-r-0-85d6c96a-4384-4e43-a575-ac71e147f349.gz.parquet
2016-05-28 20:04:21  41207 
ss/tests/partitioning/placements/par_ts=201605270400/_common_metadata
2016-05-28 20:04:204120453 
ss/tests/partitioning/placements/par_ts=201605270400/_metadata
2016-05-28 20:02:37   21471845 
ss/tests/partitioning/placements/par_ts=201605270400/par_job=load/par_cat=engagement/part-r-00028-ad451b48-e48c-46eb-9e5f-c1bc4c663a4c.gz.parquet
2016-05-28 20:03:12   29981797 
ss/tests/partitioning/placements/par_ts=201605270400/par_job=load/par_cat=error/part-r-00029-ad451b48-e48c-46eb-9e5f-c1bc4c663a4c.gz.parquet
2016-05-28 20:02:291525027 
ss/tests/partitioning/placements/par_ts=201605270400/par_job=load/par_cat=late/part-r-00025-ad451b48-e48c-46eb-9e5f-c1bc4c663a4c.gz.parquet
{code}

generates the following partition discovery exception when loading 
{{/ss/tests/partitioning/placements}}:

{code}
java.lang.AssertionError: assertion failed: Conflicting partition column names 
detected:

Partition column name list #0: par_ts, par_job, par_cat
Partition column name list #1: par_ts

For partitioned table directories, data files should only live in leaf 
directories.
And directories at the same level should have the same partition column name.
Please check the following directories for unexpected files or inconsistent 
partition column names:


dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605270400

dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605260400

dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605260400/par_job=load/par_cat=late

dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605270400/par_job=load/par_cat=late

dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605260400/par_job=load/par_cat=error

dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605270400/par_job=load/par_cat=error

dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605270400/par_job=load/par_cat=engagement

dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605260400/par_job=load/par_cat=engagement
at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:246)
at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:115)
at 
org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:621)
at 
org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:526)
at 
org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:525)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.sources.HadoopFsRelation.partitionSpec(interfaces.scala:524)
at 
org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionColumns$1.apply(interfaces.scala:578)
at 
org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionColumns$1.apply(interfaces.scala:578)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.sources.HadoopFsRelation.partitionColumns(interfaces.scala:578)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:637)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:635)
at 
org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
{code}

> _SUCCESS should not break par

[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery

2016-03-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193730#comment-15193730
 ] 

Apache Spark commented on SPARK-13207:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/11697

> _SUCCESS should not break partition discovery
> -
>
> Key: SPARK-13207
> URL: https://issues.apache.org/jira/browse/SPARK-13207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>  Labels: backport-needed
> Fix For: 2.0.0
>
>
> Partitioning discovery will fail with the following case
> {code}
> test("_SUCCESS should not break partitioning discovery") {
> withTempPath { dir =>
>   val tablePath = new File(dir, "table")
>   val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d")
>   df.write
> .format("parquet")
> .partitionBy("b", "c", "d")
> .save(tablePath.getCanonicalPath)
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", 
> "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", 
> "_SUCCESS"))
>   
> checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath),
>  df)
> }
>   }
> {code}
> Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning 
> discovery will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery

2016-03-14 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193526#comment-15193526
 ] 

Yin Huai commented on SPARK-13207:
--

It will be good to back port this to 1.6 branch.

> _SUCCESS should not break partition discovery
> -
>
> Key: SPARK-13207
> URL: https://issues.apache.org/jira/browse/SPARK-13207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>  Labels: backport-needed
> Fix For: 2.0.0
>
>
> Partitioning discovery will fail with the following case
> {code}
> test("_SUCCESS should not break partitioning discovery") {
> withTempPath { dir =>
>   val tablePath = new File(dir, "table")
>   val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d")
>   df.write
> .format("parquet")
> .partitionBy("b", "c", "d")
> .save(tablePath.getCanonicalPath)
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", 
> "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", 
> "_SUCCESS"))
>   
> checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath),
>  df)
> }
>   }
> {code}
> Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning 
> discovery will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery

2016-02-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133540#comment-15133540
 ] 

Apache Spark commented on SPARK-13207:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/11088

> _SUCCESS should not break partition discovery
> -
>
> Key: SPARK-13207
> URL: https://issues.apache.org/jira/browse/SPARK-13207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Partitioning discovery will fail with the following case
> {code}
> test("_SUCCESS should not break partitioning discovery") {
> withTempPath { dir =>
>   val tablePath = new File(dir, "table")
>   val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d")
>   df.write
> .format("parquet")
> .partitionBy("b", "c", "d")
> .save(tablePath.getCanonicalPath)
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", 
> "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", 
> "_SUCCESS"))
>   
> checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath),
>  df)
> }
>   }
> {code}
> Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning 
> discovery will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery

2016-02-04 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133529#comment-15133529
 ] 

Yin Huai commented on SPARK-13207:
--

It will be better to let partitioning discovery ignore files/dirs starting with 
"_" or ".". But, we need to change parquet to not rely on leaf files to get 
those metadata files (see 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L421-L422).
 I am thinking we can get a simple fix in master and 1.6 branch and then have a 
better fix in master.

> _SUCCESS should not break partition discovery
> -
>
> Key: SPARK-13207
> URL: https://issues.apache.org/jira/browse/SPARK-13207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Partitioning discovery will fail with the following case
> {code}
> test("_SUCCESS should not break partitioning discovery") {
> withTempPath { dir =>
>   val tablePath = new File(dir, "table")
>   val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d")
>   df.write
> .format("parquet")
> .partitionBy("b", "c", "d")
> .save(tablePath.getCanonicalPath)
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", 
> "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", 
> "_SUCCESS"))
>   
> checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath),
>  df)
> }
>   }
> {code}
> Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning 
> discovery will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org