[jira] [Updated] (SPARK-13046) Partitioning looks broken in 1.6
[ https://issues.apache.org/jira/browse/SPARK-13046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-13046: Description: Hello, I have a list of files in s3: {code} s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some parquet files} s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some parquet files} s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some parquet files} {code} Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same for the three lines) would correctly identify 2 pairs of key/value, one `date_received` and one `fingerprint`. >From 1.6.0, I get the following exception: assertion failed: Conflicting directory structures detected. Suspicious paths s3://bucket/some_path/date_received=2016-01-13 s3://bucket/some_path/date_received=2016-01-14 s3://bucket/some_path/date_received=2016-01-15 That is to say, the partitioning code now fails to identify date_received=2016-01-13 as a key/value pair. I can see that there has been some activity on spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala recently, so that seems related (especially the commits https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b and https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14 ). If I read correctly the tests added in those commits: -they don't seem to actually test the return value, only that it doesn't crash -they only test cases where the s3 path contain 1 key/value pair (which otherwise would catch the bug) This is problematic for us as we're trying to migrate all of our spark services to 1.6.0 and this bug is a real blocker. I know it's possible to force a 'union', but I'd rather not do that if the bug can be fixed. Any question, please shoot. was: Hello, I have a list of files in s3: s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some parquet files} s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some parquet files} s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some parquet files} Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same for the three lines) would correctly identify 2 pairs of key/value, one `date_received` and one `fingerprint`. >From 1.6.0, I get the following exception: assertion failed: Conflicting directory structures detected. Suspicious paths s3://bucket/some_path/date_received=2016-01-13 s3://bucket/some_path/date_received=2016-01-14 s3://bucket/some_path/date_received=2016-01-15 That is to say, the partitioning code now fails to identify date_received=2016-01-13 as a key/value pair. I can see that there has been some activity on spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala recently, so that seems related (especially the commits https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b and https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14 ). If I read correctly the tests added in those commits: -they don't seem to actually test the return value, only that it doesn't crash -they only test cases where the s3 path contain 1 key/value pair (which otherwise would catch the bug) This is problematic for us as we're trying to migrate all of our spark services to 1.6.0 and this bug is a real blocker. I know it's possible to force a 'union', but I'd rather not do that if the bug can be fixed. Any question, please shoot. > Partitioning looks broken in 1.6 > > > Key: SPARK-13046 > URL: https://issues.apache.org/jira/browse/SPARK-13046 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Julien Baley > > Hello, > I have a list of files in s3: > {code} > s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some > parquet files} > s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some > parquet files} > s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some > parquet files} > {code} > Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same > for the three lines) would correctly identify 2 pairs of key/value, one > `date_received` and one `fingerprint`. > From 1.6.0, I get the following exception: > assertion failed: Conflicting directory structures detected. Suspicious paths > s3://bucket/some_path/date_received=2016-01-13 >
[jira] [Updated] (SPARK-13046) Partitioning looks broken in 1.6
[ https://issues.apache.org/jira/browse/SPARK-13046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-13046: Description: Hello, I have a list of files in s3: {code} s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some parquet files} s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some parquet files} s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some parquet files} {code} Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same for the three lines) would correctly identify 2 pairs of key/value, one `date_received` and one `fingerprint`. >From 1.6.0, I get the following exception: {code} assertion failed: Conflicting directory structures detected. Suspicious paths s3://bucket/some_path/date_received=2016-01-13 s3://bucket/some_path/date_received=2016-01-14 s3://bucket/some_path/date_received=2016-01-15 {code} That is to say, the partitioning code now fails to identify date_received=2016-01-13 as a key/value pair. I can see that there has been some activity on spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala recently, so that seems related (especially the commits https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b and https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14 ). If I read correctly the tests added in those commits: -they don't seem to actually test the return value, only that it doesn't crash -they only test cases where the s3 path contain 1 key/value pair (which otherwise would catch the bug) This is problematic for us as we're trying to migrate all of our spark services to 1.6.0 and this bug is a real blocker. I know it's possible to force a 'union', but I'd rather not do that if the bug can be fixed. Any question, please shoot. was: Hello, I have a list of files in s3: {code} s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some parquet files} s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some parquet files} s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some parquet files} {code} Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same for the three lines) would correctly identify 2 pairs of key/value, one `date_received` and one `fingerprint`. >From 1.6.0, I get the following exception: assertion failed: Conflicting directory structures detected. Suspicious paths s3://bucket/some_path/date_received=2016-01-13 s3://bucket/some_path/date_received=2016-01-14 s3://bucket/some_path/date_received=2016-01-15 That is to say, the partitioning code now fails to identify date_received=2016-01-13 as a key/value pair. I can see that there has been some activity on spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala recently, so that seems related (especially the commits https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b and https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14 ). If I read correctly the tests added in those commits: -they don't seem to actually test the return value, only that it doesn't crash -they only test cases where the s3 path contain 1 key/value pair (which otherwise would catch the bug) This is problematic for us as we're trying to migrate all of our spark services to 1.6.0 and this bug is a real blocker. I know it's possible to force a 'union', but I'd rather not do that if the bug can be fixed. Any question, please shoot. > Partitioning looks broken in 1.6 > > > Key: SPARK-13046 > URL: https://issues.apache.org/jira/browse/SPARK-13046 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Julien Baley > > Hello, > I have a list of files in s3: > {code} > s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some > parquet files} > s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some > parquet files} > s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some > parquet files} > {code} > Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same > for the three lines) would correctly identify 2 pairs of key/value, one > `date_received` and one `fingerprint`. > From 1.6.0, I get the following exception: > {code} > assertion failed: Conflicting directory structures detected. Suspicious paths >
[jira] [Updated] (SPARK-13046) Partitioning looks broken in 1.6
[ https://issues.apache.org/jira/browse/SPARK-13046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Baley updated SPARK-13046: - Description: Hello, I have a list of files in s3: s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some parquet files} s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some parquet files} s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some parquet files} Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same for the three lines) would correctly identify 2 pairs of key/value, one `date_received` and one `fingerprint`. >From 1.6.0, I get the following exception: assertion failed: Conflicting directory structures detected. Suspicious paths s3://bucket/some_path/date_received=2016-01-13 s3://bucket/some_path/date_received=2016-01-14 s3://bucket/some_path/date_received=2016-01-15 That is to say, the partitioning code now fails to identify date_received=2016-01-13 as a key/value pair. I can see that there has been some activity on spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala recently, so that seems related (especially the commits https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b and https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14 ). If I read correctly the tests added in those commits: -they don't seem to actually test the return value, only that it doesn't crash -they only test cases where the s3 path contain 1 key/value pair (which otherwise would catch the bug) This is problematic for us as we're trying to migrate all of our spark services to 1.6.0 and this bug is a real blocker. I know it's possible to force a 'union', but I'd rather not do that if the bug can be fixed. Any question, please shoot. was: Hello, I have a list of files in s3: s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some parquet files} s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some parquet files} s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some parquet files} Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same for the three lines) would correctly identify 2 pairs of key/value, one `date_received` and one `fingerprint`. >From 1.6.0, I get the following exception: assertion failed: Conflicting directory structures detected. Suspicious paths s3://bucket/some_path/date_received=2016-01-13 s3://bucket/some_path/date_received=2016-01-14 s3://bucket/some_path/date_received=2016-01-15 That is to say, the partitioning code now fails to identify date_received=2016-01-13 as a key/value pair. I can see that there has been some activity on spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala recently, so that seems related (especially the commits https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b and https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14 ). If I read correctly the tests added in those commits: -they don't seem to actually test the return value, only that it doesn't crash -they only test cases where the s3 path contain 1 key/value pair. This is problematic for us as we're trying to migrate all of our spark services to 1.6.0 and this bug is a real blocker. I know it's possible to force a 'union', but I'd rather not do that if the bug can be fixed. Any question, please shoot. > Partitioning looks broken in 1.6 > > > Key: SPARK-13046 > URL: https://issues.apache.org/jira/browse/SPARK-13046 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Julien Baley > > Hello, > I have a list of files in s3: > s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some > parquet files} > s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some > parquet files} > s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some > parquet files} > Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same > for the three lines) would correctly identify 2 pairs of key/value, one > `date_received` and one `fingerprint`. > From 1.6.0, I get the following exception: > assertion failed: Conflicting directory structures detected. Suspicious paths > s3://bucket/some_path/date_received=2016-01-13 > s3://bucket/some_path/date_received=2016-01-14 >