[jira] [Updated] (SPARK-13046) Partitioning looks broken in 1.6

2016-02-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-13046:

Description: 
Hello,

I have a list of files in s3:

{code}
s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
{code}

Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same 
for the three lines) would correctly identify 2 pairs of key/value, one 
`date_received` and one `fingerprint`.

>From 1.6.0, I get the following exception:
assertion failed: Conflicting directory structures detected. Suspicious paths
s3://bucket/some_path/date_received=2016-01-13
s3://bucket/some_path/date_received=2016-01-14
s3://bucket/some_path/date_received=2016-01-15

That is to say, the partitioning code now fails to identify 
date_received=2016-01-13 as a key/value pair.

I can see that there has been some activity on 
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 recently, so that seems related (especially the commits 
https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b 
 and 
https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14 
).
If I read correctly the tests added in those commits:
-they don't seem to actually test the return value, only that it doesn't crash
-they only test cases where the s3 path contain 1 key/value pair (which 
otherwise would catch the bug)

This is problematic for us as we're trying to migrate all of our spark services 
to 1.6.0 and this bug is a real blocker. I know it's possible to force a 
'union', but I'd rather not do that if the bug can be fixed.

Any question, please shoot.

  was:
Hello,

I have a list of files in s3:

s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}

Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same 
for the three lines) would correctly identify 2 pairs of key/value, one 
`date_received` and one `fingerprint`.

>From 1.6.0, I get the following exception:
assertion failed: Conflicting directory structures detected. Suspicious paths
s3://bucket/some_path/date_received=2016-01-13
s3://bucket/some_path/date_received=2016-01-14
s3://bucket/some_path/date_received=2016-01-15

That is to say, the partitioning code now fails to identify 
date_received=2016-01-13 as a key/value pair.

I can see that there has been some activity on 
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 recently, so that seems related (especially the commits 
https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b 
 and 
https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14 
).
If I read correctly the tests added in those commits:
-they don't seem to actually test the return value, only that it doesn't crash
-they only test cases where the s3 path contain 1 key/value pair (which 
otherwise would catch the bug)

This is problematic for us as we're trying to migrate all of our spark services 
to 1.6.0 and this bug is a real blocker. I know it's possible to force a 
'union', but I'd rather not do that if the bug can be fixed.

Any question, please shoot.


> Partitioning looks broken in 1.6
> 
>
> Key: SPARK-13046
> URL: https://issues.apache.org/jira/browse/SPARK-13046
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Julien Baley
>
> Hello,
> I have a list of files in s3:
> {code}
> s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> {code}
> Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same 
> for the three lines) would correctly identify 2 pairs of key/value, one 
> `date_received` and one `fingerprint`.
> From 1.6.0, I get the following exception:
> assertion failed: Conflicting directory structures detected. Suspicious paths
> s3://bucket/some_path/date_received=2016-01-13
> 

[jira] [Updated] (SPARK-13046) Partitioning looks broken in 1.6

2016-02-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-13046:

Description: 
Hello,

I have a list of files in s3:

{code}
s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
{code}

Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same 
for the three lines) would correctly identify 2 pairs of key/value, one 
`date_received` and one `fingerprint`.

>From 1.6.0, I get the following exception:
{code}
assertion failed: Conflicting directory structures detected. Suspicious paths
s3://bucket/some_path/date_received=2016-01-13
s3://bucket/some_path/date_received=2016-01-14
s3://bucket/some_path/date_received=2016-01-15
{code}

That is to say, the partitioning code now fails to identify 
date_received=2016-01-13 as a key/value pair.

I can see that there has been some activity on 
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 recently, so that seems related (especially the commits 
https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b 
 and 
https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14 
).
If I read correctly the tests added in those commits:
-they don't seem to actually test the return value, only that it doesn't crash
-they only test cases where the s3 path contain 1 key/value pair (which 
otherwise would catch the bug)

This is problematic for us as we're trying to migrate all of our spark services 
to 1.6.0 and this bug is a real blocker. I know it's possible to force a 
'union', but I'd rather not do that if the bug can be fixed.

Any question, please shoot.

  was:
Hello,

I have a list of files in s3:

{code}
s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
{code}

Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same 
for the three lines) would correctly identify 2 pairs of key/value, one 
`date_received` and one `fingerprint`.

>From 1.6.0, I get the following exception:
assertion failed: Conflicting directory structures detected. Suspicious paths
s3://bucket/some_path/date_received=2016-01-13
s3://bucket/some_path/date_received=2016-01-14
s3://bucket/some_path/date_received=2016-01-15

That is to say, the partitioning code now fails to identify 
date_received=2016-01-13 as a key/value pair.

I can see that there has been some activity on 
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 recently, so that seems related (especially the commits 
https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b 
 and 
https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14 
).
If I read correctly the tests added in those commits:
-they don't seem to actually test the return value, only that it doesn't crash
-they only test cases where the s3 path contain 1 key/value pair (which 
otherwise would catch the bug)

This is problematic for us as we're trying to migrate all of our spark services 
to 1.6.0 and this bug is a real blocker. I know it's possible to force a 
'union', but I'd rather not do that if the bug can be fixed.

Any question, please shoot.


> Partitioning looks broken in 1.6
> 
>
> Key: SPARK-13046
> URL: https://issues.apache.org/jira/browse/SPARK-13046
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Julien Baley
>
> Hello,
> I have a list of files in s3:
> {code}
> s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> {code}
> Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same 
> for the three lines) would correctly identify 2 pairs of key/value, one 
> `date_received` and one `fingerprint`.
> From 1.6.0, I get the following exception:
> {code}
> assertion failed: Conflicting directory structures detected. Suspicious paths
> 

[jira] [Updated] (SPARK-13046) Partitioning looks broken in 1.6

2016-01-27 Thread Julien Baley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Baley updated SPARK-13046:
-
Description: 
Hello,

I have a list of files in s3:

s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}

Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same 
for the three lines) would correctly identify 2 pairs of key/value, one 
`date_received` and one `fingerprint`.

>From 1.6.0, I get the following exception:
assertion failed: Conflicting directory structures detected. Suspicious paths
s3://bucket/some_path/date_received=2016-01-13
s3://bucket/some_path/date_received=2016-01-14
s3://bucket/some_path/date_received=2016-01-15

That is to say, the partitioning code now fails to identify 
date_received=2016-01-13 as a key/value pair.

I can see that there has been some activity on 
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 recently, so that seems related (especially the commits 
https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b 
 and 
https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14 
).
If I read correctly the tests added in those commits:
-they don't seem to actually test the return value, only that it doesn't crash
-they only test cases where the s3 path contain 1 key/value pair (which 
otherwise would catch the bug)

This is problematic for us as we're trying to migrate all of our spark services 
to 1.6.0 and this bug is a real blocker. I know it's possible to force a 
'union', but I'd rather not do that if the bug can be fixed.

Any question, please shoot.

  was:
Hello,

I have a list of files in s3:

s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}

Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same 
for the three lines) would correctly identify 2 pairs of key/value, one 
`date_received` and one `fingerprint`.

>From 1.6.0, I get the following exception:
assertion failed: Conflicting directory structures detected. Suspicious paths
s3://bucket/some_path/date_received=2016-01-13
s3://bucket/some_path/date_received=2016-01-14
s3://bucket/some_path/date_received=2016-01-15

That is to say, the partitioning code now fails to identify 
date_received=2016-01-13 as a key/value pair.

I can see that there has been some activity on 
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 recently, so that seems related (especially the commits 
https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b 
 and 
https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14 
).
If I read correctly the tests added in those commits:
-they don't seem to actually test the return value, only that it doesn't crash
-they only test cases where the s3 path contain 1 key/value pair.

This is problematic for us as we're trying to migrate all of our spark services 
to 1.6.0 and this bug is a real blocker. I know it's possible to force a 
'union', but I'd rather not do that if the bug can be fixed.

Any question, please shoot.


> Partitioning looks broken in 1.6
> 
>
> Key: SPARK-13046
> URL: https://issues.apache.org/jira/browse/SPARK-13046
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Julien Baley
>
> Hello,
> I have a list of files in s3:
> s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same 
> for the three lines) would correctly identify 2 pairs of key/value, one 
> `date_received` and one `fingerprint`.
> From 1.6.0, I get the following exception:
> assertion failed: Conflicting directory structures detected. Suspicious paths
> s3://bucket/some_path/date_received=2016-01-13
> s3://bucket/some_path/date_received=2016-01-14
>