[jira] [Commented] (SPARK-28558) DatasetWriter partitionBy is changing the group file permissions in 2.4 for parquets

2019-10-07 Thread Nicolas Laduguie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16945755#comment-16945755
 ] 

Nicolas Laduguie commented on SPARK-28558:
--

Hi everybody, any news about a resolution of this issue ?

> DatasetWriter partitionBy is changing the group file permissions in 2.4 for 
> parquets
> 
>
> Key: SPARK-28558
> URL: https://issues.apache.org/jira/browse/SPARK-28558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: Hadoop 2.7
> Scala 2.11
> Tested:
>  * Spark 2.3.3 - Works
>  * Spark 2.4.x - All have the same issue
>Reporter: Stephen Pearson
>Priority: Minor
> Attachments: image-2019-09-24-09-20-07-225.png
>
>
> When writing a parquet using partitionBy the group file permissions are being 
> changed as shown below. This causes members of the group to get 
> "org.apache.hadoop.security.AccessControlException: Open failed for file 
> error: Permission denied (13)"
> This worked in 2.3. I found a workaround which was to set 
> "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" which gives 
> the correct behaviour
>  
> Code I used to reproduce issue:
> {quote}Seq(("H", 1), ("I", 2))
>  .toDF("Letter", "Number")
>  .write
>  .partitionBy("Letter")
>  .parquet(...){quote}
>  
> {quote}sparktesting$ tree -dp
> ├── [drwxrws---]  letter_testing2.3-defaults
> │   ├── [drwxrws---]  Letter=H
> │   └── [drwxrws---]  Letter=I
> ├── [drwxrws---]  letter_testing2.4-defaults
> │   ├── [drwxrwS---]  Letter=H
> │   └── [drwxrwS---]  Letter=I
> └── [drwxrws---]  letter_testing2.4-file-writer2
>     ├── [drwxrws---]  Letter=H
>     └── [drwxrws---]  Letter=I
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28558) DatasetWriter partitionBy is changing the group file permissions in 2.4 for parquets

2019-09-24 Thread Nicolas Laduguie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936719#comment-16936719
 ] 

Nicolas Laduguie commented on SPARK-28558:
--

I debugged Spark and my understanding is that :
 * When you "partitionBy" a data set in Spark, it will generate a path in which 
to store the file (parquet in my case) including the partitions folders,
 * When Spark save the file, it saves directly the file and not the 
intermediate partitions folders if they don't exist,
 * The permissions are calculated for the file (so 666 by default) and the 
umask is applied on this permissions (for example 664 if umask is 002),
 * The file is saved with those permissions, and the partitions folders are 
also saved with those permissions.

This is probably why we can't apply the right permissions on partitions folders.

> DatasetWriter partitionBy is changing the group file permissions in 2.4 for 
> parquets
> 
>
> Key: SPARK-28558
> URL: https://issues.apache.org/jira/browse/SPARK-28558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: Hadoop 2.7
> Scala 2.11
> Tested:
>  * Spark 2.3.3 - Works
>  * Spark 2.4.x - All have the same issue
>Reporter: Stephen Pearson
>Priority: Minor
> Attachments: image-2019-09-24-09-20-07-225.png
>
>
> When writing a parquet using partitionBy the group file permissions are being 
> changed as shown below. This causes members of the group to get 
> "org.apache.hadoop.security.AccessControlException: Open failed for file 
> error: Permission denied (13)"
> This worked in 2.3. I found a workaround which was to set 
> "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" which gives 
> the correct behaviour
>  
> Code I used to reproduce issue:
> {quote}Seq(("H", 1), ("I", 2))
>  .toDF("Letter", "Number")
>  .write
>  .partitionBy("Letter")
>  .parquet(...){quote}
>  
> {quote}sparktesting$ tree -dp
> ├── [drwxrws---]  letter_testing2.3-defaults
> │   ├── [drwxrws---]  Letter=H
> │   └── [drwxrws---]  Letter=I
> ├── [drwxrws---]  letter_testing2.4-defaults
> │   ├── [drwxrwS---]  Letter=H
> │   └── [drwxrwS---]  Letter=I
> └── [drwxrws---]  letter_testing2.4-file-writer2
>     ├── [drwxrws---]  Letter=H
>     └── [drwxrws---]  Letter=I
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28558) DatasetWriter partitionBy is changing the group file permissions in 2.4 for parquets

2019-09-24 Thread Nicolas Laduguie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936504#comment-16936504
 ] 

Nicolas Laduguie commented on SPARK-28558:
--

FYI here is a screenshot showing the permissions of partition folders names 
"dt=*" which are 766 and the table folder permissions which are 755.

!image-2019-09-24-09-20-07-225.png!

> DatasetWriter partitionBy is changing the group file permissions in 2.4 for 
> parquets
> 
>
> Key: SPARK-28558
> URL: https://issues.apache.org/jira/browse/SPARK-28558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: Hadoop 2.7
> Scala 2.11
> Tested:
>  * Spark 2.3.3 - Works
>  * Spark 2.4.x - All have the same issue
>Reporter: Stephen Pearson
>Priority: Minor
> Attachments: image-2019-09-24-09-20-07-225.png
>
>
> When writing a parquet using partitionBy the group file permissions are being 
> changed as shown below. This causes members of the group to get 
> "org.apache.hadoop.security.AccessControlException: Open failed for file 
> error: Permission denied (13)"
> This worked in 2.3. I found a workaround which was to set 
> "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" which gives 
> the correct behaviour
>  
> Code I used to reproduce issue:
> {quote}Seq(("H", 1), ("I", 2))
>  .toDF("Letter", "Number")
>  .write
>  .partitionBy("Letter")
>  .parquet(...){quote}
>  
> {quote}sparktesting$ tree -dp
> ├── [drwxrws---]  letter_testing2.3-defaults
> │   ├── [drwxrws---]  Letter=H
> │   └── [drwxrws---]  Letter=I
> ├── [drwxrws---]  letter_testing2.4-defaults
> │   ├── [drwxrwS---]  Letter=H
> │   └── [drwxrwS---]  Letter=I
> └── [drwxrws---]  letter_testing2.4-file-writer2
>     ├── [drwxrws---]  Letter=H
>     └── [drwxrws---]  Letter=I
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28558) DatasetWriter partitionBy is changing the group file permissions in 2.4 for parquets

2019-09-24 Thread Nicolas Laduguie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936485#comment-16936485
 ] 

Nicolas Laduguie commented on SPARK-28558:
--

Hi [~spearson] [~holden] we are using MapR 6.0.1 (Hadoop 2.7.4) and the 
workaround you suggested 
(spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2) didn't work 
for me.

In fact I can change a little bit the permissions of the partitions folders 
thanks to the "spark.hadoop.fs.permissions.umask-mode" configuration but it 
only impacts the R and W perms but not the X perm.

To describe a little bit more my situation, the table folder is created well by 
our Spark Structured Streaming job, that is to say with the correct perms (775).

The problem is located on the partition folders that don't receive the correct 
X perm. They always stay on 764 or 766.

> DatasetWriter partitionBy is changing the group file permissions in 2.4 for 
> parquets
> 
>
> Key: SPARK-28558
> URL: https://issues.apache.org/jira/browse/SPARK-28558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: Hadoop 2.7
> Scala 2.11
> Tested:
>  * Spark 2.3.3 - Works
>  * Spark 2.4.x - All have the same issue
>Reporter: Stephen Pearson
>Priority: Minor
>
> When writing a parquet using partitionBy the group file permissions are being 
> changed as shown below. This causes members of the group to get 
> "org.apache.hadoop.security.AccessControlException: Open failed for file 
> error: Permission denied (13)"
> This worked in 2.3. I found a workaround which was to set 
> "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" which gives 
> the correct behaviour
>  
> Code I used to reproduce issue:
> {quote}Seq(("H", 1), ("I", 2))
>  .toDF("Letter", "Number")
>  .write
>  .partitionBy("Letter")
>  .parquet(...){quote}
>  
> {quote}sparktesting$ tree -dp
> ├── [drwxrws---]  letter_testing2.3-defaults
> │   ├── [drwxrws---]  Letter=H
> │   └── [drwxrws---]  Letter=I
> ├── [drwxrws---]  letter_testing2.4-defaults
> │   ├── [drwxrwS---]  Letter=H
> │   └── [drwxrwS---]  Letter=I
> └── [drwxrws---]  letter_testing2.4-file-writer2
>     ├── [drwxrws---]  Letter=H
>     └── [drwxrws---]  Letter=I
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28558) DatasetWriter partitionBy is changing the group file permissions in 2.4 for parquets

2019-09-23 Thread Stephen Pearson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936349#comment-16936349
 ] 

Stephen Pearson commented on SPARK-28558:
-

[~holden] I am using MapR 5.1.0

 

[~nladuguie] have you tried setting the config below? It allowed for a work 
around for us (assuming the behaviour change doesn't impact your processes).

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2

> DatasetWriter partitionBy is changing the group file permissions in 2.4 for 
> parquets
> 
>
> Key: SPARK-28558
> URL: https://issues.apache.org/jira/browse/SPARK-28558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: Hadoop 2.7
> Scala 2.11
> Tested:
>  * Spark 2.3.3 - Works
>  * Spark 2.4.x - All have the same issue
>Reporter: Stephen Pearson
>Priority: Minor
>
> When writing a parquet using partitionBy the group file permissions are being 
> changed as shown below. This causes members of the group to get 
> "org.apache.hadoop.security.AccessControlException: Open failed for file 
> error: Permission denied (13)"
> This worked in 2.3. I found a workaround which was to set 
> "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" which gives 
> the correct behaviour
>  
> Code I used to reproduce issue:
> {quote}Seq(("H", 1), ("I", 2))
>  .toDF("Letter", "Number")
>  .write
>  .partitionBy("Letter")
>  .parquet(...){quote}
>  
> {quote}sparktesting$ tree -dp
> ├── [drwxrws---]  letter_testing2.3-defaults
> │   ├── [drwxrws---]  Letter=H
> │   └── [drwxrws---]  Letter=I
> ├── [drwxrws---]  letter_testing2.4-defaults
> │   ├── [drwxrwS---]  Letter=H
> │   └── [drwxrwS---]  Letter=I
> └── [drwxrws---]  letter_testing2.4-file-writer2
>     ├── [drwxrws---]  Letter=H
>     └── [drwxrws---]  Letter=I
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28558) DatasetWriter partitionBy is changing the group file permissions in 2.4 for parquets

2019-09-23 Thread holdenk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936200#comment-16936200
 ] 

holdenk commented on SPARK-28558:
-

What storage system are y'all using [~nladuguie] & [~spearson] ?

> DatasetWriter partitionBy is changing the group file permissions in 2.4 for 
> parquets
> 
>
> Key: SPARK-28558
> URL: https://issues.apache.org/jira/browse/SPARK-28558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: Hadoop 2.7
> Scala 2.11
> Tested:
>  * Spark 2.3.3 - Works
>  * Spark 2.4.x - All have the same issue
>Reporter: Stephen Pearson
>Priority: Minor
>
> When writing a parquet using partitionBy the group file permissions are being 
> changed as shown below. This causes members of the group to get 
> "org.apache.hadoop.security.AccessControlException: Open failed for file 
> error: Permission denied (13)"
> This worked in 2.3. I found a workaround which was to set 
> "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" which gives 
> the correct behaviour
>  
> Code I used to reproduce issue:
> {quote}Seq(("H", 1), ("I", 2))
>  .toDF("Letter", "Number")
>  .write
>  .partitionBy("Letter")
>  .parquet(...){quote}
>  
> {quote}sparktesting$ tree -dp
> ├── [drwxrws---]  letter_testing2.3-defaults
> │   ├── [drwxrws---]  Letter=H
> │   └── [drwxrws---]  Letter=I
> ├── [drwxrws---]  letter_testing2.4-defaults
> │   ├── [drwxrwS---]  Letter=H
> │   └── [drwxrwS---]  Letter=I
> └── [drwxrws---]  letter_testing2.4-file-writer2
>     ├── [drwxrws---]  Letter=H
>     └── [drwxrws---]  Letter=I
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28558) DatasetWriter partitionBy is changing the group file permissions in 2.4 for parquets

2019-09-23 Thread Nicolas Laduguie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935848#comment-16935848
 ] 

Nicolas Laduguie commented on SPARK-28558:
--

Hi,

I also face this issue.

>From my point of view, it is not only a "minor" bug because it is modifying 
>partition dirs permissions, which is causing users not to be able to access 
>data.

The priority of this issue should be leveled up. 

> DatasetWriter partitionBy is changing the group file permissions in 2.4 for 
> parquets
> 
>
> Key: SPARK-28558
> URL: https://issues.apache.org/jira/browse/SPARK-28558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: Hadoop 2.7
> Scala 2.11
> Tested:
>  * Spark 2.3.3 - Works
>  * Spark 2.4.x - All have the same issue
>Reporter: Stephen Pearson
>Priority: Minor
>
> When writing a parquet using partitionBy the group file permissions are being 
> changed as shown below. This causes members of the group to get 
> "org.apache.hadoop.security.AccessControlException: Open failed for file 
> error: Permission denied (13)"
> This worked in 2.3. I found a workaround which was to set 
> "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" which gives 
> the correct behaviour
>  
> Code I used to reproduce issue:
> {quote}Seq(("H", 1), ("I", 2))
>  .toDF("Letter", "Number")
>  .write
>  .partitionBy("Letter")
>  .parquet(...){quote}
>  
> {quote}sparktesting$ tree -dp
> ├── [drwxrws---]  letter_testing2.3-defaults
> │   ├── [drwxrws---]  Letter=H
> │   └── [drwxrws---]  Letter=I
> ├── [drwxrws---]  letter_testing2.4-defaults
> │   ├── [drwxrwS---]  Letter=H
> │   └── [drwxrwS---]  Letter=I
> └── [drwxrws---]  letter_testing2.4-file-writer2
>     ├── [drwxrws---]  Letter=H
>     └── [drwxrws---]  Letter=I
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org