[jira] [Updated] (SPARK-44166) Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite

2023-06-24 Thread Pralabh Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pralabh Kumar updated SPARK-44166:
--
Description: 
Currently in InsertIntoHiveTable.scala , there is no way to pass 
dynamicPartitionOverwrite to true , when calling  saveAsHiveFile . When 
dynamicPartitioOverwrite is true , spark will use  built-in FileCommitProtocol 
instead of Hadoop FileOutputCommitter , which is more performant. 

 

Here is the solution . 

When inserting overwrite into Hive table

 

Current code 

 
{code:java}
val writtenParts = saveAsHiveFile(
  sparkSession = sparkSession,
  plan = child,
  hadoopConf = hadoopConf,
  fileFormat = fileFormat,
  outputLocation = tmpLocation.toString,
  partitionAttributes = partitionColumns,
  bucketSpec = bucketSpec,
  options = options)
       {code}
 

 

Proposed code.  

enableDynamicPartitionOverwrite 
{code:java}
val USE_FILECOMMITPROTOCOL_DYNAMIC_PARTITION_OVERWRITE =
    buildConf("spark.sql.hive.filecommit.dynamicPartitionOverwrite"){code}
 
{code:java}
 val enableDynamicPartitionOverwrite =
      
SQLConf.get.getConf(HiveUtils.USE_FILECOMMITPROTOCOL_DYNAMIC_PARTITION_OVERWRITE)
    logWarning(s"enableDynamicPartitionOverwrite: 
$enableDynamicPartitionOverwrite"){code}
 

 

Now if enableDynamicPartitionOverwrite is true and numDynamicPartitions > 0 and 
overwrite is true , pass dynamicPartitionOverwrite true. 

 
{code:java}
val writtenParts = saveAsHiveFile( sparkSession = sparkSession, plan = child, 
hadoopConf = hadoopConf, fileFormat = fileFormat, outputLocation = 
tmpLocation.toString, partitionAttributes = partitionColumns, bucketSpec = 
bucketSpec, options = options, dynamicPartitionOverwrite =
        enableDynamicPartitionOverwrite && numDynamicPartitions > 0 && 
overwrite)       {code}
 

 

In saveAs File 
{code:java}
val committer = FileCommitProtocol.instantiate(
      sparkSession.sessionState.conf.fileCommitProtocolClass,
      jobId = java.util.UUID.randomUUID().toString,
      outputPath = outputLocation,
      dynamicPartitionOverwrite = dynamicPartitionOverwrite) {code}
This will internal call  with dynamicPartitionOverwrite value true. 

 
{code:java}
class SQLHadoopMapReduceCommitProtocol(
jobId: String,
path: String,
dynamicPartitionOverwrite: Boolean = false)
  extends HadoopMapReduceCommitProtocol(jobId, path, dynamicPartitionOverwrite) 
{code}
 

 

 

  was:
Currently in InsertIntoHiveTable.scala , there is no way to pass 
dynamicPartitionOverwrite to true , when calling  saveAsHiveFile . When 
dynamicPartitioOverwrite is true , spark will use  built-in FileCommitProtocol 
instead of Hadoop FileOutputCommitter , which is more performant. 

 

Here is the solution . 

When inserting overwrite into Hive table

 

Current code 

 
{code:java}
val writtenParts = saveAsHiveFile(
  sparkSession = sparkSession,
  plan = child,
  hadoopConf = hadoopConf,
  fileFormat = fileFormat,
  outputLocation = tmpLocation.toString,
  partitionAttributes = partitionColumns,
  bucketSpec = bucketSpec,
  options = options)
       {code}
 

 

Proposed code.  

enableDynamicPartitionOverwrite 

 
{code:java}
 val enableDynamicPartitionOverwrite =
      
SQLConf.get.getConf(HiveUtils.USE_FILECOMMITPROTOCOL_DYNAMIC_PARTITION_OVERWRITE)
    logWarning(s"enableDynamicPartitionOverwrite: 
$enableDynamicPartitionOverwrite"){code}
 

 

Now if enableDynamicPartitionOverwrite is true and numDynamicPartitions > 0 and 
overwrite is true , pass dynamicPartitionOverwrite true. 

 
{code:java}
val writtenParts = saveAsHiveFile( sparkSession = sparkSession, plan = child, 
hadoopConf = hadoopConf, fileFormat = fileFormat, outputLocation = 
tmpLocation.toString, partitionAttributes = partitionColumns, bucketSpec = 
bucketSpec, options = options, dynamicPartitionOverwrite =
        enableDynamicPartitionOverwrite && numDynamicPartitions > 0 && 
overwrite)       {code}
 

 

In saveAs File 
{code:java}
val committer = FileCommitProtocol.instantiate(
      sparkSession.sessionState.conf.fileCommitProtocolClass,
      jobId = java.util.UUID.randomUUID().toString,
      outputPath = outputLocation,
      dynamicPartitionOverwrite = dynamicPartitionOverwrite) {code}
This will internal call  with dynamicPartitionOverwrite value true. 

 
{code:java}
class SQLHadoopMapReduceCommitProtocol(
jobId: String,
path: String,
dynamicPartitionOverwrite: Boolean = false)
  extends HadoopMapReduceCommitProtocol(jobId, path, dynamicPartitionOverwrite) 
{code}
 

 

 


> Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite
> ---
>
> Key: SPARK-44166
> URL: https://issues.apache.org/jira/browse/SPARK-44166
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Pralabh Kumar
> 

[jira] [Updated] (SPARK-44166) Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite

2023-06-24 Thread Pralabh Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pralabh Kumar updated SPARK-44166:
--
Description: 
Currently in InsertIntoHiveTable.scala , there is no way to pass 
dynamicPartitionOverwrite to true , when calling  saveAsHiveFile . When 
dynamicPartitioOverwrite is true , spark will use  built-in FileCommitProtocol 
instead of Hadoop FileOutputCommitter , which is more performant. 

 

Here is the solution . 

When inserting overwrite into Hive table

 

Current code 

 
{code:java}
val writtenParts = saveAsHiveFile(
  sparkSession = sparkSession,
  plan = child,
  hadoopConf = hadoopConf,
  fileFormat = fileFormat,
  outputLocation = tmpLocation.toString,
  partitionAttributes = partitionColumns,
  bucketSpec = bucketSpec,
  options = options)
       {code}
 

 

Proposed code.  

enableDynamicPartitionOverwrite 

 
{code:java}
 val enableDynamicPartitionOverwrite =
      
SQLConf.get.getConf(HiveUtils.USE_FILECOMMITPROTOCOL_DYNAMIC_PARTITION_OVERWRITE)
    logWarning(s"enableDynamicPartitionOverwrite: 
$enableDynamicPartitionOverwrite"){code}
 

 

Now if enableDynamicPartitionOverwrite is true and numDynamicPartitions > 0 and 
overwrite is true , pass dynamicPartitionOverwrite true. 

 
{code:java}
val writtenParts = saveAsHiveFile( sparkSession = sparkSession, plan = child, 
hadoopConf = hadoopConf, fileFormat = fileFormat, outputLocation = 
tmpLocation.toString, partitionAttributes = partitionColumns, bucketSpec = 
bucketSpec, options = options, dynamicPartitionOverwrite =
        enableDynamicPartitionOverwrite && numDynamicPartitions > 0 && 
overwrite)       {code}
 

 

In saveAs File 
{code:java}
val committer = FileCommitProtocol.instantiate(
      sparkSession.sessionState.conf.fileCommitProtocolClass,
      jobId = java.util.UUID.randomUUID().toString,
      outputPath = outputLocation,
      dynamicPartitionOverwrite = dynamicPartitionOverwrite) {code}
This will internal call  with dynamicPartitionOverwrite value true. 

 
{code:java}
class SQLHadoopMapReduceCommitProtocol(
jobId: String,
path: String,
dynamicPartitionOverwrite: Boolean = false)
  extends HadoopMapReduceCommitProtocol(jobId, path, dynamicPartitionOverwrite) 
{code}
 

 

 

  was:
Currently in InsertIntoHiveTable.scala , there is no way to pass 
dynamicPartitionOverwrite to true , when calling  saveAsHiveFile . When 
dynamicPartitioOverwrite is true , spark will use 
built-in FileCommitProtocol instead of Hadoop FileOutputCommitter , which is 
more performant. 

 

Here is the solution . 

When inserting overwrite into Hive table

 

Current code 

 
{code:java}
val writtenParts = saveAsHiveFile(
  sparkSession = sparkSession,
  plan = child,
  hadoopConf = hadoopConf,
  fileFormat = fileFormat,
  outputLocation = tmpLocation.toString,
  partitionAttributes = partitionColumns,
  bucketSpec = bucketSpec,
  options = options)
       {code}
 

 

Proposed code. 

 

 

 

 


> Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite
> ---
>
> Key: SPARK-44166
> URL: https://issues.apache.org/jira/browse/SPARK-44166
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Pralabh Kumar
>Priority: Minor
>
> Currently in InsertIntoHiveTable.scala , there is no way to pass 
> dynamicPartitionOverwrite to true , when calling  saveAsHiveFile . When 
> dynamicPartitioOverwrite is true , spark will use  built-in 
> FileCommitProtocol instead of Hadoop FileOutputCommitter , which is more 
> performant. 
>  
> Here is the solution . 
> When inserting overwrite into Hive table
>  
> Current code 
>  
> {code:java}
> val writtenParts = saveAsHiveFile(
>   sparkSession = sparkSession,
>   plan = child,
>   hadoopConf = hadoopConf,
>   fileFormat = fileFormat,
>   outputLocation = tmpLocation.toString,
>   partitionAttributes = partitionColumns,
>   bucketSpec = bucketSpec,
>   options = options)
>        {code}
>  
>  
> Proposed code.  
> enableDynamicPartitionOverwrite 
>  
> {code:java}
>  val enableDynamicPartitionOverwrite =
>       
> SQLConf.get.getConf(HiveUtils.USE_FILECOMMITPROTOCOL_DYNAMIC_PARTITION_OVERWRITE)
>     logWarning(s"enableDynamicPartitionOverwrite: 
> $enableDynamicPartitionOverwrite"){code}
>  
>  
> Now if enableDynamicPartitionOverwrite is true and numDynamicPartitions > 0 
> and overwrite is true , pass dynamicPartitionOverwrite true. 
>  
> {code:java}
> val writtenParts = saveAsHiveFile( sparkSession = sparkSession, plan = child, 
> hadoopConf = hadoopConf, fileFormat = fileFormat, outputLocation = 
> tmpLocation.toString, partitionAttributes = partitionColumns, bucketSpec = 
> bucketSpec, options = options, dynamicPartitionOverwrite =
>         enableDynamicPartitionOverwrit

[jira] [Updated] (SPARK-44166) Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite

2023-06-24 Thread Pralabh Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pralabh Kumar updated SPARK-44166:
--
Description: 
Currently in InsertIntoHiveTable.scala , there is no way to pass 
dynamicPartitionOverwrite to true , when calling  saveAsHiveFile . When 
dynamicPartitioOverwrite is true , spark will use 
built-in FileCommitProtocol instead of Hadoop FileOutputCommitter , which is 
more performant. 

 

Here is the solution . 

When inserting overwrite into Hive table

 

Current code 

 
{code:java}
val writtenParts = saveAsHiveFile(
  sparkSession = sparkSession,
  plan = child,
  hadoopConf = hadoopConf,
  fileFormat = fileFormat,
  outputLocation = tmpLocation.toString,
  partitionAttributes = partitionColumns,
  bucketSpec = bucketSpec,
  options = options)
       {code}
 

 

Proposed code. 

 

 

 

 

  was:
Currently in InsertIntoHiveTable.scala , there is no way to pass 
dynamicPartitionOverwrite to true , when calling  saveAsHiveFile . When 
dynamicPartitioOverwrite is true , spark will use 
built-in FileCommitProtocol instead of Hadoop FileOutputCommitter , which is 
more performant. 


> Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite
> ---
>
> Key: SPARK-44166
> URL: https://issues.apache.org/jira/browse/SPARK-44166
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Pralabh Kumar
>Priority: Minor
>
> Currently in InsertIntoHiveTable.scala , there is no way to pass 
> dynamicPartitionOverwrite to true , when calling  saveAsHiveFile . When 
> dynamicPartitioOverwrite is true , spark will use 
> built-in FileCommitProtocol instead of Hadoop FileOutputCommitter , which is 
> more performant. 
>  
> Here is the solution . 
> When inserting overwrite into Hive table
>  
> Current code 
>  
> {code:java}
> val writtenParts = saveAsHiveFile(
>   sparkSession = sparkSession,
>   plan = child,
>   hadoopConf = hadoopConf,
>   fileFormat = fileFormat,
>   outputLocation = tmpLocation.toString,
>   partitionAttributes = partitionColumns,
>   bucketSpec = bucketSpec,
>   options = options)
>        {code}
>  
>  
> Proposed code. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44166) Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite

2023-06-24 Thread Pralabh Kumar (Jira)
Pralabh Kumar created SPARK-44166:
-

 Summary: Enable dynamicPartitionOverwrite in SaveAsHiveFile for 
insert overwrite
 Key: SPARK-44166
 URL: https://issues.apache.org/jira/browse/SPARK-44166
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: Pralabh Kumar


Currently in InsertIntoHiveTable.scala , there is no way to pass 
dynamicPartitionOverwrite to true , when calling  saveAsHiveFile . When 
dynamicPartitioOverwrite is true , spark will use 
built-in FileCommitProtocol instead of Hadoop FileOutputCommitter , which is 
more performant. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43235) ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception

2023-05-09 Thread Pralabh Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720979#comment-17720979
 ] 

Pralabh Kumar commented on SPARK-43235:
---

can any one please look into this . If ok I can create PR for it . 

> ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE 
> if isPublic throws exception
> --
>
> Key: SPARK-43235
> URL: https://issues.apache.org/jira/browse/SPARK-43235
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Pralabh Kumar
>Priority: Minor
>
> Hi Spark Team .
> Currently *ClientDistributedCacheManager* *getVisibility* methods checks 
> whether resource visibility can be set to private or public. 
> In order to set  *LocalResourceVisibility.PUBLIC* ,isPublic checks permission 
> of all the ancestors directories for the executable directory . It goes till 
> the root folder to check permission of all the parents 
> (ancestorsHaveExecutePermissions) 
> checkPermissionOfOther calls  FileStatus getFileStatus to check the 
> permission .
> If the   FileStatus getFileStatus throws exception Spark Submit fails . It 
> didn't sets the permission to Private.
> if (isPublic(conf, uri, statCache))
> { LocalResourceVisibility.PUBLIC }
> else
> { LocalResourceVisibility.PRIVATE }
> Generally if the user doesn't have permission to check for root folder 
> (specifically in case of cloud file system(GCS)  (for the buckets)  , methods 
> throws error IOException(Error accessing Bucket).
>  
> *Ideally if there is an error in isPublic , which means Spark isn't able to 
> determine the execution permission of all the parents directory , it should 
> set the LocalResourceVisibility.PRIVATE.  However, it currently throws an 
> exception in isPublic and hence Spark Submit fails*
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43235) ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception

2023-04-30 Thread Pralabh Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pralabh Kumar updated SPARK-43235:
--
Description: 
Hi Spark Team .

Currently *ClientDistributedCacheManager* *getVisibility* methods checks 
whether resource visibility can be set to private or public. 

In order to set  *LocalResourceVisibility.PUBLIC* ,isPublic checks permission 
of all the ancestors directories for the executable directory . It goes till 
the root folder to check permission of all the parents 
(ancestorsHaveExecutePermissions) 

checkPermissionOfOther calls  FileStatus getFileStatus to check the permission .

If the   FileStatus getFileStatus throws exception Spark Submit fails . It 
didn't sets the permission to Private.

if (isPublic(conf, uri, statCache))

{ LocalResourceVisibility.PUBLIC }

else

{ LocalResourceVisibility.PRIVATE }

Generally if the user doesn't have permission to check for root folder 
(specifically in case of cloud file system(GCS)  (for the buckets)  , methods 
throws error IOException(Error accessing Bucket).

 

*Ideally if there is an error in isPublic , which means Spark isn't able to 
determine the execution permission of all the parents directory , it should set 
the LocalResourceVisibility.PRIVATE.  However, it currently throws an exception 
in isPublic and hence Spark Submit fails*

 

 

  was:
Hi Spark Team .

Currently *ClientDistributedCacheManager* *getVisibility* methods checks 
whether resource visibility can be set to private or public. 

In order to set  *LocalResourceVisibility.PUBLIC* ,isPublic checks permission 
of all the ancestors directories for the executable directory . It goes till 
the root folder to check permission of all the parents 
(ancestorsHaveExecutePermissions) 

checkPermissionOfOther calls  FileStatus getFileStatus to check the permission .

If the   FileStatus getFileStatus throws exception Spark Submit fails . It 
didn't sets the permission to Private.

if (isPublic(conf, uri, statCache)) {
LocalResourceVisibility.PUBLIC
} else {
LocalResourceVisibility.PRIVATE
}

Generally if the user doesn't have permission to check for root folder 
(specifically in case of cloud file system(GCS)  (for the buckets)  , methods 
throws error IOException(Error accessing Bucket).

 

*Ideally if there is an error in isPublic , which means Spark isn't able to 
determine the execution permission of all the parents directory , it should set 
the LocalResourceVisibility.PRIVATE.  However, it currently throws an exception 
in isPublic and hence Spark Submit fails*

 

 


> ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE 
> if isPublic throws exception
> --
>
> Key: SPARK-43235
> URL: https://issues.apache.org/jira/browse/SPARK-43235
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Pralabh Kumar
>Priority: Minor
>
> Hi Spark Team .
> Currently *ClientDistributedCacheManager* *getVisibility* methods checks 
> whether resource visibility can be set to private or public. 
> In order to set  *LocalResourceVisibility.PUBLIC* ,isPublic checks permission 
> of all the ancestors directories for the executable directory . It goes till 
> the root folder to check permission of all the parents 
> (ancestorsHaveExecutePermissions) 
> checkPermissionOfOther calls  FileStatus getFileStatus to check the 
> permission .
> If the   FileStatus getFileStatus throws exception Spark Submit fails . It 
> didn't sets the permission to Private.
> if (isPublic(conf, uri, statCache))
> { LocalResourceVisibility.PUBLIC }
> else
> { LocalResourceVisibility.PRIVATE }
> Generally if the user doesn't have permission to check for root folder 
> (specifically in case of cloud file system(GCS)  (for the buckets)  , methods 
> throws error IOException(Error accessing Bucket).
>  
> *Ideally if there is an error in isPublic , which means Spark isn't able to 
> determine the execution permission of all the parents directory , it should 
> set the LocalResourceVisibility.PRIVATE.  However, it currently throws an 
> exception in isPublic and hence Spark Submit fails*
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43235) ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception

2023-04-30 Thread Pralabh Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718163#comment-17718163
 ] 

Pralabh Kumar commented on SPARK-43235:
---

Gentle ping to review . I can create a PR for the same 

> ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE 
> if isPublic throws exception
> --
>
> Key: SPARK-43235
> URL: https://issues.apache.org/jira/browse/SPARK-43235
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Pralabh Kumar
>Priority: Minor
>
> Hi Spark Team .
> Currently *ClientDistributedCacheManager* *getVisibility* methods checks 
> whether resource visibility can be set to private or public. 
> In order to set  *LocalResourceVisibility.PUBLIC* ,isPublic checks permission 
> of all the ancestors directories for the executable directory . It goes till 
> the root folder to check permission of all the parents 
> (ancestorsHaveExecutePermissions) 
> checkPermissionOfOther calls  FileStatus getFileStatus to check the 
> permission .
> If the   FileStatus getFileStatus throws exception Spark Submit fails . It 
> didn't sets the permission to Private.
> if (isPublic(conf, uri, statCache)) {
> LocalResourceVisibility.PUBLIC
> } else {
> LocalResourceVisibility.PRIVATE
> }
> Generally if the user doesn't have permission to check for root folder 
> (specifically in case of cloud file system(GCS)  (for the buckets)  , methods 
> throws error IOException(Error accessing Bucket).
>  
> *Ideally if there is an error in isPublic , which means Spark isn't able to 
> determine the execution permission of all the parents directory , it should 
> set the LocalResourceVisibility.PRIVATE.  However, it currently throws an 
> exception in isPublic and hence Spark Submit fails*
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43235) ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception

2023-04-28 Thread Pralabh Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717566#comment-17717566
 ] 

Pralabh Kumar commented on SPARK-43235:
---

[~gurwls223] Can u please look into this . 

> ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE 
> if isPublic throws exception
> --
>
> Key: SPARK-43235
> URL: https://issues.apache.org/jira/browse/SPARK-43235
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Pralabh Kumar
>Priority: Minor
>
> Hi Spark Team .
> Currently *ClientDistributedCacheManager* *getVisibility* methods checks 
> whether resource visibility can be set to private or public. 
> In order to set  *LocalResourceVisibility.PUBLIC* ,isPublic checks permission 
> of all the ancestors directories for the executable directory . It goes till 
> the root folder to check permission of all the parents 
> (ancestorsHaveExecutePermissions) 
> checkPermissionOfOther calls  FileStatus getFileStatus to check the 
> permission .
> If the   FileStatus getFileStatus throws exception Spark Submit fails . It 
> didn't sets the permission to Private.
> if (isPublic(conf, uri, statCache)) {
> LocalResourceVisibility.PUBLIC
> } else {
> LocalResourceVisibility.PRIVATE
> }
> Generally if the user doesn't have permission to check for root folder 
> (specifically in case of cloud file system(GCS)  (for the buckets)  , methods 
> throws error IOException(Error accessing Bucket).
>  
> *Ideally if there is an error in isPublic , which means Spark isn't able to 
> determine the execution permission of all the parents directory , it should 
> set the LocalResourceVisibility.PRIVATE.  However, it currently throws an 
> exception in isPublic and hence Spark Submit fails*
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43235) ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception

2023-04-22 Thread Pralabh Kumar (Jira)
Pralabh Kumar created SPARK-43235:
-

 Summary: ClientDistributedCacheManager doesn't set the 
LocalResourceVisibility.PRIVATE if isPublic throws exception
 Key: SPARK-43235
 URL: https://issues.apache.org/jira/browse/SPARK-43235
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Pralabh Kumar


Hi Spark Team .

Currently *ClientDistributedCacheManager* *getVisibility* methods checks 
whether resource visibility can be set to private or public. 

In order to set  *LocalResourceVisibility.PUBLIC* ,isPublic checks permission 
of all the ancestors directories for the executable directory . It goes till 
the root folder to check permission of all the parents 
(ancestorsHaveExecutePermissions) 

checkPermissionOfOther calls  FileStatus getFileStatus to check the permission .

If the   FileStatus getFileStatus throws exception Spark Submit fails . It 
didn't sets the permission to Private.

if (isPublic(conf, uri, statCache)) {
LocalResourceVisibility.PUBLIC
} else {
LocalResourceVisibility.PRIVATE
}

Generally if the user doesn't have permission to check for root folder 
(specifically in case of cloud file system(GCS)  (for the buckets)  , methods 
throws error IOException(Error accessing Bucket).

 

*Ideally if there is an error in isPublic , which means Spark isn't able to 
determine the execution permission of all the parents directory , it should set 
the LocalResourceVisibility.PRIVATE.  However, it currently throws an exception 
in isPublic and hence Spark Submit fails*

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36728) Can't create datetime object from anything other then year column Pyspark - koalas

2023-01-15 Thread Pralabh Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677198#comment-17677198
 ] 

Pralabh Kumar commented on SPARK-36728:
---

[~gurwls223] I think this can be closed , as its fixed part of 
 # SPARK-36742

> Can't create datetime object from anything other then year column Pyspark - 
> koalas
> --
>
> Key: SPARK-36728
> URL: https://issues.apache.org/jira/browse/SPARK-36728
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
> Attachments: pyspark_date.txt, pyspark_date2.txt
>
>
> If I create a datetime object it must be from columns named year.
>  
> df = ps.DataFrame(\{'year': [2015, 2016],df = ps.DataFrame({'year': [2015, 
> 2016],                   'month': [2, 3],                    'day': [4, 5],   
>                  'hour': [2, 3],                    'minute': [10, 30],       
>              'second': [21,25]}) df.info()
> Int64Index: 2 entries, 1 to 0Data 
> columns (total 6 columns): #   Column  Non-Null Count  Dtype---  --  
> --  - 0   year    2 non-null      int64 1   month   2 
> non-null      int64 2   day     2 non-null      int64 3   hour    2 non-null  
>     int64 4   minute  2 non-null      int64 5   second  2 non-null      
> int64dtypes: int64(6)
> df['date'] = ps.to_datetime(df[['year', 'month', 'day']])
> df.info()
> Int64Index: 2 entries, 1 to 0Data 
> columns (total 7 columns): #   Column  Non-Null Count  Dtype     ---  --  
> --  -      0   year    2 non-null      int64      1   month   
> 2 non-null      int64      2   day     2 non-null      int64      3   hour    
> 2 non-null      int64      4   minute  2 non-null      int64      5   second  
> 2 non-null      int64      6   date    2 non-null      datetime64dtypes: 
> datetime64(1), int64(6)
> df_test = ps.DataFrame(\{'testyear': [2015, 2016],                   
> 'testmonth': [2, 3],                    'testday': [4, 5],                    
> 'hour': [2, 3],                    'minute': [10, 30],                    
> 'second': [21,25]}) df_test['date'] = ps.to_datetime(df[['testyear', 
> 'testmonth', 'testday']])
> ---KeyError
>                                   Traceback (most recent call 
> last)/tmp/ipykernel_73/904491906.py in > 1 df_test['date'] = 
> ps.to_datetime(df[['testyear', 'testmonth', 'testday']])
> /opt/spark/python/pyspark/pandas/frame.py in __getitem__(self, key)  11853    
>          return self.loc[:, key]  11854         elif is_list_like(key):> 
> 11855             return self.loc[:, list(key)]  11856         raise 
> NotImplementedError(key)  11857 
> /opt/spark/python/pyspark/pandas/indexing.py in __getitem__(self, key)    476 
>                 returns_series,    477                 series_name,--> 478    
>          ) = self._select_cols(cols_sel)    479     480             if cond 
> is None and limit is None and returns_series:
> /opt/spark/python/pyspark/pandas/indexing.py in _select_cols(self, cols_sel, 
> missing_keys)    322             return self._select_cols_else(cols_sel, 
> missing_keys)    323         elif is_list_like(cols_sel):--> 324             
> return self._select_cols_by_iterable(cols_sel, missing_keys)    325         
> else:    326             return self._select_cols_else(cols_sel, missing_keys)
> /opt/spark/python/pyspark/pandas/indexing.py in 
> _select_cols_by_iterable(self, cols_sel, missing_keys)   1352                 
> if not found:   1353                     if missing_keys is None:-> 1354      
>                    raise KeyError("['{}'] not in 
> index".format(name_like_string(key)))   1355                     else:   1356 
>                         missing_keys.append(key)
> KeyError: "['testyear'] not in index"
> df_test
> testyear testmonth testday hour minute second0 2015 2 4 2 10 211 2016 3 5 3 
> 30 25



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36728) Can't create datetime object from anything other then year column Pyspark - koalas

2023-01-15 Thread Pralabh Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677187#comment-17677187
 ] 

Pralabh Kumar commented on SPARK-36728:
---

I think this issue is not reproducible on Spark 3.4. Please confirm 

> Can't create datetime object from anything other then year column Pyspark - 
> koalas
> --
>
> Key: SPARK-36728
> URL: https://issues.apache.org/jira/browse/SPARK-36728
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
> Attachments: pyspark_date.txt, pyspark_date2.txt
>
>
> If I create a datetime object it must be from columns named year.
>  
> df = ps.DataFrame(\{'year': [2015, 2016],df = ps.DataFrame({'year': [2015, 
> 2016],                   'month': [2, 3],                    'day': [4, 5],   
>                  'hour': [2, 3],                    'minute': [10, 30],       
>              'second': [21,25]}) df.info()
> Int64Index: 2 entries, 1 to 0Data 
> columns (total 6 columns): #   Column  Non-Null Count  Dtype---  --  
> --  - 0   year    2 non-null      int64 1   month   2 
> non-null      int64 2   day     2 non-null      int64 3   hour    2 non-null  
>     int64 4   minute  2 non-null      int64 5   second  2 non-null      
> int64dtypes: int64(6)
> df['date'] = ps.to_datetime(df[['year', 'month', 'day']])
> df.info()
> Int64Index: 2 entries, 1 to 0Data 
> columns (total 7 columns): #   Column  Non-Null Count  Dtype     ---  --  
> --  -      0   year    2 non-null      int64      1   month   
> 2 non-null      int64      2   day     2 non-null      int64      3   hour    
> 2 non-null      int64      4   minute  2 non-null      int64      5   second  
> 2 non-null      int64      6   date    2 non-null      datetime64dtypes: 
> datetime64(1), int64(6)
> df_test = ps.DataFrame(\{'testyear': [2015, 2016],                   
> 'testmonth': [2, 3],                    'testday': [4, 5],                    
> 'hour': [2, 3],                    'minute': [10, 30],                    
> 'second': [21,25]}) df_test['date'] = ps.to_datetime(df[['testyear', 
> 'testmonth', 'testday']])
> ---KeyError
>                                   Traceback (most recent call 
> last)/tmp/ipykernel_73/904491906.py in > 1 df_test['date'] = 
> ps.to_datetime(df[['testyear', 'testmonth', 'testday']])
> /opt/spark/python/pyspark/pandas/frame.py in __getitem__(self, key)  11853    
>          return self.loc[:, key]  11854         elif is_list_like(key):> 
> 11855             return self.loc[:, list(key)]  11856         raise 
> NotImplementedError(key)  11857 
> /opt/spark/python/pyspark/pandas/indexing.py in __getitem__(self, key)    476 
>                 returns_series,    477                 series_name,--> 478    
>          ) = self._select_cols(cols_sel)    479     480             if cond 
> is None and limit is None and returns_series:
> /opt/spark/python/pyspark/pandas/indexing.py in _select_cols(self, cols_sel, 
> missing_keys)    322             return self._select_cols_else(cols_sel, 
> missing_keys)    323         elif is_list_like(cols_sel):--> 324             
> return self._select_cols_by_iterable(cols_sel, missing_keys)    325         
> else:    326             return self._select_cols_else(cols_sel, missing_keys)
> /opt/spark/python/pyspark/pandas/indexing.py in 
> _select_cols_by_iterable(self, cols_sel, missing_keys)   1352                 
> if not found:   1353                     if missing_keys is None:-> 1354      
>                    raise KeyError("['{}'] not in 
> index".format(name_like_string(key)))   1355                     else:   1356 
>                         missing_keys.append(key)
> KeyError: "['testyear'] not in index"
> df_test
> testyear testmonth testday hour minute second0 2015 2 4 2 10 211 2016 3 5 3 
> 30 25



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org