[jira] [Updated] (SPARK-44166) Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite
[ https://issues.apache.org/jira/browse/SPARK-44166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pralabh Kumar updated SPARK-44166: -- Description: Currently in InsertIntoHiveTable.scala , there is no way to pass dynamicPartitionOverwrite to true , when calling saveAsHiveFile . When dynamicPartitioOverwrite is true , spark will use built-in FileCommitProtocol instead of Hadoop FileOutputCommitter , which is more performant. Here is the solution . When inserting overwrite into Hive table Current code {code:java} val writtenParts = saveAsHiveFile( sparkSession = sparkSession, plan = child, hadoopConf = hadoopConf, fileFormat = fileFormat, outputLocation = tmpLocation.toString, partitionAttributes = partitionColumns, bucketSpec = bucketSpec, options = options) {code} Proposed code. enableDynamicPartitionOverwrite {code:java} val USE_FILECOMMITPROTOCOL_DYNAMIC_PARTITION_OVERWRITE = buildConf("spark.sql.hive.filecommit.dynamicPartitionOverwrite"){code} {code:java} val enableDynamicPartitionOverwrite = SQLConf.get.getConf(HiveUtils.USE_FILECOMMITPROTOCOL_DYNAMIC_PARTITION_OVERWRITE) logWarning(s"enableDynamicPartitionOverwrite: $enableDynamicPartitionOverwrite"){code} Now if enableDynamicPartitionOverwrite is true and numDynamicPartitions > 0 and overwrite is true , pass dynamicPartitionOverwrite true. {code:java} val writtenParts = saveAsHiveFile( sparkSession = sparkSession, plan = child, hadoopConf = hadoopConf, fileFormat = fileFormat, outputLocation = tmpLocation.toString, partitionAttributes = partitionColumns, bucketSpec = bucketSpec, options = options, dynamicPartitionOverwrite = enableDynamicPartitionOverwrite && numDynamicPartitions > 0 && overwrite) {code} In saveAs File {code:java} val committer = FileCommitProtocol.instantiate( sparkSession.sessionState.conf.fileCommitProtocolClass, jobId = java.util.UUID.randomUUID().toString, outputPath = outputLocation, dynamicPartitionOverwrite = dynamicPartitionOverwrite) {code} This will internal call with dynamicPartitionOverwrite value true. {code:java} class SQLHadoopMapReduceCommitProtocol( jobId: String, path: String, dynamicPartitionOverwrite: Boolean = false) extends HadoopMapReduceCommitProtocol(jobId, path, dynamicPartitionOverwrite) {code} was: Currently in InsertIntoHiveTable.scala , there is no way to pass dynamicPartitionOverwrite to true , when calling saveAsHiveFile . When dynamicPartitioOverwrite is true , spark will use built-in FileCommitProtocol instead of Hadoop FileOutputCommitter , which is more performant. Here is the solution . When inserting overwrite into Hive table Current code {code:java} val writtenParts = saveAsHiveFile( sparkSession = sparkSession, plan = child, hadoopConf = hadoopConf, fileFormat = fileFormat, outputLocation = tmpLocation.toString, partitionAttributes = partitionColumns, bucketSpec = bucketSpec, options = options) {code} Proposed code. enableDynamicPartitionOverwrite {code:java} val enableDynamicPartitionOverwrite = SQLConf.get.getConf(HiveUtils.USE_FILECOMMITPROTOCOL_DYNAMIC_PARTITION_OVERWRITE) logWarning(s"enableDynamicPartitionOverwrite: $enableDynamicPartitionOverwrite"){code} Now if enableDynamicPartitionOverwrite is true and numDynamicPartitions > 0 and overwrite is true , pass dynamicPartitionOverwrite true. {code:java} val writtenParts = saveAsHiveFile( sparkSession = sparkSession, plan = child, hadoopConf = hadoopConf, fileFormat = fileFormat, outputLocation = tmpLocation.toString, partitionAttributes = partitionColumns, bucketSpec = bucketSpec, options = options, dynamicPartitionOverwrite = enableDynamicPartitionOverwrite && numDynamicPartitions > 0 && overwrite) {code} In saveAs File {code:java} val committer = FileCommitProtocol.instantiate( sparkSession.sessionState.conf.fileCommitProtocolClass, jobId = java.util.UUID.randomUUID().toString, outputPath = outputLocation, dynamicPartitionOverwrite = dynamicPartitionOverwrite) {code} This will internal call with dynamicPartitionOverwrite value true. {code:java} class SQLHadoopMapReduceCommitProtocol( jobId: String, path: String, dynamicPartitionOverwrite: Boolean = false) extends HadoopMapReduceCommitProtocol(jobId, path, dynamicPartitionOverwrite) {code} > Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite > --- > > Key: SPARK-44166 > URL: https://issues.apache.org/jira/browse/SPARK-44166 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Pralabh Kumar >
[jira] [Updated] (SPARK-44166) Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite
[ https://issues.apache.org/jira/browse/SPARK-44166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pralabh Kumar updated SPARK-44166: -- Description: Currently in InsertIntoHiveTable.scala , there is no way to pass dynamicPartitionOverwrite to true , when calling saveAsHiveFile . When dynamicPartitioOverwrite is true , spark will use built-in FileCommitProtocol instead of Hadoop FileOutputCommitter , which is more performant. Here is the solution . When inserting overwrite into Hive table Current code {code:java} val writtenParts = saveAsHiveFile( sparkSession = sparkSession, plan = child, hadoopConf = hadoopConf, fileFormat = fileFormat, outputLocation = tmpLocation.toString, partitionAttributes = partitionColumns, bucketSpec = bucketSpec, options = options) {code} Proposed code. enableDynamicPartitionOverwrite {code:java} val enableDynamicPartitionOverwrite = SQLConf.get.getConf(HiveUtils.USE_FILECOMMITPROTOCOL_DYNAMIC_PARTITION_OVERWRITE) logWarning(s"enableDynamicPartitionOverwrite: $enableDynamicPartitionOverwrite"){code} Now if enableDynamicPartitionOverwrite is true and numDynamicPartitions > 0 and overwrite is true , pass dynamicPartitionOverwrite true. {code:java} val writtenParts = saveAsHiveFile( sparkSession = sparkSession, plan = child, hadoopConf = hadoopConf, fileFormat = fileFormat, outputLocation = tmpLocation.toString, partitionAttributes = partitionColumns, bucketSpec = bucketSpec, options = options, dynamicPartitionOverwrite = enableDynamicPartitionOverwrite && numDynamicPartitions > 0 && overwrite) {code} In saveAs File {code:java} val committer = FileCommitProtocol.instantiate( sparkSession.sessionState.conf.fileCommitProtocolClass, jobId = java.util.UUID.randomUUID().toString, outputPath = outputLocation, dynamicPartitionOverwrite = dynamicPartitionOverwrite) {code} This will internal call with dynamicPartitionOverwrite value true. {code:java} class SQLHadoopMapReduceCommitProtocol( jobId: String, path: String, dynamicPartitionOverwrite: Boolean = false) extends HadoopMapReduceCommitProtocol(jobId, path, dynamicPartitionOverwrite) {code} was: Currently in InsertIntoHiveTable.scala , there is no way to pass dynamicPartitionOverwrite to true , when calling saveAsHiveFile . When dynamicPartitioOverwrite is true , spark will use built-in FileCommitProtocol instead of Hadoop FileOutputCommitter , which is more performant. Here is the solution . When inserting overwrite into Hive table Current code {code:java} val writtenParts = saveAsHiveFile( sparkSession = sparkSession, plan = child, hadoopConf = hadoopConf, fileFormat = fileFormat, outputLocation = tmpLocation.toString, partitionAttributes = partitionColumns, bucketSpec = bucketSpec, options = options) {code} Proposed code. > Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite > --- > > Key: SPARK-44166 > URL: https://issues.apache.org/jira/browse/SPARK-44166 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Pralabh Kumar >Priority: Minor > > Currently in InsertIntoHiveTable.scala , there is no way to pass > dynamicPartitionOverwrite to true , when calling saveAsHiveFile . When > dynamicPartitioOverwrite is true , spark will use built-in > FileCommitProtocol instead of Hadoop FileOutputCommitter , which is more > performant. > > Here is the solution . > When inserting overwrite into Hive table > > Current code > > {code:java} > val writtenParts = saveAsHiveFile( > sparkSession = sparkSession, > plan = child, > hadoopConf = hadoopConf, > fileFormat = fileFormat, > outputLocation = tmpLocation.toString, > partitionAttributes = partitionColumns, > bucketSpec = bucketSpec, > options = options) > {code} > > > Proposed code. > enableDynamicPartitionOverwrite > > {code:java} > val enableDynamicPartitionOverwrite = > > SQLConf.get.getConf(HiveUtils.USE_FILECOMMITPROTOCOL_DYNAMIC_PARTITION_OVERWRITE) > logWarning(s"enableDynamicPartitionOverwrite: > $enableDynamicPartitionOverwrite"){code} > > > Now if enableDynamicPartitionOverwrite is true and numDynamicPartitions > 0 > and overwrite is true , pass dynamicPartitionOverwrite true. > > {code:java} > val writtenParts = saveAsHiveFile( sparkSession = sparkSession, plan = child, > hadoopConf = hadoopConf, fileFormat = fileFormat, outputLocation = > tmpLocation.toString, partitionAttributes = partitionColumns, bucketSpec = > bucketSpec, options = options, dynamicPartitionOverwrite = > enableDynamicPartitionOverwrit
[jira] [Updated] (SPARK-44166) Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite
[ https://issues.apache.org/jira/browse/SPARK-44166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pralabh Kumar updated SPARK-44166: -- Description: Currently in InsertIntoHiveTable.scala , there is no way to pass dynamicPartitionOverwrite to true , when calling saveAsHiveFile . When dynamicPartitioOverwrite is true , spark will use built-in FileCommitProtocol instead of Hadoop FileOutputCommitter , which is more performant. Here is the solution . When inserting overwrite into Hive table Current code {code:java} val writtenParts = saveAsHiveFile( sparkSession = sparkSession, plan = child, hadoopConf = hadoopConf, fileFormat = fileFormat, outputLocation = tmpLocation.toString, partitionAttributes = partitionColumns, bucketSpec = bucketSpec, options = options) {code} Proposed code. was: Currently in InsertIntoHiveTable.scala , there is no way to pass dynamicPartitionOverwrite to true , when calling saveAsHiveFile . When dynamicPartitioOverwrite is true , spark will use built-in FileCommitProtocol instead of Hadoop FileOutputCommitter , which is more performant. > Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite > --- > > Key: SPARK-44166 > URL: https://issues.apache.org/jira/browse/SPARK-44166 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Pralabh Kumar >Priority: Minor > > Currently in InsertIntoHiveTable.scala , there is no way to pass > dynamicPartitionOverwrite to true , when calling saveAsHiveFile . When > dynamicPartitioOverwrite is true , spark will use > built-in FileCommitProtocol instead of Hadoop FileOutputCommitter , which is > more performant. > > Here is the solution . > When inserting overwrite into Hive table > > Current code > > {code:java} > val writtenParts = saveAsHiveFile( > sparkSession = sparkSession, > plan = child, > hadoopConf = hadoopConf, > fileFormat = fileFormat, > outputLocation = tmpLocation.toString, > partitionAttributes = partitionColumns, > bucketSpec = bucketSpec, > options = options) > {code} > > > Proposed code. > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44166) Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite
Pralabh Kumar created SPARK-44166: - Summary: Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite Key: SPARK-44166 URL: https://issues.apache.org/jira/browse/SPARK-44166 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.1 Reporter: Pralabh Kumar Currently in InsertIntoHiveTable.scala , there is no way to pass dynamicPartitionOverwrite to true , when calling saveAsHiveFile . When dynamicPartitioOverwrite is true , spark will use built-in FileCommitProtocol instead of Hadoop FileOutputCommitter , which is more performant. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43235) ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception
[ https://issues.apache.org/jira/browse/SPARK-43235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720979#comment-17720979 ] Pralabh Kumar commented on SPARK-43235: --- can any one please look into this . If ok I can create PR for it . > ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE > if isPublic throws exception > -- > > Key: SPARK-43235 > URL: https://issues.apache.org/jira/browse/SPARK-43235 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Pralabh Kumar >Priority: Minor > > Hi Spark Team . > Currently *ClientDistributedCacheManager* *getVisibility* methods checks > whether resource visibility can be set to private or public. > In order to set *LocalResourceVisibility.PUBLIC* ,isPublic checks permission > of all the ancestors directories for the executable directory . It goes till > the root folder to check permission of all the parents > (ancestorsHaveExecutePermissions) > checkPermissionOfOther calls FileStatus getFileStatus to check the > permission . > If the FileStatus getFileStatus throws exception Spark Submit fails . It > didn't sets the permission to Private. > if (isPublic(conf, uri, statCache)) > { LocalResourceVisibility.PUBLIC } > else > { LocalResourceVisibility.PRIVATE } > Generally if the user doesn't have permission to check for root folder > (specifically in case of cloud file system(GCS) (for the buckets) , methods > throws error IOException(Error accessing Bucket). > > *Ideally if there is an error in isPublic , which means Spark isn't able to > determine the execution permission of all the parents directory , it should > set the LocalResourceVisibility.PRIVATE. However, it currently throws an > exception in isPublic and hence Spark Submit fails* > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43235) ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception
[ https://issues.apache.org/jira/browse/SPARK-43235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pralabh Kumar updated SPARK-43235: -- Description: Hi Spark Team . Currently *ClientDistributedCacheManager* *getVisibility* methods checks whether resource visibility can be set to private or public. In order to set *LocalResourceVisibility.PUBLIC* ,isPublic checks permission of all the ancestors directories for the executable directory . It goes till the root folder to check permission of all the parents (ancestorsHaveExecutePermissions) checkPermissionOfOther calls FileStatus getFileStatus to check the permission . If the FileStatus getFileStatus throws exception Spark Submit fails . It didn't sets the permission to Private. if (isPublic(conf, uri, statCache)) { LocalResourceVisibility.PUBLIC } else { LocalResourceVisibility.PRIVATE } Generally if the user doesn't have permission to check for root folder (specifically in case of cloud file system(GCS) (for the buckets) , methods throws error IOException(Error accessing Bucket). *Ideally if there is an error in isPublic , which means Spark isn't able to determine the execution permission of all the parents directory , it should set the LocalResourceVisibility.PRIVATE. However, it currently throws an exception in isPublic and hence Spark Submit fails* was: Hi Spark Team . Currently *ClientDistributedCacheManager* *getVisibility* methods checks whether resource visibility can be set to private or public. In order to set *LocalResourceVisibility.PUBLIC* ,isPublic checks permission of all the ancestors directories for the executable directory . It goes till the root folder to check permission of all the parents (ancestorsHaveExecutePermissions) checkPermissionOfOther calls FileStatus getFileStatus to check the permission . If the FileStatus getFileStatus throws exception Spark Submit fails . It didn't sets the permission to Private. if (isPublic(conf, uri, statCache)) { LocalResourceVisibility.PUBLIC } else { LocalResourceVisibility.PRIVATE } Generally if the user doesn't have permission to check for root folder (specifically in case of cloud file system(GCS) (for the buckets) , methods throws error IOException(Error accessing Bucket). *Ideally if there is an error in isPublic , which means Spark isn't able to determine the execution permission of all the parents directory , it should set the LocalResourceVisibility.PRIVATE. However, it currently throws an exception in isPublic and hence Spark Submit fails* > ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE > if isPublic throws exception > -- > > Key: SPARK-43235 > URL: https://issues.apache.org/jira/browse/SPARK-43235 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Pralabh Kumar >Priority: Minor > > Hi Spark Team . > Currently *ClientDistributedCacheManager* *getVisibility* methods checks > whether resource visibility can be set to private or public. > In order to set *LocalResourceVisibility.PUBLIC* ,isPublic checks permission > of all the ancestors directories for the executable directory . It goes till > the root folder to check permission of all the parents > (ancestorsHaveExecutePermissions) > checkPermissionOfOther calls FileStatus getFileStatus to check the > permission . > If the FileStatus getFileStatus throws exception Spark Submit fails . It > didn't sets the permission to Private. > if (isPublic(conf, uri, statCache)) > { LocalResourceVisibility.PUBLIC } > else > { LocalResourceVisibility.PRIVATE } > Generally if the user doesn't have permission to check for root folder > (specifically in case of cloud file system(GCS) (for the buckets) , methods > throws error IOException(Error accessing Bucket). > > *Ideally if there is an error in isPublic , which means Spark isn't able to > determine the execution permission of all the parents directory , it should > set the LocalResourceVisibility.PRIVATE. However, it currently throws an > exception in isPublic and hence Spark Submit fails* > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43235) ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception
[ https://issues.apache.org/jira/browse/SPARK-43235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718163#comment-17718163 ] Pralabh Kumar commented on SPARK-43235: --- Gentle ping to review . I can create a PR for the same > ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE > if isPublic throws exception > -- > > Key: SPARK-43235 > URL: https://issues.apache.org/jira/browse/SPARK-43235 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Pralabh Kumar >Priority: Minor > > Hi Spark Team . > Currently *ClientDistributedCacheManager* *getVisibility* methods checks > whether resource visibility can be set to private or public. > In order to set *LocalResourceVisibility.PUBLIC* ,isPublic checks permission > of all the ancestors directories for the executable directory . It goes till > the root folder to check permission of all the parents > (ancestorsHaveExecutePermissions) > checkPermissionOfOther calls FileStatus getFileStatus to check the > permission . > If the FileStatus getFileStatus throws exception Spark Submit fails . It > didn't sets the permission to Private. > if (isPublic(conf, uri, statCache)) { > LocalResourceVisibility.PUBLIC > } else { > LocalResourceVisibility.PRIVATE > } > Generally if the user doesn't have permission to check for root folder > (specifically in case of cloud file system(GCS) (for the buckets) , methods > throws error IOException(Error accessing Bucket). > > *Ideally if there is an error in isPublic , which means Spark isn't able to > determine the execution permission of all the parents directory , it should > set the LocalResourceVisibility.PRIVATE. However, it currently throws an > exception in isPublic and hence Spark Submit fails* > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43235) ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception
[ https://issues.apache.org/jira/browse/SPARK-43235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717566#comment-17717566 ] Pralabh Kumar commented on SPARK-43235: --- [~gurwls223] Can u please look into this . > ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE > if isPublic throws exception > -- > > Key: SPARK-43235 > URL: https://issues.apache.org/jira/browse/SPARK-43235 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Pralabh Kumar >Priority: Minor > > Hi Spark Team . > Currently *ClientDistributedCacheManager* *getVisibility* methods checks > whether resource visibility can be set to private or public. > In order to set *LocalResourceVisibility.PUBLIC* ,isPublic checks permission > of all the ancestors directories for the executable directory . It goes till > the root folder to check permission of all the parents > (ancestorsHaveExecutePermissions) > checkPermissionOfOther calls FileStatus getFileStatus to check the > permission . > If the FileStatus getFileStatus throws exception Spark Submit fails . It > didn't sets the permission to Private. > if (isPublic(conf, uri, statCache)) { > LocalResourceVisibility.PUBLIC > } else { > LocalResourceVisibility.PRIVATE > } > Generally if the user doesn't have permission to check for root folder > (specifically in case of cloud file system(GCS) (for the buckets) , methods > throws error IOException(Error accessing Bucket). > > *Ideally if there is an error in isPublic , which means Spark isn't able to > determine the execution permission of all the parents directory , it should > set the LocalResourceVisibility.PRIVATE. However, it currently throws an > exception in isPublic and hence Spark Submit fails* > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43235) ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception
Pralabh Kumar created SPARK-43235: - Summary: ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception Key: SPARK-43235 URL: https://issues.apache.org/jira/browse/SPARK-43235 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: Pralabh Kumar Hi Spark Team . Currently *ClientDistributedCacheManager* *getVisibility* methods checks whether resource visibility can be set to private or public. In order to set *LocalResourceVisibility.PUBLIC* ,isPublic checks permission of all the ancestors directories for the executable directory . It goes till the root folder to check permission of all the parents (ancestorsHaveExecutePermissions) checkPermissionOfOther calls FileStatus getFileStatus to check the permission . If the FileStatus getFileStatus throws exception Spark Submit fails . It didn't sets the permission to Private. if (isPublic(conf, uri, statCache)) { LocalResourceVisibility.PUBLIC } else { LocalResourceVisibility.PRIVATE } Generally if the user doesn't have permission to check for root folder (specifically in case of cloud file system(GCS) (for the buckets) , methods throws error IOException(Error accessing Bucket). *Ideally if there is an error in isPublic , which means Spark isn't able to determine the execution permission of all the parents directory , it should set the LocalResourceVisibility.PRIVATE. However, it currently throws an exception in isPublic and hence Spark Submit fails* -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36728) Can't create datetime object from anything other then year column Pyspark - koalas
[ https://issues.apache.org/jira/browse/SPARK-36728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677198#comment-17677198 ] Pralabh Kumar commented on SPARK-36728: --- [~gurwls223] I think this can be closed , as its fixed part of # SPARK-36742 > Can't create datetime object from anything other then year column Pyspark - > koalas > -- > > Key: SPARK-36728 > URL: https://issues.apache.org/jira/browse/SPARK-36728 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Bjørn Jørgensen >Priority: Major > Attachments: pyspark_date.txt, pyspark_date2.txt > > > If I create a datetime object it must be from columns named year. > > df = ps.DataFrame(\{'year': [2015, 2016],df = ps.DataFrame({'year': [2015, > 2016], 'month': [2, 3], 'day': [4, 5], > 'hour': [2, 3], 'minute': [10, 30], > 'second': [21,25]}) df.info() > Int64Index: 2 entries, 1 to 0Data > columns (total 6 columns): # Column Non-Null Count Dtype--- -- > -- - 0 year 2 non-null int64 1 month 2 > non-null int64 2 day 2 non-null int64 3 hour 2 non-null > int64 4 minute 2 non-null int64 5 second 2 non-null > int64dtypes: int64(6) > df['date'] = ps.to_datetime(df[['year', 'month', 'day']]) > df.info() > Int64Index: 2 entries, 1 to 0Data > columns (total 7 columns): # Column Non-Null Count Dtype --- -- > -- - 0 year 2 non-null int64 1 month > 2 non-null int64 2 day 2 non-null int64 3 hour > 2 non-null int64 4 minute 2 non-null int64 5 second > 2 non-null int64 6 date 2 non-null datetime64dtypes: > datetime64(1), int64(6) > df_test = ps.DataFrame(\{'testyear': [2015, 2016], > 'testmonth': [2, 3], 'testday': [4, 5], > 'hour': [2, 3], 'minute': [10, 30], > 'second': [21,25]}) df_test['date'] = ps.to_datetime(df[['testyear', > 'testmonth', 'testday']]) > ---KeyError > Traceback (most recent call > last)/tmp/ipykernel_73/904491906.py in > 1 df_test['date'] = > ps.to_datetime(df[['testyear', 'testmonth', 'testday']]) > /opt/spark/python/pyspark/pandas/frame.py in __getitem__(self, key) 11853 > return self.loc[:, key] 11854 elif is_list_like(key):> > 11855 return self.loc[:, list(key)] 11856 raise > NotImplementedError(key) 11857 > /opt/spark/python/pyspark/pandas/indexing.py in __getitem__(self, key) 476 > returns_series, 477 series_name,--> 478 > ) = self._select_cols(cols_sel) 479 480 if cond > is None and limit is None and returns_series: > /opt/spark/python/pyspark/pandas/indexing.py in _select_cols(self, cols_sel, > missing_keys) 322 return self._select_cols_else(cols_sel, > missing_keys) 323 elif is_list_like(cols_sel):--> 324 > return self._select_cols_by_iterable(cols_sel, missing_keys) 325 > else: 326 return self._select_cols_else(cols_sel, missing_keys) > /opt/spark/python/pyspark/pandas/indexing.py in > _select_cols_by_iterable(self, cols_sel, missing_keys) 1352 > if not found: 1353 if missing_keys is None:-> 1354 > raise KeyError("['{}'] not in > index".format(name_like_string(key))) 1355 else: 1356 > missing_keys.append(key) > KeyError: "['testyear'] not in index" > df_test > testyear testmonth testday hour minute second0 2015 2 4 2 10 211 2016 3 5 3 > 30 25 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36728) Can't create datetime object from anything other then year column Pyspark - koalas
[ https://issues.apache.org/jira/browse/SPARK-36728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677187#comment-17677187 ] Pralabh Kumar commented on SPARK-36728: --- I think this issue is not reproducible on Spark 3.4. Please confirm > Can't create datetime object from anything other then year column Pyspark - > koalas > -- > > Key: SPARK-36728 > URL: https://issues.apache.org/jira/browse/SPARK-36728 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Bjørn Jørgensen >Priority: Major > Attachments: pyspark_date.txt, pyspark_date2.txt > > > If I create a datetime object it must be from columns named year. > > df = ps.DataFrame(\{'year': [2015, 2016],df = ps.DataFrame({'year': [2015, > 2016], 'month': [2, 3], 'day': [4, 5], > 'hour': [2, 3], 'minute': [10, 30], > 'second': [21,25]}) df.info() > Int64Index: 2 entries, 1 to 0Data > columns (total 6 columns): # Column Non-Null Count Dtype--- -- > -- - 0 year 2 non-null int64 1 month 2 > non-null int64 2 day 2 non-null int64 3 hour 2 non-null > int64 4 minute 2 non-null int64 5 second 2 non-null > int64dtypes: int64(6) > df['date'] = ps.to_datetime(df[['year', 'month', 'day']]) > df.info() > Int64Index: 2 entries, 1 to 0Data > columns (total 7 columns): # Column Non-Null Count Dtype --- -- > -- - 0 year 2 non-null int64 1 month > 2 non-null int64 2 day 2 non-null int64 3 hour > 2 non-null int64 4 minute 2 non-null int64 5 second > 2 non-null int64 6 date 2 non-null datetime64dtypes: > datetime64(1), int64(6) > df_test = ps.DataFrame(\{'testyear': [2015, 2016], > 'testmonth': [2, 3], 'testday': [4, 5], > 'hour': [2, 3], 'minute': [10, 30], > 'second': [21,25]}) df_test['date'] = ps.to_datetime(df[['testyear', > 'testmonth', 'testday']]) > ---KeyError > Traceback (most recent call > last)/tmp/ipykernel_73/904491906.py in > 1 df_test['date'] = > ps.to_datetime(df[['testyear', 'testmonth', 'testday']]) > /opt/spark/python/pyspark/pandas/frame.py in __getitem__(self, key) 11853 > return self.loc[:, key] 11854 elif is_list_like(key):> > 11855 return self.loc[:, list(key)] 11856 raise > NotImplementedError(key) 11857 > /opt/spark/python/pyspark/pandas/indexing.py in __getitem__(self, key) 476 > returns_series, 477 series_name,--> 478 > ) = self._select_cols(cols_sel) 479 480 if cond > is None and limit is None and returns_series: > /opt/spark/python/pyspark/pandas/indexing.py in _select_cols(self, cols_sel, > missing_keys) 322 return self._select_cols_else(cols_sel, > missing_keys) 323 elif is_list_like(cols_sel):--> 324 > return self._select_cols_by_iterable(cols_sel, missing_keys) 325 > else: 326 return self._select_cols_else(cols_sel, missing_keys) > /opt/spark/python/pyspark/pandas/indexing.py in > _select_cols_by_iterable(self, cols_sel, missing_keys) 1352 > if not found: 1353 if missing_keys is None:-> 1354 > raise KeyError("['{}'] not in > index".format(name_like_string(key))) 1355 else: 1356 > missing_keys.append(key) > KeyError: "['testyear'] not in index" > df_test > testyear testmonth testday hour minute second0 2015 2 4 2 10 211 2016 3 5 3 > 30 25 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org