[jira] [Assigned] (SPARK-46641) Add maxBytesPerTrigger threshold option
[ https://issues.apache.org/jira/browse/SPARK-46641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46641: - Assignee: Maksim Konstantinov > Add maxBytesPerTrigger threshold option > --- > > Key: SPARK-46641 > URL: https://issues.apache.org/jira/browse/SPARK-46641 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Maksim Konstantinov >Assignee: Maksim Konstantinov >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46641) Add maxBytesPerTrigger threshold option
[ https://issues.apache.org/jira/browse/SPARK-46641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46641. --- Resolution: Fixed Issue resolved by pull request 44636 [https://github.com/apache/spark/pull/44636] > Add maxBytesPerTrigger threshold option > --- > > Key: SPARK-46641 > URL: https://issues.apache.org/jira/browse/SPARK-46641 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Maksim Konstantinov >Assignee: Maksim Konstantinov >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35303) Enable pinned thread mode by default
[ https://issues.apache.org/jira/browse/SPARK-35303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-35303: --- Labels: pull-request-available (was: ) > Enable pinned thread mode by default > > > Key: SPARK-35303 > URL: https://issues.apache.org/jira/browse/SPARK-35303 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 3.2.0 > > > Pinned thread mode was added at SPARK-22340. We should enable it back to map > Python thread to JVM thread in order to prevent potential issues such as > thread local inheritance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35946) Respect Py4J server if InheritableThread API
[ https://issues.apache.org/jira/browse/SPARK-35946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-35946: --- Labels: pull-request-available (was: ) > Respect Py4J server if InheritableThread API > > > Key: SPARK-35946 > URL: https://issues.apache.org/jira/browse/SPARK-35946 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 3.2.0 > > > Currently we sets the enviornment variables at the client side of Py4J > (python/pyspark/util.py ). If the Py4J gateway is created somewhere else > (e.g., Zeppelin, etc), it could introduce a breakage at: > {code} > from pyspark import SparkContext > jvm = SparkContext._jvm > thread_connection = jvm._gateway_client.get_thread_connection() > # ^ the MLlibMLflowIntegrationSuite test suite failed at this line > # `AttributeError: 'GatewayClient' object has no attribute > 'get_thread_connection'` > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47014) Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession
[ https://issues.apache.org/jira/browse/SPARK-47014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17815891#comment-17815891 ] Hudson commented on SPARK-47014: User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/45073 > Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession > - > > Key: SPARK-47014 > URL: https://issues.apache.org/jira/browse/SPARK-47014 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Xinrong Meng >Priority: Major > > Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47014) Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession
Xinrong Meng created SPARK-47014: Summary: Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession Key: SPARK-47014 URL: https://issues.apache.org/jira/browse/SPARK-47014 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Xinrong Meng Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47013) Document the config spark.sql.streaming.minBatchesToRetain
[ https://issues.apache.org/jira/browse/SPARK-47013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lingeshwaran Radhakrishnan updated SPARK-47013: --- Summary: Document the config spark.sql.streaming.minBatchesToRetain (was: Add the config spark.sql.streaming.minBatchesToRetain to the docs) > Document the config spark.sql.streaming.minBatchesToRetain > -- > > Key: SPARK-47013 > URL: https://issues.apache.org/jira/browse/SPARK-47013 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Lingeshwaran Radhakrishnan >Priority: Major > > Add the config spark.sql.streaming.minBatchesToRetain to the [streaming > docs|https://spark.apache.org/docs/latest/configuration.html#spark-streaming] > page which basically controls the minimum number of batches that must be > retained and made recoverable. > This would also help control the lifecycle of the state files held in the > checkpoint folder i.e, State files are cleaned up based on the config > spark.sql.streaming.minBatchesToRetain -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47013) Document the config spark.sql.streaming.minBatchesToRetain
[ https://issues.apache.org/jira/browse/SPARK-47013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lingeshwaran Radhakrishnan updated SPARK-47013: --- Priority: Minor (was: Major) > Document the config spark.sql.streaming.minBatchesToRetain > -- > > Key: SPARK-47013 > URL: https://issues.apache.org/jira/browse/SPARK-47013 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Lingeshwaran Radhakrishnan >Priority: Minor > > Add the config spark.sql.streaming.minBatchesToRetain to the [streaming > docs|https://spark.apache.org/docs/latest/configuration.html#spark-streaming] > page which basically controls the minimum number of batches that must be > retained and made recoverable. > This would also help control the lifecycle of the state files held in the > checkpoint folder i.e, State files are cleaned up based on the config > spark.sql.streaming.minBatchesToRetain -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47013) Add the config spark.sql.streaming.minBatchesToRetain to the docs
Lingeshwaran Radhakrishnan created SPARK-47013: -- Summary: Add the config spark.sql.streaming.minBatchesToRetain to the docs Key: SPARK-47013 URL: https://issues.apache.org/jira/browse/SPARK-47013 Project: Spark Issue Type: Documentation Components: Structured Streaming Affects Versions: 3.5.0 Reporter: Lingeshwaran Radhakrishnan Add the config spark.sql.streaming.minBatchesToRetain to the [streaming docs|https://spark.apache.org/docs/latest/configuration.html#spark-streaming] page which basically controls the minimum number of batches that must be retained and made recoverable. This would also help control the lifecycle of the state files held in the checkpoint folder i.e, State files are cleaned up based on the config spark.sql.streaming.minBatchesToRetain -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47012) Built-in SQL Function Support - Collate
Aleksandar Tomic created SPARK-47012: Summary: Built-in SQL Function Support - Collate Key: SPARK-47012 URL: https://issues.apache.org/jira/browse/SPARK-47012 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 4.0.0 Reporter: Aleksandar Tomic -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47002) Enforce that 'AnalyzeResult' 'orderBy' field is a list of pyspark.sql.functions.OrderingColumn
[ https://issues.apache.org/jira/browse/SPARK-47002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-47002. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45062 [https://github.com/apache/spark/pull/45062] > Enforce that 'AnalyzeResult' 'orderBy' field is a list of > pyspark.sql.functions.OrderingColumn > -- > > Key: SPARK-47002 > URL: https://issues.apache.org/jira/browse/SPARK-47002 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47002) Enforce that 'AnalyzeResult' 'orderBy' field is a list of pyspark.sql.functions.OrderingColumn
[ https://issues.apache.org/jira/browse/SPARK-47002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reassigned SPARK-47002: - Assignee: Daniel > Enforce that 'AnalyzeResult' 'orderBy' field is a list of > pyspark.sql.functions.OrderingColumn > -- > > Key: SPARK-47002 > URL: https://issues.apache.org/jira/browse/SPARK-47002 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47011) Remove deprecated `BinaryClassificationMetrics.scoreLabelsWeight`
[ https://issues.apache.org/jira/browse/SPARK-47011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47011: - Assignee: Dongjoon Hyun > Remove deprecated `BinaryClassificationMetrics.scoreLabelsWeight` > - > > Key: SPARK-47011 > URL: https://issues.apache.org/jira/browse/SPARK-47011 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47011) Remove deprecated `BinaryClassificationMetrics.scoreLabelsWeight`
[ https://issues.apache.org/jira/browse/SPARK-47011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47011. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45070 [https://github.com/apache/spark/pull/45070] > Remove deprecated `BinaryClassificationMetrics.scoreLabelsWeight` > - > > Key: SPARK-47011 > URL: https://issues.apache.org/jira/browse/SPARK-47011 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46690) Support profiling on FlatMapCoGroupsInBatchExec
[ https://issues.apache.org/jira/browse/SPARK-46690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng reassigned SPARK-46690: Assignee: Xinrong Meng > Support profiling on FlatMapCoGroupsInBatchExec > --- > > Key: SPARK-46690 > URL: https://issues.apache.org/jira/browse/SPARK-46690 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Takuya Ueshin >Assignee: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46690) Support profiling on FlatMapCoGroupsInBatchExec
[ https://issues.apache.org/jira/browse/SPARK-46690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-46690. -- Resolution: Done Resolved by https://github.com/apache/spark/pull/45050 > Support profiling on FlatMapCoGroupsInBatchExec > --- > > Key: SPARK-46690 > URL: https://issues.apache.org/jira/browse/SPARK-46690 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Takuya Ueshin >Assignee: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46689) Support profiling on FlatMapGroupsInBatchExec
[ https://issues.apache.org/jira/browse/SPARK-46689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-46689. -- Resolution: Done Resolved by https://github.com/apache/spark/pull/45050 > Support profiling on FlatMapGroupsInBatchExec > - > > Key: SPARK-46689 > URL: https://issues.apache.org/jira/browse/SPARK-46689 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Takuya Ueshin >Assignee: Xinrong Meng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46689) Support profiling on FlatMapGroupsInBatchExec
[ https://issues.apache.org/jira/browse/SPARK-46689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng reassigned SPARK-46689: Assignee: Xinrong Meng > Support profiling on FlatMapGroupsInBatchExec > - > > Key: SPARK-46689 > URL: https://issues.apache.org/jira/browse/SPARK-46689 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Takuya Ueshin >Assignee: Xinrong Meng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47011) Remove deprecated `BinaryClassificationMetrics.scoreLabelsWeight`
[ https://issues.apache.org/jira/browse/SPARK-47011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47011: -- Parent: SPARK-44111 Issue Type: Sub-task (was: Task) > Remove deprecated `BinaryClassificationMetrics.scoreLabelsWeight` > - > > Key: SPARK-47011 > URL: https://issues.apache.org/jira/browse/SPARK-47011 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47011) Remove deprecated `BinaryClassificationMetrics.scoreLabelsWeight`
Dongjoon Hyun created SPARK-47011: - Summary: Remove deprecated `BinaryClassificationMetrics.scoreLabelsWeight` Key: SPARK-47011 URL: https://issues.apache.org/jira/browse/SPARK-47011 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47011) Remove deprecated `BinaryClassificationMetrics.scoreLabelsWeight`
[ https://issues.apache.org/jira/browse/SPARK-47011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47011: --- Labels: pull-request-available (was: ) > Remove deprecated `BinaryClassificationMetrics.scoreLabelsWeight` > - > > Key: SPARK-47011 > URL: https://issues.apache.org/jira/browse/SPARK-47011 > Project: Spark > Issue Type: Task > Components: MLlib >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47010) Kubernetes: support csi driver for volume type
Oleg Frenkel created SPARK-47010: Summary: Kubernetes: support csi driver for volume type Key: SPARK-47010 URL: https://issues.apache.org/jira/browse/SPARK-47010 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 3.5.0 Reporter: Oleg Frenkel Today Spark supports the following types of Kubernetes [volumes|https://kubernetes.io/docs/concepts/storage/volumes/]: hostPath, emptyDir, nfs and persistentVolumeClaim. In our case, Kubernetes cluster is multi-tenant and we cannot make cluster-wide changes when deploying our application to the Kubernetes cluster. Our application requires static shared file system. So, we cannot use hostPath (don't have control of hosting VMs) and persistentVolumeClaim (requires cluster-wide change when deploying PV). Our security department does not allow nfs. What would help in our case, is the use of csi driver (taken from here: https://github.com/kubernetes-sigs/azurefile-csi-driver/blob/master/deploy/example/e2e_usage.md#option3-inline-volume): {code:java} kind: Pod apiVersion: v1 metadata: name: nginx-azurefile-inline-volume spec: nodeSelector: "kubernetes.io/os": linux containers: - image: mcr.microsoft.com/oss/nginx/nginx:1.19.5 name: nginx-azurefile command: - "/bin/bash" - "-c" - set -euo pipefail; while true; do echo $(date) >> /mnt/azurefile/outfile; sleep 1; done volumeMounts: - name: persistent-storage mountPath: "/mnt/azurefile" readOnly: false volumes: - name: persistent-storage csi: driver: file.csi.azure.com volumeAttributes: shareName: EXISTING_SHARE_NAME # required secretName: azure-secret # required mountOptions: "dir_mode=0777,file_mode=0777,cache=strict,actimeo=30,nosharesock" # optional {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47009) Create table with collation
Stefan Kandic created SPARK-47009: - Summary: Create table with collation Key: SPARK-47009 URL: https://issues.apache.org/jira/browse/SPARK-47009 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Stefan Kandic Add support for creating table with columns containing non-default collated data -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47008) Spark to support S3 Express One Zone Storage
Steve Loughran created SPARK-47008: -- Summary: Spark to support S3 Express One Zone Storage Key: SPARK-47008 URL: https://issues.apache.org/jira/browse/SPARK-47008 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.5.1 Reporter: Steve Loughran Hadoop 3.4.0 adds support for AWS S3 Express One Zone Storage. Most of this is transparent. However, one aspect which can surface as an issue is that these stores report prefixes in a listing when there are pending uploads, *even when there are no files underneath* This leads to a situation where a listStatus of a path returns a list of file status entries which appears to contain one or more directories -but a listStatus on that path raises a FileNotFoundException: there is nothing there. HADOOP-18996 handles this in all of hadoop code, including FileInputFormat, A filesystem can now be probed for inconsistent directoriy listings through {{fs.hasPathCapability(path, "fs.capability.directory.listing.inconsistent")}} If true, then treewalking code SHOULD NOT report a failure if, when walking into a subdirectory, a list/getFileStatus on that directory raises a FileNotFoundException. Although most of this is handled in the hadoop code, but there some places where treewalking is done inside spark These need to be identified and make resilient to failure on the recurse down the tree * SparkHadoopUtil list methods , * especially listLeafStatuses used by OrcFileOperator org.apache.spark.util.Utils#fetchHcfsFile {{org.apache.hadoop.fs.FileUtil.maybeIgnoreMissingDirectory()}} can assist here, or the logic can be replicated. Using the hadoop implementation would be better from a maintenance perspective -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47007) Add SortMap function
[ https://issues.apache.org/jira/browse/SPARK-47007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47007: --- Labels: pull-request-available (was: ) > Add SortMap function > > > Key: SPARK-47007 > URL: https://issues.apache.org/jira/browse/SPARK-47007 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Priority: Major > Labels: pull-request-available > > In order to properly support GROUP BY on a map type we need to first add the > ability to sort the map in order to do the comparisons later -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39910) DataFrameReader API cannot read files from hadoop archives (.har)
[ https://issues.apache.org/jira/browse/SPARK-39910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-39910: --- Assignee: Christophe Préaud > DataFrameReader API cannot read files from hadoop archives (.har) > - > > Key: SPARK-39910 > URL: https://issues.apache.org/jira/browse/SPARK-39910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2 >Reporter: Christophe Préaud >Assignee: Christophe Préaud >Priority: Minor > Labels: DataFrameReader, pull-request-available > > Reading a file from an hadoop archive using the DataFrameReader API returns > an empty Dataset: > {code:java} > scala> val df = > spark.read.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719") > df: org.apache.spark.sql.Dataset[String] = [value: string] > scala> df.count > res7: Long = 0 {code} > > On the other hand, reading the same file, from the same hadoop archive, but > using the RDD API yields the correct result: > {code:java} > scala> val df = > sc.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719").toDF("value") > df: org.apache.spark.sql.DataFrame = [value: string] > scala> df.count > res8: Long = 5589 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39910) DataFrameReader API cannot read files from hadoop archives (.har)
[ https://issues.apache.org/jira/browse/SPARK-39910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-39910. - Fix Version/s: 3.5.1 4.0.0 Resolution: Fixed Issue resolved by pull request 43463 [https://github.com/apache/spark/pull/43463] > DataFrameReader API cannot read files from hadoop archives (.har) > - > > Key: SPARK-39910 > URL: https://issues.apache.org/jira/browse/SPARK-39910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2 >Reporter: Christophe Préaud >Assignee: Christophe Préaud >Priority: Minor > Labels: DataFrameReader, pull-request-available > Fix For: 3.5.1, 4.0.0 > > > Reading a file from an hadoop archive using the DataFrameReader API returns > an empty Dataset: > {code:java} > scala> val df = > spark.read.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719") > df: org.apache.spark.sql.Dataset[String] = [value: string] > scala> df.count > res7: Long = 0 {code} > > On the other hand, reading the same file, from the same hadoop archive, but > using the RDD API yields the correct result: > {code:java} > scala> val df = > sc.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719").toDF("value") > df: org.apache.spark.sql.DataFrame = [value: string] > scala> df.count > res8: Long = 5589 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47007) Add SortMap function
Stefan Kandic created SPARK-47007: - Summary: Add SortMap function Key: SPARK-47007 URL: https://issues.apache.org/jira/browse/SPARK-47007 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Stefan Kandic In order to properly support GROUP BY on a map type we need to first add the ability to sort the map in order to do the comparisons later -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46999) ExpressionWithUnresolvedIdentifier should include other expressions in the expression tree
[ https://issues.apache.org/jira/browse/SPARK-46999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-46999. - Fix Version/s: 4.0.0 Assignee: Wenchen Fan Resolution: Fixed > ExpressionWithUnresolvedIdentifier should include other expressions in the > expression tree > -- > > Key: SPARK-46999 > URL: https://issues.apache.org/jira/browse/SPARK-46999 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46993) Allow session variables in more places such as from_json for schema
[ https://issues.apache.org/jira/browse/SPARK-46993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-46993. - Fix Version/s: 4.0.0 Assignee: Serge Rielau Resolution: Fixed > Allow session variables in more places such as from_json for schema > --- > > Key: SPARK-46993 > URL: https://issues.apache.org/jira/browse/SPARK-46993 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.2 >Reporter: Serge Rielau >Assignee: Serge Rielau >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > It appears we do not allow session variables to provide a schema for > from_json(). > This is likely a generic restriction re constant folding. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47006) Refactor refill() method to isExhausted() in NioBufferedFileInputStream
[ https://issues.apache.org/jira/browse/SPARK-47006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-47006: - Description: Currently, in NioBufferedFileInputStream, the refill() method is always invoked in a negated context (!refill()), which can be confusing and counter-intuitive. We can refactor the method so that it's no longer necessary to invert the result of the method call. > Refactor refill() method to isExhausted() in NioBufferedFileInputStream > --- > > Key: SPARK-47006 > URL: https://issues.apache.org/jira/browse/SPARK-47006 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > > Currently, in NioBufferedFileInputStream, the refill() method is always > invoked in a negated context (!refill()), which can be confusing and > counter-intuitive. We can refactor the method so that it's no longer > necessary to invert the result of the method call. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47006) Refactor refill() method to isExhausted() in NioBufferedFileInputStream
Yang Jie created SPARK-47006: Summary: Refactor refill() method to isExhausted() in NioBufferedFileInputStream Key: SPARK-47006 URL: https://issues.apache.org/jira/browse/SPARK-47006 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org