[jira] [Commented] (SPARK-24858) Avoid unnecessary parquet footer reads
[ https://issues.apache.org/jira/browse/SPARK-24858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548880#comment-16548880 ] Apache Spark commented on SPARK-24858: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/21814 > Avoid unnecessary parquet footer reads > -- > > Key: SPARK-24858 > URL: https://issues.apache.org/jira/browse/SPARK-24858 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > Currently the same Parquet footer is read twice in the function > `buildReaderWithPartitionValues` of ParquetFileFormat if filter push down is > enabled. > Fix it with simple changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24858) Avoid unnecessary parquet footer reads
[ https://issues.apache.org/jira/browse/SPARK-24858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24858: Assignee: Apache Spark > Avoid unnecessary parquet footer reads > -- > > Key: SPARK-24858 > URL: https://issues.apache.org/jira/browse/SPARK-24858 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Currently the same Parquet footer is read twice in the function > `buildReaderWithPartitionValues` of ParquetFileFormat if filter push down is > enabled. > Fix it with simple changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24858) Avoid unnecessary parquet footer reads
[ https://issues.apache.org/jira/browse/SPARK-24858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24858: Assignee: (was: Apache Spark) > Avoid unnecessary parquet footer reads > -- > > Key: SPARK-24858 > URL: https://issues.apache.org/jira/browse/SPARK-24858 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > Currently the same Parquet footer is read twice in the function > `buildReaderWithPartitionValues` of ParquetFileFormat if filter push down is > enabled. > Fix it with simple changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24858) Avoid unnecessary parquet footer reads
Gengliang Wang created SPARK-24858: -- Summary: Avoid unnecessary parquet footer reads Key: SPARK-24858 URL: https://issues.apache.org/jira/browse/SPARK-24858 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Gengliang Wang Currently the same Parquet footer is read twice in the function `buildReaderWithPartitionValues` of ParquetFileFormat if filter push down is enabled. Fix it with simple changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24849) Convert StructType to DDL string
[ https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548878#comment-16548878 ] Maxim Gekk commented on SPARK-24849: [~maropu] This is a part of my work on customer's issue. There are multiple folders of AVRO files with pretty wide and nested schemas. I need programmatically create tables on top of each folder. To do that I read a file in a folder via Scala API, take schema, convert it to DDL string (here I need the changes) and put the string to SQL CREATE TABLE. > Convert StructType to DDL string > > > Key: SPARK-24849 > URL: https://issues.apache.org/jira/browse/SPARK-24849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to add new methods which should convert a value of StructType to a > schema in DDL format . It should be possible to use the former string in new > table creation by just copy-pasting of new method results. The existing > methods simpleString(), catalogString() and sql() put ':' between top level > field name and its type, and wrap by the *struct* word > {code} > ds.schema.catalogString > struct {code} > Output of new method should be > {code} > metaData struct {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24857) required the sample code test the spark steaming job in kubernates and write the data in remote hdfs file system
kumpatla murali krishna created SPARK-24857: --- Summary: required the sample code test the spark steaming job in kubernates and write the data in remote hdfs file system Key: SPARK-24857 URL: https://issues.apache.org/jira/browse/SPARK-24857 Project: Spark Issue Type: Test Components: Kubernetes, Spark Submit Affects Versions: 2.3.1 Reporter: kumpatla murali krishna ./bin/spark-submit --master k8s://https://api.kubernates.aws.phenom.local --deploy-mode cluster --name spark-pi --class com.phenom.analytics.executor.SummarizationJobExecutor --conf spark.executor.instances=5 --conf spark.kubernetes.container.image=phenommurali/spark_new --jars hdfs://test-dev.com:8020/user/spark/jobs/Test_jar_without_jars.jar error Normal SuccessfulMountVolume 2m kubelet, ip-x.ec2.internal MountVolume.SetUp succeeded for volume "download-files-volume" Warning FailedMount 2m kubelet, ip-.ec2.internal MountVolume.SetUp failed for volume "spark-init-properties" : configmaps "spark-pi-b5be4308783c3c479c6bf2f9da9b49dc-init-config" not found -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12126) JDBC datasource processes filters only commonly pushed down.
[ https://issues.apache.org/jira/browse/SPARK-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548809#comment-16548809 ] Hyukjin Kwon commented on SPARK-12126: -- See the comment in the PR I left. > JDBC datasource processes filters only commonly pushed down. > > > Key: SPARK-12126 > URL: https://issues.apache.org/jira/browse/SPARK-12126 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Priority: Major > > As suggested > [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=14955646&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14955646], > Currently JDBC datasource only processes the filters pushed down from > {{DataSourceStrategy}}. > Unlike ORC or Parquet, this can process pretty a lot of filters (for example, > a + b > 3) since it is just about string parsing. > As > [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=15031526&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15031526], > using {{CatalystScan}} trait might be one of solutions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548784#comment-16548784 ] Jiang Xingbo commented on SPARK-24375: -- {quote}Is the 'barrier' logic pluggable ? Instead of only being a global sync point. {quote} The barrier() function is quite like [MPI_Barrier|https://www.mpich.org/static/docs/v3.2.1/www/www3/MPI_Barrier.html] function in MPI, the major purpose is to provide a way to do global sync between barrier tasks. I'm not sure whether we have plan to support pluggable logic for now, do you have a case in hand that require pluggable barrier() ? {quote}Dynamic resource allocation (dra) triggers allocation of additional resources based on pending tasks - hence the comment We may add a check of total available slots before scheduling tasks from a barrier stage taskset. does not necessarily work in that context. {quote} Support running barrier stage with dynamic resource allocation is a Non-Goal here, however, we can improve the behavior to integrate better with DRA in Spark 3.0 . {quote}Currently DRA in spark uniformly allocates resources - are we envisioning changes as part of this effort to allocate heterogenous executor resources based on pending tasks (atleast initially for barrier support for gpu's) ? {quote} There is another ongoing SPIP SPARK-24615 to add accelerator-aware task scheduling for Spark, I think we shall deal with the above issue within that topic. {quote}In face of exceptions, some tasks will wait on barrier 2 and others on barrier 1 : causing issues.{quote} It's not desired behavior to catch exception thrown by TaskContext.barrier() silently. However, in case this really happens, we can detect that because we have `epoch` both in driver side and executor side, more details will go to the design doc of BarrierTaskContext.barrier() SPARK-24581 {quote}Can you elaborate more on leveraging TaskContext.localProperties ? Is it expected to be sync'ed after 'barrier' returns ? What gaurantees are we expecting to provide ?{quote} We update the localProperties in driver and in executors you shall be able to fetch the updated values through TaskContext, it should not couple with `barrier()` function. > Design sketch: support barrier scheduling in Apache Spark > - > > Key: SPARK-24375 > URL: https://issues.apache.org/jira/browse/SPARK-24375 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Jiang Xingbo >Priority: Major > > This task is to outline a design sketch for the barrier scheduling SPIP > discussion. It doesn't need to be a complete design before the vote. But it > should at least cover both Scala/Java and PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24856) spark need upgrade Guava for use gRPC
alibaltschun created SPARK-24856: Summary: spark need upgrade Guava for use gRPC Key: SPARK-24856 URL: https://issues.apache.org/jira/browse/SPARK-24856 Project: Spark Issue Type: Dependency upgrade Components: Input/Output, Spark Core Affects Versions: 2.3.1 Reporter: alibaltschun hello, i have a problem about load spark model while using gRPC dependencies i was posted on StackOverflow and someone says that coz spark used an old version of guava and gRPC need Guava V.20+. so that's mean spark need to update they guava version to fix this issue. thanks -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24840) do not use dummy filter to switch codegen on/off
[ https://issues.apache.org/jira/browse/SPARK-24840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24840. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21795 [https://github.com/apache/spark/pull/21795] > do not use dummy filter to switch codegen on/off > > > Key: SPARK-24840 > URL: https://issues.apache.org/jira/browse/SPARK-24840 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23967) Description add native sql show in SQL page.
[ https://issues.apache.org/jira/browse/SPARK-23967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548754#comment-16548754 ] guoxiaolongzte commented on SPARK-23967: I don't quite catch your meaning. Can you tell me more about it? > Description add native sql show in SQL page. > > > Key: SPARK-23967 > URL: https://issues.apache.org/jira/browse/SPARK-23967 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.0 >Reporter: JieFang.He >Priority: Minor > > Description add native sql show in SQL page to for better observation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24701) SparkMaster WebUI allow all appids to be shown in detail on port 4040 rather than different ports per app
[ https://issues.apache.org/jira/browse/SPARK-24701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548753#comment-16548753 ] guoxiaolongzte commented on SPARK-24701: I don't quite catch your meaning. Can you tell me more about it? It's best to have a snapshot > SparkMaster WebUI allow all appids to be shown in detail on port 4040 rather > than different ports per app > - > > Key: SPARK-24701 > URL: https://issues.apache.org/jira/browse/SPARK-24701 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.1 >Reporter: t oo >Priority: Major > Labels: master, security, ui, web, web-ui > > Right now the detail for all application ids are shown on a diff port per app > id, ie. 4040, 4041, 4042...etc this is problematic for environments with > tight firewall settings. Proposing to allow 4040?appid=1, 4040?appid=2, > 4040?appid=3..etc for the master web ui just like what the History Web UI > does. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23357) 'SHOW TABLE EXTENDED LIKE pattern=STRING' add ‘Partitioned’ display similar to hive, and partition is empty, also need to show empty partition field []
[ https://issues.apache.org/jira/browse/SPARK-23357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guoxiaolongzte resolved SPARK-23357. Resolution: Won't Fix > 'SHOW TABLE EXTENDED LIKE pattern=STRING' add ‘Partitioned’ display similar > to hive, and partition is empty, also need to show empty partition field [] > > > Key: SPARK-23357 > URL: https://issues.apache.org/jira/browse/SPARK-23357 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Priority: Minor > Attachments: 1.png, 2.png, 3.png, 4.png, 5.png > > > 'SHOW TABLE EXTENDED LIKE pattern=STRING' add ‘Partitioned’ display similar > to hive, and partition is empty, also need to show empty partition field [] . > hive: > !3.png! > sparkSQL Non-partitioned table fix before: > !1.png! > sparkSQL partitioned table fix before: > !2.png! > sparkSQL Non-partitioned table fix after: > !4.png! > sparkSQL partitioned table fix after: > !5.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24851) Map a Stage ID to it's Associated Job ID in UI
[ https://issues.apache.org/jira/browse/SPARK-24851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-24851: -- Target Version/s: (was: 2.3.1) > Map a Stage ID to it's Associated Job ID in UI > -- > > Key: SPARK-24851 > URL: https://issues.apache.org/jira/browse/SPARK-24851 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1 >Reporter: Parth Gandhi >Priority: Trivial > > It would be nice to have a field in Stage Page UI which would show mapping of > the current stage id to the job id's to which that stage belongs to. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
[ https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22151: -- Fix Version/s: (was: 2.4.0) > PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly > -- > > Key: SPARK-22151 > URL: https://issues.apache.org/jira/browse/SPARK-22151 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.1 >Reporter: Thomas Graves >Assignee: Parth Gandhi >Priority: Major > > Running in yarn cluster mode and trying to set pythonpath via > spark.yarn.appMasterEnv.PYTHONPATH doesn't work. > the yarn Client code looks at the env variables: > val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath) > But when you set spark.yarn.appMasterEnv it puts it into the local env. > So the python path set in spark.yarn.appMasterEnv isn't properly set. > You can work around if you are running in cluster mode by setting it on the > client like: > PYTHONPATH=./addon/python/ spark-submit -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548662#comment-16548662 ] Saisai Shao commented on SPARK-24615: - Hi [~tgraves] I'm rewriting the design doc based on the comments mentioned above, so temporarily make it inaccessible, sorry about it, I will reopen it. I think it is hard to control the memory usage per stage/task, because task is running in the executor which shared within a JVM. For CPU, yes I think we can do it, but I'm not sure the usage scenario of it. For the requirement of using different types of machine, what I can think of is leveraging dynamic resource allocation. For example, if user wants run some MPI jobs with barrier enabled, then Spark could allocate some new executors with accelerator resource via cluster manager (for example using node label if it is running on YARN). But I will not target this as a goal in this design, since a) it is a non-goal for barrier scheduler currently; b) it makes the design too complex, would be better to separate to another work. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548659#comment-16548659 ] Hyukjin Kwon commented on SPARK-24853: -- I don't think we need an API just for consistency. > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2 >Reporter: nirav patel >Priority: Major > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24854) Gather all options into AvroOptions
[ https://issues.apache.org/jira/browse/SPARK-24854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24854. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21810 [https://github.com/apache/spark/pull/21810] > Gather all options into AvroOptions > --- > > Key: SPARK-24854 > URL: https://issues.apache.org/jira/browse/SPARK-24854 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > Need to gather all Avro options into a class like in another datasources - > JSONOptions and CSVOptions. The map inside of the class should be case > insensitive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24854) Gather all options into AvroOptions
[ https://issues.apache.org/jira/browse/SPARK-24854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-24854: Assignee: Maxim Gekk > Gather all options into AvroOptions > --- > > Key: SPARK-24854 > URL: https://issues.apache.org/jira/browse/SPARK-24854 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > Need to gather all Avro options into a class like in another datasources - > JSONOptions and CSVOptions. The map inside of the class should be case > insensitive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24855) Built-in AVRO support should support specified schema on write
[ https://issues.apache.org/jira/browse/SPARK-24855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-24855: --- Assignee: Brian Lindblom > Built-in AVRO support should support specified schema on write > -- > > Key: SPARK-24855 > URL: https://issues.apache.org/jira/browse/SPARK-24855 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Brian Lindblom >Assignee: Brian Lindblom >Priority: Minor > Original Estimate: 168h > Remaining Estimate: 168h > > spark-avro appears to have been brought in from an upstream project, > [https://github.com/databricks/spark-avro.] I opened a PR a while ago to > enable support for 'forceSchema', which allows us to specify an AVRO schema > with which to write our records to handle some use cases we have. I didn't > get this code merged but would like to add this feature to the AVRO > reader/writer code that was brought in. The PR is here and I will follow up > with a more formal PR/Patch rebased on spark master branch: > https://github.com/databricks/spark-avro/pull/222 > > This change allows us to specify a schema, which should be compatible with > the schema generated by spark-avro from the dataset definition. This allows > a user to do things like specify default values, change union ordering, or... > in the case where you're reading in an AVRO data set, doing some sort of > in-line field cleansing, then writing out with the original schema, preserve > that original schema in the output container files. I've had several use > cases where this behavior was desired and there were several other asks for > this in the spark-avro project. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24855) Built-in AVRO support should support specified schema on write
[ https://issues.apache.org/jira/browse/SPARK-24855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Lindblom updated SPARK-24855: --- Description: spark-avro appears to have been brought in from an upstream project, [https://github.com/databricks/spark-avro.] I opened a PR a while ago to enable support for 'forceSchema', which allows us to specify an AVRO schema with which to write our records to handle some use cases we have. I didn't get this code merged but would like to add this feature to the AVRO reader/writer code that was brought in. The PR is here and I will follow up with a more formal PR/Patch rebased on spark master branch: https://github.com/databricks/spark-avro/pull/222 This change allows us to specify a schema, which should be compatible with the schema generated by spark-avro from the dataset definition. This allows a user to do things like specify default values, change union ordering, or... in the case where you're reading in an AVRO data set, doing some sort of in-line field cleansing, then writing out with the original schema, preserve that original schema in the output container files. I've had several use cases where this behavior was desired and there were several other asks for this in the spark-avro project. was: spark-avro appears to have been brought in from an upstream project, [https://github.com/databricks/spark-avro.] I opened a PR a while ago to enable support for 'forceSchema', which allows us to specify an AVRO schema with which to write our records to handle some use cases we have. I didn't get this code merged but would like to add this feature to the AVRO reader/writer code that was brought in. The PR is here and I will follow up with a more formal PR/Patch rebased on spark master branch. This change allows us to specify a schema, which should be compatible with the schema generated by spark-avro from the dataset definition. This allows a user to do things like specify default values, change union ordering, or... in the case where you're reading in an AVRO data set, doing some sort of in-line field cleansing, then writing out with the original schema, preserve that original schema in the output container files. I've had several use cases where this behavior was desired and there were several other asks for this in the spark-avro project. > Built-in AVRO support should support specified schema on write > -- > > Key: SPARK-24855 > URL: https://issues.apache.org/jira/browse/SPARK-24855 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Brian Lindblom >Priority: Minor > Original Estimate: 168h > Remaining Estimate: 168h > > spark-avro appears to have been brought in from an upstream project, > [https://github.com/databricks/spark-avro.] I opened a PR a while ago to > enable support for 'forceSchema', which allows us to specify an AVRO schema > with which to write our records to handle some use cases we have. I didn't > get this code merged but would like to add this feature to the AVRO > reader/writer code that was brought in. The PR is here and I will follow up > with a more formal PR/Patch rebased on spark master branch: > https://github.com/databricks/spark-avro/pull/222 > > This change allows us to specify a schema, which should be compatible with > the schema generated by spark-avro from the dataset definition. This allows > a user to do things like specify default values, change union ordering, or... > in the case where you're reading in an AVRO data set, doing some sort of > in-line field cleansing, then writing out with the original schema, preserve > that original schema in the output container files. I've had several use > cases where this behavior was desired and there were several other asks for > this in the spark-avro project. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24855) Built-in AVRO support should support specified schema on write
Brian Lindblom created SPARK-24855: -- Summary: Built-in AVRO support should support specified schema on write Key: SPARK-24855 URL: https://issues.apache.org/jira/browse/SPARK-24855 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Brian Lindblom spark-avro appears to have been brought in from an upstream project, [https://github.com/databricks/spark-avro.] I opened a PR a while ago to enable support for 'forceSchema', which allows us to specify an AVRO schema with which to write our records to handle some use cases we have. I didn't get this code merged but would like to add this feature to the AVRO reader/writer code that was brought in. The PR is here and I will follow up with a more formal PR/Patch rebased on spark master branch. This change allows us to specify a schema, which should be compatible with the schema generated by spark-avro from the dataset definition. This allows a user to do things like specify default values, change union ordering, or... in the case where you're reading in an AVRO data set, doing some sort of in-line field cleansing, then writing out with the original schema, preserve that original schema in the output container files. I've had several use cases where this behavior was desired and there were several other asks for this in the spark-avro project. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548578#comment-16548578 ] Apache Spark commented on SPARK-24801: -- User 'countmdm' has created a pull request for this issue: https://github.com/apache/spark/pull/21811 > Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can > waste a lot of memory > --- > > Key: SPARK-24801 > URL: https://issues.apache.org/jira/browse/SPARK-24801 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Priority: Major > > I recently analyzed another Yarn NM heap dump with jxray > ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is > wasted by empty (all zeroes) byte[] arrays. Most of these arrays are > referenced by > {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in > turn come from > {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is > the full reference chain that leads to the problematic arrays: > {code:java} > 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%) > ↖org.apache.spark.network.util.ByteArrayWritableChannel.data > ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel > ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg > ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next} > ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry > ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer > ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code} > > Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that > byteChannel is always initialized eagerly in the constructor: > {code:java} > this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code} > So I think to address the problem of empty byte[] arrays flooding the memory, > we should initialize {{byteChannel}} lazily, upon the first use. As far as I > can see, it's used only in one method, {{private void nextChunk()}}. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24801: Assignee: (was: Apache Spark) > Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can > waste a lot of memory > --- > > Key: SPARK-24801 > URL: https://issues.apache.org/jira/browse/SPARK-24801 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Priority: Major > > I recently analyzed another Yarn NM heap dump with jxray > ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is > wasted by empty (all zeroes) byte[] arrays. Most of these arrays are > referenced by > {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in > turn come from > {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is > the full reference chain that leads to the problematic arrays: > {code:java} > 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%) > ↖org.apache.spark.network.util.ByteArrayWritableChannel.data > ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel > ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg > ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next} > ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry > ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer > ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code} > > Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that > byteChannel is always initialized eagerly in the constructor: > {code:java} > this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code} > So I think to address the problem of empty byte[] arrays flooding the memory, > we should initialize {{byteChannel}} lazily, upon the first use. As far as I > can see, it's used only in one method, {{private void nextChunk()}}. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24801: Assignee: Apache Spark > Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can > waste a lot of memory > --- > > Key: SPARK-24801 > URL: https://issues.apache.org/jira/browse/SPARK-24801 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Assignee: Apache Spark >Priority: Major > > I recently analyzed another Yarn NM heap dump with jxray > ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is > wasted by empty (all zeroes) byte[] arrays. Most of these arrays are > referenced by > {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in > turn come from > {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is > the full reference chain that leads to the problematic arrays: > {code:java} > 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%) > ↖org.apache.spark.network.util.ByteArrayWritableChannel.data > ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel > ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg > ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next} > ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry > ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer > ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code} > > Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that > byteChannel is always initialized eagerly in the constructor: > {code:java} > this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code} > So I think to address the problem of empty byte[] arrays flooding the memory, > we should initialize {{byteChannel}} lazily, upon the first use. As far as I > can see, it's used only in one method, {{private void nextChunk()}}. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21261) SparkSQL regexpExpressions example
[ https://issues.apache.org/jira/browse/SPARK-21261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-21261: - Assignee: Sean Owen > SparkSQL regexpExpressions example > --- > > Key: SPARK-21261 > URL: https://issues.apache.org/jira/browse/SPARK-21261 > Project: Spark > Issue Type: Documentation > Components: Examples >Affects Versions: 2.1.1 >Reporter: zhangxin >Assignee: Sean Owen >Priority: Major > Fix For: 2.4.0 > > > The follow execute result. > scala> spark.sql(""" select regexp_replace('100-200', '(\d+)', 'num') > """).show > +--+ > |regexp_replace(100-200, (d+), num)| > +--+ > | 100-200| > +--+ > scala> spark.sql(""" select regexp_replace('100-200', '(\\d+)', 'num') > """).show > +---+ > |regexp_replace(100-200, (\d+), num)| > +---+ > |num-num| > +---+ > Add Comment -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21261) SparkSQL regexpExpressions example
[ https://issues.apache.org/jira/browse/SPARK-21261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21261. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21808 [https://github.com/apache/spark/pull/21808] > SparkSQL regexpExpressions example > --- > > Key: SPARK-21261 > URL: https://issues.apache.org/jira/browse/SPARK-21261 > Project: Spark > Issue Type: Documentation > Components: Examples >Affects Versions: 2.1.1 >Reporter: zhangxin >Assignee: Sean Owen >Priority: Major > Fix For: 2.4.0 > > > The follow execute result. > scala> spark.sql(""" select regexp_replace('100-200', '(\d+)', 'num') > """).show > +--+ > |regexp_replace(100-200, (d+), num)| > +--+ > | 100-200| > +--+ > scala> spark.sql(""" select regexp_replace('100-200', '(\\d+)', 'num') > """).show > +---+ > |regexp_replace(100-200, (\d+), num)| > +---+ > |num-num| > +---+ > Add Comment -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21261) SparkSQL regexpExpressions example
[ https://issues.apache.org/jira/browse/SPARK-21261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21261: -- Priority: Minor (was: Major) > SparkSQL regexpExpressions example > --- > > Key: SPARK-21261 > URL: https://issues.apache.org/jira/browse/SPARK-21261 > Project: Spark > Issue Type: Documentation > Components: Examples >Affects Versions: 2.1.1 >Reporter: zhangxin >Assignee: Sean Owen >Priority: Minor > Fix For: 2.4.0 > > > The follow execute result. > scala> spark.sql(""" select regexp_replace('100-200', '(\d+)', 'num') > """).show > +--+ > |regexp_replace(100-200, (d+), num)| > +--+ > | 100-200| > +--+ > scala> spark.sql(""" select regexp_replace('100-200', '(\\d+)', 'num') > """).show > +---+ > |regexp_replace(100-200, (\d+), num)| > +---+ > |num-num| > +---+ > Add Comment -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24814) Relationship between catalog and datasources
[ https://issues.apache.org/jira/browse/SPARK-24814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-24814: -- Description: This is somewhat related, though not identical to, [~rdblue]'s SPIP on datasources and catalogs. Here are the requirements (IMO) for fully implementing V2 datasources and their relationships to catalogs: # The global catalog should be configurable (the default can be HMS, but it should be overridable). # The default catalog (or an explicitly specified catalog in a query, once multiple catalogs are supported) can determine the V2 datasource to use for reading and writing the data. # Conversely, a V2 datasource can determine which catalog to use for resolution (e.g., if the user issues {{spark.read.format("acmex").table("mytable")}}, the acmex datasource would decide which catalog to use for resolving “mytable”). was: This is somewhat related, though not identical to, Ryan Blue's SPIP on datasources and catalogs. Here are the requirements (IMO) for fully implementing V2 datasources and their relationships to catalogs: # The global catalog should be configurable (the default can be HMS, but it should be overridable). # The default catalog (or an explicitly specified catalog in a query, once multiple catalogs are supported) can determine the V2 datasource to use for reading and writing the data. # Conversely, a V2 datasource can determine which catalog to use for resolution (e.g., if the user issues {{spark.read.format("acmex").table("mytable")}}, the acmex datasource would decide which catalog to use for resolving “mytable”). > Relationship between catalog and datasources > > > Key: SPARK-24814 > URL: https://issues.apache.org/jira/browse/SPARK-24814 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Priority: Major > > This is somewhat related, though not identical to, [~rdblue]'s SPIP on > datasources and catalogs. > Here are the requirements (IMO) for fully implementing V2 datasources and > their relationships to catalogs: > # The global catalog should be configurable (the default can be HMS, but it > should be overridable). > # The default catalog (or an explicitly specified catalog in a query, once > multiple catalogs are supported) can determine the V2 datasource to use for > reading and writing the data. > # Conversely, a V2 datasource can determine which catalog to use for > resolution (e.g., if the user issues > {{spark.read.format("acmex").table("mytable")}}, the acmex datasource would > decide which catalog to use for resolving “mytable”). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18186) Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support
[ https://issues.apache.org/jira/browse/SPARK-18186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548452#comment-16548452 ] Parth Gandhi edited comment on SPARK-18186 at 7/18/18 9:54 PM: --- Hi [~lian cheng], [~yhuai], there has been an issue lately with the library sketches-hive([https://github.com/DataSketches/sketches-hive)] that builds and runs a hive udaf on top of Spark SQL. In their method getNewAggregationBuffer() [https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/DataToSketchUDAF.java#L106,] they are initializing different state objects for modes Partial1 and Partial2. Their code used to work well with Spark 2.1 when Spark had support for mode "Complete". However, after it started supporting partial aggregation in Spark 2.2 onwards, their code gives an issue when partial merge is invoked here [https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56], as the wrong state object is being passed in the merge function. I was just trying to understand the PR and wondering why did Spark stop supporting Complete mode in Hive UDAF or is there a way to still run in Complete mode which I am not aware of. Thank you. was (Author: pgandhi): Hi, there has been an issue lately with the library sketches-hive([https://github.com/DataSketches/sketches-hive)] that builds and runs a hive udaf on top of Spark SQL. In their method getNewAggregationBuffer() [https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/DataToSketchUDAF.java#L106,] they are initializing different state objects for modes Partial1 and Partial2. Their code used to work well with Spark 2.1 when Spark had support for mode "Complete". However, after it started supporting partial aggregation in Spark 2.2 onwards, their code gives an issue when partial merge is invoked here [https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56], as the wrong state object is being passed in the merge function. I was just trying to understand the PR and wondering why did Spark stop supporting Complete mode in Hive UDAF or is there a way to still run in Complete mode which I am not aware of. Thank you. > Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation > support > > > Key: SPARK-18186 > URL: https://issues.apache.org/jira/browse/SPARK-18186 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 2.0.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Major > Fix For: 2.2.0 > > > Currently, Hive UDAFs in Spark SQL don't support partial aggregation. Any > query involving any Hive UDAFs has to fall back to {{SortAggregateExec}} > without partial aggregation. > This issue can be fixed by migrating {{HiveUDAFFunction}} to > {{TypedImperativeAggregate}}, which already provides partial aggregation > support for aggregate functions that may use arbitrary Java objects as > aggregation states. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18186) Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support
[ https://issues.apache.org/jira/browse/SPARK-18186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548452#comment-16548452 ] Parth Gandhi commented on SPARK-18186: -- Hi, there has been an issue lately with the library sketches-hive([https://github.com/DataSketches/sketches-hive)] that builds and runs a hive udaf on top of Spark SQL. In their method getNewAggregationBuffer() [https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/DataToSketchUDAF.java#L106,] they are initializing different state objects for modes Partial1 and Partial2. Their code used to work well with Spark 2.1 when Spark had support for mode "Complete". However, after it started supporting partial aggregation in Spark 2.2 onwards, their code gives an issue when partial merge is invoked here [https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56], as the wrong state object is being passed in the merge function. I was just trying to understand the PR and wondering why did Spark stop supporting Complete mode in Hive UDAF or is there a way to still run in Complete mode which I am not aware of. Thank you. > Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation > support > > > Key: SPARK-18186 > URL: https://issues.apache.org/jira/browse/SPARK-18186 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 2.0.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Major > Fix For: 2.2.0 > > > Currently, Hive UDAFs in Spark SQL don't support partial aggregation. Any > query involving any Hive UDAFs has to fall back to {{SortAggregateExec}} > without partial aggregation. > This issue can be fixed by migrating {{HiveUDAFFunction}} to > {{TypedImperativeAggregate}}, which already provides partial aggregation > support for aggregate functions that may use arbitrary Java objects as > aggregation states. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24854) Gather all options into AvroOptions
[ https://issues.apache.org/jira/browse/SPARK-24854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548441#comment-16548441 ] Apache Spark commented on SPARK-24854: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/21810 > Gather all options into AvroOptions > --- > > Key: SPARK-24854 > URL: https://issues.apache.org/jira/browse/SPARK-24854 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to gather all Avro options into a class like in another datasources - > JSONOptions and CSVOptions. The map inside of the class should be case > insensitive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24854) Gather all options into AvroOptions
[ https://issues.apache.org/jira/browse/SPARK-24854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24854: Assignee: Apache Spark > Gather all options into AvroOptions > --- > > Key: SPARK-24854 > URL: https://issues.apache.org/jira/browse/SPARK-24854 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > Need to gather all Avro options into a class like in another datasources - > JSONOptions and CSVOptions. The map inside of the class should be case > insensitive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24854) Gather all options into AvroOptions
[ https://issues.apache.org/jira/browse/SPARK-24854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24854: Assignee: (was: Apache Spark) > Gather all options into AvroOptions > --- > > Key: SPARK-24854 > URL: https://issues.apache.org/jira/browse/SPARK-24854 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to gather all Avro options into a class like in another datasources - > JSONOptions and CSVOptions. The map inside of the class should be case > insensitive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24854) Gather all options into AvroOptions
Maxim Gekk created SPARK-24854: -- Summary: Gather all options into AvroOptions Key: SPARK-24854 URL: https://issues.apache.org/jira/browse/SPARK-24854 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Need to gather all Avro options into a class like in another datasources - JSONOptions and CSVOptions. The map inside of the class should be case insensitive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23908) High-order function: transform(array, function) → array
[ https://issues.apache.org/jira/browse/SPARK-23908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548427#comment-16548427 ] Herman van Hovell edited comment on SPARK-23908 at 7/18/18 9:30 PM: Yeah I am, sorry for the hold up. I'll try to have something out ASAP. BTW: I don't see a target version set, the affected version is (which is a bit weird for a feature). was (Author: hvanhovell): Yeah I am, sorry for the hold up. I'll try to have something out ASAP. > High-order function: transform(array, function) → array > --- > > Key: SPARK-23908 > URL: https://issues.apache.org/jira/browse/SPARK-23908 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Herman van Hovell >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array that is the result of applying function to each element of > array: > {noformat} > SELECT transform(ARRAY [], x -> x + 1); -- [] > SELECT transform(ARRAY [5, 6], x -> x + 1); -- [6, 7] > SELECT transform(ARRAY [5, NULL, 6], x -> COALESCE(x, 0) + 1); -- [6, 1, 7] > SELECT transform(ARRAY ['x', 'abc', 'z'], x -> x || '0'); -- ['x0', 'abc0', > 'z0'] > SELECT transform(ARRAY [ARRAY [1, NULL, 2], ARRAY[3, NULL]], a -> filter(a, x > -> x IS NOT NULL)); -- [[1, 2], [3]] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23908) High-order function: transform(array, function) → array
[ https://issues.apache.org/jira/browse/SPARK-23908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548427#comment-16548427 ] Herman van Hovell commented on SPARK-23908: --- Yeah I am, sorry for the hold up. I'll try to have something out ASAP. > High-order function: transform(array, function) → array > --- > > Key: SPARK-23908 > URL: https://issues.apache.org/jira/browse/SPARK-23908 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Herman van Hovell >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array that is the result of applying function to each element of > array: > {noformat} > SELECT transform(ARRAY [], x -> x + 1); -- [] > SELECT transform(ARRAY [5, 6], x -> x + 1); -- [6, 7] > SELECT transform(ARRAY [5, NULL, 6], x -> COALESCE(x, 0) + 1); -- [6, 1, 7] > SELECT transform(ARRAY ['x', 'abc', 'z'], x -> x || '0'); -- ['x0', 'abc0', > 'z0'] > SELECT transform(ARRAY [ARRAY [1, NULL, 2], ARRAY[3, NULL]], a -> filter(a, x > -> x IS NOT NULL)); -- [[1, 2], [3]] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23908) High-order function: transform(array, function) → array
[ https://issues.apache.org/jira/browse/SPARK-23908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548421#comment-16548421 ] Frederick Reiss commented on SPARK-23908: - This Jira is marked as "in progress" with the target set to a previous release of Spark. Are you working on this, [~hvanhovell]? > High-order function: transform(array, function) → array > --- > > Key: SPARK-23908 > URL: https://issues.apache.org/jira/browse/SPARK-23908 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Herman van Hovell >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array that is the result of applying function to each element of > array: > {noformat} > SELECT transform(ARRAY [], x -> x + 1); -- [] > SELECT transform(ARRAY [5, 6], x -> x + 1); -- [6, 7] > SELECT transform(ARRAY [5, NULL, 6], x -> COALESCE(x, 0) + 1); -- [6, 1, 7] > SELECT transform(ARRAY ['x', 'abc', 'z'], x -> x || '0'); -- ['x0', 'abc0', > 'z0'] > SELECT transform(ARRAY [ARRAY [1, NULL, 2], ARRAY[3, NULL]], a -> filter(a, x > -> x IS NOT NULL)); -- [[1, 2], [3]] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24129) Add option to pass --build-arg's to docker-image-tool.sh
[ https://issues.apache.org/jira/browse/SPARK-24129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-24129. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21202 [https://github.com/apache/spark/pull/21202] > Add option to pass --build-arg's to docker-image-tool.sh > > > Key: SPARK-24129 > URL: https://issues.apache.org/jira/browse/SPARK-24129 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Devaraj K >Assignee: Devaraj K >Priority: Minor > Fix For: 2.4.0 > > > When we are working behind the firewall, we may need to pass the proxy > details as part of the docker --build-arg parameters to build the image. But > docker-image-tool.sh doesn't provide option to pass the proxy details or the > --build-arg to the docker command. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24129) Add option to pass --build-arg's to docker-image-tool.sh
[ https://issues.apache.org/jira/browse/SPARK-24129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-24129: - Assignee: Devaraj K > Add option to pass --build-arg's to docker-image-tool.sh > > > Key: SPARK-24129 > URL: https://issues.apache.org/jira/browse/SPARK-24129 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Devaraj K >Assignee: Devaraj K >Priority: Minor > Fix For: 2.4.0 > > > When we are working behind the firewall, we may need to pass the proxy > details as part of the docker --build-arg parameters to build the image. But > docker-image-tool.sh doesn't provide option to pass the proxy details or the > --build-arg to the docker command. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure
[ https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp resolved SPARK-24825. - Resolution: Fixed PR pushed, builds green, and now we have slightly more spammy build logs! :) thanks [~mcheah] > [K8S][TEST] Kubernetes integration tests don't trace the maven project > dependency structure > --- > > Key: SPARK-24825 > URL: https://issues.apache.org/jira/browse/SPARK-24825 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Assignee: Matt Cheah >Priority: Critical > > The Kubernetes integration tests will currently fail if maven installation is > not performed first, because the integration test build believes it should be > pulling the Spark parent artifact from maven central. However, this is > incorrect because the integration test should be building the Spark parent > pom as a dependency in the multi-module build, and the integration test > should just use the dynamically built artifact. Or to put it another way, the > integration test builds should never be pulling Spark dependencies from maven > central. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.
[ https://issues.apache.org/jira/browse/SPARK-24852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-24852: -- Shepherd: Joseph K. Bradley > Have spark.ml training use updated `Instrumentation` APIs. > -- > > Key: SPARK-24852 > URL: https://issues.apache.org/jira/browse/SPARK-24852 > Project: Spark > Issue Type: Story > Components: ML >Affects Versions: 2.4.0 >Reporter: Bago Amirbekian >Assignee: Bago Amirbekian >Priority: Major > > Port spark.ml code to use the new methods on the `Instrumentation` class and > remove the old methods & constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.
[ https://issues.apache.org/jira/browse/SPARK-24852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-24852: - Assignee: Bago Amirbekian > Have spark.ml training use updated `Instrumentation` APIs. > -- > > Key: SPARK-24852 > URL: https://issues.apache.org/jira/browse/SPARK-24852 > Project: Spark > Issue Type: Story > Components: ML >Affects Versions: 2.4.0 >Reporter: Bago Amirbekian >Assignee: Bago Amirbekian >Priority: Major > > Port spark.ml code to use the new methods on the `Instrumentation` class and > remove the old methods & constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
[ https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-22151: - Assignee: Parth Gandhi > PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly > -- > > Key: SPARK-22151 > URL: https://issues.apache.org/jira/browse/SPARK-22151 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.1 >Reporter: Thomas Graves >Assignee: Parth Gandhi >Priority: Major > Fix For: 2.4.0 > > > Running in yarn cluster mode and trying to set pythonpath via > spark.yarn.appMasterEnv.PYTHONPATH doesn't work. > the yarn Client code looks at the env variables: > val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath) > But when you set spark.yarn.appMasterEnv it puts it into the local env. > So the python path set in spark.yarn.appMasterEnv isn't properly set. > You can work around if you are running in cluster mode by setting it on the > client like: > PYTHONPATH=./addon/python/ spark-submit -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
[ https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-22151: -- Fix Version/s: 2.4.0 > PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly > -- > > Key: SPARK-22151 > URL: https://issues.apache.org/jira/browse/SPARK-22151 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.1 >Reporter: Thomas Graves >Assignee: Parth Gandhi >Priority: Major > Fix For: 2.4.0 > > > Running in yarn cluster mode and trying to set pythonpath via > spark.yarn.appMasterEnv.PYTHONPATH doesn't work. > the yarn Client code looks at the env variables: > val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath) > But when you set spark.yarn.appMasterEnv it puts it into the local env. > So the python path set in spark.yarn.appMasterEnv isn't properly set. > You can work around if you are running in cluster mode by setting it on the > client like: > PYTHONPATH=./addon/python/ spark-submit -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24677) TaskSetManager not updating successfulTaskDurations for old stage attempts
[ https://issues.apache.org/jira/browse/SPARK-24677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-24677: -- Fix Version/s: 2.2.3 > TaskSetManager not updating successfulTaskDurations for old stage attempts > -- > > Key: SPARK-24677 > URL: https://issues.apache.org/jira/browse/SPARK-24677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Critical > Fix For: 2.2.3, 2.4.0, 2.3.3 > > > When introducing SPARK-23433 , maybe cause stop sparkcontext. > {code:java} > ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping > SparkContext > java.util.NoSuchElementException: MedianHeap is empty. > at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83) > at > org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
nirav patel created SPARK-24853: --- Summary: Support Column type for withColumn and withColumnRenamed apis Key: SPARK-24853 URL: https://issues.apache.org/jira/browse/SPARK-24853 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.2 Reporter: nirav patel Can we add overloaded version of withColumn or withColumnRenamed that accept Column type instead of String? That way I can specify FQN in case when there is duplicate column names. e.g. if I have 2 columns with same name as a result of join and I want to rename one of the field I can do it with this new API. This would be similar to Drop api which supports both String and Column type. def withColumn(colName: Column, col: Column): DataFrame Returns a new Dataset by adding a column or replacing the existing column that has the same name. def withColumnRenamed(existingName: Column, newName: Column): DataFrame Returns a new Dataset with a column renamed. I think there should also be this one: def withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame Returns a new Dataset with a column renamed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12449) Pushing down arbitrary logical plans to data sources
[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548140#comment-16548140 ] Kyle Prifogle edited comment on SPARK-12449 at 7/18/18 6:44 PM: What happened to this initiative? I came here trying to figure out why ".limit(10)" seemed to scan the entire table. was (Author: kprifogle1): What happened to this initiative? I came here trying to figure out why ".limit(10)" seemed to scan the entire table. Is slow down in some of this (seemingly critical) work an indication that the breaks have been put on open source spark and that databricks run time is the only future? > Pushing down arbitrary logical plans to data sources > > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Stephan Kessler >Priority: Major > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24851) Map a Stage ID to it's Associated Job ID in UI
[ https://issues.apache.org/jira/browse/SPARK-24851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24851: Assignee: Apache Spark > Map a Stage ID to it's Associated Job ID in UI > -- > > Key: SPARK-24851 > URL: https://issues.apache.org/jira/browse/SPARK-24851 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1 >Reporter: Parth Gandhi >Assignee: Apache Spark >Priority: Trivial > > It would be nice to have a field in Stage Page UI which would show mapping of > the current stage id to the job id's to which that stage belongs to. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24851) Map a Stage ID to it's Associated Job ID in UI
[ https://issues.apache.org/jira/browse/SPARK-24851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548236#comment-16548236 ] Apache Spark commented on SPARK-24851: -- User 'pgandhi999' has created a pull request for this issue: https://github.com/apache/spark/pull/21809 > Map a Stage ID to it's Associated Job ID in UI > -- > > Key: SPARK-24851 > URL: https://issues.apache.org/jira/browse/SPARK-24851 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1 >Reporter: Parth Gandhi >Priority: Trivial > > It would be nice to have a field in Stage Page UI which would show mapping of > the current stage id to the job id's to which that stage belongs to. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24851) Map a Stage ID to it's Associated Job ID in UI
[ https://issues.apache.org/jira/browse/SPARK-24851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24851: Assignee: (was: Apache Spark) > Map a Stage ID to it's Associated Job ID in UI > -- > > Key: SPARK-24851 > URL: https://issues.apache.org/jira/browse/SPARK-24851 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1 >Reporter: Parth Gandhi >Priority: Trivial > > It would be nice to have a field in Stage Page UI which would show mapping of > the current stage id to the job id's to which that stage belongs to. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24677) TaskSetManager not updating successfulTaskDurations for old stage attempts
[ https://issues.apache.org/jira/browse/SPARK-24677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-24677. --- Resolution: Fixed Fix Version/s: 2.4.0 2.3.3 > TaskSetManager not updating successfulTaskDurations for old stage attempts > -- > > Key: SPARK-24677 > URL: https://issues.apache.org/jira/browse/SPARK-24677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Critical > Fix For: 2.3.3, 2.4.0 > > > When introducing SPARK-23433 , maybe cause stop sparkcontext. > {code:java} > ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping > SparkContext > java.util.NoSuchElementException: MedianHeap is empty. > at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83) > at > org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24677) TaskSetManager not updating successfulTaskDurations for old stage attempts
[ https://issues.apache.org/jira/browse/SPARK-24677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-24677: - Assignee: dzcxzl > TaskSetManager not updating successfulTaskDurations for old stage attempts > -- > > Key: SPARK-24677 > URL: https://issues.apache.org/jira/browse/SPARK-24677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Critical > > When introducing SPARK-23433 , maybe cause stop sparkcontext. > {code:java} > ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping > SparkContext > java.util.NoSuchElementException: MedianHeap is empty. > at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83) > at > org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24677) TaskSetManager not updating successfulTaskDurations for old stage attempts
[ https://issues.apache.org/jira/browse/SPARK-24677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-24677: -- Summary: TaskSetManager not updating successfulTaskDurations for old stage attempts (was: Avoid NoSuchElementException from MedianHeap) > TaskSetManager not updating successfulTaskDurations for old stage attempts > -- > > Key: SPARK-24677 > URL: https://issues.apache.org/jira/browse/SPARK-24677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: dzcxzl >Priority: Critical > > When introducing SPARK-23433 , maybe cause stop sparkcontext. > {code:java} > ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping > SparkContext > java.util.NoSuchElementException: MedianHeap is empty. > at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83) > at > org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24677) Avoid NoSuchElementException from MedianHeap
[ https://issues.apache.org/jira/browse/SPARK-24677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548210#comment-16548210 ] Thomas Graves edited comment on SPARK-24677 at 7/18/18 6:22 PM: This is really that it isn't updating successfulTaskDurations. In this case one of the older stage attempts (that is a zombie) marked the task as successful but then the newest stage attempt checked to see if it needed to speculate was (Author: tgraves): In this case one of the older stage attempts (that is a zombie) marked the task as successful but then the newest stage attempt checked to see if it needed to speculate > Avoid NoSuchElementException from MedianHeap > > > Key: SPARK-24677 > URL: https://issues.apache.org/jira/browse/SPARK-24677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: dzcxzl >Priority: Critical > > When introducing SPARK-23433 , maybe cause stop sparkcontext. > {code:java} > ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping > SparkContext > java.util.NoSuchElementException: MedianHeap is empty. > at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83) > at > org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24677) Avoid NoSuchElementException from MedianHeap
[ https://issues.apache.org/jira/browse/SPARK-24677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548210#comment-16548210 ] Thomas Graves commented on SPARK-24677: --- In this case one of the older stage attempts (that is a zombie) marked the task as successful but then the newest stage attempt checked to see if it needed to speculate > Avoid NoSuchElementException from MedianHeap > > > Key: SPARK-24677 > URL: https://issues.apache.org/jira/browse/SPARK-24677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: dzcxzl >Priority: Critical > > When introducing SPARK-23433 , maybe cause stop sparkcontext. > {code:java} > ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping > SparkContext > java.util.NoSuchElementException: MedianHeap is empty. > at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83) > at > org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.
[ https://issues.apache.org/jira/browse/SPARK-24852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548137#comment-16548137 ] Apache Spark commented on SPARK-24852: -- User 'MrBago' has created a pull request for this issue: https://github.com/apache/spark/pull/21799 > Have spark.ml training use updated `Instrumentation` APIs. > -- > > Key: SPARK-24852 > URL: https://issues.apache.org/jira/browse/SPARK-24852 > Project: Spark > Issue Type: Story > Components: ML >Affects Versions: 2.4.0 >Reporter: Bago Amirbekian >Priority: Major > > Port spark.ml code to use the new methods on the `Instrumentation` class and > remove the old methods & constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources
[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548140#comment-16548140 ] Kyle Prifogle commented on SPARK-12449: --- What happened to this initiative? I came here trying to figure out why ".limit(10)" seemed to scan the entire table. Is slow down in some of this (seemingly critical) work an indication that the breaks have been put on open source spark and that databricks run time is the only future? > Pushing down arbitrary logical plans to data sources > > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Stephan Kessler >Priority: Major > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.
[ https://issues.apache.org/jira/browse/SPARK-24852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24852: Assignee: (was: Apache Spark) > Have spark.ml training use updated `Instrumentation` APIs. > -- > > Key: SPARK-24852 > URL: https://issues.apache.org/jira/browse/SPARK-24852 > Project: Spark > Issue Type: Story > Components: ML >Affects Versions: 2.4.0 >Reporter: Bago Amirbekian >Priority: Major > > Port spark.ml code to use the new methods on the `Instrumentation` class and > remove the old methods & constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.
[ https://issues.apache.org/jira/browse/SPARK-24852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24852: Assignee: Apache Spark > Have spark.ml training use updated `Instrumentation` APIs. > -- > > Key: SPARK-24852 > URL: https://issues.apache.org/jira/browse/SPARK-24852 > Project: Spark > Issue Type: Story > Components: ML >Affects Versions: 2.4.0 >Reporter: Bago Amirbekian >Assignee: Apache Spark >Priority: Major > > Port spark.ml code to use the new methods on the `Instrumentation` class and > remove the old methods & constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.
Bago Amirbekian created SPARK-24852: --- Summary: Have spark.ml training use updated `Instrumentation` APIs. Key: SPARK-24852 URL: https://issues.apache.org/jira/browse/SPARK-24852 Project: Spark Issue Type: Story Components: ML Affects Versions: 2.4.0 Reporter: Bago Amirbekian Port spark.ml code to use the new methods on the `Instrumentation` class and remove the old methods & constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12126) JDBC datasource processes filters only commonly pushed down.
[ https://issues.apache.org/jira/browse/SPARK-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548135#comment-16548135 ] Kyle Prifogle commented on SPARK-12126: --- Whats the hold up on this? I've noticed that the PR has been closed. In the case of pushing down `limit` it seems fairly straightforward to modify the query to append a limit before executing it. > JDBC datasource processes filters only commonly pushed down. > > > Key: SPARK-12126 > URL: https://issues.apache.org/jira/browse/SPARK-12126 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Priority: Major > > As suggested > [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=14955646&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14955646], > Currently JDBC datasource only processes the filters pushed down from > {{DataSourceStrategy}}. > Unlike ORC or Parquet, this can process pretty a lot of filters (for example, > a + b > 3) since it is just about string parsing. > As > [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=15031526&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15031526], > using {{CatalystScan}} trait might be one of solutions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21261) SparkSQL regexpExpressions example
[ https://issues.apache.org/jira/browse/SPARK-21261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548119#comment-16548119 ] Apache Spark commented on SPARK-21261: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/21808 > SparkSQL regexpExpressions example > --- > > Key: SPARK-21261 > URL: https://issues.apache.org/jira/browse/SPARK-21261 > Project: Spark > Issue Type: Documentation > Components: Examples >Affects Versions: 2.1.1 >Reporter: zhangxin >Priority: Major > > The follow execute result. > scala> spark.sql(""" select regexp_replace('100-200', '(\d+)', 'num') > """).show > +--+ > |regexp_replace(100-200, (d+), num)| > +--+ > | 100-200| > +--+ > scala> spark.sql(""" select regexp_replace('100-200', '(\\d+)', 'num') > """).show > +---+ > |regexp_replace(100-200, (\d+), num)| > +---+ > |num-num| > +---+ > Add Comment -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24851) Map a Stage ID to it's Associated Job ID in UI
Parth Gandhi created SPARK-24851: Summary: Map a Stage ID to it's Associated Job ID in UI Key: SPARK-24851 URL: https://issues.apache.org/jira/browse/SPARK-24851 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.1, 2.3.0 Reporter: Parth Gandhi It would be nice to have a field in Stage Page UI which would show mapping of the current stage id to the job id's to which that stage belongs to. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24536) Query with nonsensical LIMIT hits AssertionError
[ https://issues.apache.org/jira/browse/SPARK-24536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548087#comment-16548087 ] Apache Spark commented on SPARK-24536: -- User 'mauropalsgraaf' has created a pull request for this issue: https://github.com/apache/spark/pull/21807 > Query with nonsensical LIMIT hits AssertionError > > > Key: SPARK-24536 > URL: https://issues.apache.org/jira/browse/SPARK-24536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Alexander Behm >Priority: Trivial > Labels: beginner, spree > > SELECT COUNT(1) FROM t LIMIT CAST(NULL AS INT) > fails in the QueryPlanner with: > {code} > java.lang.AssertionError: assertion failed: No plan for GlobalLimit null > {code} > I think this issue should be caught earlier during semantic analysis. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24536) Query with nonsensical LIMIT hits AssertionError
[ https://issues.apache.org/jira/browse/SPARK-24536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24536: Assignee: (was: Apache Spark) > Query with nonsensical LIMIT hits AssertionError > > > Key: SPARK-24536 > URL: https://issues.apache.org/jira/browse/SPARK-24536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Alexander Behm >Priority: Trivial > Labels: beginner, spree > > SELECT COUNT(1) FROM t LIMIT CAST(NULL AS INT) > fails in the QueryPlanner with: > {code} > java.lang.AssertionError: assertion failed: No plan for GlobalLimit null > {code} > I think this issue should be caught earlier during semantic analysis. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24536) Query with nonsensical LIMIT hits AssertionError
[ https://issues.apache.org/jira/browse/SPARK-24536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24536: Assignee: Apache Spark > Query with nonsensical LIMIT hits AssertionError > > > Key: SPARK-24536 > URL: https://issues.apache.org/jira/browse/SPARK-24536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Alexander Behm >Assignee: Apache Spark >Priority: Trivial > Labels: beginner, spree > > SELECT COUNT(1) FROM t LIMIT CAST(NULL AS INT) > fails in the QueryPlanner with: > {code} > java.lang.AssertionError: assertion failed: No plan for GlobalLimit null > {code} > I think this issue should be caught earlier during semantic analysis. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24846) Stabilize expression cannonicalization
[ https://issues.apache.org/jira/browse/SPARK-24846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548041#comment-16548041 ] Apache Spark commented on SPARK-24846: -- User 'gvr' has created a pull request for this issue: https://github.com/apache/spark/pull/21806 > Stabilize expression cannonicalization > -- > > Key: SPARK-24846 > URL: https://issues.apache.org/jira/browse/SPARK-24846 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Herman van Hovell >Priority: Major > Labels: spree > > Spark plan canonicalization is can be non-deterministic between different > versions of spark due to the fact that {{ExprId}} uses a UUID. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24846) Stabilize expression cannonicalization
[ https://issues.apache.org/jira/browse/SPARK-24846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24846: Assignee: Apache Spark > Stabilize expression cannonicalization > -- > > Key: SPARK-24846 > URL: https://issues.apache.org/jira/browse/SPARK-24846 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Herman van Hovell >Assignee: Apache Spark >Priority: Major > Labels: spree > > Spark plan canonicalization is can be non-deterministic between different > versions of spark due to the fact that {{ExprId}} uses a UUID. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24846) Stabilize expression cannonicalization
[ https://issues.apache.org/jira/browse/SPARK-24846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24846: Assignee: (was: Apache Spark) > Stabilize expression cannonicalization > -- > > Key: SPARK-24846 > URL: https://issues.apache.org/jira/browse/SPARK-24846 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Herman van Hovell >Priority: Major > Labels: spree > > Spark plan canonicalization is can be non-deterministic between different > versions of spark due to the fact that {{ExprId}} uses a UUID. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24850) Query plan string representation grows exponentially on queries with recursive cached datasets
[ https://issues.apache.org/jira/browse/SPARK-24850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24850: Assignee: (was: Apache Spark) > Query plan string representation grows exponentially on queries with > recursive cached datasets > -- > > Key: SPARK-24850 > URL: https://issues.apache.org/jira/browse/SPARK-24850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Onur Satici >Priority: Major > > As of [https://github.com/apache/spark/pull/21018], InMemoryRelation includes > its cacheBuilder when logging query plans. This CachedRDDBuilder includes the > cachedPlan, so calling treeString on InMemoryRelation will log the cachedPlan > in the cacheBuilder. > Given the sample dataset: > {code:java} > $ cat test.csv > A,B > 0,0{code} > If the query plan has multiple cached datasets that depend on each other: > {code:java} > var df_cached = spark.read.format("csv").option("header", > "true").load("test.csv").cache() > 0 to 1 foreach { _ => > df_cached = df_cached.join(spark.read.format("csv").option("header", > "true").load("test.csv"), "A").cache() > } > df_cached.explain > {code} > results in: > {code:java} > == Physical Plan == > InMemoryTableScan [A#10, B#11, B#35, B#87] > +- InMemoryRelation [A#10, B#11, B#35, B#87], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(2) Project [A#10, B#11, B#35, B#87] > +- *(2) BroadcastHashJoin [A#10], [A#86], Inner, BuildRight > :- *(2) Filter isnotnull(A#10) > : +- InMemoryTableScan [A#10, B#11, B#35], [isnotnull(A#10)] > : +- InMemoryRelation [A#10, B#11, B#35], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(2) Project [A#10, B#11, B#35] > +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight > :- *(2) Filter isnotnull(A#10) > : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)] > : +- InMemoryRelation [A#10, B#11], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, > Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > ,None) > : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: > InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, > false])) > +- *(1) Filter isnotnull(A#34) > +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)] > +- InMemoryRelation [A#34, B#35], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, > Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > ,None) > +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: > InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > ,None) > : +- *(2) Project [A#10, B#11, B#35] > : +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight > : :- *(2) Filter isnotnull(A#10) > : : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)] > : : +- InMemoryRelation [A#10, B#11], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, > Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > ,None) > : : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: > InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, > false])) > : +- *(1) Filter isnotnull(A#34) > : +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)] > : +- InMemoryRelation [A#34, B#35], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, > Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > ,None) > : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: > InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, > false])) > +- *(1) Filter isnotnull(A#86) > +- InMemoryTableScan [A#86, B#87], [isnotnull(A#86)] > +- InMemoryRelation [A#86, B#87], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV,
[jira] [Commented] (SPARK-24850) Query plan string representation grows exponentially on queries with recursive cached datasets
[ https://issues.apache.org/jira/browse/SPARK-24850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547996#comment-16547996 ] Apache Spark commented on SPARK-24850: -- User 'onursatici' has created a pull request for this issue: https://github.com/apache/spark/pull/21805 > Query plan string representation grows exponentially on queries with > recursive cached datasets > -- > > Key: SPARK-24850 > URL: https://issues.apache.org/jira/browse/SPARK-24850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Onur Satici >Priority: Major > > As of [https://github.com/apache/spark/pull/21018], InMemoryRelation includes > its cacheBuilder when logging query plans. This CachedRDDBuilder includes the > cachedPlan, so calling treeString on InMemoryRelation will log the cachedPlan > in the cacheBuilder. > Given the sample dataset: > {code:java} > $ cat test.csv > A,B > 0,0{code} > If the query plan has multiple cached datasets that depend on each other: > {code:java} > var df_cached = spark.read.format("csv").option("header", > "true").load("test.csv").cache() > 0 to 1 foreach { _ => > df_cached = df_cached.join(spark.read.format("csv").option("header", > "true").load("test.csv"), "A").cache() > } > df_cached.explain > {code} > results in: > {code:java} > == Physical Plan == > InMemoryTableScan [A#10, B#11, B#35, B#87] > +- InMemoryRelation [A#10, B#11, B#35, B#87], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(2) Project [A#10, B#11, B#35, B#87] > +- *(2) BroadcastHashJoin [A#10], [A#86], Inner, BuildRight > :- *(2) Filter isnotnull(A#10) > : +- InMemoryTableScan [A#10, B#11, B#35], [isnotnull(A#10)] > : +- InMemoryRelation [A#10, B#11, B#35], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(2) Project [A#10, B#11, B#35] > +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight > :- *(2) Filter isnotnull(A#10) > : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)] > : +- InMemoryRelation [A#10, B#11], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, > Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > ,None) > : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: > InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, > false])) > +- *(1) Filter isnotnull(A#34) > +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)] > +- InMemoryRelation [A#34, B#35], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, > Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > ,None) > +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: > InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > ,None) > : +- *(2) Project [A#10, B#11, B#35] > : +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight > : :- *(2) Filter isnotnull(A#10) > : : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)] > : : +- InMemoryRelation [A#10, B#11], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, > Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > ,None) > : : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: > InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, > false])) > : +- *(1) Filter isnotnull(A#34) > : +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)] > : +- InMemoryRelation [A#34, B#35], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, > Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > ,None) > : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: > InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, > false])) > +- *(1) Filter isnotnull(A#86) > +- InMemoryTableScan [A#86, B#87], [isnotnull(A#86)] > +- InMemoryRelation [A#86, B#87], > CachedRDDBuilder(true,1
[jira] [Assigned] (SPARK-24850) Query plan string representation grows exponentially on queries with recursive cached datasets
[ https://issues.apache.org/jira/browse/SPARK-24850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24850: Assignee: Apache Spark > Query plan string representation grows exponentially on queries with > recursive cached datasets > -- > > Key: SPARK-24850 > URL: https://issues.apache.org/jira/browse/SPARK-24850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Onur Satici >Assignee: Apache Spark >Priority: Major > > As of [https://github.com/apache/spark/pull/21018], InMemoryRelation includes > its cacheBuilder when logging query plans. This CachedRDDBuilder includes the > cachedPlan, so calling treeString on InMemoryRelation will log the cachedPlan > in the cacheBuilder. > Given the sample dataset: > {code:java} > $ cat test.csv > A,B > 0,0{code} > If the query plan has multiple cached datasets that depend on each other: > {code:java} > var df_cached = spark.read.format("csv").option("header", > "true").load("test.csv").cache() > 0 to 1 foreach { _ => > df_cached = df_cached.join(spark.read.format("csv").option("header", > "true").load("test.csv"), "A").cache() > } > df_cached.explain > {code} > results in: > {code:java} > == Physical Plan == > InMemoryTableScan [A#10, B#11, B#35, B#87] > +- InMemoryRelation [A#10, B#11, B#35, B#87], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(2) Project [A#10, B#11, B#35, B#87] > +- *(2) BroadcastHashJoin [A#10], [A#86], Inner, BuildRight > :- *(2) Filter isnotnull(A#10) > : +- InMemoryTableScan [A#10, B#11, B#35], [isnotnull(A#10)] > : +- InMemoryRelation [A#10, B#11, B#35], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(2) Project [A#10, B#11, B#35] > +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight > :- *(2) Filter isnotnull(A#10) > : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)] > : +- InMemoryRelation [A#10, B#11], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, > Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > ,None) > : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: > InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, > false])) > +- *(1) Filter isnotnull(A#34) > +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)] > +- InMemoryRelation [A#34, B#35], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, > Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > ,None) > +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: > InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > ,None) > : +- *(2) Project [A#10, B#11, B#35] > : +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight > : :- *(2) Filter isnotnull(A#10) > : : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)] > : : +- InMemoryRelation [A#10, B#11], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, > Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > ,None) > : : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: > InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, > false])) > : +- *(1) Filter isnotnull(A#34) > : +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)] > : +- InMemoryRelation [A#34, B#35], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, > Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > ,None) > : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: > InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, > false])) > +- *(1) Filter isnotnull(A#86) > +- InMemoryTableScan [A#86, B#87], [isnotnull(A#86)] > +- InMemoryRelation [A#86, B#87], > CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 > replicas),*(1) FileScan csv [A#10,B#11] Batch
[jira] [Created] (SPARK-24850) Query plan string representation grows exponentially on queries with recursive cached datasets
Onur Satici created SPARK-24850: --- Summary: Query plan string representation grows exponentially on queries with recursive cached datasets Key: SPARK-24850 URL: https://issues.apache.org/jira/browse/SPARK-24850 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Onur Satici As of [https://github.com/apache/spark/pull/21018], InMemoryRelation includes its cacheBuilder when logging query plans. This CachedRDDBuilder includes the cachedPlan, so calling treeString on InMemoryRelation will log the cachedPlan in the cacheBuilder. Given the sample dataset: {code:java} $ cat test.csv A,B 0,0{code} If the query plan has multiple cached datasets that depend on each other: {code:java} var df_cached = spark.read.format("csv").option("header", "true").load("test.csv").cache() 0 to 1 foreach { _ => df_cached = df_cached.join(spark.read.format("csv").option("header", "true").load("test.csv"), "A").cache() } df_cached.explain {code} results in: {code:java} == Physical Plan == InMemoryTableScan [A#10, B#11, B#35, B#87] +- InMemoryRelation [A#10, B#11, B#35, B#87], CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 replicas),*(2) Project [A#10, B#11, B#35, B#87] +- *(2) BroadcastHashJoin [A#10], [A#86], Inner, BuildRight :- *(2) Filter isnotnull(A#10) : +- InMemoryTableScan [A#10, B#11, B#35], [isnotnull(A#10)] : +- InMemoryRelation [A#10, B#11, B#35], CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 replicas),*(2) Project [A#10, B#11, B#35] +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight :- *(2) Filter isnotnull(A#10) : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)] : +- InMemoryRelation [A#10, B#11], CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct ,None) : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false])) +- *(1) Filter isnotnull(A#34) +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)] +- InMemoryRelation [A#34, B#35], CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct ,None) +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct ,None) : +- *(2) Project [A#10, B#11, B#35] : +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight : :- *(2) Filter isnotnull(A#10) : : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)] : : +- InMemoryRelation [A#10, B#11], CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct ,None) : : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false])) : +- *(1) Filter isnotnull(A#34) : +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)] : +- InMemoryRelation [A#34, B#35], CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct ,None) : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false])) +- *(1) Filter isnotnull(A#86) +- InMemoryTableScan [A#86, B#87], [isnotnull(A#86)] +- InMemoryRelation [A#86, B#87], CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct ,None) +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct ,None) +- *(2) Project [A#10, B#11, B#35, B#87] +- *(2) BroadcastHashJoin [A#10], [A#86], Inner, BuildRight :- *(2) Filter isnotnull(A#10) : +- InMemoryTableScan [A#10, B#11, B#35], [isnotnull(
[jira] [Commented] (SPARK-24268) DataType in error messages are not coherent
[ https://issues.apache.org/jira/browse/SPARK-24268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547939#comment-16547939 ] Apache Spark commented on SPARK-24268: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/21804 > DataType in error messages are not coherent > --- > > Key: SPARK-24268 > URL: https://issues.apache.org/jira/browse/SPARK-24268 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > > In SPARK-22893 there was a tentative to unify the way dataTypes are reported > in error messages. There, we decided to use always {{dataType.simpleString}}. > Unfortunately, we missed many places where this still needed to be fixed. > Moreover, it turns out that the right method to use is not {{simpleString}}, > but we should use {{catalogString}} instead (for further details please check > the discussion in the PR https://github.com/apache/spark/pull/21321). > So we should update all the missing places in order to provide error messages > coherently throughout the project. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24849) Convert StructType to DDL string
[ https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547919#comment-16547919 ] Apache Spark commented on SPARK-24849: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/21803 > Convert StructType to DDL string > > > Key: SPARK-24849 > URL: https://issues.apache.org/jira/browse/SPARK-24849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to add new methods which should convert a value of StructType to a > schema in DDL format . It should be possible to use the former string in new > table creation by just copy-pasting of new method results. The existing > methods simpleString(), catalogString() and sql() put ':' between top level > field name and its type, and wrap by the *struct* word > {code} > ds.schema.catalogString > struct {code} > Output of new method should be > {code} > metaData struct {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24849) Convert StructType to DDL string
[ https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24849: Assignee: Apache Spark > Convert StructType to DDL string > > > Key: SPARK-24849 > URL: https://issues.apache.org/jira/browse/SPARK-24849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > Need to add new methods which should convert a value of StructType to a > schema in DDL format . It should be possible to use the former string in new > table creation by just copy-pasting of new method results. The existing > methods simpleString(), catalogString() and sql() put ':' between top level > field name and its type, and wrap by the *struct* word > {code} > ds.schema.catalogString > struct {code} > Output of new method should be > {code} > metaData struct {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24849) Convert StructType to DDL string
[ https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24849: Assignee: (was: Apache Spark) > Convert StructType to DDL string > > > Key: SPARK-24849 > URL: https://issues.apache.org/jira/browse/SPARK-24849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to add new methods which should convert a value of StructType to a > schema in DDL format . It should be possible to use the former string in new > table creation by just copy-pasting of new method results. The existing > methods simpleString(), catalogString() and sql() put ':' between top level > field name and its type, and wrap by the *struct* word > {code} > ds.schema.catalogString > struct {code} > Output of new method should be > {code} > metaData struct {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24628) Typos of the example code in docs/mllib-data-types.md
[ https://issues.apache.org/jira/browse/SPARK-24628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-24628: - Assignee: Weizhe Huang > Typos of the example code in docs/mllib-data-types.md > - > > Key: SPARK-24628 > URL: https://issues.apache.org/jira/browse/SPARK-24628 > Project: Spark > Issue Type: Documentation > Components: MLlib >Affects Versions: 2.3.1 >Reporter: Weizhe Huang >Assignee: Weizhe Huang >Priority: Minor > Fix For: 2.4.0 > > Original Estimate: 10m > Remaining Estimate: 10m > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24628) Typos of the example code in docs/mllib-data-types.md
[ https://issues.apache.org/jira/browse/SPARK-24628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-24628. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21612 [https://github.com/apache/spark/pull/21612] > Typos of the example code in docs/mllib-data-types.md > - > > Key: SPARK-24628 > URL: https://issues.apache.org/jira/browse/SPARK-24628 > Project: Spark > Issue Type: Documentation > Components: MLlib >Affects Versions: 2.3.1 >Reporter: Weizhe Huang >Assignee: Weizhe Huang >Priority: Minor > Fix For: 2.4.0 > > Original Estimate: 10m > Remaining Estimate: 10m > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24093) Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to outside of the classes
[ https://issues.apache.org/jira/browse/SPARK-24093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-24093. --- Resolution: Won't Fix > Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to > outside of the classes > --- > > Key: SPARK-24093 > URL: https://issues.apache.org/jira/browse/SPARK-24093 > Project: Spark > Issue Type: Wish > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Weiqing Yang >Priority: Minor > > To make third parties able to get the information of streaming writer, for > example, the information of "writer" and "topic" which streaming data are > written into, this jira is created to make relevant fields of > KafkaStreamWriter and InternalRowMicroBatchWriter visible to outside of the > classes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24804) There are duplicate words in the title in the DatasetSuite
[ https://issues.apache.org/jira/browse/SPARK-24804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-24804. --- Resolution: Fixed Assignee: hantiantian Fix Version/s: 2.4.0 This is too trivial for a Jira [~hantiantian], but OK for a first contribution. Resolved by https://github.com/apache/spark/pull/21767 > There are duplicate words in the title in the DatasetSuite > -- > > Key: SPARK-24804 > URL: https://issues.apache.org/jira/browse/SPARK-24804 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: hantiantian >Assignee: hantiantian >Priority: Trivial > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24849) Convert StructType to DDL string
[ https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547906#comment-16547906 ] Takeshi Yamamuro commented on SPARK-24849: -- What is this new func used for? Is this the sub-ticket of another work? > Convert StructType to DDL string > > > Key: SPARK-24849 > URL: https://issues.apache.org/jira/browse/SPARK-24849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to add new methods which should convert a value of StructType to a > schema in DDL format . It should be possible to use the former string in new > table creation by just copy-pasting of new method results. The existing > methods simpleString(), catalogString() and sql() put ':' between top level > field name and its type, and wrap by the *struct* word > {code} > ds.schema.catalogString > struct {code} > Output of new method should be > {code} > metaData struct {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547903#comment-16547903 ] Thomas Graves commented on SPARK-24615: --- did the design doc permissions change? I can't seem to access it now. A few overall concerns. We are now making accelerator configurations available per stage, but what about cpu and memory? It seems like if we are going to start making things configurable at the stage/rdd level it would be nice to be consistent. People have asked for this ability in the past. What about the case where to run some ML algorithm you would want machines of different types? For instance tensorflow with a parameter server might want gpu nodes for the workers but the parameter server would just be a cpu. This would also apply to the barrier scheduler so might cross post there. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24796) Support GROUPED_AGG_PANDAS_UDF in Pivot
[ https://issues.apache.org/jira/browse/SPARK-24796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547901#comment-16547901 ] Xiao Li commented on SPARK-24796: - [~icexelloss] Thank you! > Support GROUPED_AGG_PANDAS_UDF in Pivot > --- > > Key: SPARK-24796 > URL: https://issues.apache.org/jira/browse/SPARK-24796 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > > Currently, Grouped AGG PandasUDF is not supported in Pivot. It is nice to > support it. > {code} > # create input dataframe > from pyspark.sql import Row > data = [ > Row(id=123, total=200.0, qty=3, name='item1'), > Row(id=124, total=1500.0, qty=1, name='item2'), > Row(id=125, total=203.5, qty=2, name='item3'), > Row(id=126, total=200.0, qty=500, name='item1'), > ] > df = spark.createDataFrame(data) > from pyspark.sql.functions import pandas_udf, PandasUDFType > @pandas_udf('double', PandasUDFType.GROUPED_AGG) > def pandas_avg(v): >return v.mean() > from pyspark.sql.functions import col, sum > > applied_df = > df.groupby('id').pivot('name').agg(pandas_avg('total').alias('mean')) > applied_df.show() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24849) Convert StructType to DDL string
[ https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547876#comment-16547876 ] Maxim Gekk commented on SPARK-24849: I am working on the ticket. > Convert StructType to DDL string > > > Key: SPARK-24849 > URL: https://issues.apache.org/jira/browse/SPARK-24849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to add new methods which should convert a value of StructType to a > schema in DDL format . It should be possible to use the former string in new > table creation by just copy-pasting of new method results. The existing > methods simpleString(), catalogString() and sql() put ':' between top level > field name and its type, and wrap by the *struct* word > {code} > ds.schema.catalogString > struct {code} > Output of new method should be > {code} > metaData struct {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24849) Convert StructType to DDL string
Maxim Gekk created SPARK-24849: -- Summary: Convert StructType to DDL string Key: SPARK-24849 URL: https://issues.apache.org/jira/browse/SPARK-24849 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Need to add new methods which should convert a value of StructType to a schema in DDL format . It should be possible to use the former string in new table creation by just copy-pasting of new method results. The existing methods simpleString(), catalogString() and sql() put ':' between top level field name and its type, and wrap by the *struct* word {code} ds.schema.catalogString struct
[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.
[ https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547873#comment-16547873 ] Li Yuanjian commented on SPARK-24295: - Thanks for your detailed explain. You can check this: SPARK-17604, seems like the same requirements about purging the compact aged file. The small difference is we need the purge logic in FileStreamSinkLog while the jira support in FileSourceSinkLog, but I think the strategy can be reused. Also cc the original author [~jerryshao2015] for SPARK-17604. > Purge Structured streaming FileStreamSinkLog metadata compact file data. > > > Key: SPARK-24295 > URL: https://issues.apache.org/jira/browse/SPARK-24295 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Iqbal Singh >Priority: Major > > FileStreamSinkLog metadata logs are concatenated to a single compact file > after defined compact interval. > For long running jobs, compact file size can grow up to 10's of GB's, Causing > slowness while reading the data from FileStreamSinkLog dir as spark is > defaulting to the "__spark__metadata" dir for the read. > We need a functionality to purge the compact file size. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24844) spark REST API need to add ipFilter
[ https://issues.apache.org/jira/browse/SPARK-24844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-24844: - Priority: Minor (was: Blocker) > spark REST API need to add ipFilter > --- > > Key: SPARK-24844 > URL: https://issues.apache.org/jira/browse/SPARK-24844 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 > Environment: all server >Reporter: daijiacheng >Priority: Minor > > Spark has a hidden REST API which handles application submission, status > checking and cancellation. But, It can't allowed specify ip, when I open't > this function, My server may be attacked. It need to add ipFilter to filter > some ip -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24844) spark REST API need to add ipFilter
[ https://issues.apache.org/jira/browse/SPARK-24844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547831#comment-16547831 ] Takeshi Yamamuro commented on SPARK-24844: -- 'Blocker' tag in priority is reserved for committers. > spark REST API need to add ipFilter > --- > > Key: SPARK-24844 > URL: https://issues.apache.org/jira/browse/SPARK-24844 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 > Environment: all server >Reporter: daijiacheng >Priority: Minor > > Spark has a hidden REST API which handles application submission, status > checking and cancellation. But, It can't allowed specify ip, when I open't > this function, My server may be attacked. It need to add ipFilter to filter > some ip -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23928) High-order function: shuffle(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547781#comment-16547781 ] Apache Spark commented on SPARK-23928: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/21802 > High-order function: shuffle(x) → array > --- > > Key: SPARK-23928 > URL: https://issues.apache.org/jira/browse/SPARK-23928 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Generate a random permutation of the given array x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24843) Spark2 job (in cluster mode) is unable to execute steps in HBase (error# java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/CompatibilityFactory)
[ https://issues.apache.org/jira/browse/SPARK-24843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547774#comment-16547774 ] Manish commented on SPARK-24843: Thanks Wang. I am setting it using export command before firing spark2-submit. It works perfectly fine with client mode but not working in cluster mode. Any leads will be very helpful to me. {color:#205081}export HADOOP_CONF_DIR=/etc/hadoop/conf:/etc/hive/conf:/etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/lib/hbase-common-1.2.0-cdh5.11.1.jar:/home/svc-cop-realtime-d/scala1/jar/lib/hbase-rdd_2.11-0.8.0.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/hbase-hadoop2-compat-1.2.0-cdh5.11.1.jar{color} > Spark2 job (in cluster mode) is unable to execute steps in HBase (error# > java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/CompatibilityFactory) > -- > > Key: SPARK-24843 > URL: https://issues.apache.org/jira/browse/SPARK-24843 > Project: Spark > Issue Type: Bug > Components: Build, Java API >Affects Versions: 2.1.0 >Reporter: Manish >Priority: Major > > I am running Spark2 streaming job to do processing in HBase. It wokrs > perfectly fine with client deploy mode but don't work with deploy mode as > cluster . Below is the error message: > |{color:#ff}_User class threw exception: java.lang.NoClassDefFoundError: > org/apache/hadoop/hbase/CompatibilityFactory_{color}| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24848) When a stage fails onStageCompleted is called before onTaskEnd
[ https://issues.apache.org/jira/browse/SPARK-24848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yavgeni Hotimsky updated SPARK-24848: - Description: It seems that when a stage fails because one of it's tasks failed too many times the onStageCompleted callback of the SparkListener is called before the onTaskEnd listener for the failing task. We're using structured streaming in this case. We noticed this because we built a listener to track the precise number of active tasks to be exported as a metric and was using the stage callback to maintain a map from stage ids to some metadata extracted from the jobGroupId. The onStageCompleted listener was removing from the map to prevent unbounded memory usage and in this case I could see the onTaskEnd callback was being called after the onStageCompleted callback so it couldn't find the stageId in the map. We worked around it by replacing the map with a timed cache. was: It seems that when a stage fails because one of it's tasks failed too many times the onStageCompleted callback of the SparkListener is called before the onTaskEnd listener for the failing task. We're using structured streaming in this case. We noticed this because we built a listener to track the precise number of active tasks per one of my processes to be exported as a metric and was using the stage callback to maintain a map from stage ids to some metadata extracted from the jobGroupId. The onStageCompleted listener was removing from the map to prevent unbounded memory and in this case I could see the onTaskEnd callback was being called after the onStageCompleted callback so it couldn't find the stageId in the map. We worked around it by replacing the map with a timed cache. > When a stage fails onStageCompleted is called before onTaskEnd > -- > > Key: SPARK-24848 > URL: https://issues.apache.org/jira/browse/SPARK-24848 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yavgeni Hotimsky >Priority: Minor > > It seems that when a stage fails because one of it's tasks failed too many > times the onStageCompleted callback of the SparkListener is called before the > onTaskEnd listener for the failing task. We're using structured streaming in > this case. > We noticed this because we built a listener to track the precise number of > active tasks to be exported as a metric and was using the stage callback to > maintain a map from stage ids to some metadata extracted from the jobGroupId. > The onStageCompleted listener was removing from the map to prevent unbounded > memory usage and in this case I could see the onTaskEnd callback was being > called after the onStageCompleted callback so it couldn't find the stageId in > the map. We worked around it by replacing the map with a timed cache. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24848) When a stage fails onStageCompleted is called before onTaskEnd
Yavgeni Hotimsky created SPARK-24848: Summary: When a stage fails onStageCompleted is called before onTaskEnd Key: SPARK-24848 URL: https://issues.apache.org/jira/browse/SPARK-24848 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Yavgeni Hotimsky It seems that when a stage fails because one of it's tasks failed too many times the onStageCompleted callback of the SparkListener is called before the onTaskEnd listener for the failing task. We're using structured streaming in this case. We noticed this because we built a listener to track the precise number of active tasks per one of my processes to be exported as a metric and was using the stage callback to maintain a map from stage ids to some metadata extracted from the jobGroupId. The onStageCompleted listener was removing from the map to prevent unbounded memory and in this case I could see the onTaskEnd callback was being called after the onStageCompleted callback so it couldn't find the stageId in the map. We worked around it by replacing the map with a timed cache. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24847) ScalaReflection#schemaFor occasionally fails to detect schema for Seq of type alias
Ahmed Mahran created SPARK-24847: Summary: ScalaReflection#schemaFor occasionally fails to detect schema for Seq of type alias Key: SPARK-24847 URL: https://issues.apache.org/jira/browse/SPARK-24847 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Ahmed Mahran org.apache.spark.sql.catalyst.ScalaReflection#schemaFor occasionally fails to detect schema for Seq of type alias (and it occasionally succeeds). {code:java} object Types { type Alias1 = Long type Alias2 = Int type Alias3 = Int } case class B(b1: Alias1, b2: Seq[Alias2], b3: Option[Alias3]) case class A(a1: B, a2: Int) {code} {code} import sparkSession.implicits._ val seq = Seq( A(B(2L, Seq(3), Some(1)), 1), A(B(3L, Seq(2), Some(2)), 2) ) val ds = sparkSession.createDataset(seq) {code} {code:java} java.lang.UnsupportedOperationException: Schema for type Seq[Types.Alias2] is not supported at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:780) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:715) at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:714) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:381) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:380) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:380) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:150) at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor(ScalaReflection.scala:150) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:391) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:380) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:380) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:150) at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor(ScalaReflection.scala:150) at org.apache.spark.sql.catalyst.ScalaReflection$.deserializerFor(ScalaReflection.scala:138) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:72) at org.apache.spark.sql.Encoders$.product(Encoders.scala:275) at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:248) at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18600) BZ2 CRC read error needs better reporting
[ https://issues.apache.org/jira/browse/SPARK-18600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-18600: -- Labels: spree (was: ) > BZ2 CRC read error needs better reporting > - > > Key: SPARK-18600 > URL: https://issues.apache.org/jira/browse/SPARK-18600 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Charles R Allen >Priority: Minor > Labels: spree > > {code} > 16/11/25 20:05:03 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 148 > in stage 5.0 failed 1 times, most recent failure: Lost task 148.0 in stage > 5.0 (TID 5945, localhost): org.apache.spark.SparkException: Task failed while > writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: com.univocity.parsers.common.TextParsingException: > java.lang.IllegalStateException - Error reading from input > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > Header extraction enabled=null > Headers=[INTERVALSTARTTIME_GMT, INTERVALENDTIME_GMT, OPR_DT, OPR_HR, > NODE_ID_XML, NODE_ID, NODE, MARKET_RUN_ID, LMP_TYPE, XML_DATA_ITEM, > PNODE_RESMRID, GRP_TYPE, POS, VALUE, OPR_INTERVAL, GROUP] > Ignore leading whitespaces=false > Ignore trailing whitespaces=false > Input buffer size=128 > Input reading on separate thread=false > Keep escape sequences=false > Line separator detection enabled=false > Maximum number of characters per column=100 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Row processor=none > RowProcessor error handler=null > Selected fields=none > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=\0 > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\n > Quote character=" > Quote escape character=\ > Quote escape escape character=null > Internal state when error was thrown: line=27089, column=13, record=27089, > charIndex=4451456, headers=[INTERVALSTARTTIME_GMT, INTERVALENDTIME_GMT, > OPR_DT, OPR_HR, NODE_ID_XML, NODE_ID, NODE, MARKET_RUN_ID, LMP_TYPE, > XML_DATA_ITEM, PNODE_RESMRID, GRP_TYPE, POS, VALUE, OPR_INTERVAL, GROUP] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:302) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:431) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:148) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:131) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execu
[jira] [Updated] (SPARK-23612) Specify formats for individual DateType and TimestampType columns in schemas
[ https://issues.apache.org/jira/browse/SPARK-23612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-23612: -- Labels: DataType date spree sql (was: DataType date sql) > Specify formats for individual DateType and TimestampType columns in schemas > > > Key: SPARK-23612 > URL: https://issues.apache.org/jira/browse/SPARK-23612 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Patrick Young >Priority: Minor > Labels: DataType, date, spree, sql > > [https://github.com/apache/spark/blob/407f67249639709c40c46917700ed6dd736daa7d/python/pyspark/sql/types.py#L162-L200] > It would be very helpful if it were possible to specify the format for > individual columns in a schema when reading csv files, rather than one format: > {code:java|title=Bar.python|borderStyle=solid} > # Currently can only do something like: > spark.read.option("dateFormat", "MMdd").csv(...) > # Would like to be able to do something like: > schema = StructType([ > StructField("date1", DateType(format="MM/dd/"), True), > StructField("date2", DateType(format="MMdd"), True) > ] > read.schema(schema).csv(...) > {code} > Thanks for any help, input! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24838) Support uncorrelated IN/EXISTS subqueries for more operators
[ https://issues.apache.org/jira/browse/SPARK-24838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-24838: -- Labels: spree (was: ) > Support uncorrelated IN/EXISTS subqueries for more operators > - > > Key: SPARK-24838 > URL: https://issues.apache.org/jira/browse/SPARK-24838 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Qifan Pu >Priority: Major > Labels: spree > > Currently, CheckAnalysis allows IN/EXISTS subquery only for filter operators. > Running a query: > {{select name in (select * from valid_names)}} > {{from all_names}} > returns error: > {code:java} > Error in SQL statement: AnalysisException: IN/EXISTS predicate sub-queries > can only be used in a Filter > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24846) Stabilize expression cannonicalization
Herman van Hovell created SPARK-24846: - Summary: Stabilize expression cannonicalization Key: SPARK-24846 URL: https://issues.apache.org/jira/browse/SPARK-24846 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Herman van Hovell Spark plan canonicalization is can be non-deterministic between different versions of spark due to the fact that {{ExprId}} uses a UUID. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24536) Query with nonsensical LIMIT hits AssertionError
[ https://issues.apache.org/jira/browse/SPARK-24536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-24536: -- Labels: beginner spree (was: beginner) > Query with nonsensical LIMIT hits AssertionError > > > Key: SPARK-24536 > URL: https://issues.apache.org/jira/browse/SPARK-24536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Alexander Behm >Priority: Trivial > Labels: beginner, spree > > SELECT COUNT(1) FROM t LIMIT CAST(NULL AS INT) > fails in the QueryPlanner with: > {code} > java.lang.AssertionError: assertion failed: No plan for GlobalLimit null > {code} > I think this issue should be caught earlier during semantic analysis. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org