[jira] [Commented] (SPARK-27827) File does not exist notice is misleading in FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-27827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857297#comment-16857297 ] zhoukang commented on SPARK-27827: -- I just test this in 2.3 cluster [~dongjoon] > File does not exist notice is misleading in FileScanRDD > --- > > Key: SPARK-27827 > URL: https://issues.apache.org/jira/browse/SPARK-27827 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2 >Reporter: zhoukang >Priority: Minor > > When we encounter error below, we will try "refresh table" and will think the > error will not thrown again. > {code:java} > Error: java.lang.IllegalStateException: Can't overwrite cause with > java.io.FileNotFoundException: File does not exist: > /user/s_xdata/kuduhive_warehouse/info_dev/dws_quality_time_dictionary/part-3-92c84bf9-99c0-49d9-8cdf-78b1844d75c3.snappy.parquet > It is possible the underlying files have been updated. You can explicitly > invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in > SQL or by recreating the Dataset/DataFrame involved. (state=,code=0) > {code} > The cause is 'InMemoryFileIndex' will be cached in 'HiveMetaStoreCatalog'.And > refresh command will only invalidate table of current session.The notice is > misleading when we have a long-running thriftserver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27827) File does not exist notice is misleading in FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-27827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-27827: - Affects Version/s: (was: 2.4.3) > File does not exist notice is misleading in FileScanRDD > --- > > Key: SPARK-27827 > URL: https://issues.apache.org/jira/browse/SPARK-27827 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2 >Reporter: zhoukang >Priority: Minor > > When we encounter error below, we will try "refresh table" and will think the > error will not thrown again. > {code:java} > Error: java.lang.IllegalStateException: Can't overwrite cause with > java.io.FileNotFoundException: File does not exist: > /user/s_xdata/kuduhive_warehouse/info_dev/dws_quality_time_dictionary/part-3-92c84bf9-99c0-49d9-8cdf-78b1844d75c3.snappy.parquet > It is possible the underlying files have been updated. You can explicitly > invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in > SQL or by recreating the Dataset/DataFrame involved. (state=,code=0) > {code} > The cause is 'InMemoryFileIndex' will be cached in 'HiveMetaStoreCatalog'.And > refresh command will only invalidate table of current session.The notice is > misleading when we have a long-running thriftserver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27068) Support failed jobs ui and completed jobs ui use different queue
[ https://issues.apache.org/jira/browse/SPARK-27068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857291#comment-16857291 ] zhoukang commented on SPARK-27068: -- [~srowen] Here is a use case in our cluster. We have a long running spark sql thriftserver, and users use that as ad-hoc query engine and also for a online bi service. Since failure number is not too large. But total query number will quickly increased as show in image below. When we want to find the root cause for the failed query, currently is really not too convenience. !屏幕快照 2019-06-06 下午1.12.04.png! > Support failed jobs ui and completed jobs ui use different queue > > > Key: SPARK-27068 > URL: https://issues.apache.org/jira/browse/SPARK-27068 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.0 >Reporter: zhoukang >Priority: Major > Attachments: 屏幕快照 2019-06-06 下午1.12.04.png > > > For some long running jobs,we may want to check out the cause of some failed > jobs. > But most jobs has completed and failed jobs ui may disappear, we can use > different queue for this two kinds of jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27068) Support failed jobs ui and completed jobs ui use different queue
[ https://issues.apache.org/jira/browse/SPARK-27068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-27068: - Attachment: 屏幕快照 2019-06-06 下午1.12.04.png > Support failed jobs ui and completed jobs ui use different queue > > > Key: SPARK-27068 > URL: https://issues.apache.org/jira/browse/SPARK-27068 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.0 >Reporter: zhoukang >Priority: Major > Attachments: 屏幕快照 2019-06-06 下午1.12.04.png > > > For some long running jobs,we may want to check out the cause of some failed > jobs. > But most jobs has completed and failed jobs ui may disappear, we can use > different queue for this two kinds of jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27965) Add extractors for logical transforms
[ https://issues.apache.org/jira/browse/SPARK-27965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27965: -- Issue Type: Improvement (was: Bug) > Add extractors for logical transforms > - > > Key: SPARK-27965 > URL: https://issues.apache.org/jira/browse/SPARK-27965 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > > Extractors can be used to make any Transform class appear like a case class > to Spark internals. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27965) Add extractors for logical transforms
[ https://issues.apache.org/jira/browse/SPARK-27965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857232#comment-16857232 ] Dongjoon Hyun commented on SPARK-27965: --- Hi, [~rdblue]. Could you use `Improvement` issue type when you create this kind of issue? > Add extractors for logical transforms > - > > Key: SPARK-27965 > URL: https://issues.apache.org/jira/browse/SPARK-27965 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > > Extractors can be used to make any Transform class appear like a case class > to Spark internals. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27964) Create CatalogV2Util
[ https://issues.apache.org/jira/browse/SPARK-27964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27964. --- Resolution: Fixed Assignee: Ryan Blue Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24813 > Create CatalogV2Util > > > Key: SPARK-27964 > URL: https://issues.apache.org/jira/browse/SPARK-27964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > Need to move utility functions from test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27964) Create CatalogV2Util
[ https://issues.apache.org/jira/browse/SPARK-27964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27964: -- Issue Type: Improvement (was: Bug) > Create CatalogV2Util > > > Key: SPARK-27964 > URL: https://issues.apache.org/jira/browse/SPARK-27964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > > Need to move utility functions from test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27964) Create CatalogV2Util
[ https://issues.apache.org/jira/browse/SPARK-27964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27964: Assignee: Apache Spark > Create CatalogV2Util > > > Key: SPARK-27964 > URL: https://issues.apache.org/jira/browse/SPARK-27964 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Apache Spark >Priority: Major > > Need to move utility functions from test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27964) Create CatalogV2Util
[ https://issues.apache.org/jira/browse/SPARK-27964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27964: Assignee: (was: Apache Spark) > Create CatalogV2Util > > > Key: SPARK-27964 > URL: https://issues.apache.org/jira/browse/SPARK-27964 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > > Need to move utility functions from test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27965) Add extractors for logical transforms
[ https://issues.apache.org/jira/browse/SPARK-27965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27965: Assignee: Apache Spark > Add extractors for logical transforms > - > > Key: SPARK-27965 > URL: https://issues.apache.org/jira/browse/SPARK-27965 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Apache Spark >Priority: Major > > Extractors can be used to make any Transform class appear like a case class > to Spark internals. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27965) Add extractors for logical transforms
[ https://issues.apache.org/jira/browse/SPARK-27965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27965: Assignee: (was: Apache Spark) > Add extractors for logical transforms > - > > Key: SPARK-27965 > URL: https://issues.apache.org/jira/browse/SPARK-27965 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > > Extractors can be used to make any Transform class appear like a case class > to Spark internals. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27965) Add extractors for logical transforms
Ryan Blue created SPARK-27965: - Summary: Add extractors for logical transforms Key: SPARK-27965 URL: https://issues.apache.org/jira/browse/SPARK-27965 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue Extractors can be used to make any Transform class appear like a case class to Spark internals. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27964) Create CatalogV2Util
Ryan Blue created SPARK-27964: - Summary: Create CatalogV2Util Key: SPARK-27964 URL: https://issues.apache.org/jira/browse/SPARK-27964 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue Need to move utility functions from test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27931) Accept 'on' and 'off' as input for boolean data type
[ https://issues.apache.org/jira/browse/SPARK-27931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27931: Description: This ticket contains two things: 1. Accept 'on' and 'off' as input for boolean data type Example: {code:sql} SELECT cast('no' as boolean) AS false; SELECT cast('off' as boolean) AS false; {code} 2. Accept unique prefixes thereof: Example: {code:sql} SELECT cast('of' as boolean) AS false; SELECT cast('fal' as boolean) AS false; {code} 3. Trim the string when cast to boolean type {code:sql} SELECT cast('true ' as boolean) AS true; SELECT cast(' FALSE' as boolean) AS true; {code} More details: [https://www.postgresql.org/docs/devel/datatype-boolean.html] [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/backend/utils/adt/bool.c#L25] [https://github.com/postgres/postgres/commit/05a7db05826c5eb68173b6d7ef1553c19322ef48] Other DBs: http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html https://my.vertica.com/docs/5.0/HTML/Master/2983.htm https://github.com/prestosql/presto/blob/b845cd66da3eb1fcece50efba83ea12bc40afbaa/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L108-L138 was: This ticket contains two things: 1. Accept 'on' and 'off' as input for boolean data type Example: {code:sql} SELECT cast('no' as boolean) AS false; SELECT cast('off' as boolean) AS false; {code} 2. Trim the string when cast to boolean type {code:sql} SELECT cast('true ' as boolean) AS true; SELECT cast(' FALSE' as boolean) AS true; {code} More details: [https://www.postgresql.org/docs/devel/datatype-boolean.html] [https://github.com/postgres/postgres/commit/05a7db05826c5eb68173b6d7ef1553c19322ef48] Other DBs: http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html https://my.vertica.com/docs/5.0/HTML/Master/2983.htm https://github.com/prestosql/presto/blob/b845cd66da3eb1fcece50efba83ea12bc40afbaa/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L108-L138 > Accept 'on' and 'off' as input for boolean data type > > > Key: SPARK-27931 > URL: https://issues.apache.org/jira/browse/SPARK-27931 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > This ticket contains two things: > 1. Accept 'on' and 'off' as input for boolean data type > Example: > {code:sql} > SELECT cast('no' as boolean) AS false; > SELECT cast('off' as boolean) AS false; > {code} > 2. Accept unique prefixes thereof: > Example: > {code:sql} > SELECT cast('of' as boolean) AS false; > SELECT cast('fal' as boolean) AS false; > {code} > 3. Trim the string when cast to boolean type > {code:sql} > SELECT cast('true ' as boolean) AS true; > SELECT cast(' FALSE' as boolean) AS true; > {code} > More details: > [https://www.postgresql.org/docs/devel/datatype-boolean.html] > [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/backend/utils/adt/bool.c#L25] > [https://github.com/postgres/postgres/commit/05a7db05826c5eb68173b6d7ef1553c19322ef48] > Other DBs: > http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html > https://my.vertica.com/docs/5.0/HTML/Master/2983.htm > https://github.com/prestosql/presto/blob/b845cd66da3eb1fcece50efba83ea12bc40afbaa/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L108-L138 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27963) Allow dynamic allocation without an external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-27963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857146#comment-16857146 ] Marcelo Vanzin commented on SPARK-27963: FYI I have a WIP patch to implement this that I plan to post soon (although I'll be out for a couple of weeks and won't be able to update it). > Allow dynamic allocation without an external shuffle service > > > Key: SPARK-27963 > URL: https://issues.apache.org/jira/browse/SPARK-27963 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Major > > It would be useful for users to be able to enable dynamic allocation without > the need to provision an external shuffle service. One immediate use case is > the ability to use dynamic allocation on Kubernetes, which doesn't yet have > that service. > This has been suggested before (e.g. > https://github.com/apache/spark/pull/24083, which was attached to the > k8s-specific SPARK-24432), and can actually be done without affecting the > internals of the Spark scheduler (aside from the dynamic allocation code). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27963) Allow dynamic allocation without an external shuffle service
Marcelo Vanzin created SPARK-27963: -- Summary: Allow dynamic allocation without an external shuffle service Key: SPARK-27963 URL: https://issues.apache.org/jira/browse/SPARK-27963 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.0.0 Reporter: Marcelo Vanzin It would be useful for users to be able to enable dynamic allocation without the need to provision an external shuffle service. One immediate use case is the ability to use dynamic allocation on Kubernetes, which doesn't yet have that service. This has been suggested before (e.g. https://github.com/apache/spark/pull/24083, which was attached to the k8s-specific SPARK-24432), and can actually be done without affecting the internals of the Spark scheduler (aside from the dynamic allocation code). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27919) DataSourceV2: Add v2 session catalog
[ https://issues.apache.org/jira/browse/SPARK-27919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-27919: -- Affects Version/s: (was: 2.4.3) 3.0.0 > DataSourceV2: Add v2 session catalog > > > Key: SPARK-27919 > URL: https://issues.apache.org/jira/browse/SPARK-27919 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > > When no default catalog is set, the session catalog (v1) is responsible for > table identifiers with no catalog part. When CTAS creates a table with a v2 > provider, a v2 catalog is required and the default catalog is used. But this > may cause Spark to create a table in a catalog that it cannot use to look up > the table. > In this case, a v2 catalog that delegates to the session catalog should be > used instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27857) DataSourceV2: Support ALTER TABLE statements
[ https://issues.apache.org/jira/browse/SPARK-27857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-27857. - Resolution: Fixed Assignee: Ryan Blue Fix Version/s: 3.0.0 > DataSourceV2: Support ALTER TABLE statements > > > Key: SPARK-27857 > URL: https://issues.apache.org/jira/browse/SPARK-27857 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > ALTER TABLE statements should be supported for v2 tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24130) Data Source V2: Join Push Down
[ https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857006#comment-16857006 ] Parshuram V Patki commented on SPARK-24130: --- Do we have any traction on this? > Data Source V2: Join Push Down > -- > > Key: SPARK-24130 > URL: https://issues.apache.org/jira/browse/SPARK-24130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jia Li >Priority: Major > Attachments: Data Source V2 Join Push Down.pdf > > > Spark applications often directly query external data sources such as > relational databases, or files. Spark provides Data Sources APIs for > accessing structured data through Spark SQL. Data Sources APIs in both V1 and > V2 support optimizations such as Filter push down and Column pruning which > are subset of the functionality that can be pushed down to some data sources. > We’re proposing to extend Data Sources APIs with join push down (JPD). Join > push down significantly improves query performance by reducing the amount of > data transfer and exploiting the capabilities of the data sources such as > index access. > Join push down design document is available > [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27962) Propagate subprocess stdout when subprocess exits with nonzero status in deploy.RRunner
[ https://issues.apache.org/jira/browse/SPARK-27962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27962: Assignee: (was: Apache Spark) > Propagate subprocess stdout when subprocess exits with nonzero status in > deploy.RRunner > --- > > Key: SPARK-27962 > URL: https://issues.apache.org/jira/browse/SPARK-27962 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.4.3 >Reporter: Jeremy Liu >Priority: Minor > > When the R process launched in org.apache.spark.deploy.RRunner terminates > with a nonzero status code, only the status code is passed on in the > SparkUserAppException. > Although the subprocess' stdout is continually piped to System.out, it would > be useful for users without access to the JVM's stdout to also capture the > last few lines of the R process and pass it along in the exception message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27962) Propagate subprocess stdout when subprocess exits with nonzero status in deploy.RRunner
[ https://issues.apache.org/jira/browse/SPARK-27962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27962: Assignee: Apache Spark > Propagate subprocess stdout when subprocess exits with nonzero status in > deploy.RRunner > --- > > Key: SPARK-27962 > URL: https://issues.apache.org/jira/browse/SPARK-27962 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.4.3 >Reporter: Jeremy Liu >Assignee: Apache Spark >Priority: Minor > > When the R process launched in org.apache.spark.deploy.RRunner terminates > with a nonzero status code, only the status code is passed on in the > SparkUserAppException. > Although the subprocess' stdout is continually piped to System.out, it would > be useful for users without access to the JVM's stdout to also capture the > last few lines of the R process and pass it along in the exception message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27962) Propagate subprocess stdout when subprocess exits with nonzero status in deploy.RRunner
Jeremy Liu created SPARK-27962: -- Summary: Propagate subprocess stdout when subprocess exits with nonzero status in deploy.RRunner Key: SPARK-27962 URL: https://issues.apache.org/jira/browse/SPARK-27962 Project: Spark Issue Type: Improvement Components: Deploy, Spark Core Affects Versions: 2.4.3 Reporter: Jeremy Liu When the R process launched in org.apache.spark.deploy.RRunner terminates with a nonzero status code, only the status code is passed on in the SparkUserAppException. Although the subprocess' stdout is continually piped to System.out, it would be useful for users without access to the JVM's stdout to also capture the last few lines of the R process and pass it along in the exception message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27760) Spark resources - user configs change .count to be .amount
[ https://issues.apache.org/jira/browse/SPARK-27760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27760: Assignee: Thomas Graves (was: Apache Spark) > Spark resources - user configs change .count to be .amount > -- > > Key: SPARK-27760 > URL: https://issues.apache.org/jira/browse/SPARK-27760 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > > For the Spark resources, we created the config > spark.\{driver/executor}.resource.\{resourceName}.count > I think we should change .count to be .amount. That more easily allows users > to specify things with suffix like memory in a single config and they can > combine the value and unit. Without this they would have to specify 2 > separate configs (like .count and .unit) which seems more of a hassle for the > user. > Note the yarn configs for resources use amount: > spark.yarn.\{executor/driver/am}.resource=, where the amont> is value and unit together. I think that makes a lot of sense. Filed a > separate Jira to add .amount to the yarn configs as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27760) Spark resources - user configs change .count to be .amount
[ https://issues.apache.org/jira/browse/SPARK-27760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27760: Assignee: Apache Spark (was: Thomas Graves) > Spark resources - user configs change .count to be .amount > -- > > Key: SPARK-27760 > URL: https://issues.apache.org/jira/browse/SPARK-27760 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Apache Spark >Priority: Major > > For the Spark resources, we created the config > spark.\{driver/executor}.resource.\{resourceName}.count > I think we should change .count to be .amount. That more easily allows users > to specify things with suffix like memory in a single config and they can > combine the value and unit. Without this they would have to specify 2 > separate configs (like .count and .unit) which seems more of a hassle for the > user. > Note the yarn configs for resources use amount: > spark.yarn.\{executor/driver/am}.resource=, where the amont> is value and unit together. I think that makes a lot of sense. Filed a > separate Jira to add .amount to the yarn configs as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27961) DataSourceV2Relation should not have refresh method
[ https://issues.apache.org/jira/browse/SPARK-27961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27961: --- Description: The newly added `Refresh` method in [PR #24401|https://github.com/apache/spark/pull/24401] prevented me from moving DataSourceV2Relation into catalyst. It calls `case table: FileTable => table.fileIndex.refresh()` while `FileTable` belongs to sql/core. More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by design, it should not have refresh method. was: The newly added `Refresh` method in PR #24401 prevented me from moving DataSourceV2Relation into catalyst. It calls `case table: FileTable => table.fileIndex.refresh()` while `FileTable` belongs to sql/core. More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by design, it should not have refresh method. > DataSourceV2Relation should not have refresh method > --- > > Key: SPARK-27961 > URL: https://issues.apache.org/jira/browse/SPARK-27961 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > The newly added `Refresh` method in [PR > #24401|https://github.com/apache/spark/pull/24401] prevented me from moving > DataSourceV2Relation into catalyst. It calls `case table: FileTable => > table.fileIndex.refresh()` while `FileTable` belongs to sql/core. > More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by > design, it should not have refresh method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27961) DataSourceV2Relation should not have refresh method
[ https://issues.apache.org/jira/browse/SPARK-27961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856982#comment-16856982 ] John Zhuge commented on SPARK-27961: [~Gengliang.Wang] [~cloud_fan] Could you help? > DataSourceV2Relation should not have refresh method > --- > > Key: SPARK-27961 > URL: https://issues.apache.org/jira/browse/SPARK-27961 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > The newly added `Refresh` method in PR #24401 prevented me from moving > DataSourceV2Relation into catalyst. It calls `case table: FileTable => > table.fileIndex.refresh()` while `FileTable` belongs to sql/core. > More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by > design, it should not have refresh method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27961) DataSourceV2Relation should not have refresh method
John Zhuge created SPARK-27961: -- Summary: DataSourceV2Relation should not have refresh method Key: SPARK-27961 URL: https://issues.apache.org/jira/browse/SPARK-27961 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: John Zhuge The newly added `Refresh` method in PR #24401 prevented me from moving DataSourceV2Relation into catalyst. It calls `case table: FileTable => table.fileIndex.refresh()` while `FileTable` belongs to sql/core. More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by design, it should not have refresh method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27939) Defining a schema with VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856975#comment-16856975 ] Johannes Schaffrath commented on SPARK-27939: - Hi Bryan, thank you very much for the detailed information. I just saw that this is also mentioned in the documentation [1], but like you said it is not intuitive. [1] http://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.Row > Defining a schema with VectorUDT > > > Key: SPARK-27939 > URL: https://issues.apache.org/jira/browse/SPARK-27939 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Johannes Schaffrath >Priority: Minor > > When I try to define a dataframe schema which has a VectorUDT field, I run > into an error when the VectorUDT field is not the last element of the > StructType list. > The following example causes the error below: > {code:java} > // from pyspark.sql import functions as F > from pyspark.sql import types as T > from pyspark.sql import Row > from pyspark.ml.linalg import VectorUDT, SparseVector > #VectorUDT should be the last structfield > train_schema = T.StructType([ > T.StructField('features', VectorUDT()), > T.StructField('SALESCLOSEPRICE', T.IntegerType()) > ]) > > train_df = spark.createDataFrame( > [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, > 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: > 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, > 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, > 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: > -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: > 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: > 1.0, 133: 1.0}), SALESCLOSEPRICE=143000), > Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, > 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: > 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, > 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: > 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, > 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: > -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, > 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: > 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), > SALESCLOSEPRICE=19), > Row(features=SparseVector(135, {0: 225000.0, 1: 225000.0, 2: 1102.0, 3: > 28.0, 5: 1102.0, 6: 2390.0, 7: 2390.0, 8: 1102.0, 9: 1949.0, 10: 0.822, 11: > 1.0, 15: 1.0, 16: 2.0, 20: 2017.0, 21: 6.0, 22: 26.0, 23: 26.0, 24: 177.0, > 25: 25.0, 26: -1.0, 27: 3.88, 28: 3.9, 29: 3.91, 30: 3.89, 31: 3.94, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 61.7, 36: 2.7, 38: 0.4706, 39: 204.1742, 40: 1102.0, > 41: 0.0106, 42: 2.0, 49: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: -1.0, 57: > 1.0, 62: 1.0, 68: 1.0, 70: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 1.0, 100: 1.0, > 103: 1.0, 108: 1.0, 110: 1.0, 115: 1.0, 123: 1.0, 131: 1.0, 132: 1.0}), > SALESCLOSEPRICE=225000) > ], schema=train_schema) > > train_df.printSchema() > train_df.show() > {code} > Error message: > {code:java} > // Fail to execute line 17: ], schema=train_schema) Traceback (most recent > call last): File "/tmp/zeppelin_pyspark-3793375738105660281.py", line 375, in > exec(code, _zcUserQueryNameSpace) File "", line 17, in > File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", > line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, > data), schema) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in > _createFromLocal data = [schema.toInternal(row) for row in data] File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in > data = [schema.toInternal(row) for row in data] File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in > toInternal for f, v, c in zip(self.fields, obj, self._needConversion)) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in > for f, v, c in zip(self.fields, obj, self._needConversion)) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 442, in > toInternal return self.dataType.toInternal(obj) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 685, in > toInternal return
[jira] [Created] (SPARK-27960) DataSourceV2 ORC implementation doesn't handle schemas correctly
Ryan Blue created SPARK-27960: - Summary: DataSourceV2 ORC implementation doesn't handle schemas correctly Key: SPARK-27960 URL: https://issues.apache.org/jira/browse/SPARK-27960 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Reporter: Ryan Blue While testing SPARK-27919 (#[24768|https://github.com/apache/spark/pull/24768]), I tried to use the v2 ORC implementation to validate a v2 catalog that delegates to the session catalog. The ORC implementation fails the following test case because it cannot infer a schema (there is no data) but it should be using the schema used to create the table. Test case: {code} test("CreateTable: test ORC source") { spark.conf.set("spark.sql.catalog.session", classOf[V2SessionCatalog].getName) spark.sql(s"CREATE TABLE table_name (id bigint, data string) USING $orc2") val testCatalog = spark.catalog("session").asTableCatalog val table = testCatalog.loadTable(Identifier.of(Array(), "table_name")) assert(table.name == "orc ") // <-- should this be table_name? assert(table.partitioning.isEmpty) assert(table.properties == Map( "provider" -> orc2, "database" -> "default", "table" -> "table_name").asJava) assert(table.schema == new StructType().add("id", LongType).add("data", StringType)) // <-- fail val rdd = spark.sparkContext.parallelize(table.asInstanceOf[InMemoryTable].rows) checkAnswer(spark.internalCreateDataFrame(rdd, table.schema), Seq.empty) } {code} Error: {code} Unable to infer schema for ORC. It must be specified manually.; org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.; at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$5(FileTable.scala:61) at scala.Option.getOrElse(Option.scala:138) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:61) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:54) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:67) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:65) at org.apache.spark.sql.sources.v2.DataSourceV2SQLSuite.$anonfun$new$5(DataSourceV2SQLSuite.scala:82) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27960) DataSourceV2 ORC implementation doesn't handle schemas correctly
[ https://issues.apache.org/jira/browse/SPARK-27960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856955#comment-16856955 ] Ryan Blue commented on SPARK-27960: --- [~Gengliang.Wang], FYI > DataSourceV2 ORC implementation doesn't handle schemas correctly > > > Key: SPARK-27960 > URL: https://issues.apache.org/jira/browse/SPARK-27960 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ryan Blue >Priority: Major > > While testing SPARK-27919 > (#[24768|https://github.com/apache/spark/pull/24768]), I tried to use the v2 > ORC implementation to validate a v2 catalog that delegates to the session > catalog. The ORC implementation fails the following test case because it > cannot infer a schema (there is no data) but it should be using the schema > used to create the table. > Test case: > {code} > test("CreateTable: test ORC source") { > spark.conf.set("spark.sql.catalog.session", > classOf[V2SessionCatalog].getName) > spark.sql(s"CREATE TABLE table_name (id bigint, data string) USING $orc2") > val testCatalog = spark.catalog("session").asTableCatalog > val table = testCatalog.loadTable(Identifier.of(Array(), "table_name")) > assert(table.name == "orc ") // <-- should this be table_name? > assert(table.partitioning.isEmpty) > assert(table.properties == Map( > "provider" -> orc2, > "database" -> "default", > "table" -> "table_name").asJava) > assert(table.schema == new StructType().add("id", LongType).add("data", > StringType)) // <-- fail > val rdd = > spark.sparkContext.parallelize(table.asInstanceOf[InMemoryTable].rows) > checkAnswer(spark.internalCreateDataFrame(rdd, table.schema), Seq.empty) > } > {code} > Error: > {code} > Unable to infer schema for ORC. It must be specified manually.; > org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It > must be specified manually.; > at > org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$5(FileTable.scala:61) > at scala.Option.getOrElse(Option.scala:138) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:61) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:54) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:67) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:65) > at > org.apache.spark.sql.sources.v2.DataSourceV2SQLSuite.$anonfun$new$5(DataSourceV2SQLSuite.scala:82) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21136) Misleading error message for typo in SQL
[ https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21136: Assignee: Apache Spark (was: Yesheng Ma) > Misleading error message for typo in SQL > > > Key: SPARK-21136 > URL: https://issues.apache.org/jira/browse/SPARK-21136 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Daniel Darabos >Assignee: Apache Spark >Priority: Critical > > {code} > scala> spark.sql("select * from a left joinn b on a.id = b.id").show > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', > 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', > 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9) > == SQL == > select * from a left joinn b on a.id = b.id > -^^^ > {code} > The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of > the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in > themselves, a misleading error like this can hinder debugging substantially. > I tried to see if maybe I could fix this. Am I correct to deduce that the > error message originates in ANTLR4, which parses the query based on the > syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out > how that syntax definition works, and why it misattributes the error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21136) Misleading error message for typo in SQL
[ https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21136: Assignee: Yesheng Ma (was: Apache Spark) > Misleading error message for typo in SQL > > > Key: SPARK-21136 > URL: https://issues.apache.org/jira/browse/SPARK-21136 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Daniel Darabos >Assignee: Yesheng Ma >Priority: Critical > > {code} > scala> spark.sql("select * from a left joinn b on a.id = b.id").show > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', > 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', > 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9) > == SQL == > select * from a left joinn b on a.id = b.id > -^^^ > {code} > The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of > the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in > themselves, a misleading error like this can hinder debugging substantially. > I tried to see if maybe I could fix this. Am I correct to deduce that the > error message originates in ANTLR4, which parses the query based on the > syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out > how that syntax definition works, and why it misattributes the error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-24615: - Assignee: Thomas Graves (was: Xingbo Jiang) > SPIP: Accelerator-aware task scheduling for Spark > - > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Epic > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Thomas Graves >Priority: Major > Labels: Hydrogen, SPIP > Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, > SPIP_ Accelerator-aware scheduling.pdf > > > (The JIRA received a major update on 2019/02/28. Some comments were based on > an earlier version. Please ignore them. New comments start at > [#comment-16778026].) > h2. Background and Motivation > GPUs and other accelerators have been widely used for accelerating special > workloads, e.g., deep learning and signal processing. While users from the AI > community use GPUs heavily, they often need Apache Spark to load and process > large datasets and to handle complex data scenarios like streaming. YARN and > Kubernetes already support GPUs in their recent releases. Although Spark > supports those two cluster managers, Spark itself is not aware of GPUs > exposed by them and hence Spark cannot properly request GPUs and schedule > them for users. This leaves a critical gap to unify big data and AI workloads > and make life simpler for end users. > To make Spark be aware of GPUs, we shall make two major changes at high level: > * At cluster manager level, we update or upgrade cluster managers to include > GPU support. Then we expose user interfaces for Spark to request GPUs from > them. > * Within Spark, we update its scheduler to understand available GPUs > allocated to executors, user task requests, and assign GPUs to tasks properly. > Based on the work done in YARN and Kubernetes to support GPUs and some > offline prototypes, we could have necessary features implemented in the next > major release of Spark. You can find a detailed scoping doc here, where we > listed user stories and their priorities. > h2. Goals > * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes. > * No regression on scheduler performance for normal jobs. > h2. Non-goals > * Fine-grained scheduling within one GPU card. > ** We treat one GPU card and its memory together as a non-divisible unit. > * Support TPU. > * Support Mesos. > * Support Windows. > h2. Target Personas > * Admins who need to configure clusters to run Spark with GPU nodes. > * Data scientists who need to build DL applications on Spark. > * Developers who need to integrate DL features on Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27368) Design: Standalone supports GPU scheduling
[ https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27368: -- Description: Design draft: Scenarios: * client-mode, worker might create one or more executor processes, from different Spark applications. * cluster-mode, worker might create driver process as well. * local-cluster model, there could be multiple worker processes on the same node. This is an undocumented use of standalone mode, which is mainly for tests. * Resource isolation is not considered here. Because executor and driver processes on the same node will share the accelerator resources, worker must take the role that allocates resources. So we will add spark.worker.resource.[resourceName].discoveryScript conf for workers to discover resources. User need to match the resourceName in driver and executor requests. Besides CPU cores and memory, worker now also considers resources in creating new executors or drivers. Example conf: {code} spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh spark.driver.resource.gpu.count=4 spark.worker.resource.gpu.count=1 {code} In client mode, driver process is not launched by worker. So user can specify driver resource discovery script. In cluster mode, if user still specify driver resource discovery script, it is ignored with a warning. Supporting resource isolation is tricky because Spark worker doesn't know how to isolate resources unless we hardcode some resource names like GPU support in YARN, which is less ideal. Support resource isolation of multiple resource types is even harder. In the first version, we will implement accelerator support without resource isolation. was: Design draft: Scenarios: * client-mode, worker might create one or more executor processes, from different Spark applications. * cluster-mode, worker might create driver process as well. * local-cluster model, there could be multiple worker processes on the same node. This is an undocumented use of standalone mode, which is mainly for tests. Because executor and driver processes on the same node will share the accelerator resources, worker must take the role that allocates resources. So we will add spark.worker.resource.[resourceName].discoveryScript conf for workers to discover resources. User need to match the resourceName in driver and executor requests and they don't need to specify discovery scripts separately. > Design: Standalone supports GPU scheduling > -- > > Key: SPARK-27368 > URL: https://issues.apache.org/jira/browse/SPARK-27368 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > Design draft: > Scenarios: > * client-mode, worker might create one or more executor processes, from > different Spark applications. > * cluster-mode, worker might create driver process as well. > * local-cluster model, there could be multiple worker processes on the same > node. This is an undocumented use of standalone mode, which is mainly for > tests. > * Resource isolation is not considered here. > Because executor and driver processes on the same node will share the > accelerator resources, worker must take the role that allocates resources. So > we will add spark.worker.resource.[resourceName].discoveryScript conf for > workers to discover resources. User need to match the resourceName in driver > and executor requests. Besides CPU cores and memory, worker now also > considers resources in creating new executors or drivers. > Example conf: > {code} > spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh > spark.driver.resource.gpu.count=4 > spark.worker.resource.gpu.count=1 > {code} > In client mode, driver process is not launched by worker. So user can specify > driver resource discovery script. In cluster mode, if user still specify > driver resource discovery script, it is ignored with a warning. > Supporting resource isolation is tricky because Spark worker doesn't know how > to isolate resources unless we hardcode some resource names like GPU support > in YARN, which is less ideal. Support resource isolation of multiple resource > types is even harder. In the first version, we will implement accelerator > support without resource isolation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27368) Design: Standalone supports GPU scheduling
[ https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27368: -- Description: Design draft: Scenarios: * client-mode, worker might create one or more executor processes, from different Spark applications. * cluster-mode, worker might create driver process as well. * local-cluster model, there could be multiple worker processes on the same node. This is an undocumented use of standalone mode, which is mainly for tests. Because executor and driver processes on the same node will share the accelerator resources, worker must take the role that allocates resources. So we will add spark.worker.resource.[resourceName].discoveryScript conf for workers to discover resources. User need to match the resourceName in driver and executor requests and they don't need to specify discovery scripts separately. > Design: Standalone supports GPU scheduling > -- > > Key: SPARK-27368 > URL: https://issues.apache.org/jira/browse/SPARK-27368 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > Design draft: > Scenarios: > * client-mode, worker might create one or more executor processes, from > different Spark applications. > * cluster-mode, worker might create driver process as well. > * local-cluster model, there could be multiple worker processes on the same > node. This is an undocumented use of standalone mode, which is mainly for > tests. > Because executor and driver processes on the same node will share the > accelerator resources, worker must take the role that allocates resources. So > we will add spark.worker.resource.[resourceName].discoveryScript conf for > workers to discover resources. User need to match the resourceName in driver > and executor requests and they don't need to specify discovery scripts > separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27760) Spark resources - user configs change .count to be .amount
[ https://issues.apache.org/jira/browse/SPARK-27760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-27760: -- Description: For the Spark resources, we created the config spark.\{driver/executor}.resource.\{resourceName}.count I think we should change .count to be .amount. That more easily allows users to specify things with suffix like memory in a single config and they can combine the value and unit. Without this they would have to specify 2 separate configs (like .count and .unit) which seems more of a hassle for the user. Note the yarn configs for resources use amount: spark.yarn.\{executor/driver/am}.resource=, where the is value and unit together. I think that makes a lot of sense. Filed a separate Jira to add .amount to the yarn configs as well. was: For the Spark resources, we created the config spark.\{driver/executor}.resource.\{resourceName}.count I think we should change .count to be .amount. That more easily allows users to specify things with suffix like memory in a single config and they can combine the value and unit. Without this they would have to specify 2 separate configs which seems more of a hassle for the user. Note the yarn configs for resources use amount: spark.yarn.\{executor/driver/am}.resource=, where the is value and unit together. I think that makes a lot of sense. Filed a separate Jira to add .amount to the yarn configs as well. > Spark resources - user configs change .count to be .amount > -- > > Key: SPARK-27760 > URL: https://issues.apache.org/jira/browse/SPARK-27760 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > > For the Spark resources, we created the config > spark.\{driver/executor}.resource.\{resourceName}.count > I think we should change .count to be .amount. That more easily allows users > to specify things with suffix like memory in a single config and they can > combine the value and unit. Without this they would have to specify 2 > separate configs (like .count and .unit) which seems more of a hassle for the > user. > Note the yarn configs for resources use amount: > spark.yarn.\{executor/driver/am}.resource=, where the amont> is value and unit together. I think that makes a lot of sense. Filed a > separate Jira to add .amount to the yarn configs as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27760) Spark resources - user configs change .count to be .amount
[ https://issues.apache.org/jira/browse/SPARK-27760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-27760: -- Description: For the Spark resources, we created the config spark.\{driver/executor}.resource.\{resourceName}.count I think we should change .count to be .amount. That more easily allows users to specify things with suffix like memory in a single config and they can combine the value and unit. Without this they would have to specify 2 separate configs which seems more of a hassle for the user. Note the yarn configs for resources use amount: spark.yarn.\{executor/driver/am}.resource=, where the is value and unit together. I think that makes a lot of sense. Filed a separate Jira to add .amount to the yarn configs as well. was: For the Spark resources, we created the config spark.\{driver/executor}.resource.\{resourceName}.count I think we should change .count to be .amount. That more easily allows users to specify things with suffix like memory in a single config and they can combine the value and unit. Without this they would have to specify 2 separate configs which seems more of a hassle for the user. > Spark resources - user configs change .count to be .amount > -- > > Key: SPARK-27760 > URL: https://issues.apache.org/jira/browse/SPARK-27760 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > > For the Spark resources, we created the config > spark.\{driver/executor}.resource.\{resourceName}.count > I think we should change .count to be .amount. That more easily allows users > to specify things with suffix like memory in a single config and they can > combine the value and unit. Without this they would have to specify 2 > separate configs which seems more of a hassle for the user. > Note the yarn configs for resources use amount: > spark.yarn.\{executor/driver/am}.resource=, where the amont> is value and unit together. I think that makes a lot of sense. Filed a > separate Jira to add .amount to the yarn configs as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27760) Spark resources - user configs change .count to be .amount
[ https://issues.apache.org/jira/browse/SPARK-27760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-27760: - Assignee: Thomas Graves > Spark resources - user configs change .count to be .amount > -- > > Key: SPARK-27760 > URL: https://issues.apache.org/jira/browse/SPARK-27760 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > > For the Spark resources, we created the config > spark.\{driver/executor}.resource.\{resourceName}.count > I think we should change .count to be .amount. That more easily allows users > to specify things with suffix like memory in a single config and they can > combine the value and unit. Without this they would have to specify 2 > separate configs which seems more of a hassle for the user. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27760) Spark resources - user configs change .count to be .amount
[ https://issues.apache.org/jira/browse/SPARK-27760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-27760: -- Summary: Spark resources - user configs change .count to be .amount (was: Spark resources - user configs change .count to be .amount, and yarn configs should match) > Spark resources - user configs change .count to be .amount > -- > > Key: SPARK-27760 > URL: https://issues.apache.org/jira/browse/SPARK-27760 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > For the Spark resources, we created the config > spark.\{driver/executor}.resource.\{resourceName}.count > I think we should change .count to be .amount. That more easily allows users > to specify things with suffix like memory in a single config and they can > combine the value and unit. Without this they would have to specify 2 > separate configs which seems more of a hassle for the user. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27959) Change YARN resource configs to use .amount
Thomas Graves created SPARK-27959: - Summary: Change YARN resource configs to use .amount Key: SPARK-27959 URL: https://issues.apache.org/jira/browse/SPARK-27959 Project: Spark Issue Type: Story Components: YARN Affects Versions: 3.0.0 Reporter: Thomas Graves we are adding in generic resource support into spark where we have suffix for the amount of the resource so that we could support other configs. Spark on yarn already had added configs to request resources via the configs spark.yarn.\{executor/driver/am}.resource=, where the is value and unit together. We should change those configs to have a .amount suffix on them to match the spark configs and to allow future configs to be more easily added. YARN itself already supports tags and attributes so if we want the user to be able to pass those from spark at some point having a suffix makes sense. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27760) Spark resources - user configs change .count to be .amount, and yarn configs should match
[ https://issues.apache.org/jira/browse/SPARK-27760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-27760: -- Description: For the Spark resources, we created the config spark.\{driver/executor}.resource.\{resourceName}.count I think we should change .count to be .amount. That more easily allows users to specify things with suffix like memory in a single config and they can combine the value and unit. Without this they would have to specify 2 separate configs which seems more of a hassle for the user. was: For the Spark resources, we created the config spark.\{driver/executor}.resource.\{resourceName}.count I think we should change .count to be .amount. That more easily allows users to specify things with suffix like memory in a single config and they can combine the value and unit. Without this they would have to specify 2 separate configs which seems more of a hassle for the user. Spark on yarn already had added configs to request resources via the configs spark.yarn.\{executor/driver/am}.resource=, where the is value and unit together. We should change those configs to have a .amount suffix on them to match the spark configs and to allow future configs to be more easily added. YARN itself already supports tags and attributes so if we want the user to be able to pass those from spark at some point having a suffix makes sense. > Spark resources - user configs change .count to be .amount, and yarn configs > should match > - > > Key: SPARK-27760 > URL: https://issues.apache.org/jira/browse/SPARK-27760 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > For the Spark resources, we created the config > spark.\{driver/executor}.resource.\{resourceName}.count > I think we should change .count to be .amount. That more easily allows users > to specify things with suffix like memory in a single config and they can > combine the value and unit. Without this they would have to specify 2 > separate configs which seems more of a hassle for the user. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27933) Extracting common purge "behaviour" to the parent StreamExecution
[ https://issues.apache.org/jira/browse/SPARK-27933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27933. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24781 [https://github.com/apache/spark/pull/24781] > Extracting common purge "behaviour" to the parent StreamExecution > - > > Key: SPARK-27933 > URL: https://issues.apache.org/jira/browse/SPARK-27933 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.3 >Reporter: Jacek Laskowski >Assignee: Jacek Laskowski >Priority: Minor > Fix For: 3.0.0 > > > Extracting the common {{purge}} "behaviour" to the parent {{StreamExecution}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27933) Extracting common purge "behaviour" to the parent StreamExecution
[ https://issues.apache.org/jira/browse/SPARK-27933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27933: - Assignee: Jacek Laskowski > Extracting common purge "behaviour" to the parent StreamExecution > - > > Key: SPARK-27933 > URL: https://issues.apache.org/jira/browse/SPARK-27933 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.3 >Reporter: Jacek Laskowski >Assignee: Jacek Laskowski >Priority: Minor > > Extracting the common {{purge}} "behaviour" to the parent {{StreamExecution}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27364) User-facing APIs for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-27364. --- Resolution: Fixed Fix Version/s: 3.0.0 > User-facing APIs for GPU-aware scheduling > - > > Key: SPARK-27364 > URL: https://issues.apache.org/jira/browse/SPARK-27364 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > Fix For: 3.0.0 > > > Design and implement: > * General guidelines for cluster managers to understand resource requests at > application start. The concrete conf/param will be under the design of each > cluster manager. > * APIs to fetch assigned resources from task context. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27364) User-facing APIs for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856895#comment-16856895 ] Thomas Graves commented on SPARK-27364: --- User facing changes are all committed so going to close this. A few changes from above. getResources was just called resources. The driver config for standalone mode takes a json file rather then individual address configs. (spark.driver.resourceFile) > User-facing APIs for GPU-aware scheduling > - > > Key: SPARK-27364 > URL: https://issues.apache.org/jira/browse/SPARK-27364 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > > Design and implement: > * General guidelines for cluster managers to understand resource requests at > application start. The concrete conf/param will be under the design of each > cluster manager. > * APIs to fetch assigned resources from task context. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27521) move data source v2 API to catalyst module
[ https://issues.apache.org/jira/browse/SPARK-27521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-27521. - Resolution: Fixed Fix Version/s: 3.0.0 > move data source v2 API to catalyst module > -- > > Key: SPARK-27521 > URL: https://issues.apache.org/jira/browse/SPARK-27521 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25994) SPIP: Property Graphs, Cypher Queries, and Algorithms
[ https://issues.apache.org/jira/browse/SPARK-25994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856827#comment-16856827 ] Ruben Berenguel commented on SPARK-25994: - Hi [~mju] I'd like to lend a hand if you feel like it (I've been following on-and-off the discussions and SPIPs for this, and currently use GraphFrames). Wouldn't mind helping with Python APIs (I'm somewhat familiar with the Python APIs and a bit of the internals, even if I'm not a frequent user of PySpark) > SPIP: Property Graphs, Cypher Queries, and Algorithms > - > > Key: SPARK-25994 > URL: https://issues.apache.org/jira/browse/SPARK-25994 > Project: Spark > Issue Type: Epic > Components: Graph >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Martin Junghanns >Priority: Major > Labels: SPIP > > Copied from the SPIP doc: > {quote} > GraphX was one of the foundational pillars of the Spark project, and is the > current graph component. This reflects the importance of the graphs data > model, which naturally pairs with an important class of analytic function, > the network or graph algorithm. > However, GraphX is not actively maintained. It is based on RDDs, and cannot > exploit Spark 2’s Catalyst query engine. GraphX is only available to Scala > users. > GraphFrames is a Spark package, which implements DataFrame-based graph > algorithms, and also incorporates simple graph pattern matching with fixed > length patterns (called “motifs”). GraphFrames is based on DataFrames, but > has a semantically weak graph data model (based on untyped edges and > vertices). The motif pattern matching facility is very limited by comparison > with the well-established Cypher language. > The Property Graph data model has become quite widespread in recent years, > and is the primary focus of commercial graph data management and of graph > data research, both for on-premises and cloud data management. Many users of > transactional graph databases also wish to work with immutable graphs in > Spark. > The idea is to define a Cypher-compatible Property Graph type based on > DataFrames; to replace GraphFrames querying with Cypher; to reimplement > GraphX/GraphFrames algos on the PropertyGraph type. > To achieve this goal, a core subset of Cypher for Apache Spark (CAPS), > reusing existing proven designs and code, will be employed in Spark 3.0. This > graph query processor, like CAPS, will overlay and drive the SparkSQL > Catalyst query engine, using the CAPS graph query planner. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27958) Stopping a SparkSession should not always stop Spark Context
[ https://issues.apache.org/jira/browse/SPARK-27958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27958: Assignee: Apache Spark > Stopping a SparkSession should not always stop Spark Context > > > Key: SPARK-27958 > URL: https://issues.apache.org/jira/browse/SPARK-27958 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Vinoo Ganesh >Assignee: Apache Spark >Priority: Major > > Creating a ticket to track the discussion here: > [http://mail-archives.apache.org/mod_mbox/spark-dev/201904.mbox/%3CCAO4re1=Nk1E1VwGzSZwQ5x0SY=_heupmed8n5yydccml_t5...@mail.gmail.com%3E] > Right now, stopping a SparkSession stops the underlying SparkContext. This > behavior is not ideal and doesn't really make sense. Stopping a SparkSession > should only stop the SparkContext in the event that the is the only session. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27958) Stopping a SparkSession should not always stop Spark Context
[ https://issues.apache.org/jira/browse/SPARK-27958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27958: Assignee: (was: Apache Spark) > Stopping a SparkSession should not always stop Spark Context > > > Key: SPARK-27958 > URL: https://issues.apache.org/jira/browse/SPARK-27958 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Vinoo Ganesh >Priority: Major > > Creating a ticket to track the discussion here: > [http://mail-archives.apache.org/mod_mbox/spark-dev/201904.mbox/%3CCAO4re1=Nk1E1VwGzSZwQ5x0SY=_heupmed8n5yydccml_t5...@mail.gmail.com%3E] > Right now, stopping a SparkSession stops the underlying SparkContext. This > behavior is not ideal and doesn't really make sense. Stopping a SparkSession > should only stop the SparkContext in the event that the is the only session. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27958) Stopping a SparkSession should not always stop Spark Context
[ https://issues.apache.org/jira/browse/SPARK-27958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856810#comment-16856810 ] Apache Spark commented on SPARK-27958: -- User 'vinooganesh' has created a pull request for this issue: https://github.com/apache/spark/pull/24807 > Stopping a SparkSession should not always stop Spark Context > > > Key: SPARK-27958 > URL: https://issues.apache.org/jira/browse/SPARK-27958 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Vinoo Ganesh >Priority: Major > > Creating a ticket to track the discussion here: > [http://mail-archives.apache.org/mod_mbox/spark-dev/201904.mbox/%3CCAO4re1=Nk1E1VwGzSZwQ5x0SY=_heupmed8n5yydccml_t5...@mail.gmail.com%3E] > Right now, stopping a SparkSession stops the underlying SparkContext. This > behavior is not ideal and doesn't really make sense. Stopping a SparkSession > should only stop the SparkContext in the event that the is the only session. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27958) Stopping a SparkSession should not always stop Spark Context
[ https://issues.apache.org/jira/browse/SPARK-27958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856808#comment-16856808 ] Vinoo Ganesh commented on SPARK-27958: -- [https://github.com/apache/spark/pull/24807] > Stopping a SparkSession should not always stop Spark Context > > > Key: SPARK-27958 > URL: https://issues.apache.org/jira/browse/SPARK-27958 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Vinoo Ganesh >Priority: Major > > Creating a ticket to track the discussion here: > [http://mail-archives.apache.org/mod_mbox/spark-dev/201904.mbox/%3CCAO4re1=Nk1E1VwGzSZwQ5x0SY=_heupmed8n5yydccml_t5...@mail.gmail.com%3E] > Right now, stopping a SparkSession stops the underlying SparkContext. This > behavior is not ideal and doesn't really make sense. Stopping a SparkSession > should only stop the SparkContext in the event that the is the only session. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27958) Stopping a SparkSession should not always stop Spark Context
[ https://issues.apache.org/jira/browse/SPARK-27958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856803#comment-16856803 ] Vinoo Ganesh commented on SPARK-27958: -- Putting up a PR shortly > Stopping a SparkSession should not always stop Spark Context > > > Key: SPARK-27958 > URL: https://issues.apache.org/jira/browse/SPARK-27958 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Vinoo Ganesh >Priority: Major > > Creating a ticket to track the discussion here: > [http://mail-archives.apache.org/mod_mbox/spark-dev/201904.mbox/%3CCAO4re1=Nk1E1VwGzSZwQ5x0SY=_heupmed8n5yydccml_t5...@mail.gmail.com%3E] > Right now, stopping a SparkSession stops the underlying SparkContext. This > behavior is not ideal and doesn't really make sense. Stopping a SparkSession > should only stop the SparkContext in the event that the is the only session. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27958) Stopping a SparkSession should not always stop Spark Context
Vinoo Ganesh created SPARK-27958: Summary: Stopping a SparkSession should not always stop Spark Context Key: SPARK-27958 URL: https://issues.apache.org/jira/browse/SPARK-27958 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Vinoo Ganesh Creating a ticket to track the discussion here: [http://mail-archives.apache.org/mod_mbox/spark-dev/201904.mbox/%3CCAO4re1=Nk1E1VwGzSZwQ5x0SY=_heupmed8n5yydccml_t5...@mail.gmail.com%3E] Right now, stopping a SparkSession stops the underlying SparkContext. This behavior is not ideal and doesn't really make sense. Stopping a SparkSession should only stop the SparkContext in the event that the is the only session. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27749) hadoop-3.2 support hive-thriftserver
[ https://issues.apache.org/jira/browse/SPARK-27749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-27749. - Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 3.0.0 > hadoop-3.2 support hive-thriftserver > > > Key: SPARK-27749 > URL: https://issues.apache.org/jira/browse/SPARK-27749 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist
[ https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-20286. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24704 [https://github.com/apache/spark/pull/24704] > dynamicAllocation.executorIdleTimeout is ignored after unpersist > > > Key: SPARK-20286 > URL: https://issues.apache.org/jira/browse/SPARK-20286 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Miguel Pérez >Priority: Major > Fix For: 3.0.0 > > > With dynamic allocation enabled, it seems that executors with cached data > which are unpersisted are still being killed using the > {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of > {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration > ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor > with unpersisted data won't be released until the job ends. > *How to reproduce* > - Set different values for {{dynamicAllocation.executorIdleTimeout}} and > {{dynamicAllocation.cachedExecutorIdleTimeout}} > - Load a file into a RDD and persist it > - Execute an action on the RDD (like a count) so some executors are activated. > - When the action has finished, unpersist the RDD > - The application UI removes correctly the persisted data from the *Storage* > tab, but if you look in the *Executors* tab, you will find that the executors > remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is > reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist
[ https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-20286: Assignee: Marcelo Vanzin > dynamicAllocation.executorIdleTimeout is ignored after unpersist > > > Key: SPARK-20286 > URL: https://issues.apache.org/jira/browse/SPARK-20286 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Miguel Pérez >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 3.0.0 > > > With dynamic allocation enabled, it seems that executors with cached data > which are unpersisted are still being killed using the > {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of > {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration > ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor > with unpersisted data won't be released until the job ends. > *How to reproduce* > - Set different values for {{dynamicAllocation.executorIdleTimeout}} and > {{dynamicAllocation.cachedExecutorIdleTimeout}} > - Load a file into a RDD and persist it > - Execute an action on the RDD (like a count) so some executors are activated. > - When the action has finished, unpersist the RDD > - The application UI removes correctly the persisted data from the *Storage* > tab, but if you look in the *Executors* tab, you will find that the executors > remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is > reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table
[ https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-27943: --- Description: *Background* Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. *Design* Hive is widely used in production environments and is the standard in the field of big data in fact. But Hive exists many version used in production and the feature between each version are different. Spark SQL need to implement default constraint, but there are two points to pay attention to in design: One is Spark SQL should reduce coupling with Hive. Another is default constraint could compatible with different versions of Hive. We want to save the metadata of default constraint into properties of Hive table, and then we restore metadata from the properties after client gets newest metadata. The implement is the same as other metadata (e.g. partition,bucket,statistics). Because default constraint is part of column, so I think could reuse the metadata of StructField. The default constraint will cached by metadata of StructField. *Tasks* This is a big work, wo I want to split this work into some sub tasks, as follows: was: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. Hive is widely used in production environments and is the standard in the field of big data in fact. But Hive exists many version used in production and the feature between each version are different. Spark SQL need to implement default constraint, but there are two points to pay attention to in design: One is Spark SQL should reduce coupling with Hive. Another is default constraint could compatible with different versions of Hive. We want to save the metadata of default constraint into properties of Hive table, and then we restore metadata from the properties after client gets newest metadata. The implement is the same as other metadata (e.g. partition,bucket,statistics). Because default constraint is part of column, so I think could reuse the metadata of StructField. The default constraint will cached by metadata of StructField. This is a big work, wo I want to split this work into some sub tasks, as follows: > Implement default constraint with Column for Hive table > --- > > Key: SPARK-27943 > URL: https://issues.apache.org/jira/browse/SPARK-27943 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Major > > > *Background* > Default constraint with column is ANSI standard. > Hive 3.0+ has supported default constraint > ref:https://issues.apache.org/jira/browse/HIVE-18726 > But Spark SQL implement this feature not yet. > *Design* > Hive is widely used in production environments and is the standard in the > field of big data in fact. But Hive exists many version used in production > and the feature between each version are different. > Spark SQL need to implement default constraint, but there are two points to > pay attention to in design: > One is Spark SQL should reduce coupling with Hive. > Another is default constraint could compatible with different versions of > Hive. > We want to save the metadata of default constraint into properties of Hive > table, and then we restore metadata from the properties after client gets > newest metadata. > The implement is the same as other metadata (e.g. > partition,bucket,statistics). > Because default constraint is part of column, so I think could reuse the > metadata of StructField. The default constraint will cached by metadata of > StructField. > *Tasks* > This is a big work, wo I want to split this work into some sub tasks, as > follows: > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27957) Display default constraint of column when running desc table.
jiaan.geng created SPARK-27957: -- Summary: Display default constraint of column when running desc table. Key: SPARK-27957 URL: https://issues.apache.org/jira/browse/SPARK-27957 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0, 2.3.0 Reporter: jiaan.geng This is a sub task to implement default constraint. This Jira used to solve the issue that display default constraint when executing {code:java} desc table{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27943) Implement default constraint with Column for Hive table
[ https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856595#comment-16856595 ] Apache Spark commented on SPARK-27943: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/24372 > Implement default constraint with Column for Hive table > --- > > Key: SPARK-27943 > URL: https://issues.apache.org/jira/browse/SPARK-27943 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Major > > Default constraint with column is ANSI standard. > Hive 3.0+ has supported default constraint > ref:https://issues.apache.org/jira/browse/HIVE-18726 > But Spark SQL implement this feature not yet. > Hive is widely used in production environments and is the standard in the > field of big data in fact. But Hive exists many version used in production > and the feature between each version are different. > Spark SQL need to implement default constraint, but there are two points to > pay attention to in design: > One is Spark SQL should reduce coupling with Hive. > Another is default constraint could compatible with different versions of > Hive. > We want to save the metadata of default constraint into properties of Hive > table, and then we restore metadata from the properties after client gets > newest metadata. > The implement is the same as other metadata (e.g. > partition,bucket,statistics). > Because default constraint is part of column, so I think could reuse the > metadata of StructField. The default constraint will cached by metadata of > StructField. > This is a big work, wo I want to split this work into some sub tasks, as > follows: > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table
[ https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-27943: --- Description: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. Hive is widely used in production environments and is the standard in the field of big data in fact. But Hive exists many version used in production and the feature between each version are different. Spark SQL need to implement default constraint, but there are two points to pay attention to in design: One is Spark SQL should reduce coupling with Hive. Another is default constraint could compatible with different versions of Hive. We want to save the metadata of default constraint into properties of Hive table, and then we restore metadata from the properties after client gets newest metadata. The implement is the same as other metadata (e.g. partition,bucket,statistics). Because default constraint is part of column, so I think could reuse the metadata of StructField. The default constraint will cached by metadata of StructField. This is a big work, wo I want to split this work into some sub tasks, as follows: was: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. Hive is widely used in production environments and is the standard in the field of big data in fact. But Hive exists many version used in production and the feature between each version are different. Spark SQL need to implement default constraint, but there are two points to pay attention to in design: One is Spark SQL should reduce coupling with Hive. Another is default constraint could compatible with different versions of Hive. We want to save the metadata of default constraint into properties of Hive table, and then we restore metadata from the properties after client gets newest metadata. The implement is the same as other metadata (e.g. partition,bucket,statistics). Because This is a big work, wo I want to split this work into some sub tasks, as follows: > Implement default constraint with Column for Hive table > --- > > Key: SPARK-27943 > URL: https://issues.apache.org/jira/browse/SPARK-27943 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Major > > Default constraint with column is ANSI standard. > Hive 3.0+ has supported default constraint > ref:https://issues.apache.org/jira/browse/HIVE-18726 > But Spark SQL implement this feature not yet. > Hive is widely used in production environments and is the standard in the > field of big data in fact. But Hive exists many version used in production > and the feature between each version are different. > Spark SQL need to implement default constraint, but there are two points to > pay attention to in design: > One is Spark SQL should reduce coupling with Hive. > Another is default constraint could compatible with different versions of > Hive. > We want to save the metadata of default constraint into properties of Hive > table, and then we restore metadata from the properties after client gets > newest metadata. > The implement is the same as other metadata (e.g. > partition,bucket,statistics). > Because default constraint is part of column, so I think could reuse the > metadata of StructField. The default constraint will cached by metadata of > StructField. > This is a big work, wo I want to split this work into some sub tasks, as > follows: > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table
[ https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-27943: --- Description: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. Hive is widely used in production environments and is the standard in the field of big data in fact. But Hive exists many version used in production and the feature between each version are different. Spark SQL need to implement default constraint, but there are two points to pay attention to in design: One is Spark SQL should reduce coupling with Hive. Another is default constraint could compatible with different versions of Hive. We want to save the metadata of default constraint into properties of Hive table, and then we restore metadata from the properties after client gets newest metadata. The implement is the same as other metadata (e.g. partition,bucket,statistics). Because This is a big work, wo I want to split this work into some sub tasks, as follows: was: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. Hive is widely used in production environments and is the standard in the field of big data in fact. But Hive exists many version used in production and the feature between each version are different. Spark SQL need to implement default constraint, but there are two points to pay attention to in design: One is Spark SQL should reduce coupling with Hive. Another is default constraint could compatible with different versions of Hive. We want to save the metadata of default constraint into properties of Hive table, and then we restore metadata from the properties after client gets newest metadata. The implement is the same as other metadata (e.g. partition,bucket,statistics). This is a big work, wo I want to split this work into some sub tasks, as follows: > Implement default constraint with Column for Hive table > --- > > Key: SPARK-27943 > URL: https://issues.apache.org/jira/browse/SPARK-27943 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Major > > Default constraint with column is ANSI standard. > Hive 3.0+ has supported default constraint > ref:https://issues.apache.org/jira/browse/HIVE-18726 > But Spark SQL implement this feature not yet. > Hive is widely used in production environments and is the standard in the > field of big data in fact. But Hive exists many version used in production > and the feature between each version are different. > Spark SQL need to implement default constraint, but there are two points to > pay attention to in design: > One is Spark SQL should reduce coupling with Hive. > Another is default constraint could compatible with different versions of > Hive. > We want to save the metadata of default constraint into properties of Hive > table, and then we restore metadata from the properties after client gets > newest metadata. > The implement is the same as other metadata (e.g. > partition,bucket,statistics). > Because > This is a big work, wo I want to split this work into some sub tasks, as > follows: > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution
[ https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856592#comment-16856592 ] Andrey Zinovyev commented on SPARK-27913: - Simple way to reproduce it {code:sql} create external table test_broken_orc(a struct) stored as orc; insert into table test_broken_orc select named_struct("f1", 1); drop table test_broken_orc; create external table test_broken_orc(a struct) stored as orc; select * from test_broken_orc; {code} Last statement fails with exception {noformat} Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) {noformat} Also you can remove column or add column in the middle of struct field. As far as I understand current implementation it supports by-name field resolution of zero level of orc structure. Everything deeper get resolved by index and expected be exact match with reader schema > Spark SQL's native ORC reader implements its own schema evolution > - > > Key: SPARK-27913 > URL: https://issues.apache.org/jira/browse/SPARK-27913 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 >Reporter: Owen O'Malley >Priority: Major > > ORC's reader handles a wide range of schema evolution, but the Spark SQL > native ORC bindings do not provide the desired schema to the ORC reader. This > causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27798) from_avro can modify variables in other rows in local mode
[ https://issues.apache.org/jira/browse/SPARK-27798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27798: Assignee: (was: Apache Spark) > from_avro can modify variables in other rows in local mode > -- > > Key: SPARK-27798 > URL: https://issues.apache.org/jira/browse/SPARK-27798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Yosuke Mori >Priority: Blocker > Labels: correctness > Attachments: Screen Shot 2019-05-21 at 2.39.27 PM.png > > > Steps to reproduce: > Create a local Dataset (at least two distinct rows) with a binary Avro field. > Use the {{from_avro}} function to deserialize the binary into another column. > Verify that all of the rows incorrectly have the same value. > Here's a concrete example (using Spark 2.4.3). All it does is converts a list > of TestPayload objects into binary using the defined avro schema, then tries > to deserialize using {{from_avro}} with that same schema: > {code:java} > import org.apache.avro.Schema > import org.apache.avro.generic.{GenericDatumWriter, GenericRecord, > GenericRecordBuilder} > import org.apache.avro.io.EncoderFactory > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.avro.from_avro > import org.apache.spark.sql.functions.col > import java.io.ByteArrayOutputStream > object TestApp extends App { > // Payload container > case class TestEvent(payload: Array[Byte]) > // Deserialized Payload > case class TestPayload(message: String) > // Schema for Payload > val simpleSchema = > """ > |{ > |"type": "record", > |"name" : "Payload", > |"fields" : [ {"name" : "message", "type" : [ "string", "null" ] } ] > |} > """.stripMargin > // Convert TestPayload into avro binary > def generateSimpleSchemaBinary(record: TestPayload, avsc: String): > Array[Byte] = { > val schema = new Schema.Parser().parse(avsc) > val out = new ByteArrayOutputStream() > val writer = new GenericDatumWriter[GenericRecord](schema) > val encoder = EncoderFactory.get().binaryEncoder(out, null) > val rootRecord = new GenericRecordBuilder(schema).set("message", > record.message).build() > writer.write(rootRecord, encoder) > encoder.flush() > out.toByteArray > } > val spark: SparkSession = > SparkSession.builder().master("local[*]").getOrCreate() > import spark.implicits._ > List( > TestPayload("one"), > TestPayload("two"), > TestPayload("three"), > TestPayload("four") > ).map(payload => TestEvent(generateSimpleSchemaBinary(payload, > simpleSchema))) > .toDS() > .withColumn("deserializedPayload", from_avro(col("payload"), > simpleSchema)) > .show(truncate = false) > } > {code} > And here is what this program outputs: > {noformat} > +--+---+ > |payload |deserializedPayload| > +--+---+ > |[00 06 6F 6E 65] |[four] | > |[00 06 74 77 6F] |[four] | > |[00 0A 74 68 72 65 65]|[four] | > |[00 08 66 6F 75 72] |[four] | > +--+---+{noformat} > Here, we can see that the avro binary is correctly generated, but the > deserialized version is a copy of the last row. I have not yet verified that > this is an issue in cluster mode as well. > > I dug into a bit more of the code and it seems like the resuse of {{result}} > in {{AvroDataToCatalyst}} is overwriting the decoded values of previous rows. > I set a breakpoint in {{LocalRelation}} and the {{data}} sequence seem to all > point to the same address in memory - and therefore a mutation in one > variable will cause all of it to mutate. > !Screen Shot 2019-05-21 at 2.39.27 PM.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27798) from_avro can modify variables in other rows in local mode
[ https://issues.apache.org/jira/browse/SPARK-27798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27798: Assignee: Apache Spark > from_avro can modify variables in other rows in local mode > -- > > Key: SPARK-27798 > URL: https://issues.apache.org/jira/browse/SPARK-27798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Yosuke Mori >Assignee: Apache Spark >Priority: Blocker > Labels: correctness > Attachments: Screen Shot 2019-05-21 at 2.39.27 PM.png > > > Steps to reproduce: > Create a local Dataset (at least two distinct rows) with a binary Avro field. > Use the {{from_avro}} function to deserialize the binary into another column. > Verify that all of the rows incorrectly have the same value. > Here's a concrete example (using Spark 2.4.3). All it does is converts a list > of TestPayload objects into binary using the defined avro schema, then tries > to deserialize using {{from_avro}} with that same schema: > {code:java} > import org.apache.avro.Schema > import org.apache.avro.generic.{GenericDatumWriter, GenericRecord, > GenericRecordBuilder} > import org.apache.avro.io.EncoderFactory > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.avro.from_avro > import org.apache.spark.sql.functions.col > import java.io.ByteArrayOutputStream > object TestApp extends App { > // Payload container > case class TestEvent(payload: Array[Byte]) > // Deserialized Payload > case class TestPayload(message: String) > // Schema for Payload > val simpleSchema = > """ > |{ > |"type": "record", > |"name" : "Payload", > |"fields" : [ {"name" : "message", "type" : [ "string", "null" ] } ] > |} > """.stripMargin > // Convert TestPayload into avro binary > def generateSimpleSchemaBinary(record: TestPayload, avsc: String): > Array[Byte] = { > val schema = new Schema.Parser().parse(avsc) > val out = new ByteArrayOutputStream() > val writer = new GenericDatumWriter[GenericRecord](schema) > val encoder = EncoderFactory.get().binaryEncoder(out, null) > val rootRecord = new GenericRecordBuilder(schema).set("message", > record.message).build() > writer.write(rootRecord, encoder) > encoder.flush() > out.toByteArray > } > val spark: SparkSession = > SparkSession.builder().master("local[*]").getOrCreate() > import spark.implicits._ > List( > TestPayload("one"), > TestPayload("two"), > TestPayload("three"), > TestPayload("four") > ).map(payload => TestEvent(generateSimpleSchemaBinary(payload, > simpleSchema))) > .toDS() > .withColumn("deserializedPayload", from_avro(col("payload"), > simpleSchema)) > .show(truncate = false) > } > {code} > And here is what this program outputs: > {noformat} > +--+---+ > |payload |deserializedPayload| > +--+---+ > |[00 06 6F 6E 65] |[four] | > |[00 06 74 77 6F] |[four] | > |[00 0A 74 68 72 65 65]|[four] | > |[00 08 66 6F 75 72] |[four] | > +--+---+{noformat} > Here, we can see that the avro binary is correctly generated, but the > deserialized version is a copy of the last row. I have not yet verified that > this is an issue in cluster mode as well. > > I dug into a bit more of the code and it seems like the resuse of {{result}} > in {{AvroDataToCatalyst}} is overwriting the decoded values of previous rows. > I set a breakpoint in {{LocalRelation}} and the {{data}} sequence seem to all > point to the same address in memory - and therefore a mutation in one > variable will cause all of it to mutate. > !Screen Shot 2019-05-21 at 2.39.27 PM.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27953) Save default constraint with Column into table properties when create Hive table
[ https://issues.apache.org/jira/browse/SPARK-27953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27953: Assignee: Apache Spark > Save default constraint with Column into table properties when create Hive > table > > > Key: SPARK-27953 > URL: https://issues.apache.org/jira/browse/SPARK-27953 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > This is a sub task to implement default constraint. > This Jira want solve the issue that save default constraint into properties > of Hive table. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27953) Save default constraint with Column into table properties when create Hive table
[ https://issues.apache.org/jira/browse/SPARK-27953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856588#comment-16856588 ] Apache Spark commented on SPARK-27953: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/24792 > Save default constraint with Column into table properties when create Hive > table > > > Key: SPARK-27953 > URL: https://issues.apache.org/jira/browse/SPARK-27953 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Major > > This is a sub task to implement default constraint. > This Jira want solve the issue that save default constraint into properties > of Hive table. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27953) Save default constraint with Column into table properties when create Hive table
[ https://issues.apache.org/jira/browse/SPARK-27953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27953: Assignee: (was: Apache Spark) > Save default constraint with Column into table properties when create Hive > table > > > Key: SPARK-27953 > URL: https://issues.apache.org/jira/browse/SPARK-27953 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Major > > This is a sub task to implement default constraint. > This Jira want solve the issue that save default constraint into properties > of Hive table. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27956) Allow subqueries as partition filter
Johannes Mayer created SPARK-27956: -- Summary: Allow subqueries as partition filter Key: SPARK-27956 URL: https://issues.apache.org/jira/browse/SPARK-27956 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Johannes Mayer Subqueries are not pushed down as partition filters. See following example {code:java} create table user_mayerjoh.tab (c1 string) partitioned by (c2 string) stored as parquet; {code} {code:java} explain select * from user_mayerjoh.tab where c2 < 1;{code} == Physical Plan == *(1) FileScan parquet user_mayerjoh.tab[c1#22,c2#23] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, *PartitionFilters: [isnotnull(c2#23), (cast(c2#23 as int) < 1)]*, PushedFilters: [], ReadSchema: struct {code:java} explain select * from user_mayerjoh.tab where c2 < (select 1);{code} == Physical Plan == +- *(1) FileScan parquet user_mayerjoh.tab[c1#30,c2#31] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, *PartitionFilters: [isnotnull(c2#31)]*, PushedFilters: [], ReadSchema: struct Is it possible to first execute the subquery and use the result as partition filter? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856558#comment-16856558 ] Piotr Chowaniec commented on SPARK-18105: - I have a similar issue with Spark 2.3.2. Here is a stack trace: {code:java} org.apache.spark.scheduler.DAGScheduler : ShuffleMapStage 647 (count at Step.java:20) failed in 1.908 s due to org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:62) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.agg_doAggregateWithKeys_1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.agg_doAggregateWithKeys_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Stream is corrupted at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:252) at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:170) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:349) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:336) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:336) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1381) at org.apache.spark.util.Utils$.copyStream(Utils.scala:357) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:436) ... 21 more Caused by: net.jpountz.lz4.LZ4Exception: Error decoding offset 2010 of input buffer at net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:39) at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:247) ... 29 more {code} It happens during ETL process that has about 200 steps. It looks like it depends on the input data because we have exceptions only on the production environment (on test and dev machines same process with different data is running without problems). Unfortunately there is no way to use production data on other environment, so we cannot find differences. Changing compression codec to Snappy gives: {code:java} o.apache.spark.scheduler.TaskSetManager : Lost task 0.0 in stage 852.3 (TID 308 36, localhost, executor driver): FetchFailed(BlockManagerId(driver, DNS.domena, 33588, None), shuffleId=298, mapId=2, reduceId=3, message= org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:62) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at
[jira] [Updated] (SPARK-27923) List all cases that PostgreSQL throws an exception but Spark SQL is NULL
[ https://issues.apache.org/jira/browse/SPARK-27923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27923: Description: In this ticket, we plan to list all cases that PostgreSQL throws an exception but Spark SQL is NULL. When porting the [boolean.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/boolean.sql] found a case: # Cast unaccepted value to boolean type throws [invalid input syntax|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/boolean.out#L45-L47]. When porting the [case.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/case.sql] found a case: # Division by zero [throws an exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/case.out#L96-L99]. was: # {{SELECT bool 'test' AS error;}} [link|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/boolean.out#L45-L47]. # {{SELECT 1/0 AS error;}} [link|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/case.out#L96-L99]. > List all cases that PostgreSQL throws an exception but Spark SQL is NULL > > > Key: SPARK-27923 > URL: https://issues.apache.org/jira/browse/SPARK-27923 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > In this ticket, we plan to list all cases that PostgreSQL throws an exception > but Spark SQL is NULL. > When porting the > [boolean.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/boolean.sql] > found a case: > # Cast unaccepted value to boolean type throws [invalid input > syntax|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/boolean.out#L45-L47]. > When porting the > [case.sql|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/case.sql] > found a case: > # Division by zero [throws an > exception|https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/expected/case.out#L96-L99]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory
[ https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856470#comment-16856470 ] Iris Shaibsky commented on SPARK-25380: --- We are facing it also on spark 2.4.2, I see that the PR is merged to master on March 13 , but was not included in spark 2.4.3 release. When this PR will be included in a release? Thanks! > Generated plans occupy over 50% of Spark driver memory > -- > > Key: SPARK-25380 > URL: https://issues.apache.org/jira/browse/SPARK-25380 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Spark 2.3.1 (AWS emr-5.16.0) > >Reporter: Michael Spector >Priority: Minor > Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot > 2018-09-12 at 8.20.05.png, heapdump_OOM.png, image-2018-09-16-14-21-38-939.png > > > When debugging an OOM exception during long run of a Spark application (many > iterations of the same code) I've found that generated plans occupy most of > the driver memory. I'm not sure whether this is a memory leak or not, but it > would be helpful if old plans could be purged from memory anyways. > Attached are screenshots of OOM heap dump opened in JVisualVM. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27462) Spark hive can not choose some columns in target table flexibly, when running insert into.
[ https://issues.apache.org/jira/browse/SPARK-27462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-27462: --- Issue Type: Sub-task (was: New Feature) Parent: SPARK-27943 > Spark hive can not choose some columns in target table flexibly, when running > insert into. > -- > > Key: SPARK-27462 > URL: https://issues.apache.org/jira/browse/SPARK-27462 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Major > > Spark SQL can not support the feature to choose some columns in target table > flexibly, when running > {code:java} > insert into tableA select ... from tableB;{code} > This feature is supported by Hive, so I think this grammar should be > consistent with Hive。 > Hive support some feature about 'insert into' as follows: > {code:java} > insert into gja_test_spark select * from gja_test; > insert into gja_test_spark(key, value, other) select key, value, other from > gja_test; > insert into gja_test_spark(key, value) select value, other from gja_test; > insert into gja_test_spark(key, other) select value, other from gja_test; > insert into gja_test_spark(value, other) select value, other from > gja_test;{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27955) Update default constraint with Column into table properties when alter Hive table
jiaan.geng created SPARK-27955: -- Summary: Update default constraint with Column into table properties when alter Hive table Key: SPARK-27955 URL: https://issues.apache.org/jira/browse/SPARK-27955 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0, 2.3.0 Reporter: jiaan.geng This is a sub task to implement default constraint. This Jira want solve the issue that update default constraint into properties of Hive table after alter table. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27954) Restore default constraint with Column from table properties after get metadata from Hive
jiaan.geng created SPARK-27954: -- Summary: Restore default constraint with Column from table properties after get metadata from Hive Key: SPARK-27954 URL: https://issues.apache.org/jira/browse/SPARK-27954 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0, 2.3.0 Reporter: jiaan.geng This is a sub task to implement default constraint. This Jira want solve the issue that restore default constraint from properties of Hive table. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27953) Save default constraint with Column into table properties when create Hive table
jiaan.geng created SPARK-27953: -- Summary: Save default constraint with Column into table properties when create Hive table Key: SPARK-27953 URL: https://issues.apache.org/jira/browse/SPARK-27953 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0, 2.3.0 Reporter: jiaan.geng This is a sub task to implement default constraint. This Jira want solve the issue that save default constraint into properties of Hive table. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table
[ https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-27943: --- Description: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. Hive is widely used in production environments and is the standard in the field of big data in fact. But Hive exists many version used in production and the feature between each version are different. Spark SQL need to implement default constraint, but there are two points to pay attention to in design: One is Spark SQL should reduce coupling with Hive. Another is default constraint could compatible with different versions of Hive. We want to save the metadata of default constraint into properties of Hive table, and then we restore metadata from the properties after client gets newest metadata. The implement is the same as other metadata (e.g. partition,bucket,statistics). This is a big work, wo I want to split this work into some sub tasks, as follows: was: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. Hive is widely used in production environments and is the standard in the field of big data in fact. But Hive exists many version used in production and the feature between each version are different. Spark SQL need to implement default constraint, but there are two points to pay attention to in design: One is Spark SQL should reduce coupling with Hive. Another is default constraint could compatible with different versions of Hive. We want to save the metadata of default constraint into properties of Hive table, and then we restore metadata from the properties after client gets newest metadata. > Implement default constraint with Column for Hive table > --- > > Key: SPARK-27943 > URL: https://issues.apache.org/jira/browse/SPARK-27943 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Major > > Default constraint with column is ANSI standard. > Hive 3.0+ has supported default constraint > ref:https://issues.apache.org/jira/browse/HIVE-18726 > But Spark SQL implement this feature not yet. > Hive is widely used in production environments and is the standard in the > field of big data in fact. But Hive exists many version used in production > and the feature between each version are different. > Spark SQL need to implement default constraint, but there are two points to > pay attention to in design: > One is Spark SQL should reduce coupling with Hive. > Another is default constraint could compatible with different versions of > Hive. > We want to save the metadata of default constraint into properties of Hive > table, and then we restore metadata from the properties after client gets > newest metadata. > The implement is the same as other metadata (e.g. > partition,bucket,statistics). > This is a big work, wo I want to split this work into some sub tasks, as > follows: > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27521) move data source v2 API to catalyst module
[ https://issues.apache.org/jira/browse/SPARK-27521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-27521: Labels: release-notes (was: ) > move data source v2 API to catalyst module > -- > > Key: SPARK-27521 > URL: https://issues.apache.org/jira/browse/SPARK-27521 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: release-notes > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table
[ https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-27943: --- Description: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. Hive is widely used in production environments and is the standard in the field of big data in fact. But Hive exists many version used in production and the feature between each version are different. Spark SQL need to implement default constraint, but there are two points to pay attention to in design: One is Spark SQL should reduce coupling with Hive. Another is default constraint could compatible with different versions of Hive. We want to save the metadata of default constraint into properties of Hive table, and then we restore metadata from the properties after client gets newest metadata. was: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. Hive is widely used in production environments and is the standard in the field of big data in fact. But Hive exists many version used in production and the feature between each version are different. Spark SQL need to implement default constraint, but there are two points to pay attention to in design: One is Spark SQL should reduce coupling with Hive. Another is default constraint could compatible with different versions of Hive. > Implement default constraint with Column for Hive table > --- > > Key: SPARK-27943 > URL: https://issues.apache.org/jira/browse/SPARK-27943 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Major > > Default constraint with column is ANSI standard. > Hive 3.0+ has supported default constraint > ref:https://issues.apache.org/jira/browse/HIVE-18726 > But Spark SQL implement this feature not yet. > Hive is widely used in production environments and is the standard in the > field of big data in fact. But Hive exists many version used in production > and the feature between each version are different. > Spark SQL need to implement default constraint, but there are two points to > pay attention to in design: > One is Spark SQL should reduce coupling with Hive. > Another is default constraint could compatible with different versions of > Hive. > We want to save the metadata of default constraint into properties of Hive > table, and then we restore metadata from the properties after client gets > newest metadata. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table
[ https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-27943: --- Description: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. Hive is widely used in production environments and is the standard in the field of big data in fact. But Hive exists many version used in production and the feature between each version are different. Spark SQL need to implement default constraint, but there are two points to pay attention to in design: One is Spark SQL should reduce coupling with Hive. Another is default constraint could compatible with different versions of Hive. was: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. Hive is widely used in production environments and is the standard in the field of big data in fact. But Hive exists many version used in production and the feature between each version are different. Spark SQL need to implement > Implement default constraint with Column for Hive table > --- > > Key: SPARK-27943 > URL: https://issues.apache.org/jira/browse/SPARK-27943 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Major > > Default constraint with column is ANSI standard. > Hive 3.0+ has supported default constraint > ref:https://issues.apache.org/jira/browse/HIVE-18726 > But Spark SQL implement this feature not yet. > Hive is widely used in production environments and is the standard in the > field of big data in fact. But Hive exists many version used in production > and the feature between each version are different. > Spark SQL need to implement default constraint, but there are two points to > pay attention to in design: > One is Spark SQL should reduce coupling with Hive. > Another is default constraint could compatible with different versions of > Hive. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table
[ https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-27943: --- Description: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. Hive is widely used in production environments and is the standard in the field of big data in fact. But Hive exists many version used in production and the feature between each version are different. Spark SQL need to implement was: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. Hive is widely used in production environments and is the standard in the field of big data in fact. But Hive exists many version used in production and the feature between each version are different. > Implement default constraint with Column for Hive table > --- > > Key: SPARK-27943 > URL: https://issues.apache.org/jira/browse/SPARK-27943 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Major > > Default constraint with column is ANSI standard. > Hive 3.0+ has supported default constraint > ref:https://issues.apache.org/jira/browse/HIVE-18726 > But Spark SQL implement this feature not yet. > Hive is widely used in production environments and is the standard in the > field of big data in fact. But Hive exists many version used in production > and the feature between each version are different. > Spark SQL need to implement -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table
[ https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-27943: --- Description: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. Hive is widely used in production environments and is the standard in the field of big data in fact. But Hive exists many version used in production and the feature between each version are different. was: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. > Implement default constraint with Column for Hive table > --- > > Key: SPARK-27943 > URL: https://issues.apache.org/jira/browse/SPARK-27943 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Major > > Default constraint with column is ANSI standard. > Hive 3.0+ has supported default constraint > ref:https://issues.apache.org/jira/browse/HIVE-18726 > But Spark SQL implement this feature not yet. > Hive is widely used in production environments and is the standard in the > field of big data in fact. But Hive exists many version used in production > and the feature between each version are different. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table
[ https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-27943: --- Summary: Implement default constraint with Column for Hive table (was: Add default constraint when create hive table) > Implement default constraint with Column for Hive table > --- > > Key: SPARK-27943 > URL: https://issues.apache.org/jira/browse/SPARK-27943 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Major > > Default constraint with column is ANSI standard. > Hive 3.0+ has supported default constraint > ref:https://issues.apache.org/jira/browse/HIVE-18726 > But Spark SQL implement this feature not yet. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27943) Add default constraint when create hive table
[ https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-27943: --- Description: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. was: Default constraint with column is ANSI standard. Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726 But Spark SQL implement this feature not yet. > Add default constraint when create hive table > - > > Key: SPARK-27943 > URL: https://issues.apache.org/jira/browse/SPARK-27943 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Major > > Default constraint with column is ANSI standard. > Hive 3.0+ has supported default constraint > ref:https://issues.apache.org/jira/browse/HIVE-18726 > But Spark SQL implement this feature not yet. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org