[jira] [Commented] (SPARK-28845) Enable spark.sql.execution.sortBeforeRepartition only for retried stages
[ https://issues.apache.org/jira/browse/SPARK-28845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035107#comment-17035107 ] Wenchen Fan commented on SPARK-28845: - I'm a little hesitant to abandon the sort approach completely. If a stage has many tasks, always retry the entire stage may end up with never finishing it and keep retrying. Performance-wise, I think it's better to combine the sort and retry approaches. But as [~XuanYuan] said, this is too difficult and we didn't make it. > Enable spark.sql.execution.sortBeforeRepartition only for retried stages > > > Key: SPARK-28845 > URL: https://issues.apache.org/jira/browse/SPARK-28845 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > For fixing the correctness bug of SPARK-28699, we disable radix sort for the > scenario of repartition in Spark SQL. This will cause a performance > regression. > So for limiting the performance overhead, we'll do the optimizing work by > only enable sort for the repartition operation while stage retries happening. > This work depends on SPARK-25341. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30795) Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s()
[ https://issues.apache.org/jira/browse/SPARK-30795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30795. -- Fix Version/s: 3.1.0 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/27544 > Spark SQL codegen's code() interpolator should treat escapes like Scala's > StringContext.s() > --- > > Key: SPARK-30795 > URL: https://issues.apache.org/jira/browse/SPARK-30795 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.0.0 >Reporter: Kris Mok >Priority: Major > Fix For: 3.1.0 > > > The {{code()}} string interpolator in Spark SQL's code generator should treat > escapes like Scala's builtin {{StringContext.s()}} interpolator, i.e. it > should treat escapes in the code parts, and should not treat escapes in the > input arguments. > For example, > {code} > val arg = "This is an argument." > val str = s"This is string part 1. $arg This is string part 2." > val code = code"This is string part 1. $arg This is string part 2." > assert(code.toString == str) > {code} > We should expect the {{code()}} interpolator produce the same thing as the > {{StringContext.s()}} interpolator, where only escapes in the string parts > should be treated, while the args should be kept verbatim. > But in the current implementation, due to the eager folding of code parts and > literal input args, the escape treatment is incorrectly done on both code > parts and literal args. > That causes a problem when an arg contains escape sequences and wants to > preserve that in the final produced code string. For example, in {{Like}} > expression's codegen, there's an ugly workaround for this bug: > {code} > // We need double escape to avoid > org.codehaus.commons.compiler.CompileException. > // '\\' will cause exception 'Single quote must be backslash-escaped in > character literal'. > // '\"' will cause exception 'Line break in literal not allowed'. > val newEscapeChar = if (escapeChar == '\"' || escapeChar == '\\') { > s"""\\$escapeChar""" > } else { > escapeChar > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25929) Support metrics with tags
[ https://issues.apache.org/jira/browse/SPARK-25929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035074#comment-17035074 ] John Zhuge commented on SPARK-25929: Yeah, I can feel the pain. When I ingest into InfluxDB, I have to use its [Graphite templates|https://github.com/influxdata/influxdb/tree/v1.7.10/services/graphite#templates], e.g., {noformat} "*.*.*.DAGScheduler.*.* application.app_id.executor_id.measurement.type.qty name=DAGScheduler", "*.*.*.ExecutorAllocationManager.*.* application.app_id.executor_id.measurement.type.qty name=ExecutorAllocationManager", "*.*.*.ExternalShuffle.*.* application.app_id.executor_id.measurement.type.qty name=ExternalShuffle", {noformat} Hard to get right. Easily obsolete. Doesn't support multiple versions. > Support metrics with tags > - > > Key: SPARK-25929 > URL: https://issues.apache.org/jira/browse/SPARK-25929 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: John Zhuge >Priority: Major > > For better integration with DBs that support tags/labels, e.g., InfluxDB, > Prometheus, Atlas, etc. > We should continue to support the current Graphite-style metrics. > Dropwizard Metrics v5 supports tags. It has been in RC status since Feb. > Currently > `[5.0.0-rc2|https://github.com/dropwizard/metrics/releases/tag/v5.0.0-rc2]` > is in Maven. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30796) Add parameter position for REGEXP_REPLACE
[ https://issues.apache.org/jira/browse/SPARK-30796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-30796: --- Parent: SPARK-27764 Issue Type: Sub-task (was: New Feature) > Add parameter position for REGEXP_REPLACE > - > > Key: SPARK-30796 > URL: https://issues.apache.org/jira/browse/SPARK-30796 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > > postgresql > {{format: regexp_replace}}(_{{source}}_, _{{pattern}}_, _{{replacement}}_ [, > _{{flags}}_ ]). > reference: [https://www.postgresql.org/docs/11/functions-matching.html] > vertica > REGEXP_REPLACE( _string_, _target_ [, _replacement_ [, _position_ [, > _occurrence_ ... [, _regexp_modifiers_ ] ] ] ] ) > reference: > [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_REPLACE.htm?zoom_highlight=regexp_replace] > oracle > [https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGEXP_REPLACE.html#GUID-EA80A33C-441A-4692-A959-273B5A224490] > redshift > https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_REPLACE.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30796) Add parameter position for REGEXP_REPLACE
[ https://issues.apache.org/jira/browse/SPARK-30796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035066#comment-17035066 ] jiaan.geng commented on SPARK-30796: I'm working on. > Add parameter position for REGEXP_REPLACE > - > > Key: SPARK-30796 > URL: https://issues.apache.org/jira/browse/SPARK-30796 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > > postgresql > {{format: regexp_replace}}(_{{source}}_, _{{pattern}}_, _{{replacement}}_ [, > _{{flags}}_ ]). > reference: [https://www.postgresql.org/docs/11/functions-matching.html] > vertica > REGEXP_REPLACE( _string_, _target_ [, _replacement_ [, _position_ [, > _occurrence_ ... [, _regexp_modifiers_ ] ] ] ] ) > reference: > [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_REPLACE.htm?zoom_highlight=regexp_replace] > oracle > [https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGEXP_REPLACE.html#GUID-EA80A33C-441A-4692-A959-273B5A224490] > redshift > https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_REPLACE.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30796) Add parameter position for REGEXP_REPLACE
jiaan.geng created SPARK-30796: -- Summary: Add parameter position for REGEXP_REPLACE Key: SPARK-30796 URL: https://issues.apache.org/jira/browse/SPARK-30796 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: jiaan.geng postgresql {{format: regexp_replace}}(_{{source}}_, _{{pattern}}_, _{{replacement}}_ [, _{{flags}}_ ]). reference: [https://www.postgresql.org/docs/11/functions-matching.html] vertica REGEXP_REPLACE( _string_, _target_ [, _replacement_ [, _position_ [, _occurrence_ ... [, _regexp_modifiers_ ] ] ] ] ) reference: [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_REPLACE.htm?zoom_highlight=regexp_replace] oracle [https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGEXP_REPLACE.html#GUID-EA80A33C-441A-4692-A959-273B5A224490] redshift https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_REPLACE.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30722) Document type hints in pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-30722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30722: Assignee: Hyukjin Kwon > Document type hints in pandas UDF > - > > Key: SPARK-30722 > URL: https://issues.apache.org/jira/browse/SPARK-30722 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > We should document the new type hints for pandas UDF introduced at > SPARK-28264. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30722) Document type hints in pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-30722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30722. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27466 [https://github.com/apache/spark/pull/27466] > Document type hints in pandas UDF > - > > Key: SPARK-30722 > URL: https://issues.apache.org/jira/browse/SPARK-30722 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > We should document the new type hints for pandas UDF introduced at > SPARK-28264. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30780) LocalRelation should use emptyRDD if it is empty
[ https://issues.apache.org/jira/browse/SPARK-30780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30780. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27530 [https://github.com/apache/spark/pull/27530] > LocalRelation should use emptyRDD if it is empty > > > Key: SPARK-30780 > URL: https://issues.apache.org/jira/browse/SPARK-30780 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.5 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.0.0 > > > LocalRelation creates an RDD of a single partition when it is empty. This is > somewhat unexpected, and can lead to unnecessary work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30795) Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s()
Kris Mok created SPARK-30795: Summary: Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s() Key: SPARK-30795 URL: https://issues.apache.org/jira/browse/SPARK-30795 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0, 3.0.0 Reporter: Kris Mok The {{code()}} string interpolator in Spark SQL's code generator should treat escapes like Scala's builtin {{StringContext.s()}} interpolator, i.e. it should treat escapes in the code parts, and should not treat escapes in the input arguments. For example, {code} val arg = "This is an argument." val str = s"This is string part 1. $arg This is string part 2." val code = code"This is string part 1. $arg This is string part 2." assert(code.toString == str) {code} We should expect the {{code()}} interpolator produce the same thing as the {{StringContext.s()}} interpolator, where only escapes in the string parts should be treated, while the args should be kept verbatim. But in the current implementation, due to the eager folding of code parts and literal input args, the escape treatment is incorrectly done on both code parts and literal args. That causes a problem when an arg contains escape sequences and wants to preserve that in the final produced code string. For example, in {{Like}} expression's codegen, there's an ugly workaround for this bug: {code} // We need double escape to avoid org.codehaus.commons.compiler.CompileException. // '\\' will cause exception 'Single quote must be backslash-escaped in character literal'. // '\"' will cause exception 'Line break in literal not allowed'. val newEscapeChar = if (escapeChar == '\"' || escapeChar == '\\') { s"""\\$escapeChar""" } else { escapeChar } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30794) Stage Level scheduling: Add ability to set off heap memory
Thomas Graves created SPARK-30794: - Summary: Stage Level scheduling: Add ability to set off heap memory Key: SPARK-30794 URL: https://issues.apache.org/jira/browse/SPARK-30794 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Thomas Graves For stage level scheduling in ExecutorResourceRequests we support setting heap memory, pyspark memory, and memory overhead. We have no split out off heap memory as its own configuration so we should add it as an option. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution
[ https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034875#comment-17034875 ] Giri commented on SPARK-27913: -- This issue doesn't exist in *spark spark-3.0.0-preview2 and also in spark 2.3* Will this fix be ported to 2.4.x branch? It appears that issue is realated to spark not using the schema from the metastore but from the ORC files and this causes the schema mismatch and out of bound exception when OrcDeserializer accesses the field that doesn't exist in the file. I see logs like this: 20/02/11 14:30:38 INFO RecordReaderImpl: Reader schema not provided -- using file schema struct> 20/02/11 14:30:38 INFO RecordReaderImpl: Reader schema not provided -- using file schema struct> > Spark SQL's native ORC reader implements its own schema evolution > - > > Key: SPARK-27913 > URL: https://issues.apache.org/jira/browse/SPARK-27913 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 >Reporter: Owen O'Malley >Priority: Major > > ORC's reader handles a wide range of schema evolution, but the Spark SQL > native ORC bindings do not provide the desired schema to the ORC reader. This > causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28845) Enable spark.sql.execution.sortBeforeRepartition only for retried stages
[ https://issues.apache.org/jira/browse/SPARK-28845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034755#comment-17034755 ] Thomas Graves commented on SPARK-28845: --- [~cloud_fan] [~XuanYuan] I wanted to followup on this with regards to [https://github.com/apache/spark/pull/25491] It looks like this got closed because its to difficult, but with SPARK-25341 - do we need the sort at all? I didn't think we did and if we do I would like to understand. So then I assume it comes down to performance. > Enable spark.sql.execution.sortBeforeRepartition only for retried stages > > > Key: SPARK-28845 > URL: https://issues.apache.org/jira/browse/SPARK-28845 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > For fixing the correctness bug of SPARK-28699, we disable radix sort for the > scenario of repartition in Spark SQL. This will cause a performance > regression. > So for limiting the performance overhead, we'll do the optimizing work by > only enable sort for the repartition operation while stage retries happening. > This work depends on SPARK-25341. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30793) Wrong truncations of timestamps before the epoch to minutes and seconds
Maxim Gekk created SPARK-30793: -- Summary: Wrong truncations of timestamps before the epoch to minutes and seconds Key: SPARK-30793 URL: https://issues.apache.org/jira/browse/SPARK-30793 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Truncations to seconds and minutes of timestamps after the epoch are correct: {code:sql} spark-sql> select date_trunc('SECOND', '2020-02-11 00:01:02.123'), date_trunc('SECOND', '2020-02-11 00:01:02.789'); 2020-02-11 00:01:02 2020-02-11 00:01:02 {code} but truncations of timestamps before the epoch are incorrect: {code:sql} spark-sql> select date_trunc('SECOND', '1960-02-11 00:01:02.123'), date_trunc('SECOND', '1960-02-11 00:01:02.789'); 1960-02-11 00:01:03 1960-02-11 00:01:03 {code} The result must be *1960-02-11 00:01:02 1960-02-11 00:01:02* The same for the MINUTE level: {code:sql} spark-sql> select date_trunc('MINUTE', '1960-02-11 00:01:01'), date_trunc('MINUTE', '1960-02-11 00:01:50'); 1960-02-11 00:02:00 1960-02-11 00:02:00 {code} The result must be 1960-02-11 00:01:00 1960-02-11 00:01:00 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30792) Dataframe .limit() performance improvements
Nathan Grand created SPARK-30792: Summary: Dataframe .limit() performance improvements Key: SPARK-30792 URL: https://issues.apache.org/jira/browse/SPARK-30792 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Nathan Grand It seems that {code:java} .limit(){code} is much less efficient than it could be/one would expect when reading a large dataset from parquet: {code:java} val sample = spark.read.parquet("/Some/Large/Data.parquet").limit(1000) // Do something with sample ...{code} This might take hours, depending on the size of the data. By comparison, {code:java} spark.read.parquet("/Some/Large/Data.parquet").show(1000){code} is essentially instant. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30783) Hive 2.3 profile should exclude hive-service-rpc
[ https://issues.apache.org/jira/browse/SPARK-30783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30783. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27533 [https://github.com/apache/spark/pull/27533] > Hive 2.3 profile should exclude hive-service-rpc > > > Key: SPARK-30783 > URL: https://issues.apache.org/jira/browse/SPARK-30783 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > Fix For: 3.0.0 > > Attachments: hive-service-rpc-2.3.6-classes, > spark-hive-thriftserver_2.12-3.0.0-20200207.021914-364-classes > > > hive-service-rpc 2.3.6 and spark sql's thrift server module have duplicate > classes. Leaving hive-service-rpc 2.3.6 in the class path means that spark > can pick up classes defined in hive instead of its thrift server module, > which can cause hard to debug runtime errors due to class loading order and > compilation errors for applications depend on spark. > > If you compare hive-service-rpc 2.3.6's jar > ([https://search.maven.org/remotecontent?filepath=org/apache/hive/hive-service-rpc/2.3.6/hive-service-rpc-2.3.6.jar]) > and spark thrift server's jar (e.g. > [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-hive-thriftserver_2.12/3.0.0-SNAPSHOT/spark-hive-thriftserver_2.12-3.0.0-20200207.021914-364.jar),] > you will see that all of classes provided by hive-service-rpc-2.3.6.jar are > covered by spark thrift server's jar. I am attaching the list of jar contents > for your reference. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27545) Update the Documentation for CACHE TABLE and UNCACHE TABLE
[ https://issues.apache.org/jira/browse/SPARK-27545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27545: --- Assignee: Rakesh Raushan (was: hantiantian) > Update the Documentation for CACHE TABLE and UNCACHE TABLE > -- > > Key: SPARK-27545 > URL: https://issues.apache.org/jira/browse/SPARK-27545 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: hantiantian >Assignee: Rakesh Raushan >Priority: Major > Fix For: 3.0.0 > > > spark-sql> cache table v1 as select * from a; > spark-sql> uncache table v1; > spark-sql> cache table v1 as select * from a; > 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: > 0: get_table : db=apachespark tbl=a > 2019-04-23 14:50:09,038 INFO > org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root > ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a > Error in query: Temporary view 'v1' already exists; > we should document it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30754) Reuse results of floorDiv in calculations of floorMod in DateTimeUtils
[ https://issues.apache.org/jira/browse/SPARK-30754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30754. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27491 [https://github.com/apache/spark/pull/27491] > Reuse results of floorDiv in calculations of floorMod in DateTimeUtils > -- > > Key: SPARK-30754 > URL: https://issues.apache.org/jira/browse/SPARK-30754 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.1.0 > > > A couple methods in DateTimeUtils call Math.floorDiv and Math.floorMod with > the same arguments. In this way, results of Math.floorDiv can be reused in > calculation of Math.floorMod. For example, this optimization can be applied > to the microsToInstant and truncDate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30754) Reuse results of floorDiv in calculations of floorMod in DateTimeUtils
[ https://issues.apache.org/jira/browse/SPARK-30754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-30754: Assignee: Maxim Gekk > Reuse results of floorDiv in calculations of floorMod in DateTimeUtils > -- > > Key: SPARK-30754 > URL: https://issues.apache.org/jira/browse/SPARK-30754 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > A couple methods in DateTimeUtils call Math.floorDiv and Math.floorMod with > the same arguments. In this way, results of Math.floorDiv can be reused in > calculation of Math.floorMod. For example, this optimization can be applied > to the microsToInstant and truncDate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27710) ClassNotFoundException: $line196400984558.$read$ in OuterScopes
[ https://issues.apache.org/jira/browse/SPARK-27710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034532#comment-17034532 ] Jelmer Kuperus edited comment on SPARK-27710 at 2/11/20 3:07 PM: - This also happens in Apache Toree {code:java} case class AttributeRow(categoryId: String, key: String, count: Long, label: String) val mySpark = spark import mySpark.implicits._ spark.read.parquet("/user/jkuperus/foo").as[AttributeRow] .limit(1) .map(r => r) .show() {code} Gives {noformat} StackTrace: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485){noformat} was (Author: jelmer): This also happens in Apache Toree {code:java} val mySpark = spark import mySpark.implicits._ spark.read.parquet("/user/jkuperus/foo").as[AttributeRow] .limit(1) .map(r => r) .show() {code} Gives {noformat} StackTrace: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485){noformat} > ClassNotFoundException: $line196400984558.$read$ in OuterScopes > --- > > Key: SPARK-27710 > URL: https://issues.apache.org/jira/browse/SPARK-27710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > > My colleague hit the following exception when using Spark in a Zeppelin > notebook: > {code:java} > java.lang.ClassNotFoundException: $line196400984558.$read$ > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:238) > at > org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:62) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105) > at > org.apache.spark.sql.catalyst.expressions.If.doGenCode(conditionalExpressions.scala:70) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105) > at > org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:99) > at > org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:98) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.sca
[jira] [Commented] (SPARK-27710) ClassNotFoundException: $line196400984558.$read$ in OuterScopes
[ https://issues.apache.org/jira/browse/SPARK-27710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034532#comment-17034532 ] Jelmer Kuperus commented on SPARK-27710: This also happens in Apache Toree {code:java} val mySpark = spark import mySpark.implicits._ spark.read.parquet("/user/jkuperus/foo").as[AttributeRow] .limit(1) .map(r => r) .show() {code} Gives {noformat} StackTrace: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485){noformat} > ClassNotFoundException: $line196400984558.$read$ in OuterScopes > --- > > Key: SPARK-27710 > URL: https://issues.apache.org/jira/browse/SPARK-27710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > > My colleague hit the following exception when using Spark in a Zeppelin > notebook: > {code:java} > java.lang.ClassNotFoundException: $line196400984558.$read$ > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:238) > at > org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:62) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105) > at > org.apache.spark.sql.catalyst.expressions.If.doGenCode(conditionalExpressions.scala:70) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105) > at > org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:99) > at > org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:98) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$class.prepareArguments(objects.scala:98) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance.prepareArguments(objects.scala:431) > at > org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:483) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105) > at > org.apache.spark.sql.execution.DeserializeToObjectExec.doConsume(objects.scala:84) > at > org.apache.spark.sql.execu
[jira] [Commented] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034492#comment-17034492 ] Jorge Machado commented on SPARK-24615: --- Yeah, that was my question. Thanks for the response. I will look at rapid.ai and try to use it inside a partition or so... > SPIP: Accelerator-aware task scheduling for Spark > - > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Epic > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Thomas Graves >Priority: Major > Labels: Hydrogen, SPIP > Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, > SPIP_ Accelerator-aware scheduling.pdf > > > (The JIRA received a major update on 2019/02/28. Some comments were based on > an earlier version. Please ignore them. New comments start at > [#comment-16778026].) > h2. Background and Motivation > GPUs and other accelerators have been widely used for accelerating special > workloads, e.g., deep learning and signal processing. While users from the AI > community use GPUs heavily, they often need Apache Spark to load and process > large datasets and to handle complex data scenarios like streaming. YARN and > Kubernetes already support GPUs in their recent releases. Although Spark > supports those two cluster managers, Spark itself is not aware of GPUs > exposed by them and hence Spark cannot properly request GPUs and schedule > them for users. This leaves a critical gap to unify big data and AI workloads > and make life simpler for end users. > To make Spark be aware of GPUs, we shall make two major changes at high level: > * At cluster manager level, we update or upgrade cluster managers to include > GPU support. Then we expose user interfaces for Spark to request GPUs from > them. > * Within Spark, we update its scheduler to understand available GPUs > allocated to executors, user task requests, and assign GPUs to tasks properly. > Based on the work done in YARN and Kubernetes to support GPUs and some > offline prototypes, we could have necessary features implemented in the next > major release of Spark. You can find a detailed scoping doc here, where we > listed user stories and their priorities. > h2. Goals > * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes. > * No regression on scheduler performance for normal jobs. > h2. Non-goals > * Fine-grained scheduling within one GPU card. > ** We treat one GPU card and its memory together as a non-divisible unit. > * Support TPU. > * Support Mesos. > * Support Windows. > h2. Target Personas > * Admins who need to configure clusters to run Spark with GPU nodes. > * Data scientists who need to build DL applications on Spark. > * Developers who need to integrate DL features on Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034490#comment-17034490 ] Thomas Graves commented on SPARK-24615: --- This is purely a scheduling feature and Spark will assign GPUs to particular tasks. From there its the users responsibility to look at those assignments and do whatever they want with the GPU. For instance you might pass it into tensor flow on Spark or some other ML/AI framework. Do you mean the actual Dataset operations using GPU? Such as doing df.join.groupby.filter? That isn't supported inside of Spark itself, nor is part of this feature. There was another Jira (SPARK-27396) we added support for adding columnar plugin to Spark that would allow someone to write a plugin that does stuff on the GPU. Nvidia is working on such a plugin but it is not publicly available yet. > SPIP: Accelerator-aware task scheduling for Spark > - > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Epic > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Thomas Graves >Priority: Major > Labels: Hydrogen, SPIP > Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, > SPIP_ Accelerator-aware scheduling.pdf > > > (The JIRA received a major update on 2019/02/28. Some comments were based on > an earlier version. Please ignore them. New comments start at > [#comment-16778026].) > h2. Background and Motivation > GPUs and other accelerators have been widely used for accelerating special > workloads, e.g., deep learning and signal processing. While users from the AI > community use GPUs heavily, they often need Apache Spark to load and process > large datasets and to handle complex data scenarios like streaming. YARN and > Kubernetes already support GPUs in their recent releases. Although Spark > supports those two cluster managers, Spark itself is not aware of GPUs > exposed by them and hence Spark cannot properly request GPUs and schedule > them for users. This leaves a critical gap to unify big data and AI workloads > and make life simpler for end users. > To make Spark be aware of GPUs, we shall make two major changes at high level: > * At cluster manager level, we update or upgrade cluster managers to include > GPU support. Then we expose user interfaces for Spark to request GPUs from > them. > * Within Spark, we update its scheduler to understand available GPUs > allocated to executors, user task requests, and assign GPUs to tasks properly. > Based on the work done in YARN and Kubernetes to support GPUs and some > offline prototypes, we could have necessary features implemented in the next > major release of Spark. You can find a detailed scoping doc here, where we > listed user stories and their priorities. > h2. Goals > * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes. > * No regression on scheduler performance for normal jobs. > h2. Non-goals > * Fine-grained scheduling within one GPU card. > ** We treat one GPU card and its memory together as a non-divisible unit. > * Support TPU. > * Support Mesos. > * Support Windows. > h2. Target Personas > * Admins who need to configure clusters to run Spark with GPU nodes. > * Data scientists who need to build DL applications on Spark. > * Developers who need to integrate DL features on Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27545) Update the Documentation for CACHE TABLE and UNCACHE TABLE
[ https://issues.apache.org/jira/browse/SPARK-27545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034479#comment-17034479 ] Rakesh Raushan commented on SPARK-27545: Please assign this to me. Thanks > Update the Documentation for CACHE TABLE and UNCACHE TABLE > -- > > Key: SPARK-27545 > URL: https://issues.apache.org/jira/browse/SPARK-27545 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: hantiantian >Assignee: hantiantian >Priority: Major > Fix For: 3.0.0 > > > spark-sql> cache table v1 as select * from a; > spark-sql> uncache table v1; > spark-sql> cache table v1 as select * from a; > 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: > 0: get_table : db=apachespark tbl=a > 2019-04-23 14:50:09,038 INFO > org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root > ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a > Error in query: Temporary view 'v1' already exists; > we should document it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30791) Dataframe add sameResult and sementicHash method
[ https://issues.apache.org/jira/browse/SPARK-30791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-30791: --- Description: Sometimes, we want to check whether two dataframes are the same. There is already an internal API like: {code:java} df1.queryExecution.logical.sameResult(...) {code} We can make a public API for this: Like: {code:java} df1.sameResult(df2) // return true if dataframe will return the same result df1.semanticHash // return a semantic hashcode, if the two dataframes will return the same results, their semantic hashcodes should be the same.{code} CC [~cloud_fan] [~mengxr] [~liangz] was: Sometimes, we want to check whether two dataframe is the same. There is already an internal API like: {code:java} df1.queryExecution.logical.sameResult(...) {code} We can make a public API for this: Like: {code:java} df1.sameResult(df2) // return true if dataframe will return the same result df1.semanticHash // return a semantic hashcode, if the two dataframe will return the same result, their semantic hashcode should be the same.{code} CC [~cloud_fan] [~mengxr] [~liangz] > Dataframe add sameResult and sementicHash method > > > Key: SPARK-30791 > URL: https://issues.apache.org/jira/browse/SPARK-30791 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Affects Versions: 3.0.0 >Reporter: Weichen Xu >Assignee: Liang Zhang >Priority: Major > > Sometimes, we want to check whether two dataframes are the same. > There is already an internal API like: > {code:java} > df1.queryExecution.logical.sameResult(...) {code} > We can make a public API for this: > Like: > {code:java} > df1.sameResult(df2) // return true if dataframe will return the same result > df1.semanticHash // return a semantic hashcode, if the two dataframes will > return the same results, their semantic hashcodes should be the same.{code} > CC [~cloud_fan] [~mengxr] [~liangz] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30791) Dataframe add sameResult and sementicHash method
[ https://issues.apache.org/jira/browse/SPARK-30791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034475#comment-17034475 ] Weichen Xu commented on SPARK-30791: [~liangz] will work on this. :) > Dataframe add sameResult and sementicHash method > > > Key: SPARK-30791 > URL: https://issues.apache.org/jira/browse/SPARK-30791 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Affects Versions: 3.0.0 >Reporter: Weichen Xu >Assignee: Liang Zhang >Priority: Major > > Sometimes, we want to check whether two dataframe is the same. > There is already an internal API like: > {code:java} > df1.queryExecution.logical.sameResult(...) {code} > We can make a public API for this: > Like: > {code:java} > df1.sameResult(df2) // return true if dataframe will return the same result > df1.semanticHash // return a semantic hashcode, if the two dataframe will > return the same result, their semantic hashcode should be the same.{code} > CC [~cloud_fan] [~mengxr] [~liangz] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30791) Dataframe add sameResult and sementicHash method
[ https://issues.apache.org/jira/browse/SPARK-30791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu reassigned SPARK-30791: -- Assignee: Liang Zhang > Dataframe add sameResult and sementicHash method > > > Key: SPARK-30791 > URL: https://issues.apache.org/jira/browse/SPARK-30791 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Affects Versions: 3.0.0 >Reporter: Weichen Xu >Assignee: Liang Zhang >Priority: Major > > Sometimes, we want to check whether two dataframe is the same. > There is already an internal API like: > {code:java} > df1.queryExecution.logical.sameResult(...) {code} > We can make a public API for this: > Like: > {code:java} > df1.sameResult(df2) // return true if dataframe will return the same result > df1.semanticHash // return a semantic hashcode, if the two dataframe will > return the same result, their semantic hashcode should be the same.{code} > CC [~cloud_fan] [~mengxr] [~liangz] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30791) Dataframe add sameResult and sementicHash method
Weichen Xu created SPARK-30791: -- Summary: Dataframe add sameResult and sementicHash method Key: SPARK-30791 URL: https://issues.apache.org/jira/browse/SPARK-30791 Project: Spark Issue Type: New Feature Components: ML, SQL Affects Versions: 3.0.0 Reporter: Weichen Xu Sometimes, we want to check whether two dataframe is the same. There is already an internal API like: {code:java} df1.queryExecution.logical.sameResult(...) {code} We can make a public API for this: Like: {code:java} df1.sameResult(df2) // return true if dataframe will return the same result df1.semanticHash // return a semantic hashcode, if the two dataframe will return the same result, their semantic hashcode should be the same.{code} CC [~cloud_fan] [~mengxr] [~liangz] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30790) The datatype of map() should be map
[ https://issues.apache.org/jira/browse/SPARK-30790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034461#comment-17034461 ] Rakesh Raushan commented on SPARK-30790: Should i expose a legacy configuration for mapType as well ?? [~hyukjin.kwon] > The datatype of map() should be map > -- > > Key: SPARK-30790 > URL: https://issues.apache.org/jira/browse/SPARK-30790 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Priority: Minor > > Currently , > spark.sql("select map()") gives {}. > To be consistent with the changes made in SPARK-29462, it should return > map. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30790) The datatype of map() should be map
Rakesh Raushan created SPARK-30790: -- Summary: The datatype of map() should be map Key: SPARK-30790 URL: https://issues.apache.org/jira/browse/SPARK-30790 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Rakesh Raushan Currently , spark.sql("select map()") gives {}. To be consistent with the changes made in SPARK-29462, it should return map. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27545) Update the Documentation for CACHE TABLE and UNCACHE TABLE
[ https://issues.apache.org/jira/browse/SPARK-27545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-27545: Summary: Update the Documentation for CACHE TABLE and UNCACHE TABLE (was: Uncache table needs to delete the temporary view created when the cache table is executed.) > Update the Documentation for CACHE TABLE and UNCACHE TABLE > -- > > Key: SPARK-27545 > URL: https://issues.apache.org/jira/browse/SPARK-27545 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: hantiantian >Assignee: hantiantian >Priority: Major > Fix For: 3.0.0 > > > spark-sql> cache table v1 as select * from a; > spark-sql> uncache table v1; > spark-sql> cache table v1 as select * from a; > 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: > 0: get_table : db=apachespark tbl=a > 2019-04-23 14:50:09,038 INFO > org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root > ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a > Error in query: Temporary view 'v1' already exists; -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27545) Update the Documentation for CACHE TABLE and UNCACHE TABLE
[ https://issues.apache.org/jira/browse/SPARK-27545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-27545: Issue Type: Documentation (was: Bug) > Update the Documentation for CACHE TABLE and UNCACHE TABLE > -- > > Key: SPARK-27545 > URL: https://issues.apache.org/jira/browse/SPARK-27545 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: hantiantian >Assignee: hantiantian >Priority: Major > Fix For: 3.0.0 > > > spark-sql> cache table v1 as select * from a; > spark-sql> uncache table v1; > spark-sql> cache table v1 as select * from a; > 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: > 0: get_table : db=apachespark tbl=a > 2019-04-23 14:50:09,038 INFO > org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root > ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a > Error in query: Temporary view 'v1' already exists; > we should document it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27545) Update the Documentation for CACHE TABLE and UNCACHE TABLE
[ https://issues.apache.org/jira/browse/SPARK-27545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-27545: Description: spark-sql> cache table v1 as select * from a; spark-sql> uncache table v1; spark-sql> cache table v1 as select * from a; 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: 0: get_table : db=apachespark tbl=a 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a Error in query: Temporary view 'v1' already exists; we should document it. was: spark-sql> cache table v1 as select * from a; spark-sql> uncache table v1; spark-sql> cache table v1 as select * from a; 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: 0: get_table : db=apachespark tbl=a 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a Error in query: Temporary view 'v1' already exists; > Update the Documentation for CACHE TABLE and UNCACHE TABLE > -- > > Key: SPARK-27545 > URL: https://issues.apache.org/jira/browse/SPARK-27545 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: hantiantian >Assignee: hantiantian >Priority: Major > Fix For: 3.0.0 > > > spark-sql> cache table v1 as select * from a; > spark-sql> uncache table v1; > spark-sql> cache table v1 as select * from a; > 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: > 0: get_table : db=apachespark tbl=a > 2019-04-23 14:50:09,038 INFO > org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root > ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a > Error in query: Temporary view 'v1' already exists; > we should document it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27545) Uncache table needs to delete the temporary view created when the cache table is executed.
[ https://issues.apache.org/jira/browse/SPARK-27545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27545. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27090 [https://github.com/apache/spark/pull/27090] > Uncache table needs to delete the temporary view created when the cache table > is executed. > -- > > Key: SPARK-27545 > URL: https://issues.apache.org/jira/browse/SPARK-27545 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: hantiantian >Assignee: hantiantian >Priority: Major > Fix For: 3.0.0 > > > spark-sql> cache table v1 as select * from a; > spark-sql> uncache table v1; > spark-sql> cache table v1 as select * from a; > 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: > 0: get_table : db=apachespark tbl=a > 2019-04-23 14:50:09,038 INFO > org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root > ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a > Error in query: Temporary view 'v1' already exists; -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27545) Uncache table needs to delete the temporary view created when the cache table is executed.
[ https://issues.apache.org/jira/browse/SPARK-27545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27545: --- Assignee: hantiantian > Uncache table needs to delete the temporary view created when the cache table > is executed. > -- > > Key: SPARK-27545 > URL: https://issues.apache.org/jira/browse/SPARK-27545 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: hantiantian >Assignee: hantiantian >Priority: Major > > spark-sql> cache table v1 as select * from a; > spark-sql> uncache table v1; > spark-sql> cache table v1 as select * from a; > 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: > 0: get_table : db=apachespark tbl=a > 2019-04-23 14:50:09,038 INFO > org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root > ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a > Error in query: Temporary view 'v1' already exists; -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30326) Raise exception if analyzer exceed max iterations
[ https://issues.apache.org/jira/browse/SPARK-30326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30326: --- Assignee: Xin Wu > Raise exception if analyzer exceed max iterations > - > > Key: SPARK-30326 > URL: https://issues.apache.org/jira/browse/SPARK-30326 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xin Wu >Assignee: Xin Wu >Priority: Major > Fix For: 3.0.0 > > > Currently, both analyzer and optimizer just log warning message if rule > execution exceed max iterations. They should have different behavior. > Analyzer should raise exception to indicates the plan is not fixed after max > iterations, while optimizer just log warning to keep the current plan. This > is more feasible after SPARK-30138 was introduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30789) Support IGNORE | RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE
jiaan.geng created SPARK-30789: -- Summary: Support IGNORE | RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE Key: SPARK-30789 URL: https://issues.apache.org/jira/browse/SPARK-30789 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: jiaan.geng All of LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE support IGNORE NULLS | RESPECT NULLS. For example: {code:java} LEAD (value_expr [, offset ]) [ IGNORE NULLS | RESPECT NULLS ] OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ){code} {code:java} LAG (value_expr [, offset ]) [ IGNORE NULLS | RESPECT NULLS ] OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ){code} {code:java} NTH_VALUE (expr, offset) [ IGNORE NULLS | RESPECT NULLS ] OVER ( [ PARTITION BY window_partition ] [ ORDER BY window_ordering frame_clause ] ){code} *Oracle:* [https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/NTH_VALUE.html#GUID-F8A0E88C-67E5-4AA6-9515-95D03A7F9EA0] *Redshift* [https://docs.aws.amazon.com/redshift/latest/dg/r_WF_NTH.html] *Presto* [https://prestodb.io/docs/current/functions/window.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30789) Support IGNORE | RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE
[ https://issues.apache.org/jira/browse/SPARK-30789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034347#comment-17034347 ] jiaan.geng commented on SPARK-30789: I will working on. > Support IGNORE | RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE > - > > Key: SPARK-30789 > URL: https://issues.apache.org/jira/browse/SPARK-30789 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > All of LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE support IGNORE NULLS | > RESPECT NULLS. For example: > {code:java} > LEAD (value_expr [, offset ]) > [ IGNORE NULLS | RESPECT NULLS ] > OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ){code} > > {code:java} > LAG (value_expr [, offset ]) > [ IGNORE NULLS | RESPECT NULLS ] > OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ){code} > > {code:java} > NTH_VALUE (expr, offset) > [ IGNORE NULLS | RESPECT NULLS ] > OVER > ( [ PARTITION BY window_partition ] > [ ORDER BY window_ordering > frame_clause ] ){code} > > *Oracle:* > [https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/NTH_VALUE.html#GUID-F8A0E88C-67E5-4AA6-9515-95D03A7F9EA0] > *Redshift* > [https://docs.aws.amazon.com/redshift/latest/dg/r_WF_NTH.html] > *Presto* > [https://prestodb.io/docs/current/functions/window.html] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30786) Block replication is not retried on other BlockManagers when it fails on 1 of the peers
[ https://issues.apache.org/jira/browse/SPARK-30786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakhar Jain updated SPARK-30786: - Component/s: Spark Core > Block replication is not retried on other BlockManagers when it fails on 1 of > the peers > --- > > Key: SPARK-30786 > URL: https://issues.apache.org/jira/browse/SPARK-30786 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.3.4, 2.4.5, 3.0.0 >Reporter: Prakhar Jain >Priority: Major > > When we cache an RDD with replication > 1, Firstly the RDD block is cached > locally on one of the BlockManager and then it is replicated to > (replication-1) number of BlockManagers. While replicating a block, if > replication fails on one of the peers, it is supposed to retry the > replication on some other peer (based on > "spark.storage.maxReplicationFailures" config). But currently this doesn't > happen because of some issue. > Logs of 1 of the executor which is trying to replicate: > {noformat} > 20/02/10 09:01:47 INFO Executor: Starting executor ID 1 on host > wn11-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net > . > . > . > 20/02/10 09:06:45 INFO Executor: Running task 244.0 in stage 3.0 (TID 550) > 20/02/10 09:06:45 DEBUG BlockManager: Getting local block rdd_13_244 > 20/02/10 09:06:45 DEBUG BlockManager: Block rdd_13_244 was not found > 20/02/10 09:06:45 DEBUG BlockManager: Getting remote block rdd_13_244 > 20/02/10 09:06:45 DEBUG BlockManager: Block rdd_13_244 not found > 20/02/10 09:06:46 INFO MemoryStore: Block rdd_13_244 stored as values in > memory (estimated size 33.3 MB, free 44.2 MB) > 20/02/10 09:06:46 DEBUG BlockManager: Told master about block rdd_13_244 > 20/02/10 09:06:46 DEBUG BlockManager: Put block rdd_13_244 locally took 947 > ms > 20/02/10 09:06:46 DEBUG BlockManager: Level for block rdd_13_244 is > StorageLevel(memory, deserialized, 3 replicas) > 20/02/10 09:06:46 TRACE BlockManager: Trying to replicate rdd_13_244 of > 34908552 bytes to BlockManagerId(2, > wn10-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36711, None) > 20/02/10 09:06:47 TRACE BlockManager: Replicated rdd_13_244 of 34908552 bytes > to BlockManagerId(2, > wn10-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36711, None) > in 205.849858 ms > 20/02/10 09:06:47 TRACE BlockManager: Trying to replicate rdd_13_244 of > 34908552 bytes to BlockManagerId(5, > wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36463, None) > 20/02/10 09:06:47 TRACE BlockManager: Replicated rdd_13_244 of 34908552 bytes > to BlockManagerId(5, > wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36463, None) > in 180.501504 ms > 20/02/10 09:06:47 DEBUG BlockManager: Replicating rdd_13_244 of 34908552 > bytes to 2 peer(s) took 387.381168 ms > 20/02/10 09:06:47 DEBUG BlockManager: block rdd_13_244 replicated to > BlockManagerId(5, > wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36463, None), > BlockManagerId(2, > wn10-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36711, None) > 20/02/10 09:06:47 DEBUG BlockManager: Put block rdd_13_244 remotely took 423 > ms > 20/02/10 09:06:47 DEBUG BlockManager: Putting block rdd_13_244 with > replication took 1371 ms > 20/02/10 09:06:47 DEBUG BlockManager: Getting local block rdd_13_244 > 20/02/10 09:06:47 DEBUG BlockManager: Level for block rdd_13_244 is > StorageLevel(memory, deserialized, 3 replicas) > 20/02/10 09:06:47 INFO Executor: Finished task 244.0 in stage 3.0 (TID 550). > 2253 bytes result sent to driver > {noformat} > Logs of other executor where the block is being replicated to: > {noformat} > 20/02/10 09:01:47 INFO Executor: Starting executor ID 5 on host > wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net > . > . > . > 20/02/10 09:06:47 INFO MemoryStore: Will not store rdd_13_244 > 20/02/10 09:06:47 WARN MemoryStore: Not enough space to cache rdd_13_244 in > memory! (computed 4.2 MB so far) > 20/02/10 09:06:47 INFO MemoryStore: Memory use = 4.9 GB (blocks) + 7.3 MB > (scratch space shared across 2 tasks(s)) = 4.9 GB. Storage limit = 4.9 GB. > 20/02/10 09:06:47 DEBUG BlockManager: Put block rdd_13_244 locally took 12 ms > 20/02/10 09:06:47 WARN BlockManager: Block rdd_13_244 could not be removed as > it was not found on disk or in memory > 20/02/10 09:06:47 WARN BlockManager: Putting block rdd_13_244 failed > 20/02/10 09:06:47 DEBUG BlockManager: Putting block rdd_13_244 without > replication took 13 ms > {noformat} > Note here that the block replication failed in Executor-5 with log line "Not > enough space to cache rdd_13_244 in memory!". But Executor-1 shows that block > is successfully replicated to executor-5 - "Repli
[jira] [Updated] (SPARK-30787) Add Generic Algorithm optimizer feature to spark-ml
[ https://issues.apache.org/jira/browse/SPARK-30787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] louischoi updated SPARK-30787: -- Target Version/s: (was: 2.4.5) > Add Generic Algorithm optimizer feature to spark-ml > --- > > Key: SPARK-30787 > URL: https://issues.apache.org/jira/browse/SPARK-30787 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 2.4.5 >Reporter: louischoi >Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > Hi. > It seems that spark does not have Generic Algoritm Optimizer. > I think that this algorithm fit well in distributed system like spark. > It is aimed to solve problems like Traveling Salesman Problem,graph > partitioning, Optimizing Network topology ... etc > > Is there some reason that Spark does not include this feature? > > Can i work on this? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30788) Support `SimpleDateFormat` and `FastDateFormat` as legacy date/timestamp formatters
Maxim Gekk created SPARK-30788: -- Summary: Support `SimpleDateFormat` and `FastDateFormat` as legacy date/timestamp formatters Key: SPARK-30788 URL: https://issues.apache.org/jira/browse/SPARK-30788 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk To be absolutely sure that Spark 3.0 is compatible with 2.4 when spark.sql.legacy.timeParser.enabled is set to true, need to support SimpleDateFormat and FastDateFormat as legacy parsers/formatters in TimestampFormatter. Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: # DateTimeFormat in CSV/JSON datasource # SimpleDateFormat - is used in JDBC datasource, in partitions parsing. # SimpleDateFormat in strong mode (lenient = false). It is used by the date_format, from_unixtime, unix_timestamp and to_unix_timestamp functions. Spark 3.0 should use the same parsers in those cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30787) Add Generic Algorithm optimizer feature to spark-ml
louischoi created SPARK-30787: - Summary: Add Generic Algorithm optimizer feature to spark-ml Key: SPARK-30787 URL: https://issues.apache.org/jira/browse/SPARK-30787 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 2.4.5 Reporter: louischoi Hi. It seems that spark does not have Generic Algoritm Optimizer. I think that this algorithm fit well in distributed system like spark. It is aimed to solve problems like Traveling Salesman Problem,graph partitioning, Optimizing Network topology ... etc Is there some reason that Spark does not include this feature? Can i work on this? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034277#comment-17034277 ] Jorge Machado commented on SPARK-24615: --- [~tgraves] thanks for the input. It would be great to have one or two examples on how to use the GPUs within a dataset. I tried to figure out the api but I did not find any useful docs. Any tip? > SPIP: Accelerator-aware task scheduling for Spark > - > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Epic > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Thomas Graves >Priority: Major > Labels: Hydrogen, SPIP > Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, > SPIP_ Accelerator-aware scheduling.pdf > > > (The JIRA received a major update on 2019/02/28. Some comments were based on > an earlier version. Please ignore them. New comments start at > [#comment-16778026].) > h2. Background and Motivation > GPUs and other accelerators have been widely used for accelerating special > workloads, e.g., deep learning and signal processing. While users from the AI > community use GPUs heavily, they often need Apache Spark to load and process > large datasets and to handle complex data scenarios like streaming. YARN and > Kubernetes already support GPUs in their recent releases. Although Spark > supports those two cluster managers, Spark itself is not aware of GPUs > exposed by them and hence Spark cannot properly request GPUs and schedule > them for users. This leaves a critical gap to unify big data and AI workloads > and make life simpler for end users. > To make Spark be aware of GPUs, we shall make two major changes at high level: > * At cluster manager level, we update or upgrade cluster managers to include > GPU support. Then we expose user interfaces for Spark to request GPUs from > them. > * Within Spark, we update its scheduler to understand available GPUs > allocated to executors, user task requests, and assign GPUs to tasks properly. > Based on the work done in YARN and Kubernetes to support GPUs and some > offline prototypes, we could have necessary features implemented in the next > major release of Spark. You can find a detailed scoping doc here, where we > listed user stories and their priorities. > h2. Goals > * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes. > * No regression on scheduler performance for normal jobs. > h2. Non-goals > * Fine-grained scheduling within one GPU card. > ** We treat one GPU card and its memory together as a non-divisible unit. > * Support TPU. > * Support Mesos. > * Support Windows. > h2. Target Personas > * Admins who need to configure clusters to run Spark with GPU nodes. > * Data scientists who need to build DL applications on Spark. > * Developers who need to integrate DL features on Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29474) CLI support for Spark-on-Docker-on-Yarn
[ https://issues.apache.org/jira/browse/SPARK-29474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034244#comment-17034244 ] Abhijeet Singh commented on SPARK-29474: Thanks for this feature suggestion [~adam.antal]. Is docker image flag ({{--docker-image}}) intended to support local/offline docker images (tar files)? > CLI support for Spark-on-Docker-on-Yarn > --- > > Key: SPARK-29474 > URL: https://issues.apache.org/jira/browse/SPARK-29474 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, YARN >Affects Versions: 3.0.0 >Reporter: Adam Antal >Priority: Major > > The Docker-on-Yarn feature is stable for a while now in Hadoop. > One can run Spark on Docker using the Docker-on-Yarn feature by providing > runtime environments to the Spark AM and Executor containers similar to this: > {noformat} > --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker > --conf > spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=repo/image:tag > --conf > spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS="/etc/passwd:/etc/passwd:ro,/etc/hadoop:/etc/hadoop:ro" > --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker > --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=repo/image:tag > --conf > spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS="/etc/passwd:/etc/passwd:ro,/etc/hadoop:/etc/hadoop:ro" > {noformat} > This is not very user friendly. I suggest to add CLI options to specify: > - whether docker image should be used ({{--docker}}) > - which docker image should be used ({{--docker-image}}) > - what docker mounts should be used ({{--docker-mounts}}) > for the AM and executor containers separately. > Let's discuss! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29462) The data type of "array()" should be array
[ https://issues.apache.org/jira/browse/SPARK-29462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29462. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27521 [https://github.com/apache/spark/pull/27521] > The data type of "array()" should be array > > > Key: SPARK-29462 > URL: https://issues.apache.org/jira/browse/SPARK-29462 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > In the current implmentation: > > spark.sql("select array()") > res0: org.apache.spark.sql.DataFrame = [array(): array] > The output type should be array -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29462) The data type of "array()" should be array
[ https://issues.apache.org/jira/browse/SPARK-29462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-29462: Assignee: Hyukjin Kwon > The data type of "array()" should be array > > > Key: SPARK-29462 > URL: https://issues.apache.org/jira/browse/SPARK-29462 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Hyukjin Kwon >Priority: Minor > > In the current implmentation: > > spark.sql("select array()") > res0: org.apache.spark.sql.DataFrame = [array(): array] > The output type should be array -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org