[jira] [Comment Edited] (SPARK-20943) Correct BypassMergeSortShuffleWriter's comment
[ https://issues.apache.org/jira/browse/SPARK-20943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033988#comment-16033988 ] CanBin Zheng edited comment on SPARK-20943 at 6/2/17 7:00 AM: -- Look at these two cases, either Aggregator or Ordering is defined but mapsideCombine is false, they both run with BypassMergeSortShuffleWriter, {code} //Has Aggregator defined @Test def testGroupByKeyUsingBypassMergeSort(): Unit = { val data = List("Hello", "World", "Hello", "One", "Two") val rdd = sc.parallelize(data).map((_, 1)).groupByKey(2) rdd.collect() } //Has Ordering defined @Test def testShuffleWithKeyOrderingUsingBypassMergeSort(): Unit = { val data = List("Hello", "World", "Hello", "One", "Two") val rdd = sc.parallelize(data).map((_, 1)) val ord = implicitly[Ordering[String]] val shuffledRDD = new ShuffledRDD[String, Int, Int](rdd, new HashPartitioner(2)).setKeyOrdering(ord) shuffledRDD.collect() } {code} was (Author: canbinzheng): Look at there two cases. {code} //Has Aggregator defined @Test def testGroupByKeyUsingBypassMergeSort(): Unit = { val data = List("Hello", "World", "Hello", "One", "Two") val rdd = sc.parallelize(data).map((_, 1)).groupByKey(2) rdd.collect() } //Has Ordering defined @Test def testShuffleWithKeyOrderingUsingBypassMergeSort(): Unit = { val data = List("Hello", "World", "Hello", "One", "Two") val rdd = sc.parallelize(data).map((_, 1)) val ord = implicitly[Ordering[String]] val shuffledRDD = new ShuffledRDD[String, Int, Int](rdd, new HashPartitioner(2)).setKeyOrdering(ord) shuffledRDD.collect() } {code} > Correct BypassMergeSortShuffleWriter's comment > -- > > Key: SPARK-20943 > URL: https://issues.apache.org/jira/browse/SPARK-20943 > Project: Spark > Issue Type: Improvement > Components: Documentation, Shuffle >Affects Versions: 2.1.1 >Reporter: CanBin Zheng >Priority: Trivial > Labels: starter > > There are some comments written in BypassMergeSortShuffleWriter.java about > when to select this write path, the three required conditions are described > as follows: > 1. no Ordering is specified, and > 2. no Aggregator is specified, and > 3. the number of partitions is less than > spark.shuffle.sort.bypassMergeThreshold > Obviously, the conditions written are partially wrong and misleading, the > right conditions should be: > 1. map-side combine is false, and > 2. the number of partitions is less than > spark.shuffle.sort.bypassMergeThreshold -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20939) Do not duplicate user-defined functions while optimizing logical query plans
[ https://issues.apache.org/jira/browse/SPARK-20939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034247#comment-16034247 ] Takeshi Yamamuro commented on SPARK-20939: -- This is not a bug, so I changed the type to "improvement". I feel this optimization still makes sense in `most` cases, so I think we do not have a strong reason to remove this rule. In feature releases, if spark would implement functionality to estimate join cardinality, we could make this rule more smarter... > Do not duplicate user-defined functions while optimizing logical query plans > > > Key: SPARK-20939 > URL: https://issues.apache.org/jira/browse/SPARK-20939 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 2.1.0 >Reporter: Lovasoa >Priority: Minor > Labels: logical_plan, optimizer > > Currently, while optimizing a query plan, spark pushes filters down the query > plan tree, so that > {code:title=LogicalPlan} > Join Inner, (a = b) > +- Filter UDF(a) > +- Relation A > +- Relation B > {code} > becomes > {code:title=Optimized LogicalPlan} > Join Inner, (a = b) > +- Filter UDF(a) > +- Relation A > +- Filter UDF(b) > +- Relation B > {code} > In general, it is a good thing to push down filters as it reduces the number > of records that will go through the join. > However, in the case where the filter is an user-defined function (UDF), we > cannot know if the cost of executing the function twice will be higher than > the eventual cost of joining more elements or not. > So I think that the optimizer shouldn't move the user-defined function in the > query plan tree. The user will still be able to duplicate the function if he > wants to. > See this question on stackoverflow: > https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20675) Support Index to skip when retrieval disk structure in CoGroupedRDD
[ https://issues.apache.org/jira/browse/SPARK-20675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034256#comment-16034256 ] lyc commented on SPARK-20675: - What do you mean `StreamBuffer`? In commit `6d05c1` (at Jun 1/17), there is only `ExternalAppendOnlyMap` and the computation seems only compute the specific partition, there are not any redundancy > Support Index to skip when retrieval disk structure in CoGroupedRDD > > > Key: SPARK-20675 > URL: https://issues.apache.org/jira/browse/SPARK-20675 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: darion yaphet > > CoGroupedRDD's compute() will retrieval each StreamBuffer(a disk structure > maintains key-value pairs which sorted by key) and merge the same key into one > So I think add a sequence index file or append the index part at the head of > temporary shuffle file to seek to the appropriate position could skip a lot > of scan which are unnecessary. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20964) Make some keywords reserved along with the ANSI/SQL standard
Takeshi Yamamuro created SPARK-20964: Summary: Make some keywords reserved along with the ANSI/SQL standard Key: SPARK-20964 URL: https://issues.apache.org/jira/browse/SPARK-20964 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.1 Reporter: Takeshi Yamamuro Priority: Minor The current Spark has many non-reserved words that are essentially reserved in the ANSI/SQL standard (http://developer.mimer.se/validator/sql-reserved-words.tml). https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L709 This is because there are many datasources (for instance twitter4j) that unfortunately use reserved keywords for column names (See [~hvanhovell]'s comments: https://github.com/apache/spark/pull/18079#discussion_r118842186). We might fix this issue in future major releases. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20662) Block jobs that have greater than a configured number of tasks
[ https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034282#comment-16034282 ] lyc edited comment on SPARK-20662 at 6/2/17 7:36 AM: - Do you mean `mapreduce.job.running.map.limit`? The conf means `The maximum number of simultaneous map tasks per job. There is no limit if this value is 0 or negative.` This means task concurrency. And the behavior seems to be that stops scheduling tasks when job has that many running tasks, and starts scheduling when some tasks are done. This seems can be done in `DAGScheduler`, I'd like give it a try if the idea is accepted. cc [~vanzin] was (Author: lyc): Do you mean `mapreduce.job.running.map.limit`? The conf means `The maximum number of simultaneous map tasks per job. There is no limit if this value is 0 or negative.` This means task concurrency. And the behavior seems to be that stops scheduling tasks when job has that many running tasks, and starts scheduling when some tasks are done. This seems can be done in `DAGScheduler`, I'd like give it a try if the idea is accepted. cc @Marcelo Vanzin > Block jobs that have greater than a configured number of tasks > -- > > Key: SPARK-20662 > URL: https://issues.apache.org/jira/browse/SPARK-20662 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0 >Reporter: Xuefu Zhang > > In a shared cluster, it's desirable for an admin to block large Spark jobs. > While there might not be a single metrics defining the size of a job, the > number of tasks is usually a good indicator. Thus, it would be useful for > Spark scheduler to block a job whose number of tasks reaches a configured > limit. By default, the limit could be just infinite, to retain the existing > behavior. > MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be > configured, which blocks a MR job at job submission time. > The proposed configuration is spark.job.max.tasks with a default value -1 > (infinite). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20662) Block jobs that have greater than a configured number of tasks
[ https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034282#comment-16034282 ] lyc commented on SPARK-20662: - Do you mean `mapreduce.job.running.map.limit`? The conf means `The maximum number of simultaneous map tasks per job. There is no limit if this value is 0 or negative.` This means task concurrency. And the behavior seems to be that stops scheduling tasks when job has that many running tasks, and starts scheduling when some tasks are done. This seems can be done in `DAGScheduler`, I'd like give it a try if the idea is accepted. cc @Marcelo Vanzin > Block jobs that have greater than a configured number of tasks > -- > > Key: SPARK-20662 > URL: https://issues.apache.org/jira/browse/SPARK-20662 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0 >Reporter: Xuefu Zhang > > In a shared cluster, it's desirable for an admin to block large Spark jobs. > While there might not be a single metrics defining the size of a job, the > number of tasks is usually a good indicator. Thus, it would be useful for > Spark scheduler to block a job whose number of tasks reaches a configured > limit. By default, the limit could be just infinite, to retain the existing > behavior. > MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be > configured, which blocks a MR job at job submission time. > The proposed configuration is spark.job.max.tasks with a default value -1 > (infinite). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20965) Support PREPARE and EXECUTE statements
Takeshi Yamamuro created SPARK-20965: Summary: Support PREPARE and EXECUTE statements Key: SPARK-20965 URL: https://issues.apache.org/jira/browse/SPARK-20965 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.1 Reporter: Takeshi Yamamuro Priority: Minor Parameterized queries might help for some users, so we might support PREPRE and EXECUTE statements by referring the ANSI/SQL standard (e.g., it is some useful for users who frequently use the same queries) {code} PREPARE sqlstmt (int) AS SELECT * FROM t WEHERE id = $1; EXECUTE sqlstmt(1); {code} One of implementation references: https://www.postgresql.org/docs/current/static/sql-prepare.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1
[ https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-20958: --- Description: We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 and avro 1.7.7 used by spark-core 2.2.0-rc2. Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the reasons mentioned in [PR #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. Therefore, we don't really have many choices here and have to roll back parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict. was: We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 and avro 1.7.7 used by spark-core 2.2.0-rc2. , Spark 2.2.0-rc2 introduced two incompatible versions of avro (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the reasons mentioned in [PR #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. Therefore, we don't really have many choices here and have to roll back parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict. > Roll back parquet-mr 1.8.2 to parquet-1.8.1 > --- > > Key: SPARK-20958 > URL: https://issues.apache.org/jira/browse/SPARK-20958 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on > avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 > and avro 1.7.7 used by spark-core 2.2.0-rc2. > Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro > (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the > reasons mentioned in [PR > #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. > Therefore, we don't really have many choices here and have to roll back > parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20943) Correct BypassMergeSortShuffleWriter's comment
[ https://issues.apache.org/jira/browse/SPARK-20943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034296#comment-16034296 ] Saisai Shao commented on SPARK-20943: - I think the original purpose of comment is to say {{Aggregator}} and {{Ordering}} is not used in map side shuffle write, those {{Aggregator}} {{Ordering}} set in ShuffleRDD will only be used in shuffle reader side. > Correct BypassMergeSortShuffleWriter's comment > -- > > Key: SPARK-20943 > URL: https://issues.apache.org/jira/browse/SPARK-20943 > Project: Spark > Issue Type: Improvement > Components: Documentation, Shuffle >Affects Versions: 2.1.1 >Reporter: CanBin Zheng >Priority: Trivial > Labels: starter > > There are some comments written in BypassMergeSortShuffleWriter.java about > when to select this write path, the three required conditions are described > as follows: > 1. no Ordering is specified, and > 2. no Aggregator is specified, and > 3. the number of partitions is less than > spark.shuffle.sort.bypassMergeThreshold > Obviously, the conditions written are partially wrong and misleading, the > right conditions should be: > 1. map-side combine is false, and > 2. the number of partitions is less than > spark.shuffle.sort.bypassMergeThreshold -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1
[ https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034310#comment-16034310 ] Cheng Lian commented on SPARK-20958: [~rdblue] I think the root cause here is we cherry-picked parquet-mr [PR #318|https://github.com/apache/parquet-mr/pull/318] to parquet-mr 1.8.2, and introduced this avro upgrade. Tried to roll back parquet-mr back to 1.8.1 but it doesn't work well because this brings back [PARQUET-389|https://issues.apache.org/jira/browse/PARQUET-389] and breaks some test cases involving schema evolution. It would be nice if we can have a parquet-mr 1.8.3 or 1.8.2.1 release that has [PR #318|https://github.com/apache/parquet-mr/pull/318] reverted from 1.8.2? I think cherry-picking that PR is also problematic for parquet-mr because it introduces a backward-incompatible dependency change in a maintenance release. > Roll back parquet-mr 1.8.2 to parquet-1.8.1 > --- > > Key: SPARK-20958 > URL: https://issues.apache.org/jira/browse/SPARK-20958 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on > avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 > and avro 1.7.7 used by spark-core 2.2.0-rc2. > Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro > (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the > reasons mentioned in [PR > #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. > Therefore, we don't really have many choices here and have to roll back > parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20966) Table data is not sorted by startTime time desc, time is not formatted and redundant code in JDBC/ODBC Server page.
guoxiaolongzte created SPARK-20966: -- Summary: Table data is not sorted by startTime time desc, time is not formatted and redundant code in JDBC/ODBC Server page. Key: SPARK-20966 URL: https://issues.apache.org/jira/browse/SPARK-20966 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.1.1 Reporter: guoxiaolongzte Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20662) Block jobs that have greater than a configured number of tasks
[ https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034334#comment-16034334 ] Sean Owen commented on SPARK-20662: --- Isn't this better handled by the resource manager? for example, YARN lets you cap these things in a bunch of ways already, and the resource manager is a better place to manage, well, resources. > Block jobs that have greater than a configured number of tasks > -- > > Key: SPARK-20662 > URL: https://issues.apache.org/jira/browse/SPARK-20662 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0 >Reporter: Xuefu Zhang > > In a shared cluster, it's desirable for an admin to block large Spark jobs. > While there might not be a single metrics defining the size of a job, the > number of tasks is usually a good indicator. Thus, it would be useful for > Spark scheduler to block a job whose number of tasks reaches a configured > limit. By default, the limit could be just infinite, to retain the existing > behavior. > MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be > configured, which blocks a MR job at job submission time. > The proposed configuration is spark.job.max.tasks with a default value -1 > (infinite). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20966) Table data is not sorted by startTime time desc, time is not formatted and redundant code in JDBC/ODBC Server page.
[ https://issues.apache.org/jira/browse/SPARK-20966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034345#comment-16034345 ] Apache Spark commented on SPARK-20966: -- User 'guoxiaolongzte' has created a pull request for this issue: https://github.com/apache/spark/pull/18186 > Table data is not sorted by startTime time desc, time is not formatted and > redundant code in JDBC/ODBC Server page. > --- > > Key: SPARK-20966 > URL: https://issues.apache.org/jira/browse/SPARK-20966 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.1 >Reporter: guoxiaolongzte >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20966) Table data is not sorted by startTime time desc, time is not formatted and redundant code in JDBC/ODBC Server page.
[ https://issues.apache.org/jira/browse/SPARK-20966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20966: Assignee: Apache Spark > Table data is not sorted by startTime time desc, time is not formatted and > redundant code in JDBC/ODBC Server page. > --- > > Key: SPARK-20966 > URL: https://issues.apache.org/jira/browse/SPARK-20966 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.1 >Reporter: guoxiaolongzte >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20966) Table data is not sorted by startTime time desc, time is not formatted and redundant code in JDBC/ODBC Server page.
[ https://issues.apache.org/jira/browse/SPARK-20966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20966: Assignee: (was: Apache Spark) > Table data is not sorted by startTime time desc, time is not formatted and > redundant code in JDBC/ODBC Server page. > --- > > Key: SPARK-20966 > URL: https://issues.apache.org/jira/browse/SPARK-20966 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.1 >Reporter: guoxiaolongzte >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20966) Table data is not sorted by startTime time desc, time is not formatted and redundant code in JDBC/ODBC Server page.
[ https://issues.apache.org/jira/browse/SPARK-20966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guoxiaolongzte updated SPARK-20966: --- Description: Table data is not sorted by startTime time desc in JDBC/ODBC Server page. Time is not formatted in JDBC/ODBC Server page. Redundant code in the ThriftServerSessionPage.scala. > Table data is not sorted by startTime time desc, time is not formatted and > redundant code in JDBC/ODBC Server page. > --- > > Key: SPARK-20966 > URL: https://issues.apache.org/jira/browse/SPARK-20966 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.1 >Reporter: guoxiaolongzte >Priority: Minor > > Table data is not sorted by startTime time desc in JDBC/ODBC Server page. > Time is not formatted in JDBC/ODBC Server page. > Redundant code in the ThriftServerSessionPage.scala. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20967) SharedState.externalCatalog is not really lazy
[ https://issues.apache.org/jira/browse/SPARK-20967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-20967: Issue Type: Improvement (was: Bug) > SharedState.externalCatalog is not really lazy > -- > > Key: SPARK-20967 > URL: https://issues.apache.org/jira/browse/SPARK-20967 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20967) SharedState.externalCatalog is not really lazy
Wenchen Fan created SPARK-20967: --- Summary: SharedState.externalCatalog is not really lazy Key: SPARK-20967 URL: https://issues.apache.org/jira/browse/SPARK-20967 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20967) SharedState.externalCatalog is not really lazy
[ https://issues.apache.org/jira/browse/SPARK-20967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034372#comment-16034372 ] Apache Spark commented on SPARK-20967: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/18187 > SharedState.externalCatalog is not really lazy > -- > > Key: SPARK-20967 > URL: https://issues.apache.org/jira/browse/SPARK-20967 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20967) SharedState.externalCatalog is not really lazy
[ https://issues.apache.org/jira/browse/SPARK-20967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20967: Assignee: Wenchen Fan (was: Apache Spark) > SharedState.externalCatalog is not really lazy > -- > > Key: SPARK-20967 > URL: https://issues.apache.org/jira/browse/SPARK-20967 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20967) SharedState.externalCatalog is not really lazy
[ https://issues.apache.org/jira/browse/SPARK-20967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20967: Assignee: Apache Spark (was: Wenchen Fan) > SharedState.externalCatalog is not really lazy > -- > > Key: SPARK-20967 > URL: https://issues.apache.org/jira/browse/SPARK-20967 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034427#comment-16034427 ] Robert Kruszewski commented on SPARK-20952: --- You're right that this needs a bit of clarification. There's a bit more subtlety with respect to that actual threadlocal and it's use. First of all this makes behaviour same as on the driver where you have localProperties on SparkContext which are inheritable. Secondly I believe the issue you're describing will not arise since a) executor tasks are uninterruptible and b) the thread pool used to run them is a cachedThreadPool and not a ForkJoinPool, hence given task thread will not inherit from another task thread. Let me know if I am missing something here though. > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20968) Support separator in Tokenizer
darion yaphet created SPARK-20968: - Summary: Support separator in Tokenizer Key: SPARK-20968 URL: https://issues.apache.org/jira/browse/SPARK-20968 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 2.1.1, 2.0.2, 2.0.0 Reporter: darion yaphet Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20959) Add a parameter to UnsafeExternalSorter to configure filebuffersize
[ https://issues.apache.org/jira/browse/SPARK-20959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034466#comment-16034466 ] caoxuewen commented on SPARK-20959: --- thanks for modify Priority > Add a parameter to UnsafeExternalSorter to configure filebuffersize > --- > > Key: SPARK-20959 > URL: https://issues.apache.org/jira/browse/SPARK-20959 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.2.0 >Reporter: caoxuewen >Priority: Trivial > > Improvement with spark.shuffle.file.buffer configure fileBufferSizeBytes in > UnsafeExternalSorter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20799) Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL
[ https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-20799: --- Summary: Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL (was: Unable to infer schema for ORC on S3N when secrets are in the URL) > Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL > - > > Key: SPARK-20799 > URL: https://issues.apache.org/jira/browse/SPARK-20799 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Jork Zijlstra >Priority: Minor > > We are getting the following exception: > {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. > It must be specified manually.{code} > Combining following factors will cause it: > - Use S3 > - Use format ORC > - Don't apply a partitioning on de data > - Embed AWS credentials in the path > The problem is in the PartitioningAwareFileIndex def allFiles() > {code} > leafDirToChildrenFiles.get(qualifiedPath) > .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } > .getOrElse(Array.empty) > {code} > leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the > qualifiedPath contains the path WITH credentials. > So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no > data is read and the schema cannot be defined. > Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login > details. This is insecure and may be unsupported in future., but this should > not mean that it shouldn't work anymore. > Workaround: > Move the AWS credentials from the path to the SparkSession > {code} > SparkSession.builder > .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId}) > .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey}) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20799) Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL
[ https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-20799: --- Environment: Hadoop 2.8.0 binaries > Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL > - > > Key: SPARK-20799 > URL: https://issues.apache.org/jira/browse/SPARK-20799 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: Hadoop 2.8.0 binaries >Reporter: Jork Zijlstra >Priority: Minor > > We are getting the following exception: > {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. > It must be specified manually.{code} > Combining following factors will cause it: > - Use S3 > - Use format ORC > - Don't apply a partitioning on de data > - Embed AWS credentials in the path > The problem is in the PartitioningAwareFileIndex def allFiles() > {code} > leafDirToChildrenFiles.get(qualifiedPath) > .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } > .getOrElse(Array.empty) > {code} > leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the > qualifiedPath contains the path WITH credentials. > So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no > data is read and the schema cannot be defined. > Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login > details. This is insecure and may be unsupported in future., but this should > not mean that it shouldn't work anymore. > Workaround: > Move the AWS credentials from the path to the SparkSession > {code} > SparkSession.builder > .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId}) > .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey}) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows
Perrine Letellier created SPARK-20969: - Summary: last() aggregate function fails returning the right answer with ordered windows Key: SPARK-20969 URL: https://issues.apache.org/jira/browse/SPARK-20969 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1 Reporter: Perrine Letellier {code} scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), ("i1", 2, "desc3"))).toDF("id", "ts", "desc") scala> val window = Window.partitionBy("id").orderBy(col("ts").asc) scala> df.withColumn("last", last(col("description")).over(window)).show +---+---+-+-+ | id| ts| description| last| +---+---+-+-+ | i1| 1|desc1|desc2| | i1| 1|desc2|desc2| | i1| 2|desc3|desc3| +---+---+-+-+ {code} However what is expected is the same answer as if asking for `first()` with a window with descending order. {code} scala> val window = Window.partitionBy("id").orderBy(col("ts").desc) scala> df.withColumn("last", first(col("description")).over(window)).show +---+---+-+-+ | id| ts| description| last| +---+---+-+-+ | i1| 2|desc3|desc3| | i1| 1|desc1|desc3| | i1| 1|desc2|desc3| +---+---+-+-+ {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20942) The title style about field is error in the history server web ui.
[ https://issues.apache.org/jira/browse/SPARK-20942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20942. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 18170 [https://github.com/apache/spark/pull/18170] > The title style about field is error in the history server web ui. > -- > > Key: SPARK-20942 > URL: https://issues.apache.org/jira/browse/SPARK-20942 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.1 >Reporter: guoxiaolongzte >Priority: Minor > Fix For: 2.2.0 > > Attachments: before.png, fix1.png, fix.png > > > 1.The title style about field is error. > 2.Title text description, 'the application' should be changed to 'this > application'. > 3.Analysis of code: > $('#hisotry-summary [data-toggle="tooltip"]').tooltip(); > hisotry-summary is the spelling error. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20942) The title style about field is error in the history server web ui.
[ https://issues.apache.org/jira/browse/SPARK-20942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20942: - Assignee: guoxiaolongzte > The title style about field is error in the history server web ui. > -- > > Key: SPARK-20942 > URL: https://issues.apache.org/jira/browse/SPARK-20942 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.1 >Reporter: guoxiaolongzte >Assignee: guoxiaolongzte >Priority: Minor > Fix For: 2.2.0 > > Attachments: before.png, fix1.png, fix.png > > > 1.The title style about field is error. > 2.Title text description, 'the application' should be changed to 'this > application'. > 3.Analysis of code: > $('#hisotry-summary [data-toggle="tooltip"]').tooltip(); > hisotry-summary is the spelling error. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20790) ALS with implicit feedback ignores negative values
[ https://issues.apache.org/jira/browse/SPARK-20790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034730#comment-16034730 ] Apache Spark commented on SPARK-20790: -- User 'davideis' has created a pull request for this issue: https://github.com/apache/spark/pull/18188 > ALS with implicit feedback ignores negative values > -- > > Key: SPARK-20790 > URL: https://issues.apache.org/jira/browse/SPARK-20790 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, > 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1 >Reporter: David Eis >Assignee: David Eis > Fix For: 2.2.0 > > > The refactorization that was done in > https://github.com/apache/spark/pull/5314/files introduced a bug, whereby for > implicit feedback negative ratings just get ignored. Prior to that commit > they were not ignored, but the absolute value was used as the confidence and > the preference was set to 0. The preservation of comments and absolute value > indicate that this was unintentional. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20943) Correct BypassMergeSortShuffleWriter's comment
[ https://issues.apache.org/jira/browse/SPARK-20943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034746#comment-16034746 ] CanBin Zheng commented on SPARK-20943: -- [~saisai_shao] I got you. But I think it's better to change the description, it has confused me for a long time, maybe someone else has the same puzzle. > Correct BypassMergeSortShuffleWriter's comment > -- > > Key: SPARK-20943 > URL: https://issues.apache.org/jira/browse/SPARK-20943 > Project: Spark > Issue Type: Improvement > Components: Documentation, Shuffle >Affects Versions: 2.1.1 >Reporter: CanBin Zheng >Priority: Trivial > Labels: starter > > There are some comments written in BypassMergeSortShuffleWriter.java about > when to select this write path, the three required conditions are described > as follows: > 1. no Ordering is specified, and > 2. no Aggregator is specified, and > 3. the number of partitions is less than > spark.shuffle.sort.bypassMergeThreshold > Obviously, the conditions written are partially wrong and misleading, the > right conditions should be: > 1. map-side combine is false, and > 2. the number of partitions is less than > spark.shuffle.sort.bypassMergeThreshold -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20960) make ColumnVector public
[ https://issues.apache.org/jira/browse/SPARK-20960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034845#comment-16034845 ] Dongjoon Hyun commented on SPARK-20960: --- cc [~mridulm80] > make ColumnVector public > > > Key: SPARK-20960 > URL: https://issues.apache.org/jira/browse/SPARK-20960 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan > > ColumnVector is an internal interface in Spark SQL, which is only used for > vectorized parquet reader to represent the in-memory columnar format. > In Spark 2.3 we want to make ColumnVector public, so that we can provide a > more efficient way for data exchanges between Spark and external systems. For > example, we can use ColumnVector to build the columnar read API in data > source framework, we can use ColumnVector to build a more efficient UDF API, > etc. > We also want to introduce a new ColumnVector implementation based on Apache > Arrow(basically just a wrapper over Arrow), so that external systems(like > Python Pandas DataFrame) can build ColumnVector very easily. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows
[ https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Perrine Letellier updated SPARK-20969: -- Description: The column on which `orderBy` is performed is considered as another column on which to partition. {code} scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), ("i1", 2, "desc3"))).toDF("id", "ts", "desc") scala> val window = Window.partitionBy("id").orderBy(col("ts").asc) scala> df.withColumn("last", last(col("description")).over(window)).show +---+---+-+-+ | id| ts| description| last| +---+---+-+-+ | i1| 1|desc1|desc2| | i1| 1|desc2|desc2| | i1| 2|desc3|desc3| +---+---+-+-+ {code} However what is expected is the same answer as if asking for `first()` with a window with descending order. {code} scala> val window = Window.partitionBy("id").orderBy(col("ts").desc) scala> df.withColumn("last", first(col("description")).over(window)).show +---+---+-+-+ | id| ts| description| last| +---+---+-+-+ | i1| 2|desc3|desc3| | i1| 1|desc1|desc3| | i1| 1|desc2|desc3| +---+---+-+-+ {code} was: {code} scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), ("i1", 2, "desc3"))).toDF("id", "ts", "desc") scala> val window = Window.partitionBy("id").orderBy(col("ts").asc) scala> df.withColumn("last", last(col("description")).over(window)).show +---+---+-+-+ | id| ts| description| last| +---+---+-+-+ | i1| 1|desc1|desc2| | i1| 1|desc2|desc2| | i1| 2|desc3|desc3| +---+---+-+-+ {code} However what is expected is the same answer as if asking for `first()` with a window with descending order. {code} scala> val window = Window.partitionBy("id").orderBy(col("ts").desc) scala> df.withColumn("last", first(col("description")).over(window)).show +---+---+-+-+ | id| ts| description| last| +---+---+-+-+ | i1| 2|desc3|desc3| | i1| 1|desc1|desc3| | i1| 1|desc2|desc3| +---+---+-+-+ {code} > last() aggregate function fails returning the right answer with ordered > windows > --- > > Key: SPARK-20969 > URL: https://issues.apache.org/jira/browse/SPARK-20969 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Perrine Letellier > > The column on which `orderBy` is performed is considered as another column on > which to partition. > {code} > scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), > ("i1", 2, "desc3"))).toDF("id", "ts", "desc") > scala> val window = Window.partitionBy("id").orderBy(col("ts").asc) > scala> df.withColumn("last", last(col("description")).over(window)).show > +---+---+-+-+ > | id| ts| description| last| > +---+---+-+-+ > | i1| 1|desc1|desc2| > | i1| 1|desc2|desc2| > | i1| 2|desc3|desc3| > +---+---+-+-+ > {code} > However what is expected is the same answer as if asking for `first()` with a > window with descending order. > {code} > scala> val window = Window.partitionBy("id").orderBy(col("ts").desc) > scala> df.withColumn("last", first(col("description")).over(window)).show > +---+---+-+-+ > | id| ts| description| last| > +---+---+-+-+ > | i1| 2|desc3|desc3| > | i1| 1|desc1|desc3| > | i1| 1|desc2|desc3| > +---+---+-+-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows
[ https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Perrine Letellier updated SPARK-20969: -- Description: The column on which `orderBy` is performed is considered as another column on which to partition. {code} scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), ("i1", 2, "desc3"))).toDF("id", "ts", "description") scala> val window = Window.partitionBy("id").orderBy(col("ts").asc) scala> df.withColumn("last", last(col("description")).over(window)).show +---+---+-+-+ | id| ts| description| last| +---+---+-+-+ | i1| 1|desc1|desc2| | i1| 1|desc2|desc2| | i1| 2|desc3|desc3| +---+---+-+-+ {code} However what is expected is the same answer as if asking for `first()` with a window with descending order. {code} scala> val window = Window.partitionBy("id").orderBy(col("ts").desc) scala> df.withColumn("last", first(col("description")).over(window)).show +---+---+-+-+ | id| ts| description| last| +---+---+-+-+ | i1| 2|desc3|desc3| | i1| 1|desc1|desc3| | i1| 1|desc2|desc3| +---+---+-+-+ {code} was: The column on which `orderBy` is performed is considered as another column on which to partition. {code} scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), ("i1", 2, "desc3"))).toDF("id", "ts", "desc") scala> val window = Window.partitionBy("id").orderBy(col("ts").asc) scala> df.withColumn("last", last(col("description")).over(window)).show +---+---+-+-+ | id| ts| description| last| +---+---+-+-+ | i1| 1|desc1|desc2| | i1| 1|desc2|desc2| | i1| 2|desc3|desc3| +---+---+-+-+ {code} However what is expected is the same answer as if asking for `first()` with a window with descending order. {code} scala> val window = Window.partitionBy("id").orderBy(col("ts").desc) scala> df.withColumn("last", first(col("description")).over(window)).show +---+---+-+-+ | id| ts| description| last| +---+---+-+-+ | i1| 2|desc3|desc3| | i1| 1|desc1|desc3| | i1| 1|desc2|desc3| +---+---+-+-+ {code} > last() aggregate function fails returning the right answer with ordered > windows > --- > > Key: SPARK-20969 > URL: https://issues.apache.org/jira/browse/SPARK-20969 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Perrine Letellier > > The column on which `orderBy` is performed is considered as another column on > which to partition. > {code} > scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), > ("i1", 2, "desc3"))).toDF("id", "ts", "description") > scala> val window = Window.partitionBy("id").orderBy(col("ts").asc) > scala> df.withColumn("last", last(col("description")).over(window)).show > +---+---+-+-+ > | id| ts| description| last| > +---+---+-+-+ > | i1| 1|desc1|desc2| > | i1| 1|desc2|desc2| > | i1| 2|desc3|desc3| > +---+---+-+-+ > {code} > However what is expected is the same answer as if asking for `first()` with a > window with descending order. > {code} > scala> val window = Window.partitionBy("id").orderBy(col("ts").desc) > scala> df.withColumn("last", first(col("description")).over(window)).show > +---+---+-+-+ > | id| ts| description| last| > +---+---+-+-+ > | i1| 2|desc3|desc3| > | i1| 1|desc1|desc3| > | i1| 1|desc2|desc3| > +---+---+-+-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19236) Add createOrReplaceGlobalTempView
[ https://issues.apache.org/jira/browse/SPARK-19236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-19236: -- Fix Version/s: 2.3.0 > Add createOrReplaceGlobalTempView > - > > Key: SPARK-19236 > URL: https://issues.apache.org/jira/browse/SPARK-19236 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Arman Yazdani >Priority: Minor > Fix For: 2.3.0 > > > There are 3 methods for saving a temp tables: > createTempView > createOrReplaceTempView > createGlobalTempView > but there isn't: > createOrReplaceGlobalTempView -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1
[ https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034950#comment-16034950 ] Ryan Blue commented on SPARK-20958: --- I don't think it is a good idea to roll back. Spark doesn't depend on parquet-avro, where the update to Avro 1.8.1 was made, except for tests where it is fine. The backports for Spark in 1.8.2 are worth keeping since there are reasonable work-arounds in user projects. The problem that I've seen on the dev list is when users add parquet-avro to their dependencies and the version gets managed to 1.8.2. That will require Avro 1.8.1 because parquet-avro calls {{getSchema}} on avro-specific objects. But there are a couple reasonable ways to deal with this: 1. Specify a dependency on parquet-avro 1.8.1 that still uses Avro 1.7.x. Parquet is backward-compatible with older binaries, so parquet-avro 1.8.1 works fine with parquet-hadoop 1.8.2. (This is the recommended work-around.) 2. Shade and relocate Avro 1.8.1 in application Jars, so that Spark can use 1.7.x and parquet-avro can use 1.8.1. This was brought up on the dev list, but the user dismissed these work-arounds without trying them. Long-term, we can do a 1.8.3 release to solve this problem, though I think the best solution there would be to stop using {{getSchema}} instead of downgrading the dependency. > Roll back parquet-mr 1.8.2 to parquet-1.8.1 > --- > > Key: SPARK-20958 > URL: https://issues.apache.org/jira/browse/SPARK-20958 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on > avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 > and avro 1.7.7 used by spark-core 2.2.0-rc2. > Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro > (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the > reasons mentioned in [PR > #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. > Therefore, we don't really have many choices here and have to roll back > parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1
[ https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034961#comment-16034961 ] Dongjoon Hyun commented on SPARK-20958: --- +1 for [~rdblue]. > Roll back parquet-mr 1.8.2 to parquet-1.8.1 > --- > > Key: SPARK-20958 > URL: https://issues.apache.org/jira/browse/SPARK-20958 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on > avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 > and avro 1.7.7 used by spark-core 2.2.0-rc2. > Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro > (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the > reasons mentioned in [PR > #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. > Therefore, we don't really have many choices here and have to roll back > parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20662) Block jobs that have greater than a configured number of tasks
[ https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034964#comment-16034964 ] Marcelo Vanzin commented on SPARK-20662: Yeah, I don't really understand this request. It doesn't matter how many tasks a job creates, what really matters is how many resources the cluster manager allows the application to allocate. If a job has 1 million tasks but the cluster manager allocates a single vcpu for the job, it will take forever, but it won't really bog down the cluster. > Block jobs that have greater than a configured number of tasks > -- > > Key: SPARK-20662 > URL: https://issues.apache.org/jira/browse/SPARK-20662 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0 >Reporter: Xuefu Zhang > > In a shared cluster, it's desirable for an admin to block large Spark jobs. > While there might not be a single metrics defining the size of a job, the > number of tasks is usually a good indicator. Thus, it would be useful for > Spark scheduler to block a job whose number of tasks reaches a configured > limit. By default, the limit could be just infinite, to retain the existing > behavior. > MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be > configured, which blocks a MR job at job submission time. > The proposed configuration is spark.job.max.tasks with a default value -1 > (infinite). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20968) Support separator in Tokenizer
[ https://issues.apache.org/jira/browse/SPARK-20968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034993#comment-16034993 ] Nick Pentreath commented on SPARK-20968: Would you mind adding more detail here? What is the use case, with an example of the desired input / output? > Support separator in Tokenizer > -- > > Key: SPARK-20968 > URL: https://issues.apache.org/jira/browse/SPARK-20968 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.0.0, 2.0.2, 2.1.1 >Reporter: darion yaphet >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035012#comment-16035012 ] Michael Armbrust commented on SPARK-19104: -- I'm about to cut RC3 of 2.2 and there is no pull request to fix this. Unfortunately that means it's not going to be fixed in 2.2.0 > CompileException with Map and Case Class in Spark 2.1.0 > > > Key: SPARK-19104 > URL: https://issues.apache.org/jira/browse/SPARK-19104 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Nils Grabbert > > The following code will run with Spark 2.0.2 but not with Spark 2.1.0: > {code} > case class InnerData(name: String, value: Int) > case class Data(id: Int, param: Map[String, InnerData]) > val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + > 100 > val ds = spark.createDataset(data) > {code} > Exception: > {code} > Caused by: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 63, Column 46: Expression > "ExternalMapToCatalyst_value_isNull1" is not an rvalue > at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) > at > org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639) > > at > org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) > at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984) > > at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) > at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) > at > org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) > at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) > at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) > at > org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) > at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > > at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) > > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) > > at > org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) > > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396) > > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311) > > at org.codehaus.janino
[jira] [Resolved] (SPARK-20967) SharedState.externalCatalog is not really lazy
[ https://issues.apache.org/jira/browse/SPARK-20967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20967. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 18187 [https://github.com/apache/spark/pull/18187] > SharedState.externalCatalog is not really lazy > -- > > Key: SPARK-20967 > URL: https://issues.apache.org/jira/browse/SPARK-20967 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20946) simplify the config setting logic in SparkSession.getOrCreate
[ https://issues.apache.org/jira/browse/SPARK-20946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20946. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 18172 [https://github.com/apache/spark/pull/18172] > simplify the config setting logic in SparkSession.getOrCreate > - > > Key: SPARK-20946 > URL: https://issues.apache.org/jira/browse/SPARK-20946 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15352) Topology aware block replication
[ https://issues.apache.org/jira/browse/SPARK-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shubham Chopra resolved SPARK-15352. Resolution: Fixed > Topology aware block replication > > > Key: SPARK-15352 > URL: https://issues.apache.org/jira/browse/SPARK-15352 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Mesos, Spark Core, YARN >Reporter: Shubham Chopra >Assignee: Shubham Chopra > > With cached RDDs, Spark can be used for online analytics where it is used to > respond to online queries. But loss of RDD partitions due to node/executor > failures can cause huge delays in such use cases as the data would have to be > regenerated. > Cached RDDs, even when using multiple replicas per block, are not currently > resilient to node failures when multiple executors are started on the same > node. Block replication currently chooses a peer at random, and this peer > could also exist on the same host. > This effort would add topology aware replication to Spark that can be enabled > with pluggable strategies. For ease of development/review, this is being > broken down to three major work-efforts: > 1.Making peer selection for replication pluggable > 2.Providing pluggable implementations for providing topology and topology > aware replication > 3.Pro-active replenishment of lost blocks -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15352) Topology aware block replication
[ https://issues.apache.org/jira/browse/SPARK-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035061#comment-16035061 ] Dongjoon Hyun commented on SPARK-15352: --- Thank you, [~shubhamc]! > Topology aware block replication > > > Key: SPARK-15352 > URL: https://issues.apache.org/jira/browse/SPARK-15352 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Mesos, Spark Core, YARN >Reporter: Shubham Chopra >Assignee: Shubham Chopra > Fix For: 2.2.0 > > > With cached RDDs, Spark can be used for online analytics where it is used to > respond to online queries. But loss of RDD partitions due to node/executor > failures can cause huge delays in such use cases as the data would have to be > regenerated. > Cached RDDs, even when using multiple replicas per block, are not currently > resilient to node failures when multiple executors are started on the same > node. Block replication currently chooses a peer at random, and this peer > could also exist on the same host. > This effort would add topology aware replication to Spark that can be enabled > with pluggable strategies. For ease of development/review, this is being > broken down to three major work-efforts: > 1.Making peer selection for replication pluggable > 2.Providing pluggable implementations for providing topology and topology > aware replication > 3.Pro-active replenishment of lost blocks -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15352) Topology aware block replication
[ https://issues.apache.org/jira/browse/SPARK-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-15352: -- Fix Version/s: 2.2.0 > Topology aware block replication > > > Key: SPARK-15352 > URL: https://issues.apache.org/jira/browse/SPARK-15352 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Mesos, Spark Core, YARN >Reporter: Shubham Chopra >Assignee: Shubham Chopra > Fix For: 2.2.0 > > > With cached RDDs, Spark can be used for online analytics where it is used to > respond to online queries. But loss of RDD partitions due to node/executor > failures can cause huge delays in such use cases as the data would have to be > regenerated. > Cached RDDs, even when using multiple replicas per block, are not currently > resilient to node failures when multiple executors are started on the same > node. Block replication currently chooses a peer at random, and this peer > could also exist on the same host. > This effort would add topology aware replication to Spark that can be enabled > with pluggable strategies. For ease of development/review, this is being > broken down to three major work-efforts: > 1.Making peer selection for replication pluggable > 2.Providing pluggable implementations for providing topology and topology > aware replication > 3.Pro-active replenishment of lost blocks -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark
[ https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035062#comment-16035062 ] Nicholas Chammas commented on SPARK-12661: -- I think we are good to resolve this provided that we've stopped testing with Python 2.6. Any cleanup of 2.6-specific workarounds (tracked in SPARK-20149) can be done separately IMO. > Drop Python 2.6 support in PySpark > -- > > Key: SPARK-12661 > URL: https://issues.apache.org/jira/browse/SPARK-12661 > Project: Spark > Issue Type: Task > Components: PySpark >Reporter: Davies Liu > Labels: releasenotes > > 1. stop testing with 2.6 > 2. remove the code for python 2.6 > see discussion : > https://www.mail-archive.com/user@spark.apache.org/msg43423.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20955) A lot of duplicated "executorId" strings in "TaskUIData"s
[ https://issues.apache.org/jira/browse/SPARK-20955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-20955. -- Resolution: Fixed Fix Version/s: 2.2.0 > A lot of duplicated "executorId" strings in "TaskUIData"s > -- > > Key: SPARK-20955 > URL: https://issues.apache.org/jira/browse/SPARK-20955 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1
[ https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035149#comment-16035149 ] Cheng Lian commented on SPARK-20958: Thanks [~rdblue]! I'm also reluctant to roll it back considering those fixes we wanted so badly... We decided to give this a try because, from the perspective of release management, we'd like to avoid cutting a release with known conflicting dependencies, even transitive ones. For a Spark 2.2 user, it's quite natural to choose parquet-avro 1.8.2, which is part of parquet-mr 1.8.2, which in turn, is a direct dependency of Spark 2.2.0. However, due to PARQUET-389, rolling back is already not an option. Two options I can see here are: # Release Spark 2.2.0 as is with a statement in the release notes saying that users should use parquet-avro 1.8.1 instead of 1.8.2 to avoid the Avro compatibility issue. # Wait for parquet-mr 1.8.3, which hopefully resolves this dependency issue (e.g., by reverting PARQUET-358). > Roll back parquet-mr 1.8.2 to parquet-1.8.1 > --- > > Key: SPARK-20958 > URL: https://issues.apache.org/jira/browse/SPARK-20958 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on > avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 > and avro 1.7.7 used by spark-core 2.2.0-rc2. > Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro > (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the > reasons mentioned in [PR > #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. > Therefore, we don't really have many choices here and have to roll back > parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035171#comment-16035171 ] Andrew Ash commented on SPARK-20952: For the localProperties on SparkContext it does 2 things I can see to improve safety: - first, it clones the properties for new threads so changes in the parent thread don't unintentionally affect a child thread: https://github.com/apache/spark/blob/v2.2.0-rc2/core/src/main/scala/org/apache/spark/SparkContext.scala#L330 - second, it clears the properties when they're no longer being used: https://github.com/apache/spark/blob/v2.2.0-rc2/core/src/main/scala/org/apache/spark/SparkContext.scala#L1942 Do we need to do do either the defensive cloning or the proactive clearing of taskInfos in executors like are done in the driver? > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035195#comment-16035195 ] Robert Kruszewski commented on SPARK-20952: --- 2 is already happening on executors where the Task will set and unset it's taskcontext correctly. Agree we should add 1 > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19236) Add createOrReplaceGlobalTempView
[ https://issues.apache.org/jira/browse/SPARK-19236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19236. - Resolution: Fixed > Add createOrReplaceGlobalTempView > - > > Key: SPARK-19236 > URL: https://issues.apache.org/jira/browse/SPARK-19236 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Arman Yazdani >Priority: Minor > Fix For: 2.2.0, 2.3.0 > > > There are 3 methods for saving a temp tables: > createTempView > createOrReplaceTempView > createGlobalTempView > but there isn't: > createOrReplaceGlobalTempView -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19236) Add createOrReplaceGlobalTempView
[ https://issues.apache.org/jira/browse/SPARK-19236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-19236: --- Assignee: Xiao Li > Add createOrReplaceGlobalTempView > - > > Key: SPARK-19236 > URL: https://issues.apache.org/jira/browse/SPARK-19236 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Arman Yazdani >Assignee: Xiao Li >Priority: Minor > Fix For: 2.2.0, 2.3.0 > > > There are 3 methods for saving a temp tables: > createTempView > createOrReplaceTempView > createGlobalTempView > but there isn't: > createOrReplaceGlobalTempView -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19236) Add createOrReplaceGlobalTempView
[ https://issues.apache.org/jira/browse/SPARK-19236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19236: Fix Version/s: 2.2.0 > Add createOrReplaceGlobalTempView > - > > Key: SPARK-19236 > URL: https://issues.apache.org/jira/browse/SPARK-19236 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Arman Yazdani >Priority: Minor > Fix For: 2.2.0, 2.3.0 > > > There are 3 methods for saving a temp tables: > createTempView > createOrReplaceTempView > createGlobalTempView > but there isn't: > createOrReplaceGlobalTempView -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19236) Add createOrReplaceGlobalTempView
[ https://issues.apache.org/jira/browse/SPARK-19236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-19236: --- Assignee: Arman Yazdani (was: Xiao Li) > Add createOrReplaceGlobalTempView > - > > Key: SPARK-19236 > URL: https://issues.apache.org/jira/browse/SPARK-19236 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Arman Yazdani >Assignee: Arman Yazdani >Priority: Minor > Fix For: 2.2.0, 2.3.0 > > > There are 3 methods for saving a temp tables: > createTempView > createOrReplaceTempView > createGlobalTempView > but there isn't: > createOrReplaceGlobalTempView -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1
[ https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-20958. -- Resolution: Won't Fix Thanks everyone. Sounds like we'll just provide directions in the release notes for users of parquet-avro to pin the version 1.8.1. > Roll back parquet-mr 1.8.2 to parquet-1.8.1 > --- > > Key: SPARK-20958 > URL: https://issues.apache.org/jira/browse/SPARK-20958 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Labels: release-notes > > We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on > avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 > and avro 1.7.7 used by spark-core 2.2.0-rc2. > Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro > (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the > reasons mentioned in [PR > #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. > Therefore, we don't really have many choices here and have to roll back > parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1
[ https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20958: - Labels: release-notes (was: ) > Roll back parquet-mr 1.8.2 to parquet-1.8.1 > --- > > Key: SPARK-20958 > URL: https://issues.apache.org/jira/browse/SPARK-20958 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Labels: release-notes > > We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on > avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 > and avro 1.7.7 used by spark-core 2.2.0-rc2. > Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro > (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the > reasons mentioned in [PR > #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. > Therefore, we don't really have many choices here and have to roll back > parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20914) Javadoc contains code that is invalid
[ https://issues.apache.org/jira/browse/SPARK-20914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20914: -- Priority: Trivial (was: Minor) It's OK if you don't see more like this just now, just open a PR for what you've got > Javadoc contains code that is invalid > - > > Key: SPARK-20914 > URL: https://issues.apache.org/jira/browse/SPARK-20914 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.1 >Reporter: Cristian Teodor >Priority: Trivial > > i was looking over the > [dataset|https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/sql/Dataset.html] > and noticed the code on top that does not make sense in java. > {code} > // To create Dataset using SparkSession >Dataset people = spark.read().parquet("..."); >Dataset department = spark.read().parquet("..."); >people.filter("age".gt(30)) > .join(department, people.col("deptId").equalTo(department("id"))) > .groupBy(department.col("name"), "gender") > .agg(avg(people.col("salary")), max(people.col("age"))); > {code} > invalid parts: > * "age".gt(30) > * department("id") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20970) Deprecate TaskMetrics._updatedBlockStatuses
Thomas Graves created SPARK-20970: - Summary: Deprecate TaskMetrics._updatedBlockStatuses Key: SPARK-20970 URL: https://issues.apache.org/jira/browse/SPARK-20970 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0 Reporter: Thomas Graves TaskMetrics._updatedBlockStatuses isn't used anywhere internally by spark. It could be used by users though since its exposed by SparkListenerTaskEnd. We made it configurable to turn off the tracking of it since it uses a lot of memory in https://issues.apache.org/jira/browse/SPARK-20923. That config is still true for backwards compatibility. We should turn that to false in next release and deprecate that api altogether. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035302#comment-16035302 ] Shixiong Zhu commented on SPARK-20952: -- What I'm concerned about is global thread pools, such as https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L128 > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20971) Purge the metadata log for FileStreamSource
Shixiong Zhu created SPARK-20971: Summary: Purge the metadata log for FileStreamSource Key: SPARK-20971 URL: https://issues.apache.org/jira/browse/SPARK-20971 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.1.1 Reporter: Shixiong Zhu Currently [FileStreamSource.commit|https://github.com/apache/spark/blob/16186cdcbce1a2ec8f839c550e6b571bf5dc2692/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L258] is empty. We can delete unused metadata logs in this method to reduce the size of log files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20782) Dataset's isCached operator
[ https://issues.apache.org/jira/browse/SPARK-20782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035316#comment-16035316 ] Jacek Laskowski commented on SPARK-20782: - Just stumbled upon {{CatalogImpl.isCached}} that could also be used to implement this feature. > Dataset's isCached operator > --- > > Key: SPARK-20782 > URL: https://issues.apache.org/jira/browse/SPARK-20782 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > It'd be very convenient to have {{isCached}} operator that would say whether > a query is cached in-memory or not. > It'd be as simple as the following snippet: > {code} > // val q2: DataFrame > spark.sharedState.cacheManager.lookupCachedData(q2.queryExecution.logical).isDefined > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public
[ https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035356#comment-16035356 ] Simeon H.K. Fitch commented on SPARK-7768: -- [~pgrandjean] Once a UDT is registered, the `ExpressionEncoder` class (usually invoked by the functions in `Encoders`) automatically makes use of it. > Make user-defined type (UDT) API public > --- > > Key: SPARK-7768 > URL: https://issues.apache.org/jira/browse/SPARK-7768 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Xiangrui Meng >Priority: Critical > > As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it > would be nice to make the UDT API public in 1.5. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20734) Structured Streaming spark.sql.streaming.schemaInference not handling schema changes
[ https://issues.apache.org/jira/browse/SPARK-20734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20734: - Issue Type: New Feature (was: Bug) > Structured Streaming spark.sql.streaming.schemaInference not handling schema > changes > > > Key: SPARK-20734 > URL: https://issues.apache.org/jira/browse/SPARK-20734 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.1.1 >Reporter: Ram > > sparkSession.config("spark.sql.streaming.schemaInference", > true).getOrCreate(); > Dataset dataset = > sparkSession.readStream().parquet("file:/files-to-process"); > StreamingQuery streamingQuery = > dataset.writeStream().option("checkpointLocation", > "file:/checkpoint-location") > .outputMode(Append()).start("file:/save-parquet-files"); > streamingQuery.awaitTermination(); > After streaming query started If there's a schema changes on new paruet > files under files-to-process directory. Structured Streaming not writing new > schema changes. Is it possible to handle these schema changes in Structured > Streaming. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20928) Continuous Processing Mode for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20928: - Description: Given the current Source API, the minimum possible latency for any record is bounded by the amount of time that it takes to launch a task. This limitation is a result of the fact that {{getBatch}} requires us to know both the starting and the ending offset, before any tasks are launched. In the worst case, the end-to-end latency is actually closer to the average batch time + task launching time. For applications where latency is more important than exactly-once output however, it would be useful if processing could happen continuously. This would allow us to achieve fully pipelined reading and writing from sources such as Kafka. This kind of architecture would make it possible to process records with end-to-end latencies on the order of 1 ms, rather than the 10-100ms that is possible today. One possible architecture here would be to change the Source API to look like the following rough sketch: {code} trait Epoch { def data: DataFrame /** The exclusive starting position for `data`. */ def startOffset: Offset /** The inclusive ending position for `data`. Incrementally updated during processing, but not complete until execution of the query plan in `data` is finished. */ def endOffset: Offset } def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], limits: Limits): Epoch {code} The above would allow us to build an alternative implementation of {{StreamExecution}} that processes continuously with much lower latency and only stops processing when needing to reconfigure the stream (either due to a failure or a user requested change in parallelism. > Continuous Processing Mode for Structured Streaming > --- > > Key: SPARK-20928 > URL: https://issues.apache.org/jira/browse/SPARK-20928 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Michael Armbrust > > Given the current Source API, the minimum possible latency for any record is > bounded by the amount of time that it takes to launch a task. This > limitation is a result of the fact that {{getBatch}} requires us to know both > the starting and the ending offset, before any tasks are launched. In the > worst case, the end-to-end latency is actually closer to the average batch > time + task launching time. > For applications where latency is more important than exactly-once output > however, it would be useful if processing could happen continuously. This > would allow us to achieve fully pipelined reading and writing from sources > such as Kafka. This kind of architecture would make it possible to process > records with end-to-end latencies on the order of 1 ms, rather than the > 10-100ms that is possible today. > One possible architecture here would be to change the Source API to look like > the following rough sketch: > {code} > trait Epoch { > def data: DataFrame > /** The exclusive starting position for `data`. */ > def startOffset: Offset > /** The inclusive ending position for `data`. Incrementally updated > during processing, but not complete until execution of the query plan in > `data` is finished. */ > def endOffset: Offset > } > def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], > limits: Limits): Epoch > {code} > The above would allow us to build an alternative implementation of > {{StreamExecution}} that processes continuously with much lower latency and > only stops processing when needing to reconfigure the stream (either due to a > failure or a user requested change in parallelism. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20147) Cloning SessionState does not clone streaming query listeners
[ https://issues.apache.org/jira/browse/SPARK-20147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-20147. -- Resolution: Fixed Assignee: Kunal Khamar Fix Version/s: 2.2.0 Target Version/s: 2.2.0 Fixed by https://github.com/apache/spark/pull/17379 > Cloning SessionState does not clone streaming query listeners > - > > Key: SPARK-20147 > URL: https://issues.apache.org/jira/browse/SPARK-20147 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Kunal Khamar >Assignee: Kunal Khamar > Fix For: 2.2.0 > > > Cloning session should clone StreamingQueryListeners registered on the > StreamingQueryListenerBus. > Similar to SPARK-20048, https://github.com/apache/spark/pull/17379 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20002) Add support for unions between streaming and batch datasets
[ https://issues.apache.org/jira/browse/SPARK-20002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035441#comment-16035441 ] Michael Armbrust commented on SPARK-20002: -- I'm not sure that we will ever support this. The issue is that for batch datasets, we don't track what has been read. Thus its unclear what should happen when the query is restarted. Instead, I think you can always achieve the same result by just loading both datasets as a stream (even if you don't plan to change one of them). Would that work? > Add support for unions between streaming and batch datasets > --- > > Key: SPARK-20002 > URL: https://issues.apache.org/jira/browse/SPARK-20002 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.0.2 >Reporter: Leon Pham > > Currently unions between streaming datasets and batch datasets are not > supported. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19903) PySpark Kafka streaming query ouput append mode not possible
[ https://issues.apache.org/jira/browse/SPARK-19903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19903: - Description: PySpark example reads a Kafka stream. There is watermarking set when handling the data window. The defined query uses output Append mode. The PySpark engine reports the error: 'Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets' The Python example: --- {code} import sys from pyspark.sql import SparkSession from pyspark.sql.functions import explode, split, window if __name__ == "__main__": if len(sys.argv) != 4: print(""" Usage: structured_kafka_wordcount.py """, file=sys.stderr) exit(-1) bootstrapServers = sys.argv[1] subscribeType = sys.argv[2] topics = sys.argv[3] spark = SparkSession\ .builder\ .appName("StructuredKafkaWordCount")\ .getOrCreate() # Create DataSet representing the stream of input lines from kafka lines = spark\ .readStream\ .format("kafka")\ .option("kafka.bootstrap.servers", bootstrapServers)\ .option(subscribeType, topics)\ .load()\ .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)") # Split the lines into words, retaining timestamps # split() splits each line into an array, and explode() turns the array into multiple rows words = lines.select( explode(split(lines.value, ' ')).alias('word'), lines.timestamp ) # Group the data by window and word and compute the count of each group windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy( window(words.timestamp, "30 seconds", "30 seconds"), words.word ).count() # Start running the query that prints the running counts to the console query = windowedCounts\ .writeStream\ .outputMode('append')\ .format('console')\ .option("truncate", "false")\ .start() query.awaitTermination() {code} The corresponding example in Zeppelin notebook: {code} %spark.pyspark from pyspark.sql.functions import explode, split, window # Create DataSet representing the stream of input lines from kafka lines = spark\ .readStream\ .format("kafka")\ .option("kafka.bootstrap.servers", "localhost:9092")\ .option("subscribe", "words")\ .load()\ .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)") # Split the lines into words, retaining timestamps # split() splits each line into an array, and explode() turns the array into multiple rows words = lines.select( explode(split(lines.value, ' ')).alias('word'), lines.timestamp ) # Group the data by window and word and compute the count of each group windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy( window(words.timestamp, "30 seconds", "30 seconds"), words.word ).count() # Start running the query that prints the running counts to the console query = windowedCounts\ .writeStream\ .outputMode('append')\ .format('console')\ .option("truncate", "false")\ .start() query.awaitTermination() -- Note that the Scala version of the same example in Zeppelin notebook works fine: import java.sql.Timestamp import org.apache.spark.sql.streaming.ProcessingTime import org.apache.spark.sql.functions._ // Create DataSet representing the stream of input lines from kafka val lines = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option("subscribe", "words") .load() // Split the lines into words, retaining timestamps val words = lines .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)") .as[(String, Timestamp)] .flatMap(line => line._1.split(" ").map(word => (word, line._2))) .toDF("word", "timestamp") // Group the data by window and word and compute the count of each group val windowedCounts = words .withWatermark("timestamp", "30 seconds") .groupBy(window($"timestamp", "30 seconds", "30 seconds"), $"word") .count() // Start running the query that prints the windowed word counts to the console val query = windowedCounts.writeStream .outputMode("append") .format("console") .trigger(ProcessingTime("35 seconds")) .option("truncate", "false") .start() query.awaitTermination() {code} was: PySpark example reads a Kafka stream. There is watermarking set when handling the data window. The
[jira] [Updated] (SPARK-19903) Watermark metadata is lost when using resolved attributes
[ https://issues.apache.org/jira/browse/SPARK-19903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19903: - Component/s: (was: PySpark) > Watermark metadata is lost when using resolved attributes > - > > Key: SPARK-19903 > URL: https://issues.apache.org/jira/browse/SPARK-19903 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 > Environment: Ubuntu Linux >Reporter: Piotr Nestorow > > PySpark example reads a Kafka stream. There is watermarking set when handling > the data window. The defined query uses output Append mode. > The PySpark engine reports the error: > 'Append output mode not supported when there are streaming aggregations on > streaming DataFrames/DataSets' > The Python example: > --- > {code} > import sys > from pyspark.sql import SparkSession > from pyspark.sql.functions import explode, split, window > if __name__ == "__main__": > if len(sys.argv) != 4: > print(""" > Usage: structured_kafka_wordcount.py > > """, file=sys.stderr) > exit(-1) > bootstrapServers = sys.argv[1] > subscribeType = sys.argv[2] > topics = sys.argv[3] > spark = SparkSession\ > .builder\ > .appName("StructuredKafkaWordCount")\ > .getOrCreate() > # Create DataSet representing the stream of input lines from kafka > lines = spark\ > .readStream\ > .format("kafka")\ > .option("kafka.bootstrap.servers", bootstrapServers)\ > .option(subscribeType, topics)\ > .load()\ > .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)") > # Split the lines into words, retaining timestamps > # split() splits each line into an array, and explode() turns the array > into multiple rows > words = lines.select( > explode(split(lines.value, ' ')).alias('word'), > lines.timestamp > ) > # Group the data by window and word and compute the count of each group > windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy( > window(words.timestamp, "30 seconds", "30 seconds"), words.word > ).count() > # Start running the query that prints the running counts to the console > query = windowedCounts\ > .writeStream\ > .outputMode('append')\ > .format('console')\ > .option("truncate", "false")\ > .start() > query.awaitTermination() > {code} > The corresponding example in Zeppelin notebook: > {code} > %spark.pyspark > from pyspark.sql.functions import explode, split, window > # Create DataSet representing the stream of input lines from kafka > lines = spark\ > .readStream\ > .format("kafka")\ > .option("kafka.bootstrap.servers", "localhost:9092")\ > .option("subscribe", "words")\ > .load()\ > .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)") > # Split the lines into words, retaining timestamps > # split() splits each line into an array, and explode() turns the array into > multiple rows > words = lines.select( > explode(split(lines.value, ' ')).alias('word'), > lines.timestamp > ) > # Group the data by window and word and compute the count of each group > windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy( > window(words.timestamp, "30 seconds", "30 seconds"), words.word > ).count() > # Start running the query that prints the running counts to the console > query = windowedCounts\ > .writeStream\ > .outputMode('append')\ > .format('console')\ > .option("truncate", "false")\ > .start() > query.awaitTermination() > -- > Note that the Scala version of the same example in Zeppelin notebook works > fine: > > import java.sql.Timestamp > import org.apache.spark.sql.streaming.ProcessingTime > import org.apache.spark.sql.functions._ > // Create DataSet representing the stream of input lines from kafka > val lines = spark > .readStream > .format("kafka") > .option("kafka.bootstrap.servers", "localhost:9092") > .option("subscribe", "words") > .load() > // Split the lines into words, retaining timestamps > val words = lines > .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS > TIMESTAMP)") > .as[(String, Timestamp)] > .flatMap(line => line._1.split(" ").map(word => (word, line._2))) > .toDF("word", "timestamp") > // Group the data by window and word and comput
[jira] [Updated] (SPARK-19903) Watermark metadata is lost when using resolved attributes
[ https://issues.apache.org/jira/browse/SPARK-19903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19903: - Summary: Watermark metadata is lost when using resolved attributes (was: PySpark Kafka streaming query ouput append mode not possible) > Watermark metadata is lost when using resolved attributes > - > > Key: SPARK-19903 > URL: https://issues.apache.org/jira/browse/SPARK-19903 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 > Environment: Ubuntu Linux >Reporter: Piotr Nestorow > > PySpark example reads a Kafka stream. There is watermarking set when handling > the data window. The defined query uses output Append mode. > The PySpark engine reports the error: > 'Append output mode not supported when there are streaming aggregations on > streaming DataFrames/DataSets' > The Python example: > --- > {code} > import sys > from pyspark.sql import SparkSession > from pyspark.sql.functions import explode, split, window > if __name__ == "__main__": > if len(sys.argv) != 4: > print(""" > Usage: structured_kafka_wordcount.py > > """, file=sys.stderr) > exit(-1) > bootstrapServers = sys.argv[1] > subscribeType = sys.argv[2] > topics = sys.argv[3] > spark = SparkSession\ > .builder\ > .appName("StructuredKafkaWordCount")\ > .getOrCreate() > # Create DataSet representing the stream of input lines from kafka > lines = spark\ > .readStream\ > .format("kafka")\ > .option("kafka.bootstrap.servers", bootstrapServers)\ > .option(subscribeType, topics)\ > .load()\ > .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)") > # Split the lines into words, retaining timestamps > # split() splits each line into an array, and explode() turns the array > into multiple rows > words = lines.select( > explode(split(lines.value, ' ')).alias('word'), > lines.timestamp > ) > # Group the data by window and word and compute the count of each group > windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy( > window(words.timestamp, "30 seconds", "30 seconds"), words.word > ).count() > # Start running the query that prints the running counts to the console > query = windowedCounts\ > .writeStream\ > .outputMode('append')\ > .format('console')\ > .option("truncate", "false")\ > .start() > query.awaitTermination() > {code} > The corresponding example in Zeppelin notebook: > {code} > %spark.pyspark > from pyspark.sql.functions import explode, split, window > # Create DataSet representing the stream of input lines from kafka > lines = spark\ > .readStream\ > .format("kafka")\ > .option("kafka.bootstrap.servers", "localhost:9092")\ > .option("subscribe", "words")\ > .load()\ > .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)") > # Split the lines into words, retaining timestamps > # split() splits each line into an array, and explode() turns the array into > multiple rows > words = lines.select( > explode(split(lines.value, ' ')).alias('word'), > lines.timestamp > ) > # Group the data by window and word and compute the count of each group > windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy( > window(words.timestamp, "30 seconds", "30 seconds"), words.word > ).count() > # Start running the query that prints the running counts to the console > query = windowedCounts\ > .writeStream\ > .outputMode('append')\ > .format('console')\ > .option("truncate", "false")\ > .start() > query.awaitTermination() > -- > Note that the Scala version of the same example in Zeppelin notebook works > fine: > > import java.sql.Timestamp > import org.apache.spark.sql.streaming.ProcessingTime > import org.apache.spark.sql.functions._ > // Create DataSet representing the stream of input lines from kafka > val lines = spark > .readStream > .format("kafka") > .option("kafka.bootstrap.servers", "localhost:9092") > .option("subscribe", "words") > .load() > // Split the lines into words, retaining timestamps > val words = lines > .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS > TIMESTAMP)") > .as[(String, Timestamp)] > .flatMap(line => line._1.split(" ").map(word => (wo
[jira] [Updated] (SPARK-19903) Watermark metadata is lost when using resolved attributes
[ https://issues.apache.org/jira/browse/SPARK-19903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19903: - Target Version/s: 2.3.0 > Watermark metadata is lost when using resolved attributes > - > > Key: SPARK-19903 > URL: https://issues.apache.org/jira/browse/SPARK-19903 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 > Environment: Ubuntu Linux >Reporter: Piotr Nestorow > > PySpark example reads a Kafka stream. There is watermarking set when handling > the data window. The defined query uses output Append mode. > The PySpark engine reports the error: > 'Append output mode not supported when there are streaming aggregations on > streaming DataFrames/DataSets' > The Python example: > --- > {code} > import sys > from pyspark.sql import SparkSession > from pyspark.sql.functions import explode, split, window > if __name__ == "__main__": > if len(sys.argv) != 4: > print(""" > Usage: structured_kafka_wordcount.py > > """, file=sys.stderr) > exit(-1) > bootstrapServers = sys.argv[1] > subscribeType = sys.argv[2] > topics = sys.argv[3] > spark = SparkSession\ > .builder\ > .appName("StructuredKafkaWordCount")\ > .getOrCreate() > # Create DataSet representing the stream of input lines from kafka > lines = spark\ > .readStream\ > .format("kafka")\ > .option("kafka.bootstrap.servers", bootstrapServers)\ > .option(subscribeType, topics)\ > .load()\ > .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)") > # Split the lines into words, retaining timestamps > # split() splits each line into an array, and explode() turns the array > into multiple rows > words = lines.select( > explode(split(lines.value, ' ')).alias('word'), > lines.timestamp > ) > # Group the data by window and word and compute the count of each group > windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy( > window(words.timestamp, "30 seconds", "30 seconds"), words.word > ).count() > # Start running the query that prints the running counts to the console > query = windowedCounts\ > .writeStream\ > .outputMode('append')\ > .format('console')\ > .option("truncate", "false")\ > .start() > query.awaitTermination() > {code} > The corresponding example in Zeppelin notebook: > {code} > %spark.pyspark > from pyspark.sql.functions import explode, split, window > # Create DataSet representing the stream of input lines from kafka > lines = spark\ > .readStream\ > .format("kafka")\ > .option("kafka.bootstrap.servers", "localhost:9092")\ > .option("subscribe", "words")\ > .load()\ > .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)") > # Split the lines into words, retaining timestamps > # split() splits each line into an array, and explode() turns the array into > multiple rows > words = lines.select( > explode(split(lines.value, ' ')).alias('word'), > lines.timestamp > ) > # Group the data by window and word and compute the count of each group > windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy( > window(words.timestamp, "30 seconds", "30 seconds"), words.word > ).count() > # Start running the query that prints the running counts to the console > query = windowedCounts\ > .writeStream\ > .outputMode('append')\ > .format('console')\ > .option("truncate", "false")\ > .start() > query.awaitTermination() > -- > Note that the Scala version of the same example in Zeppelin notebook works > fine: > > import java.sql.Timestamp > import org.apache.spark.sql.streaming.ProcessingTime > import org.apache.spark.sql.functions._ > // Create DataSet representing the stream of input lines from kafka > val lines = spark > .readStream > .format("kafka") > .option("kafka.bootstrap.servers", "localhost:9092") > .option("subscribe", "words") > .load() > // Split the lines into words, retaining timestamps > val words = lines > .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS > TIMESTAMP)") > .as[(String, Timestamp)] > .flatMap(line => line._1.split(" ").map(word => (word, line._2))) > .toDF("word", "timestamp") > // Group the data by window and word and compute the co
[jira] [Updated] (SPARK-20065) Empty output files created for aggregation query in append mode
[ https://issues.apache.org/jira/browse/SPARK-20065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20065: - Target Version/s: 2.3.0 > Empty output files created for aggregation query in append mode > --- > > Key: SPARK-20065 > URL: https://issues.apache.org/jira/browse/SPARK-20065 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Silvio Fiorito > > I've got a Kafka topic which I'm querying, running a windowed aggregation, > with a 30 second watermark, 10 second trigger, writing out to Parquet with > append output mode. > Every 10 second trigger generates a file, regardless of whether there was any > data for that trigger, or whether any records were actually finalized by the > watermark. > Is this expected behavior or should it not write out these empty files? > {code} > val df = spark.readStream.format("kafka") > val query = df > .withWatermark("timestamp", "30 seconds") > .groupBy(window($"timestamp", "10 seconds")) > .count() > .select(date_format($"window.start", "HH:mm:ss").as("time"), $"count") > query > .writeStream > .format("parquet") > .option("checkpointLocation", aggChk) > .trigger(ProcessingTime("10 seconds")) > .outputMode("append") > .start(aggPath) > {code} > As the query executes, do a file listing on "aggPath" and you'll see 339 byte > files at a minimum until we arrive at the first watermark and the initial > batch is finalized. Even after that though, as there are empty batches it'll > keep generating empty files every trigger. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20662) Block jobs that have greater than a configured number of tasks
[ https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035462#comment-16035462 ] Xuefu Zhang commented on SPARK-20662: - [~lyc] I'm talking about mapreduce.job.max.map, which is the maximum number of map tasks that a MR job may have. If a submitted MR job contains more map tasks than that, it will be rejected. Similar to mapreduce.job.max.reduce. [~sowen], [~vanzin], I don't think blocking a large (perhaps ridiculously) job is equivalent to letting it run slowly and for ever. The use case I have is: while yarn queue can be used to limit how much resources can be used, but a queue can be shared by a team or multiple applications. It's probably not a good idea to let one job takes all resources while starving others. Secondly, many those users who submit ridiculously large job have no idea on what they are doing and they don't even realize that their jobs are huge. Lastly and more importantly, our application environment has a global timeout, beyond which a job will be killed. If a large job gets killed this way, significant resources is wasted. Thus, blocking such a job at submission time helps preserve the resources. BTW, if the scenarios don't apply to a user, there is nothing for him/her to worry about because the default should keep them happy. In addition to spark.job.max.tasks, I'd also propose spark.stage.max.tasks, which limits the number of tasks any stage of a job may contain. The rationale behind this is that spark.job.max.tasks tends to favor jobs with small number of stages. With both, we can not only cover MR's mapreduce.job.max.map and mapreduce.job.max.reduce, but also control the overall size of a job. > Block jobs that have greater than a configured number of tasks > -- > > Key: SPARK-20662 > URL: https://issues.apache.org/jira/browse/SPARK-20662 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0 >Reporter: Xuefu Zhang > > In a shared cluster, it's desirable for an admin to block large Spark jobs. > While there might not be a single metrics defining the size of a job, the > number of tasks is usually a good indicator. Thus, it would be useful for > Spark scheduler to block a job whose number of tasks reaches a configured > limit. By default, the limit could be just infinite, to retain the existing > behavior. > MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be > configured, which blocks a MR job at job submission time. > The proposed configuration is spark.job.max.tasks with a default value -1 > (infinite). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20972) rename HintInfo.isBroadcastable to forceBroadcast
Wenchen Fan created SPARK-20972: --- Summary: rename HintInfo.isBroadcastable to forceBroadcast Key: SPARK-20972 URL: https://issues.apache.org/jira/browse/SPARK-20972 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Wenchen Fan Assignee: Wenchen Fan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20662) Block jobs that have greater than a configured number of tasks
[ https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035478#comment-16035478 ] Marcelo Vanzin commented on SPARK-20662: bq. It's probably not a good idea to let one job takes all resources while starving others. I'm pretty sure that's why resource managers have queues. What you want here is a client-controlled, opt-in, application-level "nicety config" that tells it to not submit more tasks than a limit at a time. That control already exists - set a maximum number of executors for the app. number of executors times number of cores = max number of tasks. > Block jobs that have greater than a configured number of tasks > -- > > Key: SPARK-20662 > URL: https://issues.apache.org/jira/browse/SPARK-20662 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0 >Reporter: Xuefu Zhang > > In a shared cluster, it's desirable for an admin to block large Spark jobs. > While there might not be a single metrics defining the size of a job, the > number of tasks is usually a good indicator. Thus, it would be useful for > Spark scheduler to block a job whose number of tasks reaches a configured > limit. By default, the limit could be just infinite, to retain the existing > behavior. > MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be > configured, which blocks a MR job at job submission time. > The proposed configuration is spark.job.max.tasks with a default value -1 > (infinite). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20972) rename HintInfo.isBroadcastable to forceBroadcast
[ https://issues.apache.org/jira/browse/SPARK-20972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035479#comment-16035479 ] Apache Spark commented on SPARK-20972: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/18189 > rename HintInfo.isBroadcastable to forceBroadcast > - > > Key: SPARK-20972 > URL: https://issues.apache.org/jira/browse/SPARK-20972 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20972) rename HintInfo.isBroadcastable to forceBroadcast
[ https://issues.apache.org/jira/browse/SPARK-20972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20972: Assignee: Apache Spark (was: Wenchen Fan) > rename HintInfo.isBroadcastable to forceBroadcast > - > > Key: SPARK-20972 > URL: https://issues.apache.org/jira/browse/SPARK-20972 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20972) rename HintInfo.isBroadcastable to forceBroadcast
[ https://issues.apache.org/jira/browse/SPARK-20972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20972: Assignee: Wenchen Fan (was: Apache Spark) > rename HintInfo.isBroadcastable to forceBroadcast > - > > Key: SPARK-20972 > URL: https://issues.apache.org/jira/browse/SPARK-20972 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20662) Block jobs that have greater than a configured number of tasks
[ https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035481#comment-16035481 ] Sean Owen commented on SPARK-20662: --- It's not equivalent to block the job, but why is that more desirable? your use case is what resource queues are for, and things like the capacity scheduler. Yes you limit the amount of resource a person is entitled for just that reason. A job that's blocked for being "too big" during busy hours may be fine to run off hours, but this would mean the job is never runnable ever. The capacity scheduler, in contrast, can let someone use resources when nobody else wants them but preempt when someone else needs them, so it doesn't really cost anyone else. It just doesn't seem like this is a wheel to reinvent in Spark. Possibly its own standalone resource manager, but if you need functionality like this you're not likely to get by with a standalone cluster anyway. > Block jobs that have greater than a configured number of tasks > -- > > Key: SPARK-20662 > URL: https://issues.apache.org/jira/browse/SPARK-20662 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0 >Reporter: Xuefu Zhang > > In a shared cluster, it's desirable for an admin to block large Spark jobs. > While there might not be a single metrics defining the size of a job, the > number of tasks is usually a good indicator. Thus, it would be useful for > Spark scheduler to block a job whose number of tasks reaches a configured > limit. By default, the limit could be just infinite, to retain the existing > behavior. > MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be > configured, which blocks a MR job at job submission time. > The proposed configuration is spark.job.max.tasks with a default value -1 > (infinite). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20662) Block jobs that have greater than a configured number of tasks
[ https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035487#comment-16035487 ] Marcelo Vanzin commented on SPARK-20662: BTW if you really, really, really think this is a good idea and you really want it, you can write a listener that just cancels jobs or kills the application whenever a stage with more than x tasks is submitted. No need for any changes in Spark. > Block jobs that have greater than a configured number of tasks > -- > > Key: SPARK-20662 > URL: https://issues.apache.org/jira/browse/SPARK-20662 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0 >Reporter: Xuefu Zhang > > In a shared cluster, it's desirable for an admin to block large Spark jobs. > While there might not be a single metrics defining the size of a job, the > number of tasks is usually a good indicator. Thus, it would be useful for > Spark scheduler to block a job whose number of tasks reaches a configured > limit. By default, the limit could be just infinite, to retain the existing > behavior. > MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be > configured, which blocks a MR job at job submission time. > The proposed configuration is spark.job.max.tasks with a default value -1 > (infinite). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17078) show estimated stats when doing explain
[ https://issues.apache.org/jira/browse/SPARK-17078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035511#comment-16035511 ] Apache Spark commented on SPARK-17078: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/18190 > show estimated stats when doing explain > --- > > Key: SPARK-17078 > URL: https://issues.apache.org/jira/browse/SPARK-17078 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.0.0 >Reporter: Ron Hu >Assignee: Zhenhua Wang > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20737) Mechanism for cleanup hooks, for structured-streaming sinks on executor shutdown.
[ https://issues.apache.org/jira/browse/SPARK-20737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust closed SPARK-20737. Resolution: Won't Fix > Mechanism for cleanup hooks, for structured-streaming sinks on executor > shutdown. > - > > Key: SPARK-20737 > URL: https://issues.apache.org/jira/browse/SPARK-20737 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Prashant Sharma > Labels: Kafka > > Add a standard way of cleanup during shutdown of executors for structured > streaming sinks in general and KafkaSink in particular. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20662) Block jobs that have greater than a configured number of tasks
[ https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035519#comment-16035519 ] Xuefu Zhang commented on SPARK-20662: - I can understand the counter argument here if Spark is targeted for single user cases. For multiple users in an enterprise deployment, it's good to provide admin knobs. In this case, an admin just wanted to block bad jobs. I don't think RM meets that goal. This is actually implemented in Hive on Spark. However, I thought this is generic and may be desirable for others as well. In addition, blocking a job at submission is better than killing it after it started to run. If Spark doesn't think this is useful, then very well. > Block jobs that have greater than a configured number of tasks > -- > > Key: SPARK-20662 > URL: https://issues.apache.org/jira/browse/SPARK-20662 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0 >Reporter: Xuefu Zhang > > In a shared cluster, it's desirable for an admin to block large Spark jobs. > While there might not be a single metrics defining the size of a job, the > number of tasks is usually a good indicator. Thus, it would be useful for > Spark scheduler to block a job whose number of tasks reaches a configured > limit. By default, the limit could be just infinite, to retain the existing > behavior. > MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be > configured, which blocks a MR job at job submission time. > The proposed configuration is spark.job.max.tasks with a default value -1 > (infinite). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20662) Block jobs that have greater than a configured number of tasks
[ https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035525#comment-16035525 ] Marcelo Vanzin commented on SPARK-20662: bq. For multiple users in an enterprise deployment, it's good to provide admin knobs. In this case, an admin just wanted to block bad jobs. Your definition of a bad job is the problem (well, one of the problems). "Number of tasks" is not an indication that a job is large. Each task may be really small. Spark shouldn't be in the job of defining what is a good or bad job, and that doesn't mean it's targeted at single user vs. multi user environments. It's just something that needs to be controlled at a different layer. If the admin is really worried about resource usage, he has control over the RM, and shouldn't rely on applications behaving nicely to enforce those controls. Applications misbehave. Users mess with configuration. Those are all things outside of the admin's control. > Block jobs that have greater than a configured number of tasks > -- > > Key: SPARK-20662 > URL: https://issues.apache.org/jira/browse/SPARK-20662 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0 >Reporter: Xuefu Zhang > > In a shared cluster, it's desirable for an admin to block large Spark jobs. > While there might not be a single metrics defining the size of a job, the > number of tasks is usually a good indicator. Thus, it would be useful for > Spark scheduler to block a job whose number of tasks reaches a configured > limit. By default, the limit could be just infinite, to retain the existing > behavior. > MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be > configured, which blocks a MR job at job submission time. > The proposed configuration is spark.job.max.tasks with a default value -1 > (infinite). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20973) insert table fail caused by unable to fetch data definition file from remote hdfs
Yunjian Zhang created SPARK-20973: - Summary: insert table fail caused by unable to fetch data definition file from remote hdfs Key: SPARK-20973 URL: https://issues.apache.org/jira/browse/SPARK-20973 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Yunjian Zhang I implemented my own hive serde to handle special data files which needs to read data definition during process. the process include 1.read definition file location from TBLPROPERTIES 2.read file content as per step 1 3.init serde base on step 2. //DDL of the table as below: - CREATE EXTERNAL TABLE dw_user_stg_txt_out ROW FORMAT SERDE 'com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe' STORED AS INPUTFORMAT 'com.ebay.dss.gdr.mapred.AbAsAvroInputFormat' OUTPUTFORMAT 'com.ebay.dss.gdr.hive.ql.io.ab.AvroAsAbOutputFormat' LOCATION 'hdfs://${remote_hdfs}/user/data' TBLPROPERTIES ( 'com.ebay.dss.dml.file' = 'hdfs://${remote_hdfs}/dml/user.dml' ) // insert statement insert overwrite table dw_user_stg_txt_out select * from dw_user_stg_txt_avro; //fail with ERROR 17/06/02 15:46:34 ERROR SparkSQLDriver: Failed in [insert overwrite table dw_user_stg_txt_out select * from dw_user_stg_txt_avro] java.lang.RuntimeException: FAILED to get dml file from: hdfs://${remote-hdfs}/dml/user.dml at com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe.initialize(AbvroSerDe.java:109) at org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:160) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:258) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:170) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:347) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20973) insert table fail caused by unable to fetch data definition file from remote hdfs
[ https://issues.apache.org/jira/browse/SPARK-20973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035597#comment-16035597 ] Yunjian Zhang commented on SPARK-20973: --- I did check the source code and add a patch to fix the insert issue > insert table fail caused by unable to fetch data definition file from remote > hdfs > -- > > Key: SPARK-20973 > URL: https://issues.apache.org/jira/browse/SPARK-20973 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yunjian Zhang > Labels: patch > > I implemented my own hive serde to handle special data files which needs to > read data definition during process. > the process include > 1.read definition file location from TBLPROPERTIES > 2.read file content as per step 1 > 3.init serde base on step 2. > //DDL of the table as below: > - > CREATE EXTERNAL TABLE dw_user_stg_txt_out > ROW FORMAT SERDE 'com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe' > STORED AS > INPUTFORMAT 'com.ebay.dss.gdr.mapred.AbAsAvroInputFormat' > OUTPUTFORMAT 'com.ebay.dss.gdr.hive.ql.io.ab.AvroAsAbOutputFormat' > LOCATION 'hdfs://${remote_hdfs}/user/data' > TBLPROPERTIES ( > 'com.ebay.dss.dml.file' = 'hdfs://${remote_hdfs}/dml/user.dml' > ) > // insert statement > insert overwrite table dw_user_stg_txt_out select * from dw_user_stg_txt_avro; > //fail with ERROR > 17/06/02 15:46:34 ERROR SparkSQLDriver: Failed in [insert overwrite table > dw_user_stg_txt_out select * from dw_user_stg_txt_avro] > java.lang.RuntimeException: FAILED to get dml file from: > hdfs://${remote-hdfs}/dml/user.dml > at > com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe.initialize(AbvroSerDe.java:109) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:160) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:258) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:170) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:347) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20973) insert table fail caused by unable to fetch data definition file from remote hdfs
[ https://issues.apache.org/jira/browse/SPARK-20973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035597#comment-16035597 ] Yunjian Zhang edited comment on SPARK-20973 at 6/2/17 11:06 PM: I did check the source code and add a patch to fix the insert issue as below, unable to attach file here, so just past the content as well. -- --- a/./workspace1/spark-2.1.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala +++ b/./workspace/git/gdr/spark/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala @@ -57,7 +57,7 @@ private[hive] class SparkHiveWriterContainer( extends Logging with HiveInspectors with Serializable { - + private val now = new Date() private val tableDesc: TableDesc = fileSinkConf.getTableInfo // Add table properties from storage handler to jobConf, so any custom storage @@ -154,6 +154,12 @@ private[hive] class SparkHiveWriterContainer( conf.value.setBoolean("mapred.task.is.map", true) conf.value.setInt("mapred.task.partition", splitID) } + + def newSerializer(tableDesc: TableDesc): Serializer = { +val serializer = tableDesc.getDeserializerClass.newInstance().asInstanceOf[Serializer] +serializer.initialize(null, tableDesc.getProperties) +serializer + } def newSerializer(jobConf: JobConf, tableDesc: TableDesc): Serializer = { val serializer = tableDesc.getDeserializerClass.newInstance().asInstanceOf[Serializer] @@ -162,10 +168,11 @@ private[hive] class SparkHiveWriterContainer( } protected def prepareForWrite() = { -val serializer = newSerializer(jobConf, fileSinkConf.getTableInfo) +val serializer = newSerializer(conf.value, fileSinkConf.getTableInfo) +logInfo("CHECK table deser:" + fileSinkConf.getTableInfo.getDeserializer(conf.value)) val standardOI = ObjectInspectorUtils .getStandardObjectInspector( -fileSinkConf.getTableInfo.getDeserializer.getObjectInspector, + fileSinkConf.getTableInfo.getDeserializer(conf.value).getObjectInspector, ObjectInspectorCopyOption.JAVA) .asInstanceOf[StructObjectInspector] was (Author: daniel.yj.zh...@gmail.com): I did check the source code and add a patch to fix the insert issue > insert table fail caused by unable to fetch data definition file from remote > hdfs > -- > > Key: SPARK-20973 > URL: https://issues.apache.org/jira/browse/SPARK-20973 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yunjian Zhang > Labels: patch > > I implemented my own hive serde to handle special data files which needs to > read data definition during process. > the process include > 1.read definition file location from TBLPROPERTIES > 2.read file content as per step 1 > 3.init serde base on step 2. > //DDL of the table as below: > - > CREATE EXTERNAL TABLE dw_user_stg_txt_out > ROW FORMAT SERDE 'com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe' > STORED AS > INPUTFORMAT 'com.ebay.dss.gdr.mapred.AbAsAvroInputFormat' > OUTPUTFORMAT 'com.ebay.dss.gdr.hive.ql.io.ab.AvroAsAbOutputFormat' > LOCATION 'hdfs://${remote_hdfs}/user/data' > TBLPROPERTIES ( > 'com.ebay.dss.dml.file' = 'hdfs://${remote_hdfs}/dml/user.dml' > ) > // insert statement > insert overwrite table dw_user_stg_txt_out select * from dw_user_stg_txt_avro; > //fail with ERROR > 17/06/02 15:46:34 ERROR SparkSQLDriver: Failed in [insert overwrite table > dw_user_stg_txt_out select * from dw_user_stg_txt_avro] > java.lang.RuntimeException: FAILED to get dml file from: > hdfs://${remote-hdfs}/dml/user.dml > at > com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe.initialize(AbvroSerDe.java:109) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:160) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:258) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:170) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:347) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20974) we should run REPL tests if SQL core has code changes
Wenchen Fan created SPARK-20974: --- Summary: we should run REPL tests if SQL core has code changes Key: SPARK-20974 URL: https://issues.apache.org/jira/browse/SPARK-20974 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.2.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20974) we should run REPL tests if SQL core has code changes
[ https://issues.apache.org/jira/browse/SPARK-20974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035614#comment-16035614 ] Apache Spark commented on SPARK-20974: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/18191 > we should run REPL tests if SQL core has code changes > - > > Key: SPARK-20974 > URL: https://issues.apache.org/jira/browse/SPARK-20974 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20974) we should run REPL tests if SQL core has code changes
[ https://issues.apache.org/jira/browse/SPARK-20974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20974: Assignee: Wenchen Fan (was: Apache Spark) > we should run REPL tests if SQL core has code changes > - > > Key: SPARK-20974 > URL: https://issues.apache.org/jira/browse/SPARK-20974 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20974) we should run REPL tests if SQL core has code changes
[ https://issues.apache.org/jira/browse/SPARK-20974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20974: Assignee: Apache Spark (was: Wenchen Fan) > we should run REPL tests if SQL core has code changes > - > > Key: SPARK-20974 > URL: https://issues.apache.org/jira/browse/SPARK-20974 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20974) we should run REPL tests if SQL core has code changes
[ https://issues.apache.org/jira/browse/SPARK-20974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20974. - Resolution: Fixed Fix Version/s: 2.2.0 2.1.2 2.0.3 > we should run REPL tests if SQL core has code changes > - > > Key: SPARK-20974 > URL: https://issues.apache.org/jira/browse/SPARK-20974 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.3, 2.1.2, 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19732) DataFrame.fillna() does not work for bools in PySpark
[ https://issues.apache.org/jira/browse/SPARK-19732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-19732. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18164 [https://github.com/apache/spark/pull/18164] > DataFrame.fillna() does not work for bools in PySpark > - > > Key: SPARK-19732 > URL: https://issues.apache.org/jira/browse/SPARK-19732 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Len Frodgers >Priority: Minor > Fix For: 2.3.0 > > > In PySpark, the fillna function of DataFrame inadvertently casts bools to > ints, so fillna cannot be used to fill True/False. > e.g. > `spark.createDataFrame([Row(a=True),Row(a=None)]).fillna(True).collect()` > yields > `[Row(a=True), Row(a=None)]` > It should be a=True for the second Row > The cause is this bit of code: > {code} > if isinstance(value, (int, long)): > value = float(value) > {code} > There needs to be a separate check for isinstance(bool), since in python, > bools are ints too > Additionally there's another anomaly: > Spark (and pyspark) supports filling of bools if you specify the args as a > map: > {code} > fillna({"a": False}) > {code} > , but not if you specify it as > {code} > fillna(False) > {code} > This is because (scala-)Spark has no > {code} > def fill(value: Boolean): DataFrame = fill(value, df.columns) > {code} > method. I find that strange/buggy -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19732) DataFrame.fillna() does not work for bools in PySpark
[ https://issues.apache.org/jira/browse/SPARK-19732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reassigned SPARK-19732: - Assignee: Ruben Berenguel > DataFrame.fillna() does not work for bools in PySpark > - > > Key: SPARK-19732 > URL: https://issues.apache.org/jira/browse/SPARK-19732 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Len Frodgers >Assignee: Ruben Berenguel >Priority: Minor > Fix For: 2.3.0 > > > In PySpark, the fillna function of DataFrame inadvertently casts bools to > ints, so fillna cannot be used to fill True/False. > e.g. > `spark.createDataFrame([Row(a=True),Row(a=None)]).fillna(True).collect()` > yields > `[Row(a=True), Row(a=None)]` > It should be a=True for the second Row > The cause is this bit of code: > {code} > if isinstance(value, (int, long)): > value = float(value) > {code} > There needs to be a separate check for isinstance(bool), since in python, > bools are ints too > Additionally there's another anomaly: > Spark (and pyspark) supports filling of bools if you specify the args as a > map: > {code} > fillna({"a": False}) > {code} > , but not if you specify it as > {code} > fillna(False) > {code} > This is because (scala-)Spark has no > {code} > def fill(value: Boolean): DataFrame = fill(value, df.columns) > {code} > method. I find that strange/buggy -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20950) Improve diskWriteBufferSize configurable
[ https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20950: -- Summary: Improve diskWriteBufferSize configurable (was: Improve Serializerbuffersize configurable) > Improve diskWriteBufferSize configurable > > > Key: SPARK-20950 > URL: https://issues.apache.org/jira/browse/SPARK-20950 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: caoxuewen >Priority: Trivial > > 1.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize > of UnsafeShuffleWriter. > 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in > mergeSpillsWithFileStream function. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20950) Improve diskWriteBufferSize configurable
[ https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20950: -- Description: 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of ShuffleExternalSorter. 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function. was: 1.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize of UnsafeShuffleWriter. 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function. > Improve diskWriteBufferSize configurable > > > Key: SPARK-20950 > URL: https://issues.apache.org/jira/browse/SPARK-20950 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: caoxuewen >Priority: Trivial > > 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize > of ShuffleExternalSorter. > 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in > mergeSpillsWithFileStream function. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20025) Driver fail over will not work, if SPARK_LOCAL* env is set.
[ https://issues.apache.org/jira/browse/SPARK-20025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-20025: Target Version/s: 2.3.0 (was: 2.2.0) > Driver fail over will not work, if SPARK_LOCAL* env is set. > --- > > Key: SPARK-20025 > URL: https://issues.apache.org/jira/browse/SPARK-20025 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Prashant Sharma > > In a bare metal system with No DNS setup, spark may be configured with > SPARK_LOCAL* for IP and host properties. > During a driver failover, in cluster deployment mode. SPARK_LOCAL* should be > ignored while auto deploying and should be picked up from target system's > local environment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org