[jira] [Comment Edited] (SPARK-20943) Correct BypassMergeSortShuffleWriter's comment

2017-06-02 Thread CanBin Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033988#comment-16033988
 ] 

CanBin Zheng edited comment on SPARK-20943 at 6/2/17 7:00 AM:
--

Look at these two cases, either Aggregator or Ordering is defined but 
mapsideCombine is false,  they both run with BypassMergeSortShuffleWriter,
 {code}
 //Has Aggregator defined
  @Test
  def testGroupByKeyUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1)).groupByKey(2)
rdd.collect()
  }

  //Has Ordering defined
  @Test
  def testShuffleWithKeyOrderingUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1))
val ord = implicitly[Ordering[String]]
val shuffledRDD = new ShuffledRDD[String, Int, Int](rdd, new 
HashPartitioner(2)).setKeyOrdering(ord)
shuffledRDD.collect()
  }
{code}


was (Author: canbinzheng):
Look at there two cases.
 {code}
 //Has Aggregator defined
  @Test
  def testGroupByKeyUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1)).groupByKey(2)
rdd.collect()
  }

  //Has Ordering defined
  @Test
  def testShuffleWithKeyOrderingUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1))
val ord = implicitly[Ordering[String]]
val shuffledRDD = new ShuffledRDD[String, Int, Int](rdd, new 
HashPartitioner(2)).setKeyOrdering(ord)
shuffledRDD.collect()
  }
{code}

> Correct BypassMergeSortShuffleWriter's comment
> --
>
> Key: SPARK-20943
> URL: https://issues.apache.org/jira/browse/SPARK-20943
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Shuffle
>Affects Versions: 2.1.1
>Reporter: CanBin Zheng
>Priority: Trivial
>  Labels: starter
>
> There are some comments written in BypassMergeSortShuffleWriter.java about 
> when to select this write path, the three required conditions are described 
> as follows:  
> 1. no Ordering is specified, and
> 2. no Aggregator is specified, and
> 3. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold
> Obviously, the conditions written are partially wrong and misleading, the 
> right conditions should be:
> 1. map-side combine is false, and
> 2. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20939) Do not duplicate user-defined functions while optimizing logical query plans

2017-06-02 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034247#comment-16034247
 ] 

Takeshi Yamamuro commented on SPARK-20939:
--

This is not a bug, so I changed the type to "improvement".
I feel this optimization still makes sense in `most` cases, so I think we do 
not have a strong reason to remove this rule.
In feature releases, if spark would implement functionality to estimate join 
cardinality, we could make this rule more smarter...

> Do not duplicate user-defined functions while optimizing logical query plans
> 
>
> Key: SPARK-20939
> URL: https://issues.apache.org/jira/browse/SPARK-20939
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Lovasoa
>Priority: Minor
>  Labels: logical_plan, optimizer
>
> Currently, while optimizing a query plan, spark pushes filters down the query 
> plan tree, so that 
> {code:title=LogicalPlan}
> Join Inner, (a = b)
> +- Filter UDF(a)
> +- Relation A
> +- Relation B
> {code}
> becomes 
> {code:title=Optimized LogicalPlan}
>  Join Inner, (a = b)
>  +- Filter UDF(a)
>  +- Relation A
>  +- Filter UDF(b)
>  +- Relation B
> {code}
> In general, it is a good thing to push down filters as it reduces the number 
> of records that will go through the join.
> However, in the case where the filter is an user-defined function (UDF), we 
> cannot know if the cost of executing the function twice will be higher than 
> the eventual cost of joining more elements or not.
> So I think that the optimizer shouldn't move the user-defined function in the 
> query plan tree. The user will still be able to duplicate the function if he 
> wants to.
> See this question on stackoverflow: 
> https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20675) Support Index to skip when retrieval disk structure in CoGroupedRDD

2017-06-02 Thread lyc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034256#comment-16034256
 ] 

lyc commented on SPARK-20675:
-

What do you mean `StreamBuffer`? In commit `6d05c1` (at  Jun 1/17), there is 
only `ExternalAppendOnlyMap` and the computation seems only compute the 
specific partition, there are not any redundancy

> Support Index to skip when retrieval disk structure in CoGroupedRDD 
> 
>
> Key: SPARK-20675
> URL: https://issues.apache.org/jira/browse/SPARK-20675
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>
> CoGroupedRDD's compute() will retrieval each StreamBuffer(a disk structure 
> maintains key-value pairs which sorted by key) and merge the same key into one
> So I think add a sequence index file or append the index part at the head of 
> temporary shuffle file to seek to the appropriate position could skip a lot 
> of scan which are unnecessary.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20964) Make some keywords reserved along with the ANSI/SQL standard

2017-06-02 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-20964:


 Summary: Make some keywords reserved along with the ANSI/SQL 
standard
 Key: SPARK-20964
 URL: https://issues.apache.org/jira/browse/SPARK-20964
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: Takeshi Yamamuro
Priority: Minor


The current Spark has many non-reserved words that are essentially reserved in 
the ANSI/SQL standard 
(http://developer.mimer.se/validator/sql-reserved-words.tml). 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L709

This is because there are many datasources (for instance twitter4j) that 
unfortunately use reserved keywords for column names (See [~hvanhovell]'s 
comments: https://github.com/apache/spark/pull/18079#discussion_r118842186). We 
might fix this issue in future major releases.






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20662) Block jobs that have greater than a configured number of tasks

2017-06-02 Thread lyc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034282#comment-16034282
 ] 

lyc edited comment on SPARK-20662 at 6/2/17 7:36 AM:
-

Do you mean `mapreduce.job.running.map.limit`? The conf means `The maximum 
number of simultaneous map tasks per job. There is no limit if this value is 0 
or negative.` 

This means task concurrency. And the behavior seems to be that stops scheduling 
tasks when job has that many running tasks, and starts scheduling when some 
tasks are done.

This seems can be done in `DAGScheduler`, I'd like give it a try if the idea is 
accepted.  cc [~vanzin]


was (Author: lyc):
Do you mean `mapreduce.job.running.map.limit`? The conf means `The maximum 
number of simultaneous map tasks per job. There is no limit if this value is 0 
or negative.` 

This means task concurrency. And the behavior seems to be that stops scheduling 
tasks when job has that many running tasks, and starts scheduling when some 
tasks are done.

This seems can be done in `DAGScheduler`, I'd like give it a try if the idea is 
accepted.  cc @Marcelo Vanzin

> Block jobs that have greater than a configured number of tasks
> --
>
> Key: SPARK-20662
> URL: https://issues.apache.org/jira/browse/SPARK-20662
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Xuefu Zhang
>
> In a shared cluster, it's desirable for an admin to block large Spark jobs. 
> While there might not be a single metrics defining the size of a job, the 
> number of tasks is usually a good indicator. Thus, it would be useful for 
> Spark scheduler to block a job whose number of tasks reaches a configured 
> limit. By default, the limit could be just infinite, to retain the existing 
> behavior.
> MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be 
> configured, which blocks a MR job at job submission time.
> The proposed configuration is spark.job.max.tasks with a default value -1 
> (infinite).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20662) Block jobs that have greater than a configured number of tasks

2017-06-02 Thread lyc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034282#comment-16034282
 ] 

lyc commented on SPARK-20662:
-

Do you mean `mapreduce.job.running.map.limit`? The conf means `The maximum 
number of simultaneous map tasks per job. There is no limit if this value is 0 
or negative.` 

This means task concurrency. And the behavior seems to be that stops scheduling 
tasks when job has that many running tasks, and starts scheduling when some 
tasks are done.

This seems can be done in `DAGScheduler`, I'd like give it a try if the idea is 
accepted.  cc @Marcelo Vanzin

> Block jobs that have greater than a configured number of tasks
> --
>
> Key: SPARK-20662
> URL: https://issues.apache.org/jira/browse/SPARK-20662
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Xuefu Zhang
>
> In a shared cluster, it's desirable for an admin to block large Spark jobs. 
> While there might not be a single metrics defining the size of a job, the 
> number of tasks is usually a good indicator. Thus, it would be useful for 
> Spark scheduler to block a job whose number of tasks reaches a configured 
> limit. By default, the limit could be just infinite, to retain the existing 
> behavior.
> MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be 
> configured, which blocks a MR job at job submission time.
> The proposed configuration is spark.job.max.tasks with a default value -1 
> (infinite).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20965) Support PREPARE and EXECUTE statements

2017-06-02 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-20965:


 Summary: Support PREPARE and EXECUTE statements
 Key: SPARK-20965
 URL: https://issues.apache.org/jira/browse/SPARK-20965
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: Takeshi Yamamuro
Priority: Minor


Parameterized queries might help for some users, so we might support PREPRE and 
EXECUTE statements by referring the ANSI/SQL standard (e.g., it is some useful 
for users who frequently use the same queries)

{code}
PREPARE sqlstmt (int) AS SELECT * FROM t WEHERE id = $1;
EXECUTE sqlstmt(1);
{code}

One of implementation references: 
https://www.postgresql.org/docs/current/static/sql-prepare.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-02 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-20958:
---
Description: 
We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 and 
avro 1.7.7 used by spark-core 2.2.0-rc2.

Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro (1.7.7 
and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the reasons 
mentioned in [PR 
#17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
Therefore, we don't really have many choices here and have to roll back 
parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.

  was:
We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 and 
avro 1.7.7 used by spark-core 2.2.0-rc2.

, Spark 2.2.0-rc2 introduced two incompatible versions of avro (1.7.7 and 
1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the reasons 
mentioned in [PR 
#17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
Therefore, we don't really have many choices here and have to roll back 
parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.


> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20943) Correct BypassMergeSortShuffleWriter's comment

2017-06-02 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034296#comment-16034296
 ] 

Saisai Shao commented on SPARK-20943:
-

I think the original purpose of comment is to say {{Aggregator}} and 
{{Ordering}} is not used in map side shuffle write, those {{Aggregator}} 
{{Ordering}} set in ShuffleRDD will only be used in shuffle reader side.

> Correct BypassMergeSortShuffleWriter's comment
> --
>
> Key: SPARK-20943
> URL: https://issues.apache.org/jira/browse/SPARK-20943
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Shuffle
>Affects Versions: 2.1.1
>Reporter: CanBin Zheng
>Priority: Trivial
>  Labels: starter
>
> There are some comments written in BypassMergeSortShuffleWriter.java about 
> when to select this write path, the three required conditions are described 
> as follows:  
> 1. no Ordering is specified, and
> 2. no Aggregator is specified, and
> 3. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold
> Obviously, the conditions written are partially wrong and misleading, the 
> right conditions should be:
> 1. map-side combine is false, and
> 2. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-02 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034310#comment-16034310
 ] 

Cheng Lian commented on SPARK-20958:


[~rdblue] I think the root cause here is we cherry-picked parquet-mr [PR 
#318|https://github.com/apache/parquet-mr/pull/318] to parquet-mr 1.8.2, and 
introduced this avro upgrade.

Tried to roll back parquet-mr back to 1.8.1 but it doesn't work well because 
this brings back 
[PARQUET-389|https://issues.apache.org/jira/browse/PARQUET-389] and breaks some 
test cases involving schema evolution. 

It would be nice if we can have a parquet-mr 1.8.3 or 1.8.2.1 release that has 
[PR #318|https://github.com/apache/parquet-mr/pull/318] reverted from 1.8.2? I 
think cherry-picking that PR is also problematic for parquet-mr because it 
introduces a backward-incompatible dependency change in a maintenance release.

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20966) Table data is not sorted by startTime time desc, time is not formatted and redundant code in JDBC/ODBC Server page.

2017-06-02 Thread guoxiaolongzte (JIRA)
guoxiaolongzte created SPARK-20966:
--

 Summary: Table data is not sorted by startTime time desc, time is 
not formatted and redundant code in JDBC/ODBC Server page.
 Key: SPARK-20966
 URL: https://issues.apache.org/jira/browse/SPARK-20966
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.1.1
Reporter: guoxiaolongzte
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20662) Block jobs that have greater than a configured number of tasks

2017-06-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034334#comment-16034334
 ] 

Sean Owen commented on SPARK-20662:
---

Isn't this better handled by the resource manager? for example, YARN lets you 
cap these things in a bunch of ways already, and the resource manager is a 
better place to manage, well, resources.

> Block jobs that have greater than a configured number of tasks
> --
>
> Key: SPARK-20662
> URL: https://issues.apache.org/jira/browse/SPARK-20662
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Xuefu Zhang
>
> In a shared cluster, it's desirable for an admin to block large Spark jobs. 
> While there might not be a single metrics defining the size of a job, the 
> number of tasks is usually a good indicator. Thus, it would be useful for 
> Spark scheduler to block a job whose number of tasks reaches a configured 
> limit. By default, the limit could be just infinite, to retain the existing 
> behavior.
> MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be 
> configured, which blocks a MR job at job submission time.
> The proposed configuration is spark.job.max.tasks with a default value -1 
> (infinite).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20966) Table data is not sorted by startTime time desc, time is not formatted and redundant code in JDBC/ODBC Server page.

2017-06-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034345#comment-16034345
 ] 

Apache Spark commented on SPARK-20966:
--

User 'guoxiaolongzte' has created a pull request for this issue:
https://github.com/apache/spark/pull/18186

> Table data is not sorted by startTime time desc, time is not formatted and 
> redundant code in JDBC/ODBC Server page.
> ---
>
> Key: SPARK-20966
> URL: https://issues.apache.org/jira/browse/SPARK-20966
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: guoxiaolongzte
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20966) Table data is not sorted by startTime time desc, time is not formatted and redundant code in JDBC/ODBC Server page.

2017-06-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20966:


Assignee: Apache Spark

> Table data is not sorted by startTime time desc, time is not formatted and 
> redundant code in JDBC/ODBC Server page.
> ---
>
> Key: SPARK-20966
> URL: https://issues.apache.org/jira/browse/SPARK-20966
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: guoxiaolongzte
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20966) Table data is not sorted by startTime time desc, time is not formatted and redundant code in JDBC/ODBC Server page.

2017-06-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20966:


Assignee: (was: Apache Spark)

> Table data is not sorted by startTime time desc, time is not formatted and 
> redundant code in JDBC/ODBC Server page.
> ---
>
> Key: SPARK-20966
> URL: https://issues.apache.org/jira/browse/SPARK-20966
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: guoxiaolongzte
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20966) Table data is not sorted by startTime time desc, time is not formatted and redundant code in JDBC/ODBC Server page.

2017-06-02 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-20966:
---
Description: 
Table data is not sorted by startTime time desc in JDBC/ODBC Server page.
Time is not formatted in JDBC/ODBC Server page.
Redundant code in the ThriftServerSessionPage.scala.

> Table data is not sorted by startTime time desc, time is not formatted and 
> redundant code in JDBC/ODBC Server page.
> ---
>
> Key: SPARK-20966
> URL: https://issues.apache.org/jira/browse/SPARK-20966
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: guoxiaolongzte
>Priority: Minor
>
> Table data is not sorted by startTime time desc in JDBC/ODBC Server page.
> Time is not formatted in JDBC/ODBC Server page.
> Redundant code in the ThriftServerSessionPage.scala.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20967) SharedState.externalCatalog is not really lazy

2017-06-02 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-20967:

Issue Type: Improvement  (was: Bug)

> SharedState.externalCatalog is not really lazy
> --
>
> Key: SPARK-20967
> URL: https://issues.apache.org/jira/browse/SPARK-20967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20967) SharedState.externalCatalog is not really lazy

2017-06-02 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-20967:
---

 Summary: SharedState.externalCatalog is not really lazy
 Key: SPARK-20967
 URL: https://issues.apache.org/jira/browse/SPARK-20967
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20967) SharedState.externalCatalog is not really lazy

2017-06-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034372#comment-16034372
 ] 

Apache Spark commented on SPARK-20967:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/18187

> SharedState.externalCatalog is not really lazy
> --
>
> Key: SPARK-20967
> URL: https://issues.apache.org/jira/browse/SPARK-20967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20967) SharedState.externalCatalog is not really lazy

2017-06-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20967:


Assignee: Wenchen Fan  (was: Apache Spark)

> SharedState.externalCatalog is not really lazy
> --
>
> Key: SPARK-20967
> URL: https://issues.apache.org/jira/browse/SPARK-20967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20967) SharedState.externalCatalog is not really lazy

2017-06-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20967:


Assignee: Apache Spark  (was: Wenchen Fan)

> SharedState.externalCatalog is not really lazy
> --
>
> Key: SPARK-20967
> URL: https://issues.apache.org/jira/browse/SPARK-20967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-02 Thread Robert Kruszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034427#comment-16034427
 ] 

Robert Kruszewski commented on SPARK-20952:
---

You're right that this needs a bit of clarification. There's a bit more 
subtlety with respect to that actual threadlocal and it's use. First of all 
this makes behaviour same as on the driver where you have localProperties on 
SparkContext which are inheritable. Secondly I believe the issue you're 
describing will not arise since a) executor tasks are uninterruptible and b) 
the thread pool used to run them is a cachedThreadPool and not a ForkJoinPool, 
hence given task thread will not inherit from another task thread. Let me know 
if I am missing something here though.

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20968) Support separator in Tokenizer

2017-06-02 Thread darion yaphet (JIRA)
darion yaphet created SPARK-20968:
-

 Summary: Support separator in Tokenizer
 Key: SPARK-20968
 URL: https://issues.apache.org/jira/browse/SPARK-20968
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 2.1.1, 2.0.2, 2.0.0
Reporter: darion yaphet
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20959) Add a parameter to UnsafeExternalSorter to configure filebuffersize

2017-06-02 Thread caoxuewen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034466#comment-16034466
 ] 

caoxuewen commented on SPARK-20959:
---

thanks for modify Priority

> Add a parameter to UnsafeExternalSorter to configure filebuffersize
> ---
>
> Key: SPARK-20959
> URL: https://issues.apache.org/jira/browse/SPARK-20959
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Trivial
>
> Improvement with spark.shuffle.file.buffer configure fileBufferSizeBytes in 
> UnsafeExternalSorter. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20799) Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL

2017-06-02 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-20799:
---
Summary: Unable to infer schema for ORC/Parquet on S3N when secrets are in 
the URL  (was: Unable to infer schema for ORC on S3N when secrets are in the 
URL)

> Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL
> -
>
> Key: SPARK-20799
> URL: https://issues.apache.org/jira/browse/SPARK-20799
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Jork Zijlstra
>Priority: Minor
>
> We are getting the following exception: 
> {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
> It must be specified manually.{code}
> Combining following factors will cause it:
> - Use S3
> - Use format ORC
> - Don't apply a partitioning on de data
> - Embed AWS credentials in the path
> The problem is in the PartitioningAwareFileIndex def allFiles()
> {code}
> leafDirToChildrenFiles.get(qualifiedPath)
>   .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
>   .getOrElse(Array.empty)
> {code}
> leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
> qualifiedPath contains the path WITH credentials.
> So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no 
> data is read and the schema cannot be defined.
> Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
> details. This is insecure and may be unsupported in future., but this should 
> not mean that it shouldn't work anymore.
> Workaround:
> Move the AWS credentials from the path to the SparkSession
> {code}
> SparkSession.builder
>   .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
>   .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20799) Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL

2017-06-02 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-20799:
---
Environment: Hadoop 2.8.0 binaries

> Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL
> -
>
> Key: SPARK-20799
> URL: https://issues.apache.org/jira/browse/SPARK-20799
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Hadoop 2.8.0 binaries
>Reporter: Jork Zijlstra
>Priority: Minor
>
> We are getting the following exception: 
> {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
> It must be specified manually.{code}
> Combining following factors will cause it:
> - Use S3
> - Use format ORC
> - Don't apply a partitioning on de data
> - Embed AWS credentials in the path
> The problem is in the PartitioningAwareFileIndex def allFiles()
> {code}
> leafDirToChildrenFiles.get(qualifiedPath)
>   .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
>   .getOrElse(Array.empty)
> {code}
> leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
> qualifiedPath contains the path WITH credentials.
> So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no 
> data is read and the schema cannot be defined.
> Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
> details. This is insecure and may be unsupported in future., but this should 
> not mean that it shouldn't work anymore.
> Workaround:
> Move the AWS credentials from the path to the SparkSession
> {code}
> SparkSession.builder
>   .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
>   .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows

2017-06-02 Thread Perrine Letellier (JIRA)
Perrine Letellier created SPARK-20969:
-

 Summary: last() aggregate function fails returning the right 
answer with ordered windows
 Key: SPARK-20969
 URL: https://issues.apache.org/jira/browse/SPARK-20969
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: Perrine Letellier


{code}
scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), 
("i1", 2, "desc3"))).toDF("id", "ts", "desc")
scala> val window = Window.partitionBy("id").orderBy(col("ts").asc)
scala> df.withColumn("last", last(col("description")).over(window)).show
+---+---+-+-+
| id| ts| description| last|
+---+---+-+-+
| i1|  1|desc1|desc2|
| i1|  1|desc2|desc2|
| i1|  2|desc3|desc3|
+---+---+-+-+
{code}

However what is expected is the same answer as if asking for `first()` with a 
window with descending order.

{code}
scala> val window = Window.partitionBy("id").orderBy(col("ts").desc)
scala> df.withColumn("last", first(col("description")).over(window)).show
+---+---+-+-+
| id| ts| description| last|
+---+---+-+-+
| i1|  2|desc3|desc3|
| i1|  1|desc1|desc3|
| i1|  1|desc2|desc3|
+---+---+-+-+
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20942) The title style about field is error in the history server web ui.

2017-06-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20942.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18170
[https://github.com/apache/spark/pull/18170]

> The title style about field is error in the history server web ui.
> --
>
> Key: SPARK-20942
> URL: https://issues.apache.org/jira/browse/SPARK-20942
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: guoxiaolongzte
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: before.png, fix1.png, fix.png
>
>
> 1.The title style about field is error.
> 2.Title text description, 'the application' should be changed to 'this 
> application'.
> 3.Analysis of code:
> $('#hisotry-summary [data-toggle="tooltip"]').tooltip();
> hisotry-summary is the spelling error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20942) The title style about field is error in the history server web ui.

2017-06-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20942:
-

Assignee: guoxiaolongzte

> The title style about field is error in the history server web ui.
> --
>
> Key: SPARK-20942
> URL: https://issues.apache.org/jira/browse/SPARK-20942
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: guoxiaolongzte
>Assignee: guoxiaolongzte
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: before.png, fix1.png, fix.png
>
>
> 1.The title style about field is error.
> 2.Title text description, 'the application' should be changed to 'this 
> application'.
> 3.Analysis of code:
> $('#hisotry-summary [data-toggle="tooltip"]').tooltip();
> hisotry-summary is the spelling error. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20790) ALS with implicit feedback ignores negative values

2017-06-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034730#comment-16034730
 ] 

Apache Spark commented on SPARK-20790:
--

User 'davideis' has created a pull request for this issue:
https://github.com/apache/spark/pull/18188

> ALS with implicit feedback ignores negative values
> --
>
> Key: SPARK-20790
> URL: https://issues.apache.org/jira/browse/SPARK-20790
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 
> 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1
>Reporter: David Eis
>Assignee: David Eis
> Fix For: 2.2.0
>
>
> The refactorization that was done in 
> https://github.com/apache/spark/pull/5314/files introduced a bug, whereby for 
> implicit feedback negative ratings just get ignored. Prior to that commit 
> they were not ignored, but the absolute value was used as the confidence and 
> the  preference was set to 0. The preservation of comments and absolute value 
> indicate that this was unintentional.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20943) Correct BypassMergeSortShuffleWriter's comment

2017-06-02 Thread CanBin Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034746#comment-16034746
 ] 

CanBin Zheng commented on SPARK-20943:
--

[~saisai_shao]  I got you. But I think it's better to change the description, 
it has confused me for a long time, maybe someone else has the same puzzle.

> Correct BypassMergeSortShuffleWriter's comment
> --
>
> Key: SPARK-20943
> URL: https://issues.apache.org/jira/browse/SPARK-20943
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Shuffle
>Affects Versions: 2.1.1
>Reporter: CanBin Zheng
>Priority: Trivial
>  Labels: starter
>
> There are some comments written in BypassMergeSortShuffleWriter.java about 
> when to select this write path, the three required conditions are described 
> as follows:  
> 1. no Ordering is specified, and
> 2. no Aggregator is specified, and
> 3. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold
> Obviously, the conditions written are partially wrong and misleading, the 
> right conditions should be:
> 1. map-side combine is false, and
> 2. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20960) make ColumnVector public

2017-06-02 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034845#comment-16034845
 ] 

Dongjoon Hyun commented on SPARK-20960:
---

cc [~mridulm80]

> make ColumnVector public
> 
>
> Key: SPARK-20960
> URL: https://issues.apache.org/jira/browse/SPARK-20960
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>
> ColumnVector is an internal interface in Spark SQL, which is only used for 
> vectorized parquet reader to represent the in-memory columnar format.
> In Spark 2.3 we want to make ColumnVector public, so that we can provide a 
> more efficient way for data exchanges between Spark and external systems. For 
> example, we can use ColumnVector to build the columnar read API in data 
> source framework, we can use ColumnVector to build a more efficient UDF API, 
> etc.
> We also want to introduce a new ColumnVector implementation based on Apache 
> Arrow(basically just a wrapper over Arrow), so that external systems(like 
> Python Pandas DataFrame) can build ColumnVector very easily.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows

2017-06-02 Thread Perrine Letellier (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Perrine Letellier updated SPARK-20969:
--
Description: 
The column on which `orderBy` is performed is considered as another column on 
which to partition.

{code}
scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), 
("i1", 2, "desc3"))).toDF("id", "ts", "desc")
scala> val window = Window.partitionBy("id").orderBy(col("ts").asc)
scala> df.withColumn("last", last(col("description")).over(window)).show
+---+---+-+-+
| id| ts| description| last|
+---+---+-+-+
| i1|  1|desc1|desc2|
| i1|  1|desc2|desc2|
| i1|  2|desc3|desc3|
+---+---+-+-+
{code}

However what is expected is the same answer as if asking for `first()` with a 
window with descending order.

{code}
scala> val window = Window.partitionBy("id").orderBy(col("ts").desc)
scala> df.withColumn("last", first(col("description")).over(window)).show
+---+---+-+-+
| id| ts| description| last|
+---+---+-+-+
| i1|  2|desc3|desc3|
| i1|  1|desc1|desc3|
| i1|  1|desc2|desc3|
+---+---+-+-+
{code}

  was:
{code}
scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), 
("i1", 2, "desc3"))).toDF("id", "ts", "desc")
scala> val window = Window.partitionBy("id").orderBy(col("ts").asc)
scala> df.withColumn("last", last(col("description")).over(window)).show
+---+---+-+-+
| id| ts| description| last|
+---+---+-+-+
| i1|  1|desc1|desc2|
| i1|  1|desc2|desc2|
| i1|  2|desc3|desc3|
+---+---+-+-+
{code}

However what is expected is the same answer as if asking for `first()` with a 
window with descending order.

{code}
scala> val window = Window.partitionBy("id").orderBy(col("ts").desc)
scala> df.withColumn("last", first(col("description")).over(window)).show
+---+---+-+-+
| id| ts| description| last|
+---+---+-+-+
| i1|  2|desc3|desc3|
| i1|  1|desc1|desc3|
| i1|  1|desc2|desc3|
+---+---+-+-+
{code}


> last() aggregate function fails returning the right answer with ordered 
> windows
> ---
>
> Key: SPARK-20969
> URL: https://issues.apache.org/jira/browse/SPARK-20969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Perrine Letellier
>
> The column on which `orderBy` is performed is considered as another column on 
> which to partition.
> {code}
> scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), 
> ("i1", 2, "desc3"))).toDF("id", "ts", "desc")
> scala> val window = Window.partitionBy("id").orderBy(col("ts").asc)
> scala> df.withColumn("last", last(col("description")).over(window)).show
> +---+---+-+-+
> | id| ts| description| last|
> +---+---+-+-+
> | i1|  1|desc1|desc2|
> | i1|  1|desc2|desc2|
> | i1|  2|desc3|desc3|
> +---+---+-+-+
> {code}
> However what is expected is the same answer as if asking for `first()` with a 
> window with descending order.
> {code}
> scala> val window = Window.partitionBy("id").orderBy(col("ts").desc)
> scala> df.withColumn("last", first(col("description")).over(window)).show
> +---+---+-+-+
> | id| ts| description| last|
> +---+---+-+-+
> | i1|  2|desc3|desc3|
> | i1|  1|desc1|desc3|
> | i1|  1|desc2|desc3|
> +---+---+-+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows

2017-06-02 Thread Perrine Letellier (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Perrine Letellier updated SPARK-20969:
--
Description: 
The column on which `orderBy` is performed is considered as another column on 
which to partition.

{code}
scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), 
("i1", 2, "desc3"))).toDF("id", "ts", "description")
scala> val window = Window.partitionBy("id").orderBy(col("ts").asc)
scala> df.withColumn("last", last(col("description")).over(window)).show
+---+---+-+-+
| id| ts| description| last|
+---+---+-+-+
| i1|  1|desc1|desc2|
| i1|  1|desc2|desc2|
| i1|  2|desc3|desc3|
+---+---+-+-+
{code}

However what is expected is the same answer as if asking for `first()` with a 
window with descending order.

{code}
scala> val window = Window.partitionBy("id").orderBy(col("ts").desc)
scala> df.withColumn("last", first(col("description")).over(window)).show
+---+---+-+-+
| id| ts| description| last|
+---+---+-+-+
| i1|  2|desc3|desc3|
| i1|  1|desc1|desc3|
| i1|  1|desc2|desc3|
+---+---+-+-+
{code}

  was:
The column on which `orderBy` is performed is considered as another column on 
which to partition.

{code}
scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), 
("i1", 2, "desc3"))).toDF("id", "ts", "desc")
scala> val window = Window.partitionBy("id").orderBy(col("ts").asc)
scala> df.withColumn("last", last(col("description")).over(window)).show
+---+---+-+-+
| id| ts| description| last|
+---+---+-+-+
| i1|  1|desc1|desc2|
| i1|  1|desc2|desc2|
| i1|  2|desc3|desc3|
+---+---+-+-+
{code}

However what is expected is the same answer as if asking for `first()` with a 
window with descending order.

{code}
scala> val window = Window.partitionBy("id").orderBy(col("ts").desc)
scala> df.withColumn("last", first(col("description")).over(window)).show
+---+---+-+-+
| id| ts| description| last|
+---+---+-+-+
| i1|  2|desc3|desc3|
| i1|  1|desc1|desc3|
| i1|  1|desc2|desc3|
+---+---+-+-+
{code}


> last() aggregate function fails returning the right answer with ordered 
> windows
> ---
>
> Key: SPARK-20969
> URL: https://issues.apache.org/jira/browse/SPARK-20969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Perrine Letellier
>
> The column on which `orderBy` is performed is considered as another column on 
> which to partition.
> {code}
> scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), 
> ("i1", 2, "desc3"))).toDF("id", "ts", "description")
> scala> val window = Window.partitionBy("id").orderBy(col("ts").asc)
> scala> df.withColumn("last", last(col("description")).over(window)).show
> +---+---+-+-+
> | id| ts| description| last|
> +---+---+-+-+
> | i1|  1|desc1|desc2|
> | i1|  1|desc2|desc2|
> | i1|  2|desc3|desc3|
> +---+---+-+-+
> {code}
> However what is expected is the same answer as if asking for `first()` with a 
> window with descending order.
> {code}
> scala> val window = Window.partitionBy("id").orderBy(col("ts").desc)
> scala> df.withColumn("last", first(col("description")).over(window)).show
> +---+---+-+-+
> | id| ts| description| last|
> +---+---+-+-+
> | i1|  2|desc3|desc3|
> | i1|  1|desc1|desc3|
> | i1|  1|desc2|desc3|
> +---+---+-+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19236) Add createOrReplaceGlobalTempView

2017-06-02 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-19236:
--
Fix Version/s: 2.3.0

> Add createOrReplaceGlobalTempView
> -
>
> Key: SPARK-19236
> URL: https://issues.apache.org/jira/browse/SPARK-19236
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Arman Yazdani
>Priority: Minor
> Fix For: 2.3.0
>
>
> There are 3 methods for saving a temp tables:
> createTempView
> createOrReplaceTempView
> createGlobalTempView
> but there isn't:
> createOrReplaceGlobalTempView



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-02 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034950#comment-16034950
 ] 

Ryan Blue commented on SPARK-20958:
---

I don't think it is a good idea to roll back. Spark doesn't depend on 
parquet-avro, where the update to Avro 1.8.1 was made, except for tests where 
it is fine. The backports for Spark in 1.8.2 are worth keeping since there are 
reasonable work-arounds in user projects.

The problem that I've seen on the dev list is when users add parquet-avro to 
their dependencies and the version gets managed to 1.8.2. That will require 
Avro 1.8.1 because parquet-avro calls {{getSchema}} on avro-specific objects. 
But there are a couple reasonable ways to deal with this:

1. Specify a dependency on parquet-avro 1.8.1 that still uses Avro 1.7.x. 
Parquet is backward-compatible with older binaries, so parquet-avro 1.8.1 works 
fine with parquet-hadoop 1.8.2. (This is the recommended work-around.)
2. Shade and relocate Avro 1.8.1 in application Jars, so that Spark can use 
1.7.x and parquet-avro can use 1.8.1.

This was brought up on the dev list, but the user dismissed these work-arounds 
without trying them.

Long-term, we can do a 1.8.3 release to solve this problem, though I think the 
best solution there would be to stop using {{getSchema}} instead of downgrading 
the dependency.

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-02 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034961#comment-16034961
 ] 

Dongjoon Hyun commented on SPARK-20958:
---

+1 for [~rdblue].

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20662) Block jobs that have greater than a configured number of tasks

2017-06-02 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034964#comment-16034964
 ] 

Marcelo Vanzin commented on SPARK-20662:


Yeah, I don't really understand this request. It doesn't matter how many tasks 
a job creates, what really matters is how many resources the cluster manager 
allows the application to allocate. If a job has 1 million tasks but the 
cluster manager allocates a single vcpu for the job, it will take forever, but 
it won't really bog down the cluster.

> Block jobs that have greater than a configured number of tasks
> --
>
> Key: SPARK-20662
> URL: https://issues.apache.org/jira/browse/SPARK-20662
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Xuefu Zhang
>
> In a shared cluster, it's desirable for an admin to block large Spark jobs. 
> While there might not be a single metrics defining the size of a job, the 
> number of tasks is usually a good indicator. Thus, it would be useful for 
> Spark scheduler to block a job whose number of tasks reaches a configured 
> limit. By default, the limit could be just infinite, to retain the existing 
> behavior.
> MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be 
> configured, which blocks a MR job at job submission time.
> The proposed configuration is spark.job.max.tasks with a default value -1 
> (infinite).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20968) Support separator in Tokenizer

2017-06-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034993#comment-16034993
 ] 

Nick Pentreath commented on SPARK-20968:


Would you mind adding more detail here? What is the use case, with an example 
of the desired input / output?

> Support separator in Tokenizer
> --
>
> Key: SPARK-20968
> URL: https://issues.apache.org/jira/browse/SPARK-20968
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.0.0, 2.0.2, 2.1.1
>Reporter: darion yaphet
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-06-02 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035012#comment-16035012
 ] 

Michael Armbrust commented on SPARK-19104:
--

I'm about to cut RC3 of 2.2 and there is no pull request to fix this.  
Unfortunately that means it's not going to be fixed in 2.2.0

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Nils Grabbert
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>  
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
>  
>   at org.codehaus.janino

[jira] [Resolved] (SPARK-20967) SharedState.externalCatalog is not really lazy

2017-06-02 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20967.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18187
[https://github.com/apache/spark/pull/18187]

> SharedState.externalCatalog is not really lazy
> --
>
> Key: SPARK-20967
> URL: https://issues.apache.org/jira/browse/SPARK-20967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20946) simplify the config setting logic in SparkSession.getOrCreate

2017-06-02 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20946.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18172
[https://github.com/apache/spark/pull/18172]

> simplify the config setting logic in SparkSession.getOrCreate
> -
>
> Key: SPARK-20946
> URL: https://issues.apache.org/jira/browse/SPARK-20946
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15352) Topology aware block replication

2017-06-02 Thread Shubham Chopra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chopra resolved SPARK-15352.

Resolution: Fixed

> Topology aware block replication
> 
>
> Key: SPARK-15352
> URL: https://issues.apache.org/jira/browse/SPARK-15352
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Mesos, Spark Core, YARN
>Reporter: Shubham Chopra
>Assignee: Shubham Chopra
>
> With cached RDDs, Spark can be used for online analytics where it is used to 
> respond to online queries. But loss of RDD partitions due to node/executor 
> failures can cause huge delays in such use cases as the data would have to be 
> regenerated.
> Cached RDDs, even when using multiple replicas per block, are not currently 
> resilient to node failures when multiple executors are started on the same 
> node. Block replication currently chooses a peer at random, and this peer 
> could also exist on the same host. 
> This effort would add topology aware replication to Spark that can be enabled 
> with pluggable strategies. For ease of development/review, this is being 
> broken down to three major work-efforts:
> 1.Making peer selection for replication pluggable
> 2.Providing pluggable implementations for providing topology and topology 
> aware replication
> 3.Pro-active replenishment of lost blocks



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15352) Topology aware block replication

2017-06-02 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035061#comment-16035061
 ] 

Dongjoon Hyun commented on SPARK-15352:
---

Thank you, [~shubhamc]!

> Topology aware block replication
> 
>
> Key: SPARK-15352
> URL: https://issues.apache.org/jira/browse/SPARK-15352
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Mesos, Spark Core, YARN
>Reporter: Shubham Chopra
>Assignee: Shubham Chopra
> Fix For: 2.2.0
>
>
> With cached RDDs, Spark can be used for online analytics where it is used to 
> respond to online queries. But loss of RDD partitions due to node/executor 
> failures can cause huge delays in such use cases as the data would have to be 
> regenerated.
> Cached RDDs, even when using multiple replicas per block, are not currently 
> resilient to node failures when multiple executors are started on the same 
> node. Block replication currently chooses a peer at random, and this peer 
> could also exist on the same host. 
> This effort would add topology aware replication to Spark that can be enabled 
> with pluggable strategies. For ease of development/review, this is being 
> broken down to three major work-efforts:
> 1.Making peer selection for replication pluggable
> 2.Providing pluggable implementations for providing topology and topology 
> aware replication
> 3.Pro-active replenishment of lost blocks



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15352) Topology aware block replication

2017-06-02 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15352:
--
Fix Version/s: 2.2.0

> Topology aware block replication
> 
>
> Key: SPARK-15352
> URL: https://issues.apache.org/jira/browse/SPARK-15352
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Mesos, Spark Core, YARN
>Reporter: Shubham Chopra
>Assignee: Shubham Chopra
> Fix For: 2.2.0
>
>
> With cached RDDs, Spark can be used for online analytics where it is used to 
> respond to online queries. But loss of RDD partitions due to node/executor 
> failures can cause huge delays in such use cases as the data would have to be 
> regenerated.
> Cached RDDs, even when using multiple replicas per block, are not currently 
> resilient to node failures when multiple executors are started on the same 
> node. Block replication currently chooses a peer at random, and this peer 
> could also exist on the same host. 
> This effort would add topology aware replication to Spark that can be enabled 
> with pluggable strategies. For ease of development/review, this is being 
> broken down to three major work-efforts:
> 1.Making peer selection for replication pluggable
> 2.Providing pluggable implementations for providing topology and topology 
> aware replication
> 3.Pro-active replenishment of lost blocks



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark

2017-06-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035062#comment-16035062
 ] 

Nicholas Chammas commented on SPARK-12661:
--

I think we are good to resolve this provided that we've stopped testing with 
Python 2.6. Any cleanup of 2.6-specific workarounds (tracked in SPARK-20149) 
can be done separately IMO.

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20955) A lot of duplicated "executorId" strings in "TaskUIData"s

2017-06-02 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20955.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

> A lot of duplicated "executorId" strings in "TaskUIData"s 
> --
>
> Key: SPARK-20955
> URL: https://issues.apache.org/jira/browse/SPARK-20955
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-02 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035149#comment-16035149
 ] 

Cheng Lian commented on SPARK-20958:


Thanks [~rdblue]! I'm also reluctant to roll it back considering those fixes we 
wanted so badly... We decided to give this a try because, from the perspective 
of release management, we'd like to avoid cutting a release with known 
conflicting dependencies, even transitive ones. For a Spark 2.2 user, it's 
quite natural to choose parquet-avro 1.8.2, which is part of parquet-mr 1.8.2, 
which in turn, is a direct dependency of Spark 2.2.0.

However, due to PARQUET-389, rolling back is already not an option. Two options 
I can see here are:

# Release Spark 2.2.0 as is with a statement in the release notes saying that 
users should use parquet-avro 1.8.1 instead of 1.8.2 to avoid the Avro 
compatibility issue.
# Wait for parquet-mr 1.8.3, which hopefully resolves this dependency issue 
(e.g., by reverting PARQUET-358).

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-02 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035171#comment-16035171
 ] 

Andrew Ash commented on SPARK-20952:


For the localProperties on SparkContext it does 2 things I can see to improve 
safety:

- first, it clones the properties for new threads so changes in the parent 
thread don't unintentionally affect a child thread: 
https://github.com/apache/spark/blob/v2.2.0-rc2/core/src/main/scala/org/apache/spark/SparkContext.scala#L330
- second, it clears the properties when they're no longer being used: 
https://github.com/apache/spark/blob/v2.2.0-rc2/core/src/main/scala/org/apache/spark/SparkContext.scala#L1942

Do we need to do do either the defensive cloning or the proactive clearing of 
taskInfos in executors like are done in the driver?

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-02 Thread Robert Kruszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035195#comment-16035195
 ] 

Robert Kruszewski commented on SPARK-20952:
---

2 is already happening on executors where the Task will set and unset it's 
taskcontext correctly. Agree we should add 1

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19236) Add createOrReplaceGlobalTempView

2017-06-02 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19236.
-
Resolution: Fixed

> Add createOrReplaceGlobalTempView
> -
>
> Key: SPARK-19236
> URL: https://issues.apache.org/jira/browse/SPARK-19236
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Arman Yazdani
>Priority: Minor
> Fix For: 2.2.0, 2.3.0
>
>
> There are 3 methods for saving a temp tables:
> createTempView
> createOrReplaceTempView
> createGlobalTempView
> but there isn't:
> createOrReplaceGlobalTempView



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19236) Add createOrReplaceGlobalTempView

2017-06-02 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-19236:
---

Assignee: Xiao Li

> Add createOrReplaceGlobalTempView
> -
>
> Key: SPARK-19236
> URL: https://issues.apache.org/jira/browse/SPARK-19236
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Arman Yazdani
>Assignee: Xiao Li
>Priority: Minor
> Fix For: 2.2.0, 2.3.0
>
>
> There are 3 methods for saving a temp tables:
> createTempView
> createOrReplaceTempView
> createGlobalTempView
> but there isn't:
> createOrReplaceGlobalTempView



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19236) Add createOrReplaceGlobalTempView

2017-06-02 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19236:

Fix Version/s: 2.2.0

> Add createOrReplaceGlobalTempView
> -
>
> Key: SPARK-19236
> URL: https://issues.apache.org/jira/browse/SPARK-19236
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Arman Yazdani
>Priority: Minor
> Fix For: 2.2.0, 2.3.0
>
>
> There are 3 methods for saving a temp tables:
> createTempView
> createOrReplaceTempView
> createGlobalTempView
> but there isn't:
> createOrReplaceGlobalTempView



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19236) Add createOrReplaceGlobalTempView

2017-06-02 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-19236:
---

Assignee: Arman Yazdani  (was: Xiao Li)

> Add createOrReplaceGlobalTempView
> -
>
> Key: SPARK-19236
> URL: https://issues.apache.org/jira/browse/SPARK-19236
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Arman Yazdani
>Assignee: Arman Yazdani
>Priority: Minor
> Fix For: 2.2.0, 2.3.0
>
>
> There are 3 methods for saving a temp tables:
> createTempView
> createOrReplaceTempView
> createGlobalTempView
> but there isn't:
> createOrReplaceGlobalTempView



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-20958.
--
Resolution: Won't Fix

Thanks everyone.  Sounds like we'll just provide directions in the release 
notes for users of parquet-avro to pin the version 1.8.1.

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: release-notes
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20958:
-
Labels: release-notes  (was: )

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: release-notes
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20914) Javadoc contains code that is invalid

2017-06-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20914:
--
Priority: Trivial  (was: Minor)

It's OK if you don't see more like this just now, just open a PR for what 
you've got

> Javadoc contains code that is invalid
> -
>
> Key: SPARK-20914
> URL: https://issues.apache.org/jira/browse/SPARK-20914
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: Cristian Teodor
>Priority: Trivial
>
> i was looking over the 
> [dataset|https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/sql/Dataset.html]
>  and noticed the code on top that does not make sense in java.
> {code}
>  // To create Dataset using SparkSession
>Dataset people = spark.read().parquet("...");
>Dataset department = spark.read().parquet("...");
>people.filter("age".gt(30))
>  .join(department, people.col("deptId").equalTo(department("id")))
>  .groupBy(department.col("name"), "gender")
>  .agg(avg(people.col("salary")), max(people.col("age")));
> {code}
> invalid parts:
> * "age".gt(30)
> * department("id")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20970) Deprecate TaskMetrics._updatedBlockStatuses

2017-06-02 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-20970:
-

 Summary: Deprecate TaskMetrics._updatedBlockStatuses
 Key: SPARK-20970
 URL: https://issues.apache.org/jira/browse/SPARK-20970
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Thomas Graves


TaskMetrics._updatedBlockStatuses isn't used anywhere internally by spark. It 
could be used by users though since its exposed by  SparkListenerTaskEnd.  We 
made it configurable to turn off the tracking of it since it uses a lot of 
memory in https://issues.apache.org/jira/browse/SPARK-20923.  That config is 
still true for backwards compatibility. We should turn that to false in next 
release and deprecate that api altogether.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-02 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035302#comment-16035302
 ] 

Shixiong Zhu commented on SPARK-20952:
--

What I'm concerned about is global thread pools, such as 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L128

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20971) Purge the metadata log for FileStreamSource

2017-06-02 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-20971:


 Summary: Purge the metadata log for FileStreamSource
 Key: SPARK-20971
 URL: https://issues.apache.org/jira/browse/SPARK-20971
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.1.1
Reporter: Shixiong Zhu


Currently 
[FileStreamSource.commit|https://github.com/apache/spark/blob/16186cdcbce1a2ec8f839c550e6b571bf5dc2692/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L258]
 is empty. We can delete unused metadata logs in this method to reduce the size 
of log files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20782) Dataset's isCached operator

2017-06-02 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035316#comment-16035316
 ] 

Jacek Laskowski commented on SPARK-20782:
-

Just stumbled upon {{CatalogImpl.isCached}} that could also be used to 
implement this feature.

> Dataset's isCached operator
> ---
>
> Key: SPARK-20782
> URL: https://issues.apache.org/jira/browse/SPARK-20782
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> It'd be very convenient to have {{isCached}} operator that would say whether 
> a query is cached in-memory or not.
> It'd be as simple as the following snippet:
> {code}
> // val q2: DataFrame
> spark.sharedState.cacheManager.lookupCachedData(q2.queryExecution.logical).isDefined
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2017-06-02 Thread Simeon H.K. Fitch (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035356#comment-16035356
 ] 

Simeon H.K. Fitch commented on SPARK-7768:
--

[~pgrandjean] Once a UDT is registered, the `ExpressionEncoder` class (usually 
invoked by the functions in `Encoders`) automatically makes use of it.

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20734) Structured Streaming spark.sql.streaming.schemaInference not handling schema changes

2017-06-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20734:
-
Issue Type: New Feature  (was: Bug)

> Structured Streaming spark.sql.streaming.schemaInference not handling schema 
> changes
> 
>
> Key: SPARK-20734
> URL: https://issues.apache.org/jira/browse/SPARK-20734
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Ram
>
> sparkSession.config("spark.sql.streaming.schemaInference", 
> true).getOrCreate();
> Dataset dataset = 
> sparkSession.readStream().parquet("file:/files-to-process");
> StreamingQuery streamingQuery =
> dataset.writeStream().option("checkpointLocation", 
> "file:/checkpoint-location")
> .outputMode(Append()).start("file:/save-parquet-files");
> streamingQuery.awaitTermination();
> After streaming query started If there's a schema changes on new paruet 
> files under files-to-process directory. Structured Streaming not writing new 
> schema changes. Is it possible to handle these schema changes in Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-06-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20928:
-
Description: 
Given the current Source API, the minimum possible latency for any record is 
bounded by the amount of time that it takes to launch a task.  This limitation 
is a result of the fact that {{getBatch}} requires us to know both the starting 
and the ending offset, before any tasks are launched.  In the worst case, the 
end-to-end latency is actually closer to the average batch time + task 
launching time.

For applications where latency is more important than exactly-once output 
however, it would be useful if processing could happen continuously.  This 
would allow us to achieve fully pipelined reading and writing from sources such 
as Kafka.  This kind of architecture would make it possible to process records 
with end-to-end latencies on the order of 1 ms, rather than the 10-100ms that 
is possible today.

One possible architecture here would be to change the Source API to look like 
the following rough sketch:

{code}
  trait Epoch {
def data: DataFrame

/** The exclusive starting position for `data`. */
def startOffset: Offset

/** The inclusive ending position for `data`.  Incrementally updated during 
processing, but not complete until execution of the query plan in `data` is 
finished. */
def endOffset: Offset
  }

  def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], limits: 
Limits): Epoch
{code}

The above would allow us to build an alternative implementation of 
{{StreamExecution}} that processes continuously with much lower latency and 
only stops processing when needing to reconfigure the stream (either due to a 
failure or a user requested change in parallelism.

> Continuous Processing Mode for Structured Streaming
> ---
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20147) Cloning SessionState does not clone streaming query listeners

2017-06-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-20147.
--
  Resolution: Fixed
Assignee: Kunal Khamar
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

Fixed by https://github.com/apache/spark/pull/17379

> Cloning SessionState does not clone streaming query listeners
> -
>
> Key: SPARK-20147
> URL: https://issues.apache.org/jira/browse/SPARK-20147
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Kunal Khamar
>Assignee: Kunal Khamar
> Fix For: 2.2.0
>
>
> Cloning session should clone StreamingQueryListeners registered on the 
> StreamingQueryListenerBus.
> Similar to SPARK-20048, https://github.com/apache/spark/pull/17379



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20002) Add support for unions between streaming and batch datasets

2017-06-02 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035441#comment-16035441
 ] 

Michael Armbrust commented on SPARK-20002:
--

I'm not sure that we will ever support this.  The issue is that for batch 
datasets, we don't track what has been read.  Thus its unclear what should 
happen when the query is restarted.  Instead, I think you can always achieve 
the same result by just loading both datasets as a stream (even if you don't 
plan to change one of them).  Would that work?

> Add support for unions between streaming and batch datasets
> ---
>
> Key: SPARK-20002
> URL: https://issues.apache.org/jira/browse/SPARK-20002
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Leon Pham
>
> Currently unions between streaming datasets and batch datasets are not 
> supported.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19903) PySpark Kafka streaming query ouput append mode not possible

2017-06-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19903:
-
Description: 
PySpark example reads a Kafka stream. There is watermarking set when handling 
the data window. The defined query uses output Append mode.

The PySpark engine reports the error:
'Append output mode not supported when there are streaming aggregations on 
streaming DataFrames/DataSets'

The Python example:
---
{code}
import sys

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, window

if __name__ == "__main__":
if len(sys.argv) != 4:
print("""
Usage: structured_kafka_wordcount.py  
 
""", file=sys.stderr)
exit(-1)

bootstrapServers = sys.argv[1]
subscribeType = sys.argv[2]
topics = sys.argv[3]

spark = SparkSession\
.builder\
.appName("StructuredKafkaWordCount")\
.getOrCreate()

# Create DataSet representing the stream of input lines from kafka
lines = spark\
.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", bootstrapServers)\
.option(subscribeType, topics)\
.load()\
.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")

# Split the lines into words, retaining timestamps
# split() splits each line into an array, and explode() turns the array 
into multiple rows
words = lines.select(
explode(split(lines.value, ' ')).alias('word'),
lines.timestamp
)

# Group the data by window and word and compute the count of each group
windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
window(words.timestamp, "30 seconds", "30 seconds"), words.word
).count()

# Start running the query that prints the running counts to the console
query = windowedCounts\
.writeStream\
.outputMode('append')\
.format('console')\
.option("truncate", "false")\
.start()

query.awaitTermination()
{code}

The corresponding example in Zeppelin notebook:
{code}
%spark.pyspark

from pyspark.sql.functions import explode, split, window

# Create DataSet representing the stream of input lines from kafka
lines = spark\
.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("subscribe", "words")\
.load()\
.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")

# Split the lines into words, retaining timestamps
# split() splits each line into an array, and explode() turns the array into 
multiple rows
words = lines.select(
explode(split(lines.value, ' ')).alias('word'),
lines.timestamp
)

# Group the data by window and word and compute the count of each group
windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
window(words.timestamp, "30 seconds", "30 seconds"), words.word
).count()

# Start running the query that prints the running counts to the console
query = windowedCounts\
.writeStream\
.outputMode('append')\
.format('console')\
.option("truncate", "false")\
.start()

query.awaitTermination()
--

Note that the Scala version of the same example in Zeppelin notebook works fine:

import java.sql.Timestamp
import org.apache.spark.sql.streaming.ProcessingTime
import org.apache.spark.sql.functions._

// Create DataSet representing the stream of input lines from kafka
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "words")
.load()

// Split the lines into words, retaining timestamps
val words = lines
.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
.as[(String, Timestamp)]
.flatMap(line => line._1.split(" ").map(word => (word, line._2)))
.toDF("word", "timestamp")

// Group the data by window and word and compute the count of each group
val windowedCounts = words
.withWatermark("timestamp", "30 seconds")
.groupBy(window($"timestamp", "30 seconds", "30 seconds"), $"word")
.count()

// Start running the query that prints the windowed word counts to the console
val query = windowedCounts.writeStream
.outputMode("append")
.format("console")
.trigger(ProcessingTime("35 seconds"))
.option("truncate", "false")
.start()

query.awaitTermination()
{code}


  was:
PySpark example reads a Kafka stream. There is watermarking set when handling 
the data window. The

[jira] [Updated] (SPARK-19903) Watermark metadata is lost when using resolved attributes

2017-06-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19903:
-
Component/s: (was: PySpark)

> Watermark metadata is lost when using resolved attributes
> -
>
> Key: SPARK-19903
> URL: https://issues.apache.org/jira/browse/SPARK-19903
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
> Environment: Ubuntu Linux
>Reporter: Piotr Nestorow
>
> PySpark example reads a Kafka stream. There is watermarking set when handling 
> the data window. The defined query uses output Append mode.
> The PySpark engine reports the error:
> 'Append output mode not supported when there are streaming aggregations on 
> streaming DataFrames/DataSets'
> The Python example:
> ---
> {code}
> import sys
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import explode, split, window
> if __name__ == "__main__":
> if len(sys.argv) != 4:
> print("""
> Usage: structured_kafka_wordcount.py  
>  
> """, file=sys.stderr)
> exit(-1)
> bootstrapServers = sys.argv[1]
> subscribeType = sys.argv[2]
> topics = sys.argv[3]
> spark = SparkSession\
> .builder\
> .appName("StructuredKafkaWordCount")\
> .getOrCreate()
> # Create DataSet representing the stream of input lines from kafka
> lines = spark\
> .readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", bootstrapServers)\
> .option(subscribeType, topics)\
> .load()\
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
> # Split the lines into words, retaining timestamps
> # split() splits each line into an array, and explode() turns the array 
> into multiple rows
> words = lines.select(
> explode(split(lines.value, ' ')).alias('word'),
> lines.timestamp
> )
> # Group the data by window and word and compute the count of each group
> windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
> window(words.timestamp, "30 seconds", "30 seconds"), words.word
> ).count()
> # Start running the query that prints the running counts to the console
> query = windowedCounts\
> .writeStream\
> .outputMode('append')\
> .format('console')\
> .option("truncate", "false")\
> .start()
> query.awaitTermination()
> {code}
> The corresponding example in Zeppelin notebook:
> {code}
> %spark.pyspark
> from pyspark.sql.functions import explode, split, window
> # Create DataSet representing the stream of input lines from kafka
> lines = spark\
> .readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", "localhost:9092")\
> .option("subscribe", "words")\
> .load()\
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
> # Split the lines into words, retaining timestamps
> # split() splits each line into an array, and explode() turns the array into 
> multiple rows
> words = lines.select(
> explode(split(lines.value, ' ')).alias('word'),
> lines.timestamp
> )
> # Group the data by window and word and compute the count of each group
> windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
> window(words.timestamp, "30 seconds", "30 seconds"), words.word
> ).count()
> # Start running the query that prints the running counts to the console
> query = windowedCounts\
> .writeStream\
> .outputMode('append')\
> .format('console')\
> .option("truncate", "false")\
> .start()
> query.awaitTermination()
> --
> Note that the Scala version of the same example in Zeppelin notebook works 
> fine:
> 
> import java.sql.Timestamp
> import org.apache.spark.sql.streaming.ProcessingTime
> import org.apache.spark.sql.functions._
> // Create DataSet representing the stream of input lines from kafka
> val lines = spark
> .readStream
> .format("kafka")
> .option("kafka.bootstrap.servers", "localhost:9092")
> .option("subscribe", "words")
> .load()
> // Split the lines into words, retaining timestamps
> val words = lines
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS 
> TIMESTAMP)")
> .as[(String, Timestamp)]
> .flatMap(line => line._1.split(" ").map(word => (word, line._2)))
> .toDF("word", "timestamp")
> // Group the data by window and word and comput

[jira] [Updated] (SPARK-19903) Watermark metadata is lost when using resolved attributes

2017-06-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19903:
-
Summary: Watermark metadata is lost when using resolved attributes  (was: 
PySpark Kafka streaming query ouput append mode not possible)

> Watermark metadata is lost when using resolved attributes
> -
>
> Key: SPARK-19903
> URL: https://issues.apache.org/jira/browse/SPARK-19903
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
> Environment: Ubuntu Linux
>Reporter: Piotr Nestorow
>
> PySpark example reads a Kafka stream. There is watermarking set when handling 
> the data window. The defined query uses output Append mode.
> The PySpark engine reports the error:
> 'Append output mode not supported when there are streaming aggregations on 
> streaming DataFrames/DataSets'
> The Python example:
> ---
> {code}
> import sys
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import explode, split, window
> if __name__ == "__main__":
> if len(sys.argv) != 4:
> print("""
> Usage: structured_kafka_wordcount.py  
>  
> """, file=sys.stderr)
> exit(-1)
> bootstrapServers = sys.argv[1]
> subscribeType = sys.argv[2]
> topics = sys.argv[3]
> spark = SparkSession\
> .builder\
> .appName("StructuredKafkaWordCount")\
> .getOrCreate()
> # Create DataSet representing the stream of input lines from kafka
> lines = spark\
> .readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", bootstrapServers)\
> .option(subscribeType, topics)\
> .load()\
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
> # Split the lines into words, retaining timestamps
> # split() splits each line into an array, and explode() turns the array 
> into multiple rows
> words = lines.select(
> explode(split(lines.value, ' ')).alias('word'),
> lines.timestamp
> )
> # Group the data by window and word and compute the count of each group
> windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
> window(words.timestamp, "30 seconds", "30 seconds"), words.word
> ).count()
> # Start running the query that prints the running counts to the console
> query = windowedCounts\
> .writeStream\
> .outputMode('append')\
> .format('console')\
> .option("truncate", "false")\
> .start()
> query.awaitTermination()
> {code}
> The corresponding example in Zeppelin notebook:
> {code}
> %spark.pyspark
> from pyspark.sql.functions import explode, split, window
> # Create DataSet representing the stream of input lines from kafka
> lines = spark\
> .readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", "localhost:9092")\
> .option("subscribe", "words")\
> .load()\
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
> # Split the lines into words, retaining timestamps
> # split() splits each line into an array, and explode() turns the array into 
> multiple rows
> words = lines.select(
> explode(split(lines.value, ' ')).alias('word'),
> lines.timestamp
> )
> # Group the data by window and word and compute the count of each group
> windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
> window(words.timestamp, "30 seconds", "30 seconds"), words.word
> ).count()
> # Start running the query that prints the running counts to the console
> query = windowedCounts\
> .writeStream\
> .outputMode('append')\
> .format('console')\
> .option("truncate", "false")\
> .start()
> query.awaitTermination()
> --
> Note that the Scala version of the same example in Zeppelin notebook works 
> fine:
> 
> import java.sql.Timestamp
> import org.apache.spark.sql.streaming.ProcessingTime
> import org.apache.spark.sql.functions._
> // Create DataSet representing the stream of input lines from kafka
> val lines = spark
> .readStream
> .format("kafka")
> .option("kafka.bootstrap.servers", "localhost:9092")
> .option("subscribe", "words")
> .load()
> // Split the lines into words, retaining timestamps
> val words = lines
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS 
> TIMESTAMP)")
> .as[(String, Timestamp)]
> .flatMap(line => line._1.split(" ").map(word => (wo

[jira] [Updated] (SPARK-19903) Watermark metadata is lost when using resolved attributes

2017-06-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19903:
-
Target Version/s: 2.3.0

> Watermark metadata is lost when using resolved attributes
> -
>
> Key: SPARK-19903
> URL: https://issues.apache.org/jira/browse/SPARK-19903
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
> Environment: Ubuntu Linux
>Reporter: Piotr Nestorow
>
> PySpark example reads a Kafka stream. There is watermarking set when handling 
> the data window. The defined query uses output Append mode.
> The PySpark engine reports the error:
> 'Append output mode not supported when there are streaming aggregations on 
> streaming DataFrames/DataSets'
> The Python example:
> ---
> {code}
> import sys
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import explode, split, window
> if __name__ == "__main__":
> if len(sys.argv) != 4:
> print("""
> Usage: structured_kafka_wordcount.py  
>  
> """, file=sys.stderr)
> exit(-1)
> bootstrapServers = sys.argv[1]
> subscribeType = sys.argv[2]
> topics = sys.argv[3]
> spark = SparkSession\
> .builder\
> .appName("StructuredKafkaWordCount")\
> .getOrCreate()
> # Create DataSet representing the stream of input lines from kafka
> lines = spark\
> .readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", bootstrapServers)\
> .option(subscribeType, topics)\
> .load()\
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
> # Split the lines into words, retaining timestamps
> # split() splits each line into an array, and explode() turns the array 
> into multiple rows
> words = lines.select(
> explode(split(lines.value, ' ')).alias('word'),
> lines.timestamp
> )
> # Group the data by window and word and compute the count of each group
> windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
> window(words.timestamp, "30 seconds", "30 seconds"), words.word
> ).count()
> # Start running the query that prints the running counts to the console
> query = windowedCounts\
> .writeStream\
> .outputMode('append')\
> .format('console')\
> .option("truncate", "false")\
> .start()
> query.awaitTermination()
> {code}
> The corresponding example in Zeppelin notebook:
> {code}
> %spark.pyspark
> from pyspark.sql.functions import explode, split, window
> # Create DataSet representing the stream of input lines from kafka
> lines = spark\
> .readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", "localhost:9092")\
> .option("subscribe", "words")\
> .load()\
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
> # Split the lines into words, retaining timestamps
> # split() splits each line into an array, and explode() turns the array into 
> multiple rows
> words = lines.select(
> explode(split(lines.value, ' ')).alias('word'),
> lines.timestamp
> )
> # Group the data by window and word and compute the count of each group
> windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
> window(words.timestamp, "30 seconds", "30 seconds"), words.word
> ).count()
> # Start running the query that prints the running counts to the console
> query = windowedCounts\
> .writeStream\
> .outputMode('append')\
> .format('console')\
> .option("truncate", "false")\
> .start()
> query.awaitTermination()
> --
> Note that the Scala version of the same example in Zeppelin notebook works 
> fine:
> 
> import java.sql.Timestamp
> import org.apache.spark.sql.streaming.ProcessingTime
> import org.apache.spark.sql.functions._
> // Create DataSet representing the stream of input lines from kafka
> val lines = spark
> .readStream
> .format("kafka")
> .option("kafka.bootstrap.servers", "localhost:9092")
> .option("subscribe", "words")
> .load()
> // Split the lines into words, retaining timestamps
> val words = lines
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS 
> TIMESTAMP)")
> .as[(String, Timestamp)]
> .flatMap(line => line._1.split(" ").map(word => (word, line._2)))
> .toDF("word", "timestamp")
> // Group the data by window and word and compute the co

[jira] [Updated] (SPARK-20065) Empty output files created for aggregation query in append mode

2017-06-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20065:
-
Target Version/s: 2.3.0

> Empty output files created for aggregation query in append mode
> ---
>
> Key: SPARK-20065
> URL: https://issues.apache.org/jira/browse/SPARK-20065
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Silvio Fiorito
>
> I've got a Kafka topic which I'm querying, running a windowed aggregation, 
> with a 30 second watermark, 10 second trigger, writing out to Parquet with 
> append output mode.
> Every 10 second trigger generates a file, regardless of whether there was any 
> data for that trigger, or whether any records were actually finalized by the 
> watermark.
> Is this expected behavior or should it not write out these empty files?
> {code}
> val df = spark.readStream.format("kafka")
> val query = df
>   .withWatermark("timestamp", "30 seconds")
>   .groupBy(window($"timestamp", "10 seconds"))
>   .count()
>   .select(date_format($"window.start", "HH:mm:ss").as("time"), $"count")
> query
>   .writeStream
>   .format("parquet")
>   .option("checkpointLocation", aggChk)
>   .trigger(ProcessingTime("10 seconds"))
>   .outputMode("append")
>   .start(aggPath)
> {code}
> As the query executes, do a file listing on "aggPath" and you'll see 339 byte 
> files at a minimum until we arrive at the first watermark and the initial 
> batch is finalized. Even after that though, as there are empty batches it'll 
> keep generating empty files every trigger.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20662) Block jobs that have greater than a configured number of tasks

2017-06-02 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035462#comment-16035462
 ] 

Xuefu Zhang commented on SPARK-20662:
-

[~lyc] I'm talking about mapreduce.job.max.map, which is the maximum number of 
map tasks that a MR job may have. If a submitted MR job contains more map tasks 
than that, it will be rejected. Similar to mapreduce.job.max.reduce.

[~sowen], [~vanzin], I don't think blocking a large (perhaps ridiculously) job 
is equivalent to letting it run slowly and for ever. The use case I have is: 
while yarn queue can be used to limit how much resources can be used, but a 
queue can be shared by a team or multiple applications. It's probably not a 
good idea to let one job takes all resources while starving others. Secondly, 
many those users who submit ridiculously large job have no idea on what they 
are doing and they don't even realize that their jobs are huge. Lastly and more 
importantly, our application environment has a global timeout, beyond which a 
job will be killed. If a large job gets killed this way, significant resources 
is wasted. Thus, blocking such a job at submission time helps preserve the 
resources.

BTW, if the scenarios don't apply to a user, there is nothing for him/her to 
worry about because the default should keep them happy.

In addition to spark.job.max.tasks, I'd also propose spark.stage.max.tasks, 
which limits the number of tasks any stage of a job may contain. The rationale 
behind this is that spark.job.max.tasks tends to favor jobs with small number 
of stages. With both, we can not only cover MR's mapreduce.job.max.map and 
mapreduce.job.max.reduce, but also control the overall size of a job.


> Block jobs that have greater than a configured number of tasks
> --
>
> Key: SPARK-20662
> URL: https://issues.apache.org/jira/browse/SPARK-20662
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Xuefu Zhang
>
> In a shared cluster, it's desirable for an admin to block large Spark jobs. 
> While there might not be a single metrics defining the size of a job, the 
> number of tasks is usually a good indicator. Thus, it would be useful for 
> Spark scheduler to block a job whose number of tasks reaches a configured 
> limit. By default, the limit could be just infinite, to retain the existing 
> behavior.
> MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be 
> configured, which blocks a MR job at job submission time.
> The proposed configuration is spark.job.max.tasks with a default value -1 
> (infinite).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20972) rename HintInfo.isBroadcastable to forceBroadcast

2017-06-02 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-20972:
---

 Summary: rename HintInfo.isBroadcastable to forceBroadcast
 Key: SPARK-20972
 URL: https://issues.apache.org/jira/browse/SPARK-20972
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20662) Block jobs that have greater than a configured number of tasks

2017-06-02 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035478#comment-16035478
 ] 

Marcelo Vanzin commented on SPARK-20662:


bq. It's probably not a good idea to let one job takes all resources while 
starving others.

I'm pretty sure that's why resource managers have queues.

What you want here is a client-controlled, opt-in, application-level "nicety 
config" that tells it to not submit more tasks than a limit at a time. That 
control already exists - set a maximum number of executors for the app. number 
of executors times number of cores = max number of tasks.

> Block jobs that have greater than a configured number of tasks
> --
>
> Key: SPARK-20662
> URL: https://issues.apache.org/jira/browse/SPARK-20662
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Xuefu Zhang
>
> In a shared cluster, it's desirable for an admin to block large Spark jobs. 
> While there might not be a single metrics defining the size of a job, the 
> number of tasks is usually a good indicator. Thus, it would be useful for 
> Spark scheduler to block a job whose number of tasks reaches a configured 
> limit. By default, the limit could be just infinite, to retain the existing 
> behavior.
> MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be 
> configured, which blocks a MR job at job submission time.
> The proposed configuration is spark.job.max.tasks with a default value -1 
> (infinite).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20972) rename HintInfo.isBroadcastable to forceBroadcast

2017-06-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035479#comment-16035479
 ] 

Apache Spark commented on SPARK-20972:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/18189

> rename HintInfo.isBroadcastable to forceBroadcast
> -
>
> Key: SPARK-20972
> URL: https://issues.apache.org/jira/browse/SPARK-20972
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20972) rename HintInfo.isBroadcastable to forceBroadcast

2017-06-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20972:


Assignee: Apache Spark  (was: Wenchen Fan)

> rename HintInfo.isBroadcastable to forceBroadcast
> -
>
> Key: SPARK-20972
> URL: https://issues.apache.org/jira/browse/SPARK-20972
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20972) rename HintInfo.isBroadcastable to forceBroadcast

2017-06-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20972:


Assignee: Wenchen Fan  (was: Apache Spark)

> rename HintInfo.isBroadcastable to forceBroadcast
> -
>
> Key: SPARK-20972
> URL: https://issues.apache.org/jira/browse/SPARK-20972
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20662) Block jobs that have greater than a configured number of tasks

2017-06-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035481#comment-16035481
 ] 

Sean Owen commented on SPARK-20662:
---

It's not equivalent to block the job, but why is that more desirable? your use 
case is what resource queues are for, and things like the capacity scheduler. 
Yes you limit the amount of resource a person is entitled for just that reason. 
A job that's blocked for being "too big" during busy hours may be fine to run 
off hours, but this would mean the job is never runnable ever. The capacity 
scheduler, in contrast, can  let someone use resources when nobody else wants 
them but preempt when someone else needs them, so it doesn't really cost anyone 
else. It just doesn't seem like this is a wheel to reinvent in Spark. Possibly 
its own standalone resource manager, but if you need functionality like this 
you're not likely to get by with a standalone cluster anyway.

> Block jobs that have greater than a configured number of tasks
> --
>
> Key: SPARK-20662
> URL: https://issues.apache.org/jira/browse/SPARK-20662
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Xuefu Zhang
>
> In a shared cluster, it's desirable for an admin to block large Spark jobs. 
> While there might not be a single metrics defining the size of a job, the 
> number of tasks is usually a good indicator. Thus, it would be useful for 
> Spark scheduler to block a job whose number of tasks reaches a configured 
> limit. By default, the limit could be just infinite, to retain the existing 
> behavior.
> MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be 
> configured, which blocks a MR job at job submission time.
> The proposed configuration is spark.job.max.tasks with a default value -1 
> (infinite).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20662) Block jobs that have greater than a configured number of tasks

2017-06-02 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035487#comment-16035487
 ] 

Marcelo Vanzin commented on SPARK-20662:


BTW if you really, really, really think this is a good idea and you really want 
it, you can write a listener that just cancels jobs or kills the application 
whenever a stage with more than x tasks is submitted.

No need for any changes in Spark.

> Block jobs that have greater than a configured number of tasks
> --
>
> Key: SPARK-20662
> URL: https://issues.apache.org/jira/browse/SPARK-20662
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Xuefu Zhang
>
> In a shared cluster, it's desirable for an admin to block large Spark jobs. 
> While there might not be a single metrics defining the size of a job, the 
> number of tasks is usually a good indicator. Thus, it would be useful for 
> Spark scheduler to block a job whose number of tasks reaches a configured 
> limit. By default, the limit could be just infinite, to retain the existing 
> behavior.
> MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be 
> configured, which blocks a MR job at job submission time.
> The proposed configuration is spark.job.max.tasks with a default value -1 
> (infinite).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17078) show estimated stats when doing explain

2017-06-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035511#comment-16035511
 ] 

Apache Spark commented on SPARK-17078:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/18190

> show estimated stats when doing explain
> ---
>
> Key: SPARK-17078
> URL: https://issues.apache.org/jira/browse/SPARK-17078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>Assignee: Zhenhua Wang
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20737) Mechanism for cleanup hooks, for structured-streaming sinks on executor shutdown.

2017-06-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust closed SPARK-20737.

Resolution: Won't Fix

> Mechanism for cleanup hooks, for structured-streaming sinks on executor 
> shutdown.
> -
>
> Key: SPARK-20737
> URL: https://issues.apache.org/jira/browse/SPARK-20737
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Prashant Sharma
>  Labels: Kafka
>
> Add a standard way of cleanup during shutdown of executors for structured 
> streaming sinks in general and KafkaSink in particular.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20662) Block jobs that have greater than a configured number of tasks

2017-06-02 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035519#comment-16035519
 ] 

Xuefu Zhang commented on SPARK-20662:
-

I can understand the counter argument here if Spark is targeted for single user 
cases. For multiple users in an enterprise deployment, it's good to provide 
admin knobs. In this case, an admin just wanted to block bad jobs. I don't 
think RM meets that goal.

This is actually implemented in Hive on Spark. However, I thought this is 
generic and may be desirable for others as well. In addition, blocking a job at 
submission is better than killing it after it started to run.

If Spark doesn't think this is useful, then very well.

> Block jobs that have greater than a configured number of tasks
> --
>
> Key: SPARK-20662
> URL: https://issues.apache.org/jira/browse/SPARK-20662
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Xuefu Zhang
>
> In a shared cluster, it's desirable for an admin to block large Spark jobs. 
> While there might not be a single metrics defining the size of a job, the 
> number of tasks is usually a good indicator. Thus, it would be useful for 
> Spark scheduler to block a job whose number of tasks reaches a configured 
> limit. By default, the limit could be just infinite, to retain the existing 
> behavior.
> MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be 
> configured, which blocks a MR job at job submission time.
> The proposed configuration is spark.job.max.tasks with a default value -1 
> (infinite).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20662) Block jobs that have greater than a configured number of tasks

2017-06-02 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035525#comment-16035525
 ] 

Marcelo Vanzin commented on SPARK-20662:


bq. For multiple users in an enterprise deployment, it's good to provide admin 
knobs. In this case, an admin just wanted to block bad jobs.

Your definition of a bad job is the problem (well, one of the problems). 
"Number of tasks" is not an indication that a job is large. Each task may be 
really small.

Spark shouldn't be in the job of defining what is a good or bad job, and that 
doesn't mean it's targeted at single user vs. multi user environments. It's 
just something that needs to be controlled at a different layer. If the admin 
is really worried about resource usage, he has control over the RM, and 
shouldn't rely on applications behaving nicely to enforce those controls. 
Applications misbehave. Users mess with configuration. Those are all things 
outside of the admin's control.

> Block jobs that have greater than a configured number of tasks
> --
>
> Key: SPARK-20662
> URL: https://issues.apache.org/jira/browse/SPARK-20662
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Xuefu Zhang
>
> In a shared cluster, it's desirable for an admin to block large Spark jobs. 
> While there might not be a single metrics defining the size of a job, the 
> number of tasks is usually a good indicator. Thus, it would be useful for 
> Spark scheduler to block a job whose number of tasks reaches a configured 
> limit. By default, the limit could be just infinite, to retain the existing 
> behavior.
> MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be 
> configured, which blocks a MR job at job submission time.
> The proposed configuration is spark.job.max.tasks with a default value -1 
> (infinite).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20973) insert table fail caused by unable to fetch data definition file from remote hdfs

2017-06-02 Thread Yunjian Zhang (JIRA)
Yunjian Zhang created SPARK-20973:
-

 Summary: insert table fail caused by unable to fetch data 
definition file from remote hdfs 
 Key: SPARK-20973
 URL: https://issues.apache.org/jira/browse/SPARK-20973
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Yunjian Zhang


I implemented my own hive serde to handle special data files which needs to 
read data definition during process.
the process include
1.read definition file location from TBLPROPERTIES
2.read file content as per step 1
3.init serde base on step 2.
//DDL of the table as below:
-
CREATE EXTERNAL TABLE dw_user_stg_txt_out
ROW FORMAT SERDE 'com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe'
STORED AS
  INPUTFORMAT 'com.ebay.dss.gdr.mapred.AbAsAvroInputFormat'
  OUTPUTFORMAT 'com.ebay.dss.gdr.hive.ql.io.ab.AvroAsAbOutputFormat'
LOCATION 'hdfs://${remote_hdfs}/user/data'
TBLPROPERTIES (
  'com.ebay.dss.dml.file' = 'hdfs://${remote_hdfs}/dml/user.dml'
)
// insert statement
insert overwrite table dw_user_stg_txt_out select * from dw_user_stg_txt_avro;
//fail with ERROR
17/06/02 15:46:34 ERROR SparkSQLDriver: Failed in [insert overwrite table 
dw_user_stg_txt_out select * from dw_user_stg_txt_avro]
java.lang.RuntimeException: FAILED to get dml file from: 
hdfs://${remote-hdfs}/dml/user.dml
at 
com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe.initialize(AbvroSerDe.java:109)
at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:160)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:258)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:170)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:347)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20973) insert table fail caused by unable to fetch data definition file from remote hdfs

2017-06-02 Thread Yunjian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035597#comment-16035597
 ] 

Yunjian Zhang commented on SPARK-20973:
---

I did check the source code and add a patch to fix the insert issue

> insert table fail caused by unable to fetch data definition file from remote 
> hdfs 
> --
>
> Key: SPARK-20973
> URL: https://issues.apache.org/jira/browse/SPARK-20973
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yunjian Zhang
>  Labels: patch
>
> I implemented my own hive serde to handle special data files which needs to 
> read data definition during process.
> the process include
> 1.read definition file location from TBLPROPERTIES
> 2.read file content as per step 1
> 3.init serde base on step 2.
> //DDL of the table as below:
> -
> CREATE EXTERNAL TABLE dw_user_stg_txt_out
> ROW FORMAT SERDE 'com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe'
> STORED AS
>   INPUTFORMAT 'com.ebay.dss.gdr.mapred.AbAsAvroInputFormat'
>   OUTPUTFORMAT 'com.ebay.dss.gdr.hive.ql.io.ab.AvroAsAbOutputFormat'
> LOCATION 'hdfs://${remote_hdfs}/user/data'
> TBLPROPERTIES (
>   'com.ebay.dss.dml.file' = 'hdfs://${remote_hdfs}/dml/user.dml'
> )
> // insert statement
> insert overwrite table dw_user_stg_txt_out select * from dw_user_stg_txt_avro;
> //fail with ERROR
> 17/06/02 15:46:34 ERROR SparkSQLDriver: Failed in [insert overwrite table 
> dw_user_stg_txt_out select * from dw_user_stg_txt_avro]
> java.lang.RuntimeException: FAILED to get dml file from: 
> hdfs://${remote-hdfs}/dml/user.dml
>   at 
> com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe.initialize(AbvroSerDe.java:109)
>   at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:160)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:258)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:170)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:347)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20973) insert table fail caused by unable to fetch data definition file from remote hdfs

2017-06-02 Thread Yunjian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035597#comment-16035597
 ] 

Yunjian Zhang edited comment on SPARK-20973 at 6/2/17 11:06 PM:


I did check the source code and add a patch to fix the insert issue as below, 
unable to attach file here, so just past the content as well.
--
--- 
a/./workspace1/spark-2.1.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala
+++ 
b/./workspace/git/gdr/spark/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala
@@ -57,7 +57,7 @@ private[hive] class SparkHiveWriterContainer(
   extends Logging
   with HiveInspectors
   with Serializable {
-
+
   private val now = new Date()
   private val tableDesc: TableDesc = fileSinkConf.getTableInfo
   // Add table properties from storage handler to jobConf, so any custom 
storage
@@ -154,6 +154,12 @@ private[hive] class SparkHiveWriterContainer(
 conf.value.setBoolean("mapred.task.is.map", true)
 conf.value.setInt("mapred.task.partition", splitID)
   }
+
+  def newSerializer(tableDesc: TableDesc): Serializer = {
+val serializer = 
tableDesc.getDeserializerClass.newInstance().asInstanceOf[Serializer]
+serializer.initialize(null, tableDesc.getProperties)
+serializer
+  }
 
   def newSerializer(jobConf: JobConf, tableDesc: TableDesc): Serializer = {
 val serializer = 
tableDesc.getDeserializerClass.newInstance().asInstanceOf[Serializer]
@@ -162,10 +168,11 @@ private[hive] class SparkHiveWriterContainer(
   }
 
   protected def prepareForWrite() = {
-val serializer = newSerializer(jobConf, fileSinkConf.getTableInfo)
+val serializer = newSerializer(conf.value, fileSinkConf.getTableInfo)
+logInfo("CHECK table deser:" + 
fileSinkConf.getTableInfo.getDeserializer(conf.value))
 val standardOI = ObjectInspectorUtils
   .getStandardObjectInspector(
-fileSinkConf.getTableInfo.getDeserializer.getObjectInspector,
+
fileSinkConf.getTableInfo.getDeserializer(conf.value).getObjectInspector,
 ObjectInspectorCopyOption.JAVA)
   .asInstanceOf[StructObjectInspector]


was (Author: daniel.yj.zh...@gmail.com):
I did check the source code and add a patch to fix the insert issue

> insert table fail caused by unable to fetch data definition file from remote 
> hdfs 
> --
>
> Key: SPARK-20973
> URL: https://issues.apache.org/jira/browse/SPARK-20973
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yunjian Zhang
>  Labels: patch
>
> I implemented my own hive serde to handle special data files which needs to 
> read data definition during process.
> the process include
> 1.read definition file location from TBLPROPERTIES
> 2.read file content as per step 1
> 3.init serde base on step 2.
> //DDL of the table as below:
> -
> CREATE EXTERNAL TABLE dw_user_stg_txt_out
> ROW FORMAT SERDE 'com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe'
> STORED AS
>   INPUTFORMAT 'com.ebay.dss.gdr.mapred.AbAsAvroInputFormat'
>   OUTPUTFORMAT 'com.ebay.dss.gdr.hive.ql.io.ab.AvroAsAbOutputFormat'
> LOCATION 'hdfs://${remote_hdfs}/user/data'
> TBLPROPERTIES (
>   'com.ebay.dss.dml.file' = 'hdfs://${remote_hdfs}/dml/user.dml'
> )
> // insert statement
> insert overwrite table dw_user_stg_txt_out select * from dw_user_stg_txt_avro;
> //fail with ERROR
> 17/06/02 15:46:34 ERROR SparkSQLDriver: Failed in [insert overwrite table 
> dw_user_stg_txt_out select * from dw_user_stg_txt_avro]
> java.lang.RuntimeException: FAILED to get dml file from: 
> hdfs://${remote-hdfs}/dml/user.dml
>   at 
> com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe.initialize(AbvroSerDe.java:109)
>   at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:160)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:258)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:170)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:347)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20974) we should run REPL tests if SQL core has code changes

2017-06-02 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-20974:
---

 Summary: we should run REPL tests if SQL core has code changes
 Key: SPARK-20974
 URL: https://issues.apache.org/jira/browse/SPARK-20974
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20974) we should run REPL tests if SQL core has code changes

2017-06-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035614#comment-16035614
 ] 

Apache Spark commented on SPARK-20974:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/18191

> we should run REPL tests if SQL core has code changes
> -
>
> Key: SPARK-20974
> URL: https://issues.apache.org/jira/browse/SPARK-20974
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20974) we should run REPL tests if SQL core has code changes

2017-06-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20974:


Assignee: Wenchen Fan  (was: Apache Spark)

> we should run REPL tests if SQL core has code changes
> -
>
> Key: SPARK-20974
> URL: https://issues.apache.org/jira/browse/SPARK-20974
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20974) we should run REPL tests if SQL core has code changes

2017-06-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20974:


Assignee: Apache Spark  (was: Wenchen Fan)

> we should run REPL tests if SQL core has code changes
> -
>
> Key: SPARK-20974
> URL: https://issues.apache.org/jira/browse/SPARK-20974
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20974) we should run REPL tests if SQL core has code changes

2017-06-02 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20974.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.2
   2.0.3

> we should run REPL tests if SQL core has code changes
> -
>
> Key: SPARK-20974
> URL: https://issues.apache.org/jira/browse/SPARK-20974
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19732) DataFrame.fillna() does not work for bools in PySpark

2017-06-02 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-19732.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18164
[https://github.com/apache/spark/pull/18164]

> DataFrame.fillna() does not work for bools in PySpark
> -
>
> Key: SPARK-19732
> URL: https://issues.apache.org/jira/browse/SPARK-19732
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Len Frodgers
>Priority: Minor
> Fix For: 2.3.0
>
>
> In PySpark, the fillna function of DataFrame inadvertently casts bools to 
> ints, so fillna cannot be used to fill True/False.
> e.g. 
> `spark.createDataFrame([Row(a=True),Row(a=None)]).fillna(True).collect()` 
> yields
> `[Row(a=True), Row(a=None)]`
> It should be a=True for the second Row
> The cause is this bit of code: 
> {code}
> if isinstance(value, (int, long)):
> value = float(value)
> {code}
> There needs to be a separate check for isinstance(bool), since in python, 
> bools are ints too
> Additionally there's another anomaly:
> Spark (and pyspark) supports filling of bools if you specify the args as a 
> map: 
> {code}
> fillna({"a": False})
> {code}
> , but not if you specify it as
> {code}
> fillna(False)
> {code}
> This is because (scala-)Spark has no
> {code}
> def fill(value: Boolean): DataFrame = fill(value, df.columns)
> {code}
>  method. I find that strange/buggy



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19732) DataFrame.fillna() does not work for bools in PySpark

2017-06-02 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-19732:
-

Assignee: Ruben Berenguel

> DataFrame.fillna() does not work for bools in PySpark
> -
>
> Key: SPARK-19732
> URL: https://issues.apache.org/jira/browse/SPARK-19732
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Len Frodgers
>Assignee: Ruben Berenguel
>Priority: Minor
> Fix For: 2.3.0
>
>
> In PySpark, the fillna function of DataFrame inadvertently casts bools to 
> ints, so fillna cannot be used to fill True/False.
> e.g. 
> `spark.createDataFrame([Row(a=True),Row(a=None)]).fillna(True).collect()` 
> yields
> `[Row(a=True), Row(a=None)]`
> It should be a=True for the second Row
> The cause is this bit of code: 
> {code}
> if isinstance(value, (int, long)):
> value = float(value)
> {code}
> There needs to be a separate check for isinstance(bool), since in python, 
> bools are ints too
> Additionally there's another anomaly:
> Spark (and pyspark) supports filling of bools if you specify the args as a 
> map: 
> {code}
> fillna({"a": False})
> {code}
> , but not if you specify it as
> {code}
> fillna(False)
> {code}
> This is because (scala-)Spark has no
> {code}
> def fill(value: Boolean): DataFrame = fill(value, df.columns)
> {code}
>  method. I find that strange/buggy



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20950) Improve diskWriteBufferSize configurable

2017-06-02 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20950:
--
Summary: Improve diskWriteBufferSize configurable  (was: Improve 
Serializerbuffersize configurable)

> Improve diskWriteBufferSize configurable
> 
>
> Key: SPARK-20950
> URL: https://issues.apache.org/jira/browse/SPARK-20950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Trivial
>
> 1.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize 
> of UnsafeShuffleWriter.
> 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
> mergeSpillsWithFileStream function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20950) Improve diskWriteBufferSize configurable

2017-06-02 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20950:
--
Description: 
1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of 
ShuffleExternalSorter.
2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function.

  was:
1.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize 
of UnsafeShuffleWriter.
2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function.


> Improve diskWriteBufferSize configurable
> 
>
> Key: SPARK-20950
> URL: https://issues.apache.org/jira/browse/SPARK-20950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Trivial
>
> 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize 
> of ShuffleExternalSorter.
> 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
> mergeSpillsWithFileStream function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20025) Driver fail over will not work, if SPARK_LOCAL* env is set.

2017-06-02 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-20025:

Target Version/s: 2.3.0  (was: 2.2.0)

> Driver fail over will not work, if SPARK_LOCAL* env is set.
> ---
>
> Key: SPARK-20025
> URL: https://issues.apache.org/jira/browse/SPARK-20025
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Prashant Sharma
>
> In a bare metal system with No DNS setup, spark may be configured with 
> SPARK_LOCAL* for IP and host properties.
> During a driver failover, in cluster deployment mode. SPARK_LOCAL* should be 
> ignored while auto deploying and should be picked up from target system's 
> local environment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org