[jira] [Created] (SPARK-15863) Update SQL programming guide for Spark 2.0

2016-06-09 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-15863:
--

 Summary: Update SQL programming guide for Spark 2.0
 Key: SPARK-15863
 URL: https://issues.apache.org/jira/browse/SPARK-15863
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15696) Improve `crosstab` to have a consistent column order

2016-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15696.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.0.0

> Improve `crosstab` to have a consistent column order 
> -
>
> Key: SPARK-15696
> URL: https://issues.apache.org/jira/browse/SPARK-15696
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> Currently, `crosstab` have **random-order** columns obtained by just 
> `distinct`. Also, the documentation of `crosstab` also shows the result in a 
> sorted order which is different from the implementation.
> {code}
> scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 
> 2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show()
> +-+---+---+---+
> |key_value|  3|  2|  1|
> +-+---+---+---+
> |2|  1|  0|  2|
> |1|  0|  1|  1|
> |3|  1|  1|  0|
> +-+---+---+---+
> scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, 
> "c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", 
> "value").show()
> +-+---+---+---+
> |key_value|  c|  a|  b|
> +-+---+---+---+
> |2|  1|  2|  0|
> |1|  0|  1|  1|
> |3|  1|  0|  1|
> +-+---+---+---+
> {code}
> This issue explicitly constructs the columns in a sorted order in order to 
> improve user experience. Also, this implementation gives the same result with 
> the documentation.
> {code}
> scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 
> 2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show()
> +-+---+---+---+
> |key_value|  1|  2|  3|
> +-+---+---+---+
> |2|  2|  0|  1|
> |1|  1|  1|  0|
> |3|  0|  1|  1|
> +-+---+---+---+
> scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, 
> "c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", 
> "value").show()
> +-+---+---+---+   
>   
> |key_value|  a|  b|  c|
> +-+---+---+---+
> |2|  2|  0|  1|
> |1|  1|  1|  0|
> |3|  0|  1|  1|
> +-+---+---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15791) NPE in ScalarSubquery

2016-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15791.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> NPE in ScalarSubquery
> -
>
> Key: SPARK-15791
> URL: https://issues.apache.org/jira/browse/SPARK-15791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Eric Liang
> Fix For: 2.0.0
>
>
> {code}
> Job aborted due to stage failure: Task 0 in stage 146.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 146.0 (TID 48828, 10.0.206.208): 
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.ScalarSubquery.dataType(subquery.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.CaseWhenBase.dataType(conditionalExpressions.scala:103)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:165)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.execution.ProjectExec.output(basicPhysicalOperators.scala:33)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.output(WholeStageCodegenExec.scala:291)
>   at 
> org.apache.spark.sql.execution.DeserializeToObjectExec$$anonfun$2.apply(objects.scala:85)
>   at 
> org.apache.spark.sql.execution.DeserializeToObjectExec$$anonfun$2.apply(objects.scala:84)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15842) Add support for socket stream.

2016-06-09 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma closed SPARK-15842.
---
Resolution: Not A Problem

> Add support for socket stream.
> --
>
> Key: SPARK-15842
> URL: https://issues.apache.org/jira/browse/SPARK-15842
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>
> Streaming so far has offset based sources with all the available sources like 
> file-source and memory-source that do not need additional capabilities to 
> implement offset for any given range.
> Socket stream at OS level has a very tiny buffer. Many message queues have 
> the ability to keep the message lingering until it is read by the receiver 
> end. ZeroMQ is one such example. However in the case of socket stream, this 
> is not supported. 
> The challenge here would be to implement a way to  buffer for a configurable 
> amount of time and discuss strategies for overflow and underflow.
> This JIRA will form the basis for implementing sources which do not have 
> native support for lingering a message for any amount of time until it is 
> read. It deals with design doc if necessary and supporting code to implement 
> such sources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15842) Add support for socket stream.

2016-06-09 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323837#comment-15323837
 ] 

Prashant Sharma commented on SPARK-15842:
-

Thank you for making it clear.

Actual question I had was, "What if we could give exactly-once guarantees only 
for a configurable amount of time ?"

 In some sense, even socket stream can have the concept of per record 
offset, by introducing some kind of control bit. But certainly, it does not 
support the features(like replay an arbitrary sequence of past data and so on.) 
most message queues come built in. Also, having this would require our own 
mechanism to support end-to-end exactly once guarantees and that is actually 
non trivial as one would need receiver as a long running thread and then have 
to worry about their failover etc.. Address challenges like scaling.

  This certainly puts it at odds with current design of structured streaming. 

Also, any one who would like to use socket stream, can always deploy kafka or 
similar message queue as middleware and have all the guarantees that streaming 
intends to provide.



> Add support for socket stream.
> --
>
> Key: SPARK-15842
> URL: https://issues.apache.org/jira/browse/SPARK-15842
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>
> Streaming so far has offset based sources with all the available sources like 
> file-source and memory-source that do not need additional capabilities to 
> implement offset for any given range.
> Socket stream at OS level has a very tiny buffer. Many message queues have 
> the ability to keep the message lingering until it is read by the receiver 
> end. ZeroMQ is one such example. However in the case of socket stream, this 
> is not supported. 
> The challenge here would be to implement a way to  buffer for a configurable 
> amount of time and discuss strategies for overflow and underflow.
> This JIRA will form the basis for implementing sources which do not have 
> native support for lingering a message for any amount of time until it is 
> read. It deals with design doc if necessary and supporting code to implement 
> such sources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15838) CACHE TABLE AS SELECT should not replace the existing Temp Table

2016-06-09 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-15838.
---
Resolution: Won't Fix

> CACHE TABLE AS SELECT should not replace the existing Temp Table
> 
>
> Key: SPARK-15838
> URL: https://issues.apache.org/jira/browse/SPARK-15838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, {{CACHE TABLE AS SELECT}} replaces the existing Temp Table, if 
> existed. This behavior is different from `CREATE TABLE` or `CREATE VIEW`. It 
> looks risky.
> Better Error Message When Having Database Name in CACHE TABLE AS SELECT



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15862) Better Error Message When Having Database Name in CACHE TABLE AS SELECT

2016-06-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323828#comment-15323828
 ] 

Apache Spark commented on SPARK-15862:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/13572

> Better Error Message When Having Database Name in CACHE TABLE AS SELECT
> ---
>
> Key: SPARK-15862
> URL: https://issues.apache.org/jira/browse/SPARK-15862
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Priority: Minor
>
> The table name in CACHE TABLE AS SELECT should NOT contain database prefix 
> like "database.table". Thus, this PR captures this in Parser and outputs a 
> better error message, instead of reporting the view already exists.
> In addition, in this JIRA, we have a few issues that need to be addressed: 1) 
> refactor the Parser to generate table identifiers instead of returning the 
> table name string; 2) add test case for caching and uncaching qualified table 
> names;  3) fix a few test cases that do not drop temp table at the end;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15862) Better Error Message When Having Database Name in CACHE TABLE AS SELECT

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15862:


Assignee: Apache Spark

> Better Error Message When Having Database Name in CACHE TABLE AS SELECT
> ---
>
> Key: SPARK-15862
> URL: https://issues.apache.org/jira/browse/SPARK-15862
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Minor
>
> The table name in CACHE TABLE AS SELECT should NOT contain database prefix 
> like "database.table". Thus, this PR captures this in Parser and outputs a 
> better error message, instead of reporting the view already exists.
> In addition, in this JIRA, we have a few issues that need to be addressed: 1) 
> refactor the Parser to generate table identifiers instead of returning the 
> table name string; 2) add test case for caching and uncaching qualified table 
> names;  3) fix a few test cases that do not drop temp table at the end;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15862) Better Error Message When Having Database Name in CACHE TABLE AS SELECT

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15862:


Assignee: (was: Apache Spark)

> Better Error Message When Having Database Name in CACHE TABLE AS SELECT
> ---
>
> Key: SPARK-15862
> URL: https://issues.apache.org/jira/browse/SPARK-15862
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Priority: Minor
>
> The table name in CACHE TABLE AS SELECT should NOT contain database prefix 
> like "database.table". Thus, this PR captures this in Parser and outputs a 
> better error message, instead of reporting the view already exists.
> In addition, in this JIRA, we have a few issues that need to be addressed: 1) 
> refactor the Parser to generate table identifiers instead of returning the 
> table name string; 2) add test case for caching and uncaching qualified table 
> names;  3) fix a few test cases that do not drop temp table at the end;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15862) Better Error Message When Having Database Name in CACHE TABLE AS SELECT

2016-06-09 Thread Xiao Li (JIRA)
Xiao Li created SPARK-15862:
---

 Summary: Better Error Message When Having Database Name in CACHE 
TABLE AS SELECT
 Key: SPARK-15862
 URL: https://issues.apache.org/jira/browse/SPARK-15862
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li
Priority: Minor


The table name in CACHE TABLE AS SELECT should NOT contain database prefix like 
"database.table". Thus, this PR captures this in Parser and outputs a better 
error message, instead of reporting the view already exists.

In addition, in this JIRA, we have a few issues that need to be addressed: 1) 
refactor the Parser to generate table identifiers instead of returning the 
table name string; 2) add test case for caching and uncaching qualified table 
names;  3) fix a few test cases that do not drop temp table at the end;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15838) CACHE TABLE AS SELECT should not replace the existing Temp Table

2016-06-09 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-15838:

Description: 
-Currently, {{CACHE TABLE AS SELECT}} replaces the existing Temp Table, if 
existed. This behavior is different from `CREATE TABLE` or `CREATE VIEW`. It 
looks risky.-

Better Error Message When Having Database Name in CACHE TABLE AS SELECT

  was:Currently, {{CACHE TABLE AS SELECT}} replaces the existing Temp Table, if 
existed. This behavior is different from `CREATE TABLE` or `CREATE VIEW`. It 
looks risky.


> CACHE TABLE AS SELECT should not replace the existing Temp Table
> 
>
> Key: SPARK-15838
> URL: https://issues.apache.org/jira/browse/SPARK-15838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> -Currently, {{CACHE TABLE AS SELECT}} replaces the existing Temp Table, if 
> existed. This behavior is different from `CREATE TABLE` or `CREATE VIEW`. It 
> looks risky.-
> Better Error Message When Having Database Name in CACHE TABLE AS SELECT



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15838) CACHE TABLE AS SELECT should not replace the existing Temp Table

2016-06-09 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-15838:

Description: 
Currently, {{CACHE TABLE AS SELECT}} replaces the existing Temp Table, if 
existed. This behavior is different from `CREATE TABLE` or `CREATE VIEW`. It 
looks risky.

Better Error Message When Having Database Name in CACHE TABLE AS SELECT

  was:
-Currently, {{CACHE TABLE AS SELECT}} replaces the existing Temp Table, if 
existed. This behavior is different from `CREATE TABLE` or `CREATE VIEW`. It 
looks risky.-

Better Error Message When Having Database Name in CACHE TABLE AS SELECT


> CACHE TABLE AS SELECT should not replace the existing Temp Table
> 
>
> Key: SPARK-15838
> URL: https://issues.apache.org/jira/browse/SPARK-15838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, {{CACHE TABLE AS SELECT}} replaces the existing Temp Table, if 
> existed. This behavior is different from `CREATE TABLE` or `CREATE VIEW`. It 
> looks risky.
> Better Error Message When Having Database Name in CACHE TABLE AS SELECT



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15861) pyspark mapPartitions with none generator functions / functors

2016-06-09 Thread Greg Bowyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Bowyer updated SPARK-15861:

Description: 
Hi all, it appears that the method `rdd.mapPartitions` does odd things if it is 
fed a normal subroutine.

For instance, lets say we have the following

{code}
rows = range(25)
rows = [rows[i:i+5] for i in range(0, len(rows), 5)]
rdd = sc.parallelize(rows)

def to_np(data):
return np.array(list(data))

rdd.mapPartitions(to_np).collect()
...
[array([0, 1, 2, 3, 4]),
 array([5, 6, 7, 8, 9]),
 array([10, 11, 12, 13, 14]),
 array([15, 16, 17, 18, 19]),
 array([20, 21, 22, 23, 24])]

rdd.mapPartitions(to_np, preservePartitioning=True).collect()
...
[array([0, 1, 2, 3, 4]),
 array([5, 6, 7, 8, 9]),
 array([10, 11, 12, 13, 14]),
 array([15, 16, 17, 18, 19]),
 array([20, 21, 22, 23, 24])]
{code}

This basically makes the provided function that did return act like the end 
user called {code}rdd.map{code}

I think that maybe a check should be put in to call 
{code}inspect.isgeneratorfunction{code}
?

  was:
Hi all, it appears that the method `rdd.mapPartitions` does odd things if it is 
fed a normal subroutine.

For instance, lets say we have the following

{code:python}
rows = range(25)
rows = [rows[i:i+5] for i in range(0, len(rows), 5)]
rdd = sc.parallelize(rows)

def to_np(data):
return np.array(list(data))

rdd.mapPartitions(to_np).collect()
...
[array([0, 1, 2, 3, 4]),
 array([5, 6, 7, 8, 9]),
 array([10, 11, 12, 13, 14]),
 array([15, 16, 17, 18, 19]),
 array([20, 21, 22, 23, 24])]

rdd.mapPartitions(to_np, preservePartitioning=True).collect()
...
[array([0, 1, 2, 3, 4]),
 array([5, 6, 7, 8, 9]),
 array([10, 11, 12, 13, 14]),
 array([15, 16, 17, 18, 19]),
 array([20, 21, 22, 23, 24])]
{code}

This basically makes the provided function that did return act like the end 
user called {code}rdd.map{code}

I think that maybe a check should be put in to call 
{code:python}inspect.isgeneratorfunction{code}
?


> pyspark mapPartitions with none generator functions / functors
> --
>
> Key: SPARK-15861
> URL: https://issues.apache.org/jira/browse/SPARK-15861
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Greg Bowyer
>Priority: Minor
>
> Hi all, it appears that the method `rdd.mapPartitions` does odd things if it 
> is fed a normal subroutine.
> For instance, lets say we have the following
> {code}
> rows = range(25)
> rows = [rows[i:i+5] for i in range(0, len(rows), 5)]
> rdd = sc.parallelize(rows)
> def to_np(data):
> return np.array(list(data))
> rdd.mapPartitions(to_np).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> rdd.mapPartitions(to_np, preservePartitioning=True).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> {code}
> This basically makes the provided function that did return act like the end 
> user called {code}rdd.map{code}
> I think that maybe a check should be put in to call 
> {code}inspect.isgeneratorfunction{code}
> ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15861) pyspark mapPartitions with none generator functions / functors

2016-06-09 Thread Greg Bowyer (JIRA)
Greg Bowyer created SPARK-15861:
---

 Summary: pyspark mapPartitions with none generator functions / 
functors
 Key: SPARK-15861
 URL: https://issues.apache.org/jira/browse/SPARK-15861
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.1
Reporter: Greg Bowyer
Priority: Minor


Hi all, it appears that the method `rdd.mapPartitions` does odd things if it is 
fed a normal subroutine.

For instance, lets say we have the following

{code:python}
rows = range(25)
rows = [rows[i:i+5] for i in range(0, len(rows), 5)]
rdd = sc.parallelize(rows)

def to_np(data):
return np.array(list(data))

rdd.mapPartitions(to_np).collect()
...
[array([0, 1, 2, 3, 4]),
 array([5, 6, 7, 8, 9]),
 array([10, 11, 12, 13, 14]),
 array([15, 16, 17, 18, 19]),
 array([20, 21, 22, 23, 24])]

rdd.mapPartitions(to_np, preservePartitioning=True).collect()
...
[array([0, 1, 2, 3, 4]),
 array([5, 6, 7, 8, 9]),
 array([10, 11, 12, 13, 14]),
 array([15, 16, 17, 18, 19]),
 array([20, 21, 22, 23, 24])]
{code}

This basically makes the provided function that did return act like the end 
user called {code}rdd.map{code}

I think that maybe a check should be put in to call 
{code:python}inspect.isgeneratorfunction{code}
?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15858) "evaluateEachIteration" will fail on trying to run it on a model with 500+ trees

2016-06-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323797#comment-15323797
 ] 

Apache Spark commented on SPARK-15858:
--

User 'mhmoudr' has created a pull request for this issue:
https://github.com/apache/spark/pull/13590

> "evaluateEachIteration" will fail on trying to run it on a model with 500+ 
> trees 
> -
>
> Key: SPARK-15858
> URL: https://issues.apache.org/jira/browse/SPARK-15858
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Mahmoud Rawas
>
> this line:
> remappedData.zip(predictionAndError).mapPartitions 
> causing a stack over flow exception on executors after nearly 300 iterations 
> also with this number of trees having a var RDD will have needless memory 
> allocation. 
> this functionality tested on version 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15825) sort-merge-join gives invalid results when joining on a tupled key

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15825:


Assignee: Apache Spark

> sort-merge-join gives invalid results when joining on a tupled key
> --
>
> Key: SPARK-15825
> URL: https://issues.apache.org/jira/browse/SPARK-15825
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: spark 2.0.0-SNAPSHOT
>Reporter: Andres Perez
>Assignee: Apache Spark
>
> {noformat}
>   import org.apache.spark.sql.functions
>   val left = List("0", "1", "2").toDS()
> .map{ k => ((k, 0), "l") }
>   val right = List("0", "1", "2").toDS()
> .map{ k => ((k, 0), "r") }
>   val result = left.toDF("k", "v").as[((String, Int), String)].alias("left")
> .joinWith(right.toDF("k", "v").as[((String, Int), 
> String)].alias("right"), functions.col("left.k") === 
> functions.col("right.k"), "inner")
> .as[(((String, Int), String), ((String, Int), String))]
> {noformat}
> When broadcast joins are enabled, we get the expected output:
> {noformat}
> (((0,0),l),((0,0),r))
> (((1,0),l),((1,0),r))
> (((2,0),l),((2,0),r))
> {noformat}
> However, when broadcast joins are disabled (i.e. setting 
> spark.sql.autoBroadcastJoinThreshold to -1), the result is incorrect:
> {noformat}
> (((2,0),l),((2,-1),))
> (((0,0),l),((0,-313907893),))
> (((1,0),l),((null,-313907893),))
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String with spark.memory.offHeap.enabled=true

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15822:


Assignee: Apache Spark

> segmentation violation in o.a.s.unsafe.types.UTF8String with 
> spark.memory.offHeap.enabled=true
> --
>
> Key: SPARK-15822
> URL: https://issues.apache.org/jira/browse/SPARK-15822
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: linux amd64
> openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>Reporter: Pete Robbins
>Assignee: Apache Spark
>Priority: Blocker
>
> Executors fail with segmentation violation while running application with
> spark.memory.offHeap.enabled true
> spark.memory.offHeap.size 512m
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
> #
> # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 4816 C2 
> org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
>  (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]
> {noformat}
> We initially saw this on IBM java on PowerPC box but is recreatable on linux 
> with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
> same code point:
> {noformat}
> 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
> java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
>   at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
>   at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(Thread.java:785)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15825) sort-merge-join gives invalid results when joining on a tupled key

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15825:


Assignee: (was: Apache Spark)

> sort-merge-join gives invalid results when joining on a tupled key
> --
>
> Key: SPARK-15825
> URL: https://issues.apache.org/jira/browse/SPARK-15825
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: spark 2.0.0-SNAPSHOT
>Reporter: Andres Perez
>
> {noformat}
>   import org.apache.spark.sql.functions
>   val left = List("0", "1", "2").toDS()
> .map{ k => ((k, 0), "l") }
>   val right = List("0", "1", "2").toDS()
> .map{ k => ((k, 0), "r") }
>   val result = left.toDF("k", "v").as[((String, Int), String)].alias("left")
> .joinWith(right.toDF("k", "v").as[((String, Int), 
> String)].alias("right"), functions.col("left.k") === 
> functions.col("right.k"), "inner")
> .as[(((String, Int), String), ((String, Int), String))]
> {noformat}
> When broadcast joins are enabled, we get the expected output:
> {noformat}
> (((0,0),l),((0,0),r))
> (((1,0),l),((1,0),r))
> (((2,0),l),((2,0),r))
> {noformat}
> However, when broadcast joins are disabled (i.e. setting 
> spark.sql.autoBroadcastJoinThreshold to -1), the result is incorrect:
> {noformat}
> (((2,0),l),((2,-1),))
> (((0,0),l),((0,-313907893),))
> (((1,0),l),((null,-313907893),))
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15825) sort-merge-join gives invalid results when joining on a tupled key

2016-06-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323783#comment-15323783
 ] 

Apache Spark commented on SPARK-15825:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/13589

> sort-merge-join gives invalid results when joining on a tupled key
> --
>
> Key: SPARK-15825
> URL: https://issues.apache.org/jira/browse/SPARK-15825
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: spark 2.0.0-SNAPSHOT
>Reporter: Andres Perez
>
> {noformat}
>   import org.apache.spark.sql.functions
>   val left = List("0", "1", "2").toDS()
> .map{ k => ((k, 0), "l") }
>   val right = List("0", "1", "2").toDS()
> .map{ k => ((k, 0), "r") }
>   val result = left.toDF("k", "v").as[((String, Int), String)].alias("left")
> .joinWith(right.toDF("k", "v").as[((String, Int), 
> String)].alias("right"), functions.col("left.k") === 
> functions.col("right.k"), "inner")
> .as[(((String, Int), String), ((String, Int), String))]
> {noformat}
> When broadcast joins are enabled, we get the expected output:
> {noformat}
> (((0,0),l),((0,0),r))
> (((1,0),l),((1,0),r))
> (((2,0),l),((2,0),r))
> {noformat}
> However, when broadcast joins are disabled (i.e. setting 
> spark.sql.autoBroadcastJoinThreshold to -1), the result is incorrect:
> {noformat}
> (((2,0),l),((2,-1),))
> (((0,0),l),((0,-313907893),))
> (((1,0),l),((null,-313907893),))
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String with spark.memory.offHeap.enabled=true

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15822:


Assignee: (was: Apache Spark)

> segmentation violation in o.a.s.unsafe.types.UTF8String with 
> spark.memory.offHeap.enabled=true
> --
>
> Key: SPARK-15822
> URL: https://issues.apache.org/jira/browse/SPARK-15822
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: linux amd64
> openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>Reporter: Pete Robbins
>Priority: Blocker
>
> Executors fail with segmentation violation while running application with
> spark.memory.offHeap.enabled true
> spark.memory.offHeap.size 512m
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
> #
> # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 4816 C2 
> org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
>  (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]
> {noformat}
> We initially saw this on IBM java on PowerPC box but is recreatable on linux 
> with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
> same code point:
> {noformat}
> 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
> java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
>   at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
>   at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(Thread.java:785)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String with spark.memory.offHeap.enabled=true

2016-06-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323781#comment-15323781
 ] 

Apache Spark commented on SPARK-15822:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/13589

> segmentation violation in o.a.s.unsafe.types.UTF8String with 
> spark.memory.offHeap.enabled=true
> --
>
> Key: SPARK-15822
> URL: https://issues.apache.org/jira/browse/SPARK-15822
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: linux amd64
> openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>Reporter: Pete Robbins
>Priority: Blocker
>
> Executors fail with segmentation violation while running application with
> spark.memory.offHeap.enabled true
> spark.memory.offHeap.size 512m
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
> #
> # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 4816 C2 
> org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
>  (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]
> {noformat}
> We initially saw this on IBM java on PowerPC box but is recreatable on linux 
> with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
> same code point:
> {noformat}
> 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
> java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
>   at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
>   at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(Thread.java:785)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15858) "evaluateEachIteration" will fail on trying to run it on a model with 500+ trees

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15858:


Assignee: Apache Spark

> "evaluateEachIteration" will fail on trying to run it on a model with 500+ 
> trees 
> -
>
> Key: SPARK-15858
> URL: https://issues.apache.org/jira/browse/SPARK-15858
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Mahmoud Rawas
>Assignee: Apache Spark
>
> this line:
> remappedData.zip(predictionAndError).mapPartitions 
> causing a stack over flow exception on executors after nearly 300 iterations 
> also with this number of trees having a var RDD will have needless memory 
> allocation. 
> this functionality tested on version 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15858) "evaluateEachIteration" will fail on trying to run it on a model with 500+ trees

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15858:


Assignee: (was: Apache Spark)

> "evaluateEachIteration" will fail on trying to run it on a model with 500+ 
> trees 
> -
>
> Key: SPARK-15858
> URL: https://issues.apache.org/jira/browse/SPARK-15858
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Mahmoud Rawas
>
> this line:
> remappedData.zip(predictionAndError).mapPartitions 
> causing a stack over flow exception on executors after nearly 300 iterations 
> also with this number of trees having a var RDD will have needless memory 
> allocation. 
> this functionality tested on version 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15858) "evaluateEachIteration" will fail on trying to run it on a model with 500+ trees

2016-06-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323777#comment-15323777
 ] 

Apache Spark commented on SPARK-15858:
--

User 'mhmoudr' has created a pull request for this issue:
https://github.com/apache/spark/pull/13588

> "evaluateEachIteration" will fail on trying to run it on a model with 500+ 
> trees 
> -
>
> Key: SPARK-15858
> URL: https://issues.apache.org/jira/browse/SPARK-15858
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Mahmoud Rawas
>
> this line:
> remappedData.zip(predictionAndError).mapPartitions 
> causing a stack over flow exception on executors after nearly 300 iterations 
> also with this number of trees having a var RDD will have needless memory 
> allocation. 
> this functionality tested on version 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15860) Metrics for codegen size and perf

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15860:


Assignee: Apache Spark

> Metrics for codegen size and perf
> -
>
> Key: SPARK-15860
> URL: https://issues.apache.org/jira/browse/SPARK-15860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Apache Spark
>
> We should expose codahale metrics for the codegen source text size and how 
> long it takes to compile. The size is particularly interesting, since the JVM 
> does have hard limits on how large methods can get.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15860) Metrics for codegen size and perf

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15860:


Assignee: (was: Apache Spark)

> Metrics for codegen size and perf
> -
>
> Key: SPARK-15860
> URL: https://issues.apache.org/jira/browse/SPARK-15860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Eric Liang
>
> We should expose codahale metrics for the codegen source text size and how 
> long it takes to compile. The size is particularly interesting, since the JVM 
> does have hard limits on how large methods can get.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15860) Metrics for codegen size and perf

2016-06-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323761#comment-15323761
 ] 

Apache Spark commented on SPARK-15860:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/13586

> Metrics for codegen size and perf
> -
>
> Key: SPARK-15860
> URL: https://issues.apache.org/jira/browse/SPARK-15860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Eric Liang
>
> We should expose codahale metrics for the codegen source text size and how 
> long it takes to compile. The size is particularly interesting, since the JVM 
> does have hard limits on how large methods can get.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15860) Metrics for codegen size and perf

2016-06-09 Thread Eric Liang (JIRA)
Eric Liang created SPARK-15860:
--

 Summary: Metrics for codegen size and perf
 Key: SPARK-15860
 URL: https://issues.apache.org/jira/browse/SPARK-15860
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Eric Liang


We should expose codahale metrics for the codegen source text size and how long 
it takes to compile. The size is particularly interesting, since the JVM does 
have hard limits on how large methods can get.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15850) Remove function grouping in SparkSession

2016-06-09 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-15850.
---
Resolution: Resolved

> Remove function grouping in SparkSession
> 
>
> Key: SPARK-15850
> URL: https://issues.apache.org/jira/browse/SPARK-15850
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> SparkSession does not have that many functions due to better namespacing, and 
> as a result we probably don't need the function grouping. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15853) HDFSMetadataLog.get leaks the input stream

2016-06-09 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-15853.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13583
[https://github.com/apache/spark/pull/13583]

> HDFSMetadataLog.get leaks the input stream
> --
>
> Key: SPARK-15853
> URL: https://issues.apache.org/jira/browse/SPARK-15853
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> HDFSMetadataLog.get doesn't close the input stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15856) Revert API breaking changes made in DataFrameReader.text and SQLContext.range

2016-06-09 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323715#comment-15323715
 ] 

Reynold Xin commented on SPARK-15856:
-

cc [~koert]

> Revert API breaking changes made in DataFrameReader.text and SQLContext.range
> -
>
> Key: SPARK-15856
> URL: https://issues.apache.org/jira/browse/SPARK-15856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Lian
>
> In Spark 2.0, after unifying Datasets and DataFrames, we made two API 
> breaking changes:
> # {{DataFrameReader.text()}} now returns {{Dataset\[String\]}} instead of 
> {{DataFrame}}
> # {{SQLContext.range()}} now returns {{Dataset\[java.lang.Long\]}} instead of 
> {{DataFrame}}
> However, these two changes introduced several inconsistencies and problems:
> # {{spark.read.text()}} silently discards partitioned columns when reading a 
> partitioned table in text format since {{Dataset\[String\]}} only contains a 
> single field. Users have to use {{spark.read.format("text").load()}} to 
> workaround this, which is pretty confusing and error-prone.
> # All data source shortcut methods in `DataFrameReader` return {{DataFrame}} 
> (aka {{Dataset\[Row\]}}) except for {{DataFrameReader.text()}}.
> # When applying typed operations over Datasets returned by {{spark.range()}}, 
> weird schema changes may happen. Please refer to SPARK-15632 for more details.
> Due to these reasons, we decided to revert these two changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323709#comment-15323709
 ] 

Marcelo Vanzin commented on SPARK-15851:


It should be simple to fix it to work with whatever that shell is, right? (In 
case it doesn't already work.)

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323704#comment-15323704
 ] 

Alexander Ulanov commented on SPARK-15851:
--

Sorry for confusion, I mean the shell that is "/bin/sh". Windows version of it 
comes with Git.

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-06-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323702#comment-15323702
 ] 

Joseph K. Bradley commented on SPARK-15581:
---

Synced some in person around the summit, and posting notes here for a public 
record.  [~mlnick] [~holdenk] [~sethah] [~yuhaoyan] [~yanboliang] 
[~wangmiao1981]

High priority
* spark.ml parity
** Multiclass logistic regression
** SVM
** Also: FPM, stats
* Python & R expansion
* Improving standard testing
* Improving MLlib as an API/platform, not just a library of algorithms

To discuss
* How should we proceed with deep learning within MLlib (vs. in packages)?
* Breeze dependency

Other features
* Imputer
* Stratified sampling
* Generic bagging

Copy more documentation from spark.mllib user guide to spark.ml one.

Items for improving MLlib development
* Make roadmap JIRA more active; this needs to be updated and curated more 
strictly to be a more useful guide to contributors.
* Be more willing to encourage developers to publish new ML algorithms as Spark 
packages while still discussing priority on JIRA.

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API

[jira] [Assigned] (SPARK-15859) Optimize the Partition Pruning with Disjunction

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15859:


Assignee: Apache Spark

> Optimize the Partition Pruning with Disjunction
> ---
>
> Key: SPARK-15859
> URL: https://issues.apache.org/jira/browse/SPARK-15859
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Apache Spark
>Priority: Critical
>
> Currently we can not optimize the partition pruning in disjunction, for 
> example:
> {{(part1=2 and col1='abc') or (part1=5 and col1='cde')}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15859) Optimize the Partition Pruning with Disjunction

2016-06-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323683#comment-15323683
 ] 

Apache Spark commented on SPARK-15859:
--

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/13585

> Optimize the Partition Pruning with Disjunction
> ---
>
> Key: SPARK-15859
> URL: https://issues.apache.org/jira/browse/SPARK-15859
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Critical
>
> Currently we can not optimize the partition pruning in disjunction, for 
> example:
> {{(part1=2 and col1='abc') or (part1=5 and col1='cde')}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15859) Optimize the Partition Pruning with Disjunction

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15859:


Assignee: (was: Apache Spark)

> Optimize the Partition Pruning with Disjunction
> ---
>
> Key: SPARK-15859
> URL: https://issues.apache.org/jira/browse/SPARK-15859
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Critical
>
> Currently we can not optimize the partition pruning in disjunction, for 
> example:
> {{(part1=2 and col1='abc') or (part1=5 and col1='cde')}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15794) Should truncate toString() of very wide schemas

2016-06-09 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-15794.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13537
[https://github.com/apache/spark/pull/13537]

> Should truncate toString() of very wide schemas
> ---
>
> Key: SPARK-15794
> URL: https://issues.apache.org/jira/browse/SPARK-15794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.0.0
>
>
> With very wide tables, e.g. thousands of fields, the output is unreadable and 
> often causes OOMs due to inefficient string processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15859) Optimize the Partition Pruning with Disjunction

2016-06-09 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-15859:
-

 Summary: Optimize the Partition Pruning with Disjunction
 Key: SPARK-15859
 URL: https://issues.apache.org/jira/browse/SPARK-15859
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Critical


Currently we can not optimize the partition pruning in disjunction, for example:

{{(part1=2 and col1='abc') or (part1=5 and col1='cde')}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15855) dataframe.R example fails with "java.io.IOException: No input paths specified in job"

2016-06-09 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323667#comment-15323667
 ] 

Shivaram Venkataraman commented on SPARK-15855:
---

For the example to work in a distributed setup the input file needs to be in 
HDFS or in some other distributed storage system.  The example is designed to 
work out of the box on a single machine. 

> dataframe.R example fails with "java.io.IOException: No input paths specified 
> in job"
> -
>
> Key: SPARK-15855
> URL: https://issues.apache.org/jira/browse/SPARK-15855
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>
> Steps:
> * Install R on all nodes
> * Run dataframe.R example.
> The example fails in yarn-client and yarn-cluster mode both with below 
> mentioned error message.
> This application fails to find people.json correctly.  {{path <- 
> file.path(Sys.getenv("SPARK_HOME"), 
> "examples/src/main/resources/people.json")}}
> {code}
> [xxx@xxx qa]$ sparkR --master yarn-client examples/src/main/r/dataframe.R
> Loading required package: methods
> Attaching package: ‘SparkR’
> The following objects are masked from ‘package:stats’:
> cov, filter, lag, na.omit, predict, sd, var
> The following objects are masked from ‘package:base’:
> colnames, colnames<-, intersect, rank, rbind, sample, subset,
> summary, table, transform
> 16/05/24 22:08:21 INFO SparkContext: Running Spark version 1.6.1
> 16/05/24 22:08:21 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/24 22:08:22 INFO SecurityManager: Changing view acls to: hrt_qa
> 16/05/24 22:08:22 INFO SecurityManager: Changing modify acls to: hrt_qa
> 16/05/24 22:08:22 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(hrt_qa); users 
> with modify permissions: Set(hrt_qa)
> 16/05/24 22:08:22 INFO Utils: Successfully started service 'sparkDriver' on 
> port 35792.
> 16/05/24 22:08:23 INFO Slf4jLogger: Slf4jLogger started
> 16/05/24 22:08:23 INFO Remoting: Starting remoting
> 16/05/24 22:08:23 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkdriveractorsys...@xx.xx.xx.xxx:49771]
> 16/05/24 22:08:23 INFO Utils: Successfully started service 
> 'sparkDriverActorSystem' on port 49771.
> 16/05/24 22:08:23 INFO SparkEnv: Registering MapOutputTracker
> 16/05/24 22:08:23 INFO SparkEnv: Registering BlockManagerMaster
> 16/05/24 22:08:23 INFO DiskBlockManager: Created local directory at 
> /tmp/blockmgr-ffed73ad-3e67-4ae5-8734-9338136d3721
> 16/05/24 22:08:23 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
> 16/05/24 22:08:24 INFO SparkEnv: Registering OutputCommitCoordinator
> 16/05/24 22:08:24 INFO Server: jetty-8.y.z-SNAPSHOT
> 16/05/24 22:08:24 INFO AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0:4040
> 16/05/24 22:08:24 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 16/05/24 22:08:24 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
> http://xx.xx.xx.xxx:4040
> spark.yarn.driver.memoryOverhead is set but does not apply in client mode.
> 16/05/24 22:08:25 INFO Client: Requesting a new application from cluster with 
> 6 NodeManagers
> 16/05/24 22:08:25 INFO Client: Verifying our application has not requested 
> more than the maximum memory capability of the cluster (10240 MB per 
> container)
> 16/05/24 22:08:25 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/05/24 22:08:25 INFO Client: Setting up container launch context for our AM
> 16/05/24 22:08:25 INFO Client: Setting up the launch environment for our AM 
> container
> 16/05/24 22:08:26 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 16/05/24 22:08:26 INFO Client: Using the spark assembly jar on HDFS because 
> you are using HDP, 
> defaultSparkAssembly:hdfs://mycluster/hdp/apps/2.5.0.0-427/spark/spark-hdp-assembly.jar
> 16/05/24 22:08:26 INFO Client: Preparing resources for our AM container
> 16/05/24 22:08:26 INFO YarnSparkHadoopUtil: getting token for namenode: 
> hdfs://mycluster/user/hrt_qa/.sparkStaging/application_1463956206030_0003
> 16/05/24 22:08:26 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 187 for 
> hrt_qa on ha-hdfs:mycluster
> 16/05/24 22:08:28 INFO metastore: Trying to connect to metastore with URI 
> thrift://xxx:9083
> 16/05/24 22:08:28 INFO metastore: Connected to metastore.
> 16/05/24 22:08:28 INFO YarnSparkHadoopUtil: HBase class not found 
> java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration
> 16/05/24 22:08:28 INFO Client: Using 

[jira] [Commented] (SPARK-15509) R MLlib algorithms should support input columns "features" and "label"

2016-06-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323666#comment-15323666
 ] 

Apache Spark commented on SPARK-15509:
--

User 'keypointt' has created a pull request for this issue:
https://github.com/apache/spark/pull/13584

> R MLlib algorithms should support input columns "features" and "label"
> --
>
> Key: SPARK-15509
> URL: https://issues.apache.org/jira/browse/SPARK-15509
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>
> Currently in SparkR, when you load a LibSVM dataset using the sqlContext and 
> then pass it to an MLlib algorithm, the ML wrappers will fail since they will 
> try to create a "features" column, which conflicts with the existing 
> "features" column from the LibSVM loader.  E.g., using the "mnist" dataset 
> from LibSVM:
> {code}
> training <- loadDF(sqlContext, ".../mnist", "libsvm")
> model <- naiveBayes(label ~ features, training)
> {code}
> This fails with:
> {code}
> 16/05/24 11:52:41 ERROR RBackendHandler: fit on 
> org.apache.spark.ml.r.NaiveBayesWrapper failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
>   java.lang.IllegalArgumentException: Output column features already exists.
>   at 
> org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120)
>   at 
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
>   at 
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>   at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
>   at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
>   at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131)
>   at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169)
>   at 
> org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62)
>   at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca
> {code}
> The same issue appears for the "label" column once you rename the "features" 
> column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15509) R MLlib algorithms should support input columns "features" and "label"

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15509:


Assignee: Apache Spark

> R MLlib algorithms should support input columns "features" and "label"
> --
>
> Key: SPARK-15509
> URL: https://issues.apache.org/jira/browse/SPARK-15509
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> Currently in SparkR, when you load a LibSVM dataset using the sqlContext and 
> then pass it to an MLlib algorithm, the ML wrappers will fail since they will 
> try to create a "features" column, which conflicts with the existing 
> "features" column from the LibSVM loader.  E.g., using the "mnist" dataset 
> from LibSVM:
> {code}
> training <- loadDF(sqlContext, ".../mnist", "libsvm")
> model <- naiveBayes(label ~ features, training)
> {code}
> This fails with:
> {code}
> 16/05/24 11:52:41 ERROR RBackendHandler: fit on 
> org.apache.spark.ml.r.NaiveBayesWrapper failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
>   java.lang.IllegalArgumentException: Output column features already exists.
>   at 
> org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120)
>   at 
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
>   at 
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>   at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
>   at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
>   at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131)
>   at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169)
>   at 
> org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62)
>   at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca
> {code}
> The same issue appears for the "label" column once you rename the "features" 
> column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15509) R MLlib algorithms should support input columns "features" and "label"

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15509:


Assignee: (was: Apache Spark)

> R MLlib algorithms should support input columns "features" and "label"
> --
>
> Key: SPARK-15509
> URL: https://issues.apache.org/jira/browse/SPARK-15509
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>
> Currently in SparkR, when you load a LibSVM dataset using the sqlContext and 
> then pass it to an MLlib algorithm, the ML wrappers will fail since they will 
> try to create a "features" column, which conflicts with the existing 
> "features" column from the LibSVM loader.  E.g., using the "mnist" dataset 
> from LibSVM:
> {code}
> training <- loadDF(sqlContext, ".../mnist", "libsvm")
> model <- naiveBayes(label ~ features, training)
> {code}
> This fails with:
> {code}
> 16/05/24 11:52:41 ERROR RBackendHandler: fit on 
> org.apache.spark.ml.r.NaiveBayesWrapper failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
>   java.lang.IllegalArgumentException: Output column features already exists.
>   at 
> org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120)
>   at 
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
>   at 
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>   at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
>   at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
>   at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131)
>   at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169)
>   at 
> org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62)
>   at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca
> {code}
> The same issue appears for the "label" column once you rename the "features" 
> column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15858) "evaluateEachIteration" will fail on trying to run it on a model with 500+ trees

2016-06-09 Thread Mahmoud Rawas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323662#comment-15323662
 ] 

Mahmoud Rawas commented on SPARK-15858:
---

I am working on a solution.

> "evaluateEachIteration" will fail on trying to run it on a model with 500+ 
> trees 
> -
>
> Key: SPARK-15858
> URL: https://issues.apache.org/jira/browse/SPARK-15858
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Mahmoud Rawas
>
> this line:
> remappedData.zip(predictionAndError).mapPartitions 
> causing a stack over flow exception on executors after nearly 300 iterations 
> also with this number of trees having a var RDD will have needless memory 
> allocation. 
> this functionality tested on version 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15841) [SPARK REPL] REPLSuite has incorrect env set for a couple of tests.

2016-06-09 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-15841.
--
   Resolution: Fixed
 Assignee: Prashant Sharma
Fix Version/s: 2.0.0

> [SPARK REPL] REPLSuite has incorrect env set for a couple of tests.
> ---
>
> Key: SPARK-15841
> URL: https://issues.apache.org/jira/browse/SPARK-15841
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
> Fix For: 2.0.0
>
>
> In ReplSuite, for a test that can be tested well on just local should not 
> really have to start a local-cluster. And similarly a test is in-sufficiently 
> run if it's actually fixing a problem related to a distributed run. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15858) "evaluateEachIteration" will fail on trying to run it on a model with 500+ trees

2016-06-09 Thread Mahmoud Rawas (JIRA)
Mahmoud Rawas created SPARK-15858:
-

 Summary: "evaluateEachIteration" will fail on trying to run it on 
a model with 500+ trees 
 Key: SPARK-15858
 URL: https://issues.apache.org/jira/browse/SPARK-15858
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1, 2.0.0
Reporter: Mahmoud Rawas


this line:
remappedData.zip(predictionAndError).mapPartitions 
causing a stack over flow exception on executors after nearly 300 iterations 
also with this number of trees having a var RDD will have needless memory 
allocation. 
this functionality tested on version 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15856) Revert API breaking changes made in DataFrameReader.text and SQLContext.range

2016-06-09 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-15856:
---
Description: 
In Spark 2.0, after unifying Datasets and DataFrames, we made two API breaking 
changes:

# {{DataFrameReader.text()}} now returns {{Dataset\[String\]}} instead of 
{{DataFrame}}
# {{SQLContext.range()}} now returns {{Dataset\[java.lang.Long\]}} instead of 
{{DataFrame}}

However, these two changes introduced several inconsistencies and problems:

# {{spark.read.text()}} silently discards partitioned columns when reading a 
partitioned table in text format since {{Dataset\[String\]}} only contains a 
single field. Users have to use {{spark.read.format("text").load()}} to 
workaround this, which is pretty confusing and error-prone.
# All data source shortcut methods in `DataFrameReader` return {{DataFrame}} 
(aka {{Dataset\[Row\]}}) except for {{DataFrameReader.text()}}.
# When applying typed operations over Datasets returned by {{spark.range()}}, 
weird schema changes may happen. Please refer to SPARK-15632 for more details.

Due to these reasons, we decided to revert these two changes.

  was:
In Spark 2.0, after unifying Datasets and DataFrames, we made two API breaking 
changes:

# {{DataFrameReader.text()}} now returns {{Dataset\[String\]}} instead of 
{{DataFrame}}
# {{SQLContext.range()}} now returns {{Dataset\[java.lang.Long\]}} instead of 
{{DataFrame}}

However, these two changes introduced several inconsistencies and problems:

# {{spark.read.text()}} silently discards partitioned columns when reading a 
partitioned table in text format since {{Dataset\[String\]}} only contains a 
single field. Users have to use {{spark.read.format("text").load()}} to 
workaround this, which is pretty confusing and error-prone.
# All data source shortcut methods in `DataFrameReader` returns a {{DataFrame}} 
(aka {{Dataset\[Row\]}} except for {{DataFrameReader.text()}}.
# When applying typed operations over Datasets returned by {{spark.range()}}, 
weird schema changes may happen. Please refer to SPARK-15632 for more details.

Due to these reasons, we decided to revert these two changes.


> Revert API breaking changes made in DataFrameReader.text and SQLContext.range
> -
>
> Key: SPARK-15856
> URL: https://issues.apache.org/jira/browse/SPARK-15856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Lian
>
> In Spark 2.0, after unifying Datasets and DataFrames, we made two API 
> breaking changes:
> # {{DataFrameReader.text()}} now returns {{Dataset\[String\]}} instead of 
> {{DataFrame}}
> # {{SQLContext.range()}} now returns {{Dataset\[java.lang.Long\]}} instead of 
> {{DataFrame}}
> However, these two changes introduced several inconsistencies and problems:
> # {{spark.read.text()}} silently discards partitioned columns when reading a 
> partitioned table in text format since {{Dataset\[String\]}} only contains a 
> single field. Users have to use {{spark.read.format("text").load()}} to 
> workaround this, which is pretty confusing and error-prone.
> # All data source shortcut methods in `DataFrameReader` return {{DataFrame}} 
> (aka {{Dataset\[Row\]}}) except for {{DataFrameReader.text()}}.
> # When applying typed operations over Datasets returned by {{spark.range()}}, 
> weird schema changes may happen. Please refer to SPARK-15632 for more details.
> Due to these reasons, we decided to revert these two changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15857) Add Caller Context in Spark

2016-06-09 Thread Weiqing Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323655#comment-15323655
 ] 

Weiqing Yang commented on SPARK-15857:
--

I will attach the design doc soon.

> Add Caller Context in Spark
> ---
>
> Key: SPARK-15857
> URL: https://issues.apache.org/jira/browse/SPARK-15857
> Project: Spark
>  Issue Type: New Feature
>Reporter: Weiqing Yang
>
> Hadoop has implemented a feature of log tracing – caller context (Jira: 
> HDFS-9184 and YARN-4349). The motivation is to better diagnose and understand 
> how specific applications impacting parts of the Hadoop system and potential 
> problems they may be creating (e.g. overloading NN). As HDFS mentioned in 
> HDFS-9184, for a given HDFS operation, it's very helpful to track which upper 
> level job issues it. The upper level callers may be specific Oozie tasks, MR 
> jobs, hive queries, Spark jobs. 
> Hadoop ecosystems like MapReduce, Tez (TEZ-2851), Hive (HIVE-12249, 
> HIVE-12254) and Pig(PIG-4714) have implemented their caller contexts. Those 
> systems invoke HDFS client API and Yarn client API to setup caller context, 
> and also expose an API to pass in caller context into it.
> Lots of Spark applications are running on Yarn/HDFS. Spark can also implement 
> its caller context via invoking HDFS/Yarn API, and also expose an API to its 
> upstream applications to set up their caller contexts. In the end, the spark 
> caller context written into Yarn log / HDFS log can associate with task id, 
> stage id, job id and app id. That is also very good for Spark users to 
> identify tasks especially if Spark supports multi-tenant environment in the 
> future.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15856) Revert API breaking changes made in DataFrameReader.text and SQLContext.range

2016-06-09 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-15856:
---
Description: 
In Spark 2.0, after unifying Datasets and DataFrames, we made two API breaking 
changes:

# {{DataFrameReader.text()}} now returns {{Dataset\[String\]}} instead of 
{{DataFrame}}
# {{SQLContext.range()}} now returns {{Dataset\[java.lang.Long\]}} instead of 
{{DataFrame}}

However, these two changes introduced several inconsistencies and problems:

# {{spark.read.text()}} silently discards partitioned columns when reading a 
partitioned table in text format since {{Dataset\[String\]}} only contains a 
single field. Users have to use {{spark.read.format("text").load()}} to 
workaround this, which is pretty confusing and error-prone.
# All data source shortcut methods in `DataFrameReader` returns a {{DataFrame}} 
(aka {{Dataset\[Row\]}} except for {{DataFrameReader.text()}}.
# When applying typed operations over Datasets returned by {{spark.range()}}, 
weird schema changes may happen. Please refer to SPARK-15632 for more details.

Due to these reasons, we decided to revert these two changes.

> Revert API breaking changes made in DataFrameReader.text and SQLContext.range
> -
>
> Key: SPARK-15856
> URL: https://issues.apache.org/jira/browse/SPARK-15856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Lian
>
> In Spark 2.0, after unifying Datasets and DataFrames, we made two API 
> breaking changes:
> # {{DataFrameReader.text()}} now returns {{Dataset\[String\]}} instead of 
> {{DataFrame}}
> # {{SQLContext.range()}} now returns {{Dataset\[java.lang.Long\]}} instead of 
> {{DataFrame}}
> However, these two changes introduced several inconsistencies and problems:
> # {{spark.read.text()}} silently discards partitioned columns when reading a 
> partitioned table in text format since {{Dataset\[String\]}} only contains a 
> single field. Users have to use {{spark.read.format("text").load()}} to 
> workaround this, which is pretty confusing and error-prone.
> # All data source shortcut methods in `DataFrameReader` returns a 
> {{DataFrame}} (aka {{Dataset\[Row\]}} except for {{DataFrameReader.text()}}.
> # When applying typed operations over Datasets returned by {{spark.range()}}, 
> weird schema changes may happen. Please refer to SPARK-15632 for more details.
> Due to these reasons, we decided to revert these two changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15857) Add Caller Context in Spark

2016-06-09 Thread Weiqing Yang (JIRA)
Weiqing Yang created SPARK-15857:


 Summary: Add Caller Context in Spark
 Key: SPARK-15857
 URL: https://issues.apache.org/jira/browse/SPARK-15857
 Project: Spark
  Issue Type: New Feature
Reporter: Weiqing Yang


Hadoop has implemented a feature of log tracing – caller context (Jira: 
HDFS-9184 and YARN-4349). The motivation is to better diagnose and understand 
how specific applications impacting parts of the Hadoop system and potential 
problems they may be creating (e.g. overloading NN). As HDFS mentioned in 
HDFS-9184, for a given HDFS operation, it's very helpful to track which upper 
level job issues it. The upper level callers may be specific Oozie tasks, MR 
jobs, hive queries, Spark jobs. 

Hadoop ecosystems like MapReduce, Tez (TEZ-2851), Hive (HIVE-12249, HIVE-12254) 
and Pig(PIG-4714) have implemented their caller contexts. Those systems invoke 
HDFS client API and Yarn client API to setup caller context, and also expose an 
API to pass in caller context into it.

Lots of Spark applications are running on Yarn/HDFS. Spark can also implement 
its caller context via invoking HDFS/Yarn API, and also expose an API to its 
upstream applications to set up their caller contexts. In the end, the spark 
caller context written into Yarn log / HDFS log can associate with task id, 
stage id, job id and app id. That is also very good for Spark users to identify 
tasks especially if Spark supports multi-tenant environment in the future.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12447) Only update AM's internal state when executor is successfully launched by NM

2016-06-09 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-12447.

   Resolution: Fixed
 Assignee: Saisai Shao  (was: Apache Spark)
Fix Version/s: 2.0.0

> Only update AM's internal state when executor is successfully launched by NM
> 
>
> Key: SPARK-12447
> URL: https://issues.apache.org/jira/browse/SPARK-12447
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 2.0.0
>
>
> Currently {{YarnAllocator}} will update its managed states like 
> {{numExecutorsRunning}} after container is allocated but before executor are 
> successfully launched. 
> This happened when Spark configuration is wrong (like spark_shuffle 
> aux-service is not configured in NM occasionally), which makes executor fail 
> to launch, or NM lost when NMClient is communicated.
> In the current implementation, state will also be updated even executor is 
> failed to launch, this will lead to incorrect state of AM. Also lingering 
> container will only be release after timeout, this will introduce resource 
> waste.
> So here we should update the states only after executor is correctly 
> launched, otherwise we should release container ASAP to make it fail fast and 
> retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15856) Revert API breaking changes made in DataFrameReader.text and SQLContext.range

2016-06-09 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-15856:
--

 Summary: Revert API breaking changes made in DataFrameReader.text 
and SQLContext.range
 Key: SPARK-15856
 URL: https://issues.apache.org/jira/browse/SPARK-15856
 Project: Spark
  Issue Type: Sub-task
Reporter: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String with spark.memory.offHeap.enabled=true

2016-06-09 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-15822:
--
Priority: Blocker  (was: Critical)

> segmentation violation in o.a.s.unsafe.types.UTF8String with 
> spark.memory.offHeap.enabled=true
> --
>
> Key: SPARK-15822
> URL: https://issues.apache.org/jira/browse/SPARK-15822
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: linux amd64
> openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>Reporter: Pete Robbins
>Priority: Blocker
>
> Executors fail with segmentation violation while running application with
> spark.memory.offHeap.enabled true
> spark.memory.offHeap.size 512m
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
> #
> # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 4816 C2 
> org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
>  (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]
> {noformat}
> We initially saw this on IBM java on PowerPC box but is recreatable on linux 
> with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
> same code point:
> {noformat}
> 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
> java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
>   at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
>   at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(Thread.java:785)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String with spark.memory.offHeap.enabled=true

2016-06-09 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-15822:
--
Description: 
Executors fail with segmentation violation while running application with
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 512m
{noformat}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
#
# JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
# Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# J 4816 C2 
org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
 (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]
{noformat}
We initially saw this on IBM java on PowerPC box but is recreatable on linux 
with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
same code point:
{noformat}
16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
java.lang.NullPointerException
at 
org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
at 
org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
at 
org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:785)
{noformat}

  was:
Executors fail with segmentation violation while running application with
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 512m

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
#
# JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
# Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# J 4816 C2 
org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
 (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]

We initially saw this on IBM java on PowerPC box but is recreatable on linux 
with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
same code point:

16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
java.lang.NullPointerException
at 
org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
at 

[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323644#comment-15323644
 ] 

Marcelo Vanzin commented on SPARK-15851:


bq. "spark-build-info" can be rewritten as a shell script.

Not sure what you mean? It is a shell script.

Assuming you mean a second script that is a Windows batch script, not sure I'm 
a big fan of the idea. It's small enough that it shouldn't matter much, though.

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15855) dataframe.R example fails with "java.io.IOException: No input paths specified in job"

2016-06-09 Thread Yesha Vora (JIRA)
Yesha Vora created SPARK-15855:
--

 Summary: dataframe.R example fails with "java.io.IOException: No 
input paths specified in job"
 Key: SPARK-15855
 URL: https://issues.apache.org/jira/browse/SPARK-15855
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.6.1
Reporter: Yesha Vora


Steps:
* Install R on all nodes
* Run dataframe.R example.

The example fails in yarn-client and yarn-cluster mode both with below 
mentioned error message.

This application fails to find people.json correctly.  {{path <- 
file.path(Sys.getenv("SPARK_HOME"), "examples/src/main/resources/people.json")}}

{code}
[xxx@xxx qa]$ sparkR --master yarn-client examples/src/main/r/dataframe.R
Loading required package: methods

Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

cov, filter, lag, na.omit, predict, sd, var

The following objects are masked from ‘package:base’:

colnames, colnames<-, intersect, rank, rbind, sample, subset,
summary, table, transform

16/05/24 22:08:21 INFO SparkContext: Running Spark version 1.6.1
16/05/24 22:08:21 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/05/24 22:08:22 INFO SecurityManager: Changing view acls to: hrt_qa
16/05/24 22:08:22 INFO SecurityManager: Changing modify acls to: hrt_qa
16/05/24 22:08:22 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hrt_qa); users 
with modify permissions: Set(hrt_qa)
16/05/24 22:08:22 INFO Utils: Successfully started service 'sparkDriver' on 
port 35792.
16/05/24 22:08:23 INFO Slf4jLogger: Slf4jLogger started
16/05/24 22:08:23 INFO Remoting: Starting remoting
16/05/24 22:08:23 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkdriveractorsys...@xx.xx.xx.xxx:49771]
16/05/24 22:08:23 INFO Utils: Successfully started service 
'sparkDriverActorSystem' on port 49771.
16/05/24 22:08:23 INFO SparkEnv: Registering MapOutputTracker
16/05/24 22:08:23 INFO SparkEnv: Registering BlockManagerMaster
16/05/24 22:08:23 INFO DiskBlockManager: Created local directory at 
/tmp/blockmgr-ffed73ad-3e67-4ae5-8734-9338136d3721
16/05/24 22:08:23 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
16/05/24 22:08:24 INFO SparkEnv: Registering OutputCommitCoordinator
16/05/24 22:08:24 INFO Server: jetty-8.y.z-SNAPSHOT
16/05/24 22:08:24 INFO AbstractConnector: Started 
SelectChannelConnector@0.0.0.0:4040
16/05/24 22:08:24 INFO Utils: Successfully started service 'SparkUI' on port 
4040.
16/05/24 22:08:24 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
http://xx.xx.xx.xxx:4040
spark.yarn.driver.memoryOverhead is set but does not apply in client mode.
16/05/24 22:08:25 INFO Client: Requesting a new application from cluster with 6 
NodeManagers
16/05/24 22:08:25 INFO Client: Verifying our application has not requested more 
than the maximum memory capability of the cluster (10240 MB per container)
16/05/24 22:08:25 INFO Client: Will allocate AM container, with 896 MB memory 
including 384 MB overhead
16/05/24 22:08:25 INFO Client: Setting up container launch context for our AM
16/05/24 22:08:25 INFO Client: Setting up the launch environment for our AM 
container
16/05/24 22:08:26 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
16/05/24 22:08:26 INFO Client: Using the spark assembly jar on HDFS because you 
are using HDP, 
defaultSparkAssembly:hdfs://mycluster/hdp/apps/2.5.0.0-427/spark/spark-hdp-assembly.jar
16/05/24 22:08:26 INFO Client: Preparing resources for our AM container
16/05/24 22:08:26 INFO YarnSparkHadoopUtil: getting token for namenode: 
hdfs://mycluster/user/hrt_qa/.sparkStaging/application_1463956206030_0003
16/05/24 22:08:26 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 187 for 
hrt_qa on ha-hdfs:mycluster
16/05/24 22:08:28 INFO metastore: Trying to connect to metastore with URI 
thrift://xxx:9083
16/05/24 22:08:28 INFO metastore: Connected to metastore.
16/05/24 22:08:28 INFO YarnSparkHadoopUtil: HBase class not found 
java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration
16/05/24 22:08:28 INFO Client: Using the spark assembly jar on HDFS because you 
are using HDP, 
defaultSparkAssembly:hdfs://mycluster/hdp/apps/2.5.0.0-427/spark/spark-hdp-assembly.jar
16/05/24 22:08:28 INFO Client: Source and destination file systems are the 
same. Not copying 
hdfs://mycluster/hdp/apps/2.5.0.0-427/spark/spark-hdp-assembly.jar
16/05/24 22:08:29 INFO Client: Uploading resource 
file:/usr/hdp/current/spark-client/examples/src/main/r/dataframe.R -> 
hdfs://mycluster/user/hrt_qa/.sparkStaging/application_1463956206030_0003/dataframe.R
16/05/24 22:08:29 INFO Client: Uploading resource 

[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323638#comment-15323638
 ] 

Alexander Ulanov commented on SPARK-15851:
--

I can do that. However, it seems that "spark-build-info" can be rewritten as a 
shell script. This will remove the need to install bash for Windows users that 
compile Spark with maven. What do you think?

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15854) Spark History server gets null pointer exception

2016-06-09 Thread Yesha Vora (JIRA)
Yesha Vora created SPARK-15854:
--

 Summary: Spark History server gets null pointer exception
 Key: SPARK-15854
 URL: https://issues.apache.org/jira/browse/SPARK-15854
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Yesha Vora


In Spark2, Spark-History Server is configured to FSHistoryProvider. 

Spark HS does not show any finished/running applications and gets Null pointer 
exception.

{code}
16/06/03 23:06:40 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx:8020/spark2-history/application_1464912457462_0002.inprogress
16/06/03 23:06:50 INFO FsHistoryProvider: Replaying log path: 
hdfs://xx:8020/spark2-history/application_1464912457462_0002
16/06/03 23:08:27 WARN ServletHandler: Error for /api/v1/applications
java.lang.NoSuchMethodError: 
javax.ws.rs.core.Application.getProperties()Ljava/util/Map;
at 
org.glassfish.jersey.server.ApplicationHandler.(ApplicationHandler.java:331)
at 
org.glassfish.jersey.servlet.WebComponent.(WebComponent.java:392)
at 
org.glassfish.jersey.servlet.ServletContainer.init(ServletContainer.java:177)
at 
org.glassfish.jersey.servlet.ServletContainer.init(ServletContainer.java:369)
at javax.servlet.GenericServlet.init(GenericServlet.java:244)
at 
org.spark_project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:616)
at 
org.spark_project.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:472)
at 
org.spark_project.jetty.servlet.ServletHolder.ensureInstance(ServletHolder.java:767)
at 
org.spark_project.jetty.servlet.ServletHolder.prepare(ServletHolder.java:752)
at 
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at 
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at 
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at 
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479)
at 
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.spark_project.jetty.server.Server.handle(Server.java:499)
at 
org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
at 
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at 
org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
at 
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)
16/06/03 23:08:33 WARN ServletHandler: /api/v1/applications
java.lang.NullPointerException
at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
at 
org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
at 
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
at 
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at 
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at 
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479)
at 
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.spark_project.jetty.server.Server.handle(Server.java:499)
at 
org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
at 
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at 
org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
at 
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by 

[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323629#comment-15323629
 ] 

Marcelo Vanzin commented on SPARK-15851:


Adding "bash" explicitly in the pom should be fine. Need to change the sbt 
build also.

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15853) HDFSMetadataLog.get leaks the input stream

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15853:


Assignee: Shixiong Zhu  (was: Apache Spark)

> HDFSMetadataLog.get leaks the input stream
> --
>
> Key: SPARK-15853
> URL: https://issues.apache.org/jira/browse/SPARK-15853
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> HDFSMetadataLog.get doesn't close the input stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323624#comment-15323624
 ] 

Alexander Ulanov commented on SPARK-15851:
--

This does not work because Ant uses Java Process to run executable which 
returns "not a valid Win32 application". In order to run it, one need to run 
"bash" and provide bash file as a param. This approach I proposed as a 
work-around. For more details please refer to: 
http://stackoverflow.com/questions/20883212/how-can-i-use-ant-exec-to-execute-commands-on-linux

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15853) HDFSMetadataLog.get leaks the input stream

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15853:


Assignee: Apache Spark  (was: Shixiong Zhu)

> HDFSMetadataLog.get leaks the input stream
> --
>
> Key: SPARK-15853
> URL: https://issues.apache.org/jira/browse/SPARK-15853
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> HDFSMetadataLog.get doesn't close the input stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15830) Spark application should get hive tokens only when it is required

2016-06-09 Thread Yesha Vora (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesha Vora updated SPARK-15830:
---
Affects Version/s: 1.6.1

> Spark application should get hive tokens only when it is required
> -
>
> Key: SPARK-15830
> URL: https://issues.apache.org/jira/browse/SPARK-15830
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>
> Currently , All spark application try to get Hive tokens (Even if application 
> does not use them) if Hive is installed on the cluster.
> Due to this practice, spark application which does not require Hive fails 
> when Hive service (metastore) is down for some reason.
> Thus, spark should only try to get Hive tokens when required. It should not 
> fetch hive token if it is not needed by application.
> Example : Spark Pi application does not perform any hive related actions. But 
> Spark Pi application still fails if hive metastore service is down.
> {code}
> 16/06/08 01:18:42 INFO YarnSparkHadoopUtil: getting token for namenode: 
> hdfs://xxx:8020/user/xx/.sparkStaging/application_1465347287950_0001
> 16/06/08 01:18:42 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 7 for 
> xx on xx.xx.xx.xxx:8020
> 16/06/08 01:18:43 INFO metastore: Trying to connect to metastore with URI 
> thrift://xx.xx.xx.xxx:9090
> 16/06/08 01:18:43 WARN metastore: Failed to connect to the MetaStore Server...
> 16/06/08 01:18:43 INFO metastore: Waiting 5 seconds before next connection 
> attempt.
> 16/06/08 01:18:48 INFO metastore: Trying to connect to metastore with URI 
> thrift://xx.xx.xx.xxx:9090
> 16/06/08 01:18:48 WARN metastore: Failed to connect to the MetaStore Server...
> 16/06/08 01:18:48 INFO metastore: Waiting 5 seconds before next connection 
> attempt.
> 16/06/08 01:18:53 INFO metastore: Trying to connect to metastore with URI 
> thrift://xx.xx.xx.xxx:9090
> 16/06/08 01:18:53 WARN metastore: Failed to connect to the MetaStore Server...
> 16/06/08 01:18:53 INFO metastore: Waiting 5 seconds before next connection 
> attempt.
> 16/06/08 01:18:59 WARN Hive: Failed to access metastore. This class should 
> not accessed in runtime.
> org.apache.hadoop.hive.ql.metadata.Hive Exception : java.lang.Runtime 
> Exception : Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
> at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
> at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
> at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15853) HDFSMetadataLog.get leaks the input stream

2016-06-09 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-15853:


 Summary: HDFSMetadataLog.get leaks the input stream
 Key: SPARK-15853
 URL: https://issues.apache.org/jira/browse/SPARK-15853
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


HDFSMetadataLog.get doesn't close the input stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323620#comment-15323620
 ] 

Marcelo Vanzin commented on SPARK-15851:


[~tgraves] (and [~Dhruve Ashar]) hah we were talking about it Tuesday. :-)

Maybe installing bash (not the cygwing version, something like the one that 
comes with the git windows installer) would help here?

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15794) Should truncate toString() of very wide schemas

2016-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15794:

Assignee: Eric Liang

> Should truncate toString() of very wide schemas
> ---
>
> Key: SPARK-15794
> URL: https://issues.apache.org/jira/browse/SPARK-15794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>
> With very wide tables, e.g. thousands of fields, the output is unreadable and 
> often causes OOMs due to inefficient string processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15764) Replace n^2 loop in BindReferences

2016-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15764:

Description: 
BindReferences contains a n^2 loop which causes performance issues when 
operating over large schemas: to determine the ordinal of an input, we perform 
a linear scan over the {{input}} array. Because {{input}} can sometimes be a 
List, the call to {{input(ordinal).nullable}} can also be {{O( n )}}.

Instead of performing a linear scan, we can convert the input into an array and 
build a hash map to map from expression ids to ordinals. The greater up-front 
cost of the map construction is offset by the fact that an expression can 
contain multiple attribute references, so the cost of the map construction is 
amortized across a number of lookups.

  was:
BindReferences contains a n^2 loop which causes performance issues when 
operating over large schemas: to determine the ordinal of an input, we perform 
a linear scan over the {{input}} array. Because {{input}} can sometimes be a 
List, the call to {{input(ordinal).nullable}} can also be {{O(n)}}.

Instead of performing a linear scan, we can convert the input into an array and 
build a hash map to map from expression ids to ordinals. The greater up-front 
cost of the map construction is offset by the fact that an expression can 
contain multiple attribute references, so the cost of the map construction is 
amortized across a number of lookups.


> Replace n^2 loop in BindReferences
> --
>
> Key: SPARK-15764
> URL: https://issues.apache.org/jira/browse/SPARK-15764
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> BindReferences contains a n^2 loop which causes performance issues when 
> operating over large schemas: to determine the ordinal of an input, we 
> perform a linear scan over the {{input}} array. Because {{input}} can 
> sometimes be a List, the call to {{input(ordinal).nullable}} can also be {{O( 
> n )}}.
> Instead of performing a linear scan, we can convert the input into an array 
> and build a hash map to map from expression ids to ordinals. The greater 
> up-front cost of the map construction is offset by the fact that an 
> expression can contain multiple attribute references, so the cost of the map 
> construction is amortized across a number of lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15794) Should truncate toString() of very wide schemas

2016-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15794:

Target Version/s: 2.0.0

> Should truncate toString() of very wide schemas
> ---
>
> Key: SPARK-15794
> URL: https://issues.apache.org/jira/browse/SPARK-15794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>
> With very wide tables, e.g. thousands of fields, the output is unreadable and 
> often causes OOMs due to inefficient string processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15764) Replace n^2 loop in BindReferences

2016-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15764:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-15852

> Replace n^2 loop in BindReferences
> --
>
> Key: SPARK-15764
> URL: https://issues.apache.org/jira/browse/SPARK-15764
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> BindReferences contains a n^2 loop which causes performance issues when 
> operating over large schemas: to determine the ordinal of an input, we 
> perform a linear scan over the {{input}} array. Because {{input}} can 
> sometimes be a List, the call to {{input(ordinal).nullable}} can also be O(n).
> Instead of performing a linear scan, we can convert the input into an array 
> and build a hash map to map from expression ids to ordinals. The greater 
> up-front cost of the map construction is offset by the fact that an 
> expression can contain multiple attribute references, so the cost of the map 
> construction is amortized across a number of lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15742) Reduce collections allocations in Catalyst tree transformation methods

2016-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15742:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-15852

> Reduce collections allocations in Catalyst tree transformation methods
> --
>
> Key: SPARK-15742
> URL: https://issues.apache.org/jira/browse/SPARK-15742
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> In Catalyst's TreeNode {{transform}} methods we end up calling 
> {{productIterator.map(...).toArray()}} in a number of places, which is 
> slightly inefficient because it needs to allocate and grow ArrayBuilders. 
> Since we already know the size of the final output ({{productArity}}), we can 
> simply allocate an array up-front and use a while loop to consume the 
> iterator and populate the array.
> For most workloads, this performance difference is negligible but it does 
> make a measurable difference in optimizer performance for queries that 
> operate over very wide schemas (thousands of columns).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15748) Replace inefficient foldLeft() call in PartitionStatistics

2016-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15748:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-15852

> Replace inefficient foldLeft() call in PartitionStatistics
> --
>
> Key: SPARK-15748
> URL: https://issues.apache.org/jira/browse/SPARK-15748
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> PartitionStatistics uses foldLeft and list concatenation to flatten an 
> iterator of lists, but this is extremely inefficient compared to simply doing 
> flatMap/flatten because it performs many unnecessary object allocations. 
> Simply replacing this foldLeft by a flatMap results in fair performance gains 
> when constructing PartitionStatistics instances for tables with many columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15762) Cache Metadata.hashCode and use a singleton for Metadata.empty

2016-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15762:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-15852

> Cache Metadata.hashCode and use a singleton for Metadata.empty
> --
>
> Key: SPARK-15762
> URL: https://issues.apache.org/jira/browse/SPARK-15762
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> In Spark SQL we should cache Metadata.hashCode and use a singleton for 
> Metadata.empty since calculating empty metadata hashCodes appears to be an 
> bottleneck according to certain profiler results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15764) Replace n^2 loop in BindReferences

2016-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15764:

Description: 
BindReferences contains a n^2 loop which causes performance issues when 
operating over large schemas: to determine the ordinal of an input, we perform 
a linear scan over the {{input}} array. Because {{input}} can sometimes be a 
List, the call to {{input(ordinal).nullable}} can also be {{O(n)}}.

Instead of performing a linear scan, we can convert the input into an array and 
build a hash map to map from expression ids to ordinals. The greater up-front 
cost of the map construction is offset by the fact that an expression can 
contain multiple attribute references, so the cost of the map construction is 
amortized across a number of lookups.

  was:
BindReferences contains a n^2 loop which causes performance issues when 
operating over large schemas: to determine the ordinal of an input, we perform 
a linear scan over the {{input}} array. Because {{input}} can sometimes be a 
List, the call to {{input(ordinal).nullable}} can also be O(n).

Instead of performing a linear scan, we can convert the input into an array and 
build a hash map to map from expression ids to ordinals. The greater up-front 
cost of the map construction is offset by the fact that an expression can 
contain multiple attribute references, so the cost of the map construction is 
amortized across a number of lookups.


> Replace n^2 loop in BindReferences
> --
>
> Key: SPARK-15764
> URL: https://issues.apache.org/jira/browse/SPARK-15764
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> BindReferences contains a n^2 loop which causes performance issues when 
> operating over large schemas: to determine the ordinal of an input, we 
> perform a linear scan over the {{input}} array. Because {{input}} can 
> sometimes be a List, the call to {{input(ordinal).nullable}} can also be 
> {{O(n)}}.
> Instead of performing a linear scan, we can convert the input into an array 
> and build a hash map to map from expression ids to ordinals. The greater 
> up-front cost of the map construction is offset by the fact that an 
> expression can contain multiple attribute references, so the cost of the map 
> construction is amortized across a number of lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15794) Should truncate toString() of very wide schemas

2016-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15794:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-15852

> Should truncate toString() of very wide schemas
> ---
>
> Key: SPARK-15794
> URL: https://issues.apache.org/jira/browse/SPARK-15794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>
> With very wide tables, e.g. thousands of fields, the output is unreadable and 
> often causes OOMs due to inefficient string processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15852) Improve query planning performance for wide nested schema

2016-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-15852:
---

 Summary: Improve query planning performance for wide nested schema
 Key: SPARK-15852
 URL: https://issues.apache.org/jira/browse/SPARK-15852
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Eric Liang


This tracks a list of issues to improve query planning (and code generation) 
performance for wide nested schema.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14321) Reduce date format cost in date functions

2016-06-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14321.
-
   Resolution: Fixed
 Assignee: Herman van Hovell
Fix Version/s: 2.0.0

> Reduce date format cost in date functions
> -
>
> Key: SPARK-14321
> URL: https://issues.apache.org/jira/browse/SPARK-14321
> Project: Spark
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Herman van Hovell
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently the code generated is
> {noformat}
> /* 066 */ UTF8String primitive5 = null;
> /* 067 */ if (!isNull4) {
> /* 068 */   try {
> /* 069 */ primitive5 = UTF8String.fromString(new 
> java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format(
> /* 070 */ new java.util.Date(primitive7 * 1000L)));
> /* 071 */   } catch (java.lang.Throwable e) {
> /* 072 */ isNull4 = true;
> /* 073 */   }
> /* 074 */ }
> {noformat}
> Instantiation of SimpleDateFormat is fairly expensive. It can be created on 
> need basis. 
> I will share the patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-15851:
-
Fix Version/s: 2.0.0

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-15851:
-
Target Version/s: 2.0.0
   Fix Version/s: (was: 2.0.0)

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-15851:


 Summary: Spark 2.0 does not compile in Windows 7
 Key: SPARK-15851
 URL: https://issues.apache.org/jira/browse/SPARK-15851
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.0.0
 Environment: Windows 7
Reporter: Alexander Ulanov


Spark does not compile in Windows 7.
"mvn compile" fails on spark-core due to trying to execute a bash script 
spark-build-info.

Work around:
1)Install win-bash and put in path
2)Change line 350 of core/pom.xml

  
  
  


Error trace:
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
spark-core_2.11: An Ant BuildException has occured: Execute failed: 
java.io.IOException: Cannot run program 
"C:\dev\spark\core\..\build\spark-build-info" (in directory 
"C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
application
[ERROR] around Ant part .. @ 4:73 in 
C:\dev\spark\core\target\antrun\build-main.xml




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15850) Remove function grouping in SparkSession

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15850:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove function grouping in SparkSession
> 
>
> Key: SPARK-15850
> URL: https://issues.apache.org/jira/browse/SPARK-15850
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> SparkSession does not have that many functions due to better namespacing, and 
> as a result we probably don't need the function grouping. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15850) Remove function grouping in SparkSession

2016-06-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323560#comment-15323560
 ] 

Apache Spark commented on SPARK-15850:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13582

> Remove function grouping in SparkSession
> 
>
> Key: SPARK-15850
> URL: https://issues.apache.org/jira/browse/SPARK-15850
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> SparkSession does not have that many functions due to better namespacing, and 
> as a result we probably don't need the function grouping. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15850) Remove function grouping in SparkSession

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15850:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove function grouping in SparkSession
> 
>
> Key: SPARK-15850
> URL: https://issues.apache.org/jira/browse/SPARK-15850
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> SparkSession does not have that many functions due to better namespacing, and 
> as a result we probably don't need the function grouping. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15850) Remove function grouping in SparkSession

2016-06-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-15850:
---

 Summary: Remove function grouping in SparkSession
 Key: SPARK-15850
 URL: https://issues.apache.org/jira/browse/SPARK-15850
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


SparkSession does not have that many functions due to better namespacing, and 
as a result we probably don't need the function grouping. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8426) Add blacklist mechanism for YARN container allocation

2016-06-09 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323522#comment-15323522
 ] 

Kay Ousterhout commented on SPARK-8426:
---

Can we merge this with SPARK-8425? This seems to be more general now (or move 
all of the general stuff to 8425, and leave any YARN-specific stuff, e.g., 
re-allocating bad containers, here).

> Add blacklist mechanism for YARN container allocation
> -
>
> Key: SPARK-8426
> URL: https://issues.apache.org/jira/browse/SPARK-8426
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Reporter: Saisai Shao
>Priority: Minor
> Attachments: DesignDocforBlacklistMechanism.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15849) FileNotFoundException on _temporary while doing saveAsTable to S3

2016-06-09 Thread Sandeep (JIRA)
Sandeep created SPARK-15849:
---

 Summary: FileNotFoundException on _temporary while doing 
saveAsTable to S3
 Key: SPARK-15849
 URL: https://issues.apache.org/jira/browse/SPARK-15849
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.1
 Environment: AWS EC2 with spark on yarn and s3 storage
Reporter: Sandeep


When submitting spark jobs to yarn cluster, I occasionally see these error 
messages while doing saveAsTable. I have tried doing this with 
spark.speculation=false, and get the same error. These errors are similar to 
SPARK-2984, but my jobs are writing to S3(s3n) :

Caused by: java.io.FileNotFoundException: File 
s3n://xxx/_temporary/0/task_201606080516_0004_m_79 does not exist.
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:506)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
at 
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
at 
org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
... 42 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13268) SQL Timestamp stored as GMT but toString returns GMT-08:00

2016-06-09 Thread Bo Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323442#comment-15323442
 ] 

Bo Meng commented on SPARK-13268:
-

Why is this related to Spark? The conversion does not use any Spark function 
and I think the conversion loses the time zone information along the way.

> SQL Timestamp stored as GMT but toString returns GMT-08:00
> --
>
> Key: SPARK-13268
> URL: https://issues.apache.org/jira/browse/SPARK-13268
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ilya Ganelin
>
> There is an issue with how timestamps are displayed/converted to Strings in 
> Spark SQL. The documentation states that the timestamp should be created in 
> the GMT time zone, however, if we do so, we see that the output actually 
> contains a -8 hour offset:
> {code}
> new 
> Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT]").toInstant.toEpochMilli)
> res144: java.sql.Timestamp = 2014-12-31 16:00:00.0
> new 
> Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT-08:00]").toInstant.toEpochMilli)
> res145: java.sql.Timestamp = 2015-01-01 00:00:00.0
> {code}
> This result is confusing, unintuitive, and introduces issues when converting 
> from DataFrames containing timestamps to RDDs which are then saved as text. 
> This has the effect of essentially shifting all dates in a dataset by 1 day. 
> The suggested fix for this is to update the timestamp toString representation 
> to either a) Include timezone or b) Correctly display in GMT.
> This change may well introduce substantial and insidious bugs so I'm not sure 
> how best to resolve this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15613) Incorrect days to millis conversion

2016-06-09 Thread Bo Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323386#comment-15323386
 ] 

Bo Meng commented on SPARK-15613:
-

Does this only happen to 1.6? I have tried on the latest master and it does not 
have this issue.

> Incorrect days to millis conversion 
> 
>
> Key: SPARK-15613
> URL: https://issues.apache.org/jira/browse/SPARK-15613
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: java version "1.8.0_91"
>Reporter: Dmitry Bushev
>
> There is an issue with {{DateTimeUtils.daysToMillis}} implementation. It  
> affects {{DateTimeUtils.toJavaDate}} and ultimately CatalystTypeConverter, 
> i.e the conversion of date stored as {{Int}} days from epoch in InternalRow 
> to {{java.sql.Date}} of Row returned to user.
>  
> The issue can be reproduced with this test (all the following tests are in my 
> defalut timezone Europe/Moscow):
> {code}
> scala> for (days <- 0 to 2 if millisToDays(daysToMillis(days)) != days) 
> yield days
> res23: scala.collection.immutable.IndexedSeq[Int] = Vector(4108, 4473, 4838, 
> 5204, 5568, 5932, 6296, 6660, 7024, 7388, 8053, 8487, 8851, 9215, 9586, 9950, 
> 10314, 10678, 11042, 11406, 11777, 12141, 12505, 12869, 13233, 13597, 13968, 
> 14332, 14696, 15060)
> {code}
> For example, for {{4108}} day of epoch, the correct date should be 
> {{1981-04-01}}
> {code}
> scala> DateTimeUtils.toJavaDate(4107)
> res25: java.sql.Date = 1981-03-31
> scala> DateTimeUtils.toJavaDate(4108)
> res26: java.sql.Date = 1981-03-31
> scala> DateTimeUtils.toJavaDate(4109)
> res27: java.sql.Date = 1981-04-02
> {code}
> There was previous unsuccessful attempt to work around the problem in 
> SPARK-11415. It seems that issue involves flaws in java date implementation 
> and I don't see how it can be fixed without third-party libraries.
> I was not able to identify the library of choice for Spark. The following 
> implementation uses [JSR-310|http://www.threeten.org/]
> {code}
> def millisToDays(millisUtc: Long): SQLDate = {
>   val instant = Instant.ofEpochMilli(millisUtc)
>   val zonedDateTime = instant.atZone(ZoneId.systemDefault)
>   zonedDateTime.toLocalDate.toEpochDay.toInt
> }
> def daysToMillis(days: SQLDate): Long = {
>   val localDate = LocalDate.ofEpochDay(days)
>   val zonedDateTime = localDate.atStartOfDay(ZoneId.systemDefault)
>   zonedDateTime.toInstant.toEpochMilli
> }
> {code}
> that produces correct results:
> {code}
> scala> for (days <- 0 to 2 if millisToDays(daysToMillis(days)) != days) 
> yield days
> res37: scala.collection.immutable.IndexedSeq[Int] = Vector()
> scala> new java.sql.Date(daysToMillis(4108))
> res36: java.sql.Date = 1981-04-01
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15613) Incorrect days to millis conversion

2016-06-09 Thread Bo Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323386#comment-15323386
 ] 

Bo Meng edited comment on SPARK-15613 at 6/9/16 9:24 PM:
-

Does this only happen to 1.6? I have tried on the latest master and it does not 
have this issue. Have not tried on 1.6.


was (Author: bomeng):
Does this only happen to 1.6? I have tried on the latest master and it does not 
have this issue.

> Incorrect days to millis conversion 
> 
>
> Key: SPARK-15613
> URL: https://issues.apache.org/jira/browse/SPARK-15613
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: java version "1.8.0_91"
>Reporter: Dmitry Bushev
>
> There is an issue with {{DateTimeUtils.daysToMillis}} implementation. It  
> affects {{DateTimeUtils.toJavaDate}} and ultimately CatalystTypeConverter, 
> i.e the conversion of date stored as {{Int}} days from epoch in InternalRow 
> to {{java.sql.Date}} of Row returned to user.
>  
> The issue can be reproduced with this test (all the following tests are in my 
> defalut timezone Europe/Moscow):
> {code}
> scala> for (days <- 0 to 2 if millisToDays(daysToMillis(days)) != days) 
> yield days
> res23: scala.collection.immutable.IndexedSeq[Int] = Vector(4108, 4473, 4838, 
> 5204, 5568, 5932, 6296, 6660, 7024, 7388, 8053, 8487, 8851, 9215, 9586, 9950, 
> 10314, 10678, 11042, 11406, 11777, 12141, 12505, 12869, 13233, 13597, 13968, 
> 14332, 14696, 15060)
> {code}
> For example, for {{4108}} day of epoch, the correct date should be 
> {{1981-04-01}}
> {code}
> scala> DateTimeUtils.toJavaDate(4107)
> res25: java.sql.Date = 1981-03-31
> scala> DateTimeUtils.toJavaDate(4108)
> res26: java.sql.Date = 1981-03-31
> scala> DateTimeUtils.toJavaDate(4109)
> res27: java.sql.Date = 1981-04-02
> {code}
> There was previous unsuccessful attempt to work around the problem in 
> SPARK-11415. It seems that issue involves flaws in java date implementation 
> and I don't see how it can be fixed without third-party libraries.
> I was not able to identify the library of choice for Spark. The following 
> implementation uses [JSR-310|http://www.threeten.org/]
> {code}
> def millisToDays(millisUtc: Long): SQLDate = {
>   val instant = Instant.ofEpochMilli(millisUtc)
>   val zonedDateTime = instant.atZone(ZoneId.systemDefault)
>   zonedDateTime.toLocalDate.toEpochDay.toInt
> }
> def daysToMillis(days: SQLDate): Long = {
>   val localDate = LocalDate.ofEpochDay(days)
>   val zonedDateTime = localDate.atStartOfDay(ZoneId.systemDefault)
>   zonedDateTime.toInstant.toEpochMilli
> }
> {code}
> that produces correct results:
> {code}
> scala> for (days <- 0 to 2 if millisToDays(daysToMillis(days)) != days) 
> yield days
> res37: scala.collection.immutable.IndexedSeq[Int] = Vector()
> scala> new java.sql.Date(daysToMillis(4108))
> res36: java.sql.Date = 1981-04-01
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14321) Reduce date format cost in date functions

2016-06-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323384#comment-15323384
 ] 

Apache Spark commented on SPARK-14321:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/13581

> Reduce date format cost in date functions
> -
>
> Key: SPARK-14321
> URL: https://issues.apache.org/jira/browse/SPARK-14321
> Project: Spark
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> Currently the code generated is
> {noformat}
> /* 066 */ UTF8String primitive5 = null;
> /* 067 */ if (!isNull4) {
> /* 068 */   try {
> /* 069 */ primitive5 = UTF8String.fromString(new 
> java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format(
> /* 070 */ new java.util.Date(primitive7 * 1000L)));
> /* 071 */   } catch (java.lang.Throwable e) {
> /* 072 */ isNull4 = true;
> /* 073 */   }
> /* 074 */ }
> {noformat}
> Instantiation of SimpleDateFormat is fairly expensive. It can be created on 
> need basis. 
> I will share the patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14485) Task finished cause fetch failure when its executor has already been removed by driver

2016-06-09 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323344#comment-15323344
 ] 

Marcelo Vanzin commented on SPARK-14485:


bq. I don't think (a) is especially rare: that's the case anytime data is saved 
to HDFS

I didn't mean rare in general, I meant it should be rare to hit this particular 
case (scheduler thinks the executor is gone *and* a task result arrives later). 
The normal case is the task result arrives while the executor is still alive, 
and the change doesn't really touch that case.

> Task finished cause fetch failure when its executor has already been removed 
> by driver 
> ---
>
> Key: SPARK-14485
> URL: https://issues.apache.org/jira/browse/SPARK-14485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.5.2
>Reporter: iward
>Assignee: iward
> Fix For: 2.0.0
>
>
> Now, when executor is removed by driver with heartbeats timeout, driver will 
> re-queue the task on this executor and send a kill command to cluster to kill 
> this executor.
> But, in a situation, the running task of this executor is finished and return 
> result to driver before this executor killed by kill command sent by driver. 
> At this situation, driver will accept the task finished event and ignore  
> speculative task and re-queued task. But, as we know, this executor has 
> removed by driver, the result of this finished task can not save in driver 
> because the *BlockManagerId* has also removed from *BlockManagerMaster* by 
> driver. So, the result data of this stage is not complete, and then, it will 
> cause fetch failure.
> For example, the following is the task log:
> {noformat}
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing 
> executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 
> 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 
> 256015 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing 
> tasks for 322 from TaskSet 107.0
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 
> 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): 
> ExecutorLostFailure (executor 322 lost)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 
> 322 (epoch 11)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: 
> Trying to remove executor 322 from BlockManagerMaster.
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 
> 322 successfully in removeExecutor
> {noformat}
> {noformat}
> 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 
> 229.0 in stage 107.0 (TID 10384) in 272315 ms on 
> BJHC-HERA-16168.hadoop.jd.local (579/700)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring 
> task-finished event for 229.1 in stage 107.0 because task 229 has already 
> completed successfully
> {noformat}
> {noformat}
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 
> missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at 
> mapPartitions at Exchange.scala:137)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task 
> set 107.1 with 3 tasks
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, 
> 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> {noformat}
> Driver will check the stage's result is not complete, and submit missing 
> task, but this time, the next stage has run because previous stage has finish 
> for its task is all finished although its result is not complete.
> {noformat}
> 2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task 
> 39.0 in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): 
> FetchFailed(null, shuffleId=11, mapId=-1, reduceId=39, message=
> 2015-12-31 04:40:13 INFO 
> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
> location for shuffle 11
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
> 2015-12-31 04:40:13 INFO at 
> 

[jira] [Issue Comment Deleted] (SPARK-15801) spark-submit --num-executors switch also works without YARN

2016-06-09 Thread Jonathan Taws (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Taws updated SPARK-15801:
--
Comment: was deleted

(was: Indeed, should be enough as it is then. )

> spark-submit --num-executors switch also works without YARN
> ---
>
> Key: SPARK-15801
> URL: https://issues.apache.org/jira/browse/SPARK-15801
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> Based on this [issue|https://issues.apache.org/jira/browse/SPARK-15781] 
> regarding the SPARK_WORKER_INSTANCES property, I also found that the 
> {{--num-executors}} switch documented in the spark-submit help is partially 
> incorrect. 
> Here's one part of the output (produced by {{spark-submit --help}}): 
> {code}
> YARN-only:
>   --driver-cores NUM  Number of cores used by the driver, only in 
> cluster mode
>   (Default: 1).
>   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
> "default").
>   --num-executors NUM Number of executors to launch (Default: 2).
> {code}
> Correct me if I am wrong, but the num-executors switch also works in Spark 
> standalone mode *without YARN*.
> I tried by only launching a master and a worker with 4 executors specified, 
> and they were all successfully spawned. The master switch pointed to the 
> master's url, and not to the yarn value. 
> Here's the exact command : {{spark-submit --master spark://[local 
> machine]:7077 --num-executors 4 --executor-cores 2}}
> By default it is *1* executor per worker in Spark standalone mode without 
> YARN, but this option enables to specify the number of executors (per worker 
> ?) if, and only if, the {{--executor-cores}} switch is also set. I do believe 
> it defaults to 2 in YARN mode. 
> I would propose to move the option from the *YARN-only* section to the *Spark 
> standalone and YARN only* section.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15801) spark-submit --num-executors switch also works without YARN

2016-06-09 Thread Jonathan Taws (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323242#comment-15323242
 ] 

Jonathan Taws commented on SPARK-15801:
---

Indeed, should be enough as it is then. 

> spark-submit --num-executors switch also works without YARN
> ---
>
> Key: SPARK-15801
> URL: https://issues.apache.org/jira/browse/SPARK-15801
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> Based on this [issue|https://issues.apache.org/jira/browse/SPARK-15781] 
> regarding the SPARK_WORKER_INSTANCES property, I also found that the 
> {{--num-executors}} switch documented in the spark-submit help is partially 
> incorrect. 
> Here's one part of the output (produced by {{spark-submit --help}}): 
> {code}
> YARN-only:
>   --driver-cores NUM  Number of cores used by the driver, only in 
> cluster mode
>   (Default: 1).
>   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
> "default").
>   --num-executors NUM Number of executors to launch (Default: 2).
> {code}
> Correct me if I am wrong, but the num-executors switch also works in Spark 
> standalone mode *without YARN*.
> I tried by only launching a master and a worker with 4 executors specified, 
> and they were all successfully spawned. The master switch pointed to the 
> master's url, and not to the yarn value. 
> Here's the exact command : {{spark-submit --master spark://[local 
> machine]:7077 --num-executors 4 --executor-cores 2}}
> By default it is *1* executor per worker in Spark standalone mode without 
> YARN, but this option enables to specify the number of executors (per worker 
> ?) if, and only if, the {{--executor-cores}} switch is also set. I do believe 
> it defaults to 2 in YARN mode. 
> I would propose to move the option from the *YARN-only* section to the *Spark 
> standalone and YARN only* section.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15801) spark-submit --num-executors switch also works without YARN

2016-06-09 Thread Jonathan Taws (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323239#comment-15323239
 ] 

Jonathan Taws commented on SPARK-15801:
---

Indeed, should be enough as it is then. 

> spark-submit --num-executors switch also works without YARN
> ---
>
> Key: SPARK-15801
> URL: https://issues.apache.org/jira/browse/SPARK-15801
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> Based on this [issue|https://issues.apache.org/jira/browse/SPARK-15781] 
> regarding the SPARK_WORKER_INSTANCES property, I also found that the 
> {{--num-executors}} switch documented in the spark-submit help is partially 
> incorrect. 
> Here's one part of the output (produced by {{spark-submit --help}}): 
> {code}
> YARN-only:
>   --driver-cores NUM  Number of cores used by the driver, only in 
> cluster mode
>   (Default: 1).
>   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
> "default").
>   --num-executors NUM Number of executors to launch (Default: 2).
> {code}
> Correct me if I am wrong, but the num-executors switch also works in Spark 
> standalone mode *without YARN*.
> I tried by only launching a master and a worker with 4 executors specified, 
> and they were all successfully spawned. The master switch pointed to the 
> master's url, and not to the yarn value. 
> Here's the exact command : {{spark-submit --master spark://[local 
> machine]:7077 --num-executors 4 --executor-cores 2}}
> By default it is *1* executor per worker in Spark standalone mode without 
> YARN, but this option enables to specify the number of executors (per worker 
> ?) if, and only if, the {{--executor-cores}} switch is also set. I do believe 
> it defaults to 2 in YARN mode. 
> I would propose to move the option from the *YARN-only* section to the *Spark 
> standalone and YARN only* section.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15781) Misleading deprecated property in standalone cluster configuration documentation

2016-06-09 Thread Jonathan Taws (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323235#comment-15323235
 ] 

Jonathan Taws edited comment on SPARK-15781 at 6/9/16 8:11 PM:
---

What are our next steps on this ? CC Andrew or someone who knows standalone ? 


was (Author: jonathantaws):
What are our nextsteps on this ? CC Andrew or someone who knows standalone ? 

> Misleading deprecated property in standalone cluster configuration 
> documentation
> 
>
> Key: SPARK-15781
> URL: https://issues.apache.org/jira/browse/SPARK-15781
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> I am unsure if this is regarded as an issue or not, but in the 
> [latest|http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts]
>  documentation for the configuration to launch Spark in stand-alone cluster 
> mode, the following property is documented :
> |SPARK_WORKER_INSTANCES|  Number of worker instances to run on each 
> machine (default: 1). You can make this more than 1 if you have have very 
> large machines and would like multiple Spark worker processes. If you do set 
> this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores 
> per worker, or else each worker will try to use all the cores.| 
> However, once I launch Spark with the spark-submit utility and the property 
> {{SPARK_WORKER_INSTANCES}} set in my spark-env.sh file, I get the following 
> deprecated warning : 
> {code}
> 16/06/06 16:38:28 WARN SparkConf: 
> SPARK_WORKER_INSTANCES was detected (set to '4').
> This is deprecated in Spark 1.0+.
> Please instead use:
>  - ./spark-submit with --num-executors to specify the number of executors
>  - Or set SPARK_EXECUTOR_INSTANCES
>  - spark.executor.instances to configure the number of instances in the spark 
> config.
> {code}
> Is this regarded as normal practice to have deprecated fields documented in 
> the documentation ? 
> I would have preferred to directly know about the --num-executors property 
> than to have to submit my application and find a deprecated warning. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15781) Misleading deprecated property in standalone cluster configuration documentation

2016-06-09 Thread Jonathan Taws (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323235#comment-15323235
 ] 

Jonathan Taws commented on SPARK-15781:
---

What are our nextsteps on this ? CC Andrew or someone who knows standalone ? 

> Misleading deprecated property in standalone cluster configuration 
> documentation
> 
>
> Key: SPARK-15781
> URL: https://issues.apache.org/jira/browse/SPARK-15781
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> I am unsure if this is regarded as an issue or not, but in the 
> [latest|http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts]
>  documentation for the configuration to launch Spark in stand-alone cluster 
> mode, the following property is documented :
> |SPARK_WORKER_INSTANCES|  Number of worker instances to run on each 
> machine (default: 1). You can make this more than 1 if you have have very 
> large machines and would like multiple Spark worker processes. If you do set 
> this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores 
> per worker, or else each worker will try to use all the cores.| 
> However, once I launch Spark with the spark-submit utility and the property 
> {{SPARK_WORKER_INSTANCES}} set in my spark-env.sh file, I get the following 
> deprecated warning : 
> {code}
> 16/06/06 16:38:28 WARN SparkConf: 
> SPARK_WORKER_INSTANCES was detected (set to '4').
> This is deprecated in Spark 1.0+.
> Please instead use:
>  - ./spark-submit with --num-executors to specify the number of executors
>  - Or set SPARK_EXECUTOR_INSTANCES
>  - spark.executor.instances to configure the number of instances in the spark 
> config.
> {code}
> Is this regarded as normal practice to have deprecated fields documented in 
> the documentation ? 
> I would have preferred to directly know about the --num-executors property 
> than to have to submit my application and find a deprecated warning. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15811) UDFs do not work in Spark 2.0-preview built with scala 2.10

2016-06-09 Thread Franklyn Dsouza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franklyn Dsouza updated SPARK-15811:

Shepherd: Davies Liu

> UDFs do not work in Spark 2.0-preview built with scala 2.10
> ---
>
> Key: SPARK-15811
> URL: https://issues.apache.org/jira/browse/SPARK-15811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>Priority: Critical
>
> I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
> {code}
> ./dev/change-version-to-2.10.sh
> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
> {code}
> and then ran the following code in a pyspark shell
> {code}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import IntegerType, StructField, StructType
> from pyspark.sql.functions import udf
> from pyspark.sql.types import Row
> spark = SparkSession.builder.master('local[4]').appName('2.0 
> DF').getOrCreate()
> add_one = udf(lambda x: x + 1, IntegerType())
> schema = StructType([StructField('a', IntegerType(), False)])
> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
> df.select(add_one(df.a).alias('incremented')).collect()
> {code}
> This never returns with a result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15848) Spark unable to read partitioned table in avro format and column name in upper case

2016-06-09 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-15848:
---
Affects Version/s: 1.6.1

> Spark unable to read partitioned table in avro format and column name in 
> upper case
> ---
>
> Key: SPARK-15848
> URL: https://issues.apache.org/jira/browse/SPARK-15848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Zhan Zhang
>
> If external partitioned Hive tables created in Avro format.
> Spark is returning "null" values if columns names are in Uppercase in the 
> Avro schema.
> The same tables return proper data when queried in the Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15848) Spark unable to read partitioned table in avro format and column name in upper case

2016-06-09 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323195#comment-15323195
 ] 

Zhan Zhang commented on SPARK-15848:


cat > file1.csv< file2.csv< val tbl = sqlContext.table("default.avro_table_uppercase");

scala> tbl.show

+--+--+-++
|student_id|subject_id|marks|year|
+--+--+-++
|  null|  null|  100|2000|
|  null|  null|   20|2000|
|  null|  null|  160|2000|
|  null|  null|  963|2000|
|  null|  null|  142|2000|
|  null|  null|  430|2000|
|  null|  null|   91|2002|
|  null|  null|   28|2002|
|  null|  null|   16|2002|
|  null|  null|   96|2002|
|  null|  null|   14|2002|
|  null|  null|   43|2002|
+--+--+-++

> Spark unable to read partitioned table in avro format and column name in 
> upper case
> ---
>
> Key: SPARK-15848
> URL: https://issues.apache.org/jira/browse/SPARK-15848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Zhan Zhang
>
> If external partitioned Hive tables created in Avro format.
> Spark is returning "null" values if columns names are in Uppercase in the 
> Avro schema.
> The same tables return proper data when queried in the Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15848) Spark unable to read partitioned table in avro format and column name in upper case

2016-06-09 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-15848:
--

 Summary: Spark unable to read partitioned table in avro format and 
column name in upper case
 Key: SPARK-15848
 URL: https://issues.apache.org/jira/browse/SPARK-15848
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhan Zhang


If external partitioned Hive tables created in Avro format.
Spark is returning "null" values if columns names are in Uppercase in the Avro 
schema.
The same tables return proper data when queried in the Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15839) Maven doc JAR generation fails when JAVA_7_HOME is set

2016-06-09 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15839.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13573
[https://github.com/apache/spark/pull/13573]

> Maven doc JAR generation fails when JAVA_7_HOME is set
> --
>
> Key: SPARK-15839
> URL: https://issues.apache.org/jira/browse/SPARK-15839
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> It looks like the nightly Maven snapshots broke after we set JAVA_7_HOME in 
> the build: 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/1573/.
>  It seems that passing {{-javabootclasspath}} to scalac using 
> scala-maven-plugin ends up preventing the Scala library classes from being 
> added to scalac's internal class path, causing compilation errors while 
> building doc-jars.
> There might be a principled fix to this inside of the scala-maven-plugin 
> itself, but for now I propose that we simply omit the -javabootclasspath 
> option during Maven doc-jar generation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >