[jira] [Commented] (SPARK-9254) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection

2015-07-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637044#comment-14637044
 ] 

Apache Spark commented on SPARK-9254:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/7597

 sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS 
 redirection
 --

 Key: SPARK-9254
 URL: https://issues.apache.org/jira/browse/SPARK-9254
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 The {{curl}} call in the script should use {{--location}} to support 
 HTTP/HTTPS redirection, since target file(s) can be hosted on CDN nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9254) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9254:
---

Assignee: Cheng Lian  (was: Apache Spark)

 sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS 
 redirection
 --

 Key: SPARK-9254
 URL: https://issues.apache.org/jira/browse/SPARK-9254
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 The {{curl}} call in the script should use {{--location}} to support 
 HTTP/HTTPS redirection, since target file(s) can be hosted on CDN nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9254) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9254:
---

Assignee: Apache Spark  (was: Cheng Lian)

 sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS 
 redirection
 --

 Key: SPARK-9254
 URL: https://issues.apache.org/jira/browse/SPARK-9254
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Apache Spark

 The {{curl}} call in the script should use {{--location}} to support 
 HTTP/HTTPS redirection, since target file(s) can be hosted on CDN nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9254) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection

2015-07-22 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-9254.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7597
[https://github.com/apache/spark/pull/7597]

 sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS 
 redirection
 --

 Key: SPARK-9254
 URL: https://issues.apache.org/jira/browse/SPARK-9254
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
 Fix For: 1.5.0


 The {{curl}} call in the script should use {{--location}} to support 
 HTTP/HTTPS redirection, since target file(s) can be hosted on CDN nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9254) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection

2015-07-22 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-9254:
--
Target Version/s: 1.4.2, 1.5.0  (was: 1.5.0)

 sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS 
 redirection
 --

 Key: SPARK-9254
 URL: https://issues.apache.org/jira/browse/SPARK-9254
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
  Labels: backport-needed
 Fix For: 1.5.0


 The {{curl}} call in the script should use {{--location}} to support 
 HTTP/HTTPS redirection, since target file(s) can be hosted on CDN nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9253) Allow to create machines with different AWS credentials than will be used for accessing the S3

2015-07-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637043#comment-14637043
 ] 

Apache Spark commented on SPARK-9253:
-

User 'ziky90' has created a pull request for this issue:
https://github.com/apache/spark/pull/7596

 Allow to create machines with different AWS credentials than will be used for 
 accessing the S3
 --

 Key: SPARK-9253
 URL: https://issues.apache.org/jira/browse/SPARK-9253
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.4.1
Reporter: Jan Zikeš

 Currently when you would like to use `spark_ec2.py` script together with S3 
 you have only the option to use exactly the same AWS credentials for both EC2 
 machines creation and access to S3.
 For security reasons I would very much appreciate to be able to access S3 
 with different credentials that with which I am launching machines.
 Proposed solution is adding the option that I can use for passing additional 
 credentials in the `spark_ec2.py`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9253) Allow to create machines with different AWS credentials than will be used for accessing the S3

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9253:
---

Assignee: (was: Apache Spark)

 Allow to create machines with different AWS credentials than will be used for 
 accessing the S3
 --

 Key: SPARK-9253
 URL: https://issues.apache.org/jira/browse/SPARK-9253
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.4.1
Reporter: Jan Zikeš

 Currently when you would like to use `spark_ec2.py` script together with S3 
 you have only the option to use exactly the same AWS credentials for both EC2 
 machines creation and access to S3.
 For security reasons I would very much appreciate to be able to access S3 
 with different credentials that with which I am launching machines.
 Proposed solution is adding the option that I can use for passing additional 
 credentials in the `spark_ec2.py`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9254) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection

2015-07-22 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-9254:
-

 Summary: sbt-launch-lib.bash should use `curl --location` to 
support HTTP/HTTPS redirection
 Key: SPARK-9254
 URL: https://issues.apache.org/jira/browse/SPARK-9254
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian


The {{curl}} call in the script should use {{--location}} to support HTTP/HTTPS 
redirection, since target file(s) can be hosted on CDN nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9254) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection

2015-07-22 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9254:

Labels: backport-needed  (was: )

 sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS 
 redirection
 --

 Key: SPARK-9254
 URL: https://issues.apache.org/jira/browse/SPARK-9254
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
  Labels: backport-needed
 Fix For: 1.5.0


 The {{curl}} call in the script should use {{--location}} to support 
 HTTP/HTTPS redirection, since target file(s) can be hosted on CDN nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-9254) sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS redirection

2015-07-22 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reopened SPARK-9254:
---

Reopening this since we need to backport this fix to branch-1.4.

 sbt-launch-lib.bash should use `curl --location` to support HTTP/HTTPS 
 redirection
 --

 Key: SPARK-9254
 URL: https://issues.apache.org/jira/browse/SPARK-9254
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
  Labels: backport-needed
 Fix For: 1.5.0


 The {{curl}} call in the script should use {{--location}} to support 
 HTTP/HTTPS redirection, since target file(s) can be hosted on CDN nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9253) Allow to create machines with different AWS credentials than will be used for accessing the S3

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9253:
---

Assignee: Apache Spark

 Allow to create machines with different AWS credentials than will be used for 
 accessing the S3
 --

 Key: SPARK-9253
 URL: https://issues.apache.org/jira/browse/SPARK-9253
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.4.1
Reporter: Jan Zikeš
Assignee: Apache Spark

 Currently when you would like to use `spark_ec2.py` script together with S3 
 you have only the option to use exactly the same AWS credentials for both EC2 
 machines creation and access to S3.
 For security reasons I would very much appreciate to be able to access S3 
 with different credentials that with which I am launching machines.
 Proposed solution is adding the option that I can use for passing additional 
 credentials in the `spark_ec2.py`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9192) add initialization phase for nondeterministic expression

2015-07-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-9192:
---
Summary: add initialization phase for nondeterministic expression  (was: 
add initialization phase for expression)

 add initialization phase for nondeterministic expression
 

 Key: SPARK-9192
 URL: https://issues.apache.org/jira/browse/SPARK-9192
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan

 Some expressions have mutable states and need to be initialized first(like 
 Rand, WeekOfYear). Currently we use `@transient lazy val` to make it 
 automatically get initialized when first use it, and reset it after 
 serialize-deserialize.
 However, this approach is kind of ugly and accessing a lazy val is not 
 efficient, we should have a explicit initialization phase for expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8364) Add crosstab to SparkR DataFrames

2015-07-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8364:
-
Shepherd: Shivaram Venkataraman

 Add crosstab to SparkR DataFrames
 -

 Key: SPARK-8364
 URL: https://issues.apache.org/jira/browse/SPARK-8364
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 Add `crosstab` to SparkR DataFrames, which takes two column names and returns 
 a local R data.frame. This is similar to `table` in R. However, `table` in 
 SparkR is used for loading SQL tables as DataFrames. The return type is 
 data.frame instead table for `crosstab` to be compatible with Scala/Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9230) SparkR RFormula should support StringType features

2015-07-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9230:
-
Target Version/s: 1.5.0

 SparkR RFormula should support StringType features
 --

 Key: SPARK-9230
 URL: https://issues.apache.org/jira/browse/SPARK-9230
 Project: Spark
  Issue Type: New Feature
  Components: ML, SparkR
Reporter: Eric Liang
Assignee: Eric Liang

 StringType features will need to be encoded using OneHotEncoder to be used 
 for regression. See umbrella design doc 
 https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9230) SparkR RFormula should support StringType features

2015-07-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9230:
-
Assignee: Eric Liang

 SparkR RFormula should support StringType features
 --

 Key: SPARK-9230
 URL: https://issues.apache.org/jira/browse/SPARK-9230
 Project: Spark
  Issue Type: New Feature
  Components: ML, SparkR
Reporter: Eric Liang
Assignee: Eric Liang

 StringType features will need to be encoded using OneHotEncoder to be used 
 for regression. See umbrella design doc 
 https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9192) add initialization phase for nondeterministic expression

2015-07-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-9192:
---
Description: 
Currently nondeterministic expression is broken without a explicit 
initialization phase.

Let me take `MonotonicallyIncreasingID` as an example. This expression need a 
mutable state to remember how many times it has been evaluated, so we use 
`@transient var count: Long` there. By being transient, the `count` will be 
reset to 0 and **only** to 0 when serialize and deserialize it, as deserialize 
transient variable will result to default value. There is *no way* to use 
another initial value for `count`, until we add a a explicit initialization 
phase.
For now no nondeterministic expression need this feature, but we may add new 
ones with the need of a different initial value for mutable state in the future.

  was:
Some expressions have mutable states and need to be initialized first(like 
Rand, WeekOfYear). Currently we use `@transient lazy val` to make it 
automatically get initialized when first use it, and reset it after 
serialize-deserialize.
However, this approach is kind of ugly and accessing a lazy val is not 
efficient, we should have a explicit initialization phase for expressions.


 add initialization phase for nondeterministic expression
 

 Key: SPARK-9192
 URL: https://issues.apache.org/jira/browse/SPARK-9192
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan

 Currently nondeterministic expression is broken without a explicit 
 initialization phase.
 Let me take `MonotonicallyIncreasingID` as an example. This expression need a 
 mutable state to remember how many times it has been evaluated, so we use 
 `@transient var count: Long` there. By being transient, the `count` will be 
 reset to 0 and **only** to 0 when serialize and deserialize it, as 
 deserialize transient variable will result to default value. There is *no 
 way* to use another initial value for `count`, until we add a a explicit 
 initialization phase.
 For now no nondeterministic expression need this feature, but we may add new 
 ones with the need of a different initial value for mutable state in the 
 future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9192) add initialization phase for nondeterministic expression

2015-07-22 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637124#comment-14637124
 ] 

Wenchen Fan commented on SPARK-9192:


hi [~lian cheng], sorry about leaving the description empty at the beginning, 
as it's a result of an offline discussion with rxin. I have updated it, does 
this make sense to you?

 add initialization phase for nondeterministic expression
 

 Key: SPARK-9192
 URL: https://issues.apache.org/jira/browse/SPARK-9192
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan

 Currently nondeterministic expression is broken without a explicit 
 initialization phase.
 Let me take `MonotonicallyIncreasingID` as an example. This expression need a 
 mutable state to remember how many times it has been evaluated, so we use 
 `@transient var count: Long` there. By being transient, the `count` will be 
 reset to 0 and **only** to 0 when serialize and deserialize it, as 
 deserialize transient variable will result to default value. There is *no 
 way* to use another initial value for `count`, until we add a a explicit 
 initialization phase.
 For now no nondeterministic expression need this feature, but we may add new 
 ones with the need of a different initial value for mutable state in the 
 future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9256) Message delay causes Master crash upon registering application

2015-07-22 Thread Colin Scott (JIRA)
Colin Scott created SPARK-9256:
--

 Summary: Message delay causes Master crash upon registering 
application
 Key: SPARK-9256
 URL: https://issues.apache.org/jira/browse/SPARK-9256
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Colin Scott
Priority: Minor


This bug occurs when `spark.deploy.recoveryMode` is set to FILESYSTEM, and I 
believe it is only possible to trigger in production when the AppClient and 
Master are on different machines.

As part of initialization, the AppClient 
[registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124]
 with the Master by repeatedly sending a RegisterApplication message until it 
receives a RegisteredApplication response.

If the RegisteredApplication response is delayed by at least 
REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the 
RegisterApplication RPC), it is possible for the Master to receive *two* 
RegisterApplication messages for the same AppClient.

Upon receiving the second RegisterApplication message, the master 
[attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274]
 to persist the ApplicationInfo to disk. Since the file already exists, 
FileSystemPersistenceEngine 
[throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59]
 an IllegalStateException, and the Master crashes.

Incidentally, it appears that there is already a 
[TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266]
 in the code to handle this scenario.

I have a reproducing scenario for this bug on an old version of Spark (1.0.1), 
but upon inspecting the latest version of the code it appears that it is still 
possible to trigger it. Let me know if you would like reproducing steps for 
triggering it on the old version of Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1301) Add UI elements to collapse Aggregated Metrics by Executor pane on stage page

2015-07-22 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637299#comment-14637299
 ] 

Ryan Williams commented on SPARK-1301:
--

[~srowen] I read this as referring to the Aggregated Metrics by Executor pane 
on the stage page, which is not in the Executors tab; it causes the more 
commonly accessed per-task table on the stage page to be many screen-heights 
below the fold when the number of executors is large.

A similar argument could be made about the Distribution Across Executors 
table on the RDD page.

 Add UI elements to collapse Aggregated Metrics by Executor pane on stage 
 page
 ---

 Key: SPARK-1301
 URL: https://issues.apache.org/jira/browse/SPARK-1301
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Matei Zaharia
Priority: Minor
  Labels: Starter

 This table is useful but it takes up a lot of space on larger clusters, 
 hiding the more commonly accessed stage page. We could also move the table 
 below if collapsing it is difficult.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4024) Remember user preferences for metrics to show in the UI

2015-07-22 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637306#comment-14637306
 ] 

Ryan Williams commented on SPARK-4024:
--

FWIW it seemed like [~zsxwing] solved some of this in 
https://issues.apache.org/jira/browse/SPARK-4598 / 
https://github.com/apache/spark/pull/7399

 Remember user preferences for metrics to show in the UI
 ---

 Key: SPARK-4024
 URL: https://issues.apache.org/jira/browse/SPARK-4024
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Kay Ousterhout
Priority: Minor

 We should remember the metrics a user has previously chosen to display for 
 each stage, so that the user doesn't need to reselect interesting metric each 
 time they open a stage detail page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9255) Timestamp handling incorrect for Spark 1.4.1 on Linux

2015-07-22 Thread Paul Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Wu updated SPARK-9255:
---
Attachment: timestamp_bug.zip

the project can run without issues. But when it is deployed to the 

 Timestamp handling incorrect for Spark 1.4.1 on Linux
 -

 Key: SPARK-9255
 URL: https://issues.apache.org/jira/browse/SPARK-9255
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
 Environment: Redhat Linux, Java 8.0 and Spark 1.4.1 release.
Reporter: Paul Wu
 Attachments: timestamp_bug.zip


 This is a very strange case involving timestamp  I can run the program on 
 Windows using dev pom.xml (1.4.1) or 1.3.1 release downloaded from Apache  
 without issues , but when I ran it on Spark 1.4.1 release either downloaded 
 from Apache or the version built with scala 2.11 on redhat linux, it has the 
 following error (the code I used is after this stack trace):
 15/07/22 12:02:50  ERROR Executor 96: Exception in task 0.0 in stage 0.0 (TID 
 0)
 java.util.concurrent.ExecutionException: scala.tools.reflect.ToolBoxError: 
 reflective compilation has failed:
 value  is not a member of TimestampType.this.InternalType
 at 
 org.spark-project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
 at 
 org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
 at 
 org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
 at 
 org.spark-project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
 at 
 org.spark-project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
 at 
 org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
 at 
 org.spark-project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
 at 
 org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
 at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000)
 at 
 org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
 at 
 org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:105)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:102)
 at 
 org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:170)
 at 
 org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$9.apply(GeneratedAggregate.scala:261)
 at 
 org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$9.apply(GeneratedAggregate.scala:246)
 at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
 at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: scala.tools.reflect.ToolBoxError: reflective compilation has 
 failed:
 value  is not a member of TimestampType.this.InternalType
 at 
 scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.throwIfErrors(ToolBoxFactory.scala:316)
 at 
 scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.wrapInPackageAndCompile(ToolBoxFactory.scala:198)
 at 
 scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.compile(ToolBoxFactory.scala:252)
 at 
 scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$$anonfun$compile$2.apply(ToolBoxFactory.scala:429)
 at 
 scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$$anonfun$compile$2.apply(ToolBoxFactory.scala:422)
 at 
 

[jira] [Updated] (SPARK-7075) Project Tungsten: Improving Physical Execution

2015-07-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7075:
---
Target Version/s: 1.6.0  (was: )

 Project Tungsten: Improving Physical Execution
 --

 Key: SPARK-7075
 URL: https://issues.apache.org/jira/browse/SPARK-7075
 Project: Spark
  Issue Type: Epic
  Components: Block Manager, Shuffle, Spark Core, SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 Based on our observation, majority of Spark workloads are not bottlenecked by 
 I/O or network, but rather CPU and memory. This project focuses on 3 areas to 
 improve the efficiency of memory and CPU for Spark applications, to push 
 performance closer to the limits of the underlying hardware.
 *Memory Management and Binary Processing*
 - Avoiding non-transient Java objects (store them in binary format), which 
 reduces GC overhead.
 - Minimizing memory usage through denser in-memory data format, which means 
 we spill less.
 - Better memory accounting (size of bytes) rather than relying on heuristics
 - For operators that understand data types (in the case of DataFrames and 
 SQL), work directly against binary format in memory, i.e. have no 
 serialization/deserialization
 *Cache-aware Computation*
 - Faster sorting and hashing for aggregations, joins, and shuffle
 *Code Generation*
 - Faster expression evaluation and DataFrame/SQL operators
 - Faster serializer
 Several parts of project Tungsten leverage the DataFrame model, which gives 
 us more semantics about the application. We will also retrofit the 
 improvements onto Spark’s RDD API whenever possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9255) Timestamp handling incorrect for Spark 1.4.1 on Linux

2015-07-22 Thread Paul Wu (JIRA)
Paul Wu created SPARK-9255:
--

 Summary: Timestamp handling incorrect for Spark 1.4.1 on Linux
 Key: SPARK-9255
 URL: https://issues.apache.org/jira/browse/SPARK-9255
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
 Environment: Redhat Linux, Java 8.0 and Spark 1.4.1 release.
Reporter: Paul Wu


This is a very strange case involving timestamp  I can run the program on 
Windows using dev pom.xml (1.4.1) or 1.3.1 release downloaded from Apache  
without issues , but when I ran it on Spark 1.4.1 release either downloaded 
from Apache or the version built with scala 2.11 on redhat linux, it has the 
following error (the code I used is after this stack trace):

15/07/22 12:02:50  ERROR Executor 96: Exception in task 0.0 in stage 0.0 (TID 0)
java.util.concurrent.ExecutionException: scala.tools.reflect.ToolBoxError: 
reflective compilation has failed:

value  is not a member of TimestampType.this.InternalType
at 
org.spark-project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at 
org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at 
org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 
org.spark-project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
at 
org.spark-project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
at 
org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
at 
org.spark-project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at 
org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000)
at 
org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:105)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:102)
at 
org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:170)
at 
org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$9.apply(GeneratedAggregate.scala:261)
at 
org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$9.apply(GeneratedAggregate.scala:246)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: scala.tools.reflect.ToolBoxError: reflective compilation has failed:

value  is not a member of TimestampType.this.InternalType
at 
scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.throwIfErrors(ToolBoxFactory.scala:316)
at 
scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.wrapInPackageAndCompile(ToolBoxFactory.scala:198)
at 
scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.compile(ToolBoxFactory.scala:252)
at 
scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$$anonfun$compile$2.apply(ToolBoxFactory.scala:429)
at 
scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$$anonfun$compile$2.apply(ToolBoxFactory.scala:422)
at 
scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$withCompilerApi$.liftedTree2$1(ToolBoxFactory.scala:355)
at 
scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$withCompilerApi$.apply(ToolBoxFactory.scala:355)
at 
scala.tools.reflect.ToolBoxFactory$ToolBoxImpl.compile(ToolBoxFactory.scala:422)
at 
scala.tools.reflect.ToolBoxFactory$ToolBoxImpl.eval(ToolBoxFactory.scala:444)
at 

[jira] [Created] (SPARK-9257) Fix the false negative of Aggregate2Sort and FinalAndCompleteAggregate2Sort's missingInput

2015-07-22 Thread Yin Huai (JIRA)
Yin Huai created SPARK-9257:
---

 Summary: Fix the false negative of Aggregate2Sort and 
FinalAndCompleteAggregate2Sort's missingInput
 Key: SPARK-9257
 URL: https://issues.apache.org/jira/browse/SPARK-9257
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai


{code}
sqlContext.sql(
   
  |SELECT sum(value)
  |FROM agg1
  |GROUP BY key
  .stripMargin).explain()

== Physical Plan ==
Aggregate2Sort Some(List(key#510)), [key#510], [(sum(CAST(value#511, 
LongType))2,mode=Final,isDistinct=false)], [sum(CAST(value#511, 
LongType))#1435L], [sum(CAST(value#511, LongType))#1435L AS _c0#1426L]
 ExternalSort [key#510 ASC], false
  Exchange hashpartitioning(key#510)
   Aggregate2Sort None, [key#510], [(sum(CAST(value#511, 
LongType))2,mode=Partial,isDistinct=false)], [currentSum#1433L], 
[key#510,currentSum#1433L]
ExternalSort [key#510 ASC], false
 PhysicalRDD [key#510,value#511], MapPartitionsRDD[97] at apply at 
Transformer.scala:22

sqlContext.sql(
  
  |SELECT sum(distinct value)
  |FROM agg1
  |GROUP BY key
  .stripMargin).explain()

== Physical Plan ==
!FinalAndCompleteAggregate2Sort [key#510,CAST(value#511, LongType)#1446L], 
[key#510], [(sum(CAST(value#511, 
LongType)#1446L)2,mode=Complete,isDistinct=false)], [sum(CAST(value#511, 
LongType))#1445L], [sum(CAST(value#511, LongType))#1445L AS _c0#1438L]
 Aggregate2Sort Some(List(key#510)), [key#510,CAST(value#511, LongType)#1446L], 
[key#510,CAST(value#511, LongType)#1446L]
  ExternalSort [key#510 ASC,CAST(value#511, LongType)#1446L ASC], false
   Exchange hashpartitioning(key#510)
!Aggregate2Sort None, [key#510,CAST(value#511, LongType) AS CAST(value#511, 
LongType)#1446L], [key#510,CAST(value#511, LongType)#1446L]
 ExternalSort [key#510 ASC,CAST(value#511, LongType) AS CAST(value#511, 
LongType)#1446L ASC], false
  PhysicalRDD [key#510,value#511], MapPartitionsRDD[102] at apply at 
Transformer.scala:22
{code}

For examples shown above, you can see there is a {{!}} at the bingeing of the 
operator's {{simpleString}}), which indicates that its {{missingInput}} is not 
empty. Actually, it is a false negative and we need to fix it.

Also, it will be good to make these two operators' {{simpleString}} more reader 
friendly (people can tell what are grouping expressions, what are aggregate 
functions, and what is the mode of an aggregate function).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-07-22 Thread Maruf Aytekin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637337#comment-14637337
 ] 

Maruf Aytekin commented on SPARK-5992:
--

In addition to Charikar's scheme for cosine [~karlhigley]  pointed out, LSH 
schemes for the other known similarity/distance measures are  as follows:

1. Hamming norm:
A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via 
Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999).
http://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf

2. Lp norms:
M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing 
Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annual
http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf
http://people.csail.mit.edu/indyk/nips-nn.ps

3. Jaccard distance:
Mining Massive Data Sets chapter#3: 
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

4. Cosine distance and Earth movers distance (EMD):
M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In 
Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002).
http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf



 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6885) Decision trees: predict class probabilities

2015-07-22 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637316#comment-14637316
 ] 

Joseph K. Bradley commented on SPARK-6885:
--

We can resume this work.  Do you think you'd have time to finish it by the end 
of this week?  Sorry for the rush, but the code cutoff for the next release is 
in ~9 days.  If you don't have time right now, I can send a patch instead.  
Thanks!

 Decision trees: predict class probabilities
 ---

 Key: SPARK-6885
 URL: https://issues.apache.org/jira/browse/SPARK-6885
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Under spark.ml, have DecisionTreeClassifier (currently being added) extend 
 ProbabilisticClassifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9082) Filter using non-deterministic expressions should not be pushed down

2015-07-22 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-9082.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7446
[https://github.com/apache/spark/pull/7446]

 Filter using non-deterministic expressions should not be pushed down
 

 Key: SPARK-9082
 URL: https://issues.apache.org/jira/browse/SPARK-9082
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Wenchen Fan
 Fix For: 1.5.0


 For example,
 {code}
 val df = sqlContext.range(1, 10).select($id, rand(0).as('r))
 df.as(a).join(df.filter($r  0.5).as(b), $a.id === 
 $b.id).explain(true)
 {code}
 The plan is 
 {code}
 == Physical Plan ==
 ShuffledHashJoin [id#55323L], [id#55327L], BuildRight
  Exchange (HashPartitioning 200)
   Project [id#55323L,Rand 0 AS r#55324]
PhysicalRDD [id#55323L], MapPartitionsRDD[42268] at range at console:37
  Exchange (HashPartitioning 200)
   Project [id#55327L,Rand 0 AS r#55325]
Filter (LessThan)
 PhysicalRDD [id#55327L], MapPartitionsRDD[42268] at range at console:37
 {code}
 The rand get evaluated twice instead of once. 
 This is caused by when we push down predicates we replace the attribute 
 reference in the predicate with the actual expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9165) Implement code generation for CreateArray, CreateStruct, and CreateNamedStruct

2015-07-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-9165.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7537
[https://github.com/apache/spark/pull/7537]

 Implement code generation for CreateArray, CreateStruct, and CreateNamedStruct
 --

 Key: SPARK-9165
 URL: https://issues.apache.org/jira/browse/SPARK-9165
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Yijie Shen
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6970) Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable

2015-07-22 Thread John Muller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Muller updated SPARK-6970:
---
Affects Version/s: (was: 1.4.1)
   (was: 1.4.0)

 Document what the options: Map[String, String] does on DataFrame.save and 
 DataFrame.saveAsTable
 ---

 Key: SPARK-6970
 URL: https://issues.apache.org/jira/browse/SPARK-6970
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.0
Reporter: John Muller
Priority: Trivial
  Labels: DataFrame
   Original Estimate: 2h
  Remaining Estimate: 2h

 The save options on DataFrames are not easily discerned:
 [ResolvedDataSource.apply|https://github.com/apache/spark/blob/b75b3070740803480d235b0c9a86673721344f30/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala#L222]
   is where the pattern match occurs:
 {code:title=ddl.scala|borderStyle=solid}
 case dataSource: SchemaRelationProvider =
 dataSource.createRelation(sqlContext, new CaseInsensitiveMap(options), schema)
 {code}
 Implementing classes are currently: TableScanSuite, JSONRelation, and 
 newParquet



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module

2015-07-22 Thread Colin Scott (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637443#comment-14637443
 ] 

Colin Scott commented on SPARK-6028:


Curious: does this new RPC implementation use TCP as its underlying transport 
protocol, or UDP?

(I believe akka-remote uses TCP by default.)

 Provide an alternative RPC implementation based on the network transport 
 module
 ---

 Key: SPARK-6028
 URL: https://issues.apache.org/jira/browse/SPARK-6028
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Priority: Critical

 Network transport module implements a low level RPC interface. We can build a 
 new RPC implementation on top of that to replace Akka's.
 Design document: 
 https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9259) How to write Python code to send data from Kafka via Spark to HDFS?

2015-07-22 Thread sutanu das (JIRA)
sutanu das created SPARK-9259:
-

 Summary: How to write Python code to send data from Kafka via 
Spark to HDFS?
 Key: SPARK-9259
 URL: https://issues.apache.org/jira/browse/SPARK-9259
 Project: Spark
  Issue Type: Question
  Components: PySpark, Spark Core
Reporter: sutanu das


1. How to write Python code to send data from Kafka via Spark to HDFS?

2. We want to send loglines from Kafka queue to HDFS via Spark - Is there any 
basecode available in Python for sending logfiles via Spark to HDFS?

3. Is there any such config available in Spark to write to HDFS? like hdfs.path 
= name_node:8020/path_2_hdfs (kinda like storm.yaml file in Storm)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9260) Standalone scheduling can overflow a worker with cores

2015-07-22 Thread Andrew Or (JIRA)
Andrew Or created SPARK-9260:


 Summary: Standalone scheduling can overflow a worker with cores
 Key: SPARK-9260
 URL: https://issues.apache.org/jira/browse/SPARK-9260
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Nishkam Ravi


If the cluster is started with `spark.deploy.spreadOut = false`, then we may 
allocate more cores than is available on a worker. E.g. a worker has 8 cores, 
and an application sets `spark.cores.max = 10`, then we end up with the 
following screenshot:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module

2015-07-22 Thread Colin Scott (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637443#comment-14637443
 ] 

Colin Scott edited comment on SPARK-6028 at 7/22/15 7:10 PM:
-

Curious: does this new RPC implementation use TCP as its underlying transport 
protocol, or UDP? In other words, does the underlying transport protocol 
guarantee in-order delivery between hosts?
(I believe akka-remote uses TCP by default.)

Thanks!


was (Author: colin_scott):
Curious: does this new RPC implementation use TCP as its underlying transport 
protocol, or UDP?

(I believe akka-remote uses TCP by default.)

 Provide an alternative RPC implementation based on the network transport 
 module
 ---

 Key: SPARK-6028
 URL: https://issues.apache.org/jira/browse/SPARK-6028
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Priority: Critical

 Network transport module implements a low level RPC interface. We can build a 
 new RPC implementation on top of that to replace Akka's.
 Design document: 
 https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module

2015-07-22 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637472#comment-14637472
 ] 

Reynold Xin commented on SPARK-6028:


TPC.


 Provide an alternative RPC implementation based on the network transport 
 module
 ---

 Key: SPARK-6028
 URL: https://issues.apache.org/jira/browse/SPARK-6028
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Priority: Critical

 Network transport module implements a low level RPC interface. We can build a 
 new RPC implementation on top of that to replace Akka's.
 Design document: 
 https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9224) OnlineLDAOptimizer Performance Improvements

2015-07-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9224:
-
Shepherd: Joseph K. Bradley

 OnlineLDAOptimizer Performance Improvements
 ---

 Key: SPARK-9224
 URL: https://issues.apache.org/jira/browse/SPARK-9224
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Feynman Liang
Assignee: Feynman Liang

 OnlineLDAOptimizer's current implementation can be improved by using in-place 
 updating (instead of reassignment to vars), reducing number of 
 transpositions, and an outer product (instead of looping) to collect stats.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9224) OnlineLDAOptimizer Performance Improvements

2015-07-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9224:
-
Assignee: Feynman Liang

 OnlineLDAOptimizer Performance Improvements
 ---

 Key: SPARK-9224
 URL: https://issues.apache.org/jira/browse/SPARK-9224
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Feynman Liang
Assignee: Feynman Liang

 OnlineLDAOptimizer's current implementation can be improved by using in-place 
 updating (instead of reassignment to vars), reducing number of 
 transpositions, and an outer product (instead of looping) to collect stats.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3947) Support Scala/Java UDAF

2015-07-22 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-3947:

Summary: Support Scala/Java UDAF  (was: Support UDAF)

 Support Scala/Java UDAF
 ---

 Key: SPARK-3947
 URL: https://issues.apache.org/jira/browse/SPARK-3947
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Pei-Lun Lee
Assignee: Yin Huai
 Fix For: 1.5.0


 Right now only Hive UDAFs are supported. It would be nice to have UDAF 
 similar to UDF through SQLContext.registerFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9261) StreamingTab calls public APIs in Spark core that expose shaded classes

2015-07-22 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-9261:
-

 Summary: StreamingTab calls public APIs in Spark core that expose 
shaded classes
 Key: SPARK-9261
 URL: https://issues.apache.org/jira/browse/SPARK-9261
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.5.0
Reporter: Marcelo Vanzin
Priority: Minor


There's a minor issue in {{StreamingTab}} that has hit me a couple of times 
when building with maven.

It calls methods in {{JettyUtils}} and {{WebUI}} that expose Jetty types 
(namely {{ServletContextHandler}}). Since Jetty is now shaded, it's not safe to 
do that, since when running unit tests the spark-core jar will have the shaded 
version of the APIs while the streaming classes haven't been shaded yet.

This seems, at the lowest level, to be a bug in scalac (I've run into this 
issue in other modules before), since the code shouldn't compile at all, but we 
should avoid that kind of thing in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9262) Treat Scala compiler warnings as errors

2015-07-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637581#comment-14637581
 ] 

Apache Spark commented on SPARK-9262:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7598

 Treat Scala compiler warnings as errors
 ---

 Key: SPARK-9262
 URL: https://issues.apache.org/jira/browse/SPARK-9262
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin

 I've seen a few cases in the past few weeks that the compiler is throwing 
 warnings that are caused by legitimate bugs. This patch updates warnings to 
 errors, except deprecation warnings.
 Note that ideally we should be able to mark deprecation warnings as errors as 
 well. However, due to the lack of ability to suppress individual warning 
 messages in the Scala compiler, we cannot do that (since we do need to access 
 deprecated APIs in Hadoop).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9262) Treat Scala compiler warnings as errors

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9262:
---

Assignee: Apache Spark  (was: Reynold Xin)

 Treat Scala compiler warnings as errors
 ---

 Key: SPARK-9262
 URL: https://issues.apache.org/jira/browse/SPARK-9262
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Reynold Xin
Assignee: Apache Spark

 I've seen a few cases in the past few weeks that the compiler is throwing 
 warnings that are caused by legitimate bugs. This patch updates warnings to 
 errors, except deprecation warnings.
 Note that ideally we should be able to mark deprecation warnings as errors as 
 well. However, due to the lack of ability to suppress individual warning 
 messages in the Scala compiler, we cannot do that (since we do need to access 
 deprecated APIs in Hadoop).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9258) Remove BroadcastLeftSemiJoinHash

2015-07-22 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9258:
--

 Summary: Remove BroadcastLeftSemiJoinHash
 Key: SPARK-9258
 URL: https://issues.apache.org/jira/browse/SPARK-9258
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


We have too many join operators than our resources to optimize them. In this 
case, BroadcastLeftSemiJoinHash isn't very necessary. We can still use an 
equi-join operator to do the join, and just not include any values from the 
other join.

We waste a little bit space due to building a hash map rather than a hash set, 
but at the end of the day unless we are going to spend a lot of time optimizing 
hash set, our Tungsten hash map will be a lot more efficient than the hash set 
anyway ...





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9259) How to write Python code to send data from Kafka via Spark to HDFS?

2015-07-22 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637458#comment-14637458
 ] 

Marcelo Vanzin commented on SPARK-9259:
---

Are you trying to report a bug or to ask questions?

This is a bug tracker. For generic questions, please use the mailing lists:
http://spark.apache.org/community.html

 How to write Python code to send data from Kafka via Spark to HDFS?
 ---

 Key: SPARK-9259
 URL: https://issues.apache.org/jira/browse/SPARK-9259
 Project: Spark
  Issue Type: Question
  Components: PySpark, Spark Core
Reporter: sutanu das

 1. How to write Python code to send data from Kafka via Spark to HDFS?
 2. We want to send loglines from Kafka queue to HDFS via Spark - Is there any 
 basecode available in Python for sending logfiles via Spark to HDFS?
 3. Is there any such config available in Spark to write to HDFS? like 
 hdfs.path = name_node:8020/path_2_hdfs (kinda like storm.yaml file in Storm)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9259) How to write Python code to send data from Kafka via Spark to HDFS?

2015-07-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-9259.
---
Resolution: Invalid

 How to write Python code to send data from Kafka via Spark to HDFS?
 ---

 Key: SPARK-9259
 URL: https://issues.apache.org/jira/browse/SPARK-9259
 Project: Spark
  Issue Type: Question
  Components: PySpark, Spark Core
Reporter: sutanu das

 1. How to write Python code to send data from Kafka via Spark to HDFS?
 2. We want to send loglines from Kafka queue to HDFS via Spark - Is there any 
 basecode available in Python for sending logfiles via Spark to HDFS?
 3. Is there any such config available in Spark to write to HDFS? like 
 hdfs.path = name_node:8020/path_2_hdfs (kinda like storm.yaml file in Storm)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6970) Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable

2015-07-22 Thread John Muller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Muller updated SPARK-6970:
---
Target Version/s:   (was: 1.3.2)

 Document what the options: Map[String, String] does on DataFrame.save and 
 DataFrame.saveAsTable
 ---

 Key: SPARK-6970
 URL: https://issues.apache.org/jira/browse/SPARK-6970
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.0, 1.4.0, 1.4.1
Reporter: John Muller
Priority: Trivial
  Labels: DataFrame
   Original Estimate: 2h
  Remaining Estimate: 2h

 The save options on DataFrames are not easily discerned:
 [ResolvedDataSource.apply|https://github.com/apache/spark/blob/b75b3070740803480d235b0c9a86673721344f30/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala#L222]
   is where the pattern match occurs:
 {code:title=ddl.scala|borderStyle=solid}
 case dataSource: SchemaRelationProvider =
 dataSource.createRelation(sqlContext, new CaseInsensitiveMap(options), schema)
 {code}
 Implementing classes are currently: TableScanSuite, JSONRelation, and 
 newParquet



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6970) Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable

2015-07-22 Thread John Muller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Muller updated SPARK-6970:
---
Affects Version/s: 1.4.0
   1.4.1

 Document what the options: Map[String, String] does on DataFrame.save and 
 DataFrame.saveAsTable
 ---

 Key: SPARK-6970
 URL: https://issues.apache.org/jira/browse/SPARK-6970
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.0, 1.4.0, 1.4.1
Reporter: John Muller
Priority: Trivial
  Labels: DataFrame
   Original Estimate: 2h
  Remaining Estimate: 2h

 The save options on DataFrames are not easily discerned:
 [ResolvedDataSource.apply|https://github.com/apache/spark/blob/b75b3070740803480d235b0c9a86673721344f30/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala#L222]
   is where the pattern match occurs:
 {code:title=ddl.scala|borderStyle=solid}
 case dataSource: SchemaRelationProvider =
 dataSource.createRelation(sqlContext, new CaseInsensitiveMap(options), schema)
 {code}
 Implementing classes are currently: TableScanSuite, JSONRelation, and 
 newParquet



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9258) Remove all semi join physical operator

2015-07-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9258:
---
Description: 
We have 4 semi join operators. In this case, they are not very very necessary. 
We can still use an equi-join operator to do the join, and just not include any 
values from the other join.

We waste a little bit space due to building a hash map rather than a hash set, 
but at the end of the day unless we are going to spend a lot of time optimizing 
hash set, our Tungsten hash map will be a lot more efficient than the hash set 
anyway. This way, semi-join automatically benefits from all the work we do in 
Tungsten.




  was:
We have too many join operators than our resources to optimize them. In this 
case, BroadcastLeftSemiJoinHash isn't very necessary. We can still use an 
equi-join operator to do the join, and just not include any values from the 
other join.

We waste a little bit space due to building a hash map rather than a hash set, 
but at the end of the day unless we are going to spend a lot of time optimizing 
hash set, our Tungsten hash map will be a lot more efficient than the hash set 
anyway ...




 Remove all semi join physical operator
 --

 Key: SPARK-9258
 URL: https://issues.apache.org/jira/browse/SPARK-9258
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin

 We have 4 semi join operators. In this case, they are not very very 
 necessary. We can still use an equi-join operator to do the join, and just 
 not include any values from the other join.
 We waste a little bit space due to building a hash map rather than a hash 
 set, but at the end of the day unless we are going to spend a lot of time 
 optimizing hash set, our Tungsten hash map will be a lot more efficient than 
 the hash set anyway. This way, semi-join automatically benefits from all the 
 work we do in Tungsten.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9024) Unsafe HashJoin

2015-07-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9024.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7480
[https://github.com/apache/spark/pull/7480]

 Unsafe HashJoin
 ---

 Key: SPARK-9024
 URL: https://issues.apache.org/jira/browse/SPARK-9024
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu
 Fix For: 1.5.0


 Create a version of BroadcastJoin that accepts UnsafeRow as inputs, and 
 outputs UnsafeRow as outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9224) OnlineLDAOptimizer Performance Improvements

2015-07-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9224.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7454
[https://github.com/apache/spark/pull/7454]

 OnlineLDAOptimizer Performance Improvements
 ---

 Key: SPARK-9224
 URL: https://issues.apache.org/jira/browse/SPARK-9224
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Feynman Liang
Assignee: Feynman Liang
 Fix For: 1.5.0


 OnlineLDAOptimizer's current implementation can be improved by using in-place 
 updating (instead of reassignment to vars), reducing number of 
 transpositions, and an outer product (instead of looping) to collect stats.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4234) Always do paritial aggregation

2015-07-22 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-4234.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

With the interface introduced by SPARK-4233, we have the capability to always 
do partial aggregations. I am resolving it.

 Always do paritial aggregation 
 ---

 Key: SPARK-4234
 URL: https://issues.apache.org/jira/browse/SPARK-4234
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
 Fix For: 1.5.0


 Currently, UDAF developer optionally implement a partial aggregation 
 function, However this probably cause performance issue by allowing do that. 
 We actually can always force developers to provide the partial aggregation 
 function as Hive does, hence we will always get the `mapside` aggregation 
 optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9256) Message delay causes Master crash upon registering application

2015-07-22 Thread Colin Scott (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Scott updated SPARK-9256:
---
Description: 
This bug occurs when `spark.deploy.recoveryMode` is set to FILESYSTEM, and I 
believe it is only possible to trigger in production when the AppClient and 
Master are on different machines.

As part of initialization, the AppClient 
[registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124]
 with the Master by repeatedly sending a RegisterApplication message until it 
receives a RegisteredApplication response.

If the RegisteredApplication response is delayed by at least 
REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the 
RegisterApplication RPC), it is possible for the Master to receive *two* 
RegisterApplication messages for the same AppClient.

Upon receiving the second RegisterApplication message, the master 
[attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274]
 to persist the ApplicationInfo to disk. Since the file already exists, 
FileSystemPersistenceEngine 
[throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59]
 an IllegalStateException, and the Master crashes.

Incidentally, it appears that there is already a 
[TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266]
 in the code to handle this scenario.

I have a reproducing scenario for this bug on an old version of Spark (1.0.1), 
but upon inspecting the latest version of the code it appears that it is still 
possible to trigger it. Let me know if you would like reproducing steps for 
triggering it on the old version of Spark.

It should be possible to trigger this bug even if the underlying transport 
protocol is TCP, since TCP only guarantees in-order delivery in each direction 
of the connection but not in both directions.

  was:
This bug occurs when `spark.deploy.recoveryMode` is set to FILESYSTEM, and I 
believe it is only possible to trigger in production when the AppClient and 
Master are on different machines.

As part of initialization, the AppClient 
[registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124]
 with the Master by repeatedly sending a RegisterApplication message until it 
receives a RegisteredApplication response.

If the RegisteredApplication response is delayed by at least 
REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the 
RegisterApplication RPC), it is possible for the Master to receive *two* 
RegisterApplication messages for the same AppClient.

Upon receiving the second RegisterApplication message, the master 
[attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274]
 to persist the ApplicationInfo to disk. Since the file already exists, 
FileSystemPersistenceEngine 
[throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59]
 an IllegalStateException, and the Master crashes.

Incidentally, it appears that there is already a 
[TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266]
 in the code to handle this scenario.

I have a reproducing scenario for this bug on an old version of Spark (1.0.1), 
but upon inspecting the latest version of the code it appears that it is still 
possible to trigger it. Let me know if you would like reproducing steps for 
triggering it on the old version of Spark.

It should be possible to trigger this bug even if the underlying transport 
protocol is TCP, since TCP only guarantees in-order delivery in each direction 
of the connection, but not in both directions.


 Message delay causes Master crash upon registering application
 --

 Key: SPARK-9256
 URL: https://issues.apache.org/jira/browse/SPARK-9256
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Colin Scott
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 This bug occurs when `spark.deploy.recoveryMode` is set to FILESYSTEM, and 
 I believe it is only possible to trigger in production when the AppClient and 
 Master are on different machines.
 As part of initialization, the AppClient 
 

[jira] [Created] (SPARK-9262) Treat Scala compiler warnings as errors

2015-07-22 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9262:
--

 Summary: Treat Scala compiler warnings as errors
 Key: SPARK-9262
 URL: https://issues.apache.org/jira/browse/SPARK-9262
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin


I've seen a few cases in the past few weeks that the compiler is throwing 
warnings that are caused by legitimate bugs. This patch updates warnings to 
errors, except deprecation warnings.

Note that ideally we should be able to mark deprecation warnings as errors as 
well. However, due to the lack of ability to suppress individual warning 
messages in the Scala compiler, we cannot do that (since we do need to access 
deprecated APIs in Hadoop).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9262) Treat Scala compiler warnings as errors

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9262:
---

Assignee: Reynold Xin  (was: Apache Spark)

 Treat Scala compiler warnings as errors
 ---

 Key: SPARK-9262
 URL: https://issues.apache.org/jira/browse/SPARK-9262
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin

 I've seen a few cases in the past few weeks that the compiler is throwing 
 warnings that are caused by legitimate bugs. This patch updates warnings to 
 errors, except deprecation warnings.
 Note that ideally we should be able to mark deprecation warnings as errors as 
 well. However, due to the lack of ability to suppress individual warning 
 messages in the Scala compiler, we cannot do that (since we do need to access 
 deprecated APIs in Hadoop).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9258) Remove all semi join physical operator

2015-07-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9258:
---
Summary: Remove all semi join physical operator  (was: Remove 
BroadcastLeftSemiJoinHash)

 Remove all semi join physical operator
 --

 Key: SPARK-9258
 URL: https://issues.apache.org/jira/browse/SPARK-9258
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin

 We have too many join operators than our resources to optimize them. In this 
 case, BroadcastLeftSemiJoinHash isn't very necessary. We can still use an 
 equi-join operator to do the join, and just not include any values from the 
 other join.
 We waste a little bit space due to building a hash map rather than a hash 
 set, but at the end of the day unless we are going to spend a lot of time 
 optimizing hash set, our Tungsten hash map will be a lot more efficient than 
 the hash set anyway ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6970) Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable

2015-07-22 Thread John Muller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Muller closed SPARK-6970.
--

Undocumented parts of DataFrames were deprecated

 Document what the options: Map[String, String] does on DataFrame.save and 
 DataFrame.saveAsTable
 ---

 Key: SPARK-6970
 URL: https://issues.apache.org/jira/browse/SPARK-6970
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.0
Reporter: John Muller
Priority: Trivial
  Labels: DataFrame
 Fix For: 1.4.0

   Original Estimate: 2h
  Remaining Estimate: 2h

 The save options on DataFrames are not easily discerned:
 [ResolvedDataSource.apply|https://github.com/apache/spark/blob/b75b3070740803480d235b0c9a86673721344f30/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala#L222]
   is where the pattern match occurs:
 {code:title=ddl.scala|borderStyle=solid}
 case dataSource: SchemaRelationProvider =
 dataSource.createRelation(sqlContext, new CaseInsensitiveMap(options), schema)
 {code}
 Implementing classes are currently: TableScanSuite, JSONRelation, and 
 newParquet



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6970) Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable

2015-07-22 Thread John Muller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Muller resolved SPARK-6970.

   Resolution: Won't Fix
Fix Version/s: 1.4.0

Resolving as won't fix.  The new DataFrames API deprecated save and 
saveAsTable.  The new write() method also lacks docs; will open a new ticket 
when I have at least a partial patch for that.

 Document what the options: Map[String, String] does on DataFrame.save and 
 DataFrame.saveAsTable
 ---

 Key: SPARK-6970
 URL: https://issues.apache.org/jira/browse/SPARK-6970
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.0
Reporter: John Muller
Priority: Trivial
  Labels: DataFrame
 Fix For: 1.4.0

   Original Estimate: 2h
  Remaining Estimate: 2h

 The save options on DataFrames are not easily discerned:
 [ResolvedDataSource.apply|https://github.com/apache/spark/blob/b75b3070740803480d235b0c9a86673721344f30/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala#L222]
   is where the pattern match occurs:
 {code:title=ddl.scala|borderStyle=solid}
 case dataSource: SchemaRelationProvider =
 dataSource.createRelation(sqlContext, new CaseInsensitiveMap(options), schema)
 {code}
 Implementing classes are currently: TableScanSuite, JSONRelation, and 
 newParquet



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9256) Message delay causes Master crash upon registering application

2015-07-22 Thread Colin Scott (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Scott updated SPARK-9256:
---
Description: 
This bug occurs when `spark.deploy.recoveryMode` is set to FILESYSTEM, and I 
believe it is only possible to trigger in production when the AppClient and 
Master are on different machines.

As part of initialization, the AppClient 
[registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124]
 with the Master by repeatedly sending a RegisterApplication message until it 
receives a RegisteredApplication response.

If the RegisteredApplication response is delayed by at least 
REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the 
RegisterApplication RPC), it is possible for the Master to receive *two* 
RegisterApplication messages for the same AppClient.

Upon receiving the second RegisterApplication message, the master 
[attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274]
 to persist the ApplicationInfo to disk. Since the file already exists, 
FileSystemPersistenceEngine 
[throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59]
 an IllegalStateException, and the Master crashes.

Incidentally, it appears that there is already a 
[TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266]
 in the code to handle this scenario.

I have a reproducing scenario for this bug on an old version of Spark (1.0.1), 
but upon inspecting the latest version of the code it appears that it is still 
possible to trigger it. Let me know if you would like reproducing steps for 
triggering it on the old version of Spark.

It should be possible to trigger this bug even if the underlying transport 
protocol is TCP, since TCP only guarantees in-order delivery in each direction 
of the connection, but not in both directions.

  was:
This bug occurs when `spark.deploy.recoveryMode` is set to FILESYSTEM, and I 
believe it is only possible to trigger in production when the AppClient and 
Master are on different machines.

As part of initialization, the AppClient 
[registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124]
 with the Master by repeatedly sending a RegisterApplication message until it 
receives a RegisteredApplication response.

If the RegisteredApplication response is delayed by at least 
REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the 
RegisterApplication RPC), it is possible for the Master to receive *two* 
RegisterApplication messages for the same AppClient.

Upon receiving the second RegisterApplication message, the master 
[attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274]
 to persist the ApplicationInfo to disk. Since the file already exists, 
FileSystemPersistenceEngine 
[throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59]
 an IllegalStateException, and the Master crashes.

Incidentally, it appears that there is already a 
[TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266]
 in the code to handle this scenario.

I have a reproducing scenario for this bug on an old version of Spark (1.0.1), 
but upon inspecting the latest version of the code it appears that it is still 
possible to trigger it. Let me know if you would like reproducing steps for 
triggering it on the old version of Spark.


 Message delay causes Master crash upon registering application
 --

 Key: SPARK-9256
 URL: https://issues.apache.org/jira/browse/SPARK-9256
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Colin Scott
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 This bug occurs when `spark.deploy.recoveryMode` is set to FILESYSTEM, and 
 I believe it is only possible to trigger in production when the AppClient and 
 Master are on different machines.
 As part of initialization, the AppClient 
 [registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124]
  with the Master by repeatedly sending a 

[jira] [Commented] (SPARK-6802) User Defined Aggregate Function Refactoring

2015-07-22 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637555#comment-14637555
 ] 

Yin Huai commented on SPARK-6802:
-

We have added Scala/Java UDAF support through SPARK-3947. Is this JIRA for 
Python UDAF?

 User Defined Aggregate Function Refactoring
 ---

 Key: SPARK-6802
 URL: https://issues.apache.org/jira/browse/SPARK-6802
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
 Environment: We use Spark Dataframe, SQL along with json, sql and 
 pandas quite a bit
Reporter: cynepia

 While trying to use custom aggregates in spark (something which is common in 
 pandas), We realized that Custom Aggregate Functions aren't well supported 
 across various features/functions in Spark beyond what is supported by Hive. 
 There are futher discussions on the topic viz-a -viz the issue SPARK-3947, 
 which points to similar improvement tickets opened earlier for refactoring 
 the UDAF area.
 While we refactor the interface for aggregates, It would make sense to keep 
 in consideration, the recently added DataFrame, GroupedData, and possibly 
 also sql.dataframe.Column, which looks different from pandas.Series and isn't 
 currently supporting any aggregations.
 Would like to get feedback from the folks, who are actively looking at this...
 We would be happy to participate and contribute, if there are any discussions 
 on the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4233) Simplify the Aggregation Function implementation

2015-07-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4233.

   Resolution: Fixed
 Assignee: Cheng Hao
Fix Version/s: 1.5.0

 Simplify the Aggregation Function implementation
 

 Key: SPARK-4233
 URL: https://issues.apache.org/jira/browse/SPARK-4233
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
 Fix For: 1.5.0


 Currently, the UDAF implementation is quite complicated, and we have to 
 provide distinct  non-distinct version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4367) Partial aggregation support the DISTINCT aggregation

2015-07-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4367.

   Resolution: Fixed
 Assignee: Yin Huai
Fix Version/s: 1.5.0

 Partial aggregation support the DISTINCT aggregation
 

 Key: SPARK-4367
 URL: https://issues.apache.org/jira/browse/SPARK-4367
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao
Assignee: Yin Huai
 Fix For: 1.5.0


 Most of aggregate function(e.g average) with distinct value will requires 
 all of the records in the same group to be shuffled into a single node, 
 however, as part of the optimization, those records can be partially 
 aggregated before shuffling, that probably reduces the overhead of shuffling 
 significantly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9231) DistributedLDAModel method for top topics per document

2015-07-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9231:
-
Description: 
Helper method in DistributedLDAModel of this form:
{code}
/**
 * For each document, return the top k weighted topics for that document.
 * @return RDD of (doc ID, topic indices, topic weights)
 */
def topTopicsPerDocument(k: Int): RDD[(Long, Array[Int], Array[Double])]
{code}

I believe the above method signature will be Java-friendly.

  was:
Helper method in DistributedLDAModel of this form:
{code}
/** For each document, return the top k weighted topics for that document. */
def topTopicsPerDocument(k: Int): RDD[(Long, Array[Int], Array[Double])]
{code}

I believe the above method signature will be Java-friendly.


 DistributedLDAModel method for top topics per document
 --

 Key: SPARK-9231
 URL: https://issues.apache.org/jira/browse/SPARK-9231
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor
   Original Estimate: 48h
  Remaining Estimate: 48h

 Helper method in DistributedLDAModel of this form:
 {code}
 /**
  * For each document, return the top k weighted topics for that document.
  * @return RDD of (doc ID, topic indices, topic weights)
  */
 def topTopicsPerDocument(k: Int): RDD[(Long, Array[Int], Array[Double])]
 {code}
 I believe the above method signature will be Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9223) Support model save/load in Python's LDA

2015-07-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636386#comment-14636386
 ] 

Apache Spark commented on SPARK-9223:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/7587

 Support model save/load in Python's LDA
 ---

 Key: SPARK-9223
 URL: https://issues.apache.org/jira/browse/SPARK-9223
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8856) Better instrumentation and visualization for physical plan

2015-07-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636405#comment-14636405
 ] 

Apache Spark commented on SPARK-8856:
-

User 'feynmanliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7590

 Better instrumentation and visualization for physical plan
 --

 Key: SPARK-8856
 URL: https://issues.apache.org/jira/browse/SPARK-8856
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin
Assignee: Shixiong Zhu

 This is an umbrella ticket to improve physical plan instrumentation and 
 visualization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8856) Better instrumentation and visualization for physical plan

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8856:
---

Assignee: Apache Spark  (was: Shixiong Zhu)

 Better instrumentation and visualization for physical plan
 --

 Key: SPARK-8856
 URL: https://issues.apache.org/jira/browse/SPARK-8856
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 This is an umbrella ticket to improve physical plan instrumentation and 
 visualization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9213) Improve regular expression performance (via joni)

2015-07-22 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-9213:
--
Target Version/s: 1.6.0  (was: )

 Improve regular expression performance (via joni)
 -

 Key: SPARK-9213
 URL: https://issues.apache.org/jira/browse/SPARK-9213
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin

 I'm creating an umbrella ticket to improve regular expression performance for 
 string expressions. Right now our use of regular expressions is inefficient 
 for two reasons:
 1. Java regex in general is slow.
 2. We have to convert everything from UTF8 encoded bytes into Java String, 
 and then run regex on it, and then convert it back.
 There are libraries in Java that provide regex support directly on UTF8 
 encoded bytes. One prominent example is joni, used in JRuby.
 Note: all regex functions are in 
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9250) ./dev/change-scala-version.sh should offer guidance what versions are accepted, i.e. 2.10 or 2.11

2015-07-22 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-9250:
--

 Summary: ./dev/change-scala-version.sh should offer guidance what 
versions are accepted, i.e. 2.10 or 2.11
 Key: SPARK-9250
 URL: https://issues.apache.org/jira/browse/SPARK-9250
 Project: Spark
  Issue Type: Improvement
  Components: Build
 Environment: commit c03299a18b4e076cabb4b7833a1e7632c5c0dabe
Reporter: Jacek Laskowski
Priority: Minor


With the commit f5b6dc5 there's this new way of building Spark with Scala 2.10 
and 2.11. The help given is not very helpful and could be improved about the 
possible versions and their format.

{code}
➜  spark git:(master) ./dev/change-scala-version.sh
Usage: change-scala-version.sh version
{code}

I can see inside - that could be part of the help.

{code}
VALID_VERSIONS=( 2.10 2.11 )
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7075) Project Tungsten: Improving Physical Execution

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7075:
---

Assignee: Reynold Xin  (was: Apache Spark)

 Project Tungsten: Improving Physical Execution
 --

 Key: SPARK-7075
 URL: https://issues.apache.org/jira/browse/SPARK-7075
 Project: Spark
  Issue Type: Epic
  Components: Block Manager, Shuffle, Spark Core, SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 Based on our observation, majority of Spark workloads are not bottlenecked by 
 I/O or network, but rather CPU and memory. This project focuses on 3 areas to 
 improve the efficiency of memory and CPU for Spark applications, to push 
 performance closer to the limits of the underlying hardware.
 *Memory Management and Binary Processing*
 - Avoiding non-transient Java objects (store them in binary format), which 
 reduces GC overhead.
 - Minimizing memory usage through denser in-memory data format, which means 
 we spill less.
 - Better memory accounting (size of bytes) rather than relying on heuristics
 - For operators that understand data types (in the case of DataFrames and 
 SQL), work directly against binary format in memory, i.e. have no 
 serialization/deserialization
 *Cache-aware Computation*
 - Faster sorting and hashing for aggregations, joins, and shuffle
 *Code Generation*
 - Faster expression evaluation and DataFrame/SQL operators
 - Faster serializer
 Several parts of project Tungsten leverage the DataFrame model, which gives 
 us more semantics about the application. We will also retrofit the 
 improvements onto Spark’s RDD API whenever possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7075) Project Tungsten: Improving Physical Execution

2015-07-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636496#comment-14636496
 ] 

Apache Spark commented on SPARK-7075:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/7592

 Project Tungsten: Improving Physical Execution
 --

 Key: SPARK-7075
 URL: https://issues.apache.org/jira/browse/SPARK-7075
 Project: Spark
  Issue Type: Epic
  Components: Block Manager, Shuffle, Spark Core, SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 Based on our observation, majority of Spark workloads are not bottlenecked by 
 I/O or network, but rather CPU and memory. This project focuses on 3 areas to 
 improve the efficiency of memory and CPU for Spark applications, to push 
 performance closer to the limits of the underlying hardware.
 *Memory Management and Binary Processing*
 - Avoiding non-transient Java objects (store them in binary format), which 
 reduces GC overhead.
 - Minimizing memory usage through denser in-memory data format, which means 
 we spill less.
 - Better memory accounting (size of bytes) rather than relying on heuristics
 - For operators that understand data types (in the case of DataFrames and 
 SQL), work directly against binary format in memory, i.e. have no 
 serialization/deserialization
 *Cache-aware Computation*
 - Faster sorting and hashing for aggregations, joins, and shuffle
 *Code Generation*
 - Faster expression evaluation and DataFrame/SQL operators
 - Faster serializer
 Several parts of project Tungsten leverage the DataFrame model, which gives 
 us more semantics about the application. We will also retrofit the 
 improvements onto Spark’s RDD API whenever possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9131) UDFs change data values

2015-07-22 Thread Luis Guerra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636508#comment-14636508
 ] 

Luis Guerra commented on SPARK-9131:


By the way, UDFs code should be changed to return StringType(). I changed the 
data type to string since It does not matter the data type but using the UDFs

 UDFs change data values
 ---

 Key: SPARK-9131
 URL: https://issues.apache.org/jira/browse/SPARK-9131
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0, 1.4.1
 Environment: Pyspark 1.4 and 1.4.1
Reporter: Luis Guerra
Priority: Critical
 Attachments: testjson_jira9131.z01, testjson_jira9131.z02, 
 testjson_jira9131.z03, testjson_jira9131.z04, testjson_jira9131.z05, 
 testjson_jira9131.z06, testjson_jira9131.zip


 I am having some troubles when using a custom udf in dataframes with pyspark 
 1.4.
 I have rewritten the udf to simplify the problem and it gets even weirder. 
 The udfs I am using do absolutely nothing, they just receive some value and 
 output the same value with the same format.
 I show you my code below:
 {code}
 c= a.join(b, a['ID'] == b['ID_new'], 'inner')
 c.filter(c['ID'] == '62698917').show()
 udf_A = UserDefinedFunction(lambda x: x, DateType())
 udf_B = UserDefinedFunction(lambda x: x, DateType())
 udf_C = UserDefinedFunction(lambda x: x, DateType())
 d = c.select(c['ID'], c['t1'].alias('ta'), 
 udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), 
 udf_C(vinc_muestra['t2']).alias('td'))
 d.filter(d['ID'] == '62698917').show()
 {code}
 I am showing here the results from the outputs:
 {code}
 +++--+--+
 |  ID | ID_new  | t1   |   t2 |
 +++--+--+
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 +++--+--+
 ++---+---+++
 |   ID|   ta |   tb|   tc| td 
   |
 ++---+---+++
 |62698917| 2012-02-28|   2007-03-05|2003-03-05|
 2014-02-28|
 |62698917| 2012-02-20|   2007-02-15|2002-02-15|
 2013-02-20|
 |62698917| 2012-02-28|   2007-03-10|2005-03-10|
 2014-02-28|
 |62698917| 2012-02-20|   2007-03-05|2003-03-05|
 2013-02-20|
 |62698917| 2012-02-20|   2013-08-02|2013-01-02|
 2013-02-20|
 |62698917| 2012-02-28|   2007-02-15|2002-02-15|
 2014-02-28|
 |62698917| 2012-02-28|   2007-02-15|2002-02-15|
 2014-02-28|
 |62698917| 2012-02-20|   2014-01-02|2013-01-02|
 2013-02-20|
 ++---+---+++
 {code}
 The problem here is that values at columns 'tb', 'tc' and 'td' in dataframe 
 'd' are completely different from values 't1' and 't2' in dataframe c even 
 when my udfs are doing nothing. It seems like if values were somehow got from 
 other registers (or just invented). Results are different between executions 
 (apparently random).
 Thanks in advance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9248) Closing curly-braces should always be on their own line

2015-07-22 Thread Yu Ishikawa (JIRA)
Yu Ishikawa created SPARK-9248:
--

 Summary: Closing curly-braces should always be on their own line
 Key: SPARK-9248
 URL: https://issues.apache.org/jira/browse/SPARK-9248
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa
Priority: Minor


Closing curly-braces should always be on their own line



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9251) do not order by expressions which still need evaluation

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9251:
---

Assignee: (was: Apache Spark)

 do not order by expressions which still need evaluation
 ---

 Key: SPARK-9251
 URL: https://issues.apache.org/jira/browse/SPARK-9251
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9083) If order by clause has non-deterministic expressions, we should add a project to materialize results of these expressions

2015-07-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636546#comment-14636546
 ] 

Apache Spark commented on SPARK-9083:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/7593

 If order by clause has non-deterministic expressions, we should add a project 
 to materialize results of these expressions 
 --

 Key: SPARK-9083
 URL: https://issues.apache.org/jira/browse/SPARK-9083
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Wenchen Fan

 When a order by clause has a non-deterministic expression, we actually 
 evaluate it twice, once in the exchange operator when we try to figure out 
 the range partitioner's boundaries and once in the sort operator. We should 
 use a project to materialize the result first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9251) do not order by expressions which still need evaluation

2015-07-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636545#comment-14636545
 ] 

Apache Spark commented on SPARK-9251:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/7593

 do not order by expressions which still need evaluation
 ---

 Key: SPARK-9251
 URL: https://issues.apache.org/jira/browse/SPARK-9251
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9083) If order by clause has non-deterministic expressions, we should add a project to materialize results of these expressions

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9083:
---

Assignee: Wenchen Fan  (was: Apache Spark)

 If order by clause has non-deterministic expressions, we should add a project 
 to materialize results of these expressions 
 --

 Key: SPARK-9083
 URL: https://issues.apache.org/jira/browse/SPARK-9083
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Wenchen Fan

 When a order by clause has a non-deterministic expression, we actually 
 evaluate it twice, once in the exchange operator when we try to figure out 
 the range partitioner's boundaries and once in the sort operator. We should 
 use a project to materialize the result first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9083) If order by clause has non-deterministic expressions, we should add a project to materialize results of these expressions

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9083:
---

Assignee: Apache Spark  (was: Wenchen Fan)

 If order by clause has non-deterministic expressions, we should add a project 
 to materialize results of these expressions 
 --

 Key: SPARK-9083
 URL: https://issues.apache.org/jira/browse/SPARK-9083
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Apache Spark

 When a order by clause has a non-deterministic expression, we actually 
 evaluate it twice, once in the exchange operator when we try to figure out 
 the range partitioner's boundaries and once in the sort operator. We should 
 use a project to materialize the result first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9251) do not order by expressions which still need evaluation

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9251:
---

Assignee: Apache Spark

 do not order by expressions which still need evaluation
 ---

 Key: SPARK-9251
 URL: https://issues.apache.org/jira/browse/SPARK-9251
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9223) Support model save/load in Python's LDA

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9223:
---

Assignee: Apache Spark

 Support model save/load in Python's LDA
 ---

 Key: SPARK-9223
 URL: https://issues.apache.org/jira/browse/SPARK-9223
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Assignee: Apache Spark
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9223) Support model save/load in Python's LDA

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9223:
---

Assignee: (was: Apache Spark)

 Support model save/load in Python's LDA
 ---

 Key: SPARK-9223
 URL: https://issues.apache.org/jira/browse/SPARK-9223
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8187) date/time function: date_sub

2015-07-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636407#comment-14636407
 ] 

Apache Spark commented on SPARK-8187:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7589

 date/time function: date_sub
 

 Key: SPARK-8187
 URL: https://issues.apache.org/jira/browse/SPARK-8187
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Adrian Wang

 date_sub(timestamp startdate, int days): timestamp
 date_sub(timestamp startdate, interval i): timestamp
 date_sub(date date, int days): date
 date_sub(date date, interval i): date



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8856) Better instrumentation and visualization for physical plan

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8856:
---

Assignee: Shixiong Zhu  (was: Apache Spark)

 Better instrumentation and visualization for physical plan
 --

 Key: SPARK-8856
 URL: https://issues.apache.org/jira/browse/SPARK-8856
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin
Assignee: Shixiong Zhu

 This is an umbrella ticket to improve physical plan instrumentation and 
 visualization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8186) date/time function: date_add

2015-07-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636406#comment-14636406
 ] 

Apache Spark commented on SPARK-8186:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7589

 date/time function: date_add
 

 Key: SPARK-8186
 URL: https://issues.apache.org/jira/browse/SPARK-8186
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Adrian Wang

 date_add(timestamp startdate, int days): timestamp
 date_add(timestamp startdate, interval i): timestamp
 date_add(date date, int days): date
 date_add(date date, interval i): date



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8856) Better instrumentation and visualization for physical plan

2015-07-22 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636413#comment-14636413
 ] 

Reynold Xin commented on SPARK-8856:


It's fine to mark it as in progress since it is actually in progress.


 Better instrumentation and visualization for physical plan
 --

 Key: SPARK-8856
 URL: https://issues.apache.org/jira/browse/SPARK-8856
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin
Assignee: Shixiong Zhu

 This is an umbrella ticket to improve physical plan instrumentation and 
 visualization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8856) Better instrumentation and visualization for physical plan

2015-07-22 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636412#comment-14636412
 ] 

Feynman Liang commented on SPARK-8856:
--

Oops, I tagged the wrong JIRA in the PR. Can you please mark as Open again?

 Better instrumentation and visualization for physical plan
 --

 Key: SPARK-8856
 URL: https://issues.apache.org/jira/browse/SPARK-8856
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin
Assignee: Shixiong Zhu

 This is an umbrella ticket to improve physical plan instrumentation and 
 visualization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9131) UDFs change data values

2015-07-22 Thread Luis Guerra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636490#comment-14636490
 ] 

Luis Guerra commented on SPARK-9131:


Actually, the file is reduced by compression to 27 mb, closer to the limit but 
still over it

 UDFs change data values
 ---

 Key: SPARK-9131
 URL: https://issues.apache.org/jira/browse/SPARK-9131
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0, 1.4.1
 Environment: Pyspark 1.4 and 1.4.1
Reporter: Luis Guerra
Priority: Critical

 I am having some troubles when using a custom udf in dataframes with pyspark 
 1.4.
 I have rewritten the udf to simplify the problem and it gets even weirder. 
 The udfs I am using do absolutely nothing, they just receive some value and 
 output the same value with the same format.
 I show you my code below:
 {code}
 c= a.join(b, a['ID'] == b['ID_new'], 'inner')
 c.filter(c['ID'] == '62698917').show()
 udf_A = UserDefinedFunction(lambda x: x, DateType())
 udf_B = UserDefinedFunction(lambda x: x, DateType())
 udf_C = UserDefinedFunction(lambda x: x, DateType())
 d = c.select(c['ID'], c['t1'].alias('ta'), 
 udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), 
 udf_C(vinc_muestra['t2']).alias('td'))
 d.filter(d['ID'] == '62698917').show()
 {code}
 I am showing here the results from the outputs:
 {code}
 +++--+--+
 |  ID | ID_new  | t1   |   t2 |
 +++--+--+
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 +++--+--+
 ++---+---+++
 |   ID|   ta |   tb|   tc| td 
   |
 ++---+---+++
 |62698917| 2012-02-28|   2007-03-05|2003-03-05|
 2014-02-28|
 |62698917| 2012-02-20|   2007-02-15|2002-02-15|
 2013-02-20|
 |62698917| 2012-02-28|   2007-03-10|2005-03-10|
 2014-02-28|
 |62698917| 2012-02-20|   2007-03-05|2003-03-05|
 2013-02-20|
 |62698917| 2012-02-20|   2013-08-02|2013-01-02|
 2013-02-20|
 |62698917| 2012-02-28|   2007-02-15|2002-02-15|
 2014-02-28|
 |62698917| 2012-02-28|   2007-02-15|2002-02-15|
 2014-02-28|
 |62698917| 2012-02-20|   2014-01-02|2013-01-02|
 2013-02-20|
 ++---+---+++
 {code}
 The problem here is that values at columns 'tb', 'tc' and 'td' in dataframe 
 'd' are completely different from values 't1' and 't2' in dataframe c even 
 when my udfs are doing nothing. It seems like if values were somehow got from 
 other registers (or just invented). Results are different between executions 
 (apparently random).
 Thanks in advance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9131) UDFs change data values

2015-07-22 Thread Luis Guerra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636489#comment-14636489
 ] 

Luis Guerra commented on SPARK-9131:


Agree, I have prepared a dataset.json and is ready to be uploaded. However, its 
size is too large (more than 600 mb). How can I upload it for you? 

 UDFs change data values
 ---

 Key: SPARK-9131
 URL: https://issues.apache.org/jira/browse/SPARK-9131
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0, 1.4.1
 Environment: Pyspark 1.4 and 1.4.1
Reporter: Luis Guerra
Priority: Critical

 I am having some troubles when using a custom udf in dataframes with pyspark 
 1.4.
 I have rewritten the udf to simplify the problem and it gets even weirder. 
 The udfs I am using do absolutely nothing, they just receive some value and 
 output the same value with the same format.
 I show you my code below:
 {code}
 c= a.join(b, a['ID'] == b['ID_new'], 'inner')
 c.filter(c['ID'] == '62698917').show()
 udf_A = UserDefinedFunction(lambda x: x, DateType())
 udf_B = UserDefinedFunction(lambda x: x, DateType())
 udf_C = UserDefinedFunction(lambda x: x, DateType())
 d = c.select(c['ID'], c['t1'].alias('ta'), 
 udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), 
 udf_C(vinc_muestra['t2']).alias('td'))
 d.filter(d['ID'] == '62698917').show()
 {code}
 I am showing here the results from the outputs:
 {code}
 +++--+--+
 |  ID | ID_new  | t1   |   t2 |
 +++--+--+
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 +++--+--+
 ++---+---+++
 |   ID|   ta |   tb|   tc| td 
   |
 ++---+---+++
 |62698917| 2012-02-28|   2007-03-05|2003-03-05|
 2014-02-28|
 |62698917| 2012-02-20|   2007-02-15|2002-02-15|
 2013-02-20|
 |62698917| 2012-02-28|   2007-03-10|2005-03-10|
 2014-02-28|
 |62698917| 2012-02-20|   2007-03-05|2003-03-05|
 2013-02-20|
 |62698917| 2012-02-20|   2013-08-02|2013-01-02|
 2013-02-20|
 |62698917| 2012-02-28|   2007-02-15|2002-02-15|
 2014-02-28|
 |62698917| 2012-02-28|   2007-02-15|2002-02-15|
 2014-02-28|
 |62698917| 2012-02-20|   2014-01-02|2013-01-02|
 2013-02-20|
 ++---+---+++
 {code}
 The problem here is that values at columns 'tb', 'tc' and 'td' in dataframe 
 'd' are completely different from values 't1' and 't2' in dataframe c even 
 when my udfs are doing nothing. It seems like if values were somehow got from 
 other registers (or just invented). Results are different between executions 
 (apparently random).
 Thanks in advance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9131) UDFs change data values

2015-07-22 Thread Luis Guerra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Guerra updated SPARK-9131:
---
Attachment: testjson_jira9131.z02
testjson_jira9131.z03
testjson_jira9131.z05
testjson_jira9131.z04
testjson_jira9131.zip
testjson_jira9131.z06
testjson_jira9131.z01

I hope they work fine. I have split them into several files to reach the limit 
size

 UDFs change data values
 ---

 Key: SPARK-9131
 URL: https://issues.apache.org/jira/browse/SPARK-9131
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0, 1.4.1
 Environment: Pyspark 1.4 and 1.4.1
Reporter: Luis Guerra
Priority: Critical
 Attachments: testjson_jira9131.z01, testjson_jira9131.z02, 
 testjson_jira9131.z03, testjson_jira9131.z04, testjson_jira9131.z05, 
 testjson_jira9131.z06, testjson_jira9131.zip


 I am having some troubles when using a custom udf in dataframes with pyspark 
 1.4.
 I have rewritten the udf to simplify the problem and it gets even weirder. 
 The udfs I am using do absolutely nothing, they just receive some value and 
 output the same value with the same format.
 I show you my code below:
 {code}
 c= a.join(b, a['ID'] == b['ID_new'], 'inner')
 c.filter(c['ID'] == '62698917').show()
 udf_A = UserDefinedFunction(lambda x: x, DateType())
 udf_B = UserDefinedFunction(lambda x: x, DateType())
 udf_C = UserDefinedFunction(lambda x: x, DateType())
 d = c.select(c['ID'], c['t1'].alias('ta'), 
 udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), 
 udf_C(vinc_muestra['t2']).alias('td'))
 d.filter(d['ID'] == '62698917').show()
 {code}
 I am showing here the results from the outputs:
 {code}
 +++--+--+
 |  ID | ID_new  | t1   |   t2 |
 +++--+--+
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 +++--+--+
 ++---+---+++
 |   ID|   ta |   tb|   tc| td 
   |
 ++---+---+++
 |62698917| 2012-02-28|   2007-03-05|2003-03-05|
 2014-02-28|
 |62698917| 2012-02-20|   2007-02-15|2002-02-15|
 2013-02-20|
 |62698917| 2012-02-28|   2007-03-10|2005-03-10|
 2014-02-28|
 |62698917| 2012-02-20|   2007-03-05|2003-03-05|
 2013-02-20|
 |62698917| 2012-02-20|   2013-08-02|2013-01-02|
 2013-02-20|
 |62698917| 2012-02-28|   2007-02-15|2002-02-15|
 2014-02-28|
 |62698917| 2012-02-28|   2007-02-15|2002-02-15|
 2014-02-28|
 |62698917| 2012-02-20|   2014-01-02|2013-01-02|
 2013-02-20|
 ++---+---+++
 {code}
 The problem here is that values at columns 'tb', 'tc' and 'td' in dataframe 
 'd' are completely different from values 't1' and 't2' in dataframe c even 
 when my udfs are doing nothing. It seems like if values were somehow got from 
 other registers (or just invented). Results are different between executions 
 (apparently random).
 Thanks in advance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9245) DistributedLDAModel predict top topic per doc-term instance

2015-07-22 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-9245:


 Summary: DistributedLDAModel predict top topic per doc-term 
instance
 Key: SPARK-9245
 URL: https://issues.apache.org/jira/browse/SPARK-9245
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley


For each (document, term) pair, return top topic.  Note that instances of (doc, 
term) pairs within a document (a.k.a. tokens) are exchangeable, so we should 
provide an estimate per document-term, rather than per token.

Synopsis for DistributedLDAModel:
{code}
/** @return RDD of (doc ID, vector of top topic index for each term) */
def topTopicAssignments: RDD[(Long, Vector)]
{code}
Note that using Vector will let us have a sparse encoding which is 
Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8861) Add basic instrumentation to each SparkPlan operator

2015-07-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636411#comment-14636411
 ] 

Apache Spark commented on SPARK-8861:
-

User 'feynmanliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7590

 Add basic instrumentation to each SparkPlan operator
 

 Key: SPARK-8861
 URL: https://issues.apache.org/jira/browse/SPARK-8861
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 The basic metric can be the number of tuples that is flowing through. We can 
 add more metrics later.
 In order for this to work, we can add a new accumulators method to 
 SparkPlan that defines the list of accumulators, .e.g.
 {code}
   def accumulators: Map[String, Accumulator]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8861) Add basic instrumentation to each SparkPlan operator

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8861:
---

Assignee: (was: Apache Spark)

 Add basic instrumentation to each SparkPlan operator
 

 Key: SPARK-8861
 URL: https://issues.apache.org/jira/browse/SPARK-8861
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 The basic metric can be the number of tuples that is flowing through. We can 
 add more metrics later.
 In order for this to work, we can add a new accumulators method to 
 SparkPlan that defines the list of accumulators, .e.g.
 {code}
   def accumulators: Map[String, Accumulator]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8861) Add basic instrumentation to each SparkPlan operator

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8861:
---

Assignee: Apache Spark

 Add basic instrumentation to each SparkPlan operator
 

 Key: SPARK-8861
 URL: https://issues.apache.org/jira/browse/SPARK-8861
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 The basic metric can be the number of tuples that is flowing through. We can 
 add more metrics later.
 In order for this to work, we can add a new accumulators method to 
 SparkPlan that defines the list of accumulators, .e.g.
 {code}
   def accumulators: Map[String, Accumulator]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9246) DistributedLDAModel predict top docs per topic

2015-07-22 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-9246:


 Summary: DistributedLDAModel predict top docs per topic
 Key: SPARK-9246
 URL: https://issues.apache.org/jira/browse/SPARK-9246
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley


For each topic, return top documents based on topicDistributions.

Synopsis:
{code}
/**
 * @param maxDocuments  Max docs to return for each topic
 * @return Array over topics of (sorted top docs, corresponding doc-topic 
weights)
 */
def topDocumentsPerTopic(maxDocuments: Int): Array[(Array[Long], Array[Double])]
{code}

Note: We will need to make sure that the above return value format is 
Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-7590) Test Issue to Debug JIRA Problem

2015-07-22 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian deleted SPARK-7590:
--


 Test Issue to Debug JIRA Problem
 

 Key: SPARK-7590
 URL: https://issues.apache.org/jira/browse/SPARK-7590
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8579) Support arbitrary object in UnsafeRow

2015-07-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636495#comment-14636495
 ] 

Apache Spark commented on SPARK-8579:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7591

 Support arbitrary object in UnsafeRow
 -

 Key: SPARK-8579
 URL: https://issues.apache.org/jira/browse/SPARK-8579
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.5.0


 It's common to run count(distinct xxx) in SQL, the data type will be UDT of 
 OpenHashSet, it's good that we could use UnsafeRow to reducing the memory 
 usage during aggregation.
 Also for DecimalType, which could be used inside the grouping key.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7075) Project Tungsten: Improving Physical Execution

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7075:
---

Assignee: Apache Spark  (was: Reynold Xin)

 Project Tungsten: Improving Physical Execution
 --

 Key: SPARK-7075
 URL: https://issues.apache.org/jira/browse/SPARK-7075
 Project: Spark
  Issue Type: Epic
  Components: Block Manager, Shuffle, Spark Core, SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 Based on our observation, majority of Spark workloads are not bottlenecked by 
 I/O or network, but rather CPU and memory. This project focuses on 3 areas to 
 improve the efficiency of memory and CPU for Spark applications, to push 
 performance closer to the limits of the underlying hardware.
 *Memory Management and Binary Processing*
 - Avoiding non-transient Java objects (store them in binary format), which 
 reduces GC overhead.
 - Minimizing memory usage through denser in-memory data format, which means 
 we spill less.
 - Better memory accounting (size of bytes) rather than relying on heuristics
 - For operators that understand data types (in the case of DataFrames and 
 SQL), work directly against binary format in memory, i.e. have no 
 serialization/deserialization
 *Cache-aware Computation*
 - Faster sorting and hashing for aggregations, joins, and shuffle
 *Code Generation*
 - Faster expression evaluation and DataFrame/SQL operators
 - Faster serializer
 Several parts of project Tungsten leverage the DataFrame model, which gives 
 us more semantics about the application. We will also retrofit the 
 improvements onto Spark’s RDD API whenever possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9247) Use BytesToBytesMap in unsafe broadcast join

2015-07-22 Thread Davies Liu (JIRA)
Davies Liu created SPARK-9247:
-

 Summary: Use BytesToBytesMap in unsafe broadcast join
 Key: SPARK-9247
 URL: https://issues.apache.org/jira/browse/SPARK-9247
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical


For better performance (both CPU and memory)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9249) local variable assigned but may not be used

2015-07-22 Thread Yu Ishikawa (JIRA)
Yu Ishikawa created SPARK-9249:
--

 Summary: local variable assigned but may not be used
 Key: SPARK-9249
 URL: https://issues.apache.org/jira/browse/SPARK-9249
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa
Priority: Minor


local variable assigned but may not be used



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9244) Increase some default memory limits

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9244:
---

Assignee: Apache Spark  (was: Matei Zaharia)

 Increase some default memory limits
 ---

 Key: SPARK-9244
 URL: https://issues.apache.org/jira/browse/SPARK-9244
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Assignee: Apache Spark

 There are a few memory limits that people hit often and that we could make 
 higher, especially now that memory sizes have grown.
 - spark.akka.frameSize: This defaults at 10 but is often hit for map output 
 statuses in large shuffles. AFAIK the memory is not fully allocated up-front, 
 so we can just make this larger and still not affect jobs that never sent a 
 status that large.
 - spark.executor.memory: Defaults at 512m, which is really small. We can at 
 least increase it to 1g, though this is something users do need to set on 
 their own.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9244) Increase some default memory limits

2015-07-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9244:
---

Assignee: Matei Zaharia  (was: Apache Spark)

 Increase some default memory limits
 ---

 Key: SPARK-9244
 URL: https://issues.apache.org/jira/browse/SPARK-9244
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Assignee: Matei Zaharia

 There are a few memory limits that people hit often and that we could make 
 higher, especially now that memory sizes have grown.
 - spark.akka.frameSize: This defaults at 10 but is often hit for map output 
 statuses in large shuffles. AFAIK the memory is not fully allocated up-front, 
 so we can just make this larger and still not affect jobs that never sent a 
 status that large.
 - spark.executor.memory: Defaults at 512m, which is really small. We can at 
 least increase it to 1g, though this is something users do need to set on 
 their own.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9232) Duplicate code in JSONRelation

2015-07-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9232.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Duplicate code in JSONRelation
 --

 Key: SPARK-9232
 URL: https://issues.apache.org/jira/browse/SPARK-9232
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Minor
 Fix For: 1.5.0


 The following block appears identically in two places:
 {code}
 var success: Boolean = false
 try {
   success = fs.delete(filesystemPath, true)
 } catch {
   case e: IOException =
 throw new IOException(
   sUnable to clear output directory ${filesystemPath.toString} prior
 + s to writing to JSON table:\n${e.toString})
 }
 if (!success) {
   throw new IOException(
 sUnable to clear output directory ${filesystemPath.toString} prior
   + s to writing to JSON table.)
   }
 }
 {code}
 https://github.com/apache/spark/blob/e5d2c37c68ac00a57c2542e62d1c5b4ca267c89e/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L72
 https://github.com/apache/spark/blob/e5d2c37c68ac00a57c2542e62d1c5b4ca267c89e/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L131



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9144) Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled

2015-07-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636367#comment-14636367
 ] 

Apache Spark commented on SPARK-9144:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7585

 Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
 ---

 Key: SPARK-9144
 URL: https://issues.apache.org/jira/browse/SPARK-9144
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler, Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen

 Spark has an option called {{spark.localExecution.enabled}}; according to the 
 docs:
 {quote}
 Enables Spark to run certain jobs, such as first() or take() on the driver, 
 without sending tasks to the cluster. This can make certain jobs execute very 
 quickly, but may require shipping a whole partition of data to the driver.
 {quote}
 This feature ends up adding quite a bit of complexity to DAGScheduler, 
 especially in the {{runLocallyWithinThread}} method, but as far as I know 
 nobody uses this feature (I searched the mailing list and haven't seen any 
 recent mentions of the configuration nor stacktraces including the runLocally 
 method).  As a step towards scheduler complexity reduction, I propose that we 
 remove this feature and all code related to it for Spark 1.5. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9244) Increase some default memory limits

2015-07-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636371#comment-14636371
 ] 

Apache Spark commented on SPARK-9244:
-

User 'mateiz' has created a pull request for this issue:
https://github.com/apache/spark/pull/7586

 Increase some default memory limits
 ---

 Key: SPARK-9244
 URL: https://issues.apache.org/jira/browse/SPARK-9244
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Assignee: Matei Zaharia

 There are a few memory limits that people hit often and that we could make 
 higher, especially now that memory sizes have grown.
 - spark.akka.frameSize: This defaults at 10 but is often hit for map output 
 statuses in large shuffles. AFAIK the memory is not fully allocated up-front, 
 so we can just make this larger and still not affect jobs that never sent a 
 status that large.
 - spark.executor.memory: Defaults at 512m, which is really small. We can at 
 least increase it to 1g, though this is something users do need to set on 
 their own.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >