date:20150818

[jira] [Commented] (SPARK-4888) Spark EC2 doesn't mount local disks for i2.8xlarge instances

2015-08-18 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702604#comment-14702604
 ] 

Shivaram Venkataraman commented on SPARK-4888:
--

Yeah we can definitely do a few things

1. Remove target versions as Spark releases shouldn't get blocked on these 
issues
2. Ping / Walk through the 'bugs' here to see if they are still bugs. 
3. A number of other open issues are feature requests. For these, I'd check 
with [~nchammas] who is doing a survey on spark-ec2. We can open issues on the 
amplab/spark-ec2 for the most popular ones or something like that? 

> Spark EC2 doesn't mount local disks for i2.8xlarge instances
> 
>
> Key: SPARK-4888
> URL: https://issues.apache.org/jira/browse/SPARK-4888
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Priority: Critical
>
> When launching a cluster using {{spark-ec2}} with i8.2xlarge instances, the 
> local disks aren't mounted.   The AWS console doesn't show the disks as 
> mounted, either
> I think that the issue is that EC2 won't auto-mount the SSDs.  We have some 
> code that handles this for some of the {{r3*}} instance types, and I think 
> the right fix is to extend this for {{i8}} instance types, too: 
> https://github.com/mesos/spark-ec2/blob/v4/setup-slave.sh#L37
> /cc [~adav], who originally found this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9357) Remove JoinedRow

2015-08-18 Thread Cheng Hao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702603#comment-14702603
 ] 

Cheng Hao commented on SPARK-9357:
--

JoinedRow is probably in high efficiency for case like:

{code}
CREATE TABLE a AS SELECT * FROM t1 JOIN t2 on t1.key=t2.key and t1.col1 < 
t2.col1;
{code}
If the table t1 and t2 are large tables with lots of columns, and most of 
records will be filtered out in t1.col1 < t2.col2.

Maybe we can create an multi-nary JoinedRow instead of the binary JoinedRow, 
any thoughts?

> Remove JoinedRow
> 
>
> Key: SPARK-9357
> URL: https://issues.apache.org/jira/browse/SPARK-9357
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> JoinedRow was introduced to join two rows together, in aggregation (join key 
> and value), joins (left, right), window functions, etc.
> It aims to reduce the amount of data copied, but incurs branches when the row 
> is actually read. Given all the fields will be read almost all the time 
> (otherwise they get pruned out by the optimizer), branch predictor cannot do 
> anything about those branches.
> I think a better way is just to remove this thing, and materializes the row 
> data directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9552) Add force control for killExecutors to avoid false killing for those busy executors

2015-08-18 Thread Jie Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702599#comment-14702599
 ] 

Jie Huang commented on SPARK-9552:
--

[~andrewor14] any comment for this bug?


> Add force control for killExecutors to avoid false killing for those busy 
> executors
> ---
>
> Key: SPARK-9552
> URL: https://issues.apache.org/jira/browse/SPARK-9552
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Jie Huang
>
> By using the dynamic allocation, sometimes it occurs false killing for those 
> busy executors. Some executors with assignments will be killed because of 
> being idle for enough time (say 60 seconds). The root cause is that the 
> Task-Launch listener event is asynchronized.
> For example, some executors are under assigning tasks, but not sending out 
> the listener notification yet. Meanwhile, the dynamic allocation's executor 
> idle time is up (e.g., 60 seconds). It will trigger killExecutor event at the 
> same time.
> the timer expiration starts before the listener event arrives.
> Then, the task is going to run on top of that killed/killing executor. It 
> will lead to task failure finally.
> Here is the proposal to fix it. We can add the force control for 
> killExecutor. If the force control is not set (i.e., false), we'd better to 
> check if the executor under killing is idle or busy. If the current executor 
> has some assignment, we should not kill that executor and return back false 
> (to indicate killing failure). In dynamic allocation, we'd better to turn off 
> force killing (i.e., force = false), we will meet killing failure if tries to 
> kill a busy executor. And then, the executor timer won't be invalid. Later 
> on, the task assignment event arrives, we can remove the idle timer 
> accordingly. So that we can avoid false killing for those busy executors in 
> dynamic allocation.
> For the rest of usages, the end users can decide if to use force killing or 
> not by themselves. If to turn on that option, the killExecutor will do the 
> action without any status checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7218) Create a real iterator with open/close for Spark SQL

2015-08-18 Thread Cheng Hao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702584#comment-14702584
 ] 

Cheng Hao commented on SPARK-7218:
--

Can you give some BKM for this task?

> Create a real iterator with open/close for Spark SQL
> 
>
> Key: SPARK-7218
> URL: https://issues.apache.org/jira/browse/SPARK-7218
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10099) Use @deprecated instead of @Deprecated in Scala code

2015-08-18 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-10099.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

> Use @deprecated instead of @Deprecated in Scala code
> 
>
> Key: SPARK-10099
> URL: https://issues.apache.org/jira/browse/SPARK-10099
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Tathagata Das
>Priority: Trivial
> Fix For: 1.5.0
>
>
> {code}
> $ find . -name "*.scala" -exec grep -l "@Deprecated" {} \;
> ./core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala
> ./core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
> ./streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaDStreamLike.scala
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9967) Rename the SparkConf property to spark.streaming.backpressure.{enable --> enabled}

2015-08-18 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-9967.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

> Rename the SparkConf property to spark.streaming.backpressure.{enable --> 
> enabled}
> --
>
> Key: SPARK-9967
> URL: https://issues.apache.org/jira/browse/SPARK-9967
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
> Fix For: 1.5.0
>
>
> ... to better align with most other spark parameters having "enable" in 
> them...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4888) Spark EC2 doesn't mount local disks for i2.8xlarge instances

2015-08-18 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702575#comment-14702575
 ] 

Sean Owen commented on SPARK-4888:
--

[~shivaram] what do you think of the state of a lot of the old EC2 tickets 
here? As far as I'm concerned it's a reasonable turning point to go and close 
old ones as no longer relevant. Although I'd kinda favor not tracking JIRAs for 
this code here anymore as the code isn't in any Apache project now, maybe one 
step at a time. But purging some old stuff is uncontroversial.

> Spark EC2 doesn't mount local disks for i2.8xlarge instances
> 
>
> Key: SPARK-4888
> URL: https://issues.apache.org/jira/browse/SPARK-4888
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Priority: Critical
>
> When launching a cluster using {{spark-ec2}} with i8.2xlarge instances, the 
> local disks aren't mounted.   The AWS console doesn't show the disks as 
> mounted, either
> I think that the issue is that EC2 won't auto-mount the SSDs.  We have some 
> code that handles this for some of the {{r3*}} instance types, and I think 
> the right fix is to extend this for {{i8}} instance types, too: 
> https://github.com/mesos/spark-ec2/blob/v4/setup-slave.sh#L37
> /cc [~adav], who originally found this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9427) Add expression functions in SparkR

2015-08-18 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702563#comment-14702563
 ] 

Yu Ishikawa commented on SPARK-9427:


I see. Thank you for letting me know. 

> Add expression functions in SparkR
> --
>
> Key: SPARK-9427
> URL: https://issues.apache.org/jira/browse/SPARK-9427
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> The list of functions to add is based on SQL's functions. And it would be 
> better to add them in one shot PR.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9911) User guide for MulticlassClassificationEvaluator

2015-08-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702559#comment-14702559
 ] 

Apache Spark commented on SPARK-9911:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/8304

> User guide for MulticlassClassificationEvaluator
> 
>
> Key: SPARK-9911
> URL: https://issues.apache.org/jira/browse/SPARK-9911
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Manoj Kumar
>
> SPARK-7690 adds MulticlassClassificationEvaluator to ML Pipelines which is 
> not present in MLlib. We need to update the user-guide ({{ml-guide#Algorithm 
> Guides}} to document this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9911) User guide for MulticlassClassificationEvaluator

2015-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9911:
---

Assignee: Manoj Kumar  (was: Apache Spark)

> User guide for MulticlassClassificationEvaluator
> 
>
> Key: SPARK-9911
> URL: https://issues.apache.org/jira/browse/SPARK-9911
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Manoj Kumar
>
> SPARK-7690 adds MulticlassClassificationEvaluator to ML Pipelines which is 
> not present in MLlib. We need to update the user-guide ({{ml-guide#Algorithm 
> Guides}} to document this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9911) User guide for MulticlassClassificationEvaluator

2015-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9911:
---

Assignee: Apache Spark  (was: Manoj Kumar)

> User guide for MulticlassClassificationEvaluator
> 
>
> Key: SPARK-9911
> URL: https://issues.apache.org/jira/browse/SPARK-9911
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Apache Spark
>
> SPARK-7690 adds MulticlassClassificationEvaluator to ML Pipelines which is 
> not present in MLlib. We need to update the user-guide ({{ml-guide#Algorithm 
> Guides}} to document this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6813) SparkR style guide

2015-08-18 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702556#comment-14702556
 ] 

Yu Ishikawa commented on SPARK-6813:


We did it. I appreciate your support.

> SparkR style guide
> --
>
> Key: SPARK-6813
> URL: https://issues.apache.org/jira/browse/SPARK-6813
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Yu Ishikawa
> Fix For: 1.5.0
>
>
> We should develop a SparkR style guide document based on the some of the 
> guidelines we use and some of the best practices in R.
> Some examples of R style guide are:
> http://r-pkgs.had.co.nz/r.html#style 
> http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html
> A related issue is to work on a automatic style checking tool. 
> https://github.com/jimhester/lintr seems promising
> We could have a R style guide based on the one from google [1], and adjust 
> some of them with the conversation in Spark:
> 1. Line Length: maximum 100 characters
> 2. no limit on function name (API should be similar as in other languages)
> 3. Allow S4 objects/methods



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9427) Add expression functions in SparkR

2015-08-18 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702552#comment-14702552
 ] 

Yu Ishikawa commented on SPARK-9427:


Alright.

> Add expression functions in SparkR
> --
>
> Key: SPARK-9427
> URL: https://issues.apache.org/jira/browse/SPARK-9427
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> The list of functions to add is based on SQL's functions. And it would be 
> better to add them in one shot PR.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10099) Use @deprecated instead of @Deprecated in Scala code

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10099:

Assignee: Tathagata Das  (was: Xiangrui Meng)

> Use @deprecated instead of @Deprecated in Scala code
> 
>
> Key: SPARK-10099
> URL: https://issues.apache.org/jira/browse/SPARK-10099
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Tathagata Das
>Priority: Trivial
>
> {code}
> $ find . -name "*.scala" -exec grep -l "@Deprecated" {} \;
> ./core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala
> ./core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
> ./streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaDStreamLike.scala
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9427) Add expression functions in SparkR

2015-08-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-9427:
-
Target Version/s: 1.6.0, 1.5.1  (was: 1.6.0)

> Add expression functions in SparkR
> --
>
> Key: SPARK-9427
> URL: https://issues.apache.org/jira/browse/SPARK-9427
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> The list of functions to add is based on SQL's functions. And it would be 
> better to add them in one shot PR.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6840) SparkR: private package functions unavailable when using lapplyPartition in package

2015-08-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6840:
-
Target Version/s: 1.6.0, 1.5.1  (was: 1.5.0)

> SparkR: private package functions unavailable when using lapplyPartition in 
> package
> ---
>
> Key: SPARK-6840
> URL: https://issues.apache.org/jira/browse/SPARK-6840
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Shivaram Venkataraman
>
> Developing package that imports SparkR. There is a function in that package 
> that calls lapplyPartition with a function argument that uses in its body 
> some functions private to the package. When run, the computation fails 
> because R can not find the private function (details below). If I fully 
> qualify them with otherpackage:::private.function, the error moves down to 
> the next private function. This used to work some time ago, I've been working 
> on other stuff for a little while. This should also work by regular R scope 
> rules. I apologize I don't have a minimal test case ready, but this was 
> discovered developing plyrmr and the list of dependencies is long enough that 
>  it's a little bit of a burden to make you install it. I think I can put 
> together a toy package to demonstrate the problem, if that helps.
> Error in FUN(part) : could not find function "keys.spark"
> Calls: source ... eval -> eval -> computeFunc ->  -> FUN -> FUN
> Execution halted
> 15/03/19 12:29:16 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> org.apache.spark.SparkException: R computation failed with
>  Error in FUN(part) : could not find function "keys.spark"
> Calls: source ... eval -> eval -> computeFunc ->  -> FUN -> FUN
> Execution halted
>   at edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:80)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>   at org.apache.spark.scheduler.Task.run(Task.scala:54)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6834) Failed with error: ‘invalid package name’ Error in as.name(name) : attempt to use zero-length variable name

2015-08-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6834:
-
Target Version/s: 1.6.0, 1.5.1  (was: 1.5.0)

> Failed with error: ‘invalid package name’ Error in as.name(name) : attempt to 
> use zero-length variable name
> ---
>
> Key: SPARK-6834
> URL: https://issues.apache.org/jira/browse/SPARK-6834
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Shivaram Venkataraman
>
> Context: trying to interface SparkR with foreach. Foreach is an abstraction 
> over several parallel backends. This would enable execution on spark cluster 
> of > 50 R packages including very popular caret and plyr. Simple foreach 
> examples work. caret unfortunately does not.
> I have a repro but it is somewhat complex (it is the main example of model 
> fitting in caret on their website though, not something made on purpose to 
> make SparkR fail). If I find anything more straightforward, I will comment 
> here. Reproduced in an R --vanilla session, but I can uninstall all of my 
> packages, so I may have missed some deps.
> Reproduce with:
> install.packages(c("caret", "foreach", "devtools", "mlbench", "gbm", 
> "survival", "splines"))
> library(caret)
> library(foreach)
> library(devtools)
> install_github("RevolutionAnalytics/doParallelSpark", subdir = "pkg")
> library(doParallelSpark)
> registerDoParallelSpark()
> library(mlbench)
> data(Sonar)
> set.seed(998)
> inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
> training <- Sonar[ inTraining,]
> testing  <- Sonar[-inTraining,]
> fitControl <- trainControl(## 10-fold CV
>   method = "repeatedcv",
>   number = 10,
>   ## repeated ten times
>   repeats = 10)
> set.seed(825)
> gbmFit1 <- train(Class ~ ., data = training,
>  method = "gbm",
>  trControl = fitControl,
>  ## This last option is actually one
>  ## for gbm() that passes through
>  verbose = FALSE)
> Stack trace
> Failed with error:  ‘invalid package name’
> Failed with error:  ‘invalid package name’
> Failed with error:  ‘invalid package name’
> Failed with error:  ‘invalid package name’
> Error in as.name(name) : attempt to use zero-length variable name
> Calls: source ... withVisible -> eval -> eval -> getNamespace -> as.name
> Execution halted
> 15/03/26 14:32:30 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 5)
> org.apache.spark.SparkException: R computation failed with
>  Failed with error:  ‘invalid package name’
> Failed with error:  ‘invalid package name’
> Failed with error:  ‘invalid package name’
> Failed with error:  ‘invalid package name’
> Error in as.name(name) : attempt to use zero-length variable name
> Calls: source ... withVisible -> eval -> eval -> getNamespace -> as.name
> Execution halted
>   at edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:80)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:32)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:54)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/03/26 14:32:30 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 5, 
> localhost): org.apache.spark.SparkException: R computation failed with
>  Failed with error:  ‘invalid package name’
> Failed with error:  ‘invalid package name’
> Failed with error:  ‘invalid package name’
> Failed with error:  ‘invalid package name’
> Error in as.name(name) : attempt to use zero-length variable name
> Calls: source ... withVisible -> eval -> eval -> getNamespace -> as.name
> Execution halted
> edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:80)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:32)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD

[jira] [Updated] (SPARK-6837) SparkR failure in processClosure

2015-08-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6837:
-
Target Version/s: 1.6.0, 1.5.1  (was: 1.5.0)

> SparkR failure in processClosure
> 
>
> Key: SPARK-6837
> URL: https://issues.apache.org/jira/browse/SPARK-6837
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Shivaram Venkataraman
>
> Sorry another one I can't reproduce in straight SparkR.
> This is a typical plyrmr example
> as.data.frame(gapply(input(mtcars), identity))
> Error in get(nodeChar, envir = func.env, inherits = FALSE) : 
> argument "..." is missing, with no default
> Stack trace below. This may have appeared after the introduction of the new 
> way of serializing closures. Using ... in a function alone doesn't reproduce 
> the error. So I thought I'd file this to hear what you guys think while I try 
> to isolate it better.
> > traceback()
> 28: get(nodeChar, envir = func.env, inherits = FALSE) at utils.R#363
> 27: processClosure(node[[i]], oldEnv, defVars, checkedFuncs, newEnv) at 
> utils.R#339
> 26: processClosure(node[[i]], oldEnv, defVars, checkedFuncs, newEnv) at 
> utils.R#339
> 25: processClosure(node[[i]], oldEnv, defVars, checkedFuncs, newEnv) at 
> utils.R#339
> 24: processClosure(node[[i]], oldEnv, defVars, checkedFuncs, newEnv) at 
> utils.R#324
> 23: processClosure(node[[i]], oldEnv, defVars, checkedFuncs, newEnv) at 
> utils.R#312
> 22: processClosure(func.body, oldEnv, defVars, checkedFuncs, newEnv) at 
> utils.R#417
> 21: cleanClosure(obj, checkedFuncs) at utils.R#381
> 20: processClosure(node[[i]], oldEnv, defVars, checkedFuncs, newEnv) at 
> utils.R#339
> 19: processClosure(node[[i]], oldEnv, defVars, checkedFuncs, newEnv) at 
> utils.R#339
> 18: processClosure(node[[i]], oldEnv, defVars, checkedFuncs, newEnv) at 
> utils.R#339
> 17: processClosure(node[[i]], oldEnv, defVars, checkedFuncs, newEnv) at 
> utils.R#312
> 16: processClosure(node[[i]], oldEnv, defVars, checkedFuncs, newEnv) at 
> utils.R#339
> 15: processClosure(node[[i]], oldEnv, defVars, checkedFuncs, newEnv) at 
> utils.R#312
> 14: processClosure(func.body, oldEnv, defVars, checkedFuncs, newEnv) at 
> utils.R#417
> 13: cleanClosure(obj, checkedFuncs) at utils.R#381
> 12: processClosure(node[[i]], oldEnv, defVars, checkedFuncs, newEnv) at 
> utils.R#339
> 11: processClosure(node[[i]], oldEnv, defVars, checkedFuncs, newEnv) at 
> utils.R#312
> 10: processClosure(func.body, oldEnv, defVars, checkedFuncs, newEnv) at 
> utils.R#417
> 9: cleanClosure(FUN) at RDD.R#532
> 8: lapplyPartitionsWithIndex(X, function(s, part)
> { FUN(part) }) at generics.R#76
> 7: lapplyPartitionsWithIndex(X, function(s, part) { FUN(part) }
> ) at RDD.R#499
> 6: SparkR::lapplyPartition(rdd, f) at generics.R#66
> 5: SparkR::lapplyPartition(rdd, f)
> 4: as.pipespark(if (is.grouped(.data))
> { if (is.mergeable(.f)) SparkR::lapplyPartition(SparkR::reduceByKey(rdd, 
> f.reduce, 2L), f) else SparkR::lapplyPartition(SparkR::groupByKey(rdd, 2L), 
> f) }
> else SparkR::lapplyPartition(rdd, f), grouped = is.grouped(.data)) at 
> pipespark.R#119
> 3: gapply.pipespark(input(mtcars), identity) at pipe.R#107
> 2: gapply(input(mtcars), identity)
> 1: as.data.frame(gapply(input(mtcars), identity))
> >
> There is a proposed fix for this at 
> https://github.com/amplab-extras/SparkR-pkg/pull/229



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4888) Spark EC2 doesn't mount local disks for i2.8xlarge instances

2015-08-18 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702533#comment-14702533
 ] 

Shivaram Venkataraman commented on SPARK-4888:
--

I'm removing a target version for this as this isn't a change to the Spark 
source code repo. 

> Spark EC2 doesn't mount local disks for i2.8xlarge instances
> 
>
> Key: SPARK-4888
> URL: https://issues.apache.org/jira/browse/SPARK-4888
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Priority: Critical
>
> When launching a cluster using {{spark-ec2}} with i8.2xlarge instances, the 
> local disks aren't mounted.   The AWS console doesn't show the disks as 
> mounted, either
> I think that the issue is that EC2 won't auto-mount the SSDs.  We have some 
> code that handles this for some of the {{r3*}} instance types, and I think 
> the right fix is to extend this for {{i8}} instance types, too: 
> https://github.com/mesos/spark-ec2/blob/v4/setup-slave.sh#L37
> /cc [~adav], who originally found this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4888) Spark EC2 doesn't mount local disks for i2.8xlarge instances

2015-08-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-4888:
-
Target Version/s:   (was: 1.5.0)

> Spark EC2 doesn't mount local disks for i2.8xlarge instances
> 
>
> Key: SPARK-4888
> URL: https://issues.apache.org/jira/browse/SPARK-4888
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Priority: Critical
>
> When launching a cluster using {{spark-ec2}} with i8.2xlarge instances, the 
> local disks aren't mounted.   The AWS console doesn't show the disks as 
> mounted, either
> I think that the issue is that EC2 won't auto-mount the SSDs.  We have some 
> code that handles this for some of the {{r3*}} instance types, and I think 
> the right fix is to extend this for {{i8}} instance types, too: 
> https://github.com/mesos/spark-ec2/blob/v4/setup-slave.sh#L37
> /cc [~adav], who originally found this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)

2015-08-18 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702528#comment-14702528
 ] 

Reynold Xin commented on SPARK-6192:


Is this one completely done?


> Enhance MLlib's Python API (GSoC 2015)
> --
>
> Key: SPARK-6192
> URL: https://issues.apache.org/jira/browse/SPARK-6192
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Manoj Kumar
>  Labels: gsoc, gsoc2015, mentor
>
> This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme 
> is to enhance MLlib's Python API, to make it on par with the Scala/Java API. 
> The main tasks are:
> 1. For all models in MLlib, provide save/load method. This also
> includes save/load in Scala.
> 2. Python API for evaluation metrics.
> 3. Python API for streaming ML algorithms.
> 4. Python API for distributed linear algebra.
> 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use
> customized serialization, making MLLibPythonAPI hard to maintain. It
> would be nice to use the DataFrames for serialization.
> I'll link the JIRAs for each of the tasks.
> Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. 
> The TODO list will be dynamic based on the backlog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6826) `hashCode` support for arbitrary R objects

2015-08-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6826:
-
Target Version/s:   (was: 1.5.0)

> `hashCode` support for arbitrary R objects
> --
>
> Key: SPARK-6826
> URL: https://issues.apache.org/jira/browse/SPARK-6826
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> From the SparkR JIRA
> digest::digest looks interesting, but it seems to be more heavyweight than 
> our requirements. One relatively easy way to do this is to serialize the 
> given R object into a string (serialize(object, ascii=T)) and then just call 
> the string hashCode function on this. FWIW it looks like digest follows a 
> similar strategy where the md5sum / shasum etc. are calculated on serialized 
> objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6822) lapplyPartition passes empty list to function

2015-08-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6822:
-
Target Version/s:   (was: 1.5.0)

> lapplyPartition passes empty list to function
> -
>
> Key: SPARK-6822
> URL: https://issues.apache.org/jira/browse/SPARK-6822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Shivaram Venkataraman
>
> I have an rdd containing two elements, as expected or as shown by a collect. 
> When I call lapplyPartition on it with a function that prints its arguments 
> in stderr, the function gets called three times, the first two with the 
> expected arguments and the third with an empty list as argument. I was 
> wondering if that's a bug or if there are conditions under which that's 
> possible. I apologize I don't have a simple test case ready yet. I run into 
> this potential bug developing a separate package, plyrmr. If you are willing 
> to install it, the test case is very simple. The rdd that creates this 
> problem is a result of a join, but I couldn't replicate the problem using a 
> plain vanilla join.
> Example from Antonio on SparkR JIRA: I don't have time to try any harder to 
> repro this without plyrmr. For the record this is the example
> {code}
> library(plyrmr)
> plyrmr.options(backend = "spark")
> df1 = mtcars[1:4,]
> df2 = mtcars[3:6,]
> w = as.data.frame(gapply(merge(input(df1), input(df2)), identity))
> {code}
> the gapply is implemented with a lapplyPartition in most cases. The merge 
> with a join. as.data.frame with a collect. The join has an arbitrary argument 
> of 4 partitions. If I turn that down to 2L, the problem disappears. I will 
> check in a version with a workaround in place but a debugging statement will 
> leave a record in stderr whenever the workaround kicks in, so that we can 
> track it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7465) DAG visualization: RDD dependencies not always shown

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7465:
---
Target Version/s: 1.6.0  (was: 1.5.0)

> DAG visualization: RDD dependencies not always shown
> 
>
> Key: SPARK-7465
> URL: https://issues.apache.org/jira/browse/SPARK-7465
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Currently if the same RDD appears in multiple stages, the arrow will be drawn 
> only for the first occurrence. It may be too much to show the dependency on 
> every single occurrence of the same RDD (common in MLlib and GraphX), but we 
> should at least show them on hover so the user knows where the RDDs are 
> coming from.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6814) Support sorting for any data type in SparkR

2015-08-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6814:
-
Target Version/s:   (was: 1.5.0)

> Support sorting for any data type in SparkR
> ---
>
> Key: SPARK-6814
> URL: https://issues.apache.org/jira/browse/SPARK-6814
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Priority: Critical
>
> I get various "return status == 0 is false" and "unimplemented type" errors 
> trying to get data out of any rdd with top() or collect(). The errors are not 
> consistent. I think spark is installed properly because some operations do 
> work. I apologize if I'm missing something easy or not providing the right 
> diagnostic info – I'm new to SparkR, and this seems to be the only resource 
> for SparkR issues.
> Some logs:
> {code}
> Browse[1]> top(estep.rdd, 1L)
> Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : 
>   unimplemented type 'list' in 'orderVector1'
> Calls: do.call ... Reduce ->  -> func -> FUN -> FUN -> order
> Execution halted
> 15/02/13 19:11:57 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14)
> org.apache.spark.SparkException: R computation failed with
>  Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : 
>   unimplemented type 'list' in 'orderVector1'
> Calls: do.call ... Reduce ->  -> func -> FUN -> FUN -> order
> Execution halted
>   at edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:69)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>   at org.apache.spark.scheduler.Task.run(Task.scala:54)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/02/13 19:11:57 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 14, 
> localhost): org.apache.spark.SparkException: R computation failed with
>  Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : 
>   unimplemented type 'list' in 'orderVector1'
> Calls: do.call ... Reduce ->  -> func -> FUN -> FUN -> order
> Execution halted
> edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:69)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7471) DAG visualization: show call site information

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7471:
---
Target Version/s: 1.6.0  (was: 1.5.0)

> DAG visualization: show call site information
> -
>
> Key: SPARK-7471
> URL: https://issues.apache.org/jira/browse/SPARK-7471
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> It would be useful to find the line that created the RDD / scope.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6813) SparkR style guide

2015-08-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-6813.
--
   Resolution: Fixed
 Assignee: Yu Ishikawa
Fix Version/s: 1.5.0

I'm marking this as resolved as we've basically made the 1.5 code compliant 
with the style guide. The lint-r on Jenkins will run on master branch which is 
also compliant.

> SparkR style guide
> --
>
> Key: SPARK-6813
> URL: https://issues.apache.org/jira/browse/SPARK-6813
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Yu Ishikawa
> Fix For: 1.5.0
>
>
> We should develop a SparkR style guide document based on the some of the 
> guidelines we use and some of the best practices in R.
> Some examples of R style guide are:
> http://r-pkgs.had.co.nz/r.html#style 
> http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html
> A related issue is to work on a automatic style checking tool. 
> https://github.com/jimhester/lintr seems promising
> We could have a R style guide based on the one from google [1], and adjust 
> some of them with the conversation in Spark:
> 1. Line Length: maximum 100 characters
> 2. no limit on function name (API should be similar as in other languages)
> 3. Allow S4 objects/methods



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7349) DAG visualization: add legend to explain the content

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7349:
---
Target Version/s: 1.6.0  (was: 1.5.0)

> DAG visualization: add legend to explain the content
> 
>
> Key: SPARK-7349
> URL: https://issues.apache.org/jira/browse/SPARK-7349
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>
> Right now we have red dots and black dots here and there. It's not clear what 
> they mean.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7178) Improve DataFrame documentation and code samples

2015-08-18 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702520#comment-14702520
 ] 

Reynold Xin edited comment on SPARK-7178 at 8/19/15 5:37 AM:
-

Closing this one since we will update DataFrame documentation in other tickets.

And also and/or now have better error messages in Python.



was (Author: rxin):
Closing this one since we will update DataFrame documentation in other tickets.


> Improve DataFrame documentation and code samples
> 
>
> Key: SPARK-7178
> URL: https://issues.apache.org/jira/browse/SPARK-7178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Chris Fregly
>  Labels: dataframe
>
> AND and OR are not straightforward when using the new DataFrame API.
> the current convention - accepted by Pandas users - is to use the bitwise & 
> and | instead of AND and OR.  when using these, however, you need to wrap 
> each expression in parenthesis to prevent the bitwise operator from 
> dominating.
> also, working with StructTypes is a bit confusing.  the following link:  
> https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
>  (Python tab) implies that you can work with tuples directly when creating a 
> DataFrame.
> however, the following code errors out unless we explicitly use Row's:
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> # The schema is encoded in a string.
> schemaString = "a"
> fields = [StructField(field_name, MapType(StringType(),IntegerType())) for 
> field_name in schemaString.split()]
> schema = StructType(fields)
> df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7041) Avoid writing empty files in BypassMergeSortShuffleWriter

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7041:
---
Target Version/s: 1.6.0  (was: 1.5.0)

> Avoid writing empty files in BypassMergeSortShuffleWriter
> -
>
> Key: SPARK-7041
> URL: https://issues.apache.org/jira/browse/SPARK-7041
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In BypassMergeSortShuffleWriter, we may end up opening disk writers files for 
> empty partitions; this occurs because we manually call {{open()}} after 
> creating the writer, causing serialization and compression input streams to 
> be created; these streams may write headers to the output stream, resulting 
> in non-zero-length files being created for partitions that contain no 
> records.  This is unnecessary, though, since the disk object writer will 
> automatically open itself when the first write is performed.  Removing this 
> eager {{open()}} call and rewriting the consumers to cope with the 
> non-existence of empty files results in a large performance benefit for 
> certain sparse workloads when using sort-based shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7463) DAG visualization improvements

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7463:
---
Target Version/s: 1.6.0  (was: 1.5.0)

> DAG visualization improvements
> --
>
> Key: SPARK-7463
> URL: https://issues.apache.org/jira/browse/SPARK-7463
> Project: Spark
>  Issue Type: Umbrella
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> This is the umbrella JIRA for improvements or bug fixes to the DAG 
> visualization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7025) Create a Java-friendly input source API

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7025:
---
Target Version/s: 1.6.0  (was: 1.5.0)

> Create a Java-friendly input source API
> ---
>
> Key: SPARK-7025
> URL: https://issues.apache.org/jira/browse/SPARK-7025
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> The goal of this ticket is to create a simple input source API that we can 
> maintain and support long term.
> Spark currently has two de facto input source API:
> 1. RDD
> 2. Hadoop MapReduce InputFormat
> Neither of the above is ideal:
> 1. RDD: It is hard for Java developers to implement RDD, given the implicit 
> class tags. In addition, the RDD API depends on Scala's runtime library, 
> which does not preserve binary compatibility across Scala versions. If a 
> developer chooses Java to implement an input source, it would be great if 
> that input source can be binary compatible in years to come.
> 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For 
> example, it forces key-value semantics, and does not support running 
> arbitrary code on the driver side (an example of why this is useful is 
> broadcast). In addition, it is somewhat awkward to tell developers that in 
> order to implement an input source for Spark, they should learn the Hadoop 
> MapReduce API first.
> So here's the proposal: an InputSource is described by:
> * an array of InputPartition that specifies the data partitioning
> * a RecordReader that specifies how data on each partition can be read
> This interface would be similar to Hadoop's InputFormat, except that there is 
> no explicit key/value separation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7780) The intercept in LogisticRegressionWithLBFGS should not be regularized

2015-08-18 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-7780:
---
Target Version/s: 1.6.0  (was: 1.5.0)

> The intercept in LogisticRegressionWithLBFGS should not be regularized
> --
>
> Key: SPARK-7780
> URL: https://issues.apache.org/jira/browse/SPARK-7780
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: DB Tsai
>
> The intercept in Logistic Regression represents a prior on categories which 
> should not be regularized. In MLlib, the regularization is handled through 
> `Updater`, and the `Updater` penalizes all the components without excluding 
> the intercept which resulting poor training accuracy with regularization.
> The new implementation in ML framework handles this properly, and we should 
> call the implementation in ML from MLlib since majority of users are still 
> using MLlib api. 
> Note that both of them are doing feature scalings to improve the convergence, 
> and the only difference is ML version doesn't regularize the intercept. As a 
> result, when lambda is zero, they will converge to the same solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7348) DAG visualization: add links to RDD page

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7348:
---
Target Version/s: 1.6.0  (was: 1.5.0)

> DAG visualization: add links to RDD page
> 
>
> Key: SPARK-7348
> URL: https://issues.apache.org/jira/browse/SPARK-7348
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>
> It currently has links from the job page to the stage page. It would be nice 
> if it has links to the corresponding RDD page as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-7355) FlakyTest - o.a.s.DriverSuite

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-7355.
--
Resolution: Later

> FlakyTest - o.a.s.DriverSuite
> -
>
> Key: SPARK-7355
> URL: https://issues.apache.org/jira/browse/SPARK-7355
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Critical
>  Labels: flaky-test
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-7178) Improve DataFrame documentation and code samples

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-7178.
--
Resolution: Duplicate

Closing this one since we will update DataFrame documentation in other tickets.


> Improve DataFrame documentation and code samples
> 
>
> Key: SPARK-7178
> URL: https://issues.apache.org/jira/browse/SPARK-7178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Chris Fregly
>  Labels: dataframe
>
> AND and OR are not straightforward when using the new DataFrame API.
> the current convention - accepted by Pandas users - is to use the bitwise & 
> and | instead of AND and OR.  when using these, however, you need to wrap 
> each expression in parenthesis to prevent the bitwise operator from 
> dominating.
> also, working with StructTypes is a bit confusing.  the following link:  
> https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
>  (Python tab) implies that you can work with tuples directly when creating a 
> DataFrame.
> however, the following code errors out unless we explicitly use Row's:
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> # The schema is encoded in a string.
> schemaString = "a"
> fields = [StructField(field_name, MapType(StringType(),IntegerType())) for 
> field_name in schemaString.split()]
> schema = StructType(fields)
> df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8496) Do not run slow tests for every PR

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8496.

Resolution: Later

> Do not run slow tests for every PR
> --
>
> Key: SPARK-8496
> URL: https://issues.apache.org/jira/browse/SPARK-8496
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> Right now we have individual test suites like SparkSubmitSuite and 
> DriverSuite that start subprocesses and wait for long timeouts. The total 
> test time on pull requests is usually between 2 - 3 hours, which is very long 
> for quickly iterating during development.
> This issue aims to separate slow tests from fast tests and disable slow tests 
> for the PRB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-7416) Shuffle performance metrics umbrella

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-7416.
--
Resolution: Later

> Shuffle performance metrics umbrella
> 
>
> Key: SPARK-7416
> URL: https://issues.apache.org/jira/browse/SPARK-7416
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle
>Reporter: Josh Rosen
>
> This is an umbrella JIRA for discussing improvements / enhancements to 
> Spark's shuffle performance metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8496) Do not run slow tests for every PR

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8496:
---
Target Version/s:   (was: 1.6.0)

> Do not run slow tests for every PR
> --
>
> Key: SPARK-8496
> URL: https://issues.apache.org/jira/browse/SPARK-8496
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> Right now we have individual test suites like SparkSubmitSuite and 
> DriverSuite that start subprocesses and wait for long timeouts. The total 
> test time on pull requests is usually between 2 - 3 hours, which is very long 
> for quickly iterating during development.
> This issue aims to separate slow tests from fast tests and disable slow tests 
> for the PRB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9427) Add expression functions in SparkR

2015-08-18 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702519#comment-14702519
 ] 

Shivaram Venkataraman commented on SPARK-9427:
--

[~yuu.ishik...@gmail.com] I retargetted some of the sub-tasks to 1.5.1. It 
shouldn't affect any of the PRs or the development workflow. It just means that 
we can continue merging the PRs into branch-1.5 and depending on when RCs get 
cut etc. we will update the fix version.

> Add expression functions in SparkR
> --
>
> Key: SPARK-9427
> URL: https://issues.apache.org/jira/browse/SPARK-9427
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> The list of functions to add is based on SQL's functions. And it would be 
> better to add them in one shot PR.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9654) Add IndexToString in Pyspark

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9654:
---
Target Version/s: 1.6.0  (was: 1.5.0)

> Add IndexToString in Pyspark
> 
>
> Key: SPARK-9654
> URL: https://issues.apache.org/jira/browse/SPARK-9654
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8496) Do not run slow tests for every PR

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8496:
---
Target Version/s: 1.6.0  (was: 1.5.0)

> Do not run slow tests for every PR
> --
>
> Key: SPARK-8496
> URL: https://issues.apache.org/jira/browse/SPARK-8496
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> Right now we have individual test suites like SparkSubmitSuite and 
> DriverSuite that start subprocesses and wait for long timeouts. The total 
> test time on pull requests is usually between 2 - 3 hours, which is very long 
> for quickly iterating during development.
> This issue aims to separate slow tests from fast tests and disable slow tests 
> for the PRB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9972) Add `struct`, `encode` and `decode` function in SparkR

2015-08-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-9972:
-
Target Version/s: 1.6.0, 1.5.1  (was: 1.6.0)

> Add `struct`, `encode` and `decode` function in SparkR
> --
>
> Key: SPARK-9972
> URL: https://issues.apache.org/jira/browse/SPARK-9972
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> Support {{struct}} function on a DataFrame in SparkR. However, I think we 
> need to improve {{collect}} function in SparkR in order to implement 
> {{struct}} function.
> - struct
> - encode
> - decode
> - array_contains
> - sort_array



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10106) Add `ifelse` Column function to SparkR

2015-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10106:


Assignee: Apache Spark

> Add `ifelse` Column function to SparkR
> --
>
> Key: SPARK-10106
> URL: https://issues.apache.org/jira/browse/SPARK-10106
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yu Ishikawa
>Assignee: Apache Spark
>
> Add a column function on a DataFrame like `ifelse` in R to SparkR.
> I guess we could implement it with a combination with {{when}} and 
> {{otherwise}}.
> h3. Example
> If {{df$x > 0}} is TRUE, then return 0, otherwise return 1.
> {noformat}
> ifelse(df$x > 0, 0, 1)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10106) Add `ifelse` Column function to SparkR

2015-08-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702515#comment-14702515
 ] 

Apache Spark commented on SPARK-10106:
--

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/8303

> Add `ifelse` Column function to SparkR
> --
>
> Key: SPARK-10106
> URL: https://issues.apache.org/jira/browse/SPARK-10106
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> Add a column function on a DataFrame like `ifelse` in R to SparkR.
> I guess we could implement it with a combination with {{when}} and 
> {{otherwise}}.
> h3. Example
> If {{df$x > 0}} is TRUE, then return 0, otherwise return 1.
> {noformat}
> ifelse(df$x > 0, 0, 1)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9857) Add expression functions into SparkR which conflict with the existing R's generic

2015-08-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-9857:
-
Target Version/s: 1.6.0, 1.5.1  (was: 1.5.0)

> Add expression functions into SparkR which conflict with the existing R's 
> generic
> -
>
> Key: SPARK-9857
> URL: https://issues.apache.org/jira/browse/SPARK-9857
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> Add expression functions into SparkR which conflict with the existing R's 
> generic, like {{coalesce(e: Column*)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10106) Add `ifelse` Column function to SparkR

2015-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10106:


Assignee: (was: Apache Spark)

> Add `ifelse` Column function to SparkR
> --
>
> Key: SPARK-10106
> URL: https://issues.apache.org/jira/browse/SPARK-10106
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> Add a column function on a DataFrame like `ifelse` in R to SparkR.
> I guess we could implement it with a combination with {{when}} and 
> {{otherwise}}.
> h3. Example
> If {{df$x > 0}} is TRUE, then return 0, otherwise return 1.
> {noformat}
> ifelse(df$x > 0, 0, 1)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10079) Make `column` and `col` functions be S4 functions

2015-08-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-10079:
--
Target Version/s: 1.6.0, 1.5.1  (was: 1.5.0)

> Make `column` and `col` functions be S4 functions
> -
>
> Key: SPARK-10079
> URL: https://issues.apache.org/jira/browse/SPARK-10079
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> {{column}} and {{col}} function at {{R/pkg/R/Column.R}} are currently defined 
> as S3 functions. I think it would be better to define them as S4 functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9952) Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9952.

   Resolution: Fixed
Fix Version/s: 1.5.0

> Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs
> --
>
> Key: SPARK-9952
> URL: https://issues.apache.org/jira/browse/SPARK-9952
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.5.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.5.0
>
>
> In Scala, Seq.fill always returns a List. Accessing a list by index is an 
> O(N) operation. Thus, the following code will be really slow (~10 seconds on 
> my machine):
> {code}
> val numItems = 10
> val s = Seq.fill(numItems)(1)
> for (i <- 0 until numItems) s(i)
> {code}
> It turns out that we had a loop like this in DAGScheduler code. In 
> getPreferredLocsInternal, there's a call to {{getCacheLocs(rdd)(partition)}}. 
>  The {{getCacheLocs}} call returns a Seq. If this Seq is a List and the RDD 
> contains many partitions, then indexing into this list will cost 
> O(partitions). Thus, when we loop over our tasks to compute their individual 
> preferred locations we implicitly perform an N^2 loop, reducing scheduling 
> throughput.
> We can easily fix this by replacing Seq with Array in this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6798) Fix Date serialization in SparkR

2015-08-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6798:
-
Target Version/s:   (was: 1.5.0)

> Fix Date serialization in SparkR
> 
>
> Key: SPARK-6798
> URL: https://issues.apache.org/jira/browse/SPARK-6798
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Davies Liu
>Priority: Minor
>
> SparkR's date serialization right now sends strings from R to the JVM. We 
> should convert this to integers and also account for timezones correctly by 
> using DateUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9302) collect()/head() failed with JSON of some format

2015-08-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-9302:
-
Target Version/s:   (was: 1.5.0)

> collect()/head() failed with JSON of some format
> 
>
> Key: SPARK-9302
> URL: https://issues.apache.org/jira/browse/SPARK-9302
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Sun Rui
>
> Reported in the mailing list by Exie :
> {noformat}
> A sample record in raw JSON looks like this:
> {"version": 1,"event": "view","timestamp": 1427846422377,"system":
> "DCDS","asset": "6404476","assetType": "myType","assetCategory":
> "myCategory","extras": [{"name": "videoSource","value": "mySource"},{"name":
> "playerType","value": "Article"},{"name": "duration","value":
> "202088"}],"trackingId": "155629a0-d802-11e4-13ee-6884e43d6000","ipAddress":
> "165.69.2.4","title": "myTitle"}
> > head(mydf)
> Error in as.data.frame.default(x[[i]], optional = TRUE) : 
>   cannot coerce class ""jobj"" to a data.frame
> >
> > show(mydf)
> DataFrame[localEventDtTm:timestamp, asset:string, assetCategory:string, 
> assetType:string, event:string, 
> extras:array>, ipAddress:string, 
> memberId:string, system:string, timestamp:bigint, title:string, 
> trackingId:string, version:bigint]
> >
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9302) Handle complex JSON types in collect()/head()

2015-08-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-9302:
-
Summary: Handle complex JSON types in collect()/head()  (was: 
collect()/head() failed with JSON of some format)

> Handle complex JSON types in collect()/head()
> -
>
> Key: SPARK-9302
> URL: https://issues.apache.org/jira/browse/SPARK-9302
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Sun Rui
>
> Reported in the mailing list by Exie :
> {noformat}
> A sample record in raw JSON looks like this:
> {"version": 1,"event": "view","timestamp": 1427846422377,"system":
> "DCDS","asset": "6404476","assetType": "myType","assetCategory":
> "myCategory","extras": [{"name": "videoSource","value": "mySource"},{"name":
> "playerType","value": "Article"},{"name": "duration","value":
> "202088"}],"trackingId": "155629a0-d802-11e4-13ee-6884e43d6000","ipAddress":
> "165.69.2.4","title": "myTitle"}
> > head(mydf)
> Error in as.data.frame.default(x[[i]], optional = TRUE) : 
>   cannot coerce class ""jobj"" to a data.frame
> >
> > show(mydf)
> DataFrame[localEventDtTm:timestamp, asset:string, assetCategory:string, 
> assetType:string, event:string, 
> extras:array>, ipAddress:string, 
> memberId:string, system:string, timestamp:bigint, title:string, 
> trackingId:string, version:bigint]
> >
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8469) Application timeline view unreadable with many executors

2015-08-18 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702511#comment-14702511
 ] 

Andrew Or commented on SPARK-8469:
--

I've spoken with a few actual dynamic allocation users about this. All they 
care about is really just the number of executors over time instead of the 
specific ones (and there can be 5000+). For now showing the last N is OK, but 
it doesn't seem particularly useful to me actually. Say we show only the last 
500, in which case we still have a jumble of executor added / removed events.

> Application timeline view unreadable with many executors
> 
>
> Key: SPARK-8469
> URL: https://issues.apache.org/jira/browse/SPARK-8469
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Kousuke Saruta
> Attachments: Screen Shot 2015-06-18 at 5.51.21 PM.png
>
>
> This is a problem with using dynamic allocation with many executors. See 
> screenshot. We may want to limit the number of stacked events somehow. See 
> screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9302) collect()/head() failed with JSON of some format

2015-08-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-9302:
-
Issue Type: Improvement  (was: Bug)

> collect()/head() failed with JSON of some format
> 
>
> Key: SPARK-9302
> URL: https://issues.apache.org/jira/browse/SPARK-9302
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Sun Rui
>
> Reported in the mailing list by Exie :
> {noformat}
> A sample record in raw JSON looks like this:
> {"version": 1,"event": "view","timestamp": 1427846422377,"system":
> "DCDS","asset": "6404476","assetType": "myType","assetCategory":
> "myCategory","extras": [{"name": "videoSource","value": "mySource"},{"name":
> "playerType","value": "Article"},{"name": "duration","value":
> "202088"}],"trackingId": "155629a0-d802-11e4-13ee-6884e43d6000","ipAddress":
> "165.69.2.4","title": "myTitle"}
> > head(mydf)
> Error in as.data.frame.default(x[[i]], optional = TRUE) : 
>   cannot coerce class ""jobj"" to a data.frame
> >
> > show(mydf)
> DataFrame[localEventDtTm:timestamp, asset:string, assetCategory:string, 
> assetType:string, event:string, 
> extras:array>, ipAddress:string, 
> memberId:string, system:string, timestamp:bigint, title:string, 
> trackingId:string, version:bigint]
> >
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9302) collect()/head() failed with JSON of some format

2015-08-18 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702510#comment-14702510
 ] 

Shivaram Venkataraman commented on SPARK-9302:
--

There is a PR open that I think will fix part of this 
https://github.com/apache/spark/pull/8276 -- But this is not a bug but a new 
feature, so I don't mind removing its target version.

> collect()/head() failed with JSON of some format
> 
>
> Key: SPARK-9302
> URL: https://issues.apache.org/jira/browse/SPARK-9302
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Sun Rui
>
> Reported in the mailing list by Exie :
> {noformat}
> A sample record in raw JSON looks like this:
> {"version": 1,"event": "view","timestamp": 1427846422377,"system":
> "DCDS","asset": "6404476","assetType": "myType","assetCategory":
> "myCategory","extras": [{"name": "videoSource","value": "mySource"},{"name":
> "playerType","value": "Article"},{"name": "duration","value":
> "202088"}],"trackingId": "155629a0-d802-11e4-13ee-6884e43d6000","ipAddress":
> "165.69.2.4","title": "myTitle"}
> > head(mydf)
> Error in as.data.frame.default(x[[i]], optional = TRUE) : 
>   cannot coerce class ""jobj"" to a data.frame
> >
> > show(mydf)
> DataFrame[localEventDtTm:timestamp, asset:string, assetCategory:string, 
> assetType:string, event:string, 
> extras:array>, ipAddress:string, 
> memberId:string, system:string, timestamp:bigint, title:string, 
> trackingId:string, version:bigint]
> >
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10106) Add `ifelse` Column function to SparkR

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10106:

Target Version/s: 1.6.0, 1.5.1  (was: 1.5.0)

> Add `ifelse` Column function to SparkR
> --
>
> Key: SPARK-10106
> URL: https://issues.apache.org/jira/browse/SPARK-10106
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> Add a column function on a DataFrame like `ifelse` in R to SparkR.
> I guess we could implement it with a combination with {{when}} and 
> {{otherwise}}.
> h3. Example
> If {{df$x > 0}} is TRUE, then return 0, otherwise return 1.
> {noformat}
> ifelse(df$x > 0, 0, 1)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10043) Add window functions into SparkR

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10043:

Target Version/s: 1.6.0, 1.5.1  (was: 1.6.0)

> Add window functions into SparkR
> 
>
> Key: SPARK-10043
> URL: https://issues.apache.org/jira/browse/SPARK-10043
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> Add window functions as follows in SparkR. I think we should improve 
> {{collect}} function in SparkR.
> - lead
> - cumuDist
> - denseRank
> - lag
> - ntile
> - percentRank
> - rank
> - rowNumber



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10043) Add window functions into SparkR

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10043:

Target Version/s: 1.6.0  (was: 1.5.0)

> Add window functions into SparkR
> 
>
> Key: SPARK-10043
> URL: https://issues.apache.org/jira/browse/SPARK-10043
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> Add window functions as follows in SparkR. I think we should improve 
> {{collect}} function in SparkR.
> - lead
> - cumuDist
> - denseRank
> - lag
> - ntile
> - percentRank
> - rank
> - rowNumber



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10058) Flaky test: HeartbeatReceiverSuite: normal heartbeat

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10058:

Target Version/s: 1.6.0, 1.5.1  (was: 1.5.0)

> Flaky test: HeartbeatReceiverSuite: normal heartbeat
> 
>
> Key: SPARK-10058
> URL: https://issues.apache.org/jira/browse/SPARK-10058
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Andrew Or
>Priority: Critical
>  Labels: flaky-test
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.5-SBT/116/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark/HeartbeatReceiverSuite/normal_heartbeat/
> {code}
> Error Message
> 3 did not equal 2
> Stacktrace
> sbt.ForkMain$ForkError: 3 did not equal 2
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.HeartbeatReceiverSuite$$anonfun$2.apply$mcV$sp(HeartbeatReceiverSuite.scala:104)
>   at 
> org.apache.spark.HeartbeatReceiverSuite$$anonfun$2.apply(HeartbeatReceiverSuite.scala:97)
>   at 
> org.apache.spark.HeartbeatReceiverSuite$$anonfun$2.apply(HeartbeatReceiverSuite.scala:97)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.HeartbeatReceiverSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(HeartbeatReceiverSuite.scala:41)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.HeartbeatReceiverSuite.runTest(HeartbeatReceiverSuite.scala:41)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.HeartbeatReceiverSuite.org$scalatest$BeforeAndAfterAll$$super$run(HeartbeatReceiverSuite.scala:41)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at 
> org.apache.spark.HeartbeatReceiverSuite.run(HeartbeatReceiverSuite.scala:41)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>

[jira] [Closed] (SPARK-7689) Deprecate spark.cleaner.ttl

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-7689.
--
Resolution: Later

> Deprecate spark.cleaner.ttl
> ---
>
> Key: SPARK-7689
> URL: https://issues.apache.org/jira/browse/SPARK-7689
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>
> With the introduction of ContextCleaner, I think there's no longer any reason 
> for most users to enable the MetadataCleaner / {{spark.cleaner.ttl}} (except 
> perhaps for super-long-lived Spark REPLs where you're worried about orphaning 
> RDDs or broadcast variables in your REPL history and having them never get 
> cleaned up, although I think this is an uncommon use-case).  I think that 
> this property used to be relevant for Spark Streaming jobs, but I think 
> that's no longer the case since the latest Streaming docs have removed all 
> mentions of {{spark.cleaner.ttl}} (see 
> https://github.com/apache/spark/pull/4956/files#diff-dbee746abf610b52d8a7cb65bf9ea765L1817,
>  for example).
> See 
> http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-cleaner-ttl-safe-td2557.html
>  for an old, related discussion.  Also, see 
> https://github.com/apache/spark/pull/126, the PR that introduced the new 
> ContextCleaner mechanism.
> We should probably add a deprecation warning to {{spark.cleaner.ttl}} that 
> advises users against using it, since it's an unsafe configuration option 
> that can lead to confusing behavior if it's misused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7992) Hide private classes/objects in in generated Java API doc

2015-08-18 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702500#comment-14702500
 ] 

Reynold Xin commented on SPARK-7992:


[~mengxr] what's going on with this one?


> Hide private classes/objects in in generated Java API doc
> -
>
> Key: SPARK-7992
> URL: https://issues.apache.org/jira/browse/SPARK-7992
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> After SPARK-5610, we found that private classes/objects still show up in the 
> generated Java API doc, e.g., under `org.apache.spark.api.r` we can see
> {code}
> BaseRRDD
> PairwiseRRDD
> RRDD
> SpecialLengths
> StringRRDD
> {code}
> We should update genjavadoc to hide those private classes/methods. The best 
> approach is to find a good mapping from Scala private to Java, and merge it 
> into the main genjavadoc repo. A WIP PR is at 
> https://github.com/typesafehub/genjavadoc/pull/47.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7780) The intercept in LogisticRegressionWithLBFGS should not be regularized

2015-08-18 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702501#comment-14702501
 ] 

Reynold Xin commented on SPARK-7780:


I'm assuming this is going to 1.6 / 1.5.1?


> The intercept in LogisticRegressionWithLBFGS should not be regularized
> --
>
> Key: SPARK-7780
> URL: https://issues.apache.org/jira/browse/SPARK-7780
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: DB Tsai
>
> The intercept in Logistic Regression represents a prior on categories which 
> should not be regularized. In MLlib, the regularization is handled through 
> `Updater`, and the `Updater` penalizes all the components without excluding 
> the intercept which resulting poor training accuracy with regularization.
> The new implementation in ML framework handles this properly, and we should 
> call the implementation in ML from MLlib since majority of users are still 
> using MLlib api. 
> Note that both of them are doing feature scalings to improve the convergence, 
> and the only difference is ML version doesn't regularize the intercept. As a 
> result, when lambda is zero, they will converge to the same solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8115) Remove TestData

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8115:
---
Target Version/s: 1.6.0, 1.5.1  (was: 1.5.0)

> Remove TestData
> ---
>
> Key: SPARK-8115
> URL: https://issues.apache.org/jira/browse/SPARK-8115
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Andrew Or
>Priority: Minor
>
> TestData was from the era when we didn't have easy ways to generate test 
> datasets. Now we have implicits on Seq + toDF, it'd make more sense to put 
> the test datasets closer to the test suites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8445) MLlib 1.5 Roadmap

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8445.

   Resolution: Fixed
Fix Version/s: 1.5.0

Closing this one since 1.5 is almost over :)

> MLlib 1.5 Roadmap
> -
>
> Key: SPARK-8445
> URL: https://issues.apache.org/jira/browse/SPARK-8445
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.5.0
>
>
> We expect to see many MLlib contributors for the 1.5 release. To scale out 
> the development, we created this master list for MLlib features we plan to 
> have in Spark 1.5. Please view this list as a wish list rather than a 
> concrete plan, because we don't have an accurate estimate of available 
> resources. Due to limited review bandwidth, features appearing on this list 
> will get higher priority during code review. But feel free to suggest new 
> items to the list in comments. We are experimenting with this process. Your 
> feedback would be greatly appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter 
> task|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20labels%20%3D%20starter%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0]
>  rather than a medium/big feature. Based on our experience, mixing the 
> development process with a big feature usually causes long delay in code 
> review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if necessary.
> h1. Roadmap (WIP)
> This is NOT [a complete list of MLlib JIRAs for 
> 1.5|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0%20ORDER%20BY%20priority%20DESC].
>  We only include umbrella JIRAs and high-level tasks.
> h2. Algorithms and performance
> * LDA improvements (SPARK-5572)
> * Log-linear model for survival analysis (SPARK-8518) -> 1.6
> * Improve GLM's scalability on number of features (SPARK-8520)
> * Tree and ensembles: Move + cleanup code (SPARK-7131), provide class 
> probabilities (SPARK-3727), feature importance (SPARK-5133)
> * Improve GMM scalability and stability (SPARK-5016)
> * Frequent pattern mining improvements (SPARK-6487)
> * R-like stats for ML models (SPARK-7674)
> * Generalize classification threshold to multiclass (SPARK-8069)
> * A/B testing (SPARK-3147)
> h2. Pipeline API
> * more feature transformers (SPARK-8521)
> * k-means (SPARK-7879)
> * naive Bayes (SPARK-8600)
> * TrainValidationSplit for tuning (SPARK-8484)
> * Isotonic regression (SPARK-8671)
> h2. Model persistence
> * more PMML export (SPARK-8545)
> * model save/load (SPARK-4587)
> * pipeline persistence (SPARK-6725)
> h2. Python API for ML
> * List of issues identified during Spark 1.4 QA: (SPARK-7536)
> * Python API for streaming ML algorithms (SPARK-3258)
> * Add missing model methods (SPARK-8633)
> h2. SparkR API for ML
> * MLlib + SparkR integration for 1.5 (RFormula + glm) (SPARK-6805)
> * model.matrix for DataFrames (SPARK-6823)
> h2. Documentation
> * [Search for documentation improvements | 
> https://issues.apache.org/jira/

[jira] [Commented] (SPARK-8521) Feature Transformers in 1.5

2015-08-18 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702488#comment-14702488
 ] 

Reynold Xin commented on SPARK-8521:


Can we close this one or move the undone one to 1.6?


> Feature Transformers in 1.5
> ---
>
> Key: SPARK-8521
> URL: https://issues.apache.org/jira/browse/SPARK-8521
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This is a list of feature transformers we plan to add in Spark 1.5. Feel free 
> to propose useful transformers that are not on the list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8469) Application timeline view unreadable with many executors

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8469:
---
Target Version/s: 1.6.0  (was: 1.5.0)

> Application timeline view unreadable with many executors
> 
>
> Key: SPARK-8469
> URL: https://issues.apache.org/jira/browse/SPARK-8469
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Kousuke Saruta
> Attachments: Screen Shot 2015-06-18 at 5.51.21 PM.png
>
>
> This is a problem with using dynamic allocation with many executors. See 
> screenshot. We may want to limit the number of stacked events somehow. See 
> screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8580) Add Parquet files generated by different systems to test interoperability and compatibility

2015-08-18 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702486#comment-14702486
 ] 

Reynold Xin commented on SPARK-8580:


If you add this last minute we can include it in 1.5.0. But I'm not going to 
hold the release for this.


> Add Parquet files generated by different systems to test interoperability and 
> compatibility
> ---
>
> Key: SPARK-8580
> URL: https://issues.apache.org/jira/browse/SPARK-8580
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> As we are implementing Parquet backwards-compatibility rules for Spark 1.5.0 
> to improve interoperability with other systems (reading non-standard Parquet 
> files they generate, and generating standard Parquet files), it would be good 
> to have a set of standard test Parquet files generated by various 
> systems/tools (parquet-thrift, parquet-avro, parquet-hive, Impala, and old 
> versions of Spark SQL) to ensure compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8966) Design a mechanism to ensure that temporary files created in tasks are cleaned up after failures

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8966:
---
Parent Issue: SPARK-9697  (was: SPARK-9565)

> Design a mechanism to ensure that temporary files created in tasks are 
> cleaned up after failures
> 
>
> Key: SPARK-8966
> URL: https://issues.apache.org/jira/browse/SPARK-8966
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>
> It's important to avoid leaking temporary files, such as spill files created 
> by the external sorter.  Individual operators should still make an effort to 
> clean up their own files / perform their own error handling, but I think that 
> we should add a safety-net mechanism to track file creation on a per-task 
> basis and automatically clean up leaked files.
> During tests, this mechanism should throw an exception when a leak is 
> detected. In production deployments, it should log a warning and clean up the 
> leak itself.  This is similar to the TaskMemoryManager's leak detection and 
> cleanup code.
> We may be able to implement this via a convenience method that registers task 
> completion handlers with TaskContext.
> We might also explore techniques that will cause files to be cleaned up 
> automatically when their file descriptors are closed (e.g. by calling unlink 
> on an open file). These techniques should not be our last line of defense 
> against file resource leaks, though, since they might be platform-specific 
> and may clean up resources later than we'd like.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8580) Add Parquet files generated by different systems to test interoperability and compatibility

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8580:
---
Target Version/s: 1.6.0, 1.5.1  (was: 1.5.0)

> Add Parquet files generated by different systems to test interoperability and 
> compatibility
> ---
>
> Key: SPARK-8580
> URL: https://issues.apache.org/jira/browse/SPARK-8580
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> As we are implementing Parquet backwards-compatibility rules for Spark 1.5.0 
> to improve interoperability with other systems (reading non-standard Parquet 
> files they generate, and generating standard Parquet files), it would be good 
> to have a set of standard test Parquet files generated by various 
> systems/tools (parquet-thrift, parquet-avro, parquet-hive, Impala, and old 
> versions of Spark SQL) to ensure compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8716) Write tests for executor shared cache feature

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8716:
---
Target Version/s: 1.6.0  (was: 1.5.0)

> Write tests for executor shared cache feature
> -
>
> Key: SPARK-8716
> URL: https://issues.apache.org/jira/browse/SPARK-8716
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>
> More specifically, this is the feature that is currently flagged by 
> `spark.files.useFetchCache`.
> This is a complicated feature that has no tests. I cannot say with confidence 
> that it actually works on all cluster managers. In particular, I believe it 
> doesn't work on Mesos because whatever goes into this else case creates its 
> own temp directory per executor: 
> https://github.com/apache/spark/blob/881662e9c93893430756320f51cef0fc6643f681/core/src/main/scala/org/apache/spark/util/Utils.scala#L739.
> It's also not immediately clear that it works on standalone mode due to the 
> lack of comments. It actually does work there because the Worker happens to 
> set a `SPARK_EXECUTOR_DIRS` variable. The linkage could be more explicitly 
> documented in the code.
> This is difficult to write tests for, but it's still important to do so. 
> Otherwise, semi-related changes in the future may easily break it without 
> anyone noticing.
> Related issues: SPARK-8130, SPARK-6313, SPARK-2713



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8939) YARN EC2 default setting fails with IllegalArgumentException

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8939:
---
Target Version/s: 1.5.1  (was: 1.5.0)

> YARN EC2 default setting fails with IllegalArgumentException
> 
>
> Key: SPARK-8939
> URL: https://issues.apache.org/jira/browse/SPARK-8939
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>
> I just set it up from scratch using the spark-ec2 script. Then I ran
> {code}
> bin/spark-shell --master yarn
> {code}
> which failed with
> {code}
> 15/07/09 03:44:29 ERROR SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: Unknown/unsupported param 
> List(--num-executors, , --executor-memory, 6154m, --executor-memory, 6154m, 
> --executor-cores, 2, --name, Spark shell)
> {code}
> This goes away if I provide `--num-executors`, but we should fix the default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8966) Design a mechanism to ensure that temporary files created in tasks are cleaned up after failures

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8966:
---
Target Version/s: 1.6.0  (was: 1.5.0)

> Design a mechanism to ensure that temporary files created in tasks are 
> cleaned up after failures
> 
>
> Key: SPARK-8966
> URL: https://issues.apache.org/jira/browse/SPARK-8966
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>
> It's important to avoid leaking temporary files, such as spill files created 
> by the external sorter.  Individual operators should still make an effort to 
> clean up their own files / perform their own error handling, but I think that 
> we should add a safety-net mechanism to track file creation on a per-task 
> basis and automatically clean up leaked files.
> During tests, this mechanism should throw an exception when a leak is 
> detected. In production deployments, it should log a warning and clean up the 
> leak itself.  This is similar to the TaskMemoryManager's leak detection and 
> cleanup code.
> We may be able to implement this via a convenience method that registers task 
> completion handlers with TaskContext.
> We might also explore techniques that will cause files to be cleaned up 
> automatically when their file descriptors are closed (e.g. by calling unlink 
> on an open file). These techniques should not be our last line of defense 
> against file resource leaks, though, since they might be platform-specific 
> and may clean up resources later than we'd like.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8987) Increase test coverage of DAGScheduler

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8987:
---
Target Version/s: 1.6.0  (was: 1.5.0)

> Increase test coverage of DAGScheduler
> --
>
> Key: SPARK-8987
> URL: https://issues.apache.org/jira/browse/SPARK-8987
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Tests
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>
> DAGScheduler is one of the most monstrous piece of code in Spark. Every time 
> someone changes something there something like the following happens:
> (1) Someone pings a committer
> (2) The committer pings a scheduler maintainer
> (3) Scheduler maintainer correctly points out bugs in the patch
> (4) Author of patch fixes bug but introduces more bugs
> (5) Repeat steps 3 - 4 N times
> (6) Other committers / contributors jump in and start debating
> (7) The patch goes stale for months
> All of this happens because no one, including the committers, has high 
> confidence that a particular change doesn't break some corner case in the 
> scheduler. I believe one of the main issues is the lack of sufficient test 
> coverage, which is not a luxury but a necessity for logic as complex as the 
> DAGScheduler.
> As of the writing of this JIRA, DAGScheduler has ~1500 lines, while the 
> DAGSchedulerSuite only has ~900 lines. I would argue that the suite line 
> count should actually be many multiples of that of the original code.
> If you wish to work on this, let me know and I will assign it to you. Anyone 
> is welcome. :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9026) SimpleFutureAction.onComplete should not tie up a separate thread for each callback

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9026:
---
Target Version/s: 1.6.0  (was: 1.5.0)

> SimpleFutureAction.onComplete should not tie up a separate thread for each 
> callback
> ---
>
> Key: SPARK-9026
> URL: https://issues.apache.org/jira/browse/SPARK-9026
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> As [~zsxwing] points out at 
> https://github.com/apache/spark/pull/7276#issuecomment-121097747, 
> SimpleFutureAction currently blocks a separate execution context thread for 
> each callback registered via onComplete:
> {code}
>   override def onComplete[U](func: (Try[T]) => U)(implicit executor: 
> ExecutionContext) {
> executor.execute(new Runnable {
>   override def run() {
> func(awaitResult())
>   }
> })
>   }
> {code}
> We should fix this so that callbacks do not steal threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9158) PyLint should only fail on error

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9158:
---
Target Version/s: 1.6.0  (was: 1.5.0)

> PyLint should only fail on error
> 
>
> Key: SPARK-9158
> URL: https://issues.apache.org/jira/browse/SPARK-9158
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Davies Liu
>Priority: Critical
>
> It's boring to fight with warning from Pylint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9302) collect()/head() failed with JSON of some format

2015-08-18 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702478#comment-14702478
 ] 

Reynold Xin commented on SPARK-9302:


[~shivaram] is this targeted for 1.5? 

> collect()/head() failed with JSON of some format
> 
>
> Key: SPARK-9302
> URL: https://issues.apache.org/jira/browse/SPARK-9302
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Sun Rui
>
> Reported in the mailing list by Exie :
> {noformat}
> A sample record in raw JSON looks like this:
> {"version": 1,"event": "view","timestamp": 1427846422377,"system":
> "DCDS","asset": "6404476","assetType": "myType","assetCategory":
> "myCategory","extras": [{"name": "videoSource","value": "mySource"},{"name":
> "playerType","value": "Article"},{"name": "duration","value":
> "202088"}],"trackingId": "155629a0-d802-11e4-13ee-6884e43d6000","ipAddress":
> "165.69.2.4","title": "myTitle"}
> > head(mydf)
> Error in as.data.frame.default(x[[i]], optional = TRUE) : 
>   cannot coerce class ""jobj"" to a data.frame
> >
> > show(mydf)
> DataFrame[localEventDtTm:timestamp, asset:string, assetCategory:string, 
> assetType:string, event:string, 
> extras:array>, ipAddress:string, 
> memberId:string, system:string, timestamp:bigint, title:string, 
> trackingId:string, version:bigint]
> >
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9226) Change default log level to WARN in python REPL

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9226:
---
Target Version/s: 1.6.0  (was: 1.5.0)

> Change default log level to WARN in python REPL
> ---
>
> Key: SPARK-9226
> URL: https://issues.apache.org/jira/browse/SPARK-9226
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Auberon López
>Priority: Minor
>
> SPARK-7261 provides separate logging properties to be used when in the scala 
> REPL, by default changing the logging level to WARN instead of INFO. This 
> same improvement can be implemented for the Python REPL, which will make 
> using Pyspark interactively a cleaner experience that is closer to parity 
> with the scala shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9226) Change default log level to WARN in python REPL

2015-08-18 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702477#comment-14702477
 ] 

Reynold Xin commented on SPARK-9226:


[~alope107] do you want to submit a pull request for this?

> Change default log level to WARN in python REPL
> ---
>
> Key: SPARK-9226
> URL: https://issues.apache.org/jira/browse/SPARK-9226
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Auberon López
>Priority: Minor
>
> SPARK-7261 provides separate logging properties to be used when in the scala 
> REPL, by default changing the logging level to WARN instead of INFO. This 
> same improvement can be implemented for the Python REPL, which will make 
> using Pyspark interactively a cleaner experience that is closer to parity 
> with the scala shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9226) Change default log level to WARN in python REPL

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9226:
---
Fix Version/s: (was: 1.5.0)

> Change default log level to WARN in python REPL
> ---
>
> Key: SPARK-9226
> URL: https://issues.apache.org/jira/browse/SPARK-9226
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Auberon López
>Priority: Minor
>
> SPARK-7261 provides separate logging properties to be used when in the scala 
> REPL, by default changing the logging level to WARN instead of INFO. This 
> same improvement can be implemented for the Python REPL, which will make 
> using Pyspark interactively a cleaner experience that is closer to parity 
> with the scala shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9508) Align graphx programming guide with the updated Pregel code

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9508.

   Resolution: Fixed
 Assignee: Alexander Ulanov
Fix Version/s: (was: 1.4.0)
   1.5.0

> Align graphx programming guide with the updated Pregel code
> ---
>
> Key: SPARK-9508
> URL: https://issues.apache.org/jira/browse/SPARK-9508
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
>Priority: Minor
> Fix For: 1.5.0
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> SPARK-9436 simplifies the Pregel code. graphx-programming-guide needs to be 
> modified accordingly since it lists the old Pregel code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9545) Run Maven tests in pull request builder if title has "[maven-test]" in it

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9545:
---
Target Version/s: 1.6.0  (was: 1.5.0)

> Run Maven tests in pull request builder if title has "[maven-test]" in it
> -
>
> Key: SPARK-9545
> URL: https://issues.apache.org/jira/browse/SPARK-9545
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>
> We have infrastructure now in the build tooling for running maven tests, but 
> it's not actually used anywhere. With a very minor change we can support 
> running maven tests if the pull request title has "maven-test" in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9705) outdated Python 3 and IPython information

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9705.

   Resolution: Fixed
 Assignee: Davies Liu
Fix Version/s: 1.5.0

> outdated Python 3 and IPython information
> -
>
> Key: SPARK-9705
> URL: https://issues.apache.org/jira/browse/SPARK-9705
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Piotr Migdał
>Assignee: Davies Liu
>Priority: Blocker
>  Labels: documentation
> Fix For: 1.5.0
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> https://issues.apache.org/jira/browse/SPARK-4897 adds Python 3.4 support to 
> 1.4.0 and above, but the official docs (1.4.1, but the same is for 1.4.0) 
> says explicitly:
> "Spark 1.4.1 works with Python 2.6 or higher (but not Python 3)."
> Affected:
> https://spark.apache.org/docs/1.4.0/programming-guide.html
> https://spark.apache.org/docs/1.4.1/programming-guide.html
> There are some other Python-related things, which are outdated, e.g. this 
> line:
> "For example, to launch the IPython Notebook with PyLab plot support:"
> (At least since IPython 3.0 PyLab/Matplotlib support happens inside a 
> notebook; and the line "--pylab inline" is already removed.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10093) Invalid transformations for TungstenProject of struct type

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10093.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

> Invalid transformations for TungstenProject of struct type
> --
>
> Key: SPARK-10093
> URL: https://issues.apache.org/jira/browse/SPARK-10093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.5.0
>
>
> Code to reproduce:
> {code}
> val df = Seq((1,1)).toDF("a", "b")
> df.where($"a" === 1)
>   .select($"a", $"b", struct($"b"))
>   .orderBy("a")
>   .select(struct($"b"))
>   .collect()
> {code}
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 37.0 failed 4 times, most recent failure: Lost task 1.3 in stage 37.0 
> (TID 1118, 10.0.167.218): 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree:
> Exchange rangepartitioning(a#197 ASC)
>  ConvertToSafe
>   TungstenProject [_1#195 AS a#197,_2#196 AS b#198,struct(_2#196 AS b#198) AS 
> struct(b)#199]
>Filter (_1#195 = 1)
> LocalTableScan [_1#195,_2#196], [[1,1]]
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:315)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:80)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:46)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:280)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:234)
>

[jira] [Resolved] (SPARK-10096) CreateStruct/CreateArray/CreateNamedStruct broken with UDFs

2015-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10096.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

> CreateStruct/CreateArray/CreateNamedStruct broken with UDFs
> ---
>
> Key: SPARK-10096
> URL: https://issues.apache.org/jira/browse/SPARK-10096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michael Armbrust
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 1.5.0
>
>
> {code}
> val f = udf((a: String) => a)
> val df = sc.parallelize((1,1) :: Nil).toDF("a", "b")
> df.select(struct($"a").as("s")).select(f($"s.a")).collect()
> {code}
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in 
> stage 9.0 failed 4 times, most recent failure: Lost task 3.3 in stage 9.0 
> (TID 78, 10.0.243.97): java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateStructUnsafe.eval(complexTypeCreator.scala:209)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:247)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:247)
>   at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:76)
>   at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:905)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:905)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10106) Add `ifelse` Column function to SparkR

2015-08-18 Thread Yu Ishikawa (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Ishikawa updated SPARK-10106:

Description: 
Add a column function on a DataFrame like `ifelse` in R to SparkR.
I guess we could implement it with a combination with {{when}} and 
{{otherwise}}.

h3. Example

If {{df$x > 0}} is TRUE, then return 0, otherwise return 1.
{noformat}
ifelse(df$x > 0, 0, 1)
{noformat}

  was:
Add a column function on a DataFrame like `ifelse` in R to SparkR.
I guess we could implement it with a combination with {{when}} and 
{{otherwise}}.

h3. Example

If {{df$x > 0}} is TRUE, then return 0, else return 1.
{noformat}
ifelse(df$x > 0, 0, 1)
{noformat}


> Add `ifelse` Column function to SparkR
> --
>
> Key: SPARK-10106
> URL: https://issues.apache.org/jira/browse/SPARK-10106
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> Add a column function on a DataFrame like `ifelse` in R to SparkR.
> I guess we could implement it with a combination with {{when}} and 
> {{otherwise}}.
> h3. Example
> If {{df$x > 0}} is TRUE, then return 0, otherwise return 1.
> {noformat}
> ifelse(df$x > 0, 0, 1)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10106) Add `ifelse` Column function to SparkR

2015-08-18 Thread Yu Ishikawa (JIRA)

Yu Ishikawa created SPARK-10106:
---

 Summary: Add `ifelse` Column function to SparkR
 Key: SPARK-10106
 URL: https://issues.apache.org/jira/browse/SPARK-10106
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Yu Ishikawa


Add a column function on a DataFrame like `ifelse` in R to SparkR.
I guess we could implement it with a combination with {{when}} and 
{{otherwise}}.

h3. Example

If {{df$x > 0}} is TRUE, then return 0, else return 1.
{noformat}
ifelse(df$x > 0, 0, 1)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10105) Adding most k frequent words parameter to Word2Vec implementation

2015-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10105:


Assignee: Apache Spark

> Adding most k frequent words parameter to Word2Vec implementation
> -
>
> Key: SPARK-10105
> URL: https://issues.apache.org/jira/browse/SPARK-10105
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Antonio Murgia
>Assignee: Apache Spark
>Priority: Minor
>  Labels: mllib, top-k, word2vec
>
> When training Word2Vec on a really big dataset, it's really hard to evaluate 
> the right minCount parameter, it would really help having a parameter to 
> choose how many words you want to be in the vocabulary.
> Furthermore, the original Word2Vec paper, state that they took into account 
> the most frequent 1M words.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10105) Adding most k frequent words parameter to Word2Vec implementation

2015-08-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702412#comment-14702412
 ] 

Apache Spark commented on SPARK-10105:
--

User 'tmnd1991' has created a pull request for this issue:
https://github.com/apache/spark/pull/8301

> Adding most k frequent words parameter to Word2Vec implementation
> -
>
> Key: SPARK-10105
> URL: https://issues.apache.org/jira/browse/SPARK-10105
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Antonio Murgia
>Priority: Minor
>  Labels: mllib, top-k, word2vec
>
> When training Word2Vec on a really big dataset, it's really hard to evaluate 
> the right minCount parameter, it would really help having a parameter to 
> choose how many words you want to be in the vocabulary.
> Furthermore, the original Word2Vec paper, state that they took into account 
> the most frequent 1M words.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10105) Adding most k frequent words parameter to Word2Vec implementation

2015-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10105:


Assignee: (was: Apache Spark)

> Adding most k frequent words parameter to Word2Vec implementation
> -
>
> Key: SPARK-10105
> URL: https://issues.apache.org/jira/browse/SPARK-10105
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Antonio Murgia
>Priority: Minor
>  Labels: mllib, top-k, word2vec
>
> When training Word2Vec on a really big dataset, it's really hard to evaluate 
> the right minCount parameter, it would really help having a parameter to 
> choose how many words you want to be in the vocabulary.
> Furthermore, the original Word2Vec paper, state that they took into account 
> the most frequent 1M words.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10073) Python withColumn for existing column name not consistent with scala

2015-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10073:


Assignee: Davies Liu  (was: Apache Spark)

> Python withColumn for existing column name not consistent with scala
> 
>
> Key: SPARK-10073
> URL: https://issues.apache.org/jira/browse/SPARK-10073
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michael Armbrust
>Assignee: Davies Liu
>Priority: Blocker
>
> The same code as below works in Scala (replacing the old column with the new 
> one).
> {code}
> from pyspark.sql import Row
> df = sc.parallelize([Row(a=1)]).toDF()
> df.withColumn("a", df.a).select("a")
> ---
> AnalysisException Traceback (most recent call last)
>  in ()
>   1 from pyspark.sql import Row
>   2 df = sc.parallelize([Row(a=1)]).toDF()
> > 3 df.withColumn("a", df.a).select("a")
> /home/ubuntu/databricks/spark/python/pyspark/sql/dataframe.py in select(self, 
> *cols)
> 764 [Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]
> 765 """
> --> 766 jdf = self._jdf.select(self._jcols(*cols))
> 767 return DataFrame(jdf, self.sql_ctx)
> 768 
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> /home/ubuntu/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  38 s = e.java_exception.toString()
>  39 if s.startswith('org.apache.spark.sql.AnalysisException: 
> '):
> ---> 40 raise AnalysisException(s.split(': ', 1)[1])
>  41 if s.startswith('java.lang.IllegalArgumentException: '):
>  42 raise IllegalArgumentException(s.split(': ', 1)[1])
> AnalysisException: Reference 'a' is ambiguous, could be: a#894L, a#895L.;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10073) Python withColumn for existing column name not consistent with scala

2015-08-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702408#comment-14702408
 ] 

Apache Spark commented on SPARK-10073:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/8300

> Python withColumn for existing column name not consistent with scala
> 
>
> Key: SPARK-10073
> URL: https://issues.apache.org/jira/browse/SPARK-10073
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michael Armbrust
>Assignee: Davies Liu
>Priority: Blocker
>
> The same code as below works in Scala (replacing the old column with the new 
> one).
> {code}
> from pyspark.sql import Row
> df = sc.parallelize([Row(a=1)]).toDF()
> df.withColumn("a", df.a).select("a")
> ---
> AnalysisException Traceback (most recent call last)
>  in ()
>   1 from pyspark.sql import Row
>   2 df = sc.parallelize([Row(a=1)]).toDF()
> > 3 df.withColumn("a", df.a).select("a")
> /home/ubuntu/databricks/spark/python/pyspark/sql/dataframe.py in select(self, 
> *cols)
> 764 [Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]
> 765 """
> --> 766 jdf = self._jdf.select(self._jcols(*cols))
> 767 return DataFrame(jdf, self.sql_ctx)
> 768 
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> /home/ubuntu/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  38 s = e.java_exception.toString()
>  39 if s.startswith('org.apache.spark.sql.AnalysisException: 
> '):
> ---> 40 raise AnalysisException(s.split(': ', 1)[1])
>  41 if s.startswith('java.lang.IllegalArgumentException: '):
>  42 raise IllegalArgumentException(s.split(': ', 1)[1])
> AnalysisException: Reference 'a' is ambiguous, could be: a#894L, a#895L.;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10073) Python withColumn for existing column name not consistent with scala

2015-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10073:


Assignee: Apache Spark  (was: Davies Liu)

> Python withColumn for existing column name not consistent with scala
> 
>
> Key: SPARK-10073
> URL: https://issues.apache.org/jira/browse/SPARK-10073
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Blocker
>
> The same code as below works in Scala (replacing the old column with the new 
> one).
> {code}
> from pyspark.sql import Row
> df = sc.parallelize([Row(a=1)]).toDF()
> df.withColumn("a", df.a).select("a")
> ---
> AnalysisException Traceback (most recent call last)
>  in ()
>   1 from pyspark.sql import Row
>   2 df = sc.parallelize([Row(a=1)]).toDF()
> > 3 df.withColumn("a", df.a).select("a")
> /home/ubuntu/databricks/spark/python/pyspark/sql/dataframe.py in select(self, 
> *cols)
> 764 [Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]
> 765 """
> --> 766 jdf = self._jdf.select(self._jcols(*cols))
> 767 return DataFrame(jdf, self.sql_ctx)
> 768 
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> /home/ubuntu/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  38 s = e.java_exception.toString()
>  39 if s.startswith('org.apache.spark.sql.AnalysisException: 
> '):
> ---> 40 raise AnalysisException(s.split(': ', 1)[1])
>  41 if s.startswith('java.lang.IllegalArgumentException: '):
>  42 raise IllegalArgumentException(s.split(': ', 1)[1])
> AnalysisException: Reference 'a' is ambiguous, could be: a#894L, a#895L.;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10087) In some cases, all reducers are scheduled to the same executor

2015-08-18 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10087:
-
Target Version/s: 1.6.0

> In some cases, all reducers are scheduled to the same executor
> --
>
> Key: SPARK-10087
> URL: https://issues.apache.org/jira/browse/SPARK-10087
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Critical
>
> In some cases, when spark.shuffle.reduceLocality.enabled is enabled, we are 
> scheduling all reducers to the same executor (the cluster has plenty of 
> resources). Changing spark.shuffle.reduceLocality.enabled to false resolve 
> the problem. 
> Comments of https://github.com/apache/spark/pull/8280 provide more details of 
> the symptom of this issue.
> The query I was using is
> {code:sql}
> select
>   i_brand_id,
>   i_brand,
>   i_manufact_id,
>   i_manufact,
>   sum(ss_ext_sales_price) ext_price
> from
>   store_sales
>   join item on (store_sales.ss_item_sk = item.i_item_sk)
>   join customer on (store_sales.ss_customer_sk = customer.c_customer_sk)
>   join customer_address on (customer.c_current_addr_sk = 
> customer_address.ca_address_sk)
>   join store on (store_sales.ss_store_sk = store.s_store_sk)
>   join date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> where
>   --ss_date between '1999-11-01' and '1999-11-30'
>   ss_sold_date_sk between 2451484 and 2451513
>   and d_moy = 11
>   and d_year = 1999
>   and i_manager_id = 7
>   and substr(ca_zip, 1, 5) <> substr(s_zip, 1, 5)
> group by
>   i_brand,
>   i_brand_id,
>   i_manufact_id,
>   i_manufact
> order by
>   ext_price desc,
>   i_brand,
>   i_brand_id,
>   i_manufact_id,
>   i_manufact
> limit 100
> {code}
> The dataset is tpc-ds scale factor 1500. To reproduce the problem, you can 
> just join store_sales with customer and make sure there is only one mapper 
> reads the data of customer.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction

2015-08-18 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10100:
-
Assignee: Herman van Hovell  (was: Yin Huai)

> AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
> --
>
> Key: SPARK-10100
> URL: https://issues.apache.org/jira/browse/SPARK-10100
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Herman van Hovell
>
> Looks like Max (probably Min) implemented based on AggregateFunction2 is 
> slower than the old MaxFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction

2015-08-18 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702398#comment-14702398
 ] 

Yin Huai commented on SPARK-10100:
--

[~hvanhovell] How's the performance?

> AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
> --
>
> Key: SPARK-10100
> URL: https://issues.apache.org/jira/browse/SPARK-10100
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Looks like Max (probably Min) implemented based on AggregateFunction2 is 
> slower than the old MaxFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10095) Should not use the private field of BigInteger

2015-08-18 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10095.

   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8286
[https://github.com/apache/spark/pull/8286]

> Should not use the private field of BigInteger
> --
>
> Key: SPARK-10095
> URL: https://issues.apache.org/jira/browse/SPARK-10095
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Minor
> Fix For: 1.5.0
>
>
> In UnsafeRow, we use the private field of BigInteger for better performance, 
> but it actually didn't contribute much to end-to-end runtime, and make it not 
> portable (may fail on other JVM implementations).
> So we should use the public API instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9627) SQL job failed if the dataframe with string columns is cached

2015-08-18 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702390#comment-14702390
 ] 

Davies Liu commented on SPARK-9627:
---

I can reproduce it with latest master.

> SQL job failed if the dataframe with string columns is cached
> -
>
> Key: SPARK-9627
> URL: https://issues.apache.org/jira/browse/SPARK-9627
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Cheng Lian
>Priority: Blocker
>
> {code}
> r = random.Random()
> def gen(i):
> d = date.today() - timedelta(r.randint(0, 5000))
> cat = str(r.randint(0, 20)) * 5
> c = r.randint(0, 1000)
> price = decimal.Decimal(r.randint(0, 10)) / 100
> return (d, cat, c, price)
> schema = StructType().add('date', DateType()).add('cat', 
> StringType()).add('count', ShortType()).add('price', DecimalType(5, 2))
> #df = sqlContext.createDataFrame(sc.range(1<<24).map(gen), schema)
> #df.show()
> #df.write.parquet('sales4')
> df = sqlContext.read.parquet('sales4')
> df.cache()
> df.count()
> df.show()
> print df.schema
> raw_input()
> r = df.groupBy(df.date, df.cat).agg(sum(df['count'] * df.price))
> print r.explain(True)
> r.show()
> {code}
> {code}
> StructType(List(StructField(date,DateType,true),StructField(cat,StringType,true),StructField(count,ShortType,true),StructField(price,DecimalType(5,2),true)))
> == Parsed Logical Plan ==
> 'Aggregate [date#0,cat#1], [date#0,cat#1,sum((count#2 * price#3)) AS 
> sum((count * price))#70]
>  Relation[date#0,cat#1,count#2,price#3] 
> org.apache.spark.sql.parquet.ParquetRelation@5ec8f315
> == Analyzed Logical Plan ==
> date: date, cat: string, sum((count * price)): decimal(21,2)
> Aggregate [date#0,cat#1], 
> [date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, 
> DecimalType(5,0)), DecimalType(11,2))) * 
> change_decimal_precision(CAST(price#3, DecimalType(11,2) AS sum((count * 
> price))#70]
>  Relation[date#0,cat#1,count#2,price#3] 
> org.apache.spark.sql.parquet.ParquetRelation@5ec8f315
> == Optimized Logical Plan ==
> Aggregate [date#0,cat#1], 
> [date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, 
> DecimalType(5,0)), DecimalType(11,2))) * 
> change_decimal_precision(CAST(price#3, DecimalType(11,2) AS sum((count * 
> price))#70]
>  InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, 
> StorageLevel(true, true, false, true, 1), (PhysicalRDD 
> [date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None
> == Physical Plan ==
> NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) 
> ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, 
> DecimalType(5,0)), DecimalType(11,2))) * 
> change_decimal_precision(CAST(price#3, 
> DecimalType(11,2)2,mode=Final,isDistinct=false))
>  TungstenSort [date#0 ASC,cat#1 ASC], false, 0
>   ConvertToUnsafe
>Exchange hashpartitioning(date#0,cat#1)
> NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) 
> ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, 
> DecimalType(5,0)), DecimalType(11,2))) * 
> change_decimal_precision(CAST(price#3, 
> DecimalType(11,2)2,mode=Partial,isDistinct=false))
>  TungstenSort [date#0 ASC,cat#1 ASC], false, 0
>   ConvertToUnsafe
>InMemoryColumnarTableScan [date#0,cat#1,count#2,price#3], 
> (InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, 
> StorageLevel(true, true, false, true, 1), (PhysicalRDD 
> [date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None)
> Code Generation: true
> == RDD ==
> None
> 15/08/04 23:21:53 ERROR TaskSetManager: Task 0 in stage 4.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "t.py", line 34, in 
> r.show()
>   File "/Users/davies/work/spark/python/pyspark/sql/dataframe.py", line 258, 
> in show
> print(self._jdf.showString(n, truncate))
>   File "/Users/davies/work/spark/python/lib/py4j/java_gateway.py", line 538, 
> in __call__
> self.target_id, self.name)
>   File "/Users/davies/work/spark/python/pyspark/sql/utils.py", line 36, in 
> deco
> return f(*a, **kw)
>   File "/Users/davies/work/spark/python/lib/py4j/protocol.py", line 300, in 
> get_return_value
> format(target_id, '.', name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 
> (TID 10, localhost): java.lang.UnsupportedOperationException: tail of empty 
> list
>   at scala.collection.immutable.Nil$.tail(List.scala:339)
>   at scala.collection.immutable.Nil$.tail(List.scala:334)
>   at scala.reflect.internal.SymbolTable.popPh

[jira] [Commented] (SPARK-9627) SQL job failed if the dataframe with string columns is cached

2015-08-18 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702387#comment-14702387
 ] 

Davies Liu commented on SPARK-9627:
---

The `df.show()` will succeed, but `df.groupBy(df.date, 
df.cat).agg(sum(df['count'] * df.price))` will fail (you need to press enter to 
run it, or remove the `raw_input()` line)

> SQL job failed if the dataframe with string columns is cached
> -
>
> Key: SPARK-9627
> URL: https://issues.apache.org/jira/browse/SPARK-9627
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Cheng Lian
>Priority: Blocker
>
> {code}
> r = random.Random()
> def gen(i):
> d = date.today() - timedelta(r.randint(0, 5000))
> cat = str(r.randint(0, 20)) * 5
> c = r.randint(0, 1000)
> price = decimal.Decimal(r.randint(0, 10)) / 100
> return (d, cat, c, price)
> schema = StructType().add('date', DateType()).add('cat', 
> StringType()).add('count', ShortType()).add('price', DecimalType(5, 2))
> #df = sqlContext.createDataFrame(sc.range(1<<24).map(gen), schema)
> #df.show()
> #df.write.parquet('sales4')
> df = sqlContext.read.parquet('sales4')
> df.cache()
> df.count()
> df.show()
> print df.schema
> raw_input()
> r = df.groupBy(df.date, df.cat).agg(sum(df['count'] * df.price))
> print r.explain(True)
> r.show()
> {code}
> {code}
> StructType(List(StructField(date,DateType,true),StructField(cat,StringType,true),StructField(count,ShortType,true),StructField(price,DecimalType(5,2),true)))
> == Parsed Logical Plan ==
> 'Aggregate [date#0,cat#1], [date#0,cat#1,sum((count#2 * price#3)) AS 
> sum((count * price))#70]
>  Relation[date#0,cat#1,count#2,price#3] 
> org.apache.spark.sql.parquet.ParquetRelation@5ec8f315
> == Analyzed Logical Plan ==
> date: date, cat: string, sum((count * price)): decimal(21,2)
> Aggregate [date#0,cat#1], 
> [date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, 
> DecimalType(5,0)), DecimalType(11,2))) * 
> change_decimal_precision(CAST(price#3, DecimalType(11,2) AS sum((count * 
> price))#70]
>  Relation[date#0,cat#1,count#2,price#3] 
> org.apache.spark.sql.parquet.ParquetRelation@5ec8f315
> == Optimized Logical Plan ==
> Aggregate [date#0,cat#1], 
> [date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, 
> DecimalType(5,0)), DecimalType(11,2))) * 
> change_decimal_precision(CAST(price#3, DecimalType(11,2) AS sum((count * 
> price))#70]
>  InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, 
> StorageLevel(true, true, false, true, 1), (PhysicalRDD 
> [date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None
> == Physical Plan ==
> NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) 
> ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, 
> DecimalType(5,0)), DecimalType(11,2))) * 
> change_decimal_precision(CAST(price#3, 
> DecimalType(11,2)2,mode=Final,isDistinct=false))
>  TungstenSort [date#0 ASC,cat#1 ASC], false, 0
>   ConvertToUnsafe
>Exchange hashpartitioning(date#0,cat#1)
> NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) 
> ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, 
> DecimalType(5,0)), DecimalType(11,2))) * 
> change_decimal_precision(CAST(price#3, 
> DecimalType(11,2)2,mode=Partial,isDistinct=false))
>  TungstenSort [date#0 ASC,cat#1 ASC], false, 0
>   ConvertToUnsafe
>InMemoryColumnarTableScan [date#0,cat#1,count#2,price#3], 
> (InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, 
> StorageLevel(true, true, false, true, 1), (PhysicalRDD 
> [date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None)
> Code Generation: true
> == RDD ==
> None
> 15/08/04 23:21:53 ERROR TaskSetManager: Task 0 in stage 4.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "t.py", line 34, in 
> r.show()
>   File "/Users/davies/work/spark/python/pyspark/sql/dataframe.py", line 258, 
> in show
> print(self._jdf.showString(n, truncate))
>   File "/Users/davies/work/spark/python/lib/py4j/java_gateway.py", line 538, 
> in __call__
> self.target_id, self.name)
>   File "/Users/davies/work/spark/python/pyspark/sql/utils.py", line 36, in 
> deco
> return f(*a, **kw)
>   File "/Users/davies/work/spark/python/lib/py4j/protocol.py", line 300, in 
> get_return_value
> format(target_id, '.', name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 
> (TID 10, localhost): java.lang.UnsupportedOperationException: tail of empty 
> list
>   at scala.collection.immutable.N

[jira] [Commented] (SPARK-10075) Add `when` expressino function in SparkR

2015-08-18 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702382#comment-14702382
 ] 

Shivaram Venkataraman commented on SPARK-10075:
---

Resolved by https://github.com/apache/spark/pull/8266

> Add `when` expressino function in SparkR
> 
>
> Key: SPARK-10075
> URL: https://issues.apache.org/jira/browse/SPARK-10075
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>Assignee: Yu Ishikawa
> Fix For: 1.5.0
>
>
> Add {{when}} function into SparkR. Before this issue, we need to implement 
> {{when}}, {{otherwise}} and so on as {{Column}} methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10075) Add `when` expressino function in SparkR

2015-08-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-10075.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

> Add `when` expressino function in SparkR
> 
>
> Key: SPARK-10075
> URL: https://issues.apache.org/jira/browse/SPARK-10075
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>Assignee: Yu Ishikawa
> Fix For: 1.5.0
>
>
> Add {{when}} function into SparkR. Before this issue, we need to implement 
> {{when}}, {{otherwise}} and so on as {{Column}} methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 371 matches

Mail list logo