[jira] [Assigned] (SPARK-12951) Support spilling in generate aggregate

2016-01-29 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-12951:
--

Assignee: Davies Liu

> Support spilling in generate aggregate
> --
>
> Key: SPARK-12951
> URL: https://issues.apache.org/jira/browse/SPARK-12951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13098) remove GenericInternalRowWithSchema

2016-01-29 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13098.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10992
[https://github.com/apache/spark/pull/10992]

> remove GenericInternalRowWithSchema
> ---
>
> Key: SPARK-13098
> URL: https://issues.apache.org/jira/browse/SPARK-13098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12914) Generate TungstenAggregate with grouping keys

2016-01-29 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12914.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10855
[https://github.com/apache/spark/pull/10855]

> Generate TungstenAggregate with grouping keys
> -
>
> Key: SPARK-12914
> URL: https://issues.apache.org/jira/browse/SPARK-12914
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13085) Add scalastyle command used in build testing

2016-01-29 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124703#comment-15124703
 ] 

Alex Bozarth commented on SPARK-13085:
--

Thanks, should we leave this open until that goes through?


> Add scalastyle command used in build testing
> 
>
> Key: SPARK-13085
> URL: https://issues.apache.org/jira/browse/SPARK-13085
> Project: Spark
>  Issue Type: Wish
>  Components: Build, Tests
>Reporter: Charles Allen
>
> As an occasional or new contributor, it is easy to screw up scala style. But 
> looking at the output logs (for example 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50300/consoleFull
>  ) it is not obvious how to fix the scala style tests, even when reading the 
> scala style guide.
> {code}
> 
> Running Scala style checks
> 
> Scalastyle checks failed at following occurrences:
> [error] 
> /home/jenkins/workspace/SparkPullRequestBuilder/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:22:0:
>  import.ordering.wrongOrderInGroup.message
> [error] (core/compile:scalastyle) errors exist
> [error] Total time: 9 s, completed Jan 28, 2016 2:11:00 PM
> [error] running 
> /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala ; received 
> return code 1
> {code}
> This ask is that the command used to check scalastyle is presented in the log 
> so a developer does not have to wait for the build process to check if a pull 
> request should pass scala style checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13085) Add scalastyle command used in build testing

2016-01-29 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124699#comment-15124699
 ] 

Marcelo Vanzin commented on SPARK-13085:


This is fixed in the scalastyle repo, but we need a release with the fix to 
update the Spark build. I've asked for one but no reply so far. :-/

> Add scalastyle command used in build testing
> 
>
> Key: SPARK-13085
> URL: https://issues.apache.org/jira/browse/SPARK-13085
> Project: Spark
>  Issue Type: Wish
>  Components: Build, Tests
>Reporter: Charles Allen
>
> As an occasional or new contributor, it is easy to screw up scala style. But 
> looking at the output logs (for example 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50300/consoleFull
>  ) it is not obvious how to fix the scala style tests, even when reading the 
> scala style guide.
> {code}
> 
> Running Scala style checks
> 
> Scalastyle checks failed at following occurrences:
> [error] 
> /home/jenkins/workspace/SparkPullRequestBuilder/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:22:0:
>  import.ordering.wrongOrderInGroup.message
> [error] (core/compile:scalastyle) errors exist
> [error] Total time: 9 s, completed Jan 28, 2016 2:11:00 PM
> [error] running 
> /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala ; received 
> return code 1
> {code}
> This ask is that the command used to check scalastyle is presented in the log 
> so a developer does not have to wait for the build process to check if a pull 
> request should pass scala style checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12469) Consistent Accumulators for Spark

2016-01-29 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124635#comment-15124635
 ] 

holdenk commented on SPARK-12469:
-

Also a related JIRA https://issues.apache.org/jira/browse/SPARK-10620

> Consistent Accumulators for Spark
> -
>
> Key: SPARK-12469
> URL: https://issues.apache.org/jira/browse/SPARK-12469
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: holdenk
>
> Tasks executed on Spark workers are unable to modify values from the driver, 
> and accumulators are the one exception for this. Accumulators in Spark are 
> implemented in such a way that when a stage is recomputed (say for cache 
> eviction) the accumulator will be updated a second time. This makes 
> accumulators inside of transformations more difficult to use for things like 
> counting invalid records (one of the primary potential use cases of 
> collecting side information during a transformation). However in some cases 
> this counting during re-evaluation is exactly the behaviour we want (say in 
> tracking total execution time for a particular function). Spark would benefit 
> from a version of accumulators which did not double count even if stages were 
> re-executed.
> Motivating example:
> {code}
> val parseTime = sc.accumulator(0L)
> val parseFailures = sc.accumulator(0L)
> val parsedData = sc.textFile(...).flatMap { line =>
>   val start = System.currentTimeMillis()
>   val parsed = Try(parse(line))
>   if (parsed.isFailure) parseFailures += 1
>   parseTime += System.currentTimeMillis() - start
>   parsed.toOption
> }
> parsedData.cache()
> val resultA = parsedData.map(...).filter(...).count()
> // some intervening code.  Almost anything could happen here -- some of 
> parsedData may
> // get kicked out of the cache, or an executor where data was cached might 
> get lost
> val resultB = parsedData.filter(...).map(...).flatMap(...).count()
> // now we look at the accumulators
> {code}
> Here we would want parseFailures to only have been added to once for every 
> line which failed to parse.  Unfortunately, the current Spark accumulator API 
> doesn’t support the current parseFailures use case since if some data had 
> been evicted its possible that it will be double counted.
> See the full design document at 
> https://docs.google.com/document/d/1lR_l1g3zMVctZXrcVjFusq2iQVpr4XvRK_UUDsDr6nk/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13099) ccjlbr

2016-01-29 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-13099:


 Summary: ccjlbr
 Key: SPARK-13099
 URL: https://issues.apache.org/jira/browse/SPARK-13099
 Project: Spark
  Issue Type: Bug
Reporter: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13096) Make AccumulatorSuite#verifyPeakExecutionMemorySet less flaky

2016-01-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-13096.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Make AccumulatorSuite#verifyPeakExecutionMemorySet less flaky
> -
>
> Key: SPARK-13096
> URL: https://issues.apache.org/jira/browse/SPARK-13096
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> We have a method in AccumulatorSuite called verifyPeakExecutionMemorySet. 
> This is used in a variety of other test suites, including (but not limited 
> to):
> - ExternalAppendOnlyMapSuite
> - ExternalSorterSuite
> - sql.SQLQuerySuite
> Lately it's been flaky ever since https://github.com/apache/spark/pull/10835 
> was merged. Note: this was an existing problem even before that patch, but it 
> was uncovered there because previously we never failed the test even if an 
> assertion error failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13088) DAG viz does not work with latest version of chrome

2016-01-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-13088.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.6.1
   1.5.3
   1.4.2

> DAG viz does not work with latest version of chrome
> ---
>
> Key: SPARK-13088
> URL: https://issues.apache.org/jira/browse/SPARK-13088
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.4.2, 1.5.3, 1.6.1, 2.0.0
>
> Attachments: Screen Shot 2016-01-29 at 10.54.14 AM.png
>
>
> See screenshot. This is because dagre-d3.js is using a function that chrome 
> no longer supports:
> {code}
> Uncaught TypeError: elem.getTransformToElement is not a function
> {code}
> We need to upgrade it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster

2016-01-29 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124613#comment-15124613
 ] 

Shivaram Venkataraman commented on SPARK-5629:
--

I manually cleaned up some of the issues and pointed them to open issues on 
spark-ec2. I think for some of the issues we should just ping the issue and see 
if its still a relevant issue. Finally I think some of the S3 reading issues 
aren't spark-ec2 issues but more an issue with jets3t etc. 

> Add spark-ec2 action to return info about an existing cluster
> -
>
> Key: SPARK-5629
> URL: https://issues.apache.org/jira/browse/SPARK-5629
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> You can launch multiple clusters using spark-ec2. At some point, you might 
> just want to get some information about an existing cluster.
> Use cases include:
> * Wanting to check something about your cluster in the EC2 web console.
> * Wanting to feed information about your cluster to another tool (e.g. as 
> described in [SPARK-5627]).
> So, in addition to the [existing 
> actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]:
> * {{launch}}
> * {{destroy}}
> * {{login}}
> * {{stop}}
> * {{start}}
> * {{get-master}}
> * {{reboot-slaves}}
> We add a new action, {{describe}}, which describes an existing cluster if 
> given a cluster name, and all clusters if not.
> Some examples:
> {code}
> # describes all clusters launched by spark-ec2
> spark-ec2 describe
> {code}
> {code}
> # describes cluster-1
> spark-ec2 describe cluster-1
> {code}
> In combination with the proposal in [SPARK-5627]:
> {code}
> # describes cluster-3 in a machine-readable way (e.g. JSON)
> spark-ec2 describe cluster-3 --machine-readable
> {code}
> Parallels in similar tools include:
> * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju
> * [{{starcluster 
> listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node]
>  from MIT StarCluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2016-01-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10620.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Look into whether accumulator mechanism can replace TaskMetrics
> ---
>
> Key: SPARK-10620
> URL: https://issues.apache.org/jira/browse/SPARK-10620
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
> Fix For: 2.0.0
>
> Attachments: accums-and-task-metrics.pdf
>
>
> This task is simply to explore whether the internal representation used by 
> TaskMetrics could be performed by using accumulators rather than having two 
> separate mechanisms. Note that we need to continue to preserve the existing 
> "Task Metric" data structures that are exposed to users through event logs 
> etc. The question is can we use a single internal codepath and perhaps make 
> this easier to extend in the future.
> I think a full exploration would answer the following questions:
> - How do the semantics of accumulators on stage retries differ from aggregate 
> TaskMetrics for a stage? Could we implement clearer retry semantics for 
> internal accumulators to allow them to be the same - for instance, zeroing 
> accumulator values if a stage is retried (see discussion here: SPARK-10042).
> - Are there metrics that do not fit well into the accumulator model, or would 
> be difficult to update as an accumulator.
> - If we expose metrics through accumulators in the future rather than 
> continuing to add fields to TaskMetrics, what is the best way to coerce 
> compatibility?
> - Are there any other considerations?
> - Is it worth it to do this, or is the consolidation too complicated to 
> justify?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13090) Add initial support for constraint propagation in SparkSQL

2016-01-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13090:
-
Assignee: Sameer Agarwal

> Add initial support for constraint propagation in SparkSQL
> --
>
> Key: SPARK-13090
> URL: https://issues.apache.org/jira/browse/SPARK-13090
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
>
> The goal of this subtask is to add an initial support for the basic 
> constraint framework and allow propagating constraints through filter, 
> project, union, intersect, except and joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13092) Track constraints in ExpressionSet

2016-01-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13092:
-
Assignee: Sameer Agarwal

> Track constraints in ExpressionSet
> --
>
> Key: SPARK-13092
> URL: https://issues.apache.org/jira/browse/SPARK-13092
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
>
> Create a new ExpressionSet that operates similar to an AttributeSet for 
> keeping track of constraints. A nice addition to this will be to try and have 
> it do other type of canonicalization (i.e. don't allow both a = b and b = a).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13091) Rewrite/Propagate constraints for Aliases

2016-01-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13091:
-
Assignee: Sameer Agarwal

> Rewrite/Propagate constraints for Aliases
> -
>
> Key: SPARK-13091
> URL: https://issues.apache.org/jira/browse/SPARK-13091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
>
> We'd want to duplicate constraints when there is an alias (i.e. for "SELECT 
> a, a AS b", any constraints on a now apply to b)
> This is a follow up task based on [~marmbrus]'s suggestion in 
> https://docs.google.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit#heading=h.6hjcndo36qze



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12624) When schema is specified, we should give better error message if actual row length doesn't match

2016-01-29 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124556#comment-15124556
 ] 

Cheng Lian commented on SPARK-12624:


Yes, it should.

> When schema is specified, we should give better error message if actual row 
> length doesn't match
> 
>
> Key: SPARK-12624
> URL: https://issues.apache.org/jira/browse/SPARK-12624
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.6.1, 2.0.0
>
>
> The following code snippet reproduces this issue:
> {code}
> from pyspark.sql.types import StructType, StructField, IntegerType, StringType
> from pyspark.sql.types import Row
> schema = StructType([StructField("a", IntegerType()), StructField("b", 
> StringType())])
> rdd = sc.parallelize(range(10)).map(lambda x: Row(a=x))
> df = sqlContext.createDataFrame(rdd, schema)
> df.show()
> {code}
> An unintuitive {{ArrayIndexOutOfBoundsException}} exception is thrown in this 
> case:
> {code}
> ...
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
> at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:227)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getAs(rows.scala:35)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.isNullAt(rows.scala:36)
> ...
> {code}
> We should give a better error message here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13076) Rename ClientInterface to HiveClient

2016-01-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13076.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Rename ClientInterface to HiveClient
> 
>
> Key: SPARK-13076
> URL: https://issues.apache.org/jira/browse/SPARK-13076
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13085) Add scalastyle command used in build testing

2016-01-29 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124543#comment-15124543
 ] 

Alex Bozarth commented on SPARK-13085:
--

Ah, I get it. I've been confused by this particular build failure myself 
before, import.ordering.wrongOrderInGroup.message is the failure message, it's 
a bug in scalatest I believe, the message should be translated into a readable 
form. [~vanzin] it looks like you added that line in the scalatest project, how 
would we go about reporting the bug to the scaliest project since it's not a 
spark bug.

> Add scalastyle command used in build testing
> 
>
> Key: SPARK-13085
> URL: https://issues.apache.org/jira/browse/SPARK-13085
> Project: Spark
>  Issue Type: Wish
>  Components: Build, Tests
>Reporter: Charles Allen
>
> As an occasional or new contributor, it is easy to screw up scala style. But 
> looking at the output logs (for example 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50300/consoleFull
>  ) it is not obvious how to fix the scala style tests, even when reading the 
> scala style guide.
> {code}
> 
> Running Scala style checks
> 
> Scalastyle checks failed at following occurrences:
> [error] 
> /home/jenkins/workspace/SparkPullRequestBuilder/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:22:0:
>  import.ordering.wrongOrderInGroup.message
> [error] (core/compile:scalastyle) errors exist
> [error] Total time: 9 s, completed Jan 28, 2016 2:11:00 PM
> [error] running 
> /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala ; received 
> return code 1
> {code}
> This ask is that the command used to check scalastyle is presented in the log 
> so a developer does not have to wait for the build process to check if a pull 
> request should pass scala style checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5331) Spark workers can't find tachyon master as spark-ec2 doesn't set spark.tachyonStore.url

2016-01-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-5331.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Fixed by https://github.com/mesos/spark-ec2/pull/125

> Spark workers can't find tachyon master as spark-ec2 doesn't set 
> spark.tachyonStore.url
> ---
>
> Key: SPARK-5331
> URL: https://issues.apache.org/jira/browse/SPARK-5331
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
> Environment: Running on EC2 via modified spark-ec2 scripts (to get 
> dependencies right so tachyon starts)
> Using tachyon 0.5.0 built against hadoop 2.4.1
> Spark 1.2.0 built against tachyon 0.5.0 and hadoop 0.4.1
> Tachyon configured using the template in 0.5.0 but updated with slave list 
> and master variables etc..
>Reporter: Florian Verhein
> Fix For: 1.4.0
>
>
> ps -ef | grep Tachyon 
> shows Tachyon running on the master (and the slave) node with correct setting:
> -Dtachyon.master.hostname=ec2-54-252-156-187.ap-southeast-2.compute.amazonaws.com
> However from stderr log on worker running the SparkTachyonPi example:
> 15/01/20 06:00:56 INFO CacheManager: Partition rdd_0_0 not found, computing it
> 15/01/20 06:00:56 INFO : Trying to connect master @ localhost/127.0.0.1:19998
> 15/01/20 06:00:56 ERROR : Failed to connect (1) to master 
> localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
> 15/01/20 06:00:57 ERROR : Failed to connect (2) to master 
> localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
> 15/01/20 06:00:58 ERROR : Failed to connect (3) to master 
> localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
> 15/01/20 06:00:59 ERROR : Failed to connect (4) to master 
> localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
> 15/01/20 06:01:00 ERROR : Failed to connect (5) to master 
> localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
> 15/01/20 06:01:01 WARN TachyonBlockManager: Attempt 1 to create tachyon dir 
> null failed
> java.io.IOException: Failed to connect to master localhost/127.0.0.1:19998 
> after 5 attempts
>   at tachyon.client.TachyonFS.connect(TachyonFS.java:293)
>   at tachyon.client.TachyonFS.getFileId(TachyonFS.java:1011)
>   at tachyon.client.TachyonFS.exist(TachyonFS.java:633)
>   at 
> org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:117)
>   at 
> org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:106)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>   at 
> org.apache.spark.storage.TachyonBlockManager.createTachyonDirs(TachyonBlockManager.scala:106)
>   at 
> org.apache.spark.storage.TachyonBlockManager.(TachyonBlockManager.scala:57)
>   at 
> org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:94)
>   at 
> org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:88)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:773)
>   at 
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
>   at 
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: tachyon.org.apache.thrift.TException: Failed to connect to master 
> localhost/127.0.0.1:19998 after 5 attempts
>   at tachyon.master.MasterClient.connect(MasterClient.java:178)
>  

[jira] [Resolved] (SPARK-8980) Setup cluster with spark-ec2 scripts as non-root user

2016-01-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-8980.
--
Resolution: Won't Fix

Now tracked by https://github.com/amplab/spark-ec2/issues/1

> Setup cluster with spark-ec2 scripts as non-root user
> -
>
> Key: SPARK-8980
> URL: https://issues.apache.org/jira/browse/SPARK-8980
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 1.4.0
>Reporter: Mathieu D
>Priority: Minor
>
> Spark-ec2 scripts installs everything as root, which is not a best practice.
> Suggestion to use a sudoer instead (ec2-user, available in the AMI, is).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9494) 'spark-ec2 launch' fails with anaconda python 3.4

2016-01-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-9494.
--
   Resolution: Duplicate
Fix Version/s: 1.6.0

> 'spark-ec2 launch' fails with anaconda python 3.4
> -
>
> Key: SPARK-9494
> URL: https://issues.apache.org/jira/browse/SPARK-9494
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.4.1
> Environment: OSX, Anaconda, Python 3.4
>Reporter: Stuart Owen
>Priority: Minor
> Fix For: 1.6.0
>
>
> Command I used to launch:
> {code:none}
> $SPARK_HOME/ec2/spark-ec2 \
> -k spark \
> -i ~/keys/spark.pem \
> -s $NUM_SLAVES \
> --copy-aws-credentials \
> --region=us-east-1 \
> --instance-type=m3.2xlarge \
> --spot-price=0.1 \
> launch $CLUSTER_NAME
> {code}
> Execution log:
> {code:none}
> /Users/stuart/Applications/anaconda/lib/python3.4/imp.py:32: 
> PendingDeprecationWarning: the imp module is deprecated in favour of 
> importlib; see the module's documentation for alternative uses
>   PendingDeprecationWarning)
> Setting up security groups...
> Searching for existing cluster july-event-fix in region us-east-1...
> Spark AMI: ami-35b1885c
> Launching instances...
> Traceback (most recent call last):
>   File 
> "/Users/stuart/Applications/spark-1.4.0-bin-hadoop2.6/ec2/spark_ec2.py", line 
> 1455, in 
> main()
>   File 
> "/Users/stuart/Applications/spark-1.4.0-bin-hadoop2.6/ec2/spark_ec2.py", line 
> 1447, in main
> real_main()
>   File 
> "/Users/stuart/Applications/spark-1.4.0-bin-hadoop2.6/ec2/spark_ec2.py", line 
> 1276, in real_main
> (master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name)
>   File 
> "/Users/stuart/Applications/spark-1.4.0-bin-hadoop2.6/ec2/spark_ec2.py", line 
> 566, in launch_cluster
> name = '/dev/sd' + string.letters[i + 1]
> AttributeError: 'module' object has no attribute 'letters'
> /Users/stuart/Applications/anaconda/lib/python3.4/imp.py:32: 
> PendingDeprecationWarning: the imp module is deprecated in favour of 
> importlib; see the module's documentation for alternative uses
>   PendingDeprecationWarning)
> ERROR: Could not find a master for cluster july-event-fix in region us-east-1.
> sys:1: ResourceWarning: unclosed  family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, 
> laddr=('192.168.1.2', 55678), raddr=('207.171.162.181', 443)>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12624) When schema is specified, we should give better error message if actual row length doesn't match

2016-01-29 Thread Haidar Hadi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124528#comment-15124528
 ] 

Haidar Hadi commented on SPARK-12624:
-

I am using the scala api and I do see the same issue: when the schema does not 
match the row a generic java.lang.ArrayIndexOutOfBoundsException exception is 
raised. will this PR solve the scala/java api too ?


> When schema is specified, we should give better error message if actual row 
> length doesn't match
> 
>
> Key: SPARK-12624
> URL: https://issues.apache.org/jira/browse/SPARK-12624
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.6.1, 2.0.0
>
>
> The following code snippet reproduces this issue:
> {code}
> from pyspark.sql.types import StructType, StructField, IntegerType, StringType
> from pyspark.sql.types import Row
> schema = StructType([StructField("a", IntegerType()), StructField("b", 
> StringType())])
> rdd = sc.parallelize(range(10)).map(lambda x: Row(a=x))
> df = sqlContext.createDataFrame(rdd, schema)
> df.show()
> {code}
> An unintuitive {{ArrayIndexOutOfBoundsException}} exception is thrown in this 
> case:
> {code}
> ...
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
> at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:227)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getAs(rows.scala:35)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.isNullAt(rows.scala:36)
> ...
> {code}
> We should give a better error message here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10462) spark-ec2 not creating ephemeral volumes

2016-01-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-10462.
---
Resolution: Won't Fix

According to http://www.ec2instances.info/?filter=c4.2xlarge these machines 
don't have any ephemeral disks so there is nothing to mount. Closing this here, 
and we can open a new issue on amplab/spark-ec2 if required

> spark-ec2 not creating ephemeral volumes
> 
>
> Key: SPARK-10462
> URL: https://issues.apache.org/jira/browse/SPARK-10462
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.5.0
>Reporter: Joseph E. Gonzalez
>
> When trying to launch an ec2 cluster with the following:
> ```
> ./ec2/spark-ec2 -r us-west-2 -k mykey -i mykey.pem \
>   --hadoop-major-version=yarn \
>   --spot-price=1.0 \
>   -t c4.2xlarge -s 2 \
>   launch test-dato-yarn
> ```
> None of the nodes had an ephemeral volume and the /mnt was mounted to the 
> root 8G file-system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13085) Add scalastyle command used in build testing

2016-01-29 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124518#comment-15124518
 ] 

Charles Allen commented on SPARK-13085:
---

I wanted to know what command was failing the build and it was not obvious from 
the build logs.

> Add scalastyle command used in build testing
> 
>
> Key: SPARK-13085
> URL: https://issues.apache.org/jira/browse/SPARK-13085
> Project: Spark
>  Issue Type: Wish
>  Components: Build, Tests
>Reporter: Charles Allen
>
> As an occasional or new contributor, it is easy to screw up scala style. But 
> looking at the output logs (for example 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50300/consoleFull
>  ) it is not obvious how to fix the scala style tests, even when reading the 
> scala style guide.
> {code}
> 
> Running Scala style checks
> 
> Scalastyle checks failed at following occurrences:
> [error] 
> /home/jenkins/workspace/SparkPullRequestBuilder/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:22:0:
>  import.ordering.wrongOrderInGroup.message
> [error] (core/compile:scalastyle) errors exist
> [error] Total time: 9 s, completed Jan 28, 2016 2:11:00 PM
> [error] running 
> /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala ; received 
> return code 1
> {code}
> This ask is that the command used to check scalastyle is presented in the log 
> so a developer does not have to wait for the build process to check if a pull 
> request should pass scala style checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9688) Improve spark-ec2 script to handle users that are not root

2016-01-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-9688.
--
Resolution: Won't Fix

Now tracked at https://github.com/amplab/spark-ec2/issues/1

> Improve spark-ec2 script to handle users that are not root
> --
>
> Key: SPARK-9688
> URL: https://issues.apache.org/jira/browse/SPARK-9688
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 1.4.0, 1.4.1
> Environment: All
>Reporter: Karina Uribe
>  Labels: EC2, aws-ec2, security
>   Original Estimate: 252h
>  Remaining Estimate: 252h
>
> Hi, 
> I was trying to use the spark-ec2 script from Spark to create a new Spark 
> cluster wit an user different than root (--user=ec2-user). Unfortunately the 
> part of the script that attempts to copy the templates into the target 
> machines fail because it tries to rsync /etc/* and /root/* 
> This is the full traceback
> rsync: recv_generator: mkdir "/root/spark-ec2" failed: Permission denied (13)
> *** Skipping any contents from this failed directory ***
> sent 95 bytes  received 17 bytes  224.00 bytes/sec
> total size is 1444  speedup is 12.89
> rsync error: some files/attrs were not transferred (see previous errors) 
> (code 2  3) at main.c(1039) [sender=3.0.6]
> Traceback (most recent call last):
>   File "/home/ec2-user/spark-1.4.0/ec2/spark_ec2.py", line 1455, in 
> main()
>   File "/home/ec2-user/spark-1.4.0/ec2/spark_ec2.py", line 1447, in main
> real_main()
>   File "/home/ec2-user/spark-1.4.0/ec2/spark_ec2.py", line 1283, in real_main
> setup_cluster(conn, master_nodes, slave_nodes, opts, True)
>   File "/home/ec2-user/spark-1.4.0/ec2/spark_ec2.py", line 785, in 
> setup_cluster
> modules=modules
>   File "/home/ec2-user/spark-1.4.0/ec2/spark_ec2.py", line 1049, in 
> deploy_files
> subprocess.check_call(command)
>   File "/usr/lib64/python2.7/subprocess.py", line 540, in check_call
> raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command '['rsync', '-rv', '-e', 'ssh -o 
> StrictHos  tKeyChecking=no -o UserKnownHostsFile=/dev/null -i 
> /home/ec2-user/.ssh/sparkclus  terkey_us_east.pem', 
> '/tmp/tmpT4Iw54/', u'ec2-u...@ec2-52-2-96-193.compute-1.ama  
> zonaws.com:/']' returned non-zero exit status 23
> Is there a workaround for this? I want to improve security of our operations 
> by avoiding user root on the instances. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode

2016-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124473#comment-15124473
 ] 

Apache Spark commented on SPARK-5095:
-

User 'mgummelt' has created a pull request for this issue:
https://github.com/apache/spark/pull/10993

> Support launching multiple mesos executors in coarse grained mesos mode
> ---
>
> Key: SPARK-5095
> URL: https://issues.apache.org/jira/browse/SPARK-5095
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Timothy Chen
>Assignee: Timothy Chen
>
> Currently in coarse grained mesos mode, it's expected that we only launch one 
> Mesos executor that launches one JVM process to launch multiple spark 
> executors.
> However, this become a problem when the JVM process launched is larger than 
> an ideal size (30gb is recommended value from databricks), which causes GC 
> problems reported on the mailing list.
> We should support launching mulitple executors when large enough resources 
> are available for spark to use, and these resources are still under the 
> configured limit.
> This is also applicable when users want to specifiy number of executors to be 
> launched on each node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13098) remove GenericInternalRowWithSchema

2016-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124452#comment-15124452
 ] 

Apache Spark commented on SPARK-13098:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10992

> remove GenericInternalRowWithSchema
> ---
>
> Key: SPARK-13098
> URL: https://issues.apache.org/jira/browse/SPARK-13098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13098) remove GenericInternalRowWithSchema

2016-01-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13098:


Assignee: (was: Apache Spark)

> remove GenericInternalRowWithSchema
> ---
>
> Key: SPARK-13098
> URL: https://issues.apache.org/jira/browse/SPARK-13098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13098) remove GenericInternalRowWithSchema

2016-01-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13098:


Assignee: Apache Spark

> remove GenericInternalRowWithSchema
> ---
>
> Key: SPARK-13098
> URL: https://issues.apache.org/jira/browse/SPARK-13098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13098) remove GenericInternalRowWithSchema

2016-01-29 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-13098:
---

 Summary: remove GenericInternalRowWithSchema
 Key: SPARK-13098
 URL: https://issues.apache.org/jira/browse/SPARK-13098
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13084) Utilize @SerialVersionUID to avoid local class incompatibility

2016-01-29 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-13084:
---
Component/s: Spark Core

> Utilize @SerialVersionUID to avoid local class incompatibility
> --
>
> Key: SPARK-13084
> URL: https://issues.apache.org/jira/browse/SPARK-13084
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Ted Yu
>
> Here is related thread:
> http://search-hadoop.com/m/q3RTtSjjdT1BJ4Jr/local+class+incompatible&subj=local+class+incompatible+stream+classdesc+serialVersionUID
> RDD extends Serializable but doesn't have @SerialVersionUID() annotation.
> Adding @SerialVersionUID would overcome local class incompatibility across 
> minor releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12299) Remove history serving functionality from standalone Master

2016-01-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12299:


Assignee: Apache Spark

> Remove history serving functionality from standalone Master
> ---
>
> Key: SPARK-12299
> URL: https://issues.apache.org/jira/browse/SPARK-12299
> Project: Spark
>  Issue Type: Sub-task
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> The standalone Master currently continues to serve the historical UIs of 
> applications that have completed and enabled event logging. This poses 
> problems, however, if the event log is very large, e.g. SPARK-6270. The 
> Master might OOM or hang while it rebuilds the UI, rejecting applications in 
> the mean time.
> Personally, I have had to make modifications in the code to disable this 
> myself, because I wanted to use event logging in standalone mode for 
> applications that produce a lot of logging.
> Removing this from the Master would simplify the process significantly. This 
> issue supersedes SPARK-12062.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13085) Add scalastyle command used in build testing

2016-01-29 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124392#comment-15124392
 ] 

Alex Bozarth commented on SPARK-13085:
--

The script is located at $SPARK_HOME/dev/scalastyle
Are you asking to add a line informing the user of the script location or did 
you just want to know where it was? 

> Add scalastyle command used in build testing
> 
>
> Key: SPARK-13085
> URL: https://issues.apache.org/jira/browse/SPARK-13085
> Project: Spark
>  Issue Type: Wish
>  Components: Build, Tests
>Reporter: Charles Allen
>
> As an occasional or new contributor, it is easy to screw up scala style. But 
> looking at the output logs (for example 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50300/consoleFull
>  ) it is not obvious how to fix the scala style tests, even when reading the 
> scala style guide.
> {code}
> 
> Running Scala style checks
> 
> Scalastyle checks failed at following occurrences:
> [error] 
> /home/jenkins/workspace/SparkPullRequestBuilder/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:22:0:
>  import.ordering.wrongOrderInGroup.message
> [error] (core/compile:scalastyle) errors exist
> [error] Total time: 9 s, completed Jan 28, 2016 2:11:00 PM
> [error] running 
> /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala ; received 
> return code 1
> {code}
> This ask is that the command used to check scalastyle is presented in the log 
> so a developer does not have to wait for the build process to check if a pull 
> request should pass scala style checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12299) Remove history serving functionality from standalone Master

2016-01-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12299:


Assignee: (was: Apache Spark)

> Remove history serving functionality from standalone Master
> ---
>
> Key: SPARK-12299
> URL: https://issues.apache.org/jira/browse/SPARK-12299
> Project: Spark
>  Issue Type: Sub-task
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>
> The standalone Master currently continues to serve the historical UIs of 
> applications that have completed and enabled event logging. This poses 
> problems, however, if the event log is very large, e.g. SPARK-6270. The 
> Master might OOM or hang while it rebuilds the UI, rejecting applications in 
> the mean time.
> Personally, I have had to make modifications in the code to disable this 
> myself, because I wanted to use event logging in standalone mode for 
> applications that produce a lot of logging.
> Removing this from the Master would simplify the process significantly. This 
> issue supersedes SPARK-12062.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12299) Remove history serving functionality from standalone Master

2016-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124393#comment-15124393
 ] 

Apache Spark commented on SPARK-12299:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/10991

> Remove history serving functionality from standalone Master
> ---
>
> Key: SPARK-12299
> URL: https://issues.apache.org/jira/browse/SPARK-12299
> Project: Spark
>  Issue Type: Sub-task
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>
> The standalone Master currently continues to serve the historical UIs of 
> applications that have completed and enabled event logging. This poses 
> problems, however, if the event log is very large, e.g. SPARK-6270. The 
> Master might OOM or hang while it rebuilds the UI, rejecting applications in 
> the mean time.
> Personally, I have had to make modifications in the code to disable this 
> myself, because I wanted to use event logging in standalone mode for 
> applications that produce a lot of logging.
> Removing this from the Master would simplify the process significantly. This 
> issue supersedes SPARK-12062.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9975) Add Normalized Closeness Centrality to Spark GraphX

2016-01-29 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124375#comment-15124375
 ] 

Stavros Kontopoulos commented on SPARK-9975:


Betweenness centrality seems straightforward to add too. What do you think?


> Add Normalized Closeness Centrality to Spark GraphX
> ---
>
> Key: SPARK-9975
> URL: https://issues.apache.org/jira/browse/SPARK-9975
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Kenny Bastani
>Priority: Minor
>  Labels: features
>
> “Closeness centrality” is also defined as a proportion. First, the distance 
> of a vertex from all other vertices in the network is counted. Normalization 
> is achieved by defining closeness centrality as the number of other vertices 
> divided by this sum (De Nooy et al., 2005, p. 127). Because of this 
> normalization, closeness centrality provides a global measure about the 
> position of a vertex in the network, while betweenness centrality is defined 
> with reference to the local position of a vertex. -- Cited from 
> http://arxiv.org/pdf/0911.2719.pdf
> This request is to add normalized closeness centrality as a core graph 
> algorithm in the GraphX library. I implemented this algorithm for a graph 
> processing extension to Neo4j 
> (https://github.com/kbastani/neo4j-mazerunner#supported-algorithms) and I 
> would like to put it up for review for inclusion into Spark. This algorithm 
> is very straight forward and builds on top of the included ShortestPaths 
> (SSSP) algorithm already in the library.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13097) Extend Binarizer to allow Double AND Vector inputs

2016-01-29 Thread Mike Seddon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Seddon updated SPARK-13097:

Description: 
To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in 
addition to the existing Double input column type.

https://github.com/apache/spark/pull/10976

A use case for this enhancement is for when a user wants to Binarize many 
similar feature columns at once using the same threshold value.

A real-world example for this would be where the authors of one of the leading 
MNIST handwriting character recognition entries converts 784 grayscale (0-255) 
pixels (28x28 pixel images) to binary if the pixel's grayscale exceeds 127.5: 
(http://arxiv.org/abs/1003.0358). With this modification the user is able to: 
VectorAssembler(784 columns)->Binarizer(127.5)->Classifier as all the pixels 
are of a similar type. 

This approach also allows much easier use of the ParamGridBuilder to test 
multiple theshold values.

I have already written the code and unit tests and have tested in a Multilayer 
perceptron classifier workflow.

  was:
To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in 
addition to the existing Double input column type.

A use case for this enhancement is for when a user wants to Binarize many 
similar feature columns at once using the same threshold value.

A real-world example for this would be where the authors of one of the leading 
MNIST handwriting character recognition entries converts 784 grayscale (0-255) 
pixels (28x28 pixel images) to binary if the pixel's grayscale exceeds 127.5: 
(http://arxiv.org/abs/1003.0358). With this modification the user is able to: 
VectorAssembler(784 columns)->Binarizer(127.5)->Classifier as all the pixels 
are of a similar type. 

This approach also allows much easier use of the ParamGridBuilder to test 
multiple theshold values.

I have already written the code and unit tests and have tested in a Multilayer 
perceptron classifier workflow.


> Extend Binarizer to allow Double AND Vector inputs
> --
>
> Key: SPARK-13097
> URL: https://issues.apache.org/jira/browse/SPARK-13097
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Mike Seddon
>Priority: Minor
>
> To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in 
> addition to the existing Double input column type.
> https://github.com/apache/spark/pull/10976
> A use case for this enhancement is for when a user wants to Binarize many 
> similar feature columns at once using the same threshold value.
> A real-world example for this would be where the authors of one of the 
> leading MNIST handwriting character recognition entries converts 784 
> grayscale (0-255) pixels (28x28 pixel images) to binary if the pixel's 
> grayscale exceeds 127.5: (http://arxiv.org/abs/1003.0358). With this 
> modification the user is able to: VectorAssembler(784 
> columns)->Binarizer(127.5)->Classifier as all the pixels are of a similar 
> type. 
> This approach also allows much easier use of the ParamGridBuilder to test 
> multiple theshold values.
> I have already written the code and unit tests and have tested in a 
> Multilayer perceptron classifier workflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13097) Extend Binarizer to allow Double AND Vector inputs

2016-01-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13097:


Assignee: (was: Apache Spark)

> Extend Binarizer to allow Double AND Vector inputs
> --
>
> Key: SPARK-13097
> URL: https://issues.apache.org/jira/browse/SPARK-13097
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Mike Seddon
>Priority: Minor
>
> To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in 
> addition to the existing Double input column type.
> A use case for this enhancement is for when a user wants to Binarize many 
> similar feature columns at once using the same threshold value.
> A real-world example for this would be where the authors of one of the 
> leading MNIST handwriting character recognition entries converts 784 
> grayscale (0-255) pixels (28x28 pixel images) to binary if the pixel's 
> grayscale exceeds 127.5: (http://arxiv.org/abs/1003.0358). With this 
> modification the user is able to: VectorAssembler(784 
> columns)->Binarizer(127.5)->Classifier as all the pixels are of a similar 
> type. 
> This approach also allows much easier use of the ParamGridBuilder to test 
> multiple theshold values.
> I have already written the code and unit tests and have tested in a 
> Multilayer perceptron classifier workflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13097) Extend Binarizer to allow Double AND Vector inputs

2016-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124323#comment-15124323
 ] 

Apache Spark commented on SPARK-13097:
--

User 'seddonm1' has created a pull request for this issue:
https://github.com/apache/spark/pull/10976

> Extend Binarizer to allow Double AND Vector inputs
> --
>
> Key: SPARK-13097
> URL: https://issues.apache.org/jira/browse/SPARK-13097
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Mike Seddon
>Priority: Minor
>
> To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in 
> addition to the existing Double input column type.
> A use case for this enhancement is for when a user wants to Binarize many 
> similar feature columns at once using the same threshold value.
> A real-world example for this would be where the authors of one of the 
> leading MNIST handwriting character recognition entries converts 784 
> grayscale (0-255) pixels (28x28 pixel images) to binary if the pixel's 
> grayscale exceeds 127.5: (http://arxiv.org/abs/1003.0358). With this 
> modification the user is able to: VectorAssembler(784 
> columns)->Binarizer(127.5)->Classifier as all the pixels are of a similar 
> type. 
> This approach also allows much easier use of the ParamGridBuilder to test 
> multiple theshold values.
> I have already written the code and unit tests and have tested in a 
> Multilayer perceptron classifier workflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13096) Make AccumulatorSuite#verifyPeakExecutionMemorySet less flaky

2016-01-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13096:
--
Affects Version/s: 1.6.0
 Target Version/s: 2.0.0

> Make AccumulatorSuite#verifyPeakExecutionMemorySet less flaky
> -
>
> Key: SPARK-13096
> URL: https://issues.apache.org/jira/browse/SPARK-13096
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> We have a method in AccumulatorSuite called verifyPeakExecutionMemorySet. 
> This is used in a variety of other test suites, including (but not limited 
> to):
> - ExternalAppendOnlyMapSuite
> - ExternalSorterSuite
> - sql.SQLQuerySuite
> Lately it's been flaky ever since https://github.com/apache/spark/pull/10835 
> was merged. Note: this was an existing problem even before that patch, but it 
> was uncovered there because previously we never failed the test even if an 
> assertion error failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13097) Extend Binarizer to allow Double AND Vector inputs

2016-01-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13097:


Assignee: Apache Spark

> Extend Binarizer to allow Double AND Vector inputs
> --
>
> Key: SPARK-13097
> URL: https://issues.apache.org/jira/browse/SPARK-13097
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Mike Seddon
>Assignee: Apache Spark
>Priority: Minor
>
> To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in 
> addition to the existing Double input column type.
> A use case for this enhancement is for when a user wants to Binarize many 
> similar feature columns at once using the same threshold value.
> A real-world example for this would be where the authors of one of the 
> leading MNIST handwriting character recognition entries converts 784 
> grayscale (0-255) pixels (28x28 pixel images) to binary if the pixel's 
> grayscale exceeds 127.5: (http://arxiv.org/abs/1003.0358). With this 
> modification the user is able to: VectorAssembler(784 
> columns)->Binarizer(127.5)->Classifier as all the pixels are of a similar 
> type. 
> This approach also allows much easier use of the ParamGridBuilder to test 
> multiple theshold values.
> I have already written the code and unit tests and have tested in a 
> Multilayer perceptron classifier workflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13097) Extend Binarizer to allow Double AND Vector inputs

2016-01-29 Thread Mike Seddon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Seddon updated SPARK-13097:

Description: 
To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in 
addition to the existing Double input column type.

A use case for this enhancement is for when a user wants to Binarize many 
similar feature columns at once using the same threshold value.

A real-world example for this would be where the authors of one of the leading 
MNIST handwriting character recognition entries converts 784 grayscale (0-255) 
pixels (28x28 pixel images) to binary if the pixel's grayscale exceeds 127.5: 
(http://arxiv.org/abs/1003.0358). With this modification the user is able to: 
VectorAssembler(784 columns)->Binarizer(127.5)->Classifier as all the pixels 
are of a similar type. 

This approach also allows much easier use of the ParamGridBuilder to test 
multiple theshold values.

I have already written the code and unit tests and have tested in a Multilayer 
perceptron classifier workflow.

  was:
To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in 
addition to the existing Double input column type.

A use case for this enhancement is for when a user wants to Binarize many 
similar feature columns at once using the same threshold value.

A real-world example for this would be where the authors of one of the leading 
MNIST handwriting character recognition entries converts 784 grayscale (0-255) 
pixels (28x28 pixel images) to binary if the pixel's grayscale exceeds 127.5: 
(http://arxiv.org/abs/1003.0358). With this modification the user is able to: 
VectorAssembler->Binarizer(127.5)->Classifier as all the pixels are of a 
similar type. 

This approach also allows much easier use of the ParamGridBuilder to test 
multiple theshold values.

I have already written the code and unit tests and have tested in a Multilayer 
perceptron classifier workflow.


> Extend Binarizer to allow Double AND Vector inputs
> --
>
> Key: SPARK-13097
> URL: https://issues.apache.org/jira/browse/SPARK-13097
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Mike Seddon
>Priority: Minor
>
> To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in 
> addition to the existing Double input column type.
> A use case for this enhancement is for when a user wants to Binarize many 
> similar feature columns at once using the same threshold value.
> A real-world example for this would be where the authors of one of the 
> leading MNIST handwriting character recognition entries converts 784 
> grayscale (0-255) pixels (28x28 pixel images) to binary if the pixel's 
> grayscale exceeds 127.5: (http://arxiv.org/abs/1003.0358). With this 
> modification the user is able to: VectorAssembler(784 
> columns)->Binarizer(127.5)->Classifier as all the pixels are of a similar 
> type. 
> This approach also allows much easier use of the ParamGridBuilder to test 
> multiple theshold values.
> I have already written the code and unit tests and have tested in a 
> Multilayer perceptron classifier workflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13096) Make AccumulatorSuite#verifyPeakExecutionMemorySet less flaky

2016-01-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13096:


Assignee: Andrew Or  (was: Apache Spark)

> Make AccumulatorSuite#verifyPeakExecutionMemorySet less flaky
> -
>
> Key: SPARK-13096
> URL: https://issues.apache.org/jira/browse/SPARK-13096
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> We have a method in AccumulatorSuite called verifyPeakExecutionMemorySet. 
> This is used in a variety of other test suites, including (but not limited 
> to):
> - ExternalAppendOnlyMapSuite
> - ExternalSorterSuite
> - sql.SQLQuerySuite
> Lately it's been flaky ever since https://github.com/apache/spark/pull/10835 
> was merged. Note: this was an existing problem even before that patch, but it 
> was uncovered there because previously we never failed the test even if an 
> assertion error failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13096) Make AccumulatorSuite#verifyPeakExecutionMemorySet less flaky

2016-01-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13096:


Assignee: Apache Spark  (was: Andrew Or)

> Make AccumulatorSuite#verifyPeakExecutionMemorySet less flaky
> -
>
> Key: SPARK-13096
> URL: https://issues.apache.org/jira/browse/SPARK-13096
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> We have a method in AccumulatorSuite called verifyPeakExecutionMemorySet. 
> This is used in a variety of other test suites, including (but not limited 
> to):
> - ExternalAppendOnlyMapSuite
> - ExternalSorterSuite
> - sql.SQLQuerySuite
> Lately it's been flaky ever since https://github.com/apache/spark/pull/10835 
> was merged. Note: this was an existing problem even before that patch, but it 
> was uncovered there because previously we never failed the test even if an 
> assertion error failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13096) Make AccumulatorSuite#verifyPeakExecutionMemorySet less flaky

2016-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124306#comment-15124306
 ] 

Apache Spark commented on SPARK-13096:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/10990

> Make AccumulatorSuite#verifyPeakExecutionMemorySet less flaky
> -
>
> Key: SPARK-13096
> URL: https://issues.apache.org/jira/browse/SPARK-13096
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> We have a method in AccumulatorSuite called verifyPeakExecutionMemorySet. 
> This is used in a variety of other test suites, including (but not limited 
> to):
> - ExternalAppendOnlyMapSuite
> - ExternalSorterSuite
> - sql.SQLQuerySuite
> Lately it's been flaky ever since https://github.com/apache/spark/pull/10835 
> was merged. Note: this was an existing problem even before that patch, but it 
> was uncovered there because previously we never failed the test even if an 
> assertion error failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13097) Extend Binarizer to allow Double AND Vector inputs

2016-01-29 Thread Mike Seddon (JIRA)
Mike Seddon created SPARK-13097:
---

 Summary: Extend Binarizer to allow Double AND Vector inputs
 Key: SPARK-13097
 URL: https://issues.apache.org/jira/browse/SPARK-13097
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Mike Seddon
Priority: Minor


To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in 
addition to the existing Double input column type.

A use case for this enhancement is for when a user wants to Binarize many 
similar feature columns at once using the same threshold value.

A real-world example for this would be where the authors of one of the leading 
MNIST handwriting character recognition entries converts 784 grayscale (0-255) 
pixels (28x28 pixel images) to binary if the pixel's grayscale exceeds 127.5: 
(http://arxiv.org/abs/1003.0358). With this modification the user is able to: 
VectorAssembler->Binarizer(127.5)->Classifier as all the pixels are of a 
similar type. 

This approach also allows much easier use of the ParamGridBuilder to test 
multiple theshold values.

I have already written the code and unit tests and have tested in a Multilayer 
perceptron classifier workflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10658) Could pyspark provide addJars() as scala spark API?

2016-01-29 Thread Tony Cebzanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Cebzanov updated SPARK-10658:
--
Comment: was deleted

(was: I'm also noticing that the %addjar magic doesn't seem to work with 
PySpark (works fine in scala.)  Is that related to this issue?  If so, will 
resolving this issue also allow %addjar to work?)

> Could pyspark provide addJars() as scala spark API? 
> 
>
> Key: SPARK-10658
> URL: https://issues.apache.org/jira/browse/SPARK-10658
> Project: Spark
>  Issue Type: Wish
>  Components: PySpark
>Affects Versions: 1.3.1
> Environment: Linux ubuntu 14.01 LTS
>Reporter: ryanchou
>  Labels: features
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> My spark program was written by pyspark API , and it has used the spark-csv 
> jar library. 
> I could submit the task by spark-submit, and add `--jars` arguments for using 
> spark-csv jar library as following commands:
> ```
> /bin/spark-submit --jars /path/spark-csv_2.10-1.1.0.jar  xxx.py
> ```
> However I need to run my unittests like:
> ```
> py.test -vvs test_xxx.py
> ```
> It could't add jars by adding '--jars' arugment.
> Therefore I tried to use the SparkContext.addPyFile() API to add jars in my 
> test_xxx.py. 
> Because I saw the addPyFile()'s doc mention me PACKAGES_EXTENSIONS = (.zip, 
> .py, .jar). 
> Does it mean that I could add *.jar (jar libraries) by using the addPyFile()?
> The codes which using addPyFile() to add jars as below: 
> ```
> self.sc.addPyFile(join(lib_path, "spark-csv_2.10-1.1.0.jar"))
> sqlContext = SQLContext(self.sc)
> self.dataframe = sqlContext.load(
> source="com.databricks.spark.csv",
> header="true",
> path="xxx.csv"
> )
> ```
> While it doesn't work. sqlContext cannot load the 
> source(com.databricks.spark.csv)
> Eventually I have found another way to set the enviroment variable 
> SPARK_CLASSPATH for loading jars libraries
> ```
> SPARK_CLASSPATH="/path/xxx.jar:/path/xxx2.jar" py.test -vvs test_xxx.py
> ```
> It could load the jars libraries and sqlContext could load source succeed as 
> well as adding `--jar xxx1.jar` arguments
> For the situation on using third party jars (.py & .egg could work well by 
> using addPyFile()) in pyspark-written scripts.
> and it cannot use `--jars` on the situation (py.test -vvs test_xxx.py).
> Have you ever planed to provide an API such as addJars() in scala for adding 
> jars to spark program, or was there a better way to add jars I still havent 
> found it yet?
> If someone want to addjars() in pyspark-written scripts not using '--jars'. 
> Could you give us some suggestions on it?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12964) SparkContext.localProperties leaked

2016-01-29 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124296#comment-15124296
 ] 

Shixiong Zhu commented on SPARK-12964:
--

+1 for removing the inheritance.

> SparkContext.localProperties leaked
> ---
>
> Key: SPARK-12964
> URL: https://issues.apache.org/jira/browse/SPARK-12964
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Daniel Darabos
>Priority: Minor
>
> I have a non-deterministic but quite reliable reproduction for a case where 
> {{spark.sql.execution.id}} is leaked. Operations then die with 
> {{spark.sql.execution.id is already set}}. These threads never recover 
> because there is nothing to unset {{spark.sql.execution.id}}. (It's not a 
> case of nested {{withNewExecutionId}} calls.)
> I have figured out why this happens. We are within a {{withNewExecutionId}} 
> block. At some point we call back to user code. (In our case this is a custom 
> data source's {{buildScan}} method.) The user code calls 
> {{scala.concurrent.Await.result}}. Because our thread is a member of a 
> {{ForkJoinPool}} (this is a Play HTTP serving thread) {{Await.result}} starts 
> a new thread. {{SparkContext.localProperties}} is cloned for this thread and 
> then it's ready to serve an HTTP request.
> The first thread then finishes waiting, finishes {{buildScan}}, and leaves 
> {{withNewExecutionId}}, clearing {{spark.sql.execution.id}} in the {{finally} 
> block. All good. But some time later another HTTP request will be served by 
> the second thread. This thread is "poisoned" with a 
> {{spark.sql.execution.id}}. When it tries to use {{withNewExecutionId}} it 
> fails.
> 
> I don't know who's at fault here. 
>  - I don't like the {{ThreadLocal}} properties anyway. Why not create an 
> Execution object and let it wrap the operation? Then you could have two 
> executions in parallel on the same thread, and other stuff like that. It 
> would be much clearer than storing the execution ID in a kind-of-global 
> variable.
>  - Why do we have to inherit the {{ThreadLocal}} properties? I'm sure there 
> is a good reason, but this is essentially a bug-generator in my view. (It has 
> already generated https://issues.apache.org/jira/browse/SPARK-10563.)
>  - {{Await.result}} --- I never would have thought it starts threads.
>  - We probably shouldn't be calling {{Await.result}} inside {{buildScan}}.
>  - We probably shouldn't call Spark things from HTTP serving threads.
> I'm not sure what could be done on the Spark side, but I thought I should 
> mention this interesting issue. For supporting evidence here is the stack 
> trace when {{localProperties}} is getting cloned. It's contents at that point 
> are:
> {noformat}
> {spark.sql.execution.id=0, spark.rdd.scope.noOverride=true, 
> spark.rdd.scope={"id":"4","name":"ExecutedCommand"}}
> {noformat}
> {noformat}
>   at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:364) 
> [spark-assembly-1.6.0-hadoop2.4.0.jar:1.6.0]
>   at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:362) 
> [spark-assembly-1.6.0-hadoop2.4.0.jar:1.6.0]
>   at java.lang.ThreadLocal$ThreadLocalMap.(ThreadLocal.java:353) 
> [na:1.7.0_91]
>   at java.lang.ThreadLocal$ThreadLocalMap.(ThreadLocal.java:261) 
> [na:1.7.0_91]
>   at java.lang.ThreadLocal.createInheritedMap(ThreadLocal.java:236) 
> [na:1.7.0_91]   
>   at java.lang.Thread.init(Thread.java:416) [na:1.7.0_91] 
>   
>   at java.lang.Thread.init(Thread.java:349) [na:1.7.0_91] 
>   
>   at java.lang.Thread.(Thread.java:508) [na:1.7.0_91]   
>   
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.(ForkJoinWorkerThread.java:48)
>  [org.scala-lang.scala-library-2.10.5.jar:na]
>   at 
> scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2.(ExecutionContextImpl.scala:42)
>  [org.scala-lang.scala-library-2.10.5.jar:na]
>   at 
> scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory.newThread(ExecutionContextImpl.scala:42)
>  [org.scala-lang.scala-library-2.10.5.jar:na]
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.tryCompensate(ForkJoinPool.java:2341) 
> [org.scala-lang.scala-library-2.10.5.jar:na]
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3638) 
> [org.scala-lang.scala-library-2.10.5.jar:na]
>   at 
> scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2.blockOn(ExecutionContextImpl.scala:45)
>  [org.scala-lang.scala-library-2.10.5.jar:na]
>   at scala.concurrent.Await$.result(package.scala:107) 
> [org.scala-lang.scala-library-2.10.5.jar:na] 
>   at 
> com.lynxanalyt

[jira] [Created] (SPARK-13096) Make AccumulatorSuite#verifyPeakExecutionMemorySet less flaky

2016-01-29 Thread Andrew Or (JIRA)
Andrew Or created SPARK-13096:
-

 Summary: Make AccumulatorSuite#verifyPeakExecutionMemorySet less 
flaky
 Key: SPARK-13096
 URL: https://issues.apache.org/jira/browse/SPARK-13096
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Andrew Or
Assignee: Andrew Or


We have a method in AccumulatorSuite called verifyPeakExecutionMemorySet. This 
is used in a variety of other test suites, including (but not limited to):

- ExternalAppendOnlyMapSuite
- ExternalSorterSuite
- sql.SQLQuerySuite

Lately it's been flaky ever since https://github.com/apache/spark/pull/10835 
was merged. Note: this was an existing problem even before that patch, but it 
was uncovered there because previously we never failed the test even if an 
assertion error failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10658) Could pyspark provide addJars() as scala spark API?

2016-01-29 Thread Tony Cebzanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124298#comment-15124298
 ] 

Tony Cebzanov commented on SPARK-10658:
---

I'm also noticing that the %addjar magic doesn't seem to work with PySpark 
(works fine in scala.)  Is that related to this issue?  If so, will resolving 
this issue also allow %addjar to work?

> Could pyspark provide addJars() as scala spark API? 
> 
>
> Key: SPARK-10658
> URL: https://issues.apache.org/jira/browse/SPARK-10658
> Project: Spark
>  Issue Type: Wish
>  Components: PySpark
>Affects Versions: 1.3.1
> Environment: Linux ubuntu 14.01 LTS
>Reporter: ryanchou
>  Labels: features
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> My spark program was written by pyspark API , and it has used the spark-csv 
> jar library. 
> I could submit the task by spark-submit, and add `--jars` arguments for using 
> spark-csv jar library as following commands:
> ```
> /bin/spark-submit --jars /path/spark-csv_2.10-1.1.0.jar  xxx.py
> ```
> However I need to run my unittests like:
> ```
> py.test -vvs test_xxx.py
> ```
> It could't add jars by adding '--jars' arugment.
> Therefore I tried to use the SparkContext.addPyFile() API to add jars in my 
> test_xxx.py. 
> Because I saw the addPyFile()'s doc mention me PACKAGES_EXTENSIONS = (.zip, 
> .py, .jar). 
> Does it mean that I could add *.jar (jar libraries) by using the addPyFile()?
> The codes which using addPyFile() to add jars as below: 
> ```
> self.sc.addPyFile(join(lib_path, "spark-csv_2.10-1.1.0.jar"))
> sqlContext = SQLContext(self.sc)
> self.dataframe = sqlContext.load(
> source="com.databricks.spark.csv",
> header="true",
> path="xxx.csv"
> )
> ```
> While it doesn't work. sqlContext cannot load the 
> source(com.databricks.spark.csv)
> Eventually I have found another way to set the enviroment variable 
> SPARK_CLASSPATH for loading jars libraries
> ```
> SPARK_CLASSPATH="/path/xxx.jar:/path/xxx2.jar" py.test -vvs test_xxx.py
> ```
> It could load the jars libraries and sqlContext could load source succeed as 
> well as adding `--jar xxx1.jar` arguments
> For the situation on using third party jars (.py & .egg could work well by 
> using addPyFile()) in pyspark-written scripts.
> and it cannot use `--jars` on the situation (py.test -vvs test_xxx.py).
> Have you ever planed to provide an API such as addJars() in scala for adding 
> jars to spark program, or was there a better way to add jars I still havent 
> found it yet?
> If someone want to addjars() in pyspark-written scripts not using '--jars'. 
> Could you give us some suggestions on it?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12984) Not able to read CSV file using Spark 1.4.0

2016-01-29 Thread Jai Murugesh Rajasekaran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124284#comment-15124284
 ] 

Jai Murugesh Rajasekaran commented on SPARK-12984:
--

Thanks

> Not able to read CSV file using Spark 1.4.0
> ---
>
> Key: SPARK-12984
> URL: https://issues.apache.org/jira/browse/SPARK-12984
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
> Environment: Unix
> Hadoop 2.7.1.2.3.0.0-2557
> R 3.1.1
> Don't have Internet on the server
>Reporter: Jai Murugesh Rajasekaran
>
> Hi,
> We are trying to read a CSV file
> Downloaded following CSV related package (jar files) and configured using 
> Maven
> 1. spark-csv_2.10-1.2.0.jar
> 2. spark-csv_2.10-1.2.0-sources.jar
> 3. spark-csv_2.10-1.2.0-javadoc.jar
> Trying to execute following script
> > library(SparkR)
> > sc <- sparkR.init(appName="SparkR-DataFrame")
> Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or 
> restart R to create a new Spark Context
> > sqlContext <- sparkRSQL.init(sc)
> > setwd("/home/s/")
> > getwd()
> [1] "/home/s"
> > path <- file.path("Sample.csv")
> > Test <- read.df(sqlContext, path)
> Note: I am able to read CSV file using regular R function but when tried 
> using SparkR functions...ended up with error
> Initiated SparkR
> $ sh -x sparkR -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> Error Messages/Log
> $ sh -x sparkR -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> +++ dirname sparkR
> ++ cd ./..
> ++ pwd
> + export SPARK_HOME=/opt/spark-1.4.0
> + SPARK_HOME=/opt/spark-1.4.0
> + source /opt/spark-1.4.0/bin/load-spark-env.sh
>  dirname sparkR
> +++ cd ./..
> +++ pwd
> ++ FWDIR=/opt/spark-1.4.0
> ++ '[' -z '' ']'
> ++ export SPARK_ENV_LOADED=1
> ++ SPARK_ENV_LOADED=1
>  dirname sparkR
> +++ cd ./..
> +++ pwd
> ++ parent_dir=/opt/spark-1.4.0
> ++ user_conf_dir=/opt/spark-1.4.0/conf
> ++ '[' -f /opt/spark-1.4.0/conf/spark-env.sh ']'
> ++ set -a
> ++ . /opt/spark-1.4.0/conf/spark-env.sh
> +++ export SPARK_HOME=/opt/spark-1.4.0
> +++ SPARK_HOME=/opt/spark-1.4.0
> +++ export YARN_CONF_DIR=/etc/hadoop/conf
> +++ YARN_CONF_DIR=/etc/hadoop/conf
> +++ export HADOOP_CONF_DIR=/etc/hadoop/conf
> +++ HADOOP_CONF_DIR=/etc/hadoop/conf
> +++ export HADOOP_CONF_DIR=/etc/hadoop/conf
> +++ HADOOP_CONF_DIR=/etc/hadoop/conf
> ++ set +a
> ++ '[' -z '' ']'
> ++ ASSEMBLY_DIR2=/opt/spark-1.4.0/assembly/target/scala-2.11
> ++ ASSEMBLY_DIR1=/opt/spark-1.4.0/assembly/target/scala-2.10
> ++ [[ -d /opt/spark-1.4.0/assembly/target/scala-2.11 ]]
> ++ '[' -d /opt/spark-1.4.0/assembly/target/scala-2.11 ']'
> ++ export SPARK_SCALA_VERSION=2.10
> ++ SPARK_SCALA_VERSION=2.10
> + export -f usage
> + [[ -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
>  = *--help ]]
> + [[ -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
>  = *-h ]]
> + exec /opt/spark-1.4.0/bin/spark-submit sparkr-shell-main -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> R version 3.1.1 (2014-07-10) -- "Sock it to Me"
> Copyright (C) 2014 The R Foundation for Statistical Computing
> Platform: x86_64-unknown-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information a

[jira] [Closed] (SPARK-12984) Not able to read CSV file using Spark 1.4.0

2016-01-29 Thread Jai Murugesh Rajasekaran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jai Murugesh Rajasekaran closed SPARK-12984.


> Not able to read CSV file using Spark 1.4.0
> ---
>
> Key: SPARK-12984
> URL: https://issues.apache.org/jira/browse/SPARK-12984
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
> Environment: Unix
> Hadoop 2.7.1.2.3.0.0-2557
> R 3.1.1
> Don't have Internet on the server
>Reporter: Jai Murugesh Rajasekaran
>
> Hi,
> We are trying to read a CSV file
> Downloaded following CSV related package (jar files) and configured using 
> Maven
> 1. spark-csv_2.10-1.2.0.jar
> 2. spark-csv_2.10-1.2.0-sources.jar
> 3. spark-csv_2.10-1.2.0-javadoc.jar
> Trying to execute following script
> > library(SparkR)
> > sc <- sparkR.init(appName="SparkR-DataFrame")
> Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or 
> restart R to create a new Spark Context
> > sqlContext <- sparkRSQL.init(sc)
> > setwd("/home/s/")
> > getwd()
> [1] "/home/s"
> > path <- file.path("Sample.csv")
> > Test <- read.df(sqlContext, path)
> Note: I am able to read CSV file using regular R function but when tried 
> using SparkR functions...ended up with error
> Initiated SparkR
> $ sh -x sparkR -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> Error Messages/Log
> $ sh -x sparkR -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> +++ dirname sparkR
> ++ cd ./..
> ++ pwd
> + export SPARK_HOME=/opt/spark-1.4.0
> + SPARK_HOME=/opt/spark-1.4.0
> + source /opt/spark-1.4.0/bin/load-spark-env.sh
>  dirname sparkR
> +++ cd ./..
> +++ pwd
> ++ FWDIR=/opt/spark-1.4.0
> ++ '[' -z '' ']'
> ++ export SPARK_ENV_LOADED=1
> ++ SPARK_ENV_LOADED=1
>  dirname sparkR
> +++ cd ./..
> +++ pwd
> ++ parent_dir=/opt/spark-1.4.0
> ++ user_conf_dir=/opt/spark-1.4.0/conf
> ++ '[' -f /opt/spark-1.4.0/conf/spark-env.sh ']'
> ++ set -a
> ++ . /opt/spark-1.4.0/conf/spark-env.sh
> +++ export SPARK_HOME=/opt/spark-1.4.0
> +++ SPARK_HOME=/opt/spark-1.4.0
> +++ export YARN_CONF_DIR=/etc/hadoop/conf
> +++ YARN_CONF_DIR=/etc/hadoop/conf
> +++ export HADOOP_CONF_DIR=/etc/hadoop/conf
> +++ HADOOP_CONF_DIR=/etc/hadoop/conf
> +++ export HADOOP_CONF_DIR=/etc/hadoop/conf
> +++ HADOOP_CONF_DIR=/etc/hadoop/conf
> ++ set +a
> ++ '[' -z '' ']'
> ++ ASSEMBLY_DIR2=/opt/spark-1.4.0/assembly/target/scala-2.11
> ++ ASSEMBLY_DIR1=/opt/spark-1.4.0/assembly/target/scala-2.10
> ++ [[ -d /opt/spark-1.4.0/assembly/target/scala-2.11 ]]
> ++ '[' -d /opt/spark-1.4.0/assembly/target/scala-2.11 ']'
> ++ export SPARK_SCALA_VERSION=2.10
> ++ SPARK_SCALA_VERSION=2.10
> + export -f usage
> + [[ -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
>  = *--help ]]
> + [[ -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
>  = *-h ]]
> + exec /opt/spark-1.4.0/bin/spark-submit sparkr-shell-main -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> R version 3.1.1 (2014-07-10) -- "Sock it to Me"
> Copyright (C) 2014 The R Foundation for Statistical Computing
> Platform: x86_64-unknown-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
>

[jira] [Resolved] (SPARK-12984) Not able to read CSV file using Spark 1.4.0

2016-01-29 Thread Jai Murugesh Rajasekaran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jai Murugesh Rajasekaran resolved SPARK-12984.
--
Resolution: Fixed

> Not able to read CSV file using Spark 1.4.0
> ---
>
> Key: SPARK-12984
> URL: https://issues.apache.org/jira/browse/SPARK-12984
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
> Environment: Unix
> Hadoop 2.7.1.2.3.0.0-2557
> R 3.1.1
> Don't have Internet on the server
>Reporter: Jai Murugesh Rajasekaran
>
> Hi,
> We are trying to read a CSV file
> Downloaded following CSV related package (jar files) and configured using 
> Maven
> 1. spark-csv_2.10-1.2.0.jar
> 2. spark-csv_2.10-1.2.0-sources.jar
> 3. spark-csv_2.10-1.2.0-javadoc.jar
> Trying to execute following script
> > library(SparkR)
> > sc <- sparkR.init(appName="SparkR-DataFrame")
> Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or 
> restart R to create a new Spark Context
> > sqlContext <- sparkRSQL.init(sc)
> > setwd("/home/s/")
> > getwd()
> [1] "/home/s"
> > path <- file.path("Sample.csv")
> > Test <- read.df(sqlContext, path)
> Note: I am able to read CSV file using regular R function but when tried 
> using SparkR functions...ended up with error
> Initiated SparkR
> $ sh -x sparkR -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> Error Messages/Log
> $ sh -x sparkR -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> +++ dirname sparkR
> ++ cd ./..
> ++ pwd
> + export SPARK_HOME=/opt/spark-1.4.0
> + SPARK_HOME=/opt/spark-1.4.0
> + source /opt/spark-1.4.0/bin/load-spark-env.sh
>  dirname sparkR
> +++ cd ./..
> +++ pwd
> ++ FWDIR=/opt/spark-1.4.0
> ++ '[' -z '' ']'
> ++ export SPARK_ENV_LOADED=1
> ++ SPARK_ENV_LOADED=1
>  dirname sparkR
> +++ cd ./..
> +++ pwd
> ++ parent_dir=/opt/spark-1.4.0
> ++ user_conf_dir=/opt/spark-1.4.0/conf
> ++ '[' -f /opt/spark-1.4.0/conf/spark-env.sh ']'
> ++ set -a
> ++ . /opt/spark-1.4.0/conf/spark-env.sh
> +++ export SPARK_HOME=/opt/spark-1.4.0
> +++ SPARK_HOME=/opt/spark-1.4.0
> +++ export YARN_CONF_DIR=/etc/hadoop/conf
> +++ YARN_CONF_DIR=/etc/hadoop/conf
> +++ export HADOOP_CONF_DIR=/etc/hadoop/conf
> +++ HADOOP_CONF_DIR=/etc/hadoop/conf
> +++ export HADOOP_CONF_DIR=/etc/hadoop/conf
> +++ HADOOP_CONF_DIR=/etc/hadoop/conf
> ++ set +a
> ++ '[' -z '' ']'
> ++ ASSEMBLY_DIR2=/opt/spark-1.4.0/assembly/target/scala-2.11
> ++ ASSEMBLY_DIR1=/opt/spark-1.4.0/assembly/target/scala-2.10
> ++ [[ -d /opt/spark-1.4.0/assembly/target/scala-2.11 ]]
> ++ '[' -d /opt/spark-1.4.0/assembly/target/scala-2.11 ']'
> ++ export SPARK_SCALA_VERSION=2.10
> ++ SPARK_SCALA_VERSION=2.10
> + export -f usage
> + [[ -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
>  = *--help ]]
> + [[ -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
>  = *-h ]]
> + exec /opt/spark-1.4.0/bin/spark-submit sparkr-shell-main -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> R version 3.1.1 (2014-07-10) -- "Sock it to Me"
> Copyright (C) 2014 The R Foundation for Statistical Computing
> Platform: x86_64-unknown-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R p

[jira] [Commented] (SPARK-12984) Not able to read CSV file using Spark 1.4.0

2016-01-29 Thread Jai Murugesh Rajasekaran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124282#comment-15124282
 ] 

Jai Murugesh Rajasekaran commented on SPARK-12984:
--

Issue resolved
Tried with --jars option & used "spark-csv_2.10-1.3.0.jar" and 
"commons-csv-1.2.jar" which solved the issue

$sparkR --jars 
/home/s/spark-csv_2.10-1.3.0.jar,/home/s/commons-csv-1.2.jar

> Not able to read CSV file using Spark 1.4.0
> ---
>
> Key: SPARK-12984
> URL: https://issues.apache.org/jira/browse/SPARK-12984
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
> Environment: Unix
> Hadoop 2.7.1.2.3.0.0-2557
> R 3.1.1
> Don't have Internet on the server
>Reporter: Jai Murugesh Rajasekaran
>
> Hi,
> We are trying to read a CSV file
> Downloaded following CSV related package (jar files) and configured using 
> Maven
> 1. spark-csv_2.10-1.2.0.jar
> 2. spark-csv_2.10-1.2.0-sources.jar
> 3. spark-csv_2.10-1.2.0-javadoc.jar
> Trying to execute following script
> > library(SparkR)
> > sc <- sparkR.init(appName="SparkR-DataFrame")
> Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or 
> restart R to create a new Spark Context
> > sqlContext <- sparkRSQL.init(sc)
> > setwd("/home/s/")
> > getwd()
> [1] "/home/s"
> > path <- file.path("Sample.csv")
> > Test <- read.df(sqlContext, path)
> Note: I am able to read CSV file using regular R function but when tried 
> using SparkR functions...ended up with error
> Initiated SparkR
> $ sh -x sparkR -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> Error Messages/Log
> $ sh -x sparkR -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> +++ dirname sparkR
> ++ cd ./..
> ++ pwd
> + export SPARK_HOME=/opt/spark-1.4.0
> + SPARK_HOME=/opt/spark-1.4.0
> + source /opt/spark-1.4.0/bin/load-spark-env.sh
>  dirname sparkR
> +++ cd ./..
> +++ pwd
> ++ FWDIR=/opt/spark-1.4.0
> ++ '[' -z '' ']'
> ++ export SPARK_ENV_LOADED=1
> ++ SPARK_ENV_LOADED=1
>  dirname sparkR
> +++ cd ./..
> +++ pwd
> ++ parent_dir=/opt/spark-1.4.0
> ++ user_conf_dir=/opt/spark-1.4.0/conf
> ++ '[' -f /opt/spark-1.4.0/conf/spark-env.sh ']'
> ++ set -a
> ++ . /opt/spark-1.4.0/conf/spark-env.sh
> +++ export SPARK_HOME=/opt/spark-1.4.0
> +++ SPARK_HOME=/opt/spark-1.4.0
> +++ export YARN_CONF_DIR=/etc/hadoop/conf
> +++ YARN_CONF_DIR=/etc/hadoop/conf
> +++ export HADOOP_CONF_DIR=/etc/hadoop/conf
> +++ HADOOP_CONF_DIR=/etc/hadoop/conf
> +++ export HADOOP_CONF_DIR=/etc/hadoop/conf
> +++ HADOOP_CONF_DIR=/etc/hadoop/conf
> ++ set +a
> ++ '[' -z '' ']'
> ++ ASSEMBLY_DIR2=/opt/spark-1.4.0/assembly/target/scala-2.11
> ++ ASSEMBLY_DIR1=/opt/spark-1.4.0/assembly/target/scala-2.10
> ++ [[ -d /opt/spark-1.4.0/assembly/target/scala-2.11 ]]
> ++ '[' -d /opt/spark-1.4.0/assembly/target/scala-2.11 ']'
> ++ export SPARK_SCALA_VERSION=2.10
> ++ SPARK_SCALA_VERSION=2.10
> + export -f usage
> + [[ -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
>  = *--help ]]
> + [[ -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
>  = *-h ]]
> + exec /opt/spark-1.4.0/bin/spark-submit sparkr-shell-main -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> R version 3.1.1 (2014-07-10) -- "Sock it to Me"
> Copyright (C) 2014 The R Foundation for Statistical Computing
> Platform: x86_64-unknown-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' o

[jira] [Commented] (SPARK-13094) Dataset Aggregators do not work with complex types

2016-01-29 Thread Deenar Toraskar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124278#comment-15124278
 ] 

Deenar Toraskar commented on SPARK-13094:
-

[~marmbrus]
Still the same error on nightly snapshot build of 1.6.0 
http://people.apache.org/~pwendell/spark-nightly/spark-branch-1.6-bin/latest/spark-1.6.0-SNAPSHOT-bin-hadoop2.6.tgz


> Dataset Aggregators do not work with complex types
> --
>
> Key: SPARK-13094
> URL: https://issues.apache.org/jira/browse/SPARK-13094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Deenar Toraskar
>
> Dataset aggregators with complex types fail with unable to find encoder for 
> type stored in a Dataset. Though Datasets with these complex types are 
> supported.
> val arraySum = new Aggregator[Seq[Float], Seq[Float],
>   Seq[Float]] with Serializable {
>   def zero: Seq[Float] = Nil
>   // The initial value.
>   def reduce(currentSum: Seq[Float], currentRow: Seq[Float]) =
> sumArray(currentSum, currentRow)
>   def merge(sum: Seq[Float], row: Seq[Float]) = sumArray(sum, row)
>   def finish(b: Seq[Float]) = b // Return the final result.
>   def sumArray(a: Seq[Float], b: Seq[Float]): Seq[Float] = {
> (a, b) match {
>   case (Nil, Nil) => Nil
>   case (Nil, row) => row
>   case (sum, Nil) => sum
>   case (sum, row) => (a, b).zipped.map { case (a, b) => a + b }
> }
>   }
> }.toColumn
> :47: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing sqlContext.implicits._  Support for serializing other 
> types will be added in future releases.
>}.toColumn



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13082) sqlCtx.real.json() doesn't work with PythonRDD

2016-01-29 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-13082.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.0.0
   1.6.1

> sqlCtx.real.json() doesn't work with PythonRDD
> --
>
> Key: SPARK-13082
> URL: https://issues.apache.org/jira/browse/SPARK-13082
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: Tested on macosx 10.10 using Spark 1.6
>Reporter: Gaëtan Lehmann
>Assignee: Shixiong Zhu
> Fix For: 1.6.1, 2.0.0
>
>
> This code works without problem:
>   sqlCtx.read.json(sqlCtx.range(10).toJSON())
> but these ones fail with the traceback below:
>   sqlCtx.read.json(sc.parallelize(['{"id":1}']*10))
>   sqlCtx.read.json(sqlCtx.range(10).toJSON().pipe("cat"))
>   sqlCtx.read.json(sqlCtx.range(10).toJSON().map(lambda x: x))
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sqlCtx.read.json(sqlCtx.range(10).toJSON().map(lambda x: x))
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/sql/readwriter.pyc
>  in json(self, path, schema)
> 178 return 
> self._df(self._jreader.json(self._sqlContext._sc._jvm.PythonUtils.toSeq(path)))
> 179 elif isinstance(path, RDD):
> --> 180 return self._df(self._jreader.json(path._jrdd))
> 181 else:
> 182 raise TypeError("path can be only string or RDD")
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 811 answer = self.gateway_client.send_command(command)
> 812 return_value = get_return_value(
> --> 813 answer, self.gateway_client, self.target_id, self.name)
> 814 
> 815 for temp_arg in temp_args:
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/sql/utils.pyc in 
> deco(*a, **kw)
>  43 def deco(*a, **kw):
>  44 try:
> ---> 45 return f(*a, **kw)
>  46 except py4j.protocol.Py4JJavaError as e:
>  47 s = e.java_exception.toString()
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 306 raise Py4JJavaError(
> 307 "An error occurred while calling {0}{1}{2}.\n".
> --> 308 format(target_id, ".", name), value)
> 309 else:
> 310 raise Py4JError(
> Py4JJavaError: An error occurred while calling o961.json.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 55.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 55.0 (TID 149, localhost): java.lang.ClassCastException: [B cannot be cast to 
> java.lang.String
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:53)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>   at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPo

[jira] [Updated] (SPARK-13055) SQLHistoryListener throws ClassCastException

2016-01-29 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13055:
-
Affects Version/s: (was: 1.5.0)

> SQLHistoryListener throws ClassCastException
> 
>
> Key: SPARK-13055
> URL: https://issues.apache.org/jira/browse/SPARK-13055
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> {code}
> 16/01/27 18:46:28 ERROR ReplayListenerBus: Listener SQLHistoryListener threw 
> an exception
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> java.lang.Long
> at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110)
> at 
> org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1$$anonfun$5.apply(SQLListener.scala:334)
> at 
> org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1$$anonfun$5.apply(SQLListener.scala:334)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:334)
> at 
> org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:332)
> {code}
> SQLHistoryListener listens on SparkListenerTaskEnd events, which contain 
> non-SQL accumulators as well. We try to cast all accumulators we encounter to 
> Long, resulting in an error like this one.
> Note: this was a problem even before internal accumulators were introduced. 
> If  the task used a user accumulator of a type other than Long, we would 
> still see this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13055) SQLHistoryListener throws ClassCastException

2016-01-29 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-13055.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> SQLHistoryListener throws ClassCastException
> 
>
> Key: SPARK-13055
> URL: https://issues.apache.org/jira/browse/SPARK-13055
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> {code}
> 16/01/27 18:46:28 ERROR ReplayListenerBus: Listener SQLHistoryListener threw 
> an exception
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> java.lang.Long
> at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110)
> at 
> org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1$$anonfun$5.apply(SQLListener.scala:334)
> at 
> org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1$$anonfun$5.apply(SQLListener.scala:334)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:334)
> at 
> org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:332)
> {code}
> SQLHistoryListener listens on SparkListenerTaskEnd events, which contain 
> non-SQL accumulators as well. We try to cast all accumulators we encounter to 
> Long, resulting in an error like this one.
> Note: this was a problem even before internal accumulators were introduced. 
> If  the task used a user accumulator of a type other than Long, we would 
> still see this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13095) improve performance of hash join with dimension table

2016-01-29 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13095:
---
Component/s: SQL

> improve performance of hash join with dimension table
> -
>
> Key: SPARK-13095
> URL: https://issues.apache.org/jira/browse/SPARK-13095
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> The join key is usually an integer or long (primary key, unique), we could 
> have special HashRelation for them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13095) improve performance of hash join with dimension table

2016-01-29 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13095:
--

 Summary: improve performance of hash join with dimension table
 Key: SPARK-13095
 URL: https://issues.apache.org/jira/browse/SPARK-13095
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu
Assignee: Davies Liu


The join key is usually an integer or long (primary key, unique), we could have 
special HashRelation for them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13094) Dataset Aggregators do not work with complex types

2016-01-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13094:
-
Target Version/s: 1.6.1

> Dataset Aggregators do not work with complex types
> --
>
> Key: SPARK-13094
> URL: https://issues.apache.org/jira/browse/SPARK-13094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Deenar Toraskar
>
> Dataset aggregators with complex types fail with unable to find encoder for 
> type stored in a Dataset. Though Datasets with these complex types are 
> supported.
> val arraySum = new Aggregator[Seq[Float], Seq[Float],
>   Seq[Float]] with Serializable {
>   def zero: Seq[Float] = Nil
>   // The initial value.
>   def reduce(currentSum: Seq[Float], currentRow: Seq[Float]) =
> sumArray(currentSum, currentRow)
>   def merge(sum: Seq[Float], row: Seq[Float]) = sumArray(sum, row)
>   def finish(b: Seq[Float]) = b // Return the final result.
>   def sumArray(a: Seq[Float], b: Seq[Float]): Seq[Float] = {
> (a, b) match {
>   case (Nil, Nil) => Nil
>   case (Nil, row) => row
>   case (sum, Nil) => sum
>   case (sum, row) => (a, b).zipped.map { case (a, b) => a + b }
> }
>   }
> }.toColumn
> :47: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing sqlContext.implicits._  Support for serializing other 
> types will be added in future releases.
>}.toColumn



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13094) Dataset Aggregators do not work with complex types

2016-01-29 Thread Deenar Toraskar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124206#comment-15124206
 ] 

Deenar Toraskar commented on SPARK-13094:
-

Downloading it now, will update the JIRA after rerunning my code

> Dataset Aggregators do not work with complex types
> --
>
> Key: SPARK-13094
> URL: https://issues.apache.org/jira/browse/SPARK-13094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Deenar Toraskar
>
> Dataset aggregators with complex types fail with unable to find encoder for 
> type stored in a Dataset. Though Datasets with these complex types are 
> supported.
> val arraySum = new Aggregator[Seq[Float], Seq[Float],
>   Seq[Float]] with Serializable {
>   def zero: Seq[Float] = Nil
>   // The initial value.
>   def reduce(currentSum: Seq[Float], currentRow: Seq[Float]) =
> sumArray(currentSum, currentRow)
>   def merge(sum: Seq[Float], row: Seq[Float]) = sumArray(sum, row)
>   def finish(b: Seq[Float]) = b // Return the final result.
>   def sumArray(a: Seq[Float], b: Seq[Float]): Seq[Float] = {
> (a, b) match {
>   case (Nil, Nil) => Nil
>   case (Nil, row) => row
>   case (sum, Nil) => sum
>   case (sum, row) => (a, b).zipped.map { case (a, b) => a + b }
> }
>   }
> }.toColumn
> :47: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing sqlContext.implicits._  Support for serializing other 
> types will be added in future releases.
>}.toColumn



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12798) Broadcast hash join

2016-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124207#comment-15124207
 ] 

Apache Spark commented on SPARK-12798:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/10989

> Broadcast hash join
> ---
>
> Key: SPARK-12798
> URL: https://issues.apache.org/jira/browse/SPARK-12798
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> [~davies] it's minor but you rarely set component on your JIRAs. I think it 
> helps. Just tag with pyspark or sql.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13094) Dataset Aggregators do not work with complex types

2016-01-29 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124205#comment-15124205
 ] 

Michael Armbrust commented on SPARK-13094:
--

Sorry, I think I was unclear.  When I said branch-1.6 I meant what is currently 
on github, not the 1.6.0 release.

> Dataset Aggregators do not work with complex types
> --
>
> Key: SPARK-13094
> URL: https://issues.apache.org/jira/browse/SPARK-13094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Deenar Toraskar
>
> Dataset aggregators with complex types fail with unable to find encoder for 
> type stored in a Dataset. Though Datasets with these complex types are 
> supported.
> val arraySum = new Aggregator[Seq[Float], Seq[Float],
>   Seq[Float]] with Serializable {
>   def zero: Seq[Float] = Nil
>   // The initial value.
>   def reduce(currentSum: Seq[Float], currentRow: Seq[Float]) =
> sumArray(currentSum, currentRow)
>   def merge(sum: Seq[Float], row: Seq[Float]) = sumArray(sum, row)
>   def finish(b: Seq[Float]) = b // Return the final result.
>   def sumArray(a: Seq[Float], b: Seq[Float]): Seq[Float] = {
> (a, b) match {
>   case (Nil, Nil) => Nil
>   case (Nil, row) => row
>   case (sum, Nil) => sum
>   case (sum, row) => (a, b).zipped.map { case (a, b) => a + b }
> }
>   }
> }.toColumn
> :47: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing sqlContext.implicits._  Support for serializing other 
> types will be added in future releases.
>}.toColumn



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13082) sqlCtx.real.json() doesn't work with PythonRDD

2016-01-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13082:


Assignee: (was: Apache Spark)

> sqlCtx.real.json() doesn't work with PythonRDD
> --
>
> Key: SPARK-13082
> URL: https://issues.apache.org/jira/browse/SPARK-13082
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: Tested on macosx 10.10 using Spark 1.6
>Reporter: Gaëtan Lehmann
>
> This code works without problem:
>   sqlCtx.read.json(sqlCtx.range(10).toJSON())
> but these ones fail with the traceback below:
>   sqlCtx.read.json(sc.parallelize(['{"id":1}']*10))
>   sqlCtx.read.json(sqlCtx.range(10).toJSON().pipe("cat"))
>   sqlCtx.read.json(sqlCtx.range(10).toJSON().map(lambda x: x))
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sqlCtx.read.json(sqlCtx.range(10).toJSON().map(lambda x: x))
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/sql/readwriter.pyc
>  in json(self, path, schema)
> 178 return 
> self._df(self._jreader.json(self._sqlContext._sc._jvm.PythonUtils.toSeq(path)))
> 179 elif isinstance(path, RDD):
> --> 180 return self._df(self._jreader.json(path._jrdd))
> 181 else:
> 182 raise TypeError("path can be only string or RDD")
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 811 answer = self.gateway_client.send_command(command)
> 812 return_value = get_return_value(
> --> 813 answer, self.gateway_client, self.target_id, self.name)
> 814 
> 815 for temp_arg in temp_args:
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/sql/utils.pyc in 
> deco(*a, **kw)
>  43 def deco(*a, **kw):
>  44 try:
> ---> 45 return f(*a, **kw)
>  46 except py4j.protocol.Py4JJavaError as e:
>  47 s = e.java_exception.toString()
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 306 raise Py4JJavaError(
> 307 "An error occurred while calling {0}{1}{2}.\n".
> --> 308 format(target_id, ".", name), value)
> 309 else:
> 310 raise Py4JError(
> Py4JJavaError: An error occurred while calling o961.json.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 55.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 55.0 (TID 149, localhost): java.lang.ClassCastException: [B cannot be cast to 
> java.lang.String
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:53)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>   at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.T

[jira] [Commented] (SPARK-13082) sqlCtx.real.json() doesn't work with PythonRDD

2016-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124189#comment-15124189
 ] 

Apache Spark commented on SPARK-13082:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/10988

> sqlCtx.real.json() doesn't work with PythonRDD
> --
>
> Key: SPARK-13082
> URL: https://issues.apache.org/jira/browse/SPARK-13082
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: Tested on macosx 10.10 using Spark 1.6
>Reporter: Gaëtan Lehmann
>
> This code works without problem:
>   sqlCtx.read.json(sqlCtx.range(10).toJSON())
> but these ones fail with the traceback below:
>   sqlCtx.read.json(sc.parallelize(['{"id":1}']*10))
>   sqlCtx.read.json(sqlCtx.range(10).toJSON().pipe("cat"))
>   sqlCtx.read.json(sqlCtx.range(10).toJSON().map(lambda x: x))
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sqlCtx.read.json(sqlCtx.range(10).toJSON().map(lambda x: x))
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/sql/readwriter.pyc
>  in json(self, path, schema)
> 178 return 
> self._df(self._jreader.json(self._sqlContext._sc._jvm.PythonUtils.toSeq(path)))
> 179 elif isinstance(path, RDD):
> --> 180 return self._df(self._jreader.json(path._jrdd))
> 181 else:
> 182 raise TypeError("path can be only string or RDD")
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 811 answer = self.gateway_client.send_command(command)
> 812 return_value = get_return_value(
> --> 813 answer, self.gateway_client, self.target_id, self.name)
> 814 
> 815 for temp_arg in temp_args:
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/sql/utils.pyc in 
> deco(*a, **kw)
>  43 def deco(*a, **kw):
>  44 try:
> ---> 45 return f(*a, **kw)
>  46 except py4j.protocol.Py4JJavaError as e:
>  47 s = e.java_exception.toString()
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 306 raise Py4JJavaError(
> 307 "An error occurred while calling {0}{1}{2}.\n".
> --> 308 format(target_id, ".", name), value)
> 309 else:
> 310 raise Py4JError(
> Py4JJavaError: An error occurred while calling o961.json.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 55.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 55.0 (TID 149, localhost): java.lang.ClassCastException: [B cannot be cast to 
> java.lang.String
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:53)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>   at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 

[jira] [Commented] (SPARK-13082) sqlCtx.real.json() doesn't work with PythonRDD

2016-01-29 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124191#comment-15124191
 ] 

Shixiong Zhu commented on SPARK-13082:
--

This one actually fixed by SPARK-12600. I just sent a PR to backport the fix to 
branch 1.6.

> sqlCtx.real.json() doesn't work with PythonRDD
> --
>
> Key: SPARK-13082
> URL: https://issues.apache.org/jira/browse/SPARK-13082
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: Tested on macosx 10.10 using Spark 1.6
>Reporter: Gaëtan Lehmann
>
> This code works without problem:
>   sqlCtx.read.json(sqlCtx.range(10).toJSON())
> but these ones fail with the traceback below:
>   sqlCtx.read.json(sc.parallelize(['{"id":1}']*10))
>   sqlCtx.read.json(sqlCtx.range(10).toJSON().pipe("cat"))
>   sqlCtx.read.json(sqlCtx.range(10).toJSON().map(lambda x: x))
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sqlCtx.read.json(sqlCtx.range(10).toJSON().map(lambda x: x))
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/sql/readwriter.pyc
>  in json(self, path, schema)
> 178 return 
> self._df(self._jreader.json(self._sqlContext._sc._jvm.PythonUtils.toSeq(path)))
> 179 elif isinstance(path, RDD):
> --> 180 return self._df(self._jreader.json(path._jrdd))
> 181 else:
> 182 raise TypeError("path can be only string or RDD")
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 811 answer = self.gateway_client.send_command(command)
> 812 return_value = get_return_value(
> --> 813 answer, self.gateway_client, self.target_id, self.name)
> 814 
> 815 for temp_arg in temp_args:
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/sql/utils.pyc in 
> deco(*a, **kw)
>  43 def deco(*a, **kw):
>  44 try:
> ---> 45 return f(*a, **kw)
>  46 except py4j.protocol.Py4JJavaError as e:
>  47 s = e.java_exception.toString()
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 306 raise Py4JJavaError(
> 307 "An error occurred while calling {0}{1}{2}.\n".
> --> 308 format(target_id, ".", name), value)
> 309 else:
> 310 raise Py4JError(
> Py4JJavaError: An error occurred while calling o961.json.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 55.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 55.0 (TID 149, localhost): java.lang.ClassCastException: [B cannot be cast to 
> java.lang.String
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:53)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>   at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at

[jira] [Assigned] (SPARK-13082) sqlCtx.real.json() doesn't work with PythonRDD

2016-01-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13082:


Assignee: Apache Spark

> sqlCtx.real.json() doesn't work with PythonRDD
> --
>
> Key: SPARK-13082
> URL: https://issues.apache.org/jira/browse/SPARK-13082
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: Tested on macosx 10.10 using Spark 1.6
>Reporter: Gaëtan Lehmann
>Assignee: Apache Spark
>
> This code works without problem:
>   sqlCtx.read.json(sqlCtx.range(10).toJSON())
> but these ones fail with the traceback below:
>   sqlCtx.read.json(sc.parallelize(['{"id":1}']*10))
>   sqlCtx.read.json(sqlCtx.range(10).toJSON().pipe("cat"))
>   sqlCtx.read.json(sqlCtx.range(10).toJSON().map(lambda x: x))
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sqlCtx.read.json(sqlCtx.range(10).toJSON().map(lambda x: x))
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/sql/readwriter.pyc
>  in json(self, path, schema)
> 178 return 
> self._df(self._jreader.json(self._sqlContext._sc._jvm.PythonUtils.toSeq(path)))
> 179 elif isinstance(path, RDD):
> --> 180 return self._df(self._jreader.json(path._jrdd))
> 181 else:
> 182 raise TypeError("path can be only string or RDD")
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 811 answer = self.gateway_client.send_command(command)
> 812 return_value = get_return_value(
> --> 813 answer, self.gateway_client, self.target_id, self.name)
> 814 
> 815 for temp_arg in temp_args:
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/sql/utils.pyc in 
> deco(*a, **kw)
>  43 def deco(*a, **kw):
>  44 try:
> ---> 45 return f(*a, **kw)
>  46 except py4j.protocol.Py4JJavaError as e:
>  47 s = e.java_exception.toString()
> /usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 306 raise Py4JJavaError(
> 307 "An error occurred while calling {0}{1}{2}.\n".
> --> 308 format(target_id, ".", name), value)
> 309 else:
> 310 raise Py4JError(
> Py4JJavaError: An error occurred while calling o961.json.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 55.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 55.0 (TID 149, localhost): java.lang.ClassCastException: [B cannot be cast to 
> java.lang.String
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:53)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>   at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:61

[jira] [Created] (SPARK-13094) Dataset Aggregators do not work with complex types

2016-01-29 Thread Deenar Toraskar (JIRA)
Deenar Toraskar created SPARK-13094:
---

 Summary: Dataset Aggregators do not work with complex types
 Key: SPARK-13094
 URL: https://issues.apache.org/jira/browse/SPARK-13094
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Deenar Toraskar


Dataset aggregators with complex types fail with unable to find encoder for 
type stored in a Dataset. Though Datasets with these complex types are 
supported.

val arraySum = new Aggregator[Seq[Float], Seq[Float],
  Seq[Float]] with Serializable {
  def zero: Seq[Float] = Nil
  // The initial value.
  def reduce(currentSum: Seq[Float], currentRow: Seq[Float]) =
sumArray(currentSum, currentRow)
  def merge(sum: Seq[Float], row: Seq[Float]) = sumArray(sum, row)
  def finish(b: Seq[Float]) = b // Return the final result.
  def sumArray(a: Seq[Float], b: Seq[Float]): Seq[Float] = {
(a, b) match {
  case (Nil, Nil) => Nil
  case (Nil, row) => row
  case (sum, Nil) => sum
  case (sum, row) => (a, b).zipped.map { case (a, b) => a + b }
}
  }
}.toColumn

:47: error: Unable to find encoder for type stored in a Dataset.  
Primitive types (Int, String, etc) and Product types (case classes) are 
supported by importing sqlContext.implicits._  Support for serializing other 
types will be added in future releases.
   }.toColumn



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13093) improve null check in nullSafeCodeGen for unary, binary and ternary expression

2016-01-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13093:


Assignee: (was: Apache Spark)

> improve null check in nullSafeCodeGen for unary, binary and ternary expression
> --
>
> Key: SPARK-13093
> URL: https://issues.apache.org/jira/browse/SPARK-13093
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13093) improve null check in nullSafeCodeGen for unary, binary and ternary expression

2016-01-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13093:


Assignee: Apache Spark

> improve null check in nullSafeCodeGen for unary, binary and ternary expression
> --
>
> Key: SPARK-13093
> URL: https://issues.apache.org/jira/browse/SPARK-13093
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13093) improve null check in nullSafeCodeGen for unary, binary and ternary expression

2016-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124165#comment-15124165
 ] 

Apache Spark commented on SPARK-13093:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10987

> improve null check in nullSafeCodeGen for unary, binary and ternary expression
> --
>
> Key: SPARK-13093
> URL: https://issues.apache.org/jira/browse/SPARK-13093
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13093) improve null check in nullSafeCodeGen for unary, binary and ternary expression

2016-01-29 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-13093:
---

 Summary: improve null check in nullSafeCodeGen for unary, binary 
and ternary expression
 Key: SPARK-13093
 URL: https://issues.apache.org/jira/browse/SPARK-13093
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13092) Track constraints in ExpressionSet

2016-01-29 Thread Sameer Agarwal (JIRA)
Sameer Agarwal created SPARK-13092:
--

 Summary: Track constraints in ExpressionSet
 Key: SPARK-13092
 URL: https://issues.apache.org/jira/browse/SPARK-13092
 Project: Spark
  Issue Type: Sub-task
Reporter: Sameer Agarwal


Create a new ExpressionSet that operates similar to an AttributeSet for keeping 
track of constraints. A nice addition to this will be to try and have it do 
other type of canonicalization (i.e. don't allow both a = b and b = a).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13091) Rewrite/Propagate constraints for Aliases

2016-01-29 Thread Sameer Agarwal (JIRA)
Sameer Agarwal created SPARK-13091:
--

 Summary: Rewrite/Propagate constraints for Aliases
 Key: SPARK-13091
 URL: https://issues.apache.org/jira/browse/SPARK-13091
 Project: Spark
  Issue Type: Sub-task
Reporter: Sameer Agarwal


We'd want to duplicate constraints when there is an alias (i.e. for "SELECT a, 
a AS b", any constraints on a now apply to b)

This is a follow up task based on [~marmbrus]'s suggestion in 
https://docs.google.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit#heading=h.6hjcndo36qze




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13090) Add initial support for constraint propagation in SparkSQL

2016-01-29 Thread Sameer Agarwal (JIRA)
Sameer Agarwal created SPARK-13090:
--

 Summary: Add initial support for constraint propagation in SparkSQL
 Key: SPARK-13090
 URL: https://issues.apache.org/jira/browse/SPARK-13090
 Project: Spark
  Issue Type: Sub-task
Reporter: Sameer Agarwal


The goal of this subtask is to add an initial support for the basic constraint 
framework and allow propagating constraints through filter, project, union, 
intersect, except and joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13087) Grouping by a complex expression may lead to incorrect AttributeReferences in aggregations

2016-01-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13087:
-
Target Version/s: 1.6.1

> Grouping by a complex expression may lead to incorrect AttributeReferences in 
> aggregations
> --
>
> Key: SPARK-13087
> URL: https://issues.apache.org/jira/browse/SPARK-13087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Mark Hamstra
>
> This is a regression from 1.5.
> An example of the failure:
> Working with this table...
> {code}
> 0: jdbc:hive2://10.1.3.203:1> DESCRIBE 
> csd_0ae1abc1_a3af_4c63_95b0_9599faca6c3d;
> +---++--+--+
> |   col_name| data_type  | comment  |
> +---++--+--+
> | c_date| timestamp  | NULL |
> | c_count   | int| NULL |
> | c_location_fips_code  | string | NULL |
> | c_airtemp | float  | NULL |
> | c_dewtemp | float  | NULL |
> | c_pressure| int| NULL |
> | c_rain| float  | NULL |
> | c_snow| float  | NULL |
> +---++--+--+
> {code}
> ...and this query (which isn't necessarily all that sensical or useful, but 
> has been adapted from a similarly failing query that uses a custom UDF where 
> the Spark SQL built-in `day` function has been substituted into this query)...
> {code}
> SELECT day ( c_date )  AS c_date, percentile_approx(c_rain, 0.5) AS 
> c_expr_1256887735 FROM csd_0ae1abc1_a3af_4c63_95b0_9599faca6c3d GROUP BY day 
> ( c_date )  ORDER BY c_date;
> {code}
> Spark 1.5 produces the expected results without error.
> In Spark 1.6, this plan is produced...
> {code}
> Exchange rangepartitioning(c_date#63009 ASC,16), None
> +- SortBasedAggregate(key=[dayofmonth(cast(c_date#63011 as date))#63020], 
> functions=[(hiveudaffunction(HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox,org.apache.hadoop.hive.ql.udf.generic.Gene
> ricUDAFPercentileApprox@6f211801),c_rain#63017,0.5,false,0,0),mode=Complete,isDistinct=false)],
>  output=[c_date#63009,c_expr_1256887735#63010])
>+- ConvertToSafe
>   +- !Sort [dayofmonth(cast(c_date#63011 as date))#63020 ASC], false, 0
>  +- !TungstenExchange hashpartitioning(dayofmonth(cast(c_date#63011 
> as date))#63020,16), None
> +- ConvertToUnsafe
>+- HiveTableScan [c_date#63011,c_rain#63017], 
> MetastoreRelation default, csd_0ae1abc1_a3af_4c63_95b0_9599faca6c3d, None
> {code}
> ...which fails with a TreeNodeException and stack traces that include this...
> {code}
> Caused by: ! org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 0 in stage 2842.0 failed 4 times, most recent failure: Lost 
> task 0.3 in stage 2842.0 (TID 15007, ip-10-1-1-59.dev.clearstory.com): 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: dayofmonth(cast(c_date#63011 as date))#63020
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:86)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:85)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:249)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:85)
> at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:62)
> at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:62)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> a

[jira] [Assigned] (SPARK-13088) DAG viz does not work with latest version of chrome

2016-01-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13088:


Assignee: Apache Spark  (was: Andrew Or)

> DAG viz does not work with latest version of chrome
> ---
>
> Key: SPARK-13088
> URL: https://issues.apache.org/jira/browse/SPARK-13088
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Blocker
> Attachments: Screen Shot 2016-01-29 at 10.54.14 AM.png
>
>
> See screenshot. This is because dagre-d3.js is using a function that chrome 
> no longer supports:
> {code}
> Uncaught TypeError: elem.getTransformToElement is not a function
> {code}
> We need to upgrade it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13088) DAG viz does not work with latest version of chrome

2016-01-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13088:


Assignee: Andrew Or  (was: Apache Spark)

> DAG viz does not work with latest version of chrome
> ---
>
> Key: SPARK-13088
> URL: https://issues.apache.org/jira/browse/SPARK-13088
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
> Attachments: Screen Shot 2016-01-29 at 10.54.14 AM.png
>
>
> See screenshot. This is because dagre-d3.js is using a function that chrome 
> no longer supports:
> {code}
> Uncaught TypeError: elem.getTransformToElement is not a function
> {code}
> We need to upgrade it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13088) DAG viz does not work with latest version of chrome

2016-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124095#comment-15124095
 ] 

Apache Spark commented on SPARK-13088:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/10986

> DAG viz does not work with latest version of chrome
> ---
>
> Key: SPARK-13088
> URL: https://issues.apache.org/jira/browse/SPARK-13088
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
> Attachments: Screen Shot 2016-01-29 at 10.54.14 AM.png
>
>
> See screenshot. This is because dagre-d3.js is using a function that chrome 
> no longer supports:
> {code}
> Uncaught TypeError: elem.getTransformToElement is not a function
> {code}
> We need to upgrade it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13089) spark.ml Naive Bayes user guide

2016-01-29 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-13089:
-

 Summary: spark.ml Naive Bayes user guide
 Key: SPARK-13089
 URL: https://issues.apache.org/jira/browse/SPARK-13089
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley
Priority: Minor


Add section in ml-classification.md for NaiveBayes DataFrame-based API, plus 
example code (using include_example to clip code from examples/ folder files).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13084) Utilize @SerialVersionUID to avoid local class incompatibility

2016-01-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124068#comment-15124068
 ] 

Sean Owen commented on SPARK-13084:
---

Serialversionuid is generally a bad idea. It opens up a worse problem: you 
claim compatibility when something isn't because you forget to update the 
field. I don't think it's expected that a serialized RDD is compatible across 
any different versions. I would be against this in general

> Utilize @SerialVersionUID to avoid local class incompatibility
> --
>
> Key: SPARK-13084
> URL: https://issues.apache.org/jira/browse/SPARK-13084
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>
> Here is related thread:
> http://search-hadoop.com/m/q3RTtSjjdT1BJ4Jr/local+class+incompatible&subj=local+class+incompatible+stream+classdesc+serialVersionUID
> RDD extends Serializable but doesn't have @SerialVersionUID() annotation.
> Adding @SerialVersionUID would overcome local class incompatibility across 
> minor releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12818) Implement Bloom filter and count-min sketch in DataFrames

2016-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124059#comment-15124059
 ] 

Apache Spark commented on SPARK-12818:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/10985

> Implement Bloom filter and count-min sketch in DataFrames
> -
>
> Key: SPARK-12818
> URL: https://issues.apache.org/jira/browse/SPARK-12818
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
> Attachments: BloomFilterandCount-MinSketchinSpark2.0.pdf
>
>
> This ticket tracks implementing Bloom filter and count-min sketch support in 
> DataFrames. Please see the attached design doc for more information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12695) java.lang.ClassCastException: [B cannot be cast to java.lang.String

2016-01-29 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124039#comment-15124039
 ] 

Shixiong Zhu edited comment on SPARK-12695 at 1/29/16 7:25 PM:
---

This is a SQL issue. See SPARK-13082


was (Author: zsxwing):
This is a SQL issue. See SPARK-12695

> java.lang.ClassCastException: [B cannot be cast to java.lang.String
> ---
>
> Key: SPARK-12695
> URL: https://issues.apache.org/jira/browse/SPARK-12695
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: eugeny birukov
> Attachments: exception_SPARK-12695.log
>
>
> I process kinesis stream in python code:
> stream = KinesisUtils.createStream(ssc, appName, streamName, endpointUrl, 
> regionName, InitialPositionInStream.LATEST, 30)
> stream.map(lambda line: str(line)).foreachRDD(process)
> def process(time, rdd):
> sqlContext = SQLContext.getOrCreate(rdd.context)
> t = sqlContext.read.json(rdd)
> run and get exception
> org.apache.spark.SparkException: An exception was raised by Python:
> Traceback (most recent call last):
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/streaming/util.py", 
> line 65, in call
> r = self.func(t, *rdds)
>   File "/usr/local/spark-1.6.0-bin-hadoop2.4/bin/kinesis_test.py", line 26, 
> in process
> t = sqlContext.read.json(rdd.map(lambda line: str(line)))
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", 
> line 180, in json
> return self._df(self._jreader.json(path._jrdd))
>   File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", 
> line 813, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 
> 45, in deco
> return f(*a, **kw)
>   File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 
> 308, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o165.json.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
> (TID 1, localhost): java.lang.ClassCastException: [B cannot be cast to 
> java.lang.String
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:53)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>   at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
> 
> in java same co

[jira] [Comment Edited] (SPARK-12695) java.lang.ClassCastException: [B cannot be cast to java.lang.String

2016-01-29 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124037#comment-15124037
 ] 

Shixiong Zhu edited comment on SPARK-12695 at 1/29/16 7:25 PM:
---

I'm going to mark this one duplicate because SPARK-13082 has a better 
reproducer.


was (Author: zsxwing):
I'm going to mark this one duplicate because SPARK-12695 has a better 
reproducer.

> java.lang.ClassCastException: [B cannot be cast to java.lang.String
> ---
>
> Key: SPARK-12695
> URL: https://issues.apache.org/jira/browse/SPARK-12695
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: eugeny birukov
> Attachments: exception_SPARK-12695.log
>
>
> I process kinesis stream in python code:
> stream = KinesisUtils.createStream(ssc, appName, streamName, endpointUrl, 
> regionName, InitialPositionInStream.LATEST, 30)
> stream.map(lambda line: str(line)).foreachRDD(process)
> def process(time, rdd):
> sqlContext = SQLContext.getOrCreate(rdd.context)
> t = sqlContext.read.json(rdd)
> run and get exception
> org.apache.spark.SparkException: An exception was raised by Python:
> Traceback (most recent call last):
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/streaming/util.py", 
> line 65, in call
> r = self.func(t, *rdds)
>   File "/usr/local/spark-1.6.0-bin-hadoop2.4/bin/kinesis_test.py", line 26, 
> in process
> t = sqlContext.read.json(rdd.map(lambda line: str(line)))
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", 
> line 180, in json
> return self._df(self._jreader.json(path._jrdd))
>   File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", 
> line 813, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 
> 45, in deco
> return f(*a, **kw)
>   File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 
> 308, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o165.json.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
> (TID 1, localhost): java.lang.ClassCastException: [B cannot be cast to 
> java.lang.String
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:53)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>   at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler

[jira] [Resolved] (SPARK-12695) java.lang.ClassCastException: [B cannot be cast to java.lang.String

2016-01-29 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-12695.
--
Resolution: Duplicate

I'm going to mark this one duplicate because SPARK-12695 has a better 
reproducer.

> java.lang.ClassCastException: [B cannot be cast to java.lang.String
> ---
>
> Key: SPARK-12695
> URL: https://issues.apache.org/jira/browse/SPARK-12695
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: eugeny birukov
> Attachments: exception_SPARK-12695.log
>
>
> I process kinesis stream in python code:
> stream = KinesisUtils.createStream(ssc, appName, streamName, endpointUrl, 
> regionName, InitialPositionInStream.LATEST, 30)
> stream.map(lambda line: str(line)).foreachRDD(process)
> def process(time, rdd):
> sqlContext = SQLContext.getOrCreate(rdd.context)
> t = sqlContext.read.json(rdd)
> run and get exception
> org.apache.spark.SparkException: An exception was raised by Python:
> Traceback (most recent call last):
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/streaming/util.py", 
> line 65, in call
> r = self.func(t, *rdds)
>   File "/usr/local/spark-1.6.0-bin-hadoop2.4/bin/kinesis_test.py", line 26, 
> in process
> t = sqlContext.read.json(rdd.map(lambda line: str(line)))
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", 
> line 180, in json
> return self._df(self._jreader.json(path._jrdd))
>   File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", 
> line 813, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 
> 45, in deco
> return f(*a, **kw)
>   File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 
> 308, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o165.json.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
> (TID 1, localhost): java.lang.ClassCastException: [B cannot be cast to 
> java.lang.String
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:53)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>   at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
> 
> in java same code has no exceptions
> streams.map(x -> new String( x ) )
>   .foreachRDD((Jav

[jira] [Commented] (SPARK-12695) java.lang.ClassCastException: [B cannot be cast to java.lang.String

2016-01-29 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124039#comment-15124039
 ] 

Shixiong Zhu commented on SPARK-12695:
--

This is a SQL issue. See SPARK-12695

> java.lang.ClassCastException: [B cannot be cast to java.lang.String
> ---
>
> Key: SPARK-12695
> URL: https://issues.apache.org/jira/browse/SPARK-12695
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: eugeny birukov
> Attachments: exception_SPARK-12695.log
>
>
> I process kinesis stream in python code:
> stream = KinesisUtils.createStream(ssc, appName, streamName, endpointUrl, 
> regionName, InitialPositionInStream.LATEST, 30)
> stream.map(lambda line: str(line)).foreachRDD(process)
> def process(time, rdd):
> sqlContext = SQLContext.getOrCreate(rdd.context)
> t = sqlContext.read.json(rdd)
> run and get exception
> org.apache.spark.SparkException: An exception was raised by Python:
> Traceback (most recent call last):
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/streaming/util.py", 
> line 65, in call
> r = self.func(t, *rdds)
>   File "/usr/local/spark-1.6.0-bin-hadoop2.4/bin/kinesis_test.py", line 26, 
> in process
> t = sqlContext.read.json(rdd.map(lambda line: str(line)))
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", 
> line 180, in json
> return self._df(self._jreader.json(path._jrdd))
>   File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", 
> line 813, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 
> 45, in deco
> return f(*a, **kw)
>   File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 
> 308, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o165.json.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
> (TID 1, localhost): java.lang.ClassCastException: [B cannot be cast to 
> java.lang.String
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:53)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>   at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1122)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
> 
> in java same code has no exceptions
> streams.map(x -> new String( x ) )
>   .foreachRDD((JavaRDD rdd) -> {
> 

[jira] [Resolved] (SPARK-12656) Rewrite Intersect phyiscal plan using semi-join

2016-01-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12656.
-
   Resolution: Fixed
 Assignee: Xiao Li
Fix Version/s: 2.0.0

> Rewrite Intersect phyiscal plan using semi-join
> ---
>
> Key: SPARK-12656
> URL: https://issues.apache.org/jira/browse/SPARK-12656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> Our current Intersect physical operator simply delegates to RDD.intersect. We 
> should remove the Intersect physical operator and simply transform a logical 
> intersect into a semi-join. This way, we can take advantage of all the 
> benefits of join implementations (e.g. managed memory, code generation, 
> broadcast joins).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13088) DAG viz does not work with latest version of chrome

2016-01-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13088:
--
Attachment: Screen Shot 2016-01-29 at 10.54.14 AM.png

> DAG viz does not work with latest version of chrome
> ---
>
> Key: SPARK-13088
> URL: https://issues.apache.org/jira/browse/SPARK-13088
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
> Attachments: Screen Shot 2016-01-29 at 10.54.14 AM.png
>
>
> See screenshot. This is because dagre-d3.js is using a function that chrome 
> no longer supports:
> {code}
> Uncaught TypeError: elem.getTransformToElement is not a function
> {code}
> We need to upgrade it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13088) DAG viz does not work with latest version of chrome

2016-01-29 Thread Andrew Or (JIRA)
Andrew Or created SPARK-13088:
-

 Summary: DAG viz does not work with latest version of chrome
 Key: SPARK-13088
 URL: https://issues.apache.org/jira/browse/SPARK-13088
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.6.0, 1.5.0, 1.4.0, 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker


See screenshot. This is because dagre-d3.js is using a function that chrome no 
longer supports:
{code}
Uncaught TypeError: elem.getTransformToElement is not a function
{code}
We need to upgrade it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10009) PySpark Param of Vector type can be set with Python array or numpy.array

2016-01-29 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123974#comment-15123974
 ] 

holdenk commented on SPARK-10009:
-

cc [~sethah] you might want to handle this while looking at the other list type 
conversion stuff as well.

> PySpark Param of Vector type can be set with Python array or numpy.array
> 
>
> Key: SPARK-10009
> URL: https://issues.apache.org/jira/browse/SPARK-10009
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>
> If the type of Param in PySpark ML pipeline is Vector, we can set with Vector 
> currently. We also need to support set it with Python array and numpy.array. 
> It should be handled in the wrapper (_transfer_params_to_java).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13072) simplify and improve murmur3 hash expression codegen

2016-01-29 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13072.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10974
[https://github.com/apache/spark/pull/10974]

> simplify and improve murmur3 hash expression codegen
> 
>
> Key: SPARK-13072
> URL: https://issues.apache.org/jira/browse/SPARK-13072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13085) Add scalastyle command used in build testing

2016-01-29 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123946#comment-15123946
 ] 

Charles Allen commented on SPARK-13085:
---

{code}
mvn scalastyle:check
{code}

was able to produce a similar error, but it is not obvious if that is the same 
command the build uses.

> Add scalastyle command used in build testing
> 
>
> Key: SPARK-13085
> URL: https://issues.apache.org/jira/browse/SPARK-13085
> Project: Spark
>  Issue Type: Wish
>  Components: Build, Tests
>Reporter: Charles Allen
>
> As an occasional or new contributor, it is easy to screw up scala style. But 
> looking at the output logs (for example 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50300/consoleFull
>  ) it is not obvious how to fix the scala style tests, even when reading the 
> scala style guide.
> {code}
> 
> Running Scala style checks
> 
> Scalastyle checks failed at following occurrences:
> [error] 
> /home/jenkins/workspace/SparkPullRequestBuilder/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:22:0:
>  import.ordering.wrongOrderInGroup.message
> [error] (core/compile:scalastyle) errors exist
> [error] Total time: 9 s, completed Jan 28, 2016 2:11:00 PM
> [error] running 
> /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala ; received 
> return code 1
> {code}
> This ask is that the command used to check scalastyle is presented in the log 
> so a developer does not have to wait for the build process to check if a pull 
> request should pass scala style checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13087) Grouping by a complex expression may lead to incorrect AttributeReferences in aggregations

2016-01-29 Thread Mark Hamstra (JIRA)
Mark Hamstra created SPARK-13087:


 Summary: Grouping by a complex expression may lead to incorrect 
AttributeReferences in aggregations
 Key: SPARK-13087
 URL: https://issues.apache.org/jira/browse/SPARK-13087
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Mark Hamstra


This is a regression from 1.5.

An example of the failure:

Working with this table...
{code}
0: jdbc:hive2://10.1.3.203:1> DESCRIBE 
csd_0ae1abc1_a3af_4c63_95b0_9599faca6c3d;
+---++--+--+
|   col_name| data_type  | comment  |
+---++--+--+
| c_date| timestamp  | NULL |
| c_count   | int| NULL |
| c_location_fips_code  | string | NULL |
| c_airtemp | float  | NULL |
| c_dewtemp | float  | NULL |
| c_pressure| int| NULL |
| c_rain| float  | NULL |
| c_snow| float  | NULL |
+---++--+--+
{code}
...and this query (which isn't necessarily all that sensical or useful, but has 
been adapted from a similarly failing query that uses a custom UDF where the 
Spark SQL built-in `day` function has been substituted into this query)...
{code}
SELECT day ( c_date )  AS c_date, percentile_approx(c_rain, 0.5) AS 
c_expr_1256887735 FROM csd_0ae1abc1_a3af_4c63_95b0_9599faca6c3d GROUP BY day ( 
c_date )  ORDER BY c_date;
{code}
Spark 1.5 produces the expected results without error.

In Spark 1.6, this plan is produced...
{code}
Exchange rangepartitioning(c_date#63009 ASC,16), None
+- SortBasedAggregate(key=[dayofmonth(cast(c_date#63011 as date))#63020], 
functions=[(hiveudaffunction(HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox,org.apache.hadoop.hive.ql.udf.generic.Gene
ricUDAFPercentileApprox@6f211801),c_rain#63017,0.5,false,0,0),mode=Complete,isDistinct=false)],
 output=[c_date#63009,c_expr_1256887735#63010])
   +- ConvertToSafe
  +- !Sort [dayofmonth(cast(c_date#63011 as date))#63020 ASC], false, 0
 +- !TungstenExchange hashpartitioning(dayofmonth(cast(c_date#63011 as 
date))#63020,16), None
+- ConvertToUnsafe
   +- HiveTableScan [c_date#63011,c_rain#63017], MetastoreRelation 
default, csd_0ae1abc1_a3af_4c63_95b0_9599faca6c3d, None
{code}
...which fails with a TreeNodeException and stack traces that include this...
{code}
Caused by: ! org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 2842.0 failed 4 times, most recent failure: Lost task 0.3 in 
stage 2842.0 (TID 15007, ip-10-1-1-59.dev.clearstory.com): 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: dayofmonth(cast(c_date#63011 as date))#63020
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:86)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:85)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:249)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:85)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:62)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:62)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(Projection.scala:62)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:254)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:254)
at 
org.apache.spark.sql.execution.Exchange.org$apache$spark$sql$execution$Exc

[jira] [Assigned] (SPARK-12913) Reimplement stat functions as declarative function

2016-01-29 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-12913:
--

Assignee: Davies Liu

> Reimplement stat functions as declarative function
> --
>
> Key: SPARK-12913
> URL: https://issues.apache.org/jira/browse/SPARK-12913
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> As benchmarked and discussed here: 
> https://github.com/apache/spark/pull/10786/files#r50038294.
> Benefits from codegen, the declarative aggregate function could be much 
> faster than imperative one,  we should re-implement all the builtin aggregate 
> functions as declarative one.
> For skewness and kurtosis, we need to benchmark it to make sure that the 
> declarative one is actually faster than imperative one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10873) can't sort columns on history page

2016-01-29 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-10873.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> can't sort columns on history page
> --
>
> Key: SPARK-10873
> URL: https://issues.apache.org/jira/browse/SPARK-10873
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Thomas Graves
>Assignee: Zhuo Liu
> Fix For: 2.0.0
>
>
> Starting with 1.5.1 the history server page isn't allowing sorting by column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10873) Change history to use datatables to support sorting columns and searching

2016-01-29 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-10873:
--
Summary: Change history to use datatables to support sorting columns and 
searching  (was: Change history table to use datatables to support sorting 
columns and searching)

> Change history to use datatables to support sorting columns and searching
> -
>
> Key: SPARK-10873
> URL: https://issues.apache.org/jira/browse/SPARK-10873
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Thomas Graves
>Assignee: Zhuo Liu
> Fix For: 2.0.0
>
>
> Starting with 1.5.1 the history server page isn't allowing sorting by column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10873) Change history table to use datatables to support sorting columns and searching

2016-01-29 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-10873:
--
Summary: Change history table to use datatables to support sorting columns 
and searching  (was: can't sort columns on history page)

> Change history table to use datatables to support sorting columns and 
> searching
> ---
>
> Key: SPARK-10873
> URL: https://issues.apache.org/jira/browse/SPARK-10873
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Thomas Graves
>Assignee: Zhuo Liu
> Fix For: 2.0.0
>
>
> Starting with 1.5.1 the history server page isn't allowing sorting by column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13033) PySpark ml.regression support export/import

2016-01-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123823#comment-15123823
 ] 

Joseph K. Bradley commented on SPARK-13033:
---

It's now merged!

> PySpark ml.regression support export/import
> ---
>
> Key: SPARK-13033
> URL: https://issues.apache.org/jira/browse/SPARK-13033
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/regression.py. Please refer the 
> implementation at SPARK-13032. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13037) PySpark ml.recommendation support export/import

2016-01-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123824#comment-15123824
 ] 

Joseph K. Bradley commented on SPARK-13037:
---

The blocking PR is now merged!

> PySpark ml.recommendation support export/import
> ---
>
> Key: SPARK-13037
> URL: https://issues.apache.org/jira/browse/SPARK-13037
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/recommendation.py. Please refer the 
> implementation at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12626) MLlib 2.0 Roadmap

2016-01-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12626:
--
Description: 
This is a master list for MLlib improvements we plan to have in Spark 2.0. 
Please view this list as a wish list rather than a concrete plan, because we 
don't have an accurate estimate of available resources. Due to limited review 
bandwidth, features appearing on this list will get higher priority during code 
review. But feel free to suggest new items to the list in comments. We are 
experimenting with this process. Your feedback would be greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
than a medium/big feature. Based on our experience, mixing the development 
process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
after another.
* Remember to add the `@Since("2.0.0")` annotation to new public APIs.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps to improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add a "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on 
JIRA.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
please ping a maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if applicable.

h1. Roadmap (*WIP*)

This is NOT [a complete list of MLlib JIRAs for 
2.0|https://issues.apache.org/jira/issues/?filter=12334385]. We only include 
umbrella JIRAs and high-level tasks.

Major efforts in this release:
* `spark.ml`: Achieve feature parity for the `spark.ml` API, relative to the 
`spark.mllib` API.  This includes the Python API.
* Linear algebra: Separate out the linear algebra library as a standalone 
project without a Spark dependency to simplify production deployment.
* Pipelines API: Complete critical improvements to the Pipelines API
* New features: As usual, we expect to expand the feature set of MLlib.  
However, we will prioritize API parity over new features.  _New algorithms 
should be written for `spark.ml`, not `spark.mllib`._

h2. Algorithms and performance

* iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835)
* estimator interface for GLMs (SPARK-12811)
* extended support for GLM model families and link functions in SparkR 
(SPARK-12566)
* improved model summaries and stats via IRLS (SPARK-9837)

Additional (maybe lower priority):
* robust linear regression with Huber loss (SPARK-3181)
* vector-free L-BFGS (SPARK-10078)
* tree partition by features (SPARK-3717)
* local linear algebra (SPARK-6442)
* weighted instance support (SPARK-9610)
** random forest (SPARK-9478)
** GBT (SPARK-9612)
* locality sensitive hashing (LSH) (SPARK-5992)
* deep learning (SPARK-5575)
** autoencoder (SPARK-10408)
** restricted Boltzmann machine (RBM) (SPARK-4251)
** convolutional neural network (stretch)
* factorization machine (SPARK-7008)
* distributed LU decomposition (SPARK-8514)

h2. Statistics

* bivariate statistics as UDAFs (SPARK-10385)
* R-like statistics for GLMs (SPARK-9835)
* sketch algorithms (cross listed) : approximate quantiles (SPARK-6761), 
count-min sketch (SPARK-6763), Bloom filter (SPARK-12818)

h2. Pipeline API

* pipeline persistence (SPARK-6725)
** trees (SPARK-11888)
** RFormula (SPARK-11891)
** MLC (SPARK-11871)
** PySpark (SPARK-11939) --> *This is now ready for people to take up subtasks!*
* ML attribute API improvements (SPARK-8515)
* predict single instance (SPARK-10413)
* test Kaggle datasets (SPARK-9941)

_There may be other design improvement efforts for Pipelines, to be listed here 
soon.  See (SPARK-5874) for a list of possibilities._

h2. Model persistence

* PMML export
** naive Bayes (SPARK-8546)
** decision tree (SPARK-8542)
* model save/load
** FPGrowth (SPARK-6724)
** PrefixSpan (SPARK-1038

  1   2   >