date:20160420

[jira] [Created] (SPARK-14747) Add assertStreaming/assertNoneStreaming checks in DataFrameWriter

2016-04-20 Thread Liwei Lin (JIRA)

Liwei Lin created SPARK-14747:
-

 Summary: Add assertStreaming/assertNoneStreaming checks in 
DataFrameWriter
 Key: SPARK-14747
 URL: https://issues.apache.org/jira/browse/SPARK-14747
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Liwei Lin
Priority: Minor


If an end user happens to write the code mixed with continuous-query-oriented 
methods and non-continuous-query-oriented methods:

{code}
ctx.read
   .format("text")
   .stream("...")  // continuous query
   .write
   .text("...")// non-continuous query
{code}

He/she would get somehow a confusing exception:

{quote}
Exception in thread "main" java.lang.AssertionError: assertion failed: No plan 
for FileSource\[./continuous_query_test_input\]
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at ...
{quote}

This JIRA proposes to add checks for continuous-query-oriented methods and 
non-continuous-query-oriented methods in `DataFrameWriter`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14747) Add assertStreaming/assertNoneStreaming checks in DataFrameWriter

2016-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249418#comment-15249418
 ] 

Apache Spark commented on SPARK-14747:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12521

> Add assertStreaming/assertNoneStreaming checks in DataFrameWriter
> -
>
> Key: SPARK-14747
> URL: https://issues.apache.org/jira/browse/SPARK-14747
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
>
> If an end user happens to write the code mixed with continuous-query-oriented 
> methods and non-continuous-query-oriented methods:
> {code}
> ctx.read
>.format("text")
>.stream("...")  // continuous query
>.write
>.text("...")// non-continuous query
> {code}
> He/she would get somehow a confusing exception:
> {quote}
> Exception in thread "main" java.lang.AssertionError: assertion failed: No 
> plan for FileSource\[./continuous_query_test_input\]
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at ...
> {quote}
> This JIRA proposes to add checks for continuous-query-oriented methods and 
> non-continuous-query-oriented methods in `DataFrameWriter`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14747) Add assertStreaming/assertNoneStreaming checks in DataFrameWriter

2016-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14747:


Assignee: Apache Spark

> Add assertStreaming/assertNoneStreaming checks in DataFrameWriter
> -
>
> Key: SPARK-14747
> URL: https://issues.apache.org/jira/browse/SPARK-14747
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Assignee: Apache Spark
>Priority: Minor
>
> If an end user happens to write the code mixed with continuous-query-oriented 
> methods and non-continuous-query-oriented methods:
> {code}
> ctx.read
>.format("text")
>.stream("...")  // continuous query
>.write
>.text("...")// non-continuous query
> {code}
> He/she would get somehow a confusing exception:
> {quote}
> Exception in thread "main" java.lang.AssertionError: assertion failed: No 
> plan for FileSource\[./continuous_query_test_input\]
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at ...
> {quote}
> This JIRA proposes to add checks for continuous-query-oriented methods and 
> non-continuous-query-oriented methods in `DataFrameWriter`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14747) Add assertStreaming/assertNoneStreaming checks in DataFrameWriter

2016-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14747:


Assignee: (was: Apache Spark)

> Add assertStreaming/assertNoneStreaming checks in DataFrameWriter
> -
>
> Key: SPARK-14747
> URL: https://issues.apache.org/jira/browse/SPARK-14747
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
>
> If an end user happens to write the code mixed with continuous-query-oriented 
> methods and non-continuous-query-oriented methods:
> {code}
> ctx.read
>.format("text")
>.stream("...")  // continuous query
>.write
>.text("...")// non-continuous query
> {code}
> He/she would get somehow a confusing exception:
> {quote}
> Exception in thread "main" java.lang.AssertionError: assertion failed: No 
> plan for FileSource\[./continuous_query_test_input\]
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at ...
> {quote}
> This JIRA proposes to add checks for continuous-query-oriented methods and 
> non-continuous-query-oriented methods in `DataFrameWriter`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14594) Improve error messages for RDD API

2016-04-20 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249424#comment-15249424
 ] 

Felix Cheung commented on SPARK-14594:
--

[~mgaido] Do you have the repo step?

If I run this
{code}
rdd <- SparkR:::parallelize(sc, 1:10)
partitionSum <- SparkR:::lapplyPartition(rdd, function(part) { stop("jkfhk") })
SparkR:::collect(partitionSum)
{code}

and I see this error

{code}
org.apache.spark.SparkException: R computation failed with
 Error in FUN(part) : jkfhk
Calls: source -> withVisible -> eval -> eval -> computeFunc -> FUN
Execution halted
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:102)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:65)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

Any exception in the worker should be captured in this catch 
(https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala#L172)
 with its message communicated back to R here 
(https://github.com/apache/spark/blob/master/R/pkg/R/backend.R#L112)


> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Marco Gaido
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9013) generate MutableProjection directly instead of return a function

2016-04-20 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9013.

   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.0.0

> generate MutableProjection directly instead of return a function
> 
>
> Key: SPARK-9013
> URL: https://issues.apache.org/jira/browse/SPARK-9013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14746) Support transformations in R source code for Dataset/DataFrame

2016-04-20 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249445#comment-15249445
 ] 

Reynold Xin commented on SPARK-14746:
-

Do you mind linking to those threads?

What are the limitations of pipe?


> Support transformations in R source code for Dataset/DataFrame
> --
>
> Key: SPARK-14746
> URL: https://issues.apache.org/jira/browse/SPARK-14746
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Sun Rui
>
> there actually is a desired scenario mentioned several times in the Spark 
> mailing list that users are writing Scala/Java Spark applications (not 
> SparkR) but want to use R functions in some transformations. typically this 
> can be achieved by calling Pipe() in RDD. However, there are limitations on 
> pipe(). So we can support applying a R function in source code format to a 
> Dataset/DataFrame (Thus SparkR is not needed for serializing an R function.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14720) Move the rest of HiveContext to HiveSessionState

2016-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249459#comment-15249459
 ] 

Apache Spark commented on SPARK-14720:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/12522

> Move the rest of HiveContext to HiveSessionState
> 
>
> Key: SPARK-14720
> URL: https://issues.apache.org/jira/browse/SPARK-14720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> This will be a major cleanup task. Unfortunately part of the state will leak 
> to SessionState, which shouldn't know anything about Hive. Part of the effort 
> here is to create a new SparkSession interface (SPARK-13643) and do 
> reflection there to decide which SessionState to use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13643) Create SparkSession interface

2016-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249460#comment-15249460
 ] 

Apache Spark commented on SPARK-13643:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/12522

> Create SparkSession interface
> -
>
> Key: SPARK-13643
> URL: https://issues.apache.org/jira/browse/SPARK-13643
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14748) BoundReference should not set ExprCode.code to empty string

2016-04-20 Thread Sameer Agarwal (JIRA)

Sameer Agarwal created SPARK-14748:
--

 Summary: BoundReference should not set ExprCode.code to empty 
string
 Key: SPARK-14748
 URL: https://issues.apache.org/jira/browse/SPARK-14748
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Sameer Agarwal






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14748) BoundReference should not set ExprCode.code to empty string

2016-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14748:


Assignee: Apache Spark

> BoundReference should not set ExprCode.code to empty string
> ---
>
> Key: SPARK-14748
> URL: https://issues.apache.org/jira/browse/SPARK-14748
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14748) BoundReference should not set ExprCode.code to empty string

2016-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249469#comment-15249469
 ] 

Apache Spark commented on SPARK-14748:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/12523

> BoundReference should not set ExprCode.code to empty string
> ---
>
> Key: SPARK-14748
> URL: https://issues.apache.org/jira/browse/SPARK-14748
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sameer Agarwal
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14748) BoundReference should not set ExprCode.code to empty string

2016-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14748:


Assignee: (was: Apache Spark)

> BoundReference should not set ExprCode.code to empty string
> ---
>
> Key: SPARK-14748
> URL: https://issues.apache.org/jira/browse/SPARK-14748
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sameer Agarwal
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14051) Implement `Double.NaN==Float.NaN` for consistency

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14051.
---
Resolution: Won't Fix

> Implement `Double.NaN==Float.NaN` for consistency
> -
>
> Key: SPARK-14051
> URL: https://issues.apache.org/jira/browse/SPARK-14051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since SPARK-9079 and SPARK-9145, `NaN = NaN` returns true and works well. The 
> only exception case is direct comparison between  `Row(Float.NaN)` and 
> `Row(Double.NaN)`. The following is the example: the last two expressions had 
> better be *true* and *List([NaN])* for consistency.
> {code}
> scala> 
> Seq((1d,1f),(Double.NaN,Float.NaN)).toDF("a","b").registerTempTable("tmp")
> scala> sql("select a,b,a=b from tmp").collect()
> res1: Array[org.apache.spark.sql.Row] = Array([1.0,1.0,true], [NaN,NaN,true])
> scala> val row_a = sql("select a from tmp").collect()
> row_a: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN])
> scala> val row_b = sql("select b from tmp").collect()
> row_b: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN])
> scala> row_a(0) == row_b(0)
> res2: Boolean = true
> scala> List(row_a(0),row_b(0)).distinct
> res3: List[org.apache.spark.sql.Row] = List([1.0])
> scala> row_a(1) == row_b(1)
> res4: Boolean = false
> scala> List(row_a(1),row_b(1)).distinct
> res5: List[org.apache.spark.sql.Row] = List([NaN], [NaN])
> {code}
> Please note that the following background truths as of today.
> * Double.NaN != Double.NaN (Scala/Java/IEEE Standard)
> * Float.NaN != Float.NaN (Scala/Java/IEEE Standard)
> * Double.NaN != Float.NaN (Scala/Java/IEEE Standard)
> * Row(Double.NaN) == Row(Double.NaN)
> * Row(Float.NaN) == Row(Float.NaN)
> * *Row(Double.NaN) != Row(Float.NaN)*  <== The problem of this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14614) Add `bround` function

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14614:
--
Assignee: Dongjoon Hyun

> Add `bround` function
> -
>
> Key: SPARK-14614
> URL: https://issues.apache.org/jira/browse/SPARK-14614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> This issue aims to add `bound` function (aka Banker's round) by extending 
> current `round` implementation.
> Hive supports `bround` since 1.3.0. [Language 
> Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF].
> {code}
> hive> select round(2.5), bround(2.5);
> OK
> 3.0   2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13419) SubquerySuite should use checkAnswer rather than ScalaTest's assertResult

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13419:
--
Assignee: Luciano Resende

> SubquerySuite should use checkAnswer rather than ScalaTest's assertResult
> -
>
> Key: SPARK-13419
> URL: https://issues.apache.org/jira/browse/SPARK-13419
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Luciano Resende
> Fix For: 2.0.0
>
>
> This is blocked by being able to generate SQL for subqueries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14600) Push predicates through Expand

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14600:
--
Assignee: Wenchen Fan

> Push predicates through Expand
> --
>
> Key: SPARK-14600
> URL: https://issues.apache.org/jira/browse/SPARK-14600
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> A grouping sets will be analyzed as Aggregate(Expand(Project)), the grouping 
> attributes came from Project, but have different meaning in Project (equal to 
> original grouping expression) and Expand (could be original grouping 
> expression or null), this does not make sense, because the attribute has 
> different result in different operator,
>  A better way could be  Aggregate(Expand()), then we need to  fix SQL 
> generation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13929) Use Scala reflection for UDFs

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13929:
--
Assignee: Joan Goyeau

> Use Scala reflection for UDFs
> -
>
> Key: SPARK-13929
> URL: https://issues.apache.org/jira/browse/SPARK-13929
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jakob Odersky
>Assignee: Joan Goyeau
>Priority: Minor
> Fix For: 2.0.0
>
>
> {{ScalaReflection}} uses native Java reflection for User Defined Types which 
> would fail if such types are not plain Scala classes that map 1:1 to Java.
> Consider the following extract (from here 
> https://github.com/apache/spark/blob/92024797a4fad594b5314f3f3be5c6be2434de8a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L376
>  ):
> {code}
> case t if Utils.classIsLoadable(className) &&
> Utils.classForName(className).isAnnotationPresent(classOf[SQLUserDefinedType])
>  =>
> val udt = 
> Utils.classForName(className).getAnnotation(classOf[SQLUserDefinedType]).udt().newInstance()
> //...
> {code}
> If {{t}}'s runtime class is actually synthetic (something that doesn't exist 
> in Java and hence uses a dollar sign internally), such as nested classes or 
> package objects, the above code will fail.
> Currently there are no known use-cases of synthetic user-defined types (hence 
> the minor priority), however it would be best practice to remove plain Java 
> reflection and rely on Scala reflection instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4226:
-
Assignee: Herman van Hovell

> SparkSQL - Add support for subqueries in predicates
> ---
>
> Key: SPARK-4226
> URL: https://issues.apache.org/jira/browse/SPARK-4226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: Spark 1.2 snapshot
>Reporter: Terry Siu
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>
> I have a test table defined in Hive as follows:
> {code:sql}
> CREATE TABLE sparkbug (
>   id INT,
>   event STRING
> ) STORED AS PARQUET;
> {code}
> and insert some sample data with ids 1, 2, 3.
> In a Spark shell, I then create a HiveContext and then execute the following 
> HQL to test out subquery predicates:
> {code}
> val hc = HiveContext(hc)
> hc.hql("select customerid from sparkbug where customerid in (select 
> customerid from sparkbug where customerid in (2,3))")
> {code}
> I get the following error:
> {noformat}
> java.lang.RuntimeException: Unsupported language features in query: select 
> customerid from sparkbug where customerid in (select customerid from sparkbug 
> where customerid in (2,3))
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_SUBQUERY_EXPR
> TOK_SUBQUERY_OP
>   in
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_FUNCTION
> in
> TOK_TABLE_OR_COL
>   customerid
> 2
> 3
> TOK_TABLE_OR_COL
>   customerid
> scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR
>   TOK_SUBQUERY_OP
> in
>   TOK_QUERY
> TOK_FROM
>   TOK_TABREF
> TOK_TABNAME
>   sparkbug
> TOK_INSERT
>   TOK_DESTINATION
> TOK_DIR
>   TOK_TMP_FILE
>   TOK_SELECT
> TOK_SELEXPR
>   TOK_TABLE_OR_COL
> customerid
>   TOK_WHERE
> TOK_FUNCTION
>   in
>   TOK_TABLE_OR_COL
> customerid
>   2
>   3
>   TOK_TABLE_OR_COL
> customerid
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
> 
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> {noformat}
> [This 
> thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html]
>  also brings up lack of subquery support in SparkSQL. It would be nice to 
> have subquery predicate support in a near, future release (1.3, maybe?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14749) PlannerSuite failed when it runs individually

2016-04-20 Thread Yin Huai (JIRA)

Yin Huai created SPARK-14749:


 Summary: PlannerSuite failed when it runs individually
 Key: SPARK-14749
 URL: https://issues.apache.org/jira/browse/SPARK-14749
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Reporter: Yin Huai
Priority: Minor


If you try {{test-only *PlannerSuite -- -z "count is partially aggregated"}}, 
you will see
{code}
[info] - count is partially aggregated *** FAILED *** (104 milliseconds)
[info]   java.lang.NullPointerException:
[info]   at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate.(TungstenAggregate.scala:76)
[info]   at 
org.apache.spark.sql.execution.aggregate.Utils$.createAggregate(utils.scala:60)
[info]   at 
org.apache.spark.sql.execution.aggregate.Utils$.planAggregateWithoutDistinct(utils.scala:97)
[info]   at 
org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:258)
[info]   at 
org.apache.spark.sql.execution.PlannerSuite.org$apache$spark$sql$execution$PlannerSuite$$testPartialAggregationPlan(PlannerSuite.scala:43)
[info]   at 
org.apache.spark.sql.execution.PlannerSuite$$anonfun$1.apply$mcV$sp(PlannerSuite.scala:58)
[info]   at 
org.apache.spark.sql.execution.PlannerSuite$$anonfun$1.apply(PlannerSuite.scala:56)
[info]   at 
org.apache.spark.sql.execution.PlannerSuite$$anonfun$1.apply(PlannerSuite.scala:56)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:56)
[info]   at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
[info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
[info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
[info]   at scala.collection.immutable.List.foreach(List.scala:381)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info]   at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
[info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
[info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
[info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
[info]   at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
[info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
[info]   at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:28)
[info]   at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
[info]   at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
[info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:28)
[info]   at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:357)
[info]   at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:502)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
[info]   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[info]   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[info]   at java.lang.Thread.run(Thread.java:745)
{code}

The cause is that the {{activeContext}} (in the object of {{SQLContext}}) is 
not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additiona

[jira] [Updated] (SPARK-12917) Add DML support to Spark SQL for HIVE

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12917:
--
Priority: Major  (was: Blocker)

> Add DML support to Spark SQL for HIVE
> -
>
> Key: SPARK-12917
> URL: https://issues.apache.org/jira/browse/SPARK-12917
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Hemang Nagar
>
> Spark SQL should be updated to support the DML operations that are being 
> supported by Hive since 0.14



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13425) Documentation for CSV datasource options

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13425:
--
Priority: Major  (was: Blocker)

> Documentation for CSV datasource options
> 
>
> Key: SPARK-13425
> URL: https://issues.apache.org/jira/browse/SPARK-13425
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> As said https://github.com/apache/spark/pull/11262#discussion_r53508815,
> CSV datasource is added for Spark 2.0.0 and therefore the options might have 
> to be added in documentation.
> The options can be found 
> [here|https://issues.apache.org/jira/secure/attachment/12779313/Built-in%20CSV%20datasource%20in%20Spark.pdf]
>  in Parsing Options section.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9278) DataFrameWriter.insertInto inserts incorrect data

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9278:
-
Priority: Critical  (was: Blocker)

> DataFrameWriter.insertInto inserts incorrect data
> -
>
> Key: SPARK-9278
> URL: https://issues.apache.org/jira/browse/SPARK-9278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Linux, S3, Hive Metastore
>Reporter: Steve Lindemann
>Assignee: Cheng Lian
>Priority: Critical
>
> After creating a partitioned Hive table (stored as Parquet) via the 
> DataFrameWriter.createTable command, subsequent attempts to insert additional 
> data into new partitions of this table result in inserting incorrect data 
> rows. Reordering the columns in the data to be written seems to avoid this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4105:
-
Priority: Critical  (was: Blocker)

> FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
> shuffle
> -
>
> Key: SPARK-4105
> URL: https://issues.apache.org/jira/browse/SPARK-4105
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Attachments: JavaObjectToSerialize.java, 
> SparkFailedToUncompressGenerator.scala
>
>
> We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
> shuffle read.  Here's a sample stacktrace from an executor:
> {code}
> 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
> 33053)
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Here's another occurrence of a similar error:
> {code}
> java.io.IOException: failed t

[jira] [Updated] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14521:
--
Target Version/s: 2.0.0

> StackOverflowError in Kryo when executing TPC-DS
> 
>
> Key: SPARK-14521
> URL: https://issues.apache.org/jira/browse/SPARK-14521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Rajesh Balamohan
>Priority: Blocker
>
> Build details:  Spark build from master branch (Apr-10)
> DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.
> Client: $SPARK_HOME/bin/beeline 
> Query:  TPC-DS Query27
> spark.sql.sources.fileScan=true (this is the default value anyways)
> Exception:
> {noformat}
> Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14171) UDAF aggregates argument object inspector not parsed correctly

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14171:
--
Priority: Major  (was: Blocker)

> UDAF aggregates argument object inspector not parsed correctly
> --
>
> Key: SPARK-14171
> URL: https://issues.apache.org/jira/browse/SPARK-14171
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jianfeng Hu
>
> For example, when using percentile_approx and count distinct together, it 
> raises an error complaining the argument is not constant. We have a test case 
> to reproduce. Could you help look into a fix of this? This was working in 
> previous version (Spark 1.4 + Hive 0.13). Thanks!
> {code}--- 
> a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
> +++ 
> b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
> @@ -148,6 +148,9 @@ class HiveUDFSuite extends QueryTest with 
> TestHiveSingleton with SQLTestUtils {
>  checkAnswer(sql("SELECT percentile_approx(100.0, array(0.9, 0.9)) FROM 
> src LIMIT 1"),
>sql("SELECT array(100, 100) FROM src LIMIT 1").collect().toSeq)
> +
> +checkAnswer(sql("SELECT percentile_approx(key, 0.9), count(distinct 
> key) FROM src LIMIT 1"),
> +  sql("SELECT max(key), 1 FROM src LIMIT 1").collect().toSeq)
> }
>test("UDFIntegerToString") {
> {code}
> When running the test suite, we can see this error:
> {code}
> - Generic UDAF aggregates *** FAILED ***
>   org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree: 
> hiveudaffunction(HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox,org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox@6e1dc6a7),key#51176,0.9,false,0,0)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:238)
>   at 
> org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter.org$apache$spark$sql$catalyst$analysis$DistinctAggregationRewriter$$patchAggregateFunctionChildren$1(DistinctAggregationRewriter.scala:148)
>   at 
> org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:192)
>   at 
> org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:190)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   ...
>   Cause: java.lang.reflect.InvocationTargetException:
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:368)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:367)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:365)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
>   ...
>   Cause: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: The second 
> argument must be a constant, but double was passed instead.
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox.getEvaluator(GenericUDAFPercentileApprox.java:147)
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector$lzycompute(hiveUDFs.scala:598)
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector(hiveUDFs.scala:596)
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.returnInspector$lzycompute(hiveUDFs.scala:606)
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.returnInspector(hiveUDFs.scala:606)
>   at org.apache.spark.sql.hive.HiveUDAFFunction.(hiveUDFs.scala:654)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newIns

[jira] [Updated] (SPARK-12878) Dataframe fails with nested User Defined Types

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12878:
--
Priority: Major  (was: Blocker)

> Dataframe fails with nested User Defined Types
> --
>
> Key: SPARK-12878
> URL: https://issues.apache.org/jira/browse/SPARK-12878
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Joao
>
> Spark 1.6.0 crashes when using nested User Defined Types in a Dataframe. 
> In version 1.5.2 the code below worked just fine:
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.sql.catalyst.InternalRow
> import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
> import org.apache.spark.sql.types._
> @SQLUserDefinedType(udt = classOf[AUDT])
> case class A(list:Seq[B])
> class AUDT extends UserDefinedType[A] {
>   override def sqlType: DataType = StructType(Seq(StructField("list", 
> ArrayType(BUDT, containsNull = false), nullable = true)))
>   override def userClass: Class[A] = classOf[A]
>   override def serialize(obj: Any): Any = obj match {
> case A(list) =>
>   val row = new GenericMutableRow(1)
>   row.update(0, new 
> GenericArrayData(list.map(_.asInstanceOf[Any]).toArray))
>   row
>   }
>   override def deserialize(datum: Any): A = {
> datum match {
>   case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq)
> }
>   }
> }
> object AUDT extends AUDT
> @SQLUserDefinedType(udt = classOf[BUDT])
> case class B(text:Int)
> class BUDT extends UserDefinedType[B] {
>   override def sqlType: DataType = StructType(Seq(StructField("num", 
> IntegerType, nullable = false)))
>   override def userClass: Class[B] = classOf[B]
>   override def serialize(obj: Any): Any = obj match {
> case B(text) =>
>   val row = new GenericMutableRow(1)
>   row.setInt(0, text)
>   row
>   }
>   override def deserialize(datum: Any): B = {
> datum match {  case row: InternalRow => new B(row.getInt(0))  }
>   }
> }
> object BUDT extends BUDT
> object Test {
>   def main(args:Array[String]) = {
> val col = Seq(new A(Seq(new B(1), new B(2))),
>   new A(Seq(new B(3), new B(4
> val sc = new SparkContext(new 
> SparkConf().setMaster("local[1]").setAppName("TestSpark"))
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
> val df = sc.parallelize(1 to 2 zip col).toDF("id","b")
> df.select("b").show()
> df.collect().foreach(println)
>   }
> }
> In the new version (1.6.0) I needed to include the following import:
> import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
> However, Spark crashes in runtime:
> 16/01/18 14:36:22 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:248)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
>   at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
>   at 
> org.apache.

[jira] [Updated] (SPARK-14526) The catalog of SQLContext should not be case-sensitive

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14526:
--
Target Version/s: 2.0.0

> The catalog of SQLContext should not be case-sensitive 
> ---
>
> Key: SPARK-14526
> URL: https://issues.apache.org/jira/browse/SPARK-14526
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Andrew Or
>Priority: Blocker
>
> {code}
> >>> from pyspark.sql import SQLContext
> >>> ctx = SQLContext(sc)
> >>> ctx.range(10).registerTempTable("t")
> >>> ctx.table("T")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/davies/work/spark/python/pyspark/sql/context.py", line 522, in 
> table
> return DataFrame(self._ssql_ctx.table(tableName), self)
>   File 
> "/Users/davies/work/spark/python/lib/py4j-0.9.2-src.zip/py4j/java_gateway.py",
>  line 836, in __call__
>   File "/Users/davies/work/spark/python/pyspark/sql/utils.py", line 57, in 
> deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u"Table not found: 'T' does not exist in 
> database 'default';"
> {code}
> Is this a feature or a bug?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14750) Make historyServer refer application log in hdfs

2016-04-20 Thread SuYan (JIRA)

SuYan created SPARK-14750:
-

 Summary: Make historyServer refer application log in hdfs
 Key: SPARK-14750
 URL: https://issues.apache.org/jira/browse/SPARK-14750
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.6.1
Reporter: SuYan
 Fix For: 1.6.1


Make history server refer application log, just like MR history server



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7258) spark.ml API taking Graph instead of DataFrame

2016-04-20 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249509#comment-15249509
 ] 

zhengruifeng commented on SPARK-7258:
-

I have only encounter Graph-like API in Keras, within which I construct some 
complex networks.
Keras had two modeling API: Sequential (like Pipelines, add layers one by one), 
Graph (add nodes).
Its Graph API looks like this:
{code}
graph = Graph()  
graph.add_input(name='input1', input_shape=(32,))  
graph.add_input(name='input2', input_shape=(32,))  
graph.add_node(Dense(16), name='dense1', input='input1')  
graph.add_node(Dense(4), name='dense2', input='input2')  
graph.add_node(Dense(4), name='dense3', input='dense1')  
graph.add_output(name='output', inputs=['dense2', 'dense3'], merge_mode='sum')  
graph.compile('rmsprop', {'output':'mse'})  
history = graph.fit({'input1':X_train, 'input2':X2_train, 'output':y_train}, 
nb_epoch=10)  
predictions = graph.predict({'input1':X_test, 'input2':X2_test})
{code}

And the perhaps API may look like this:
{code}
val graph = Graph()
graph.addInput(name="input1", path='...')
graph.addInput(name="input2", path='...')
graph.addSQLNode(name="sql", inputs=["input1", "input2"], sql="select * from  
... join ...")
graph.addNode(name="tfidf", transformer=...)
...
graph.addOutput(name="output")
{code}

> spark.ml API taking Graph instead of DataFrame
> --
>
> Key: SPARK-7258
> URL: https://issues.apache.org/jira/browse/SPARK-7258
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, ML
>Reporter: Joseph K. Bradley
>
> It would be useful to have an API in ML Pipelines for working with Graphs, 
> not just DataFrames.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-7258) spark.ml API taking Graph instead of DataFrame

2016-04-20 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-7258:

Comment: was deleted

(was: I have only encounter Graph-like API in Keras, within which I construct 
some complex networks.
Keras had two modeling API: Sequential (like Pipelines, add layers one by one), 
Graph (add nodes).
Its Graph API looks like this:
{code}
graph = Graph()  
graph.add_input(name='input1', input_shape=(32,))  
graph.add_input(name='input2', input_shape=(32,))  
graph.add_node(Dense(16), name='dense1', input='input1')  
graph.add_node(Dense(4), name='dense2', input='input2')  
graph.add_node(Dense(4), name='dense3', input='dense1')  
graph.add_output(name='output', inputs=['dense2', 'dense3'], merge_mode='sum')  
graph.compile('rmsprop', {'output':'mse'})  
history = graph.fit({'input1':X_train, 'input2':X2_train, 'output':y_train}, 
nb_epoch=10)  
predictions = graph.predict({'input1':X_test, 'input2':X2_test})
{code}

And the perhaps API may look like this:
{code}
val graph = Graph()
graph.addInput(name="input1", path='...')
graph.addInput(name="input2", path='...')
graph.addSQLNode(name="sql", inputs=["input1", "input2"], sql="select * from  
... join ...")
graph.addNode(name="tfidf", transformer=...)
...
graph.addOutput(name="output")
{code})

> spark.ml API taking Graph instead of DataFrame
> --
>
> Key: SPARK-7258
> URL: https://issues.apache.org/jira/browse/SPARK-7258
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, ML
>Reporter: Joseph K. Bradley
>
> It would be useful to have an API in ML Pipelines for working with Graphs, 
> not just DataFrames.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14744) Put examples packaging on a diet

2016-04-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249513#comment-15249513
 ] 

Sean Owen commented on SPARK-14744:
---

+1 and +1 for just removing the Cassandra code

> Put examples packaging on a diet
> 
>
> Key: SPARK-14744
> URL: https://issues.apache.org/jira/browse/SPARK-14744
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Currently the examples bring in a lot of external dependencies, ballooning 
> the size of the Spark distribution packages.
> I'd like to propose two things to slim down these dependencies:
> - make all non-Spark, and also Spark Streaming, dependencies "provided". This 
> means, especially for streaming connectors, that launching examples becomes 
> more like launching real applications (where you need to figure out how to 
> provide those dependencies, e.g. using {{--packages}}).
> - audit examples and remove those that don't provide a lot of value. For 
> example, HBase is working on full-featured Spark bindings, based on code that 
> has already been in use for a while before being merged into HBase. The HBase 
> example in Spark is very bare bones and, in comparison, not really useful and 
> in fact a little misleading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14745) CEP support in Spark Streaming

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14745:
--
Target Version/s:   (was: 2.1.0)

This seems like a great project on top of Spark. What of this _requires_ 
changes in Spark? (Don't set Target)

> CEP support in Spark Streaming
> --
>
> Key: SPARK-14745
> URL: https://issues.apache.org/jira/browse/SPARK-14745
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Mario Briggs
> Attachments: SparkStreamingCEP.pdf
>
>
> Complex Event Processing is a often used feature in Streaming applications. 
> Spark Streaming current does not have a DSL/API for it. This JIRA is about 
> how/what can we add in Spark Streaming to support CEP out of the box



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14594) Improve error messages for RDD API

2016-04-20 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-14594:

Affects Version/s: (was: 1.6.0)
   1.5.2

> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Marco Gaido
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14594) Improve error messages for RDD API

2016-04-20 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249521#comment-15249521
 ] 

Marco Gaido commented on SPARK-14594:
-

I am using Spark1.5.2. Maybe the issue is resolved now..

> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Marco Gaido
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14751) SparkR fails on Cassandra map with numeric key

2016-04-20 Thread JIRA

Michał Matłoka created SPARK-14751:
--

 Summary: SparkR fails on Cassandra map with numeric key
 Key: SPARK-14751
 URL: https://issues.apache.org/jira/browse/SPARK-14751
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.6.1
Reporter: Michał Matłoka


Hi,
I have created an issue for spark  cassandra connector ( 
https://datastax-oss.atlassian.net/projects/SPARKC/issues/SPARKC-366 ) but 
after a bit of digging it seems this is a better place for this issue:

{code}
CREATE TABLE test.map (
id text,
somemap map,
PRIMARY KEY (id)
);

insert into test.map(id, somemap) values ('a', { 0 : 12 }); 
{code}
{code}
  sqlContext <- sparkRSQL.init(sc)
  test <-read.df(sqlContext,  source = "org.apache.spark.sql.cassandra",  
keyspace = "test", table = "map")
  head(test)
{code}
Results in:
{code}
16/04/19 14:47:02 ERROR RBackendHandler: dfToCols on 
org.apache.spark.sql.api.r.SQLUtils failed
Error in readBin(con, raw(), stringLen, endian = "big") :
  invalid 'n' argument
{code}

Problem occurs even for int key. For text key it works. Every scenario works 
under scala & python.
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished

2016-04-20 Thread yixiaohua (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249527#comment-15249527
 ] 

yixiaohua commented on SPARK-14658:
---

https://github.com/apache/spark/pull/12524

> when executor lost DagScheduer may submit one stage twice even if the first 
> running taskset for this stage is not finished
> --
>
> Key: SPARK-14658
> URL: https://issues.apache.org/jira/browse/SPARK-14658
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: spark1.6.1  hadoop-2.6.0-cdh5.4.2
>Reporter: yixiaohua
>
> 16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 57: 
> 57.2,57.1
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> First Time:
> 16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
> AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, 
> 13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, 
> 52, 53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99, 100, 101, 
> 108, 109, 110, 111, 112, 113, 114, 115, 126, 127, 134, 136, 137, 146, 147, 
> 150, 151, 154, 155, 158, 159, 162, 163, 164, 165, 166, 167, 170, 171, 172, 
> 173, 174, 175, 176, 177, 178, 179, 180, 181, 188, 189, 190, 191, 198, 199, 
> 204, 206, 207, 208, 218, 219, 222, 223, 230, 231, 236, 238, 239
> 16/04/14 15:35:20 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
> 16/04/14 15:35:20 DEBUG DAGScheduler: missing: List()
> 16/04/14 15:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 57 
> (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
> missing parents
> 16/04/14 15:35:20 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
> 16/04/14 15:35:20 INFO DAGScheduler: Submitting 100 missing tasks from 
> ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
> 16/04/14 15:35:20 DEBUG DAGScheduler: New pending partitions: Set(206, 177, 
> 127, 98, 48, 27, 23, 163, 238, 188, 159, 28, 109, 59, 9, 176, 126, 207, 174, 
> 43, 170, 208, 158, 108, 29, 8, 204, 154, 223, 173, 219, 190, 111, 61, 40, 
> 136, 115, 86, 57, 155, 55, 230, 222, 180, 172, 151, 101, 18, 166, 56, 137, 
> 87, 52, 171, 71, 42, 167, 198, 67, 17, 236, 165, 13, 5, 53, 178, 99, 70, 49, 
> 218, 147, 164, 114, 85, 60, 31, 179, 150, 19, 100, 50, 175, 146, 134, 113, 
> 84, 51, 30, 199, 26, 16, 191, 162, 112, 12, 239, 231, 189, 181, 110)
> Second Time:
> 16/04/14 15:35:22 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
> AccessController.java:-2) because some of its tasks had failed: 26
> 16/04/14 15:35:22 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
> 16/04/14 15:35:22 DEBUG DAGScheduler: missing: List()
> 16/04/14 15:35:22 INFO DAGScheduler: Submitting ShuffleMapStage 57 
> (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
> missing parents
> 16/04/14 15:35:22 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
> 16/04/14 15:35:22 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
> 16/04/14 15:35:22 DEBUG DAGScheduler: New pending partitions: Set(26)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14744) Put examples packaging on a diet

2016-04-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249533#comment-15249533
 ] 

Sean Owen commented on SPARK-14744:
---

+1 and +1 to removing the Cassandra code

> Put examples packaging on a diet
> 
>
> Key: SPARK-14744
> URL: https://issues.apache.org/jira/browse/SPARK-14744
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Currently the examples bring in a lot of external dependencies, ballooning 
> the size of the Spark distribution packages.
> I'd like to propose two things to slim down these dependencies:
> - make all non-Spark, and also Spark Streaming, dependencies "provided". This 
> means, especially for streaming connectors, that launching examples becomes 
> more like launching real applications (where you need to figure out how to 
> provide those dependencies, e.g. using {{--packages}}).
> - audit examples and remove those that don't provide a lot of value. For 
> example, HBase is working on full-featured Spark bindings, based on code that 
> has already been in use for a while before being merged into HBase. The HBase 
> example in Spark is very bare bones and, in comparison, not really useful and 
> in fact a little misleading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-14283) Avoid sort in randomSplit when possible

2016-04-20 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-14283:
-
Comment: was deleted

(was: [~josephkb] I can work on this.
There should be a version of randomSplit that avoid the local sort which is 
meaningless in ML.
But the calls in ML should be add a extra param to avoid local sort IMO.)

> Avoid sort in randomSplit when possible
> ---
>
> Key: SPARK-14283
> URL: https://issues.apache.org/jira/browse/SPARK-14283
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Joseph K. Bradley
>
> Dataset.randomSplit sorts each partition in order to guarantee an ordering 
> and make randomSplit deterministic given the seed.  Since randomSplit is used 
> a fair amount in ML, it would be great to avoid the sort when possible.
> Are there cases when it could be avoided?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-14744) Put examples packaging on a diet

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14744:
--
Comment: was deleted

(was: +1 and +1 to removing the Cassandra code)

> Put examples packaging on a diet
> 
>
> Key: SPARK-14744
> URL: https://issues.apache.org/jira/browse/SPARK-14744
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Currently the examples bring in a lot of external dependencies, ballooning 
> the size of the Spark distribution packages.
> I'd like to propose two things to slim down these dependencies:
> - make all non-Spark, and also Spark Streaming, dependencies "provided". This 
> means, especially for streaming connectors, that launching examples becomes 
> more like launching real applications (where you need to figure out how to 
> provide those dependencies, e.g. using {{--packages}}).
> - audit examples and remove those that don't provide a lot of value. For 
> example, HBase is working on full-featured Spark bindings, based on code that 
> has already been in use for a while before being merged into HBase. The HBase 
> example in Spark is very bare bones and, in comparison, not really useful and 
> in fact a little misleading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14725) Remove HttpServer

2016-04-20 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249557#comment-15249557
 ] 

Saisai Shao commented on SPARK-14725:
-

I think now it forces to use RPC and there's no other option to configure:

{code}
_conf.getOption("spark.repl.class.outputDir").foreach { path =>
  val replUri = _env.rpcEnv.fileServer.addDirectory("/classes", new 
File(path))
  _conf.set("spark.repl.class.uri", replUri)
}
{code}

Also there's no code creating and launching this {{HttpServer}}.

> Remove HttpServer
> -
>
> Key: SPARK-14725
> URL: https://issues.apache.org/jira/browse/SPARK-14725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Saisai Shao
>Priority: Minor
>
> {{HttpServer}} used to support broadcast variables and jars/files 
> transmission now seems obsolete, by searching the codes, actually no one 
> class depends on it except one unit test, so here propose to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10496) Efficient DataFrame cumulative sum

2016-04-20 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249560#comment-15249560
 ] 

zhengruifeng commented on SPARK-10496:
--

I can have a try. Add this new API in DataFrameStatFunctions.

> Efficient DataFrame cumulative sum
> --
>
> Key: SPARK-10496
> URL: https://issues.apache.org/jira/browse/SPARK-10496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Goal: Given a DataFrame with a numeric column X, create a new column Y which 
> is the cumulative sum of X.
> This can be done with window functions, but it is not efficient for a large 
> number of rows.  It could be done more efficiently using a prefix sum/scan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14752) LazilyGenerateOrdering throws NullPointerException with TakeOrderedAndProject

2016-04-20 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created SPARK-14752:


 Summary: LazilyGenerateOrdering throws NullPointerException with 
TakeOrderedAndProject
 Key: SPARK-14752
 URL: https://issues.apache.org/jira/browse/SPARK-14752
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rajesh Balamohan


codebase: spark master

DataSet: TPC-DS

Client: $SPARK_HOME/bin/beeline

Example query to reproduce the issue:  
select i_item_id from item order by i_item_id limit 10;

Explain plan output
{noformat}
explain select i_item_id from item order by i_item_id limit 10;
+--+--+
|   
  plan  

   |
+--+--+
| == Physical Plan ==
TakeOrderedAndProject(limit=10, orderBy=[i_item_id#1229 ASC], 
output=[i_item_id#1229])
+- WholeStageCodegen
   :  +- Project [i_item_id#1229]
   : +- Scan HadoopFiles[i_item_id#1229] Format: ORC, PushedFilters: [], 
ReadSchema: struct  |
+--+--+
{noformat}

Exception:
{noformat}
TaskResultGetter: Exception while getting task result
com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
Serialization trace:
underlying (org.apache.spark.util.BoundedPriorityQueue)
at 
com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
at 
org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1791)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
at java.util.PriorityQueue.offer(PriorityQueue.java:344)
at java.util.PriorityQueue.add(PriorityQueue.java:321)
at 
com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
at 
com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
at 
com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14752) LazilyGenerateOrdering throws NullPointerException with TakeOrderedAndProject

2016-04-20 Thread Rajesh Balamohan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249576#comment-15249576
 ] 

Rajesh Balamohan commented on SPARK-14752:
--

Changing generatedOrdering in LazilyGeneratedOrdering to 
{noformat}
private[this] lazy val generatedOrdering = GenerateOrdering.generate(ordering)
{noformat} 
solves the issue and the query runs fine. 

> LazilyGenerateOrdering throws NullPointerException with TakeOrderedAndProject
> -
>
> Key: SPARK-14752
> URL: https://issues.apache.org/jira/browse/SPARK-14752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> codebase: spark master
> DataSet: TPC-DS
> Client: $SPARK_HOME/bin/beeline
> Example query to reproduce the issue:  
> select i_item_id from item order by i_item_id limit 10;
> Explain plan output
> {noformat}
> explain select i_item_id from item order by i_item_id limit 10;
> +--+--+
> | 
> plan  
>   
>  |
> +--+--+
> | == Physical Plan ==
> TakeOrderedAndProject(limit=10, orderBy=[i_item_id#1229 ASC], 
> output=[i_item_id#1229])
> +- WholeStageCodegen
>:  +- Project [i_item_id#1229]
>: +- Scan HadoopFiles[i_item_id#1229] Format: ORC, PushedFilters: [], 
> ReadSchema: struct  |
> +--+--+
> {noformat}
> Exception:
> {noformat}
> TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1791)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:344)
>   at java.util.PriorityQueue.add(PriorityQueue.java:321)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
>

[jira] [Created] (SPARK-14753) remove internal flag in Accumulable

2016-04-20 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-14753:
---

 Summary: remove internal flag in Accumulable
 Key: SPARK-14753
 URL: https://issues.apache.org/jira/browse/SPARK-14753
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14750) Make historyServer refer application log in hdfs

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14750:
--
Fix Version/s: (was: 1.6.1)

There's no detail here at all. What is the problem and proposed solution? or 
else this should be closed.

> Make historyServer refer application log in hdfs
> 
>
> Key: SPARK-14750
> URL: https://issues.apache.org/jira/browse/SPARK-14750
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.1
>Reporter: SuYan
>
> Make history server refer application log, just like MR history server



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14737) Kafka Brokers are down - spark stream should retry

2016-04-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249586#comment-15249586
 ] 

Sean Owen commented on SPARK-14737:
---

If all brokers are down that's a fatal error. It's correct (IMHO) to fail. Your 
application however could recreate a stream.

> Kafka Brokers are down - spark stream should retry
> --
>
> Key: SPARK-14737
> URL: https://issues.apache.org/jira/browse/SPARK-14737
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.0
> Environment: Suse Linux, Cloudera Enterprise 5.4.8 (#7 built by 
> jenkins on 20151023-1205 git: d7dbdf29ac1d57ae9fb19958502d50dcf4e4fffd), 
> kafka_2.10-0.8.2.2
>Reporter: Faisal
>
> I have spark streaming application that uses direct streaming - listening to 
> KAFKA topic.
> {code}
> HashMap kafkaParams = new HashMap();
> kafkaParams.put("metadata.broker.list", "broker1,broker2,broker3");
> kafkaParams.put("auto.offset.reset", "largest");
> HashSet topicsSet = new HashSet();
> topicsSet.add("Topic1");
> JavaPairInputDStream messages = 
> KafkaUtils.createDirectStream(
> jssc, 
> String.class, 
> String.class,
> StringDecoder.class, 
> StringDecoder.class, 
> kafkaParams, 
> topicsSet
> );
> {code}
> I notice when i stop/shutdown kafka brokers, my spark application also 
> shutdown.
> Here is the spark execution script
> {code}
> spark-submit \
> --master yarn-cluster \
> --files /home/siddiquf/spark/log4j-spark.xml
> --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j-spark.xml" \
> --conf 
> "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j-spark.xml" \
> --class com.example.MyDataStreamProcessor \
> myapp.jar 
> {code}
> Spark job submitted successfully and i can track the application driver and 
> worker/executor nodes.
> Everything works fine but only concern if kafka borkers are offline or 
> restarted my application controlled by yarn should not shutdown? but it does.
> If this is expected behavior then how to handle such situation with least 
> maintenance? Keeping in mind Kafka cluster is not in hadoop cluster and 
> managed by different team that is why requires our application to be 
> resilient enough.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14742) Redirect spark-ec2 doc to new location

2016-04-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249588#comment-15249588
 ] 

Sean Owen commented on SPARK-14742:
---

That file has already been removed in master. [~shivaram] was referring to some 
page in the ASF site though -- what page was this and what needs to change? I 
agree, we should ideally change it with 2.0

> Redirect spark-ec2 doc to new location
> --
>
> Key: SPARK-14742
> URL: https://issues.apache.org/jira/browse/SPARK-14742
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> See: https://github.com/amplab/spark-ec2/pull/24#issuecomment-212033453
> We need to redirect this page
> http://spark.apache.org/docs/latest/ec2-scripts.html
> to this page
> https://github.com/amplab/spark-ec2#readme



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14753) remove internal flag in Accumulable

2016-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14753:


Assignee: Apache Spark  (was: Wenchen Fan)

> remove internal flag in Accumulable
> ---
>
> Key: SPARK-14753
> URL: https://issues.apache.org/jira/browse/SPARK-14753
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14753) remove internal flag in Accumulable

2016-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14753:


Assignee: Wenchen Fan  (was: Apache Spark)

> remove internal flag in Accumulable
> ---
>
> Key: SPARK-14753
> URL: https://issues.apache.org/jira/browse/SPARK-14753
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14753) remove internal flag in Accumulable

2016-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249590#comment-15249590
 ] 

Apache Spark commented on SPARK-14753:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/12525

> remove internal flag in Accumulable
> ---
>
> Key: SPARK-14753
> URL: https://issues.apache.org/jira/browse/SPARK-14753
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14679) UI DAG visualization causes OOM generating data

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14679.
---
   Resolution: Fixed
Fix Version/s: 1.6.2
   2.0.0

Issue resolved by pull request 12437
[https://github.com/apache/spark/pull/12437]

> UI DAG visualization causes OOM generating data
> ---
>
> Key: SPARK-14679
> URL: https://issues.apache.org/jira/browse/SPARK-14679
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.1
>Reporter: Ryan Blue
> Fix For: 2.0.0, 1.6.2
>
>
> The UI will hit an OutOfMemoryException when generating the DAG visualization 
> data for large Hive table scans. The problem is that data is being duplicated 
> in the output for each RDD like cluster10 here:
> {code}
> digraph G {
>   subgraph clusterstage_1 {
> label="Stage 1";
> subgraph cluster7 {
>   label="TungstenAggregate";
>   9 [label="MapPartitionsRDD [9]\nrun at ThreadPoolExecutor.java:1142"];
> }
> subgraph cluster10 {
>   label="HiveTableScan";
>   7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"];
>   6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"];
>   5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"];
> }
> subgraph cluster10 {
>   label="HiveTableScan";
>   7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"];
>   6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"];
>   5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"];
> }
> subgraph cluster8 {
>   label="ConvertToUnsafe";
>   8 [label="MapPartitionsRDD [8]\nrun at ThreadPoolExecutor.java:1142"];
> }
> subgraph cluster10 {
>   label="HiveTableScan";
>   7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"];
>   6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"];
>   5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"];
> }
>   }
>   8->9;
>   6->7;
>   5->6;
>   7->8;
> }
> {code}
> Hive has a large number of RDDs because it creates a RDD for each partition 
> in the scan returned by the metastore. Each RDD in results in another copy of 
> the. The data is built with a StringBuilder and copied into a String, so the 
> memory required gets huge quickly.
> The cause is how the RDDOperationGraph gets generated. For each RDD, a nested 
> chain of RDDOperationCluster is produced and those are merged. But, there is 
> no implementation of equals for RDDOperationCluster, so they are always 
> distinct and accumulated rather than 
> [deduped|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala#L135].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14679) UI DAG visualization causes OOM generating data

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14679:
--
Assignee: Ryan Blue

> UI DAG visualization causes OOM generating data
> ---
>
> Key: SPARK-14679
> URL: https://issues.apache.org/jira/browse/SPARK-14679
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.1
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 1.6.2, 2.0.0
>
>
> The UI will hit an OutOfMemoryException when generating the DAG visualization 
> data for large Hive table scans. The problem is that data is being duplicated 
> in the output for each RDD like cluster10 here:
> {code}
> digraph G {
>   subgraph clusterstage_1 {
> label="Stage 1";
> subgraph cluster7 {
>   label="TungstenAggregate";
>   9 [label="MapPartitionsRDD [9]\nrun at ThreadPoolExecutor.java:1142"];
> }
> subgraph cluster10 {
>   label="HiveTableScan";
>   7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"];
>   6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"];
>   5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"];
> }
> subgraph cluster10 {
>   label="HiveTableScan";
>   7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"];
>   6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"];
>   5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"];
> }
> subgraph cluster8 {
>   label="ConvertToUnsafe";
>   8 [label="MapPartitionsRDD [8]\nrun at ThreadPoolExecutor.java:1142"];
> }
> subgraph cluster10 {
>   label="HiveTableScan";
>   7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"];
>   6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"];
>   5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"];
> }
>   }
>   8->9;
>   6->7;
>   5->6;
>   7->8;
> }
> {code}
> Hive has a large number of RDDs because it creates a RDD for each partition 
> in the scan returned by the metastore. Each RDD in results in another copy of 
> the. The data is built with a StringBuilder and copied into a String, so the 
> memory required gets huge quickly.
> The cause is how the RDDOperationGraph gets generated. For each RDD, a nested 
> chain of RDDOperationCluster is produced and those are merged. But, there is 
> no implementation of equals for RDDOperationCluster, so they are always 
> distinct and accumulated rather than 
> [deduped|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala#L135].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14687) Call path.getFileSystem(conf) instead of call FileSystem.get(conf)

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14687:
--
Assignee: Liwei Lin

> Call path.getFileSystem(conf) instead of call FileSystem.get(conf)
> --
>
> Key: SPARK-14687
> URL: https://issues.apache.org/jira/browse/SPARK-14687
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Assignee: Liwei Lin
>Priority: Minor
>
> Generally we should call path.getFileSystem(conf) instead of call 
> FileSystem.get(conf), because the latter is actually called on the 
> DEFAULT_URI (fs.defaultFS), leading to problems under certain situations:
> - if {{fs.defaultFS}} is {{hdfs://clusterA/...}}, but path is 
> {{hdfs://clusterB/...}}: then we'll encounter 
> {{java.lang.IllegalArgumentException (Wrong FS: hdfs://clusterB/..., 
> expected: hdfs://clusterA/...)}}
> - if {{fs.defaultFS}} is not specified, the schema will default to 
> {{file:///}}: then we'll encounter {{java.lang.IllegalArgumentException 
> (Wrong FS: hdfs://..., expected: file:///)}}
> - if {{fs.defaultFS}} is not {{hdfs://...}}, for example {{viewfs://}}(which 
> is used for federated HDFS): then we'll encounter 
> {{java.lang.IllegalArgumentException (Wrong FS: hdfs://..., expected: 
> viewfs:///)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14687) Call path.getFileSystem(conf) instead of call FileSystem.get(conf)

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14687.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/12450

> Call path.getFileSystem(conf) instead of call FileSystem.get(conf)
> --
>
> Key: SPARK-14687
> URL: https://issues.apache.org/jira/browse/SPARK-14687
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Assignee: Liwei Lin
>Priority: Minor
> Fix For: 2.0.0
>
>
> Generally we should call path.getFileSystem(conf) instead of call 
> FileSystem.get(conf), because the latter is actually called on the 
> DEFAULT_URI (fs.defaultFS), leading to problems under certain situations:
> - if {{fs.defaultFS}} is {{hdfs://clusterA/...}}, but path is 
> {{hdfs://clusterB/...}}: then we'll encounter 
> {{java.lang.IllegalArgumentException (Wrong FS: hdfs://clusterB/..., 
> expected: hdfs://clusterA/...)}}
> - if {{fs.defaultFS}} is not specified, the schema will default to 
> {{file:///}}: then we'll encounter {{java.lang.IllegalArgumentException 
> (Wrong FS: hdfs://..., expected: file:///)}}
> - if {{fs.defaultFS}} is not {{hdfs://...}}, for example {{viewfs://}}(which 
> is used for federated HDFS): then we'll encounter 
> {{java.lang.IllegalArgumentException (Wrong FS: hdfs://..., expected: 
> viewfs:///)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249628#comment-15249628
 ] 

Paul Shearer commented on SPARK-13973:
--

The problem with this change is that it creates a bug for users who simply want 
the IPython interactive shell, as opposed to the notebook. `ipython` with no 
arguments starts the IPython shell, but `jupyter` with no arguments results in 
the following error:

{noformat}
usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir]
   [--paths] [--json]
   [subcommand]
jupyter: error: one of the arguments --version subcommand --config-dir 
--data-dir --runtime-dir --paths is required
{noformat}

I can't speak for the general Python community but as a data scientist, 
personally I find the IPython notebook only suitable for very basic exploratory 
analysis - any sort of application development is much better served by the 
IPython shell, so I'm always using the shell and rarely the notebook.

It seems like maintaining this old configuration switch is no longer 
sustainable. The change breaks it for my case, and the old state will 
eventually be broken when `ipython notebook` is deprecated. So perhaps 
`IPYTHON=1` should just result in some kind of error message prompting the user 
to switch to the new PYSPARK_DRIVER_PYTHON config style.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14635) Documentation and Examples for TF-IDF only refer to HashingTF

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14635.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12454
[https://github.com/apache/spark/pull/12454]

> Documentation and Examples for TF-IDF only refer to HashingTF
> -
>
> Key: SPARK-14635
> URL: https://issues.apache.org/jira/browse/SPARK-14635
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently, the [docs for 
> TF-IDF|http://spark.apache.org/docs/latest/ml-features.html#tf-idf-hashingtf-and-idf]
>  only refer to using {{HashingTF}} with {{IDF}}. However, {{CountVectorizer}} 
> can also be used. We should probably amend the user guide and examples to 
> show this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14635) Documentation and Examples for TF-IDF only refer to HashingTF

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14635:
--
Assignee: yuhao yang

> Documentation and Examples for TF-IDF only refer to HashingTF
> -
>
> Key: SPARK-14635
> URL: https://issues.apache.org/jira/browse/SPARK-14635
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently, the [docs for 
> TF-IDF|http://spark.apache.org/docs/latest/ml-features.html#tf-idf-hashingtf-and-idf]
>  only refer to using {{HashingTF}} with {{IDF}}. However, {{CountVectorizer}} 
> can also be used. We should probably amend the user guide and examples to 
> show this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14754) Metrics as logs are not coming through slf4j

2016-04-20 Thread Monani Mihir (JIRA)

Monani Mihir created SPARK-14754:


 Summary: Metrics as logs are not coming through slf4j
 Key: SPARK-14754
 URL: https://issues.apache.org/jira/browse/SPARK-14754
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.1, 1.6.0, 1.5.2, 1.6.2
Reporter: Monani Mihir
Priority: Minor


Based on codahale's metric documentation, *Slf4jsink.scala* should have *class 
name* for log4j to print metrics in log files. Metric name is missing in 
current Slf4jsink.scala file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14754) Metrics as logs are not coming through slf4j

2016-04-20 Thread Monani Mihir (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Monani Mihir updated SPARK-14754:
-
Description: 
Based on codahale's metric documentation, *Slf4jsink.scala* should have *class 
name* for log4j to print metrics in log files. Metric name is missing in 
current Slf4jsink.scala file. 
Refer to this link :- 
https://dropwizard.github.io/metrics/3.1.0/manual/core/#man-core-reporters-slf4j


  was:Based on codahale's metric documentation, *Slf4jsink.scala* should have 
*class name* for log4j to print metrics in log files. Metric name is missing in 
current Slf4jsink.scala file.


> Metrics as logs are not coming through slf4j
> 
>
> Key: SPARK-14754
> URL: https://issues.apache.org/jira/browse/SPARK-14754
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.0, 1.6.1, 1.6.2
>Reporter: Monani Mihir
>Priority: Minor
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Based on codahale's metric documentation, *Slf4jsink.scala* should have 
> *class name* for log4j to print metrics in log files. Metric name is missing 
> in current Slf4jsink.scala file. 
> Refer to this link :- 
> https://dropwizard.github.io/metrics/3.1.0/manual/core/#man-core-reporters-slf4j



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249628#comment-15249628
 ] 

Paul Shearer edited comment on SPARK-13973 at 4/20/16 10:48 AM:


The problem with this change is that it creates a bug for users who simply want 
the IPython interactive shell, as opposed to the notebook. `ipython` with no 
arguments starts the IPython shell, but `jupyter` with no arguments results in 
the following error:

{noformat}
usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir]
   [--paths] [--json]
   [subcommand]
jupyter: error: one of the arguments --version subcommand --config-dir 
--data-dir --runtime-dir --paths is required
{noformat}

I can't speak for the general Python community but as a data scientist, 
personally I find the IPython notebook only suitable for very basic exploratory 
analysis - any sort of application development is much better served by the 
IPython shell, so I'm always using the shell and rarely the notebook. So I 
prefer the old script.

Perhaps the best answer is to stop maintaining an unsustainable backwards 
compatibility. The committed change broke the pyspark startup script in my 
case, and the old startup script will eventually be broken when `ipython 
notebook` is deprecated. So perhaps `IPYTHON=1` should just result in some kind 
of error message prompting the user to switch to the new PYSPARK_DRIVER_PYTHON 
config style. Most Spark users knows the installation process is not seamless 
and requires mucking about with environment variables - they might as well be 
told to do it in a way that's convenient to the development team.


was (Author: pshearer):
The problem with this change is that it creates a bug for users who simply want 
the IPython interactive shell, as opposed to the notebook. `ipython` with no 
arguments starts the IPython shell, but `jupyter` with no arguments results in 
the following error:

{noformat}
usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir]
   [--paths] [--json]
   [subcommand]
jupyter: error: one of the arguments --version subcommand --config-dir 
--data-dir --runtime-dir --paths is required
{noformat}

I can't speak for the general Python community but as a data scientist, 
personally I find the IPython notebook only suitable for very basic exploratory 
analysis - any sort of application development is much better served by the 
IPython shell, so I'm always using the shell and rarely the notebook.

It seems like maintaining this old configuration switch is no longer 
sustainable. The change breaks it for my case, and the old state will 
eventually be broken when `ipython notebook` is deprecated. So perhaps 
`IPYTHON=1` should just result in some kind of error message prompting the user 
to switch to the new PYSPARK_DRIVER_PYTHON config style.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
>

[jira] [Commented] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249641#comment-15249641
 ] 

Sean Owen commented on SPARK-13973:
---

Ah, I had assumed the notebook was the only use case here. Is that 
unreasonable? (I have only ever used these as notebooks.) What does it buy you 
over using pyspark directly? It only matters if you have jupyter installed too. 
How about a change to not overwrite {{PYSPARK_DRIVER_PYTHON}} if already set so 
you can force the choice you want?

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249656#comment-15249656
 ] 

Paul Shearer commented on SPARK-13973:
--

Just to get us on the same page - pyspark is a python shell with SparkContext 
pre-loaded. The python shell can be either default python shell or the enhanced 
IPython shell. The latter offers tab completion of object attributes, nicely 
formatted tracebacks, "magic" macros for executing pasted code snippets, 
scripting, debugging, and accessing help/docstrings. It is a separate entity 
from the notebook and long predates the notebook. 

Most working scientists and analysts I know, including myself, use the IPython 
shell much more than the notebook. The notebook is more of a presentation and 
exploratory analysis tool, while the IPython shell is better for power users.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249656#comment-15249656
 ] 

Paul Shearer edited comment on SPARK-13973 at 4/20/16 11:17 AM:


Just to get us on the same page - pyspark is a python shell with SparkContext 
pre-loaded. The python shell can be either default python shell or the enhanced 
IPython shell. The latter offers tab completion of object attributes, nicely 
formatted tracebacks, "magic" macros for executing pasted code snippets, 
scripting, debugging, and accessing help/docstrings. It is a separate entity 
from the notebook and long predates the notebook. 

Most working scientists and analysts I know, including myself, use the IPython 
shell much more than the notebook. The notebook is more of a presentation and 
exploratory analysis tool, while the IPython shell is better for power users.

And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. 
I found it easy enough just to comment out `IPYTHON=1` once I realized that was 
the problem, so alternatively you could print some alert that this is 
deprecated and tell people how to change it, so you don't have to keep 
supporting it.


was (Author: pshearer):
Just to get us on the same page - pyspark is a python shell with SparkContext 
pre-loaded. The python shell can be either default python shell or the enhanced 
IPython shell. The latter offers tab completion of object attributes, nicely 
formatted tracebacks, "magic" macros for executing pasted code snippets, 
scripting, debugging, and accessing help/docstrings. It is a separate entity 
from the notebook and long predates the notebook. 

Most working scientists and analysts I know, including myself, use the IPython 
shell much more than the notebook. The notebook is more of a presentation and 
exploratory analysis tool, while the IPython shell is better for power users.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249656#comment-15249656
 ] 

Paul Shearer edited comment on SPARK-13973 at 4/20/16 11:18 AM:


Just to get us on the same page - pyspark is a python shell with SparkContext 
pre-loaded. The python shell can be either default python shell or the enhanced 
IPython shell. The latter offers tab completion of object attributes, nicely 
formatted tracebacks, "magic" macros for executing pasted code snippets, 
scripting, debugging, and accessing help/docstrings. It is a separate entity 
from the notebook and long predates the notebook. 

Most working scientists and analysts I know, including myself, use the IPython 
shell much more than the notebook. The notebook is more of a presentation and 
exploratory analysis tool, while the IPython shell is better for general 
programming / power users.

And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. 
I found it easy enough just to comment out `IPYTHON=1` once I realized that was 
the problem, so alternatively you could print some alert that this is 
deprecated and tell people how to change it, so you don't have to keep 
supporting it.


was (Author: pshearer):
Just to get us on the same page - pyspark is a python shell with SparkContext 
pre-loaded. The python shell can be either default python shell or the enhanced 
IPython shell. The latter offers tab completion of object attributes, nicely 
formatted tracebacks, "magic" macros for executing pasted code snippets, 
scripting, debugging, and accessing help/docstrings. It is a separate entity 
from the notebook and long predates the notebook. 

Most working scientists and analysts I know, including myself, use the IPython 
shell much more than the notebook. The notebook is more of a presentation and 
exploratory analysis tool, while the IPython shell is better for power users.

And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. 
I found it easy enough just to comment out `IPYTHON=1` once I realized that was 
the problem, so alternatively you could print some alert that this is 
deprecated and tell people how to change it, so you don't have to keep 
supporting it.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14746) Support transformations in R source code for Dataset/DataFrame

2016-04-20 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249673#comment-15249673
 ] 

Sun Rui commented on SPARK-14746:
-

[~rxin] 
some links for discussions on calling R code from Scala:
Run External R script from Spark: 
https://mail-archives.apache.org/mod_mbox/spark-user/201603.mbox/%3ccag06he009b4tonqd-rtkhlspiojchgpe2ularb09jhe55xn...@mail.gmail.com%3E
Running synchronized JRI code: 
https://www.mail-archive.com/user@spark.apache.org/msg45753.html

We also have a customer having similar requirement that: the applications are 
written in Scala/Java, but sometimes need to call R statistical functions in 
transformations.

There are similar requirements for calling python code from Scala, one example 
is: https://www.mail-archive.com/user@spark.apache.org/msg49653.html

The limitations of pipe():
1. Only RDD has pipe(). DataFrame does not.
2. pipe() uses text based communication between JVM and external processes. 
User have to manually serialize the data into text on JVM side (printRDDElement 
function as a parameter to pipe()) and deserialize the data in the external 
process. Difficult to use and have performance concern compared to binary 
communications.
3. Users have to write separate code in the target language for external 
processes. While if we support this proposal, users can embed the code (for 
example, R or python code) for external processes  in the Scala program. easier 
to maintain and beneficial to readability.
4. Hard to debug when external processes launched by pipe() failed. No detailed 
error message. In this proposal, error message can be caught, which eases 
debugging.


> Support transformations in R source code for Dataset/DataFrame
> --
>
> Key: SPARK-14746
> URL: https://issues.apache.org/jira/browse/SPARK-14746
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Sun Rui
>
> there actually is a desired scenario mentioned several times in the Spark 
> mailing list that users are writing Scala/Java Spark applications (not 
> SparkR) but want to use R functions in some transformations. typically this 
> can be achieved by calling Pipe() in RDD. However, there are limitations on 
> pipe(). So we can support applying a R function in source code format to a 
> Dataset/DataFrame (Thus SparkR is not needed for serializing an R function.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory

2016-04-20 Thread Nipun Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249679#comment-15249679
 ] 

Nipun Agarwal commented on SPARK-11293:
---

I was using Apache spark 1.6 in EMR with spark streaming in yarn and saw memory 
leaks in one of the containers. Here are the logs

16/04/14 13:49:10 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
2942916
16/04/14 13:49:10 INFO executor.Executor: Running task 22.0 in stage 35684.0 
(TID 2942915)
16/04/14 13:49:10 INFO executor.Executor: Running task 23.0 in stage 35684.0 
(TID 2942916)
16/04/14 13:49:10 INFO storage.ShuffleBlockFetcherIterator: Getting 94 
non-empty blocks out of 94 blocks
16/04/14 13:49:10 INFO storage.ShuffleBlockFetcherIterator: Getting 94 
non-empty blocks out of 94 blocks
16/04/14 13:49:10 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote 
fetches in 1 ms
16/04/14 13:49:10 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote 
fetches in 1 ms
16/04/14 13:49:10 INFO storage.MemoryStore: Block input-3-1460583424327 stored 
as values in memory (estimated size 244.7 KB, free 19.3 MB)
16/04/14 13:49:10 INFO receiver.BlockGenerator: Pushed block 
input-3-1460641750200
16/04/14 13:49:10 INFO storage.MemoryStore: 1 blocks selected for dropping
16/04/14 13:49:10 INFO storage.BlockManager: Dropping block 
input-1-1460615659379 from memory
16/04/14 13:49:10 INFO storage.MemoryStore: 1 blocks selected for dropping
16/04/14 13:49:10 INFO storage.BlockManager: Dropping block 
input-1-1460615659380 from memory
16/04/14 13:49:10 INFO memory.TaskMemoryManager: Memory used in task 2942915
16/04/14 13:49:10 INFO memory.TaskMemoryManager: Acquired by 
org.apache.spark.unsafe.map.BytesToBytesMap@34158d5f: 32.3 MB
16/04/14 13:49:10 INFO memory.TaskMemoryManager: 0 bytes of memory were used by 
task 2942915 but are not associated with specific consumers
16/04/14 13:49:10 INFO memory.TaskMemoryManager: 101247172 bytes of memory are 
used for execution and 3603881260 bytes of memory are used for storage
16/04/14 13:49:10 WARN memory.TaskMemoryManager: leak 32.3 MB memory from 
org.apache.spark.unsafe.map.BytesToBytesMap@34158d5f
16/04/14 13:49:10 ERROR executor.Executor: Managed memory leak detected; size = 
33816576 bytes, TID = 2942915
16/04/14 13:49:10 ERROR executor.Executor: Exception in task 22.0 in stage 
35684.0 (TID 2942915)
java.lang.OutOfMemoryError: Unable to acquire 262144 bytes of memory, got 220032
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:91)
at 
org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap.java:735)
at 
org.apache.spark.unsafe.map.BytesToBytesMap.(BytesToBytesMap.java:197)
at 
org.apache.spark.unsafe.map.BytesToBytesMap.(BytesToBytesMap.java:212)
at 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.(UnsafeFixedWidthAggregationMap.java:103)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:483)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/04/14 13:49:10 INFO executor.Executor: Finished task 23.0 in stage 35684.0 
(TID 2942916). 1921 bytes result sent to driver
16/04/14 13:49:10 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
2942927
16/04/14 13:49:10 INFO executor.Executor: Running task 34.0 in stage 35684.0 
(TID 2942927)
16/04/14 13:49:10 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception 
in thread Thread[Executor

[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249656#comment-15249656
 ] 

Paul Shearer edited comment on SPARK-13973 at 4/20/16 11:36 AM:


In my development experience and those I know, the notebook is not the only or 
even the main use case for IPython. It's just the most visible because it's 
easy to make slides and webpages and such. IPython, without the notebook, is a 
productive general-purpose programming environment for data scientists.

Just to get us on the same page - pyspark is a python shell with SparkContext 
pre-loaded. The python shell can be either default python shell or the enhanced 
IPython shell. The latter offers tab completion of object attributes, nicely 
formatted tracebacks, "magic" macros for executing pasted code snippets, 
scripting, debugging, and accessing help/docstrings. It is a separate entity 
from the notebook and long predates the notebook. 

Most working scientists and analysts I know, including myself, use the IPython 
shell much more than the notebook. The notebook is more of a presentation and 
exploratory analysis tool, while the IPython shell is better for general 
programming / power users.

And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. 
I found it easy enough just to comment out `IPYTHON=1` once I realized that was 
the problem, so alternatively you could print some alert that this is 
deprecated and tell people how to change it, so you don't have to keep 
supporting it.


was (Author: pshearer):
In my development experience and those I know, the notebook is not the only or 
even the main use case for IPython. It's just the most visible because it's 
easy to make slides and such. IPython, without the notebook, is a productive 
general-purpose programming environment for data scientists.

Just to get us on the same page - pyspark is a python shell with SparkContext 
pre-loaded. The python shell can be either default python shell or the enhanced 
IPython shell. The latter offers tab completion of object attributes, nicely 
formatted tracebacks, "magic" macros for executing pasted code snippets, 
scripting, debugging, and accessing help/docstrings. It is a separate entity 
from the notebook and long predates the notebook. 

Most working scientists and analysts I know, including myself, use the IPython 
shell much more than the notebook. The notebook is more of a presentation and 
exploratory analysis tool, while the IPython shell is better for general 
programming / power users.

And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. 
I found it easy enough just to comment out `IPYTHON=1` once I realized that was 
the problem, so alternatively you could print some alert that this is 
deprecated and tell people how to change it, so you don't have to keep 
supporting it.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This mes

[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249656#comment-15249656
 ] 

Paul Shearer edited comment on SPARK-13973 at 4/20/16 11:36 AM:


In my development experience and those I know, the notebook is not the only or 
even the main use case for IPython. It's just the most visible because it's 
easy to make slides and such. IPython, without the notebook, is a productive 
general-purpose programming environment for data scientists.

Just to get us on the same page - pyspark is a python shell with SparkContext 
pre-loaded. The python shell can be either default python shell or the enhanced 
IPython shell. The latter offers tab completion of object attributes, nicely 
formatted tracebacks, "magic" macros for executing pasted code snippets, 
scripting, debugging, and accessing help/docstrings. It is a separate entity 
from the notebook and long predates the notebook. 

Most working scientists and analysts I know, including myself, use the IPython 
shell much more than the notebook. The notebook is more of a presentation and 
exploratory analysis tool, while the IPython shell is better for general 
programming / power users.

And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. 
I found it easy enough just to comment out `IPYTHON=1` once I realized that was 
the problem, so alternatively you could print some alert that this is 
deprecated and tell people how to change it, so you don't have to keep 
supporting it.


was (Author: pshearer):
Just to get us on the same page - pyspark is a python shell with SparkContext 
pre-loaded. The python shell can be either default python shell or the enhanced 
IPython shell. The latter offers tab completion of object attributes, nicely 
formatted tracebacks, "magic" macros for executing pasted code snippets, 
scripting, debugging, and accessing help/docstrings. It is a separate entity 
from the notebook and long predates the notebook. 

Most working scientists and analysts I know, including myself, use the IPython 
shell much more than the notebook. The notebook is more of a presentation and 
exploratory analysis tool, while the IPython shell is better for general 
programming / power users.

And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. 
I found it easy enough just to comment out `IPYTHON=1` once I realized that was 
the problem, so alternatively you could print some alert that this is 
deprecated and tell people how to change it, so you don't have to keep 
supporting it.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14752) LazilyGenerateOrdering throws NullPointerException with TakeOrderedAndProject

2016-04-20 Thread Rajesh Balamohan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249576#comment-15249576
 ] 

Rajesh Balamohan edited comment on SPARK-14752 at 4/20/16 11:42 AM:


Changing generatedOrdering in LazilyGeneratedOrdering to 
{noformat}
private[this] lazy val generatedOrdering = GenerateOrdering.generate(ordering)
{noformat} 
solves the issue and the query runs fine.  Thought of checking with committers 
opinion before posting the PR for this.


was (Author: rajesh.balamohan):
Changing generatedOrdering in LazilyGeneratedOrdering to 
{noformat}
private[this] lazy val generatedOrdering = GenerateOrdering.generate(ordering)
{noformat} 
solves the issue and the query runs fine. 

> LazilyGenerateOrdering throws NullPointerException with TakeOrderedAndProject
> -
>
> Key: SPARK-14752
> URL: https://issues.apache.org/jira/browse/SPARK-14752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> codebase: spark master
> DataSet: TPC-DS
> Client: $SPARK_HOME/bin/beeline
> Example query to reproduce the issue:  
> select i_item_id from item order by i_item_id limit 10;
> Explain plan output
> {noformat}
> explain select i_item_id from item order by i_item_id limit 10;
> +--+--+
> | 
> plan  
>   
>  |
> +--+--+
> | == Physical Plan ==
> TakeOrderedAndProject(limit=10, orderBy=[i_item_id#1229 ASC], 
> output=[i_item_id#1229])
> +- WholeStageCodegen
>:  +- Project [i_item_id#1229]
>: +- Scan HadoopFiles[i_item_id#1229] Format: ORC, PushedFilters: [], 
> ReadSchema: struct  |
> +--+--+
> {noformat}
> Exception:
> {noformat}
> TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1791)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java

[jira] [Commented] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249705#comment-15249705
 ] 

Paul Shearer commented on SPARK-13973:
--

Bottom line... I think IPYTHON=1 should either 

(1) mean what it appears to mean - IPython and not necessarily the notebook - 
or 
(2) be removed entirely as too confusing.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249715#comment-15249715
 ] 

Sean Owen commented on SPARK-13973:
---

Yes that makes sense. So, reading deeper into the script, this behavior only 
matters if you set IPYTHON=1, but that's long since deprecated (since 1.2). You 
should be setting PYSPARK_DRIVER_PYTHON now. Then this default behavior doesn't 
matter. Actually the script already respects that flag if it's set, but, you 
have to not set IPYTHON too.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249705#comment-15249705
 ] 

Paul Shearer edited comment on SPARK-13973 at 4/20/16 12:08 PM:


Bottom line... I think IPYTHON=1 should either 

(1) mean what it appears to mean - IPython and not necessarily the notebook - 
or 
(2) be removed entirely as too confusing.

As the code now stands, the IPython shell user's pyspark is silently broken if 
they happen to have set a deprecated option when using an older version of 
pyspark. Not exactly the definition of backwards compatibility.


was (Author: pshearer):
Bottom line... I think IPYTHON=1 should either 

(1) mean what it appears to mean - IPython and not necessarily the notebook - 
or 
(2) be removed entirely as too confusing.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249705#comment-15249705
 ] 

Paul Shearer edited comment on SPARK-13973 at 4/20/16 12:09 PM:


Bottom line... I think IPYTHON=1 should either 

(1) mean what it appears to mean - IPython and not necessarily the notebook - 
or 
(2) be removed entirely as too confusing.

As the code now stands, the IPython shell user's pyspark is silently broken if 
they happen to have set a deprecated option when using an older version of 
pyspark. Fixable yes, but more confusing than necessary, and opposite to the 
intention of backwards compatibility.


was (Author: pshearer):
Bottom line... I think IPYTHON=1 should either 

(1) mean what it appears to mean - IPython and not necessarily the notebook - 
or 
(2) be removed entirely as too confusing.

As the code now stands, the IPython shell user's pyspark is silently broken if 
they happen to have set a deprecated option when using an older version of 
pyspark. Not exactly the definition of backwards compatibility.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14725) Remove HttpServer

2016-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14725:


Assignee: (was: Apache Spark)

> Remove HttpServer
> -
>
> Key: SPARK-14725
> URL: https://issues.apache.org/jira/browse/SPARK-14725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Saisai Shao
>Priority: Minor
>
> {{HttpServer}} used to support broadcast variables and jars/files 
> transmission now seems obsolete, by searching the codes, actually no one 
> class depends on it except one unit test, so here propose to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14725) Remove HttpServer

2016-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14725:


Assignee: Apache Spark

> Remove HttpServer
> -
>
> Key: SPARK-14725
> URL: https://issues.apache.org/jira/browse/SPARK-14725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> {{HttpServer}} used to support broadcast variables and jars/files 
> transmission now seems obsolete, by searching the codes, actually no one 
> class depends on it except one unit test, so here propose to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14725) Remove HttpServer

2016-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249786#comment-15249786
 ] 

Apache Spark commented on SPARK-14725:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/12526

> Remove HttpServer
> -
>
> Key: SPARK-14725
> URL: https://issues.apache.org/jira/browse/SPARK-14725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Saisai Shao
>Priority: Minor
>
> {{HttpServer}} used to support broadcast variables and jars/files 
> transmission now seems obsolete, by searching the codes, actually no one 
> class depends on it except one unit test, so here propose to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249787#comment-15249787
 ] 

Sean Owen commented on SPARK-13973:
---

Yeah it should just be removed, I think, for 2.0. Would you like to open a 
JIRA/PR for that? it has been deprecated for a long time, and we have a 
movement to clear out some env variables for 2.0.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14755) Dynamic proxy class cause a ClassNotFoundException on task deserialization

2016-04-20 Thread francisco (JIRA)

francisco created SPARK-14755:
-

 Summary: Dynamic proxy class cause a ClassNotFoundException on 
task deserialization
 Key: SPARK-14755
 URL: https://issues.apache.org/jira/browse/SPARK-14755
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: francisco


Using component object wrapped by an java.lang.reflect.Proxy cause an 
ClassNotFoundException error in task deserialization. 

The ClassNotFoundException refers to the interface of the proxied object. The 
same job execution with the same component non proxied works fine.

I think there is a relation with class loader used in method 
ObjectInputStream.resolveProxyClass

Best regards



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14755) Dynamic proxy class cause a ClassNotFoundException on task deserialization

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14755:
--
Priority: Minor  (was: Major)

Heh, I'm not sure I'd expect that to work. You're defining a class at runtime 
that only exists in the source classloader.

> Dynamic proxy class cause a ClassNotFoundException on task deserialization
> --
>
> Key: SPARK-14755
> URL: https://issues.apache.org/jira/browse/SPARK-14755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: francisco
>Priority: Minor
>
> Using component object wrapped by an java.lang.reflect.Proxy cause an 
> ClassNotFoundException error in task deserialization. 
> The ClassNotFoundException refers to the interface of the proxied object. The 
> same job execution with the same component non proxied works fine.
> I think there is a relation with class loader used in method 
> ObjectInputStream.resolveProxyClass
> Best regards



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14756) Use parseLong instead of valueOf

2016-04-20 Thread Azeem Jiva (JIRA)

Azeem Jiva created SPARK-14756:
--

 Summary: Use parseLong instead of valueOf
 Key: SPARK-14756
 URL: https://issues.apache.org/jira/browse/SPARK-14756
 Project: Spark
  Issue Type: Bug
Reporter: Azeem Jiva
Priority: Trivial


Use Long.parseLong which returns a primative.
Use a series of appends() reduces the creation of an extra StringBuilder type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14756) Use parseLong instead of valueOf

2016-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14756:


Assignee: Apache Spark

> Use parseLong instead of valueOf
> 
>
> Key: SPARK-14756
> URL: https://issues.apache.org/jira/browse/SPARK-14756
> Project: Spark
>  Issue Type: Bug
>Reporter: Azeem Jiva
>Assignee: Apache Spark
>Priority: Trivial
>
> Use Long.parseLong which returns a primative.
> Use a series of appends() reduces the creation of an extra StringBuilder type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14756) Use parseLong instead of valueOf

2016-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14756:


Assignee: (was: Apache Spark)

> Use parseLong instead of valueOf
> 
>
> Key: SPARK-14756
> URL: https://issues.apache.org/jira/browse/SPARK-14756
> Project: Spark
>  Issue Type: Bug
>Reporter: Azeem Jiva
>Priority: Trivial
>
> Use Long.parseLong which returns a primative.
> Use a series of appends() reduces the creation of an extra StringBuilder type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14756) Use parseLong instead of valueOf

2016-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249860#comment-15249860
 ] 

Apache Spark commented on SPARK-14756:
--

User 'javawithjiva' has created a pull request for this issue:
https://github.com/apache/spark/pull/12520

> Use parseLong instead of valueOf
> 
>
> Key: SPARK-14756
> URL: https://issues.apache.org/jira/browse/SPARK-14756
> Project: Spark
>  Issue Type: Bug
>Reporter: Azeem Jiva
>Priority: Trivial
>
> Use Long.parseLong which returns a primative.
> Use a series of appends() reduces the creation of an extra StringBuilder type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14369) Implement preferredLocations() for FileScanRDD

2016-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249882#comment-15249882
 ] 

Apache Spark commented on SPARK-14369:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/12527

> Implement preferredLocations() for FileScanRDD
> --
>
> Key: SPARK-14369
> URL: https://issues.apache.org/jira/browse/SPARK-14369
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Implement {{FileScanRDD.preferredLocations()}} to add locality support for 
> {{HadoopFsRelation}} based data sources.
> We should avoid extra block location related RPC costs for S3, which doesn't 
> provide valid locality information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14756) Use parseLong instead of valueOf

2016-04-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14756:
--
Issue Type: Improvement  (was: Bug)

> Use parseLong instead of valueOf
> 
>
> Key: SPARK-14756
> URL: https://issues.apache.org/jira/browse/SPARK-14756
> Project: Spark
>  Issue Type: Improvement
>Reporter: Azeem Jiva
>Priority: Trivial
>
> Use Long.parseLong which returns a primative.
> Use a series of appends() reduces the creation of an extra StringBuilder type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14756) Use parseLong instead of valueOf

2016-04-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249893#comment-15249893
 ] 

Sean Owen commented on SPARK-14756:
---

You mean return a primitive where a primitive is required. There are cases in 
the code where a Long object is required and then valueOf makes sense.

> Use parseLong instead of valueOf
> 
>
> Key: SPARK-14756
> URL: https://issues.apache.org/jira/browse/SPARK-14756
> Project: Spark
>  Issue Type: Bug
>Reporter: Azeem Jiva
>Priority: Trivial
>
> Use Long.parseLong which returns a primative.
> Use a series of appends() reduces the creation of an extra StringBuilder type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249924#comment-15249924
 ] 

Apache Spark commented on SPARK-13973:
--

User 'shearerp' has created a pull request for this issue:
https://github.com/apache/spark/pull/12528

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249921#comment-15249921
 ] 

Paul Shearer commented on SPARK-13973:
--

Done: https://github.com/apache/spark/pull/12528

I'm a bit new to this, happy to fix any issues.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14755) Dynamic proxy class cause a ClassNotFoundException on task deserialization

2016-04-20 Thread francisco (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249978#comment-15249978
 ] 

francisco commented on SPARK-14755:
---

So, there is no way to do this without extends/change deserialization?

Thanks Sean!

> Dynamic proxy class cause a ClassNotFoundException on task deserialization
> --
>
> Key: SPARK-14755
> URL: https://issues.apache.org/jira/browse/SPARK-14755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: francisco
>Priority: Minor
>
> Using component object wrapped by an java.lang.reflect.Proxy cause an 
> ClassNotFoundException error in task deserialization. 
> The ClassNotFoundException refers to the interface of the proxied object. The 
> same job execution with the same component non proxied works fine.
> I think there is a relation with class loader used in method 
> ObjectInputStream.resolveProxyClass
> Best regards



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14755) Dynamic proxy class cause a ClassNotFoundException on task deserialization

2016-04-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249991#comment-15249991
 ] 

Sean Owen commented on SPARK-14755:
---

I honestly don't know if there's a way to do it. It's one of the more obscure 
corner cases in the JDK.

> Dynamic proxy class cause a ClassNotFoundException on task deserialization
> --
>
> Key: SPARK-14755
> URL: https://issues.apache.org/jira/browse/SPARK-14755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: francisco
>Priority: Minor
>
> Using component object wrapped by an java.lang.reflect.Proxy cause an 
> ClassNotFoundException error in task deserialization. 
> The ClassNotFoundException refers to the interface of the proxied object. The 
> same job execution with the same component non proxied works fine.
> I think there is a relation with class loader used in method 
> ObjectInputStream.resolveProxyClass
> Best regards



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14757) Incorrect behavior of Join operation in Spqrk SQL JOIN : "false" in the left table is joined to "null" on the right table

2016-04-20 Thread JIRA

黄泓 created SPARK-14757:
--

 Summary: Incorrect behavior of Join operation in Spqrk SQL JOIN : 
"false" in the left table is joined to "null" on the right table
 Key: SPARK-14757
 URL: https://issues.apache.org/jira/browse/SPARK-14757
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: 黄泓


Content of table a:
+-+
|outgoing_0|
+-+
| false |
|  true |
|  null  |
+--+

Content of table b:

+--+
|outgoing_1|
+--+
| false  |
|  true  |
|  null   |
+--+

After running this query:

select * from a FULL JOIN b ON ( outgoing_0<=>outgoing_1)

I got the following result:

+-+--+
|outgoing_0|outgoing_1|
+--+--+
|  true  |  true  |
| false  | false |
| false  |  null  |
|  null   |  null  |
+--+--+

The row with "false" as outgoing_0 and "null" as outgoing_1 is unexpected. The 
operator <=> should match null with null. 

While left "false" is matched with right "null", it is also strange to find 
that the "false" on the right table does not match with "null" on the left 
table (no row with "null" as outgoing_0 and "false" as outgoing_1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14757) Incorrect behavior of Join operation in Spqrk SQL JOIN : "false" in the left table is joined to "null" on the right table

2016-04-20 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-14757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

黄泓 updated SPARK-14757:
---
Description: 
Content of table a:

|outgoing_0|
| false |
|  true |
|  null  |

a has only one field: outgoing_0 

Content of table b:

|outgoing_1|
| false  |
|  true  |
|  null   |

b has only one filed: outgoing_1

After running this query:

select * from a FULL JOIN b ON ( outgoing_0<=>outgoing_1)

I got the following result:

|outgoing_0|outgoing_1|
|  true  |  true  |
| false  | false |
| false  |  null  |
|  null   |  null  |

The row with "false" as outgoing_0 and "null" as outgoing_1 is unexpected. The 
operator <=> should match null with null. 

While left "false" is matched with right "null", it is also strange to find 
that the "false" on the right table does not match with "null" on the left 
table (no row with "null" as outgoing_0 and "false" as outgoing_1)

  was:
Content of table a:

|outgoing_0|
| false |
|  true |
|  null  |

Content of table b:

|outgoing_1|
| false  |
|  true  |
|  null   |

After running this query:

select * from a FULL JOIN b ON ( outgoing_0<=>outgoing_1)

I got the following result:

|outgoing_0|outgoing_1|
|  true  |  true  |
| false  | false |
| false  |  null  |
|  null   |  null  |

The row with "false" as outgoing_0 and "null" as outgoing_1 is unexpected. The 
operator <=> should match null with null. 

While left "false" is matched with right "null", it is also strange to find 
that the "false" on the right table does not match with "null" on the left 
table (no row with "null" as outgoing_0 and "false" as outgoing_1)


> Incorrect behavior of Join operation in Spqrk SQL JOIN : "false" in the left 
> table is joined to "null" on the right table
> -
>
> Key: SPARK-14757
> URL: https://issues.apache.org/jira/browse/SPARK-14757
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: 黄泓
>
> Content of table a:
> |outgoing_0|
> | false |
> |  true |
> |  null  |
> a has only one field: outgoing_0 
> Content of table b:
> |outgoing_1|
> | false  |
> |  true  |
> |  null   |
> b has only one filed: outgoing_1
> After running this query:
> select * from a FULL JOIN b ON ( outgoing_0<=>outgoing_1)
> I got the following result:
> |outgoing_0|outgoing_1|
> |  true  |  true  |
> | false  | false |
> | false  |  null  |
> |  null   |  null  |
> The row with "false" as outgoing_0 and "null" as outgoing_1 is unexpected. 
> The operator <=> should match null with null. 
> While left "false" is matched with right "null", it is also strange to find 
> that the "false" on the right table does not match with "null" on the left 
> table (no row with "null" as outgoing_0 and "false" as outgoing_1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14749) PlannerSuite failed when it runs individually

2016-04-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15250044#comment-15250044
 ] 

Apache Spark commented on SPARK-14749:
--

User 'sbcd90' has created a pull request for this issue:
https://github.com/apache/spark/pull/12532

> PlannerSuite failed when it runs individually
> -
>
> Key: SPARK-14749
> URL: https://issues.apache.org/jira/browse/SPARK-14749
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Yin Huai
>Priority: Minor
>
> If you try {{test-only *PlannerSuite -- -z "count is partially aggregated"}}, 
> you will see
> {code}
> [info] - count is partially aggregated *** FAILED *** (104 milliseconds)
> [info]   java.lang.NullPointerException:
> [info]   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.(TungstenAggregate.scala:76)
> [info]   at 
> org.apache.spark.sql.execution.aggregate.Utils$.createAggregate(utils.scala:60)
> [info]   at 
> org.apache.spark.sql.execution.aggregate.Utils$.planAggregateWithoutDistinct(utils.scala:97)
> [info]   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:258)
> [info]   at 
> org.apache.spark.sql.execution.PlannerSuite.org$apache$spark$sql$execution$PlannerSuite$$testPartialAggregationPlan(PlannerSuite.scala:43)
> [info]   at 
> org.apache.spark.sql.execution.PlannerSuite$$anonfun$1.apply$mcV$sp(PlannerSuite.scala:58)
> [info]   at 
> org.apache.spark.sql.execution.PlannerSuite$$anonfun$1.apply(PlannerSuite.scala:56)
> [info]   at 
> org.apache.spark.sql.execution.PlannerSuite$$anonfun$1.apply(PlannerSuite.scala:56)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:56)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
> [info]   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
> [info]   at scala.collection.immutable.List.foreach(List.scala:381)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> [info]   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
> [info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
> [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
> [info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
> [info]   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
> [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
> [info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:28)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
> [info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:28)
> [info]   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:357)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:502)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> [info]   at 
> java.util.concurrent.ThreadPoolExecut

[jira] [Assigned] (SPARK-14749) PlannerSuite failed when it runs individually

2016-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14749:


Assignee: Apache Spark

> PlannerSuite failed when it runs individually
> -
>
> Key: SPARK-14749
> URL: https://issues.apache.org/jira/browse/SPARK-14749
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Minor
>
> If you try {{test-only *PlannerSuite -- -z "count is partially aggregated"}}, 
> you will see
> {code}
> [info] - count is partially aggregated *** FAILED *** (104 milliseconds)
> [info]   java.lang.NullPointerException:
> [info]   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.(TungstenAggregate.scala:76)
> [info]   at 
> org.apache.spark.sql.execution.aggregate.Utils$.createAggregate(utils.scala:60)
> [info]   at 
> org.apache.spark.sql.execution.aggregate.Utils$.planAggregateWithoutDistinct(utils.scala:97)
> [info]   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:258)
> [info]   at 
> org.apache.spark.sql.execution.PlannerSuite.org$apache$spark$sql$execution$PlannerSuite$$testPartialAggregationPlan(PlannerSuite.scala:43)
> [info]   at 
> org.apache.spark.sql.execution.PlannerSuite$$anonfun$1.apply$mcV$sp(PlannerSuite.scala:58)
> [info]   at 
> org.apache.spark.sql.execution.PlannerSuite$$anonfun$1.apply(PlannerSuite.scala:56)
> [info]   at 
> org.apache.spark.sql.execution.PlannerSuite$$anonfun$1.apply(PlannerSuite.scala:56)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:56)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
> [info]   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
> [info]   at scala.collection.immutable.List.foreach(List.scala:381)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> [info]   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
> [info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
> [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
> [info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
> [info]   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
> [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
> [info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:28)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
> [info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:28)
> [info]   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:357)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:502)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [info]   at 
> java.util.concurrent.ThreadP

[jira] [Assigned] (SPARK-14749) PlannerSuite failed when it runs individually

2016-04-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14749:


Assignee: (was: Apache Spark)

> PlannerSuite failed when it runs individually
> -
>
> Key: SPARK-14749
> URL: https://issues.apache.org/jira/browse/SPARK-14749
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Yin Huai
>Priority: Minor
>
> If you try {{test-only *PlannerSuite -- -z "count is partially aggregated"}}, 
> you will see
> {code}
> [info] - count is partially aggregated *** FAILED *** (104 milliseconds)
> [info]   java.lang.NullPointerException:
> [info]   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.(TungstenAggregate.scala:76)
> [info]   at 
> org.apache.spark.sql.execution.aggregate.Utils$.createAggregate(utils.scala:60)
> [info]   at 
> org.apache.spark.sql.execution.aggregate.Utils$.planAggregateWithoutDistinct(utils.scala:97)
> [info]   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:258)
> [info]   at 
> org.apache.spark.sql.execution.PlannerSuite.org$apache$spark$sql$execution$PlannerSuite$$testPartialAggregationPlan(PlannerSuite.scala:43)
> [info]   at 
> org.apache.spark.sql.execution.PlannerSuite$$anonfun$1.apply$mcV$sp(PlannerSuite.scala:58)
> [info]   at 
> org.apache.spark.sql.execution.PlannerSuite$$anonfun$1.apply(PlannerSuite.scala:56)
> [info]   at 
> org.apache.spark.sql.execution.PlannerSuite$$anonfun$1.apply(PlannerSuite.scala:56)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:56)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
> [info]   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
> [info]   at scala.collection.immutable.List.foreach(List.scala:381)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> [info]   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
> [info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
> [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
> [info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
> [info]   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
> [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
> [info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:28)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
> [info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:28)
> [info]   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:357)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:502)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(Th

[jira] [Commented] (SPARK-14730) Expose ColumnPruner as feature transformer

2016-04-20 Thread Benjamin Fradet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15250050#comment-15250050
 ] 

Benjamin Fradet commented on SPARK-14730:
-

[~jlaskowski], [~yanboliang] are one of you guys working on this?

> Expose ColumnPruner as feature transformer
> --
>
> Key: SPARK-14730
> URL: https://issues.apache.org/jira/browse/SPARK-14730
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Jacek Laskowski
>Priority: Minor
>
> From d...@spark.apache.org:
> {quote}
> Jacek:
> Came across `private class ColumnPruner` with "TODO(ekl) make this a
> public transformer" in scaladoc, cf.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala#L317.
> Why is this private and is there a JIRA for the TODO(ekl)?
> {quote}
> {quote}
> Yanbo Liang:
> This is due to ColumnPruner is only used for RFormula currently, we did not 
> expose it as a feature transformer.
> Please feel free to create JIRA and work on it.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters

2016-04-20 Thread Paul Shearer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-14740:
-
Component/s: (was: Spark Core)
 PySpark
 MLlib

> CrossValidatorModel.bestModel does not include hyper-parameters
> ---
>
> Key: SPARK-14740
> URL: https://issues.apache.org/jira/browse/SPARK-14740
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.1
>Reporter: Paul Shearer
>
> If you tune hyperparameters using a CrossValidator object in PySpark, you may 
> not be able to extract the parameter values of the best model.
> {noformat}
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
> dataset = sqlContext.createDataFrame(
> [(Vectors.dense([0.0]), 0.0),
>  (Vectors.dense([0.4]), 1.0),
>  (Vectors.dense([0.5]), 0.0),
>  (Vectors.dense([0.6]), 1.0),
>  (Vectors.dense([1.0]), 1.0)] * 10,
> ["features", "label"])
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
> 0.0001]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
> evaluator=evaluator)
> cvModel = cv.fit(dataset)
> {noformat}
> I can get the regression coefficient out, but I can't get the regularization 
> parameter
> {noformat}
> In [3]: cvModel.bestModel.coefficients
> Out[3]: DenseVector([3.1573])
> In [4]: cvModel.bestModel.explainParams()
> Out[4]: ''
> In [5]: cvModel.bestModel.extractParamMap()
> Out[5]: {}
> In [15]: cvModel.params
> Out[15]: []
> In [36]: cvModel.bestModel.params
> Out[36]: []
> {noformat}
> For the original issue raised on StackOverflow please see 
> http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14757) Incorrect behavior of Join operation in Spqrk SQL JOIN : "false" in the left table is joined to "null" on the right table

2016-04-20 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-14757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

黄泓 updated SPARK-14757:
---
Description: 
Content of table a:

|outgoing_0|
| false |
|  true |
|  null  |

Content of table b:

|outgoing_1|
| false  |
|  true  |
|  null   |

After running this query:

select * from a FULL JOIN b ON ( outgoing_0<=>outgoing_1)

I got the following result:

|outgoing_0|outgoing_1|
|  true  |  true  |
| false  | false |
| false  |  null  |
|  null   |  null  |

The row with "false" as outgoing_0 and "null" as outgoing_1 is unexpected. The 
operator <=> should match null with null. 

While left "false" is matched with right "null", it is also strange to find 
that the "false" on the right table does not match with "null" on the left 
table (no row with "null" as outgoing_0 and "false" as outgoing_1)

  was:
Content of table a:
+-+
|outgoing_0|
+-+
| false |
|  true |
|  null  |
+--+

Content of table b:

+--+
|outgoing_1|
+--+
| false  |
|  true  |
|  null   |
+--+

After running this query:

select * from a FULL JOIN b ON ( outgoing_0<=>outgoing_1)

I got the following result:

+-+--+
|outgoing_0|outgoing_1|
+--+--+
|  true  |  true  |
| false  | false |
| false  |  null  |
|  null   |  null  |
+--+--+

The row with "false" as outgoing_0 and "null" as outgoing_1 is unexpected. The 
operator <=> should match null with null. 

While left "false" is matched with right "null", it is also strange to find 
that the "false" on the right table does not match with "null" on the left 
table (no row with "null" as outgoing_0 and "false" as outgoing_1)


> Incorrect behavior of Join operation in Spqrk SQL JOIN : "false" in the left 
> table is joined to "null" on the right table
> -
>
> Key: SPARK-14757
> URL: https://issues.apache.org/jira/browse/SPARK-14757
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: 黄泓
>
> Content of table a:
> |outgoing_0|
> | false |
> |  true |
> |  null  |
> Content of table b:
> |outgoing_1|
> | false  |
> |  true  |
> |  null   |
> After running this query:
> select * from a FULL JOIN b ON ( outgoing_0<=>outgoing_1)
> I got the following result:
> |outgoing_0|outgoing_1|
> |  true  |  true  |
> | false  | false |
> | false  |  null  |
> |  null   |  null  |
> The row with "false" as outgoing_0 and "null" as outgoing_1 is unexpected. 
> The operator <=> should match null with null. 
> While left "false" is matched with right "null", it is also strange to find 
> that the "false" on the right table does not match with "null" on the left 
> table (no row with "null" as outgoing_0 and "false" as outgoing_1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters

2016-04-20 Thread Paul Shearer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15250055#comment-15250055
 ] 

Paul Shearer commented on SPARK-14740:
--

It appears it's accessible through cvModel.bestModel._java_obj.getRegParam(), 
just need to get this in the right place in the code.

> CrossValidatorModel.bestModel does not include hyper-parameters
> ---
>
> Key: SPARK-14740
> URL: https://issues.apache.org/jira/browse/SPARK-14740
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.1
>Reporter: Paul Shearer
>
> If you tune hyperparameters using a CrossValidator object in PySpark, you may 
> not be able to extract the parameter values of the best model.
> {noformat}
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
> dataset = sqlContext.createDataFrame(
> [(Vectors.dense([0.0]), 0.0),
>  (Vectors.dense([0.4]), 1.0),
>  (Vectors.dense([0.5]), 0.0),
>  (Vectors.dense([0.6]), 1.0),
>  (Vectors.dense([1.0]), 1.0)] * 10,
> ["features", "label"])
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
> 0.0001]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
> evaluator=evaluator)
> cvModel = cv.fit(dataset)
> {noformat}
> I can get the regression coefficient out, but I can't get the regularization 
> parameter
> {noformat}
> In [3]: cvModel.bestModel.coefficients
> Out[3]: DenseVector([3.1573])
> In [4]: cvModel.bestModel.explainParams()
> Out[4]: ''
> In [5]: cvModel.bestModel.extractParamMap()
> Out[5]: {}
> In [15]: cvModel.params
> Out[15]: []
> In [36]: cvModel.bestModel.params
> Out[36]: []
> {noformat}
> For the original issue raised on StackOverflow please see 
> http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 355 matches

Mail list logo