[jira] [Resolved] (SPARK-10009) PySpark Param of Vector type can be set with Python array or numpy.array

2016-03-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10009.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> PySpark Param of Vector type can be set with Python array or numpy.array
> 
>
> Key: SPARK-10009
> URL: https://issues.apache.org/jira/browse/SPARK-10009
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
> Fix For: 2.0.0
>
>
> If the type of Param in PySpark ML pipeline is Vector, we can set with Vector 
> currently. We also need to support set it with Python array and numpy.array. 
> It should be handled in the wrapper (_transfer_params_to_java).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4144) Support incremental model training of Naive Bayes classifier

2016-03-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209802#comment-15209802
 ] 

Joseph K. Bradley commented on SPARK-4144:
--

I'm wondering: Do you have a use case?  I have not heard of other interest for 
a long time, so I want to make sure it's needed.

> Support incremental model training of Naive Bayes classifier
> 
>
> Key: SPARK-4144
> URL: https://issues.apache.org/jira/browse/SPARK-4144
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Streaming
>Reporter: Chris Fregly
>Assignee: Chris Fregly
>
> Per Xiangrui Meng from the following user list discussion:  
> http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E
>
> "For Naive Bayes, we need to update the priors and conditional
> probabilities, which means we should also remember the number of
> observations for the updates."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4144) Support incremental model training of Naive Bayes classifier

2016-03-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4144:
-
Assignee: (was: Chris Fregly)

> Support incremental model training of Naive Bayes classifier
> 
>
> Key: SPARK-4144
> URL: https://issues.apache.org/jira/browse/SPARK-4144
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Streaming
>Reporter: Chris Fregly
>
> Per Xiangrui Meng from the following user list discussion:  
> http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E
>
> "For Naive Bayes, we need to update the priors and conditional
> probabilities, which means we should also remember the number of
> observations for the updates."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13911) Having condition and order by cannot both have aggregate functions

2016-03-23 Thread Yang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209801#comment-15209801
 ] 

Yang Wang commented on SPARK-13911:
---

Are there any new progresses here about this issue? We are facing the same 
problem. Thanks.

> Having condition and order by cannot both have aggregate functions
> --
>
> Key: SPARK-13911
> URL: https://issues.apache.org/jira/browse/SPARK-13911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.1, 2.0.0
>Reporter: Cheng Lian
>
> Given the following temporary table:
> {code}
> sqlContext range 10 select ('id as 'a, 'id as 'b) registerTempTable "t"
> {code}
> The following SQL statement can't pass analysis:
> {noformat}
> scala> sqlContext sql "SELECT * FROM t GROUP BY a HAVING COUNT(b) > 0 ORDER 
> BY COUNT(b)" show ()
> org.apache.spark.sql.AnalysisException: expression '`t`.`b`' is neither 
> present in the group by, nor is it an aggregate function. Add to group by or 
> wrap in first() (or first_value) if you don't care which value you get.;
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:36)
>   at org.apache.spark.sql.Dataset$.newDataFrame(Dataset.scala:58)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:784)
>   ... 49 elided
> {noformat}
> The reason is that analysis rule {{ResolveAggregateFunctions}} only handles 
> the first {{Filter}} _or_ {{Sort}} directly above an {{Aggregate}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13934) SqlParser.parseTableIdentifier cannot recognize table name start with scientific notation

2016-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209796#comment-15209796
 ] 

Apache Spark commented on SPARK-13934:
--

User 'wangyang1992' has created a pull request for this issue:
https://github.com/apache/spark/pull/11929

> SqlParser.parseTableIdentifier cannot recognize table name start with 
> scientific notation
> -
>
> Key: SPARK-13934
> URL: https://issues.apache.org/jira/browse/SPARK-13934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yang Wang
>
> SqlParser.parseTableIdentifier cannot recognize table name start with 
> scientific notation like "1e30abcdedfg".
> This bug can be reproduced by code following:
> val conf = new SparkConf().setAppName(s"test").setMaster("local[2]")
> val sc = new SparkContext(conf)
> val hc = new HiveContext(sc)
> val tableName = "1e34abcd"
> hc.sql("select 123").registerTempTable(tableName)
> hc.dropTempTable(tableName)
> The last line will throw a RuntimeException.(java.lang.RuntimeException: 
> [1.1] failure: identifier expected)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13456) Cannot create encoders for case classes defined in Spark shell after upgrading to Scala 2.11

2016-03-23 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209794#comment-15209794
 ] 

Wenchen Fan commented on SPARK-13456:
-

oh i see, the problem is the paste mode, working on it.

> Cannot create encoders for case classes defined in Spark shell after 
> upgrading to Scala 2.11
> 
>
> Key: SPARK-13456
> URL: https://issues.apache.org/jira/browse/SPARK-13456
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Spark 2.0 started to use Scala 2.11 by default since [PR 
> #10608|https://github.com/apache/spark/pull/10608].  Unfortunately, after 
> this upgrade, Spark fails to create encoders for case classes defined in REPL:
> {code}
> import sqlContext.implicits._
> case class T(a: Int, b: Double)
> val ds = Seq(1 -> T(1, 1D), 2 -> T(2, 2D)).toDS()
> {code}
> Exception thrown:
> {noformat}
> org.apache.spark.sql.AnalysisException: Unable to generate an encoder for 
> inner class `T` without access to the scope that this class was defined in.
> Try moving this class out of its parent class.;
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:565)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:561)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:304)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:353)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:333)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at 

[jira] [Commented] (SPARK-13456) Cannot create encoders for case classes defined in Spark shell after upgrading to Scala 2.11

2016-03-23 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209790#comment-15209790
 ] 

Wenchen Fan commented on SPARK-13456:
-

I tried your example locally and everything goes well. How do you build spark?

> Cannot create encoders for case classes defined in Spark shell after 
> upgrading to Scala 2.11
> 
>
> Key: SPARK-13456
> URL: https://issues.apache.org/jira/browse/SPARK-13456
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Spark 2.0 started to use Scala 2.11 by default since [PR 
> #10608|https://github.com/apache/spark/pull/10608].  Unfortunately, after 
> this upgrade, Spark fails to create encoders for case classes defined in REPL:
> {code}
> import sqlContext.implicits._
> case class T(a: Int, b: Double)
> val ds = Seq(1 -> T(1, 1D), 2 -> T(2, 2D)).toDS()
> {code}
> Exception thrown:
> {noformat}
> org.apache.spark.sql.AnalysisException: Unable to generate an encoder for 
> inner class `T` without access to the scope that this class was defined in.
> Try moving this class out of its parent class.;
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:565)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:561)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:304)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:353)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:333)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at 

[jira] [Commented] (SPARK-13832) TPC-DS Query 36 fails with Parser error

2016-03-23 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209748#comment-15209748
 ] 

Xin Wu commented on SPARK-13832:


[~jfc...@us.ibm.com]  For the above execution issue, i think it is a duplicate 
of SPARK-14096. I think you can close this JIRA and refer to SPARK-14096 for 
the kyro exception issue.  Thanks!

> TPC-DS Query 36 fails with Parser error
> ---
>
> Key: SPARK-13832
> URL: https://issues.apache.org/jira/browse/SPARK-13832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS query 36 fails with the following error
> Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
> 'i_category' is neither present in the group by, nor is it an aggregate 
> function. Add to group by or wrap in first() (or first_value) if you don't 
> care which value you get.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> Query Text pasted here for quick reference.
>   select
> sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
>,i_category
>,i_class
>,grouping__id as lochierarchy
>,rank() over (
> partition by grouping__id,
> case when grouping__id = 0 then i_category end
> order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
> rank_within_parent
>  from
> store_sales
>,date_dim   d1
>,item
>,store
>  where
> d1.d_year = 2001
>  and d1.d_date_sk = ss_sold_date_sk
>  and i_item_sk  = ss_item_sk
>  and s_store_sk  = ss_store_sk
>  and s_state in ('TN','TN','TN','TN',
>  'TN','TN','TN','TN')
>  group by i_category,i_class WITH ROLLUP
>  order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then i_category end
>   ,rank_within_parent
> limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8632) Poor Python UDF performance because of RDD caching

2016-03-23 Thread Bijay Kumar Pathak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209743#comment-15209743
 ] 

Bijay Kumar Pathak edited comment on SPARK-8632 at 3/24/16 4:47 AM:


I am still having this issue in Spark 1.6.0 on EMR. The job fails with OOM. I 
have DataFrame with 250 columns and I am applying UDF on more than 50 of the 
columns. I am registering the DataFrame as temptable and  applying the UDF in 
hive_context sql statement. I am applying the UDF after sort merge join of two 
DataFrame (each of around 4GB) and multiple broadcast joins of 22 Dim table.

Below is how I am applying the UDF.

{code:borderStyle=solid}
data_frame.registerTempTable("temp_table")
new_df = hive_context.sql("select python_udf(column_1),python_udf(column_2), 
... , from temp_table")
{code}


was (Author: bijay697):
I am still having this issue in Spark 1.6.0 on EMR. The job fails with OOM. I 
have DataFrame with 250 columns and I am applying UDF on more than 50 of the 
columns. I am registering the DataFrame as temptable and  applying the UDF in 
hive_context sql statement. I am applying the UDF after sort merge join of two 
DataFrame (each of around 4GB) and multiple broadcast joins of 22 Dim table.

Below is how I am applying the UDF.

{code:title=Bar.python|borderStyle=solid}
data_frame.registerTempTable("temp_table")
new_df = hive_context.sql("select python_udf(column_1),python_udf(column_2), 
... , from temp_table")
{code}

> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8632) Poor Python UDF performance because of RDD caching

2016-03-23 Thread Bijay Kumar Pathak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209743#comment-15209743
 ] 

Bijay Kumar Pathak edited comment on SPARK-8632 at 3/24/16 4:46 AM:


I am still having this issue in Spark 1.6.0 on EMR. The job fails with OOM. I 
have DataFrame with 250 columns and I am applying UDF on more than 50 of the 
columns. I am registering the DataFrame as temptable and  applying the UDF in 
hive_context sql statement. I am applying the UDF after sort merge join of two 
DataFrame (each of around 4GB) and multiple broadcast joins of 22 Dim table.

Below is how I am applying the UDF.

{code:title=Bar.python|borderStyle=solid}
data_frame.registerTempTable("temp_table")
new_df = hive_context.sql("select python_udf(column_1),python_udf(column_2), 
... , from temp_table")
{code}


was (Author: bijay697):
I am still having this issue in Spark 1.6.0 on EMR. The job fails with OOM. I 
have DataFrame with 250 columns and I am applying UDF on more than 50 of the 
columns. I am registering the DataFrame as temptable and  applying the UDF in 
hive_context sql statement. I am applying the UDF after sort merge join of two 
DataFrame (each of around 4GB) and multiple broadcast joins of 22 Dim table.

Below is how I am applying the UDF.

``` python
data_frame.registerTempTable("temp_table")
new_df = hive_context.sql("select python_udf(column_1),python_udf(column_2), 
... , from temp_table")
```


> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching

2016-03-23 Thread Bijay Kumar Pathak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209743#comment-15209743
 ] 

Bijay Kumar Pathak commented on SPARK-8632:
---

I am still having this issue in Spark 1.6.0 on EMR. The job fails with OOM. I 
have DataFrame with 250 columns and I am applying UDF on more than 50 of the 
columns. I am registering the DataFrame as temptable and  applying the UDF in 
hive_context sql statement. I am applying the UDF after sort merge join of two 
DataFrame (each of around 4GB) and multiple broadcast joins of 22 Dim table.

Below is how I am applying the UDF.

``` python
data_frame.registerTempTable("temp_table")
new_df = hive_context.sql("select python_udf(column_1),python_udf(column_2), 
... , from temp_table")
```


> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10691) Make Logistic, Linear Regression Model evaluate() method public

2016-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209740#comment-15209740
 ] 

Apache Spark commented on SPARK-10691:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/11928

> Make Logistic, Linear Regression Model evaluate() method public
> ---
>
> Key: SPARK-10691
> URL: https://issues.apache.org/jira/browse/SPARK-10691
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Hao Ren
>Assignee: Joseph K. Bradley
>
> The following method in {{LogisticRegressionModel}} is marked as {{private}}, 
> which prevents users from creating a summary on any given data set. Check 
> [here|https://github.com/feynmanliang/spark/blob/d219fa4c216e8f35b71a26921561104d15cd6055/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L272].
> {code}
> // TODO: decide on a good name before exposing to public API
> private[classification] def evaluate(dataset: DataFrame)
> : LogisticRegressionSummary = {
> new BinaryLogisticRegressionSummary(
> this.transform(dataset), 
> $(probabilityCol), 
> $(labelCol))
> }
> {code}
> This method is definitely necessary to test model performance.
> By the way, the name {{evaluate}} is already pretty good for me.
> [~mengxr] Could you check this ? Thx



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10691) Make Logistic, Linear Regression Model evaluate() method public

2016-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10691:


Assignee: Apache Spark  (was: Joseph K. Bradley)

> Make Logistic, Linear Regression Model evaluate() method public
> ---
>
> Key: SPARK-10691
> URL: https://issues.apache.org/jira/browse/SPARK-10691
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Hao Ren
>Assignee: Apache Spark
>
> The following method in {{LogisticRegressionModel}} is marked as {{private}}, 
> which prevents users from creating a summary on any given data set. Check 
> [here|https://github.com/feynmanliang/spark/blob/d219fa4c216e8f35b71a26921561104d15cd6055/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L272].
> {code}
> // TODO: decide on a good name before exposing to public API
> private[classification] def evaluate(dataset: DataFrame)
> : LogisticRegressionSummary = {
> new BinaryLogisticRegressionSummary(
> this.transform(dataset), 
> $(probabilityCol), 
> $(labelCol))
> }
> {code}
> This method is definitely necessary to test model performance.
> By the way, the name {{evaluate}} is already pretty good for me.
> [~mengxr] Could you check this ? Thx



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10691) Make Logistic, Linear Regression Model evaluate() method public

2016-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10691:


Assignee: Joseph K. Bradley  (was: Apache Spark)

> Make Logistic, Linear Regression Model evaluate() method public
> ---
>
> Key: SPARK-10691
> URL: https://issues.apache.org/jira/browse/SPARK-10691
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Hao Ren
>Assignee: Joseph K. Bradley
>
> The following method in {{LogisticRegressionModel}} is marked as {{private}}, 
> which prevents users from creating a summary on any given data set. Check 
> [here|https://github.com/feynmanliang/spark/blob/d219fa4c216e8f35b71a26921561104d15cd6055/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L272].
> {code}
> // TODO: decide on a good name before exposing to public API
> private[classification] def evaluate(dataset: DataFrame)
> : LogisticRegressionSummary = {
> new BinaryLogisticRegressionSummary(
> this.transform(dataset), 
> $(probabilityCol), 
> $(labelCol))
> }
> {code}
> This method is definitely necessary to test model performance.
> By the way, the name {{evaluate}} is already pretty good for me.
> [~mengxr] Could you check this ? Thx



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10691) Make Logistic, Linear Regression Model evaluate() method public

2016-03-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10691:
--
Summary: Make Logistic, Linear Regression Model evaluate() method public  
(was: Make LogisticRegressionModel.evaluate() method public)

> Make Logistic, Linear Regression Model evaluate() method public
> ---
>
> Key: SPARK-10691
> URL: https://issues.apache.org/jira/browse/SPARK-10691
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Hao Ren
>Assignee: Joseph K. Bradley
>
> The following method in {{LogisticRegressionModel}} is marked as {{private}}, 
> which prevents users from creating a summary on any given data set. Check 
> [here|https://github.com/feynmanliang/spark/blob/d219fa4c216e8f35b71a26921561104d15cd6055/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L272].
> {code}
> // TODO: decide on a good name before exposing to public API
> private[classification] def evaluate(dataset: DataFrame)
> : LogisticRegressionSummary = {
> new BinaryLogisticRegressionSummary(
> this.transform(dataset), 
> $(probabilityCol), 
> $(labelCol))
> }
> {code}
> This method is definitely necessary to test model performance.
> By the way, the name {{evaluate}} is already pretty good for me.
> [~mengxr] Could you check this ? Thx



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10691) Make LogisticRegressionModel.evaluate() method public

2016-03-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-10691:
-

Assignee: Joseph K. Bradley

> Make LogisticRegressionModel.evaluate() method public
> -
>
> Key: SPARK-10691
> URL: https://issues.apache.org/jira/browse/SPARK-10691
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Hao Ren
>Assignee: Joseph K. Bradley
>
> The following method in {{LogisticRegressionModel}} is marked as {{private}}, 
> which prevents users from creating a summary on any given data set. Check 
> [here|https://github.com/feynmanliang/spark/blob/d219fa4c216e8f35b71a26921561104d15cd6055/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L272].
> {code}
> // TODO: decide on a good name before exposing to public API
> private[classification] def evaluate(dataset: DataFrame)
> : LogisticRegressionSummary = {
> new BinaryLogisticRegressionSummary(
> this.transform(dataset), 
> $(probabilityCol), 
> $(labelCol))
> }
> {code}
> This method is definitely necessary to test model performance.
> By the way, the name {{evaluate}} is already pretty good for me.
> [~mengxr] Could you check this ? Thx



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14096) SPARK-SQL CLI returns NPE

2016-03-23 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209737#comment-15209737
 ] 

Xin Wu commented on SPARK-14096:


I simplied the query to:
{code}select * from item order by i_item_id limit 100;{code}
And it fails with exception:{code}
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
at java.util.PriorityQueue.offer(PriorityQueue.java:344)
at java.util.PriorityQueue.add(PriorityQueue.java:321)
at 
com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
at 
com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
{code}
And removing either "ORDER BY" or "LIMIT" clause will pass.. 

> SPARK-SQL CLI returns NPE
> -
>
> Key: SPARK-14096
> URL: https://issues.apache.org/jira/browse/SPARK-14096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>
> Trying to run TPCDS query 06 in spark-sql shell received the following error 
> in the middle of a stage; but running another query 38 succeeded:
> NPE:
> {noformat}
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 65.0 in stage 
> 10.0 (TID 622) in 171 ms on localhost (30/200)
> 16/03/22 15:12:56 ERROR scheduler.TaskResultGetter: Exception while getting 
> task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1790)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:344)
>   at java.util.PriorityQueue.add(PriorityQueue.java:321)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
>   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
>   ... 15 more
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all 

[jira] [Closed] (SPARK-14093) org.apache.spark.ml.recommendation.ALSModel.save method cannot be used with HDFS

2016-03-23 Thread Dulaj Rajitha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dulaj Rajitha closed SPARK-14093.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

I was able to solve the issue by using the ALSModel.write().saveImpl( hdfsPath) 
method to save to the HDFS

> org.apache.spark.ml.recommendation.ALSModel.save method cannot be used with 
> HDFS
> 
>
> Key: SPARK-14093
> URL: https://issues.apache.org/jira/browse/SPARK-14093
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML
>Affects Versions: 1.6.0
>Reporter: Dulaj Rajitha
> Fix For: 1.6.0
>
>
> ALSModel.save(path) is not working for HDFS paths and it gives  
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://192.168.1.71/res/als.model, expected: file:/// 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14093) org.apache.spark.ml.recommendation.ALSModel.save method cannot be used with HDFS

2016-03-23 Thread Dulaj Rajitha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209724#comment-15209724
 ] 

Dulaj Rajitha commented on SPARK-14093:
---

I was able to solve the issue by using the ALSModel.write().saveImpl( hdfsPath) 
method to save to the HDFS. 

> org.apache.spark.ml.recommendation.ALSModel.save method cannot be used with 
> HDFS
> 
>
> Key: SPARK-14093
> URL: https://issues.apache.org/jira/browse/SPARK-14093
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML
>Affects Versions: 1.6.0
>Reporter: Dulaj Rajitha
>
> ALSModel.save(path) is not working for HDFS paths and it gives  
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://192.168.1.71/res/als.model, expected: file:/// 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12183) Remove spark.mllib tree, forest implementations and use spark.ml

2016-03-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-12183.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11855
[https://github.com/apache/spark/pull/11855]

> Remove spark.mllib tree, forest implementations and use spark.ml
> 
>
> Key: SPARK-12183
> URL: https://issues.apache.org/jira/browse/SPARK-12183
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.0.0
>
>
> This JIRA is for replacing the spark.mllib decision tree and random forest 
> implementations with the one from spark.ml.  The spark.ml one should be used 
> as a wrapper.  This should involve moving the implementation, but should 
> probably not require changing the tests (much).
> This blocks on 1 improvement to spark.mllib which needs to be ported to 
> spark.ml: [SPARK-10064]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14110) PipedRDD to print the command ran on non zero exit

2016-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14110:


Assignee: Apache Spark

> PipedRDD to print the command ran on non zero exit
> --
>
> Key: SPARK-14110
> URL: https://issues.apache.org/jira/browse/SPARK-14110
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Assignee: Apache Spark
>Priority: Trivial
>
> In case of failure in subprocess launched in PipedRDD, the failure exception 
> reads “Subprocess exited with status XXX”. Debugging this is not easy for 
> users especially if there are multiple pipe() operations in the Spark 
> application. It would be great to see the command of the failed sub-process 
> so that one can identify which pipe() failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14110) PipedRDD to print the command ran on non zero exit

2016-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209702#comment-15209702
 ] 

Apache Spark commented on SPARK-14110:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/11927

> PipedRDD to print the command ran on non zero exit
> --
>
> Key: SPARK-14110
> URL: https://issues.apache.org/jira/browse/SPARK-14110
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Trivial
>
> In case of failure in subprocess launched in PipedRDD, the failure exception 
> reads “Subprocess exited with status XXX”. Debugging this is not easy for 
> users especially if there are multiple pipe() operations in the Spark 
> application. It would be great to see the command of the failed sub-process 
> so that one can identify which pipe() failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14110) PipedRDD to print the command ran on non zero exit

2016-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14110:


Assignee: (was: Apache Spark)

> PipedRDD to print the command ran on non zero exit
> --
>
> Key: SPARK-14110
> URL: https://issues.apache.org/jira/browse/SPARK-14110
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Trivial
>
> In case of failure in subprocess launched in PipedRDD, the failure exception 
> reads “Subprocess exited with status XXX”. Debugging this is not easy for 
> users especially if there are multiple pipe() operations in the Spark 
> application. It would be great to see the command of the failed sub-process 
> so that one can identify which pipe() failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14096) SPARK-SQL CLI returns NPE

2016-03-23 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209688#comment-15209688
 ] 

Xin Wu commented on SPARK-14096:


[~jfc...@us.ibm.com] Can you try the query without ORDER BY? I noticed another 
query failed with the ORDER BY while succeeded without the ORDER BY. 

> SPARK-SQL CLI returns NPE
> -
>
> Key: SPARK-14096
> URL: https://issues.apache.org/jira/browse/SPARK-14096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>
> Trying to run TPCDS query 06 in spark-sql shell received the following error 
> in the middle of a stage; but running another query 38 succeeded:
> NPE:
> {noformat}
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 65.0 in stage 
> 10.0 (TID 622) in 171 ms on localhost (30/200)
> 16/03/22 15:12:56 ERROR scheduler.TaskResultGetter: Exception while getting 
> task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1790)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:344)
>   at java.util.PriorityQueue.add(PriorityQueue.java:321)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
>   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
>   ... 15 more
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 66.0 in stage 
> 10.0 (TID 623) in 171 ms on localhost (31/200)
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> {noformat}
> query 06 (caused the above NPE):
> {noformat}
>  select  a.ca_state state, count(*) cnt
>  from customer_address a
>  join customer c on a.ca_address_sk = c.c_current_addr_sk
>  join store_sales s on c.c_customer_sk = s.ss_customer_sk
>  join date_dim d on s.ss_sold_date_sk = d.d_date_sk
>  join item i on s.ss_item_sk = i.i_item_sk
>  join (select distinct d_month_seq
> from date_dim
>where d_year = 2001
>   and d_moy = 1 ) tmp1 ON d.d_month_seq = tmp1.d_month_seq
>  join
>   (select j.i_category, avg(j.i_current_price) as avg_i_current_price
>from item j group by j.i_category) tmp2 on tmp2.i_category = 
> i.i_category
>  where  
>   i.i_current_price > 1.2 * 

[jira] [Commented] (SPARK-12194) Add Sink for reporting Spark Metrics to OpenTSDB

2016-03-23 Thread He Tianyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209686#comment-15209686
 ] 

He Tianyi commented on SPARK-12194:
---

Looking for OpenTSDB sink also. Would any committer check on this please?

> Add Sink for reporting Spark Metrics to OpenTSDB
> 
>
> Key: SPARK-12194
> URL: https://issues.apache.org/jira/browse/SPARK-12194
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Kapil Singh
>
> Add OpenTSDB Sink to the currently supported metric sinks. Since OpenTSDB is 
> a popular open-source Time Series Database (based on HBase), this will make 
> it convenient for those who want metrics data for time series analysis 
> purposes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13832) TPC-DS Query 36 fails with Parser error

2016-03-23 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209687#comment-15209687
 ] 

Xin Wu commented on SPARK-13832:


The analysis issue reported in this jira is resolved in spark 2.0.. 
For the kyro exception during execution, the query can return without the ORDER 
BY.. so I am also looking into why ORDER BY clause triggers this kyro 
Exception. 

> TPC-DS Query 36 fails with Parser error
> ---
>
> Key: SPARK-13832
> URL: https://issues.apache.org/jira/browse/SPARK-13832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS query 36 fails with the following error
> Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
> 'i_category' is neither present in the group by, nor is it an aggregate 
> function. Add to group by or wrap in first() (or first_value) if you don't 
> care which value you get.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> Query Text pasted here for quick reference.
>   select
> sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
>,i_category
>,i_class
>,grouping__id as lochierarchy
>,rank() over (
> partition by grouping__id,
> case when grouping__id = 0 then i_category end
> order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
> rank_within_parent
>  from
> store_sales
>,date_dim   d1
>,item
>,store
>  where
> d1.d_year = 2001
>  and d1.d_date_sk = ss_sold_date_sk
>  and i_item_sk  = ss_item_sk
>  and s_store_sk  = ss_store_sk
>  and s_state in ('TN','TN','TN','TN',
>  'TN','TN','TN','TN')
>  group by i_category,i_class WITH ROLLUP
>  order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then i_category end
>   ,rank_within_parent
> limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14110) PipedRDD to print the command ran on non zero exit

2016-03-23 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-14110:

Summary: PipedRDD to print the command ran on non zero exit  (was: PipedRDD 
to print the command ran on failure)

> PipedRDD to print the command ran on non zero exit
> --
>
> Key: SPARK-14110
> URL: https://issues.apache.org/jira/browse/SPARK-14110
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Trivial
>
> In case of failure in subprocess launched in PipedRDD, the failure exception 
> reads “Subprocess exited with status XXX”. Debugging this is not easy for 
> users especially if there are multiple pipe() operations in the Spark 
> application. It would be great to see the command of the failed sub-process 
> so that one can identify which pipe() failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14110) PipedRDD to print the command ran on failure

2016-03-23 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-14110:

Description: In case of failure in subprocess launched in PipedRDD, the 
failure exception reads “Subprocess exited with status XXX”. Debugging this is 
not easy for users especially if there are multiple pipe() operations in the 
Spark application. It would be great to see the command of the failed 
sub-process so that one can identify which pipe() failed.  (was: In case of 
failure in PipedRDD, the driver says “Subprocess exited with status XXX”. 
Debugging this is not easy for users especially if there are multiple pipe() 
operations in the Spark application. It would be great to see the command of 
the failed sub-process so that one can identify which pipe() failed.)

> PipedRDD to print the command ran on failure
> 
>
> Key: SPARK-14110
> URL: https://issues.apache.org/jira/browse/SPARK-14110
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Trivial
>
> In case of failure in subprocess launched in PipedRDD, the failure exception 
> reads “Subprocess exited with status XXX”. Debugging this is not easy for 
> users especially if there are multiple pipe() operations in the Spark 
> application. It would be great to see the command of the failed sub-process 
> so that one can identify which pipe() failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14111) Correct output nullability with constraints for logical plans

2016-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14111:


Assignee: Apache Spark

> Correct output nullability with constraints for logical plans
> -
>
> Key: SPARK-14111
> URL: https://issues.apache.org/jira/browse/SPARK-14111
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> We use output of logical plan as schema. Output nullablity is important and 
> we should keep its correctness.
> With constraints and optimization, in fact we will change output nullability 
> of logical plans. But we don't reflect such changes in the output attributes. 
> So the output nullablity is not correct now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14111) Correct output nullability with constraints for logical plans

2016-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14111:


Assignee: (was: Apache Spark)

> Correct output nullability with constraints for logical plans
> -
>
> Key: SPARK-14111
> URL: https://issues.apache.org/jira/browse/SPARK-14111
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> We use output of logical plan as schema. Output nullablity is important and 
> we should keep its correctness.
> With constraints and optimization, in fact we will change output nullability 
> of logical plans. But we don't reflect such changes in the output attributes. 
> So the output nullablity is not correct now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14111) Correct output nullability with constraints for logical plans

2016-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209676#comment-15209676
 ] 

Apache Spark commented on SPARK-14111:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/11926

> Correct output nullability with constraints for logical plans
> -
>
> Key: SPARK-14111
> URL: https://issues.apache.org/jira/browse/SPARK-14111
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> We use output of logical plan as schema. Output nullablity is important and 
> we should keep its correctness.
> With constraints and optimization, in fact we will change output nullability 
> of logical plans. But we don't reflect such changes in the output attributes. 
> So the output nullablity is not correct now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14111) Correct output nullability with constraints for logical plans

2016-03-23 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-14111:
---

 Summary: Correct output nullability with constraints for logical 
plans
 Key: SPARK-14111
 URL: https://issues.apache.org/jira/browse/SPARK-14111
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


We use output of logical plan as schema. Output nullablity is important and we 
should keep its correctness.

With constraints and optimization, in fact we will change output nullability of 
logical plans. But we don't reflect such changes in the output attributes. So 
the output nullablity is not correct now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13903) Modify output nullability with constraints for Join and Filter operators

2016-03-23 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh closed SPARK-13903.
---
Resolution: Won't Fix

> Modify output nullability with constraints for Join and Filter operators
> 
>
> Key: SPARK-13903
> URL: https://issues.apache.org/jira/browse/SPARK-13903
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> With constraints and optimization, we can make sure some outputs of a Join 
> (or Filter) operator are not nulls. We should modify output nullability 
> accordingly. We can use this information in later execution to avoid null 
> checking.
> Another reason to modify plan output is that we will use the output to 
> determine schema. We should keep correct nullability in the schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14085) Star Expansion for Hash

2016-03-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-14085:

Affects Version/s: (was: 2.0.0)
   1.6.1

> Star Expansion for Hash
> ---
>
> Key: SPARK-14085
> URL: https://issues.apache.org/jira/browse/SPARK-14085
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> Support star expansion in hash. For example
> {code}
> val structDf = testData2.select("a", "b").as("record")
> structDf.select(hash($"*")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14085) Star Expansion for Hash

2016-03-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-14085.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11904
[https://github.com/apache/spark/pull/11904]

> Star Expansion for Hash
> ---
>
> Key: SPARK-14085
> URL: https://issues.apache.org/jira/browse/SPARK-14085
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
> Fix For: 2.0.0
>
>
> Support star expansion in hash. For example
> {code}
> val structDf = testData2.select("a", "b").as("record")
> structDf.select(hash($"*")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14085) Star Expansion for Hash

2016-03-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-14085:

Assignee: Xiao Li

> Star Expansion for Hash
> ---
>
> Key: SPARK-14085
> URL: https://issues.apache.org/jira/browse/SPARK-14085
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> Support star expansion in hash. For example
> {code}
> val structDf = testData2.select("a", "b").as("record")
> structDf.select(hash($"*")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14068) Pluggable DiskBlockManager

2016-03-23 Thread He Tianyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209624#comment-15209624
 ] 

He Tianyi commented on SPARK-14068:
---

For example:
In my deployment, both SSDs and HDDs are installed on each node. In production 
I've encountered 'No space left on device' many times since SSDs does not 
always fit all data blocks during shuffle phase. 
In this case one may want to implement a 'ssd-first' strategy. That is, use 
SSDs if possible, otherwise fallback to HDDs. 

Generally, strategies may be highly customized per user.

> Pluggable DiskBlockManager
> --
>
> Key: SPARK-14068
> URL: https://issues.apache.org/jira/browse/SPARK-14068
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager
>Reporter: He Tianyi
>Priority: Minor
>
> Currently DiskBlockManager places file by hashing strategy, this can be 
> non-optimal in some scenario. 
> Maybe we make it pluggable. That is, DiskBlockManager can be replaced with 
> another implementation with different strategy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12083) java.lang.IllegalArgumentException: requirement failed: Overflowed precision (q98)

2016-03-23 Thread zhangdekun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209608#comment-15209608
 ] 

zhangdekun commented on SPARK-12083:


JDBCRDD.scala
case java.sql.Types.DECIMAL
if precision != 0 || scale != 0 => DecimalType.bounded(precision, scale)
may be set scale >0 

> java.lang.IllegalArgumentException: requirement failed: Overflowed precision 
> (q98)
> --
>
> Key: SPARK-12083
> URL: https://issues.apache.org/jira/browse/SPARK-12083
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: CentOS release 6.6 
>Reporter: Dileep Kumar
>  Labels: perfomance
>
> While running with 10 users we found that one of the executor randomly hangs 
> during q98 execution. The behavior is random in way that it happens at 
> different time but for the same query. Tried to get a stack trace but was not 
> successful in generating the stack trace.
> Here is the last exception that I saw before the hang:
> java.lang.IllegalArgumentException: requirement failed: Overflowed precision
>   at scala.Predef$.require(Predef.scala:233)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:111)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:335)
>   at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.getDecimal(UnsafeRow.java:388)
>   at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.getDecimal(JoinedRow.scala:95)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
> ===
> One of the other executor before this had the following exception:
> FetchFailed(BlockManagerId(10, d2412.halxg.cloudera.com, 45956), shuffleId=0, 
> mapId=212, reduceId=492, message=
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> d2412.halxg.cloudera.com/10.20.122.112:45956
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:321)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:306)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:173)
>   at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$executePartition$1(sort.scala:160)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at 

[jira] [Commented] (SPARK-14105) Serialization issue for KafkaRDD

2016-03-23 Thread Liyin Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209594#comment-15209594
 ] 

Liyin Tang commented on SPARK-14105:


I tested the PR with the same streaming job, and verified the serialization 
issue is fixed with the PR.

> Serialization issue for KafkaRDD
> 
>
> Key: SPARK-14105
> URL: https://issues.apache.org/jira/browse/SPARK-14105
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2, 1.6.1
>Reporter: Liyin Tang
> Attachments: Screenshot 2016-03-23 09.04.51.png
>
>
> When using DISK or Memory to persistent KafkaDirectInputStream, it will 
> serialize the FetchResponse into blocks. The FetchResponse contains the 
> ByteBufferMessageSet where each Kafka Message is just one slice of the 
> underlying ByteBuffer. 
> When serializing the KafkaRDDIterator, it seems like the entire underlying 
> ByteBuffer in ByteBufferMessageSet will be serialized for each and every 
> message. This will cause block size easily exceeds 2G, and lead to 
> "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" or 
> "FileChannelImpl.map -> exceeds Integer.MAX_VALUE:"
> The consumer fetch is the default value (1M).  I tried to reduce fetch size, 
> but it will cause other errors like errRanOutBeforeEnd. 
> Here are exceptions I got for both Memory and Disk persistent.
> Memory Persistent:
> 16/03/23 15:34:44 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-9,5,main]
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:3236)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
> at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
> at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
> at 
> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:158)
> at 
> org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
> at 
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1190)
> at 
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1199)
> at org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:132)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:793)
> at org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:669)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
> Disk Persistent: 
> 16/03/23 17:04:00 INFO TaskSetManager: Lost task 42.1 in stage 2.0 (TID 974) 
> on executor i-2878ceb3.inst.aws.airbnb.com: java.lang.RuntimeException 
> (java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
> 

[jira] [Commented] (SPARK-14094) When shutdown spark streaming jobs , the MasterUI stop response, RPC Timeout exception occurs

2016-03-23 Thread Simon Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209592#comment-15209592
 ] 

Simon Lee commented on SPARK-14094:
---

Maybe I forgot config the parameter: 
spark.ui.retainedJobs
spark.ui.retainedStages
spark.worker.ui.retainedExecutors
spark.worker.ui.retainedDrivers

Or does these parameters has any relation for the problem? Because there are 
several streaming applications, and each job generates lots of [Jobs] and 
[Stages]

> When shutdown spark streaming jobs , the MasterUI stop response,  RPC Timeout 
> exception occurs
> --
>
> Key: SPARK-14094
> URL: https://issues.apache.org/jira/browse/SPARK-14094
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1
> Environment: Standalone
>Reporter: Simon Lee
>
> I have several streaming jobs running, after kill them, the spark master ui  
> stop response. In the master log file:
> 16/03/23 16:17:50 WARN ui.JettyUtils: GET / failed: 
> java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
> java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at 
> org.apache.spark.deploy.master.ui.MasterPage.getMasterState(MasterPage.scala:40)
>   at 
> org.apache.spark.deploy.master.ui.MasterPage.render(MasterPage.scala:74)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
>   at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:69)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
>   at 
> org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
>   at 
> org.spark-project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
>   at 
> org.spark-project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
>   at 
> org.spark-project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
>   at 
> org.spark-project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
>   at 
> org.spark-project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.spark-project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.spark-project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.spark-project.jetty.server.Server.handle(Server.java:370)
>   at 
> org.spark-project.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>   at 
> org.spark-project.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
>   at 
> org.spark-project.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
>   at 
> org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:644)
>   at 
> org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>   at 
> org.spark-project.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>   at 
> org.spark-project.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
>   at 
> org.spark-project.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
>   at 
> org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>   at 
> org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>   at java.lang.Thread.run(Thread.java:745)
> 16/03/23 16:17:50 WARN servlet.ServletHandler: /
> java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at 
> org.apache.spark.deploy.master.ui.MasterPage.getMasterState(MasterPage.scala:40)
>   at 
> 

[jira] [Created] (SPARK-14110) PipedRDD to print the command ran on failure

2016-03-23 Thread Tejas Patil (JIRA)
Tejas Patil created SPARK-14110:
---

 Summary: PipedRDD to print the command ran on failure
 Key: SPARK-14110
 URL: https://issues.apache.org/jira/browse/SPARK-14110
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.1
Reporter: Tejas Patil
Priority: Trivial


In case of failure in PipedRDD, the driver says “Subprocess exited with status 
XXX”. Debugging this is not easy for users especially if there are multiple 
pipe() operations in the Spark application. It would be great to see the 
command of the failed sub-process so that one can identify which pipe() failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14109) HDFSMetadataLog throws AbstractFilesSystem exception with common schemes like s3n

2016-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14109:


Assignee: Tathagata Das  (was: Apache Spark)

> HDFSMetadataLog throws AbstractFilesSystem exception with common schemes like 
> s3n
> -
>
> Key: SPARK-14109
> URL: https://issues.apache.org/jira/browse/SPARK-14109
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> HDFSMetadataLog uses newer FileContext API to achieve atomic renaming. 
> However, FileContext implementations may not exist for many scheme for which 
> there may be FileSystem implementations. In those cases, rather than failing 
> completely, we should fallback to the FileSystem based implementation, and 
> log warning that there may be file consistency issues in case the log 
> directory is concurrently modified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14109) HDFSMetadataLog throws AbstractFilesSystem exception with common schemes like s3n

2016-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209566#comment-15209566
 ] 

Apache Spark commented on SPARK-14109:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/11925

> HDFSMetadataLog throws AbstractFilesSystem exception with common schemes like 
> s3n
> -
>
> Key: SPARK-14109
> URL: https://issues.apache.org/jira/browse/SPARK-14109
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> HDFSMetadataLog uses newer FileContext API to achieve atomic renaming. 
> However, FileContext implementations may not exist for many scheme for which 
> there may be FileSystem implementations. In those cases, rather than failing 
> completely, we should fallback to the FileSystem based implementation, and 
> log warning that there may be file consistency issues in case the log 
> directory is concurrently modified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14109) HDFSMetadataLog throws AbstractFilesSystem exception with common schemes like s3n

2016-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14109:


Assignee: Apache Spark  (was: Tathagata Das)

> HDFSMetadataLog throws AbstractFilesSystem exception with common schemes like 
> s3n
> -
>
> Key: SPARK-14109
> URL: https://issues.apache.org/jira/browse/SPARK-14109
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> HDFSMetadataLog uses newer FileContext API to achieve atomic renaming. 
> However, FileContext implementations may not exist for many scheme for which 
> there may be FileSystem implementations. In those cases, rather than failing 
> completely, we should fallback to the FileSystem based implementation, and 
> log warning that there may be file consistency issues in case the log 
> directory is concurrently modified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14109) HDFSMetadataLog throws AbstractFilesSystem exception with common schemes like s3n

2016-03-23 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-14109:
-

 Summary: HDFSMetadataLog throws AbstractFilesSystem exception with 
common schemes like s3n
 Key: SPARK-14109
 URL: https://issues.apache.org/jira/browse/SPARK-14109
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Tathagata Das
Assignee: Tathagata Das


HDFSMetadataLog uses newer FileContext API to achieve atomic renaming. However, 
FileContext implementations may not exist for many scheme for which there may 
be FileSystem implementations. In those cases, rather than failing completely, 
we should fallback to the FileSystem based implementation, and log warning that 
there may be file consistency issues in case the log directory is concurrently 
modified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13656) Delete spark.sql.parquet.cacheMetadata

2016-03-23 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209561#comment-15209561
 ] 

Takeshi Yamamuro commented on SPARK-13656:
--

Okay, I'll make a new ticket with a basic idea later.

> Delete spark.sql.parquet.cacheMetadata
> --
>
> Key: SPARK-13656
> URL: https://issues.apache.org/jira/browse/SPARK-13656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> Looks like spark.sql.parquet.cacheMetadata is not used anymore. Let's delete 
> it to avoid any potential confusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9721) TreeTests.checkEqual should compare predictions on data

2016-03-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9721:
-
Priority: Minor  (was: Major)

> TreeTests.checkEqual should compare predictions on data
> ---
>
> Key: SPARK-9721
> URL: https://issues.apache.org/jira/browse/SPARK-9721
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.ml tree and ensemble unit tests:
> Modify TreeTests.checkEqual in unit tests to compare predictions on the 
> training data, rather than only comparing the trees themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9721) TreeTests.checkEqual should compare predictions on data

2016-03-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209556#comment-15209556
 ] 

Joseph K. Bradley commented on SPARK-9721:
--

This issue came up when looking at the aggregation mechanisms for random 
forests.  The tree data do not include the mechanisms for aggregating 
predictions from multiple trees, so comparing the tree data does not completely 
test functionality.

> TreeTests.checkEqual should compare predictions on data
> ---
>
> Key: SPARK-9721
> URL: https://issues.apache.org/jira/browse/SPARK-9721
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Joseph K. Bradley
>
> In spark.ml tree and ensemble unit tests:
> Modify TreeTests.checkEqual in unit tests to compare predictions on the 
> training data, rather than only comparing the trees themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8674) 2-sample, 2-sided Kolmogorov Smirnov Test

2016-03-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209537#comment-15209537
 ] 

José Pablo Cambronero commented on SPARK-8674:
--

Don't mind at all. Thank you for picking this up. I appreciate it.



> 2-sample, 2-sided Kolmogorov Smirnov Test
> -
>
> Key: SPARK-8674
> URL: https://issues.apache.org/jira/browse/SPARK-8674
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>Assignee: Jose Cambronero
>Priority: Minor
>
> We added functionality to calculate a 2-sample, 2-sided Kolmogorov Smirnov 
> test for 2 RDD[Double]. The calculation provides a test for the null 
> hypothesis that both samples come from the same probability distribution. The 
> implementation seeks to minimize the shuffles necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions

2016-03-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14083:

Description: 
One big advantage of the Dataset API is the type safety, at the cost of 
performance due to heavy reliance on user-defined closures/lambdas. These 
closures are typically slower than expressions because we have more flexibility 
to optimize expressions (known data types, no virtual function calls, etc). In 
many cases, it's actually not going to be very difficult to look into the byte 
code of these closures and figure out what they are trying to do. If we can 
understand them, then we can turn them directly into Catalyst expressions for 
more optimized executions.

Some examples are:

{code}
df.map(_.name)  // equivalent to expression col("name")

ds.groupBy(_.gender)  // equivalent to expression col("gender")

df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
lit(18)

df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
{code}

The goal of this ticket is to design a small framework for byte code analysis 
and use that to convert closures/lambdas into Catalyst expressions in order to 
speed up Dataset execution. It is a little bit futuristic, but I believe it is 
very doable. The framework should be easy to reason about (e.g. similar to 
Catalyst).

Note that a big emphasis on "small" and "easy to reason about". A patch should 
be rejected if it is too complicated or difficult to reason about.




  was:
One big advantage of the Dataset API is the type safety, at the cost of 
performance due to heavy reliance on user-defined closures/lambdas. These 
closures are typically slower than expressions because we can more flexibility 
to optimize expressions (known data types, no virtual function calls, etc). In 
many cases, it's actually not going to be very difficult to look into the byte 
code of these closures and figure out what they are trying to do. If we can 
understand them, then we can turn them directly into Catalyst expressions for 
more optimized executions.

Some examples are:

{code}
df.map(_.name)  // equivalent to expression col("name")

ds.groupBy(_.gender)  // equivalent to expression col("gender")

df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
lit(18)

df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
{code}

The goal of this ticket is to design a small framework for byte code analysis 
and use that to convert closures/lambdas into Catalyst expressions in order to 
speed up Dataset execution. It is a little bit futuristic, but I believe it is 
very doable. The framework should be easy to reason about (e.g. similar to 
Catalyst).

Note that a big emphasis on "small" and "easy to reason about". A patch should 
be rejected if it is too complicated or difficult to reason about.





> Analyze JVM bytecode and turn closures into Catalyst expressions
> 
>
> Key: SPARK-14083
> URL: https://issues.apache.org/jira/browse/SPARK-14083
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> One big advantage of the Dataset API is the type safety, at the cost of 
> performance due to heavy reliance on user-defined closures/lambdas. These 
> closures are typically slower than expressions because we have more 
> flexibility to optimize expressions (known data types, no virtual function 
> calls, etc). In many cases, it's actually not going to be very difficult to 
> look into the byte code of these closures and figure out what they are trying 
> to do. If we can understand them, then we can turn them directly into 
> Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name)  // equivalent to expression col("name")
> ds.groupBy(_.gender)  // equivalent to expression col("gender")
> df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
> lit(18)
> df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis 
> and use that to convert closures/lambdas into Catalyst expressions in order 
> to speed up Dataset execution. It is a little bit futuristic, but I believe 
> it is very doable. The framework should be easy to reason about (e.g. similar 
> to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch 
> should be rejected if it is too complicated or difficult to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8674) 2-sample, 2-sided Kolmogorov Smirnov Test

2016-03-23 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209523#comment-15209523
 ] 

yuhao yang commented on SPARK-8674:
---

Hi [~josepablocam] I've sent a PR for anderson darling test 
https://github.com/apache/spark/pull/11780. And if you don't mind, I hope to 
continue the work here also, so we may finish them in 2.0

> 2-sample, 2-sided Kolmogorov Smirnov Test
> -
>
> Key: SPARK-8674
> URL: https://issues.apache.org/jira/browse/SPARK-8674
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>Assignee: Jose Cambronero
>Priority: Minor
>
> We added functionality to calculate a 2-sample, 2-sided Kolmogorov Smirnov 
> test for 2 RDD[Double]. The calculation provides a test for the null 
> hypothesis that both samples come from the same probability distribution. The 
> implementation seeks to minimize the shuffles necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13383) Keep broadcast hint after column pruning

2016-03-23 Thread Yong Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209519#comment-15209519
 ] 

Yong Zhang commented on SPARK-13383:


Will this fix back patch to 1.5.x or 1.6.x branches?

> Keep broadcast hint after column pruning
> 
>
> Key: SPARK-13383
> URL: https://issues.apache.org/jira/browse/SPARK-13383
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> When we do column pruning in Optimizer, we put additional Project on top of a 
> logical plan. However, when we already wrap a BroadcastHint on a logical 
> plan, the added Project will hide BroadcastHint after later execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14040) Null-safe and equality join produces incorrect result with filtered dataframe

2016-03-23 Thread Sunitha Kambhampati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209518#comment-15209518
 ] 

Sunitha Kambhampati commented on SPARK-14040:
-

PR for SPARK-13801 is trying to resolve the general issue behind the resolution 
of the columns.  I think this issue will be resolved once those changes are in. 
 

> Null-safe and equality join produces incorrect result with filtered dataframe
> -
>
> Key: SPARK-14040
> URL: https://issues.apache.org/jira/browse/SPARK-14040
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Ubuntu Linux 15.10
>Reporter: Denton Cockburn
>
> Initial issue reported here: 
> http://stackoverflow.com/questions/36131942/spark-join-produces-wrong-results
>   val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
>   val a = b.where("c = 1").withColumnRenamed("a", 
> "filta").withColumnRenamed("b", "filtb")
>   a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> 
> b("c"), "left_outer").show
> Produces 2 rows instead of the expected 1.
>   a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === 
> $"b" and $"newc" === b("c"), "left_outer").show
> Also produces 2 rows instead of the expected 1.
> The only one that seemed to work correctly was:
>   a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === 
> b("c"), "left_outer").show
> But that produced a warning for :  
>   WARN Column: Constructing trivially true equals predicate, 'c#18232 = 
> c#18232' 
> As pointed out by commenter zero323:
> "The second behavior looks indeed like a bug related to the fact that you 
> still have a.c in your data. It looks like it is picked downstream before b.c 
> and the evaluated condition is actually a.newc = a.c"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14040) Null-safe and equality join produces incorrect result with filtered dataframe

2016-03-23 Thread Sunitha Kambhampati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209503#comment-15209503
 ] 

Sunitha Kambhampati edited comment on SPARK-14040 at 3/24/16 12:48 AM:
---

Here are my notes on the investigation. 
{noformat}
A smaller repro: 

  test("test nullsafe3 - Wrong results 14040") {
val b1 = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
val a1 = b1.where("c = 1")
a1.printSchema()
a1.join(b1, a1("c") <=> b1("c"), "left_outer").explain(true)
a1.join(b1, a1("c") <=> b1("c"), "left_outer").show()
  }
{noformat}  

* The ParsedLogicalPlan resolves the column in the join condition to the *same* 
causing the issue.

* Note, if you use the === in the join condition, the results are correct 
because there is special casing logic in join in the DataSet to reanalyze and 
thus it avoids the problem.

* One way to fix this is to add the special case for the EqualNullSafe in join 
in DataSet.  I have added it and the testcase works fine. 
* But that said, there is a general fundamental problem that the resolution of 
the column is incorrect when the column names are same.  


was (Author: ksunitha):
Here are my notes on the investigation. 
{noformat}
A smaller repro: 

  test("test nullsafe3 - Wrong results 14040") {
val b1 = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
val a1 = b1.where("c = 1")
a1.printSchema()
a1.join(b1, a1("c") <=> b1("c"), "left_outer").explain(true)
a1.join(b1, a1("c") <=> b1("c"), "left_outer").show()
  }

The ParsedLogicalPlan resolves the column in the join condition to the same 
causing the issue.

Note, if you use the === , the results are correct because there is special 
casing logic in join in the DataSet to reanalyze and thus it avoids the problem.

One way to fix this is to add the special case for the EqualNullSafe in join in 
DataSet.  I have added it and the testcase works fine. But that said, there is 
a general fundamental problem that the resolution of the column is incorrect 
when the column names are same.  

{noformat}  

> Null-safe and equality join produces incorrect result with filtered dataframe
> -
>
> Key: SPARK-14040
> URL: https://issues.apache.org/jira/browse/SPARK-14040
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Ubuntu Linux 15.10
>Reporter: Denton Cockburn
>
> Initial issue reported here: 
> http://stackoverflow.com/questions/36131942/spark-join-produces-wrong-results
>   val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
>   val a = b.where("c = 1").withColumnRenamed("a", 
> "filta").withColumnRenamed("b", "filtb")
>   a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> 
> b("c"), "left_outer").show
> Produces 2 rows instead of the expected 1.
>   a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === 
> $"b" and $"newc" === b("c"), "left_outer").show
> Also produces 2 rows instead of the expected 1.
> The only one that seemed to work correctly was:
>   a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === 
> b("c"), "left_outer").show
> But that produced a warning for :  
>   WARN Column: Constructing trivially true equals predicate, 'c#18232 = 
> c#18232' 
> As pointed out by commenter zero323:
> "The second behavior looks indeed like a bug related to the fact that you 
> still have a.c in your data. It looks like it is picked downstream before b.c 
> and the evaluated condition is actually a.newc = a.c"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14040) Null-safe and equality join produces incorrect result with filtered dataframe

2016-03-23 Thread Sunitha Kambhampati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209503#comment-15209503
 ] 

Sunitha Kambhampati edited comment on SPARK-14040 at 3/24/16 12:44 AM:
---

Here are my notes on the investigation. 
{noformat}
A smaller repro: 

  test("test nullsafe3 - Wrong results 14040") {
val b1 = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
val a1 = b1.where("c = 1")
a1.printSchema()
a1.join(b1, a1("c") <=> b1("c"), "left_outer").explain(true)
a1.join(b1, a1("c") <=> b1("c"), "left_outer").show()
  }

The ParsedLogicalPlan resolves the column in the join condition to the same 
causing the issue.

Note, if you use the === , the results are correct because there is special 
casing logic in join in the DataSet to reanalyze and thus it avoids the problem.

One way to fix this is to add the special case for the EqualNullSafe in join in 
DataSet.  I have added it and the testcase works fine. But that said, there is 
a general fundamental problem that the resolution of the column is incorrect 
when the column names are same.  

{noformat}  


was (Author: ksunitha):
Here are my notes on the investigation. 

A smaller repro: 
  test("test nullsafe3 - Wrong results 14040") {
val b1 = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
val a1 = b1.where("c = 1")
a1.printSchema()
a1.join(b1, a1("c") <=> b1("c"), "left_outer").explain(true)
a1.join(b1, a1("c") <=> b1("c"), "left_outer").show()
  }

The ParsedLogicalPlan resolves the column in the join condition to the same 
causing the issue.

Note, if you use the === , the results are correct because there is special 
casing logic in join in the DataSet to reanalyze and thus it avoids the problem.

One way to fix this is to add the special case for the EqualNullSafe in join in 
DataSet.  I have added it and the testcase works fine. But that said, there is 
a general fundamental problem that the resolution of the column is incorrect 
when the column names are same.

> Null-safe and equality join produces incorrect result with filtered dataframe
> -
>
> Key: SPARK-14040
> URL: https://issues.apache.org/jira/browse/SPARK-14040
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Ubuntu Linux 15.10
>Reporter: Denton Cockburn
>
> Initial issue reported here: 
> http://stackoverflow.com/questions/36131942/spark-join-produces-wrong-results
>   val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
>   val a = b.where("c = 1").withColumnRenamed("a", 
> "filta").withColumnRenamed("b", "filtb")
>   a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> 
> b("c"), "left_outer").show
> Produces 2 rows instead of the expected 1.
>   a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === 
> $"b" and $"newc" === b("c"), "left_outer").show
> Also produces 2 rows instead of the expected 1.
> The only one that seemed to work correctly was:
>   a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === 
> b("c"), "left_outer").show
> But that produced a warning for :  
>   WARN Column: Constructing trivially true equals predicate, 'c#18232 = 
> c#18232' 
> As pointed out by commenter zero323:
> "The second behavior looks indeed like a bug related to the fact that you 
> still have a.c in your data. It looks like it is picked downstream before b.c 
> and the evaluated condition is actually a.newc = a.c"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14040) Null-safe and equality join produces incorrect result with filtered dataframe

2016-03-23 Thread Sunitha Kambhampati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209503#comment-15209503
 ] 

Sunitha Kambhampati commented on SPARK-14040:
-

Here are my notes on the investigation. 

A smaller repro: 
  test("test nullsafe3 - Wrong results 14040") {
val b1 = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
val a1 = b1.where("c = 1")
a1.printSchema()
a1.join(b1, a1("c") <=> b1("c"), "left_outer").explain(true)
a1.join(b1, a1("c") <=> b1("c"), "left_outer").show()
  }

The ParsedLogicalPlan resolves the column in the join condition to the same 
causing the issue.

Note, if you use the === , the results are correct because there is special 
casing logic in join in the DataSet to reanalyze and thus it avoids the problem.

One way to fix this is to add the special case for the EqualNullSafe in join in 
DataSet.  I have added it and the testcase works fine. But that said, there is 
a general fundamental problem that the resolution of the column is incorrect 
when the column names are same.

> Null-safe and equality join produces incorrect result with filtered dataframe
> -
>
> Key: SPARK-14040
> URL: https://issues.apache.org/jira/browse/SPARK-14040
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Ubuntu Linux 15.10
>Reporter: Denton Cockburn
>
> Initial issue reported here: 
> http://stackoverflow.com/questions/36131942/spark-join-produces-wrong-results
>   val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
>   val a = b.where("c = 1").withColumnRenamed("a", 
> "filta").withColumnRenamed("b", "filtb")
>   a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> 
> b("c"), "left_outer").show
> Produces 2 rows instead of the expected 1.
>   a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === 
> $"b" and $"newc" === b("c"), "left_outer").show
> Also produces 2 rows instead of the expected 1.
> The only one that seemed to work correctly was:
>   a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === 
> b("c"), "left_outer").show
> But that produced a warning for :  
>   WARN Column: Constructing trivially true equals predicate, 'c#18232 = 
> c#18232' 
> As pointed out by commenter zero323:
> "The second behavior looks indeed like a bug related to the fact that you 
> still have a.c in your data. It looks like it is picked downstream before b.c 
> and the evaluated condition is actually a.newc = a.c"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2016-03-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209486#comment-15209486
 ] 

Reynold Xin commented on SPARK-12148:
-

Let's just do the renaming. Thanks!


> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Michael Lawrence
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13710) Spark shell shows ERROR when launching on Windows

2016-03-23 Thread Michel Lemay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209484#comment-15209484
 ] 

Michel Lemay commented on SPARK-13710:
--

Spark coordinates for curator did not change..  It still depends on Curator 
2.4.0 that later depends on ZK and jline 0.9.94.

The problem is in Scala itself. It went from a local copy of jline sources in 
2.10.x: https://github.com/scala/scala/tree/2.10.x/src/jline

to a standard maven import in 2.11.x:
https://github.com/scala/scala/blob/2.11.x/build.sbt#L74

However, I'm not sure exactly how it plays out at runtime but I guess it's 
something in the lines of:
- Spark jar is loaded in the JVM
- Scala runtime is loaded in the JVM but jline resources are discarded because 
one copy is already there (this part wasn't happening in scala 2.10.x since 
resources were names or located somewhere else)
- Spark tries to initialize the scala REPL but that fails because it loads the 
wrong jline/jansi


Since I'm no expert in JVM classloading and the likes so your guess is probably 
better than mine..

> Spark shell shows ERROR when launching on Windows
> -
>
> Key: SPARK-13710
> URL: https://issues.apache.org/jira/browse/SPARK-13710
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Windows
>Reporter: Masayoshi TSUZUKI
>Priority: Minor
>
> On Windows, when we launch {{bin\spark-shell.cmd}}, it shows ERROR message 
> and stacktrace.
> {noformat}
> C:\Users\tsudukim\Documents\workspace\spark-dev3>bin\spark-shell
> [ERROR] Terminal initialization failed; falling back to unsupported
> java.lang.NoClassDefFoundError: Could not initialize class 
> scala.tools.fusesource_embedded.jansi.internal.Kernel32
> at 
> scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50)
> at 
> scala.tools.jline_embedded.WindowsTerminal.getConsoleMode(WindowsTerminal.java:204)
> at 
> scala.tools.jline_embedded.WindowsTerminal.init(WindowsTerminal.java:82)
> at 
> scala.tools.jline_embedded.TerminalFactory.create(TerminalFactory.java:101)
> at 
> scala.tools.jline_embedded.TerminalFactory.get(TerminalFactory.java:158)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:229)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:221)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:209)
> at 
> scala.tools.nsc.interpreter.jline_embedded.JLineConsoleReader.(JLineReader.scala:61)
> at 
> scala.tools.nsc.interpreter.jline_embedded.InteractiveReader.(JLineReader.scala:33)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:865)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:862)
> at 
> scala.tools.nsc.interpreter.ILoop.scala$tools$nsc$interpreter$ILoop$$mkReader$1(ILoop.scala:871)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at scala.util.Try$.apply(Try.scala:192)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875)
> at 
> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
> at 
> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233)
> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223)
> at scala.collection.immutable.Stream.collect(Stream.scala:435)
> at scala.tools.nsc.interpreter.ILoop.chooseReader(ILoop.scala:877)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$2.apply(ILoop.scala:916)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:916)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
> at 
> 

[jira] [Updated] (SPARK-11892) Model export/import for spark.ml: OneVsRest

2016-03-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11892:
--
Shepherd:   (was: Joseph K. Bradley)

> Model export/import for spark.ml: OneVsRest
> ---
>
> Key: SPARK-11892
> URL: https://issues.apache.org/jira/browse/SPARK-11892
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> Implement read/write for OneVsRest estimator and its model.
> When this is implemented, {{CrossValidatorReader.getUidMap}} should be 
> updated as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11893) Model export/import for spark.ml: TrainValidationSplit

2016-03-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11893:
--
Shepherd: Joseph K. Bradley
Assignee: Xusen Yin
Target Version/s: 2.0.0  (was: )

> Model export/import for spark.ml: TrainValidationSplit
> --
>
> Key: SPARK-11893
> URL: https://issues.apache.org/jira/browse/SPARK-11893
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> Implement save/write for TrainValidationSplit, similar to CrossValidator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14105) Serialization issue for KafkaRDD

2016-03-23 Thread Liyin Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liyin Tang updated SPARK-14105:
---
Affects Version/s: 1.6.1

> Serialization issue for KafkaRDD
> 
>
> Key: SPARK-14105
> URL: https://issues.apache.org/jira/browse/SPARK-14105
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2, 1.6.1
>Reporter: Liyin Tang
> Attachments: Screenshot 2016-03-23 09.04.51.png
>
>
> When using DISK or Memory to persistent KafkaDirectInputStream, it will 
> serialize the FetchResponse into blocks. The FetchResponse contains the 
> ByteBufferMessageSet where each Kafka Message is just one slice of the 
> underlying ByteBuffer. 
> When serializing the KafkaRDDIterator, it seems like the entire underlying 
> ByteBuffer in ByteBufferMessageSet will be serialized for each and every 
> message. This will cause block size easily exceeds 2G, and lead to 
> "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" or 
> "FileChannelImpl.map -> exceeds Integer.MAX_VALUE:"
> The consumer fetch is the default value (1M).  I tried to reduce fetch size, 
> but it will cause other errors like errRanOutBeforeEnd. 
> Here are exceptions I got for both Memory and Disk persistent.
> Memory Persistent:
> 16/03/23 15:34:44 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-9,5,main]
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:3236)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
> at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
> at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
> at 
> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:158)
> at 
> org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
> at 
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1190)
> at 
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1199)
> at org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:132)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:793)
> at org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:669)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
> Disk Persistent: 
> 16/03/23 17:04:00 INFO TaskSetManager: Lost task 42.1 in stage 2.0 (TID 974) 
> on executor i-2878ceb3.inst.aws.airbnb.com: java.lang.RuntimeException 
> (java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1206)
> at 

[jira] [Closed] (SPARK-14099) Cleanup shuffles in ML version of ALS

2016-03-23 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-14099.
---
Resolution: Duplicate

> Cleanup shuffles in ML version of ALS
> -
>
> Key: SPARK-14099
> URL: https://issues.apache.org/jira/browse/SPARK-14099
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: holdenk
>
> See parent for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14105) Serialization issue for KafkaRDD

2016-03-23 Thread Kevin Long (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209395#comment-15209395
 ] 

Kevin Long commented on SPARK-14105:


tested on 1.6.1 and the issue is also there.

> Serialization issue for KafkaRDD
> 
>
> Key: SPARK-14105
> URL: https://issues.apache.org/jira/browse/SPARK-14105
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liyin Tang
> Attachments: Screenshot 2016-03-23 09.04.51.png
>
>
> When using DISK or Memory to persistent KafkaDirectInputStream, it will 
> serialize the FetchResponse into blocks. The FetchResponse contains the 
> ByteBufferMessageSet where each Kafka Message is just one slice of the 
> underlying ByteBuffer. 
> When serializing the KafkaRDDIterator, it seems like the entire underlying 
> ByteBuffer in ByteBufferMessageSet will be serialized for each and every 
> message. This will cause block size easily exceeds 2G, and lead to 
> "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" or 
> "FileChannelImpl.map -> exceeds Integer.MAX_VALUE:"
> The consumer fetch is the default value (1M).  I tried to reduce fetch size, 
> but it will cause other errors like errRanOutBeforeEnd. 
> Here are exceptions I got for both Memory and Disk persistent.
> Memory Persistent:
> 16/03/23 15:34:44 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-9,5,main]
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:3236)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
> at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
> at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
> at 
> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:158)
> at 
> org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
> at 
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1190)
> at 
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1199)
> at org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:132)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:793)
> at org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:669)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
> Disk Persistent: 
> 16/03/23 17:04:00 INFO TaskSetManager: Lost task 42.1 in stage 2.0 (TID 974) 
> on executor i-2878ceb3.inst.aws.airbnb.com: java.lang.RuntimeException 
> (java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
> at 

[jira] [Updated] (SPARK-13874) Move docs of streaming-mqtt, streaming-zeromq, streaming-akka, streaming-twitter to Spark packages

2016-03-23 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13874:
-
Summary: Move docs of streaming-mqtt, streaming-zeromq, streaming-akka, 
streaming-twitter to Spark packages  (was: Move docs of streaming-flume, 
streaming-mqtt, streaming-zeromq, streaming-akka, streaming-twitter to Spark 
packages)

> Move docs of streaming-mqtt, streaming-zeromq, streaming-akka, 
> streaming-twitter to Spark packages
> --
>
> Key: SPARK-13874
> URL: https://issues.apache.org/jira/browse/SPARK-13874
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Shixiong Zhu
>
> Need to move docs for streaming-mqtt, streaming-zeromq, streaming-akka, 
> streaming-twitter to spark packages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14107) PySpark spark.ml GBT algs need seed Param

2016-03-23 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209327#comment-15209327
 ] 

Seth Hendrickson commented on SPARK-14107:
--

I can work on this.

> PySpark spark.ml GBT algs need seed Param
> -
>
> Key: SPARK-14107
> URL: https://issues.apache.org/jira/browse/SPARK-14107
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> [SPARK-13952] made GBTClassifier, Regressor use the seed Param.  We need to 
> modify Python slightly so users can provide it as a named argument to the 
> class constructors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14014) Replace existing analysis.Catalog with SessionCatalog

2016-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209322#comment-15209322
 ] 

Apache Spark commented on SPARK-14014:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/11923

> Replace existing analysis.Catalog with SessionCatalog
> -
>
> Key: SPARK-14014
> URL: https://issues.apache.org/jira/browse/SPARK-14014
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> As of this moment, there exist many catalogs in Spark. For Spark 2.0, we will 
> have two high level catalogs only: SessionCatalog and ExternalCatalog. 
> SessionCatalog (implemented in SPARK-13923) keeps track of temporary 
> functions and tables and delegates other operations to ExternalCatalog.
> At the same time, there's this legacy catalog called `analysis.Catalog` that 
> also tracks temporary functions and tables. The goal is to get rid of this 
> legacy catalog and replace it with SessionCatalog, which is the new thing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13923) Implement SessionCatalog to manage temp functions and tables

2016-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209321#comment-15209321
 ] 

Apache Spark commented on SPARK-13923:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/11923

> Implement SessionCatalog to manage temp functions and tables
> 
>
> Key: SPARK-13923
> URL: https://issues.apache.org/jira/browse/SPARK-13923
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> Today, we have ExternalCatalog, which is dead code. As part of the effort of 
> merging SQLContext/HiveContext we'll parse Hive commands and call the 
> corresponding methods in ExternalCatalog.
> However, this handles only persisted things. We need something in addition to 
> that to handle temporary things. The new catalog is called SessionCatalog and 
> will internally call ExternalCatalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14108) calling count() on empty dataframe throws java.util.NoSuchElementException

2016-03-23 Thread Krishna Shekhram (JIRA)
Krishna Shekhram created SPARK-14108:


 Summary: calling count() on empty dataframe throws 
java.util.NoSuchElementException
 Key: SPARK-14108
 URL: https://issues.apache.org/jira/browse/SPARK-14108
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
 Environment: Tested in Hadoop 2.7.2 EMR 4.x
Reporter: Krishna Shekhram
Priority: Minor


When calling count() on empty dataframe, then spark code still tries to iterate 
through the empty iterator and throws java.util.NoSuchElementException.

Stacktrace :
java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
at 
scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
at scala.collection.IterableLike$class.head(IterableLike.scala:91)
at 
scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:108)
at 
scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)
at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:108)
at 
org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1515)
at 
org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1514)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1514)

Code Snippet:
This code fails
if(this.df !=null){
long countOfRows = this.df.count();
}

If I do this then it works
if(this.df !=null && ! this.df.rdd().isEmpty()){
long countOfRows = this.df.count();
}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14025) Fix streaming job descriptions on the event line

2016-03-23 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-14025.
--
   Resolution: Fixed
 Assignee: Liwei Lin
Fix Version/s: 2.0.0

> Fix streaming job descriptions on the event line
> 
>
> Key: SPARK-14025
> URL: https://issues.apache.org/jira/browse/SPARK-14025
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming, Web UI
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Assignee: Liwei Lin
>Priority: Minor
> Fix For: 2.0.0
>
>
> Removed the extra `...` for each streaming job's description 
> on the event line.
> [Before]
> !https://cloud.githubusercontent.com/assets/15843379/13898653/0a6c1838-ee13-11e5-9761-14bb7b114c13.png!
> [After]
> !https://cloud.githubusercontent.com/assets/15843379/13898650/012b8808-ee13-11e5-92a6-64aff0799c83.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7992) Hide private classes/objects in in generated Java API doc

2016-03-23 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209280#comment-15209280
 ] 

Jakob Odersky edited comment on SPARK-7992 at 3/23/16 10:16 PM:


Hey [~mengxr],
you caught me in a very busy time last week and I'm afraid to say that I 
completely forgot about this. I just took up the issue. Take a look at my 
comment on the PR thread https://github.com/typesafehub/genjavadoc/pull/47. 


was (Author: jodersky):
Hey Xiangrui,
you caught me in a very busy time last week and I'm afraid to say that I 
completely forgot about this. I just took up the issue. Take a look at my 
comment on the PR thread https://github.com/typesafehub/genjavadoc/pull/47. 

> Hide private classes/objects in in generated Java API doc
> -
>
> Key: SPARK-7992
> URL: https://issues.apache.org/jira/browse/SPARK-7992
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>
> After SPARK-5610, we found that private classes/objects still show up in the 
> generated Java API doc, e.g., under `org.apache.spark.api.r` we can see
> {code}
> BaseRRDD
> PairwiseRRDD
> RRDD
> SpecialLengths
> StringRRDD
> {code}
> We should update genjavadoc to hide those private classes/methods. The best 
> approach is to find a good mapping from Scala private to Java, and merge it 
> into the main genjavadoc repo. A WIP PR is at 
> https://github.com/typesafehub/genjavadoc/pull/47.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7992) Hide private classes/objects in in generated Java API doc

2016-03-23 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209280#comment-15209280
 ] 

Jakob Odersky commented on SPARK-7992:
--

Hey Xiangrui,
you caught me in a very busy time last week and I'm afraid to say that I 
completely forgot about this. I just took up the issue. Take a look at my 
comment on the PR thread https://github.com/typesafehub/genjavadoc/pull/47. 

> Hide private classes/objects in in generated Java API doc
> -
>
> Key: SPARK-7992
> URL: https://issues.apache.org/jira/browse/SPARK-7992
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>
> After SPARK-5610, we found that private classes/objects still show up in the 
> generated Java API doc, e.g., under `org.apache.spark.api.r` we can see
> {code}
> BaseRRDD
> PairwiseRRDD
> RRDD
> SpecialLengths
> StringRRDD
> {code}
> We should update genjavadoc to hide those private classes/methods. The best 
> approach is to find a good mapping from Scala private to Java, and merge it 
> into the main genjavadoc repo. A WIP PR is at 
> https://github.com/typesafehub/genjavadoc/pull/47.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14107) PySpark spark.ml GBT algs need seed Param

2016-03-23 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-14107:
-

 Summary: PySpark spark.ml GBT algs need seed Param
 Key: SPARK-14107
 URL: https://issues.apache.org/jira/browse/SPARK-14107
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Joseph K. Bradley
Priority: Minor


[SPARK-13952] made GBTClassifier, Regressor use the seed Param.  We need to 
modify Python slightly so users can provide it as a named argument to the class 
constructors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13952) spark.ml GBT algs need to use random seed

2016-03-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-13952.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11903
[https://github.com/apache/spark/pull/11903]

> spark.ml GBT algs need to use random seed
> -
>
> Key: SPARK-13952
> URL: https://issues.apache.org/jira/browse/SPARK-13952
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.0.0
>
>
> SPARK-12379 copied the GBT implementation from spark.mllib to spark.ml.  
> There was one bug I found: The random seed is not used.  A reasonable fix 
> will be to use the original seed to generate a new seed for each tree trained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12818) Implement Bloom filter and count-min sketch in DataFrames

2016-03-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209262#comment-15209262
 ] 

Reynold Xin commented on SPARK-12818:
-

I think we have implemented the common ones that make sense. The most useful 
things are actually if you try using them, I'm sure you can find bugs or 
features missing and that can feed back into the roadmap.

> Implement Bloom filter and count-min sketch in DataFrames
> -
>
> Key: SPARK-12818
> URL: https://issues.apache.org/jira/browse/SPARK-12818
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
> Attachments: BloomFilterandCount-MinSketchinSpark2.0.pdf
>
>
> This ticket tracks implementing Bloom filter and count-min sketch support in 
> DataFrames. Please see the attached design doc for more information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14041) Locate possible duplicates and group them into subtasks

2016-03-23 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14041:
--
Description: 
Please go through the current example code and list possible duplicates.

To find out all examples of ml/mllib that don't contain "example on": 
{code}grep -L "example on" /path/to/ml-or-mllib/examples{code}

Duplicates need to be deleted:

* scala/ml
  
** CrossValidatorExample.scala
** DecisionTreeExample.scala
** GBTExample.scala
** LinearRegressionExample.scala
** LogisticRegressionExample.scala
** RandomForestExample.scala
** TrainValidationSplitExample.scala

* scala/mllib

** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

* java/ml

** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib

** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

* python/ml

** None

* python/mllib

** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py

  was:
Please go through the current example code and list possible duplicates.

Duplicates need to be deleted:

* scala/ml
  
** CrossValidatorExample.scala
** DecisionTreeExample.scala
** GBTExample.scala
** LinearRegressionExample.scala
** LogisticRegressionExample.scala
** RandomForestExample.scala
** TrainValidationSplitExample.scala

* scala/mllib

** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

* java/ml

** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib

** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

* python/ml

** None

* python/mllib

** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py


> Locate possible duplicates and group them into subtasks
> ---
>
> Key: SPARK-14041
> URL: https://issues.apache.org/jira/browse/SPARK-14041
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> Please go through the current example code and list possible duplicates.
> To find out all examples of ml/mllib that don't contain "example on": 
> {code}grep -L "example on" /path/to/ml-or-mllib/examples{code}
> Duplicates need to be deleted:
> * scala/ml
>   
> ** CrossValidatorExample.scala
> ** DecisionTreeExample.scala
> ** GBTExample.scala
> ** LinearRegressionExample.scala
> ** LogisticRegressionExample.scala
> ** RandomForestExample.scala
> ** TrainValidationSplitExample.scala
> * scala/mllib
> 
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> * java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> * python/ml
> ** None
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14105) Serialization issue for KafkaRDD

2016-03-23 Thread Liyin Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liyin Tang updated SPARK-14105:
---
Attachment: Screenshot 2016-03-23 09.04.51.png

The screenshot of KafkaRDD's serialization 

> Serialization issue for KafkaRDD
> 
>
> Key: SPARK-14105
> URL: https://issues.apache.org/jira/browse/SPARK-14105
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liyin Tang
> Attachments: Screenshot 2016-03-23 09.04.51.png
>
>
> When using DISK or Memory to persistent KafkaDirectInputStream, it will 
> serialize the FetchResponse into blocks. The FetchResponse contains the 
> ByteBufferMessageSet where each Kafka Message is just one slice of the 
> underlying ByteBuffer. 
> When serializing the KafkaRDDIterator, it seems like the entire underlying 
> ByteBuffer in ByteBufferMessageSet will be serialized for each and every 
> message. This will cause block size easily exceeds 2G, and lead to 
> "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" or 
> "FileChannelImpl.map -> exceeds Integer.MAX_VALUE:"
> The consumer fetch is the default value (1M).  I tried to reduce fetch size, 
> but it will cause other errors like errRanOutBeforeEnd. 
> Here are exceptions I got for both Memory and Disk persistent.
> Memory Persistent:
> 16/03/23 15:34:44 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-9,5,main]
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:3236)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
> at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
> at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
> at 
> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:158)
> at 
> org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
> at 
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1190)
> at 
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1199)
> at org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:132)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:793)
> at org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:669)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
> Disk Persistent: 
> 16/03/23 17:04:00 INFO TaskSetManager: Lost task 42.1 in stage 2.0 (TID 974) 
> on executor i-2878ceb3.inst.aws.airbnb.com: java.lang.RuntimeException 
> (java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
> at 

[jira] [Updated] (SPARK-14105) Serialization issue for KafkaRDD

2016-03-23 Thread Liyin Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liyin Tang updated SPARK-14105:
---
Component/s: (was: Spark Core)

> Serialization issue for KafkaRDD
> 
>
> Key: SPARK-14105
> URL: https://issues.apache.org/jira/browse/SPARK-14105
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liyin Tang
>
> When using DISK or Memory to persistent KafkaDirectInputStream, it will 
> serialize the FetchResponse into blocks. The FetchResponse contains the 
> ByteBufferMessageSet where each Kafka Message is just one slice of the 
> underlying ByteBuffer. 
> When serializing the KafkaRDDIterator, it seems like the entire underlying 
> ByteBuffer in ByteBufferMessageSet will be serialized for each and every 
> message. This will cause block size easily exceeds 2G, and lead to 
> "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" or 
> "FileChannelImpl.map -> exceeds Integer.MAX_VALUE:"
> The consumer fetch is the default value (1M).  I tried to reduce fetch size, 
> but it will cause other errors like errRanOutBeforeEnd. 
> Here are exceptions I got for both Memory and Disk persistent.
> Memory Persistent:
> 16/03/23 15:34:44 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-9,5,main]
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:3236)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
> at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
> at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
> at 
> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:158)
> at 
> org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
> at 
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1190)
> at 
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1199)
> at org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:132)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:793)
> at org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:669)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
> Disk Persistent: 
> 16/03/23 17:04:00 INFO TaskSetManager: Lost task 42.1 in stage 2.0 (TID 974) 
> on executor i-2878ceb3.inst.aws.airbnb.com: java.lang.RuntimeException 
> (java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1206)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)
> at 

[jira] [Commented] (SPARK-14105) Serialization issue for KafkaRDD

2016-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209226#comment-15209226
 ] 

Apache Spark commented on SPARK-14105:
--

User 'liyintang' has created a pull request for this issue:
https://github.com/apache/spark/pull/11921

> Serialization issue for KafkaRDD
> 
>
> Key: SPARK-14105
> URL: https://issues.apache.org/jira/browse/SPARK-14105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Streaming
>Affects Versions: 1.5.2
>Reporter: Liyin Tang
>
> When using DISK or Memory to persistent KafkaDirectInputStream, it will 
> serialize the FetchResponse into blocks. The FetchResponse contains the 
> ByteBufferMessageSet where each Kafka Message is just one slice of the 
> underlying ByteBuffer. 
> When serializing the KafkaRDDIterator, it seems like the entire underlying 
> ByteBuffer in ByteBufferMessageSet will be serialized for each and every 
> message. This will cause block size easily exceeds 2G, and lead to 
> "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" or 
> "FileChannelImpl.map -> exceeds Integer.MAX_VALUE:"
> The consumer fetch is the default value (1M).  I tried to reduce fetch size, 
> but it will cause other errors like errRanOutBeforeEnd. 
> Here are exceptions I got for both Memory and Disk persistent.
> Memory Persistent:
> 16/03/23 15:34:44 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-9,5,main]
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:3236)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
> at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
> at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
> at 
> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:158)
> at 
> org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
> at 
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1190)
> at 
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1199)
> at org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:132)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:793)
> at org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:669)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
> Disk Persistent: 
> 16/03/23 17:04:00 INFO TaskSetManager: Lost task 42.1 in stage 2.0 (TID 974) 
> on executor i-2878ceb3.inst.aws.airbnb.com: java.lang.RuntimeException 
> (java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
> at 

[jira] [Assigned] (SPARK-14105) Serialization issue for KafkaRDD

2016-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14105:


Assignee: Apache Spark

> Serialization issue for KafkaRDD
> 
>
> Key: SPARK-14105
> URL: https://issues.apache.org/jira/browse/SPARK-14105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Streaming
>Affects Versions: 1.5.2
>Reporter: Liyin Tang
>Assignee: Apache Spark
>
> When using DISK or Memory to persistent KafkaDirectInputStream, it will 
> serialize the FetchResponse into blocks. The FetchResponse contains the 
> ByteBufferMessageSet where each Kafka Message is just one slice of the 
> underlying ByteBuffer. 
> When serializing the KafkaRDDIterator, it seems like the entire underlying 
> ByteBuffer in ByteBufferMessageSet will be serialized for each and every 
> message. This will cause block size easily exceeds 2G, and lead to 
> "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" or 
> "FileChannelImpl.map -> exceeds Integer.MAX_VALUE:"
> The consumer fetch is the default value (1M).  I tried to reduce fetch size, 
> but it will cause other errors like errRanOutBeforeEnd. 
> Here are exceptions I got for both Memory and Disk persistent.
> Memory Persistent:
> 16/03/23 15:34:44 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-9,5,main]
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:3236)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
> at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
> at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
> at 
> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:158)
> at 
> org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
> at 
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1190)
> at 
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1199)
> at org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:132)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:793)
> at org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:669)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
> Disk Persistent: 
> 16/03/23 17:04:00 INFO TaskSetManager: Lost task 42.1 in stage 2.0 (TID 974) 
> on executor i-2878ceb3.inst.aws.airbnb.com: java.lang.RuntimeException 
> (java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1206)
> at 

[jira] [Assigned] (SPARK-14105) Serialization issue for KafkaRDD

2016-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14105:


Assignee: (was: Apache Spark)

> Serialization issue for KafkaRDD
> 
>
> Key: SPARK-14105
> URL: https://issues.apache.org/jira/browse/SPARK-14105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Streaming
>Affects Versions: 1.5.2
>Reporter: Liyin Tang
>
> When using DISK or Memory to persistent KafkaDirectInputStream, it will 
> serialize the FetchResponse into blocks. The FetchResponse contains the 
> ByteBufferMessageSet where each Kafka Message is just one slice of the 
> underlying ByteBuffer. 
> When serializing the KafkaRDDIterator, it seems like the entire underlying 
> ByteBuffer in ByteBufferMessageSet will be serialized for each and every 
> message. This will cause block size easily exceeds 2G, and lead to 
> "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" or 
> "FileChannelImpl.map -> exceeds Integer.MAX_VALUE:"
> The consumer fetch is the default value (1M).  I tried to reduce fetch size, 
> but it will cause other errors like errRanOutBeforeEnd. 
> Here are exceptions I got for both Memory and Disk persistent.
> Memory Persistent:
> 16/03/23 15:34:44 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-9,5,main]
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:3236)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
> at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
> at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
> at 
> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:158)
> at 
> org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
> at 
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1190)
> at 
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1199)
> at org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:132)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:793)
> at org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:669)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
> Disk Persistent: 
> 16/03/23 17:04:00 INFO TaskSetManager: Lost task 42.1 in stage 2.0 (TID 974) 
> on executor i-2878ceb3.inst.aws.airbnb.com: java.lang.RuntimeException 
> (java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1206)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)
>

[jira] [Commented] (SPARK-13832) TPC-DS Query 36 fails with Parser error

2016-03-23 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209220#comment-15209220
 ] 

Xin Wu commented on SPARK-13832:


[~jfc...@us.ibm.com] I think when using grouping_id(), you need to pass in all 
the columns that are in the group by clause. In this case, it will be 
grouping_id(i_category, i_class). The result is like concatenating results of 
grouping() into a bit vector (a string of ones and zeros), such as 
grouping(i_category)+grouping(i_class)

So {code}grouping_id(i_category)+grouping_id(i_class){code} is not correct. 
After I changed to use {code}grouping_id(i_category, i_class){code}, the query 
returns for the text data files.. 
I am trying for the parquet files now. 



> TPC-DS Query 36 fails with Parser error
> ---
>
> Key: SPARK-13832
> URL: https://issues.apache.org/jira/browse/SPARK-13832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS query 36 fails with the following error
> Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
> 'i_category' is neither present in the group by, nor is it an aggregate 
> function. Add to group by or wrap in first() (or first_value) if you don't 
> care which value you get.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> Query Text pasted here for quick reference.
>   select
> sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
>,i_category
>,i_class
>,grouping__id as lochierarchy
>,rank() over (
> partition by grouping__id,
> case when grouping__id = 0 then i_category end
> order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
> rank_within_parent
>  from
> store_sales
>,date_dim   d1
>,item
>,store
>  where
> d1.d_year = 2001
>  and d1.d_date_sk = ss_sold_date_sk
>  and i_item_sk  = ss_item_sk
>  and s_store_sk  = ss_store_sk
>  and s_state in ('TN','TN','TN','TN',
>  'TN','TN','TN','TN')
>  group by i_category,i_class WITH ROLLUP
>  order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then i_category end
>   ,rank_within_parent
> limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14101) Hive on Spark 1.6.1 - Maven BUILD FAILURE ?

2016-03-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209211#comment-15209211
 ] 

Sean Owen edited comment on SPARK-14101 at 3/23/16 9:30 PM:


Java 7 or 8 is fine but it needs to be what {{java}} invokes. This was the 
error that shows it isn't being used: {{[ERROR] javac: invalid source release: 
1.7}}


was (Author: srowen):
It needs to be what {{java}} invokes. This was the error that shows it isn't 
being used: {{[ERROR] javac: invalid source release: 1.7}}

> Hive on Spark 1.6.1 - Maven BUILD FAILURE ?
> ---
>
> Key: SPARK-14101
> URL: https://issues.apache.org/jira/browse/SPARK-14101
> Project: Spark
>  Issue Type: Test
>Reporter: Shankar
>
> ERROR - BUILD FAILURE
> {noformat}
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
> project spark-test-tags_2.11: Execution scala-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed -> 
> [Help 1]
> {noformat}
> My System Details - 
> {noformat}
> CentOS6.5 x64
> jdk1.8.0_74
> apache-maven-3.3.9
> apache-hive-1.2.1-bin
> hadoop-2.7.1
> Scala version 2.11.4
> spark version = spark-1.6.1.tgz
> {noformat}
> Building Command - 
> {noformat}
> cd spark-1.6.1
> ./dev/change-scala-version.sh 2.11
> /home/hadoop/apache-maven-3.3.9/bin/mvn -Pyarn -Phive -Phadoop-2.6 
> -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests clean package
> {noformat}
> {noformat}
> [INFO] 
> 
> [INFO] Building Spark Project Test Tags 1.6.1
> [INFO] 
> 
> [INFO] 
> [INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @ 
> spark-test-tags_2.11 ---
> [INFO] Deleting /home/hadoop/apache-spark-s/spark-1.6.1/tags/target
> [INFO] 
> [INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-versions) @ 
> spark-test-tags_2.11 ---
> [INFO] 
> [INFO] --- scala-maven-plugin:3.2.2:add-source (eclipse-add-source) @ 
> spark-test-tags_2.11 ---
> [INFO] Add Source directory: 
> /home/hadoop/apache-spark-s/spark-1.6.1/tags/src/main/scala
> [INFO] Add Test Source directory: 
> /home/hadoop/apache-spark-s/spark-1.6.1/tags/src/test/scala
> [INFO] 
> [INFO] --- maven-dependency-plugin:2.10:build-classpath (default-cli) @ 
> spark-test-tags_2.11 ---
> [INFO] Dependencies classpath:
> /home/hadoop/.m2/repository/org/scala-lang/scala-reflect/2.11.7/scala-reflect-2.11.7.jar:/home/hadoop/.m2/repository/org/scala-lang/modules/scala-xml_2.11/1.0.2/scala-xml_2.11-1.0.2.jar:/home/hadoop/.m2/repository/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar:/home/hadoop/.m2/repository/org/scala-lang/scala-library/2.11.7/scala-library-2.11.7.jar:/home/hadoop/.m2/repository/org/scalatest/scalatest_2.11/2.2.1/scalatest_2.11-2.2.1.jar
> [INFO] 
> [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ 
> spark-test-tags_2.11 ---
> [INFO] 
> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
> spark-test-tags_2.11 ---
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] skip non existing resourceDirectory 
> /home/hadoop/apache-spark-s/spark-1.6.1/tags/src/main/resources
> [INFO] Copying 3 resources
> [INFO] 
> [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ 
> spark-test-tags_2.11 ---
> [WARNING] Zinc server is not available at port 3030 - reverting to normal 
> incremental compile
> [INFO] Using incremental compilation
> [INFO] Compiling 3 Java sources to 
> /home/hadoop/apache-spark-s/spark-1.6.1/tags/target/scala-2.11/classes...
> [ERROR] javac: invalid source release: 1.7
> [ERROR] Usage: javac  
> [ERROR] use -help for a list of possible options
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO] 
> [INFO] Spark Project Test Tags  FAILURE [  2.862 
> s]
> [INFO] Spark Project Launcher . SKIPPED
> [INFO] Spark Project Networking ... SKIPPED
> [INFO] Spark Project Shuffle Streaming Service  SKIPPED
> [INFO] Spark Project Unsafe ... SKIPPED
> [INFO] Spark Project Core . SKIPPED
> [INFO] Spark Project Bagel  SKIPPED
> [INFO] Spark Project GraphX ... SKIPPED
> [INFO] Spark Project Streaming  SKIPPED
> [INFO] Spark Project Catalyst . SKIPPED
> [INFO] Spark Project SQL .. SKIPPED
> [INFO] Spark Project ML Library ... SKIPPED
> [INFO] Spark Project Tools  

[jira] [Commented] (SPARK-14101) Hive on Spark 1.6.1 - Maven BUILD FAILURE ?

2016-03-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209211#comment-15209211
 ] 

Sean Owen commented on SPARK-14101:
---

It needs to be what {{java}} invokes. This was the error that shows it isn't 
being used: {{[ERROR] javac: invalid source release: 1.7}}

> Hive on Spark 1.6.1 - Maven BUILD FAILURE ?
> ---
>
> Key: SPARK-14101
> URL: https://issues.apache.org/jira/browse/SPARK-14101
> Project: Spark
>  Issue Type: Test
>Reporter: Shankar
>
> ERROR - BUILD FAILURE
> {noformat}
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
> project spark-test-tags_2.11: Execution scala-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed -> 
> [Help 1]
> {noformat}
> My System Details - 
> {noformat}
> CentOS6.5 x64
> jdk1.8.0_74
> apache-maven-3.3.9
> apache-hive-1.2.1-bin
> hadoop-2.7.1
> Scala version 2.11.4
> spark version = spark-1.6.1.tgz
> {noformat}
> Building Command - 
> {noformat}
> cd spark-1.6.1
> ./dev/change-scala-version.sh 2.11
> /home/hadoop/apache-maven-3.3.9/bin/mvn -Pyarn -Phive -Phadoop-2.6 
> -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests clean package
> {noformat}
> {noformat}
> [INFO] 
> 
> [INFO] Building Spark Project Test Tags 1.6.1
> [INFO] 
> 
> [INFO] 
> [INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @ 
> spark-test-tags_2.11 ---
> [INFO] Deleting /home/hadoop/apache-spark-s/spark-1.6.1/tags/target
> [INFO] 
> [INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-versions) @ 
> spark-test-tags_2.11 ---
> [INFO] 
> [INFO] --- scala-maven-plugin:3.2.2:add-source (eclipse-add-source) @ 
> spark-test-tags_2.11 ---
> [INFO] Add Source directory: 
> /home/hadoop/apache-spark-s/spark-1.6.1/tags/src/main/scala
> [INFO] Add Test Source directory: 
> /home/hadoop/apache-spark-s/spark-1.6.1/tags/src/test/scala
> [INFO] 
> [INFO] --- maven-dependency-plugin:2.10:build-classpath (default-cli) @ 
> spark-test-tags_2.11 ---
> [INFO] Dependencies classpath:
> /home/hadoop/.m2/repository/org/scala-lang/scala-reflect/2.11.7/scala-reflect-2.11.7.jar:/home/hadoop/.m2/repository/org/scala-lang/modules/scala-xml_2.11/1.0.2/scala-xml_2.11-1.0.2.jar:/home/hadoop/.m2/repository/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar:/home/hadoop/.m2/repository/org/scala-lang/scala-library/2.11.7/scala-library-2.11.7.jar:/home/hadoop/.m2/repository/org/scalatest/scalatest_2.11/2.2.1/scalatest_2.11-2.2.1.jar
> [INFO] 
> [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ 
> spark-test-tags_2.11 ---
> [INFO] 
> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
> spark-test-tags_2.11 ---
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] skip non existing resourceDirectory 
> /home/hadoop/apache-spark-s/spark-1.6.1/tags/src/main/resources
> [INFO] Copying 3 resources
> [INFO] 
> [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ 
> spark-test-tags_2.11 ---
> [WARNING] Zinc server is not available at port 3030 - reverting to normal 
> incremental compile
> [INFO] Using incremental compilation
> [INFO] Compiling 3 Java sources to 
> /home/hadoop/apache-spark-s/spark-1.6.1/tags/target/scala-2.11/classes...
> [ERROR] javac: invalid source release: 1.7
> [ERROR] Usage: javac  
> [ERROR] use -help for a list of possible options
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO] 
> [INFO] Spark Project Test Tags  FAILURE [  2.862 
> s]
> [INFO] Spark Project Launcher . SKIPPED
> [INFO] Spark Project Networking ... SKIPPED
> [INFO] Spark Project Shuffle Streaming Service  SKIPPED
> [INFO] Spark Project Unsafe ... SKIPPED
> [INFO] Spark Project Core . SKIPPED
> [INFO] Spark Project Bagel  SKIPPED
> [INFO] Spark Project GraphX ... SKIPPED
> [INFO] Spark Project Streaming  SKIPPED
> [INFO] Spark Project Catalyst . SKIPPED
> [INFO] Spark Project SQL .. SKIPPED
> [INFO] Spark Project ML Library ... SKIPPED
> [INFO] Spark Project Tools  SKIPPED
> [INFO] Spark Project Hive . SKIPPED
> [INFO] Spark Project Docker Integration Tests . SKIPPED
> [INFO] Spark Project REPL . SKIPPED
> [INFO] Spark 

[jira] [Commented] (SPARK-14106) history server application cache doesn't detect that apps are completed

2016-03-23 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209175#comment-15209175
 ] 

Steve Loughran commented on SPARK-14106:


The solution I propose is for {{LoadedAppUI}} to return a completed flag; 
that's used by {{ApplicationCache}} to decide whether to hook up the filter. 
This avoids it having to have its own logic to decide when an application 
finished, one which may be inconsistent with the app UI that's loaded

> history server application cache doesn't detect that apps are completed
> ---
>
> Key: SPARK-14106
> URL: https://issues.apache.org/jira/browse/SPARK-14106
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Steve Loughran
>Priority: Minor
>
> The cache in SPARK-7889 hooks up a filter to the web UI of incomplete apps, 
> triggering a probe to see if the app is out of date.
> It looks like the filter is being hooked up to completed apps too, that is: 
> the {{ApplicationCache}} isn't correctly detecting when apps are finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-13456) Cannot create encoders for case classes defined in Spark shell after upgrading to Scala 2.11

2016-03-23 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski reopened SPARK-13456:
-

In today's Spark 2.0.0-SNAPSHOT:

{code}
scala> :pa
// Entering paste mode (ctrl-D to finish)

import sqlContext.implicits._

case class Token(name: String, productId: Int, score: Double)
val data = Token("aaa", 100, 0.12) ::
  Token("aaa", 200, 0.29) ::
  Token("bbb", 200, 0.53) ::
  Token("bbb", 300, 0.42) :: Nil

val ds = data.toDS


// Exiting paste mode, now interpreting.

java.lang.NullPointerException
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at 
org.apache.spark.sql.catalyst.encoders.OuterScopes$.getOuterScope(OuterScopes.scala:64)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:588)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:580)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:67)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:248)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.resolveDeserializer(Analyzer.scala:580)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.resolve(ExpressionEncoder.scala:321)
  at org.apache.spark.sql.Dataset.(Dataset.scala:197)
  at org.apache.spark.sql.Dataset.(Dataset.scala:164)
  at org.apache.spark.sql.Dataset$.apply(Dataset.scala:53)
  at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:448)
  at 
org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:152)
  ... 47 elided
{code}

> Cannot create encoders for case classes defined in Spark shell after 
> upgrading to Scala 2.11
> 
>
> Key: SPARK-13456
> URL: https://issues.apache.org/jira/browse/SPARK-13456
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Spark 2.0 started to use Scala 2.11 by default since [PR 
> #10608|https://github.com/apache/spark/pull/10608].  Unfortunately, after 
> this upgrade, Spark fails to create encoders for case classes defined in REPL:
> {code}
> import sqlContext.implicits._
> case class T(a: Int, b: Double)
> val ds = Seq(1 -> T(1, 1D), 2 -> T(2, 2D)).toDS()
> {code}
> Exception thrown:
> {noformat}
> org.apache.spark.sql.AnalysisException: Unable to generate an encoder for 
> inner class `T` without access to the scope that this class was defined in.
> Try moving this class out of its parent class.;
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:565)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:561)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:304)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> 

[jira] [Commented] (SPARK-14106) history server application cache doesn't detect that apps are completed

2016-03-23 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209168#comment-15209168
 ] 

Steve Loughran commented on SPARK-14106:


Is this serious? Not if the probes returned by the history providers are low 
cost; our YARN one is a no-op on incomplete apps; low cost on completed ones. 
The filter is pretty low cost too, but it is gratuitious

> history server application cache doesn't detect that apps are completed
> ---
>
> Key: SPARK-14106
> URL: https://issues.apache.org/jira/browse/SPARK-14106
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Steve Loughran
>Priority: Minor
>
> The cache in SPARK-7889 hooks up a filter to the web UI of incomplete apps, 
> triggering a probe to see if the app is out of date.
> It looks like the filter is being hooked up to completed apps too, that is: 
> the {{ApplicationCache}} isn't correctly detecting when apps are finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14106) history server application cache doesn't detect that apps are completed

2016-03-23 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209153#comment-15209153
 ] 

Steve Loughran edited comment on SPARK-14106 at 3/23/16 8:56 PM:
-

This surfaced in logs of my yarn timeline integration tests. There a specific 
no-op() probe is hooked up on completed applications, so there's no overhead at 
all on their checks.

{code}
 contains 32 event(s)
2016-03-23 20:46:06,594 [qtp196369530-25] INFO  server.YarnHistoryProvider 
(Logging.scala:logInfo(58)) - Building Application UI Spark Pi attempt Some(1) 
under /history/application_1458752661638_0002/1
2016-03-23 20:46:06,595 [qtp196369530-25] INFO  server.YarnHistoryProvider 
(Logging.scala:logInfo(58)) - App application_1458752661638_0002 history 
contains 32 events
2016-03-23 20:46:06,597 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is {"Block Manager 
ID":{"Host":"192.168.1.132","Executor 
ID":"driver","Port":40481},"Event":"SparkListenerBlockManagerAdded","Timestamp":1458753291682,"Maximum
 Memory":480116736}
2016-03-23 20:46:06,608 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is 
{"Event":"SparkListenerEnvironmentUpdate","System 
Properties":{"java.vm.info":"mixed 
mode","line.separator":"\n","java.vm.version":"25.74-b02","sun.cpu.isalist":"","java.vm.specification.vendor":"Oracle
 Corporation","os.version":"3.13.0-83-generic","java.l ... }
2016-03-23 20:46:06,614 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is {"App Attempt 
ID":"1","App 
ID":"application_1458752661638_0002","Timestamp":1458753289143,"User":"stevel","Event":"SparkListenerApplicationStart","Driver
 
Logs":{"stdout":"http://xubunty.cotham.uk:8042/node/containerlogs/container_1458752661638_0002_01_
 ... }
2016-03-23 20:46:06,621 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is {"Executor 
ID":"1","Event":"SparkListenerExecutorAdded","Executor Info":{"Total 
Cores":1,"Log 
Urls":{"stderr":"http://xubunty.cotham.uk:8042/node/containerlogs/container_1458752661638_0002_01_02/stevel/stderr?start=-4096","stdout":"http://xubunty.cotha
 ... }
2016-03-23 20:46:06,624 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is 
{"Timestamp":1458753309182,"Executor 
ID":"2","Event":"SparkListenerExecutorAdded","Executor Info":{"Total 
Cores":1,"Host":"xubunty.cotham.uk","Log 
Urls":{"stdout":"http://xubunty.cotham.uk:8042/node/containerlogs/container_1458752661638_0002_01_03/stev
 ... }
2016-03-23 20:46:06,625 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is {"Block Manager 
ID":{"Executor 
ID":"1","Port":56415,"Host":"xubunty.cotham.uk"},"Event":"SparkListenerBlockManagerAdded","Maximum
 Memory":535953408,"Timestamp":1458753309248}
2016-03-23 20:46:06,626 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is {"Block Manager 
ID":{"Port":49921,"Executor 
ID":"2","Host":"xubunty.cotham.uk"},"Event":"SparkListenerBlockManagerAdded","Maximum
 Memory":535953408,"Timestamp":1458753309295}
2016-03-23 20:46:06,632 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is 
{"Event":"SparkListenerJobStart","Stage Infos":[{"Parent IDs":[],"Stage 
ID":0,"RDD Info":[{"Disk Size":0,"RDD ID":1,"Name":"MapPartitionsRDD","Storage 
Level":{"Use Memory":false,"Deserialized":false,"Replication":1,"Use 
Disk":false,"Use ExternalBlockStore" ... }
2016-03-23 20:46:06,666 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is {"Stage Info":{"Parent 
IDs":[],"Stage ID":0,"RDD Info":[{"Disk Size":0,"RDD 
ID":1,"Name":"MapPartitionsRDD","Storage Level":{"Use 
Memory":false,"Deserialized":false,"Replication":1,"Use Disk":false,"Use 
ExternalBlockStore":false},"Callsite":"map at SparkPi ... }
2016-03-23 20:46:06,678 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is {"Stage ID":0,"Task 
Info":{"Executor ID":"1","Finish Time":0,"Launch Time":1458753311859,"Getting 
Result 
Time":0,"Locality":"PROCESS_LOCAL","Accumulables":[],"Speculative":false,"Failed":false,"Index":0,"Host":"xubunty.cotham.uk","Attempt":0,"Task
 ID":0}," ... }
2016-03-23 20:46:06,682 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is {"Stage ID":0,"Task 
Info":{"Finish Time":0,"Executor ID":"2","Getting Result Time":0,"Launch 
Time":1458753311896,"Locality":"PROCESS_LOCAL","Accumulables":[],"Speculative":false,"Failed":false,"Task
 ID":1,"Host":"xubunty.cotham.uk","Attempt":0,"Index":1}," ... }
2016-03-23 20:46:06,684 [qtp196369530-25] DEBUG 

[jira] [Commented] (SPARK-14106) history server application cache doesn't detect that apps are completed

2016-03-23 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209153#comment-15209153
 ] 

Steve Loughran commented on SPARK-14106:


This surfaced in logs of my yarn timeline integration tests. There a specific 
no-op() probe is hooked up on completed applications, so there's no overhead at 
all on their checks.

{code}
 contains 32 event(s)
2016-03-23 20:46:06,594 [qtp196369530-25] INFO  server.YarnHistoryProvider 
(Logging.scala:logInfo(58)) - Building Application UI Spark Pi attempt Some(1) 
under /history/application_1458752661638_0002/1
2016-03-23 20:46:06,595 [qtp196369530-25] INFO  server.YarnHistoryProvider 
(Logging.scala:logInfo(58)) - App application_1458752661638_0002 history 
contains 32 events
2016-03-23 20:46:06,597 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is {"Block Manager 
ID":{"Host":"192.168.1.132","Executor 
ID":"driver","Port":40481},"Event":"SparkListenerBlockManagerAdded","Timestamp":1458753291682,"Maximum
 Memory":480116736}
2016-03-23 20:46:06,608 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is 
{"Event":"SparkListenerEnvironmentUpdate","System 
Properties":{"java.vm.info":"mixed 
mode","line.separator":"\n","java.vm.version":"25.74-b02","sun.cpu.isalist":"","java.vm.specification.vendor":"Oracle
 Corporation","os.version":"3.13.0-83-generic","java.l ... }
2016-03-23 20:46:06,614 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is {"App Attempt 
ID":"1","App 
ID":"application_1458752661638_0002","Timestamp":1458753289143,"User":"stevel","Event":"SparkListenerApplicationStart","Driver
 
Logs":{"stdout":"http://xubunty.cotham.uk:8042/node/containerlogs/container_1458752661638_0002_01_
 ... }
2016-03-23 20:46:06,621 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is {"Executor 
ID":"1","Event":"SparkListenerExecutorAdded","Executor Info":{"Total 
Cores":1,"Log 
Urls":{"stderr":"http://xubunty.cotham.uk:8042/node/containerlogs/container_1458752661638_0002_01_02/stevel/stderr?start=-4096","stdout":"http://xubunty.cotha
 ... }
2016-03-23 20:46:06,624 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is 
{"Timestamp":1458753309182,"Executor 
ID":"2","Event":"SparkListenerExecutorAdded","Executor Info":{"Total 
Cores":1,"Host":"xubunty.cotham.uk","Log 
Urls":{"stdout":"http://xubunty.cotham.uk:8042/node/containerlogs/container_1458752661638_0002_01_03/stev
 ... }
2016-03-23 20:46:06,625 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is {"Block Manager 
ID":{"Executor 
ID":"1","Port":56415,"Host":"xubunty.cotham.uk"},"Event":"SparkListenerBlockManagerAdded","Maximum
 Memory":535953408,"Timestamp":1458753309248}
2016-03-23 20:46:06,626 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is {"Block Manager 
ID":{"Port":49921,"Executor 
ID":"2","Host":"xubunty.cotham.uk"},"Event":"SparkListenerBlockManagerAdded","Maximum
 Memory":535953408,"Timestamp":1458753309295}
2016-03-23 20:46:06,632 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is 
{"Event":"SparkListenerJobStart","Stage Infos":[{"Parent IDs":[],"Stage 
ID":0,"RDD Info":[{"Disk Size":0,"RDD ID":1,"Name":"MapPartitionsRDD","Storage 
Level":{"Use Memory":false,"Deserialized":false,"Replication":1,"Use 
Disk":false,"Use ExternalBlockStore" ... }
2016-03-23 20:46:06,666 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is {"Stage Info":{"Parent 
IDs":[],"Stage ID":0,"RDD Info":[{"Disk Size":0,"RDD 
ID":1,"Name":"MapPartitionsRDD","Storage Level":{"Use 
Memory":false,"Deserialized":false,"Replication":1,"Use Disk":false,"Use 
ExternalBlockStore":false},"Callsite":"map at SparkPi ... }
2016-03-23 20:46:06,678 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is {"Stage ID":0,"Task 
Info":{"Executor ID":"1","Finish Time":0,"Launch Time":1458753311859,"Getting 
Result 
Time":0,"Locality":"PROCESS_LOCAL","Accumulables":[],"Speculative":false,"Failed":false,"Index":0,"Host":"xubunty.cotham.uk","Attempt":0,"Task
 ID":0}," ... }
2016-03-23 20:46:06,682 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - toSparkEvent payload is {"Stage ID":0,"Task 
Info":{"Finish Time":0,"Executor ID":"2","Getting Result Time":0,"Launch 
Time":1458753311896,"Locality":"PROCESS_LOCAL","Accumulables":[],"Speculative":false,"Failed":false,"Task
 ID":1,"Host":"xubunty.cotham.uk","Attempt":0,"Index":1}," ... }
2016-03-23 20:46:06,684 [qtp196369530-25] DEBUG yarn.YarnTimelineUtils 
(Logging.scala:logDebug(62)) - 

[jira] [Commented] (SPARK-14101) Hive on Spark 1.6.1 - Maven BUILD FAILURE ?

2016-03-23 Thread Shankar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209149#comment-15209149
 ] 

Shankar commented on SPARK-14101:
-

You mean I have to use Java 7 explicitly? 

I have already installed Java 8  and Set it in Environment path.


> Hive on Spark 1.6.1 - Maven BUILD FAILURE ?
> ---
>
> Key: SPARK-14101
> URL: https://issues.apache.org/jira/browse/SPARK-14101
> Project: Spark
>  Issue Type: Test
>Reporter: Shankar
>
> ERROR - BUILD FAILURE
> {noformat}
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
> project spark-test-tags_2.11: Execution scala-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed -> 
> [Help 1]
> {noformat}
> My System Details - 
> {noformat}
> CentOS6.5 x64
> jdk1.8.0_74
> apache-maven-3.3.9
> apache-hive-1.2.1-bin
> hadoop-2.7.1
> Scala version 2.11.4
> spark version = spark-1.6.1.tgz
> {noformat}
> Building Command - 
> {noformat}
> cd spark-1.6.1
> ./dev/change-scala-version.sh 2.11
> /home/hadoop/apache-maven-3.3.9/bin/mvn -Pyarn -Phive -Phadoop-2.6 
> -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests clean package
> {noformat}
> {noformat}
> [INFO] 
> 
> [INFO] Building Spark Project Test Tags 1.6.1
> [INFO] 
> 
> [INFO] 
> [INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @ 
> spark-test-tags_2.11 ---
> [INFO] Deleting /home/hadoop/apache-spark-s/spark-1.6.1/tags/target
> [INFO] 
> [INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-versions) @ 
> spark-test-tags_2.11 ---
> [INFO] 
> [INFO] --- scala-maven-plugin:3.2.2:add-source (eclipse-add-source) @ 
> spark-test-tags_2.11 ---
> [INFO] Add Source directory: 
> /home/hadoop/apache-spark-s/spark-1.6.1/tags/src/main/scala
> [INFO] Add Test Source directory: 
> /home/hadoop/apache-spark-s/spark-1.6.1/tags/src/test/scala
> [INFO] 
> [INFO] --- maven-dependency-plugin:2.10:build-classpath (default-cli) @ 
> spark-test-tags_2.11 ---
> [INFO] Dependencies classpath:
> /home/hadoop/.m2/repository/org/scala-lang/scala-reflect/2.11.7/scala-reflect-2.11.7.jar:/home/hadoop/.m2/repository/org/scala-lang/modules/scala-xml_2.11/1.0.2/scala-xml_2.11-1.0.2.jar:/home/hadoop/.m2/repository/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar:/home/hadoop/.m2/repository/org/scala-lang/scala-library/2.11.7/scala-library-2.11.7.jar:/home/hadoop/.m2/repository/org/scalatest/scalatest_2.11/2.2.1/scalatest_2.11-2.2.1.jar
> [INFO] 
> [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ 
> spark-test-tags_2.11 ---
> [INFO] 
> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
> spark-test-tags_2.11 ---
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] skip non existing resourceDirectory 
> /home/hadoop/apache-spark-s/spark-1.6.1/tags/src/main/resources
> [INFO] Copying 3 resources
> [INFO] 
> [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ 
> spark-test-tags_2.11 ---
> [WARNING] Zinc server is not available at port 3030 - reverting to normal 
> incremental compile
> [INFO] Using incremental compilation
> [INFO] Compiling 3 Java sources to 
> /home/hadoop/apache-spark-s/spark-1.6.1/tags/target/scala-2.11/classes...
> [ERROR] javac: invalid source release: 1.7
> [ERROR] Usage: javac  
> [ERROR] use -help for a list of possible options
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO] 
> [INFO] Spark Project Test Tags  FAILURE [  2.862 
> s]
> [INFO] Spark Project Launcher . SKIPPED
> [INFO] Spark Project Networking ... SKIPPED
> [INFO] Spark Project Shuffle Streaming Service  SKIPPED
> [INFO] Spark Project Unsafe ... SKIPPED
> [INFO] Spark Project Core . SKIPPED
> [INFO] Spark Project Bagel  SKIPPED
> [INFO] Spark Project GraphX ... SKIPPED
> [INFO] Spark Project Streaming  SKIPPED
> [INFO] Spark Project Catalyst . SKIPPED
> [INFO] Spark Project SQL .. SKIPPED
> [INFO] Spark Project ML Library ... SKIPPED
> [INFO] Spark Project Tools  SKIPPED
> [INFO] Spark Project Hive . SKIPPED
> [INFO] Spark Project Docker Integration Tests . SKIPPED
> [INFO] Spark Project REPL . SKIPPED
> [INFO] Spark Project YARN Shuffle Service 

[jira] [Created] (SPARK-14106) history server application cache doesn't detect that apps are completed

2016-03-23 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-14106:
--

 Summary: history server application cache doesn't detect that apps 
are completed
 Key: SPARK-14106
 URL: https://issues.apache.org/jira/browse/SPARK-14106
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.0.0
Reporter: Steve Loughran
Priority: Minor


The cache in SPARK-7889 hooks up a filter to the web UI of incomplete apps, 
triggering a probe to see if the app is out of date.

It looks like the filter is being hooked up to completed apps too, that is: the 
{{ApplicationCache}} isn't correctly detecting when apps are finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13710) Spark shell shows ERROR when launching on Windows

2016-03-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209134#comment-15209134
 ] 

Sean Owen commented on SPARK-13710:
---

Hm. The maven coordinates didn't change. I wonder if there is somehow still an 
issue where the new version contains code in the old namespace for backwards 
compatibility or whatever. That is I wonder why the exclusion above fixed it 
for you?

> Spark shell shows ERROR when launching on Windows
> -
>
> Key: SPARK-13710
> URL: https://issues.apache.org/jira/browse/SPARK-13710
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Windows
>Reporter: Masayoshi TSUZUKI
>Priority: Minor
>
> On Windows, when we launch {{bin\spark-shell.cmd}}, it shows ERROR message 
> and stacktrace.
> {noformat}
> C:\Users\tsudukim\Documents\workspace\spark-dev3>bin\spark-shell
> [ERROR] Terminal initialization failed; falling back to unsupported
> java.lang.NoClassDefFoundError: Could not initialize class 
> scala.tools.fusesource_embedded.jansi.internal.Kernel32
> at 
> scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50)
> at 
> scala.tools.jline_embedded.WindowsTerminal.getConsoleMode(WindowsTerminal.java:204)
> at 
> scala.tools.jline_embedded.WindowsTerminal.init(WindowsTerminal.java:82)
> at 
> scala.tools.jline_embedded.TerminalFactory.create(TerminalFactory.java:101)
> at 
> scala.tools.jline_embedded.TerminalFactory.get(TerminalFactory.java:158)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:229)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:221)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:209)
> at 
> scala.tools.nsc.interpreter.jline_embedded.JLineConsoleReader.(JLineReader.scala:61)
> at 
> scala.tools.nsc.interpreter.jline_embedded.InteractiveReader.(JLineReader.scala:33)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:865)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:862)
> at 
> scala.tools.nsc.interpreter.ILoop.scala$tools$nsc$interpreter$ILoop$$mkReader$1(ILoop.scala:871)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at scala.util.Try$.apply(Try.scala:192)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875)
> at 
> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
> at 
> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233)
> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223)
> at scala.collection.immutable.Stream.collect(Stream.scala:435)
> at scala.tools.nsc.interpreter.ILoop.chooseReader(ILoop.scala:877)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$2.apply(ILoop.scala:916)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:916)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
> at 
> scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:911)
> at org.apache.spark.repl.Main$.doMain(Main.scala:64)
> at org.apache.spark.repl.Main$.main(Main.scala:47)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> 

[jira] [Updated] (SPARK-14105) Serialization issue for KafkaRDD

2016-03-23 Thread Liyin Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liyin Tang updated SPARK-14105:
---
Affects Version/s: 1.5.2
  Component/s: Streaming
   Spark Core

> Serialization issue for KafkaRDD
> 
>
> Key: SPARK-14105
> URL: https://issues.apache.org/jira/browse/SPARK-14105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Streaming
>Affects Versions: 1.5.2
>Reporter: Liyin Tang
>
> When using DISK or Memory to persistent KafkaDirectInputStream, it will 
> serialize the FetchResponse into blocks. The FetchResponse contains the 
> ByteBufferMessageSet where each Kafka Message is just one slice of the 
> underlying ByteBuffer. 
> When serializing the KafkaRDDIterator, it seems like the entire underlying 
> ByteBuffer in ByteBufferMessageSet will be serialized for each and every 
> message. This will cause block size easily exceeds 2G, and lead to 
> "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" or 
> "FileChannelImpl.map -> exceeds Integer.MAX_VALUE:"
> The consumer fetch is the default value (1M).  I tried to reduce fetch size, 
> but it will cause other errors like errRanOutBeforeEnd. 
> Here are exceptions I got for both Memory and Disk persistent.
> Memory Persistent:
> 16/03/23 15:34:44 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-9,5,main]
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:3236)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
> at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
> at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
> at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
> at 
> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:158)
> at 
> org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
> at 
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1190)
> at 
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1199)
> at org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:132)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:793)
> at org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:669)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
> Disk Persistent: 
> 16/03/23 17:04:00 INFO TaskSetManager: Lost task 42.1 in stage 2.0 (TID 974) 
> on executor i-2878ceb3.inst.aws.airbnb.com: java.lang.RuntimeException 
> (java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
> at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1206)
> at 

[jira] [Commented] (SPARK-13710) Spark shell shows ERROR when launching on Windows

2016-03-23 Thread Michel Lemay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209127#comment-15209127
 ] 

Michel Lemay commented on SPARK-13710:
--

Looks like it was renamed to a different package.  Therefore no conflict 
possible.
{code}
package scala.tools.jline
{code}

> Spark shell shows ERROR when launching on Windows
> -
>
> Key: SPARK-13710
> URL: https://issues.apache.org/jira/browse/SPARK-13710
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Windows
>Reporter: Masayoshi TSUZUKI
>Priority: Minor
>
> On Windows, when we launch {{bin\spark-shell.cmd}}, it shows ERROR message 
> and stacktrace.
> {noformat}
> C:\Users\tsudukim\Documents\workspace\spark-dev3>bin\spark-shell
> [ERROR] Terminal initialization failed; falling back to unsupported
> java.lang.NoClassDefFoundError: Could not initialize class 
> scala.tools.fusesource_embedded.jansi.internal.Kernel32
> at 
> scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50)
> at 
> scala.tools.jline_embedded.WindowsTerminal.getConsoleMode(WindowsTerminal.java:204)
> at 
> scala.tools.jline_embedded.WindowsTerminal.init(WindowsTerminal.java:82)
> at 
> scala.tools.jline_embedded.TerminalFactory.create(TerminalFactory.java:101)
> at 
> scala.tools.jline_embedded.TerminalFactory.get(TerminalFactory.java:158)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:229)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:221)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:209)
> at 
> scala.tools.nsc.interpreter.jline_embedded.JLineConsoleReader.(JLineReader.scala:61)
> at 
> scala.tools.nsc.interpreter.jline_embedded.InteractiveReader.(JLineReader.scala:33)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:865)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:862)
> at 
> scala.tools.nsc.interpreter.ILoop.scala$tools$nsc$interpreter$ILoop$$mkReader$1(ILoop.scala:871)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at scala.util.Try$.apply(Try.scala:192)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875)
> at 
> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
> at 
> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233)
> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223)
> at scala.collection.immutable.Stream.collect(Stream.scala:435)
> at scala.tools.nsc.interpreter.ILoop.chooseReader(ILoop.scala:877)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$2.apply(ILoop.scala:916)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:916)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
> at 
> scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:911)
> at org.apache.spark.repl.Main$.doMain(Main.scala:64)
> at org.apache.spark.repl.Main$.main(Main.scala:47)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:737)
> at 
> 

[jira] [Created] (SPARK-14105) Serialization issue for KafkaRDD

2016-03-23 Thread Liyin Tang (JIRA)
Liyin Tang created SPARK-14105:
--

 Summary: Serialization issue for KafkaRDD
 Key: SPARK-14105
 URL: https://issues.apache.org/jira/browse/SPARK-14105
 Project: Spark
  Issue Type: Bug
Reporter: Liyin Tang


When using DISK or Memory to persistent KafkaDirectInputStream, it will 
serialize the FetchResponse into blocks. The FetchResponse contains the 
ByteBufferMessageSet where each Kafka Message is just one slice of the 
underlying ByteBuffer. 

When serializing the KafkaRDDIterator, it seems like the entire underlying 
ByteBuffer in ByteBufferMessageSet will be serialized for each and every 
message. This will cause block size easily exceeds 2G, and lead to 
"java.lang.OutOfMemoryError: Requested array size exceeds VM limit" or 
"FileChannelImpl.map -> exceeds Integer.MAX_VALUE:"

The consumer fetch is the default value (1M).  I tried to reduce fetch size, 
but it will cause other errors like errRanOutBeforeEnd. 

Here are exceptions I got for both Memory and Disk persistent.
Memory Persistent:
16/03/23 15:34:44 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
thread Thread[Executor task launch worker-9,5,main]
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at 
org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:158)
at 
org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
at 
org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1190)
at 
org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1199)
at org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:132)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:793)
at org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:669)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)

Disk Persistent: 
16/03/23 17:04:00 INFO TaskSetManager: Lost task 42.1 in stage 2.0 (TID 974) on 
executor i-2878ceb3.inst.aws.airbnb.com: java.lang.RuntimeException 
(java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
at 
org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
at 
org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1206)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:512)
at 
org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:302)
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Closed] (SPARK-14080) Improve the codegen for Filter

2016-03-23 Thread Bo Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Meng closed SPARK-14080.
---
Resolution: Won't Fix

It would probably be cosmic issue and Java compiler could optimize this 
redundancy away. 

> Improve the codegen for Filter
> --
>
> Key: SPARK-14080
> URL: https://issues.apache.org/jira/browse/SPARK-14080
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Bo Meng
>Priority: Minor
>
> Currently, the codegen of null check for Filter sometime generates code as 
> followings:
> /* 072 */   if (!(!(filter_isNull2))) continue;
> It will be better to be:
> /* 072 */   if (filter_isNull2) continue;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14104) All Python param setters should use the `_set` method.

2016-03-23 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209118#comment-15209118
 ] 

Seth Hendrickson commented on SPARK-14104:
--

I am planning to work on this.

> All Python param setters should use the `_set` method.
> --
>
> Key: SPARK-14104
> URL: https://issues.apache.org/jira/browse/SPARK-14104
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Priority: Minor
>
> [SPARK-13068|https://issues.apache.org/jira/browse/SPARK-13068] added type 
> conversions and checking for Python ML params. The type checking happens in 
> the {{_set}} method of params. However, some param setters modify the 
> {{_paramMap}} directly and circumvent type checking. All param updates should 
> happen through the {{_set}} method to ensure proper type conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14104) All Python param setters should use the `_set` method.

2016-03-23 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-14104:


 Summary: All Python param setters should use the `_set` method.
 Key: SPARK-14104
 URL: https://issues.apache.org/jira/browse/SPARK-14104
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Seth Hendrickson
Priority: Minor


[SPARK-13068|https://issues.apache.org/jira/browse/SPARK-13068] added type 
conversions and checking for Python ML params. The type checking happens in the 
{{_set}} method of params. However, some param setters modify the {{_paramMap}} 
directly and circumvent type checking. All param updates should happen through 
the {{_set}} method to ensure proper type conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13710) Spark shell shows ERROR when launching on Windows

2016-03-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209114#comment-15209114
 ] 

Sean Owen commented on SPARK-13710:
---

I agree that's a likely explanation. Maybe the conflict only surfaces in 
Windows. Curator is still used by some ZK-using parts of the code like the 
standalone Master, but, I can't quite figure out why it would need jline. It 
may only be needed if, say, you're running its command line tools. This 
exclusion may be a good idea then.

I wonder if this has consequences for the 2.10 build -- I'd suspect less of a 
problem if it was using the old jline?

> Spark shell shows ERROR when launching on Windows
> -
>
> Key: SPARK-13710
> URL: https://issues.apache.org/jira/browse/SPARK-13710
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Windows
>Reporter: Masayoshi TSUZUKI
>Priority: Minor
>
> On Windows, when we launch {{bin\spark-shell.cmd}}, it shows ERROR message 
> and stacktrace.
> {noformat}
> C:\Users\tsudukim\Documents\workspace\spark-dev3>bin\spark-shell
> [ERROR] Terminal initialization failed; falling back to unsupported
> java.lang.NoClassDefFoundError: Could not initialize class 
> scala.tools.fusesource_embedded.jansi.internal.Kernel32
> at 
> scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50)
> at 
> scala.tools.jline_embedded.WindowsTerminal.getConsoleMode(WindowsTerminal.java:204)
> at 
> scala.tools.jline_embedded.WindowsTerminal.init(WindowsTerminal.java:82)
> at 
> scala.tools.jline_embedded.TerminalFactory.create(TerminalFactory.java:101)
> at 
> scala.tools.jline_embedded.TerminalFactory.get(TerminalFactory.java:158)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:229)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:221)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:209)
> at 
> scala.tools.nsc.interpreter.jline_embedded.JLineConsoleReader.(JLineReader.scala:61)
> at 
> scala.tools.nsc.interpreter.jline_embedded.InteractiveReader.(JLineReader.scala:33)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:865)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:862)
> at 
> scala.tools.nsc.interpreter.ILoop.scala$tools$nsc$interpreter$ILoop$$mkReader$1(ILoop.scala:871)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at scala.util.Try$.apply(Try.scala:192)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875)
> at 
> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
> at 
> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233)
> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223)
> at scala.collection.immutable.Stream.collect(Stream.scala:435)
> at scala.tools.nsc.interpreter.ILoop.chooseReader(ILoop.scala:877)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$2.apply(ILoop.scala:916)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:916)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
> at 
> scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:911)
> at org.apache.spark.repl.Main$.doMain(Main.scala:64)
> at org.apache.spark.repl.Main$.main(Main.scala:47)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

[jira] [Commented] (SPARK-14102) Block `reset` command in SparkShell

2016-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209101#comment-15209101
 ] 

Apache Spark commented on SPARK-14102:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/11920

> Block `reset` command in SparkShell
> ---
>
> Key: SPARK-14102
> URL: https://issues.apache.org/jira/browse/SPARK-14102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Spark Shell provides an easy way to use Spark in Scala environment. This 
> issue proposes to block `reset` command provided by default.
> {code:title=bin/spark-shell|borderStyle=solid}
> scala> sc
> res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@718fad24
> scala> :reset
> scala> sc
> :11: error: not found: value sc
>sc
>^
> {code}
> If we blocks `reset`, Spark Shell works like the followings.
> {code:title=bin/spark-shell|borderStyle=solid}
> scala> :reset
> reset: no such command.  Type :help for help.
> scala> :re
> re is ambiguous: did you mean :replay or :require?
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14102) Block `reset` command in SparkShell

2016-03-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14102.
---
Resolution: Duplicate

I think we decided to leave the stock Scala shell basically as-is here. I'm not 
as sure we can modify the 2.11 version now since we're not forking it?

> Block `reset` command in SparkShell
> ---
>
> Key: SPARK-14102
> URL: https://issues.apache.org/jira/browse/SPARK-14102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Spark Shell provides an easy way to use Spark in Scala environment. This 
> issue proposes to block `reset` command provided by default.
> {code:title=bin/spark-shell|borderStyle=solid}
> scala> sc
> res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@718fad24
> scala> :reset
> scala> sc
> :11: error: not found: value sc
>sc
>^
> {code}
> If we blocks `reset`, Spark Shell works like the followings.
> {code:title=bin/spark-shell|borderStyle=solid}
> scala> :reset
> reset: no such command.  Type :help for help.
> scala> :re
> re is ambiguous: did you mean :replay or :require?
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >