[jira] [Commented] (SPARK-13635) Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit

2016-03-02 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177445#comment-15177445
 ] 

Liang-Chi Hsieh commented on SPARK-13635:
-

[~davies] Can you help update the Assignee field? Thanks!

> Enable LimitPushdown optimizer rule because we have whole-stage codegen for 
> Limit
> -
>
> Key: SPARK-13635
> URL: https://issues.apache.org/jira/browse/SPARK-13635
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> LimitPushdown optimizer rule has been disabled due to no whole-stage codegen 
> for Limit. As we have whole-stage codegen for Limit now, we should enable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13531) Some DataFrame joins stopped working with UnsupportedOperationException: No size estimation available for objects

2016-03-02 Thread Zuo Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177441#comment-15177441
 ] 

Zuo Wang commented on SPARK-13531:
--

Caused by the commit in https://issues.apache.org/jira/browse/SPARK-13329

> Some DataFrame joins stopped working with UnsupportedOperationException: No 
> size estimation available for objects
> -
>
> Key: SPARK-13531
> URL: https://issues.apache.org/jira/browse/SPARK-13531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: koert kuipers
>Priority: Minor
>
> this is using spark 2.0.0-SNAPSHOT
> dataframe df1:
> schema:
> {noformat}StructType(StructField(x,IntegerType,true)){noformat}
> explain:
> {noformat}== Physical Plan ==
> MapPartitions , obj#135: object, [if (input[0, object].isNullAt) 
> null else input[0, object].get AS x#128]
> +- MapPartitions , createexternalrow(if (isnull(x#9)) null else 
> x#9), [input[0, object] AS obj#135]
>+- WholeStageCodegen
>   :  +- Project [_1#8 AS x#9]
>   : +- Scan ExistingRDD[_1#8]{noformat}
> show:
> {noformat}+---+
> |  x|
> +---+
> |  2|
> |  3|
> +---+{noformat}
> dataframe df2:
> schema:
> {noformat}StructType(StructField(x,IntegerType,true), 
> StructField(y,StringType,true)){noformat}
> explain:
> {noformat}== Physical Plan ==
> MapPartitions , createexternalrow(x#2, if (isnull(y#3)) null else 
> y#3.toString), [if (input[0, object].isNullAt) null else input[0, object].get 
> AS x#130,if (input[0, object].isNullAt) null else staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, 
> object].get, true) AS y#131]
> +- WholeStageCodegen
>:  +- Project [_1#0 AS x#2,_2#1 AS y#3]
>: +- Scan ExistingRDD[_1#0,_2#1]{noformat}
> show:
> {noformat}+---+---+
> |  x|  y|
> +---+---+
> |  1|  1|
> |  2|  2|
> |  3|  3|
> +---+---+{noformat}
> i run:
> df1.join(df2, Seq("x")).show
> i get:
> {noformat}java.lang.UnsupportedOperationException: No size estimation 
> available for objects.
> at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323)
> at 
> org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87){noformat}
> now sure what changed, this ran about a week ago without issues (in our 
> internal unit tests). it is fully reproducible, however when i tried to 
> minimize the issue i could not reproduce it by just creating data frames in 
> the repl with the same contents, so it probably has something to do with way 
> these are created (from Row objects and StructTypes).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13635) Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit

2016-03-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13635.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11483
[https://github.com/apache/spark/pull/11483]

> Enable LimitPushdown optimizer rule because we have whole-stage codegen for 
> Limit
> -
>
> Key: SPARK-13635
> URL: https://issues.apache.org/jira/browse/SPARK-13635
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> LimitPushdown optimizer rule has been disabled due to no whole-stage codegen 
> for Limit. As we have whole-stage codegen for Limit now, we should enable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13589) Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType

2016-03-02 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177437#comment-15177437
 ] 

Liang-Chi Hsieh commented on SPARK-13589:
-

[~lian cheng] I think this is already solved in SPARK-13537.

> Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType
> ---
>
> Key: SPARK-13589
> URL: https://issues.apache.org/jira/browse/SPARK-13589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>  Labels: flaky-test
>
> Here are a few sample build failures caused by this test case:
> # 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52164/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/
> # 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52154/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/
> # 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52153/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/
> (I've pinned these builds on Jenkins so that they won't be cleaned up.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12941) Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR datatype

2016-03-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177436#comment-15177436
 ] 

Apache Spark commented on SPARK-12941:
--

User 'thomastechs' has created a pull request for this issue:
https://github.com/apache/spark/pull/11489

> Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR 
> datatype
> --
>
> Key: SPARK-12941
> URL: https://issues.apache.org/jira/browse/SPARK-12941
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
> Environment: Apache Spark 1.4.2.2
>Reporter: Jose Martinez Poblete
>Assignee: Thomas Sebastian
> Fix For: 1.4.2, 1.5.3, 1.6.2, 2.0.0
>
>
> When exporting data from Spark to Oracle, string datatypes are translated to 
> TEXT for Oracle, this is leading to the following error
> {noformat}
> java.sql.SQLSyntaxErrorException: ORA-00902: invalid datatype
> {noformat}
> As per the following code:
> https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/jdbc/jdbc.scala#L144
> See also:
> http://stackoverflow.com/questions/31287182/writing-to-oracle-database-using-apache-spark-1-4-0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13612) Multiplication of BigDecimal columns not working as expected

2016-03-02 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177428#comment-15177428
 ] 

Liang-Chi Hsieh edited comment on SPARK-13612 at 3/3/16 7:35 AM:
-

Because the internal type for BigDecimal would be Decimal(38, 18) by default, 
(you can print the schema of x and y), the result scale of x("a") * y("b") will 
be 18 + 18 = 36. That is detected to have overflow so you get a null value back.

You can cast the decimal column to proper precision and scale, e.g.:

{code}

val newX = x.withColumn("a", x("a").cast(DecimalType(10, 1)))
val newY = y.withColumn("b", y("b").cast(DecimalType(10, 1)))

newX.join(newY, newX("id") === newY("id")).withColumn("z", newX("a") * 
newY("b")).show

+---++---++--+
| id|   a| id|   b| z|
+---++---++--+
|  1|10.0|  1|10.0|100.00|
+---++---++--+

{code}



was (Author: viirya):
Because the internal type for BigDecimal would be Decimal(38, 18) by default, 
(you can print the schema of x and y), the result scale of x("a") * y("b") will 
be 18 + 18 = 36. That is detected to have overflow so you get a null value back.

You can cast the decimal column to proper precision and scale, e.g.:

{{code}}

val newX = x.withColumn("a", x("a").cast(DecimalType(10, 1)))
val newY = y.withColumn("b", y("b").cast(DecimalType(10, 1)))

newX.join(newY, newX("id") === newY("id")).withColumn("z", newX("a") * 
newY("b")).show

+---++---++--+
| id|   a| id|   b| z|
+---++---++--+
|  1|10.0|  1|10.0|100.00|
+---++---++--+

{{code}}


> Multiplication of BigDecimal columns not working as expected
> 
>
> Key: SPARK-13612
> URL: https://issues.apache.org/jira/browse/SPARK-13612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Varadharajan
>
> Please consider the below snippet:
> {code}
> case class AM(id: Int, a: BigDecimal)
> case class AX(id: Int, b: BigDecimal)
> val x = sc.parallelize(List(AM(1, 10))).toDF
> val y = sc.parallelize(List(AX(1, 10))).toDF
> x.join(y, x("id") === y("id")).withColumn("z", x("a") * y("b")).show
> {code}
> output:
> {code}
> | id|   a| id|   b|   z|
> |  1|10.00...|  1|10.00...|null|
> {code}
> Here the multiplication of the columns ("z") return null instead of 100.
> As of now we are using the below workaround, but definitely looks like a 
> serious issue.
> {code}
> x.join(y, x("id") === y("id")).withColumn("z", x("a") / (expr("1") / 
> y("b"))).show
> {code}
> {code}
> | id|   a| id|   b|   z|
> |  1|10.00...|  1|10.00...|100.0...|
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13612) Multiplication of BigDecimal columns not working as expected

2016-03-02 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177428#comment-15177428
 ] 

Liang-Chi Hsieh commented on SPARK-13612:
-

Because the internal type for BigDecimal would be Decimal(38, 18) by default, 
(you can print the schema of x and y), the result scale of x("a") * y("b") will 
be 18 + 18 = 36. That is detected to have overflow so you get a null value back.

You can cast the decimal column to proper precision and scale, e.g.:

{{code}}

val newX = x.withColumn("a", x("a").cast(DecimalType(10, 1)))
val newY = y.withColumn("b", y("b").cast(DecimalType(10, 1)))

newX.join(newY, newX("id") === newY("id")).withColumn("z", newX("a") * 
newY("b")).show

+---++---++--+
| id|   a| id|   b| z|
+---++---++--+
|  1|10.0|  1|10.0|100.00|
+---++---++--+

{{code}}


> Multiplication of BigDecimal columns not working as expected
> 
>
> Key: SPARK-13612
> URL: https://issues.apache.org/jira/browse/SPARK-13612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Varadharajan
>
> Please consider the below snippet:
> {code}
> case class AM(id: Int, a: BigDecimal)
> case class AX(id: Int, b: BigDecimal)
> val x = sc.parallelize(List(AM(1, 10))).toDF
> val y = sc.parallelize(List(AX(1, 10))).toDF
> x.join(y, x("id") === y("id")).withColumn("z", x("a") * y("b")).show
> {code}
> output:
> {code}
> | id|   a| id|   b|   z|
> |  1|10.00...|  1|10.00...|null|
> {code}
> Here the multiplication of the columns ("z") return null instead of 100.
> As of now we are using the below workaround, but definitely looks like a 
> serious issue.
> {code}
> x.join(y, x("id") === y("id")).withColumn("z", x("a") / (expr("1") / 
> y("b"))).show
> {code}
> {code}
> | id|   a| id|   b|   z|
> |  1|10.00...|  1|10.00...|100.0...|
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13643) Create SparkSession interface

2016-03-02 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-13643:
---

 Summary: Create SparkSession interface
 Key: SPARK-13643
 URL: https://issues.apache.org/jira/browse/SPARK-13643
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer

2016-03-02 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177392#comment-15177392
 ] 

Xusen Yin commented on SPARK-13600:
---

Vote for the new method.

> Incorrect number of buckets in QuantileDiscretizer
> --
>
> Key: SPARK-13600
> URL: https://issues.apache.org/jira/browse/SPARK-13600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Oliver Pierson
>Assignee: Oliver Pierson
>
> Under certain circumstances, QuantileDiscretizer fails to calculate the 
> correct splits resulting in an incorrect number of buckets/bins.
> E.g.
> val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
> val discretizer = new 
> QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
> discretizer.fit(df).getSplits
> gives:
> Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
> which corresponds to 6 buckets (not 5).
> The problem appears to be in the QuantileDiscretizer.findSplitsCandidates 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13568) Create feature transformer to impute missing values

2016-03-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177368#comment-15177368
 ] 

Nick Pentreath commented on SPARK-13568:


Ok - the Imputer will need to compute column stats ignoring NaNs, so 
SPARK-13639 should add that (whether as default behaviour, or an optional 
argument)

> Create feature transformer to impute missing values
> ---
>
> Key: SPARK-13568
> URL: https://issues.apache.org/jira/browse/SPARK-13568
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> It is quite common to encounter missing values in data sets. It would be 
> useful to implement a {{Transformer}} that can impute missing data points, 
> similar to e.g. {{Imputer}} in 
> [scikit-learn|http://scikit-learn.org/dev/modules/preprocessing.html#imputation-of-missing-values].
> Initially, options for imputation could include {{mean}}, {{median}} and 
> {{most frequent}}, but we could add various other approaches. Where possible 
> existing DataFrame code can be used (e.g. for approximate quantiles etc).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2016-03-02 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177356#comment-15177356
 ] 

Adrian Wang commented on SPARK-13446:
-

That's not enough. We still need some code change.

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13311) prettyString of IN is not good

2016-03-02 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177357#comment-15177357
 ] 

Xiao Li commented on SPARK-13311:
-

After the merge of https://github.com/apache/spark/pull/10757, I think the 
problem is resolved. 

> prettyString of IN is not good
> --
>
> Key: SPARK-13311
> URL: https://issues.apache.org/jira/browse/SPARK-13311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>
> In(i_class,[Ljava.lang.Object;@1a575883))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2016-03-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13446:
--
Issue Type: Improvement  (was: Bug)

Can't you build against the newer version of Hive? that much is needed of 
course; I don't know if it's all that's needed.

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13642) Inconsistent finishing state between driver and AM

2016-03-02 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177347#comment-15177347
 ] 

Saisai Shao commented on SPARK-13642:
-

[~tgraves] [~vanzin], would you please comment on this, why the default 
application final state is "SUCCESS"? Is it better to mark this application as 
"SUCCESS" only after user class is exited? Thanks a lot.

> Inconsistent finishing state between driver and AM 
> ---
>
> Key: SPARK-13642
> URL: https://issues.apache.org/jira/browse/SPARK-13642
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>
> Currently when running Spark on Yarn with yarn cluster mode, the default 
> application final state is "SUCCEED", if any exception is occurred, this 
> final state will be changed to "FAILED" and trigger the reattempt if 
> possible. 
> This is OK in normal case, but if there's a race condition when AM received a 
> signal (SIGTERM) and no any exception is occurred. In this situation, 
> shutdown hook will be invoked and marked this application as finished with 
> success, and there's no another attempt.
> In such situation, actually from Spark's aspect this application is failed 
> and need another attempt, but from Yarn's aspect the application is finished 
> with success.
> This could happened in NM failure situation, the failure of NM will send 
> SIGTERM to AM, AM should make this attempt as failure and rerun again, not 
> invoke unregister.
> So to increase the chance of this race condition, here is the reproduced code:
> {code}
> val sc = ...
> Thread.sleep(3L)
> sc.parallelize(1 to 100).collect()
> {code}
> If the AM is failed in sleeping, there's no exception been thrown, so from 
> Yarn's point this application is finished successfully, but from Spark's 
> point, this application should be reattempted.
> So basically, I think only after the finish of user class, we could mark this 
> application as "SUCCESS", otherwise, especially in the signal stopped 
> scenario, it would be better to mark as failed and try again (except 
> explicitly KILL command by yarn).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13642) Inconsistent finishing state between driver and AM

2016-03-02 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-13642:

Description: 
Currently when running Spark on Yarn with yarn cluster mode, the default 
application final state is "SUCCEED", if any exception is occurred, this final 
state will be changed to "FAILED" and trigger the reattempt if possible. 

This is OK in normal case, but if there's a race condition when AM received a 
signal (SIGTERM) and no any exception is occurred. In this situation, shutdown 
hook will be invoked and marked this application as finished with success, and 
there's no another attempt.

In such situation, actually from Spark's aspect this application is failed and 
need another attempt, but from Yarn's aspect the application is finished with 
success.

This could happened in NM failure situation, the failure of NM will send 
SIGTERM to AM, AM should make this attempt as failure and rerun again, not 
invoke unregister.

So to increase the chance of this race condition, here is the reproduced code:

{code}
val sc = ...
Thread.sleep(3L)
sc.parallelize(1 to 100).collect()
{code}

If the AM is failed in sleeping, there's no exception been thrown, so from 
Yarn's point this application is finished successfully, but from Spark's point, 
this application should be reattempted.

So basically, I think only after the finish of user class, we could mark this 
application as "SUCCESS", otherwise, especially in the signal stopped scenario, 
it would be better to mark as failed and try again (except explicitly KILL 
command by yarn).


  was:
Currently when running Spark on Yarn with yarn cluster mode, the default 
application final state is "SUCCEED", if any exception is occurred, this final 
state will be changed to "FAILED" and trigger the reattempt if possible. 

This is OK in normal case, but there's a race condition when AM received a 
signal (SIGTERM), no any exception is occurred. In this situation, shutdown 
hook will be invoked and marked this application as finished with success, and 
there's no another attempt.

In such situation, actually from Spark's aspect this application is failed and 
need another attempt, but from Yarn's aspect the application is finished with 
success.

This could happened in NM failure situation, the failure of NM will send 
SIGTERM to AM, AM should make this attempt as failure and rerun again, not 
invoke unregister.

So to increase the chance of this race condition, here is the reproduced code:

{code}
val sc = ...
Thread.sleep(3L)
sc.parallelize(1 to 100).collect()
{code}

If the AM is failed in sleeping, there's no exception been thrown, so from 
Yarn's point this application is finished successfully, but from Spark's point, 
this application should be reattempted.

So basically, I think only after the finish of user class, we could mark this 
application as "SUCCESS", otherwise, especially in the signal stopped scenario, 
it would be better to mark as failed and try again (except explicitly KILL 
command by yarn).



> Inconsistent finishing state between driver and AM 
> ---
>
> Key: SPARK-13642
> URL: https://issues.apache.org/jira/browse/SPARK-13642
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>
> Currently when running Spark on Yarn with yarn cluster mode, the default 
> application final state is "SUCCEED", if any exception is occurred, this 
> final state will be changed to "FAILED" and trigger the reattempt if 
> possible. 
> This is OK in normal case, but if there's a race condition when AM received a 
> signal (SIGTERM) and no any exception is occurred. In this situation, 
> shutdown hook will be invoked and marked this application as finished with 
> success, and there's no another attempt.
> In such situation, actually from Spark's aspect this application is failed 
> and need another attempt, but from Yarn's aspect the application is finished 
> with success.
> This could happened in NM failure situation, the failure of NM will send 
> SIGTERM to AM, AM should make this attempt as failure and rerun again, not 
> invoke unregister.
> So to increase the chance of this race condition, here is the reproduced code:
> {code}
> val sc = ...
> Thread.sleep(3L)
> sc.parallelize(1 to 100).collect()
> {code}
> If the AM is failed in sleeping, there's no exception been thrown, so from 
> Yarn's point this application is finished successfully, but from Spark's 
> point, this application should be reattempted.
> So basically, I think only after the finish of user class, we could mark this 
> application as "SUCCESS", otherwise, especially in the signal stopped 
> scenario, it would be better to mark as failed and try again (except 
> explicitly KILL command by yarn).



--
Th

[jira] [Created] (SPARK-13642) Inconsistent finishing state between driver and AM

2016-03-02 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-13642:
---

 Summary: Inconsistent finishing state between driver and AM 
 Key: SPARK-13642
 URL: https://issues.apache.org/jira/browse/SPARK-13642
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.6.0
Reporter: Saisai Shao


Currently when running Spark on Yarn with yarn cluster mode, the default 
application final state is "SUCCEED", if any exception is occurred, this final 
state will be changed to "FAILED" and trigger the reattempt if possible. 

This is OK in normal case, but there's a race condition when AM received a 
signal (SIGTERM), no any exception is occurred. In this situation, shutdown 
hook will be invoked and marked this application as finished with success, and 
there's no another attempt.

In such situation, actually from Spark's aspect this application is failed and 
need another attempt, but from Yarn's aspect the application is finished with 
success.

This could happened in NM failure situation, the failure of NM will send 
SIGTERM to AM, AM should make this attempt as failure and rerun again, not 
invoke unregister.

So to increase the chance of this race condition, here is the reproduced code:

{code}
val sc = ...
Thread.sleep(3L)
sc.parallelize(1 to 100).collect()
{code}

If the AM is failed in sleeping, there's no exception been thrown, so from 
Yarn's point this application is finished successfully, but from Spark's point, 
this application should be reattempted.

So basically, I think only after the finish of user class, we could mark this 
application as "SUCCESS", otherwise, especially in the signal stopped scenario, 
it would be better to mark as failed and try again (except explicitly KILL 
command by yarn).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13621) TestExecutor.scala needs to be moved to test package

2016-03-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13621.
-
   Resolution: Fixed
 Assignee: Devaraj K
Fix Version/s: 2.0.0

> TestExecutor.scala needs to be moved to test package
> 
>
> Key: SPARK-13621
> URL: https://issues.apache.org/jira/browse/SPARK-13621
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Devaraj K
>Assignee: Devaraj K
>Priority: Trivial
> Fix For: 2.0.0
>
>
> TestExecutor.scala is in the package 
> core\src\main\scala\org\apache\spark\deploy\client\ and it is getting used 
> only by test classes. It needs to be moved to test package i.e. 
> core\src\test\scala\org\apache\spark\deploy\client\ since the purpose of it 
> is for test.
> And also core\src\main\scala\org\apache\spark\deploy\client\TestClient.scala 
> is not getting used any where and present in the src, I think it can be 
> removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-02 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177315#comment-15177315
 ] 

Mark Grover commented on SPARK-12177:
-

One more thing as a potential con for Proposal 1:
There are places that have to use the kafka artifact. 'examples' subproject is 
a good example of that. The subproject pulls kafka artifact as a dependency and 
has example for Kafka usage. However, it can't depend on the new 
implementation's artifact at the same time because they depend on different 
versions of kafka. Therefore, unless I am missing something, new 
implementation's example can't go there. 

And, that's fine, we can put it within the subproject itself, instead of 
examples, but that won't necessarily work with tooling like run-example, etc.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13616) Let SQLBuilder convert logical plan without a Project on top of it

2016-03-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13616.
-
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.0.0

> Let SQLBuilder convert logical plan without a Project on top of it
> --
>
> Key: SPARK-13616
> URL: https://issues.apache.org/jira/browse/SPARK-13616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> It is possibly that a logical plan has been removed Project from the top of 
> it. Or the plan doesn't has a top Project from the beginning. Currently the 
> SQLBuilder can't convert such plans back to SQL. This issue is opened to add 
> this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-02 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-13641:
-

 Summary: getModelFeatures of ml.api.r.SparkRWrapper cannot 
(always) reveal the original column names
 Key: SPARK-13641
 URL: https://issues.apache.org/jira/browse/SPARK-13641
 Project: Spark
  Issue Type: Bug
  Components: ML, SparkR
Reporter: Xusen Yin
Priority: Minor


getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original 
column names. Let's take the HouseVotes84 data set as an example:

{code}
case m: XXXModel =>
  val attrs = AttributeGroup.fromStructField(
m.summary.predictions.schema(m.summary.featuresCol))
  attrs.attributes.get.map(_.name.get)
{code}

The code above gets features' names from the features column. Usually, the 
features column is generated by RFormula. The latter has a VectorAssembler in 
it, which leads the output attributes not equal with the original ones.

E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
transform function of 
VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
 adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2016-03-02 Thread Lifeng Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lifeng Wang updated SPARK-13446:

Summary: Spark need to support reading data from Hive 2.0.0 metastore  
(was: Spark need to support reading data from HIve 2.0.0 metastore)

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13640) Synchronize ScalaReflection.mirror method.

2016-03-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13640:


Assignee: Apache Spark

> Synchronize ScalaReflection.mirror method.
> --
>
> Key: SPARK-13640
> URL: https://issues.apache.org/jira/browse/SPARK-13640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>
> {{ScalaReflection.mirror}} method should be synchronized when scala version 
> is 2.10 because {{universe.runtimeMirror}} is not thread safe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13640) Synchronize ScalaReflection.mirror method.

2016-03-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177301#comment-15177301
 ] 

Apache Spark commented on SPARK-13640:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11487

> Synchronize ScalaReflection.mirror method.
> --
>
> Key: SPARK-13640
> URL: https://issues.apache.org/jira/browse/SPARK-13640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> {{ScalaReflection.mirror}} method should be synchronized when scala version 
> is 2.10 because {{universe.runtimeMirror}} is not thread safe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13640) Synchronize ScalaReflection.mirror method.

2016-03-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13640:


Assignee: (was: Apache Spark)

> Synchronize ScalaReflection.mirror method.
> --
>
> Key: SPARK-13640
> URL: https://issues.apache.org/jira/browse/SPARK-13640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> {{ScalaReflection.mirror}} method should be synchronized when scala version 
> is 2.10 because {{universe.runtimeMirror}} is not thread safe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13449) Naive Bayes wrapper in SparkR

2016-03-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177298#comment-15177298
 ] 

Apache Spark commented on SPARK-13449:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11486

> Naive Bayes wrapper in SparkR
> -
>
> Key: SPARK-13449
> URL: https://issues.apache.org/jira/browse/SPARK-13449
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> Following SPARK-13011, we can add a wrapper for naive Bayes in SparkR. R's 
> naive Bayes implementation is from package e1071 with signature:
> {code}
> ## S3 method for class 'formula'
> naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass)
> ## Default S3 method, which we don't want to support
> # naiveBayes(x, y, laplace = 0, ...)
> ## S3 method for class 'naiveBayes'
> predict(object, newdata,
>   type = c("class", "raw"), threshold = 0.001, eps = 0, ...)
> {code}
> It should be easy for us to match the parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13449) Naive Bayes wrapper in SparkR

2016-03-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13449:


Assignee: Xusen Yin  (was: Apache Spark)

> Naive Bayes wrapper in SparkR
> -
>
> Key: SPARK-13449
> URL: https://issues.apache.org/jira/browse/SPARK-13449
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> Following SPARK-13011, we can add a wrapper for naive Bayes in SparkR. R's 
> naive Bayes implementation is from package e1071 with signature:
> {code}
> ## S3 method for class 'formula'
> naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass)
> ## Default S3 method, which we don't want to support
> # naiveBayes(x, y, laplace = 0, ...)
> ## S3 method for class 'naiveBayes'
> predict(object, newdata,
>   type = c("class", "raw"), threshold = 0.001, eps = 0, ...)
> {code}
> It should be easy for us to match the parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13449) Naive Bayes wrapper in SparkR

2016-03-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13449:


Assignee: Apache Spark  (was: Xusen Yin)

> Naive Bayes wrapper in SparkR
> -
>
> Key: SPARK-13449
> URL: https://issues.apache.org/jira/browse/SPARK-13449
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> Following SPARK-13011, we can add a wrapper for naive Bayes in SparkR. R's 
> naive Bayes implementation is from package e1071 with signature:
> {code}
> ## S3 method for class 'formula'
> naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass)
> ## Default S3 method, which we don't want to support
> # naiveBayes(x, y, laplace = 0, ...)
> ## S3 method for class 'naiveBayes'
> predict(object, newdata,
>   type = c("class", "raw"), threshold = 0.001, eps = 0, ...)
> {code}
> It should be easy for us to match the parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13631) getPreferredLocations race condition in spark 1.6.0?

2016-03-02 Thread Andy Sloane (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177261#comment-15177261
 ] 

Andy Sloane commented on SPARK-13631:
-

Did some digging with git bisect.

It turns out to be directly linked to {{spark.shuffle.reduceLocality.enabled}}. 
The difference between Spark 1.6 and 1.5 here is that 1.5 has it {{false}} by 
default, and 1.6 has it {{true}} by default.

Setting it to false cures this in 1.6, and setting it to true causes it to 
re-emerge in 1.5.


> getPreferredLocations race condition in spark 1.6.0?
> 
>
> Key: SPARK-13631
> URL: https://issues.apache.org/jira/browse/SPARK-13631
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Andy Sloane
>
> We are seeing something that looks a lot like a regression from spark 1.2. 
> When we run jobs with multiple threads, we have a crash somewhere inside 
> getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside 
> org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs 
> instead of DAGScheduler directly.
> I tried Spark 1.2 post-SPARK-4454 (before this patch it's only slightly 
> flaky), 1.4.1, and 1.5.2 and all are fine. 1.6.0 immediately crashes on our 
> threaded test case, though once in a while it passes.
> The stack trace is huge, but starts like this:
> Caused by: java.lang.NullPointerException: null
>   at 
> org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs(MapOutputTracker.scala:406)
>   at 
> org.apache.spark.MapOutputTrackerMaster.getPreferredLocationsForShuffle(MapOutputTracker.scala:366)
>   at 
> org.apache.spark.rdd.ShuffledRDD.getPreferredLocations(ShuffledRDD.scala:92)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:256)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1545)
> The full trace is available here:
> https://gist.github.com/andy256/97611f19924bbf65cf49



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13568) Create feature transformer to impute missing values

2016-03-02 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172423#comment-15172423
 ] 

yuhao yang edited comment on SPARK-13568 at 3/3/16 5:48 AM:


Yes, I'm working on supporting numeric values too. 

And I agree about the imputation for vector should check the elements in the 
vector. I intends to support the 3 use cases you mentioned.

I'll send a PR after some refine and performance benchmark. Thanks

updated:
create a new jira to discuss how to handle NaN in Statistics


was (Author: yuhaoyan):
Yes, I'm working on supporting numeric values too. 

And I agree about the imputation for vector should check the elements in the 
vector. I intends to support the 3 use cases you mentioned.

I'll send a PR today or tomorrow after some refine and performance benchmark. 
Thanks

> Create feature transformer to impute missing values
> ---
>
> Key: SPARK-13568
> URL: https://issues.apache.org/jira/browse/SPARK-13568
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> It is quite common to encounter missing values in data sets. It would be 
> useful to implement a {{Transformer}} that can impute missing data points, 
> similar to e.g. {{Imputer}} in 
> [scikit-learn|http://scikit-learn.org/dev/modules/preprocessing.html#imputation-of-missing-values].
> Initially, options for imputation could include {{mean}}, {{median}} and 
> {{most frequent}}, but we could add various other approaches. Where possible 
> existing DataFrame code can be used (e.g. for approximate quantiles etc).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13638) Support for saving with a quote mode

2016-03-02 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-13638:
-
Description: 
https://github.com/databricks/spark-csv/pull/254

tobithiel reported this.

{quote}
I'm dealing with some messy csv files and being able to just quote all fields 
is very useful,
so that other applications don't misunderstand the file because of some sketchy 
characters
{quote}

When writing there are several quote modes in apache commons csv. (See 
https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html)

This might have to be supported.

However, it looks univocity parser used for writing (it looks currently only 
this library is supported) does not support this quote mode. I think we can 
drop this backwards compatibility if we are not going to add apache commons csv.

This is a reminder that it will break backwards compatibility for the options, 
{{quoteMode}}.

  was:
https://github.com/databricks/spark-csv/pull/254

tobithiel reported this.

{quote}
I'm dealing with some messy csv files and being able to just quote all fields 
is very useful,
so that other applications don't misunderstand the file because of some sketchy 
characters
{quote}

When writing there are several quote modes in apache commons csv. (See 
https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html)

This might have to be supported.

However, it looks univocity parser used for writing (it looks currently only 
this library is supported) does not support this quote mode. I think we can 
drop this backwards compatibility if we are not going to add apache commons csv.

This is a reminder that it will break backwards compatibility for the options, 
{{quoteMode}} and {{parserLib}}.


> Support for saving with a quote mode
> 
>
> Key: SPARK-13638
> URL: https://issues.apache.org/jira/browse/SPARK-13638
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> https://github.com/databricks/spark-csv/pull/254
> tobithiel reported this.
> {quote}
> I'm dealing with some messy csv files and being able to just quote all fields 
> is very useful,
> so that other applications don't misunderstand the file because of some 
> sketchy characters
> {quote}
> When writing there are several quote modes in apache commons csv. (See 
> https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html)
> This might have to be supported.
> However, it looks univocity parser used for writing (it looks currently only 
> this library is supported) does not support this quote mode. I think we can 
> drop this backwards compatibility if we are not going to add apache commons 
> csv.
> This is a reminder that it will break backwards compatibility for the 
> options, {{quoteMode}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13638) Support for saving with a quote mode

2016-03-02 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-13638:
-
Description: 
https://github.com/databricks/spark-csv/pull/254

tobithiel reported this.

{quote}
I'm dealing with some messy csv files and being able to just quote all fields 
is very useful,
so that other applications don't misunderstand the file because of some sketchy 
characters
{quote}

When writing there are several quote modes in apache commons csv. (See 
https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html)

This might have to be supported.

However, it looks univocity parser used for writing does not support this quote 
mode. I think we can drop this backwards compatibility if we are not going to 
add apache commons csv.

This is a reminder that it will break backwards compatibility for the options, 
{{quoteMode}} and {{parserLib}}.

  was:
https://github.com/databricks/spark-csv/pull/254

tobithiel reported this.

>I'm dealing with some messy csv files and being able to just quote all fields 
>is very useful, so that other applications don't misunderstand the file 
>because of some sketchy characters

When writing there are several quote modes in apache commons csv. (See 
https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html)

This might have to be supported.

However, it looks univocity parser used for writing does not support this quote 
mode. I think we can drop this backwards compatibility if we are not going to 
add apache commons csv.

This is a reminder that it will break backwards compatibility for the options, 
{{quoteMode}} and {{parserLib}}.


> Support for saving with a quote mode
> 
>
> Key: SPARK-13638
> URL: https://issues.apache.org/jira/browse/SPARK-13638
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> https://github.com/databricks/spark-csv/pull/254
> tobithiel reported this.
> {quote}
> I'm dealing with some messy csv files and being able to just quote all fields 
> is very useful,
> so that other applications don't misunderstand the file because of some 
> sketchy characters
> {quote}
> When writing there are several quote modes in apache commons csv. (See 
> https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html)
> This might have to be supported.
> However, it looks univocity parser used for writing does not support this 
> quote mode. I think we can drop this backwards compatibility if we are not 
> going to add apache commons csv.
> This is a reminder that it will break backwards compatibility for the 
> options, {{quoteMode}} and {{parserLib}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13637) use more information to simplify the code in Expand builder

2016-03-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177256#comment-15177256
 ] 

Apache Spark commented on SPARK-13637:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/11485

> use more information to simplify the code in Expand builder
> ---
>
> Key: SPARK-13637
> URL: https://issues.apache.org/jira/browse/SPARK-13637
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13638) Support for saving with a quote mode

2016-03-02 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-13638:
-
Description: 
https://github.com/databricks/spark-csv/pull/254

tobithiel reported this.

{quote}
I'm dealing with some messy csv files and being able to just quote all fields 
is very useful,
so that other applications don't misunderstand the file because of some sketchy 
characters
{quote}

When writing there are several quote modes in apache commons csv. (See 
https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html)

This might have to be supported.

However, it looks univocity parser used for writing (it looks currently only 
this library is supported) does not support this quote mode. I think we can 
drop this backwards compatibility if we are not going to add apache commons csv.

This is a reminder that it will break backwards compatibility for the options, 
{{quoteMode}} and {{parserLib}}.

  was:
https://github.com/databricks/spark-csv/pull/254

tobithiel reported this.

{quote}
I'm dealing with some messy csv files and being able to just quote all fields 
is very useful,
so that other applications don't misunderstand the file because of some sketchy 
characters
{quote}

When writing there are several quote modes in apache commons csv. (See 
https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html)

This might have to be supported.

However, it looks univocity parser used for writing does not support this quote 
mode. I think we can drop this backwards compatibility if we are not going to 
add apache commons csv.

This is a reminder that it will break backwards compatibility for the options, 
{{quoteMode}} and {{parserLib}}.


> Support for saving with a quote mode
> 
>
> Key: SPARK-13638
> URL: https://issues.apache.org/jira/browse/SPARK-13638
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> https://github.com/databricks/spark-csv/pull/254
> tobithiel reported this.
> {quote}
> I'm dealing with some messy csv files and being able to just quote all fields 
> is very useful,
> so that other applications don't misunderstand the file because of some 
> sketchy characters
> {quote}
> When writing there are several quote modes in apache commons csv. (See 
> https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html)
> This might have to be supported.
> However, it looks univocity parser used for writing (it looks currently only 
> this library is supported) does not support this quote mode. I think we can 
> drop this backwards compatibility if we are not going to add apache commons 
> csv.
> This is a reminder that it will break backwards compatibility for the 
> options, {{quoteMode}} and {{parserLib}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13640) Synchronize ScalaReflection.mirror method.

2016-03-02 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-13640:
-

 Summary: Synchronize ScalaReflection.mirror method.
 Key: SPARK-13640
 URL: https://issues.apache.org/jira/browse/SPARK-13640
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin


{{ScalaReflection.mirror}} method should be synchronized when scala version is 
2.10 because {{universe.runtimeMirror}} is not thread safe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13637) use more information to simplify the code in Expand builder

2016-03-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13637:


Assignee: (was: Apache Spark)

> use more information to simplify the code in Expand builder
> ---
>
> Key: SPARK-13637
> URL: https://issues.apache.org/jira/browse/SPARK-13637
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13637) use more information to simplify the code in Expand builder

2016-03-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13637:


Assignee: Apache Spark

> use more information to simplify the code in Expand builder
> ---
>
> Key: SPARK-13637
> URL: https://issues.apache.org/jira/browse/SPARK-13637
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13639) Statistics.colStats(rdd).mean and variance should handle NaN in the input vectors

2016-03-02 Thread yuhao yang (JIRA)
yuhao yang created SPARK-13639:
--

 Summary: Statistics.colStats(rdd).mean and variance should handle 
NaN in the input vectors
 Key: SPARK-13639
 URL: https://issues.apache.org/jira/browse/SPARK-13639
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: yuhao yang
Priority: Trivial


   val denseData = Array(
  Vectors.dense(3.8, 0.0, 1.8),
  Vectors.dense(1.7, 0.9, 0.0),
  Vectors.dense(Double.NaN, 0, 0.0)
)

val rdd = sc.parallelize(denseData)
println(Statistics.colStats(rdd).mean)

[NaN,0.3,0.6]

This is just a proposal for discussion on how to handle the NaN value in the 
vectors. We can ignore the NaN value in the computation or just output NaN as 
it is now as a warning.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13638) Support for saving with a quote mode

2016-03-02 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-13638:


 Summary: Support for saving with a quote mode
 Key: SPARK-13638
 URL: https://issues.apache.org/jira/browse/SPARK-13638
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon
Priority: Minor


https://github.com/databricks/spark-csv/pull/254

tobithiel reported this.

>I'm dealing with some messy csv files and being able to just quote all fields 
>is very useful, so that other applications don't misunderstand the file 
>because of some sketchy characters

When writing there are several quote modes in apache commons csv. (See 
https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html)

This might have to be supported.

However, it looks univocity parser used for writing does not support this quote 
mode. I think we can drop this backwards compatibility if we are not going to 
add apache commons csv.

This is a reminder that it will break backwards compatibility for the options, 
{{quoteMode}} and {{parserLib}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13637) use more information to simplify the code in Expand builder

2016-03-02 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-13637:
---

 Summary: use more information to simplify the code in Expand 
builder
 Key: SPARK-13637
 URL: https://issues.apache.org/jira/browse/SPARK-13637
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-03-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13634:
--
Priority: Minor  (was: Major)

I doubt it's a Spark problem; this is more a function of how Scala puts things 
in its closure. Usually you can tinker with equivalent code to find a different 
version that works as expected. For example, declare a def containing the 
function you want to map -- that may happen to work.

> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>Priority: Minor
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. 
> Note that the error does not occur when submitting the code as a batch job - 
> via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses a wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it in notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13602) o.a.s.deploy.worker.DriverRunner may leak the driver processes

2016-03-02 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177169#comment-15177169
 ] 

Bryan Cutler commented on SPARK-13602:
--

Great! Thanks :D

> o.a.s.deploy.worker.DriverRunner may leak the driver processes
> --
>
> Key: SPARK-13602
> URL: https://issues.apache.org/jira/browse/SPARK-13602
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> If Worker calls "System.exit", DriverRunner will not kill the driver 
> processes. We should add a shutdown hook in DriverRunner like 
> o.a.s.deploy.worker.ExecutorRunner 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer

2016-03-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13600:
--
Assignee: Oliver Pierson

> Incorrect number of buckets in QuantileDiscretizer
> --
>
> Key: SPARK-13600
> URL: https://issues.apache.org/jira/browse/SPARK-13600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Oliver Pierson
>Assignee: Oliver Pierson
>
> Under certain circumstances, QuantileDiscretizer fails to calculate the 
> correct splits resulting in an incorrect number of buckets/bins.
> E.g.
> val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
> val discretizer = new 
> QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
> discretizer.fit(df).getSplits
> gives:
> Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
> which corresponds to 6 buckets (not 5).
> The problem appears to be in the QuantileDiscretizer.findSplitsCandidates 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13636) Direct consume UnsafeRow in wholestage codegen plans

2016-03-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177146#comment-15177146
 ] 

Apache Spark commented on SPARK-13636:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/11484

> Direct consume UnsafeRow in wholestage codegen plans
> 
>
> Key: SPARK-13636
> URL: https://issues.apache.org/jira/browse/SPARK-13636
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> As shown in the wholestage codegen verion of Sort operator, when Sort is top 
> of Exchange (or other operator that produce UnsafeRow), we will create 
> variables from UnsafeRow, than create another UnsafeRow using these 
> variables. We should avoid the unnecessary unpack and pack variables from 
> UnsafeRows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13636) Direct consume UnsafeRow in wholestage codegen plans

2016-03-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13636:


Assignee: Apache Spark

> Direct consume UnsafeRow in wholestage codegen plans
> 
>
> Key: SPARK-13636
> URL: https://issues.apache.org/jira/browse/SPARK-13636
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> As shown in the wholestage codegen verion of Sort operator, when Sort is top 
> of Exchange (or other operator that produce UnsafeRow), we will create 
> variables from UnsafeRow, than create another UnsafeRow using these 
> variables. We should avoid the unnecessary unpack and pack variables from 
> UnsafeRows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13636) Direct consume UnsafeRow in wholestage codegen plans

2016-03-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13636:


Assignee: (was: Apache Spark)

> Direct consume UnsafeRow in wholestage codegen plans
> 
>
> Key: SPARK-13636
> URL: https://issues.apache.org/jira/browse/SPARK-13636
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> As shown in the wholestage codegen verion of Sort operator, when Sort is top 
> of Exchange (or other operator that produce UnsafeRow), we will create 
> variables from UnsafeRow, than create another UnsafeRow using these 
> variables. We should avoid the unnecessary unpack and pack variables from 
> UnsafeRows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13627) Fix simple deprecation warnings

2016-03-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13627.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.0.0

> Fix simple deprecation warnings
> ---
>
> Key: SPARK-13627
> URL: https://issues.apache.org/jira/browse/SPARK-13627
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, SQL, YARN
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.0
>
>
> This issue aims to fix the following deprecation warnings.
>   * MethodSymbolApi.paramss--> paramLists
>   * AnnotationApi.tpe -> tree.tpe
>   * BufferLike.readOnly -> toList.
>   * StandardNames.nme -> termNames
>   * scala.tools.nsc.interpreter.AbstractFileClassLoader -> 
> scala.reflect.internal.util.AbstractFileClassLoader
>   * TypeApi.declarations-> decls



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13636) Direct consume UnsafeRow in wholestage codegen plans

2016-03-02 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-13636:
---

 Summary: Direct consume UnsafeRow in wholestage codegen plans
 Key: SPARK-13636
 URL: https://issues.apache.org/jira/browse/SPARK-13636
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


As shown in the wholestage codegen verion of Sort operator, when Sort is top of 
Exchange (or other operator that produce UnsafeRow), we will create variables 
from UnsafeRow, than create another UnsafeRow using these variables. We should 
avoid the unnecessary unpack and pack variables from UnsafeRows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13617) remove unnecessary GroupingAnalytics trait

2016-03-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13617.
-
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.0.0

> remove unnecessary GroupingAnalytics trait
> --
>
> Key: SPARK-13617
> URL: https://issues.apache.org/jira/browse/SPARK-13617
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-03-02 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma closed SPARK-13634.
---
Resolution: Duplicate

> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. 
> Note that the error does not occur when submitting the code as a batch job - 
> via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses a wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it in notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13635) Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit

2016-03-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13635:


Assignee: (was: Apache Spark)

> Enable LimitPushdown optimizer rule because we have whole-stage codegen for 
> Limit
> -
>
> Key: SPARK-13635
> URL: https://issues.apache.org/jira/browse/SPARK-13635
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> LimitPushdown optimizer rule has been disabled due to no whole-stage codegen 
> for Limit. As we have whole-stage codegen for Limit now, we should enable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13635) Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit

2016-03-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13635:


Assignee: Apache Spark

> Enable LimitPushdown optimizer rule because we have whole-stage codegen for 
> Limit
> -
>
> Key: SPARK-13635
> URL: https://issues.apache.org/jira/browse/SPARK-13635
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> LimitPushdown optimizer rule has been disabled due to no whole-stage codegen 
> for Limit. As we have whole-stage codegen for Limit now, we should enable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13635) Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit

2016-03-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177108#comment-15177108
 ] 

Apache Spark commented on SPARK-13635:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/11483

> Enable LimitPushdown optimizer rule because we have whole-stage codegen for 
> Limit
> -
>
> Key: SPARK-13635
> URL: https://issues.apache.org/jira/browse/SPARK-13635
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> LimitPushdown optimizer rule has been disabled due to no whole-stage codegen 
> for Limit. As we have whole-stage codegen for Limit now, we should enable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13635) Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit

2016-03-02 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-13635:
---

 Summary: Enable LimitPushdown optimizer rule because we have 
whole-stage codegen for Limit
 Key: SPARK-13635
 URL: https://issues.apache.org/jira/browse/SPARK-13635
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


LimitPushdown optimizer rule has been disabled due to no whole-stage codegen 
for Limit. As we have whole-stage codegen for Limit now, we should enable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13593) improve the `toDF()` method to accept data type string and verify the data

2016-03-02 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-13593:

Summary: improve the `toDF()` method to accept data type string and verify 
the data  (was: add a `schema()` method to convert python RDD to DataFrame 
easily)

> improve the `toDF()` method to accept data type string and verify the data
> --
>
> Key: SPARK-13593
> URL: https://issues.apache.org/jira/browse/SPARK-13593
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-03-02 Thread Rahul Palamuttam (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rahul Palamuttam updated SPARK-13634:
-
Description: 
The following lines of code cause a task serialization error when executed in 
the spark-shell. 
Note that the error does not occur when submitting the code as a batch job - 
via spark-submit.

val temp = 10
val newSC = sc
val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)

For some reason when temp is being pulled in to the referencing environment of 
the closure, so is the SparkContext. 

We originally hit this issue in the SciSpark project, when referencing a string 
variable inside of a lambda expression in RDD.map(...)

Any insight into how this could be resolved would be appreciated.
While the above code is trivial, SciSpark uses a wrapper around the 
SparkContext to read from various file formats. We want to keep this class 
structure and also use it in notebook and shell environments.

  was:
The following lines of code cause a task serialization error when executed in 
the spark-shell. Note that the error does not occur when submitting the code as 
a batch job - via spark-submit.

val temp = 10
val newSC = sc
val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)

For some reason when temp is being pulled in to the referencing environment of 
the closure, so is the SparkContext. 

We originally hit this issue in the SciSpark project, when referencing a string 
variable inside of a lambda expression in RDD.map(...)

Any insight into how this could be resolved would be appreciated.
While the above code is trivial, SciSpark uses wrapper around the SparkContext 
to read from various file formats. We want to keep this class structure and 
also use it notebook and shell environments.


> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. 
> Note that the error does not occur when submitting the code as a batch job - 
> via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses a wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it in notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-03-02 Thread Rahul Palamuttam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177093#comment-15177093
 ] 

Rahul Palamuttam commented on SPARK-13634:
--

[~chrismattmann]

> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. Note that the error does not occur when submitting the code 
> as a batch job - via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-03-02 Thread Rahul Palamuttam (JIRA)
Rahul Palamuttam created SPARK-13634:


 Summary: Assigning spark context to variable results in 
serialization error
 Key: SPARK-13634
 URL: https://issues.apache.org/jira/browse/SPARK-13634
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Reporter: Rahul Palamuttam


The following lines of code cause a task serialization error when executed in 
the spark-shell. Note that the error does not occur when submitting the code as 
a batch job - via spark-submit.

val temp = 10
val newSC = sc
val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)

For some reason when temp is being pulled in to the referencing environment of 
the closure, so is the SparkContext. 

We originally hit this issue in the SciSpark project, when referencing a string 
variable inside of a lambda expression in RDD.map(...)

Any insight into how this could be resolved would be appreciated.
While the above code is trivial, SciSpark uses wrapper around the SparkContext 
to read from various file formats. We want to keep this class structure and 
also use it notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13632) Create new o.a.s.sql.execution.commands package

2016-03-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177082#comment-15177082
 ] 

Apache Spark commented on SPARK-13632:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/11482

> Create new o.a.s.sql.execution.commands package
> ---
>
> Key: SPARK-13632
> URL: https://issues.apache.org/jira/browse/SPARK-13632
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13632) Create new o.a.s.sql.execution.commands package

2016-03-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13632:


Assignee: Apache Spark  (was: Andrew Or)

> Create new o.a.s.sql.execution.commands package
> ---
>
> Key: SPARK-13632
> URL: https://issues.apache.org/jira/browse/SPARK-13632
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13632) Create new o.a.s.sql.execution.commands package

2016-03-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13632:


Assignee: Andrew Or  (was: Apache Spark)

> Create new o.a.s.sql.execution.commands package
> ---
>
> Key: SPARK-13632
> URL: https://issues.apache.org/jira/browse/SPARK-13632
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13633) Move parser classes to o.a.s.sql.catalyst.parser package

2016-03-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13633:
--
Summary: Move parser classes to o.a.s.sql.catalyst.parser package  (was: 
Create new o.a.s.sql.catalyst.parser package)

> Move parser classes to o.a.s.sql.catalyst.parser package
> 
>
> Key: SPARK-13633
> URL: https://issues.apache.org/jira/browse/SPARK-13633
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13633) Create new o.a.s.sql.catalyst.parser package

2016-03-02 Thread Andrew Or (JIRA)
Andrew Or created SPARK-13633:
-

 Summary: Create new o.a.s.sql.catalyst.parser package
 Key: SPARK-13633
 URL: https://issues.apache.org/jira/browse/SPARK-13633
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Andrew Or
Assignee: Andrew Or






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13632) Create new o.a.s.sql.execution.commands package

2016-03-02 Thread Andrew Or (JIRA)
Andrew Or created SPARK-13632:
-

 Summary: Create new o.a.s.sql.execution.commands package
 Key: SPARK-13632
 URL: https://issues.apache.org/jira/browse/SPARK-13632
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Andrew Or
Assignee: Andrew Or






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13614) show() trigger memory leak,why?

2016-03-02 Thread chillon_m (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175516#comment-15175516
 ] 

chillon_m edited comment on SPARK-13614 at 3/3/16 2:16 AM:
---

@[~srowen]
the same size of dataset(hot.count()=599147,ghot.size=21844,10Byte/row),collect 
don't trigger memory leak(first image),but show() trigger it.why?in 
general,collect trigger it easily("Keep in mind that your entire dataset must 
fit in memory on a single machine to use collect() on it, so collect() 
shouldn’t be used on large datasets." in ),but collect don't 
trigger.



was (Author: chillon_m):
@[~srowen]
the same size of dataset(hot.count()=599147,ghot.size=21844),collect don't 
trigger memory leak(first image),but show() trigger it.why?in general,collect 
trigger it easily("Keep in mind that your entire dataset must fit in memory on 
a single machine to use collect() on it, so collect() shouldn’t be used on 
large datasets." in ),but collect don't trigger.


> show() trigger memory leak,why?
> ---
>
> Key: SPARK-13614
> URL: https://issues.apache.org/jira/browse/SPARK-13614
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: chillon_m
> Attachments: memory leak.png, memory.png
>
>
> hot.count()=599147
> ghot.size=21844
> [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell 
> --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar 
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>   /_/
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context available as sc.
> SQL context available as sqlContext.
> scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:mysql://:/?user=&password=","dbtable" -> "")).load()
> Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> hot: org.apache.spark.sql.DataFrame = []
> scala> val ghot=hot.groupBy("Num","pNum").count().collect()
> Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> ghot: Array[org.apache.spark.sql.Row] = Array([[],[],[], [,42310...
> scala> ghot.take(20)
> res0: Array[org.apache.spark.sql.Row] = Array([],[],[],[],[],[],[],[])
> scala> hot.groupBy("Num","pNum").count().show()
> Wed Mar 02 14:26:05 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> 16/03/02 14:26:33 ERROR Executor: Managed memory leak detected; size = 
> 4194304 bytes, TID = 202
> +--+-+-+
> | QQNum| TroopNum|count|
> +--+-+-+
> |1X|38XXX|1|
> |1X| 5XXX|2|
> |1X|26XXX|6|
> |1X|14XXX|3|
> |1X|41XXX|   14|
> |1X|48XXX|   18|
> |1X|23XXX|2|
> |1X|  XXX|   34|
> |1X|52XXX|1|
> |1X|52XXX|2|
> |1X|49XXX|3|
> |1X|42XXX|3|
> |1X|17XXX|   11|
> |1X|25XXX|  129|
> |1X|13XXX|2|
> |1X|19XXX|1|
> |1X|32XXX|9|
> |1X|38XXX|6|
> |1X|38XXX|   13|
> |1X|30XXX|4|
> +--+-+-+
> only showing top 20 rows



--
This message was sent 

[jira] [Comment Edited] (SPARK-13614) show() trigger memory leak,why?

2016-03-02 Thread chillon_m (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175516#comment-15175516
 ] 

chillon_m edited comment on SPARK-13614 at 3/3/16 2:14 AM:
---

@[~srowen]
the same size of dataset(hot.count()=599147,ghot.size=21844),collect don't 
trigger memory leak(first image),but show() trigger it.why?in general,collect 
trigger it easily("Keep in mind that your entire dataset must fit in memory on 
a single machine to use collect() on it, so collect() shouldn’t be used on 
large datasets." in ),but collect don't trigger.



was (Author: chillon_m):
[~srowen]
the same size of dataset(hot.count()=599147,ghot.size=21844),collect don't 
trigger memory leak(first image),but show() trigger it.why?in general,collect 
trigger it easily("Keep in mind that your entire dataset must fit in memory on 
a single machine to use collect() on it, so collect() shouldn’t be used on 
large datasets." in ),but collect don't trigger.


> show() trigger memory leak,why?
> ---
>
> Key: SPARK-13614
> URL: https://issues.apache.org/jira/browse/SPARK-13614
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: chillon_m
> Attachments: memory leak.png, memory.png
>
>
> hot.count()=599147
> ghot.size=21844
> [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell 
> --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar 
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>   /_/
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context available as sc.
> SQL context available as sqlContext.
> scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:mysql://:/?user=&password=","dbtable" -> "")).load()
> Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> hot: org.apache.spark.sql.DataFrame = []
> scala> val ghot=hot.groupBy("Num","pNum").count().collect()
> Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> ghot: Array[org.apache.spark.sql.Row] = Array([[],[],[], [,42310...
> scala> ghot.take(20)
> res0: Array[org.apache.spark.sql.Row] = Array([],[],[],[],[],[],[],[])
> scala> hot.groupBy("Num","pNum").count().show()
> Wed Mar 02 14:26:05 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> 16/03/02 14:26:33 ERROR Executor: Managed memory leak detected; size = 
> 4194304 bytes, TID = 202
> +--+-+-+
> | QQNum| TroopNum|count|
> +--+-+-+
> |1X|38XXX|1|
> |1X| 5XXX|2|
> |1X|26XXX|6|
> |1X|14XXX|3|
> |1X|41XXX|   14|
> |1X|48XXX|   18|
> |1X|23XXX|2|
> |1X|  XXX|   34|
> |1X|52XXX|1|
> |1X|52XXX|2|
> |1X|49XXX|3|
> |1X|42XXX|3|
> |1X|17XXX|   11|
> |1X|25XXX|  129|
> |1X|13XXX|2|
> |1X|19XXX|1|
> |1X|32XXX|9|
> |1X|38XXX|6|
> |1X|38XXX|   13|
> |1X|30XXX|4|
> +--+-+-+
> only showing top 20 rows



--
This message was sent by Atlassian 

[jira] [Comment Edited] (SPARK-13614) show() trigger memory leak,why?

2016-03-02 Thread chillon_m (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175516#comment-15175516
 ] 

chillon_m edited comment on SPARK-13614 at 3/3/16 2:14 AM:
---

[~srowen]
the same size of dataset(hot.count()=599147,ghot.size=21844),collect don't 
trigger memory leak(first image),but show() trigger it.why?in general,collect 
trigger it easily("Keep in mind that your entire dataset must fit in memory on 
a single machine to use collect() on it, so collect() shouldn’t be used on 
large datasets." in ),but collect don't trigger.



was (Author: chillon_m):
the same size of dataset,collect don't trigger memory leak(first image),but 
show() trigger it.why?

> show() trigger memory leak,why?
> ---
>
> Key: SPARK-13614
> URL: https://issues.apache.org/jira/browse/SPARK-13614
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: chillon_m
> Attachments: memory leak.png, memory.png
>
>
> hot.count()=599147
> ghot.size=21844
> [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell 
> --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar 
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>   /_/
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context available as sc.
> SQL context available as sqlContext.
> scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:mysql://:/?user=&password=","dbtable" -> "")).load()
> Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> hot: org.apache.spark.sql.DataFrame = []
> scala> val ghot=hot.groupBy("Num","pNum").count().collect()
> Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> ghot: Array[org.apache.spark.sql.Row] = Array([[],[],[], [,42310...
> scala> ghot.take(20)
> res0: Array[org.apache.spark.sql.Row] = Array([],[],[],[],[],[],[],[])
> scala> hot.groupBy("Num","pNum").count().show()
> Wed Mar 02 14:26:05 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> 16/03/02 14:26:33 ERROR Executor: Managed memory leak detected; size = 
> 4194304 bytes, TID = 202
> +--+-+-+
> | QQNum| TroopNum|count|
> +--+-+-+
> |1X|38XXX|1|
> |1X| 5XXX|2|
> |1X|26XXX|6|
> |1X|14XXX|3|
> |1X|41XXX|   14|
> |1X|48XXX|   18|
> |1X|23XXX|2|
> |1X|  XXX|   34|
> |1X|52XXX|1|
> |1X|52XXX|2|
> |1X|49XXX|3|
> |1X|42XXX|3|
> |1X|17XXX|   11|
> |1X|25XXX|  129|
> |1X|13XXX|2|
> |1X|19XXX|1|
> |1X|32XXX|9|
> |1X|38XXX|6|
> |1X|38XXX|   13|
> |1X|30XXX|4|
> +--+-+-+
> only showing top 20 rows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13614) show() trigger memory leak,why?

2016-03-02 Thread chillon_m (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chillon_m updated SPARK-13614:
--
Comment: was deleted

(was: the same size of dataset,collect don't trigger memory leak(first 
image),but show() trigger it.why?)

> show() trigger memory leak,why?
> ---
>
> Key: SPARK-13614
> URL: https://issues.apache.org/jira/browse/SPARK-13614
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: chillon_m
> Attachments: memory leak.png, memory.png
>
>
> hot.count()=599147
> ghot.size=21844
> [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell 
> --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar 
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>   /_/
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context available as sc.
> SQL context available as sqlContext.
> scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:mysql://:/?user=&password=","dbtable" -> "")).load()
> Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> hot: org.apache.spark.sql.DataFrame = []
> scala> val ghot=hot.groupBy("Num","pNum").count().collect()
> Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> ghot: Array[org.apache.spark.sql.Row] = Array([[],[],[], [,42310...
> scala> ghot.take(20)
> res0: Array[org.apache.spark.sql.Row] = Array([],[],[],[],[],[],[],[])
> scala> hot.groupBy("Num","pNum").count().show()
> Wed Mar 02 14:26:05 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> 16/03/02 14:26:33 ERROR Executor: Managed memory leak detected; size = 
> 4194304 bytes, TID = 202
> +--+-+-+
> | QQNum| TroopNum|count|
> +--+-+-+
> |1X|38XXX|1|
> |1X| 5XXX|2|
> |1X|26XXX|6|
> |1X|14XXX|3|
> |1X|41XXX|   14|
> |1X|48XXX|   18|
> |1X|23XXX|2|
> |1X|  XXX|   34|
> |1X|52XXX|1|
> |1X|52XXX|2|
> |1X|49XXX|3|
> |1X|42XXX|3|
> |1X|17XXX|   11|
> |1X|25XXX|  129|
> |1X|13XXX|2|
> |1X|19XXX|1|
> |1X|32XXX|9|
> |1X|38XXX|6|
> |1X|38XXX|   13|
> |1X|30XXX|4|
> +--+-+-+
> only showing top 20 rows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13485) (Dataset-oriented) API evolution in Spark 2.0

2016-03-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-13485:

Summary: (Dataset-oriented) API evolution in Spark 2.0  (was: 
Dataset-oriented API foundation in Spark 2.0)

> (Dataset-oriented) API evolution in Spark 2.0
> -
>
> Key: SPARK-13485
> URL: https://issues.apache.org/jira/browse/SPARK-13485
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Attachments: API Evolution in Spark 2.0.pdf
>
>
> As part of Spark 2.0, we want to create a stable API foundation for Dataset 
> to become the main user-facing API in Spark. This ticket tracks various tasks 
> related to that.
> The main high level changes are:
> 1. Merge Dataset/DataFrame
> 2. Create a more natural entry point for Dataset (SQLContext is not ideal 
> because of the name "SQL")
> 3. First class support for sessions
> 4. First class support for some system catalog



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13485) (Dataset-oriented) API evolution in Spark 2.0

2016-03-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-13485:

Description: 
As part of Spark 2.0, we want to create a stable API foundation for Dataset to 
become the main user-facing API in Spark. This ticket tracks various tasks 
related to that.

The main high level changes are:

1. Merge Dataset/DataFrame
2. Create a more natural entry point for Dataset (SQLContext is not ideal 
because of the name "SQL")
3. First class support for sessions
4. First class support for some system catalog


See the design doc for more details.



  was:
As part of Spark 2.0, we want to create a stable API foundation for Dataset to 
become the main user-facing API in Spark. This ticket tracks various tasks 
related to that.

The main high level changes are:

1. Merge Dataset/DataFrame
2. Create a more natural entry point for Dataset (SQLContext is not ideal 
because of the name "SQL")
3. First class support for sessions
4. First class support for some system catalog





> (Dataset-oriented) API evolution in Spark 2.0
> -
>
> Key: SPARK-13485
> URL: https://issues.apache.org/jira/browse/SPARK-13485
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Attachments: API Evolution in Spark 2.0.pdf
>
>
> As part of Spark 2.0, we want to create a stable API foundation for Dataset 
> to become the main user-facing API in Spark. This ticket tracks various tasks 
> related to that.
> The main high level changes are:
> 1. Merge Dataset/DataFrame
> 2. Create a more natural entry point for Dataset (SQLContext is not ideal 
> because of the name "SQL")
> 3. First class support for sessions
> 4. First class support for some system catalog
> See the design doc for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13485) Dataset-oriented API foundation in Spark 2.0

2016-03-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-13485:

Attachment: API Evolution in Spark 2.0.pdf

> Dataset-oriented API foundation in Spark 2.0
> 
>
> Key: SPARK-13485
> URL: https://issues.apache.org/jira/browse/SPARK-13485
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Attachments: API Evolution in Spark 2.0.pdf
>
>
> As part of Spark 2.0, we want to create a stable API foundation for Dataset 
> to become the main user-facing API in Spark. This ticket tracks various tasks 
> related to that.
> The main high level changes are:
> 1. Merge Dataset/DataFrame
> 2. Create a more natural entry point for Dataset (SQLContext is not ideal 
> because of the name "SQL")
> 3. First class support for sessions
> 4. First class support for some system catalog



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13583) Remove unused imports and add checkstyle rule

2016-03-02 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-13583:
--
Summary: Remove unused imports and add checkstyle rule  (was: Support 
`UnusedImports` Java checkstyle rule)

> Remove unused imports and add checkstyle rule
> -
>
> Key: SPARK-13583
> URL: https://issues.apache.org/jira/browse/SPARK-13583
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, Streaming
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review 
> by saving much time.
> This issue aims to enforce `UnusedImports` rule by adding a `UnusedImports` 
> rule to `checkstyle.xml` and fixing all existing unused imports.
> {code:title=checkstyle.xml|borderStyle=solid}
> +
> {code}
> Unfortunately, `dev/lint-java` is not tested by Jenkins. ( 
> https://github.com/apache/spark/blob/master/dev/run-tests.py#L546 )
> This will also help Spark contributors to check by themselves before 
> submitting their PRs.
> According to the [~srowen]'s comments, this PR also includes the removal of 
> unused imports in Scala code. It will be done by manually because of the 
> following two reasons. 
>   * Scalastyle does not have `UnusedImport` rule yet.
>   * Scala 2.11.7 has a bug with `-Ywarn-unused-import` option.
> (https://issues.scala-lang.org/browse/SI-9616)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13602) o.a.s.deploy.worker.DriverRunner may leak the driver processes

2016-03-02 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176844#comment-15176844
 ] 

Shixiong Zhu commented on SPARK-13602:
--

Sure. Go ahead.

> o.a.s.deploy.worker.DriverRunner may leak the driver processes
> --
>
> Key: SPARK-13602
> URL: https://issues.apache.org/jira/browse/SPARK-13602
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> If Worker calls "System.exit", DriverRunner will not kill the driver 
> processes. We should add a shutdown hook in DriverRunner like 
> o.a.s.deploy.worker.ExecutorRunner 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13627) Fix simple deprecation warnings

2016-03-02 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-13627:
--
Component/s: (was: PySpark)

> Fix simple deprecation warnings
> ---
>
> Key: SPARK-13627
> URL: https://issues.apache.org/jira/browse/SPARK-13627
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, SQL, YARN
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to fix the following deprecation warnings.
>   * MethodSymbolApi.paramss--> paramLists
>   * AnnotationApi.tpe -> tree.tpe
>   * BufferLike.readOnly -> toList.
>   * StandardNames.nme -> termNames
>   * scala.tools.nsc.interpreter.AbstractFileClassLoader -> 
> scala.reflect.internal.util.AbstractFileClassLoader
>   * TypeApi.declarations-> decls



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13630) Add optimizer rule to collapse sorts

2016-03-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176810#comment-15176810
 ] 

Apache Spark commented on SPARK-13630:
--

User 'skambha' has created a pull request for this issue:
https://github.com/apache/spark/pull/11480

> Add optimizer rule to collapse sorts
> 
>
> Key: SPARK-13630
> URL: https://issues.apache.org/jira/browse/SPARK-13630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunitha Kambhampati
> Fix For: 2.0.0
>
>
> It is possible to collapse  adjacent sorts  and keep the last one.This 
> task is to add optimizer rule to collapse adjacent sorts if possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13630) Add optimizer rule to collapse sorts

2016-03-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13630:


Assignee: (was: Apache Spark)

> Add optimizer rule to collapse sorts
> 
>
> Key: SPARK-13630
> URL: https://issues.apache.org/jira/browse/SPARK-13630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunitha Kambhampati
> Fix For: 2.0.0
>
>
> It is possible to collapse  adjacent sorts  and keep the last one.This 
> task is to add optimizer rule to collapse adjacent sorts if possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13630) Add optimizer rule to collapse sorts

2016-03-02 Thread Sunitha Kambhampati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176811#comment-15176811
 ] 

Sunitha Kambhampati commented on SPARK-13630:
-

Here is the pull request with changes: 
https://github.com/apache/spark/pull/11480

> Add optimizer rule to collapse sorts
> 
>
> Key: SPARK-13630
> URL: https://issues.apache.org/jira/browse/SPARK-13630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunitha Kambhampati
> Fix For: 2.0.0
>
>
> It is possible to collapse  adjacent sorts  and keep the last one.This 
> task is to add optimizer rule to collapse adjacent sorts if possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13630) Add optimizer rule to collapse sorts

2016-03-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13630:


Assignee: Apache Spark

> Add optimizer rule to collapse sorts
> 
>
> Key: SPARK-13630
> URL: https://issues.apache.org/jira/browse/SPARK-13630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunitha Kambhampati
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> It is possible to collapse  adjacent sorts  and keep the last one.This 
> task is to add optimizer rule to collapse adjacent sorts if possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13631) getPreferredLocations race condition in spark 1.6.0?

2016-03-02 Thread Andy Sloane (JIRA)
Andy Sloane created SPARK-13631:
---

 Summary: getPreferredLocations race condition in spark 1.6.0?
 Key: SPARK-13631
 URL: https://issues.apache.org/jira/browse/SPARK-13631
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.6.0
Reporter: Andy Sloane


We are seeing something that looks a lot like a regression from spark 1.2. When 
we run jobs with multiple threads, we have a crash somewhere inside 
getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside 
org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs instead 
of DAGScheduler directly.

I tried Spark 1.2 post-SPARK-4454 (before this patch it's only slightly flaky), 
1.4.1, and 1.5.2 and all are fine. 1.6.0 immediately crashes on our threaded 
test case, though once in a while it passes.

The stack trace is huge, but starts like this:

Caused by: java.lang.NullPointerException: null
at 
org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs(MapOutputTracker.scala:406)
at 
org.apache.spark.MapOutputTrackerMaster.getPreferredLocationsForShuffle(MapOutputTracker.scala:366)
at 
org.apache.spark.rdd.ShuffledRDD.getPreferredLocations(ShuffledRDD.scala:92)
at 
org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
at 
org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:256)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1545)

The full trace is available here:
https://gist.github.com/andy256/97611f19924bbf65cf49




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13627) Fix simple deprecation warnings

2016-03-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176802#comment-15176802
 ] 

Apache Spark commented on SPARK-13627:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/11479

> Fix simple deprecation warnings
> ---
>
> Key: SPARK-13627
> URL: https://issues.apache.org/jira/browse/SPARK-13627
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, PySpark, SQL, YARN
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to fix the following deprecation warnings.
>   * MethodSymbolApi.paramss--> paramLists
>   * AnnotationApi.tpe -> tree.tpe
>   * BufferLike.readOnly -> toList.
>   * StandardNames.nme -> termNames
>   * scala.tools.nsc.interpreter.AbstractFileClassLoader -> 
> scala.reflect.internal.util.AbstractFileClassLoader
>   * TypeApi.declarations-> decls



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13627) Fix simple deprecation warnings

2016-03-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13627:


Assignee: (was: Apache Spark)

> Fix simple deprecation warnings
> ---
>
> Key: SPARK-13627
> URL: https://issues.apache.org/jira/browse/SPARK-13627
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, PySpark, SQL, YARN
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to fix the following deprecation warnings.
>   * MethodSymbolApi.paramss--> paramLists
>   * AnnotationApi.tpe -> tree.tpe
>   * BufferLike.readOnly -> toList.
>   * StandardNames.nme -> termNames
>   * scala.tools.nsc.interpreter.AbstractFileClassLoader -> 
> scala.reflect.internal.util.AbstractFileClassLoader
>   * TypeApi.declarations-> decls



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13627) Fix simple deprecation warnings

2016-03-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13627:


Assignee: Apache Spark

> Fix simple deprecation warnings
> ---
>
> Key: SPARK-13627
> URL: https://issues.apache.org/jira/browse/SPARK-13627
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, PySpark, SQL, YARN
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> This issue aims to fix the following deprecation warnings.
>   * MethodSymbolApi.paramss--> paramLists
>   * AnnotationApi.tpe -> tree.tpe
>   * BufferLike.readOnly -> toList.
>   * StandardNames.nme -> termNames
>   * scala.tools.nsc.interpreter.AbstractFileClassLoader -> 
> scala.reflect.internal.util.AbstractFileClassLoader
>   * TypeApi.declarations-> decls



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13602) o.a.s.deploy.worker.DriverRunner may leak the driver processes

2016-03-02 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176787#comment-15176787
 ] 

Bryan Cutler commented on SPARK-13602:
--

Hi [~zsxwing], mind if I work on this one?

> o.a.s.deploy.worker.DriverRunner may leak the driver processes
> --
>
> Key: SPARK-13602
> URL: https://issues.apache.org/jira/browse/SPARK-13602
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> If Worker calls "System.exit", DriverRunner will not kill the driver 
> processes. We should add a shutdown hook in DriverRunner like 
> o.a.s.deploy.worker.ExecutorRunner 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13630) Add optimizer rule to collapse sorts

2016-03-02 Thread Sunitha Kambhampati (JIRA)
Sunitha Kambhampati created SPARK-13630:
---

 Summary: Add optimizer rule to collapse sorts
 Key: SPARK-13630
 URL: https://issues.apache.org/jira/browse/SPARK-13630
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Sunitha Kambhampati
 Fix For: 2.0.0


It is possible to collapse  adjacent sorts  and keep the last one.This task 
is to add optimizer rule to collapse adjacent sorts if possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13627) Fix simple deprecation warnings

2016-03-02 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-13627:
--
Description: 
This issue aims to fix the following deprecation warnings.
  * MethodSymbolApi.paramss--> paramLists
  * AnnotationApi.tpe -> tree.tpe
  * BufferLike.readOnly -> toList.
  * StandardNames.nme -> termNames
  * scala.tools.nsc.interpreter.AbstractFileClassLoader -> 
scala.reflect.internal.util.AbstractFileClassLoader
  * TypeApi.declarations-> decls


  was:
This issue aims to fix the following 21 deprecation warnings.
  * (6) MethodSymbolApi.paramss--> paramLists
  * (4) AnnotationApi.tpe -> tree.tpe
  * (3) BufferLike.readOnly -> toList.
  * (3) StandardNames.nme -> termNames
  * (3) scala.tools.nsc.interpreter.AbstractFileClassLoader -> 
scala.reflect.internal.util.AbstractFileClassLoader
  * (2) TypeApi.declarations-> decls



> Fix simple deprecation warnings
> ---
>
> Key: SPARK-13627
> URL: https://issues.apache.org/jira/browse/SPARK-13627
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, PySpark, SQL, YARN
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to fix the following deprecation warnings.
>   * MethodSymbolApi.paramss--> paramLists
>   * AnnotationApi.tpe -> tree.tpe
>   * BufferLike.readOnly -> toList.
>   * StandardNames.nme -> termNames
>   * scala.tools.nsc.interpreter.AbstractFileClassLoader -> 
> scala.reflect.internal.util.AbstractFileClassLoader
>   * TypeApi.declarations-> decls



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13629) Add binary toggle Param to CountVectorizer

2016-03-02 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-13629:
-

 Summary: Add binary toggle Param to CountVectorizer
 Key: SPARK-13629
 URL: https://issues.apache.org/jira/browse/SPARK-13629
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor


It would be handy to add a binary toggle Param to CountVectorizer, as in the 
scikit-learn one: 
[http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html]

If set, then all non-zero counts will be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13628) Temporary intermediate output file should be renamed before copying to destination filesystem

2016-03-02 Thread Chen He (JIRA)
Chen He created SPARK-13628:
---

 Summary: Temporary intermediate output file should be renamed 
before copying to destination filesystem
 Key: SPARK-13628
 URL: https://issues.apache.org/jira/browse/SPARK-13628
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 1.6.0
Reporter: Chen He


Spark Executor will dump temporary file into local temp dir, copy it to 
destination filesystem, and then, rename it. It could be costly for Blobstore 
(such as openstack swift) which do the actual copy when file is renamed. If it 
does not affect other components, we may switch the sequence of copy and rename 
so that Spark can use Blobstore  as final output destination.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13465) Add a task failure listener to TaskContext

2016-03-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176750#comment-15176750
 ] 

Apache Spark commented on SPARK-13465:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11478

> Add a task failure listener to TaskContext
> --
>
> Key: SPARK-13465
> URL: https://issues.apache.org/jira/browse/SPARK-13465
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> TaskContext supports task completion callback, which gets called regardless 
> of task failures. However, there is no way for the listener to know if there 
> is an error. This ticket proposes adding a new listener that gets called when 
> a task fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13161) Extend MLlib LDA to include options for Author Topic Modeling

2016-03-02 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176747#comment-15176747
 ] 

Joseph K. Bradley commented on SPARK-13161:
---

There are many generalizations of LDA, so it would be valuable to know about 
people's use cases and needs.  Do you have a use case you could describe for 
this?

It would be great to have this feature as a Spark package in the meantime.

> Extend MLlib LDA to include options for Author Topic Modeling
> -
>
> Key: SPARK-13161
> URL: https://issues.apache.org/jira/browse/SPARK-13161
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: John Hogue
>
> The author-topic model, a generative model for documents that extends Latent 
> Dirichlet Allocation.
> By modeling the interests of authors, we can answer a range of important 
> queries about the content of document collections. With an appropriate author 
> model, we can establish which subjects an author writes about, which authors 
> are likely to have written documents similar to an observed document, and 
> which authors produce similar work.
> Full whitepaper here.
> http://mimno.infosci.cornell.edu/info6150/readings/398.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13161) Extend MLlib LDA to include options for Author Topic Modeling

2016-03-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-13161:
--
Priority: Minor  (was: Major)

> Extend MLlib LDA to include options for Author Topic Modeling
> -
>
> Key: SPARK-13161
> URL: https://issues.apache.org/jira/browse/SPARK-13161
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: John Hogue
>Priority: Minor
>
> The author-topic model, a generative model for documents that extends Latent 
> Dirichlet Allocation.
> By modeling the interests of authors, we can answer a range of important 
> queries about the content of document collections. With an appropriate author 
> model, we can establish which subjects an author writes about, which authors 
> are likely to have written documents similar to an observed document, and 
> which authors produce similar work.
> Full whitepaper here.
> http://mimno.infosci.cornell.edu/info6150/readings/398.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13627) Fix simple deprecation warnings

2016-03-02 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-13627:
-

 Summary: Fix simple deprecation warnings
 Key: SPARK-13627
 URL: https://issues.apache.org/jira/browse/SPARK-13627
 Project: Spark
  Issue Type: Bug
  Components: Examples, PySpark, SQL, YARN
Reporter: Dongjoon Hyun
Priority: Minor


This issue aims to fix the following 21 deprecation warnings.
  * (6) MethodSymbolApi.paramss--> paramLists
  * (4) AnnotationApi.tpe -> tree.tpe
  * (3) BufferLike.readOnly -> toList.
  * (3) StandardNames.nme -> termNames
  * (3) scala.tools.nsc.interpreter.AbstractFileClassLoader -> 
scala.reflect.internal.util.AbstractFileClassLoader
  * (2) TypeApi.declarations-> decls




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12925) Improve HiveInspectors.unwrap for StringObjectInspector.getPrimitiveWritableObject

2016-03-02 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176734#comment-15176734
 ] 

Rajesh Balamohan commented on SPARK-12925:
--

Earlier fix had a problem when Text was reused. Posting a revised patch for 
review which fixes the issue when Text is reused. 

> Improve HiveInspectors.unwrap for 
> StringObjectInspector.getPrimitiveWritableObject
> --
>
> Key: SPARK-12925
> URL: https://issues.apache.org/jira/browse/SPARK-12925
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Fix For: 2.0.0
>
> Attachments: SPARK-12925_profiler_cpu_samples.png
>
>
> Text is in UTF-8 and converting it via "UTF8String.fromString" incurs 
> decoding and encoding, which turns out to be expensive. (to be specific: 
> https://github.com/apache/spark/blob/0d543b98f3e3da5053f0476f4647a765460861f3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L323)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12925) Improve HiveInspectors.unwrap for StringObjectInspector.getPrimitiveWritableObject

2016-03-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176729#comment-15176729
 ] 

Apache Spark commented on SPARK-12925:
--

User 'rajeshbalamohan' has created a pull request for this issue:
https://github.com/apache/spark/pull/11477

> Improve HiveInspectors.unwrap for 
> StringObjectInspector.getPrimitiveWritableObject
> --
>
> Key: SPARK-12925
> URL: https://issues.apache.org/jira/browse/SPARK-12925
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Fix For: 2.0.0
>
> Attachments: SPARK-12925_profiler_cpu_samples.png
>
>
> Text is in UTF-8 and converting it via "UTF8String.fromString" incurs 
> decoding and encoding, which turns out to be expensive. (to be specific: 
> https://github.com/apache/spark/blob/0d543b98f3e3da5053f0476f4647a765460861f3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L323)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13626) SparkConf deprecation log messages are printed multiple times

2016-03-02 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-13626:
--

 Summary: SparkConf deprecation log messages are printed multiple 
times
 Key: SPARK-13626
 URL: https://issues.apache.org/jira/browse/SPARK-13626
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.0.0
Reporter: Marcelo Vanzin
Priority: Minor


I noticed that if I have a deprecated config in my spark-defaults.conf, I'll 
see multiple warnings when running, for example, spark-shell. I collected the 
backtrace from when the messages are printed, and here's a few instances. The 
first one is the only one I expect to be printed.

{noformat}
java.lang.Exception:
...
at org.apache.spark.SparkConf.(SparkConf.scala:53)
at org.apache.spark.repl.Main$.(Main.scala:30)
{noformat}

The following ones are causing duplicate log messages and we should clean those 
up:

{noformat}
java.lang.Exception:
at 
org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
...
at org.apache.spark.SparkConf.(SparkConf.scala:53)
at org.apache.spark.repl.Main$.createSparkContext(Main.scala:82)
{noformat}

{noformat}
java.lang.Exception:
at 
org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
...
at org.apache.spark.SparkConf.setAll(SparkConf.scala:139)
at org.apache.spark.SparkConf.clone(SparkConf.scala:358)
at org.apache.spark.SparkContext.(SparkContext.scala:368)
at org.apache.spark.repl.Main$.createSparkContext(Main.scala:98)
{noformat}

There are also a few more caused by the use of {{SparkConf.clone()}}.

{noformat}
java.lang.Exception:
at 
org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
...
at org.apache.spark.SparkConf.(SparkConf.scala:59)
at org.apache.spark.SparkConf.(SparkConf.scala:53)
at 
org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:48)
{noformat}

{noformat}
java.lang.Exception:
at 
org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
...
at org.apache.spark.SparkConf.(SparkConf.scala:53)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:93)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
{noformat}

{noformat}
java.lang.Exception:
at 
org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
...
at org.apache.spark.SparkConf.(SparkConf.scala:53)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:93)
{noformat}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13625) PySpark-ML method to get list of params for an obj should not check property attr

2016-03-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13625:


Assignee: Apache Spark

> PySpark-ML method to get list of params for an obj should not check property 
> attr
> -
>
> Key: SPARK-13625
> URL: https://issues.apache.org/jira/browse/SPARK-13625
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>
> In PySpark params.__init__.py, the method {{Param.params()}} returns a list 
> of Params belonging to that object.  This method should not check an 
> attribute to be an instance of {{Param}} if it is a property (uses the 
> {{@property}} decorator).  This causes the property to be invoked to 'get' 
> the attribute, and that can lead to an error, depending on the property.  If 
> an attribute is a property it will not be an ML {{Param}}, so no need to 
> check it.
> I came across this in working on SPARK-13430 while adding 
> {{LinearRegressionModel.summary}} as a property to give a training summary, 
> similar to the Scala API.  It is possible that a training summary does not 
> exist and will then raise an exception if the {{summary}} property is 
> invoked.  
> Calling {{getattr(self, x)}} will cause the property to be invoked if {{x}} 
> is a property.  To fix this, just need to check if it a class property before 
> making the call to {{getattr()}} in {{Param.params()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13625) PySpark-ML method to get list of params for an obj should not check property attr

2016-03-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13625:


Assignee: (was: Apache Spark)

> PySpark-ML method to get list of params for an obj should not check property 
> attr
> -
>
> Key: SPARK-13625
> URL: https://issues.apache.org/jira/browse/SPARK-13625
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>
> In PySpark params.__init__.py, the method {{Param.params()}} returns a list 
> of Params belonging to that object.  This method should not check an 
> attribute to be an instance of {{Param}} if it is a property (uses the 
> {{@property}} decorator).  This causes the property to be invoked to 'get' 
> the attribute, and that can lead to an error, depending on the property.  If 
> an attribute is a property it will not be an ML {{Param}}, so no need to 
> check it.
> I came across this in working on SPARK-13430 while adding 
> {{LinearRegressionModel.summary}} as a property to give a training summary, 
> similar to the Scala API.  It is possible that a training summary does not 
> exist and will then raise an exception if the {{summary}} property is 
> invoked.  
> Calling {{getattr(self, x)}} will cause the property to be invoked if {{x}} 
> is a property.  To fix this, just need to check if it a class property before 
> making the call to {{getattr()}} in {{Param.params()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13625) PySpark-ML method to get list of params for an obj should not check property attr

2016-03-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176715#comment-15176715
 ] 

Apache Spark commented on SPARK-13625:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/11476

> PySpark-ML method to get list of params for an obj should not check property 
> attr
> -
>
> Key: SPARK-13625
> URL: https://issues.apache.org/jira/browse/SPARK-13625
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>
> In PySpark params.__init__.py, the method {{Param.params()}} returns a list 
> of Params belonging to that object.  This method should not check an 
> attribute to be an instance of {{Param}} if it is a property (uses the 
> {{@property}} decorator).  This causes the property to be invoked to 'get' 
> the attribute, and that can lead to an error, depending on the property.  If 
> an attribute is a property it will not be an ML {{Param}}, so no need to 
> check it.
> I came across this in working on SPARK-13430 while adding 
> {{LinearRegressionModel.summary}} as a property to give a training summary, 
> similar to the Scala API.  It is possible that a training summary does not 
> exist and will then raise an exception if the {{summary}} property is 
> invoked.  
> Calling {{getattr(self, x)}} will cause the property to be invoked if {{x}} 
> is a property.  To fix this, just need to check if it a class property before 
> making the call to {{getattr()}} in {{Param.params()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13528) Make the short names of compression codecs consistent in spark

2016-03-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13528.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.0.0

> Make the short names of compression codecs consistent in spark
> --
>
> Key: SPARK-13528
> URL: https://issues.apache.org/jira/browse/SPARK-13528
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.0.0
>
>
> Add a common utility code to map short names to fully-qualified codec names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13594) remove typed operations (map, flatMap, mapPartitions) from Python DataFrame

2016-03-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-13594:

Description: 
Once we implement Dataset-equivalent API in Python, we'd need to change the 
return type of map, flatMap, and mapPartitions. In this case, we should just 
remove them from Python DataFrame now in 2.0, so we don't break APIs in 2.x.

Users can still use those after the removal, but must prefix "rdd" to it. For 
example, df.rdd.map, df.rdd.flatMap, and df.rdd.mapPartitions.


  was:
Once we implement Dataset-equivalent API in Python, we'd need to change the 
return type of map, flatMap, and mapPartitions. In this case, we should just 
remove them from Python DataFrame now in 2.0, so we don't break APIs in 2.x.



> remove typed operations (map, flatMap, mapPartitions) from Python DataFrame 
> 
>
> Key: SPARK-13594
> URL: https://issues.apache.org/jira/browse/SPARK-13594
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> Once we implement Dataset-equivalent API in Python, we'd need to change the 
> return type of map, flatMap, and mapPartitions. In this case, we should just 
> remove them from Python DataFrame now in 2.0, so we don't break APIs in 2.x.
> Users can still use those after the removal, but must prefix "rdd" to it. For 
> example, df.rdd.map, df.rdd.flatMap, and df.rdd.mapPartitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13594) remove typed operations (map, flatMap, mapPartitions) from Python DataFrame

2016-03-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-13594:

Description: 
Once we implement Dataset-equivalent API in Python, we'd need to change the 
return type of map, flatMap, and mapPartitions. In this case, we should just 
remove them from Python DataFrame now in 2.0, so we don't break APIs in 2.x.


> remove typed operations (map, flatMap, mapPartitions) from Python DataFrame 
> 
>
> Key: SPARK-13594
> URL: https://issues.apache.org/jira/browse/SPARK-13594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> Once we implement Dataset-equivalent API in Python, we'd need to change the 
> return type of map, flatMap, and mapPartitions. In this case, we should just 
> remove them from Python DataFrame now in 2.0, so we don't break APIs in 2.x.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13594) remove typed operations (map, flatMap, mapPartitions) from Python DataFrame

2016-03-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-13594:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-11806

> remove typed operations (map, flatMap, mapPartitions) from Python DataFrame 
> 
>
> Key: SPARK-13594
> URL: https://issues.apache.org/jira/browse/SPARK-13594
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> Once we implement Dataset-equivalent API in Python, we'd need to change the 
> return type of map, flatMap, and mapPartitions. In this case, we should just 
> remove them from Python DataFrame now in 2.0, so we don't break APIs in 2.x.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >