[jira] [Commented] (SPARK-13635) Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit
[ https://issues.apache.org/jira/browse/SPARK-13635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177445#comment-15177445 ] Liang-Chi Hsieh commented on SPARK-13635: - [~davies] Can you help update the Assignee field? Thanks! > Enable LimitPushdown optimizer rule because we have whole-stage codegen for > Limit > - > > Key: SPARK-13635 > URL: https://issues.apache.org/jira/browse/SPARK-13635 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > Fix For: 2.0.0 > > > LimitPushdown optimizer rule has been disabled due to no whole-stage codegen > for Limit. As we have whole-stage codegen for Limit now, we should enable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13531) Some DataFrame joins stopped working with UnsupportedOperationException: No size estimation available for objects
[ https://issues.apache.org/jira/browse/SPARK-13531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177441#comment-15177441 ] Zuo Wang commented on SPARK-13531: -- Caused by the commit in https://issues.apache.org/jira/browse/SPARK-13329 > Some DataFrame joins stopped working with UnsupportedOperationException: No > size estimation available for objects > - > > Key: SPARK-13531 > URL: https://issues.apache.org/jira/browse/SPARK-13531 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: koert kuipers >Priority: Minor > > this is using spark 2.0.0-SNAPSHOT > dataframe df1: > schema: > {noformat}StructType(StructField(x,IntegerType,true)){noformat} > explain: > {noformat}== Physical Plan == > MapPartitions , obj#135: object, [if (input[0, object].isNullAt) > null else input[0, object].get AS x#128] > +- MapPartitions , createexternalrow(if (isnull(x#9)) null else > x#9), [input[0, object] AS obj#135] >+- WholeStageCodegen > : +- Project [_1#8 AS x#9] > : +- Scan ExistingRDD[_1#8]{noformat} > show: > {noformat}+---+ > | x| > +---+ > | 2| > | 3| > +---+{noformat} > dataframe df2: > schema: > {noformat}StructType(StructField(x,IntegerType,true), > StructField(y,StringType,true)){noformat} > explain: > {noformat}== Physical Plan == > MapPartitions , createexternalrow(x#2, if (isnull(y#3)) null else > y#3.toString), [if (input[0, object].isNullAt) null else input[0, object].get > AS x#130,if (input[0, object].isNullAt) null else staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, > object].get, true) AS y#131] > +- WholeStageCodegen >: +- Project [_1#0 AS x#2,_2#1 AS y#3] >: +- Scan ExistingRDD[_1#0,_2#1]{noformat} > show: > {noformat}+---+---+ > | x| y| > +---+---+ > | 1| 1| > | 2| 2| > | 3| 3| > +---+---+{noformat} > i run: > df1.join(df2, Seq("x")).show > i get: > {noformat}java.lang.UnsupportedOperationException: No size estimation > available for objects. > at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41) > at > org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323) > at > org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323) > at > org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87){noformat} > now sure what changed, this ran about a week ago without issues (in our > internal unit tests). it is fully reproducible, however when i tried to > minimize the issue i could not reproduce it by just creating data frames in > the repl with the same contents, so it probably has something to do with way > these are created (from Row objects and StructTypes). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13635) Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit
[ https://issues.apache.org/jira/browse/SPARK-13635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13635. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11483 [https://github.com/apache/spark/pull/11483] > Enable LimitPushdown optimizer rule because we have whole-stage codegen for > Limit > - > > Key: SPARK-13635 > URL: https://issues.apache.org/jira/browse/SPARK-13635 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > Fix For: 2.0.0 > > > LimitPushdown optimizer rule has been disabled due to no whole-stage codegen > for Limit. As we have whole-stage codegen for Limit now, we should enable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13589) Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType
[ https://issues.apache.org/jira/browse/SPARK-13589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177437#comment-15177437 ] Liang-Chi Hsieh commented on SPARK-13589: - [~lian cheng] I think this is already solved in SPARK-13537. > Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType > --- > > Key: SPARK-13589 > URL: https://issues.apache.org/jira/browse/SPARK-13589 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Cheng Lian > Labels: flaky-test > > Here are a few sample build failures caused by this test case: > # > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52164/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/ > # > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52154/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/ > # > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52153/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/ > (I've pinned these builds on Jenkins so that they won't be cleaned up.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12941) Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR datatype
[ https://issues.apache.org/jira/browse/SPARK-12941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177436#comment-15177436 ] Apache Spark commented on SPARK-12941: -- User 'thomastechs' has created a pull request for this issue: https://github.com/apache/spark/pull/11489 > Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR > datatype > -- > > Key: SPARK-12941 > URL: https://issues.apache.org/jira/browse/SPARK-12941 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 > Environment: Apache Spark 1.4.2.2 >Reporter: Jose Martinez Poblete >Assignee: Thomas Sebastian > Fix For: 1.4.2, 1.5.3, 1.6.2, 2.0.0 > > > When exporting data from Spark to Oracle, string datatypes are translated to > TEXT for Oracle, this is leading to the following error > {noformat} > java.sql.SQLSyntaxErrorException: ORA-00902: invalid datatype > {noformat} > As per the following code: > https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/jdbc/jdbc.scala#L144 > See also: > http://stackoverflow.com/questions/31287182/writing-to-oracle-database-using-apache-spark-1-4-0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13612) Multiplication of BigDecimal columns not working as expected
[ https://issues.apache.org/jira/browse/SPARK-13612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177428#comment-15177428 ] Liang-Chi Hsieh edited comment on SPARK-13612 at 3/3/16 7:35 AM: - Because the internal type for BigDecimal would be Decimal(38, 18) by default, (you can print the schema of x and y), the result scale of x("a") * y("b") will be 18 + 18 = 36. That is detected to have overflow so you get a null value back. You can cast the decimal column to proper precision and scale, e.g.: {code} val newX = x.withColumn("a", x("a").cast(DecimalType(10, 1))) val newY = y.withColumn("b", y("b").cast(DecimalType(10, 1))) newX.join(newY, newX("id") === newY("id")).withColumn("z", newX("a") * newY("b")).show +---++---++--+ | id| a| id| b| z| +---++---++--+ | 1|10.0| 1|10.0|100.00| +---++---++--+ {code} was (Author: viirya): Because the internal type for BigDecimal would be Decimal(38, 18) by default, (you can print the schema of x and y), the result scale of x("a") * y("b") will be 18 + 18 = 36. That is detected to have overflow so you get a null value back. You can cast the decimal column to proper precision and scale, e.g.: {{code}} val newX = x.withColumn("a", x("a").cast(DecimalType(10, 1))) val newY = y.withColumn("b", y("b").cast(DecimalType(10, 1))) newX.join(newY, newX("id") === newY("id")).withColumn("z", newX("a") * newY("b")).show +---++---++--+ | id| a| id| b| z| +---++---++--+ | 1|10.0| 1|10.0|100.00| +---++---++--+ {{code}} > Multiplication of BigDecimal columns not working as expected > > > Key: SPARK-13612 > URL: https://issues.apache.org/jira/browse/SPARK-13612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Varadharajan > > Please consider the below snippet: > {code} > case class AM(id: Int, a: BigDecimal) > case class AX(id: Int, b: BigDecimal) > val x = sc.parallelize(List(AM(1, 10))).toDF > val y = sc.parallelize(List(AX(1, 10))).toDF > x.join(y, x("id") === y("id")).withColumn("z", x("a") * y("b")).show > {code} > output: > {code} > | id| a| id| b| z| > | 1|10.00...| 1|10.00...|null| > {code} > Here the multiplication of the columns ("z") return null instead of 100. > As of now we are using the below workaround, but definitely looks like a > serious issue. > {code} > x.join(y, x("id") === y("id")).withColumn("z", x("a") / (expr("1") / > y("b"))).show > {code} > {code} > | id| a| id| b| z| > | 1|10.00...| 1|10.00...|100.0...| > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13612) Multiplication of BigDecimal columns not working as expected
[ https://issues.apache.org/jira/browse/SPARK-13612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177428#comment-15177428 ] Liang-Chi Hsieh commented on SPARK-13612: - Because the internal type for BigDecimal would be Decimal(38, 18) by default, (you can print the schema of x and y), the result scale of x("a") * y("b") will be 18 + 18 = 36. That is detected to have overflow so you get a null value back. You can cast the decimal column to proper precision and scale, e.g.: {{code}} val newX = x.withColumn("a", x("a").cast(DecimalType(10, 1))) val newY = y.withColumn("b", y("b").cast(DecimalType(10, 1))) newX.join(newY, newX("id") === newY("id")).withColumn("z", newX("a") * newY("b")).show +---++---++--+ | id| a| id| b| z| +---++---++--+ | 1|10.0| 1|10.0|100.00| +---++---++--+ {{code}} > Multiplication of BigDecimal columns not working as expected > > > Key: SPARK-13612 > URL: https://issues.apache.org/jira/browse/SPARK-13612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Varadharajan > > Please consider the below snippet: > {code} > case class AM(id: Int, a: BigDecimal) > case class AX(id: Int, b: BigDecimal) > val x = sc.parallelize(List(AM(1, 10))).toDF > val y = sc.parallelize(List(AX(1, 10))).toDF > x.join(y, x("id") === y("id")).withColumn("z", x("a") * y("b")).show > {code} > output: > {code} > | id| a| id| b| z| > | 1|10.00...| 1|10.00...|null| > {code} > Here the multiplication of the columns ("z") return null instead of 100. > As of now we are using the below workaround, but definitely looks like a > serious issue. > {code} > x.join(y, x("id") === y("id")).withColumn("z", x("a") / (expr("1") / > y("b"))).show > {code} > {code} > | id| a| id| b| z| > | 1|10.00...| 1|10.00...|100.0...| > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13643) Create SparkSession interface
Reynold Xin created SPARK-13643: --- Summary: Create SparkSession interface Key: SPARK-13643 URL: https://issues.apache.org/jira/browse/SPARK-13643 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177392#comment-15177392 ] Xusen Yin commented on SPARK-13600: --- Vote for the new method. > Incorrect number of buckets in QuantileDiscretizer > -- > > Key: SPARK-13600 > URL: https://issues.apache.org/jira/browse/SPARK-13600 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.0, 2.0.0 >Reporter: Oliver Pierson >Assignee: Oliver Pierson > > Under certain circumstances, QuantileDiscretizer fails to calculate the > correct splits resulting in an incorrect number of buckets/bins. > E.g. > val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x") > val discretizer = new > QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5) > discretizer.fit(df).getSplits > gives: > Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity) > which corresponds to 6 buckets (not 5). > The problem appears to be in the QuantileDiscretizer.findSplitsCandidates > method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13568) Create feature transformer to impute missing values
[ https://issues.apache.org/jira/browse/SPARK-13568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177368#comment-15177368 ] Nick Pentreath commented on SPARK-13568: Ok - the Imputer will need to compute column stats ignoring NaNs, so SPARK-13639 should add that (whether as default behaviour, or an optional argument) > Create feature transformer to impute missing values > --- > > Key: SPARK-13568 > URL: https://issues.apache.org/jira/browse/SPARK-13568 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > It is quite common to encounter missing values in data sets. It would be > useful to implement a {{Transformer}} that can impute missing data points, > similar to e.g. {{Imputer}} in > [scikit-learn|http://scikit-learn.org/dev/modules/preprocessing.html#imputation-of-missing-values]. > Initially, options for imputation could include {{mean}}, {{median}} and > {{most frequent}}, but we could add various other approaches. Where possible > existing DataFrame code can be used (e.g. for approximate quantiles etc). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177356#comment-15177356 ] Adrian Wang commented on SPARK-13446: - That's not enough. We still need some code change. > Spark need to support reading data from Hive 2.0.0 metastore > > > Key: SPARK-13446 > URL: https://issues.apache.org/jira/browse/SPARK-13446 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Lifeng Wang > > Spark provided HIveContext class to read data from hive metastore directly. > While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has > released, it's better to upgrade to support Hive 2.0.0. > {noformat} > 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI > thrift://hsw-node13:9083 > 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current > connections: 1 > 16/02/23 02:35:02 INFO metastore: Connected to metastore. > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185) > at > org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13311) prettyString of IN is not good
[ https://issues.apache.org/jira/browse/SPARK-13311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177357#comment-15177357 ] Xiao Li commented on SPARK-13311: - After the merge of https://github.com/apache/spark/pull/10757, I think the problem is resolved. > prettyString of IN is not good > -- > > Key: SPARK-13311 > URL: https://issues.apache.org/jira/browse/SPARK-13311 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu > > In(i_class,[Ljava.lang.Object;@1a575883)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13446: -- Issue Type: Improvement (was: Bug) Can't you build against the newer version of Hive? that much is needed of course; I don't know if it's all that's needed. > Spark need to support reading data from Hive 2.0.0 metastore > > > Key: SPARK-13446 > URL: https://issues.apache.org/jira/browse/SPARK-13446 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Lifeng Wang > > Spark provided HIveContext class to read data from hive metastore directly. > While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has > released, it's better to upgrade to support Hive 2.0.0. > {noformat} > 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI > thrift://hsw-node13:9083 > 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current > connections: 1 > 16/02/23 02:35:02 INFO metastore: Connected to metastore. > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185) > at > org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13642) Inconsistent finishing state between driver and AM
[ https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177347#comment-15177347 ] Saisai Shao commented on SPARK-13642: - [~tgraves] [~vanzin], would you please comment on this, why the default application final state is "SUCCESS"? Is it better to mark this application as "SUCCESS" only after user class is exited? Thanks a lot. > Inconsistent finishing state between driver and AM > --- > > Key: SPARK-13642 > URL: https://issues.apache.org/jira/browse/SPARK-13642 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: Saisai Shao > > Currently when running Spark on Yarn with yarn cluster mode, the default > application final state is "SUCCEED", if any exception is occurred, this > final state will be changed to "FAILED" and trigger the reattempt if > possible. > This is OK in normal case, but if there's a race condition when AM received a > signal (SIGTERM) and no any exception is occurred. In this situation, > shutdown hook will be invoked and marked this application as finished with > success, and there's no another attempt. > In such situation, actually from Spark's aspect this application is failed > and need another attempt, but from Yarn's aspect the application is finished > with success. > This could happened in NM failure situation, the failure of NM will send > SIGTERM to AM, AM should make this attempt as failure and rerun again, not > invoke unregister. > So to increase the chance of this race condition, here is the reproduced code: > {code} > val sc = ... > Thread.sleep(3L) > sc.parallelize(1 to 100).collect() > {code} > If the AM is failed in sleeping, there's no exception been thrown, so from > Yarn's point this application is finished successfully, but from Spark's > point, this application should be reattempted. > So basically, I think only after the finish of user class, we could mark this > application as "SUCCESS", otherwise, especially in the signal stopped > scenario, it would be better to mark as failed and try again (except > explicitly KILL command by yarn). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13642) Inconsistent finishing state between driver and AM
[ https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-13642: Description: Currently when running Spark on Yarn with yarn cluster mode, the default application final state is "SUCCEED", if any exception is occurred, this final state will be changed to "FAILED" and trigger the reattempt if possible. This is OK in normal case, but if there's a race condition when AM received a signal (SIGTERM) and no any exception is occurred. In this situation, shutdown hook will be invoked and marked this application as finished with success, and there's no another attempt. In such situation, actually from Spark's aspect this application is failed and need another attempt, but from Yarn's aspect the application is finished with success. This could happened in NM failure situation, the failure of NM will send SIGTERM to AM, AM should make this attempt as failure and rerun again, not invoke unregister. So to increase the chance of this race condition, here is the reproduced code: {code} val sc = ... Thread.sleep(3L) sc.parallelize(1 to 100).collect() {code} If the AM is failed in sleeping, there's no exception been thrown, so from Yarn's point this application is finished successfully, but from Spark's point, this application should be reattempted. So basically, I think only after the finish of user class, we could mark this application as "SUCCESS", otherwise, especially in the signal stopped scenario, it would be better to mark as failed and try again (except explicitly KILL command by yarn). was: Currently when running Spark on Yarn with yarn cluster mode, the default application final state is "SUCCEED", if any exception is occurred, this final state will be changed to "FAILED" and trigger the reattempt if possible. This is OK in normal case, but there's a race condition when AM received a signal (SIGTERM), no any exception is occurred. In this situation, shutdown hook will be invoked and marked this application as finished with success, and there's no another attempt. In such situation, actually from Spark's aspect this application is failed and need another attempt, but from Yarn's aspect the application is finished with success. This could happened in NM failure situation, the failure of NM will send SIGTERM to AM, AM should make this attempt as failure and rerun again, not invoke unregister. So to increase the chance of this race condition, here is the reproduced code: {code} val sc = ... Thread.sleep(3L) sc.parallelize(1 to 100).collect() {code} If the AM is failed in sleeping, there's no exception been thrown, so from Yarn's point this application is finished successfully, but from Spark's point, this application should be reattempted. So basically, I think only after the finish of user class, we could mark this application as "SUCCESS", otherwise, especially in the signal stopped scenario, it would be better to mark as failed and try again (except explicitly KILL command by yarn). > Inconsistent finishing state between driver and AM > --- > > Key: SPARK-13642 > URL: https://issues.apache.org/jira/browse/SPARK-13642 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: Saisai Shao > > Currently when running Spark on Yarn with yarn cluster mode, the default > application final state is "SUCCEED", if any exception is occurred, this > final state will be changed to "FAILED" and trigger the reattempt if > possible. > This is OK in normal case, but if there's a race condition when AM received a > signal (SIGTERM) and no any exception is occurred. In this situation, > shutdown hook will be invoked and marked this application as finished with > success, and there's no another attempt. > In such situation, actually from Spark's aspect this application is failed > and need another attempt, but from Yarn's aspect the application is finished > with success. > This could happened in NM failure situation, the failure of NM will send > SIGTERM to AM, AM should make this attempt as failure and rerun again, not > invoke unregister. > So to increase the chance of this race condition, here is the reproduced code: > {code} > val sc = ... > Thread.sleep(3L) > sc.parallelize(1 to 100).collect() > {code} > If the AM is failed in sleeping, there's no exception been thrown, so from > Yarn's point this application is finished successfully, but from Spark's > point, this application should be reattempted. > So basically, I think only after the finish of user class, we could mark this > application as "SUCCESS", otherwise, especially in the signal stopped > scenario, it would be better to mark as failed and try again (except > explicitly KILL command by yarn). -- Th
[jira] [Created] (SPARK-13642) Inconsistent finishing state between driver and AM
Saisai Shao created SPARK-13642: --- Summary: Inconsistent finishing state between driver and AM Key: SPARK-13642 URL: https://issues.apache.org/jira/browse/SPARK-13642 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.6.0 Reporter: Saisai Shao Currently when running Spark on Yarn with yarn cluster mode, the default application final state is "SUCCEED", if any exception is occurred, this final state will be changed to "FAILED" and trigger the reattempt if possible. This is OK in normal case, but there's a race condition when AM received a signal (SIGTERM), no any exception is occurred. In this situation, shutdown hook will be invoked and marked this application as finished with success, and there's no another attempt. In such situation, actually from Spark's aspect this application is failed and need another attempt, but from Yarn's aspect the application is finished with success. This could happened in NM failure situation, the failure of NM will send SIGTERM to AM, AM should make this attempt as failure and rerun again, not invoke unregister. So to increase the chance of this race condition, here is the reproduced code: {code} val sc = ... Thread.sleep(3L) sc.parallelize(1 to 100).collect() {code} If the AM is failed in sleeping, there's no exception been thrown, so from Yarn's point this application is finished successfully, but from Spark's point, this application should be reattempted. So basically, I think only after the finish of user class, we could mark this application as "SUCCESS", otherwise, especially in the signal stopped scenario, it would be better to mark as failed and try again (except explicitly KILL command by yarn). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13621) TestExecutor.scala needs to be moved to test package
[ https://issues.apache.org/jira/browse/SPARK-13621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13621. - Resolution: Fixed Assignee: Devaraj K Fix Version/s: 2.0.0 > TestExecutor.scala needs to be moved to test package > > > Key: SPARK-13621 > URL: https://issues.apache.org/jira/browse/SPARK-13621 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 1.6.0, 2.0.0 >Reporter: Devaraj K >Assignee: Devaraj K >Priority: Trivial > Fix For: 2.0.0 > > > TestExecutor.scala is in the package > core\src\main\scala\org\apache\spark\deploy\client\ and it is getting used > only by test classes. It needs to be moved to test package i.e. > core\src\test\scala\org\apache\spark\deploy\client\ since the purpose of it > is for test. > And also core\src\main\scala\org\apache\spark\deploy\client\TestClient.scala > is not getting used any where and present in the src, I think it can be > removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177315#comment-15177315 ] Mark Grover commented on SPARK-12177: - One more thing as a potential con for Proposal 1: There are places that have to use the kafka artifact. 'examples' subproject is a good example of that. The subproject pulls kafka artifact as a dependency and has example for Kafka usage. However, it can't depend on the new implementation's artifact at the same time because they depend on different versions of kafka. Therefore, unless I am missing something, new implementation's example can't go there. And, that's fine, we can put it within the subproject itself, instead of examples, but that won't necessarily work with tooling like run-example, etc. > Update KafkaDStreams to new Kafka 0.9 Consumer API > -- > > Key: SPARK-12177 > URL: https://issues.apache.org/jira/browse/SPARK-12177 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Nikita Tarasenko > Labels: consumer, kafka > > Kafka 0.9 already released and it introduce new consumer API that not > compatible with old one. So, I added new consumer api. I made separate > classes in package org.apache.spark.streaming.kafka.v09 with changed API. I > didn't remove old classes for more backward compatibility. User will not need > to change his old spark applications when he uprgade to new Spark version. > Please rewiew my changes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13616) Let SQLBuilder convert logical plan without a Project on top of it
[ https://issues.apache.org/jira/browse/SPARK-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13616. - Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 2.0.0 > Let SQLBuilder convert logical plan without a Project on top of it > -- > > Key: SPARK-13616 > URL: https://issues.apache.org/jira/browse/SPARK-13616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.0.0 > > > It is possibly that a logical plan has been removed Project from the top of > it. Or the plan doesn't has a top Project from the beginning. Currently the > SQLBuilder can't convert such plans back to SQL. This issue is opened to add > this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
Xusen Yin created SPARK-13641: - Summary: getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names Key: SPARK-13641 URL: https://issues.apache.org/jira/browse/SPARK-13641 Project: Spark Issue Type: Bug Components: ML, SparkR Reporter: Xusen Yin Priority: Minor getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names. Let's take the HouseVotes84 data set as an example: {code} case m: XXXModel => val attrs = AttributeGroup.fromStructField( m.summary.predictions.schema(m.summary.featuresCol)) attrs.attributes.get.map(_.name.get) {code} The code above gets features' names from the features column. Usually, the features column is generated by RFormula. The latter has a VectorAssembler in it, which leads the output attributes not equal with the original ones. E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the transform function of VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lifeng Wang updated SPARK-13446: Summary: Spark need to support reading data from Hive 2.0.0 metastore (was: Spark need to support reading data from HIve 2.0.0 metastore) > Spark need to support reading data from Hive 2.0.0 metastore > > > Key: SPARK-13446 > URL: https://issues.apache.org/jira/browse/SPARK-13446 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Lifeng Wang > > Spark provided HIveContext class to read data from hive metastore directly. > While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has > released, it's better to upgrade to support Hive 2.0.0. > {noformat} > 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI > thrift://hsw-node13:9083 > 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current > connections: 1 > 16/02/23 02:35:02 INFO metastore: Connected to metastore. > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185) > at > org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13640) Synchronize ScalaReflection.mirror method.
[ https://issues.apache.org/jira/browse/SPARK-13640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13640: Assignee: Apache Spark > Synchronize ScalaReflection.mirror method. > -- > > Key: SPARK-13640 > URL: https://issues.apache.org/jira/browse/SPARK-13640 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Assignee: Apache Spark > > {{ScalaReflection.mirror}} method should be synchronized when scala version > is 2.10 because {{universe.runtimeMirror}} is not thread safe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13640) Synchronize ScalaReflection.mirror method.
[ https://issues.apache.org/jira/browse/SPARK-13640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177301#comment-15177301 ] Apache Spark commented on SPARK-13640: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/11487 > Synchronize ScalaReflection.mirror method. > -- > > Key: SPARK-13640 > URL: https://issues.apache.org/jira/browse/SPARK-13640 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > > {{ScalaReflection.mirror}} method should be synchronized when scala version > is 2.10 because {{universe.runtimeMirror}} is not thread safe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13640) Synchronize ScalaReflection.mirror method.
[ https://issues.apache.org/jira/browse/SPARK-13640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13640: Assignee: (was: Apache Spark) > Synchronize ScalaReflection.mirror method. > -- > > Key: SPARK-13640 > URL: https://issues.apache.org/jira/browse/SPARK-13640 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > > {{ScalaReflection.mirror}} method should be synchronized when scala version > is 2.10 because {{universe.runtimeMirror}} is not thread safe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13449) Naive Bayes wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-13449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177298#comment-15177298 ] Apache Spark commented on SPARK-13449: -- User 'yinxusen' has created a pull request for this issue: https://github.com/apache/spark/pull/11486 > Naive Bayes wrapper in SparkR > - > > Key: SPARK-13449 > URL: https://issues.apache.org/jira/browse/SPARK-13449 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Xusen Yin > > Following SPARK-13011, we can add a wrapper for naive Bayes in SparkR. R's > naive Bayes implementation is from package e1071 with signature: > {code} > ## S3 method for class 'formula' > naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass) > ## Default S3 method, which we don't want to support > # naiveBayes(x, y, laplace = 0, ...) > ## S3 method for class 'naiveBayes' > predict(object, newdata, > type = c("class", "raw"), threshold = 0.001, eps = 0, ...) > {code} > It should be easy for us to match the parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13449) Naive Bayes wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-13449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13449: Assignee: Xusen Yin (was: Apache Spark) > Naive Bayes wrapper in SparkR > - > > Key: SPARK-13449 > URL: https://issues.apache.org/jira/browse/SPARK-13449 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Xusen Yin > > Following SPARK-13011, we can add a wrapper for naive Bayes in SparkR. R's > naive Bayes implementation is from package e1071 with signature: > {code} > ## S3 method for class 'formula' > naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass) > ## Default S3 method, which we don't want to support > # naiveBayes(x, y, laplace = 0, ...) > ## S3 method for class 'naiveBayes' > predict(object, newdata, > type = c("class", "raw"), threshold = 0.001, eps = 0, ...) > {code} > It should be easy for us to match the parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13449) Naive Bayes wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-13449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13449: Assignee: Apache Spark (was: Xusen Yin) > Naive Bayes wrapper in SparkR > - > > Key: SPARK-13449 > URL: https://issues.apache.org/jira/browse/SPARK-13449 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Apache Spark > > Following SPARK-13011, we can add a wrapper for naive Bayes in SparkR. R's > naive Bayes implementation is from package e1071 with signature: > {code} > ## S3 method for class 'formula' > naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass) > ## Default S3 method, which we don't want to support > # naiveBayes(x, y, laplace = 0, ...) > ## S3 method for class 'naiveBayes' > predict(object, newdata, > type = c("class", "raw"), threshold = 0.001, eps = 0, ...) > {code} > It should be easy for us to match the parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13631) getPreferredLocations race condition in spark 1.6.0?
[ https://issues.apache.org/jira/browse/SPARK-13631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177261#comment-15177261 ] Andy Sloane commented on SPARK-13631: - Did some digging with git bisect. It turns out to be directly linked to {{spark.shuffle.reduceLocality.enabled}}. The difference between Spark 1.6 and 1.5 here is that 1.5 has it {{false}} by default, and 1.6 has it {{true}} by default. Setting it to false cures this in 1.6, and setting it to true causes it to re-emerge in 1.5. > getPreferredLocations race condition in spark 1.6.0? > > > Key: SPARK-13631 > URL: https://issues.apache.org/jira/browse/SPARK-13631 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: Andy Sloane > > We are seeing something that looks a lot like a regression from spark 1.2. > When we run jobs with multiple threads, we have a crash somewhere inside > getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside > org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs > instead of DAGScheduler directly. > I tried Spark 1.2 post-SPARK-4454 (before this patch it's only slightly > flaky), 1.4.1, and 1.5.2 and all are fine. 1.6.0 immediately crashes on our > threaded test case, though once in a while it passes. > The stack trace is huge, but starts like this: > Caused by: java.lang.NullPointerException: null > at > org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs(MapOutputTracker.scala:406) > at > org.apache.spark.MapOutputTrackerMaster.getPreferredLocationsForShuffle(MapOutputTracker.scala:366) > at > org.apache.spark.rdd.ShuffledRDD.getPreferredLocations(ShuffledRDD.scala:92) > at > org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257) > at > org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:256) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1545) > The full trace is available here: > https://gist.github.com/andy256/97611f19924bbf65cf49 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13568) Create feature transformer to impute missing values
[ https://issues.apache.org/jira/browse/SPARK-13568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172423#comment-15172423 ] yuhao yang edited comment on SPARK-13568 at 3/3/16 5:48 AM: Yes, I'm working on supporting numeric values too. And I agree about the imputation for vector should check the elements in the vector. I intends to support the 3 use cases you mentioned. I'll send a PR after some refine and performance benchmark. Thanks updated: create a new jira to discuss how to handle NaN in Statistics was (Author: yuhaoyan): Yes, I'm working on supporting numeric values too. And I agree about the imputation for vector should check the elements in the vector. I intends to support the 3 use cases you mentioned. I'll send a PR today or tomorrow after some refine and performance benchmark. Thanks > Create feature transformer to impute missing values > --- > > Key: SPARK-13568 > URL: https://issues.apache.org/jira/browse/SPARK-13568 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > It is quite common to encounter missing values in data sets. It would be > useful to implement a {{Transformer}} that can impute missing data points, > similar to e.g. {{Imputer}} in > [scikit-learn|http://scikit-learn.org/dev/modules/preprocessing.html#imputation-of-missing-values]. > Initially, options for imputation could include {{mean}}, {{median}} and > {{most frequent}}, but we could add various other approaches. Where possible > existing DataFrame code can be used (e.g. for approximate quantiles etc). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13638) Support for saving with a quote mode
[ https://issues.apache.org/jira/browse/SPARK-13638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-13638: - Description: https://github.com/databricks/spark-csv/pull/254 tobithiel reported this. {quote} I'm dealing with some messy csv files and being able to just quote all fields is very useful, so that other applications don't misunderstand the file because of some sketchy characters {quote} When writing there are several quote modes in apache commons csv. (See https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html) This might have to be supported. However, it looks univocity parser used for writing (it looks currently only this library is supported) does not support this quote mode. I think we can drop this backwards compatibility if we are not going to add apache commons csv. This is a reminder that it will break backwards compatibility for the options, {{quoteMode}}. was: https://github.com/databricks/spark-csv/pull/254 tobithiel reported this. {quote} I'm dealing with some messy csv files and being able to just quote all fields is very useful, so that other applications don't misunderstand the file because of some sketchy characters {quote} When writing there are several quote modes in apache commons csv. (See https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html) This might have to be supported. However, it looks univocity parser used for writing (it looks currently only this library is supported) does not support this quote mode. I think we can drop this backwards compatibility if we are not going to add apache commons csv. This is a reminder that it will break backwards compatibility for the options, {{quoteMode}} and {{parserLib}}. > Support for saving with a quote mode > > > Key: SPARK-13638 > URL: https://issues.apache.org/jira/browse/SPARK-13638 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > https://github.com/databricks/spark-csv/pull/254 > tobithiel reported this. > {quote} > I'm dealing with some messy csv files and being able to just quote all fields > is very useful, > so that other applications don't misunderstand the file because of some > sketchy characters > {quote} > When writing there are several quote modes in apache commons csv. (See > https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html) > This might have to be supported. > However, it looks univocity parser used for writing (it looks currently only > this library is supported) does not support this quote mode. I think we can > drop this backwards compatibility if we are not going to add apache commons > csv. > This is a reminder that it will break backwards compatibility for the > options, {{quoteMode}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13638) Support for saving with a quote mode
[ https://issues.apache.org/jira/browse/SPARK-13638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-13638: - Description: https://github.com/databricks/spark-csv/pull/254 tobithiel reported this. {quote} I'm dealing with some messy csv files and being able to just quote all fields is very useful, so that other applications don't misunderstand the file because of some sketchy characters {quote} When writing there are several quote modes in apache commons csv. (See https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html) This might have to be supported. However, it looks univocity parser used for writing does not support this quote mode. I think we can drop this backwards compatibility if we are not going to add apache commons csv. This is a reminder that it will break backwards compatibility for the options, {{quoteMode}} and {{parserLib}}. was: https://github.com/databricks/spark-csv/pull/254 tobithiel reported this. >I'm dealing with some messy csv files and being able to just quote all fields >is very useful, so that other applications don't misunderstand the file >because of some sketchy characters When writing there are several quote modes in apache commons csv. (See https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html) This might have to be supported. However, it looks univocity parser used for writing does not support this quote mode. I think we can drop this backwards compatibility if we are not going to add apache commons csv. This is a reminder that it will break backwards compatibility for the options, {{quoteMode}} and {{parserLib}}. > Support for saving with a quote mode > > > Key: SPARK-13638 > URL: https://issues.apache.org/jira/browse/SPARK-13638 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > https://github.com/databricks/spark-csv/pull/254 > tobithiel reported this. > {quote} > I'm dealing with some messy csv files and being able to just quote all fields > is very useful, > so that other applications don't misunderstand the file because of some > sketchy characters > {quote} > When writing there are several quote modes in apache commons csv. (See > https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html) > This might have to be supported. > However, it looks univocity parser used for writing does not support this > quote mode. I think we can drop this backwards compatibility if we are not > going to add apache commons csv. > This is a reminder that it will break backwards compatibility for the > options, {{quoteMode}} and {{parserLib}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13637) use more information to simplify the code in Expand builder
[ https://issues.apache.org/jira/browse/SPARK-13637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177256#comment-15177256 ] Apache Spark commented on SPARK-13637: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/11485 > use more information to simplify the code in Expand builder > --- > > Key: SPARK-13637 > URL: https://issues.apache.org/jira/browse/SPARK-13637 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13638) Support for saving with a quote mode
[ https://issues.apache.org/jira/browse/SPARK-13638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-13638: - Description: https://github.com/databricks/spark-csv/pull/254 tobithiel reported this. {quote} I'm dealing with some messy csv files and being able to just quote all fields is very useful, so that other applications don't misunderstand the file because of some sketchy characters {quote} When writing there are several quote modes in apache commons csv. (See https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html) This might have to be supported. However, it looks univocity parser used for writing (it looks currently only this library is supported) does not support this quote mode. I think we can drop this backwards compatibility if we are not going to add apache commons csv. This is a reminder that it will break backwards compatibility for the options, {{quoteMode}} and {{parserLib}}. was: https://github.com/databricks/spark-csv/pull/254 tobithiel reported this. {quote} I'm dealing with some messy csv files and being able to just quote all fields is very useful, so that other applications don't misunderstand the file because of some sketchy characters {quote} When writing there are several quote modes in apache commons csv. (See https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html) This might have to be supported. However, it looks univocity parser used for writing does not support this quote mode. I think we can drop this backwards compatibility if we are not going to add apache commons csv. This is a reminder that it will break backwards compatibility for the options, {{quoteMode}} and {{parserLib}}. > Support for saving with a quote mode > > > Key: SPARK-13638 > URL: https://issues.apache.org/jira/browse/SPARK-13638 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > https://github.com/databricks/spark-csv/pull/254 > tobithiel reported this. > {quote} > I'm dealing with some messy csv files and being able to just quote all fields > is very useful, > so that other applications don't misunderstand the file because of some > sketchy characters > {quote} > When writing there are several quote modes in apache commons csv. (See > https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html) > This might have to be supported. > However, it looks univocity parser used for writing (it looks currently only > this library is supported) does not support this quote mode. I think we can > drop this backwards compatibility if we are not going to add apache commons > csv. > This is a reminder that it will break backwards compatibility for the > options, {{quoteMode}} and {{parserLib}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13640) Synchronize ScalaReflection.mirror method.
Takuya Ueshin created SPARK-13640: - Summary: Synchronize ScalaReflection.mirror method. Key: SPARK-13640 URL: https://issues.apache.org/jira/browse/SPARK-13640 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin {{ScalaReflection.mirror}} method should be synchronized when scala version is 2.10 because {{universe.runtimeMirror}} is not thread safe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13637) use more information to simplify the code in Expand builder
[ https://issues.apache.org/jira/browse/SPARK-13637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13637: Assignee: (was: Apache Spark) > use more information to simplify the code in Expand builder > --- > > Key: SPARK-13637 > URL: https://issues.apache.org/jira/browse/SPARK-13637 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13637) use more information to simplify the code in Expand builder
[ https://issues.apache.org/jira/browse/SPARK-13637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13637: Assignee: Apache Spark > use more information to simplify the code in Expand builder > --- > > Key: SPARK-13637 > URL: https://issues.apache.org/jira/browse/SPARK-13637 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13639) Statistics.colStats(rdd).mean and variance should handle NaN in the input vectors
yuhao yang created SPARK-13639: -- Summary: Statistics.colStats(rdd).mean and variance should handle NaN in the input vectors Key: SPARK-13639 URL: https://issues.apache.org/jira/browse/SPARK-13639 Project: Spark Issue Type: Improvement Components: MLlib Reporter: yuhao yang Priority: Trivial val denseData = Array( Vectors.dense(3.8, 0.0, 1.8), Vectors.dense(1.7, 0.9, 0.0), Vectors.dense(Double.NaN, 0, 0.0) ) val rdd = sc.parallelize(denseData) println(Statistics.colStats(rdd).mean) [NaN,0.3,0.6] This is just a proposal for discussion on how to handle the NaN value in the vectors. We can ignore the NaN value in the computation or just output NaN as it is now as a warning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13638) Support for saving with a quote mode
Hyukjin Kwon created SPARK-13638: Summary: Support for saving with a quote mode Key: SPARK-13638 URL: https://issues.apache.org/jira/browse/SPARK-13638 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Hyukjin Kwon Priority: Minor https://github.com/databricks/spark-csv/pull/254 tobithiel reported this. >I'm dealing with some messy csv files and being able to just quote all fields >is very useful, so that other applications don't misunderstand the file >because of some sketchy characters When writing there are several quote modes in apache commons csv. (See https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html) This might have to be supported. However, it looks univocity parser used for writing does not support this quote mode. I think we can drop this backwards compatibility if we are not going to add apache commons csv. This is a reminder that it will break backwards compatibility for the options, {{quoteMode}} and {{parserLib}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13637) use more information to simplify the code in Expand builder
Wenchen Fan created SPARK-13637: --- Summary: use more information to simplify the code in Expand builder Key: SPARK-13637 URL: https://issues.apache.org/jira/browse/SPARK-13637 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13634: -- Priority: Minor (was: Major) I doubt it's a Spark problem; this is more a function of how Scala puts things in its closure. Usually you can tinker with equivalent code to find a different version that works as expected. For example, declare a def containing the function you want to map -- that may happen to work. > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam >Priority: Minor > > The following lines of code cause a task serialization error when executed in > the spark-shell. > Note that the error does not occur when submitting the code as a batch job - > via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses a wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it in notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13602) o.a.s.deploy.worker.DriverRunner may leak the driver processes
[ https://issues.apache.org/jira/browse/SPARK-13602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177169#comment-15177169 ] Bryan Cutler commented on SPARK-13602: -- Great! Thanks :D > o.a.s.deploy.worker.DriverRunner may leak the driver processes > -- > > Key: SPARK-13602 > URL: https://issues.apache.org/jira/browse/SPARK-13602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shixiong Zhu > > If Worker calls "System.exit", DriverRunner will not kill the driver > processes. We should add a shutdown hook in DriverRunner like > o.a.s.deploy.worker.ExecutorRunner -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13600: -- Assignee: Oliver Pierson > Incorrect number of buckets in QuantileDiscretizer > -- > > Key: SPARK-13600 > URL: https://issues.apache.org/jira/browse/SPARK-13600 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.0, 2.0.0 >Reporter: Oliver Pierson >Assignee: Oliver Pierson > > Under certain circumstances, QuantileDiscretizer fails to calculate the > correct splits resulting in an incorrect number of buckets/bins. > E.g. > val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x") > val discretizer = new > QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5) > discretizer.fit(df).getSplits > gives: > Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity) > which corresponds to 6 buckets (not 5). > The problem appears to be in the QuantileDiscretizer.findSplitsCandidates > method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13636) Direct consume UnsafeRow in wholestage codegen plans
[ https://issues.apache.org/jira/browse/SPARK-13636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177146#comment-15177146 ] Apache Spark commented on SPARK-13636: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/11484 > Direct consume UnsafeRow in wholestage codegen plans > > > Key: SPARK-13636 > URL: https://issues.apache.org/jira/browse/SPARK-13636 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > As shown in the wholestage codegen verion of Sort operator, when Sort is top > of Exchange (or other operator that produce UnsafeRow), we will create > variables from UnsafeRow, than create another UnsafeRow using these > variables. We should avoid the unnecessary unpack and pack variables from > UnsafeRows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13636) Direct consume UnsafeRow in wholestage codegen plans
[ https://issues.apache.org/jira/browse/SPARK-13636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13636: Assignee: Apache Spark > Direct consume UnsafeRow in wholestage codegen plans > > > Key: SPARK-13636 > URL: https://issues.apache.org/jira/browse/SPARK-13636 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > As shown in the wholestage codegen verion of Sort operator, when Sort is top > of Exchange (or other operator that produce UnsafeRow), we will create > variables from UnsafeRow, than create another UnsafeRow using these > variables. We should avoid the unnecessary unpack and pack variables from > UnsafeRows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13636) Direct consume UnsafeRow in wholestage codegen plans
[ https://issues.apache.org/jira/browse/SPARK-13636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13636: Assignee: (was: Apache Spark) > Direct consume UnsafeRow in wholestage codegen plans > > > Key: SPARK-13636 > URL: https://issues.apache.org/jira/browse/SPARK-13636 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > As shown in the wholestage codegen verion of Sort operator, when Sort is top > of Exchange (or other operator that produce UnsafeRow), we will create > variables from UnsafeRow, than create another UnsafeRow using these > variables. We should avoid the unnecessary unpack and pack variables from > UnsafeRows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13627) Fix simple deprecation warnings
[ https://issues.apache.org/jira/browse/SPARK-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13627. - Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.0.0 > Fix simple deprecation warnings > --- > > Key: SPARK-13627 > URL: https://issues.apache.org/jira/browse/SPARK-13627 > Project: Spark > Issue Type: Bug > Components: Examples, SQL, YARN >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.0.0 > > > This issue aims to fix the following deprecation warnings. > * MethodSymbolApi.paramss--> paramLists > * AnnotationApi.tpe -> tree.tpe > * BufferLike.readOnly -> toList. > * StandardNames.nme -> termNames > * scala.tools.nsc.interpreter.AbstractFileClassLoader -> > scala.reflect.internal.util.AbstractFileClassLoader > * TypeApi.declarations-> decls -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13636) Direct consume UnsafeRow in wholestage codegen plans
Liang-Chi Hsieh created SPARK-13636: --- Summary: Direct consume UnsafeRow in wholestage codegen plans Key: SPARK-13636 URL: https://issues.apache.org/jira/browse/SPARK-13636 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh As shown in the wholestage codegen verion of Sort operator, when Sort is top of Exchange (or other operator that produce UnsafeRow), we will create variables from UnsafeRow, than create another UnsafeRow using these variables. We should avoid the unnecessary unpack and pack variables from UnsafeRows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13617) remove unnecessary GroupingAnalytics trait
[ https://issues.apache.org/jira/browse/SPARK-13617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13617. - Resolution: Fixed Assignee: Wenchen Fan Fix Version/s: 2.0.0 > remove unnecessary GroupingAnalytics trait > -- > > Key: SPARK-13617 > URL: https://issues.apache.org/jira/browse/SPARK-13617 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma closed SPARK-13634. --- Resolution: Duplicate > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam > > The following lines of code cause a task serialization error when executed in > the spark-shell. > Note that the error does not occur when submitting the code as a batch job - > via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses a wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it in notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13635) Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit
[ https://issues.apache.org/jira/browse/SPARK-13635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13635: Assignee: (was: Apache Spark) > Enable LimitPushdown optimizer rule because we have whole-stage codegen for > Limit > - > > Key: SPARK-13635 > URL: https://issues.apache.org/jira/browse/SPARK-13635 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > LimitPushdown optimizer rule has been disabled due to no whole-stage codegen > for Limit. As we have whole-stage codegen for Limit now, we should enable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13635) Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit
[ https://issues.apache.org/jira/browse/SPARK-13635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13635: Assignee: Apache Spark > Enable LimitPushdown optimizer rule because we have whole-stage codegen for > Limit > - > > Key: SPARK-13635 > URL: https://issues.apache.org/jira/browse/SPARK-13635 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > LimitPushdown optimizer rule has been disabled due to no whole-stage codegen > for Limit. As we have whole-stage codegen for Limit now, we should enable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13635) Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit
[ https://issues.apache.org/jira/browse/SPARK-13635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177108#comment-15177108 ] Apache Spark commented on SPARK-13635: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/11483 > Enable LimitPushdown optimizer rule because we have whole-stage codegen for > Limit > - > > Key: SPARK-13635 > URL: https://issues.apache.org/jira/browse/SPARK-13635 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > LimitPushdown optimizer rule has been disabled due to no whole-stage codegen > for Limit. As we have whole-stage codegen for Limit now, we should enable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13635) Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit
Liang-Chi Hsieh created SPARK-13635: --- Summary: Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit Key: SPARK-13635 URL: https://issues.apache.org/jira/browse/SPARK-13635 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh LimitPushdown optimizer rule has been disabled due to no whole-stage codegen for Limit. As we have whole-stage codegen for Limit now, we should enable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13593) improve the `toDF()` method to accept data type string and verify the data
[ https://issues.apache.org/jira/browse/SPARK-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-13593: Summary: improve the `toDF()` method to accept data type string and verify the data (was: add a `schema()` method to convert python RDD to DataFrame easily) > improve the `toDF()` method to accept data type string and verify the data > -- > > Key: SPARK-13593 > URL: https://issues.apache.org/jira/browse/SPARK-13593 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rahul Palamuttam updated SPARK-13634: - Description: The following lines of code cause a task serialization error when executed in the spark-shell. Note that the error does not occur when submitting the code as a batch job - via spark-submit. val temp = 10 val newSC = sc val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) For some reason when temp is being pulled in to the referencing environment of the closure, so is the SparkContext. We originally hit this issue in the SciSpark project, when referencing a string variable inside of a lambda expression in RDD.map(...) Any insight into how this could be resolved would be appreciated. While the above code is trivial, SciSpark uses a wrapper around the SparkContext to read from various file formats. We want to keep this class structure and also use it in notebook and shell environments. was: The following lines of code cause a task serialization error when executed in the spark-shell. Note that the error does not occur when submitting the code as a batch job - via spark-submit. val temp = 10 val newSC = sc val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) For some reason when temp is being pulled in to the referencing environment of the closure, so is the SparkContext. We originally hit this issue in the SciSpark project, when referencing a string variable inside of a lambda expression in RDD.map(...) Any insight into how this could be resolved would be appreciated. While the above code is trivial, SciSpark uses wrapper around the SparkContext to read from various file formats. We want to keep this class structure and also use it notebook and shell environments. > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam > > The following lines of code cause a task serialization error when executed in > the spark-shell. > Note that the error does not occur when submitting the code as a batch job - > via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses a wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it in notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177093#comment-15177093 ] Rahul Palamuttam commented on SPARK-13634: -- [~chrismattmann] > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam > > The following lines of code cause a task serialization error when executed in > the spark-shell. Note that the error does not occur when submitting the code > as a batch job - via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13634) Assigning spark context to variable results in serialization error
Rahul Palamuttam created SPARK-13634: Summary: Assigning spark context to variable results in serialization error Key: SPARK-13634 URL: https://issues.apache.org/jira/browse/SPARK-13634 Project: Spark Issue Type: Bug Components: Spark Shell Reporter: Rahul Palamuttam The following lines of code cause a task serialization error when executed in the spark-shell. Note that the error does not occur when submitting the code as a batch job - via spark-submit. val temp = 10 val newSC = sc val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) For some reason when temp is being pulled in to the referencing environment of the closure, so is the SparkContext. We originally hit this issue in the SciSpark project, when referencing a string variable inside of a lambda expression in RDD.map(...) Any insight into how this could be resolved would be appreciated. While the above code is trivial, SciSpark uses wrapper around the SparkContext to read from various file formats. We want to keep this class structure and also use it notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13632) Create new o.a.s.sql.execution.commands package
[ https://issues.apache.org/jira/browse/SPARK-13632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177082#comment-15177082 ] Apache Spark commented on SPARK-13632: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/11482 > Create new o.a.s.sql.execution.commands package > --- > > Key: SPARK-13632 > URL: https://issues.apache.org/jira/browse/SPARK-13632 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13632) Create new o.a.s.sql.execution.commands package
[ https://issues.apache.org/jira/browse/SPARK-13632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13632: Assignee: Apache Spark (was: Andrew Or) > Create new o.a.s.sql.execution.commands package > --- > > Key: SPARK-13632 > URL: https://issues.apache.org/jira/browse/SPARK-13632 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Or >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13632) Create new o.a.s.sql.execution.commands package
[ https://issues.apache.org/jira/browse/SPARK-13632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13632: Assignee: Andrew Or (was: Apache Spark) > Create new o.a.s.sql.execution.commands package > --- > > Key: SPARK-13632 > URL: https://issues.apache.org/jira/browse/SPARK-13632 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13633) Move parser classes to o.a.s.sql.catalyst.parser package
[ https://issues.apache.org/jira/browse/SPARK-13633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-13633: -- Summary: Move parser classes to o.a.s.sql.catalyst.parser package (was: Create new o.a.s.sql.catalyst.parser package) > Move parser classes to o.a.s.sql.catalyst.parser package > > > Key: SPARK-13633 > URL: https://issues.apache.org/jira/browse/SPARK-13633 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13633) Create new o.a.s.sql.catalyst.parser package
Andrew Or created SPARK-13633: - Summary: Create new o.a.s.sql.catalyst.parser package Key: SPARK-13633 URL: https://issues.apache.org/jira/browse/SPARK-13633 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Andrew Or Assignee: Andrew Or -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13632) Create new o.a.s.sql.execution.commands package
Andrew Or created SPARK-13632: - Summary: Create new o.a.s.sql.execution.commands package Key: SPARK-13632 URL: https://issues.apache.org/jira/browse/SPARK-13632 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Andrew Or Assignee: Andrew Or -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13614) show() trigger memory leak,why?
[ https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175516#comment-15175516 ] chillon_m edited comment on SPARK-13614 at 3/3/16 2:16 AM: --- @[~srowen] the same size of dataset(hot.count()=599147,ghot.size=21844,10Byte/row),collect don't trigger memory leak(first image),but show() trigger it.why?in general,collect trigger it easily("Keep in mind that your entire dataset must fit in memory on a single machine to use collect() on it, so collect() shouldn’t be used on large datasets." in ),but collect don't trigger. was (Author: chillon_m): @[~srowen] the same size of dataset(hot.count()=599147,ghot.size=21844),collect don't trigger memory leak(first image),but show() trigger it.why?in general,collect trigger it easily("Keep in mind that your entire dataset must fit in memory on a single machine to use collect() on it, so collect() shouldn’t be used on large datasets." in ),but collect don't trigger. > show() trigger memory leak,why? > --- > > Key: SPARK-13614 > URL: https://issues.apache.org/jira/browse/SPARK-13614 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2 >Reporter: chillon_m > Attachments: memory leak.png, memory.png > > > hot.count()=599147 > ghot.size=21844 > [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell > --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.5.2 > /_/ > Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80) > Type in expressions to have them evaluated. > Type :help for more information. > Spark context available as sc. > SQL context available as sqlContext. > scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> > "jdbc:mysql://:/?user=&password=","dbtable" -> "")).load() > Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > hot: org.apache.spark.sql.DataFrame = [] > scala> val ghot=hot.groupBy("Num","pNum").count().collect() > Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > ghot: Array[org.apache.spark.sql.Row] = Array([[],[],[], [,42310... > scala> ghot.take(20) > res0: Array[org.apache.spark.sql.Row] = Array([],[],[],[],[],[],[],[]) > scala> hot.groupBy("Num","pNum").count().show() > Wed Mar 02 14:26:05 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > 16/03/02 14:26:33 ERROR Executor: Managed memory leak detected; size = > 4194304 bytes, TID = 202 > +--+-+-+ > | QQNum| TroopNum|count| > +--+-+-+ > |1X|38XXX|1| > |1X| 5XXX|2| > |1X|26XXX|6| > |1X|14XXX|3| > |1X|41XXX| 14| > |1X|48XXX| 18| > |1X|23XXX|2| > |1X| XXX| 34| > |1X|52XXX|1| > |1X|52XXX|2| > |1X|49XXX|3| > |1X|42XXX|3| > |1X|17XXX| 11| > |1X|25XXX| 129| > |1X|13XXX|2| > |1X|19XXX|1| > |1X|32XXX|9| > |1X|38XXX|6| > |1X|38XXX| 13| > |1X|30XXX|4| > +--+-+-+ > only showing top 20 rows -- This message was sent
[jira] [Comment Edited] (SPARK-13614) show() trigger memory leak,why?
[ https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175516#comment-15175516 ] chillon_m edited comment on SPARK-13614 at 3/3/16 2:14 AM: --- @[~srowen] the same size of dataset(hot.count()=599147,ghot.size=21844),collect don't trigger memory leak(first image),but show() trigger it.why?in general,collect trigger it easily("Keep in mind that your entire dataset must fit in memory on a single machine to use collect() on it, so collect() shouldn’t be used on large datasets." in ),but collect don't trigger. was (Author: chillon_m): [~srowen] the same size of dataset(hot.count()=599147,ghot.size=21844),collect don't trigger memory leak(first image),but show() trigger it.why?in general,collect trigger it easily("Keep in mind that your entire dataset must fit in memory on a single machine to use collect() on it, so collect() shouldn’t be used on large datasets." in ),but collect don't trigger. > show() trigger memory leak,why? > --- > > Key: SPARK-13614 > URL: https://issues.apache.org/jira/browse/SPARK-13614 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2 >Reporter: chillon_m > Attachments: memory leak.png, memory.png > > > hot.count()=599147 > ghot.size=21844 > [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell > --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.5.2 > /_/ > Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80) > Type in expressions to have them evaluated. > Type :help for more information. > Spark context available as sc. > SQL context available as sqlContext. > scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> > "jdbc:mysql://:/?user=&password=","dbtable" -> "")).load() > Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > hot: org.apache.spark.sql.DataFrame = [] > scala> val ghot=hot.groupBy("Num","pNum").count().collect() > Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > ghot: Array[org.apache.spark.sql.Row] = Array([[],[],[], [,42310... > scala> ghot.take(20) > res0: Array[org.apache.spark.sql.Row] = Array([],[],[],[],[],[],[],[]) > scala> hot.groupBy("Num","pNum").count().show() > Wed Mar 02 14:26:05 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > 16/03/02 14:26:33 ERROR Executor: Managed memory leak detected; size = > 4194304 bytes, TID = 202 > +--+-+-+ > | QQNum| TroopNum|count| > +--+-+-+ > |1X|38XXX|1| > |1X| 5XXX|2| > |1X|26XXX|6| > |1X|14XXX|3| > |1X|41XXX| 14| > |1X|48XXX| 18| > |1X|23XXX|2| > |1X| XXX| 34| > |1X|52XXX|1| > |1X|52XXX|2| > |1X|49XXX|3| > |1X|42XXX|3| > |1X|17XXX| 11| > |1X|25XXX| 129| > |1X|13XXX|2| > |1X|19XXX|1| > |1X|32XXX|9| > |1X|38XXX|6| > |1X|38XXX| 13| > |1X|30XXX|4| > +--+-+-+ > only showing top 20 rows -- This message was sent by Atlassian
[jira] [Comment Edited] (SPARK-13614) show() trigger memory leak,why?
[ https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175516#comment-15175516 ] chillon_m edited comment on SPARK-13614 at 3/3/16 2:14 AM: --- [~srowen] the same size of dataset(hot.count()=599147,ghot.size=21844),collect don't trigger memory leak(first image),but show() trigger it.why?in general,collect trigger it easily("Keep in mind that your entire dataset must fit in memory on a single machine to use collect() on it, so collect() shouldn’t be used on large datasets." in ),but collect don't trigger. was (Author: chillon_m): the same size of dataset,collect don't trigger memory leak(first image),but show() trigger it.why? > show() trigger memory leak,why? > --- > > Key: SPARK-13614 > URL: https://issues.apache.org/jira/browse/SPARK-13614 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2 >Reporter: chillon_m > Attachments: memory leak.png, memory.png > > > hot.count()=599147 > ghot.size=21844 > [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell > --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.5.2 > /_/ > Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80) > Type in expressions to have them evaluated. > Type :help for more information. > Spark context available as sc. > SQL context available as sqlContext. > scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> > "jdbc:mysql://:/?user=&password=","dbtable" -> "")).load() > Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > hot: org.apache.spark.sql.DataFrame = [] > scala> val ghot=hot.groupBy("Num","pNum").count().collect() > Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > ghot: Array[org.apache.spark.sql.Row] = Array([[],[],[], [,42310... > scala> ghot.take(20) > res0: Array[org.apache.spark.sql.Row] = Array([],[],[],[],[],[],[],[]) > scala> hot.groupBy("Num","pNum").count().show() > Wed Mar 02 14:26:05 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > 16/03/02 14:26:33 ERROR Executor: Managed memory leak detected; size = > 4194304 bytes, TID = 202 > +--+-+-+ > | QQNum| TroopNum|count| > +--+-+-+ > |1X|38XXX|1| > |1X| 5XXX|2| > |1X|26XXX|6| > |1X|14XXX|3| > |1X|41XXX| 14| > |1X|48XXX| 18| > |1X|23XXX|2| > |1X| XXX| 34| > |1X|52XXX|1| > |1X|52XXX|2| > |1X|49XXX|3| > |1X|42XXX|3| > |1X|17XXX| 11| > |1X|25XXX| 129| > |1X|13XXX|2| > |1X|19XXX|1| > |1X|32XXX|9| > |1X|38XXX|6| > |1X|38XXX| 13| > |1X|30XXX|4| > +--+-+-+ > only showing top 20 rows -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-13614) show() trigger memory leak,why?
[ https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chillon_m updated SPARK-13614: -- Comment: was deleted (was: the same size of dataset,collect don't trigger memory leak(first image),but show() trigger it.why?) > show() trigger memory leak,why? > --- > > Key: SPARK-13614 > URL: https://issues.apache.org/jira/browse/SPARK-13614 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2 >Reporter: chillon_m > Attachments: memory leak.png, memory.png > > > hot.count()=599147 > ghot.size=21844 > [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell > --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.5.2 > /_/ > Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80) > Type in expressions to have them evaluated. > Type :help for more information. > Spark context available as sc. > SQL context available as sqlContext. > scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> > "jdbc:mysql://:/?user=&password=","dbtable" -> "")).load() > Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > hot: org.apache.spark.sql.DataFrame = [] > scala> val ghot=hot.groupBy("Num","pNum").count().collect() > Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > ghot: Array[org.apache.spark.sql.Row] = Array([[],[],[], [,42310... > scala> ghot.take(20) > res0: Array[org.apache.spark.sql.Row] = Array([],[],[],[],[],[],[],[]) > scala> hot.groupBy("Num","pNum").count().show() > Wed Mar 02 14:26:05 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > 16/03/02 14:26:33 ERROR Executor: Managed memory leak detected; size = > 4194304 bytes, TID = 202 > +--+-+-+ > | QQNum| TroopNum|count| > +--+-+-+ > |1X|38XXX|1| > |1X| 5XXX|2| > |1X|26XXX|6| > |1X|14XXX|3| > |1X|41XXX| 14| > |1X|48XXX| 18| > |1X|23XXX|2| > |1X| XXX| 34| > |1X|52XXX|1| > |1X|52XXX|2| > |1X|49XXX|3| > |1X|42XXX|3| > |1X|17XXX| 11| > |1X|25XXX| 129| > |1X|13XXX|2| > |1X|19XXX|1| > |1X|32XXX|9| > |1X|38XXX|6| > |1X|38XXX| 13| > |1X|30XXX|4| > +--+-+-+ > only showing top 20 rows -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13485) (Dataset-oriented) API evolution in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-13485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-13485: Summary: (Dataset-oriented) API evolution in Spark 2.0 (was: Dataset-oriented API foundation in Spark 2.0) > (Dataset-oriented) API evolution in Spark 2.0 > - > > Key: SPARK-13485 > URL: https://issues.apache.org/jira/browse/SPARK-13485 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Attachments: API Evolution in Spark 2.0.pdf > > > As part of Spark 2.0, we want to create a stable API foundation for Dataset > to become the main user-facing API in Spark. This ticket tracks various tasks > related to that. > The main high level changes are: > 1. Merge Dataset/DataFrame > 2. Create a more natural entry point for Dataset (SQLContext is not ideal > because of the name "SQL") > 3. First class support for sessions > 4. First class support for some system catalog -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13485) (Dataset-oriented) API evolution in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-13485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-13485: Description: As part of Spark 2.0, we want to create a stable API foundation for Dataset to become the main user-facing API in Spark. This ticket tracks various tasks related to that. The main high level changes are: 1. Merge Dataset/DataFrame 2. Create a more natural entry point for Dataset (SQLContext is not ideal because of the name "SQL") 3. First class support for sessions 4. First class support for some system catalog See the design doc for more details. was: As part of Spark 2.0, we want to create a stable API foundation for Dataset to become the main user-facing API in Spark. This ticket tracks various tasks related to that. The main high level changes are: 1. Merge Dataset/DataFrame 2. Create a more natural entry point for Dataset (SQLContext is not ideal because of the name "SQL") 3. First class support for sessions 4. First class support for some system catalog > (Dataset-oriented) API evolution in Spark 2.0 > - > > Key: SPARK-13485 > URL: https://issues.apache.org/jira/browse/SPARK-13485 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Attachments: API Evolution in Spark 2.0.pdf > > > As part of Spark 2.0, we want to create a stable API foundation for Dataset > to become the main user-facing API in Spark. This ticket tracks various tasks > related to that. > The main high level changes are: > 1. Merge Dataset/DataFrame > 2. Create a more natural entry point for Dataset (SQLContext is not ideal > because of the name "SQL") > 3. First class support for sessions > 4. First class support for some system catalog > See the design doc for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13485) Dataset-oriented API foundation in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-13485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-13485: Attachment: API Evolution in Spark 2.0.pdf > Dataset-oriented API foundation in Spark 2.0 > > > Key: SPARK-13485 > URL: https://issues.apache.org/jira/browse/SPARK-13485 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Attachments: API Evolution in Spark 2.0.pdf > > > As part of Spark 2.0, we want to create a stable API foundation for Dataset > to become the main user-facing API in Spark. This ticket tracks various tasks > related to that. > The main high level changes are: > 1. Merge Dataset/DataFrame > 2. Create a more natural entry point for Dataset (SQLContext is not ideal > because of the name "SQL") > 3. First class support for sessions > 4. First class support for some system catalog -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13583) Remove unused imports and add checkstyle rule
[ https://issues.apache.org/jira/browse/SPARK-13583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-13583: -- Summary: Remove unused imports and add checkstyle rule (was: Support `UnusedImports` Java checkstyle rule) > Remove unused imports and add checkstyle rule > - > > Key: SPARK-13583 > URL: https://issues.apache.org/jira/browse/SPARK-13583 > Project: Spark > Issue Type: Task > Components: Spark Core, Streaming >Reporter: Dongjoon Hyun >Priority: Minor > > After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review > by saving much time. > This issue aims to enforce `UnusedImports` rule by adding a `UnusedImports` > rule to `checkstyle.xml` and fixing all existing unused imports. > {code:title=checkstyle.xml|borderStyle=solid} > + > {code} > Unfortunately, `dev/lint-java` is not tested by Jenkins. ( > https://github.com/apache/spark/blob/master/dev/run-tests.py#L546 ) > This will also help Spark contributors to check by themselves before > submitting their PRs. > According to the [~srowen]'s comments, this PR also includes the removal of > unused imports in Scala code. It will be done by manually because of the > following two reasons. > * Scalastyle does not have `UnusedImport` rule yet. > * Scala 2.11.7 has a bug with `-Ywarn-unused-import` option. > (https://issues.scala-lang.org/browse/SI-9616) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13602) o.a.s.deploy.worker.DriverRunner may leak the driver processes
[ https://issues.apache.org/jira/browse/SPARK-13602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176844#comment-15176844 ] Shixiong Zhu commented on SPARK-13602: -- Sure. Go ahead. > o.a.s.deploy.worker.DriverRunner may leak the driver processes > -- > > Key: SPARK-13602 > URL: https://issues.apache.org/jira/browse/SPARK-13602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shixiong Zhu > > If Worker calls "System.exit", DriverRunner will not kill the driver > processes. We should add a shutdown hook in DriverRunner like > o.a.s.deploy.worker.ExecutorRunner -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13627) Fix simple deprecation warnings
[ https://issues.apache.org/jira/browse/SPARK-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-13627: -- Component/s: (was: PySpark) > Fix simple deprecation warnings > --- > > Key: SPARK-13627 > URL: https://issues.apache.org/jira/browse/SPARK-13627 > Project: Spark > Issue Type: Bug > Components: Examples, SQL, YARN >Reporter: Dongjoon Hyun >Priority: Minor > > This issue aims to fix the following deprecation warnings. > * MethodSymbolApi.paramss--> paramLists > * AnnotationApi.tpe -> tree.tpe > * BufferLike.readOnly -> toList. > * StandardNames.nme -> termNames > * scala.tools.nsc.interpreter.AbstractFileClassLoader -> > scala.reflect.internal.util.AbstractFileClassLoader > * TypeApi.declarations-> decls -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13630) Add optimizer rule to collapse sorts
[ https://issues.apache.org/jira/browse/SPARK-13630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176810#comment-15176810 ] Apache Spark commented on SPARK-13630: -- User 'skambha' has created a pull request for this issue: https://github.com/apache/spark/pull/11480 > Add optimizer rule to collapse sorts > > > Key: SPARK-13630 > URL: https://issues.apache.org/jira/browse/SPARK-13630 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunitha Kambhampati > Fix For: 2.0.0 > > > It is possible to collapse adjacent sorts and keep the last one.This > task is to add optimizer rule to collapse adjacent sorts if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13630) Add optimizer rule to collapse sorts
[ https://issues.apache.org/jira/browse/SPARK-13630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13630: Assignee: (was: Apache Spark) > Add optimizer rule to collapse sorts > > > Key: SPARK-13630 > URL: https://issues.apache.org/jira/browse/SPARK-13630 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunitha Kambhampati > Fix For: 2.0.0 > > > It is possible to collapse adjacent sorts and keep the last one.This > task is to add optimizer rule to collapse adjacent sorts if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13630) Add optimizer rule to collapse sorts
[ https://issues.apache.org/jira/browse/SPARK-13630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176811#comment-15176811 ] Sunitha Kambhampati commented on SPARK-13630: - Here is the pull request with changes: https://github.com/apache/spark/pull/11480 > Add optimizer rule to collapse sorts > > > Key: SPARK-13630 > URL: https://issues.apache.org/jira/browse/SPARK-13630 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunitha Kambhampati > Fix For: 2.0.0 > > > It is possible to collapse adjacent sorts and keep the last one.This > task is to add optimizer rule to collapse adjacent sorts if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13630) Add optimizer rule to collapse sorts
[ https://issues.apache.org/jira/browse/SPARK-13630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13630: Assignee: Apache Spark > Add optimizer rule to collapse sorts > > > Key: SPARK-13630 > URL: https://issues.apache.org/jira/browse/SPARK-13630 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunitha Kambhampati >Assignee: Apache Spark > Fix For: 2.0.0 > > > It is possible to collapse adjacent sorts and keep the last one.This > task is to add optimizer rule to collapse adjacent sorts if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13631) getPreferredLocations race condition in spark 1.6.0?
Andy Sloane created SPARK-13631: --- Summary: getPreferredLocations race condition in spark 1.6.0? Key: SPARK-13631 URL: https://issues.apache.org/jira/browse/SPARK-13631 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.6.0 Reporter: Andy Sloane We are seeing something that looks a lot like a regression from spark 1.2. When we run jobs with multiple threads, we have a crash somewhere inside getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs instead of DAGScheduler directly. I tried Spark 1.2 post-SPARK-4454 (before this patch it's only slightly flaky), 1.4.1, and 1.5.2 and all are fine. 1.6.0 immediately crashes on our threaded test case, though once in a while it passes. The stack trace is huge, but starts like this: Caused by: java.lang.NullPointerException: null at org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs(MapOutputTracker.scala:406) at org.apache.spark.MapOutputTrackerMaster.getPreferredLocationsForShuffle(MapOutputTracker.scala:366) at org.apache.spark.rdd.ShuffledRDD.getPreferredLocations(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257) at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:256) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1545) The full trace is available here: https://gist.github.com/andy256/97611f19924bbf65cf49 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13627) Fix simple deprecation warnings
[ https://issues.apache.org/jira/browse/SPARK-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176802#comment-15176802 ] Apache Spark commented on SPARK-13627: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/11479 > Fix simple deprecation warnings > --- > > Key: SPARK-13627 > URL: https://issues.apache.org/jira/browse/SPARK-13627 > Project: Spark > Issue Type: Bug > Components: Examples, PySpark, SQL, YARN >Reporter: Dongjoon Hyun >Priority: Minor > > This issue aims to fix the following deprecation warnings. > * MethodSymbolApi.paramss--> paramLists > * AnnotationApi.tpe -> tree.tpe > * BufferLike.readOnly -> toList. > * StandardNames.nme -> termNames > * scala.tools.nsc.interpreter.AbstractFileClassLoader -> > scala.reflect.internal.util.AbstractFileClassLoader > * TypeApi.declarations-> decls -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13627) Fix simple deprecation warnings
[ https://issues.apache.org/jira/browse/SPARK-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13627: Assignee: (was: Apache Spark) > Fix simple deprecation warnings > --- > > Key: SPARK-13627 > URL: https://issues.apache.org/jira/browse/SPARK-13627 > Project: Spark > Issue Type: Bug > Components: Examples, PySpark, SQL, YARN >Reporter: Dongjoon Hyun >Priority: Minor > > This issue aims to fix the following deprecation warnings. > * MethodSymbolApi.paramss--> paramLists > * AnnotationApi.tpe -> tree.tpe > * BufferLike.readOnly -> toList. > * StandardNames.nme -> termNames > * scala.tools.nsc.interpreter.AbstractFileClassLoader -> > scala.reflect.internal.util.AbstractFileClassLoader > * TypeApi.declarations-> decls -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13627) Fix simple deprecation warnings
[ https://issues.apache.org/jira/browse/SPARK-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13627: Assignee: Apache Spark > Fix simple deprecation warnings > --- > > Key: SPARK-13627 > URL: https://issues.apache.org/jira/browse/SPARK-13627 > Project: Spark > Issue Type: Bug > Components: Examples, PySpark, SQL, YARN >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > This issue aims to fix the following deprecation warnings. > * MethodSymbolApi.paramss--> paramLists > * AnnotationApi.tpe -> tree.tpe > * BufferLike.readOnly -> toList. > * StandardNames.nme -> termNames > * scala.tools.nsc.interpreter.AbstractFileClassLoader -> > scala.reflect.internal.util.AbstractFileClassLoader > * TypeApi.declarations-> decls -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13602) o.a.s.deploy.worker.DriverRunner may leak the driver processes
[ https://issues.apache.org/jira/browse/SPARK-13602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176787#comment-15176787 ] Bryan Cutler commented on SPARK-13602: -- Hi [~zsxwing], mind if I work on this one? > o.a.s.deploy.worker.DriverRunner may leak the driver processes > -- > > Key: SPARK-13602 > URL: https://issues.apache.org/jira/browse/SPARK-13602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shixiong Zhu > > If Worker calls "System.exit", DriverRunner will not kill the driver > processes. We should add a shutdown hook in DriverRunner like > o.a.s.deploy.worker.ExecutorRunner -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13630) Add optimizer rule to collapse sorts
Sunitha Kambhampati created SPARK-13630: --- Summary: Add optimizer rule to collapse sorts Key: SPARK-13630 URL: https://issues.apache.org/jira/browse/SPARK-13630 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.6.0 Reporter: Sunitha Kambhampati Fix For: 2.0.0 It is possible to collapse adjacent sorts and keep the last one.This task is to add optimizer rule to collapse adjacent sorts if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13627) Fix simple deprecation warnings
[ https://issues.apache.org/jira/browse/SPARK-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-13627: -- Description: This issue aims to fix the following deprecation warnings. * MethodSymbolApi.paramss--> paramLists * AnnotationApi.tpe -> tree.tpe * BufferLike.readOnly -> toList. * StandardNames.nme -> termNames * scala.tools.nsc.interpreter.AbstractFileClassLoader -> scala.reflect.internal.util.AbstractFileClassLoader * TypeApi.declarations-> decls was: This issue aims to fix the following 21 deprecation warnings. * (6) MethodSymbolApi.paramss--> paramLists * (4) AnnotationApi.tpe -> tree.tpe * (3) BufferLike.readOnly -> toList. * (3) StandardNames.nme -> termNames * (3) scala.tools.nsc.interpreter.AbstractFileClassLoader -> scala.reflect.internal.util.AbstractFileClassLoader * (2) TypeApi.declarations-> decls > Fix simple deprecation warnings > --- > > Key: SPARK-13627 > URL: https://issues.apache.org/jira/browse/SPARK-13627 > Project: Spark > Issue Type: Bug > Components: Examples, PySpark, SQL, YARN >Reporter: Dongjoon Hyun >Priority: Minor > > This issue aims to fix the following deprecation warnings. > * MethodSymbolApi.paramss--> paramLists > * AnnotationApi.tpe -> tree.tpe > * BufferLike.readOnly -> toList. > * StandardNames.nme -> termNames > * scala.tools.nsc.interpreter.AbstractFileClassLoader -> > scala.reflect.internal.util.AbstractFileClassLoader > * TypeApi.declarations-> decls -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13629) Add binary toggle Param to CountVectorizer
Joseph K. Bradley created SPARK-13629: - Summary: Add binary toggle Param to CountVectorizer Key: SPARK-13629 URL: https://issues.apache.org/jira/browse/SPARK-13629 Project: Spark Issue Type: New Feature Components: ML Reporter: Joseph K. Bradley Priority: Minor It would be handy to add a binary toggle Param to CountVectorizer, as in the scikit-learn one: [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html] If set, then all non-zero counts will be set to 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13628) Temporary intermediate output file should be renamed before copying to destination filesystem
Chen He created SPARK-13628: --- Summary: Temporary intermediate output file should be renamed before copying to destination filesystem Key: SPARK-13628 URL: https://issues.apache.org/jira/browse/SPARK-13628 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 1.6.0 Reporter: Chen He Spark Executor will dump temporary file into local temp dir, copy it to destination filesystem, and then, rename it. It could be costly for Blobstore (such as openstack swift) which do the actual copy when file is renamed. If it does not affect other components, we may switch the sequence of copy and rename so that Spark can use Blobstore as final output destination. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13465) Add a task failure listener to TaskContext
[ https://issues.apache.org/jira/browse/SPARK-13465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176750#comment-15176750 ] Apache Spark commented on SPARK-13465: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/11478 > Add a task failure listener to TaskContext > -- > > Key: SPARK-13465 > URL: https://issues.apache.org/jira/browse/SPARK-13465 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > TaskContext supports task completion callback, which gets called regardless > of task failures. However, there is no way for the listener to know if there > is an error. This ticket proposes adding a new listener that gets called when > a task fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13161) Extend MLlib LDA to include options for Author Topic Modeling
[ https://issues.apache.org/jira/browse/SPARK-13161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176747#comment-15176747 ] Joseph K. Bradley commented on SPARK-13161: --- There are many generalizations of LDA, so it would be valuable to know about people's use cases and needs. Do you have a use case you could describe for this? It would be great to have this feature as a Spark package in the meantime. > Extend MLlib LDA to include options for Author Topic Modeling > - > > Key: SPARK-13161 > URL: https://issues.apache.org/jira/browse/SPARK-13161 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.0 >Reporter: John Hogue > > The author-topic model, a generative model for documents that extends Latent > Dirichlet Allocation. > By modeling the interests of authors, we can answer a range of important > queries about the content of document collections. With an appropriate author > model, we can establish which subjects an author writes about, which authors > are likely to have written documents similar to an observed document, and > which authors produce similar work. > Full whitepaper here. > http://mimno.infosci.cornell.edu/info6150/readings/398.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13161) Extend MLlib LDA to include options for Author Topic Modeling
[ https://issues.apache.org/jira/browse/SPARK-13161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-13161: -- Priority: Minor (was: Major) > Extend MLlib LDA to include options for Author Topic Modeling > - > > Key: SPARK-13161 > URL: https://issues.apache.org/jira/browse/SPARK-13161 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.0 >Reporter: John Hogue >Priority: Minor > > The author-topic model, a generative model for documents that extends Latent > Dirichlet Allocation. > By modeling the interests of authors, we can answer a range of important > queries about the content of document collections. With an appropriate author > model, we can establish which subjects an author writes about, which authors > are likely to have written documents similar to an observed document, and > which authors produce similar work. > Full whitepaper here. > http://mimno.infosci.cornell.edu/info6150/readings/398.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13627) Fix simple deprecation warnings
Dongjoon Hyun created SPARK-13627: - Summary: Fix simple deprecation warnings Key: SPARK-13627 URL: https://issues.apache.org/jira/browse/SPARK-13627 Project: Spark Issue Type: Bug Components: Examples, PySpark, SQL, YARN Reporter: Dongjoon Hyun Priority: Minor This issue aims to fix the following 21 deprecation warnings. * (6) MethodSymbolApi.paramss--> paramLists * (4) AnnotationApi.tpe -> tree.tpe * (3) BufferLike.readOnly -> toList. * (3) StandardNames.nme -> termNames * (3) scala.tools.nsc.interpreter.AbstractFileClassLoader -> scala.reflect.internal.util.AbstractFileClassLoader * (2) TypeApi.declarations-> decls -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12925) Improve HiveInspectors.unwrap for StringObjectInspector.getPrimitiveWritableObject
[ https://issues.apache.org/jira/browse/SPARK-12925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176734#comment-15176734 ] Rajesh Balamohan commented on SPARK-12925: -- Earlier fix had a problem when Text was reused. Posting a revised patch for review which fixes the issue when Text is reused. > Improve HiveInspectors.unwrap for > StringObjectInspector.getPrimitiveWritableObject > -- > > Key: SPARK-12925 > URL: https://issues.apache.org/jira/browse/SPARK-12925 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Fix For: 2.0.0 > > Attachments: SPARK-12925_profiler_cpu_samples.png > > > Text is in UTF-8 and converting it via "UTF8String.fromString" incurs > decoding and encoding, which turns out to be expensive. (to be specific: > https://github.com/apache/spark/blob/0d543b98f3e3da5053f0476f4647a765460861f3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L323) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12925) Improve HiveInspectors.unwrap for StringObjectInspector.getPrimitiveWritableObject
[ https://issues.apache.org/jira/browse/SPARK-12925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176729#comment-15176729 ] Apache Spark commented on SPARK-12925: -- User 'rajeshbalamohan' has created a pull request for this issue: https://github.com/apache/spark/pull/11477 > Improve HiveInspectors.unwrap for > StringObjectInspector.getPrimitiveWritableObject > -- > > Key: SPARK-12925 > URL: https://issues.apache.org/jira/browse/SPARK-12925 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Fix For: 2.0.0 > > Attachments: SPARK-12925_profiler_cpu_samples.png > > > Text is in UTF-8 and converting it via "UTF8String.fromString" incurs > decoding and encoding, which turns out to be expensive. (to be specific: > https://github.com/apache/spark/blob/0d543b98f3e3da5053f0476f4647a765460861f3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L323) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13626) SparkConf deprecation log messages are printed multiple times
Marcelo Vanzin created SPARK-13626: -- Summary: SparkConf deprecation log messages are printed multiple times Key: SPARK-13626 URL: https://issues.apache.org/jira/browse/SPARK-13626 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 2.0.0 Reporter: Marcelo Vanzin Priority: Minor I noticed that if I have a deprecated config in my spark-defaults.conf, I'll see multiple warnings when running, for example, spark-shell. I collected the backtrace from when the messages are printed, and here's a few instances. The first one is the only one I expect to be printed. {noformat} java.lang.Exception: ... at org.apache.spark.SparkConf.(SparkConf.scala:53) at org.apache.spark.repl.Main$.(Main.scala:30) {noformat} The following ones are causing duplicate log messages and we should clean those up: {noformat} java.lang.Exception: at org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) ... at org.apache.spark.SparkConf.(SparkConf.scala:53) at org.apache.spark.repl.Main$.createSparkContext(Main.scala:82) {noformat} {noformat} java.lang.Exception: at org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) ... at org.apache.spark.SparkConf.setAll(SparkConf.scala:139) at org.apache.spark.SparkConf.clone(SparkConf.scala:358) at org.apache.spark.SparkContext.(SparkContext.scala:368) at org.apache.spark.repl.Main$.createSparkContext(Main.scala:98) {noformat} There are also a few more caused by the use of {{SparkConf.clone()}}. {noformat} java.lang.Exception: at org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) ... at org.apache.spark.SparkConf.(SparkConf.scala:59) at org.apache.spark.SparkConf.(SparkConf.scala:53) at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:48) {noformat} {noformat} java.lang.Exception: at org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) ... at org.apache.spark.SparkConf.(SparkConf.scala:53) at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:93) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238) {noformat} {noformat} java.lang.Exception: at org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) ... at org.apache.spark.SparkConf.(SparkConf.scala:53) at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:93) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13625) PySpark-ML method to get list of params for an obj should not check property attr
[ https://issues.apache.org/jira/browse/SPARK-13625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13625: Assignee: Apache Spark > PySpark-ML method to get list of params for an obj should not check property > attr > - > > Key: SPARK-13625 > URL: https://issues.apache.org/jira/browse/SPARK-13625 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Apache Spark > > In PySpark params.__init__.py, the method {{Param.params()}} returns a list > of Params belonging to that object. This method should not check an > attribute to be an instance of {{Param}} if it is a property (uses the > {{@property}} decorator). This causes the property to be invoked to 'get' > the attribute, and that can lead to an error, depending on the property. If > an attribute is a property it will not be an ML {{Param}}, so no need to > check it. > I came across this in working on SPARK-13430 while adding > {{LinearRegressionModel.summary}} as a property to give a training summary, > similar to the Scala API. It is possible that a training summary does not > exist and will then raise an exception if the {{summary}} property is > invoked. > Calling {{getattr(self, x)}} will cause the property to be invoked if {{x}} > is a property. To fix this, just need to check if it a class property before > making the call to {{getattr()}} in {{Param.params()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13625) PySpark-ML method to get list of params for an obj should not check property attr
[ https://issues.apache.org/jira/browse/SPARK-13625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13625: Assignee: (was: Apache Spark) > PySpark-ML method to get list of params for an obj should not check property > attr > - > > Key: SPARK-13625 > URL: https://issues.apache.org/jira/browse/SPARK-13625 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler > > In PySpark params.__init__.py, the method {{Param.params()}} returns a list > of Params belonging to that object. This method should not check an > attribute to be an instance of {{Param}} if it is a property (uses the > {{@property}} decorator). This causes the property to be invoked to 'get' > the attribute, and that can lead to an error, depending on the property. If > an attribute is a property it will not be an ML {{Param}}, so no need to > check it. > I came across this in working on SPARK-13430 while adding > {{LinearRegressionModel.summary}} as a property to give a training summary, > similar to the Scala API. It is possible that a training summary does not > exist and will then raise an exception if the {{summary}} property is > invoked. > Calling {{getattr(self, x)}} will cause the property to be invoked if {{x}} > is a property. To fix this, just need to check if it a class property before > making the call to {{getattr()}} in {{Param.params()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13625) PySpark-ML method to get list of params for an obj should not check property attr
[ https://issues.apache.org/jira/browse/SPARK-13625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176715#comment-15176715 ] Apache Spark commented on SPARK-13625: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/11476 > PySpark-ML method to get list of params for an obj should not check property > attr > - > > Key: SPARK-13625 > URL: https://issues.apache.org/jira/browse/SPARK-13625 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler > > In PySpark params.__init__.py, the method {{Param.params()}} returns a list > of Params belonging to that object. This method should not check an > attribute to be an instance of {{Param}} if it is a property (uses the > {{@property}} decorator). This causes the property to be invoked to 'get' > the attribute, and that can lead to an error, depending on the property. If > an attribute is a property it will not be an ML {{Param}}, so no need to > check it. > I came across this in working on SPARK-13430 while adding > {{LinearRegressionModel.summary}} as a property to give a training summary, > similar to the Scala API. It is possible that a training summary does not > exist and will then raise an exception if the {{summary}} property is > invoked. > Calling {{getattr(self, x)}} will cause the property to be invoked if {{x}} > is a property. To fix this, just need to check if it a class property before > making the call to {{getattr()}} in {{Param.params()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13528) Make the short names of compression codecs consistent in spark
[ https://issues.apache.org/jira/browse/SPARK-13528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13528. - Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.0.0 > Make the short names of compression codecs consistent in spark > -- > > Key: SPARK-13528 > URL: https://issues.apache.org/jira/browse/SPARK-13528 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 1.6.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 2.0.0 > > > Add a common utility code to map short names to fully-qualified codec names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13594) remove typed operations (map, flatMap, mapPartitions) from Python DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-13594: Description: Once we implement Dataset-equivalent API in Python, we'd need to change the return type of map, flatMap, and mapPartitions. In this case, we should just remove them from Python DataFrame now in 2.0, so we don't break APIs in 2.x. Users can still use those after the removal, but must prefix "rdd" to it. For example, df.rdd.map, df.rdd.flatMap, and df.rdd.mapPartitions. was: Once we implement Dataset-equivalent API in Python, we'd need to change the return type of map, flatMap, and mapPartitions. In this case, we should just remove them from Python DataFrame now in 2.0, so we don't break APIs in 2.x. > remove typed operations (map, flatMap, mapPartitions) from Python DataFrame > > > Key: SPARK-13594 > URL: https://issues.apache.org/jira/browse/SPARK-13594 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > Once we implement Dataset-equivalent API in Python, we'd need to change the > return type of map, flatMap, and mapPartitions. In this case, we should just > remove them from Python DataFrame now in 2.0, so we don't break APIs in 2.x. > Users can still use those after the removal, but must prefix "rdd" to it. For > example, df.rdd.map, df.rdd.flatMap, and df.rdd.mapPartitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13594) remove typed operations (map, flatMap, mapPartitions) from Python DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-13594: Description: Once we implement Dataset-equivalent API in Python, we'd need to change the return type of map, flatMap, and mapPartitions. In this case, we should just remove them from Python DataFrame now in 2.0, so we don't break APIs in 2.x. > remove typed operations (map, flatMap, mapPartitions) from Python DataFrame > > > Key: SPARK-13594 > URL: https://issues.apache.org/jira/browse/SPARK-13594 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > Once we implement Dataset-equivalent API in Python, we'd need to change the > return type of map, flatMap, and mapPartitions. In this case, we should just > remove them from Python DataFrame now in 2.0, so we don't break APIs in 2.x. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13594) remove typed operations (map, flatMap, mapPartitions) from Python DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-13594: Issue Type: Sub-task (was: Improvement) Parent: SPARK-11806 > remove typed operations (map, flatMap, mapPartitions) from Python DataFrame > > > Key: SPARK-13594 > URL: https://issues.apache.org/jira/browse/SPARK-13594 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > Once we implement Dataset-equivalent API in Python, we'd need to change the > return type of map, flatMap, and mapPartitions. In this case, we should just > remove them from Python DataFrame now in 2.0, so we don't break APIs in 2.x. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org