[jira] [Comment Edited] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
[ https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698070#comment-16698070 ] Cheng Su edited comment on SPARK-26164 at 11/25/18 5:37 AM: I think this is something we need in spark. Let me know if this is not needed, otherwise I will start working on a pr to address the issue. cc [~cloud_fan], [~tejasp], and [~sameerag] who might have the most context on this, thanks! was (Author: chengsu): cc [~cloud_fan], [~tejasp], and [~sameerag] who might have the most context on this. I think this is something we need in spark. Let me know if this is not needed, otherwise I will start working on a pr to address the issue. > [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort > -- > > Key: SPARK-26164 > URL: https://issues.apache.org/jira/browse/SPARK-26164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Cheng Su >Priority: Minor > > Problem: > Current spark always requires a local sort before writing to output table on > partition/bucket columns [1]. The disadvantage is the sort might waste > reserved CPU time on executor due to spill. Hive does not require the local > sort before writing output table [2], and we saw performance regression when > migrating hive workload to spark. > > Proposal: > We can avoid the local sort by keeping the mapping between file path and > output writer. In case of writing row to a new file path, we create a new > output writer. Otherwise, re-use the same output writer if the writer already > exists (mainly change should be in FileFormatDataWriter.scala). This is very > similar to what hive does in [2]. > Given the new behavior (i.e. avoid sort by keeping multiple output writer) > consumes more memory on executor (multiple output writer needs to be opened > in same time), than the current behavior (i.e. only one output writer > opened). We can add the config to switch between the current and new behavior. > > [1]: spark FileFormatWriter.scala - > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123] > [2]: hive FileSinkOperator.java - > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
[ https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698070#comment-16698070 ] Cheng Su commented on SPARK-26164: -- cc [~cloud_fan], [~tejasp], and [~sameerag] who might have the most context on this. I think this is something we need in spark. Let me know if this is not needed, otherwise I will start working on a pr to address the issue. > [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort > -- > > Key: SPARK-26164 > URL: https://issues.apache.org/jira/browse/SPARK-26164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Cheng Su >Priority: Minor > > Problem: > Current spark always requires a local sort before writing to output table on > partition/bucket columns [1]. The disadvantage is the sort might waste > reserved CPU time on executor due to spill. Hive does not require the local > sort before writing output table [2], and we saw performance regression when > migrating hive workload to spark. > > Proposal: > We can avoid the local sort by keeping the mapping between file path and > output writer. In case of writing row to a new file path, we create a new > output writer. Otherwise, re-use the same output writer if the writer already > exists (mainly change should be in FileFormatDataWriter.scala). This is very > similar to what hive does in [2]. > Given the new behavior (i.e. avoid sort by keeping multiple output writer) > consumes more memory on executor (multiple output writer needs to be opened > in same time), than the current behavior (i.e. only one output writer > opened). We can add the config to switch between the current and new behavior. > > [1]: spark FileFormatWriter.scala - > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123] > [2]: hive FileSinkOperator.java - > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
[ https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-26164: - Description: Problem: Current spark always requires a local sort before writing to output table on partition/bucket columns [1]. The disadvantage is the sort might waste reserved CPU time on executor due to spill. Hive does not require the local sort before writing output table [2], and we saw performance regression when migrating hive workload to spark. Proposal: We can avoid the local sort by keeping the mapping between file path and output writer. In case of writing row to a new file path, we create a new output writer. Otherwise, re-use the same output writer if the writer already exists (mainly change should be in FileFormatDataWriter.scala). This is very similar to what hive does in [2]. Given the new behavior (i.e. avoid sort by keeping multiple output writer) consumes more memory on executor (multiple output writer needs to be opened in same time), than the current behavior (i.e. only one output writer opened). We can add the config to switch between the current and new behavior. [1]: spark FileFormatWriter.scala - [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123] [2]: hive FileSinkOperator.java - [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510] was: Problem: Current spark always requires a local sort before writing to output table on partition/bucket/sort columns [1]. The disadvantage is the sort might waste reserved CPU time on executor due to spill. Hive does not require the local sort before writing output table [2], and we saw performance regression when migrating hive workload to spark. Proposal: We can avoid the local sort by keeping the mapping between file path and output writer. In case of writing row to a new file path, we create a new output writer. Otherwise, re-use the same output writer if the writer already exists (mainly change should be in FileFormatDataWriter.scala). This is very similar to what hive does in [2]. Given the new behavior (i.e. avoid sort by keeping multiple output writer) consumes more memory on executor (multiple output writer needs to be opened in same time), than the current behavior (i.e. only one output writer opened). We can add the config to switch between the current and new behavior. [1]: spark FileFormatWriter.scala - [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123] [2]: hive FileSinkOperator.java - https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510 > [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort > -- > > Key: SPARK-26164 > URL: https://issues.apache.org/jira/browse/SPARK-26164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Cheng Su >Priority: Minor > > Problem: > Current spark always requires a local sort before writing to output table on > partition/bucket columns [1]. The disadvantage is the sort might waste > reserved CPU time on executor due to spill. Hive does not require the local > sort before writing output table [2], and we saw performance regression when > migrating hive workload to spark. > > Proposal: > We can avoid the local sort by keeping the mapping between file path and > output writer. In case of writing row to a new file path, we create a new > output writer. Otherwise, re-use the same output writer if the writer already > exists (mainly change should be in FileFormatDataWriter.scala). This is very > similar to what hive does in [2]. > Given the new behavior (i.e. avoid sort by keeping multiple output writer) > consumes more memory on executor (multiple output writer needs to be opened > in same time), than the current behavior (i.e. only one output writer > opened). We can add the config to switch between the current and new behavior. > > [1]: spark FileFormatWriter.scala - > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123] > [2]: hive FileSinkOperator.java - > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Created] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
Cheng Su created SPARK-26164: Summary: [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort Key: SPARK-26164 URL: https://issues.apache.org/jira/browse/SPARK-26164 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Cheng Su Problem: Current spark always requires a local sort before writing to output table on partition/bucket/sort columns [1]. The disadvantage is the sort might waste reserved CPU time on executor due to spill. Hive does not require the local sort before writing output table [2], and we saw performance regression when migrating hive workload to spark. Proposal: We can avoid the local sort by keeping the mapping between file path and output writer. In case of writing row to a new file path, we create a new output writer. Otherwise, re-use the same output writer if the writer already exists (mainly change should be in FileFormatDataWriter.scala). This is very similar to what hive does in [2]. Given the new behavior (i.e. avoid sort by keeping multiple output writer) consumes more memory on executor (multiple output writer needs to be opened in same time), than the current behavior (i.e. only one output writer opened). We can add the config to switch between the current and new behavior. [1]: spark FileFormatWriter.scala - [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123] [2]: hive FileSinkOperator.java - https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26163) Parsing decimals from JSON using locale
[ https://issues.apache.org/jira/browse/SPARK-26163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697990#comment-16697990 ] Apache Spark commented on SPARK-26163: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23132 > Parsing decimals from JSON using locale > --- > > Key: SPARK-26163 > URL: https://issues.apache.org/jira/browse/SPARK-26163 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > Currently, the decimal type can be inferring from a JSON field value only if > it is recognised by JacksonParser and does not depend on locale. Also > decimals in locale specific format cannot be parsed. The ticket aims to > support parsing/converting/inferring of decimals from JSON inputs using > locale. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26163) Parsing decimals from JSON using locale
[ https://issues.apache.org/jira/browse/SPARK-26163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26163: Assignee: Apache Spark > Parsing decimals from JSON using locale > --- > > Key: SPARK-26163 > URL: https://issues.apache.org/jira/browse/SPARK-26163 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Currently, the decimal type can be inferring from a JSON field value only if > it is recognised by JacksonParser and does not depend on locale. Also > decimals in locale specific format cannot be parsed. The ticket aims to > support parsing/converting/inferring of decimals from JSON inputs using > locale. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26163) Parsing decimals from JSON using locale
[ https://issues.apache.org/jira/browse/SPARK-26163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697989#comment-16697989 ] Apache Spark commented on SPARK-26163: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23132 > Parsing decimals from JSON using locale > --- > > Key: SPARK-26163 > URL: https://issues.apache.org/jira/browse/SPARK-26163 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > Currently, the decimal type can be inferring from a JSON field value only if > it is recognised by JacksonParser and does not depend on locale. Also > decimals in locale specific format cannot be parsed. The ticket aims to > support parsing/converting/inferring of decimals from JSON inputs using > locale. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26163) Parsing decimals from JSON using locale
[ https://issues.apache.org/jira/browse/SPARK-26163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26163: Assignee: (was: Apache Spark) > Parsing decimals from JSON using locale > --- > > Key: SPARK-26163 > URL: https://issues.apache.org/jira/browse/SPARK-26163 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > Currently, the decimal type can be inferring from a JSON field value only if > it is recognised by JacksonParser and does not depend on locale. Also > decimals in locale specific format cannot be parsed. The ticket aims to > support parsing/converting/inferring of decimals from JSON inputs using > locale. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26163) Parsing decimals from JSON using locale
Maxim Gekk created SPARK-26163: -- Summary: Parsing decimals from JSON using locale Key: SPARK-26163 URL: https://issues.apache.org/jira/browse/SPARK-26163 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk Currently, the decimal type can be inferring from a JSON field value only if it is recognised by JacksonParser and does not depend on locale. Also decimals in locale specific format cannot be parsed. The ticket aims to support parsing/converting/inferring of decimals from JSON inputs using locale. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26162) ALS results vary with user or item ID encodings
[ https://issues.apache.org/jira/browse/SPARK-26162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo J. Villacorta updated SPARK-26162: Description: When calling ALS.fit() with the same seed on a dataset, the results (both the latent factors matrices and the accuracy of the recommendations) differ when we change the labels used to encode the users or items. The code example below illustrates this by just changing user ID 1 or 2 to an unused ID like 30. The user factors matrix changes, but not only the rows corresponding to users 1 or 2 but also the other rows. Is this the intended behaviour? {code:java} val r = scala.util.Random r.setSeed(123456) val trainDataset1 = spark.sparkContext.parallelize( (1 to 1000).map(_=> (r.nextInt(20), r.nextInt(100), r.nextInt(4) + 1)) // users go from 0 to 19 ).toDF("user", "item", "rating") val trainDataset2 = trainDataset1.withColumn("user", when(col("user")===1, 30).otherwise(col("user"))) val trainDataset3 = trainDataset1.withColumn("user", when(col("user")===2, 30).otherwise(col("user"))) val testDatasets = Array(trainDataset1, trainDataset2, trainDataset3).map( _.groupBy("user").agg(collect_list("item").alias("watched")) ) val Array(als1, als2, als3) = Array(trainDataset1, trainDataset2, trainDataset3).map(new ALS().setSeed(12345).fit(_)) als1.userFactors.show(5, false) als2.userFactors.show(5, false) als3.userFactors.show(5, false){code} If we ask for recommendations and compare them with a test dataset also modified accordingly (in this example, the test dataset is exactly the train dataset) the results also differ: {code:java} val recommendations = Array(als1, als2, als3).map(x => x.recommendForAllUsers(20).map{ case Row(user: Int, recommendations: WrappedArray[Row]) => { val items = recommendations.map{case Row(item: Int, score: Float) => item} (user, items) } }.toDF("user", "recommendations") ) val predictionsAndActualRDD = testDatasets.zip(recommendations).map{ case (testDataset, recommendationsDF) => testDataset.join(recommendationsDF, "user") .rdd.map(r => { (r.getAs[WrappedArray[Int]](r.fieldIndex("recommendations")).array, r.getAs[WrappedArray[Int]](r.fieldIndex("watched")).array ) }) } val metrics = predictionsAndActualRDD.map(new RankingMetrics(_)) println(s"Precision at 5 of first model = ${metrics(0).precisionAt(5)}") println(s"Precision at 5 of second model = ${metrics(1).precisionAt(5)}") println(s"Precision at 5 of third model = ${metrics(2).precisionAt(5)}") {code} EDIT: The results also change if we just swap the IDs of some users, like: {code:java} val trainDataset4 = trainDataset1.withColumn("user", when(col("user")===1, 4) .when(col("user")===4, 1).otherwise(col("user"))) {code} was: When calling ALS.fit() with the same seed on a dataset, the results (both the latent factors matrices and the accuracy of the recommendations) differ when we change the labels used to encode the users or items. The code example below illustrates this by just changing user ID 1 or 2 to an unused ID like 30. The user factors matrix changes, but not only the rows corresponding to users 1 or 2 but also the other rows. Is this the intended behaviour? {code:java} val r = scala.util.Random r.setSeed(123456) val trainDataset1 = spark.sparkContext.parallelize( (1 to 1000).map(_=> (r.nextInt(20), r.nextInt(100), r.nextInt(4) + 1)) // users go from 0 to 19 ).toDF("user", "item", "rating") val trainDataset2 = trainDataset1.withColumn("user", when(col("user")===1, 30).otherwise(col("user"))) val trainDataset3 = trainDataset1.withColumn("user", when(col("user")===2, 30).otherwise(col("user"))) val testDatasets = Array(trainDataset1, trainDataset2, trainDataset3).map( _.groupBy("user").agg(collect_list("item").alias("watched")) ) val Array(als1, als2, als3) = Array(trainDataset1, trainDataset2, trainDataset3).map(new ALS().setSeed(12345).fit(_)) als1.userFactors.show(5, false) als2.userFactors.show(5, false) als3.userFactors.show(5, false){code} If we ask for recommendations and compare them with a test dataset also modified accordingly (in this example, the test dataset is exactly the train dataset) the results also differ: {code:java} val recommendations = Array(als1, als2, als3).map(x => x.recommendForAllUsers(20).map{ case Row(user: Int, recommendations: WrappedArray[Row]) => { val items = recommendations.map{case Row(item: Int, score: Float) => item} (user, items) } }.toDF("user", "recommendations") ) val predictionsAndActualRDD = testDatasets.zip(recommendations).map{ case (testDataset, recommendationsDF) => testDataset.join(recommendationsDF, "user") .rdd.map(r => {
[jira] [Commented] (SPARK-24710) Information Gain Ratio for decision trees
[ https://issues.apache.org/jira/browse/SPARK-24710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697974#comment-16697974 ] Pablo J. Villacorta commented on SPARK-24710: - It seems this is not considered a very interesting feature by the community... have you worked on it, Aleksey? > Information Gain Ratio for decision trees > - > > Key: SPARK-24710 > URL: https://issues.apache.org/jira/browse/SPARK-24710 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.1 >Reporter: Pablo J. Villacorta >Priority: Minor > Labels: features > > Spark currently uses Information Gain (IG) to decide the next feature to > branch on when building a decision tree. In case of categorical features, IG > is known to be biased towards features with a large number of categories. > [Information Gain Ratio|https://en.wikipedia.org/wiki/Information_gain_ratio] > solves this problem by dividing the IG by a number that characterizes the > intrinsic information of a feature. > As far as I know, Spark has IG but not IGR. It would be nice to have the > possibility to choose IGR instead of IG. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26162) ALS results vary with user or item ID encodings
[ https://issues.apache.org/jira/browse/SPARK-26162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo J. Villacorta updated SPARK-26162: Description: When calling ALS.fit() with the same seed on a dataset, the results (both the latent factors matrices and the accuracy of the recommendations) differ when we change the labels used to encode the users or items. The code example below illustrates this by just changing user ID 1 or 2 to an unused ID like 30. The user factors matrix changes, but not only the rows corresponding to users 1 or 2 but also the other rows. Is this the intended behaviour? {code:java} val r = scala.util.Random r.setSeed(123456) val trainDataset1 = spark.sparkContext.parallelize( (1 to 1000).map(_=> (r.nextInt(20), r.nextInt(100), r.nextInt(4) + 1)) // users go from 0 to 4 ).toDF("user", "item", "rating") val trainDataset2 = trainDataset1.withColumn("user", when(col("user")===1, 30).otherwise(col("user"))) val trainDataset3 = trainDataset1.withColumn("user", when(col("user")===2, 30).otherwise(col("user"))) val testDatasets = Array(trainDataset1, trainDataset2, trainDataset3).map( _.groupBy("user").agg(collect_list("item").alias("watched")) ) val Array(als1, als2, als3) = Array(trainDataset1, trainDataset2, trainDataset3).map(new ALS().setSeed(12345).fit(_)) als1.userFactors.show(5, false) als2.userFactors.show(5, false) als3.userFactors.show(5, false){code} If we ask for recommendations and compare them with a test dataset also modified accordingly (in this example, the test dataset is exactly the train dataset) the results also differ: {code:java} val recommendations = Array(als1, als2, als3).map(x => x.recommendForAllUsers(20).map{ case Row(user: Int, recommendations: WrappedArray[Row]) => { val items = recommendations.map{case Row(item: Int, score: Float) => item} (user, items) } }.toDF("user", "recommendations") ) val predictionsAndActualRDD = testDatasets.zip(recommendations).map{ case (testDataset, recommendationsDF) => testDataset.join(recommendationsDF, "user") .rdd.map(r => { (r.getAs[WrappedArray[Int]](r.fieldIndex("recommendations")).array, r.getAs[WrappedArray[Int]](r.fieldIndex("watched")).array ) }) } val metrics = predictionsAndActualRDD.map(new RankingMetrics(_)) println(s"Precision at 5 of first model = ${metrics(0).precisionAt(5)}") println(s"Precision at 5 of second model = ${metrics(1).precisionAt(5)}") println(s"Precision at 5 of third model = ${metrics(2).precisionAt(5)}") {code} was: When calling ALS.fit() with the same seed on a dataset, the results (both the latent factors matrices and the accuracy of the recommendations) differ when we change the labels used to encode the users or items. The code example below illustrates this by just changing user ID 1 or 2 to an unused ID like 30. The user factors matrix changes, but not only the rows corresponding to users 1 or 2 but also the other rows. Is this the intended behaviour? {code:java} val r = scala.util.Random r.setSeed(123456) val trainDataset1 = spark.sparkContext.parallelize( (1 to 1000).map(_=> (r.nextInt(20), r.nextInt(100), r.nextInt(4) + 1)) // users go from 0 to 4 ).toDF("user", "item", "rating") val maxuser = trainDataset1.select(max("user")).head.getAs[Int](0) println(s"maxuser is ${maxuser}") val trainDataset2 = trainDataset1.withColumn("user", when(col("user")===1, 30).otherwise(col("user"))) val trainDataset3 = trainDataset1.withColumn("user", when(col("user")===2, 30).otherwise(col("user"))) val testDatasets = Array(trainDataset1, trainDataset2, trainDataset3).map( _.groupBy("user").agg(collect_list("item").alias("watched")) ) val Array(als1, als2, als3) = Array(trainDataset1, trainDataset2, trainDataset3).map(new ALS().setSeed(12345).fit(_)) als1.userFactors.show(5, false) als2.userFactors.show(5, false) als3.userFactors.show(5, false){code} If we ask for recommendations and compare them with a test dataset also modified accordingly (in this example, the test dataset is exactly the train dataset) the results also differ: {code:java} val recommendations = Array(als1, als2, als3).map(x => x.recommendForAllUsers(20).map{ case Row(user: Int, recommendations: WrappedArray[Row]) => { val items = recommendations.map{case Row(item: Int, score: Float) => item} (user, items) } }.toDF("user", "recommendations") ) val predictionsAndActualRDD = testDatasets.zip(recommendations).map{ case (testDataset, recommendationsDF) => testDataset.join(recommendationsDF, "user") .rdd.map(r => { (r.getAs[WrappedArray[Int]](r.fieldIndex("recommendations")).array, r.getAs[WrappedArray[Int]](r.fieldIndex("watched")).array ) }) } val metrics =
[jira] [Updated] (SPARK-26162) ALS results vary with user or item ID encodings
[ https://issues.apache.org/jira/browse/SPARK-26162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo J. Villacorta updated SPARK-26162: Description: When calling ALS.fit() with the same seed on a dataset, the results (both the latent factors matrices and the accuracy of the recommendations) differ when we change the labels used to encode the users or items. The code example below illustrates this by just changing user ID 1 or 2 to an unused ID like 30. The user factors matrix changes, but not only the rows corresponding to users 1 or 2 but also the other rows. Is this the intended behaviour? {code:java} val r = scala.util.Random r.setSeed(123456) val trainDataset1 = spark.sparkContext.parallelize( (1 to 1000).map(_=> (r.nextInt(20), r.nextInt(100), r.nextInt(4) + 1)) // users go from 0 to 19 ).toDF("user", "item", "rating") val trainDataset2 = trainDataset1.withColumn("user", when(col("user")===1, 30).otherwise(col("user"))) val trainDataset3 = trainDataset1.withColumn("user", when(col("user")===2, 30).otherwise(col("user"))) val testDatasets = Array(trainDataset1, trainDataset2, trainDataset3).map( _.groupBy("user").agg(collect_list("item").alias("watched")) ) val Array(als1, als2, als3) = Array(trainDataset1, trainDataset2, trainDataset3).map(new ALS().setSeed(12345).fit(_)) als1.userFactors.show(5, false) als2.userFactors.show(5, false) als3.userFactors.show(5, false){code} If we ask for recommendations and compare them with a test dataset also modified accordingly (in this example, the test dataset is exactly the train dataset) the results also differ: {code:java} val recommendations = Array(als1, als2, als3).map(x => x.recommendForAllUsers(20).map{ case Row(user: Int, recommendations: WrappedArray[Row]) => { val items = recommendations.map{case Row(item: Int, score: Float) => item} (user, items) } }.toDF("user", "recommendations") ) val predictionsAndActualRDD = testDatasets.zip(recommendations).map{ case (testDataset, recommendationsDF) => testDataset.join(recommendationsDF, "user") .rdd.map(r => { (r.getAs[WrappedArray[Int]](r.fieldIndex("recommendations")).array, r.getAs[WrappedArray[Int]](r.fieldIndex("watched")).array ) }) } val metrics = predictionsAndActualRDD.map(new RankingMetrics(_)) println(s"Precision at 5 of first model = ${metrics(0).precisionAt(5)}") println(s"Precision at 5 of second model = ${metrics(1).precisionAt(5)}") println(s"Precision at 5 of third model = ${metrics(2).precisionAt(5)}") {code} was: When calling ALS.fit() with the same seed on a dataset, the results (both the latent factors matrices and the accuracy of the recommendations) differ when we change the labels used to encode the users or items. The code example below illustrates this by just changing user ID 1 or 2 to an unused ID like 30. The user factors matrix changes, but not only the rows corresponding to users 1 or 2 but also the other rows. Is this the intended behaviour? {code:java} val r = scala.util.Random r.setSeed(123456) val trainDataset1 = spark.sparkContext.parallelize( (1 to 1000).map(_=> (r.nextInt(20), r.nextInt(100), r.nextInt(4) + 1)) // users go from 0 to 4 ).toDF("user", "item", "rating") val trainDataset2 = trainDataset1.withColumn("user", when(col("user")===1, 30).otherwise(col("user"))) val trainDataset3 = trainDataset1.withColumn("user", when(col("user")===2, 30).otherwise(col("user"))) val testDatasets = Array(trainDataset1, trainDataset2, trainDataset3).map( _.groupBy("user").agg(collect_list("item").alias("watched")) ) val Array(als1, als2, als3) = Array(trainDataset1, trainDataset2, trainDataset3).map(new ALS().setSeed(12345).fit(_)) als1.userFactors.show(5, false) als2.userFactors.show(5, false) als3.userFactors.show(5, false){code} If we ask for recommendations and compare them with a test dataset also modified accordingly (in this example, the test dataset is exactly the train dataset) the results also differ: {code:java} val recommendations = Array(als1, als2, als3).map(x => x.recommendForAllUsers(20).map{ case Row(user: Int, recommendations: WrappedArray[Row]) => { val items = recommendations.map{case Row(item: Int, score: Float) => item} (user, items) } }.toDF("user", "recommendations") ) val predictionsAndActualRDD = testDatasets.zip(recommendations).map{ case (testDataset, recommendationsDF) => testDataset.join(recommendationsDF, "user") .rdd.map(r => { (r.getAs[WrappedArray[Int]](r.fieldIndex("recommendations")).array, r.getAs[WrappedArray[Int]](r.fieldIndex("watched")).array ) }) } val metrics = predictionsAndActualRDD.map(new RankingMetrics(_)) println(s"Precision at 5 of first model =
[jira] [Created] (SPARK-26162) ALS results vary with user or item ID encodings
Pablo J. Villacorta created SPARK-26162: --- Summary: ALS results vary with user or item ID encodings Key: SPARK-26162 URL: https://issues.apache.org/jira/browse/SPARK-26162 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.0 Reporter: Pablo J. Villacorta When calling ALS.fit() with the same seed on a dataset, the results (both the latent factors matrices and the accuracy of the recommendations) differ when we change the labels used to encode the users or items. The code example below illustrates this by just changing user ID 1 or 2 to an unused ID like 30. The user factors matrix changes, but not only the rows corresponding to users 1 or 2 but also the other rows. Is this the intended behaviour? {code:java} val r = scala.util.Random r.setSeed(123456) val trainDataset1 = spark.sparkContext.parallelize( (1 to 1000).map(_=> (r.nextInt(20), r.nextInt(100), r.nextInt(4) + 1)) // users go from 0 to 4 ).toDF("user", "item", "rating") val maxuser = trainDataset1.select(max("user")).head.getAs[Int](0) println(s"maxuser is ${maxuser}") val trainDataset2 = trainDataset1.withColumn("user", when(col("user")===1, 30).otherwise(col("user"))) val trainDataset3 = trainDataset1.withColumn("user", when(col("user")===2, 30).otherwise(col("user"))) val testDatasets = Array(trainDataset1, trainDataset2, trainDataset3).map( _.groupBy("user").agg(collect_list("item").alias("watched")) ) val Array(als1, als2, als3) = Array(trainDataset1, trainDataset2, trainDataset3).map(new ALS().setSeed(12345).fit(_)) als1.userFactors.show(5, false) als2.userFactors.show(5, false) als3.userFactors.show(5, false){code} If we ask for recommendations and compare them with a test dataset also modified accordingly (in this example, the test dataset is exactly the train dataset) the results also differ: {code:java} val recommendations = Array(als1, als2, als3).map(x => x.recommendForAllUsers(20).map{ case Row(user: Int, recommendations: WrappedArray[Row]) => { val items = recommendations.map{case Row(item: Int, score: Float) => item} (user, items) } }.toDF("user", "recommendations") ) val predictionsAndActualRDD = testDatasets.zip(recommendations).map{ case (testDataset, recommendationsDF) => testDataset.join(recommendationsDF, "user") .rdd.map(r => { (r.getAs[WrappedArray[Int]](r.fieldIndex("recommendations")).array, r.getAs[WrappedArray[Int]](r.fieldIndex("watched")).array ) }) } val metrics = predictionsAndActualRDD.map(new RankingMetrics(_)) println(s"Precision at 5 of first model = ${metrics(0).precisionAt(5)}") println(s"Precision at 5 of second model = ${metrics(1).precisionAt(5)}") println(s"Precision at 5 of third model = ${metrics(2).precisionAt(5)}") {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25908) Remove old deprecated items in Spark 3
[ https://issues.apache.org/jira/browse/SPARK-25908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-25908: -- Description: There are many deprecated methods and classes in Spark. They _can_ be removed in Spark 3, and for those that have been deprecated a long time (i.e. since Spark <= 2.0), we should probably do so. This addresses most of these cases, the easiest ones, those that are easy to remove and are old: - Remove some AccumulableInfo .apply() methods - Remove non-label-specific multiclass precision/recall/fScore in favor of accuracy - Remove toDegrees/toRadians in favor of degrees/radians (SparkR: only deprecated) - Remove approxCountDistinct in favor of approx_count_distinct (SparkR: only deprecated) - Remove unused Python StorageLevel constants - Remove unused multiclass option in libsvm parsing - Remove references to deprecated spark configs like spark.yarn.am.port - Remove TaskContext.isRunningLocally - Remove ShuffleMetrics.shuffle* methods - Remove BaseReadWrite.context in favor of session - Remove Column.!== in favor of =!= - Remove Dataset.explode - Remove Dataset.registerTempTable - Remove SQLContext.getOrCreate, setActive, clearActive, constructors Not touched yet: - everything else in MLLib - HiveContext - Anything deprecated more recently than 2.0.0, generally was: There are many deprecated methods and classes in Spark. They _can_ be removed in Spark 3, and for those that have been deprecated a long time (i.e. since Spark <= 2.0), we should probably do so. This addresses most of these cases, the easiest ones, those that are easy to remove and are old: - Remove some AccumulableInfo .apply() methods - Remove non-label-specific multiclass precision/recall/fScore in favor of accuracy - Remove toDegrees/toRadians in favor of degrees/radians (SparkR: only deprecated) - Remove approxCountDistinct in favor of approx_count_distinct (SparkR: only deprecated) - Remove unused Python StorageLevel constants - Remove Dataset unionAll in favor of union - Remove unused multiclass option in libsvm parsing - Remove references to deprecated spark configs like spark.yarn.am.port - Remove TaskContext.isRunningLocally - Remove ShuffleMetrics.shuffle* methods - Remove BaseReadWrite.context in favor of session - Remove Column.!== in favor of =!= - Remove Dataset.explode - Remove Dataset.registerTempTable - Remove SQLContext.getOrCreate, setActive, clearActive, constructors Not touched yet: - everything else in MLLib - HiveContext - Anything deprecated more recently than 2.0.0, generally > Remove old deprecated items in Spark 3 > -- > > Key: SPARK-25908 > URL: https://issues.apache.org/jira/browse/SPARK-25908 > Project: Spark > Issue Type: Task > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > There are many deprecated methods and classes in Spark. They _can_ be removed > in Spark 3, and for those that have been deprecated a long time (i.e. since > Spark <= 2.0), we should probably do so. This addresses most of these cases, > the easiest ones, those that are easy to remove and are old: > - Remove some AccumulableInfo .apply() methods > - Remove non-label-specific multiclass precision/recall/fScore in favor of > accuracy > - Remove toDegrees/toRadians in favor of degrees/radians (SparkR: only > deprecated) > - Remove approxCountDistinct in favor of approx_count_distinct (SparkR: only > deprecated) > - Remove unused Python StorageLevel constants > - Remove unused multiclass option in libsvm parsing > - Remove references to deprecated spark configs like spark.yarn.am.port > - Remove TaskContext.isRunningLocally > - Remove ShuffleMetrics.shuffle* methods > - Remove BaseReadWrite.context in favor of session > - Remove Column.!== in favor of =!= > - Remove Dataset.explode > - Remove Dataset.registerTempTable > - Remove SQLContext.getOrCreate, setActive, clearActive, constructors > Not touched yet: > - everything else in MLLib > - HiveContext > - Anything deprecated more recently than 2.0.0, generally -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25908) Remove old deprecated items in Spark 3
[ https://issues.apache.org/jira/browse/SPARK-25908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697947#comment-16697947 ] Apache Spark commented on SPARK-25908: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/23131 > Remove old deprecated items in Spark 3 > -- > > Key: SPARK-25908 > URL: https://issues.apache.org/jira/browse/SPARK-25908 > Project: Spark > Issue Type: Task > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > There are many deprecated methods and classes in Spark. They _can_ be removed > in Spark 3, and for those that have been deprecated a long time (i.e. since > Spark <= 2.0), we should probably do so. This addresses most of these cases, > the easiest ones, those that are easy to remove and are old: > - Remove some AccumulableInfo .apply() methods > - Remove non-label-specific multiclass precision/recall/fScore in favor of > accuracy > - Remove toDegrees/toRadians in favor of degrees/radians (SparkR: only > deprecated) > - Remove approxCountDistinct in favor of approx_count_distinct (SparkR: only > deprecated) > - Remove unused Python StorageLevel constants > - Remove Dataset unionAll in favor of union > - Remove unused multiclass option in libsvm parsing > - Remove references to deprecated spark configs like spark.yarn.am.port > - Remove TaskContext.isRunningLocally > - Remove ShuffleMetrics.shuffle* methods > - Remove BaseReadWrite.context in favor of session > - Remove Column.!== in favor of =!= > - Remove Dataset.explode > - Remove Dataset.registerTempTable > - Remove SQLContext.getOrCreate, setActive, clearActive, constructors > Not touched yet: > - everything else in MLLib > - HiveContext > - Anything deprecated more recently than 2.0.0, generally -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25908) Remove old deprecated items in Spark 3
[ https://issues.apache.org/jira/browse/SPARK-25908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697946#comment-16697946 ] Apache Spark commented on SPARK-25908: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/23131 > Remove old deprecated items in Spark 3 > -- > > Key: SPARK-25908 > URL: https://issues.apache.org/jira/browse/SPARK-25908 > Project: Spark > Issue Type: Task > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > There are many deprecated methods and classes in Spark. They _can_ be removed > in Spark 3, and for those that have been deprecated a long time (i.e. since > Spark <= 2.0), we should probably do so. This addresses most of these cases, > the easiest ones, those that are easy to remove and are old: > - Remove some AccumulableInfo .apply() methods > - Remove non-label-specific multiclass precision/recall/fScore in favor of > accuracy > - Remove toDegrees/toRadians in favor of degrees/radians (SparkR: only > deprecated) > - Remove approxCountDistinct in favor of approx_count_distinct (SparkR: only > deprecated) > - Remove unused Python StorageLevel constants > - Remove Dataset unionAll in favor of union > - Remove unused multiclass option in libsvm parsing > - Remove references to deprecated spark configs like spark.yarn.am.port > - Remove TaskContext.isRunningLocally > - Remove ShuffleMetrics.shuffle* methods > - Remove BaseReadWrite.context in favor of session > - Remove Column.!== in favor of =!= > - Remove Dataset.explode > - Remove Dataset.registerTempTable > - Remove SQLContext.getOrCreate, setActive, clearActive, constructors > Not touched yet: > - everything else in MLLib > - HiveContext > - Anything deprecated more recently than 2.0.0, generally -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26161) Ignore empty files in load
[ https://issues.apache.org/jira/browse/SPARK-26161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26161: Assignee: (was: Apache Spark) > Ignore empty files in load > -- > > Key: SPARK-26161 > URL: https://issues.apache.org/jira/browse/SPARK-26161 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, empty files are opened in load, and Spark tries to read data from > them. In some cases, empty partitions are produced from such empty files. For > example, in the case of *wholetext* in Text datasource and *multiLine* modes > in CSV/JSON datasource. The behaviour is unnecessary, and empty files can be > skipped in read. It can reduce number of tasks submitted for loading empty > files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26161) Ignore empty files in load
[ https://issues.apache.org/jira/browse/SPARK-26161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26161: Assignee: Apache Spark > Ignore empty files in load > -- > > Key: SPARK-26161 > URL: https://issues.apache.org/jira/browse/SPARK-26161 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > Currently, empty files are opened in load, and Spark tries to read data from > them. In some cases, empty partitions are produced from such empty files. For > example, in the case of *wholetext* in Text datasource and *multiLine* modes > in CSV/JSON datasource. The behaviour is unnecessary, and empty files can be > skipped in read. It can reduce number of tasks submitted for loading empty > files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26161) Ignore empty files in load
[ https://issues.apache.org/jira/browse/SPARK-26161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697913#comment-16697913 ] Apache Spark commented on SPARK-26161: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23130 > Ignore empty files in load > -- > > Key: SPARK-26161 > URL: https://issues.apache.org/jira/browse/SPARK-26161 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, empty files are opened in load, and Spark tries to read data from > them. In some cases, empty partitions are produced from such empty files. For > example, in the case of *wholetext* in Text datasource and *multiLine* modes > in CSV/JSON datasource. The behaviour is unnecessary, and empty files can be > skipped in read. It can reduce number of tasks submitted for loading empty > files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26161) Ignore empty files in load
Maxim Gekk created SPARK-26161: -- Summary: Ignore empty files in load Key: SPARK-26161 URL: https://issues.apache.org/jira/browse/SPARK-26161 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk Currently, empty files are opened in load, and Spark tries to read data from them. In some cases, empty partitions are produced from such empty files. For example, in the case of *wholetext* in Text datasource and *multiLine* modes in CSV/JSON datasource. The behaviour is unnecessary, and empty files can be skipped in read. It can reduce number of tasks submitted for loading empty files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26160) Make assertNotBucketed call in DataFrameWriter::save optional
Ximo Guanter created SPARK-26160: Summary: Make assertNotBucketed call in DataFrameWriter::save optional Key: SPARK-26160 URL: https://issues.apache.org/jira/browse/SPARK-26160 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Ximo Guanter Currently, the `.save` method in DataFrameWriter will fail if the dataframe is bucketed and/or sorted. This makes sense, since there is no way of storing metadata in the current file-based data sources to know whether a file was bucketed or not. I have a use case where I would like to implement a new, file-based data source which could keep track of that kind of metadata, so I would like to be able to `.save` bucketed dataframes. Would a patch to extend the datasource api with an indicator of whether that source is able to serialize bucketed dataframes be a welcome addition? I'm happy to work on it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25786) If the ByteBuffer.hasArray is false , it will throw UnsupportedOperationException for Kryo
[ https://issues.apache.org/jira/browse/SPARK-25786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-25786: -- Fix Version/s: 2.4.1 2.3.3 > If the ByteBuffer.hasArray is false , it will throw > UnsupportedOperationException for Kryo > -- > > Key: SPARK-25786 > URL: https://issues.apache.org/jira/browse/SPARK-25786 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > Fix For: 2.3.3, 2.4.1, 3.0.0 > > > `{color:#ffc66d}deserialize{color}` for kryo, the type of input parameter is > ByteBuffer, if it is not backed by an accessible byte array. it will throw > UnsupportedOperationException > Exception Info: > java.lang.UnsupportedOperationException was thrown. > java.lang.UnsupportedOperationException > at java.nio.ByteBuffer.array(ByteBuffer.java:994) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25786) If the ByteBuffer.hasArray is false , it will throw UnsupportedOperationException for Kryo
[ https://issues.apache.org/jira/browse/SPARK-25786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25786: - Assignee: liuxian > If the ByteBuffer.hasArray is false , it will throw > UnsupportedOperationException for Kryo > -- > > Key: SPARK-25786 > URL: https://issues.apache.org/jira/browse/SPARK-25786 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Assignee: liuxian >Priority: Major > > `{color:#ffc66d}deserialize{color}` for kryo, the type of input parameter is > ByteBuffer, if it is not backed by an accessible byte array. it will throw > UnsupportedOperationException > Exception Info: > java.lang.UnsupportedOperationException was thrown. > java.lang.UnsupportedOperationException > at java.nio.ByteBuffer.array(ByteBuffer.java:994) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25786) If the ByteBuffer.hasArray is false , it will throw UnsupportedOperationException for Kryo
[ https://issues.apache.org/jira/browse/SPARK-25786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25786. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22779 [https://github.com/apache/spark/pull/22779] > If the ByteBuffer.hasArray is false , it will throw > UnsupportedOperationException for Kryo > -- > > Key: SPARK-25786 > URL: https://issues.apache.org/jira/browse/SPARK-25786 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Assignee: liuxian >Priority: Major > Fix For: 3.0.0 > > > `{color:#ffc66d}deserialize{color}` for kryo, the type of input parameter is > ByteBuffer, if it is not backed by an accessible byte array. it will throw > UnsupportedOperationException > Exception Info: > java.lang.UnsupportedOperationException was thrown. > java.lang.UnsupportedOperationException > at java.nio.ByteBuffer.array(ByteBuffer.java:994) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26156) Revise summary section of stage page
[ https://issues.apache.org/jira/browse/SPARK-26156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26156. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23125 [https://github.com/apache/spark/pull/23125] > Revise summary section of stage page > > > Key: SPARK-26156 > URL: https://issues.apache.org/jira/browse/SPARK-26156 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > Fix For: 3.0.0 > > > 1. In the summary section of stage page, the following metrics names can be > revised: > Output => Output Size / Records > Shuffle Read: => Shuffle Read Size / Records > Shuffle Write => Shuffle Write Size / Records > After changes, the names are more clear, and consistent with the other names > in the same page. > 2. The associated job id URL should not contain the 3 tails spaces. Reduce > the number of spaces to one, and exclude the space from link. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26156) Revise summary section of stage page
[ https://issues.apache.org/jira/browse/SPARK-26156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26156: - Assignee: Gengliang Wang > Revise summary section of stage page > > > Key: SPARK-26156 > URL: https://issues.apache.org/jira/browse/SPARK-26156 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > > 1. In the summary section of stage page, the following metrics names can be > revised: > Output => Output Size / Records > Shuffle Read: => Shuffle Read Size / Records > Shuffle Write => Shuffle Write Size / Records > After changes, the names are more clear, and consistent with the other names > in the same page. > 2. The associated job id URL should not contain the 3 tails spaces. Reduce > the number of spaces to one, and exclude the space from link. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25786) If the ByteBuffer.hasArray is false , it will throw UnsupportedOperationException for Kryo
[ https://issues.apache.org/jira/browse/SPARK-25786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-25786: -- Priority: Minor (was: Major) Issue Type: Bug (was: Improvement) > If the ByteBuffer.hasArray is false , it will throw > UnsupportedOperationException for Kryo > -- > > Key: SPARK-25786 > URL: https://issues.apache.org/jira/browse/SPARK-25786 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > Fix For: 3.0.0 > > > `{color:#ffc66d}deserialize{color}` for kryo, the type of input parameter is > ByteBuffer, if it is not backed by an accessible byte array. it will throw > UnsupportedOperationException > Exception Info: > java.lang.UnsupportedOperationException was thrown. > java.lang.UnsupportedOperationException > at java.nio.ByteBuffer.array(ByteBuffer.java:994) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations
[ https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697831#comment-16697831 ] Chris Baker commented on SPARK-19700: - I was getting ready to start looking at this in earnest as well, in support of the Nomad scheduler for Spark (which we're currently maintaining in a fork of Spark). > Design an API for pluggable scheduler implementations > - > > Key: SPARK-19700 > URL: https://issues.apache.org/jira/browse/SPARK-19700 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Matt Cheah >Priority: Major > > One point that was brought up in discussing SPARK-18278 was that schedulers > cannot easily be added to Spark without forking the whole project. The main > reason is that much of the scheduler's behavior fundamentally depends on the > CoarseGrainedSchedulerBackend class, which is not part of the public API of > Spark and is in fact quite a complex module. As resource management and > allocation continues evolves, Spark will need to be integrated with more > cluster managers, but maintaining support for all possible allocators in the > Spark project would be untenable. Furthermore, it would be impossible for > Spark to support proprietary frameworks that are developed by specific users > for their other particular use cases. > Therefore, this ticket proposes making scheduler implementations fully > pluggable. The idea is that Spark will provide a Java/Scala interface that is > to be implemented by a scheduler that is backed by the cluster manager of > interest. The user can compile their scheduler's code into a JAR that is > placed on the driver's classpath. Finally, as is the case in the current > world, the scheduler implementation is selected and dynamically loaded > depending on the user's provided master URL. > Determining the correct API is the most challenging problem. The current > CoarseGrainedSchedulerBackend handles many responsibilities, some of which > will be common across all cluster managers, and some which will be specific > to a particular cluster manager. For example, the particular mechanism for > creating the executor processes will differ between YARN and Mesos, but, once > these executors have started running, the means to submit tasks to them over > the Netty RPC is identical across the board. > We must also consider a plugin model and interface for submitting the > application as well, because different cluster managers support different > configuration options, and thus the driver must be bootstrapped accordingly. > For example, in YARN mode the application and Hadoop configuration must be > packaged and shipped to the distributed cache prior to launching the job. A > prototype of a Kubernetes implementation starts a Kubernetes pod that runs > the driver in cluster mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org