[jira] [Comment Edited] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2018-11-24 Thread Cheng Su (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698070#comment-16698070
 ] 

Cheng Su edited comment on SPARK-26164 at 11/25/18 5:37 AM:


I think this is something we need in spark. Let me know if this is not needed, 
otherwise I will start working on a pr to address the issue. cc [~cloud_fan], 
[~tejasp], and [~sameerag] who might have the most context on this, thanks!


was (Author: chengsu):
cc [~cloud_fan], [~tejasp], and [~sameerag] who might have the most context on 
this. I think this is something we need in spark. Let me know if this is not 
needed, otherwise I will start working on a pr to address the issue.

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2018-11-24 Thread Cheng Su (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698070#comment-16698070
 ] 

Cheng Su commented on SPARK-26164:
--

cc [~cloud_fan], [~tejasp], and [~sameerag] who might have the most context on 
this. I think this is something we need in spark. Let me know if this is not 
needed, otherwise I will start working on a pr to address the issue.

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2018-11-24 Thread Cheng Su (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-26164:
-
Description: 
Problem:

Current spark always requires a local sort before writing to output table on 
partition/bucket columns [1]. The disadvantage is the sort might waste reserved 
CPU time on executor due to spill. Hive does not require the local sort before 
writing output table [2], and we saw performance regression when migrating hive 
workload to spark.

 

Proposal:

We can avoid the local sort by keeping the mapping between file path and output 
writer. In case of writing row to a new file path, we create a new output 
writer. Otherwise, re-use the same output writer if the writer already exists 
(mainly change should be in FileFormatDataWriter.scala). This is very similar 
to what hive does in [2].

Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
consumes more memory on executor (multiple output writer needs to be opened in 
same time), than the current behavior (i.e. only one output writer opened). We 
can add the config to switch between the current and new behavior.

 

[1]: spark FileFormatWriter.scala - 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]

[2]: hive FileSinkOperator.java - 
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]

 

 

  was:
Problem:

Current spark always requires a local sort before writing to output table on 
partition/bucket/sort columns [1]. The disadvantage is the sort might waste 
reserved CPU time on executor due to spill. Hive does not require the local 
sort before writing output table [2], and we saw performance regression when 
migrating hive workload to spark.

 

Proposal:

We can avoid the local sort by keeping the mapping between file path and output 
writer. In case of writing row to a new file path, we create a new output 
writer. Otherwise, re-use the same output writer if the writer already exists 
(mainly change should be in FileFormatDataWriter.scala). This is very similar 
to what hive does in [2].

Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
consumes more memory on executor (multiple output writer needs to be opened in 
same time), than the current behavior (i.e. only one output writer opened). We 
can add the config to switch between the current and new behavior.

 

[1]: spark FileFormatWriter.scala - 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]

[2]: hive FileSinkOperator.java - 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510

 

 


> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Created] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2018-11-24 Thread Cheng Su (JIRA)
Cheng Su created SPARK-26164:


 Summary: [SQL] Allow FileFormatWriter to write multiple 
partitions/buckets without sort
 Key: SPARK-26164
 URL: https://issues.apache.org/jira/browse/SPARK-26164
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Cheng Su


Problem:

Current spark always requires a local sort before writing to output table on 
partition/bucket/sort columns [1]. The disadvantage is the sort might waste 
reserved CPU time on executor due to spill. Hive does not require the local 
sort before writing output table [2], and we saw performance regression when 
migrating hive workload to spark.

 

Proposal:

We can avoid the local sort by keeping the mapping between file path and output 
writer. In case of writing row to a new file path, we create a new output 
writer. Otherwise, re-use the same output writer if the writer already exists 
(mainly change should be in FileFormatDataWriter.scala). This is very similar 
to what hive does in [2].

Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
consumes more memory on executor (multiple output writer needs to be opened in 
same time), than the current behavior (i.e. only one output writer opened). We 
can add the config to switch between the current and new behavior.

 

[1]: spark FileFormatWriter.scala - 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]

[2]: hive FileSinkOperator.java - 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26163) Parsing decimals from JSON using locale

2018-11-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697990#comment-16697990
 ] 

Apache Spark commented on SPARK-26163:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23132

> Parsing decimals from JSON using locale
> ---
>
> Key: SPARK-26163
> URL: https://issues.apache.org/jira/browse/SPARK-26163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, the decimal type can be inferring from a JSON field value only if 
> it is recognised by JacksonParser and does not depend on locale. Also 
> decimals in locale specific format cannot be parsed. The ticket aims to 
> support parsing/converting/inferring of decimals from JSON inputs using 
> locale.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26163) Parsing decimals from JSON using locale

2018-11-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26163:


Assignee: Apache Spark

> Parsing decimals from JSON using locale
> ---
>
> Key: SPARK-26163
> URL: https://issues.apache.org/jira/browse/SPARK-26163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Currently, the decimal type can be inferring from a JSON field value only if 
> it is recognised by JacksonParser and does not depend on locale. Also 
> decimals in locale specific format cannot be parsed. The ticket aims to 
> support parsing/converting/inferring of decimals from JSON inputs using 
> locale.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26163) Parsing decimals from JSON using locale

2018-11-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697989#comment-16697989
 ] 

Apache Spark commented on SPARK-26163:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23132

> Parsing decimals from JSON using locale
> ---
>
> Key: SPARK-26163
> URL: https://issues.apache.org/jira/browse/SPARK-26163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, the decimal type can be inferring from a JSON field value only if 
> it is recognised by JacksonParser and does not depend on locale. Also 
> decimals in locale specific format cannot be parsed. The ticket aims to 
> support parsing/converting/inferring of decimals from JSON inputs using 
> locale.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26163) Parsing decimals from JSON using locale

2018-11-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26163:


Assignee: (was: Apache Spark)

> Parsing decimals from JSON using locale
> ---
>
> Key: SPARK-26163
> URL: https://issues.apache.org/jira/browse/SPARK-26163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, the decimal type can be inferring from a JSON field value only if 
> it is recognised by JacksonParser and does not depend on locale. Also 
> decimals in locale specific format cannot be parsed. The ticket aims to 
> support parsing/converting/inferring of decimals from JSON inputs using 
> locale.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26163) Parsing decimals from JSON using locale

2018-11-24 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26163:
--

 Summary: Parsing decimals from JSON using locale
 Key: SPARK-26163
 URL: https://issues.apache.org/jira/browse/SPARK-26163
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Currently, the decimal type can be inferring from a JSON field value only if it 
is recognised by JacksonParser and does not depend on locale. Also decimals in 
locale specific format cannot be parsed. The ticket aims to support 
parsing/converting/inferring of decimals from JSON inputs using locale.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26162) ALS results vary with user or item ID encodings

2018-11-24 Thread Pablo J. Villacorta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo J. Villacorta updated SPARK-26162:

Description: 
When calling ALS.fit() with the same seed on a dataset, the results (both the 
latent factors matrices and the accuracy of the recommendations) differ when we 
change the labels used to encode the users or items. The code example below 
illustrates this by just changing user ID 1 or 2 to an unused ID like 30. The 
user factors matrix changes, but not only the rows corresponding to users 1 or 
2 but also the other rows. 

Is this the intended behaviour?
{code:java}
val r = scala.util.Random
r.setSeed(123456)
val trainDataset1 = spark.sparkContext.parallelize(
(1 to 1000).map(_=> (r.nextInt(20), r.nextInt(100), r.nextInt(4) + 1)) // 
users go from 0 to 19
).toDF("user", "item", "rating")

val trainDataset2 = trainDataset1.withColumn("user", when(col("user")===1, 
30).otherwise(col("user")))
val trainDataset3 = trainDataset1.withColumn("user", when(col("user")===2, 
30).otherwise(col("user")))

val testDatasets = Array(trainDataset1, trainDataset2, trainDataset3).map(
_.groupBy("user").agg(collect_list("item").alias("watched"))
)

val Array(als1, als2, als3) = Array(trainDataset1, trainDataset2, 
trainDataset3).map(new ALS().setSeed(12345).fit(_))

als1.userFactors.show(5, false)
als2.userFactors.show(5, false)
als3.userFactors.show(5, false){code}
If we ask for recommendations and compare them with a test dataset also 
modified accordingly (in this example, the test dataset is exactly the train 
dataset) the results also differ:
{code:java}
val recommendations = Array(als1, als2, als3).map(x =>
x.recommendForAllUsers(20).map{
case Row(user: Int, recommendations: WrappedArray[Row]) => {
val items = recommendations.map{case Row(item: Int, score: Float) 
=> item}
(user, items)
}
}.toDF("user", "recommendations")
)

val predictionsAndActualRDD = testDatasets.zip(recommendations).map{
case (testDataset, recommendationsDF) =>
testDataset.join(recommendationsDF, "user")
.rdd.map(r => {
(r.getAs[WrappedArray[Int]](r.fieldIndex("recommendations")).array,
r.getAs[WrappedArray[Int]](r.fieldIndex("watched")).array
)
})
}

val metrics = predictionsAndActualRDD.map(new RankingMetrics(_))

println(s"Precision at 5 of first model = ${metrics(0).precisionAt(5)}")
println(s"Precision at 5 of second model = ${metrics(1).precisionAt(5)}")
println(s"Precision at 5 of third model = ${metrics(2).precisionAt(5)}")
{code}
EDIT: The results also change if we just swap the IDs of some users, like:
{code:java}
val trainDataset4 = trainDataset1.withColumn("user", when(col("user")===1, 4)
.when(col("user")===4, 
1).otherwise(col("user")))
{code}

  was:
When calling ALS.fit() with the same seed on a dataset, the results (both the 
latent factors matrices and the accuracy of the recommendations) differ when we 
change the labels used to encode the users or items. The code example below 
illustrates this by just changing user ID 1 or 2 to an unused ID like 30. The 
user factors matrix changes, but not only the rows corresponding to users 1 or 
2 but also the other rows. 

Is this the intended behaviour?
{code:java}
val r = scala.util.Random
r.setSeed(123456)
val trainDataset1 = spark.sparkContext.parallelize(
(1 to 1000).map(_=> (r.nextInt(20), r.nextInt(100), r.nextInt(4) + 1)) // 
users go from 0 to 19
).toDF("user", "item", "rating")

val trainDataset2 = trainDataset1.withColumn("user", when(col("user")===1, 
30).otherwise(col("user")))
val trainDataset3 = trainDataset1.withColumn("user", when(col("user")===2, 
30).otherwise(col("user")))

val testDatasets = Array(trainDataset1, trainDataset2, trainDataset3).map(
_.groupBy("user").agg(collect_list("item").alias("watched"))
)

val Array(als1, als2, als3) = Array(trainDataset1, trainDataset2, 
trainDataset3).map(new ALS().setSeed(12345).fit(_))

als1.userFactors.show(5, false)
als2.userFactors.show(5, false)
als3.userFactors.show(5, false){code}
If we ask for recommendations and compare them with a test dataset also 
modified accordingly (in this example, the test dataset is exactly the train 
dataset) the results also differ:
{code:java}
val recommendations = Array(als1, als2, als3).map(x =>
x.recommendForAllUsers(20).map{
case Row(user: Int, recommendations: WrappedArray[Row]) => {
val items = recommendations.map{case Row(item: Int, score: Float) 
=> item}
(user, items)
}
}.toDF("user", "recommendations")
)

val predictionsAndActualRDD = testDatasets.zip(recommendations).map{
case (testDataset, recommendationsDF) =>
testDataset.join(recommendationsDF, "user")
.rdd.map(r => {

[jira] [Commented] (SPARK-24710) Information Gain Ratio for decision trees

2018-11-24 Thread Pablo J. Villacorta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697974#comment-16697974
 ] 

Pablo J. Villacorta commented on SPARK-24710:
-

It seems this is not considered a very interesting feature by the community... 
have you worked on it, Aleksey?

> Information Gain Ratio for decision trees
> -
>
> Key: SPARK-24710
> URL: https://issues.apache.org/jira/browse/SPARK-24710
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Pablo J. Villacorta
>Priority: Minor
>  Labels: features
>
> Spark currently uses Information Gain (IG) to decide the next feature to 
> branch on when building a decision tree. In case of categorical features, IG 
> is known to be biased towards features with a large number of categories. 
> [Information Gain Ratio|https://en.wikipedia.org/wiki/Information_gain_ratio] 
> solves this problem by dividing the IG by a number that characterizes the 
> intrinsic information of a feature.
> As far as I know, Spark has IG but not IGR. It would be nice to have the 
> possibility to choose IGR instead of IG.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26162) ALS results vary with user or item ID encodings

2018-11-24 Thread Pablo J. Villacorta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo J. Villacorta updated SPARK-26162:

Description: 
When calling ALS.fit() with the same seed on a dataset, the results (both the 
latent factors matrices and the accuracy of the recommendations) differ when we 
change the labels used to encode the users or items. The code example below 
illustrates this by just changing user ID 1 or 2 to an unused ID like 30. The 
user factors matrix changes, but not only the rows corresponding to users 1 or 
2 but also the other rows. 

Is this the intended behaviour?
{code:java}
val r = scala.util.Random
r.setSeed(123456)
val trainDataset1 = spark.sparkContext.parallelize(
(1 to 1000).map(_=> (r.nextInt(20), r.nextInt(100), r.nextInt(4) + 1)) // 
users go from 0 to 4
).toDF("user", "item", "rating")

val trainDataset2 = trainDataset1.withColumn("user", when(col("user")===1, 
30).otherwise(col("user")))
val trainDataset3 = trainDataset1.withColumn("user", when(col("user")===2, 
30).otherwise(col("user")))

val testDatasets = Array(trainDataset1, trainDataset2, trainDataset3).map(
_.groupBy("user").agg(collect_list("item").alias("watched"))
)

val Array(als1, als2, als3) = Array(trainDataset1, trainDataset2, 
trainDataset3).map(new ALS().setSeed(12345).fit(_))

als1.userFactors.show(5, false)
als2.userFactors.show(5, false)
als3.userFactors.show(5, false){code}
If we ask for recommendations and compare them with a test dataset also 
modified accordingly (in this example, the test dataset is exactly the train 
dataset) the results also differ:
{code:java}
val recommendations = Array(als1, als2, als3).map(x =>
x.recommendForAllUsers(20).map{
case Row(user: Int, recommendations: WrappedArray[Row]) => {
val items = recommendations.map{case Row(item: Int, score: Float) 
=> item}
(user, items)
}
}.toDF("user", "recommendations")
)

val predictionsAndActualRDD = testDatasets.zip(recommendations).map{
case (testDataset, recommendationsDF) =>
testDataset.join(recommendationsDF, "user")
.rdd.map(r => {
(r.getAs[WrappedArray[Int]](r.fieldIndex("recommendations")).array,
r.getAs[WrappedArray[Int]](r.fieldIndex("watched")).array
)
})
}

val metrics = predictionsAndActualRDD.map(new RankingMetrics(_))

println(s"Precision at 5 of first model = ${metrics(0).precisionAt(5)}")
println(s"Precision at 5 of second model = ${metrics(1).precisionAt(5)}")
println(s"Precision at 5 of third model = ${metrics(2).precisionAt(5)}")
{code}

  was:
When calling ALS.fit() with the same seed on a dataset, the results (both the 
latent factors matrices and the accuracy of the recommendations) differ when we 
change the labels used to encode the users or items. The code example below 
illustrates this by just changing user ID 1 or 2 to an unused ID like 30. The 
user factors matrix changes, but not only the rows corresponding to users 1 or 
2 but also the other rows. 

Is this the intended behaviour?
{code:java}
val r = scala.util.Random
r.setSeed(123456)
val trainDataset1 = spark.sparkContext.parallelize(
(1 to 1000).map(_=> (r.nextInt(20), r.nextInt(100), r.nextInt(4) + 1)) // 
users go from 0 to 4
).toDF("user", "item", "rating")

val maxuser = trainDataset1.select(max("user")).head.getAs[Int](0)
println(s"maxuser is ${maxuser}")

val trainDataset2 = trainDataset1.withColumn("user", when(col("user")===1, 
30).otherwise(col("user")))
val trainDataset3 = trainDataset1.withColumn("user", when(col("user")===2, 
30).otherwise(col("user")))

val testDatasets = Array(trainDataset1, trainDataset2, trainDataset3).map(
_.groupBy("user").agg(collect_list("item").alias("watched"))
)

val Array(als1, als2, als3) = Array(trainDataset1, trainDataset2, 
trainDataset3).map(new ALS().setSeed(12345).fit(_))

als1.userFactors.show(5, false)
als2.userFactors.show(5, false)
als3.userFactors.show(5, false){code}
If we ask for recommendations and compare them with a test dataset also 
modified accordingly (in this example, the test dataset is exactly the train 
dataset) the results also differ:
{code:java}
val recommendations = Array(als1, als2, als3).map(x =>
x.recommendForAllUsers(20).map{
case Row(user: Int, recommendations: WrappedArray[Row]) => {
val items = recommendations.map{case Row(item: Int, score: Float) 
=> item}
(user, items)
}
}.toDF("user", "recommendations")
)

val predictionsAndActualRDD = testDatasets.zip(recommendations).map{
case (testDataset, recommendationsDF) =>
testDataset.join(recommendationsDF, "user")
.rdd.map(r => {
(r.getAs[WrappedArray[Int]](r.fieldIndex("recommendations")).array,
r.getAs[WrappedArray[Int]](r.fieldIndex("watched")).array
)
})
}

val metrics = 

[jira] [Updated] (SPARK-26162) ALS results vary with user or item ID encodings

2018-11-24 Thread Pablo J. Villacorta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo J. Villacorta updated SPARK-26162:

Description: 
When calling ALS.fit() with the same seed on a dataset, the results (both the 
latent factors matrices and the accuracy of the recommendations) differ when we 
change the labels used to encode the users or items. The code example below 
illustrates this by just changing user ID 1 or 2 to an unused ID like 30. The 
user factors matrix changes, but not only the rows corresponding to users 1 or 
2 but also the other rows. 

Is this the intended behaviour?
{code:java}
val r = scala.util.Random
r.setSeed(123456)
val trainDataset1 = spark.sparkContext.parallelize(
(1 to 1000).map(_=> (r.nextInt(20), r.nextInt(100), r.nextInt(4) + 1)) // 
users go from 0 to 19
).toDF("user", "item", "rating")

val trainDataset2 = trainDataset1.withColumn("user", when(col("user")===1, 
30).otherwise(col("user")))
val trainDataset3 = trainDataset1.withColumn("user", when(col("user")===2, 
30).otherwise(col("user")))

val testDatasets = Array(trainDataset1, trainDataset2, trainDataset3).map(
_.groupBy("user").agg(collect_list("item").alias("watched"))
)

val Array(als1, als2, als3) = Array(trainDataset1, trainDataset2, 
trainDataset3).map(new ALS().setSeed(12345).fit(_))

als1.userFactors.show(5, false)
als2.userFactors.show(5, false)
als3.userFactors.show(5, false){code}
If we ask for recommendations and compare them with a test dataset also 
modified accordingly (in this example, the test dataset is exactly the train 
dataset) the results also differ:
{code:java}
val recommendations = Array(als1, als2, als3).map(x =>
x.recommendForAllUsers(20).map{
case Row(user: Int, recommendations: WrappedArray[Row]) => {
val items = recommendations.map{case Row(item: Int, score: Float) 
=> item}
(user, items)
}
}.toDF("user", "recommendations")
)

val predictionsAndActualRDD = testDatasets.zip(recommendations).map{
case (testDataset, recommendationsDF) =>
testDataset.join(recommendationsDF, "user")
.rdd.map(r => {
(r.getAs[WrappedArray[Int]](r.fieldIndex("recommendations")).array,
r.getAs[WrappedArray[Int]](r.fieldIndex("watched")).array
)
})
}

val metrics = predictionsAndActualRDD.map(new RankingMetrics(_))

println(s"Precision at 5 of first model = ${metrics(0).precisionAt(5)}")
println(s"Precision at 5 of second model = ${metrics(1).precisionAt(5)}")
println(s"Precision at 5 of third model = ${metrics(2).precisionAt(5)}")
{code}

  was:
When calling ALS.fit() with the same seed on a dataset, the results (both the 
latent factors matrices and the accuracy of the recommendations) differ when we 
change the labels used to encode the users or items. The code example below 
illustrates this by just changing user ID 1 or 2 to an unused ID like 30. The 
user factors matrix changes, but not only the rows corresponding to users 1 or 
2 but also the other rows. 

Is this the intended behaviour?
{code:java}
val r = scala.util.Random
r.setSeed(123456)
val trainDataset1 = spark.sparkContext.parallelize(
(1 to 1000).map(_=> (r.nextInt(20), r.nextInt(100), r.nextInt(4) + 1)) // 
users go from 0 to 4
).toDF("user", "item", "rating")

val trainDataset2 = trainDataset1.withColumn("user", when(col("user")===1, 
30).otherwise(col("user")))
val trainDataset3 = trainDataset1.withColumn("user", when(col("user")===2, 
30).otherwise(col("user")))

val testDatasets = Array(trainDataset1, trainDataset2, trainDataset3).map(
_.groupBy("user").agg(collect_list("item").alias("watched"))
)

val Array(als1, als2, als3) = Array(trainDataset1, trainDataset2, 
trainDataset3).map(new ALS().setSeed(12345).fit(_))

als1.userFactors.show(5, false)
als2.userFactors.show(5, false)
als3.userFactors.show(5, false){code}
If we ask for recommendations and compare them with a test dataset also 
modified accordingly (in this example, the test dataset is exactly the train 
dataset) the results also differ:
{code:java}
val recommendations = Array(als1, als2, als3).map(x =>
x.recommendForAllUsers(20).map{
case Row(user: Int, recommendations: WrappedArray[Row]) => {
val items = recommendations.map{case Row(item: Int, score: Float) 
=> item}
(user, items)
}
}.toDF("user", "recommendations")
)

val predictionsAndActualRDD = testDatasets.zip(recommendations).map{
case (testDataset, recommendationsDF) =>
testDataset.join(recommendationsDF, "user")
.rdd.map(r => {
(r.getAs[WrappedArray[Int]](r.fieldIndex("recommendations")).array,
r.getAs[WrappedArray[Int]](r.fieldIndex("watched")).array
)
})
}

val metrics = predictionsAndActualRDD.map(new RankingMetrics(_))

println(s"Precision at 5 of first model = 

[jira] [Created] (SPARK-26162) ALS results vary with user or item ID encodings

2018-11-24 Thread Pablo J. Villacorta (JIRA)
Pablo J. Villacorta created SPARK-26162:
---

 Summary: ALS results vary with user or item ID encodings
 Key: SPARK-26162
 URL: https://issues.apache.org/jira/browse/SPARK-26162
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.0
Reporter: Pablo J. Villacorta


When calling ALS.fit() with the same seed on a dataset, the results (both the 
latent factors matrices and the accuracy of the recommendations) differ when we 
change the labels used to encode the users or items. The code example below 
illustrates this by just changing user ID 1 or 2 to an unused ID like 30. The 
user factors matrix changes, but not only the rows corresponding to users 1 or 
2 but also the other rows. 

Is this the intended behaviour?
{code:java}
val r = scala.util.Random
r.setSeed(123456)
val trainDataset1 = spark.sparkContext.parallelize(
(1 to 1000).map(_=> (r.nextInt(20), r.nextInt(100), r.nextInt(4) + 1)) // 
users go from 0 to 4
).toDF("user", "item", "rating")

val maxuser = trainDataset1.select(max("user")).head.getAs[Int](0)
println(s"maxuser is ${maxuser}")

val trainDataset2 = trainDataset1.withColumn("user", when(col("user")===1, 
30).otherwise(col("user")))
val trainDataset3 = trainDataset1.withColumn("user", when(col("user")===2, 
30).otherwise(col("user")))

val testDatasets = Array(trainDataset1, trainDataset2, trainDataset3).map(
_.groupBy("user").agg(collect_list("item").alias("watched"))
)

val Array(als1, als2, als3) = Array(trainDataset1, trainDataset2, 
trainDataset3).map(new ALS().setSeed(12345).fit(_))

als1.userFactors.show(5, false)
als2.userFactors.show(5, false)
als3.userFactors.show(5, false){code}
If we ask for recommendations and compare them with a test dataset also 
modified accordingly (in this example, the test dataset is exactly the train 
dataset) the results also differ:
{code:java}
val recommendations = Array(als1, als2, als3).map(x =>
x.recommendForAllUsers(20).map{
case Row(user: Int, recommendations: WrappedArray[Row]) => {
val items = recommendations.map{case Row(item: Int, score: Float) 
=> item}
(user, items)
}
}.toDF("user", "recommendations")
)

val predictionsAndActualRDD = testDatasets.zip(recommendations).map{
case (testDataset, recommendationsDF) =>
testDataset.join(recommendationsDF, "user")
.rdd.map(r => {
(r.getAs[WrappedArray[Int]](r.fieldIndex("recommendations")).array,
r.getAs[WrappedArray[Int]](r.fieldIndex("watched")).array
)
})
}

val metrics = predictionsAndActualRDD.map(new RankingMetrics(_))

println(s"Precision at 5 of first model = ${metrics(0).precisionAt(5)}")
println(s"Precision at 5 of second model = ${metrics(1).precisionAt(5)}")
println(s"Precision at 5 of third model = ${metrics(2).precisionAt(5)}")
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25908) Remove old deprecated items in Spark 3

2018-11-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25908:
--
Description: 
There are many deprecated methods and classes in Spark. They _can_ be removed 
in Spark 3, and for those that have been deprecated a long time (i.e. since 
Spark <= 2.0), we should probably do so. This addresses most of these cases, 
the easiest ones, those that are easy to remove and are old:
 - Remove some AccumulableInfo .apply() methods
 - Remove non-label-specific multiclass precision/recall/fScore in favor of 
accuracy
 - Remove toDegrees/toRadians in favor of degrees/radians (SparkR: only 
deprecated)
 - Remove approxCountDistinct in favor of approx_count_distinct (SparkR: only 
deprecated)
 - Remove unused Python StorageLevel constants
 - Remove unused multiclass option in libsvm parsing
 - Remove references to deprecated spark configs like spark.yarn.am.port
 - Remove TaskContext.isRunningLocally
 - Remove ShuffleMetrics.shuffle* methods
 - Remove BaseReadWrite.context in favor of session
 - Remove Column.!== in favor of =!=
 - Remove Dataset.explode
 - Remove Dataset.registerTempTable
 - Remove SQLContext.getOrCreate, setActive, clearActive, constructors

Not touched yet:
 - everything else in MLLib
 - HiveContext
 - Anything deprecated more recently than 2.0.0, generally

  was:
There are many deprecated methods and classes in Spark. They _can_ be removed 
in Spark 3, and for those that have been deprecated a long time (i.e. since 
Spark <= 2.0), we should probably do so. This addresses most of these cases, 
the easiest ones, those that are easy to remove and are old:
 - Remove some AccumulableInfo .apply() methods
 - Remove non-label-specific multiclass precision/recall/fScore in favor of 
accuracy
 - Remove toDegrees/toRadians in favor of degrees/radians (SparkR: only 
deprecated)
 - Remove approxCountDistinct in favor of approx_count_distinct (SparkR: only 
deprecated)
 - Remove unused Python StorageLevel constants
 - Remove Dataset unionAll in favor of union
 - Remove unused multiclass option in libsvm parsing
 - Remove references to deprecated spark configs like spark.yarn.am.port
 - Remove TaskContext.isRunningLocally
 - Remove ShuffleMetrics.shuffle* methods
 - Remove BaseReadWrite.context in favor of session
 - Remove Column.!== in favor of =!=
 - Remove Dataset.explode
 - Remove Dataset.registerTempTable
 - Remove SQLContext.getOrCreate, setActive, clearActive, constructors

Not touched yet:
 - everything else in MLLib
 - HiveContext
 - Anything deprecated more recently than 2.0.0, generally


> Remove old deprecated items in Spark 3
> --
>
> Key: SPARK-25908
> URL: https://issues.apache.org/jira/browse/SPARK-25908
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> There are many deprecated methods and classes in Spark. They _can_ be removed 
> in Spark 3, and for those that have been deprecated a long time (i.e. since 
> Spark <= 2.0), we should probably do so. This addresses most of these cases, 
> the easiest ones, those that are easy to remove and are old:
>  - Remove some AccumulableInfo .apply() methods
>  - Remove non-label-specific multiclass precision/recall/fScore in favor of 
> accuracy
>  - Remove toDegrees/toRadians in favor of degrees/radians (SparkR: only 
> deprecated)
>  - Remove approxCountDistinct in favor of approx_count_distinct (SparkR: only 
> deprecated)
>  - Remove unused Python StorageLevel constants
>  - Remove unused multiclass option in libsvm parsing
>  - Remove references to deprecated spark configs like spark.yarn.am.port
>  - Remove TaskContext.isRunningLocally
>  - Remove ShuffleMetrics.shuffle* methods
>  - Remove BaseReadWrite.context in favor of session
>  - Remove Column.!== in favor of =!=
>  - Remove Dataset.explode
>  - Remove Dataset.registerTempTable
>  - Remove SQLContext.getOrCreate, setActive, clearActive, constructors
> Not touched yet:
>  - everything else in MLLib
>  - HiveContext
>  - Anything deprecated more recently than 2.0.0, generally



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25908) Remove old deprecated items in Spark 3

2018-11-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697947#comment-16697947
 ] 

Apache Spark commented on SPARK-25908:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/23131

> Remove old deprecated items in Spark 3
> --
>
> Key: SPARK-25908
> URL: https://issues.apache.org/jira/browse/SPARK-25908
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> There are many deprecated methods and classes in Spark. They _can_ be removed 
> in Spark 3, and for those that have been deprecated a long time (i.e. since 
> Spark <= 2.0), we should probably do so. This addresses most of these cases, 
> the easiest ones, those that are easy to remove and are old:
>  - Remove some AccumulableInfo .apply() methods
>  - Remove non-label-specific multiclass precision/recall/fScore in favor of 
> accuracy
>  - Remove toDegrees/toRadians in favor of degrees/radians (SparkR: only 
> deprecated)
>  - Remove approxCountDistinct in favor of approx_count_distinct (SparkR: only 
> deprecated)
>  - Remove unused Python StorageLevel constants
>  - Remove Dataset unionAll in favor of union
>  - Remove unused multiclass option in libsvm parsing
>  - Remove references to deprecated spark configs like spark.yarn.am.port
>  - Remove TaskContext.isRunningLocally
>  - Remove ShuffleMetrics.shuffle* methods
>  - Remove BaseReadWrite.context in favor of session
>  - Remove Column.!== in favor of =!=
>  - Remove Dataset.explode
>  - Remove Dataset.registerTempTable
>  - Remove SQLContext.getOrCreate, setActive, clearActive, constructors
> Not touched yet:
>  - everything else in MLLib
>  - HiveContext
>  - Anything deprecated more recently than 2.0.0, generally



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25908) Remove old deprecated items in Spark 3

2018-11-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697946#comment-16697946
 ] 

Apache Spark commented on SPARK-25908:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/23131

> Remove old deprecated items in Spark 3
> --
>
> Key: SPARK-25908
> URL: https://issues.apache.org/jira/browse/SPARK-25908
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> There are many deprecated methods and classes in Spark. They _can_ be removed 
> in Spark 3, and for those that have been deprecated a long time (i.e. since 
> Spark <= 2.0), we should probably do so. This addresses most of these cases, 
> the easiest ones, those that are easy to remove and are old:
>  - Remove some AccumulableInfo .apply() methods
>  - Remove non-label-specific multiclass precision/recall/fScore in favor of 
> accuracy
>  - Remove toDegrees/toRadians in favor of degrees/radians (SparkR: only 
> deprecated)
>  - Remove approxCountDistinct in favor of approx_count_distinct (SparkR: only 
> deprecated)
>  - Remove unused Python StorageLevel constants
>  - Remove Dataset unionAll in favor of union
>  - Remove unused multiclass option in libsvm parsing
>  - Remove references to deprecated spark configs like spark.yarn.am.port
>  - Remove TaskContext.isRunningLocally
>  - Remove ShuffleMetrics.shuffle* methods
>  - Remove BaseReadWrite.context in favor of session
>  - Remove Column.!== in favor of =!=
>  - Remove Dataset.explode
>  - Remove Dataset.registerTempTable
>  - Remove SQLContext.getOrCreate, setActive, clearActive, constructors
> Not touched yet:
>  - everything else in MLLib
>  - HiveContext
>  - Anything deprecated more recently than 2.0.0, generally



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26161) Ignore empty files in load

2018-11-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26161:


Assignee: (was: Apache Spark)

> Ignore empty files in load
> --
>
> Key: SPARK-26161
> URL: https://issues.apache.org/jira/browse/SPARK-26161
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, empty files are opened in load, and Spark tries to read data from 
> them. In some cases, empty partitions are produced from such empty files. For 
> example, in the case of *wholetext* in Text datasource and *multiLine* modes 
> in CSV/JSON datasource. The behaviour is unnecessary, and empty files can be 
> skipped in read. It can reduce number of tasks submitted for loading empty 
> files. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26161) Ignore empty files in load

2018-11-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26161:


Assignee: Apache Spark

> Ignore empty files in load
> --
>
> Key: SPARK-26161
> URL: https://issues.apache.org/jira/browse/SPARK-26161
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, empty files are opened in load, and Spark tries to read data from 
> them. In some cases, empty partitions are produced from such empty files. For 
> example, in the case of *wholetext* in Text datasource and *multiLine* modes 
> in CSV/JSON datasource. The behaviour is unnecessary, and empty files can be 
> skipped in read. It can reduce number of tasks submitted for loading empty 
> files. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26161) Ignore empty files in load

2018-11-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697913#comment-16697913
 ] 

Apache Spark commented on SPARK-26161:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23130

> Ignore empty files in load
> --
>
> Key: SPARK-26161
> URL: https://issues.apache.org/jira/browse/SPARK-26161
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, empty files are opened in load, and Spark tries to read data from 
> them. In some cases, empty partitions are produced from such empty files. For 
> example, in the case of *wholetext* in Text datasource and *multiLine* modes 
> in CSV/JSON datasource. The behaviour is unnecessary, and empty files can be 
> skipped in read. It can reduce number of tasks submitted for loading empty 
> files. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26161) Ignore empty files in load

2018-11-24 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26161:
--

 Summary: Ignore empty files in load
 Key: SPARK-26161
 URL: https://issues.apache.org/jira/browse/SPARK-26161
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Currently, empty files are opened in load, and Spark tries to read data from 
them. In some cases, empty partitions are produced from such empty files. For 
example, in the case of *wholetext* in Text datasource and *multiLine* modes in 
CSV/JSON datasource. The behaviour is unnecessary, and empty files can be 
skipped in read. It can reduce number of tasks submitted for loading empty 
files. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26160) Make assertNotBucketed call in DataFrameWriter::save optional

2018-11-24 Thread Ximo Guanter (JIRA)
Ximo Guanter created SPARK-26160:


 Summary: Make assertNotBucketed call in DataFrameWriter::save 
optional
 Key: SPARK-26160
 URL: https://issues.apache.org/jira/browse/SPARK-26160
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Ximo Guanter


Currently, the `.save` method in DataFrameWriter will fail if the dataframe is 
bucketed and/or sorted. This makes sense, since there is no way of storing 
metadata in the current file-based data sources to know whether a file was 
bucketed or not.

I have a use case where I would like to implement a new, file-based data source 
which could keep track of that kind of metadata, so I would like to be able to 
`.save` bucketed dataframes.

 

 Would a patch to extend the datasource api with an indicator of whether that 
source is able to serialize bucketed dataframes be a welcome addition? I'm 
happy to work on it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25786) If the ByteBuffer.hasArray is false , it will throw UnsupportedOperationException for Kryo

2018-11-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25786:
--
Fix Version/s: 2.4.1
   2.3.3

> If the ByteBuffer.hasArray is false , it will throw 
> UnsupportedOperationException for Kryo
> --
>
> Key: SPARK-25786
> URL: https://issues.apache.org/jira/browse/SPARK-25786
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> `{color:#ffc66d}deserialize{color}` for kryo,  the type of input parameter is 
> ByteBuffer, if it is not backed by an accessible byte array. it will throw 
> UnsupportedOperationException
> Exception Info:
> java.lang.UnsupportedOperationException was thrown.
>  java.lang.UnsupportedOperationException
>      at java.nio.ByteBuffer.array(ByteBuffer.java:994)
>      at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25786) If the ByteBuffer.hasArray is false , it will throw UnsupportedOperationException for Kryo

2018-11-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25786:
-

Assignee: liuxian

> If the ByteBuffer.hasArray is false , it will throw 
> UnsupportedOperationException for Kryo
> --
>
> Key: SPARK-25786
> URL: https://issues.apache.org/jira/browse/SPARK-25786
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Major
>
> `{color:#ffc66d}deserialize{color}` for kryo,  the type of input parameter is 
> ByteBuffer, if it is not backed by an accessible byte array. it will throw 
> UnsupportedOperationException
> Exception Info:
> java.lang.UnsupportedOperationException was thrown.
>  java.lang.UnsupportedOperationException
>      at java.nio.ByteBuffer.array(ByteBuffer.java:994)
>      at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25786) If the ByteBuffer.hasArray is false , it will throw UnsupportedOperationException for Kryo

2018-11-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25786.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22779
[https://github.com/apache/spark/pull/22779]

> If the ByteBuffer.hasArray is false , it will throw 
> UnsupportedOperationException for Kryo
> --
>
> Key: SPARK-25786
> URL: https://issues.apache.org/jira/browse/SPARK-25786
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Major
> Fix For: 3.0.0
>
>
> `{color:#ffc66d}deserialize{color}` for kryo,  the type of input parameter is 
> ByteBuffer, if it is not backed by an accessible byte array. it will throw 
> UnsupportedOperationException
> Exception Info:
> java.lang.UnsupportedOperationException was thrown.
>  java.lang.UnsupportedOperationException
>      at java.nio.ByteBuffer.array(ByteBuffer.java:994)
>      at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26156) Revise summary section of stage page

2018-11-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26156.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23125
[https://github.com/apache/spark/pull/23125]

> Revise summary section of stage page
> 
>
> Key: SPARK-26156
> URL: https://issues.apache.org/jira/browse/SPARK-26156
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> 1. In the summary section of stage page, the following metrics names can be 
> revised:
> Output => Output Size / Records
> Shuffle Read: => Shuffle Read Size / Records
> Shuffle Write => Shuffle Write Size / Records
> After changes, the names are more clear, and consistent with the other names 
> in the same page.
> 2. The associated job id URL should not contain the 3 tails spaces. Reduce 
> the number of spaces to one, and exclude the space from link.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26156) Revise summary section of stage page

2018-11-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26156:
-

Assignee: Gengliang Wang

> Revise summary section of stage page
> 
>
> Key: SPARK-26156
> URL: https://issues.apache.org/jira/browse/SPARK-26156
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> 1. In the summary section of stage page, the following metrics names can be 
> revised:
> Output => Output Size / Records
> Shuffle Read: => Shuffle Read Size / Records
> Shuffle Write => Shuffle Write Size / Records
> After changes, the names are more clear, and consistent with the other names 
> in the same page.
> 2. The associated job id URL should not contain the 3 tails spaces. Reduce 
> the number of spaces to one, and exclude the space from link.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25786) If the ByteBuffer.hasArray is false , it will throw UnsupportedOperationException for Kryo

2018-11-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25786:
--
  Priority: Minor  (was: Major)
Issue Type: Bug  (was: Improvement)

> If the ByteBuffer.hasArray is false , it will throw 
> UnsupportedOperationException for Kryo
> --
>
> Key: SPARK-25786
> URL: https://issues.apache.org/jira/browse/SPARK-25786
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 3.0.0
>
>
> `{color:#ffc66d}deserialize{color}` for kryo,  the type of input parameter is 
> ByteBuffer, if it is not backed by an accessible byte array. it will throw 
> UnsupportedOperationException
> Exception Info:
> java.lang.UnsupportedOperationException was thrown.
>  java.lang.UnsupportedOperationException
>      at java.nio.ByteBuffer.array(ByteBuffer.java:994)
>      at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations

2018-11-24 Thread Chris Baker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697831#comment-16697831
 ] 

Chris Baker commented on SPARK-19700:
-

I was getting ready to start looking at this in earnest as well, in support of 
the Nomad scheduler for Spark (which we're currently maintaining in a fork of 
Spark).

> Design an API for pluggable scheduler implementations
> -
>
> Key: SPARK-19700
> URL: https://issues.apache.org/jira/browse/SPARK-19700
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Matt Cheah
>Priority: Major
>
> One point that was brought up in discussing SPARK-18278 was that schedulers 
> cannot easily be added to Spark without forking the whole project. The main 
> reason is that much of the scheduler's behavior fundamentally depends on the 
> CoarseGrainedSchedulerBackend class, which is not part of the public API of 
> Spark and is in fact quite a complex module. As resource management and 
> allocation continues evolves, Spark will need to be integrated with more 
> cluster managers, but maintaining support for all possible allocators in the 
> Spark project would be untenable. Furthermore, it would be impossible for 
> Spark to support proprietary frameworks that are developed by specific users 
> for their other particular use cases.
> Therefore, this ticket proposes making scheduler implementations fully 
> pluggable. The idea is that Spark will provide a Java/Scala interface that is 
> to be implemented by a scheduler that is backed by the cluster manager of 
> interest. The user can compile their scheduler's code into a JAR that is 
> placed on the driver's classpath. Finally, as is the case in the current 
> world, the scheduler implementation is selected and dynamically loaded 
> depending on the user's provided master URL.
> Determining the correct API is the most challenging problem. The current 
> CoarseGrainedSchedulerBackend handles many responsibilities, some of which 
> will be common across all cluster managers, and some which will be specific 
> to a particular cluster manager. For example, the particular mechanism for 
> creating the executor processes will differ between YARN and Mesos, but, once 
> these executors have started running, the means to submit tasks to them over 
> the Netty RPC is identical across the board.
> We must also consider a plugin model and interface for submitting the 
> application as well, because different cluster managers support different 
> configuration options, and thus the driver must be bootstrapped accordingly. 
> For example, in YARN mode the application and Hadoop configuration must be 
> packaged and shipped to the distributed cache prior to launching the job. A 
> prototype of a Kubernetes implementation starts a Kubernetes pod that runs 
> the driver in cluster mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org