[jira] [Updated] (SPARK-23176) REPL project build failing in Spark v2.2.0

2018-01-21 Thread shekhar reddy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shekhar reddy updated SPARK-23176:
--
Description: 
I tried building Spark v2.2.0 and got compilation in 

https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11]/[src|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11/src/main/scala/org/apache/spark/repl]/*Main.scala*

at line 59: val jars = Utils.getUserJars(conf, isShell = 
true).mkString(File.pathSeparator)

I replaced with  Spark v2.2.1 and compiled and it worked fine. 

Could you please fix this build error so that it will help for next users

 

 

  was:
I tried building Spark v2.2.0 and got compilation in 

[spark|https://github.com/apache/spark/tree/v2.2.0]/[repl|https://github.com/apache/spark/tree/v2.2.0/repl]/[scala-2.11|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11]/[src|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11/src]/[main|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11/src/main]/[scala|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11/src/main/scala]/[org|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11/src/main/scala/org]/[apache|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11/src/main/scala/org/apache]/[spark|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11/src/main/scala/org/apache/spark]/[repl|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11/src/main/scala/org/apache/spark/repl]/*Main.scala*

at line 59: val jars = Utils.getUserJars(conf, isShell = 
true).mkString(File.pathSeparator)

I replaced with  Spark v2.2.1 and compiled and it worked fine. 

Could you please fix this build error so that it will help for next users

 

 


> REPL project  build failing in Spark v2.2.0
> ---
>
> Key: SPARK-23176
> URL: https://issues.apache.org/jira/browse/SPARK-23176
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: shekhar reddy
>Priority: Major
>
> I tried building Spark v2.2.0 and got compilation in 
> https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11]/[src|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11/src/main/scala/org/apache/spark/repl]/*Main.scala*
> at line 59: val jars = Utils.getUserJars(conf, isShell = 
> true).mkString(File.pathSeparator)
> I replaced with  Spark v2.2.1 and compiled and it worked fine. 
> Could you please fix this build error so that it will help for next users
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23176) REPL project build failing in Spark v2.2.0

2018-01-21 Thread shekhar reddy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shekhar reddy updated SPARK-23176:
--
Priority: Blocker  (was: Major)

> REPL project  build failing in Spark v2.2.0
> ---
>
> Key: SPARK-23176
> URL: https://issues.apache.org/jira/browse/SPARK-23176
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: shekhar reddy
>Priority: Blocker
>
> I tried building Spark v2.2.0 and got compilation in 
> https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11]/[src|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11/src/main/scala/org/apache/spark/repl]/*Main.scala*
> at line 59: val jars = Utils.getUserJars(conf, isShell = 
> true).mkString(File.pathSeparator)
> I replaced with  Spark v2.2.1 and compiled and it worked fine. 
> Could you please fix this build error so that it will help for next users
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23176) REPL project build failing in Spark v2.2.0

2018-01-21 Thread shekhar reddy (JIRA)
shekhar reddy created SPARK-23176:
-

 Summary: REPL project  build failing in Spark v2.2.0
 Key: SPARK-23176
 URL: https://issues.apache.org/jira/browse/SPARK-23176
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.0
Reporter: shekhar reddy


I tried building Spark v2.2.0 and got compilation in 

[spark|https://github.com/apache/spark/tree/v2.2.0]/[repl|https://github.com/apache/spark/tree/v2.2.0/repl]/[scala-2.11|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11]/[src|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11/src]/[main|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11/src/main]/[scala|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11/src/main/scala]/[org|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11/src/main/scala/org]/[apache|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11/src/main/scala/org/apache]/[spark|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11/src/main/scala/org/apache/spark]/[repl|https://github.com/apache/spark/tree/v2.2.0/repl/scala-2.11/src/main/scala/org/apache/spark/repl]/*Main.scala*

at line 59: val jars = Utils.getUserJars(conf, isShell = 
true).mkString(File.pathSeparator)

I replaced with  Spark v2.2.1 and compiled and it worked fine. 

Could you please fix this build error so that it will help for next users

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19217) Offer easy cast from vector to array

2018-01-21 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333963#comment-16333963
 ] 

Wenchen Fan commented on SPARK-19217:
-

Then we also need to define how to do cast, which needs a careful design.

> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage (edit: at least as ORC) without 
> converting the vector columns to array columns, and there doesn't appear to 
> an easy way to make that conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19217) Offer easy cast from vector to array

2018-01-21 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333959#comment-16333959
 ] 

Takeshi Yamamuro commented on SPARK-19217:
--

If we can, I think it's the best to reuse `sqlType`, but IMO it's difficult to 
do so because `sqlType` just means an internal type of underlying data 
structure. If `sqlType` is the suitable type that most users want to cast UDT 
data to, it's totally ok. But, if not, we cannot tell which type we should cast 
the data to. `VectorUDT` is a good example; I think most users want to cast 
vectors to arrays, but `sqlType` is not an array type.

> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage (edit: at least as ORC) without 
> converting the vector columns to array columns, and there doesn't appear to 
> an easy way to make that conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable

2018-01-21 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333940#comment-16333940
 ] 

Wenchen Fan commented on SPARK-23173:
-

+1 on proposal 1.

> from_json can produce nulls for fields which are marked as non-nullable
> ---
>
> Key: SPARK-23173
> URL: https://issues.apache.org/jira/browse/SPARK-23173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Herman van Hovell
>Priority: Major
>
> The {{from_json}} function uses a schema to convert a string into a Spark SQL 
> struct. This schema can contain non-nullable fields. The underlying 
> {{JsonToStructs}} expression does not check if a resulting struct respects 
> the nullability of the schema. This leads to very weird problems in consuming 
> expressions. In our case parquet writing would produce an illegal parquet 
> file.
> There are roughly solutions here:
>  # Assume that each field in schema passed to {{from_json}} is nullable, and 
> ignore the nullability information set in the passed schema.
>  # Validate the object during runtime, and fail execution if the data is null 
> where we are not expecting this.
> I currently am slightly in favor of option 1, since this is the more 
> performant option and a lot easier to do.
> WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19217) Offer easy cast from vector to array

2018-01-21 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333936#comment-16333936
 ] 

Wenchen Fan commented on SPARK-19217:
-

Before publishing UDT, I'm a little worried about adding more functionalities. 
For the cast, can we just recursively check if `UDT.sqlType` can be casted to 
some type?

> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage (edit: at least as ORC) without 
> converting the vector columns to array columns, and there doesn't appear to 
> an easy way to make that conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23020.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20297
[https://github.com/apache/spark/pull/20297]

> Re-enable Flaky Test: 
> org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> 
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.3.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23066) Master Page increase master start-up time.

2018-01-21 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte resolved SPARK-23066.

Resolution: Won't Fix

> Master Page increase master start-up time.
> --
>
> Key: SPARK-23066
> URL: https://issues.apache.org/jira/browse/SPARK-23066
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> When a spark system runs stably for a long time, we do not know how long it 
> actually runs and can not get its startup time from the UI. 
> So, it is necessary to increase the Master start-up time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2018-01-21 Thread Gaurav Garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333917#comment-16333917
 ] 

Gaurav Garg edited comment on SPARK-18016 at 1/22/18 6:32 AM:
--

[~Tagar], it is workiing fine with me when pivoting around 20K columns of 
dataframe, when using above snippet of code.


was (Author: gaurav.garg):
[~Tagar], it is workiing fine with me when pivoting around 10K columns of 
dataframe, when using above snippet of code.

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> 

[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2018-01-21 Thread Gaurav Garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333917#comment-16333917
 ] 

Gaurav Garg commented on SPARK-18016:
-

[~Tagar], it is workiing fine with me when pivoting around 10K columns of 
dataframe, when using above snippet of code.

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 

[jira] [Updated] (SPARK-23122) Deprecate register* for UDFs in SQLContext and Catalog in PySpark

2018-01-21 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23122:

Description: 
Deprecate register* for UDFs in SQLContext and Catalog in PySpark

Seems we allow many other ways to register UDFs in SQL statements. Some are in 
{{SQLContext}} and some are in {{Catalog}}.

These are also inconsistent with Java / Scala APIs. Seems we better deprecating 
them and put the logics into {{UDFRegistration}} ({{spark.udf.register*}}).

Please see this discussion too - 
[https://github.com/apache/spark/pull/20217#issuecomment-357134926].

  was:
Seems we allow many other ways to register UDFs in SQL statements. Some are in 
{{SQLContext}} and some are in {{Catalog}}.

These are also inconsistent with Java / Scala APIs. Seems we better deprecating 
them and put the logics into {{UDFRegistration}} ({{spark.udf.register*}}).

Please see this discussion too - 
[https://github.com/apache/spark/pull/20217#issuecomment-357134926].


> Deprecate register* for UDFs in SQLContext and Catalog in PySpark
> -
>
> Key: SPARK-23122
> URL: https://issues.apache.org/jira/browse/SPARK-23122
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.0
>
>
> Deprecate register* for UDFs in SQLContext and Catalog in PySpark
> Seems we allow many other ways to register UDFs in SQL statements. Some are 
> in {{SQLContext}} and some are in {{Catalog}}.
> These are also inconsistent with Java / Scala APIs. Seems we better 
> deprecating them and put the logics into {{UDFRegistration}} 
> ({{spark.udf.register*}}).
> Please see this discussion too - 
> [https://github.com/apache/spark/pull/20217#issuecomment-357134926].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2018-01-21 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333916#comment-16333916
 ] 

Ruslan Dautkhanov commented on SPARK-18016:
---

In Spark 2.2 I have the same issue with pivoting much narrower datasets - 
around 5800 columns. Not sure if pivot code generation has some inefficiency 
for wider datasets?

Will test with Spark 2.3 once it is released.

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at 

[jira] [Commented] (SPARK-23122) Deprecate register* for UDFs in SQLContext and Catalog in PySpark

2018-01-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333911#comment-16333911
 ] 

Apache Spark commented on SPARK-23122:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20348

> Deprecate register* for UDFs in SQLContext and Catalog in PySpark
> -
>
> Key: SPARK-23122
> URL: https://issues.apache.org/jira/browse/SPARK-23122
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.0
>
>
> Seems we allow many other ways to register UDFs in SQL statements. Some are 
> in {{SQLContext}} and some are in {{Catalog}}.
> These are also inconsistent with Java / Scala APIs. Seems we better 
> deprecating them and put the logics into {{UDFRegistration}} 
> ({{spark.udf.register*}}).
> Please see this discussion too - 
> [https://github.com/apache/spark/pull/20217#issuecomment-357134926].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2018-01-21 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333897#comment-16333897
 ] 

Kazuaki Ishizaki commented on SPARK-18016:
--

Thanks, I will look at this.

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
> 

[jira] [Resolved] (SPARK-23175) Type conversion does not make sense under case like select ’0.1’ = 0

2018-01-21 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-23175.
-
Resolution: Duplicate

> Type conversion does not make sense under case like select ’0.1’ = 0
> 
>
> Key: SPARK-23175
> URL: https://issues.apache.org/jira/browse/SPARK-23175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Shaoquan Zhang
>Priority: Major
>
> SQL select '0.1' = 0 returns true. The result seems unreasonable.
> From the logical plan, the sql is parsed as 'Project [(cast(cast(0.1 as 
> decimal(20,0)) as int) = 0) AS #6]'. The type conversion converts the string 
> to integer, which leads to the unreasonable result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23000) Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3

2018-01-21 Thread Sameer Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal resolved SPARK-23000.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by [https://github.com/apache/spark/pull/20328]

> Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3
> -
>
> Key: SPARK-23000
> URL: https://issues.apache.org/jira/browse/SPARK-23000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.3.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-sbt-hadoop-2.6/
> The test suite DataSourceWithHiveMetastoreCatalogSuite of Branch 2.3 always 
> failed in hadoop 2.6 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22838) Avoid unnecessary copying of data

2018-01-21 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-22838.
-
Resolution: Invalid

> Avoid unnecessary copying of data
> -
>
> Key: SPARK-22838
> URL: https://issues.apache.org/jira/browse/SPARK-22838
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Xianyang Liu
>Priority: Major
>
> If we read data from FileChannel to HeapByteBuffer, there is a need to copy 
> the data from the off-heap to the on-heap, you can see the follow code:
> ```java
> static int read(FileDescriptor var0, ByteBuffer var1, long var2, 
> NativeDispatcher var4) throws IOException {
> if(var1.isReadOnly()) {
>   throw new IllegalArgumentException("Read-only buffer");
> } else if(var1 instanceof DirectBuffer) {
>   return readIntoNativeBuffer(var0, var1, var2, var4);
> } else {
>   ByteBuffer var5 = Util.getTemporaryDirectBuffer(var1.remaining());
>   int var7;
>   try {
> int var6 = readIntoNativeBuffer(var0, var5, var2, var4);
> var5.flip();
> if(var6 > 0) {
>   var1.put(var5);
> }
> var7 = var6;
>   } finally {
> Util.offerFirstTemporaryDirectBuffer(var5);
>   }
>   return var7;
> }
>   }
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22976) Worker cleanup can remove running driver directories

2018-01-21 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-22976:
---

Assignee: Russell Spitzer

> Worker cleanup can remove running driver directories
> 
>
> Key: SPARK-22976
> URL: https://issues.apache.org/jira/browse/SPARK-22976
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.0.2
>Reporter: Russell Spitzer
>Assignee: Russell Spitzer
>Priority: Major
> Fix For: 2.3.0
>
>
> Spark Standalone worker cleanup finds directories to remove with a listFiles 
> command
> This includes both application directories and driver directories from 
> cluster mode submitted applications. 
> A directory is considered to not be part of a running app if the worker does 
> not have an executor with a matching ID.
> https://github.com/apache/spark/blob/v2.2.1/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L432
> {code}
>   val appIds = executors.values.map(_.appId).toSet
>   val isAppStillRunning = appIds.contains(appIdFromDir)
> {code}
> If a driver has been started on a node, but all of the executors are on other 
> nodes, the worker running the driver will always assume that the driver 
> directory is not part of a running app.
> Consider a two node spark cluster with Worker A and Worker B where each node 
> has a single core available. We submit our application in deploy mode 
> cluster, the driver begins running on Worker A while the Executor starts on B.
> Worker A has a cleanup triggered and looks and finds it has a directory
> {code}
> /var/lib/spark/worker/driver-20180105234824-
> {code}
> Worker A check's it's executor list and finds no entries which match this 
> since it has no corresponding executors for this application. Worker A then 
> removes the directory even though it may still be actively running.
> I think this could be fixed by modifying line 432 to be
> {code}
>   val appIds = executors.values.map(_.appId).toSet ++ 
> drivers.values.map(_.driverId)
> {code}
> I'll run a test and submit a PR soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23081) Add colRegex API to PySpark

2018-01-21 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333861#comment-16333861
 ] 

Xiao Li commented on SPARK-23081:
-

Yeah. Please go ahead. 

> Add colRegex API to PySpark
> ---
>
> Key: SPARK-23081
> URL: https://issues.apache.org/jira/browse/SPARK-23081
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22976) Worker cleanup can remove running driver directories

2018-01-21 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-22976.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Worker cleanup can remove running driver directories
> 
>
> Key: SPARK-22976
> URL: https://issues.apache.org/jira/browse/SPARK-22976
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.0.2
>Reporter: Russell Spitzer
>Priority: Major
> Fix For: 2.3.0
>
>
> Spark Standalone worker cleanup finds directories to remove with a listFiles 
> command
> This includes both application directories and driver directories from 
> cluster mode submitted applications. 
> A directory is considered to not be part of a running app if the worker does 
> not have an executor with a matching ID.
> https://github.com/apache/spark/blob/v2.2.1/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L432
> {code}
>   val appIds = executors.values.map(_.appId).toSet
>   val isAppStillRunning = appIds.contains(appIdFromDir)
> {code}
> If a driver has been started on a node, but all of the executors are on other 
> nodes, the worker running the driver will always assume that the driver 
> directory is not part of a running app.
> Consider a two node spark cluster with Worker A and Worker B where each node 
> has a single core available. We submit our application in deploy mode 
> cluster, the driver begins running on Worker A while the Executor starts on B.
> Worker A has a cleanup triggered and looks and finds it has a directory
> {code}
> /var/lib/spark/worker/driver-20180105234824-
> {code}
> Worker A check's it's executor list and finds no entries which match this 
> since it has no corresponding executors for this application. Worker A then 
> removes the directory even though it may still be actively running.
> I think this could be fixed by modifying line 432 to be
> {code}
>   val appIds = executors.values.map(_.appId).toSet ++ 
> drivers.values.map(_.driverId)
> {code}
> I'll run a test and submit a PR soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23026) Add RegisterUDF to PySpark

2018-01-21 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23026.
-
Resolution: Won't Fix

> Add RegisterUDF to PySpark
> --
>
> Key: SPARK-23026
> URL: https://issues.apache.org/jira/browse/SPARK-23026
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> Add a new API for registering row-at-a-time or scalar vectorized UDFs. The 
> registered UDFs can be used in the SQL statement.
> {noformat}
> >>> from pyspark.sql.types import IntegerType
> >>> from pyspark.sql.functions import udf
> >>> slen = udf(lambda s: len(s), IntegerType())
> >>> _ = spark.udf.registerUDF("slen", slen)
> >>> spark.sql("SELECT slen('test')").collect()
> [Row(slen(test)=4)]
> >>> import random
> >>> from pyspark.sql.functions import udf
> >>> from pyspark.sql.types import IntegerType
> >>> random_udf = udf(lambda: random.randint(0, 100), 
> >>> IntegerType()).asNondeterministic()
> >>> newRandom_udf = spark.catalog.registerUDF("random_udf", random_udf)
> >>> spark.sql("SELECT random_udf()").collect()  
> [Row(random_udf()=82)]
> >>> spark.range(1).select(newRandom_udf()).collect()  
> [Row(random_udf()=62)]
> >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
> >>> @pandas_udf("integer", PandasUDFType.SCALAR)  
> ... def add_one(x):
> ... return x + 1
> ...
> >>> _ = spark.udf.registerUDF("add_one", add_one)  
> >>> spark.sql("SELECT add_one(id) FROM range(10)").collect()  
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23084) Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark

2018-01-21 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333860#comment-16333860
 ] 

Xiao Li commented on SPARK-23084:
-

Yeah. Please go ahead.

> Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark 
> ---
>
> Key: SPARK-23084
> URL: https://issues.apache.org/jira/browse/SPARK-23084
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Add the new APIs (introduced by https://github.com/apache/spark/pull/18814) 
> to PySpark. Also update the rangeBetween API
> {noformat}
> /**
>  * Window function: returns the special frame boundary that represents the 
> first row in the
>  * window partition.
>  *
>  * @group window_funcs
>  * @since 2.3.0
>  */
>  def unboundedPreceding(): Column = Column(UnboundedPreceding)
> /**
>  * Window function: returns the special frame boundary that represents the 
> last row in the
>  * window partition.
>  *
>  * @group window_funcs
>  * @since 2.3.0
>  */
>  def unboundedFollowing(): Column = Column(UnboundedFollowing)
> /**
>  * Window function: returns the special frame boundary that represents the 
> current row in the
>  * window partition.
>  *
>  * @group window_funcs
>  * @since 2.3.0
>  */
>  def currentRow(): Column = Column(CurrentRow)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2018-01-21 Thread Gaurav Garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333857#comment-16333857
 ] 

Gaurav Garg edited comment on SPARK-18016 at 1/22/18 4:27 AM:
--

In sequence,

df
 .groupBy({color:#008000}"col1"{color})
 .pivot({color:#008000}"col2"{color})
 .agg(count({color:#008000}"col2"{color}))
 .na.fill({color:#ff}0{color})
 .drop({color:#008000}"col1"{color})

 

Each "col1" and "col2" have around 80K distinct enteries, which makes this 
pivot a dataframe of size around  "80K * 80K". After this, when I am trying to 
save this dataframe or manipulate this dataframe, gives the following error:

 

*Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for class 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
 has grown past JVM limit of 0x*
 *at org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)*

 


was (Author: gaurav.garg):
In sequence,

df
 .groupBy({color:#008000}"col1"{color})
 .pivot({color:#008000}"col2"{color})
 .agg(count({color:#008000}"col2"{color}))
 .na.fill({color:#ff}0{color})
 .drop({color:#008000}"col1"{color})

 

Each "col1" and "col2" have around 80K distinct enteries, which makes this 
pivot a dataframe of size around  "80K * 80K". After this, while I am trying to 
save this dataframe or manipulate this dataframe, gives the following error:

 

*Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for class 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
 has grown past JVM limit of 0x*
 *at org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)*

 

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at 

[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2018-01-21 Thread Gaurav (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333857#comment-16333857
 ] 

Gaurav commented on SPARK-18016:


In sequence,

df
 .groupBy({color:#008000}"col1"{color})
 .pivot({color:#008000}"col2"{color})
 .agg(count({color:#008000}"col2"{color}))
 .na.fill({color:#ff}0{color})
 .drop({color:#008000}"col1"{color})

 

Each "col1" and "col2" have around 80K distinct enteries, which makes this 
pivot a dataframe of size around  "80K * 80K". After this, while I am trying to 
save this dataframe or manipulate this dataframe, gives the following error:

 

*Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for class 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
 has grown past JVM limit of 0x*
 *at org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)*

 

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at 

[jira] [Commented] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable

2018-01-21 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333855#comment-16333855
 ] 

Hyukjin Kwon commented on SPARK-23173:
--

I believe this one is related with SPARK-17763. My first try was 2 (roughly 1.5 
years ago).

The root cause seems the same - I just double checked the Jackson parsers 
produce {{null}} regardless of the nullability.

Just FYI, there was a related discussion in SPARK-16472 too. If I understood 
correctly, seems the guys roughly rather prefer to leave it as nullable rather 
than failure during runtime. 

So, +1 for 1. to me given the past discussion and It seems simplest and less 
invasion.

cc [~cloud_fan] too who was in the discussion of SPARK-16472.

> from_json can produce nulls for fields which are marked as non-nullable
> ---
>
> Key: SPARK-23173
> URL: https://issues.apache.org/jira/browse/SPARK-23173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Herman van Hovell
>Priority: Major
>
> The {{from_json}} function uses a schema to convert a string into a Spark SQL 
> struct. This schema can contain non-nullable fields. The underlying 
> {{JsonToStructs}} expression does not check if a resulting struct respects 
> the nullability of the schema. This leads to very weird problems in consuming 
> expressions. In our case parquet writing would produce an illegal parquet 
> file.
> There are roughly solutions here:
>  # Assume that each field in schema passed to {{from_json}} is nullable, and 
> ignore the nullability information set in the passed schema.
>  # Validate the object during runtime, and fail execution if the data is null 
> where we are not expecting this.
> I currently am slightly in favor of option 1, since this is the more 
> performant option and a lot easier to do.
> WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22808) saveAsTable() should be marked as deprecated

2018-01-21 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22808.
-
Resolution: Duplicate

> saveAsTable() should be marked as deprecated
> 
>
> Key: SPARK-22808
> URL: https://issues.apache.org/jira/browse/SPARK-22808
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: Jason Vaccaro
>Priority: Major
>
> As discussed in SPARK-16803, saveAsTable is not supported as a method for 
> writing to Hive and insertInto should be used instead. However, on the java 
> api documentation for version 2.1.1, the saveAsTable method is not marked as 
> deprecated and the programming guides indicate that saveAsTable is the proper 
> way to write to Hive. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-21 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333755#comment-16333755
 ] 

Hyukjin Kwon edited comment on SPARK-23114 at 1/22/18 3:05 AM:
---

[~felixcheung], I maybe misunderstood but you mean if we can have an actual 
explicit test case that always reproduces the issue in SPARK-21093 because we 
were unable to have the test case in the fix for SPARK-21093?


was (Author: hyukjin.kwon):
[~felixcheung], I maybe misunderstood but you mean if we can have an actual 
explicit test case that always reproduces the issue in SPARK-21093 because we 
are unable to have the test case in the fix for SPARK-21093?

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23050) Structured Streaming with S3 file source duplicates data because of eventual consistency.

2018-01-21 Thread Yash Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333809#comment-16333809
 ] 

Yash Sharma commented on SPARK-23050:
-

Hi [~ste...@apache.org], Thanks for bringing this great discussion on the 
ticket.

I will turn on the debug logs and provide more feedback soon. Till I get there, 
here are the details of the job:
 * Instance type: The instance are pretty good currently. R3.8x Large : 64vcpu, 
244Gigs memory, but I have seen the issue with other smaller instances as well.
 * Data processed: Input: ~110Gigs, Output: 2 Gigs / per day on streaming. The 
data volume is usually uniform and there is no load on the machines. Output 
files written ~1600.
 * Lesser data: I suspect the issue will be less with smaller data but I need 
to confirm that. Currently the error is seen 5% of times on current data load, 
not all runs have the issue. The ratio of the number of files duplicated is 
also very low - 3/1600. I strongly think that lower volume would not reproduce 
the issue.

> Structured Streaming with S3 file source duplicates data because of eventual 
> consistency.
> -
>
> Key: SPARK-23050
> URL: https://issues.apache.org/jira/browse/SPARK-23050
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Yash Sharma
>Priority: Major
>
> Spark Structured streaming with S3 file source duplicates data because of 
> eventual consistency.
> Re producing the scenario -
> - Structured streaming reading from S3 source. Writing back to S3.
> - Spark tries to commitTask on completion of a task, by verifying if all the 
> files have been written to Filesystem. 
> {{ManifestFileCommitProtocol.commitTask}}.
> - [Eventual consistency issue] Spark finds that the file is not present and 
> fails the task. {{org.apache.spark.SparkException: Task failed while writing 
> rows. No such file or directory 
> 's3://path/data/part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet'}}
> - By this time S3 eventually gets the file.
> - Spark reruns the task and completes the task, but gets a new file name this 
> time. {{ManifestFileCommitProtocol.newTaskTempFile. 
> part-00256-b62fa7a4-b7e0-43d6-8c38-9705076a7ee1-c000.snappy.parquet.}}
> - Data duplicates in results and the same data is processed twice and written 
> to S3.
> - There is no data duplication if spark is able to list presence of all 
> committed files and all tasks succeed.
> Code:
> {code}
> query = selected_df.writeStream \
> .format("parquet") \
> .option("compression", "snappy") \
> .option("path", "s3://path/data/") \
> .option("checkpointLocation", "s3://path/checkpoint/") \
> .start()
> {code}
> Same sized duplicate S3 Files:
> {code}
> $ aws s3 ls s3://path/data/ | grep part-00256
> 2018-01-11 03:37:00  17070 
> part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet
> 2018-01-11 03:37:10  17070 
> part-00256-b62fa7a4-b7e0-43d6-8c38-9705076a7ee1-c000.snappy.parquet
> {code}
> Exception on S3 listing and task failure:
> {code}
> [Stage 5:>(277 + 100) / 
> 597]18/01/11 03:36:59 WARN TaskSetManager: Lost task 256.0 in stage 5.0 (TID  
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>  Caused by: java.io.FileNotFoundException: No such file or directory 
> 's3://path/data/part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet'
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:816)
>   at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:509)
>   at 
> org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol$$anonfun$4.apply(ManifestFileCommitProtocol.scala:109)
>   at 
> 

[jira] [Commented] (SPARK-20129) JavaSparkContext should use SparkContext.getOrCreate

2018-01-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333808#comment-16333808
 ] 

Apache Spark commented on SPARK-20129:
--

User 'rekhajoshm' has created a pull request for this issue:
https://github.com/apache/spark/pull/20347

> JavaSparkContext should use SparkContext.getOrCreate
> 
>
> Key: SPARK-20129
> URL: https://issues.apache.org/jira/browse/SPARK-20129
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> It should re-use an existing SparkContext if there is a live one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable

2018-01-21 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-23173:
--
Description: 
The {{from_json}} function uses a schema to convert a string into a Spark SQL 
struct. This schema can contain non-nullable fields. The underlying 
{{JsonToStructs}} expression does not check if a resulting struct respects the 
nullability of the schema. This leads to very weird problems in consuming 
expressions. In our case parquet writing would produce an illegal parquet file.

There are roughly solutions here:
 # Assume that each field in schema passed to {{from_json}} is nullable, and 
ignore the nullability information set in the passed schema.
 # Validate the object during runtime, and fail execution if the data is null 
where we are not expecting this.
I currently am slightly in favor of option 1, since this is the more performant 
option and a lot easier to do.

WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz]

  was:
The {{from_json}} function uses a schema to convert a string into a Spark SQL 
struct. This schema can contain non-nullable fields. The underlying 
{{JsonToStructs}} expression does not check if a resulting struct respects the 
nullability of the schema. This leads to very weird problems in consuming 
expressions. In our case parquet writing would produce an illegal parquet file.

There are roughly solutions here:
 # Assume that each field in schema passed to {{from_json}} is nullable, and 
ignore the nullability information set in the passed schema.
 # Validate the object during runtime, and fail execution if the data is null 
where we are not expecting this.
I currently am slightly in favor of option 1, since this is the more performant 
option and a lot easier to do.

WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz]]


> from_json can produce nulls for fields which are marked as non-nullable
> ---
>
> Key: SPARK-23173
> URL: https://issues.apache.org/jira/browse/SPARK-23173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Herman van Hovell
>Priority: Major
>
> The {{from_json}} function uses a schema to convert a string into a Spark SQL 
> struct. This schema can contain non-nullable fields. The underlying 
> {{JsonToStructs}} expression does not check if a resulting struct respects 
> the nullability of the schema. This leads to very weird problems in consuming 
> expressions. In our case parquet writing would produce an illegal parquet 
> file.
> There are roughly solutions here:
>  # Assume that each field in schema passed to {{from_json}} is nullable, and 
> ignore the nullability information set in the passed schema.
>  # Validate the object during runtime, and fail execution if the data is null 
> where we are not expecting this.
> I currently am slightly in favor of option 1, since this is the more 
> performant option and a lot easier to do.
> WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable

2018-01-21 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-23173:
--
Description: 
The {{from_json}} function uses a schema to convert a string into a Spark SQL 
struct. This schema can contain non-nullable fields. The underlying 
{{JsonToStructs}} expression does not check if a resulting struct respects the 
nullability of the schema. This leads to very weird problems in consuming 
expressions. In our case parquet writing would produce an illegal parquet file.

There are roughly solutions here:
 # Assume that each field in schema passed to {{from_json}} is nullable, and 
ignore the nullability information set in the passed schema.
 # Validate the object during runtime, and fail execution if the data is null 
where we are not expecting this.
I currently am slightly in favor of option 1, since this is the more performant 
option and a lot easier to do.

WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz]]

  was:
The {{from_json}} function uses a schema to convert a string into a Spark SQL 
struct. This schema can contain non-nullable fields. The underlying 
{{JsonToStructs}} expression does not check if a resulting struct respects the 
nullability of the schema. This leads to very weird problems in consuming 
expressions. In our case parquet writing would produce an illegal parquet file.

There are roughly solutions here:
 # Assume that each field in schema passed to {{from_json}} is nullable, and 
ignore the nullability information set in the passed schema.
 # Validate the object during runtime, and fail execution if the data is null 
where we are not expecting this.
I currently am slightly in favor of option 1, since this is the more performant 
option and a lot easier to do. WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] 
[~brkyvz]]


> from_json can produce nulls for fields which are marked as non-nullable
> ---
>
> Key: SPARK-23173
> URL: https://issues.apache.org/jira/browse/SPARK-23173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Herman van Hovell
>Priority: Major
>
> The {{from_json}} function uses a schema to convert a string into a Spark SQL 
> struct. This schema can contain non-nullable fields. The underlying 
> {{JsonToStructs}} expression does not check if a resulting struct respects 
> the nullability of the schema. This leads to very weird problems in consuming 
> expressions. In our case parquet writing would produce an illegal parquet 
> file.
> There are roughly solutions here:
>  # Assume that each field in schema passed to {{from_json}} is nullable, and 
> ignore the nullability information set in the passed schema.
>  # Validate the object during runtime, and fail execution if the data is null 
> where we are not expecting this.
> I currently am slightly in favor of option 1, since this is the more 
> performant option and a lot easier to do.
> WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz]]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23175) Type conversion does not make sense under case like select ’0.1’ = 0

2018-01-21 Thread Shaoquan Zhang (JIRA)
Shaoquan Zhang created SPARK-23175:
--

 Summary: Type conversion does not make sense under case like 
select ’0.1’ = 0
 Key: SPARK-23175
 URL: https://issues.apache.org/jira/browse/SPARK-23175
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Shaoquan Zhang


SQL select '0.1' = 0 returns true. The result seems unreasonable.

>From the logical plan, the sql is parsed as 'Project [(cast(cast(0.1 as 
>decimal(20,0)) as int) = 0) AS #6]'. The type conversion converts the string 
>to integer, which leads to the unreasonable result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23174) Fix pep8 to latest official version

2018-01-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23174:


Assignee: Apache Spark

> Fix pep8 to latest official version
> ---
>
> Key: SPARK-23174
> URL: https://issues.apache.org/jira/browse/SPARK-23174
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Rekha Joshi
>Assignee: Apache Spark
>Priority: Trivial
>
> As per discussion with [~hyukjin.kwon] , this Jira to fix python code style 
> to latest official version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23174) Fix pep8 to latest official version

2018-01-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333793#comment-16333793
 ] 

Apache Spark commented on SPARK-23174:
--

User 'rekhajoshm' has created a pull request for this issue:
https://github.com/apache/spark/pull/20338

> Fix pep8 to latest official version
> ---
>
> Key: SPARK-23174
> URL: https://issues.apache.org/jira/browse/SPARK-23174
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Rekha Joshi
>Priority: Trivial
>
> As per discussion with [~hyukjin.kwon] , this Jira to fix python code style 
> to latest official version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23174) Fix pep8 to latest official version

2018-01-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23174:


Assignee: (was: Apache Spark)

> Fix pep8 to latest official version
> ---
>
> Key: SPARK-23174
> URL: https://issues.apache.org/jira/browse/SPARK-23174
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Rekha Joshi
>Priority: Trivial
>
> As per discussion with [~hyukjin.kwon] , this Jira to fix python code style 
> to latest official version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23174) Fix pep8 to latest official version

2018-01-21 Thread Rekha Joshi (JIRA)
Rekha Joshi created SPARK-23174:
---

 Summary: Fix pep8 to latest official version
 Key: SPARK-23174
 URL: https://issues.apache.org/jira/browse/SPARK-23174
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.2.1
Reporter: Rekha Joshi


As per discussion with [~hyukjin.kwon] , this Jira to fix python code style to 
latest official version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2018-01-21 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333787#comment-16333787
 ] 

Dongjoon Hyun edited comment on SPARK-22320 at 1/22/18 2:18 AM:


With the above workaround, I think this seems to be a minor issue.

I tested the workaround at 2.1.2 and 2.2.1.


was (Author: dongjoon):
With the above workaround, I think this seems to be a minor issue.

> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> I save dataframe containing vectors in ORC format, when I read it back, the 
> format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2018-01-21 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22320:
--
Priority: Minor  (was: Major)

> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> I save dataframe containing vectors in ORC format, when I read it back, the 
> format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11222) Add style checker rules to validate doc tests aren't included in docs

2018-01-21 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333786#comment-16333786
 ] 

Rekha Joshi commented on SPARK-11222:
-

 I have raised the doctest bank line as an 
[issue|https://github.com/PyCQA/pycodestyle/issues/723] under pep8, as it does 
not belong within Spark.

> Add style checker rules to validate doc tests aren't included in docs
> -
>
> Key: SPARK-11222
> URL: https://issues.apache.org/jira/browse/SPARK-11222
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: holdenk
>Priority: Trivial
>
> Add style checker test to make sure we have the blank line before starting 
> doctests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2018-01-21 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333787#comment-16333787
 ] 

Dongjoon Hyun commented on SPARK-22320:
---

With the above workaround, I think this seems to be a minor issue.

> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: zhengruifeng
>Priority: Major
>
> I save dataframe containing vectors in ORC format, when I read it back, the 
> format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2018-01-21 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333784#comment-16333784
 ] 

Dongjoon Hyun commented on SPARK-22320:
---

For this one, Parquet saves the original schema as a metadata field, 
`org.apache.spark.sql.parquet.row.attributes`, at 
[here|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L111-L113].
{code}
// We want to clear this temporary metadata from saving into Parquet file.
// This metadata is only useful for detecting optional columns when 
pushdowning filters.
ParquetWriteSupport.setSchema(dataSchema, conf)
{code}

For the other format like JSON/ORC, Spark doesn't have 
`org.apache.spark.sql.(json/orc).row.attributes`. BTW, since the comment is 
wrong, we had better fix it by 
[PR-20346|https://github.com/apache/spark/pull/20346].

As a workaround, if a user gives a correct schema, there is no problem at 
reading both the data and schema.
{code}
scala> import org.apache.spark.ml.linalg._
scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
Array(4), Array(1.0
data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
(2,(8,[4],[1.0])))

scala> val df = data.toDF("i", "vec")
scala> df.write.json("/tmp/json")

scala> spark.read.schema(df.schema).json("/tmp/json").show()
+---+-+
|  i|  vec|
+---+-+
|  2|(8,[4],[1.0])|
|  1|[1.0,2.0]|
+---+-+

scala> df.write.orc("/tmp/orc")
scala> spark.read.schema(df.schema).orc("/tmp/orc").show()
+---+-+
|  i|  vec|
+---+-+
|  1|[1.0,2.0]|
|  2|(8,[4],[1.0])|
+---+-+

scala> spark.read.schema(df.schema).orc("/tmp/orc").printSchema
root
 |-- i: integer (nullable = true)
 |-- vec: vector (nullable = true)
{code}

> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: zhengruifeng
>Priority: Major
>
> I save dataframe containing vectors in ORC format, when I read it back, the 
> format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20947) Encoding/decoding issue in PySpark pipe implementation

2018-01-21 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-20947:


Assignee: Xiaozhe Wang

> Encoding/decoding issue in PySpark pipe implementation
> --
>
> Key: SPARK-20947
> URL: https://issues.apache.org/jira/browse/SPARK-20947
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 
> 2.1.1
>Reporter: Xiaozhe Wang
>Assignee: Xiaozhe Wang
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: fix-pipe-encoding-error.patch
>
>
> Pipe action convert objects into strings using a way that was affected by the 
> default encoding setting of Python environment.
> Here is the related code fragment (L717-721@python/pyspark/rdd.py):
> {code}
> def pipe_objs(out):
> for obj in iterator:
> s = str(obj).rstrip('\n') + '\n'
> out.write(s.encode('utf-8'))
> out.close()
> {code}
> The `str(obj)` part implicitly convert `obj` to an unicode string, then 
> encode it into a byte string using default encoding; On the other hand, the 
> `s.encode('utf-8')` part implicitly decode `s` into an unicode string using 
> default encoding and then encode it (AGAIN!) into a UTF-8 encoded byte string.
> Typically the default encoding of Python environment would be 'ascii', which 
> means passing  an unicode string containing characters beyond 'ascii' charset 
> will raise UnicodeEncodeError exception at `str(obj)` and passing a byte 
> string containing bytes greater than 128 will again raise UnicodeEncodeError 
> exception at 's.encode('utf-8')`.
> Changing `str(obj)` to `unicode(obj)` would eliminate these problems.
> The following code snippet reproduces these errors:
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.6.3
>   /_/
> Using Python version 2.7.12 (default, Jul 25 2016 15:06:45)
> SparkContext available as sc, HiveContext available as sqlContext.
> >>> sc.parallelize([u'\u6d4b\u8bd5']).pipe('cat').collect()
> [Stage 0:>  (0 + 4) / 
> 4]Exception in thread Thread-1:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 801, in __bootstrap_inner
> self.run()
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 754, in run
> self.__target(*self.__args, **self.__kwargs)
>   File 
> "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 
> 719, in pipe_objs
> s = str(obj).rstrip('\n') + '\n'
> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: 
> ordinal not in range(128)
> >>>
> >>> sc.parallelize([u'\u6d4b\u8bd5']).map(lambda x: 
> >>> x.encode('utf-8')).pipe('cat').collect()
> Exception in thread Thread-1:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 801, in __bootstrap_inner
> self.run()
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 754, in run
> self.__target(*self.__args, **self.__kwargs)
>   File 
> "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 
> 720, in pipe_objs
> out.write(s.encode('utf-8'))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: 
> ordinal not in range(128)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20947) Encoding/decoding issue in PySpark pipe implementation

2018-01-21 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20947.
--
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/18277

> Encoding/decoding issue in PySpark pipe implementation
> --
>
> Key: SPARK-20947
> URL: https://issues.apache.org/jira/browse/SPARK-20947
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 
> 2.1.1
>Reporter: Xiaozhe Wang
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: fix-pipe-encoding-error.patch
>
>
> Pipe action convert objects into strings using a way that was affected by the 
> default encoding setting of Python environment.
> Here is the related code fragment (L717-721@python/pyspark/rdd.py):
> {code}
> def pipe_objs(out):
> for obj in iterator:
> s = str(obj).rstrip('\n') + '\n'
> out.write(s.encode('utf-8'))
> out.close()
> {code}
> The `str(obj)` part implicitly convert `obj` to an unicode string, then 
> encode it into a byte string using default encoding; On the other hand, the 
> `s.encode('utf-8')` part implicitly decode `s` into an unicode string using 
> default encoding and then encode it (AGAIN!) into a UTF-8 encoded byte string.
> Typically the default encoding of Python environment would be 'ascii', which 
> means passing  an unicode string containing characters beyond 'ascii' charset 
> will raise UnicodeEncodeError exception at `str(obj)` and passing a byte 
> string containing bytes greater than 128 will again raise UnicodeEncodeError 
> exception at 's.encode('utf-8')`.
> Changing `str(obj)` to `unicode(obj)` would eliminate these problems.
> The following code snippet reproduces these errors:
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.6.3
>   /_/
> Using Python version 2.7.12 (default, Jul 25 2016 15:06:45)
> SparkContext available as sc, HiveContext available as sqlContext.
> >>> sc.parallelize([u'\u6d4b\u8bd5']).pipe('cat').collect()
> [Stage 0:>  (0 + 4) / 
> 4]Exception in thread Thread-1:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 801, in __bootstrap_inner
> self.run()
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 754, in run
> self.__target(*self.__args, **self.__kwargs)
>   File 
> "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 
> 719, in pipe_objs
> s = str(obj).rstrip('\n') + '\n'
> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: 
> ordinal not in range(128)
> >>>
> >>> sc.parallelize([u'\u6d4b\u8bd5']).map(lambda x: 
> >>> x.encode('utf-8')).pipe('cat').collect()
> Exception in thread Thread-1:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 801, in __bootstrap_inner
> self.run()
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 754, in run
> self.__target(*self.__args, **self.__kwargs)
>   File 
> "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 
> 720, in pipe_objs
> out.write(s.encode('utf-8'))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: 
> ordinal not in range(128)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20947) Encoding/decoding issue in PySpark pipe implementation

2018-01-21 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-20947:
-
Fix Version/s: 2.4.0

> Encoding/decoding issue in PySpark pipe implementation
> --
>
> Key: SPARK-20947
> URL: https://issues.apache.org/jira/browse/SPARK-20947
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 
> 2.1.1
>Reporter: Xiaozhe Wang
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: fix-pipe-encoding-error.patch
>
>
> Pipe action convert objects into strings using a way that was affected by the 
> default encoding setting of Python environment.
> Here is the related code fragment (L717-721@python/pyspark/rdd.py):
> {code}
> def pipe_objs(out):
> for obj in iterator:
> s = str(obj).rstrip('\n') + '\n'
> out.write(s.encode('utf-8'))
> out.close()
> {code}
> The `str(obj)` part implicitly convert `obj` to an unicode string, then 
> encode it into a byte string using default encoding; On the other hand, the 
> `s.encode('utf-8')` part implicitly decode `s` into an unicode string using 
> default encoding and then encode it (AGAIN!) into a UTF-8 encoded byte string.
> Typically the default encoding of Python environment would be 'ascii', which 
> means passing  an unicode string containing characters beyond 'ascii' charset 
> will raise UnicodeEncodeError exception at `str(obj)` and passing a byte 
> string containing bytes greater than 128 will again raise UnicodeEncodeError 
> exception at 's.encode('utf-8')`.
> Changing `str(obj)` to `unicode(obj)` would eliminate these problems.
> The following code snippet reproduces these errors:
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.6.3
>   /_/
> Using Python version 2.7.12 (default, Jul 25 2016 15:06:45)
> SparkContext available as sc, HiveContext available as sqlContext.
> >>> sc.parallelize([u'\u6d4b\u8bd5']).pipe('cat').collect()
> [Stage 0:>  (0 + 4) / 
> 4]Exception in thread Thread-1:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 801, in __bootstrap_inner
> self.run()
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 754, in run
> self.__target(*self.__args, **self.__kwargs)
>   File 
> "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 
> 719, in pipe_objs
> s = str(obj).rstrip('\n') + '\n'
> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: 
> ordinal not in range(128)
> >>>
> >>> sc.parallelize([u'\u6d4b\u8bd5']).map(lambda x: 
> >>> x.encode('utf-8')).pipe('cat').collect()
> Exception in thread Thread-1:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 801, in __bootstrap_inner
> self.run()
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 754, in run
> self.__target(*self.__args, **self.__kwargs)
>   File 
> "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 
> 720, in pipe_objs
> out.write(s.encode('utf-8'))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: 
> ordinal not in range(128)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable

2018-01-21 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-23173:
-

 Summary: from_json can produce nulls for fields which are marked 
as non-nullable
 Key: SPARK-23173
 URL: https://issues.apache.org/jira/browse/SPARK-23173
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Herman van Hovell


The {{from_json}} function uses a schema to convert a string into a Spark SQL 
struct. This schema can contain non-nullable fields. The underlying 
{{JsonToStructs}} expression does not check if a resulting struct respects the 
nullability of the schema. This leads to very weird problems in consuming 
expressions. In our case parquet writing would produce an illegal parquet file.

There are roughly solutions here:
 # Assume that each field in schema passed to {{from_json}} is nullable, and 
ignore the nullability information set in the passed schema.
 # Validate the object during runtime, and fail execution if the data is null 
where we are not expecting this.
I currently am slightly in favor of option 1, since this is the more performant 
option and a lot easier to do. WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] 
[~brkyvz]]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23169) Run lintr on the changes of lint-r script and .lintr configuration

2018-01-21 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23169.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20339
[https://github.com/apache/spark/pull/20339]

> Run lintr on the changes of lint-r script and .lintr configuration
> --
>
> Key: SPARK-23169
> URL: https://issues.apache.org/jira/browse/SPARK-23169
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.4.0
>
>
> When running the {{run-tests}} script, seems we don't run lintr on the 
> changes of lint-r script and .lintr configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23169) Run lintr on the changes of lint-r script and .lintr configuration

2018-01-21 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23169:


Assignee: Hyukjin Kwon

> Run lintr on the changes of lint-r script and .lintr configuration
> --
>
> Key: SPARK-23169
> URL: https://issues.apache.org/jira/browse/SPARK-23169
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.4.0
>
>
> When running the {{run-tests}} script, seems we don't run lintr on the 
> changes of lint-r script and .lintr configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-21 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333755#comment-16333755
 ] 

Hyukjin Kwon commented on SPARK-23114:
--

[~felixcheung], I maybe misunderstood but you mean if we can have an actual 
explicit test case that always reproduces the issue in SPARK-21093 because we 
are unable to have the test case in the fix for SPARK-21093?

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333712#comment-16333712
 ] 

Felix Cheung edited comment on SPARK-23107 at 1/21/18 11:08 PM:


We don't have doc on RFormula but it'll be good idea to add now and also to 
allow for documenting changes like 
h1. SPARK-20619 
h1. SPARK-20899

in a language independent way


was (Author: felixcheung):
We don't have doc on RFormula but it'll be good idea to also allow for 
documenting changes like 
h1. SPARK-20619 
h1. SPARK-20899

in a language independent way

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333733#comment-16333733
 ] 

Felix Cheung commented on SPARK-21727:
--

how are we doing?

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil Alexander McQuarrie
>Assignee: Neil Alexander McQuarrie
>Priority: Major
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333730#comment-16333730
 ] 

Felix Cheung edited comment on SPARK-23114 at 1/21/18 11:03 PM:


[~falaki] [~hyukjin.kwon]

About SPARK-21093, do you think you could have real data and real workload to 
test for long haul or heavy load or many short/bursty tasks?

 


was (Author: felixcheung):
[~falaki] [~hyukjin.kwon]

About SPARK-21093, do you think you could have real data and real workload to 
test for long haul or heavy load or many tasks?

 

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333725#comment-16333725
 ] 

Felix Cheung edited comment on SPARK-23114 at 1/21/18 11:02 PM:


[~sameerag]

Here are some ideas for the release notes (that goes to spark-website in the 
announcements)

For SparkR, new in 2.3.0:

SQL changes:

SQL functions, cubing & nested structure

collect_list, collect_set, split_string, repeat_string, rollup, cube
 explode_outer posexplode_outer, %<=>%, !, not, create_array, create_map, 
grouping_bit, grouping_id
 input_file_name, alias, trunc, date_trunc, map_keys, map_values, current_date, 
current_timestamp, trim/trimString,
 dayofweek, unionByName,

to_json (map or array of maps)

Data Source -  multiLine (json/csv)

 

ML changes:

Decision Tree (regression and classification)

Constrained Logistic Regression
 offset in SparkR GLM [https://github.com/apache/spark/pull/18831]
 stringIndexerOrderType
 handleInvalid (spark.svmLinear, spark.logit, spark.mlp, spark.naiveBayes, 
spark.gbt, spark.decisionTree, spark.randomForest)

 

SS changes:

Structured Streaming API for withWatermark, trigger (once, processingTime), 
partitionBy

stream-stream join

 

Documentation:

major overhaul and simplification of API doc for SQL functions

 


was (Author: felixcheung):
[~sameerag]

Here are some ideas for the release notes (that goes to spark-website in the 
announcements)

For SparkR, new in 2.3.0:

SQL changes:

SQL functions, cubing & nested structure

collect_list, collect_set, split_string, repeat_string, rollup, cube
 explode_outer posexplode_outer, %<=>%, !, not, create_array, create_map, 
grouping_bit, grouping_id
 input_file_name, alias, trunc, date_trunc, map_keys, map_values, current_date, 
current_timestamp, trim/trimString,
 dayofweek, unionByName,

to_json (map or array of maps)

Data Source -  multiLine (json/csv)

 

ML changes:

Decision Tree (regression and classification)

Constrained Logistic Regression
offset in SparkR GLM https://github.com/apache/spark/pull/18831
stringIndexerOrderType
handleInvalid (spark.svmLinear, spark.logit, spark.mlp, spark.naiveBayes, 
spark.gbt, spark.decisionTree, spark.randomForest)

 

SS changes:

Structured Streaming API for withWatermark, trigger (once, processingTime), 
partitionBy

stream-stream join

 

Documentation:

major overhaul and simplification of API doc

 

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333730#comment-16333730
 ] 

Felix Cheung commented on SPARK-23114:
--

[~falaki] [~hyukjin.kwon]

About SPARK-21093, do you think you could have real data and real workload to 
test for long haul or heavy load or many tasks?

 

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333725#comment-16333725
 ] 

Felix Cheung commented on SPARK-23114:
--

[~sameerag]

Here are some ideas for the release notes (that goes to spark-website in the 
announcements)

For SparkR, new in 2.3.0:

SQL changes:

SQL functions, cubing & nested structure

collect_list, collect_set, split_string, repeat_string, rollup, cube
 explode_outer posexplode_outer, %<=>%, !, not, create_array, create_map, 
grouping_bit, grouping_id
 input_file_name, alias, trunc, date_trunc, map_keys, map_values, current_date, 
current_timestamp, trim/trimString,
 dayofweek, unionByName,

to_json (map or array of maps)

Data Source -  multiLine (json/csv)

 

ML changes:

Decision Tree (regression and classification)

Constrained Logistic Regression
offset in SparkR GLM https://github.com/apache/spark/pull/18831
stringIndexerOrderType
handleInvalid (spark.svmLinear, spark.logit, spark.mlp, spark.naiveBayes, 
spark.gbt, spark.decisionTree, spark.randomForest)

 

SS changes:

Structured Streaming API for withWatermark, trigger (once, processingTime), 
partitionBy

stream-stream join

 

Documentation:

major overhaul and simplification of API doc

 

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23117) SparkR 2.3 QA: Check for new R APIs requiring example code

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333718#comment-16333718
 ] 

Felix Cheung edited comment on SPARK-23117 at 1/21/18 10:47 PM:


I did a pass, I think these could use an example, preferably a bit more detail 
one

SPARK-20307

SPARK-21381

 

Others:

Constrained Logistic Regression - SPARK-20906 - should go to ML guide

stringIndexerOrderType - SPARK-20619 SPARK-14659 SPARK-20899 - should have 
RFormula in ML guide


was (Author: felixcheung):
I did a pass, I think these could use an example, preferably a bit more detail 
one

SPARK-20307

SPARK-21381

 

Others:

Constrained Logistic Regression - SPARK-20906 - should go to ML guide

stringIndexerOrderType - SPARK-20899 - should have RFormula in ML guide

> SparkR 2.3 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-23117
> URL: https://issues.apache.org/jira/browse/SPARK-23117
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23116) SparkR 2.3 QA: Update user guide for new features & APIs

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333717#comment-16333717
 ] 

Felix Cheung commented on SPARK-23116:
--

I did a pass.

> SparkR 2.3 QA: Update user guide for new features & APIs
> 
>
> Key: SPARK-23116
> URL: https://issues.apache.org/jira/browse/SPARK-23116
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.3.0
>
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23116) SparkR 2.3 QA: Update user guide for new features & APIs

2018-01-21 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23116.
--
   Resolution: Fixed
 Assignee: Felix Cheung
Fix Version/s: 2.3.0

> SparkR 2.3 QA: Update user guide for new features & APIs
> 
>
> Key: SPARK-23116
> URL: https://issues.apache.org/jira/browse/SPARK-23116
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
> Fix For: 2.3.0
>
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23117) SparkR 2.3 QA: Check for new R APIs requiring example code

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333718#comment-16333718
 ] 

Felix Cheung commented on SPARK-23117:
--

I did a pass, I think these could use an example, preferably a bit more detail 
one

SPARK-20307

SPARK-21381

 

Others:

Constrained Logistic Regression - SPARK-20906 - should go to ML guide

stringIndexerOrderType - SPARK-20899 - should have RFormula in ML guide

> SparkR 2.3 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-23117
> URL: https://issues.apache.org/jira/browse/SPARK-23117
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333716#comment-16333716
 ] 

Felix Cheung commented on SPARK-20307:
--

I think [~wm624] if you have the time

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with factor 
> levels that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid on to 
> the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", "that"), 
> 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at 

[jira] [Comment Edited] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333682#comment-16333682
 ] 

Felix Cheung edited comment on SPARK-20307 at 1/21/18 10:40 PM:


Hi Felix,
 I can do that but I have a family emergency lately. It will not occur soon.
 Best
 Joseph

 


was (Author: monday0927!):
Hi Felix,
I can do that but I have a family emergency lately. It will not occur soon.
Best
Joseph

On 1/21/18, 2:45 PM, "Felix Cheung (JIRA)"  wrote:


[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333663#comment-16333663
 ] 

Felix Cheung commented on SPARK-20307:
--

for SPARK-20307 and SPARK-21381, do you think you can write up example on 
how to use them and also a mention in the R programming guide?

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
StringIndexer
> 

>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> when training a model in SparkR with string variables (tested with 
spark.randomForest, but i assume is valid for all spark.xx functions that apply 
a StringIndexer under the hood), testing on a new dataset with factor levels 
that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid 
on to the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context 
loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", 
"that"), 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 607.0 failed 1 times, most recent failure: Lost task 0.0 in stage 607.0 
(TID 1581, localhost, executor driver): org.apache.spark.SparkException: Failed 
to execute user defined function($anonfun$4: (string) => double)
> at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
> at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 

[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333712#comment-16333712
 ] 

Felix Cheung commented on SPARK-23107:
--

We don't have doc on RFormula but it'll be good idea to also allow for 
documenting changes like 
h1. SPARK-20619 
h1. SPARK-20899

in a language independent way

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20906) Constrained Logistic Regression for SparkR

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333654#comment-16333654
 ] 

Felix Cheung edited comment on SPARK-20906 at 1/21/18 10:30 PM:


[~wm624] would you like to add example of this in the API doc?

roxygen2 doc for spark.logit


was (Author: felixcheung):
[~wm624] would you like to add example of this in the API doc?

> Constrained Logistic Regression for SparkR
> --
>
> Key: SPARK-20906
> URL: https://issues.apache.org/jira/browse/SPARK-20906
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Miao Wang
>Assignee: Miao Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic 
> Regression for ML. We should add it to SparkR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333708#comment-16333708
 ] 

Felix Cheung commented on SPARK-23108:
--

>From reviewing R, it would be good to document constrained optimization for 
>logistic regression SPARK-20906 (R guide just links to ML guide, so we should 
>add doc there)

> ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-23108
> URL: https://issues.apache.org/jira/browse/SPARK-23108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23118) SparkR 2.3 QA: Programming guide, migration guide, vignettes updates

2018-01-21 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23118.
--
   Resolution: Fixed
 Assignee: Felix Cheung
Fix Version/s: 2.3.0

> SparkR 2.3 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-23118
> URL: https://issues.apache.org/jira/browse/SPARK-23118
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
> Fix For: 2.3.0
>
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")
>  * Update R vignettes
> Note: This task is for large changes to the guides. New features are handled 
> in SPARK-23116.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23118) SparkR 2.3 QA: Programming guide, migration guide, vignettes updates

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333707#comment-16333707
 ] 

Felix Cheung commented on SPARK-23118:
--

for programming guide, perhaps 

SPARK-20906

But it mostly just links to API doc and ML programming guide. Will add a 
comment on ML programming guide instead.

> SparkR 2.3 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-23118
> URL: https://issues.apache.org/jira/browse/SPARK-23118
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.3.0
>
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")
>  * Update R vignettes
> Note: This task is for large changes to the guides. New features are handled 
> in SPARK-23116.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

2018-01-21 Thread Ioana Delaney (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333703#comment-16333703
 ] 

Ioana Delaney commented on SPARK-19842:
---

The benefits of this work is that it opens up an area of query optimization 
techniques that rely on RI semantics. Some are listed in section 5 of the above 
design document. For the performance numbers, I can refer you to our summit 
talk:

[https://databricks.com/session/informational-referential-integrity-constraints-support-in-apache-spark]

 

> Informational Referential Integrity Constraints Support in Spark
> 
>
> Key: SPARK-19842
> URL: https://issues.apache.org/jira/browse/SPARK-19842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
>Priority: Major
> Attachments: InformationalRIConstraints.doc
>
>
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key 
> (referential integrity) constraints_ in Spark. The main purpose is to open up 
> an area of query optimization techniques that rely on referential integrity 
> constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a 
> _unique_, _primary key_, _foreign key_, or _check constraint_, that can be 
> used by Spark to improve query performance. Informational constraints are not 
> enforced by the Spark SQL engine; rather, they are used by Catalyst to 
> optimize the query processing. They provide semantics information that allows 
> Catalyst to rewrite queries to eliminate joins, push down aggregates, remove 
> unnecessary Distinct operations, and perform a number of other optimizations. 
> Informational constraints are primarily targeted to applications that load 
> and analyze data that originated from a data warehouse. For such 
> applications, the conditions for a given constraint are known to be true, so 
> the constraint does not need to be enforced during data load operations. 
> The attached document covers constraint definition, metastore storage, 
> constraint validation, and maintenance. The document shows many examples of 
> query performance improvements that utilize referential integrity constraints 
> and can be implemented in Spark.
> Link to the google doc: 
> [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2018-01-21 Thread Joseph Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333682#comment-16333682
 ] 

Joseph Wang commented on SPARK-20307:
-

Hi Felix,
I can do that but I have a family emergency lately. It will not occur soon.
Best
Joseph

On 1/21/18, 2:45 PM, "Felix Cheung (JIRA)"  wrote:


[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333663#comment-16333663
 ] 

Felix Cheung commented on SPARK-20307:
--

for SPARK-20307 and SPARK-21381, do you think you can write up example on 
how to use them and also a mention in the R programming guide?

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
StringIndexer
> 

>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> when training a model in SparkR with string variables (tested with 
spark.randomForest, but i assume is valid for all spark.xx functions that apply 
a StringIndexer under the hood), testing on a new dataset with factor levels 
that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid 
on to the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context 
loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", 
"that"), 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 607.0 failed 1 times, most recent failure: Lost task 0.0 in stage 607.0 
(TID 1581, localhost, executor driver): org.apache.spark.SparkException: Failed 
to execute user defined function($anonfun$4: (string) => double)
> at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
> at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 

[jira] [Commented] (SPARK-22208) Improve percentile_approx by not rounding up targetError and starting from index 0

2018-01-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333678#comment-16333678
 ] 

Sean Owen commented on SPARK-22208:
---

It's a bug fix, and more of a corner case of behavior, so I don't know if it 
must be called out in the release notes as a behavior change, if that's what 
you mean.

> Improve percentile_approx by not rounding up targetError and starting from 
> index 0
> --
>
> Key: SPARK-22208
> URL: https://issues.apache.org/jira/browse/SPARK-22208
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> percentile_approx never returns the first element when percentile is in 
> (relativeError, 1/N], where relativeError default is 1/1, and N is the 
> total number of elements. But ideally, percentiles in [0, 1/N] should all 
> return the first element as the answer.
> For example, given input data 1 to 10, if a user queries 10% (or even less) 
> percentile, it should return 1, because the first value 1 already reaches 
> 10%. Currently it returns 2.
> Based on the paper, targetError is not rounded up, and searching index should 
> start from 0 instead of 1. By following the paper, we should be able to fix 
> the cases mentioned above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23115) SparkR 2.3 QA: New R APIs and API docs

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333677#comment-16333677
 ] 

Felix Cheung commented on SPARK-23115:
--

Another pass, we should add API doc for

SPARK-20906

> SparkR 2.3 QA: New R APIs and API docs
> --
>
> Key: SPARK-23115
> URL: https://issues.apache.org/jira/browse/SPARK-23115
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20906) Constrained Logistic Regression for SparkR

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333654#comment-16333654
 ] 

Felix Cheung edited comment on SPARK-20906 at 1/21/18 8:54 PM:
---

[~wm624] would you like to add example of this in the API doc?


was (Author: felixcheung):
[~wm624] would you like to add example of this in the R vignettes?

> Constrained Logistic Regression for SparkR
> --
>
> Key: SPARK-20906
> URL: https://issues.apache.org/jira/browse/SPARK-20906
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Miao Wang
>Assignee: Miao Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic 
> Regression for ML. We should add it to SparkR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22208) Improve percentile_approx by not rounding up targetError and starting from index 0

2018-01-21 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22208:
-
Labels: releasenotes  (was: )

> Improve percentile_approx by not rounding up targetError and starting from 
> index 0
> --
>
> Key: SPARK-22208
> URL: https://issues.apache.org/jira/browse/SPARK-22208
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> percentile_approx never returns the first element when percentile is in 
> (relativeError, 1/N], where relativeError default is 1/1, and N is the 
> total number of elements. But ideally, percentiles in [0, 1/N] should all 
> return the first element as the answer.
> For example, given input data 1 to 10, if a user queries 10% (or even less) 
> percentile, it should return 1, because the first value 1 already reaches 
> 10%. Currently it returns 2.
> Based on the paper, targetError is not rounded up, and searching index should 
> start from 0 instead of 1. By following the paper, we should be able to fix 
> the cases mentioned above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333663#comment-16333663
 ] 

Felix Cheung commented on SPARK-20307:
--

for SPARK-20307 and SPARK-21381, do you think you can write up example on how 
to use them and also a mention in the R programming guide?

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with factor 
> levels that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid on to 
> the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", "that"), 
> 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at 
> 

[jira] [Commented] (SPARK-22208) Improve percentile_approx by not rounding up targetError and starting from index 0

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333659#comment-16333659
 ] 

Felix Cheung commented on SPARK-22208:
--

Is this documented in the SQL programming guide/ migration guide?

[~ZenWzh]

[~smilegator]

 

> Improve percentile_approx by not rounding up targetError and starting from 
> index 0
> --
>
> Key: SPARK-22208
> URL: https://issues.apache.org/jira/browse/SPARK-22208
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> percentile_approx never returns the first element when percentile is in 
> (relativeError, 1/N], where relativeError default is 1/1, and N is the 
> total number of elements. But ideally, percentiles in [0, 1/N] should all 
> return the first element as the answer.
> For example, given input data 1 to 10, if a user queries 10% (or even less) 
> percentile, it should return 1, because the first value 1 already reaches 
> 10%. Currently it returns 2.
> Based on the paper, targetError is not rounded up, and searching index should 
> start from 0 instead of 1. By following the paper, we should be able to fix 
> the cases mentioned above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20906) Constrained Logistic Regression for SparkR

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333654#comment-16333654
 ] 

Felix Cheung commented on SPARK-20906:
--

[~wm624] would you like to add example of this in the R vignettes?

> Constrained Logistic Regression for SparkR
> --
>
> Key: SPARK-20906
> URL: https://issues.apache.org/jira/browse/SPARK-20906
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Miao Wang
>Assignee: Miao Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic 
> Regression for ML. We should add it to SparkR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21293) R document update structured streaming

2018-01-21 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-21293.
--
  Resolution: Fixed
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> R document update structured streaming
> --
>
> Key: SPARK-21293
> URL: https://issues.apache.org/jira/browse/SPARK-21293
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
> Fix For: 2.3.0
>
>
> add examples for
> Window Operations on Event Time
> Join Operations
> Streaming Deduplication



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23172) Respect Project nodes in ReorderJoin

2018-01-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23172:


Assignee: Apache Spark

> Respect Project nodes in ReorderJoin
> 
>
> Key: SPARK-23172
> URL: https://issues.apache.org/jira/browse/SPARK-23172
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> The current `ReorderJoin` optimizer rule cannot flatten a pattern `Join -> 
> Project -> Join` because `ExtractFiltersAndInnerJoins`
> doesn't handle `Project` nodes. So, the current master cannot reorder joins 
> in a query below;
> {code}
> val df1 = spark.range(100).selectExpr("id % 10 AS k0", s"id % 10 AS k1", s"id 
> % 10 AS k2", "id AS v1")
> val df2 = spark.range(10).selectExpr("id AS k0", "id AS v2")
> val df3 = spark.range(10).selectExpr("id AS k1", "id AS v3")
> val df4 = spark.range(10).selectExpr("id AS k2", "id AS v4")
> df1.join(df2, "k0").join(df3, "k1").join(df4, "k2").explain(true)
> == Analyzed Logical Plan ==
> k2: bigint, k1: bigint, k0: bigint, v1: bigint, v2: bigint, v3: bigint, v4: 
> bigint
> Project [k2#5L, k1#4L, k0#3L, v1#6L, v2#16L, v3#24L, v4#32L]
> +- Join Inner, (k2#5L = k2#31L)
>:- Project [k1#4L, k0#3L, k2#5L, v1#6L, v2#16L, v3#24L]
>:  +- Join Inner, (k1#4L = k1#23L)
>: :- Project [k0#3L, k1#4L, k2#5L, v1#6L, v2#16L]
>: :  +- Join Inner, (k0#3L = k0#15L)
>: : :- Project [(id#0L % cast(10 as bigint)) AS k0#3L, (id#0L % 
> cast(10 as bigint)) AS k1#4L, (id#0L % cast(10 as bigint)) AS k2#5L, id#0
> L AS v1#6L]
>: : :  +- Range (0, 100, step=1, splits=Some(4))
>: : +- Project [id#12L AS k0#15L, id#12L AS v2#16L]
>: :+- Range (0, 10, step=1, splits=Some(4))
>: +- Project [id#20L AS k1#23L, id#20L AS v3#24L]
>:+- Range (0, 10, step=1, splits=Some(4))
>+- Project [id#28L AS k2#31L, id#28L AS v4#32L]
>   +- Range (0, 10, step=1, splits=Some(4))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23172) Respect Project nodes in ReorderJoin

2018-01-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333627#comment-16333627
 ] 

Apache Spark commented on SPARK-23172:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/20345

> Respect Project nodes in ReorderJoin
> 
>
> Key: SPARK-23172
> URL: https://issues.apache.org/jira/browse/SPARK-23172
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The current `ReorderJoin` optimizer rule cannot flatten a pattern `Join -> 
> Project -> Join` because `ExtractFiltersAndInnerJoins`
> doesn't handle `Project` nodes. So, the current master cannot reorder joins 
> in a query below;
> {code}
> val df1 = spark.range(100).selectExpr("id % 10 AS k0", s"id % 10 AS k1", s"id 
> % 10 AS k2", "id AS v1")
> val df2 = spark.range(10).selectExpr("id AS k0", "id AS v2")
> val df3 = spark.range(10).selectExpr("id AS k1", "id AS v3")
> val df4 = spark.range(10).selectExpr("id AS k2", "id AS v4")
> df1.join(df2, "k0").join(df3, "k1").join(df4, "k2").explain(true)
> == Analyzed Logical Plan ==
> k2: bigint, k1: bigint, k0: bigint, v1: bigint, v2: bigint, v3: bigint, v4: 
> bigint
> Project [k2#5L, k1#4L, k0#3L, v1#6L, v2#16L, v3#24L, v4#32L]
> +- Join Inner, (k2#5L = k2#31L)
>:- Project [k1#4L, k0#3L, k2#5L, v1#6L, v2#16L, v3#24L]
>:  +- Join Inner, (k1#4L = k1#23L)
>: :- Project [k0#3L, k1#4L, k2#5L, v1#6L, v2#16L]
>: :  +- Join Inner, (k0#3L = k0#15L)
>: : :- Project [(id#0L % cast(10 as bigint)) AS k0#3L, (id#0L % 
> cast(10 as bigint)) AS k1#4L, (id#0L % cast(10 as bigint)) AS k2#5L, id#0
> L AS v1#6L]
>: : :  +- Range (0, 100, step=1, splits=Some(4))
>: : +- Project [id#12L AS k0#15L, id#12L AS v2#16L]
>: :+- Range (0, 10, step=1, splits=Some(4))
>: +- Project [id#20L AS k1#23L, id#20L AS v3#24L]
>:+- Range (0, 10, step=1, splits=Some(4))
>+- Project [id#28L AS k2#31L, id#28L AS v4#32L]
>   +- Range (0, 10, step=1, splits=Some(4))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23172) Respect Project nodes in ReorderJoin

2018-01-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23172:


Assignee: (was: Apache Spark)

> Respect Project nodes in ReorderJoin
> 
>
> Key: SPARK-23172
> URL: https://issues.apache.org/jira/browse/SPARK-23172
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The current `ReorderJoin` optimizer rule cannot flatten a pattern `Join -> 
> Project -> Join` because `ExtractFiltersAndInnerJoins`
> doesn't handle `Project` nodes. So, the current master cannot reorder joins 
> in a query below;
> {code}
> val df1 = spark.range(100).selectExpr("id % 10 AS k0", s"id % 10 AS k1", s"id 
> % 10 AS k2", "id AS v1")
> val df2 = spark.range(10).selectExpr("id AS k0", "id AS v2")
> val df3 = spark.range(10).selectExpr("id AS k1", "id AS v3")
> val df4 = spark.range(10).selectExpr("id AS k2", "id AS v4")
> df1.join(df2, "k0").join(df3, "k1").join(df4, "k2").explain(true)
> == Analyzed Logical Plan ==
> k2: bigint, k1: bigint, k0: bigint, v1: bigint, v2: bigint, v3: bigint, v4: 
> bigint
> Project [k2#5L, k1#4L, k0#3L, v1#6L, v2#16L, v3#24L, v4#32L]
> +- Join Inner, (k2#5L = k2#31L)
>:- Project [k1#4L, k0#3L, k2#5L, v1#6L, v2#16L, v3#24L]
>:  +- Join Inner, (k1#4L = k1#23L)
>: :- Project [k0#3L, k1#4L, k2#5L, v1#6L, v2#16L]
>: :  +- Join Inner, (k0#3L = k0#15L)
>: : :- Project [(id#0L % cast(10 as bigint)) AS k0#3L, (id#0L % 
> cast(10 as bigint)) AS k1#4L, (id#0L % cast(10 as bigint)) AS k2#5L, id#0
> L AS v1#6L]
>: : :  +- Range (0, 100, step=1, splits=Some(4))
>: : +- Project [id#12L AS k0#15L, id#12L AS v2#16L]
>: :+- Range (0, 10, step=1, splits=Some(4))
>: +- Project [id#20L AS k1#23L, id#20L AS v3#24L]
>:+- Range (0, 10, step=1, splits=Some(4))
>+- Project [id#28L AS k2#31L, id#28L AS v4#32L]
>   +- Range (0, 10, step=1, splits=Some(4))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23172) Respect Project nodes in ReorderJoin

2018-01-21 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-23172:


 Summary: Respect Project nodes in ReorderJoin
 Key: SPARK-23172
 URL: https://issues.apache.org/jira/browse/SPARK-23172
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.1
Reporter: Takeshi Yamamuro


The current `ReorderJoin` optimizer rule cannot flatten a pattern `Join -> 
Project -> Join` because `ExtractFiltersAndInnerJoins`
doesn't handle `Project` nodes. So, the current master cannot reorder joins in 
a query below;
{code}
val df1 = spark.range(100).selectExpr("id % 10 AS k0", s"id % 10 AS k1", s"id % 
10 AS k2", "id AS v1")
val df2 = spark.range(10).selectExpr("id AS k0", "id AS v2")
val df3 = spark.range(10).selectExpr("id AS k1", "id AS v3")
val df4 = spark.range(10).selectExpr("id AS k2", "id AS v4")
df1.join(df2, "k0").join(df3, "k1").join(df4, "k2").explain(true)

== Analyzed Logical Plan ==
k2: bigint, k1: bigint, k0: bigint, v1: bigint, v2: bigint, v3: bigint, v4: 
bigint
Project [k2#5L, k1#4L, k0#3L, v1#6L, v2#16L, v3#24L, v4#32L]
+- Join Inner, (k2#5L = k2#31L)
   :- Project [k1#4L, k0#3L, k2#5L, v1#6L, v2#16L, v3#24L]
   :  +- Join Inner, (k1#4L = k1#23L)
   : :- Project [k0#3L, k1#4L, k2#5L, v1#6L, v2#16L]
   : :  +- Join Inner, (k0#3L = k0#15L)
   : : :- Project [(id#0L % cast(10 as bigint)) AS k0#3L, (id#0L % 
cast(10 as bigint)) AS k1#4L, (id#0L % cast(10 as bigint)) AS k2#5L, id#0
L AS v1#6L]
   : : :  +- Range (0, 100, step=1, splits=Some(4))
   : : +- Project [id#12L AS k0#15L, id#12L AS v2#16L]
   : :+- Range (0, 10, step=1, splits=Some(4))
   : +- Project [id#20L AS k1#23L, id#20L AS v3#24L]
   :+- Range (0, 10, step=1, splits=Some(4))
   +- Project [id#28L AS k2#31L, id#28L AS v4#32L]
  +- Range (0, 10, step=1, splits=Some(4))
{code}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8294) Break down large methods in YARN code

2018-01-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333615#comment-16333615
 ] 

Hudson commented on SPARK-8294:
---

[ 
https://issues-test.apache.org/jira/browse/SPARK-8294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265593#comment-16265593
 ] 

Sebastian Piu commented on SPARK-8294:
--

I had a quick look and all those methods seem to have been refactored now, is 
this issue still relevant?




--
This message was sent by Atlassian JIRA
(v7.6.0#76001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org


> Break down large methods in YARN code
> -
>
> Key: SPARK-8294
> URL: https://issues.apache.org/jira/browse/SPARK-8294
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Priority: Major
>
> What large methods am I talking about?
> Client#prepareLocalResources ~ 170 lines
> ExecutorRunnable#prepareCommand ~ 100 lines
> Client#setupLaunchEnv ~ 100 lines
> ... many others that hover around 80 - 90 lines
> There are several things wrong with this. First, it's difficult to follow / 
> review the code. Second, it's difficult to test it at a fine-granularity. In 
> the past we as a community has been reluctant to add new regression tests for 
> YARN changes. This stems from the fact that it is difficult to write tests, 
> and the cost is that we can't really ensure the correctness of the code 
> easily.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8294) Break down large methods in YARN code

2018-01-21 Thread Sebastian Piu (JIRATEST)

[ 
https://issues-test.apache.org/jira/browse/SPARK-8294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265593#comment-16265593
 ] 

Sebastian Piu commented on SPARK-8294:
--

I had a quick look and all those methods seem to have been refactored now, is 
this issue still relevant?

> Break down large methods in YARN code
> -
>
> Key: SPARK-8294
> URL: https://issues-test.apache.org/jira/browse/SPARK-8294
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Priority: Major
>
> What large methods am I talking about?
> Client#prepareLocalResources ~ 170 lines
> ExecutorRunnable#prepareCommand ~ 100 lines
> Client#setupLaunchEnv ~ 100 lines
> ... many others that hover around 80 - 90 lines
> There are several things wrong with this. First, it's difficult to follow / 
> review the code. Second, it's difficult to test it at a fine-granularity. In 
> the past we as a community has been reluctant to add new regression tests for 
> YARN changes. This stems from the fact that it is difficult to write tests, 
> and the cost is that we can't really ensure the correctness of the code 
> easily.



--
This message was sent by Atlassian JIRA
(v7.6.0#76001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

2018-01-21 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333576#comment-16333576
 ] 

Takeshi Yamamuro commented on SPARK-19842:
--

What's the status of this tickets now? We need to discuss the benefits from 
this work first? 
https://github.com/apache/spark/pull/18994#issuecomment-331368062

> Informational Referential Integrity Constraints Support in Spark
> 
>
> Key: SPARK-19842
> URL: https://issues.apache.org/jira/browse/SPARK-19842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
>Priority: Major
> Attachments: InformationalRIConstraints.doc
>
>
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key 
> (referential integrity) constraints_ in Spark. The main purpose is to open up 
> an area of query optimization techniques that rely on referential integrity 
> constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a 
> _unique_, _primary key_, _foreign key_, or _check constraint_, that can be 
> used by Spark to improve query performance. Informational constraints are not 
> enforced by the Spark SQL engine; rather, they are used by Catalyst to 
> optimize the query processing. They provide semantics information that allows 
> Catalyst to rewrite queries to eliminate joins, push down aggregates, remove 
> unnecessary Distinct operations, and perform a number of other optimizations. 
> Informational constraints are primarily targeted to applications that load 
> and analyze data that originated from a data warehouse. For such 
> applications, the conditions for a given constraint are known to be true, so 
> the constraint does not need to be enforced during data load operations. 
> The attached document covers constraint definition, metastore storage, 
> constraint validation, and maintenance. The document shows many examples of 
> query performance improvements that utilize referential integrity constraints 
> and can be implemented in Spark.
> Link to the google doc: 
> [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23168) Hints for fact tables and unique columns

2018-01-21 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-23168:
-
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-19842

> Hints for fact tables and unique columns
> 
>
> Key: SPARK-23168
> URL: https://issues.apache.org/jira/browse/SPARK-23168
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> We already have fact table and unique column inferences in 
> StarSchemaDetection for decision making queries. IMHO, in most cases, users 
> could know which is a fact table and which columns are unique. So, fact table 
> and unique column hint might help for these users.
>  For example,
> {code:java}
> scala> factTable.hint("factTable").hint("uid").join(dimTable)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23168) Hints for fact tables and unique columns

2018-01-21 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333572#comment-16333572
 ] 

Takeshi Yamamuro commented on SPARK-23168:
--

ok

> Hints for fact tables and unique columns
> 
>
> Key: SPARK-23168
> URL: https://issues.apache.org/jira/browse/SPARK-23168
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> We already have fact table and unique column inferences in 
> StarSchemaDetection for decision making queries. IMHO, in most cases, users 
> could know which is a fact table and which columns are unique. So, fact table 
> and unique column hint might help for these users.
>  For example,
> {code:java}
> scala> factTable.hint("factTable").hint("uid").join(dimTable)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23167) Update TPCDS queries from v1.4 to v2.7 (latest)

2018-01-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23167:


Assignee: (was: Apache Spark)

> Update TPCDS queries from v1.4 to v2.7 (latest)
> ---
>
> Key: SPARK-23167
> URL: https://issues.apache.org/jira/browse/SPARK-23167
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> We currently use TPCDS v1.4 
> ([https://github.com/apache/spark/commits/master/sql/core/src/test/resources/tpcds)]
>  though, the latest one is v2.7 
> ([http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp]).
>  I found that some queries are different from v1.4 and v2.7 (e.g., q4, q5, 
> q6, ...) and some queries newly might appear (e.g., q10a, ..). I think it 
> might make some sense to update the queries for more correct evaluation.
> Raw generated queries from TPCDS v2.7 query templates:
>  [https://github.com/maropu/spark_tpcds_v2.7.0/tree/master/generated]
> Modified TPCDS v2.7 queries to pass TPCDSQuerySuite (e.g., replacing 
> unsupported syntaxes, + 14 days -> interval 14 days):
>  [https://github.com/apache/spark/compare/master...maropu:TPCDSV2_7]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23167) Update TPCDS queries from v1.4 to v2.7 (latest)

2018-01-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23167:


Assignee: Apache Spark

> Update TPCDS queries from v1.4 to v2.7 (latest)
> ---
>
> Key: SPARK-23167
> URL: https://issues.apache.org/jira/browse/SPARK-23167
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> We currently use TPCDS v1.4 
> ([https://github.com/apache/spark/commits/master/sql/core/src/test/resources/tpcds)]
>  though, the latest one is v2.7 
> ([http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp]).
>  I found that some queries are different from v1.4 and v2.7 (e.g., q4, q5, 
> q6, ...) and some queries newly might appear (e.g., q10a, ..). I think it 
> might make some sense to update the queries for more correct evaluation.
> Raw generated queries from TPCDS v2.7 query templates:
>  [https://github.com/maropu/spark_tpcds_v2.7.0/tree/master/generated]
> Modified TPCDS v2.7 queries to pass TPCDSQuerySuite (e.g., replacing 
> unsupported syntaxes, + 14 days -> interval 14 days):
>  [https://github.com/apache/spark/compare/master...maropu:TPCDSV2_7]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23167) Update TPCDS queries from v1.4 to v2.7 (latest)

2018-01-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333563#comment-16333563
 ] 

Apache Spark commented on SPARK-23167:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/20343

> Update TPCDS queries from v1.4 to v2.7 (latest)
> ---
>
> Key: SPARK-23167
> URL: https://issues.apache.org/jira/browse/SPARK-23167
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> We currently use TPCDS v1.4 
> ([https://github.com/apache/spark/commits/master/sql/core/src/test/resources/tpcds)]
>  though, the latest one is v2.7 
> ([http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp]).
>  I found that some queries are different from v1.4 and v2.7 (e.g., q4, q5, 
> q6, ...) and some queries newly might appear (e.g., q10a, ..). I think it 
> might make some sense to update the queries for more correct evaluation.
> Raw generated queries from TPCDS v2.7 query templates:
>  [https://github.com/maropu/spark_tpcds_v2.7.0/tree/master/generated]
> Modified TPCDS v2.7 queries to pass TPCDSQuerySuite (e.g., replacing 
> unsupported syntaxes, + 14 days -> interval 14 days):
>  [https://github.com/apache/spark/compare/master...maropu:TPCDSV2_7]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23171) Reduce the time costs of the rule runs that do not change the plans

2018-01-21 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333551#comment-16333551
 ] 

Takeshi Yamamuro commented on SPARK-23171:
--

ok, I'll check code based on these metrics.

> Reduce the time costs of the rule runs that do not change the plans 
> 
>
> Key: SPARK-23171
> URL: https://issues.apache.org/jira/browse/SPARK-23171
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Below is the time stats of Analyzer/Optimizer rules. Try to improve the rules 
> and reduce the time costs, especially for the runs that do not change the 
> plans.
> {noformat}
> === Metrics of Analyzer/Optimizer Rules ===
> Total number of runs = 175827
> Total time: 20.699042877 seconds
> Rule  
>  Total Time Effective Time Total Runs 
> Effective Runs
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning 
>  2340563794 1338268224 1875   
> 761   
> org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution   
>  1632672623 1625071881 788
> 37
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions 
>  1395087131 347339931  1982   
> 38
> org.apache.spark.sql.catalyst.optimizer.PruneFilters  
>  1177711364 21344174   1590   
> 3 
> org.apache.spark.sql.catalyst.optimizer.Optimizer$OptimizeSubqueries  
>  1145135465 1131417128 285
> 39
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences 
>  1008347217 663112062  1982   
> 616   
> org.apache.spark.sql.catalyst.optimizer.ReorderJoin   
>  767024424  693001699  1590   
> 132   
> org.apache.spark.sql.catalyst.analysis.Analyzer$FixNullability
>  598524650  40802876   742
> 12
> org.apache.spark.sql.catalyst.analysis.DecimalPrecision   
>  595384169  436153128  1982   
> 211   
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery   
>  548178270  459695885  1982   
> 49
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts 
>  423002864  139869503  1982   
> 86
> org.apache.spark.sql.catalyst.optimizer.BooleanSimplification 
>  405544962  17250184   1590   
> 7 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin  
>  383837603  284174662  1590   
> 708   
> org.apache.spark.sql.catalyst.optimizer.RemoveRedundantAliases
>  372901885  33623321590   
> 9 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints   
>  364628214  343815519  285
> 192   
> org.apache.spark.sql.execution.datasources.FindDataSourceTable
>  303293296  285344766  1982   
> 233   
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions  
>  233195019  92648171   1982   
> 294   
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$FunctionArgumentConversion
>  220568919  73932736   1982   
> 38
> org.apache.spark.sql.catalyst.optimizer.NullPropagation   
>  207976072  90723051590   
> 26
> 

[jira] [Resolved] (SPARK-23156) Code of method "initialize(I)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2018-01-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23156.
---
Resolution: Duplicate

>  Code of method "initialize(I)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> 
>
> Key: SPARK-23156
> URL: https://issues.apache.org/jira/browse/SPARK-23156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 2.1.1, 2.1.2
> Environment: Ubuntu 16.04, Scala 2.11, Java 8, 8-node YARN cluster.
>Reporter: Krystian Zawistowski
>Priority: Major
>
> I am getting this trying to generate a random DataFrame  (300 columns, 5000 
> rows, Ints, Floats and Timestamps in equal ratios). This is similar (but not 
> identical) to SPARK-18492 and few tickets more that should be done in 2.1.1.
> Part of the logs below. They contain hundreds of millions of lines of 
> generated code, apparently for each of the 1500 000 fields of the dataframe 
> which is very suspicious. 
> {code:java}
> 18/01/19 06:33:15 INFO CodeGenerator: Code generated in 246.168393 ms$
> 18/01/19 06:33:21 ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method "initialize(I)V" 
> of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB$
> /* 001 */ public java.lang.Object generate(Object[] references) {$
> /* 002 */ return new SpecificUnsafeProjection(references);$
> /* 003 */ }$
> /* 004 */$
> /* 005 */ class SpecificUnsafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection {$
> /* 006 */$
> /* 007 */ private Object[] references;$
> /* 008 */ private org.apache.spark.util.random.XORShiftRandom rng;$
> /* 009 */ private org.apache.spark.util.random.XORShiftRandom rng1;$
> /* 010 */ private org.apache.spark.util.random.XORShiftRandom rng2;$
> /* 011 */ private org.apache.spark.util.random.XORShiftRandom rng3;$
> /* 012 */ private org.apache.spark.util.random.XORShiftRandom rng4;$
> {code}
> Reproduction:
> {code:java}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.{Column, DataFrame, SparkSession}
> class RandomData(val numberOfColumns: Int, val numberOfRows: Int) extends 
> Serializable {
>   private val minEpoch = Timestamp.valueOf("1800-01-01 00:00:00").getTime
>   private val maxEpoch = Timestamp.valueOf("2200-01-01 00:00:00").getTime
>   val idColumn = "id"
>   import org.apache.spark.sql.functions._
> def generateData(path: String): Unit = {
>   val spark: SparkSession = SparkSession.builder().getOrCreate()
>   materializeTable(spark).write.parquet(path + "/source")
> }
> private def materializeTable(spark: SparkSession): DataFrame = {
>   var sourceDF = spark.sqlContext.range(0, 
> numberOfRows).withColumnRenamed("id", 
>  idColumn)
>   val columns = sourceDF(idColumn) +: (0 until numberOfColumns)
>   .flatMap(x => Seq(getTimeColumn(x), getNumberColumn(x), 
> getCategoryColumn(x)))
> sourceDF.select(columns: _*)
> }
> private def getTimeColumn(seed: Int): Column = {
>   val uniqueSeed = seed + numberOfColumns * 3
>   rand(seed = uniqueSeed)
>.multiply(maxEpoch - minEpoch)
>.divide(1000).cast("long")
>.plus(minEpoch / 1000).cast(TimestampType).alias(s"time$seed")
> }
> private def getNumberColumn(seed: Int, namePrefix: String = "number"): Column 
> = {
>   val uniqueSeed = seed + numberOfColumns * 4
>   randn(seed = uniqueSeed).alias(s"$namePrefix$seed")
> }
> private def getCategoryColumn(seed: Int): Column = {
>   val uniqueSeed = seed + numberOfColumns * 4
>   rand(seed = uniqueSeed).multiply(100).cast("int").alias(s"category$seed")
> }
> }
> object GenerateData{
> def main(args: Array[String]): Unit = {
>   new RandomData(args(0).toInt, args(1).toInt).generateData(args(2))
> }
> }
> {code}
> Please package a jar and run as follows:
> {code:java}
> spark-submit --master yarn \
>  --driver-memory 12g \
>  --executor-memory 12g \
>  --deploy-mode cluster \
>  --class GenerateData \
>  --master yarn \
>  100 5000 "hdfs:///tmp/parquet"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22119) Add cosine distance to KMeans

2018-01-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22119.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 19340
[https://github.com/apache/spark/pull/19340]

> Add cosine distance to KMeans
> -
>
> Key: SPARK-22119
> URL: https://issues.apache.org/jira/browse/SPARK-22119
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently, KMeans assumes the only possible distance measure to be used is 
> the Euclidean.
> In some use cases, eg. text mining, other distance measures like the cosine 
> distance are widely used. Thus, for such use cases, it would be good to 
> support multiple distance measures.
> This ticket is to support the cosine distance measure on KMeans. Later, other 
> algorithms can be extended to support several distance measures and other 
> distance measures can be added.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22119) Add cosine distance to KMeans

2018-01-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22119:
-

Assignee: Marco Gaido

> Add cosine distance to KMeans
> -
>
> Key: SPARK-22119
> URL: https://issues.apache.org/jira/browse/SPARK-22119
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently, KMeans assumes the only possible distance measure to be used is 
> the Euclidean.
> In some use cases, eg. text mining, other distance measures like the cosine 
> distance are widely used. Thus, for such use cases, it would be good to 
> support multiple distance measures.
> This ticket is to support the cosine distance measure on KMeans. Later, other 
> algorithms can be extended to support several distance measures and other 
> distance measures can be added.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23166) Add maxDF Parameter to CountVectorizer

2018-01-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23166:
--
Priority: Minor  (was: Major)

Seems fine; open a pull request.

> Add maxDF Parameter to CountVectorizer
> --
>
> Key: SPARK-23166
> URL: https://issues.apache.org/jira/browse/SPARK-23166
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Yacine Mazari
>Priority: Minor
>
> Currently, the {{CountVectorizer}} has a {{minDF}} parameter.
> It might be useful to also have a {{maxDF}} parameter.
> It will be used as a threshold for filtering all the terms that occur very 
> frequently in a text corpus, because they are not very informative or could 
> even be stop-words.
> This is analogous to scikit-learn, 
> [CountVectorizer|http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html],
>  max_df.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23171) Reduce the time costs of the rule runs that do not change the plans

2018-01-21 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333540#comment-16333540
 ] 

Xiao Li edited comment on SPARK-23171 at 1/21/18 2:24 PM:
--

cc [~maropu] This is an umbrella JIRA. Feel free to create sub-tasks


was (Author: smilegator):
cc [~maropu]

> Reduce the time costs of the rule runs that do not change the plans 
> 
>
> Key: SPARK-23171
> URL: https://issues.apache.org/jira/browse/SPARK-23171
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Below is the time stats of Analyzer/Optimizer rules. Try to improve the rules 
> and reduce the time costs, especially for the runs that do not change the 
> plans.
> {noformat}
> === Metrics of Analyzer/Optimizer Rules ===
> Total number of runs = 175827
> Total time: 20.699042877 seconds
> Rule  
>  Total Time Effective Time Total Runs 
> Effective Runs
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning 
>  2340563794 1338268224 1875   
> 761   
> org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution   
>  1632672623 1625071881 788
> 37
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions 
>  1395087131 347339931  1982   
> 38
> org.apache.spark.sql.catalyst.optimizer.PruneFilters  
>  1177711364 21344174   1590   
> 3 
> org.apache.spark.sql.catalyst.optimizer.Optimizer$OptimizeSubqueries  
>  1145135465 1131417128 285
> 39
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences 
>  1008347217 663112062  1982   
> 616   
> org.apache.spark.sql.catalyst.optimizer.ReorderJoin   
>  767024424  693001699  1590   
> 132   
> org.apache.spark.sql.catalyst.analysis.Analyzer$FixNullability
>  598524650  40802876   742
> 12
> org.apache.spark.sql.catalyst.analysis.DecimalPrecision   
>  595384169  436153128  1982   
> 211   
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery   
>  548178270  459695885  1982   
> 49
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts 
>  423002864  139869503  1982   
> 86
> org.apache.spark.sql.catalyst.optimizer.BooleanSimplification 
>  405544962  17250184   1590   
> 7 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin  
>  383837603  284174662  1590   
> 708   
> org.apache.spark.sql.catalyst.optimizer.RemoveRedundantAliases
>  372901885  33623321590   
> 9 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints   
>  364628214  343815519  285
> 192   
> org.apache.spark.sql.execution.datasources.FindDataSourceTable
>  303293296  285344766  1982   
> 233   
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions  
>  233195019  92648171   1982   
> 294   
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$FunctionArgumentConversion
>  220568919  73932736   1982   
> 38
> org.apache.spark.sql.catalyst.optimizer.NullPropagation   
>  

[jira] [Created] (SPARK-23171) Reduce the time costs of the rule runs that do not change the plans

2018-01-21 Thread Xiao Li (JIRA)
Xiao Li created SPARK-23171:
---

 Summary: Reduce the time costs of the rule runs that do not change 
the plans 
 Key: SPARK-23171
 URL: https://issues.apache.org/jira/browse/SPARK-23171
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li


Below is the time stats of Analyzer/Optimizer rules. Try to improve the rules 
and reduce the time costs, especially for the runs that do not change the plans.

{noformat}
=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs = 175827
Total time: 20.699042877 seconds

Rule
   Total Time Effective Time Total Runs 
Effective Runs

org.apache.spark.sql.catalyst.optimizer.ColumnPruning   
   2340563794 1338268224 1875   
761   
org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution 
   1632672623 1625071881 788
37
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions   
   1395087131 347339931  1982   
38
org.apache.spark.sql.catalyst.optimizer.PruneFilters
   1177711364 21344174   1590   
3 
org.apache.spark.sql.catalyst.optimizer.Optimizer$OptimizeSubqueries
   1145135465 1131417128 285
39
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences   
   1008347217 663112062  1982   
616   
org.apache.spark.sql.catalyst.optimizer.ReorderJoin 
   767024424  693001699  1590   
132   
org.apache.spark.sql.catalyst.analysis.Analyzer$FixNullability  
   598524650  40802876   742
12
org.apache.spark.sql.catalyst.analysis.DecimalPrecision 
   595384169  436153128  1982   
211   
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery 
   548178270  459695885  1982   
49
org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts   
   423002864  139869503  1982   
86
org.apache.spark.sql.catalyst.optimizer.BooleanSimplification   
   405544962  17250184   1590   
7 
org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin
   383837603  284174662  1590   
708   
org.apache.spark.sql.catalyst.optimizer.RemoveRedundantAliases  
   372901885  33623321590   
9 
org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints 
   364628214  343815519  285
192   
org.apache.spark.sql.execution.datasources.FindDataSourceTable  
   303293296  285344766  1982   
233   
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions
   233195019  92648171   1982   
294   
org.apache.spark.sql.catalyst.analysis.TypeCoercion$FunctionArgumentConversion  
   220568919  73932736   1982   
38
org.apache.spark.sql.catalyst.optimizer.NullPropagation 
   207976072  90723051590   
26
org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings  
   207027618  37834145   1982   
40
org.apache.spark.sql.catalyst.optimizer.PushDownPredicate   
   203382836  176482044  1590   
783   
org.apache.spark.sql.catalyst.analysis.TypeCoercion$InConversion   

[jira] [Commented] (SPARK-23171) Reduce the time costs of the rule runs that do not change the plans

2018-01-21 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333540#comment-16333540
 ] 

Xiao Li commented on SPARK-23171:
-

cc [~maropu]

> Reduce the time costs of the rule runs that do not change the plans 
> 
>
> Key: SPARK-23171
> URL: https://issues.apache.org/jira/browse/SPARK-23171
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Below is the time stats of Analyzer/Optimizer rules. Try to improve the rules 
> and reduce the time costs, especially for the runs that do not change the 
> plans.
> {noformat}
> === Metrics of Analyzer/Optimizer Rules ===
> Total number of runs = 175827
> Total time: 20.699042877 seconds
> Rule  
>  Total Time Effective Time Total Runs 
> Effective Runs
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning 
>  2340563794 1338268224 1875   
> 761   
> org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution   
>  1632672623 1625071881 788
> 37
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions 
>  1395087131 347339931  1982   
> 38
> org.apache.spark.sql.catalyst.optimizer.PruneFilters  
>  1177711364 21344174   1590   
> 3 
> org.apache.spark.sql.catalyst.optimizer.Optimizer$OptimizeSubqueries  
>  1145135465 1131417128 285
> 39
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences 
>  1008347217 663112062  1982   
> 616   
> org.apache.spark.sql.catalyst.optimizer.ReorderJoin   
>  767024424  693001699  1590   
> 132   
> org.apache.spark.sql.catalyst.analysis.Analyzer$FixNullability
>  598524650  40802876   742
> 12
> org.apache.spark.sql.catalyst.analysis.DecimalPrecision   
>  595384169  436153128  1982   
> 211   
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery   
>  548178270  459695885  1982   
> 49
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts 
>  423002864  139869503  1982   
> 86
> org.apache.spark.sql.catalyst.optimizer.BooleanSimplification 
>  405544962  17250184   1590   
> 7 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin  
>  383837603  284174662  1590   
> 708   
> org.apache.spark.sql.catalyst.optimizer.RemoveRedundantAliases
>  372901885  33623321590   
> 9 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints   
>  364628214  343815519  285
> 192   
> org.apache.spark.sql.execution.datasources.FindDataSourceTable
>  303293296  285344766  1982   
> 233   
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions  
>  233195019  92648171   1982   
> 294   
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$FunctionArgumentConversion
>  220568919  73932736   1982   
> 38
> org.apache.spark.sql.catalyst.optimizer.NullPropagation   
>  207976072  90723051590   
> 26
> 

[jira] [Commented] (SPARK-23167) Update TPCDS queries from v1.4 to v2.7 (latest)

2018-01-21 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333536#comment-16333536
 ] 

Takeshi Yamamuro commented on SPARK-23167:
--

ok, will do.

> Update TPCDS queries from v1.4 to v2.7 (latest)
> ---
>
> Key: SPARK-23167
> URL: https://issues.apache.org/jira/browse/SPARK-23167
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> We currently use TPCDS v1.4 
> ([https://github.com/apache/spark/commits/master/sql/core/src/test/resources/tpcds)]
>  though, the latest one is v2.7 
> ([http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp]).
>  I found that some queries are different from v1.4 and v2.7 (e.g., q4, q5, 
> q6, ...) and some queries newly might appear (e.g., q10a, ..). I think it 
> might make some sense to update the queries for more correct evaluation.
> Raw generated queries from TPCDS v2.7 query templates:
>  [https://github.com/maropu/spark_tpcds_v2.7.0/tree/master/generated]
> Modified TPCDS v2.7 queries to pass TPCDSQuerySuite (e.g., replacing 
> unsupported syntaxes, + 14 days -> interval 14 days):
>  [https://github.com/apache/spark/compare/master...maropu:TPCDSV2_7]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23170) Dump the statistics of effective runs of analyzer and optimizer rules

2018-01-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23170:


Assignee: Apache Spark  (was: Xiao Li)

> Dump the statistics of effective runs of analyzer and optimizer rules
> -
>
> Key: SPARK-23170
> URL: https://issues.apache.org/jira/browse/SPARK-23170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Dump the statistics of effective runs of analyzer and optimizer rules.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23170) Dump the statistics of effective runs of analyzer and optimizer rules

2018-01-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333535#comment-16333535
 ] 

Apache Spark commented on SPARK-23170:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20342

> Dump the statistics of effective runs of analyzer and optimizer rules
> -
>
> Key: SPARK-23170
> URL: https://issues.apache.org/jira/browse/SPARK-23170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> Dump the statistics of effective runs of analyzer and optimizer rules.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23170) Dump the statistics of effective runs of analyzer and optimizer rules

2018-01-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23170:


Assignee: Xiao Li  (was: Apache Spark)

> Dump the statistics of effective runs of analyzer and optimizer rules
> -
>
> Key: SPARK-23170
> URL: https://issues.apache.org/jira/browse/SPARK-23170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> Dump the statistics of effective runs of analyzer and optimizer rules.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >