[jira] [Updated] (SPARK-22891) NullPointerException when use udf

2017-12-22 Thread gaoyang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyang updated SPARK-22891:

Description: 
In my application,i use multi threads. Each thread has a SparkSession and use 
SparkSession.sqlContext.udf.register to register my udf. Sometimes there throws 
exception like this:

{code:java}
java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionStateBuilder':
at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1062)
at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:137)
at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:136)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:136)
at 
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:133)
at org.apache.spark.sql.SparkSession.udf(SparkSession.scala:207)
at org.apache.spark.sql.SQLContext.udf(SQLContext.scala:203)
at 
com.game.data.stat.clusterTask.tools.standard.IpConverterRegister$.run(IpConverterRegister.scala:63)
at 
... 20 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.newSession(HiveClientImpl.scala:789)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.newSession(HiveClientImpl.scala:79)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.resourceLoader$lzycompute(HiveSessionStateBuilder.scala:45)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.resourceLoader(HiveSessionStateBuilder.scala:44)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:61)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35)
at 
org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)
at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1059)
... 20 more
Caused by: java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.metadata.HiveException
at 
org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:744)
at 
org.apache.hadoop.hive.ql.session.SessionState.getAuthenticator(SessionState.java:1391)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:210)
... 34 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException
at 
org.apache.hadoop.hive.ql.session.SessionState.setAuthorizerV2Config(SessionState.java:769)
at 
org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:736)
... 36 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.isCompatibleWith(HiveMetaStoreClient.java:287)
at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
at com.sun.proxy.$Proxy25.isCompatibleWith(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:206)
at 
org.apache.hadoop.hive.ql.session.SessionState.setAuthorizerV2Config(SessionState.java:765)
... 37 more

{code}

Also, i use apache hive 2.1.1 in my cluster.
When i use spark 2.1.x, the exception above never happends again.


  was:
In my application,i use multi threads. Each thread has a SparkSession and use 
SparkSession.sqlContext.udf.register to register my udf. Sometimes there throw 
exception like this:

{code:java}
java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionStateBuilder':
at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1062)
at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:137)
  

[jira] [Created] (SPARK-22891) NullPointerException when use udf

2017-12-22 Thread gaoyang (JIRA)
gaoyang created SPARK-22891:
---

 Summary: NullPointerException when use udf
 Key: SPARK-22891
 URL: https://issues.apache.org/jira/browse/SPARK-22891
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1, 2.2.0
 Environment: hadoop 2.7.2
Reporter: gaoyang


In my application,i use multi threads. Each thread has a SparkSession and use 
SparkSession.sqlContext.udf.register to register my udf. Sometimes there throw 
exception like this:

{code:java}
java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionStateBuilder':
at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1062)
at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:137)
at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:136)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:136)
at 
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:133)
at org.apache.spark.sql.SparkSession.udf(SparkSession.scala:207)
at org.apache.spark.sql.SQLContext.udf(SQLContext.scala:203)
at 
com.game.data.stat.clusterTask.tools.standard.IpConverterRegister$.run(IpConverterRegister.scala:63)
at 
... 20 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.newSession(HiveClientImpl.scala:789)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.newSession(HiveClientImpl.scala:79)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.resourceLoader$lzycompute(HiveSessionStateBuilder.scala:45)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.resourceLoader(HiveSessionStateBuilder.scala:44)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:61)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35)
at 
org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)
at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1059)
... 20 more
Caused by: java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.metadata.HiveException
at 
org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:744)
at 
org.apache.hadoop.hive.ql.session.SessionState.getAuthenticator(SessionState.java:1391)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:210)
... 34 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException
at 
org.apache.hadoop.hive.ql.session.SessionState.setAuthorizerV2Config(SessionState.java:769)
at 
org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:736)
... 36 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.isCompatibleWith(HiveMetaStoreClient.java:287)
at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
at com.sun.proxy.$Proxy25.isCompatibleWith(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:206)
at 
org.apache.hadoop.hive.ql.session.SessionState.setAuthorizerV2Config(SessionState.java:765)
... 37 more

{code}

Also, i use apache hive 2.1.1 in my cluster.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22789) Add ContinuousExecution for continuous processing queries

2017-12-22 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-22789.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19984
[https://github.com/apache/spark/pull/19984]

> Add ContinuousExecution for continuous processing queries
> -
>
> Key: SPARK-22789
> URL: https://issues.apache.org/jira/browse/SPARK-22789
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22789) Add ContinuousExecution for continuous processing queries

2017-12-22 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-22789:


Assignee: Jose Torres

> Add ContinuousExecution for continuous processing queries
> -
>
> Key: SPARK-22789
> URL: https://issues.apache.org/jira/browse/SPARK-22789
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>Assignee: Jose Torres
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22890) Basic tests for DateTimeOperations

2017-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22890:


Assignee: (was: Apache Spark)

> Basic tests for DateTimeOperations
> --
>
> Key: SPARK-22890
> URL: https://issues.apache.org/jira/browse/SPARK-22890
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22890) Basic tests for DateTimeOperations

2017-12-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302174#comment-16302174
 ] 

Apache Spark commented on SPARK-22890:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/20061

> Basic tests for DateTimeOperations
> --
>
> Key: SPARK-22890
> URL: https://issues.apache.org/jira/browse/SPARK-22890
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22890) Basic tests for DateTimeOperations

2017-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22890:


Assignee: Apache Spark

> Basic tests for DateTimeOperations
> --
>
> Key: SPARK-22890
> URL: https://issues.apache.org/jira/browse/SPARK-22890
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22890) Basic tests for DateTimeOperations

2017-12-22 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-22890:
---

 Summary: Basic tests for DateTimeOperations
 Key: SPARK-22890
 URL: https://issues.apache.org/jira/browse/SPARK-22890
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22881) ML test for StructuredStreaming: spark.ml.regression

2017-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22881:


Assignee: Weichen Xu  (was: Apache Spark)

> ML test for StructuredStreaming: spark.ml.regression
> 
>
> Key: SPARK-22881
> URL: https://issues.apache.org/jira/browse/SPARK-22881
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22881) ML test for StructuredStreaming: spark.ml.regression

2017-12-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302166#comment-16302166
 ] 

Apache Spark commented on SPARK-22881:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/19979

> ML test for StructuredStreaming: spark.ml.regression
> 
>
> Key: SPARK-22881
> URL: https://issues.apache.org/jira/browse/SPARK-22881
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22881) ML test for StructuredStreaming: spark.ml.regression

2017-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22881:


Assignee: Apache Spark  (was: Weichen Xu)

> ML test for StructuredStreaming: spark.ml.regression
> 
>
> Key: SPARK-22881
> URL: https://issues.apache.org/jira/browse/SPARK-22881
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22889) CRAN checks can fail if older Spark install exists

2017-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22889:


Assignee: (was: Apache Spark)

> CRAN checks can fail if older Spark install exists
> --
>
> Key: SPARK-22889
> URL: https://issues.apache.org/jira/browse/SPARK-22889
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Shivaram Venkataraman
>
> Since all CRAN checks go through the same machine, if there is an older 
> partial download or partial install of Spark left behind the tests fail. One 
> solution is to overwrite the install files when running tests. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22889) CRAN checks can fail if older Spark install exists

2017-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22889:


Assignee: Apache Spark

> CRAN checks can fail if older Spark install exists
> --
>
> Key: SPARK-22889
> URL: https://issues.apache.org/jira/browse/SPARK-22889
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Shivaram Venkataraman
>Assignee: Apache Spark
>
> Since all CRAN checks go through the same machine, if there is an older 
> partial download or partial install of Spark left behind the tests fail. One 
> solution is to overwrite the install files when running tests. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22889) CRAN checks can fail if older Spark install exists

2017-12-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302139#comment-16302139
 ] 

Apache Spark commented on SPARK-22889:
--

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/20060

> CRAN checks can fail if older Spark install exists
> --
>
> Key: SPARK-22889
> URL: https://issues.apache.org/jira/browse/SPARK-22889
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Shivaram Venkataraman
>
> Since all CRAN checks go through the same machine, if there is an older 
> partial download or partial install of Spark left behind the tests fail. One 
> solution is to overwrite the install files when running tests. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22889) CRAN checks can fail if older Spark install exists

2017-12-22 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-22889:
-

 Summary: CRAN checks can fail if older Spark install exists
 Key: SPARK-22889
 URL: https://issues.apache.org/jira/browse/SPARK-22889
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.1, 2.3.0
Reporter: Shivaram Venkataraman


Since all CRAN checks go through the same machine, if there is an older partial 
download or partial install of Spark left behind the tests fail. One solution 
is to overwrite the install files when running tests. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22881) ML test for StructuredStreaming: spark.ml.regression

2017-12-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-22881:
-

Assignee: Weichen Xu

> ML test for StructuredStreaming: spark.ml.regression
> 
>
> Key: SPARK-22881
> URL: https://issues.apache.org/jira/browse/SPARK-22881
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming

2017-12-22 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212962#comment-16212962
 ] 

Joseph K. Bradley edited comment on SPARK-21926 at 12/22/17 10:40 PM:
--

One more report from the dev list: HashingTF and IDFModel fail with Structured 
Streaming: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HashingTFModel-IDFModel-in-Structured-Streaming-td22680.html
--> the issue seems to have been VectorAssembler


was (Author: josephkb):
One more report from the dev list: HashingTF and IDFModel fail with Structured 
Streaming: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HashingTFModel-IDFModel-in-Structured-Streaming-td22680.html

> Compatibility between ML Transformers and Structured Streaming
> --
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> More details here SPARK-22346.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> -b) Allow user to set the cardinality of OneHotEncoder.-



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22888) OneVsRestModel does not work with Structured Streaming

2017-12-22 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-22888:
-

 Summary: OneVsRestModel does not work with Structured Streaming
 Key: SPARK-22888
 URL: https://issues.apache.org/jira/browse/SPARK-22888
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.1
Reporter: Joseph K. Bradley
Priority: Critical


OneVsRestModel uses Dataset.persist, which does not work with streaming.  This 
should be avoided when the input is a streaming Dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22886) ML test for StructuredStreaming: spark.ml.recommendation

2017-12-22 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-22886:
-

 Summary: ML test for StructuredStreaming: spark.ml.recommendation
 Key: SPARK-22886
 URL: https://issues.apache.org/jira/browse/SPARK-22886
 Project: Spark
  Issue Type: Test
  Components: ML, Tests
Affects Versions: 2.3.0
Reporter: Joseph K. Bradley


Task for adding Structured Streaming tests for all Models/Transformers in a 
sub-module in spark.ml

For an example, see LinearRegressionSuite.scala in 
https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22887) ML test for StructuredStreaming: spark.ml.fpm

2017-12-22 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-22887:
-

 Summary: ML test for StructuredStreaming: spark.ml.fpm
 Key: SPARK-22887
 URL: https://issues.apache.org/jira/browse/SPARK-22887
 Project: Spark
  Issue Type: Test
  Components: ML, Tests
Affects Versions: 2.3.0
Reporter: Joseph K. Bradley


Task for adding Structured Streaming tests for all Models/Transformers in a 
sub-module in spark.ml

For an example, see LinearRegressionSuite.scala in 
https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22885) ML test for StructuredStreaming: spark.ml.tuning

2017-12-22 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-22885:
-

 Summary: ML test for StructuredStreaming: spark.ml.tuning
 Key: SPARK-22885
 URL: https://issues.apache.org/jira/browse/SPARK-22885
 Project: Spark
  Issue Type: Test
  Components: ML, Tests
Affects Versions: 2.3.0
Reporter: Joseph K. Bradley


Task for adding Structured Streaming tests for all Models/Transformers in a 
sub-module in spark.ml

For an example, see LinearRegressionSuite.scala in 
https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering

2017-12-22 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-22884:
-

 Summary: ML test for StructuredStreaming: spark.ml.clustering
 Key: SPARK-22884
 URL: https://issues.apache.org/jira/browse/SPARK-22884
 Project: Spark
  Issue Type: Test
  Components: ML, Tests
Affects Versions: 2.3.0
Reporter: Joseph K. Bradley


Task for adding Structured Streaming tests for all Models/Transformers in a 
sub-module in spark.ml

For an example, see LinearRegressionSuite.scala in 
https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature

2017-12-22 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-22883:
-

 Summary: ML test for StructuredStreaming: spark.ml.feature
 Key: SPARK-22883
 URL: https://issues.apache.org/jira/browse/SPARK-22883
 Project: Spark
  Issue Type: Test
  Components: ML, Tests
Affects Versions: 2.3.0
Reporter: Joseph K. Bradley


Task for adding Structured Streaming tests for all Models/Transformers in a 
sub-module in spark.ml

For an example, see LinearRegressionSuite.scala in 
https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22882) ML test for StructuredStreaming: spark.ml.classification

2017-12-22 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-22882:
-

 Summary: ML test for StructuredStreaming: spark.ml.classification
 Key: SPARK-22882
 URL: https://issues.apache.org/jira/browse/SPARK-22882
 Project: Spark
  Issue Type: Test
  Components: ML, Tests
Affects Versions: 2.3.0
Reporter: Joseph K. Bradley


Task for adding Structured Streaming tests for all Models/Transformers in a 
sub-module in spark.ml

For an example, see LinearRegressionSuite.scala in 
https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22881) ML test for StructuredStreaming: spark.ml.regression

2017-12-22 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-22881:
-

 Summary: ML test for StructuredStreaming: spark.ml.regression
 Key: SPARK-22881
 URL: https://issues.apache.org/jira/browse/SPARK-22881
 Project: Spark
  Issue Type: Test
  Components: ML, Tests
Affects Versions: 2.3.0
Reporter: Joseph K. Bradley


Task for adding Structured Streaming tests for all Models/Transformers in a 
sub-module in spark.ml

For an example, see LinearRegressionSuite.scala in 
https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22648) Documentation for Kubernetes Scheduler Backend

2017-12-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302006#comment-16302006
 ] 

Apache Spark commented on SPARK-22648:
--

User 'liyinan926' has created a pull request for this issue:
https://github.com/apache/spark/pull/20059

> Documentation for Kubernetes Scheduler Backend
> --
>
> Key: SPARK-22648
> URL: https://issues.apache.org/jira/browse/SPARK-22648
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Assignee: Anirudh Ramanathan
> Fix For: 2.3.0
>
>
> This covers documentation needed to use the Kubernetes backend. This will 
> likely be marked as experimental.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22346) Update VectorAssembler to work with Structured Streaming

2017-12-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-22346.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Resolved via https://github.com/apache/spark/pull/19746

> Update VectorAssembler to work with Structured Streaming
> 
>
> Key: SPARK-22346
> URL: https://issues.apache.org/jira/browse/SPARK-22346
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Critical
> Fix For: 2.3.0
>
>
> The issue
> In batch mode, VectorAssembler can take multiple columns of VectorType and 
> assemble a output a new column of VectorType containing the concatenated 
> vectors. In streaming mode, this transformation can fail because 
> VectorAssembler does not have enough information to produce metadata 
> (AttributeGroup) for the new column. Because VectorAssembler is such a 
> ubiquitous part of mllib pipelines, this issue effectively means spark 
> structured streaming does not support prediction using mllib pipelines.
> I've created this ticket so we can discuss ways to potentially improve 
> VectorAssembler. Please let me know if there are any issues I have not 
> considered or potential fixes I haven't outlined. I'm happy to submit a patch 
> once I know which strategy is the best approach.
> Potential fixes
> 1) Replace VectorAssembler with an estimator/model pair like was recently 
> done with OneHotEncoder, 
> [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The 
> Estimator can "learn" the size of the inputs vectors during training and save 
> it to use during prediction.
> Pros:
> * Possibly simplest of the potential fixes
> Cons:
> * We'll need to deprecate current VectorAssembler
> 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty 
> major change, but it could be done in stages. We could first ensure that 
> metadata is not used during prediction and allow the VectorAssembler to drop 
> metadata for streaming dataframes. Going forward, it would be important to 
> not use any metadata on Vector columns for any prediction tasks.
> Pros:
> * Potentially, easy short term fix for VectorAssembler
> (drop metadata for vector columns in streaming).
> * Current Attributes implementation is also causing other issues, eg 
> [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].
> Cons:
> * To fully remove ML Attributes would be a major refactor of MLlib and would 
> most likely require breaking changings.
> * A partial removal of ML attributes (eg: ensure ML attributes are not used 
> during transform, only during fit) might be tricky. This would require 
> testing or other enforcement mechanism to prevent regressions.
> 3) Require Vector columns to have fixed length vectors. Most mllib 
> transformers that produce vectors already include the size of the vector in 
> the column metadata. This change would be to deprecate APIs that allow 
> creating a vector column of unknown length and replace those APIs with 
> equivalents that enforce a fixed size.
> Pros:
> * We already treat vectors as fixed size, for example VectorAssembler assumes 
> the inputs * output col are fixed size vectors and creates metadata 
> accordingly. In the spirit of explicit is better than implicit, we would be 
> codifying something we already assume.
> * This could potentially enable performance optimizations that are only 
> possible if the Vector size of a column is fixed & known.
> Cons:
> * This would require breaking changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22346) Update VectorAssembler to work with Structured Streaming

2017-12-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-22346:
-

Assignee: Bago Amirbekian

> Update VectorAssembler to work with Structured Streaming
> 
>
> Key: SPARK-22346
> URL: https://issues.apache.org/jira/browse/SPARK-22346
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Critical
>
> The issue
> In batch mode, VectorAssembler can take multiple columns of VectorType and 
> assemble a output a new column of VectorType containing the concatenated 
> vectors. In streaming mode, this transformation can fail because 
> VectorAssembler does not have enough information to produce metadata 
> (AttributeGroup) for the new column. Because VectorAssembler is such a 
> ubiquitous part of mllib pipelines, this issue effectively means spark 
> structured streaming does not support prediction using mllib pipelines.
> I've created this ticket so we can discuss ways to potentially improve 
> VectorAssembler. Please let me know if there are any issues I have not 
> considered or potential fixes I haven't outlined. I'm happy to submit a patch 
> once I know which strategy is the best approach.
> Potential fixes
> 1) Replace VectorAssembler with an estimator/model pair like was recently 
> done with OneHotEncoder, 
> [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The 
> Estimator can "learn" the size of the inputs vectors during training and save 
> it to use during prediction.
> Pros:
> * Possibly simplest of the potential fixes
> Cons:
> * We'll need to deprecate current VectorAssembler
> 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty 
> major change, but it could be done in stages. We could first ensure that 
> metadata is not used during prediction and allow the VectorAssembler to drop 
> metadata for streaming dataframes. Going forward, it would be important to 
> not use any metadata on Vector columns for any prediction tasks.
> Pros:
> * Potentially, easy short term fix for VectorAssembler
> (drop metadata for vector columns in streaming).
> * Current Attributes implementation is also causing other issues, eg 
> [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].
> Cons:
> * To fully remove ML Attributes would be a major refactor of MLlib and would 
> most likely require breaking changings.
> * A partial removal of ML attributes (eg: ensure ML attributes are not used 
> during transform, only during fit) might be tricky. This would require 
> testing or other enforcement mechanism to prevent regressions.
> 3) Require Vector columns to have fixed length vectors. Most mllib 
> transformers that produce vectors already include the size of the vector in 
> the column metadata. This change would be to deprecate APIs that allow 
> creating a vector column of unknown length and replace those APIs with 
> equivalents that enforce a fixed size.
> Pros:
> * We already treat vectors as fixed size, for example VectorAssembler assumes 
> the inputs * output col are fixed size vectors and creates metadata 
> accordingly. In the spirit of explicit is better than implicit, we would be 
> codifying something we already assume.
> * This could potentially enable performance optimizations that are only 
> possible if the Vector size of a column is fixed & known.
> Cons:
> * This would require breaking changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning

2017-12-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301986#comment-16301986
 ] 

Apache Spark commented on SPARK-22126:
--

User 'MrBago' has created a pull request for this issue:
https://github.com/apache/spark/pull/20058

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0
> Anyone who's following might want to scan the design doc (in the links 
> above), the latest api proposal is:
> {code}
> def fitMultiple(
> dataset: Dataset[_],
> paramMaps: Array[ParamMap]
>   ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]]
> {code}
> Old discussion:
> I copy discussion from gist to here:
> I propose to design API as:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): 
> Array[Callable[Map[Int, M]]]
> {code}
> Let me use an example to explain the API:
> {quote}
>  It could be possible to still use the current parallelism and still allow 
> for model-specific optimizations. For example, if we doing cross validation 
> and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets 
> say that the cross validator could know that maxIter is optimized for the 
> model being evaluated (e.g. a new method in Estimator that return such 
> params). It would then be straightforward for the cross validator to remove 
> maxIter from the param map that will be parallelized over and use it to 
> create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, 
> maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)).
> {quote}
> In this example, we can see that, models computed from ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread 
> code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, 
> maxIter=10))  in another thread. In this example, there're 4 paramMaps, but 
> we can at most generate two threads to compute the models for them.
> The API above allow "callable.call()" to return multiple models, and return 
> type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap 
> index for corresponding model. Use the example above, there're 4 paramMaps, 
> but only return 2 callable objects, one callable object for ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, 
> maxIter=5), (regParam=0.3, maxIter=10)).
> and the default "fitCallables/fit with paramMaps" can be implemented as 
> following:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]):
> Array[Callable[Map[Int, M]]] = {
>   paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) =>
> new Callable[Map[Int, M]] {
>   override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap))
> }
>   }
> }
> def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
>fitCallables(dataset, paramMaps).map { _.call().toSeq }
>  .flatMap(_).sortBy(_._1).map(_._2)
> }
> {code}
> If use the API I proposed above, the code in 
> [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159]
> can be changed to:
> {code}
>   val trainingDataset = sparkSession.createDataFrame(training, 
> schema).cache()
>   val validationDataset = sparkSession.createDataFrame(validation, 
> schema).cache()
>   // Fit models in a Future for training in parallel
>   val modelMapFutures = fitCallables(trainingDataset, paramMaps).map { 
> callable =>
>  Future[Map[Int, Model[_]]] {
> val modelMap = callable.call()
> if (collectSubModelsParam) {
>...
> }
> modelMap
>  } (executionContext)
>   }
>   // Unpersist training data only when all models have trained
>   Future.sequence[Model[_], Iterable](modelMapFutures)(implicitly, 
> executionContext)
> .onComplete { _ => trainingDataset.unpersist() } (executionContext)
>   // Evaluate models in a Future that will calulate a metric and allow 
> model to be cleaned up
>   val foldMetricMapFutures = modelMapFutures.map { modelMapFuture =>
> modelMapFuture.map { modelMap =>
>   modelMap.map { case (index: Int, model: Model[_]) =>
> val metric = eval.evaluate(model.transform(validationDataset, 
> paramMaps(index)))
> (index, metric)
>   }
> 

[jira] [Commented] (SPARK-22879) LogisticRegression inconsistent prediction when proba == threshold

2017-12-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301801#comment-16301801
 ] 

Sean Owen commented on SPARK-22879:
---

Yes these are algebraically the same but not exactly the same due to roundoff. 
I guess I'd argue the right answer is 'false' in your example, because it's the 
comparison with the value the user supplied. Yes it should be consistent, but I 
don't know if this can be avoided, without avoiding the 'raw' comparison 
altogether. That's for performance reasons though.

In general this seems like an extreme corner case, but I see you're trying to 
exactly reproduce certain comparisons. What about rounding the result to your 
nearest test set value, to satisfy your particular use case?

Is there any way to get the speed up (win with no downside in almost all cases) 
without this behavior?

> LogisticRegression inconsistent prediction when proba == threshold
> --
>
> Key: SPARK-22879
> URL: https://issues.apache.org/jira/browse/SPARK-22879
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.3
>Reporter: Adrien Lavoillotte
>Priority: Minor
>
> I'm using {{org.apache.spark.ml.classification.LogisticRegression}} for 
> binary classification.
> If I predict on a record that yields exactly the probability of the 
> threshold, then the result of {{transform}} is different depending on whether 
> the {{rawPredictionCol}} param is empty on the model or not.
> If it is empty, as most ML tools I've seen, it correctly predicts 0, the rule 
> being {{ if (proba > threshold) then 1 else 0 }} (implemented in 
> {{probability2prediction}}).
> If however {{rawPredictionCol}} is set (default), then it avoids 
> recomputation by calling {{raw2prediction}}, and the rule becomes {{if 
> (rawPrediction(1) > rawThreshold) 1 else 0}}. The {{rawThreshold = math.log(t 
> / (1.0 - t))}} is ever-so-slightly below the {{rawPrediction(1)}}, so it 
> predicts 1.
> The use case is that I choose the threshold amongst 
> {{BinaryClassificationMetrics#thresholds}}, so I get one that corresponds to 
> the probability for one or more of my test set's records. Re-scoring that 
> record or one that yields the same probability exhibits this behaviour.
> Tested this on Spark 1.6 but the code involved seems to be similar on Spark 
> 2.2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22880) Add option to cascade jdbc truncate if database supports this (PostgreSQL and Oracle)

2017-12-22 Thread Daniel van der Ende (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301772#comment-16301772
 ] 

Daniel van der Ende edited comment on SPARK-22880 at 12/22/17 6:31 PM:
---

It's different in the sense that SPARK-22717's scope was to correct a bug; 
PostgreSQL does not cascade by default. The reason this was particularly nasty, 
was that it actually prevented any PostgreSQL table from being truncated, as 
the logic dictates that if a table is truncated by default by a database, it 
cannot be truncated. This issue (SPARK-22880) is about adding an option to 
allow users to enable cascade for truncating tables if they want to. 


was (Author: danielvdende):
It's different in the sense that 
[SPARK-22717|https://github.com/apache/spark/pull/19911]'s scope was to correct 
a bug; PostgreSQL does not cascade by default. The reason this was particularly 
nasty, was that it actually prevented any PostgreSQL table from being 
truncated, as the logic dictates that if a table is truncated by default by a 
database, it cannot be truncated. This issue (SPARK-22880) is about adding an 
option to allow users to enable cascade for truncating tables if they want to. 

> Add option to cascade jdbc truncate if database supports this (PostgreSQL and 
> Oracle)
> -
>
> Key: SPARK-22880
> URL: https://issues.apache.org/jira/browse/SPARK-22880
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>Priority: Minor
>
> When truncating tables, PostgreSQL and Oracle support an option `TRUNCATE`. 
> This cascades the truncate to tables with foreign key constraints on a column 
> in the table specified to truncate. It would be nice to be able to optionally 
> set this `TRUNCATE` option for PostgreSQL and Oracle.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22880) Add option to cascade jdbc truncate if database supports this (PostgreSQL and Oracle)

2017-12-22 Thread Daniel van der Ende (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301772#comment-16301772
 ] 

Daniel van der Ende commented on SPARK-22880:
-

It's different in the sense that 
[SPARK-22717|https://github.com/apache/spark/pull/19911]'s scope was to correct 
a bug; PostgreSQL does not cascade by default. The reason this was particularly 
nasty, was that it actually prevented any PostgreSQL table from being 
truncated, as the logic dictates that if a table is truncated by default by a 
database, it cannot be truncated. This issue (SPARK-22880) is about adding an 
option to allow users to enable cascade for truncating tables if they want to. 

> Add option to cascade jdbc truncate if database supports this (PostgreSQL and 
> Oracle)
> -
>
> Key: SPARK-22880
> URL: https://issues.apache.org/jira/browse/SPARK-22880
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>Priority: Minor
>
> When truncating tables, PostgreSQL and Oracle support an option `TRUNCATE`. 
> This cascades the truncate to tables with foreign key constraints on a column 
> in the table specified to truncate. It would be nice to be able to optionally 
> set this `TRUNCATE` option for PostgreSQL and Oracle.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22766) Install R linter package in spark lib directory

2017-12-22 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301751#comment-16301751
 ] 

Hossein Falaki commented on SPARK-22766:


[~felixcheung] This just makes installation of whatever version of {{lintr}} we 
pick more stable. The two tickets are orthogonal.

> Install R linter package in spark lib directory
> ---
>
> Key: SPARK-22766
> URL: https://issues.apache.org/jira/browse/SPARK-22766
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Hossein Falaki
>
> {{dev/lint-r.R}} file installs uses devtools to install {{jimhester/lintr}} 
> package in the default site library location which is 
> {{/usr/local/lib/R/site-library}. This is not recommended and can fail 
> because we are running this script as jenkins while that directory is owned 
> by root.
> We need to install the linter package in a local directory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22879) LogisticRegression inconsistent prediction when proba == threshold

2017-12-22 Thread Adrien Lavoillotte (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301724#comment-16301724
 ] 

Adrien Lavoillotte commented on SPARK-22879:


In my case, the threshold was 0.2739699931543393, and the raw prediction was 
-0.974572759358285.

* If you compute the proba from the raw prediction, you also get 
0.2739699931543393, so the > comparison returns false
* If, however, you compute the {{rawThreshold}} from the threshold, you get 
-0.9745727593582851 (notice that last 1), so the > comparison now returns true

> LogisticRegression inconsistent prediction when proba == threshold
> --
>
> Key: SPARK-22879
> URL: https://issues.apache.org/jira/browse/SPARK-22879
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.3
>Reporter: Adrien Lavoillotte
>Priority: Minor
>
> I'm using {{org.apache.spark.ml.classification.LogisticRegression}} for 
> binary classification.
> If I predict on a record that yields exactly the probability of the 
> threshold, then the result of {{transform}} is different depending on whether 
> the {{rawPredictionCol}} param is empty on the model or not.
> If it is empty, as most ML tools I've seen, it correctly predicts 0, the rule 
> being {{ if (proba > threshold) then 1 else 0 }} (implemented in 
> {{probability2prediction}}).
> If however {{rawPredictionCol}} is set (default), then it avoids 
> recomputation by calling {{raw2prediction}}, and the rule becomes {{if 
> (rawPrediction(1) > rawThreshold) 1 else 0}}. The {{rawThreshold = math.log(t 
> / (1.0 - t))}} is ever-so-slightly below the {{rawPrediction(1)}}, so it 
> predicts 1.
> The use case is that I choose the threshold amongst 
> {{BinaryClassificationMetrics#thresholds}}, so I get one that corresponds to 
> the probability for one or more of my test set's records. Re-scoring that 
> record or one that yields the same probability exhibits this behaviour.
> Tested this on Spark 1.6 but the code involved seems to be similar on Spark 
> 2.2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22879) LogisticRegression inconsistent prediction when proba == threshold

2017-12-22 Thread Adrien Lavoillotte (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301718#comment-16301718
 ] 

Adrien Lavoillotte commented on SPARK-22879:


I'm saying that, given a threshold {{t}}, the two conditions below are not 
exactly the same when {{proba == t}} because of floating-point math:

* {{proba(1) > t}} (false)
* {{rawPrediction(1) > rawThreshold}} (true)

Which condition is used depends on whether the raw predictions are already 
computed, which depends on whether the {{rawPredictionCol}} param is empty or 
not. If it's empty, it will predict 0, else (default), it will predict 1.

I hope it's clearer

> LogisticRegression inconsistent prediction when proba == threshold
> --
>
> Key: SPARK-22879
> URL: https://issues.apache.org/jira/browse/SPARK-22879
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.3
>Reporter: Adrien Lavoillotte
>Priority: Minor
>
> I'm using {{org.apache.spark.ml.classification.LogisticRegression}} for 
> binary classification.
> If I predict on a record that yields exactly the probability of the 
> threshold, then the result of {{transform}} is different depending on whether 
> the {{rawPredictionCol}} param is empty on the model or not.
> If it is empty, as most ML tools I've seen, it correctly predicts 0, the rule 
> being {{ if (proba > threshold) then 1 else 0 }} (implemented in 
> {{probability2prediction}}).
> If however {{rawPredictionCol}} is set (default), then it avoids 
> recomputation by calling {{raw2prediction}}, and the rule becomes {{if 
> (rawPrediction(1) > rawThreshold) 1 else 0}}. The {{rawThreshold = math.log(t 
> / (1.0 - t))}} is ever-so-slightly below the {{rawPrediction(1)}}, so it 
> predicts 1.
> The use case is that I choose the threshold amongst 
> {{BinaryClassificationMetrics#thresholds}}, so I get one that corresponds to 
> the probability for one or more of my test set's records. Re-scoring that 
> record or one that yields the same probability exhibits this behaviour.
> Tested this on Spark 1.6 but the code involved seems to be similar on Spark 
> 2.2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-22 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301717#comment-16301717
 ] 

Felix Cheung commented on SPARK-22683:
--

I think the challenge here is the ability to determine upfront if the task is 
going to be small/quick.



> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22809) pyspark is sensitive to imports with dots

2017-12-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22809.
---
Resolution: Not A Problem

It's still not clear this isn't some other error swallowed by error-handling in 
your example. I say that because you show this construct works, the second 
time, so doesn't really seem to be related to the import. Spark doesn't handle 
imports anyway; it's the interpreter. At worst you have a workaround to 
whatever this is. If you have a specific change to propose that pinpoints or 
explains what's happening, that's attribute to Spark, reopen.

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Cricket Temple
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-22 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301709#comment-16301709
 ] 

Felix Cheung edited comment on SPARK-22683 at 12/22/17 5:35 PM:


I couldn't find the exact source line, but from running Flink previously I'm 
reasonably sure number of task slots  == number of cores. Therefore I don't 
think it's meant to increase utilization by over committing tasks or 
concurrently running multiple tasks and so on.


was (Author: felixcheung):
I couldn't find the exact source line, but from running Flink previously I'm 
reasonably sure number of task slots  == number of cores. Therefore I don't 
think it's meant to increase utilization by concurrently running multiple tasks 
and so on.

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a 

[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-22 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301709#comment-16301709
 ] 

Felix Cheung commented on SPARK-22683:
--

I couldn't find the exact source line, but from running Flink previously I'm 
reasonably sure number of task slots  == number of cores. Therefore I don't 
think it's meant to increase utilization by concurrently running multiple tasks 
and so on.

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To 

[jira] [Commented] (SPARK-22870) Dynamic allocation should allow 0 idle time

2017-12-22 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301701#comment-16301701
 ] 

Felix Cheung commented on SPARK-22870:
--

+1
yes there is more than the check for the value 0 as the clean up is not 
currently synchronous.

> Dynamic allocation should allow 0 idle time
> ---
>
> Key: SPARK-22870
> URL: https://issues.apache.org/jira/browse/SPARK-22870
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Xuefu Zhang
>Priority: Minor
>
> As discussed in SPARK-22765, with SPARK-21656, an executor will not idle out 
> when there are pending tasks to run. When there is no task to run, an 
> executor will die out after {{spark.dynamicAllocation.executorIdleTimeout}}, 
> which is currently required to be greater than zero. However, for efficiency, 
> a user should be able to specify that an executor can die out immediately w/o 
> being required to be idle for at least 1s.
> This is to make {{0}} a valid value for 
> {{spark.dynamicAllocation.executorIdleTimeout}}, and special handling such a 
> case might be needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22880) Add option to cascade jdbc truncate if database supports this (PostgreSQL and Oracle)

2017-12-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22880:
--
Priority: Minor  (was: Major)

How is this different from SPARK-22717?

> Add option to cascade jdbc truncate if database supports this (PostgreSQL and 
> Oracle)
> -
>
> Key: SPARK-22880
> URL: https://issues.apache.org/jira/browse/SPARK-22880
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>Priority: Minor
>
> When truncating tables, PostgreSQL and Oracle support an option `TRUNCATE`. 
> This cascades the truncate to tables with foreign key constraints on a column 
> in the table specified to truncate. It would be nice to be able to optionally 
> set this `TRUNCATE` option for PostgreSQL and Oracle.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22872) Spark ML Pipeline Model Persistent Support Save Schema Info

2017-12-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22872.
---
Resolution: Invalid

Questions to the mailing list please

> Spark ML Pipeline Model Persistent Support Save Schema Info
> ---
>
> Key: SPARK-22872
> URL: https://issues.apache.org/jira/browse/SPARK-22872
> Project: Spark
>  Issue Type: IT Help
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Cyanny
>Priority: Minor
> Attachments: jpmml-research.jpg
>
>
> Hi all,
> I have a project about model transformation with PMML, it  needs to transform 
> models with different types to pmml files.
> Moreover, JPMML(https://github.com/jpmml) has provided tools to do that,such 
> as jpmml-sklearn, jpmml-xgboost etc. Our transformation API parameters must 
> be concise and simple, in other words the less the better.
> I came with a issue that, sklearn, tensorflow, and lightgbm can produce only 
> one model file, including schema info and model data info.
> but Spark PipelineModel only export a model file in parquet, there is no 
> schema info in the model file. However, JPMML-SPARK converter needs two 
> arguments: Data Schema and PipelineModel
> *Can spark PipelineModel include input data schema as metadata when do 
> export? *
> The situations about machine learning libraries to jpmml are as the attached 
> image, only xgboost and spark can't include schema info in exported model 
> file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22879) LogisticRegression inconsistent prediction when proba == threshold

2017-12-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301681#comment-16301681
 ] 

Sean Owen commented on SPARK-22879:
---

I'm not quite following here. In both cases the rule is "1 if > threshold". Are 
you saying that the recomputation of the threshold from t isn't equal to the 
original one? that's a little different. 

> LogisticRegression inconsistent prediction when proba == threshold
> --
>
> Key: SPARK-22879
> URL: https://issues.apache.org/jira/browse/SPARK-22879
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.3
>Reporter: Adrien Lavoillotte
>Priority: Minor
>
> I'm using {{org.apache.spark.ml.classification.LogisticRegression}} for 
> binary classification.
> If I predict on a record that yields exactly the probability of the 
> threshold, then the result of {{transform}} is different depending on whether 
> the {{rawPredictionCol}} param is empty on the model or not.
> If it is empty, as most ML tools I've seen, it correctly predicts 0, the rule 
> being {{ if (proba > threshold) then 1 else 0 }} (implemented in 
> {{probability2prediction}}).
> If however {{rawPredictionCol}} is set (default), then it avoids 
> recomputation by calling {{raw2prediction}}, and the rule becomes {{if 
> (rawPrediction(1) > rawThreshold) 1 else 0}}. The {{rawThreshold = math.log(t 
> / (1.0 - t))}} is ever-so-slightly below the {{rawPrediction(1)}}, so it 
> predicts 1.
> The use case is that I choose the threshold amongst 
> {{BinaryClassificationMetrics#thresholds}}, so I get one that corresponds to 
> the probability for one or more of my test set's records. Re-scoring that 
> record or one that yields the same probability exhibits this behaviour.
> Tested this on Spark 1.6 but the code involved seems to be similar on Spark 
> 2.2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22875) Assembly build fails for a high user id

2017-12-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22875:
--
Shepherd:   (was: Steve Loughran)
   Flags:   (was: Patch)
Priority: Minor  (was: Major)

> Assembly build fails for a high user id
> ---
>
> Key: SPARK-22875
> URL: https://issues.apache.org/jira/browse/SPARK-22875
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>Priority: Minor
>
> {code}
> ./build/mvn package -Pbigtop-dist -DskipTests
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project 
> spark-assembly_2.11: Execution dist of goal 
> org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id 
> '123456789' is too big ( > 2097151 ). -> [Help 1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22870) Dynamic allocation should allow 0 idle time

2017-12-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301667#comment-16301667
 ] 

Sean Owen commented on SPARK-22870:
---

Seems reasonable to me.

> Dynamic allocation should allow 0 idle time
> ---
>
> Key: SPARK-22870
> URL: https://issues.apache.org/jira/browse/SPARK-22870
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Xuefu Zhang
>Priority: Minor
>
> As discussed in SPARK-22765, with SPARK-21656, an executor will not idle out 
> when there are pending tasks to run. When there is no task to run, an 
> executor will die out after {{spark.dynamicAllocation.executorIdleTimeout}}, 
> which is currently required to be greater than zero. However, for efficiency, 
> a user should be able to specify that an executor can die out immediately w/o 
> being required to be idle for at least 1s.
> This is to make {{0}} a valid value for 
> {{spark.dynamicAllocation.executorIdleTimeout}}, and special handling such a 
> case might be needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22880) Add option to cascade jdbc truncate if database supports this (PostgreSQL and Oracle)

2017-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22880:


Assignee: (was: Apache Spark)

> Add option to cascade jdbc truncate if database supports this (PostgreSQL and 
> Oracle)
> -
>
> Key: SPARK-22880
> URL: https://issues.apache.org/jira/browse/SPARK-22880
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>
> When truncating tables, PostgreSQL and Oracle support an option `TRUNCATE`. 
> This cascades the truncate to tables with foreign key constraints on a column 
> in the table specified to truncate. It would be nice to be able to optionally 
> set this `TRUNCATE` option for PostgreSQL and Oracle.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22880) Add option to cascade jdbc truncate if database supports this (PostgreSQL and Oracle)

2017-12-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301591#comment-16301591
 ] 

Apache Spark commented on SPARK-22880:
--

User 'danielvdende' has created a pull request for this issue:
https://github.com/apache/spark/pull/20057

> Add option to cascade jdbc truncate if database supports this (PostgreSQL and 
> Oracle)
> -
>
> Key: SPARK-22880
> URL: https://issues.apache.org/jira/browse/SPARK-22880
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>
> When truncating tables, PostgreSQL and Oracle support an option `TRUNCATE`. 
> This cascades the truncate to tables with foreign key constraints on a column 
> in the table specified to truncate. It would be nice to be able to optionally 
> set this `TRUNCATE` option for PostgreSQL and Oracle.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22880) Add option to cascade jdbc truncate if database supports this (PostgreSQL and Oracle)

2017-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22880:


Assignee: Apache Spark

> Add option to cascade jdbc truncate if database supports this (PostgreSQL and 
> Oracle)
> -
>
> Key: SPARK-22880
> URL: https://issues.apache.org/jira/browse/SPARK-22880
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>Assignee: Apache Spark
>
> When truncating tables, PostgreSQL and Oracle support an option `TRUNCATE`. 
> This cascades the truncate to tables with foreign key constraints on a column 
> in the table specified to truncate. It would be nice to be able to optionally 
> set this `TRUNCATE` option for PostgreSQL and Oracle.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22880) Add option to cascade jdbc truncate if database supports this (PostgreSQL and Oracle)

2017-12-22 Thread Daniel van der Ende (JIRA)
Daniel van der Ende created SPARK-22880:
---

 Summary: Add option to cascade jdbc truncate if database supports 
this (PostgreSQL and Oracle)
 Key: SPARK-22880
 URL: https://issues.apache.org/jira/browse/SPARK-22880
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.2.1
Reporter: Daniel van der Ende


When truncating tables, PostgreSQL and Oracle support an option `TRUNCATE`. 
This cascades the truncate to tables with foreign key constraints on a column 
in the table specified to truncate. It would be nice to be able to optionally 
set this `TRUNCATE` option for PostgreSQL and Oracle.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22867) Add Isolation Forest algorithm to MLlib

2017-12-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301498#comment-16301498
 ] 

Sean Owen commented on SPARK-22867:
---

The problem is that this goes for a hundred things. I don't think MLlib aspires 
to be like scikit, especially because you can easily add a third-party package 
to any app. It's just the basics. Given that most JIRAs like this have been 
rejected I doubt this woudl be different. At least you'd need to argue this is 
widely used (e.g. papers, other implementations)

> Add Isolation Forest algorithm to MLlib
> ---
>
> Key: SPARK-22867
> URL: https://issues.apache.org/jira/browse/SPARK-22867
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Fangzhou Yang
>
> Isolation Forest (iForest) is an effective model that focuses on anomaly 
> isolation. 
> iForest uses tree structure for modeling data, iTree isolates anomalies 
> closer to the root of the tree as compared to normal points. 
> A anomaly score is calculated by iForest model to measure the abnormality of 
> the data instances. The lower, the more abnormal.
> More details about iForest can be found in the following papers: 
> https://dl.acm.org/citation.cfm?id=1511387;>Isolation Forest [1] 
> and https://dl.acm.org/citation.cfm?id=2133363;>Isolation-Based 
> Anomaly Detection [2].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22879) LogisticRegression inconsistent prediction when proba == threshold

2017-12-22 Thread Adrien Lavoillotte (JIRA)
Adrien Lavoillotte created SPARK-22879:
--

 Summary: LogisticRegression inconsistent prediction when proba == 
threshold
 Key: SPARK-22879
 URL: https://issues.apache.org/jira/browse/SPARK-22879
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 1.6.3
Reporter: Adrien Lavoillotte
Priority: Minor


I'm using {{org.apache.spark.ml.classification.LogisticRegression}} for binary 
classification.

If I predict on a record that yields exactly the probability of the threshold, 
then the result of {{transform}} is different depending on whether the 
{{rawPredictionCol}} param is empty on the model or not.
If it is empty, as most ML tools I've seen, it correctly predicts 0, the rule 
being {{ if (proba > threshold) then 1 else 0 }} (implemented in 
{{probability2prediction}}).
If however {{rawPredictionCol}} is set (default), then it avoids recomputation 
by calling {{raw2prediction}}, and the rule becomes {{if (rawPrediction(1) > 
rawThreshold) 1 else 0}}. The {{rawThreshold = math.log(t / (1.0 - t))}} is 
ever-so-slightly below the {{rawPrediction(1)}}, so it predicts 1.

The use case is that I choose the threshold amongst 
{{BinaryClassificationMetrics#thresholds}}, so I get one that corresponds to 
the probability for one or more of my test set's records. Re-scoring that 
record or one that yields the same probability exhibits this behaviour.

Tested this on Spark 1.6 but the code involved seems to be similar on Spark 2.2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22878) Count totalDroppedEvents for LiveListenerBus

2017-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22878:


Assignee: (was: Apache Spark)

> Count totalDroppedEvents for LiveListenerBus
> 
>
> Key: SPARK-22878
> URL: https://issues.apache.org/jira/browse/SPARK-22878
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: wuyi
>Priority: Minor
>
> Count total dropped events from all queues' numDroppedEvents for 
> LiveListenerBus and log the dropped events periodically for LiveListenerBus 
> like queues do.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22878) Count totalDroppedEvents for LiveListenerBus

2017-12-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301341#comment-16301341
 ] 

Apache Spark commented on SPARK-22878:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/20056

> Count totalDroppedEvents for LiveListenerBus
> 
>
> Key: SPARK-22878
> URL: https://issues.apache.org/jira/browse/SPARK-22878
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: wuyi
>Priority: Minor
>
> Count total dropped events from all queues' numDroppedEvents for 
> LiveListenerBus and log the dropped events periodically for LiveListenerBus 
> like queues do.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22878) Count totalDroppedEvents for LiveListenerBus

2017-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22878:


Assignee: Apache Spark

> Count totalDroppedEvents for LiveListenerBus
> 
>
> Key: SPARK-22878
> URL: https://issues.apache.org/jira/browse/SPARK-22878
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Minor
>
> Count total dropped events from all queues' numDroppedEvents for 
> LiveListenerBus and log the dropped events periodically for LiveListenerBus 
> like queues do.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22878) Count totalDroppedEvents for LiveListenerBus

2017-12-22 Thread wuyi (JIRA)
wuyi created SPARK-22878:


 Summary: Count totalDroppedEvents for LiveListenerBus
 Key: SPARK-22878
 URL: https://issues.apache.org/jira/browse/SPARK-22878
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: wuyi
Priority: Minor


Count total dropped events from all queues' numDroppedEvents for 
LiveListenerBus and log the dropped events periodically for LiveListenerBus 
like queues do.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22874) Modify checking pandas version to use LooseVersion.

2017-12-22 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22874.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Fixed in https://github.com/apache/spark/pull/20054

> Modify checking pandas version to use LooseVersion.
> ---
>
> Key: SPARK-22874
> URL: https://issues.apache.org/jira/browse/SPARK-22874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.3.0
>
>
> Currently we check pandas version by capturing if {{ImportError}} for the 
> specific imports is raised or not but we can compare {{LooseVersion}} of the 
> version string as the same as we're checking pyarrow version.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22874) Modify checking pandas version to use LooseVersion.

2017-12-22 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22874:


Assignee: Takuya Ueshin

> Modify checking pandas version to use LooseVersion.
> ---
>
> Key: SPARK-22874
> URL: https://issues.apache.org/jira/browse/SPARK-22874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>
> Currently we check pandas version by capturing if {{ImportError}} for the 
> specific imports is raised or not but we can compare {{LooseVersion}} of the 
> version string as the same as we're checking pyarrow version.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-22877) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2017-12-22 Thread Jinhan Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinhan Zhong closed SPARK-22877.


> spark.yarn.am.attemptFailuresValidityInterval does not work correctly
> -
>
> Key: SPARK-22877
> URL: https://issues.apache.org/jira/browse/SPARK-22877
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
> Environment: hadoop version 2.7.3
>Reporter: Jinhan Zhong
>Priority: Minor
>
> I assume we can use spark.yarn.maxAppAttempts together with 
> spark.yarn.am.attemptFailuresValidityInterval to make a long running 
> application avoid stopping  after acceptable number of failures.
> But after testing, I found that the application always stops after failing n 
> times ( n is minimum value of spark.yarn.maxAppAttempts and 
> yarn.resourcemanager.am.max-attempts from client yarn-site.xml)
> for example, following setup will allow the application master to fail 20 
> times.
> * spark.yarn.am.attemptFailuresValidityInterval=1s
> * spark.yarn.maxAppAttempts=20
> * yarn client: yarn.resourcemanager.am.max-attempts=20
> * yarn resource manager: yarn.resourcemanager.am.max-attempts=3
> And after checking the source code, I found in source file 
> ApplicationMaster.scala 
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293[https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293]
> there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
> if attempt id >= maxAppAttempts, it will try to unregister the application 
> and the application will finish.
> is this a expected design or a bug?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22877) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2017-12-22 Thread Jinhan Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinhan Zhong resolved SPARK-22877.
--
Resolution: Duplicate

> spark.yarn.am.attemptFailuresValidityInterval does not work correctly
> -
>
> Key: SPARK-22877
> URL: https://issues.apache.org/jira/browse/SPARK-22877
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
> Environment: hadoop version 2.7.3
>Reporter: Jinhan Zhong
>Priority: Minor
>
> I assume we can use spark.yarn.maxAppAttempts together with 
> spark.yarn.am.attemptFailuresValidityInterval to make a long running 
> application avoid stopping  after acceptable number of failures.
> But after testing, I found that the application always stops after failing n 
> times ( n is minimum value of spark.yarn.maxAppAttempts and 
> yarn.resourcemanager.am.max-attempts from client yarn-site.xml)
> for example, following setup will allow the application master to fail 20 
> times.
> * spark.yarn.am.attemptFailuresValidityInterval=1s
> * spark.yarn.maxAppAttempts=20
> * yarn client: yarn.resourcemanager.am.max-attempts=20
> * yarn resource manager: yarn.resourcemanager.am.max-attempts=3
> And after checking the source code, I found in source file 
> ApplicationMaster.scala 
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293[https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293]
> there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
> if attempt id >= maxAppAttempts, it will try to unregister the application 
> and the application will finish.
> is this a expected design or a bug?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2017-12-22 Thread Jinhan Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinhan Zhong updated SPARK-22876:
-
Description: 
I assume we can use spark.yarn.maxAppAttempts together with 
spark.yarn.am.attemptFailuresValidityInterval to make a long running 
application avoid stopping  after acceptable number of failures.

But after testing, I found that the application always stops after failing n 
times ( n is minimum value of spark.yarn.maxAppAttempts and 
yarn.resourcemanager.am.max-attempts from client yarn-site.xml)

for example, following setup will allow the application master to fail 20 times.
* spark.yarn.am.attemptFailuresValidityInterval=1s
* spark.yarn.maxAppAttempts=20
* yarn client: yarn.resourcemanager.am.max-attempts=20
* yarn resource manager: yarn.resourcemanager.am.max-attempts=3

And after checking the source code, I found in source file 
ApplicationMaster.scala 
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293

there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
if attempt id >= maxAppAttempts, it will try to unregister the application and 
the application will finish.

is this a expected design or a bug?


  was:
I assume we can use spark.yarn.maxAppAttempts together with 
spark.yarn.am.attemptFailuresValidityInterval to make a long running 
application avoid stopping  after acceptable number of failures.

But after testing, I found that the application always stops after failing n 
times ( n is minimum value of spark.yarn.maxAppAttempts and 
yarn.resourcemanager.am.max-attempts from client yarn-site.xml)

for example, following setup will allow the application master to fail 20 times.
* spark.yarn.am.attemptFailuresValidityInterval=1s
* spark.yarn.maxAppAttempts=20
* yarn client: yarn.resourcemanager.am.max-attempts=20
* yarn resource manager: yarn.resourcemanager.am.max-attempts=3

And after checking the source code, I found in source file 
ApplicationMaster.scala 
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293[https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293]

there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
if attempt id >= maxAppAttempts, it will try to unregister the application and 
the application will finish.

is this a expected design or a bug?



> spark.yarn.am.attemptFailuresValidityInterval does not work correctly
> -
>
> Key: SPARK-22876
> URL: https://issues.apache.org/jira/browse/SPARK-22876
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
> Environment: hadoop version 2.7.3
>Reporter: Jinhan Zhong
>Priority: Minor
>
> I assume we can use spark.yarn.maxAppAttempts together with 
> spark.yarn.am.attemptFailuresValidityInterval to make a long running 
> application avoid stopping  after acceptable number of failures.
> But after testing, I found that the application always stops after failing n 
> times ( n is minimum value of spark.yarn.maxAppAttempts and 
> yarn.resourcemanager.am.max-attempts from client yarn-site.xml)
> for example, following setup will allow the application master to fail 20 
> times.
> * spark.yarn.am.attemptFailuresValidityInterval=1s
> * spark.yarn.maxAppAttempts=20
> * yarn client: yarn.resourcemanager.am.max-attempts=20
> * yarn resource manager: yarn.resourcemanager.am.max-attempts=3
> And after checking the source code, I found in source file 
> ApplicationMaster.scala 
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293
> there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
> if attempt id >= maxAppAttempts, it will try to unregister the application 
> and the application will finish.
> is this a expected design or a bug?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22875) Assembly build fails for a high user id

2017-12-22 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated SPARK-22875:
--
Flags: Patch

> Assembly build fails for a high user id
> ---
>
> Key: SPARK-22875
> URL: https://issues.apache.org/jira/browse/SPARK-22875
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>
> {code}
> ./build/mvn package -Pbigtop-dist -DskipTests
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project 
> spark-assembly_2.11: Execution dist of goal 
> org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id 
> '123456789' is too big ( > 2097151 ). -> [Help 1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22877) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2017-12-22 Thread Jinhan Zhong (JIRA)
Jinhan Zhong created SPARK-22877:


 Summary: spark.yarn.am.attemptFailuresValidityInterval does not 
work correctly
 Key: SPARK-22877
 URL: https://issues.apache.org/jira/browse/SPARK-22877
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.2.0
 Environment: hadoop version 2.7.3
Reporter: Jinhan Zhong
Priority: Minor


I assume we can use spark.yarn.maxAppAttempts together with 
spark.yarn.am.attemptFailuresValidityInterval to make a long running 
application avoid stopping  after acceptable number of failures.

But after testing, I found that the application always stops after failing n 
times ( n is minimum value of spark.yarn.maxAppAttempts and 
yarn.resourcemanager.am.max-attempts from client yarn-site.xml)

for example, following setup will allow the application master to fail 20 times.
* spark.yarn.am.attemptFailuresValidityInterval=1s
* spark.yarn.maxAppAttempts=20
* yarn client: yarn.resourcemanager.am.max-attempts=20
* yarn resource manager: yarn.resourcemanager.am.max-attempts=3

And after checking the source code, I found in source file 
ApplicationMaster.scala 
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293[https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293]

there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
if attempt id >= maxAppAttempts, it will try to unregister the application and 
the application will finish.

is this a expected design or a bug?




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22875) Assembly build fails for a high user id

2017-12-22 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated SPARK-22875:
--
Shepherd: Steve Loughran

> Assembly build fails for a high user id
> ---
>
> Key: SPARK-22875
> URL: https://issues.apache.org/jira/browse/SPARK-22875
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>
> {code}
> ./build/mvn package -Pbigtop-dist -DskipTests
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project 
> spark-assembly_2.11: Execution dist of goal 
> org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id 
> '123456789' is too big ( > 2097151 ). -> [Help 1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2017-12-22 Thread Jinhan Zhong (JIRA)
Jinhan Zhong created SPARK-22876:


 Summary: spark.yarn.am.attemptFailuresValidityInterval does not 
work correctly
 Key: SPARK-22876
 URL: https://issues.apache.org/jira/browse/SPARK-22876
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.2.0
 Environment: hadoop version 2.7.3
Reporter: Jinhan Zhong
Priority: Minor


I assume we can use spark.yarn.maxAppAttempts together with 
spark.yarn.am.attemptFailuresValidityInterval to make a long running 
application avoid stopping  after acceptable number of failures.

But after testing, I found that the application always stops after failing n 
times ( n is minimum value of spark.yarn.maxAppAttempts and 
yarn.resourcemanager.am.max-attempts from client yarn-site.xml)

for example, following setup will allow the application master to fail 20 times.
* spark.yarn.am.attemptFailuresValidityInterval=1s
* spark.yarn.maxAppAttempts=20
* yarn client: yarn.resourcemanager.am.max-attempts=20
* yarn resource manager: yarn.resourcemanager.am.max-attempts=3

And after checking the source code, I found in source file 
ApplicationMaster.scala 
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293[https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293]

there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
if attempt id >= maxAppAttempts, it will try to unregister the application and 
the application will finish.

is this a expected design or a bug?




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22875) Assembly build fails for a high user id

2017-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22875:


Assignee: (was: Apache Spark)

> Assembly build fails for a high user id
> ---
>
> Key: SPARK-22875
> URL: https://issues.apache.org/jira/browse/SPARK-22875
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>
> {code}
> ./build/mvn package -Pbigtop-dist -DskipTests
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project 
> spark-assembly_2.11: Execution dist of goal 
> org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id 
> '123456789' is too big ( > 2097151 ). -> [Help 1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22875) Assembly build fails for a high user id

2017-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22875:


Assignee: Apache Spark

> Assembly build fails for a high user id
> ---
>
> Key: SPARK-22875
> URL: https://issues.apache.org/jira/browse/SPARK-22875
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>Assignee: Apache Spark
>
> {code}
> ./build/mvn package -Pbigtop-dist -DskipTests
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project 
> spark-assembly_2.11: Execution dist of goal 
> org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id 
> '123456789' is too big ( > 2097151 ). -> [Help 1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22875) Assembly build fails for a high user id

2017-12-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301148#comment-16301148
 ] 

Apache Spark commented on SPARK-22875:
--

User 'gerashegalov' has created a pull request for this issue:
https://github.com/apache/spark/pull/20055

> Assembly build fails for a high user id
> ---
>
> Key: SPARK-22875
> URL: https://issues.apache.org/jira/browse/SPARK-22875
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>
> {code}
> ./build/mvn package -Pbigtop-dist -DskipTests
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project 
> spark-assembly_2.11: Execution dist of goal 
> org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id 
> '123456789' is too big ( > 2097151 ). -> [Help 1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22875) Assembly build fails for a high user id

2017-12-22 Thread Gera Shegalov (JIRA)
Gera Shegalov created SPARK-22875:
-

 Summary: Assembly build fails for a high user id
 Key: SPARK-22875
 URL: https://issues.apache.org/jira/browse/SPARK-22875
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.1
Reporter: Gera Shegalov


{code}
./build/mvn package -Pbigtop-dist -DskipTests
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project 
spark-assembly_2.11: Execution dist of goal 
org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id 
'123456789' is too big ( > 2097151 ). -> [Help 1]
{code}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22872) Spark ML Pipeline Model Persistent Support Save Schema Info

2017-12-22 Thread Cyanny (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyanny updated SPARK-22872:
---
Description: 
Hi all,
I have a project about model transformation with PMML, it  needs to transform 
models with different types to pmml files.
Moreover, JPMML(https://github.com/jpmml) has provided tools to do that,such as 
jpmml-sklearn, jpmml-xgboost etc. Our transformation API parameters must be 
concise and simple, in other words the less the better.

I came with a issue that, sklearn, tensorflow, and lightgbm can produce only 
one model file, including schema info and model data info.
but Spark PipelineModel only export a model file in parquet, there is no schema 
info in the model file. However, JPMML-SPARK converter needs two arguments: 
Data Schema and PipelineModel

*Can spark PipelineModel include input data schema as metadata when do export? *

The situations about machine learning libraries to jpmml are as the attached 
image, only xgboost and spark can't include schema info in exported model file.


  was:
Hi all,
I have a project about model transformation with PMML, it  needs to transform 
models with different types to pmml files.
Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many tools 
to do that. I need to provide a uniform API for user, the API parameters must 
be concise and simple, in other words the less the better.

I came with a issue that, sklearn, tensorflow, and lightgbm can produce only 
one model file, including schema info and model data info.
but Spark PipelineModel only export a model file in parquet, there is no schema 
info in the model file. However, JPMML-SPARK converter needs two arguments: 
Data Schema and PipelineModel

*Can spark PipelineModel include input data schema as metadata when do export? *

The situations about machine learning libraries to jpmml are as the attached 
img, only xgboost and spark can't include schema info in exported model file.



> Spark ML Pipeline Model Persistent Support Save Schema Info
> ---
>
> Key: SPARK-22872
> URL: https://issues.apache.org/jira/browse/SPARK-22872
> Project: Spark
>  Issue Type: IT Help
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Cyanny
>Priority: Minor
> Attachments: jpmml-research.jpg
>
>
> Hi all,
> I have a project about model transformation with PMML, it  needs to transform 
> models with different types to pmml files.
> Moreover, JPMML(https://github.com/jpmml) has provided tools to do that,such 
> as jpmml-sklearn, jpmml-xgboost etc. Our transformation API parameters must 
> be concise and simple, in other words the less the better.
> I came with a issue that, sklearn, tensorflow, and lightgbm can produce only 
> one model file, including schema info and model data info.
> but Spark PipelineModel only export a model file in parquet, there is no 
> schema info in the model file. However, JPMML-SPARK converter needs two 
> arguments: Data Schema and PipelineModel
> *Can spark PipelineModel include input data schema as metadata when do 
> export? *
> The situations about machine learning libraries to jpmml are as the attached 
> image, only xgboost and spark can't include schema info in exported model 
> file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22872) Spark ML Pipeline Model Persistent Support Save Schema Info

2017-12-22 Thread Cyanny (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyanny updated SPARK-22872:
---
Description: 
Hi all,
I have a project about model transformation with PMML, it  needs to transform 
models with different types to pmml files.
Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many tools 
to do that. I need to provide a uniform API for user, the API parameters must 
be concise and simple, in other words the less the better.

I came with a issue that, sklearn, tensorflow, and lightgbm can produce only 
one model file, including schema info and model data info.
but Spark PipelineModel only export a model file in parquet, there is no schema 
info in the model file. However, JPMML-SPARK converter needs two arguments: 
Data Schema and PipelineModel

*Can spark PipelineModel include input data schema as metadata when do export? *

The situations about machine learning libraries to jpmml are as the attached 
img, only xgboost and spark can't include schema info in exported model file.


  was:
Hi all,
I recently did a research about pmml, and our project needs to transform many 
models with different types to pmml files.
Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many tools 
to do that. I need to provide a uniform API for user, the API parameters must 
be concise and simple, in other words the less the better.

I came with a issue that, sklearn, tensorflow, and lightgbm can produce only 
one model file, including schema info and model data info.
but Spark PipelineModel only export a model file in parquet, there is no schema 
info in the model file. However, JPMML-SPARK converter needs two arguments: 
Data Schema and PipelineModel

*Can spark PipelineModel include input data schema as metadata when do export? *

The situations about machine learning libraries to jpmml are as the attached 
img, only xgboost and spark can't include schema info in exported model file.



> Spark ML Pipeline Model Persistent Support Save Schema Info
> ---
>
> Key: SPARK-22872
> URL: https://issues.apache.org/jira/browse/SPARK-22872
> Project: Spark
>  Issue Type: IT Help
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Cyanny
>Priority: Minor
> Attachments: jpmml-research.jpg
>
>
> Hi all,
> I have a project about model transformation with PMML, it  needs to transform 
> models with different types to pmml files.
> Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many 
> tools to do that. I need to provide a uniform API for user, the API 
> parameters must be concise and simple, in other words the less the better.
> I came with a issue that, sklearn, tensorflow, and lightgbm can produce only 
> one model file, including schema info and model data info.
> but Spark PipelineModel only export a model file in parquet, there is no 
> schema info in the model file. However, JPMML-SPARK converter needs two 
> arguments: Data Schema and PipelineModel
> *Can spark PipelineModel include input data schema as metadata when do 
> export? *
> The situations about machine learning libraries to jpmml are as the attached 
> img, only xgboost and spark can't include schema info in exported model file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22874) Modify checking pandas version to use LooseVersion.

2017-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22874:


Assignee: Apache Spark

> Modify checking pandas version to use LooseVersion.
> ---
>
> Key: SPARK-22874
> URL: https://issues.apache.org/jira/browse/SPARK-22874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>
> Currently we check pandas version by capturing if {{ImportError}} for the 
> specific imports is raised or not but we can compare {{LooseVersion}} of the 
> version string as the same as we're checking pyarrow version.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22874) Modify checking pandas version to use LooseVersion.

2017-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22874:


Assignee: (was: Apache Spark)

> Modify checking pandas version to use LooseVersion.
> ---
>
> Key: SPARK-22874
> URL: https://issues.apache.org/jira/browse/SPARK-22874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>
> Currently we check pandas version by capturing if {{ImportError}} for the 
> specific imports is raised or not but we can compare {{LooseVersion}} of the 
> version string as the same as we're checking pyarrow version.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22874) Modify checking pandas version to use LooseVersion.

2017-12-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301101#comment-16301101
 ] 

Apache Spark commented on SPARK-22874:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20054

> Modify checking pandas version to use LooseVersion.
> ---
>
> Key: SPARK-22874
> URL: https://issues.apache.org/jira/browse/SPARK-22874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>
> Currently we check pandas version by capturing if {{ImportError}} for the 
> specific imports is raised or not but we can compare {{LooseVersion}} of the 
> version string as the same as we're checking pyarrow version.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22874) Modify checking pandas version to use LooseVersion.

2017-12-22 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-22874:
-

 Summary: Modify checking pandas version to use LooseVersion.
 Key: SPARK-22874
 URL: https://issues.apache.org/jira/browse/SPARK-22874
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.3.0
Reporter: Takuya Ueshin


Currently we check pandas version by capturing if {{ImportError}} for the 
specific imports is raised or not but we can compare {{LooseVersion}} of the 
version string as the same as we're checking pyarrow version.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22872) Spark ML Pipeline Model Persistent Support Save Schema Info

2017-12-22 Thread Cyanny (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyanny updated SPARK-22872:
---
Description: 
Hi all,
I recently did a research about pmml, and our project needs to transform many 
models with different types to pmml files.
Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many tools 
to do that. I need to provide a uniform API for user, the API parameters must 
be concise and simple, in other words the less the better.

I came with a issue that, sklearn, tensorflow, and lightgbm can produce only 
one model file, including schema info and model data info.
but Spark PipelineModel only export a model file in parquet, there is no schema 
info in the model file. However, JPMML-SPARK converter needs two arguments: 
Data Schema and PipelineModel

*Can spark PipelineModel include input data schema as metadata when do export? *

The situations about machine learning libraries to jpmml are as the attached 
img, only xgboost and spark can't include schema info in exported model file.


  was:
Hi all,
I recently did a research about pmml, and our project needs to transform many 
models with different types to pmml files.
Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many tools 
to do that. I need to provide a uniform API for user, the API parameters must 
be concise and simple, in other words the less the better.

I came with a issue that, sklearn, tensorflow, and lightgbm can produce only 
one model file, including schema info and model data info.
but Spark PipelineModel only export a model file in parquet, there is no schema 
info in the model file. However, JPMML-SPARK converter needs two arguments: 
Data Schema and PipelineModel

*Can spark PipelineModel include input data schema when export to a file? *

I found a solution, use dataframe API to export schema:
```dataframe.limit(1).write.format("parquet").save("./model.schema")```

*Are there any solutions to get the PipelineModel input schema?*

The situations about machine learning libraries to jpmml are as the attached img



> Spark ML Pipeline Model Persistent Support Save Schema Info
> ---
>
> Key: SPARK-22872
> URL: https://issues.apache.org/jira/browse/SPARK-22872
> Project: Spark
>  Issue Type: IT Help
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Cyanny
>Priority: Minor
> Attachments: jpmml-research.jpg
>
>
> Hi all,
> I recently did a research about pmml, and our project needs to transform many 
> models with different types to pmml files.
> Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many 
> tools to do that. I need to provide a uniform API for user, the API 
> parameters must be concise and simple, in other words the less the better.
> I came with a issue that, sklearn, tensorflow, and lightgbm can produce only 
> one model file, including schema info and model data info.
> but Spark PipelineModel only export a model file in parquet, there is no 
> schema info in the model file. However, JPMML-SPARK converter needs two 
> arguments: Data Schema and PipelineModel
> *Can spark PipelineModel include input data schema as metadata when do 
> export? *
> The situations about machine learning libraries to jpmml are as the attached 
> img, only xgboost and spark can't include schema info in exported model file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22872) Spark ML Pipeline Model Persistent Support Save Schema Info

2017-12-22 Thread Cyanny (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyanny updated SPARK-22872:
---
Description: 
Hi all,
I recently did a research about pmml, and our project needs to transform many 
models with different types to pmml files.
Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many tools 
to do that. I need to provide a uniform API for user, the API parameters must 
be concise and simple, in other words the less the better.

I came with a issue that, sklearn, tensorflow, and lightgbm can produce only 
one model file, including schema info and model data info.
but Spark PipelineModel only export a model file in parquet, there is no schema 
info in the model file. However, JPMML-SPARK converter needs two arguments: 
Data Schema and PipelineModel

*Can spark PipelineModel include input data schema when export to a file? *

I found a solution, use dataframe API to export schema:
```dataframe.limit(1).write.format("parquet").save("./model.schema")```

*Are there any solutions to get the PipelineModel input schema?*

The situations about machine learning libraries to jpmml are as the attached img


  was:
Hi all,
I recently did a research about pmml, and my project needs to transform many 
models with different type to pmml files.
Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many tools 
to do that. I need to provide a uniform API for user, the API parameters must 
be concise and simple, in other words the less the better.

I came with a issue that, sklearn, tensorflow, and lightgbm can produce only 
one model file, including schema info and model data info.
but Spark PipelineModel only export a model file in parquet, there is no schema 
info in the model file. However, JPMML-SPARK converter needs two arguments: 
Data Schema and PipelineModel

*Can spark PipelineModel include input data schema when export to a file? *

I found a solution, use dataframe API to export schema:
```dataframe.limit(1).write.format("parquet").save("./model.schema")```

*Are there any solutions to get the PipelineModel input schema?*

The situations about machine learning libraries to jpmml are as the attached img



> Spark ML Pipeline Model Persistent Support Save Schema Info
> ---
>
> Key: SPARK-22872
> URL: https://issues.apache.org/jira/browse/SPARK-22872
> Project: Spark
>  Issue Type: IT Help
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Cyanny
>Priority: Minor
> Attachments: jpmml-research.jpg
>
>
> Hi all,
> I recently did a research about pmml, and our project needs to transform many 
> models with different types to pmml files.
> Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many 
> tools to do that. I need to provide a uniform API for user, the API 
> parameters must be concise and simple, in other words the less the better.
> I came with a issue that, sklearn, tensorflow, and lightgbm can produce only 
> one model file, including schema info and model data info.
> but Spark PipelineModel only export a model file in parquet, there is no 
> schema info in the model file. However, JPMML-SPARK converter needs two 
> arguments: Data Schema and PipelineModel
> *Can spark PipelineModel include input data schema when export to a file? *
> I found a solution, use dataframe API to export schema:
> ```dataframe.limit(1).write.format("parquet").save("./model.schema")```
> *Are there any solutions to get the PipelineModel input schema?*
> The situations about machine learning libraries to jpmml are as the attached 
> img



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org