[jira] [Updated] (SPARK-22891) NullPointerException when use udf
[ https://issues.apache.org/jira/browse/SPARK-22891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaoyang updated SPARK-22891: Description: In my application,i use multi threads. Each thread has a SparkSession and use SparkSession.sqlContext.udf.register to register my udf. Sometimes there throws exception like this: {code:java} java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder': at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1062) at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:137) at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:136) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:136) at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:133) at org.apache.spark.sql.SparkSession.udf(SparkSession.scala:207) at org.apache.spark.sql.SQLContext.udf(SQLContext.scala:203) at com.game.data.stat.clusterTask.tools.standard.IpConverterRegister$.run(IpConverterRegister.scala:63) at ... 20 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264) at org.apache.spark.sql.hive.client.HiveClientImpl.newSession(HiveClientImpl.scala:789) at org.apache.spark.sql.hive.client.HiveClientImpl.newSession(HiveClientImpl.scala:79) at org.apache.spark.sql.hive.HiveSessionStateBuilder.resourceLoader$lzycompute(HiveSessionStateBuilder.scala:45) at org.apache.spark.sql.hive.HiveSessionStateBuilder.resourceLoader(HiveSessionStateBuilder.scala:44) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:61) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35) at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289) at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1059) ... 20 more Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException at org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:744) at org.apache.hadoop.hive.ql.session.SessionState.getAuthenticator(SessionState.java:1391) at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:210) ... 34 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException at org.apache.hadoop.hive.ql.session.SessionState.setAuthorizerV2Config(SessionState.java:769) at org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:736) ... 36 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.isCompatibleWith(HiveMetaStoreClient.java:287) at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156) at com.sun.proxy.$Proxy25.isCompatibleWith(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:206) at org.apache.hadoop.hive.ql.session.SessionState.setAuthorizerV2Config(SessionState.java:765) ... 37 more {code} Also, i use apache hive 2.1.1 in my cluster. When i use spark 2.1.x, the exception above never happends again. was: In my application,i use multi threads. Each thread has a SparkSession and use SparkSession.sqlContext.udf.register to register my udf. Sometimes there throw exception like this: {code:java} java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder': at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1062) at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:137)
[jira] [Created] (SPARK-22891) NullPointerException when use udf
gaoyang created SPARK-22891: --- Summary: NullPointerException when use udf Key: SPARK-22891 URL: https://issues.apache.org/jira/browse/SPARK-22891 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1, 2.2.0 Environment: hadoop 2.7.2 Reporter: gaoyang In my application,i use multi threads. Each thread has a SparkSession and use SparkSession.sqlContext.udf.register to register my udf. Sometimes there throw exception like this: {code:java} java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder': at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1062) at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:137) at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:136) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:136) at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:133) at org.apache.spark.sql.SparkSession.udf(SparkSession.scala:207) at org.apache.spark.sql.SQLContext.udf(SQLContext.scala:203) at com.game.data.stat.clusterTask.tools.standard.IpConverterRegister$.run(IpConverterRegister.scala:63) at ... 20 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264) at org.apache.spark.sql.hive.client.HiveClientImpl.newSession(HiveClientImpl.scala:789) at org.apache.spark.sql.hive.client.HiveClientImpl.newSession(HiveClientImpl.scala:79) at org.apache.spark.sql.hive.HiveSessionStateBuilder.resourceLoader$lzycompute(HiveSessionStateBuilder.scala:45) at org.apache.spark.sql.hive.HiveSessionStateBuilder.resourceLoader(HiveSessionStateBuilder.scala:44) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:61) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35) at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289) at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1059) ... 20 more Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException at org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:744) at org.apache.hadoop.hive.ql.session.SessionState.getAuthenticator(SessionState.java:1391) at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:210) ... 34 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException at org.apache.hadoop.hive.ql.session.SessionState.setAuthorizerV2Config(SessionState.java:769) at org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:736) ... 36 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.isCompatibleWith(HiveMetaStoreClient.java:287) at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156) at com.sun.proxy.$Proxy25.isCompatibleWith(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:206) at org.apache.hadoop.hive.ql.session.SessionState.setAuthorizerV2Config(SessionState.java:765) ... 37 more {code} Also, i use apache hive 2.1.1 in my cluster. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22789) Add ContinuousExecution for continuous processing queries
[ https://issues.apache.org/jira/browse/SPARK-22789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-22789. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19984 [https://github.com/apache/spark/pull/19984] > Add ContinuousExecution for continuous processing queries > - > > Key: SPARK-22789 > URL: https://issues.apache.org/jira/browse/SPARK-22789 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22789) Add ContinuousExecution for continuous processing queries
[ https://issues.apache.org/jira/browse/SPARK-22789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-22789: Assignee: Jose Torres > Add ContinuousExecution for continuous processing queries > - > > Key: SPARK-22789 > URL: https://issues.apache.org/jira/browse/SPARK-22789 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres >Assignee: Jose Torres > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22890) Basic tests for DateTimeOperations
[ https://issues.apache.org/jira/browse/SPARK-22890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22890: Assignee: (was: Apache Spark) > Basic tests for DateTimeOperations > -- > > Key: SPARK-22890 > URL: https://issues.apache.org/jira/browse/SPARK-22890 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yuming Wang > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22890) Basic tests for DateTimeOperations
[ https://issues.apache.org/jira/browse/SPARK-22890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302174#comment-16302174 ] Apache Spark commented on SPARK-22890: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/20061 > Basic tests for DateTimeOperations > -- > > Key: SPARK-22890 > URL: https://issues.apache.org/jira/browse/SPARK-22890 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yuming Wang > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22890) Basic tests for DateTimeOperations
[ https://issues.apache.org/jira/browse/SPARK-22890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22890: Assignee: Apache Spark > Basic tests for DateTimeOperations > -- > > Key: SPARK-22890 > URL: https://issues.apache.org/jira/browse/SPARK-22890 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yuming Wang >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22890) Basic tests for DateTimeOperations
Yuming Wang created SPARK-22890: --- Summary: Basic tests for DateTimeOperations Key: SPARK-22890 URL: https://issues.apache.org/jira/browse/SPARK-22890 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22881) ML test for StructuredStreaming: spark.ml.regression
[ https://issues.apache.org/jira/browse/SPARK-22881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22881: Assignee: Weichen Xu (was: Apache Spark) > ML test for StructuredStreaming: spark.ml.regression > > > Key: SPARK-22881 > URL: https://issues.apache.org/jira/browse/SPARK-22881 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Weichen Xu > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22881) ML test for StructuredStreaming: spark.ml.regression
[ https://issues.apache.org/jira/browse/SPARK-22881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302166#comment-16302166 ] Apache Spark commented on SPARK-22881: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/19979 > ML test for StructuredStreaming: spark.ml.regression > > > Key: SPARK-22881 > URL: https://issues.apache.org/jira/browse/SPARK-22881 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Weichen Xu > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22881) ML test for StructuredStreaming: spark.ml.regression
[ https://issues.apache.org/jira/browse/SPARK-22881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22881: Assignee: Apache Spark (was: Weichen Xu) > ML test for StructuredStreaming: spark.ml.regression > > > Key: SPARK-22881 > URL: https://issues.apache.org/jira/browse/SPARK-22881 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22889) CRAN checks can fail if older Spark install exists
[ https://issues.apache.org/jira/browse/SPARK-22889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22889: Assignee: (was: Apache Spark) > CRAN checks can fail if older Spark install exists > -- > > Key: SPARK-22889 > URL: https://issues.apache.org/jira/browse/SPARK-22889 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1, 2.3.0 >Reporter: Shivaram Venkataraman > > Since all CRAN checks go through the same machine, if there is an older > partial download or partial install of Spark left behind the tests fail. One > solution is to overwrite the install files when running tests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22889) CRAN checks can fail if older Spark install exists
[ https://issues.apache.org/jira/browse/SPARK-22889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22889: Assignee: Apache Spark > CRAN checks can fail if older Spark install exists > -- > > Key: SPARK-22889 > URL: https://issues.apache.org/jira/browse/SPARK-22889 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1, 2.3.0 >Reporter: Shivaram Venkataraman >Assignee: Apache Spark > > Since all CRAN checks go through the same machine, if there is an older > partial download or partial install of Spark left behind the tests fail. One > solution is to overwrite the install files when running tests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22889) CRAN checks can fail if older Spark install exists
[ https://issues.apache.org/jira/browse/SPARK-22889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302139#comment-16302139 ] Apache Spark commented on SPARK-22889: -- User 'shivaram' has created a pull request for this issue: https://github.com/apache/spark/pull/20060 > CRAN checks can fail if older Spark install exists > -- > > Key: SPARK-22889 > URL: https://issues.apache.org/jira/browse/SPARK-22889 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1, 2.3.0 >Reporter: Shivaram Venkataraman > > Since all CRAN checks go through the same machine, if there is an older > partial download or partial install of Spark left behind the tests fail. One > solution is to overwrite the install files when running tests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22889) CRAN checks can fail if older Spark install exists
Shivaram Venkataraman created SPARK-22889: - Summary: CRAN checks can fail if older Spark install exists Key: SPARK-22889 URL: https://issues.apache.org/jira/browse/SPARK-22889 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.1, 2.3.0 Reporter: Shivaram Venkataraman Since all CRAN checks go through the same machine, if there is an older partial download or partial install of Spark left behind the tests fail. One solution is to overwrite the install files when running tests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22881) ML test for StructuredStreaming: spark.ml.regression
[ https://issues.apache.org/jira/browse/SPARK-22881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-22881: - Assignee: Weichen Xu > ML test for StructuredStreaming: spark.ml.regression > > > Key: SPARK-22881 > URL: https://issues.apache.org/jira/browse/SPARK-22881 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Weichen Xu > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212962#comment-16212962 ] Joseph K. Bradley edited comment on SPARK-21926 at 12/22/17 10:40 PM: -- One more report from the dev list: HashingTF and IDFModel fail with Structured Streaming: http://apache-spark-developers-list.1001551.n3.nabble.com/HashingTFModel-IDFModel-in-Structured-Streaming-td22680.html --> the issue seems to have been VectorAssembler was (Author: josephkb): One more report from the dev list: HashingTF and IDFModel fail with Structured Streaming: http://apache-spark-developers-list.1001551.n3.nabble.com/HashingTFModel-IDFModel-in-Structured-Streaming-td22680.html > Compatibility between ML Transformers and Structured Streaming > -- > > Key: SPARK-21926 > URL: https://issues.apache.org/jira/browse/SPARK-21926 > Project: Spark > Issue Type: Umbrella > Components: ML, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > We've run into a few cases where ML components don't play nice with streaming > dataframes (for prediction). This ticket is meant to help aggregate these > known cases in one place and provide a place to discuss possible fixes. > Failing cases: > 1) VectorAssembler where one of the inputs is a VectorUDT column with no > metadata. > Possible fixes: > More details here SPARK-22346. > 2) OneHotEncoder where the input is a column with no metadata. > Possible fixes: > a) Make OneHotEncoder an estimator (SPARK-13030). > -b) Allow user to set the cardinality of OneHotEncoder.- -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22888) OneVsRestModel does not work with Structured Streaming
Joseph K. Bradley created SPARK-22888: - Summary: OneVsRestModel does not work with Structured Streaming Key: SPARK-22888 URL: https://issues.apache.org/jira/browse/SPARK-22888 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.2.1 Reporter: Joseph K. Bradley Priority: Critical OneVsRestModel uses Dataset.persist, which does not work with streaming. This should be avoided when the input is a streaming Dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22886) ML test for StructuredStreaming: spark.ml.recommendation
Joseph K. Bradley created SPARK-22886: - Summary: ML test for StructuredStreaming: spark.ml.recommendation Key: SPARK-22886 URL: https://issues.apache.org/jira/browse/SPARK-22886 Project: Spark Issue Type: Test Components: ML, Tests Affects Versions: 2.3.0 Reporter: Joseph K. Bradley Task for adding Structured Streaming tests for all Models/Transformers in a sub-module in spark.ml For an example, see LinearRegressionSuite.scala in https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22887) ML test for StructuredStreaming: spark.ml.fpm
Joseph K. Bradley created SPARK-22887: - Summary: ML test for StructuredStreaming: spark.ml.fpm Key: SPARK-22887 URL: https://issues.apache.org/jira/browse/SPARK-22887 Project: Spark Issue Type: Test Components: ML, Tests Affects Versions: 2.3.0 Reporter: Joseph K. Bradley Task for adding Structured Streaming tests for all Models/Transformers in a sub-module in spark.ml For an example, see LinearRegressionSuite.scala in https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22885) ML test for StructuredStreaming: spark.ml.tuning
Joseph K. Bradley created SPARK-22885: - Summary: ML test for StructuredStreaming: spark.ml.tuning Key: SPARK-22885 URL: https://issues.apache.org/jira/browse/SPARK-22885 Project: Spark Issue Type: Test Components: ML, Tests Affects Versions: 2.3.0 Reporter: Joseph K. Bradley Task for adding Structured Streaming tests for all Models/Transformers in a sub-module in spark.ml For an example, see LinearRegressionSuite.scala in https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering
Joseph K. Bradley created SPARK-22884: - Summary: ML test for StructuredStreaming: spark.ml.clustering Key: SPARK-22884 URL: https://issues.apache.org/jira/browse/SPARK-22884 Project: Spark Issue Type: Test Components: ML, Tests Affects Versions: 2.3.0 Reporter: Joseph K. Bradley Task for adding Structured Streaming tests for all Models/Transformers in a sub-module in spark.ml For an example, see LinearRegressionSuite.scala in https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature
Joseph K. Bradley created SPARK-22883: - Summary: ML test for StructuredStreaming: spark.ml.feature Key: SPARK-22883 URL: https://issues.apache.org/jira/browse/SPARK-22883 Project: Spark Issue Type: Test Components: ML, Tests Affects Versions: 2.3.0 Reporter: Joseph K. Bradley Task for adding Structured Streaming tests for all Models/Transformers in a sub-module in spark.ml For an example, see LinearRegressionSuite.scala in https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22882) ML test for StructuredStreaming: spark.ml.classification
Joseph K. Bradley created SPARK-22882: - Summary: ML test for StructuredStreaming: spark.ml.classification Key: SPARK-22882 URL: https://issues.apache.org/jira/browse/SPARK-22882 Project: Spark Issue Type: Test Components: ML, Tests Affects Versions: 2.3.0 Reporter: Joseph K. Bradley Task for adding Structured Streaming tests for all Models/Transformers in a sub-module in spark.ml For an example, see LinearRegressionSuite.scala in https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22881) ML test for StructuredStreaming: spark.ml.regression
Joseph K. Bradley created SPARK-22881: - Summary: ML test for StructuredStreaming: spark.ml.regression Key: SPARK-22881 URL: https://issues.apache.org/jira/browse/SPARK-22881 Project: Spark Issue Type: Test Components: ML, Tests Affects Versions: 2.3.0 Reporter: Joseph K. Bradley Task for adding Structured Streaming tests for all Models/Transformers in a sub-module in spark.ml For an example, see LinearRegressionSuite.scala in https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22648) Documentation for Kubernetes Scheduler Backend
[ https://issues.apache.org/jira/browse/SPARK-22648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302006#comment-16302006 ] Apache Spark commented on SPARK-22648: -- User 'liyinan926' has created a pull request for this issue: https://github.com/apache/spark/pull/20059 > Documentation for Kubernetes Scheduler Backend > -- > > Key: SPARK-22648 > URL: https://issues.apache.org/jira/browse/SPARK-22648 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Assignee: Anirudh Ramanathan > Fix For: 2.3.0 > > > This covers documentation needed to use the Kubernetes backend. This will > likely be marked as experimental. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22346) Update VectorAssembler to work with Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-22346. --- Resolution: Fixed Fix Version/s: 2.3.0 Resolved via https://github.com/apache/spark/pull/19746 > Update VectorAssembler to work with Structured Streaming > > > Key: SPARK-22346 > URL: https://issues.apache.org/jira/browse/SPARK-22346 > Project: Spark > Issue Type: Improvement > Components: ML, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian >Assignee: Bago Amirbekian >Priority: Critical > Fix For: 2.3.0 > > > The issue > In batch mode, VectorAssembler can take multiple columns of VectorType and > assemble a output a new column of VectorType containing the concatenated > vectors. In streaming mode, this transformation can fail because > VectorAssembler does not have enough information to produce metadata > (AttributeGroup) for the new column. Because VectorAssembler is such a > ubiquitous part of mllib pipelines, this issue effectively means spark > structured streaming does not support prediction using mllib pipelines. > I've created this ticket so we can discuss ways to potentially improve > VectorAssembler. Please let me know if there are any issues I have not > considered or potential fixes I haven't outlined. I'm happy to submit a patch > once I know which strategy is the best approach. > Potential fixes > 1) Replace VectorAssembler with an estimator/model pair like was recently > done with OneHotEncoder, > [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The > Estimator can "learn" the size of the inputs vectors during training and save > it to use during prediction. > Pros: > * Possibly simplest of the potential fixes > Cons: > * We'll need to deprecate current VectorAssembler > 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty > major change, but it could be done in stages. We could first ensure that > metadata is not used during prediction and allow the VectorAssembler to drop > metadata for streaming dataframes. Going forward, it would be important to > not use any metadata on Vector columns for any prediction tasks. > Pros: > * Potentially, easy short term fix for VectorAssembler > (drop metadata for vector columns in streaming). > * Current Attributes implementation is also causing other issues, eg > [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141]. > Cons: > * To fully remove ML Attributes would be a major refactor of MLlib and would > most likely require breaking changings. > * A partial removal of ML attributes (eg: ensure ML attributes are not used > during transform, only during fit) might be tricky. This would require > testing or other enforcement mechanism to prevent regressions. > 3) Require Vector columns to have fixed length vectors. Most mllib > transformers that produce vectors already include the size of the vector in > the column metadata. This change would be to deprecate APIs that allow > creating a vector column of unknown length and replace those APIs with > equivalents that enforce a fixed size. > Pros: > * We already treat vectors as fixed size, for example VectorAssembler assumes > the inputs * output col are fixed size vectors and creates metadata > accordingly. In the spirit of explicit is better than implicit, we would be > codifying something we already assume. > * This could potentially enable performance optimizations that are only > possible if the Vector size of a column is fixed & known. > Cons: > * This would require breaking changes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22346) Update VectorAssembler to work with Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-22346: - Assignee: Bago Amirbekian > Update VectorAssembler to work with Structured Streaming > > > Key: SPARK-22346 > URL: https://issues.apache.org/jira/browse/SPARK-22346 > Project: Spark > Issue Type: Improvement > Components: ML, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian >Assignee: Bago Amirbekian >Priority: Critical > > The issue > In batch mode, VectorAssembler can take multiple columns of VectorType and > assemble a output a new column of VectorType containing the concatenated > vectors. In streaming mode, this transformation can fail because > VectorAssembler does not have enough information to produce metadata > (AttributeGroup) for the new column. Because VectorAssembler is such a > ubiquitous part of mllib pipelines, this issue effectively means spark > structured streaming does not support prediction using mllib pipelines. > I've created this ticket so we can discuss ways to potentially improve > VectorAssembler. Please let me know if there are any issues I have not > considered or potential fixes I haven't outlined. I'm happy to submit a patch > once I know which strategy is the best approach. > Potential fixes > 1) Replace VectorAssembler with an estimator/model pair like was recently > done with OneHotEncoder, > [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The > Estimator can "learn" the size of the inputs vectors during training and save > it to use during prediction. > Pros: > * Possibly simplest of the potential fixes > Cons: > * We'll need to deprecate current VectorAssembler > 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty > major change, but it could be done in stages. We could first ensure that > metadata is not used during prediction and allow the VectorAssembler to drop > metadata for streaming dataframes. Going forward, it would be important to > not use any metadata on Vector columns for any prediction tasks. > Pros: > * Potentially, easy short term fix for VectorAssembler > (drop metadata for vector columns in streaming). > * Current Attributes implementation is also causing other issues, eg > [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141]. > Cons: > * To fully remove ML Attributes would be a major refactor of MLlib and would > most likely require breaking changings. > * A partial removal of ML attributes (eg: ensure ML attributes are not used > during transform, only during fit) might be tricky. This would require > testing or other enforcement mechanism to prevent regressions. > 3) Require Vector columns to have fixed length vectors. Most mllib > transformers that produce vectors already include the size of the vector in > the column metadata. This change would be to deprecate APIs that allow > creating a vector column of unknown length and replace those APIs with > equivalents that enforce a fixed size. > Pros: > * We already treat vectors as fixed size, for example VectorAssembler assumes > the inputs * output col are fixed size vectors and creates metadata > accordingly. In the spirit of explicit is better than implicit, we would be > codifying something we already assume. > * This could potentially enable performance optimizations that are only > possible if the Vector size of a column is fixed & known. > Cons: > * This would require breaking changes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning
[ https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301986#comment-16301986 ] Apache Spark commented on SPARK-22126: -- User 'MrBago' has created a pull request for this issue: https://github.com/apache/spark/pull/20058 > Fix model-specific optimization support for ML tuning > - > > Key: SPARK-22126 > URL: https://issues.apache.org/jira/browse/SPARK-22126 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Fix model-specific optimization support for ML tuning. This is discussed in > SPARK-19357 > more discussion is here > https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0 > Anyone who's following might want to scan the design doc (in the links > above), the latest api proposal is: > {code} > def fitMultiple( > dataset: Dataset[_], > paramMaps: Array[ParamMap] > ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]] > {code} > Old discussion: > I copy discussion from gist to here: > I propose to design API as: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] > {code} > Let me use an example to explain the API: > {quote} > It could be possible to still use the current parallelism and still allow > for model-specific optimizations. For example, if we doing cross validation > and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets > say that the cross validator could know that maxIter is optimized for the > model being evaluated (e.g. a new method in Estimator that return such > params). It would then be straightforward for the cross validator to remove > maxIter from the param map that will be parallelized over and use it to > create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, > maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)). > {quote} > In this example, we can see that, models computed from ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread > code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, > maxIter=10)) in another thread. In this example, there're 4 paramMaps, but > we can at most generate two threads to compute the models for them. > The API above allow "callable.call()" to return multiple models, and return > type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap > index for corresponding model. Use the example above, there're 4 paramMaps, > but only return 2 callable objects, one callable object for ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, > maxIter=5), (regParam=0.3, maxIter=10)). > and the default "fitCallables/fit with paramMaps" can be implemented as > following: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] = { > paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) => > new Callable[Map[Int, M]] { > override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap)) > } > } > } > def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = { >fitCallables(dataset, paramMaps).map { _.call().toSeq } > .flatMap(_).sortBy(_._1).map(_._2) > } > {code} > If use the API I proposed above, the code in > [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159] > can be changed to: > {code} > val trainingDataset = sparkSession.createDataFrame(training, > schema).cache() > val validationDataset = sparkSession.createDataFrame(validation, > schema).cache() > // Fit models in a Future for training in parallel > val modelMapFutures = fitCallables(trainingDataset, paramMaps).map { > callable => > Future[Map[Int, Model[_]]] { > val modelMap = callable.call() > if (collectSubModelsParam) { >... > } > modelMap > } (executionContext) > } > // Unpersist training data only when all models have trained > Future.sequence[Model[_], Iterable](modelMapFutures)(implicitly, > executionContext) > .onComplete { _ => trainingDataset.unpersist() } (executionContext) > // Evaluate models in a Future that will calulate a metric and allow > model to be cleaned up > val foldMetricMapFutures = modelMapFutures.map { modelMapFuture => > modelMapFuture.map { modelMap => > modelMap.map { case (index: Int, model: Model[_]) => > val metric = eval.evaluate(model.transform(validationDataset, > paramMaps(index))) > (index, metric) > } >
[jira] [Commented] (SPARK-22879) LogisticRegression inconsistent prediction when proba == threshold
[ https://issues.apache.org/jira/browse/SPARK-22879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301801#comment-16301801 ] Sean Owen commented on SPARK-22879: --- Yes these are algebraically the same but not exactly the same due to roundoff. I guess I'd argue the right answer is 'false' in your example, because it's the comparison with the value the user supplied. Yes it should be consistent, but I don't know if this can be avoided, without avoiding the 'raw' comparison altogether. That's for performance reasons though. In general this seems like an extreme corner case, but I see you're trying to exactly reproduce certain comparisons. What about rounding the result to your nearest test set value, to satisfy your particular use case? Is there any way to get the speed up (win with no downside in almost all cases) without this behavior? > LogisticRegression inconsistent prediction when proba == threshold > -- > > Key: SPARK-22879 > URL: https://issues.apache.org/jira/browse/SPARK-22879 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.6.3 >Reporter: Adrien Lavoillotte >Priority: Minor > > I'm using {{org.apache.spark.ml.classification.LogisticRegression}} for > binary classification. > If I predict on a record that yields exactly the probability of the > threshold, then the result of {{transform}} is different depending on whether > the {{rawPredictionCol}} param is empty on the model or not. > If it is empty, as most ML tools I've seen, it correctly predicts 0, the rule > being {{ if (proba > threshold) then 1 else 0 }} (implemented in > {{probability2prediction}}). > If however {{rawPredictionCol}} is set (default), then it avoids > recomputation by calling {{raw2prediction}}, and the rule becomes {{if > (rawPrediction(1) > rawThreshold) 1 else 0}}. The {{rawThreshold = math.log(t > / (1.0 - t))}} is ever-so-slightly below the {{rawPrediction(1)}}, so it > predicts 1. > The use case is that I choose the threshold amongst > {{BinaryClassificationMetrics#thresholds}}, so I get one that corresponds to > the probability for one or more of my test set's records. Re-scoring that > record or one that yields the same probability exhibits this behaviour. > Tested this on Spark 1.6 but the code involved seems to be similar on Spark > 2.2. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22880) Add option to cascade jdbc truncate if database supports this (PostgreSQL and Oracle)
[ https://issues.apache.org/jira/browse/SPARK-22880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301772#comment-16301772 ] Daniel van der Ende edited comment on SPARK-22880 at 12/22/17 6:31 PM: --- It's different in the sense that SPARK-22717's scope was to correct a bug; PostgreSQL does not cascade by default. The reason this was particularly nasty, was that it actually prevented any PostgreSQL table from being truncated, as the logic dictates that if a table is truncated by default by a database, it cannot be truncated. This issue (SPARK-22880) is about adding an option to allow users to enable cascade for truncating tables if they want to. was (Author: danielvdende): It's different in the sense that [SPARK-22717|https://github.com/apache/spark/pull/19911]'s scope was to correct a bug; PostgreSQL does not cascade by default. The reason this was particularly nasty, was that it actually prevented any PostgreSQL table from being truncated, as the logic dictates that if a table is truncated by default by a database, it cannot be truncated. This issue (SPARK-22880) is about adding an option to allow users to enable cascade for truncating tables if they want to. > Add option to cascade jdbc truncate if database supports this (PostgreSQL and > Oracle) > - > > Key: SPARK-22880 > URL: https://issues.apache.org/jira/browse/SPARK-22880 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.1 >Reporter: Daniel van der Ende >Priority: Minor > > When truncating tables, PostgreSQL and Oracle support an option `TRUNCATE`. > This cascades the truncate to tables with foreign key constraints on a column > in the table specified to truncate. It would be nice to be able to optionally > set this `TRUNCATE` option for PostgreSQL and Oracle. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22880) Add option to cascade jdbc truncate if database supports this (PostgreSQL and Oracle)
[ https://issues.apache.org/jira/browse/SPARK-22880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301772#comment-16301772 ] Daniel van der Ende commented on SPARK-22880: - It's different in the sense that [SPARK-22717|https://github.com/apache/spark/pull/19911]'s scope was to correct a bug; PostgreSQL does not cascade by default. The reason this was particularly nasty, was that it actually prevented any PostgreSQL table from being truncated, as the logic dictates that if a table is truncated by default by a database, it cannot be truncated. This issue (SPARK-22880) is about adding an option to allow users to enable cascade for truncating tables if they want to. > Add option to cascade jdbc truncate if database supports this (PostgreSQL and > Oracle) > - > > Key: SPARK-22880 > URL: https://issues.apache.org/jira/browse/SPARK-22880 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.1 >Reporter: Daniel van der Ende >Priority: Minor > > When truncating tables, PostgreSQL and Oracle support an option `TRUNCATE`. > This cascades the truncate to tables with foreign key constraints on a column > in the table specified to truncate. It would be nice to be able to optionally > set this `TRUNCATE` option for PostgreSQL and Oracle. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22766) Install R linter package in spark lib directory
[ https://issues.apache.org/jira/browse/SPARK-22766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301751#comment-16301751 ] Hossein Falaki commented on SPARK-22766: [~felixcheung] This just makes installation of whatever version of {{lintr}} we pick more stable. The two tickets are orthogonal. > Install R linter package in spark lib directory > --- > > Key: SPARK-22766 > URL: https://issues.apache.org/jira/browse/SPARK-22766 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1 >Reporter: Hossein Falaki > > {{dev/lint-r.R}} file installs uses devtools to install {{jimhester/lintr}} > package in the default site library location which is > {{/usr/local/lib/R/site-library}. This is not recommended and can fail > because we are running this script as jenkins while that directory is owned > by root. > We need to install the linter package in a local directory. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22879) LogisticRegression inconsistent prediction when proba == threshold
[ https://issues.apache.org/jira/browse/SPARK-22879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301724#comment-16301724 ] Adrien Lavoillotte commented on SPARK-22879: In my case, the threshold was 0.2739699931543393, and the raw prediction was -0.974572759358285. * If you compute the proba from the raw prediction, you also get 0.2739699931543393, so the > comparison returns false * If, however, you compute the {{rawThreshold}} from the threshold, you get -0.9745727593582851 (notice that last 1), so the > comparison now returns true > LogisticRegression inconsistent prediction when proba == threshold > -- > > Key: SPARK-22879 > URL: https://issues.apache.org/jira/browse/SPARK-22879 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.6.3 >Reporter: Adrien Lavoillotte >Priority: Minor > > I'm using {{org.apache.spark.ml.classification.LogisticRegression}} for > binary classification. > If I predict on a record that yields exactly the probability of the > threshold, then the result of {{transform}} is different depending on whether > the {{rawPredictionCol}} param is empty on the model or not. > If it is empty, as most ML tools I've seen, it correctly predicts 0, the rule > being {{ if (proba > threshold) then 1 else 0 }} (implemented in > {{probability2prediction}}). > If however {{rawPredictionCol}} is set (default), then it avoids > recomputation by calling {{raw2prediction}}, and the rule becomes {{if > (rawPrediction(1) > rawThreshold) 1 else 0}}. The {{rawThreshold = math.log(t > / (1.0 - t))}} is ever-so-slightly below the {{rawPrediction(1)}}, so it > predicts 1. > The use case is that I choose the threshold amongst > {{BinaryClassificationMetrics#thresholds}}, so I get one that corresponds to > the probability for one or more of my test set's records. Re-scoring that > record or one that yields the same probability exhibits this behaviour. > Tested this on Spark 1.6 but the code involved seems to be similar on Spark > 2.2. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22879) LogisticRegression inconsistent prediction when proba == threshold
[ https://issues.apache.org/jira/browse/SPARK-22879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301718#comment-16301718 ] Adrien Lavoillotte commented on SPARK-22879: I'm saying that, given a threshold {{t}}, the two conditions below are not exactly the same when {{proba == t}} because of floating-point math: * {{proba(1) > t}} (false) * {{rawPrediction(1) > rawThreshold}} (true) Which condition is used depends on whether the raw predictions are already computed, which depends on whether the {{rawPredictionCol}} param is empty or not. If it's empty, it will predict 0, else (default), it will predict 1. I hope it's clearer > LogisticRegression inconsistent prediction when proba == threshold > -- > > Key: SPARK-22879 > URL: https://issues.apache.org/jira/browse/SPARK-22879 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.6.3 >Reporter: Adrien Lavoillotte >Priority: Minor > > I'm using {{org.apache.spark.ml.classification.LogisticRegression}} for > binary classification. > If I predict on a record that yields exactly the probability of the > threshold, then the result of {{transform}} is different depending on whether > the {{rawPredictionCol}} param is empty on the model or not. > If it is empty, as most ML tools I've seen, it correctly predicts 0, the rule > being {{ if (proba > threshold) then 1 else 0 }} (implemented in > {{probability2prediction}}). > If however {{rawPredictionCol}} is set (default), then it avoids > recomputation by calling {{raw2prediction}}, and the rule becomes {{if > (rawPrediction(1) > rawThreshold) 1 else 0}}. The {{rawThreshold = math.log(t > / (1.0 - t))}} is ever-so-slightly below the {{rawPrediction(1)}}, so it > predicts 1. > The use case is that I choose the threshold amongst > {{BinaryClassificationMetrics#thresholds}}, so I get one that corresponds to > the probability for one or more of my test set's records. Re-scoring that > record or one that yields the same probability exhibits this behaviour. > Tested this on Spark 1.6 but the code involved seems to be similar on Spark > 2.2. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301717#comment-16301717 ] Felix Cheung commented on SPARK-22683: -- I think the challenge here is the ability to determine upfront if the task is going to be small/quick. > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle > Labels: pull-request-available > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to solve the issue with existing parameters (summing up a few > points mentioned in the comments) ? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become contention > points for other exes trying to remotely access blocks in the old exes (not > witnessed in the jobs I'm talking about, but we did see this behavior in > other jobs) > Proposal: > Simply add a tasksPerExecutorSlot parameter, which makes it possible to > specify how many tasks a single taskSlot should ideally execute to mitigate > the overhead of executor allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22809) pyspark is sensitive to imports with dots
[ https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22809. --- Resolution: Not A Problem It's still not clear this isn't some other error swallowed by error-handling in your example. I say that because you show this construct works, the second time, so doesn't really seem to be related to the import. Spark doesn't handle imports anyway; it's the interpreter. At worst you have a workaround to whatever this is. If you have a specific change to propose that pinpoints or explains what's happening, that's attribute to Spark, reopen. > pyspark is sensitive to imports with dots > - > > Key: SPARK-22809 > URL: https://issues.apache.org/jira/browse/SPARK-22809 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Cricket Temple > > User code can fail with dotted imports. Here's a repro script. > {noformat} > import numpy as np > import pandas as pd > import pyspark > import scipy.interpolate > import scipy.interpolate as scipy_interpolate > import py4j > scipy_interpolate2 = scipy.interpolate > sc = pyspark.SparkContext() > spark_session = pyspark.SQLContext(sc) > ### > # The details of this dataset are irrelevant # > # Sorry if you'd have preferred something more boring # > ### > x__ = np.linspace(0,10,1000) > freq__ = np.arange(1,5) > x_, freq_ = np.ix_(x__, freq__) > y = np.sin(x_ * freq_).ravel() > x = (x_ * np.ones(freq_.shape)).ravel() > freq = (np.ones(x_.shape) * freq_).ravel() > df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq']) > df_sk = spark_session.createDataFrame(df_pd) > assert(df_sk.toPandas() == df_pd).all().all() > try: > import matplotlib.pyplot as plt > for f, data in df_pd.groupby("freq"): > plt.plot(*data[['x','y']].values.T) > plt.show() > except: > print("I guess we can't plot anything") > def mymap(x, interp_fn): > df = pd.DataFrame.from_records([row.asDict() for row in list(x)]) > return interp_fn(df.x.values, df.y.values)(np.pi) > df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey() > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy_interpolate.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > try: > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy.interpolate.interp1d)).collect() > raise Excpetion("Not going to reach this line") > except py4j.protocol.Py4JJavaError, e: > print("See?") > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy_interpolate2.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > # But now it works! > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy.interpolate.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301709#comment-16301709 ] Felix Cheung edited comment on SPARK-22683 at 12/22/17 5:35 PM: I couldn't find the exact source line, but from running Flink previously I'm reasonably sure number of task slots == number of cores. Therefore I don't think it's meant to increase utilization by over committing tasks or concurrently running multiple tasks and so on. was (Author: felixcheung): I couldn't find the exact source line, but from running Flink previously I'm reasonably sure number of task slots == number of cores. Therefore I don't think it's meant to increase utilization by concurrently running multiple tasks and so on. > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle > Labels: pull-request-available > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to solve the issue with existing parameters (summing up a few > points mentioned in the comments) ? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become contention > points for other exes trying to remotely access blocks in the old exes (not > witnessed in the jobs I'm talking about, but we did see this behavior in > other jobs) > Proposal: > Simply add a
[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301709#comment-16301709 ] Felix Cheung commented on SPARK-22683: -- I couldn't find the exact source line, but from running Flink previously I'm reasonably sure number of task slots == number of cores. Therefore I don't think it's meant to increase utilization by concurrently running multiple tasks and so on. > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle > Labels: pull-request-available > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to solve the issue with existing parameters (summing up a few > points mentioned in the comments) ? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become contention > points for other exes trying to remotely access blocks in the old exes (not > witnessed in the jobs I'm talking about, but we did see this behavior in > other jobs) > Proposal: > Simply add a tasksPerExecutorSlot parameter, which makes it possible to > specify how many tasks a single taskSlot should ideally execute to mitigate > the overhead of executor allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To
[jira] [Commented] (SPARK-22870) Dynamic allocation should allow 0 idle time
[ https://issues.apache.org/jira/browse/SPARK-22870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301701#comment-16301701 ] Felix Cheung commented on SPARK-22870: -- +1 yes there is more than the check for the value 0 as the clean up is not currently synchronous. > Dynamic allocation should allow 0 idle time > --- > > Key: SPARK-22870 > URL: https://issues.apache.org/jira/browse/SPARK-22870 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: Xuefu Zhang >Priority: Minor > > As discussed in SPARK-22765, with SPARK-21656, an executor will not idle out > when there are pending tasks to run. When there is no task to run, an > executor will die out after {{spark.dynamicAllocation.executorIdleTimeout}}, > which is currently required to be greater than zero. However, for efficiency, > a user should be able to specify that an executor can die out immediately w/o > being required to be idle for at least 1s. > This is to make {{0}} a valid value for > {{spark.dynamicAllocation.executorIdleTimeout}}, and special handling such a > case might be needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22880) Add option to cascade jdbc truncate if database supports this (PostgreSQL and Oracle)
[ https://issues.apache.org/jira/browse/SPARK-22880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22880: -- Priority: Minor (was: Major) How is this different from SPARK-22717? > Add option to cascade jdbc truncate if database supports this (PostgreSQL and > Oracle) > - > > Key: SPARK-22880 > URL: https://issues.apache.org/jira/browse/SPARK-22880 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.1 >Reporter: Daniel van der Ende >Priority: Minor > > When truncating tables, PostgreSQL and Oracle support an option `TRUNCATE`. > This cascades the truncate to tables with foreign key constraints on a column > in the table specified to truncate. It would be nice to be able to optionally > set this `TRUNCATE` option for PostgreSQL and Oracle. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22872) Spark ML Pipeline Model Persistent Support Save Schema Info
[ https://issues.apache.org/jira/browse/SPARK-22872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22872. --- Resolution: Invalid Questions to the mailing list please > Spark ML Pipeline Model Persistent Support Save Schema Info > --- > > Key: SPARK-22872 > URL: https://issues.apache.org/jira/browse/SPARK-22872 > Project: Spark > Issue Type: IT Help > Components: ML >Affects Versions: 2.2.0 >Reporter: Cyanny >Priority: Minor > Attachments: jpmml-research.jpg > > > Hi all, > I have a project about model transformation with PMML, it needs to transform > models with different types to pmml files. > Moreover, JPMML(https://github.com/jpmml) has provided tools to do that,such > as jpmml-sklearn, jpmml-xgboost etc. Our transformation API parameters must > be concise and simple, in other words the less the better. > I came with a issue that, sklearn, tensorflow, and lightgbm can produce only > one model file, including schema info and model data info. > but Spark PipelineModel only export a model file in parquet, there is no > schema info in the model file. However, JPMML-SPARK converter needs two > arguments: Data Schema and PipelineModel > *Can spark PipelineModel include input data schema as metadata when do > export? * > The situations about machine learning libraries to jpmml are as the attached > image, only xgboost and spark can't include schema info in exported model > file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22879) LogisticRegression inconsistent prediction when proba == threshold
[ https://issues.apache.org/jira/browse/SPARK-22879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301681#comment-16301681 ] Sean Owen commented on SPARK-22879: --- I'm not quite following here. In both cases the rule is "1 if > threshold". Are you saying that the recomputation of the threshold from t isn't equal to the original one? that's a little different. > LogisticRegression inconsistent prediction when proba == threshold > -- > > Key: SPARK-22879 > URL: https://issues.apache.org/jira/browse/SPARK-22879 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.6.3 >Reporter: Adrien Lavoillotte >Priority: Minor > > I'm using {{org.apache.spark.ml.classification.LogisticRegression}} for > binary classification. > If I predict on a record that yields exactly the probability of the > threshold, then the result of {{transform}} is different depending on whether > the {{rawPredictionCol}} param is empty on the model or not. > If it is empty, as most ML tools I've seen, it correctly predicts 0, the rule > being {{ if (proba > threshold) then 1 else 0 }} (implemented in > {{probability2prediction}}). > If however {{rawPredictionCol}} is set (default), then it avoids > recomputation by calling {{raw2prediction}}, and the rule becomes {{if > (rawPrediction(1) > rawThreshold) 1 else 0}}. The {{rawThreshold = math.log(t > / (1.0 - t))}} is ever-so-slightly below the {{rawPrediction(1)}}, so it > predicts 1. > The use case is that I choose the threshold amongst > {{BinaryClassificationMetrics#thresholds}}, so I get one that corresponds to > the probability for one or more of my test set's records. Re-scoring that > record or one that yields the same probability exhibits this behaviour. > Tested this on Spark 1.6 but the code involved seems to be similar on Spark > 2.2. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22875) Assembly build fails for a high user id
[ https://issues.apache.org/jira/browse/SPARK-22875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22875: -- Shepherd: (was: Steve Loughran) Flags: (was: Patch) Priority: Minor (was: Major) > Assembly build fails for a high user id > --- > > Key: SPARK-22875 > URL: https://issues.apache.org/jira/browse/SPARK-22875 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.1 >Reporter: Gera Shegalov >Priority: Minor > > {code} > ./build/mvn package -Pbigtop-dist -DskipTests > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project > spark-assembly_2.11: Execution dist of goal > org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id > '123456789' is too big ( > 2097151 ). -> [Help 1] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22870) Dynamic allocation should allow 0 idle time
[ https://issues.apache.org/jira/browse/SPARK-22870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301667#comment-16301667 ] Sean Owen commented on SPARK-22870: --- Seems reasonable to me. > Dynamic allocation should allow 0 idle time > --- > > Key: SPARK-22870 > URL: https://issues.apache.org/jira/browse/SPARK-22870 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: Xuefu Zhang >Priority: Minor > > As discussed in SPARK-22765, with SPARK-21656, an executor will not idle out > when there are pending tasks to run. When there is no task to run, an > executor will die out after {{spark.dynamicAllocation.executorIdleTimeout}}, > which is currently required to be greater than zero. However, for efficiency, > a user should be able to specify that an executor can die out immediately w/o > being required to be idle for at least 1s. > This is to make {{0}} a valid value for > {{spark.dynamicAllocation.executorIdleTimeout}}, and special handling such a > case might be needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22880) Add option to cascade jdbc truncate if database supports this (PostgreSQL and Oracle)
[ https://issues.apache.org/jira/browse/SPARK-22880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22880: Assignee: (was: Apache Spark) > Add option to cascade jdbc truncate if database supports this (PostgreSQL and > Oracle) > - > > Key: SPARK-22880 > URL: https://issues.apache.org/jira/browse/SPARK-22880 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.1 >Reporter: Daniel van der Ende > > When truncating tables, PostgreSQL and Oracle support an option `TRUNCATE`. > This cascades the truncate to tables with foreign key constraints on a column > in the table specified to truncate. It would be nice to be able to optionally > set this `TRUNCATE` option for PostgreSQL and Oracle. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22880) Add option to cascade jdbc truncate if database supports this (PostgreSQL and Oracle)
[ https://issues.apache.org/jira/browse/SPARK-22880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301591#comment-16301591 ] Apache Spark commented on SPARK-22880: -- User 'danielvdende' has created a pull request for this issue: https://github.com/apache/spark/pull/20057 > Add option to cascade jdbc truncate if database supports this (PostgreSQL and > Oracle) > - > > Key: SPARK-22880 > URL: https://issues.apache.org/jira/browse/SPARK-22880 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.1 >Reporter: Daniel van der Ende > > When truncating tables, PostgreSQL and Oracle support an option `TRUNCATE`. > This cascades the truncate to tables with foreign key constraints on a column > in the table specified to truncate. It would be nice to be able to optionally > set this `TRUNCATE` option for PostgreSQL and Oracle. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22880) Add option to cascade jdbc truncate if database supports this (PostgreSQL and Oracle)
[ https://issues.apache.org/jira/browse/SPARK-22880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22880: Assignee: Apache Spark > Add option to cascade jdbc truncate if database supports this (PostgreSQL and > Oracle) > - > > Key: SPARK-22880 > URL: https://issues.apache.org/jira/browse/SPARK-22880 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.1 >Reporter: Daniel van der Ende >Assignee: Apache Spark > > When truncating tables, PostgreSQL and Oracle support an option `TRUNCATE`. > This cascades the truncate to tables with foreign key constraints on a column > in the table specified to truncate. It would be nice to be able to optionally > set this `TRUNCATE` option for PostgreSQL and Oracle. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22880) Add option to cascade jdbc truncate if database supports this (PostgreSQL and Oracle)
Daniel van der Ende created SPARK-22880: --- Summary: Add option to cascade jdbc truncate if database supports this (PostgreSQL and Oracle) Key: SPARK-22880 URL: https://issues.apache.org/jira/browse/SPARK-22880 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.2.1 Reporter: Daniel van der Ende When truncating tables, PostgreSQL and Oracle support an option `TRUNCATE`. This cascades the truncate to tables with foreign key constraints on a column in the table specified to truncate. It would be nice to be able to optionally set this `TRUNCATE` option for PostgreSQL and Oracle. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22867) Add Isolation Forest algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-22867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301498#comment-16301498 ] Sean Owen commented on SPARK-22867: --- The problem is that this goes for a hundred things. I don't think MLlib aspires to be like scikit, especially because you can easily add a third-party package to any app. It's just the basics. Given that most JIRAs like this have been rejected I doubt this woudl be different. At least you'd need to argue this is widely used (e.g. papers, other implementations) > Add Isolation Forest algorithm to MLlib > --- > > Key: SPARK-22867 > URL: https://issues.apache.org/jira/browse/SPARK-22867 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Fangzhou Yang > > Isolation Forest (iForest) is an effective model that focuses on anomaly > isolation. > iForest uses tree structure for modeling data, iTree isolates anomalies > closer to the root of the tree as compared to normal points. > A anomaly score is calculated by iForest model to measure the abnormality of > the data instances. The lower, the more abnormal. > More details about iForest can be found in the following papers: > https://dl.acm.org/citation.cfm?id=1511387;>Isolation Forest [1] > and https://dl.acm.org/citation.cfm?id=2133363;>Isolation-Based > Anomaly Detection [2]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22879) LogisticRegression inconsistent prediction when proba == threshold
Adrien Lavoillotte created SPARK-22879: -- Summary: LogisticRegression inconsistent prediction when proba == threshold Key: SPARK-22879 URL: https://issues.apache.org/jira/browse/SPARK-22879 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 1.6.3 Reporter: Adrien Lavoillotte Priority: Minor I'm using {{org.apache.spark.ml.classification.LogisticRegression}} for binary classification. If I predict on a record that yields exactly the probability of the threshold, then the result of {{transform}} is different depending on whether the {{rawPredictionCol}} param is empty on the model or not. If it is empty, as most ML tools I've seen, it correctly predicts 0, the rule being {{ if (proba > threshold) then 1 else 0 }} (implemented in {{probability2prediction}}). If however {{rawPredictionCol}} is set (default), then it avoids recomputation by calling {{raw2prediction}}, and the rule becomes {{if (rawPrediction(1) > rawThreshold) 1 else 0}}. The {{rawThreshold = math.log(t / (1.0 - t))}} is ever-so-slightly below the {{rawPrediction(1)}}, so it predicts 1. The use case is that I choose the threshold amongst {{BinaryClassificationMetrics#thresholds}}, so I get one that corresponds to the probability for one or more of my test set's records. Re-scoring that record or one that yields the same probability exhibits this behaviour. Tested this on Spark 1.6 but the code involved seems to be similar on Spark 2.2. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22878) Count totalDroppedEvents for LiveListenerBus
[ https://issues.apache.org/jira/browse/SPARK-22878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22878: Assignee: (was: Apache Spark) > Count totalDroppedEvents for LiveListenerBus > > > Key: SPARK-22878 > URL: https://issues.apache.org/jira/browse/SPARK-22878 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: wuyi >Priority: Minor > > Count total dropped events from all queues' numDroppedEvents for > LiveListenerBus and log the dropped events periodically for LiveListenerBus > like queues do. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22878) Count totalDroppedEvents for LiveListenerBus
[ https://issues.apache.org/jira/browse/SPARK-22878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301341#comment-16301341 ] Apache Spark commented on SPARK-22878: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/20056 > Count totalDroppedEvents for LiveListenerBus > > > Key: SPARK-22878 > URL: https://issues.apache.org/jira/browse/SPARK-22878 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: wuyi >Priority: Minor > > Count total dropped events from all queues' numDroppedEvents for > LiveListenerBus and log the dropped events periodically for LiveListenerBus > like queues do. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22878) Count totalDroppedEvents for LiveListenerBus
[ https://issues.apache.org/jira/browse/SPARK-22878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22878: Assignee: Apache Spark > Count totalDroppedEvents for LiveListenerBus > > > Key: SPARK-22878 > URL: https://issues.apache.org/jira/browse/SPARK-22878 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: wuyi >Assignee: Apache Spark >Priority: Minor > > Count total dropped events from all queues' numDroppedEvents for > LiveListenerBus and log the dropped events periodically for LiveListenerBus > like queues do. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22878) Count totalDroppedEvents for LiveListenerBus
wuyi created SPARK-22878: Summary: Count totalDroppedEvents for LiveListenerBus Key: SPARK-22878 URL: https://issues.apache.org/jira/browse/SPARK-22878 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.1 Reporter: wuyi Priority: Minor Count total dropped events from all queues' numDroppedEvents for LiveListenerBus and log the dropped events periodically for LiveListenerBus like queues do. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22874) Modify checking pandas version to use LooseVersion.
[ https://issues.apache.org/jira/browse/SPARK-22874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22874. -- Resolution: Fixed Fix Version/s: 2.3.0 Fixed in https://github.com/apache/spark/pull/20054 > Modify checking pandas version to use LooseVersion. > --- > > Key: SPARK-22874 > URL: https://issues.apache.org/jira/browse/SPARK-22874 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 2.3.0 > > > Currently we check pandas version by capturing if {{ImportError}} for the > specific imports is raised or not but we can compare {{LooseVersion}} of the > version string as the same as we're checking pyarrow version. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22874) Modify checking pandas version to use LooseVersion.
[ https://issues.apache.org/jira/browse/SPARK-22874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-22874: Assignee: Takuya Ueshin > Modify checking pandas version to use LooseVersion. > --- > > Key: SPARK-22874 > URL: https://issues.apache.org/jira/browse/SPARK-22874 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > > Currently we check pandas version by capturing if {{ImportError}} for the > specific imports is raised or not but we can compare {{LooseVersion}} of the > version string as the same as we're checking pyarrow version. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-22877) spark.yarn.am.attemptFailuresValidityInterval does not work correctly
[ https://issues.apache.org/jira/browse/SPARK-22877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinhan Zhong closed SPARK-22877. > spark.yarn.am.attemptFailuresValidityInterval does not work correctly > - > > Key: SPARK-22877 > URL: https://issues.apache.org/jira/browse/SPARK-22877 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 > Environment: hadoop version 2.7.3 >Reporter: Jinhan Zhong >Priority: Minor > > I assume we can use spark.yarn.maxAppAttempts together with > spark.yarn.am.attemptFailuresValidityInterval to make a long running > application avoid stopping after acceptable number of failures. > But after testing, I found that the application always stops after failing n > times ( n is minimum value of spark.yarn.maxAppAttempts and > yarn.resourcemanager.am.max-attempts from client yarn-site.xml) > for example, following setup will allow the application master to fail 20 > times. > * spark.yarn.am.attemptFailuresValidityInterval=1s > * spark.yarn.maxAppAttempts=20 > * yarn client: yarn.resourcemanager.am.max-attempts=20 > * yarn resource manager: yarn.resourcemanager.am.max-attempts=3 > And after checking the source code, I found in source file > ApplicationMaster.scala > https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293[https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293] > there's a ShutdownHook that checks the attempt id against the maxAppAttempts, > if attempt id >= maxAppAttempts, it will try to unregister the application > and the application will finish. > is this a expected design or a bug? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22877) spark.yarn.am.attemptFailuresValidityInterval does not work correctly
[ https://issues.apache.org/jira/browse/SPARK-22877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinhan Zhong resolved SPARK-22877. -- Resolution: Duplicate > spark.yarn.am.attemptFailuresValidityInterval does not work correctly > - > > Key: SPARK-22877 > URL: https://issues.apache.org/jira/browse/SPARK-22877 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 > Environment: hadoop version 2.7.3 >Reporter: Jinhan Zhong >Priority: Minor > > I assume we can use spark.yarn.maxAppAttempts together with > spark.yarn.am.attemptFailuresValidityInterval to make a long running > application avoid stopping after acceptable number of failures. > But after testing, I found that the application always stops after failing n > times ( n is minimum value of spark.yarn.maxAppAttempts and > yarn.resourcemanager.am.max-attempts from client yarn-site.xml) > for example, following setup will allow the application master to fail 20 > times. > * spark.yarn.am.attemptFailuresValidityInterval=1s > * spark.yarn.maxAppAttempts=20 > * yarn client: yarn.resourcemanager.am.max-attempts=20 > * yarn resource manager: yarn.resourcemanager.am.max-attempts=3 > And after checking the source code, I found in source file > ApplicationMaster.scala > https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293[https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293] > there's a ShutdownHook that checks the attempt id against the maxAppAttempts, > if attempt id >= maxAppAttempts, it will try to unregister the application > and the application will finish. > is this a expected design or a bug? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly
[ https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinhan Zhong updated SPARK-22876: - Description: I assume we can use spark.yarn.maxAppAttempts together with spark.yarn.am.attemptFailuresValidityInterval to make a long running application avoid stopping after acceptable number of failures. But after testing, I found that the application always stops after failing n times ( n is minimum value of spark.yarn.maxAppAttempts and yarn.resourcemanager.am.max-attempts from client yarn-site.xml) for example, following setup will allow the application master to fail 20 times. * spark.yarn.am.attemptFailuresValidityInterval=1s * spark.yarn.maxAppAttempts=20 * yarn client: yarn.resourcemanager.am.max-attempts=20 * yarn resource manager: yarn.resourcemanager.am.max-attempts=3 And after checking the source code, I found in source file ApplicationMaster.scala https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293 there's a ShutdownHook that checks the attempt id against the maxAppAttempts, if attempt id >= maxAppAttempts, it will try to unregister the application and the application will finish. is this a expected design or a bug? was: I assume we can use spark.yarn.maxAppAttempts together with spark.yarn.am.attemptFailuresValidityInterval to make a long running application avoid stopping after acceptable number of failures. But after testing, I found that the application always stops after failing n times ( n is minimum value of spark.yarn.maxAppAttempts and yarn.resourcemanager.am.max-attempts from client yarn-site.xml) for example, following setup will allow the application master to fail 20 times. * spark.yarn.am.attemptFailuresValidityInterval=1s * spark.yarn.maxAppAttempts=20 * yarn client: yarn.resourcemanager.am.max-attempts=20 * yarn resource manager: yarn.resourcemanager.am.max-attempts=3 And after checking the source code, I found in source file ApplicationMaster.scala https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293[https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293] there's a ShutdownHook that checks the attempt id against the maxAppAttempts, if attempt id >= maxAppAttempts, it will try to unregister the application and the application will finish. is this a expected design or a bug? > spark.yarn.am.attemptFailuresValidityInterval does not work correctly > - > > Key: SPARK-22876 > URL: https://issues.apache.org/jira/browse/SPARK-22876 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 > Environment: hadoop version 2.7.3 >Reporter: Jinhan Zhong >Priority: Minor > > I assume we can use spark.yarn.maxAppAttempts together with > spark.yarn.am.attemptFailuresValidityInterval to make a long running > application avoid stopping after acceptable number of failures. > But after testing, I found that the application always stops after failing n > times ( n is minimum value of spark.yarn.maxAppAttempts and > yarn.resourcemanager.am.max-attempts from client yarn-site.xml) > for example, following setup will allow the application master to fail 20 > times. > * spark.yarn.am.attemptFailuresValidityInterval=1s > * spark.yarn.maxAppAttempts=20 > * yarn client: yarn.resourcemanager.am.max-attempts=20 > * yarn resource manager: yarn.resourcemanager.am.max-attempts=3 > And after checking the source code, I found in source file > ApplicationMaster.scala > https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293 > there's a ShutdownHook that checks the attempt id against the maxAppAttempts, > if attempt id >= maxAppAttempts, it will try to unregister the application > and the application will finish. > is this a expected design or a bug? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22875) Assembly build fails for a high user id
[ https://issues.apache.org/jira/browse/SPARK-22875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated SPARK-22875: -- Flags: Patch > Assembly build fails for a high user id > --- > > Key: SPARK-22875 > URL: https://issues.apache.org/jira/browse/SPARK-22875 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.1 >Reporter: Gera Shegalov > > {code} > ./build/mvn package -Pbigtop-dist -DskipTests > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project > spark-assembly_2.11: Execution dist of goal > org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id > '123456789' is too big ( > 2097151 ). -> [Help 1] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22877) spark.yarn.am.attemptFailuresValidityInterval does not work correctly
Jinhan Zhong created SPARK-22877: Summary: spark.yarn.am.attemptFailuresValidityInterval does not work correctly Key: SPARK-22877 URL: https://issues.apache.org/jira/browse/SPARK-22877 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.2.0 Environment: hadoop version 2.7.3 Reporter: Jinhan Zhong Priority: Minor I assume we can use spark.yarn.maxAppAttempts together with spark.yarn.am.attemptFailuresValidityInterval to make a long running application avoid stopping after acceptable number of failures. But after testing, I found that the application always stops after failing n times ( n is minimum value of spark.yarn.maxAppAttempts and yarn.resourcemanager.am.max-attempts from client yarn-site.xml) for example, following setup will allow the application master to fail 20 times. * spark.yarn.am.attemptFailuresValidityInterval=1s * spark.yarn.maxAppAttempts=20 * yarn client: yarn.resourcemanager.am.max-attempts=20 * yarn resource manager: yarn.resourcemanager.am.max-attempts=3 And after checking the source code, I found in source file ApplicationMaster.scala https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293[https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293] there's a ShutdownHook that checks the attempt id against the maxAppAttempts, if attempt id >= maxAppAttempts, it will try to unregister the application and the application will finish. is this a expected design or a bug? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22875) Assembly build fails for a high user id
[ https://issues.apache.org/jira/browse/SPARK-22875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated SPARK-22875: -- Shepherd: Steve Loughran > Assembly build fails for a high user id > --- > > Key: SPARK-22875 > URL: https://issues.apache.org/jira/browse/SPARK-22875 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.1 >Reporter: Gera Shegalov > > {code} > ./build/mvn package -Pbigtop-dist -DskipTests > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project > spark-assembly_2.11: Execution dist of goal > org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id > '123456789' is too big ( > 2097151 ). -> [Help 1] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly
Jinhan Zhong created SPARK-22876: Summary: spark.yarn.am.attemptFailuresValidityInterval does not work correctly Key: SPARK-22876 URL: https://issues.apache.org/jira/browse/SPARK-22876 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.2.0 Environment: hadoop version 2.7.3 Reporter: Jinhan Zhong Priority: Minor I assume we can use spark.yarn.maxAppAttempts together with spark.yarn.am.attemptFailuresValidityInterval to make a long running application avoid stopping after acceptable number of failures. But after testing, I found that the application always stops after failing n times ( n is minimum value of spark.yarn.maxAppAttempts and yarn.resourcemanager.am.max-attempts from client yarn-site.xml) for example, following setup will allow the application master to fail 20 times. * spark.yarn.am.attemptFailuresValidityInterval=1s * spark.yarn.maxAppAttempts=20 * yarn client: yarn.resourcemanager.am.max-attempts=20 * yarn resource manager: yarn.resourcemanager.am.max-attempts=3 And after checking the source code, I found in source file ApplicationMaster.scala https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293[https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293] there's a ShutdownHook that checks the attempt id against the maxAppAttempts, if attempt id >= maxAppAttempts, it will try to unregister the application and the application will finish. is this a expected design or a bug? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22875) Assembly build fails for a high user id
[ https://issues.apache.org/jira/browse/SPARK-22875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22875: Assignee: (was: Apache Spark) > Assembly build fails for a high user id > --- > > Key: SPARK-22875 > URL: https://issues.apache.org/jira/browse/SPARK-22875 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.1 >Reporter: Gera Shegalov > > {code} > ./build/mvn package -Pbigtop-dist -DskipTests > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project > spark-assembly_2.11: Execution dist of goal > org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id > '123456789' is too big ( > 2097151 ). -> [Help 1] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22875) Assembly build fails for a high user id
[ https://issues.apache.org/jira/browse/SPARK-22875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22875: Assignee: Apache Spark > Assembly build fails for a high user id > --- > > Key: SPARK-22875 > URL: https://issues.apache.org/jira/browse/SPARK-22875 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.1 >Reporter: Gera Shegalov >Assignee: Apache Spark > > {code} > ./build/mvn package -Pbigtop-dist -DskipTests > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project > spark-assembly_2.11: Execution dist of goal > org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id > '123456789' is too big ( > 2097151 ). -> [Help 1] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22875) Assembly build fails for a high user id
[ https://issues.apache.org/jira/browse/SPARK-22875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301148#comment-16301148 ] Apache Spark commented on SPARK-22875: -- User 'gerashegalov' has created a pull request for this issue: https://github.com/apache/spark/pull/20055 > Assembly build fails for a high user id > --- > > Key: SPARK-22875 > URL: https://issues.apache.org/jira/browse/SPARK-22875 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.1 >Reporter: Gera Shegalov > > {code} > ./build/mvn package -Pbigtop-dist -DskipTests > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project > spark-assembly_2.11: Execution dist of goal > org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id > '123456789' is too big ( > 2097151 ). -> [Help 1] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22875) Assembly build fails for a high user id
Gera Shegalov created SPARK-22875: - Summary: Assembly build fails for a high user id Key: SPARK-22875 URL: https://issues.apache.org/jira/browse/SPARK-22875 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.2.1 Reporter: Gera Shegalov {code} ./build/mvn package -Pbigtop-dist -DskipTests [ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project spark-assembly_2.11: Execution dist of goal org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id '123456789' is too big ( > 2097151 ). -> [Help 1] {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22872) Spark ML Pipeline Model Persistent Support Save Schema Info
[ https://issues.apache.org/jira/browse/SPARK-22872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyanny updated SPARK-22872: --- Description: Hi all, I have a project about model transformation with PMML, it needs to transform models with different types to pmml files. Moreover, JPMML(https://github.com/jpmml) has provided tools to do that,such as jpmml-sklearn, jpmml-xgboost etc. Our transformation API parameters must be concise and simple, in other words the less the better. I came with a issue that, sklearn, tensorflow, and lightgbm can produce only one model file, including schema info and model data info. but Spark PipelineModel only export a model file in parquet, there is no schema info in the model file. However, JPMML-SPARK converter needs two arguments: Data Schema and PipelineModel *Can spark PipelineModel include input data schema as metadata when do export? * The situations about machine learning libraries to jpmml are as the attached image, only xgboost and spark can't include schema info in exported model file. was: Hi all, I have a project about model transformation with PMML, it needs to transform models with different types to pmml files. Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many tools to do that. I need to provide a uniform API for user, the API parameters must be concise and simple, in other words the less the better. I came with a issue that, sklearn, tensorflow, and lightgbm can produce only one model file, including schema info and model data info. but Spark PipelineModel only export a model file in parquet, there is no schema info in the model file. However, JPMML-SPARK converter needs two arguments: Data Schema and PipelineModel *Can spark PipelineModel include input data schema as metadata when do export? * The situations about machine learning libraries to jpmml are as the attached img, only xgboost and spark can't include schema info in exported model file. > Spark ML Pipeline Model Persistent Support Save Schema Info > --- > > Key: SPARK-22872 > URL: https://issues.apache.org/jira/browse/SPARK-22872 > Project: Spark > Issue Type: IT Help > Components: ML >Affects Versions: 2.2.0 >Reporter: Cyanny >Priority: Minor > Attachments: jpmml-research.jpg > > > Hi all, > I have a project about model transformation with PMML, it needs to transform > models with different types to pmml files. > Moreover, JPMML(https://github.com/jpmml) has provided tools to do that,such > as jpmml-sklearn, jpmml-xgboost etc. Our transformation API parameters must > be concise and simple, in other words the less the better. > I came with a issue that, sklearn, tensorflow, and lightgbm can produce only > one model file, including schema info and model data info. > but Spark PipelineModel only export a model file in parquet, there is no > schema info in the model file. However, JPMML-SPARK converter needs two > arguments: Data Schema and PipelineModel > *Can spark PipelineModel include input data schema as metadata when do > export? * > The situations about machine learning libraries to jpmml are as the attached > image, only xgboost and spark can't include schema info in exported model > file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22872) Spark ML Pipeline Model Persistent Support Save Schema Info
[ https://issues.apache.org/jira/browse/SPARK-22872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyanny updated SPARK-22872: --- Description: Hi all, I have a project about model transformation with PMML, it needs to transform models with different types to pmml files. Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many tools to do that. I need to provide a uniform API for user, the API parameters must be concise and simple, in other words the less the better. I came with a issue that, sklearn, tensorflow, and lightgbm can produce only one model file, including schema info and model data info. but Spark PipelineModel only export a model file in parquet, there is no schema info in the model file. However, JPMML-SPARK converter needs two arguments: Data Schema and PipelineModel *Can spark PipelineModel include input data schema as metadata when do export? * The situations about machine learning libraries to jpmml are as the attached img, only xgboost and spark can't include schema info in exported model file. was: Hi all, I recently did a research about pmml, and our project needs to transform many models with different types to pmml files. Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many tools to do that. I need to provide a uniform API for user, the API parameters must be concise and simple, in other words the less the better. I came with a issue that, sklearn, tensorflow, and lightgbm can produce only one model file, including schema info and model data info. but Spark PipelineModel only export a model file in parquet, there is no schema info in the model file. However, JPMML-SPARK converter needs two arguments: Data Schema and PipelineModel *Can spark PipelineModel include input data schema as metadata when do export? * The situations about machine learning libraries to jpmml are as the attached img, only xgboost and spark can't include schema info in exported model file. > Spark ML Pipeline Model Persistent Support Save Schema Info > --- > > Key: SPARK-22872 > URL: https://issues.apache.org/jira/browse/SPARK-22872 > Project: Spark > Issue Type: IT Help > Components: ML >Affects Versions: 2.2.0 >Reporter: Cyanny >Priority: Minor > Attachments: jpmml-research.jpg > > > Hi all, > I have a project about model transformation with PMML, it needs to transform > models with different types to pmml files. > Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many > tools to do that. I need to provide a uniform API for user, the API > parameters must be concise and simple, in other words the less the better. > I came with a issue that, sklearn, tensorflow, and lightgbm can produce only > one model file, including schema info and model data info. > but Spark PipelineModel only export a model file in parquet, there is no > schema info in the model file. However, JPMML-SPARK converter needs two > arguments: Data Schema and PipelineModel > *Can spark PipelineModel include input data schema as metadata when do > export? * > The situations about machine learning libraries to jpmml are as the attached > img, only xgboost and spark can't include schema info in exported model file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22874) Modify checking pandas version to use LooseVersion.
[ https://issues.apache.org/jira/browse/SPARK-22874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22874: Assignee: Apache Spark > Modify checking pandas version to use LooseVersion. > --- > > Key: SPARK-22874 > URL: https://issues.apache.org/jira/browse/SPARK-22874 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark > > Currently we check pandas version by capturing if {{ImportError}} for the > specific imports is raised or not but we can compare {{LooseVersion}} of the > version string as the same as we're checking pyarrow version. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22874) Modify checking pandas version to use LooseVersion.
[ https://issues.apache.org/jira/browse/SPARK-22874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22874: Assignee: (was: Apache Spark) > Modify checking pandas version to use LooseVersion. > --- > > Key: SPARK-22874 > URL: https://issues.apache.org/jira/browse/SPARK-22874 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin > > Currently we check pandas version by capturing if {{ImportError}} for the > specific imports is raised or not but we can compare {{LooseVersion}} of the > version string as the same as we're checking pyarrow version. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22874) Modify checking pandas version to use LooseVersion.
[ https://issues.apache.org/jira/browse/SPARK-22874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301101#comment-16301101 ] Apache Spark commented on SPARK-22874: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/20054 > Modify checking pandas version to use LooseVersion. > --- > > Key: SPARK-22874 > URL: https://issues.apache.org/jira/browse/SPARK-22874 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin > > Currently we check pandas version by capturing if {{ImportError}} for the > specific imports is raised or not but we can compare {{LooseVersion}} of the > version string as the same as we're checking pyarrow version. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22874) Modify checking pandas version to use LooseVersion.
Takuya Ueshin created SPARK-22874: - Summary: Modify checking pandas version to use LooseVersion. Key: SPARK-22874 URL: https://issues.apache.org/jira/browse/SPARK-22874 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 2.3.0 Reporter: Takuya Ueshin Currently we check pandas version by capturing if {{ImportError}} for the specific imports is raised or not but we can compare {{LooseVersion}} of the version string as the same as we're checking pyarrow version. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22872) Spark ML Pipeline Model Persistent Support Save Schema Info
[ https://issues.apache.org/jira/browse/SPARK-22872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyanny updated SPARK-22872: --- Description: Hi all, I recently did a research about pmml, and our project needs to transform many models with different types to pmml files. Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many tools to do that. I need to provide a uniform API for user, the API parameters must be concise and simple, in other words the less the better. I came with a issue that, sklearn, tensorflow, and lightgbm can produce only one model file, including schema info and model data info. but Spark PipelineModel only export a model file in parquet, there is no schema info in the model file. However, JPMML-SPARK converter needs two arguments: Data Schema and PipelineModel *Can spark PipelineModel include input data schema as metadata when do export? * The situations about machine learning libraries to jpmml are as the attached img, only xgboost and spark can't include schema info in exported model file. was: Hi all, I recently did a research about pmml, and our project needs to transform many models with different types to pmml files. Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many tools to do that. I need to provide a uniform API for user, the API parameters must be concise and simple, in other words the less the better. I came with a issue that, sklearn, tensorflow, and lightgbm can produce only one model file, including schema info and model data info. but Spark PipelineModel only export a model file in parquet, there is no schema info in the model file. However, JPMML-SPARK converter needs two arguments: Data Schema and PipelineModel *Can spark PipelineModel include input data schema when export to a file? * I found a solution, use dataframe API to export schema: ```dataframe.limit(1).write.format("parquet").save("./model.schema")``` *Are there any solutions to get the PipelineModel input schema?* The situations about machine learning libraries to jpmml are as the attached img > Spark ML Pipeline Model Persistent Support Save Schema Info > --- > > Key: SPARK-22872 > URL: https://issues.apache.org/jira/browse/SPARK-22872 > Project: Spark > Issue Type: IT Help > Components: ML >Affects Versions: 2.2.0 >Reporter: Cyanny >Priority: Minor > Attachments: jpmml-research.jpg > > > Hi all, > I recently did a research about pmml, and our project needs to transform many > models with different types to pmml files. > Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many > tools to do that. I need to provide a uniform API for user, the API > parameters must be concise and simple, in other words the less the better. > I came with a issue that, sklearn, tensorflow, and lightgbm can produce only > one model file, including schema info and model data info. > but Spark PipelineModel only export a model file in parquet, there is no > schema info in the model file. However, JPMML-SPARK converter needs two > arguments: Data Schema and PipelineModel > *Can spark PipelineModel include input data schema as metadata when do > export? * > The situations about machine learning libraries to jpmml are as the attached > img, only xgboost and spark can't include schema info in exported model file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22872) Spark ML Pipeline Model Persistent Support Save Schema Info
[ https://issues.apache.org/jira/browse/SPARK-22872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyanny updated SPARK-22872: --- Description: Hi all, I recently did a research about pmml, and our project needs to transform many models with different types to pmml files. Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many tools to do that. I need to provide a uniform API for user, the API parameters must be concise and simple, in other words the less the better. I came with a issue that, sklearn, tensorflow, and lightgbm can produce only one model file, including schema info and model data info. but Spark PipelineModel only export a model file in parquet, there is no schema info in the model file. However, JPMML-SPARK converter needs two arguments: Data Schema and PipelineModel *Can spark PipelineModel include input data schema when export to a file? * I found a solution, use dataframe API to export schema: ```dataframe.limit(1).write.format("parquet").save("./model.schema")``` *Are there any solutions to get the PipelineModel input schema?* The situations about machine learning libraries to jpmml are as the attached img was: Hi all, I recently did a research about pmml, and my project needs to transform many models with different type to pmml files. Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many tools to do that. I need to provide a uniform API for user, the API parameters must be concise and simple, in other words the less the better. I came with a issue that, sklearn, tensorflow, and lightgbm can produce only one model file, including schema info and model data info. but Spark PipelineModel only export a model file in parquet, there is no schema info in the model file. However, JPMML-SPARK converter needs two arguments: Data Schema and PipelineModel *Can spark PipelineModel include input data schema when export to a file? * I found a solution, use dataframe API to export schema: ```dataframe.limit(1).write.format("parquet").save("./model.schema")``` *Are there any solutions to get the PipelineModel input schema?* The situations about machine learning libraries to jpmml are as the attached img > Spark ML Pipeline Model Persistent Support Save Schema Info > --- > > Key: SPARK-22872 > URL: https://issues.apache.org/jira/browse/SPARK-22872 > Project: Spark > Issue Type: IT Help > Components: ML >Affects Versions: 2.2.0 >Reporter: Cyanny >Priority: Minor > Attachments: jpmml-research.jpg > > > Hi all, > I recently did a research about pmml, and our project needs to transform many > models with different types to pmml files. > Moreover, JPMML(https://github.com/jpmml/jpmml-sparkml) has provided many > tools to do that. I need to provide a uniform API for user, the API > parameters must be concise and simple, in other words the less the better. > I came with a issue that, sklearn, tensorflow, and lightgbm can produce only > one model file, including schema info and model data info. > but Spark PipelineModel only export a model file in parquet, there is no > schema info in the model file. However, JPMML-SPARK converter needs two > arguments: Data Schema and PipelineModel > *Can spark PipelineModel include input data schema when export to a file? * > I found a solution, use dataframe API to export schema: > ```dataframe.limit(1).write.format("parquet").save("./model.schema")``` > *Are there any solutions to get the PipelineModel input schema?* > The situations about machine learning libraries to jpmml are as the attached > img -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org