[jira] [Resolved] (SPARK-30533) Add classes to represent Java Regressors and RegressionModels
[ https://issues.apache.org/jira/browse/SPARK-30533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30533. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27241 [https://github.com/apache/spark/pull/27241] > Add classes to represent Java Regressors and RegressionModels > - > > Key: SPARK-30533 > URL: https://issues.apache.org/jira/browse/SPARK-30533 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.0.0 > > > Right now PySpark provides classed representing Java {{Classifiers}} and > {{ClassifierModels}}, but lacks their regression counterparts. > We should provide these for consistency, feature parity and as prerequisite > for SPARK-29212. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30533) Add classes to represent Java Regressors and RegressionModels
[ https://issues.apache.org/jira/browse/SPARK-30533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-30533: - Priority: Minor (was: Major) > Add classes to represent Java Regressors and RegressionModels > - > > Key: SPARK-30533 > URL: https://issues.apache.org/jira/browse/SPARK-30533 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 3.0.0 > > > Right now PySpark provides classed representing Java {{Classifiers}} and > {{ClassifierModels}}, but lacks their regression counterparts. > We should provide these for consistency, feature parity and as prerequisite > for SPARK-29212. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30533) Add classes to represent Java Regressors and RegressionModels
[ https://issues.apache.org/jira/browse/SPARK-30533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-30533: Assignee: Maciej Szymkiewicz > Add classes to represent Java Regressors and RegressionModels > - > > Key: SPARK-30533 > URL: https://issues.apache.org/jira/browse/SPARK-30533 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > Right now PySpark provides classed representing Java {{Classifiers}} and > {{ClassifierModels}}, but lacks their regression counterparts. > We should provide these for consistency, feature parity and as prerequisite > for SPARK-29212. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25993) Add test cases for CREATE EXTERNAL TABLE with subdirectories
[ https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25993: -- Issue Type: Improvement (was: Bug) > Add test cases for CREATE EXTERNAL TABLE with subdirectories > > > Key: SPARK-25993 > URL: https://issues.apache.org/jira/browse/SPARK-25993 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.2 >Reporter: Xiao Li >Assignee: kevin yu >Priority: Major > Fix For: 3.0.0 > > > Add a test case based on the following example. The behavior was changed in > 2.3 release. We also need to upgrade the migration guide. > {code:java} > val someDF1 = Seq( > (1, 1, "blah"), > (1, 2, "blahblah") > ).toDF("folder", "number", "word").repartition(1) > someDF1.write.orc("/tmp/orctab1/dir1/") > someDF1.write.orc("/mnt/orctab1/dir2/") > create external table tab1(folder int,number int,word string) STORED AS ORC > LOCATION '/tmp/orctab1/"); > select * from tab1; > create external table tab2(folder int,number int,word string) STORED AS ORC > LOCATION '/tmp/orctab1/*"); > select * from tab2; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25993) Add test cases for CREATE EXTERNAL TABLE with subdirectories
[ https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25993: -- Summary: Add test cases for CREATE EXTERNAL TABLE with subdirectories (was: Add test cases for resolution of ORC table location) > Add test cases for CREATE EXTERNAL TABLE with subdirectories > > > Key: SPARK-25993 > URL: https://issues.apache.org/jira/browse/SPARK-25993 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.2 >Reporter: Xiao Li >Assignee: kevin yu >Priority: Major > Labels: starter > Fix For: 3.0.0 > > > Add a test case based on the following example. The behavior was changed in > 2.3 release. We also need to upgrade the migration guide. > {code:java} > val someDF1 = Seq( > (1, 1, "blah"), > (1, 2, "blahblah") > ).toDF("folder", "number", "word").repartition(1) > someDF1.write.orc("/tmp/orctab1/dir1/") > someDF1.write.orc("/mnt/orctab1/dir2/") > create external table tab1(folder int,number int,word string) STORED AS ORC > LOCATION '/tmp/orctab1/"); > select * from tab1; > create external table tab2(folder int,number int,word string) STORED AS ORC > LOCATION '/tmp/orctab1/*"); > select * from tab2; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25993) Add test cases for CREATE EXTERNAL TABLE with subdirectories
[ https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25993: -- Labels: (was: starter) > Add test cases for CREATE EXTERNAL TABLE with subdirectories > > > Key: SPARK-25993 > URL: https://issues.apache.org/jira/browse/SPARK-25993 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.2 >Reporter: Xiao Li >Assignee: kevin yu >Priority: Major > Fix For: 3.0.0 > > > Add a test case based on the following example. The behavior was changed in > 2.3 release. We also need to upgrade the migration guide. > {code:java} > val someDF1 = Seq( > (1, 1, "blah"), > (1, 2, "blahblah") > ).toDF("folder", "number", "word").repartition(1) > someDF1.write.orc("/tmp/orctab1/dir1/") > someDF1.write.orc("/mnt/orctab1/dir2/") > create external table tab1(folder int,number int,word string) STORED AS ORC > LOCATION '/tmp/orctab1/"); > select * from tab1; > create external table tab2(folder int,number int,word string) STORED AS ORC > LOCATION '/tmp/orctab1/*"); > select * from tab2; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25993) Add test cases for resolution of ORC table location
[ https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25993: - Assignee: kevin yu > Add test cases for resolution of ORC table location > --- > > Key: SPARK-25993 > URL: https://issues.apache.org/jira/browse/SPARK-25993 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.2 >Reporter: Xiao Li >Assignee: kevin yu >Priority: Major > Labels: starter > > Add a test case based on the following example. The behavior was changed in > 2.3 release. We also need to upgrade the migration guide. > {code:java} > val someDF1 = Seq( > (1, 1, "blah"), > (1, 2, "blahblah") > ).toDF("folder", "number", "word").repartition(1) > someDF1.write.orc("/tmp/orctab1/dir1/") > someDF1.write.orc("/mnt/orctab1/dir2/") > create external table tab1(folder int,number int,word string) STORED AS ORC > LOCATION '/tmp/orctab1/"); > select * from tab1; > create external table tab2(folder int,number int,word string) STORED AS ORC > LOCATION '/tmp/orctab1/*"); > select * from tab2; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25993) Add test cases for resolution of ORC table location
[ https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25993. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27130 [https://github.com/apache/spark/pull/27130] > Add test cases for resolution of ORC table location > --- > > Key: SPARK-25993 > URL: https://issues.apache.org/jira/browse/SPARK-25993 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.2 >Reporter: Xiao Li >Assignee: kevin yu >Priority: Major > Labels: starter > Fix For: 3.0.0 > > > Add a test case based on the following example. The behavior was changed in > 2.3 release. We also need to upgrade the migration guide. > {code:java} > val someDF1 = Seq( > (1, 1, "blah"), > (1, 2, "blahblah") > ).toDF("folder", "number", "word").repartition(1) > someDF1.write.orc("/tmp/orctab1/dir1/") > someDF1.write.orc("/mnt/orctab1/dir2/") > create external table tab1(folder int,number int,word string) STORED AS ORC > LOCATION '/tmp/orctab1/"); > select * from tab1; > create external table tab2(folder int,number int,word string) STORED AS ORC > LOCATION '/tmp/orctab1/*"); > select * from tab2; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer
[ https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018329#comment-17018329 ] Nick Afshartous edited comment on SPARK-27249 at 1/17/20 9:29 PM: -- [~enrush] Hi Everett, The {{Dataset}} API has an experimental function {{mapPartitions}} for transforming {{Datasets}}. Does this satisfy your requirements ? https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset was (Author: nafshartous): [~enrush] Hi Everett, The {{Dataset}} API has an experimental function {{mapPartitions}} for transforming {{Dataset}}s. Does this satisfy your requirements ? https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset > Developers API for Transformers beyond UnaryTransformer > --- > > Key: SPARK-27249 > URL: https://issues.apache.org/jira/browse/SPARK-27249 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.0 >Reporter: Everett Rush >Priority: Minor > Labels: starter > Attachments: Screen Shot 2020-01-17 at 4.20.57 PM.png > > Original Estimate: 96h > Remaining Estimate: 96h > > It would be nice to have a developers' API for dataset transformations that > need more than one column from a row (ie UnaryTransformer inputs one column > and outputs one column) or that contain objects too expensive to initialize > repeatedly in a UDF such as a database connection. > > Design: > Abstract class PartitionTransformer extends Transformer and defines the > partition transformation function as Iterator[Row] => Iterator[Row] > NB: This parallels the UnaryTransformer createTransformFunc method > > When developers subclass this transformer, they can provide their own schema > for the output Row in which case the PartitionTransformer creates a row > encoder and executes the transformation. Alternatively the developer can set > output Datatype and output col name. Then the PartitionTransformer class will > create a new schema, a row encoder, and execute the transformation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer
[ https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018329#comment-17018329 ] Nick Afshartous commented on SPARK-27249: - [~enrush] Hi Everett, The {{Dataset}} API has an experimental function {{mapPartitions}} for transforming {{Dataset}} . Does this satisfy your requirements ? https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset > Developers API for Transformers beyond UnaryTransformer > --- > > Key: SPARK-27249 > URL: https://issues.apache.org/jira/browse/SPARK-27249 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.0 >Reporter: Everett Rush >Priority: Minor > Labels: starter > Attachments: Screen Shot 2020-01-17 at 4.20.57 PM.png > > Original Estimate: 96h > Remaining Estimate: 96h > > It would be nice to have a developers' API for dataset transformations that > need more than one column from a row (ie UnaryTransformer inputs one column > and outputs one column) or that contain objects too expensive to initialize > repeatedly in a UDF such as a database connection. > > Design: > Abstract class PartitionTransformer extends Transformer and defines the > partition transformation function as Iterator[Row] => Iterator[Row] > NB: This parallels the UnaryTransformer createTransformFunc method > > When developers subclass this transformer, they can provide their own schema > for the output Row in which case the PartitionTransformer creates a row > encoder and executes the transformation. Alternatively the developer can set > output Datatype and output col name. Then the PartitionTransformer class will > create a new schema, a row encoder, and execute the transformation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer
[ https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018329#comment-17018329 ] Nick Afshartous edited comment on SPARK-27249 at 1/17/20 9:28 PM: -- [~enrush] Hi Everett, The {{Dataset}} API has an experimental function {{mapPartitions}} for transforming {{Dataset}}s. Does this satisfy your requirements ? https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset was (Author: nafshartous): [~enrush] Hi Everett, The {{Dataset}} API has an experimental function {{mapPartitions}} for transforming {{Dataset}} . Does this satisfy your requirements ? https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset > Developers API for Transformers beyond UnaryTransformer > --- > > Key: SPARK-27249 > URL: https://issues.apache.org/jira/browse/SPARK-27249 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.0 >Reporter: Everett Rush >Priority: Minor > Labels: starter > Attachments: Screen Shot 2020-01-17 at 4.20.57 PM.png > > Original Estimate: 96h > Remaining Estimate: 96h > > It would be nice to have a developers' API for dataset transformations that > need more than one column from a row (ie UnaryTransformer inputs one column > and outputs one column) or that contain objects too expensive to initialize > repeatedly in a UDF such as a database connection. > > Design: > Abstract class PartitionTransformer extends Transformer and defines the > partition transformation function as Iterator[Row] => Iterator[Row] > NB: This parallels the UnaryTransformer createTransformFunc method > > When developers subclass this transformer, they can provide their own schema > for the output Row in which case the PartitionTransformer creates a row > encoder and executes the transformation. Alternatively the developer can set > output Datatype and output col name. Then the PartitionTransformer class will > create a new schema, a row encoder, and execute the transformation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer
[ https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Afshartous updated SPARK-27249: Attachment: Screen Shot 2020-01-17 at 4.20.57 PM.png > Developers API for Transformers beyond UnaryTransformer > --- > > Key: SPARK-27249 > URL: https://issues.apache.org/jira/browse/SPARK-27249 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.0 >Reporter: Everett Rush >Priority: Minor > Labels: starter > Attachments: Screen Shot 2020-01-17 at 4.20.57 PM.png > > Original Estimate: 96h > Remaining Estimate: 96h > > It would be nice to have a developers' API for dataset transformations that > need more than one column from a row (ie UnaryTransformer inputs one column > and outputs one column) or that contain objects too expensive to initialize > repeatedly in a UDF such as a database connection. > > Design: > Abstract class PartitionTransformer extends Transformer and defines the > partition transformation function as Iterator[Row] => Iterator[Row] > NB: This parallels the UnaryTransformer createTransformFunc method > > When developers subclass this transformer, they can provide their own schema > for the output Row in which case the PartitionTransformer creates a row > encoder and executes the transformation. Alternatively the developer can set > output Datatype and output col name. Then the PartitionTransformer class will > create a new schema, a row encoder, and execute the transformation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30557) Add public documentation for SPARK_SUBMIT_OPTS
[ https://issues.apache.org/jira/browse/SPARK-30557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018256#comment-17018256 ] Marcelo Masiero Vanzin commented on SPARK-30557: I don't exactly remember what that does, but a quick looks seem to indicate it's basically another way of setting JVM options, used in some internal code. We have {{--driver-java-options}} for users already. > Add public documentation for SPARK_SUBMIT_OPTS > -- > > Key: SPARK-30557 > URL: https://issues.apache.org/jira/browse/SPARK-30557 > Project: Spark > Issue Type: Improvement > Components: Deploy, Documentation >Affects Versions: 2.4.4 >Reporter: Nicholas Chammas >Priority: Minor > > Is `SPARK_SUBMIT_OPTS` part of Spark's public interface? If so, it needs some > documentation. I cannot see it documented > [anywhere|https://github.com/apache/spark/search?q=SPARK_SUBMIT_OPTS_q=SPARK_SUBMIT_OPTS] > in the docs. > How do you use it? What is it useful for? What's an example usage? etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30557) Add public documentation for SPARK_SUBMIT_OPTS
[ https://issues.apache.org/jira/browse/SPARK-30557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018254#comment-17018254 ] Nicholas Chammas commented on SPARK-30557: -- [~vanzin] - Do you know if this is something we should document? Would the documentation go in [Submitting Applications|https://spark.apache.org/docs/latest/submitting-applications.html]? (Another possibility is to put it in spark-env.sh.template.) > Add public documentation for SPARK_SUBMIT_OPTS > -- > > Key: SPARK-30557 > URL: https://issues.apache.org/jira/browse/SPARK-30557 > Project: Spark > Issue Type: Improvement > Components: Deploy, Documentation >Affects Versions: 2.4.4 >Reporter: Nicholas Chammas >Priority: Minor > > Is `SPARK_SUBMIT_OPTS` part of Spark's public interface? If so, it needs some > documentation. I cannot see it documented > [anywhere|https://github.com/apache/spark/search?q=SPARK_SUBMIT_OPTS_q=SPARK_SUBMIT_OPTS] > in the docs. > How do you use it? What is it useful for? What's an example usage? etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27868) Better document shuffle / RPC listen backlog
[ https://issues.apache.org/jira/browse/SPARK-27868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018242#comment-17018242 ] Dongjoon Hyun commented on SPARK-27868: --- I see. Thank you, [~vanzin]. > Better document shuffle / RPC listen backlog > > > Key: SPARK-27868 > URL: https://issues.apache.org/jira/browse/SPARK-27868 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 2.4.3 >Reporter: Marcelo Masiero Vanzin >Assignee: Marcelo Masiero Vanzin >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > The option to control the listen socket backlog for RPC and shuffle servers > is not documented in our public docs. > The only piece of documentation is in a Java class, and even that > documentation is incorrect: > {code} > /** Requested maximum length of the queue of incoming connections. Default > -1 for no backlog. */ > public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, > -1); } > {code} > The default value actual causes the default value from the JRE to be used, > which is 50 according to the docs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29876) Delete/archive file source completed files in separate thread
[ https://issues.apache.org/jira/browse/SPARK-29876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-29876. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26502 [https://github.com/apache/spark/pull/26502] > Delete/archive file source completed files in separate thread > - > > Key: SPARK-29876 > URL: https://issues.apache.org/jira/browse/SPARK-29876 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.0.0 > > > SPARK-20568 added the possibility to clean up completed files in streaming > query. Deleting/archiving uses the main thread which can slow down > processing. It would be good to do this on separate thread(s). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29876) Delete/archive file source completed files in separate thread
[ https://issues.apache.org/jira/browse/SPARK-29876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin reassigned SPARK-29876: -- Assignee: Gabor Somogyi > Delete/archive file source completed files in separate thread > - > > Key: SPARK-29876 > URL: https://issues.apache.org/jira/browse/SPARK-29876 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > > SPARK-20568 added the possibility to clean up completed files in streaming > query. Deleting/archiving uses the main thread which can slow down > processing. It would be good to do this on separate thread(s). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30557) Add public documentation for SPARK_SUBMIT_OPTS
Nicholas Chammas created SPARK-30557: Summary: Add public documentation for SPARK_SUBMIT_OPTS Key: SPARK-30557 URL: https://issues.apache.org/jira/browse/SPARK-30557 Project: Spark Issue Type: Improvement Components: Deploy, Documentation Affects Versions: 2.4.4 Reporter: Nicholas Chammas Is `SPARK_SUBMIT_OPTS` part of Spark's public interface? If so, it needs some documentation. I cannot see it documented [anywhere|https://github.com/apache/spark/search?q=SPARK_SUBMIT_OPTS_q=SPARK_SUBMIT_OPTS] in the docs. How do you use it? What is it useful for? What's an example usage? etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27868) Better document shuffle / RPC listen backlog
[ https://issues.apache.org/jira/browse/SPARK-27868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018219#comment-17018219 ] Marcelo Masiero Vanzin commented on SPARK-27868: It's ok for now, it's done. Hopefully 3.0 will come out soon and the "main" documentation on the site will have the info. > Better document shuffle / RPC listen backlog > > > Key: SPARK-27868 > URL: https://issues.apache.org/jira/browse/SPARK-27868 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 2.4.3 >Reporter: Marcelo Masiero Vanzin >Assignee: Marcelo Masiero Vanzin >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > The option to control the listen socket backlog for RPC and shuffle servers > is not documented in our public docs. > The only piece of documentation is in a Java class, and even that > documentation is incorrect: > {code} > /** Requested maximum length of the queue of incoming connections. Default > -1 for no backlog. */ > public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, > -1); } > {code} > The default value actual causes the default value from the JRE to be used, > which is 50 according to the docs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22590) Broadcast thread propagates the localProperties to task
[ https://issues.apache.org/jira/browse/SPARK-22590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajith S updated SPARK-22590: Affects Version/s: 3.0.0 2.4.4 > Broadcast thread propagates the localProperties to task > --- > > Key: SPARK-22590 > URL: https://issues.apache.org/jira/browse/SPARK-22590 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.4.4, 3.0.0 >Reporter: Ajith S >Priority: Major > Labels: bulk-closed > Attachments: TestProps.scala > > > Local properties set via sparkContext are not available as TaskContext > properties when executing parallel jobs and threadpools have idle threads > Explanation: > When executing parallel jobs via {{BroadcastExchangeExec}}, the > {{relationFuture}} is evaluated via a seperate thread. The threads inherit > the {{localProperties}} from sparkContext as they are the child threads. > These threads are controlled via the executionContext (thread pools). Each > Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle > threads. > Scenarios where the thread pool has threads which are idle and reused for a > subsequent new query, the thread local properties will not be inherited from > spark context (thread properties are inherited only on thread creation) hence > end up having old or no properties set. This will cause taskset properties to > be missing when properties are transferred by child thread via > {{sparkContext.runJob/submitJob}} > Attached is a test-case to simulate this behavior -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30556) SubqueryExec passes local properties to SubqueryExec.executionContext
Ajith S created SPARK-30556: --- Summary: SubqueryExec passes local properties to SubqueryExec.executionContext Key: SPARK-30556 URL: https://issues.apache.org/jira/browse/SPARK-30556 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4, 3.0.0 Reporter: Ajith S Local properties set via sparkContext are not available as TaskContext properties when executing jobs and threadpools have idle threads which are reused Explanation: When SubqueryExec, the {{relationFuture}} is evaluated via a separate thread. The threads inherit the {{localProperties}} from sparkContext as they are the child threads. These threads are controlled via the executionContext (thread pools). Each Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads. Scenarios where the thread pool has threads which are idle and reused for a subsequent new query, the thread local properties will not be inherited from spark context (thread properties are inherited only on thread creation) hence end up having old or no properties set. This will cause taskset properties to be missing when properties are transferred by child thread via {{sparkContext.runJob/submitJob}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30041) Add Codegen Stage Id to Stage DAG visualization in Web UI
[ https://issues.apache.org/jira/browse/SPARK-30041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30041: --- Assignee: Luca Canali > Add Codegen Stage Id to Stage DAG visualization in Web UI > - > > Key: SPARK-30041 > URL: https://issues.apache.org/jira/browse/SPARK-30041 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Attachments: Snippet_StagesDags_with_CodegenId _annotated.png > > > SPARK-29894 provides information on the Codegen Stage Id in WEBUI for SQL > Plan graphs. Similarly, this proposes to add Codegen Stage Id in the DAG > visualization for Stage execution. DAGs for Stage execution are available in > the WEBUI under the Jobs and Stages tabs. > This is proposed as an aid for drill-down analysis of complex SQL statement > execution, as it is not always easy to match parts of the SQL Plan graph with > the corresponding Stage DAG execution graph. Adding Codegen Stage Id for > WholeStageCodegen operations makes this task easier. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30041) Add Codegen Stage Id to Stage DAG visualization in Web UI
[ https://issues.apache.org/jira/browse/SPARK-30041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30041. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26675 [https://github.com/apache/spark/pull/26675] > Add Codegen Stage Id to Stage DAG visualization in Web UI > - > > Key: SPARK-30041 > URL: https://issues.apache.org/jira/browse/SPARK-30041 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Fix For: 3.0.0 > > Attachments: Snippet_StagesDags_with_CodegenId _annotated.png > > > SPARK-29894 provides information on the Codegen Stage Id in WEBUI for SQL > Plan graphs. Similarly, this proposes to add Codegen Stage Id in the DAG > visualization for Stage execution. DAGs for Stage execution are available in > the WEBUI under the Jobs and Stages tabs. > This is proposed as an aid for drill-down analysis of complex SQL statement > execution, as it is not always easy to match parts of the SQL Plan graph with > the corresponding Stage DAG execution graph. Adding Codegen Stage Id for > WholeStageCodegen operations makes this task easier. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22590) Broadcast thread propagates the localProperties to task
[ https://issues.apache.org/jira/browse/SPARK-22590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajith S updated SPARK-22590: Summary: Broadcast thread propagates the localProperties to task (was: SparkContext's local properties missing from TaskContext properties) > Broadcast thread propagates the localProperties to task > --- > > Key: SPARK-22590 > URL: https://issues.apache.org/jira/browse/SPARK-22590 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Ajith S >Priority: Major > Labels: bulk-closed > Attachments: TestProps.scala > > > Local properties set via sparkContext are not available as TaskContext > properties when executing parallel jobs and threadpools have idle threads > Explanation: > When executing parallel jobs via {{BroadcastExchangeExec}}, the > {{relationFuture}} is evaluated via a seperate thread. The threads inherit > the {{localProperties}} from sparkContext as they are the child threads. > These threads are controlled via the executionContext (thread pools). Each > Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle > threads. > Scenarios where the thread pool has threads which are idle and reused for a > subsequent new query, the thread local properties will not be inherited from > spark context (thread properties are inherited only on thread creation) hence > end up having old or no properties set. This will cause taskset properties to > be missing when properties are transferred by child thread via > {{sparkContext.runJob/submitJob}} > Attached is a test-case to simulate this behavior -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22590) SparkContext's local properties missing from TaskContext properties
[ https://issues.apache.org/jira/browse/SPARK-22590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajith S updated SPARK-22590: Description: Local properties set via sparkContext are not available as TaskContext properties when executing parallel jobs and threadpools have idle threads Explanation: When executing parallel jobs via {{BroadcastExchangeExec}}, the {{relationFuture}} is evaluated via a seperate thread. The threads inherit the {{localProperties}} from sparkContext as they are the child threads. These threads are controlled via the executionContext (thread pools). Each Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads. Scenarios where the thread pool has threads which are idle and reused for a subsequent new query, the thread local properties will not be inherited from spark context (thread properties are inherited only on thread creation) hence end up having old or no properties set. This will cause taskset properties to be missing when properties are transferred by child thread via {{sparkContext.runJob/submitJob}} Attached is a test-case to simulate this behavior was: Local properties set via sparkContext are not available as TaskContext properties when executing parallel jobs and threadpools have idle threads Explanation: When executing parallel jobs via {{BroadcastExchangeExec}} or {{SubqueryExec}}, the {{relationFuture}} is evaluated via a seperate thread. The threads inherit the {{localProperties}} from sparkContext as they are the child threads. These threads are controlled via the executionContext (thread pools). Each Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads. Scenarios where the thread pool has threads which are idle and reused for a subsequent new query, the thread local properties will not be inherited from spark context (thread properties are inherited only on thread creation) hence end up having old or no properties set. This will cause taskset properties to be missing when properties are transferred by child thread via {{sparkContext.runJob/submitJob}} Attached is a test-case to simulate this behavior > SparkContext's local properties missing from TaskContext properties > --- > > Key: SPARK-22590 > URL: https://issues.apache.org/jira/browse/SPARK-22590 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Ajith S >Priority: Major > Labels: bulk-closed > Attachments: TestProps.scala > > > Local properties set via sparkContext are not available as TaskContext > properties when executing parallel jobs and threadpools have idle threads > Explanation: > When executing parallel jobs via {{BroadcastExchangeExec}}, the > {{relationFuture}} is evaluated via a seperate thread. The threads inherit > the {{localProperties}} from sparkContext as they are the child threads. > These threads are controlled via the executionContext (thread pools). Each > Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle > threads. > Scenarios where the thread pool has threads which are idle and reused for a > subsequent new query, the thread local properties will not be inherited from > spark context (thread properties are inherited only on thread creation) hence > end up having old or no properties set. This will cause taskset properties to > be missing when properties are transferred by child thread via > {{sparkContext.runJob/submitJob}} > Attached is a test-case to simulate this behavior -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-22590) SparkContext's local properties missing from TaskContext properties
[ https://issues.apache.org/jira/browse/SPARK-22590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajith S reopened SPARK-22590: - Adding Fix > SparkContext's local properties missing from TaskContext properties > --- > > Key: SPARK-22590 > URL: https://issues.apache.org/jira/browse/SPARK-22590 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Ajith S >Priority: Major > Labels: bulk-closed > Attachments: TestProps.scala > > > Local properties set via sparkContext are not available as TaskContext > properties when executing parallel jobs and threadpools have idle threads > Explanation: > When executing parallel jobs via {{BroadcastExchangeExec}} or > {{SubqueryExec}}, the {{relationFuture}} is evaluated via a seperate thread. > The threads inherit the {{localProperties}} from sparkContext as they are the > child threads. > These threads are controlled via the executionContext (thread pools). Each > Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle > threads. > Scenarios where the thread pool has threads which are idle and reused for a > subsequent new query, the thread local properties will not be inherited from > spark context (thread properties are inherited only on thread creation) hence > end up having old or no properties set. This will cause taskset properties to > be missing when properties are transferred by child thread via > {{sparkContext.runJob/submitJob}} > Attached is a test-case to simulate this behavior -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30555) MERGE INTO insert action should only access columns from source table
Wenchen Fan created SPARK-30555: --- Summary: MERGE INTO insert action should only access columns from source table Key: SPARK-30555 URL: https://issues.apache.org/jira/browse/SPARK-30555 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30554) Return Iterable from FailureSafeParser.rawParser
Maxim Gekk created SPARK-30554: -- Summary: Return Iterable from FailureSafeParser.rawParser Key: SPARK-30554 URL: https://issues.apache.org/jira/browse/SPARK-30554 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.4 Reporter: Maxim Gekk Currently, rawParser of FailureSafeParser has the signature `IN => Seq[InternalRow]` which is strict requirement for underlying rawParser. It could return Option, for example like CSV datasource. The ticket aims to refactor FailureSafeParser, and change Seq to Iterable. Also. need to modify CSV and JSON datasources where FailureSafeParser is used, basically. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30553) structured-streaming documentation java watermark group by
[ https://issues.apache.org/jira/browse/SPARK-30553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bettermouse updated SPARK-30553: Description: [http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking] I write code according to this by java and scala. java {code:java} public static void main(String[] args) throws StreamingQueryException { SparkSession spark = SparkSession.builder().appName("test").master("local[*]") .config("spark.sql.shuffle.partitions", 1) .getOrCreate();Dataset lines = spark.readStream().format("socket") .option("host", "skynet") .option("includeTimestamp", true) .option("port", ).load(); Dataset words = lines.select("timestamp", "value"); Dataset count = words.withWatermark("timestamp", "10 seconds") .groupBy(functions.window(words.col("timestamp"), "10 seconds", "10 seconds") , words.col("value")).count(); StreamingQuery start = count.writeStream() .outputMode("update") .format("console").start(); start.awaitTermination();} {code} scala {code:java} def main(args: Array[String]): Unit = { val spark = SparkSession.builder.appName("test"). master("local[*]"). config("spark.sql.shuffle.partitions", 1) .getOrCreate import spark.implicits._ val lines = spark.readStream.format("socket"). option("host", "skynet").option("includeTimestamp", true). option("port", ).load val words = lines.select("timestamp", "value") val count = words.withWatermark("timestamp", "10 seconds"). groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"value") .count() val start = count.writeStream.outputMode("update").format("console").start start.awaitTermination() } {code} This is according to official documents. written in Java I found metrics "stateOnCurrentVersionSizeBytes" always increase .but scala is ok. java {code:java} == Physical Plan == WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4176a001 +- *(4) HashAggregate(keys=[window#11, value#0], functions=[count(1)], output=[window#11, value#0, count#10L]) +- StateStoreSave [window#11, value#0], state info [ checkpoint = file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state, runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, numPartitions = 1], Update, 1579274372624, 2 +- *(3) HashAggregate(keys=[window#11, value#0], functions=[merge_count(1)], output=[window#11, value#0, count#21L]) +- StateStoreRestore [window#11, value#0], state info [ checkpoint = file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state, runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, numPartitions = 1], 2 +- *(2) HashAggregate(keys=[window#11, value#0], functions=[merge_count(1)], output=[window#11, value#0, count#21L]) +- Exchange hashpartitioning(window#11, value#0, 1) +- *(1) HashAggregate(keys=[window#11, value#0], functions=[partial_count(1)], output=[window#11, value#0, count#21L]) +- *(1) Project [named_struct(start, precisetimestampconversion(CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / 1.0E7)) as double) = (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / 1.0E7)) THEN (CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / 1.0E7)) + 1) ELSE CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 0), LongType, TimestampType), end, precisetimestampconversion(CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / 1.0E7)) as double) = (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / 1.0E7)) THEN (CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / 1.0E7)) + 1) ELSE CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 1000), LongType, TimestampType)) AS window#11, value#0] +- *(1) Filter isnotnull(timestamp#1) +- EventTimeWatermark timestamp#1: timestamp, interval 10 seconds +- LocalTableScan , [timestamp#1, value#0] {code} scala {code:java} WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4149892c +- *(4)
[jira] [Updated] (SPARK-30553) structured-streaming documentation java watermark group by
[ https://issues.apache.org/jira/browse/SPARK-30553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bettermouse updated SPARK-30553: Priority: Trivial (was: Major) > structured-streaming documentation java watermark group by > - > > Key: SPARK-30553 > URL: https://issues.apache.org/jira/browse/SPARK-30553 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.4.4 >Reporter: bettermouse >Priority: Trivial > > [http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking] > I write code according to this by java and scala. > java > {code:java} > public static void main(String[] args) throws StreamingQueryException { > SparkSession spark = > SparkSession.builder().appName("test").master("local[*]") > .config("spark.sql.shuffle.partitions", 1) > .getOrCreate();Dataset lines = > spark.readStream().format("socket") > .option("host", "skynet") > .option("includeTimestamp",true) > .option("port", ).load(); > Dataset words = lines.select("timestamp", "value"); > Dataset count = words.withWatermark("timestamp", "10 seconds") > .groupBy(functions.window(words.col("timestamp"), "10 > seconds", "10 seconds") > , words.col("value")).count(); > StreamingQuery start = count.writeStream() > .outputMode("update") > .format("console").start(); > start.awaitTermination();} > {code} > scala > > {code:java} > def main(args: Array[String]): Unit = { > val spark = SparkSession.builder.appName("test"). > master("local[*]"). > config("spark.sql.shuffle.partitions", 1) > .getOrCreate > import spark.implicits._ > val lines = spark.readStream.format("socket"). > option("host", "skynet").option("includeTimestamp", true). > option("port", ).load > val words = lines.select("timestamp", "value") > val count = words.withWatermark("timestamp", "10 seconds"). > groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"value") > .count() > val start = count.writeStream.outputMode("update").format("console").start > start.awaitTermination() > } > {code} > This is according to official documents. written in Java I found metrics > "stateOnCurrentVersionSizeBytes" always increase .but scala is ok. > > java > > {code:java} > == Physical Plan == > == Physical Plan == > WriteToDataSourceV2 > org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4176a001 > +- *(4) HashAggregate(keys=[window#11, value#0], functions=[count(1)], > output=[window#11, value#0, count#10L]) >+- StateStoreSave [window#11, value#0], state info [ checkpoint = > file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state, > runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, > numPartitions = 1], Update, 1579274372624, 2 > +- *(3) HashAggregate(keys=[window#11, value#0], > functions=[merge_count(1)], output=[window#11, value#0, count#21L]) > +- StateStoreRestore [window#11, value#0], state info [ checkpoint = > file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state, > runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, > numPartitions = 1], 2 > +- *(2) HashAggregate(keys=[window#11, value#0], > functions=[merge_count(1)], output=[window#11, value#0, count#21L]) >+- Exchange hashpartitioning(window#11, value#0, 1) > +- *(1) HashAggregate(keys=[window#11, value#0], > functions=[partial_count(1)], output=[window#11, value#0, count#21L]) > +- *(1) Project [named_struct(start, > precisetimestampconversion(CASE WHEN > (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, > LongType) - 0) as double) / 1.0E7)) as double) = > (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) > as double) / 1.0E7)) THEN > (CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) > - 0) as double) / 1.0E7)) + 1) ELSE > CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) > - 0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 0), LongType, > TimestampType), end, precisetimestampconversion(CASE WHEN > (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, > LongType) - 0) as double) / 1.0E7)) as double) = > (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) > as double) / 1.0E7)) THEN > (CEIL((cast((precisetimestampconversion(timestamp#1,
[jira] [Created] (SPARK-30553) structured-streaming documentation java watermark group by
bettermouse created SPARK-30553: --- Summary: structured-streaming documentation java watermark group by Key: SPARK-30553 URL: https://issues.apache.org/jira/browse/SPARK-30553 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 2.4.4 Reporter: bettermouse [http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking] I write code according to this by java and scala. java {code:java} public static void main(String[] args) throws StreamingQueryException { SparkSession spark = SparkSession.builder().appName("test").master("local[*]") .config("spark.sql.shuffle.partitions", 1) .getOrCreate();Dataset lines = spark.readStream().format("socket") .option("host", "skynet") .option("includeTimestamp",true) .option("port", ).load(); Dataset words = lines.select("timestamp", "value"); Dataset count = words.withWatermark("timestamp", "10 seconds") .groupBy(functions.window(words.col("timestamp"), "10 seconds", "10 seconds") , words.col("value")).count(); StreamingQuery start = count.writeStream() .outputMode("update") .format("console").start(); start.awaitTermination();} {code} scala {code:java} def main(args: Array[String]): Unit = { val spark = SparkSession.builder.appName("test"). master("local[*]"). config("spark.sql.shuffle.partitions", 1) .getOrCreate import spark.implicits._ val lines = spark.readStream.format("socket"). option("host", "skynet").option("includeTimestamp", true). option("port", ).load val words = lines.select("timestamp", "value") val count = words.withWatermark("timestamp", "10 seconds"). groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"value") .count() val start = count.writeStream.outputMode("update").format("console").start start.awaitTermination() } {code} This is according to official documents. written in Java I found metrics "stateOnCurrentVersionSizeBytes" always increase .but scala is ok. java {code:java} == Physical Plan == == Physical Plan == WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4176a001 +- *(4) HashAggregate(keys=[window#11, value#0], functions=[count(1)], output=[window#11, value#0, count#10L]) +- StateStoreSave [window#11, value#0], state info [ checkpoint = file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state, runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, numPartitions = 1], Update, 1579274372624, 2 +- *(3) HashAggregate(keys=[window#11, value#0], functions=[merge_count(1)], output=[window#11, value#0, count#21L]) +- StateStoreRestore [window#11, value#0], state info [ checkpoint = file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state, runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, numPartitions = 1], 2 +- *(2) HashAggregate(keys=[window#11, value#0], functions=[merge_count(1)], output=[window#11, value#0, count#21L]) +- Exchange hashpartitioning(window#11, value#0, 1) +- *(1) HashAggregate(keys=[window#11, value#0], functions=[partial_count(1)], output=[window#11, value#0, count#21L]) +- *(1) Project [named_struct(start, precisetimestampconversion(CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / 1.0E7)) as double) = (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / 1.0E7)) THEN (CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / 1.0E7)) + 1) ELSE CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 0), LongType, TimestampType), end, precisetimestampconversion(CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / 1.0E7)) as double) = (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / 1.0E7)) THEN (CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / 1.0E7)) + 1) ELSE CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 1000), LongType, TimestampType)) AS window#11, value#0] +- *(1) Filter isnotnull(timestamp#1) +- EventTimeWatermark timestamp#1: timestamp, interval 10 seconds
[jira] [Assigned] (SPARK-30310) SparkUncaughtExceptionHandler halts running process unexpectedly
[ https://issues.apache.org/jira/browse/SPARK-30310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-30310: Assignee: Tin Hang To > SparkUncaughtExceptionHandler halts running process unexpectedly > > > Key: SPARK-30310 > URL: https://issues.apache.org/jira/browse/SPARK-30310 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Tin Hang To >Assignee: Tin Hang To >Priority: Major > > During 2.4.x testing, we have many occasions where the Worker process would > just DEAD unexpectedly, with the Worker log ends with: > > {{ERROR SparkUncaughtExceptionHandler: scala.MatchError: <...callstack...>}} > > We get the same callstack during our 2.3.x testing but the Worker process > stays up. > Upon looking at the 2.4.x SparkUncaughtExceptionHandler.scala compared to the > 2.3.x version, we found out SPARK-24294 introduced the following change: > {{exception catch {}} > {{ case _: OutOfMemoryError =>}} > {{ System.exit(SparkExitCode.OOM)}} > {{ case e: SparkFatalException if e.throwable.isInstanceOf[OutOfMemoryError] > =>}} > {{ // SPARK-24294: This is defensive code, in case that > SparkFatalException is}} > {{ // misused and uncaught.}} > {{ System.exit(SparkExitCode.OOM)}} > {{ case _ if exitOnUncaughtException =>}} > {{ System.exit(SparkExitCode.UNCAUGHT_EXCEPTION)}} > {{}}} > > This code has the _ if exitOnUncaughtException case, but not the other _ > cases. As a result, when exitOnUncaughtException is false (Master and > Worker) and exception doesn't match any of the match cases (e.g., > IllegalStateException), Scala throws MatchError(exception) ("MatchError" > wrapper of the original exception). Then the other catch block down below > thinks we have another uncaught exception, and halts the entire process with > SparkExitCode.UNCAUGHT_EXCEPTION_TWICE. > > {{catch {}} > {{ case oom: OutOfMemoryError => Runtime.getRuntime.halt(SparkExitCode.OOM)}} > {{ case t: Throwable => > Runtime.getRuntime.halt(SparkExitCode.UNCAUGHT_EXCEPTION_TWICE)}} > {{}}} > > Therefore, even when exitOnUncaughtException is false, the process will halt. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30310) SparkUncaughtExceptionHandler halts running process unexpectedly
[ https://issues.apache.org/jira/browse/SPARK-30310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30310. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26955 [https://github.com/apache/spark/pull/26955] > SparkUncaughtExceptionHandler halts running process unexpectedly > > > Key: SPARK-30310 > URL: https://issues.apache.org/jira/browse/SPARK-30310 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Tin Hang To >Assignee: Tin Hang To >Priority: Major > Fix For: 3.0.0 > > > During 2.4.x testing, we have many occasions where the Worker process would > just DEAD unexpectedly, with the Worker log ends with: > > {{ERROR SparkUncaughtExceptionHandler: scala.MatchError: <...callstack...>}} > > We get the same callstack during our 2.3.x testing but the Worker process > stays up. > Upon looking at the 2.4.x SparkUncaughtExceptionHandler.scala compared to the > 2.3.x version, we found out SPARK-24294 introduced the following change: > {{exception catch {}} > {{ case _: OutOfMemoryError =>}} > {{ System.exit(SparkExitCode.OOM)}} > {{ case e: SparkFatalException if e.throwable.isInstanceOf[OutOfMemoryError] > =>}} > {{ // SPARK-24294: This is defensive code, in case that > SparkFatalException is}} > {{ // misused and uncaught.}} > {{ System.exit(SparkExitCode.OOM)}} > {{ case _ if exitOnUncaughtException =>}} > {{ System.exit(SparkExitCode.UNCAUGHT_EXCEPTION)}} > {{}}} > > This code has the _ if exitOnUncaughtException case, but not the other _ > cases. As a result, when exitOnUncaughtException is false (Master and > Worker) and exception doesn't match any of the match cases (e.g., > IllegalStateException), Scala throws MatchError(exception) ("MatchError" > wrapper of the original exception). Then the other catch block down below > thinks we have another uncaught exception, and halts the entire process with > SparkExitCode.UNCAUGHT_EXCEPTION_TWICE. > > {{catch {}} > {{ case oom: OutOfMemoryError => Runtime.getRuntime.halt(SparkExitCode.OOM)}} > {{ case t: Throwable => > Runtime.getRuntime.halt(SparkExitCode.UNCAUGHT_EXCEPTION_TWICE)}} > {{}}} > > Therefore, even when exitOnUncaughtException is false, the process will halt. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer
[ https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018105#comment-17018105 ] Nick Afshartous commented on SPARK-27249: - Thanks Everett, and can someone with permission assign this ticket to me. > Developers API for Transformers beyond UnaryTransformer > --- > > Key: SPARK-27249 > URL: https://issues.apache.org/jira/browse/SPARK-27249 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.0 >Reporter: Everett Rush >Priority: Minor > Labels: starter > Original Estimate: 96h > Remaining Estimate: 96h > > It would be nice to have a developers' API for dataset transformations that > need more than one column from a row (ie UnaryTransformer inputs one column > and outputs one column) or that contain objects too expensive to initialize > repeatedly in a UDF such as a database connection. > > Design: > Abstract class PartitionTransformer extends Transformer and defines the > partition transformation function as Iterator[Row] => Iterator[Row] > NB: This parallels the UnaryTransformer createTransformFunc method > > When developers subclass this transformer, they can provide their own schema > for the output Row in which case the PartitionTransformer creates a row > encoder and executes the transformation. Alternatively the developer can set > output Datatype and output col name. Then the PartitionTransformer class will > create a new schema, a row encoder, and execute the transformation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30552) Chained spark column expressions with distinct windows specs produce inefficient DAG
[ https://issues.apache.org/jira/browse/SPARK-30552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franz updated SPARK-30552: -- Environment: python : 3.6.9.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel pyspark: 2.4.4 pandas : 0.25.3 numpy : 1.17.4 pyarrow : 0.15.1 was: INSTALLED VERSIONS -- commit : None python : 3.6.9.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : de_DE.UTF-8 LOCALE : None.None pandas : 0.25.3 numpy : 1.17.4 pytz : 2019.3 dateutil : 2.8.1 pip : 19.3.1 setuptools : 41.6.0.post20191030 Cython : None pytest : 5.3.0 hypothesis : None sphinx : 2.2.1 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.10.3 IPython : 7.11.1 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.15.1 pytables : None s3fs : None scipy : None sqlalchemy : None tables : None xarray : None xlrd : None xlwt : None xlsxwriter : None > Chained spark column expressions with distinct windows specs produce > inefficient DAG > > > Key: SPARK-30552 > URL: https://issues.apache.org/jira/browse/SPARK-30552 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.4.4 > Environment: python : 3.6.9.final.0 > python-bits : 64 > OS : Windows > OS-release : 10 > machine : AMD64 > processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel > pyspark: 2.4.4 > pandas : 0.25.3 > numpy : 1.17.4 > pyarrow : 0.15.1 >Reporter: Franz >Priority: Major > > h2. Context > Let's say you deal with time series data. Your desired outcome relies on > multiple window functions with distinct window specifications. The result may > resemble a single spark column expression, like an identifier for intervals. > h2. Status Quo > Usually, I don't store intermediate results with `df.withColumn` but rather > chain/stack column expressions and trust Spark to find the most effective DAG > (when dealing with DataFrame). > h2. Reproducible example > However, in the following example (PySpark 2.4.4 standalone), storing an > intermediate result with `df.withColumn` reduces the DAG complexity. Let's > consider following test setup: > {code:python} > import pandas as pd > import numpy as np > from pyspark.sql import SparkSession, Window > from pyspark.sql import functions as F > spark = SparkSession.builder.getOrCreate() > dfp = pd.DataFrame( > { > "col1": np.random.randint(0, 5, size=100), > "col2": np.random.randint(0, 5, size=100), > "col3": np.random.randint(0, 5, size=100), > "col4": np.random.randint(0, 5, size=100), > } > ) > df = spark.createDataFrame(dfp) > df.show(5) > +++++ > |col1|col2|col3|col4| > +++++ > | 1| 2| 4| 1| > | 0| 2| 3| 0| > | 2| 0| 1| 0| > | 4| 1| 1| 2| > | 1| 3| 0| 4| > +++++ > only showing top 5 rows > {code} > The computation is arbitrary. Basically we have 2 window specs and 3 > computational steps. The 3 computational steps are dependend on each other > and use alternating window specs: > {code:python} > w1 = Window.partitionBy("col1").orderBy("col2") > w2 = Window.partitionBy("col3").orderBy("col4") > # first step, arbitrary window func over 1st window > step1 = F.lag("col3").over(w1) > # second step, arbitrary window func over 2nd window with step 1 > step2 = F.lag(step1).over(w2) > # third step, arbitrary window func over 1st window with step 2 > step3 = F.when(step2 > 1, F.max(step2).over(w1)) > df_result = df.withColumn("result", step3) > {code} > Inspecting the phyiscal plan via `df_result.explain()` reveals 4 exchanges > and sorts! However, only 3 should be necessary here because we change the > window spec only twice. > {code:python} > df_result.explain() > == Physical Plan == > *(7) Project [col1#0L, col2#1L, col3#2L, col4#3L, CASE WHEN (_we0#25L > 1) > THEN _we1#26L END AS result#22L] > +- Window [lag(_w0#23L, 1, null) windowspecdefinition(col3#2L, col4#3L ASC > NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _we0#25L], [col3#2L], > [col4#3L ASC NULLS FIRST] >+- *(6) Sort [col3#2L ASC NULLS FIRST, col4#3L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(col3#2L, 200) > +- *(5) Project [col1#0L, col2#1L, col3#2L, col4#3L, _w0#23L, > _we1#26L] > +- Window [max(_w1#24L)
[jira] [Created] (SPARK-30552) Chained spark column expressions with distinct windows specs produce inefficient DAG
Franz created SPARK-30552: - Summary: Chained spark column expressions with distinct windows specs produce inefficient DAG Key: SPARK-30552 URL: https://issues.apache.org/jira/browse/SPARK-30552 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 2.4.4 Environment: INSTALLED VERSIONS -- commit : None python : 3.6.9.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : de_DE.UTF-8 LOCALE : None.None pandas : 0.25.3 numpy : 1.17.4 pytz : 2019.3 dateutil : 2.8.1 pip : 19.3.1 setuptools : 41.6.0.post20191030 Cython : None pytest : 5.3.0 hypothesis : None sphinx : 2.2.1 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.10.3 IPython : 7.11.1 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.15.1 pytables : None s3fs : None scipy : None sqlalchemy : None tables : None xarray : None xlrd : None xlwt : None xlsxwriter : None Reporter: Franz h2. Context Let's say you deal with time series data. Your desired outcome relies on multiple window functions with distinct window specifications. The result may resemble a single spark column expression, like an identifier for intervals. h2. Status Quo Usually, I don't store intermediate results with `df.withColumn` but rather chain/stack column expressions and trust Spark to find the most effective DAG (when dealing with DataFrame). h2. Reproducible example However, in the following example (PySpark 2.4.4 standalone), storing an intermediate result with `df.withColumn` reduces the DAG complexity. Let's consider following test setup: {code:python} import pandas as pd import numpy as np from pyspark.sql import SparkSession, Window from pyspark.sql import functions as F spark = SparkSession.builder.getOrCreate() dfp = pd.DataFrame( { "col1": np.random.randint(0, 5, size=100), "col2": np.random.randint(0, 5, size=100), "col3": np.random.randint(0, 5, size=100), "col4": np.random.randint(0, 5, size=100), } ) df = spark.createDataFrame(dfp) df.show(5) +++++ |col1|col2|col3|col4| +++++ | 1| 2| 4| 1| | 0| 2| 3| 0| | 2| 0| 1| 0| | 4| 1| 1| 2| | 1| 3| 0| 4| +++++ only showing top 5 rows {code} The computation is arbitrary. Basically we have 2 window specs and 3 computational steps. The 3 computational steps are dependend on each other and use alternating window specs: {code:python} w1 = Window.partitionBy("col1").orderBy("col2") w2 = Window.partitionBy("col3").orderBy("col4") # first step, arbitrary window func over 1st window step1 = F.lag("col3").over(w1) # second step, arbitrary window func over 2nd window with step 1 step2 = F.lag(step1).over(w2) # third step, arbitrary window func over 1st window with step 2 step3 = F.when(step2 > 1, F.max(step2).over(w1)) df_result = df.withColumn("result", step3) {code} Inspecting the phyiscal plan via `df_result.explain()` reveals 4 exchanges and sorts! However, only 3 should be necessary here because we change the window spec only twice. {code:python} df_result.explain() == Physical Plan == *(7) Project [col1#0L, col2#1L, col3#2L, col4#3L, CASE WHEN (_we0#25L > 1) THEN _we1#26L END AS result#22L] +- Window [lag(_w0#23L, 1, null) windowspecdefinition(col3#2L, col4#3L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _we0#25L], [col3#2L], [col4#3L ASC NULLS FIRST] +- *(6) Sort [col3#2L ASC NULLS FIRST, col4#3L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(col3#2L, 200) +- *(5) Project [col1#0L, col2#1L, col3#2L, col4#3L, _w0#23L, _we1#26L] +- Window [max(_w1#24L) windowspecdefinition(col1#0L, col2#1L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS _we1#26L], [col1#0L], [col2#1L ASC NULLS FIRST] +- *(4) Sort [col1#0L ASC NULLS FIRST, col2#1L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(col1#0L, 200) +- *(3) Project [col1#0L, col2#1L, col3#2L, col4#3L, _w0#23L, _w1#24L] +- Window [lag(_w0#27L, 1, null) windowspecdefinition(col3#2L, col4#3L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _w1#24L], [col3#2L], [col4#3L ASC NULLS FIRST] +- *(2) Sort [col3#2L ASC NULLS FIRST, col4#3L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(col3#2L, 200) +- Window [lag(col3#2L, 1,
[jira] [Resolved] (SPARK-29306) Executors need to track what ResourceProfile they are created with
[ https://issues.apache.org/jira/browse/SPARK-29306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-29306. --- Fix Version/s: 3.0.0 Resolution: Fixed > Executors need to track what ResourceProfile they are created with > --- > > Key: SPARK-29306 > URL: https://issues.apache.org/jira/browse/SPARK-29306 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > Fix For: 3.0.0 > > > For stage level scheduling, the Executors need to report what ResourceProfile > they are created with so that the ExecutorMonitor can track them and the > ExecutorAllocationManager can use that information to know how many to > request, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path
[ https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017820#comment-17017820 ] Sivakumar edited comment on SPARK-30542 at 1/17/20 12:40 PM: - Hi Jungtaek, I thought this might be a feature that should be added to structured streaming. Earlier with Spark Dstreams two jobs can have a same base path. But with Spark structured streaming I don't have that flexibility. I guess this should be a feature that structured streaming should support. Also Please lemme know If you have any work around for this. was (Author: sparksiva): Earlier with Spark Dstreams two jobs can have a same base path. But with Spark structured streaming I don't have that flexibility. I guess this should be a feature that structured streaming should support. > Two Spark structured streaming jobs cannot write to same base path > -- > > Key: SPARK-30542 > URL: https://issues.apache.org/jira/browse/SPARK-30542 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Sivakumar >Priority: Major > > Hi All, > Spark Structured Streaming doesn't allow two structured streaming jobs to > write data to the same base directory which is possible with using dstreams. > As __spark___metadata directory will be created by default for one job, > second job cannot use the same directory as base path as already > _spark__metadata directory is created by other job, It is throwing exception. > Is there any workaround for this, other than creating separate base path's > for both the jobs. > Is it possible to create the __spark__metadata directory else where or > disable without any data loss. > If I had to change the base path for both the jobs, then my whole framework > will get impacted, So i don't want to do that. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path
[ https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sivakumar updated SPARK-30542: -- Comment: was deleted (was: Hi Jungtaek, I thought this might be a feature that should be added to structured streaming. Also Please lemme know If you have any work around for this.) > Two Spark structured streaming jobs cannot write to same base path > -- > > Key: SPARK-30542 > URL: https://issues.apache.org/jira/browse/SPARK-30542 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Sivakumar >Priority: Major > > Hi All, > Spark Structured Streaming doesn't allow two structured streaming jobs to > write data to the same base directory which is possible with using dstreams. > As __spark___metadata directory will be created by default for one job, > second job cannot use the same directory as base path as already > _spark__metadata directory is created by other job, It is throwing exception. > Is there any workaround for this, other than creating separate base path's > for both the jobs. > Is it possible to create the __spark__metadata directory else where or > disable without any data loss. > If I had to change the base path for both the jobs, then my whole framework > will get impacted, So i don't want to do that. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path
[ https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017967#comment-17017967 ] Sivakumar edited comment on SPARK-30542 at 1/17/20 12:40 PM: - Hi Jungtaek, I thought this might be a feature that should be added to structured streaming. Also Please lemme know If you have any work around for this. was (Author: sparksiva): Hi Jungtaek, I thought this might be a feature that should be added to structured streaming. Also Please lemme know If you have any work around for this. > Two Spark structured streaming jobs cannot write to same base path > -- > > Key: SPARK-30542 > URL: https://issues.apache.org/jira/browse/SPARK-30542 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Sivakumar >Priority: Major > > Hi All, > Spark Structured Streaming doesn't allow two structured streaming jobs to > write data to the same base directory which is possible with using dstreams. > As __spark___metadata directory will be created by default for one job, > second job cannot use the same directory as base path as already > _spark__metadata directory is created by other job, It is throwing exception. > Is there any workaround for this, other than creating separate base path's > for both the jobs. > Is it possible to create the __spark__metadata directory else where or > disable without any data loss. > If I had to change the base path for both the jobs, then my whole framework > will get impacted, So i don't want to do that. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path
[ https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017967#comment-17017967 ] Sivakumar commented on SPARK-30542: --- Hi Jungtaek, I thought this might be a feature that should be added to structured streaming. Also Please lemme know If you have any work around for this. > Two Spark structured streaming jobs cannot write to same base path > -- > > Key: SPARK-30542 > URL: https://issues.apache.org/jira/browse/SPARK-30542 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Sivakumar >Priority: Major > > Hi All, > Spark Structured Streaming doesn't allow two structured streaming jobs to > write data to the same base directory which is possible with using dstreams. > As __spark___metadata directory will be created by default for one job, > second job cannot use the same directory as base path as already > _spark__metadata directory is created by other job, It is throwing exception. > Is there any workaround for this, other than creating separate base path's > for both the jobs. > Is it possible to create the __spark__metadata directory else where or > disable without any data loss. > If I had to change the base path for both the jobs, then my whole framework > will get impacted, So i don't want to do that. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30282) Migrate SHOW TBLPROPERTIES to new framework
[ https://issues.apache.org/jira/browse/SPARK-30282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-30282: Summary: Migrate SHOW TBLPROPERTIES to new framework (was: UnresolvedV2Relation should be resolved to temp view first) > Migrate SHOW TBLPROPERTIES to new framework > --- > > Key: SPARK-30282 > URL: https://issues.apache.org/jira/browse/SPARK-30282 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > For the following v2 commands, _Analyzer.ResolveTables_ does not check > against the temp views before resolving _UnresolvedV2Relation_, thus it > always resolves _UnresolvedV2Relation_ to a table: > * ALTER TABLE > * DESCRIBE TABLE > * SHOW TBLPROPERTIES > Thus, in the following example, 't' will be resolved to a table, not a temp > view: > {code:java} > sql("CREATE TEMPORARY VIEW t AS SELECT 2 AS i") > sql("CREATE TABLE testcat.ns.t USING csv AS SELECT 1 AS i") > sql("USE testcat.ns") > sql("SHOW TBLPROPERTIES t") // 't' is resolved to a table > {code} > For V2 commands, if a table is resolved to a temp view, it should error out > with a message that v2 command cannot handle temp views. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29299) Intermittently getting "Cannot create the managed table error" while creating table from spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-29299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017932#comment-17017932 ] Steve Loughran commented on SPARK-29299: Have you tried using the s3 optimised committer in EMR? It still materializes files in the destination on task commit, so potentially retains the issue -I'd like to know if it does. thanks > Intermittently getting "Cannot create the managed table error" while creating > table from spark 2.4 > -- > > Key: SPARK-29299 > URL: https://issues.apache.org/jira/browse/SPARK-29299 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Abhijeet >Priority: Major > > We are facing below error in spark 2.4 intermittently when saving the managed > table from spark. > Error - > pyspark.sql.utils.AnalysisException: u"Can not create the managed > table('`hive_issue`.`table`'). The associated > location('s3://\{bucket_name}/EMRFS_WARE_TEST167_new/warehouse/hive_issue.db/table') > already exists.;" > Steps to reproduce-- > 1. Create dataframe from spark mid size data (30MB CSV file) > 2. Save dataframe as a table > 3. Terminate the session when above mentioned operation is in progress > Note-- > Session termination is just a way to reproduce this issue. In real time we > are facing this issue intermittently when we are running same spark jobs > multiple times. We use EMRFS and HDFS from EMR cluster and we face the same > issue on both of the systems. > The only ways we can fix this is by deleting the target folder where table > will keep its files which is not option for us and we need to keep historical > information in the table hence we use APPEND mode while writing to table. > Sample code-- > from pyspark.sql import SparkSession > sc = SparkSession.builder.enableHiveSupport().getOrCreate() > df = sc.read.csv("s3://\{sample-bucket}1/DATA/consumecomplians.csv") > print "STARTED WRITING TO TABLE" > # Terminate session using ctrl + c after this statement post df.write action > started > df.write.mode("append").saveAsTable("hive_issue.table") > print "COMPLETED WRITING TO TABLE" > We went through the documentation of spark 2.4 [1] and found that spark is no > longer allowing to create manage tables on non empty folders. > 1. Any reason behind change in the spark behavior > 2. To us it looks like a breaking change as despite specifying "overwrite" > option spark in unable to wipe out existing data and create tables > 3. Do we have any solution for this issue other that setting > "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" flag > [1] > [https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30393) Too much ProvisionedThroughputExceededException while recover from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-30393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017930#comment-17017930 ] Steve Loughran commented on SPARK-30393: {{ProvisionedThroughputExceededException}} means your client(s) are making more requests per second than that AWS Endpoint permits. Applications are expected to recognise this and perform some kind of exponential backoff. It looks suspiciously like the spark-kinetic module is not doing this. If this is the case, I recommend doing so (it's what s3A does for S3 503's and DDB ProvisionedThroughput). See https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AUtils.java#L390 for the code to identify the problem; retries are straightforward as you can be confident the request was not processed. In the meantime -reduce the number of workers trying to talk to that particular stream. AWS endpoint throttling means that their scalability can be sub-linear. side issue: EMR's spark is a fork of AWS spark. You probably need to talk to them > Too much ProvisionedThroughputExceededException while recover from checkpoint > - > > Key: SPARK-30393 > URL: https://issues.apache.org/jira/browse/SPARK-30393 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 2.4.3 > Environment: I am using EMR 5.25.0, Spark 2.4.3, > spark-streaming-kinesis-asl 2.4.3 I have 6 r5.4xLarge in my cluster, plenty > of memory. 6 kinesis shards, I even increased to 12 shards but still see the > kinesis error >Reporter: Stephen >Priority: Major > Attachments: kinesisexceedreadlimit.png, > kinesisusagewhilecheckpointrecoveryerror.png, > sparkuiwhilecheckpointrecoveryerror.png > > > I have a spark application which consume from Kinesis with 6 shards. Data was > produced to Kinesis at at most 2000 records/second. At non peak time data > only comes in at 200 records/second. Each record is 0.5K Bytes. So 6 shards > is enough to handle that. > I use reduceByKeyAndWindow and mapWithState in the program and the sliding > window is one hour long. > Recently I am trying to checkpoint the application to S3. I am testing this > at nonpeak time so the data incoming rate is very low like 200 records/sec. I > run the Spark application by creating new context, checkpoint is created at > s3, but when I kill the app and restarts, it failed to recover from > checkpoint, and the error message is the following and my SparkUI shows all > the batches are stucked, and it takes a long time for the checkpoint recovery > to complete, 15 minutes to over an hour. > I found lots of error message in the log related to Kinesis exceeding read > limit: > {quote}19/12/24 00:15:21 WARN TaskSetManager: Lost task 571.0 in stage 33.0 > (TID 4452, ip-172-17-32-11.ec2.internal, executor 9): > org.apache.spark.SparkException: Gave up after 3 retries while getting shard > iterator from sequence number > 49601654074184110438492229476281538439036626028298502210, last exception: > bq. at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288) > bq. at scala.Option.getOrElse(Option.scala:121) > bq. at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282) > bq. at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getKinesisIterator(KinesisBackedBlockRDD.scala:246) > bq. at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:206) > bq. at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162) > bq. at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133) > bq. at > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > bq. at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > bq. at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > bq. at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > bq. at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) > bq. at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > bq. at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187) > bq. at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > bq. at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > bq. at org.apache.spark.scheduler.Task.run(Task.scala:121) > bq.
[jira] [Commented] (SPARK-30460) Spark checkpoint failing after some run with S3 path
[ https://issues.apache.org/jira/browse/SPARK-30460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017928#comment-17017928 ] Steve Loughran commented on SPARK-30460: bq. I understand spark 3.0 has new committer but as you said it is not deeply tested. well, been tested and shipping with Hortonworks and Cloudera releases of Spark for a while but: # doesn't do stream checkpoints. Nobody has looked at that, though it's something I'd like to see. # you are seeing a stack trace on EMR; they have their own reimplementation of the committer; it may be different. bq. I had a long discussion with AWS folks but they are asking to report this to open source to verify it that stack trace shows it's their S3 connector which is rejecting the request -but the S3A one is going to to reject it in exactly the same way. You going to need a way to checkpoint that does not use append. Sorry > Spark checkpoint failing after some run with S3 path > - > > Key: SPARK-30460 > URL: https://issues.apache.org/jira/browse/SPARK-30460 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.4 >Reporter: Sachin Pasalkar >Priority: Major > > We are using EMR with the SQS as source of stream. However it is failing, > after 4-6 hours of run, with below exception. Application shows its running > but stops the processing the messages > {code:java} > 2020-01-06 13:04:10,548 WARN [BatchedWriteAheadLog Writer] > org.apache.spark.streaming.util.BatchedWriteAheadLog:BatchedWriteAheadLog > Writer failed to write ArrayBuffer(Record(java.nio.HeapByteBuffer[pos=0 > lim=1226 cap=1226],1578315850302,Future())) > java.lang.UnsupportedOperationException > at > com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.append(S3NativeFileSystem2.java:150) > at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295) > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:229) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:94) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:50) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog.org$apache$spark$streaming$util$BatchedWriteAheadLog$$flushRecords(BatchedWriteAheadLog.scala:175) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog$$anon$1.run(BatchedWriteAheadLog.scala:142) > at java.lang.Thread.run(Thread.java:748) > 2020-01-06 13:04:10,554 WARN [wal-batching-thread-pool-0] > org.apache.spark.streaming.scheduler.ReceivedBlockTracker:Exception thrown > while writing record: > BlockAdditionEvent(ReceivedBlockInfo(0,Some(3),None,WriteAheadLogBasedStoreResult(input-0-1578315849800,Some(3),FileBasedWriteAheadLogSegment(s3://mss-prod-us-east-1-ueba-bucket/streaming/checkpoint/receivedData/0/log-1578315850001-1578315910001,0,5175 > to the WriteAheadLog. > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:242) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:89) > at > org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:347) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:522) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:520) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at
[jira] [Created] (SPARK-30551) Disable comparison for interval type
Kent Yao created SPARK-30551: Summary: Disable comparison for interval type Key: SPARK-30551 URL: https://issues.apache.org/jira/browse/SPARK-30551 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao As we are not going to follow ANSI, it is weird to compare the year-month part to the day-time part for our current implementation of interval. Additionally, the current ordering logic comes from PostgreSQL where the implementation of the interval is messy. And we are not aiming PostgreSQL compliance at all. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30550) Random pyspark-shell applications being generated
[ https://issues.apache.org/jira/browse/SPARK-30550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram updated SPARK-30550: Attachment: Screenshot from 2020-01-17 15-43-33.png > Random pyspark-shell applications being generated > - > > Key: SPARK-30550 > URL: https://issues.apache.org/jira/browse/SPARK-30550 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell, Spark Submit, Web UI >Affects Versions: 2.3.2, 2.4.4 >Reporter: Ram >Priority: Major > Attachments: Screenshot from 2020-01-17 15-43-33.png > > > > When we submit a particular spark job, this happens. Not sure from where > these pyspark-shell applications get generated, but they persist for like 5s > and gets killed. We're not able to figure put why this happens -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30550) Random pyspark-shell applications being generated
[ https://issues.apache.org/jira/browse/SPARK-30550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram updated SPARK-30550: Description: When we submit a particular spark job, this happens. Not sure from where these pyspark-shell applications get generated, but they persist for like 5s and gets killed. We're not able to figure put why this happens was: !image-2020-01-17-15-40-12-899.png! When we submit a particular spark job, this happens. Not sure from where these pyspark-shell applications get generated, but they persist for like 5s and gets killed. We're not able to figure put why this happens > Random pyspark-shell applications being generated > - > > Key: SPARK-30550 > URL: https://issues.apache.org/jira/browse/SPARK-30550 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell, Spark Submit, Web UI >Affects Versions: 2.3.2, 2.4.4 >Reporter: Ram >Priority: Major > > > When we submit a particular spark job, this happens. Not sure from where > these pyspark-shell applications get generated, but they persist for like 5s > and gets killed. We're not able to figure put why this happens -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30550) Random pyspark-shell applications being generated
Ram created SPARK-30550: --- Summary: Random pyspark-shell applications being generated Key: SPARK-30550 URL: https://issues.apache.org/jira/browse/SPARK-30550 Project: Spark Issue Type: Bug Components: PySpark, Spark Shell, Spark Submit, Web UI Affects Versions: 2.4.4, 2.3.2 Reporter: Ram !image-2020-01-17-15-40-12-899.png! When we submit a particular spark job, this happens. Not sure from where these pyspark-shell applications get generated, but they persist for like 5s and gets killed. We're not able to figure put why this happens -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30549) Fix the subquery metrics showing issue in UI When enable AQE
[ https://issues.apache.org/jira/browse/SPARK-30549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ke Jia updated SPARK-30549: --- Summary: Fix the subquery metrics showing issue in UI When enable AQE (was: Fix the subquery metrics showing issue in UI) > Fix the subquery metrics showing issue in UI When enable AQE > > > Key: SPARK-30549 > URL: https://issues.apache.org/jira/browse/SPARK-30549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Priority: Major > > After merged [PR#25316|[https://github.com/apache/spark/pull/25316]], the > subquery metrics can not be shown in UI. This PR will fix the subquery shown > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30549) Fix the subquery metrics showing issue in UI When enable AQE
[ https://issues.apache.org/jira/browse/SPARK-30549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ke Jia updated SPARK-30549: --- Description: After merged [https://github.com/apache/spark/pull/25316], the subquery metrics can not be shown in UI when enable AQE. This PR will fix the subquery shown issue. (was: After merged [PR#25316|[https://github.com/apache/spark/pull/25316]], the subquery metrics can not be shown in UI. This PR will fix the subquery shown issue.) > Fix the subquery metrics showing issue in UI When enable AQE > > > Key: SPARK-30549 > URL: https://issues.apache.org/jira/browse/SPARK-30549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Priority: Major > > After merged [https://github.com/apache/spark/pull/25316], the subquery > metrics can not be shown in UI when enable AQE. This PR will fix the subquery > shown issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30549) Fix the subquery metrics showing issue in UI
Ke Jia created SPARK-30549: -- Summary: Fix the subquery metrics showing issue in UI Key: SPARK-30549 URL: https://issues.apache.org/jira/browse/SPARK-30549 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Ke Jia After merged [PR#25316|[https://github.com/apache/spark/pull/25316]], the subquery metrics can not be shown in UI. This PR will fix the subquery shown issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30319) Adds a stricter version of as[T]
[ https://issues.apache.org/jira/browse/SPARK-30319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-30319: -- Affects Version/s: (was: 2.4.4) 3.0.0 > Adds a stricter version of as[T] > > > Key: SPARK-30319 > URL: https://issues.apache.org/jira/browse/SPARK-30319 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Enrico Minack >Priority: Major > > The behaviour of as[T] is not intuitive when you read code like > df.as[T].write.csv("data.csv"). The result depends on the actual schema of > df, where def as[T](): Dataset[T] should be agnostic to the schema of df. The > expected behaviour is not provided elsewhere: > * Extra columns that are not part of the type {{T}} are not dropped. > * Order of columns is not aligned with schema of {{T}}. > * Columns are not cast to the types of {{T}}'s fields. They have to be cast > explicitly. > A method that enforces schema of T on a given Dataset would be very > convenient and allows to articulate and guarantee above assumptions about > your data with the native Spark Dataset API. This method plays a more > explicit and enforcing role than as[T] with respect to columns, column order > and column type. > Possible naming of a stricter version of {{as[T]}}: > * {{as[T](strict = true)}} > * {{toDS[T]}} (as in {{toDF}}) > * {{selectAs[T]}} (as this is merely selecting the columns of schema {{T}}) > The naming {{toDS[T]}} is chosen here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30319) Adds a stricter version of as[T]
[ https://issues.apache.org/jira/browse/SPARK-30319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-30319: -- Fix Version/s: (was: 3.0.0) > Adds a stricter version of as[T] > > > Key: SPARK-30319 > URL: https://issues.apache.org/jira/browse/SPARK-30319 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4 >Reporter: Enrico Minack >Priority: Major > > The behaviour of as[T] is not intuitive when you read code like > df.as[T].write.csv("data.csv"). The result depends on the actual schema of > df, where def as[T](): Dataset[T] should be agnostic to the schema of df. The > expected behaviour is not provided elsewhere: > * Extra columns that are not part of the type {{T}} are not dropped. > * Order of columns is not aligned with schema of {{T}}. > * Columns are not cast to the types of {{T}}'s fields. They have to be cast > explicitly. > A method that enforces schema of T on a given Dataset would be very > convenient and allows to articulate and guarantee above assumptions about > your data with the native Spark Dataset API. This method plays a more > explicit and enforcing role than as[T] with respect to columns, column order > and column type. > Possible naming of a stricter version of {{as[T]}}: > * {{as[T](strict = true)}} > * {{toDS[T]}} (as in {{toDF}}) > * {{selectAs[T]}} (as this is merely selecting the columns of schema {{T}}) > The naming {{toDS[T]}} is chosen here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path
[ https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017820#comment-17017820 ] Sivakumar commented on SPARK-30542: --- Earlier with Spark Dstreams two jobs can have a same base path. But with Spark structured streaming I don't have that flexibility. I guess this should be a feature that structured streaming should support. > Two Spark structured streaming jobs cannot write to same base path > -- > > Key: SPARK-30542 > URL: https://issues.apache.org/jira/browse/SPARK-30542 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Sivakumar >Priority: Major > > Hi All, > Spark Structured Streaming doesn't allow two structured streaming jobs to > write data to the same base directory which is possible with using dstreams. > As __spark___metadata directory will be created by default for one job, > second job cannot use the same directory as base path as already > _spark__metadata directory is created by other job, It is throwing exception. > Is there any workaround for this, other than creating separate base path's > for both the jobs. > Is it possible to create the __spark__metadata directory else where or > disable without any data loss. > If I had to change the base path for both the jobs, then my whole framework > will get impacted, So i don't want to do that. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path
[ https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sivakumar updated SPARK-30542: -- Description: Hi All, Spark Structured Streaming doesn't allow two structured streaming jobs to write data to the same base directory which is possible with using dstreams. As __spark___metadata directory will be created by default for one job, second job cannot use the same directory as base path as already _spark__metadata directory is created by other job, It is throwing exception. Is there any workaround for this, other than creating separate base path's for both the jobs. Is it possible to create the __spark__metadata directory else where or disable without any data loss. If I had to change the base path for both the jobs, then my whole framework will get impacted, So i don't want to do that. was: Hi All, I have two structured streaming jobs which should write data to the same base directory. As __spark___metadata directory will be created by default for one job, second job cannot use the same directory as base path as already _spark__metadata directory is created by other job, It is throwing exception. Is there any workaround for this, other than creating separate base path's for both the jobs. Is it possible to create the __spark__metadata directory else where or disable without any data loss. If I had to change the base path for both the jobs, then my whole framework will get impacted, So i don't want to do that. > Two Spark structured streaming jobs cannot write to same base path > -- > > Key: SPARK-30542 > URL: https://issues.apache.org/jira/browse/SPARK-30542 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Sivakumar >Priority: Major > > Hi All, > Spark Structured Streaming doesn't allow two structured streaming jobs to > write data to the same base directory which is possible with using dstreams. > As __spark___metadata directory will be created by default for one job, > second job cannot use the same directory as base path as already > _spark__metadata directory is created by other job, It is throwing exception. > Is there any workaround for this, other than creating separate base path's > for both the jobs. > Is it possible to create the __spark__metadata directory else where or > disable without any data loss. > If I had to change the base path for both the jobs, then my whole framework > will get impacted, So i don't want to do that. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30282) UnresolvedV2Relation should be resolved to temp view first
[ https://issues.apache.org/jira/browse/SPARK-30282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30282. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26921 [https://github.com/apache/spark/pull/26921] > UnresolvedV2Relation should be resolved to temp view first > -- > > Key: SPARK-30282 > URL: https://issues.apache.org/jira/browse/SPARK-30282 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > For the following v2 commands, _Analyzer.ResolveTables_ does not check > against the temp views before resolving _UnresolvedV2Relation_, thus it > always resolves _UnresolvedV2Relation_ to a table: > * ALTER TABLE > * DESCRIBE TABLE > * SHOW TBLPROPERTIES > Thus, in the following example, 't' will be resolved to a table, not a temp > view: > {code:java} > sql("CREATE TEMPORARY VIEW t AS SELECT 2 AS i") > sql("CREATE TABLE testcat.ns.t USING csv AS SELECT 1 AS i") > sql("USE testcat.ns") > sql("SHOW TBLPROPERTIES t") // 't' is resolved to a table > {code} > For V2 commands, if a table is resolved to a temp view, it should error out > with a message that v2 command cannot handle temp views. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30282) UnresolvedV2Relation should be resolved to temp view first
[ https://issues.apache.org/jira/browse/SPARK-30282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30282: --- Assignee: Terry Kim > UnresolvedV2Relation should be resolved to temp view first > -- > > Key: SPARK-30282 > URL: https://issues.apache.org/jira/browse/SPARK-30282 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > > For the following v2 commands, _Analyzer.ResolveTables_ does not check > against the temp views before resolving _UnresolvedV2Relation_, thus it > always resolves _UnresolvedV2Relation_ to a table: > * ALTER TABLE > * DESCRIBE TABLE > * SHOW TBLPROPERTIES > Thus, in the following example, 't' will be resolved to a table, not a temp > view: > {code:java} > sql("CREATE TEMPORARY VIEW t AS SELECT 2 AS i") > sql("CREATE TABLE testcat.ns.t USING csv AS SELECT 1 AS i") > sql("USE testcat.ns") > sql("SHOW TBLPROPERTIES t") // 't' is resolved to a table > {code} > For V2 commands, if a table is resolved to a temp view, it should error out > with a message that v2 command cannot handle temp views. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30548) Cached blockInfo in BlockMatrix.scala is never released
Dong Wang created SPARK-30548: - Summary: Cached blockInfo in BlockMatrix.scala is never released Key: SPARK-30548 URL: https://issues.apache.org/jira/browse/SPARK-30548 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.4.4 Reporter: Dong Wang The private variable _blockInfo_ in mllib.linalg.distribtued.BlockMatrix is never unpersisted since a BlockMatrix instance is created. {code:scala} private lazy val blockInfo = blocks.mapValues(block => (block.numRows, block.numCols)).cache() {code} I think we should add an API to unpersist this variable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30530) CSV load followed by "is null" filter produces incorrect results
[ https://issues.apache.org/jira/browse/SPARK-30530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017806#comment-17017806 ] Maxim Gekk commented on SPARK-30530: [~jlowe] I prepared a fix for the issue. [~hyukjin.kwon] [~cloud_fan] Could you review it, please. > CSV load followed by "is null" filter produces incorrect results > > > Key: SPARK-30530 > URL: https://issues.apache.org/jira/browse/SPARK-30530 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jason Darrell Lowe >Priority: Major > > Trying to filter on is null from values loaded from a CSV file has regressed > recently and now produces incorrect results. > Given a CSV file with the contents: > {noformat:title=floats.csv} > 100.0,1.0, > 200.0,, > 300.0,3.0, > 1.0,4.0, > ,4.0, > 500.0,, > ,6.0, > -500.0,50.5 > {noformat} > Filtering this data for the first column being null should return exactly two > rows, but it is returning extraneous rows with nulls: > {noformat} > scala> val schema = StructType(Array(StructField("floats", FloatType, > true),StructField("more_floats", FloatType, true))) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(floats,FloatType,true), > StructField(more_floats,FloatType,true)) > scala> val df = spark.read.schema(schema).csv("floats.csv") > df: org.apache.spark.sql.DataFrame = [floats: float, more_floats: float] > scala> df.filter("floats is null").show > +--+---+ > |floats|more_floats| > +--+---+ > | null| null| > | null| null| > | null| null| > | null| null| > | null|4.0| > | null| null| > | null|6.0| > +--+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30547) Add unstable annotation to the CalendarInterval class
Kent Yao created SPARK-30547: Summary: Add unstable annotation to the CalendarInterval class Key: SPARK-30547 URL: https://issues.apache.org/jira/browse/SPARK-30547 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao People already use CalendarInterval as UDF inputs so it’s better to make it clear it’s unstable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30544) Upgrade Genjavadoc to 0.15
[ https://issues.apache.org/jira/browse/SPARK-30544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-30544: --- Parent: SPARK-25075 Issue Type: Sub-task (was: Improvement) > Upgrade Genjavadoc to 0.15 > -- > > Key: SPARK-30544 > URL: https://issues.apache.org/jira/browse/SPARK-30544 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > Genjavadoc 0.14 doesn't support Scala so sbt -Pscala-2.13 will fail to build. > Let's upgrade it to 0.15. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path
[ https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017791#comment-17017791 ] Jungtaek Lim commented on SPARK-30542: -- This is more likely a question rather than actual bug which is encouraged to post user/dev mailing list to ask about. > Two Spark structured streaming jobs cannot write to same base path > -- > > Key: SPARK-30542 > URL: https://issues.apache.org/jira/browse/SPARK-30542 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Sivakumar >Priority: Major > > Hi All, > I have two structured streaming jobs which should write data to the same base > directory. > As __spark___metadata directory will be created by default for one job, > second job cannot use the same directory as base path as already > _spark__metadata directory is created by other job, It is throwing exception. > Is there any workaround for this, other than creating separate base path's > for both the jobs. > Is it possible to create the __spark__metadata directory else where or > disable without any data loss. > If I had to change the base path for both the jobs, then my whole framework > will get impacted, So i don't want to do that. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30546) Make interval type more future-proofing
Kent Yao created SPARK-30546: Summary: Make interval type more future-proofing Key: SPARK-30546 URL: https://issues.apache.org/jira/browse/SPARK-30546 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao Before 3.0 we maymake some efforts for the current interval type to make it more future-proofing. e.g. 1. add unstable annotation to the CalendarInterval class. People already use it as UDF inputs so it’s better to make it clear it’s unstable. 2. Add a schema checker to prohibit create v2 custom catalog table with intervals, as same as what we do for the builtin catalog 3. Add a schema checker for DataFrameWriterV2 too 4. Make the interval type incomparable as version 2.4 for disambiguation of comparison between year-month and day-time fields 5. The 3.0 newly added to_csv should not support output intervals as same as using CSV file format 6. The function to_json should not allow using interval as a key field as same as the value field and JSON datasource, with a legacy config to restore. 7. Revert interval ISO/ANSI SQL Standard output since we decide not to follow ANSI, so there is no round trip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30544) Upgrade Genjavadoc to 0.15
[ https://issues.apache.org/jira/browse/SPARK-30544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-30544: --- Affects Version/s: (was: 3.1.0) 3.0.0 > Upgrade Genjavadoc to 0.15 > -- > > Key: SPARK-30544 > URL: https://issues.apache.org/jira/browse/SPARK-30544 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > Genjavadoc 0.14 doesn't support Scala so sbt -Pscala-2.13 will fail to build. > Let's upgrade it to 0.15. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30543) RandomForest add Param bootstrap to control sampling method
[ https://issues.apache.org/jira/browse/SPARK-30543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-30543: - Issue Type: Improvement (was: Bug) > RandomForest add Param bootstrap to control sampling method > --- > > Key: SPARK-30543 > URL: https://issues.apache.org/jira/browse/SPARK-30543 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > Current RF with numTrees=1 will directly build a tree using the orignial > dataset, > while with numTrees>1 it will use bootstrap samples to build trees. > This design is to train a DecisionTreeModel by the impl of RandomForest, > however, it is somewhat strange. > In Scikit-Learn, there is a param bootstrap to control bootstrap samples are > used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30545) Impl Extremely Randomized Trees
zhengruifeng created SPARK-30545: Summary: Impl Extremely Randomized Trees Key: SPARK-30545 URL: https://issues.apache.org/jira/browse/SPARK-30545 Project: Spark Issue Type: New Feature Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng 1, Extremely Randomized Trees or ExtraTrees is widely used and impled in Scikit-Learn and OpenCV; 2, ExtraTrees is quite similar to RandomForest, and the main difference lie in that,on each leaf, candidate splits (only one split in Scikit-Learn's impl) are drawn at random for each feature and the best of these randomly-chosen splits is selected. Based on current impl of ensenble trees, it can be impled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30545) Impl Extremely Randomized Trees
[ https://issues.apache.org/jira/browse/SPARK-30545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-30545: Assignee: zhengruifeng > Impl Extremely Randomized Trees > --- > > Key: SPARK-30545 > URL: https://issues.apache.org/jira/browse/SPARK-30545 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > > 1, Extremely Randomized Trees or ExtraTrees is widely used and impled in > Scikit-Learn and OpenCV; > 2, ExtraTrees is quite similar to RandomForest, and the main difference lie > in that,on each leaf, candidate splits (only one split in Scikit-Learn's > impl) are drawn at random for each feature and the best of these > randomly-chosen splits is selected. > Based on current impl of ensenble trees, it can be impled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30544) Upgrade Genjavadoc to 0.15
Kousuke Saruta created SPARK-30544: -- Summary: Upgrade Genjavadoc to 0.15 Key: SPARK-30544 URL: https://issues.apache.org/jira/browse/SPARK-30544 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.1.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Genjavadoc 0.14 doesn't support Scala so sbt -Pscala-2.13 will fail to build. Let's upgrade it to 0.15. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30462) Structured Streaming _spark_metadata fills up Spark Driver memory when having lots of objects
[ https://issues.apache.org/jira/browse/SPARK-30462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017782#comment-17017782 ] Sivakumar commented on SPARK-30462: --- Hi All, I have two structured streaming jobs which should write data to the same base directory. As __spark___metadata directory will be created by default for one job, second job cannot use the same directory as base path as already _spark__metadata directory is created by other job, It is throwing exception. Is there any workaround for this, other than creating separate base path's for both the jobs. Is it possible to create the __spark__metadata directory else where or disable without any data loss. If I had to change the base path for both the jobs, then my whole framework will get impacted, So i don't want to do that. (SPARK-30542) - I have created a separate ticket for this. > Structured Streaming _spark_metadata fills up Spark Driver memory when having > lots of objects > - > > Key: SPARK-30462 > URL: https://issues.apache.org/jira/browse/SPARK-30462 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3, 2.4.4, 3.0.0 >Reporter: Vladimir Yankov >Priority: Critical > > Hi, > With the current implementation of the Spark Structured Streaming it does not > seem to be possible to have a constantly running stream, writing millions of > files, without increasing the spark driver's memory to dozens of GB's. > In our scenario we are using Spark structured streaming to consume messages > from a Kafka cluster, transform them, and write them as compressed Parquet > files in an S3 Objectstore Service. > Each 30 seconds a new batch of the spark-streaming is writing hundreds of > objects, which respectively results within time to millions of objects in S3. > As all written objects are recorded in the _spark_metadata, the size of the > compact files there grows to GB's that eventually fill up the Spark Driver's > memory and lead to OOM errors. > We need the functionality to configure the spark structured streaming to run > without loading all the historically accumulated metadata in its memory. > Regularly resetting the _spark_metadata and the checkpoint folders is not an > option in our use-case, as we are using the information from the > _spark_metadata to have a register of the objects for faster querying and > search of the written objects. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org