[jira] [Created] (SPARK-27861) get_json_object in sql will truncate long value gotten from jsonpath
jing.yan created SPARK-27861: Summary: get_json_object in sql will truncate long value gotten from jsonpath Key: SPARK-27861 URL: https://issues.apache.org/jira/browse/SPARK-27861 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: jing.yan {code:java} select get_json_object('{"key":"eyJib29sIjp7ImZpbHRlciI6W3sidGVybSI6eyJ3YXJlaG91c2VfaWQiOjMwfX0seyJ0ZXJtcyI6eyJpZCI6WzE2MTgzMjk3NywxNjE4MjM1OTgsMTYxODIzNTk3LDE2MTgyMzU5NiwxNjE4MjM1OTUsMTYxODAwMjI0LDE2MTgwMDIyMSwxNjE4MDAyMjAsMTYxODAwMjE5LDE2MTgwMDIxOCwxNjE4MDAyMTMsMTYxODAwMjEyLDE2MTgwMDIxMSwxNjE4MDAyMTAsMTYxODAwMjA2LDE2MTgwMDIwMiwxNjE4MDAyMDEsMTYxODAwMTk4LDE2MTgwMDE5NywxNjE4MDAxOTYsMTYxODAwMTk1LDE2MTgwMDE5NCwxNjE4MDAxOTMsMTYxODAwMTkyLDE2MTgwMDE5MSwxNjE4MDAxOTAsMTYxODAwMTg5LDE2MTc5NDAyOSwxNjEzNTg2ODMsMTYxMzU4NjExLDE2MTM1ODYxMCwxNjEzNTYzMTIsMTYxMzU0NzU5LDE2MTM1NDY5NywxNjEzNTQ2OTYsMTYxMzQ0NTkyLDE2MTM0NDUwNywxNjEzMzc2MzQsMTYxMzM3MjA2LDE2MTMwMDM4OCwxNjEzMDAzODcsMTYxMzAwMzg2LDE2MTI4ODAxMCwxNjEyODgwMDksMTYxMjc5NTI0LDE2MTI3OTUyMiwxNjEyNzk1MTUsMTYxMjc5NTE0LDE2MTI3OTUxMywxNjEyNzg5NzcsMTYxMjc4OTc2LDE2MTI3ODk0NCwxNjEyNzg5NDMsMTYwODk2NjM3LDE2MDg5NjYzNiwxNjA4NzQ0ODcsMTYwODc0NDg2LDE2MDg3NDQ2NywxNjA4NzQ0NjYsMTYwODc0NDY1LDE2MDg3NDQ2NCwxNjA4NzQ0NjMsMTYwODc0NDU2LDE2MDg3NDQ1NSwxNjA4NzQxNTksMTYwODc0MTU2LDE2MDg3NDE1NSwxNjA4NzQxNTIsMTYwODc0MTUxLDE2MDg3NDE1MCwxNjA4NzQxNDksMTYwODc0MTQ4LDE2MDg3NDE0NSwxNjA4NzQxNDQsMTYwODc0MTQzLDE2MDg3NDE0MiwxNjA4NzQxNDEsMTYwODc0MTQwLDE2MDg3NDEzNywxNjA4NzQxMzYsMTYwODc0MTMzLDE2MDg3NDEzMiwxNjA4MjI0OTksMTYwODExMDE5LDE2MDgxMDkxNSwxNjA4MTA5MTQsMTYwODEwNDU4LDE2MDgxMDQ1NywxNjA4MTA0NTYsMTYwODEwNDU1LDE2MDgxMDQ1NCwxNjA4MTAzNTMsMTYwNzk5NTU1LDE2MDc5MjQ1OSwxNjA3OTI0NTYsMTYwNzkyNDU1LDE2MDc5MjQ1NCwxNjA3OTI0NTEsMTYwNzkyNDUwLDE2MDc5MjQ0OSwxNjA3OTI0MzUsMTYwNzkyNDI4LDE2MDc5MjQyNywxNjA3OTI0MjYsMTYwNzkyNDI1LDE2MDc5MjQyNCwxNjA3OTI0MjMsMTYwNzkyNDIyLDE2MDc5MjQxNCwxNjA3OTI0MTEsMTYwNzkyNDA4LDE2MDc5MjQwMywxNjA3OTI0MDIsMTYwNzkyNDAxLDE2MDc5MjM5MiwxNjA3OTIzODgsMTYwNzkyMzg3LDE2MDc5MjM4NiwxNjA3OTIzODUsMTYwNzkyMzg0LDE2MDc5MjM2NSwxNjA3OTIzNjIsMTYwNzkyMzYxLDE2MDc5MjM2MCwxNjA3OTIzNTksMTYwNzkyMzU4LDE2MDc5MjM1NywxNjA3OTIzNTYsMTYwNzkyMzU1LDE2MDc5MjM1NCwxNjA3OTIzNTMsMTYwNzkyMzUyLDE2MDc5MjM1MSwxNjA3OTIzNTAsMTYwNzkwNjA4LDE2MDc5MDYwNywxNjA3OTA2MDQsMTYwNzkwNjAzLDE2MDc5MDYwMCwxNjA3OTA1OTYsMTYwNzkwNTk1LDE2MDc5MDU5NCwxNjA3Nzg3MjQsMTYwNzc4NzIzLDE2MDc3ODcyMiwxNjA3Nzg3MjEsMTYwNzc4NzIwLDE2MDc3ODcxOSwxNjA3Nzg3MTgsMTYwNzc4NzE3LDE2MDc3ODcxNiwxNjA3Nzg3MTUsMTYwNzc4NzE0LDE2MDc3ODcxMywxNjA3Nzg3MTIsMTYwNzY4MjMxLDE2MDc2NDE3NywxNjA3NjQxNTgsMTYwNzYyNTI0LDE2MDc2MjUyMywxNjA3NjI1MjIsMTYwNzYyNTIxLDE2MDc2MjUyMCwxNjA3NjI1MTksMTYwNzYyNTE4LDE2MDc2MjUxNywxNjA3NjI1MTYsMTYwNzYyNTE1LDE2MDc2MjQ5MiwxNjA3NjI0OTEsMTYwNzYyNDkwLDE2MDc2MjQ4OSwxNjA3NjI0ODgsMTYwNzYyNDg3LDE2MDc2MjQ4NiwxNjA3NTY4MDksMTYwNzU2Nzc5LDE2MDc1Njc0OSwxNjA3NTY3NDgsMTYwNzU2NzM4LDE2MDc1NjczNywxNjA3NTY3MjAsMTYwNzU2NzE5LDE2MDc1NjcxOCwxNjA3NTY3MTcsMTYwNzU2Njk1LDE2MDczNzIzNSwxNjA3MzcyMzQsMTYwNzM3MjMzLDE2MDczNzIzMiwxNjA3MzcyMzEsMTYwNzM3MjMwLDE2MDczNzIyOSwxNjA3MzcyMjgsMTYwNzM3MjI3LDE2MDczNzIyNiwxNjA3MzcyMjUsMTYwNzM3MjI0LDE2MDczNzE3MywxNjA3MzcxNzJdfX0seyJ0ZXJtIjp7ImlzX2RlbGV0ZWQiOjB9fV19fQ=="}','$.key') result will be truncated in spark-sql. {code} {code:java} val spark = SparkSession.builder .appName("test").master("local[1]") .getOrCreate import spark.implicits._ // << add this val sc = spark.sparkContext val rdd: RDD[Row] = sc.parallelize(Seq( Row("""{"key":"eyJib29sIjp7ImZpbHRlciI6W3sidGVybSI6eyJ3YXJlaG91c2VfaWQiOjMwfX0seyJ0ZXJtcyI6eyJpZCI6WzE2MTgzMjk3NywxNjE4MjM1OTgsMTYxODIzNTk3LDE2MTgyMzU5NiwxNjE4MjM1OTUsMTYxODAwMjI0LDE2MTgwMDIyMSwxNjE4MDAyMjAsMTYxODAwMjE5LDE2MTgwMDIxOCwxNjE4MDAyMTMsMTYxODAwMjEyLDE2MTgwMDIxMSwxNjE4MDAyMTAsMTYxODAwMjA2LDE2MTgwMDIwMiwxNjE4MDAyMDEsMTYxODAwMTk4LDE2MTgwMDE5NywxNjE4MDAxOTYsMTYxODAwMTk1LDE2MTgwMDE5NCwxNjE4MDAxOTMsMTYxODAwMTkyLDE2MTgwMDE5MSwxNjE4MDAxOTAsMTYxODAwMTg5LDE2MTc5NDAyOSwxNjEzNTg2ODMsMTYxMzU4NjExLDE2MTM1ODYxMCwxNjEzNTYzMTIsMTYxMzU0NzU5LDE2MTM1NDY5NywxNjEzNTQ2OTYsMTYxMzQ0NTkyLDE2MTM0NDUwNywxNjEzMzc2MzQsMTYxMzM3MjA2LDE2MTMwMDM4OCwxNjEzMDAzODcsMTYxMzAwMzg2LDE2MTI4ODAxMCwxNjEyODgwMDksMTYxMjc5NTI0LDE2MTI3OTUyMiwxNjEyNzk1MTUsMTYxMjc5NTE0LDE2MTI3OTUxMywxNjEyNzg5NzcsMTYxMjc4OTc2LDE2MTI3ODk0NCwxNjEyNzg5NDMsMTYwODk2NjM3LDE2MDg5NjYzNiwxNjA4NzQ0ODcsMTYwODc0NDg2LDE2MDg3NDQ2NywxNjA4NzQ0NjYsMTYwODc0NDY1LDE2MDg3NDQ2NCwxNjA4NzQ0NjMsMTYwODc0NDU2LDE2MDg3NDQ1NSwxNjA4NzQxNTksMTYwODc0MTU2LDE2MDg3NDE1NSwxNjA4NzQxNTIsMTYwODc0MTUxLDE2MDg3NDE1MCwxNjA4NzQxNDksMTYwODc0MTQ4LDE2MDg3NDE0NSwxNjA4NzQxNDQsMTYwODc0MTQzLDE2MDg3NDE0MiwxNjA4NzQxNDEsMTYwODc0MTQwLDE2MDg3NDEzNywxNjA4NzQxMzYsMTYwODc0MTMzLDE2MDg3NDEzMiwxNjA4MjI0OTksMTYwODExMDE5LDE2MDgxMDkxNSwxNjA4MTA5MTQsMTYwODEwNDU4LDE2MDgxMDQ1NywxNjA4MTA0NTYsMTYwODEwNDU1LDE2MDgxMDQ1NCwxNjA4MTAzNTMsMTYwNzk5NTU1LDE2MDc5MjQ1OSwxNjA3OTI0NTYsMTYwNzkyNDU1LDE2MDc5MjQ1NCwxNjA3OTI0NTEsMTYwNzkyNDUwLDE2MDc5MjQ0OSwxNjA3OTI0MzUsMTYwNzkyNDI4LDE2MDc5MjQyNywxNjA3OTI0M
[jira] [Updated] (SPARK-27848) AppVeyor change to latest R version (3.6.0)
[ https://issues.apache.org/jira/browse/SPARK-27848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27848: - Description: R 3.6.0 is released last month. We better test higher versions of R in AppVeyor. We should set `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the warnings below: {code} Error in strptime(xx, f, tz = tz) : (converted from warning) unable to identify current timezone 'C': please set environment variable 'TZ' Error in i.p(...) : (converted from warning) installation of package 'praise' had non-zero exit status Calls: ... with_rprofile_user -> with_envvar -> force -> force -> i.p Execution halted {code} This JIRA targets to the latest of R version 3.6.0. was:R 3.6.0 is released last month. We better test higher versions of R in AppVeyor. > AppVeyor change to latest R version (3.6.0) > --- > > Key: SPARK-27848 > URL: https://issues.apache.org/jira/browse/SPARK-27848 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > R 3.6.0 is released last month. We better test higher versions of R in > AppVeyor. > We should set `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the > warnings below: > {code} > Error in strptime(xx, f, tz = tz) : > (converted from warning) unable to identify current timezone 'C': > please set environment variable 'TZ' > Error in i.p(...) : > (converted from warning) installation of package 'praise' had non-zero exit > status > Calls: ... with_rprofile_user -> with_envvar -> force -> force -> > i.p > Execution halted > {code} > This JIRA targets to the latest of R version 3.6.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25944) AppVeyor change to latest R version
[ https://issues.apache.org/jira/browse/SPARK-25944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-25944: - Description: R 3.5.1 is released few months ago. We better test higher versions of R in AppVeyor. R 3.6.0 is released 2019-04-26. This PR targets to change R version from 3.5.1 to 3.6.0 in AppVeyor. We should set `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the warnings below: {code} Error in strptime(xx, f, tz = tz) : (converted from warning) unable to identify current timezone 'C': please set environment variable 'TZ' Error in i.p(...) : (converted from warning) installation of package 'praise' had non-zero exit status Calls: ... with_rprofile_user -> with_envvar -> force -> force -> i.p Execution halted {code} This JIRA targets to the latest of R version 3.6.0. was: R 3.5.1 is released few months ago. We better test higher versions of R in AppVeyor. R 3.6.0 is released 2019-04-26. This PR targets to change R version from 3.5.1 to 3.6.0 in AppVeyor. This PR sets `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the warnings below: {code} Error in strptime(xx, f, tz = tz) : (converted from warning) unable to identify current timezone 'C': please set environment variable 'TZ' Error in i.p(...) : (converted from warning) installation of package 'praise' had non-zero exit status Calls: ... with_rprofile_user -> with_envvar -> force -> force -> i.p Execution halted {code} This JIRA targets to the latest of R version 3.6.0 for now. > AppVeyor change to latest R version > --- > > Key: SPARK-25944 > URL: https://issues.apache.org/jira/browse/SPARK-25944 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > R 3.5.1 is released few months ago. We better test higher versions of R in > AppVeyor. > R 3.6.0 is released 2019-04-26. This PR targets to change R version from > 3.5.1 to 3.6.0 in AppVeyor. > We should set `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the > warnings below: > {code} > Error in strptime(xx, f, tz = tz) : > (converted from warning) unable to identify current timezone 'C': > please set environment variable 'TZ' > Error in i.p(...) : > (converted from warning) installation of package 'praise' had non-zero exit > status > Calls: ... with_rprofile_user -> with_envvar -> force -> force -> > i.p > Execution halted > {code} > This JIRA targets to the latest of R version 3.6.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25944) AppVeyor change to latest R version
[ https://issues.apache.org/jira/browse/SPARK-25944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-25944: - Description: R 3.5.1 is released few months ago. We better test higher versions of R in AppVeyor. R 3.6.0 is released 2019-04-26. This PR targets to change R version from 3.5.1 to 3.6.0 in AppVeyor. This PR sets `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the warnings below: {code} Error in strptime(xx, f, tz = tz) : (converted from warning) unable to identify current timezone 'C': please set environment variable 'TZ' Error in i.p(...) : (converted from warning) installation of package 'praise' had non-zero exit status Calls: ... with_rprofile_user -> with_envvar -> force -> force -> i.p Execution halted {code} This JIRA targets to the latest of R version 3.6.0 for now. was: R 3.5.1 is released few months ago. We better test higher versions of R in AppVeyor. R 3.6.0 is released 2019-04-26. This PR targets to change R version from 3.5.1 to 3.6.0 in AppVeyor. This PR sets `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the warnings below: ``` Error in strptime(xx, f, tz = tz) : (converted from warning) unable to identify current timezone 'C': please set environment variable 'TZ' Error in i.p(...) : (converted from warning) installation of package 'praise' had non-zero exit status Calls: ... with_rprofile_user -> with_envvar -> force -> force -> i.p Execution halted ``` This JIRA targets to the latest of R version 3.6.0 for now. > AppVeyor change to latest R version > --- > > Key: SPARK-25944 > URL: https://issues.apache.org/jira/browse/SPARK-25944 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > R 3.5.1 is released few months ago. We better test higher versions of R in > AppVeyor. > R 3.6.0 is released 2019-04-26. This PR targets to change R version from > 3.5.1 to 3.6.0 in AppVeyor. > This PR sets `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the > warnings below: > {code} > Error in strptime(xx, f, tz = tz) : > (converted from warning) unable to identify current timezone 'C': > please set environment variable 'TZ' > Error in i.p(...) : > (converted from warning) installation of package 'praise' had non-zero exit > status > Calls: ... with_rprofile_user -> with_envvar -> force -> force -> > i.p > Execution halted > {code} > This JIRA targets to the latest of R version 3.6.0 for now. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25944) AppVeyor change to latest R version
[ https://issues.apache.org/jira/browse/SPARK-25944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849354#comment-16849354 ] Hyukjin Kwon commented on SPARK-25944: -- Ah .. I realised that I messed up JIRA and PR links. I fixed it accordingly. > AppVeyor change to latest R version > --- > > Key: SPARK-25944 > URL: https://issues.apache.org/jira/browse/SPARK-25944 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > R 3.5.1 is released few months ago. We better test higher versions of R in > AppVeyor. > R 3.6.0 is released 2019-04-26. This PR targets to change R version from > 3.5.1 to 3.6.0 in AppVeyor. > This PR sets `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the > warnings below: > ``` > Error in strptime(xx, f, tz = tz) : > (converted from warning) unable to identify current timezone 'C': > please set environment variable 'TZ' > Error in i.p(...) : > (converted from warning) installation of package 'praise' had non-zero exit > status > Calls: ... with_rprofile_user -> with_envvar -> force -> force -> > i.p > Execution halted > ``` > This JIRA targets to the latest of R version 3.6.0 for now. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25944) AppVeyor change to latest R version
[ https://issues.apache.org/jira/browse/SPARK-25944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-25944: - Description: R 3.5.1 is released few months ago. We better test higher versions of R in AppVeyor. R 3.6.0 is released 2019-04-26. This PR targets to change R version from 3.5.1 to 3.6.0 in AppVeyor. This PR sets `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the warnings below: ``` Error in strptime(xx, f, tz = tz) : (converted from warning) unable to identify current timezone 'C': please set environment variable 'TZ' Error in i.p(...) : (converted from warning) installation of package 'praise' had non-zero exit status Calls: ... with_rprofile_user -> with_envvar -> force -> force -> i.p Execution halted ``` This JIRA targets to the latest of R version 3.6.0 for now. was:R 3.5.1 is released few months ago. We better test higher versions of R in AppVeyor. > AppVeyor change to latest R version > --- > > Key: SPARK-25944 > URL: https://issues.apache.org/jira/browse/SPARK-25944 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > R 3.5.1 is released few months ago. We better test higher versions of R in > AppVeyor. > R 3.6.0 is released 2019-04-26. This PR targets to change R version from > 3.5.1 to 3.6.0 in AppVeyor. > This PR sets `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the > warnings below: > ``` > Error in strptime(xx, f, tz = tz) : > (converted from warning) unable to identify current timezone 'C': > please set environment variable 'TZ' > Error in i.p(...) : > (converted from warning) installation of package 'praise' had non-zero exit > status > Calls: ... with_rprofile_user -> with_envvar -> force -> force -> > i.p > Execution halted > ``` > This JIRA targets to the latest of R version 3.6.0 for now. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-27848) AppVeyor change to latest R version (3.6.0)
[ https://issues.apache.org/jira/browse/SPARK-27848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27848: - Comment: was deleted (was: Fixed in https://github.com/apache/spark/pull/24716) > AppVeyor change to latest R version (3.6.0) > --- > > Key: SPARK-27848 > URL: https://issues.apache.org/jira/browse/SPARK-27848 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > R 3.6.0 is released last month. We better test higher versions of R in > AppVeyor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27848) AppVeyor change to latest R version (3.6.0)
[ https://issues.apache.org/jira/browse/SPARK-27848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27848. -- Resolution: Duplicate > AppVeyor change to latest R version (3.6.0) > --- > > Key: SPARK-27848 > URL: https://issues.apache.org/jira/browse/SPARK-27848 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > R 3.6.0 is released last month. We better test higher versions of R in > AppVeyor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-27848) AppVeyor change to latest R version (3.6.0)
[ https://issues.apache.org/jira/browse/SPARK-27848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27848: - Comment: was deleted (was: User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/24716) > AppVeyor change to latest R version (3.6.0) > --- > > Key: SPARK-27848 > URL: https://issues.apache.org/jira/browse/SPARK-27848 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > R 3.6.0 is released last month. We better test higher versions of R in > AppVeyor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25944) AppVeyor change to latest R version
[ https://issues.apache.org/jira/browse/SPARK-25944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-25944: - Summary: AppVeyor change to latest R version (was: AppVeyor change to latest R version (3.5.1)) > AppVeyor change to latest R version > --- > > Key: SPARK-25944 > URL: https://issues.apache.org/jira/browse/SPARK-25944 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > R 3.5.1 is released few months ago. We better test higher versions of R in > AppVeyor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-27848) AppVeyor change to latest R version (3.6.0)
[ https://issues.apache.org/jira/browse/SPARK-27848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-27848: -- > AppVeyor change to latest R version (3.6.0) > --- > > Key: SPARK-27848 > URL: https://issues.apache.org/jira/browse/SPARK-27848 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > R 3.6.0 is released last month. We better test higher versions of R in > AppVeyor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27848) AppVeyor change to latest R version (3.6.0)
[ https://issues.apache.org/jira/browse/SPARK-27848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849353#comment-16849353 ] Apache Spark commented on SPARK-27848: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/24716 > AppVeyor change to latest R version (3.6.0) > --- > > Key: SPARK-27848 > URL: https://issues.apache.org/jira/browse/SPARK-27848 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > R 3.6.0 is released last month. We better test higher versions of R in > AppVeyor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27848) AppVeyor change to latest R version (3.6.0)
[ https://issues.apache.org/jira/browse/SPARK-27848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27848. -- Resolution: Fixed Fixed in https://github.com/apache/spark/pull/24716 > AppVeyor change to latest R version (3.6.0) > --- > > Key: SPARK-27848 > URL: https://issues.apache.org/jira/browse/SPARK-27848 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > R 3.6.0 is released last month. We better test higher versions of R in > AppVeyor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27848) AppVeyor change to latest R version (3.6.0)
[ https://issues.apache.org/jira/browse/SPARK-27848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27848: - Fix Version/s: 3.0.0 > AppVeyor change to latest R version (3.6.0) > --- > > Key: SPARK-27848 > URL: https://issues.apache.org/jira/browse/SPARK-27848 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > R 3.6.0 is released last month. We better test higher versions of R in > AppVeyor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-27860) Use efficient sorting instead of `.sorted.reverse` sequence
[ https://issues.apache.org/jira/browse/SPARK-27860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wenxuanguan closed SPARK-27860. --- > Use efficient sorting instead of `.sorted.reverse` sequence > --- > > Key: SPARK-27860 > URL: https://issues.apache.org/jira/browse/SPARK-27860 > Project: Spark > Issue Type: Improvement > Components: DStreams, Structured Streaming, Web UI >Affects Versions: 2.4.3 >Reporter: wenxuanguan >Priority: Minor > > use descending sort instead of two action of ascending sort and reverse. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27860) Use efficient sorting instead of `.sorted.reverse` sequence
[ https://issues.apache.org/jira/browse/SPARK-27860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wenxuanguan resolved SPARK-27860. - Resolution: Duplicate > Use efficient sorting instead of `.sorted.reverse` sequence > --- > > Key: SPARK-27860 > URL: https://issues.apache.org/jira/browse/SPARK-27860 > Project: Spark > Issue Type: Improvement > Components: DStreams, Structured Streaming, Web UI >Affects Versions: 2.4.3 >Reporter: wenxuanguan >Priority: Minor > > use descending sort instead of two action of ascending sort and reverse. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-341) Added MapPartitionsWithSplitRDD.
[ https://issues.apache.org/jira/browse/SPARK-341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wenxuanguan closed SPARK-341. - > Added MapPartitionsWithSplitRDD. > > > Key: SPARK-341 > URL: https://issues.apache.org/jira/browse/SPARK-341 > Project: Spark > Issue Type: Bug >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27860) Use efficient sorting instead of `.sorted.reverse` sequence
wenxuanguan created SPARK-27860: --- Summary: Use efficient sorting instead of `.sorted.reverse` sequence Key: SPARK-27860 URL: https://issues.apache.org/jira/browse/SPARK-27860 Project: Spark Issue Type: Improvement Components: DStreams, Structured Streaming, Web UI Affects Versions: 2.4.3 Reporter: wenxuanguan use descending sort instead of two action of ascending sort and reverse. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27859) Use efficient sorting instead of `.sorted.reverse` sequence
[ https://issues.apache.org/jira/browse/SPARK-27859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27859. --- Resolution: Fixed Assignee: wenxuanguan Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24711 > Use efficient sorting instead of `.sorted.reverse` sequence > --- > > Key: SPARK-27859 > URL: https://issues.apache.org/jira/browse/SPARK-27859 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: wenxuanguan >Priority: Trivial > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27859) Use efficient sorting instead of `.sorted.reverse` sequence
[ https://issues.apache.org/jira/browse/SPARK-27859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27859: Assignee: Apache Spark > Use efficient sorting instead of `.sorted.reverse` sequence > --- > > Key: SPARK-27859 > URL: https://issues.apache.org/jira/browse/SPARK-27859 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27859) Use efficient sorting instead of `.sorted.reverse` sequence
[ https://issues.apache.org/jira/browse/SPARK-27859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27859: Assignee: (was: Apache Spark) > Use efficient sorting instead of `.sorted.reverse` sequence > --- > > Key: SPARK-27859 > URL: https://issues.apache.org/jira/browse/SPARK-27859 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27859) Use efficient sorting instead of `.sorted.reverse` sequence
Dongjoon Hyun created SPARK-27859: - Summary: Use efficient sorting instead of `.sorted.reverse` sequence Key: SPARK-27859 URL: https://issues.apache.org/jira/browse/SPARK-27859 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27833) java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark
[ https://issues.apache.org/jira/browse/SPARK-27833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849320#comment-16849320 ] Raviteja edited comment on SPARK-27833 at 5/28/19 4:25 AM: --- I have tried step by step, i faced the issue when i add watermark along with group and agg functions. Without watermarking the job is successfully completing. A similar issue has been raise [27564|https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-27564] we cannot use "console sink" because it will print the output in console. we are using "custom sink" because we need to write the data to other databases. I dont know much about scala but i think error is because the query planner is not able find a plan for watermarking when creating a physical plan. was (Author: ravitejasutrave): I have tried step by step, i faced the issue when i add watermark along with group and agg functions. Without watermarking the job is successfully completing. we cannot use "console sink" because it will print the output in console. we are using "custom sink" because we need to write the data to other databases. I dont know much about scala but i think error is because the query planner is not able find a plan for watermarking when creating a physical plan. > java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark > > > Key: SPARK-27833 > URL: https://issues.apache.org/jira/browse/SPARK-27833 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 > Environment: spark 2.3.0 > java 1.8 > kafka version 0.10. >Reporter: Raviteja >Priority: Minor > Labels: spark-streaming-kafka > Attachments: kafka_consumer_code.java, kafka_custom_sink.java, > kafka_error_log.txt > > > Hi , > We have a requirement to read data from kafka, apply some transformation and > store data to database .For this we are implementing watermarking feature > along with aggregate function and for storing we are writing our own sink > (Structured streaming) .we are using spark 2.3.0, java 1.8 and kafka version > 0.10. > We are getting the below error. > "*java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark > timestamp#39: timestamp, interval 2 minutes*" > > works perfectly fine when we use Console as sink instead custom sink. For > Debugging the issue, we are performing "dataframe.show()" in our custom sink > and nothing else. > Please find the attachment for the Error log and the code. Please look into > this issue as this a blocker and we are not able to proceed further or find > any alternatives as we need watermarking feature. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27833) java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark
[ https://issues.apache.org/jira/browse/SPARK-27833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849320#comment-16849320 ] Raviteja commented on SPARK-27833: -- I have tried step by step, i faced the issue when i add watermark along with group and agg functions. Without watermarking the job is successfully completing. we cannot use "console sink" because it will print the output in console. we are using "custom sink" because we need to write the data to other databases. I dont know much about scala but i think error is because the query planner is not able find a plan for watermarking when creating a physical plan. > java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark > > > Key: SPARK-27833 > URL: https://issues.apache.org/jira/browse/SPARK-27833 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 > Environment: spark 2.3.0 > java 1.8 > kafka version 0.10. >Reporter: Raviteja >Priority: Minor > Labels: spark-streaming-kafka > Attachments: kafka_consumer_code.java, kafka_custom_sink.java, > kafka_error_log.txt > > > Hi , > We have a requirement to read data from kafka, apply some transformation and > store data to database .For this we are implementing watermarking feature > along with aggregate function and for storing we are writing our own sink > (Structured streaming) .we are using spark 2.3.0, java 1.8 and kafka version > 0.10. > We are getting the below error. > "*java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark > timestamp#39: timestamp, interval 2 minutes*" > > works perfectly fine when we use Console as sink instead custom sink. For > Debugging the issue, we are performing "dataframe.show()" in our custom sink > and nothing else. > Please find the attachment for the Error log and the code. Please look into > this issue as this a blocker and we are not able to proceed further or find > any alternatives as we need watermarking feature. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23191) Workers registration failes in case of network drop
[ https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23191: --- Assignee: wuyi > Workers registration failes in case of network drop > --- > > Key: SPARK-23191 > URL: https://issues.apache.org/jira/browse/SPARK-23191 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.3, 2.2.1, 2.3.0 > Environment: OS:- Centos 6.9(64 bit) > >Reporter: Neeraj Gupta >Assignee: wuyi >Priority: Critical > > We have a 3 node cluster. We were facing issues of multiple driver running in > some scenario in production. > On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 > versions the scenario with following steps:- > # Setup a 3 node cluster. Start master and slaves. > # On any node where the worker process is running block the connections on > port 7077 using iptables. > {code:java} > iptables -A OUTPUT -p tcp --dport 7077 -j DROP > {code} > # After about 10-15 secs we get the error on node that it is unable to > connect to master. > {code:java} > 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN > org.apache.spark.network.server.TransportChannelHandler - Exception in > connection from > java.io.IOException: Connection timed out > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221) > at > io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) > 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR > org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting > for master to reconnect... > 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR > org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting > for master to reconnect... > {code} > # Once we get this exception we renable the connections to port 7077 using > {code:java} > iptables -D OUTPUT -p tcp --dport 7077 -j DROP > {code} > # Worker tries to register again with master but is unable to do so. It > gives following error > {code:java} > 2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN > org.apache.spark.deploy.worker.Worker - Failed to connect to master > :7077 > org.apache.spark.SparkException: Exception thrown in awaitResult: > at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100) > at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108) > at > org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to :7077 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182) > at > org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197) > at org.apache.spark.rpc.netty.O
[jira] [Resolved] (SPARK-23191) Workers registration failes in case of network drop
[ https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23191. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24569 [https://github.com/apache/spark/pull/24569] > Workers registration failes in case of network drop > --- > > Key: SPARK-23191 > URL: https://issues.apache.org/jira/browse/SPARK-23191 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.3, 2.2.1, 2.3.0 > Environment: OS:- Centos 6.9(64 bit) > >Reporter: Neeraj Gupta >Assignee: wuyi >Priority: Critical > Fix For: 3.0.0 > > > We have a 3 node cluster. We were facing issues of multiple driver running in > some scenario in production. > On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 > versions the scenario with following steps:- > # Setup a 3 node cluster. Start master and slaves. > # On any node where the worker process is running block the connections on > port 7077 using iptables. > {code:java} > iptables -A OUTPUT -p tcp --dport 7077 -j DROP > {code} > # After about 10-15 secs we get the error on node that it is unable to > connect to master. > {code:java} > 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN > org.apache.spark.network.server.TransportChannelHandler - Exception in > connection from > java.io.IOException: Connection timed out > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221) > at > io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) > 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR > org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting > for master to reconnect... > 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR > org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting > for master to reconnect... > {code} > # Once we get this exception we renable the connections to port 7077 using > {code:java} > iptables -D OUTPUT -p tcp --dport 7077 -j DROP > {code} > # Worker tries to register again with master but is unable to do so. It > gives following error > {code:java} > 2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN > org.apache.spark.deploy.worker.Worker - Failed to connect to master > :7077 > org.apache.spark.SparkException: Exception thrown in awaitResult: > at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100) > at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108) > at > org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to :7077 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.ja
[jira] [Updated] (SPARK-27578) Support INTERVAL ... HOUR TO SECOND syntax
[ https://issues.apache.org/jira/browse/SPARK-27578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27578: -- Summary: Support INTERVAL ... HOUR TO SECOND syntax (was: Add support for "interval '23:59:59' hour to second") > Support INTERVAL ... HOUR TO SECOND syntax > -- > > Key: SPARK-27578 > URL: https://issues.apache.org/jira/browse/SPARK-27578 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Priority: Major > > Currently, SparkSQL can support interval format like this. > > {code:java} > select interval '5 23:59:59.155' day to second.{code} > > Can SparkSQL support grammar like below, as Presto/Teradata can support it > well now. > {code:java} > select interval '23:59:59.155' hour to second > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27578) Add support for "interval '23:59:59' hour to second"
[ https://issues.apache.org/jira/browse/SPARK-27578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27578: -- Issue Type: Improvement (was: Bug) > Add support for "interval '23:59:59' hour to second" > > > Key: SPARK-27578 > URL: https://issues.apache.org/jira/browse/SPARK-27578 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Priority: Major > > Currently, SparkSQL can support interval format like this. > > {code:java} > select interval '5 23:59:59.155' day to second.{code} > > Can SparkSQL support grammar like below, as Teradata can support it well now. > {code:java} > select interval '23:59:59.155' hour to second > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27578) Add support for "interval '23:59:59' hour to second"
[ https://issues.apache.org/jira/browse/SPARK-27578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27578: -- Description: Currently, SparkSQL can support interval format like this. {code:java} select interval '5 23:59:59.155' day to second.{code} Can SparkSQL support grammar like below, as Presto/Teradata can support it well now. {code:java} select interval '23:59:59.155' hour to second {code} was: Currently, SparkSQL can support interval format like this. {code:java} select interval '5 23:59:59.155' day to second.{code} Can SparkSQL support grammar like below, as Teradata can support it well now. {code:java} select interval '23:59:59.155' hour to second {code} > Add support for "interval '23:59:59' hour to second" > > > Key: SPARK-27578 > URL: https://issues.apache.org/jira/browse/SPARK-27578 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Priority: Major > > Currently, SparkSQL can support interval format like this. > > {code:java} > select interval '5 23:59:59.155' day to second.{code} > > Can SparkSQL support grammar like below, as Presto/Teradata can support it > well now. > {code:java} > select interval '23:59:59.155' hour to second > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27858) Fix for avro deserialization on union types with multiple non-null types
[ https://issues.apache.org/jira/browse/SPARK-27858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27858. --- Resolution: Fixed Fix Version/s: 3.0.0 2.4.4 This is resolved via https://github.com/apache/spark/pull/24722 > Fix for avro deserialization on union types with multiple non-null types > > > Key: SPARK-27858 > URL: https://issues.apache.org/jira/browse/SPARK-27858 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Priority: Major > Fix For: 2.4.4, 3.0.0 > > > This issue aims to fix a union avro type with more than one non-null value > (for instance `["string", "null", "int"]`) the deserialization to a DataFrame > would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that > the `fieldWriter` relied on the index from the avro schema before null were > filtered out. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27858) Fix for avro deserialization on union types with multiple non-null types
[ https://issues.apache.org/jira/browse/SPARK-27858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27858: Assignee: (was: Apache Spark) > Fix for avro deserialization on union types with multiple non-null types > > > Key: SPARK-27858 > URL: https://issues.apache.org/jira/browse/SPARK-27858 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to fix a union avro type with more than one non-null value > (for instance `["string", "null", "int"]`) the deserialization to a DataFrame > would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that > the `fieldWriter` relied on the index from the avro schema before null were > filtered out. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27858) Fix for avro deserialization on union types with multiple non-null types
[ https://issues.apache.org/jira/browse/SPARK-27858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849257#comment-16849257 ] Apache Spark commented on SPARK-27858: -- User 'gcmerz' has created a pull request for this issue: https://github.com/apache/spark/pull/24722 > Fix for avro deserialization on union types with multiple non-null types > > > Key: SPARK-27858 > URL: https://issues.apache.org/jira/browse/SPARK-27858 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to fix a union avro type with more than one non-null value > (for instance `["string", "null", "int"]`) the deserialization to a DataFrame > would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that > the `fieldWriter` relied on the index from the avro schema before null were > filtered out. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27858) Fix for avro deserialization on union types with multiple non-null types
[ https://issues.apache.org/jira/browse/SPARK-27858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27858: Assignee: Apache Spark > Fix for avro deserialization on union types with multiple non-null types > > > Key: SPARK-27858 > URL: https://issues.apache.org/jira/browse/SPARK-27858 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > This issue aims to fix a union avro type with more than one non-null value > (for instance `["string", "null", "int"]`) the deserialization to a DataFrame > would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that > the `fieldWriter` relied on the index from the avro schema before null were > filtered out. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27858) Fix for avro deserialization on union types with multiple non-null types
Dongjoon Hyun created SPARK-27858: - Summary: Fix for avro deserialization on union types with multiple non-null types Key: SPARK-27858 URL: https://issues.apache.org/jira/browse/SPARK-27858 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3, 3.0.0 Reporter: Dongjoon Hyun This issue aims to fix a union avro type with more than one non-null value (for instance `["string", "null", "int"]`) the deserialization to a DataFrame would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that the `fieldWriter` relied on the index from the avro schema before null were filtered out. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27857) DataSourceV2: Support ALTER TABLE statements
[ https://issues.apache.org/jira/browse/SPARK-27857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27857: Assignee: Apache Spark > DataSourceV2: Support ALTER TABLE statements > > > Key: SPARK-27857 > URL: https://issues.apache.org/jira/browse/SPARK-27857 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ryan Blue >Assignee: Apache Spark >Priority: Major > > ALTER TABLE statements should be supported for v2 tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27857) DataSourceV2: Support ALTER TABLE statements
[ https://issues.apache.org/jira/browse/SPARK-27857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27857: Assignee: (was: Apache Spark) > DataSourceV2: Support ALTER TABLE statements > > > Key: SPARK-27857 > URL: https://issues.apache.org/jira/browse/SPARK-27857 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ryan Blue >Priority: Major > > ALTER TABLE statements should be supported for v2 tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27857) DataSourceV2: Support ALTER TABLE statements
Ryan Blue created SPARK-27857: - Summary: DataSourceV2: Support ALTER TABLE statements Key: SPARK-27857 URL: https://issues.apache.org/jira/browse/SPARK-27857 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.3 Reporter: Ryan Blue ALTER TABLE statements should be supported for v2 tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27768) Infinity, -Infinity, NaN should be recognized in a case insensitive manner
[ https://issues.apache.org/jira/browse/SPARK-27768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849078#comment-16849078 ] Dilip Biswal commented on SPARK-27768: -- [~dongjoon] Thanks for trying out Presto. Just want to share my 2 cents before we take a final call on it. I am okay with whatever you guys decide :). There seems to be a subtle difference between Presto and Spark ? Spark returns "NULL" in this case where as presto returns an error ? Because of this i think we should be more accommodative of data that is accepted in other systems. I am afraid, because of "authoring null" semantics, sometimes during the etl process we will treat some valid input from other systems as nulls and its probably hard for users to locate the bad record and fix.. Lets say for a second that we decide to accept this case. So technically, we will not be portable with Hive and Presto. But we are allowing something more that these two systems, right ? Do we think that some users would actually want the strings such as "infinity" to be treated as null and would be negatively surprised to see the new behaviour ? Let me know what you think.. > Infinity, -Infinity, NaN should be recognized in a case insensitive manner > -- > > Key: SPARK-27768 > URL: https://issues.apache.org/jira/browse/SPARK-27768 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > When the inputs contain the constant 'infinity', Spark SQL does not generate > the expected results. > {code:java} > SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) > FROM (VALUES ('1'), (CAST('infinity' AS DOUBLE))) v(x); > SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) > FROM (VALUES ('infinity'), ('1')) v(x); > SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) > FROM (VALUES ('infinity'), ('infinity')) v(x); > SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) > FROM (VALUES ('-infinity'), ('infinity')) v(x);{code} > The root cause: Spark SQL does not recognize the special constants in a case > insensitive way. In PostgreSQL, they are recognized in a case insensitive > way. > Link: https://www.postgresql.org/docs/9.3/datatype-numeric.html > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27071) Expose additional metrics in status.api.v1.StageData
[ https://issues.apache.org/jira/browse/SPARK-27071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-27071. --- Resolution: Fixed Fix Version/s: 3.0.0 > Expose additional metrics in status.api.v1.StageData > > > Key: SPARK-27071 > URL: https://issues.apache.org/jira/browse/SPARK-27071 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Tom van Bussel >Assignee: Tom van Bussel >Priority: Major > Fix For: 3.0.0 > > > Currently StageData exposes the following metrics: > * executorRunTime > * executorCpuTime > * inputBytes > * inputRecords > * outputBytes > * outputRecords > * shuffleReadBytes > * shuffleReadRecords > * shuffleWriteBytes > * shuffleWriteRecords > * memoryBytesSpilled > * diskBytesSpilled > These metrics are computed by aggregating the metrics of the tasks in the > stage. For the task metrics however we keep track of a lot more metrics. > Currently these metrics are also computed for stages (such shuffle read fetch > wait time), but these are not exposed through the api. It would be very > useful if these were also exposed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27071) Expose additional metrics in status.api.v1.StageData
[ https://issues.apache.org/jira/browse/SPARK-27071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell reassigned SPARK-27071: - Assignee: Tom van Bussel > Expose additional metrics in status.api.v1.StageData > > > Key: SPARK-27071 > URL: https://issues.apache.org/jira/browse/SPARK-27071 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Tom van Bussel >Assignee: Tom van Bussel >Priority: Major > > Currently StageData exposes the following metrics: > * executorRunTime > * executorCpuTime > * inputBytes > * inputRecords > * outputBytes > * outputRecords > * shuffleReadBytes > * shuffleReadRecords > * shuffleWriteBytes > * shuffleWriteRecords > * memoryBytesSpilled > * diskBytesSpilled > These metrics are computed by aggregating the metrics of the tasks in the > stage. For the task metrics however we keep track of a lot more metrics. > Currently these metrics are also computed for stages (such shuffle read fetch > wait time), but these are not exposed through the api. It would be very > useful if these were also exposed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27777) Eliminate uncessary sliding job in AreaUnderCurve
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-2: - Assignee: zhengruifeng > Eliminate uncessary sliding job in AreaUnderCurve > - > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > Current impl of \{AreaUnderCurve} use \{SlidingRDD} to perform sliding, in > which a prepending job is needed to compute the head items on each partition. > However, this job can be eliminated in computation of AUC by collecting local > areas and head/last at once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27777) Eliminate uncessary sliding job in AreaUnderCurve
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24648 [https://github.com/apache/spark/pull/24648] > Eliminate uncessary sliding job in AreaUnderCurve > - > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > Current impl of \{AreaUnderCurve} use \{SlidingRDD} to perform sliding, in > which a prepending job is needed to compute the head items on each partition. > However, this job can be eliminated in computation of AUC by collecting local > areas and head/last at once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27855) Union failed between 2 datasets of the same type converted from different dataframes
[ https://issues.apache.org/jira/browse/SPARK-27855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849016#comment-16849016 ] Liang-Chi Hsieh commented on SPARK-27855: - If you notice, the printed schema of two Datasets is different. The columns have different order. Dataset.union resolves columns by position. This is well documented in the API doc. If you want to resolve columns by name, please use Dataset.unionByName API. > Union failed between 2 datasets of the same type converted from different > dataframes > > > Key: SPARK-27855 > URL: https://issues.apache.org/jira/browse/SPARK-27855 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 >Reporter: Hao Ren >Priority: Major > > 2 Datasets of the same type converted from different dataframes can not union. > Here is the code to reproduce the problem. It seems `union` just checks the > schema of the orignal dataframe, even if the two datasets have already been > converted to the same type of dataset. > {code:java} > case class Entity(key: Int, a: Int, b: String) > val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity] > val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity] > df1.printSchema > df2.printSchema > df1 union df2 > {code} > Result > {code:java} > defined class Entity > df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more > field] > df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more > field] > converted > root > |-- key: integer (nullable = false) > |-- a: integer (nullable = false) > |-- b: string (nullable = true) > root > |-- key: integer (nullable = false) > |-- b: string (nullable = true) > |-- a: integer (nullable = false) > org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int > as it may truncate > The type path of the target object is: > - field (class: "scala.Int", name: "a") > - root class: "Entity"{code} > The problem is that the two datasets of the same type have different schemas. > The schema of the dataset does not conserve the order of the fields in the > case class definition, but the one of the original dataframe -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27665) Split fetch shuffle blocks protocol from OpenBlocks
[ https://issues.apache.org/jira/browse/SPARK-27665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27665: --- Assignee: Yuanjian Li > Split fetch shuffle blocks protocol from OpenBlocks > --- > > Key: SPARK-27665 > URL: https://issues.apache.org/jira/browse/SPARK-27665 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > > As the current approach in OneForOneBlockFetcher, we reuse the OpenBlocks > protocol to describe the fetch request for shuffle blocks, and it causes the > extension work for shuffle fetching like SPARK-9853 and SPARK-25341 very > awkward. We need a new protocol only for shuffle blocks fetcher. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27665) Split fetch shuffle blocks protocol from OpenBlocks
[ https://issues.apache.org/jira/browse/SPARK-27665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27665. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24565 [https://github.com/apache/spark/pull/24565] > Split fetch shuffle blocks protocol from OpenBlocks > --- > > Key: SPARK-27665 > URL: https://issues.apache.org/jira/browse/SPARK-27665 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.0.0 > > > As the current approach in OneForOneBlockFetcher, we reuse the OpenBlocks > protocol to describe the fetch request for shuffle blocks, and it causes the > extension work for shuffle fetching like SPARK-9853 and SPARK-25341 very > awkward. We need a new protocol only for shuffle blocks fetcher. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27776) Avoid duplicate Java reflection in DataSource
[ https://issues.apache.org/jira/browse/SPARK-27776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-27776: -- Priority: Trivial (was: Minor) > Avoid duplicate Java reflection in DataSource > - > > Key: SPARK-27776 > URL: https://issues.apache.org/jira/browse/SPARK-27776 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Trivial > > I checked the code of > {code:java} > org.apache.spark.sql.execution.datasources.DataSource{code} > , there exists duplicate Java reflection. > I want to avoid it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27856) do not forcibly add cast when inserting table
[ https://issues.apache.org/jira/browse/SPARK-27856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27856: Assignee: Apache Spark (was: Wenchen Fan) > do not forcibly add cast when inserting table > - > > Key: SPARK-27856 > URL: https://issues.apache.org/jira/browse/SPARK-27856 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27856) do not forcibly add cast when inserting table
[ https://issues.apache.org/jira/browse/SPARK-27856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27856: Assignee: Wenchen Fan (was: Apache Spark) > do not forcibly add cast when inserting table > - > > Key: SPARK-27856 > URL: https://issues.apache.org/jira/browse/SPARK-27856 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27856) do not forcibly add cast when inserting table
Wenchen Fan created SPARK-27856: --- Summary: do not forcibly add cast when inserting table Key: SPARK-27856 URL: https://issues.apache.org/jira/browse/SPARK-27856 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27855) Union failed between 2 datasets of the same type converted from different dataframes
[ https://issues.apache.org/jira/browse/SPARK-27855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hao Ren updated SPARK-27855: Description: 2 Datasets of the same type converted from different dataframes can not union. Here is the code to reproduce the problem. It seems `union` just checks the schema of the orignal dataframe, even if the two datasets have already been converted to the same type of dataset. {code:java} case class Entity(key: Int, a: Int, b: String) val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity] val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity] df1.printSchema df2.printSchema df1 union df2 {code} Result {code:java} defined class Entity df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more field] df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more field] converted root |-- key: integer (nullable = false) |-- a: integer (nullable = false) |-- b: string (nullable = true) root |-- key: integer (nullable = false) |-- b: string (nullable = true) |-- a: integer (nullable = false) org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int as it may truncate The type path of the target object is: - field (class: "scala.Int", name: "a") - root class: "Entity"{code} The problem is that the two datasets of the same type have different schemas. The schema of the dataset does not conserve the order of the fields in the case class definition, but the one of the original dataframe was: 2 Datasets of the same type converted from different dataframes can not union. Here is the code to reproduce the problem. It seems `union` just checks the schema of the orignal dataframe, even if the two datasets have already been converted to the same type of dataset. {code:java} case class Entity(key: Int, a: Int, b: String) val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity] val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity] df1.printSchema df2.printSchema df1 union df2 {code} Result {code:java} defined class Entity df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more field] df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more field] converted root |-- key: integer (nullable = false) |-- a: integer (nullable = false) |-- b: string (nullable = true) root |-- key: integer (nullable = false) |-- b: string (nullable = true) |-- a: integer (nullable = false) org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int as it may truncate The type path of the target object is: - field (class: "scala.Int", name: "a") - root class: "Entity"{code} > Union failed between 2 datasets of the same type converted from different > dataframes > > > Key: SPARK-27855 > URL: https://issues.apache.org/jira/browse/SPARK-27855 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 >Reporter: Hao Ren >Priority: Major > > 2 Datasets of the same type converted from different dataframes can not union. > Here is the code to reproduce the problem. It seems `union` just checks the > schema of the orignal dataframe, even if the two datasets have already been > converted to the same type of dataset. > {code:java} > case class Entity(key: Int, a: Int, b: String) > val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity] > val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity] > df1.printSchema > df2.printSchema > df1 union df2 > {code} > Result > {code:java} > defined class Entity > df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more > field] > df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more > field] > converted > root > |-- key: integer (nullable = false) > |-- a: integer (nullable = false) > |-- b: string (nullable = true) > root > |-- key: integer (nullable = false) > |-- b: string (nullable = true) > |-- a: integer (nullable = false) > org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int > as it may truncate > The type path of the target object is: > - field (class: "scala.Int", name: "a") > - root class: "Entity"{code} > The problem is that the two datasets of the same type have different schemas. > The schema of the dataset does not conserve the order of the fields in the > case class definition, but the one of the original dataframe -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27855) Union failed between 2 datasets of the same type converted from different dataframes
[ https://issues.apache.org/jira/browse/SPARK-27855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hao Ren updated SPARK-27855: Description: 2 Datasets of the same type converted from different dataframes can not union. Here is the code to reproduce the problem. It seems `union` just checks the schema of the orignal dataframe, even if the two datasets have already been converted to the same type of dataset. {code:java} case class Entity(key: Int, a: Int, b: String) val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity] val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity] df1.printSchema df2.printSchema df1 union df2 {code} Result {code:java} defined class Entity df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more field] df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more field] converted root |-- key: integer (nullable = false) |-- a: integer (nullable = false) |-- b: string (nullable = true) root |-- key: integer (nullable = false) |-- b: string (nullable = true) |-- a: integer (nullable = false) org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int as it may truncate The type path of the target object is: - field (class: "scala.Int", name: "a") - root class: "Entity"{code} was: 2 Datasets of the same type converted from different dataframes can not union. Here is the code to reproduce the problem. It seems `union` just checks the schema of the orignal dataframe, even if the two datasets have already been converted to the same type of dataset. {code:java} case class Entity(key: Int, a: Int, b: String) val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity] val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity] df1.printSchema df2.printSchema df1 union df2 {code} Result {code:java} defined class Entity df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more field] df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more field] converted root |-- key: integer (nullable = false) |-- a: integer (nullable = false) |-- b: string (nullable = true) root |-- key: integer (nullable = false) |-- b: string (nullable = true) |-- a: integer (nullable = false) org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int as it may truncate The type path of the target object is: - field (class: "scala.Int", name: "a") - root class: "Entity" You can either add an expl {code} > Union failed between 2 datasets of the same type converted from different > dataframes > > > Key: SPARK-27855 > URL: https://issues.apache.org/jira/browse/SPARK-27855 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 >Reporter: Hao Ren >Priority: Major > > 2 Datasets of the same type converted from different dataframes can not union. > Here is the code to reproduce the problem. It seems `union` just checks the > schema of the orignal dataframe, even if the two datasets have already been > converted to the same type of dataset. > {code:java} > case class Entity(key: Int, a: Int, b: String) > val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity] > val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity] > df1.printSchema > df2.printSchema > df1 union df2 > {code} > Result > {code:java} > defined class Entity > df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more > field] > df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more > field] > converted > root > |-- key: integer (nullable = false) > |-- a: integer (nullable = false) > |-- b: string (nullable = true) > root > |-- key: integer (nullable = false) > |-- b: string (nullable = true) > |-- a: integer (nullable = false) > org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int > as it may truncate > The type path of the target object is: > - field (class: "scala.Int", name: "a") > - root class: "Entity"{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27855) Union failed between 2 datasets of the same type converted from different dataframes
Hao Ren created SPARK-27855: --- Summary: Union failed between 2 datasets of the same type converted from different dataframes Key: SPARK-27855 URL: https://issues.apache.org/jira/browse/SPARK-27855 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.3 Reporter: Hao Ren 2 Datasets of the same type converted from different dataframes can not union. Here is the code to reproduce the problem. It seems `union` just checks the schema of the orignal dataframe, even if the two datasets have already been converted to the same type of dataset. {code:java} case class Entity(key: Int, a: Int, b: String) val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity] val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity] df1.printSchema df2.printSchema df1 union df2 {code} Result {code:java} defined class Entity df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more field] df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more field] converted root |-- key: integer (nullable = false) |-- a: integer (nullable = false) |-- b: string (nullable = true) root |-- key: integer (nullable = false) |-- b: string (nullable = true) |-- a: integer (nullable = false) org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int as it may truncate The type path of the target object is: - field (class: "scala.Int", name: "a") - root class: "Entity" You can either add an expl {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13182) Spark Executor retries infinitely
[ https://issues.apache.org/jira/browse/SPARK-13182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848904#comment-16848904 ] Sean Owen commented on SPARK-13182: --- On its face that sounds like a YARN-related issue. Rescheduling a preempted task is correct, but not if there are not enough resources available to execute it. If resources became available but not for long enough to finish it, that's still an app-level and YARN policy issue. > Spark Executor retries infinitely > - > > Key: SPARK-13182 > URL: https://issues.apache.org/jira/browse/SPARK-13182 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Prabhu Joseph >Priority: Minor > > When a Spark job (Spark-1.5.2) is submitted with a single executor and if > user passes some wrong JVM arguments with spark.executor.extraJavaOptions, > the first executor fails. But the job keeps on retrying, creating a new > executor and failing every time, until CTRL-C is pressed. > ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077" --conf > "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16" > /SPARK/SimpleApp.jar > Here when user submits job with ConcGCThreads 16 which is greater than > ParallelGCThreads, JVM will crash. But the job does not exit, keeps on > creating executors and retrying. > .. > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID > app-20160201065319-0014/2846 on hostPort 10.10.72.145:36558 with 12 cores, > 2.0 GB RAM > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2846 is now LOADING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2846 is now RUNNING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2846 is now EXITED (Command exited with code 1) > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor > app-20160201065319-0014/2846 removed: Command exited with code 1 > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove > non-existent executor 2846 > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: > app-20160201065319-0014/2847 on worker-20160131230345-10.10.72.145-36558 > (10.10.72.145:36558) with 12 cores > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID > app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores, > 2.0 GB RAM > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2847 is now LOADING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2847 is now EXITED (Command exited with code 1) > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor > app-20160201065319-0014/2847 removed: Command exited with code 1 > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove > non-existent executor 2847 > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: > app-20160201065319-0014/2848 on worker-20160131230345-10.10.72.145-36558 > (10.10.72.145:36558) with 12 cores > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID > app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12 cores, > 2.0 GB RAM > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2848 is now LOADING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2848 is now RUNNING > Spark should not fall into a trap on these kind of user errors on a > production cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27803) fix column pruning for python UDF
[ https://issues.apache.org/jira/browse/SPARK-27803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27803. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24675 [https://github.com/apache/spark/pull/24675] > fix column pruning for python UDF > - > > Key: SPARK-27803 > URL: https://issues.apache.org/jira/browse/SPARK-27803 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15348) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848881#comment-16848881 ] Xianyin Xin commented on SPARK-15348: - The starting point (or goal) of Delta Lake is not ACID, but "Data Lake", and ACID is just one of its features. The ACID designs between hive and delta is very different, both have pros and cons. However, hive table and delta table are two datasources in spark's perspective, so a pluggable ACID support for different datasources within one framework is a choice. Maybe datasource V2 API can handle this. > Hive ACID > - > > Key: SPARK-15348 > URL: https://issues.apache.org/jira/browse/SPARK-15348 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0 >Reporter: Ran Haim >Priority: Major > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27833) java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark
[ https://issues.apache.org/jira/browse/SPARK-27833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848840#comment-16848840 ] Gabor Somogyi commented on SPARK-27833: --- Maybe you could just copy the working sink with a different name, change things step-by-step until it breaks. At the first glance this doesn't look like a Spark problem. > java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark > > > Key: SPARK-27833 > URL: https://issues.apache.org/jira/browse/SPARK-27833 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 > Environment: spark 2.3.0 > java 1.8 > kafka version 0.10. >Reporter: Raviteja >Priority: Minor > Labels: spark-streaming-kafka > Attachments: kafka_consumer_code.java, kafka_custom_sink.java, > kafka_error_log.txt > > > Hi , > We have a requirement to read data from kafka, apply some transformation and > store data to database .For this we are implementing watermarking feature > along with aggregate function and for storing we are writing our own sink > (Structured streaming) .we are using spark 2.3.0, java 1.8 and kafka version > 0.10. > We are getting the below error. > "*java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark > timestamp#39: timestamp, interval 2 minutes*" > > works perfectly fine when we use Console as sink instead custom sink. For > Debugging the issue, we are performing "dataframe.show()" in our custom sink > and nothing else. > Please find the attachment for the Error log and the code. Please look into > this issue as this a blocker and we are not able to proceed further or find > any alternatives as we need watermarking feature. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27553) Operation log is not closed when close session
[ https://issues.apache.org/jira/browse/SPARK-27553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848823#comment-16848823 ] jinwensc commented on SPARK-27553: -- it was closed in org.apche.hive.service.cli.operation.Operation afterRun. but can't get operation log from thriftserver. > Operation log is not closed when close session > -- > > Key: SPARK-27553 > URL: https://issues.apache.org/jira/browse/SPARK-27553 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: pin_zhang >Priority: Major > > On Window > 1. start spark-shell > 2. start hive server in shell by > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.startWithContext(spark.sqlContext) > 3. beeline connect to hive server > 3.1 connect > beeline -u jdbc:hive2://localhost:1 > 3.2 Run SQL > show tables; > 3.3 quit beeline > !quit > Get exception log > {code} > 19/04/24 11:38:22 ERROR HiveSessionImpl: Failed to cleanup ses > sion log dir: SessionHandle [5827428b-d140-4fc0-8ad4-721c39b3ead0] > java.io.IOException: Unable to delete file: > C:\Users\test\AppData\Local\Temp\test\operation_logs\5827428b-d140-4fc0-8ad4-721c39b3ead0\df9cd631-66e7-4303-9a4 > 1-a09bdefcf888 > at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279) > at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) > at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) > at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270) > at > org.apache.hive.service.cli.session.HiveSessionImpl.cleanupSessionLogDir(HiveSessionImpl.java:671) > at > org.apache.hive.service.cli.session.HiveSessionImpl.close(HiveSessionImpl.java:643) > at > org.apache.hive.service.cli.session.HiveSessionImplwithUGI.close(HiveSessionImplwithUGI.java:109) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) > at com.sun.proxy.$Proxy19.close(Unknown Source) > at > org.apache.hive.service.cli.session.SessionManager.closeSession(SessionManager.java:280) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLSessionManager.closeSession(SparkSQLSessionManager.scala:76) > at org.apache.hive.service.cli.CLIService.closeSession(CLIService.java:237) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.CloseSession(ThriftCLIService.java:397) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$CloseSession.getResult(TCLIService.java:1273) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$CloseSession.getResult(TCLIService.java:1258) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27826) saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" format issue
[ https://issues.apache.org/jira/browse/SPARK-27826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848811#comment-16848811 ] fengtlyer commented on SPARK-27826: --- Hi Hyukjin, Our team think this is a compatibility issue. We are fully understand, if we use format("hive") this line of code would work, However, all of our actions should work with "parquet" format. Why it is not parquet format when we use impala created parquet table? If we use impala SQL query "stored as parquet" in the Hue, then we checked the HDFS, the files end with ".parq", but we can't use "write.format("parquet").mode("append").saveAsTable()" to append this table. We think there should be some compatibility issues. > saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" > format issue > > > Key: SPARK-27826 > URL: https://issues.apache.org/jira/browse/SPARK-27826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.4.0 > Environment: CDH 5.13.1 - Spark version 2.2.0.cloudera2 > CDH 6.1.1 - Spark version 2.4.0-cdh6.1.1 >Reporter: fengtlyer >Priority: Minor > > Hi Spark Dev Team, > We tested a few times and found this bug can reappearance in multi Spark > version > We tested in CDH 5.13.1 - Spark version 2.2.0.cloudera2 and CDH 6.1.1 - Spark > version 2.4.0-cdh6.1.1 > Both of them have this bug: > 1. If one table created by Impala or Hive in the HUE, then in Spark code, > "write.format("parquet").mode("append").saveAsTable()" will case the format > issue (see the below error log) > 2. Hive/Impala in the HUE created table, then > "write.format("parquet").mode("overwrite").saveAsTable()", this code still > does not work. > 2.1 Hive/Impala in the HUE created table, and > "write.format("parquet").mode("overwrite").saveAsTable()", then > "write.format("parquet").mode("append").saveAsTable()" can work. > 3. Hive/Impala in the HUE created table, then "insertInto()" still will work. > 3.1 Hive/Impala in the HUE created a table, and used "insertInto()" insert > some new record, then try to use > "write.format("parquet").mode("append").saveAsTable()", it will get the same > format error log > 4. Created parquet table and insert some data by Hive shell, then > "write.format("parquet").mode("append").saveAsTable()" can insert data, but > spark only shows data which insert by spark, and Hive only show data which > hive insert. > === > Error Log > === > {code} > spark.read.format("csv").option("sep",",").option("header","true").load("hdfs:///temp1/test_paquettest.csv").write.format("parquet").mode("append").saveAsTable("parquet_test_table") > {code} > {code} > org.apache.spark.sql.AnalysisException: The format of the existing table > default.parquet_test_table is `HiveFileFormat`. It doesn't match the > specified format `ParquetFileFormat`.; > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:115) > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75) > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.e
[jira] [Commented] (SPARK-27808) Ability to ignore existing files for structured streaming
[ https://issues.apache.org/jira/browse/SPARK-27808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848769#comment-16848769 ] Gabor Somogyi commented on SPARK-27808: --- What I've suggested is a workaround of course and not efficient for huge data sets. I've had a look and seems doable. > Ability to ignore existing files for structured streaming > - > > Key: SPARK-27808 > URL: https://issues.apache.org/jira/browse/SPARK-27808 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.3, 2.4.3 >Reporter: Vladimir Matveev >Priority: Major > > Currently it is not easily possible to make a structured streaming query to > ignore all of the existing data inside a directory and only process new > files, created after the job was started. See here for example: > [https://stackoverflow.com/questions/51391722/spark-structured-streaming-file-source-starting-offset] > > My use case is to ignore everything which existed in the directory when the > streaming job is first started (and there are no checkpoints), but to behave > as usual when the stream is restarted, e.g. catch up reading new files since > the last restart. This would allow us to use the streaming job for continuous > processing, with all the benefits it brings, but also to keep the possibility > to reprocess the data in the batch fashion by a different job, drop the > checkpoints and make the streaming job only run for the new data. > > It would be great to have an option similar to the `newFilesOnly` option on > the original StreamingContext.fileStream method: > https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext@fileStream[K,V,F%3C:org.apache.hadoop.mapreduce.InputFormat[K,V]](directory:String,filter:org.apache.hadoop.fs.Path=%3EBoolean,newFilesOnly:Boolean)(implicitevidence$7:scala.reflect.ClassTag[K],implicitevidence$8:scala.reflect.ClassTag[V],implicitevidence$9:scala.reflect.ClassTag[F]):org.apache.spark.streaming.dstream.InputDStream[(K,V])] > but probably with slightly different semantics, described above (ignore all > existing for the first run, catch up for the following runs)> -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27854) [Spark-SQL] OOM when using unequal join sql
[ https://issues.apache.org/jira/browse/SPARK-27854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kai zhao updated SPARK-27854: - Environment: Spark Version:1.6.2 HDP Version:2.5 JDK Version:1.8 OS Version:Redhat 7.3 Cluster Info: 8 nodes Each node : RAM: 256G CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz (40 cores) Disk:10*4T HDD+1T SSD Yarn Config: NodeManager Memory:210G NodeManager Vcores:70 Runtime Information Java Home=/opt/jdk1.8.0_131/jre Java Version=1.8.0_131 (Oracle Corporation) Scala Version=version 2.10.5 Spark Properties spark.app.id=application_1558686555626_0024 spark.app.name=org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark.driver.appUIAddress=[http://172.17.3.2:4040|http://172.17.3.2:4040/] spark.driver.extraClassPath=/yinhai_platform/resources/spark_dep_jar/* spark.driver.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 spark.driver.host=172.17.3.2 spark.driver.maxResultSize=16g spark.driver.port=44591 spark.dynamicAllocation.enabled=true spark.dynamicAllocation.initialExecutors=0 spark.dynamicAllocation.maxExecutors=200 spark.dynamicAllocation.minExecutors=0 spark.eventLog.dir=hdfs:///spark-history spark.eventLog.enabled=true spark.executor.cores=5 spark.executor.extraClassPath=/yinhai_platform/resources/spark_dep_jar/* spark.executor.extraJavaOptions=-XX:MaxPermSize=10240m spark.executor.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 spark.executor.id=driver spark.executor.memory=16g spark.externalBlockStore.folderName=spark-058bff7c-f76c-4a0e-86a3-b390f2f06d1a spark.hadoop.cacheConf=false spark.history.fs.logDirectory=hdfs:///spark-history spark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider spark.kryo.referenceTracking=false spark.kryoserializer.buffer.max=1024m spark.local.dir=/data/disk1/spark-local-dir spark.master=yarn-client spark.network.timeout=600s spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS=ambari-node-2 spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES=[http://ambari-node-2:8088/proxy/application_1558686555626_0024] spark.scheduler.allocation.file /usr/hdp/current/spark-thriftserver/conf/spark-thrift-fairscheduler.xml spark.scheduler.mode=FAIR spark.serializer=org.apache.spark.serializer.KryoSerializer spark.shuffle.managr=SORT spark.shuffle.service.enabled=true spark.shuffle.service.port=9339 spark.submit.deployMode=client spark.ui.filters=org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter spark.yarn.am.cores=5 spark.yarn.am.memory=16g spark.yarn.queue=default was: Spark Version:1.6.2 HDP Version:2.5 JDK Version:1.8 OS Version:Redhat 7.3 Runtime Information Java Home=/opt/jdk1.8.0_131/jre Java Version=1.8.0_131 (Oracle Corporation) Scala Version=version 2.10.5 Spark Properties spark.app.id=application_1558686555626_0024 spark.app.name=org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark.driver.appUIAddress=http://172.17.3.2:4040 spark.driver.extraClassPath=/yinhai_platform/resources/spark_dep_jar/* spark.driver.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 spark.driver.host=172.17.3.2 spark.driver.maxResultSize=16g spark.driver.port=44591 spark.dynamicAllocation.enabled=true spark.dynamicAllocation.initialExecutors=0 spark.dynamicAllocation.maxExecutors=200 spark.dynamicAllocation.minExecutors=0 spark.eventLog.dir=hdfs:///spark-history spark.eventLog.enabled=true spark.executor.cores=5 spark.executor.extraClassPath=/yinhai_platform/resources/spark_dep_jar/* spark.executor.extraJavaOptions=-XX:MaxPermSize=10240m spark.executor.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 spark.executor.id=driver spark.executor.memory=16g spark.externalBlockStore.folderName=spark-058bff7c-f76c-4a0e-86a3-b390f2f06d1a spark.hadoop.cacheConf=false spark.history.fs.logDirectory=hdfs:///spark-history spark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider spark.kryo.referenceTracking=false spark.kryoserializer.buffer.max=1024m spark.local.dir=/data/disk1/spark-local-dir spark.master=yarn-client spark.network.timeout=600s spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS=ambari-node-2 spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES=http://ambari-node-2:8088/proxy/application_1558686555626_0024 spark.scheduler.allocation.file /usr/hdp/current/spark-thriftserver/conf/spark-thrift-fairscheduler.xml spark.scheduler.mode=FAIR spark.serializer=org.apache.spark.serializer.KryoSerializer spark.shuffle.managr=SORT spark.shuffle.service.enabled=true spark.shuffle.servi
[jira] [Updated] (SPARK-27854) [Spark-SQL] OOM when using unequal join sql
[ https://issues.apache.org/jira/browse/SPARK-27854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kai zhao updated SPARK-27854: - Environment: Spark Version:1.6.2 HDP Version:2.5 JDK Version:1.8 OS Version:Redhat 7.3 Runtime Information Java Home=/opt/jdk1.8.0_131/jre Java Version=1.8.0_131 (Oracle Corporation) Scala Version=version 2.10.5 Spark Properties spark.app.id=application_1558686555626_0024 spark.app.name=org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark.driver.appUIAddress=http://172.17.3.2:4040 spark.driver.extraClassPath=/yinhai_platform/resources/spark_dep_jar/* spark.driver.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 spark.driver.host=172.17.3.2 spark.driver.maxResultSize=16g spark.driver.port=44591 spark.dynamicAllocation.enabled=true spark.dynamicAllocation.initialExecutors=0 spark.dynamicAllocation.maxExecutors=200 spark.dynamicAllocation.minExecutors=0 spark.eventLog.dir=hdfs:///spark-history spark.eventLog.enabled=true spark.executor.cores=5 spark.executor.extraClassPath=/yinhai_platform/resources/spark_dep_jar/* spark.executor.extraJavaOptions=-XX:MaxPermSize=10240m spark.executor.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 spark.executor.id=driver spark.executor.memory=16g spark.externalBlockStore.folderName=spark-058bff7c-f76c-4a0e-86a3-b390f2f06d1a spark.hadoop.cacheConf=false spark.history.fs.logDirectory=hdfs:///spark-history spark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider spark.kryo.referenceTracking=false spark.kryoserializer.buffer.max=1024m spark.local.dir=/data/disk1/spark-local-dir spark.master=yarn-client spark.network.timeout=600s spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS=ambari-node-2 spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES=http://ambari-node-2:8088/proxy/application_1558686555626_0024 spark.scheduler.allocation.file /usr/hdp/current/spark-thriftserver/conf/spark-thrift-fairscheduler.xml spark.scheduler.mode=FAIR spark.serializer=org.apache.spark.serializer.KryoSerializer spark.shuffle.managr=SORT spark.shuffle.service.enabled=true spark.shuffle.service.port=9339 spark.submit.deployMode=client spark.ui.filters=org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter spark.yarn.am.cores=5 spark.yarn.am.memory=16g spark.yarn.queue=default > [Spark-SQL] OOM when using unequal join sql > > > Key: SPARK-27854 > URL: https://issues.apache.org/jira/browse/SPARK-27854 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 > Environment: Spark Version:1.6.2 > HDP Version:2.5 > JDK Version:1.8 > OS Version:Redhat 7.3 > > Runtime Information > Java Home=/opt/jdk1.8.0_131/jre > Java Version=1.8.0_131 (Oracle Corporation) > Scala Version=version 2.10.5 > Spark Properties > spark.app.id=application_1558686555626_0024 > spark.app.name=org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 > spark.driver.appUIAddress=http://172.17.3.2:4040 > spark.driver.extraClassPath=/yinhai_platform/resources/spark_dep_jar/* > spark.driver.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 > spark.driver.host=172.17.3.2 > spark.driver.maxResultSize=16g > spark.driver.port=44591 > spark.dynamicAllocation.enabled=true > spark.dynamicAllocation.initialExecutors=0 > spark.dynamicAllocation.maxExecutors=200 > spark.dynamicAllocation.minExecutors=0 > spark.eventLog.dir=hdfs:///spark-history > spark.eventLog.enabled=true > spark.executor.cores=5 > spark.executor.extraClassPath=/yinhai_platform/resources/spark_dep_jar/* > spark.executor.extraJavaOptions=-XX:MaxPermSize=10240m > spark.executor.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 > spark.executor.id=driver > spark.executor.memory=16g > spark.externalBlockStore.folderName=spark-058bff7c-f76c-4a0e-86a3-b390f2f06d1a > spark.hadoop.cacheConf=false > spark.history.fs.logDirectory=hdfs:///spark-history > spark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider > spark.kryo.referenceTracking=false > spark.kryoserializer.buffer.max=1024m > spark.local.dir=/data/disk1/spark-local-dir > spark.master=yarn-client > spark.network.timeout=600s > spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS=ambari-node-2 > spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES=http://ambari-node-2:8088/proxy/application_1558686555626_0024 > spark.scheduler.allocation.file > /usr/hdp/current/spark-thriftserver/conf/spark-thrift-fairscheduler.xml > spark.scheduler.mode=FAIR > spar
[jira] [Commented] (SPARK-27837) Running rand() in SQL with seed of column results in error (rand(col1))
[ https://issues.apache.org/jira/browse/SPARK-27837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848752#comment-16848752 ] Liang-Chi Hsieh commented on SPARK-27837: - I don't see it makes sense. I checked few DBs, and didn't see rand function work like your way. > Running rand() in SQL with seed of column results in error (rand(col1)) > --- > > Key: SPARK-27837 > URL: https://issues.apache.org/jira/browse/SPARK-27837 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jason Ferrell >Priority: Major > > Running this sql: > with a as > ( > select 123 val1 > union all > select 123 val1 > union all > select 123 val1 > ) > select val1,rand(123),rand(val1) > from a > Results in error: org.apache.spark.sql.AnalysisException: Input argument to > rand must be an integer, long or null literal.; > It doesn't appear to recognize the value of the column as an int. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13182) Spark Executor retries infinitely
[ https://issues.apache.org/jira/browse/SPARK-13182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848749#comment-16848749 ] Atul Anand edited comment on SPARK-13182 at 5/27/19 9:16 AM: - [~srowen] The issue here is spark does not consider this as failures, and so keeps retrying. I have hit infinite retry in a valid scenario, please see. [here|[https://stackoverflow.com/questions/56236216/spark-keeps-relaunching-executors-after-yarn-kills-them]] Basically yarn preempted spark containers as they were running on lower priority queue. But spark restarted the containers right away. Yarn again killed them. Spark should have hit max failures count after few kills, but it does not consider these as failures. {noformat} 2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.{noformat} Hence it keeps relaunching containers, while Yarn keeps killing them. was (Author: zxcvmnb): [~srowen] The issue here is spark does not consider this as failures, and so keeps retrying. I have hit infinite retry in a valid scenario, please see. [here|[https://stackoverflow.com/questions/56236216/spark-keeps-relaunching-executors-after-yarn-kills-them]]. Basically yarn preempted spark containers as they were running on lower priority queue. But spark restarted the containers right away. Yarn again killed them. Spark should have hit max failures count after few kills, but it does not consider these as failures. {noformat} 2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.{noformat} Hence it keeps relaunching containers, while Yarn keeps killing them. > Spark Executor retries infinitely > - > > Key: SPARK-13182 > URL: https://issues.apache.org/jira/browse/SPARK-13182 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Prabhu Joseph >Priority: Minor > > When a Spark job (Spark-1.5.2) is submitted with a single executor and if > user passes some wrong JVM arguments with spark.executor.extraJavaOptions, > the first executor fails. But the job keeps on retrying, creating a new > executor and failing every time, until CTRL-C is pressed. > ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077" --conf > "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16" > /SPARK/SimpleApp.jar > Here when user submits job with ConcGCThreads 16 which is greater than > ParallelGCThreads, JVM will crash. But the job does not exit, keeps on > creating executors and retrying. > .. > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID > app-20160201065319-0014/2846 on hostPort 10.10.72.145:36558 with 12 cores, > 2.0 GB RAM > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2846 is now LOADING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2846 is now RUNNING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2846 is now EXITED (Command exited with code 1) > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor > app-20160201065319-0014/2846 removed: Command exited with code 1 > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove > non-existent executor 2846 > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: > app-20160201065319-0014/2847 on worker-20160131230345-10.10.72.145-36558 > (10.10.72.145:36558) with 12 cores > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID > app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores, > 2.0 GB RAM > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2847 is now LOADING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2847 is now EXITED (Command exited with code 1) > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor > app-20160201065319-0014/2847 removed: Command exited with code 1 > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove > non-existent executor 2847 > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: > app-20160201065319-0014/2848 on worker-20160131230345-10.10.72.145-36558 > (10.10.72.145:36558) with 12 core
[jira] [Updated] (SPARK-27854) [Spark-SQL] OOM when using unequal join sql
[ https://issues.apache.org/jira/browse/SPARK-27854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kai zhao updated SPARK-27854: - Description: I am using Spark 1.6.2 which is from hdp package. Reproduce Steps: 1.Start Spark thrift server Or write DataFrame Code 2.Execute SQL like:select * from table_a left join table_b on table_a.fieldA <= table_b.fieldB 3.Wait until the job is finished Actual Result: SQL won't execute failed With multipule task error: a)ExecutorLostFailure (executor 119 exited caused by one of the running tasks) Reason: Container marked as failed: b)java.lang.OutOfMemoryError: Java heap space Expect: SQL runs Successfully . I have tried every method on the Internet . But it still won't work was: I am using Spark 1.6.2 which is from hdp package. Reproduce Steps: 1.Start Spark thrift server Or write DataFrame Code 2.Execute SQL like:select * from table_a left join table_b on table_a.fieldA <= table_b.fieldB 3.Wait until the job is finished Actual Result: SQL won't execute failed With multipule task error: Expect: SQL runs Successfully I have tried every method on the Internet . But it still won't work > [Spark-SQL] OOM when using unequal join sql > > > Key: SPARK-27854 > URL: https://issues.apache.org/jira/browse/SPARK-27854 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: kai zhao >Priority: Major > > I am using Spark 1.6.2 which is from hdp package. > Reproduce Steps: > 1.Start Spark thrift server Or write DataFrame Code > 2.Execute SQL like:select * from table_a left join table_b on table_a.fieldA > <= table_b.fieldB > 3.Wait until the job is finished > > Actual Result: > SQL won't execute failed With multipule task error: > a)ExecutorLostFailure (executor 119 exited caused by one of the running > tasks) Reason: Container marked as failed: > b)java.lang.OutOfMemoryError: Java heap space > Expect: > SQL runs Successfully . > I have tried every method on the Internet . But it still won't work -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13182) Spark Executor retries infinitely
[ https://issues.apache.org/jira/browse/SPARK-13182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848749#comment-16848749 ] Atul Anand edited comment on SPARK-13182 at 5/27/19 9:16 AM: - [~srowen] The issue here is spark does not consider this as failures, and so keeps retrying. I have hit infinite retry in a valid scenario, please see [https://stackoverflow.com/questions/56236216/spark-keeps-relaunching-executors-after-yarn-kills-them]. Basically yarn preempted spark containers as they were running on lower priority queue. But spark restarted the containers right away. Yarn again killed them. Spark should have hit max failures count after few kills, but it does not consider these as failures. {noformat} 2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.{noformat} Hence it keeps relaunching containers, while Yarn keeps killing them. was (Author: zxcvmnb): [~srowen] The issue here is spark does not consider this as failures, and so keeps retrying. I have hit infinite retry in a valid scenario, please see. [here|[https://stackoverflow.com/questions/56236216/spark-keeps-relaunching-executors-after-yarn-kills-them]] Basically yarn preempted spark containers as they were running on lower priority queue. But spark restarted the containers right away. Yarn again killed them. Spark should have hit max failures count after few kills, but it does not consider these as failures. {noformat} 2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.{noformat} Hence it keeps relaunching containers, while Yarn keeps killing them. > Spark Executor retries infinitely > - > > Key: SPARK-13182 > URL: https://issues.apache.org/jira/browse/SPARK-13182 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Prabhu Joseph >Priority: Minor > > When a Spark job (Spark-1.5.2) is submitted with a single executor and if > user passes some wrong JVM arguments with spark.executor.extraJavaOptions, > the first executor fails. But the job keeps on retrying, creating a new > executor and failing every time, until CTRL-C is pressed. > ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077" --conf > "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16" > /SPARK/SimpleApp.jar > Here when user submits job with ConcGCThreads 16 which is greater than > ParallelGCThreads, JVM will crash. But the job does not exit, keeps on > creating executors and retrying. > .. > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID > app-20160201065319-0014/2846 on hostPort 10.10.72.145:36558 with 12 cores, > 2.0 GB RAM > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2846 is now LOADING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2846 is now RUNNING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2846 is now EXITED (Command exited with code 1) > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor > app-20160201065319-0014/2846 removed: Command exited with code 1 > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove > non-existent executor 2846 > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: > app-20160201065319-0014/2847 on worker-20160131230345-10.10.72.145-36558 > (10.10.72.145:36558) with 12 cores > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID > app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores, > 2.0 GB RAM > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2847 is now LOADING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2847 is now EXITED (Command exited with code 1) > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor > app-20160201065319-0014/2847 removed: Command exited with code 1 > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove > non-existent executor 2847 > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: > app-20160201065319-0014/2848 on worker-20160131230345-10.10.72.145-36558 > (10.10.72.145:36558) with 12 cores > 16/0
[jira] [Commented] (SPARK-13182) Spark Executor retries infinitely
[ https://issues.apache.org/jira/browse/SPARK-13182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848749#comment-16848749 ] Atul Anand commented on SPARK-13182: [~srowen] The issue here is spark does not consider this as failures, and so keeps retrying. I have hit infinite retry in a valid scenario, please see. [here|[https://stackoverflow.com/questions/56236216/spark-keeps-relaunching-executors-after-yarn-kills-them]]. Basically yarn preempted spark containers as they were running on lower priority queue. But spark restarted the containers right away. Yarn again killed them. Spark should have hit max failures count after few kills, but it does not consider these as failures. {noformat} 2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.{noformat} Hence it keeps relaunching containers, while Yarn keeps killing them. > Spark Executor retries infinitely > - > > Key: SPARK-13182 > URL: https://issues.apache.org/jira/browse/SPARK-13182 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Prabhu Joseph >Priority: Minor > > When a Spark job (Spark-1.5.2) is submitted with a single executor and if > user passes some wrong JVM arguments with spark.executor.extraJavaOptions, > the first executor fails. But the job keeps on retrying, creating a new > executor and failing every time, until CTRL-C is pressed. > ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077" --conf > "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16" > /SPARK/SimpleApp.jar > Here when user submits job with ConcGCThreads 16 which is greater than > ParallelGCThreads, JVM will crash. But the job does not exit, keeps on > creating executors and retrying. > .. > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID > app-20160201065319-0014/2846 on hostPort 10.10.72.145:36558 with 12 cores, > 2.0 GB RAM > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2846 is now LOADING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2846 is now RUNNING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2846 is now EXITED (Command exited with code 1) > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor > app-20160201065319-0014/2846 removed: Command exited with code 1 > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove > non-existent executor 2846 > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: > app-20160201065319-0014/2847 on worker-20160131230345-10.10.72.145-36558 > (10.10.72.145:36558) with 12 cores > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID > app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores, > 2.0 GB RAM > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2847 is now LOADING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2847 is now EXITED (Command exited with code 1) > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor > app-20160201065319-0014/2847 removed: Command exited with code 1 > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove > non-existent executor 2847 > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: > app-20160201065319-0014/2848 on worker-20160131230345-10.10.72.145-36558 > (10.10.72.145:36558) with 12 cores > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID > app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12 cores, > 2.0 GB RAM > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2848 is now LOADING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2848 is now RUNNING > Spark should not fall into a trap on these kind of user errors on a > production cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27854) [Spark-SQL] OOM when using unequal join sql
[ https://issues.apache.org/jira/browse/SPARK-27854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kai zhao updated SPARK-27854: - Description: I am using Spark 1.6.2 which is from hdp package. Reproduce Steps: 1.Start Spark thrift server Or write DataFrame Code 2.Execute SQL like:select * from table_a left join table_b on table_a.fieldA <= table_b.fieldB 3.Wait until the job is finished Actual Result: SQL won't execute failed With multipule task error: Expect: SQL runs Successfully I have tried every method on the Internet . But it still won't work was: I am using Spark 1.6.2 which is from hdp package. Reproduce Steps: 1.Start Spark thrift server Or write DataFrame Code 2.Execute SQL like:select * from table_a left join table_b on table_a.fieldA <= table_b.fieldB 3.Wait until the job is finished Actual Result: SQL won't execute failed With Error: Expect: SQL runs Successfully I have tried every method on the Internet . But it still won't work > [Spark-SQL] OOM when using unequal join sql > > > Key: SPARK-27854 > URL: https://issues.apache.org/jira/browse/SPARK-27854 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: kai zhao >Priority: Major > > I am using Spark 1.6.2 which is from hdp package. > Reproduce Steps: > 1.Start Spark thrift server Or write DataFrame Code > 2.Execute SQL like:select * from table_a left join table_b on table_a.fieldA > <= table_b.fieldB > 3.Wait until the job is finished > > Actual Result: > SQL won't execute failed With multipule task error: > > Expect: > SQL runs Successfully > I have tried every method on the Internet . But it still won't work -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27854) [Spark-SQL] OOM when using unequal join sql
kai zhao created SPARK-27854: Summary: [Spark-SQL] OOM when using unequal join sql Key: SPARK-27854 URL: https://issues.apache.org/jira/browse/SPARK-27854 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.2 Reporter: kai zhao I am using Spark 1.6.2 which is from hdp package. Reproduce Steps: 1.Start Spark thrift server Or write DataFrame Code 2.Execute SQL like:select * from table_a left join table_b on table_a.fieldA <= table_b.fieldB 3.Wait until the job is finished Actual Result: SQL won't execute failed With Error: Expect: SQL runs Successfully I have tried every method on the Internet . But it still won't work -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27851) Allow for custom BroadcastMode#transform return values
[ https://issues.apache.org/jira/browse/SPARK-27851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marc Arndt updated SPARK-27851: --- Summary: Allow for custom BroadcastMode#transform return values (was: Allow for custom BroadcastMode return values) > Allow for custom BroadcastMode#transform return values > -- > > Key: SPARK-27851 > URL: https://issues.apache.org/jira/browse/SPARK-27851 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 2.4.3 >Reporter: Marc Arndt >Priority: Major > > According to the BroadcastMode API the BroadcastMode#transform methods are > allows to return a result object of an arbitrary type: > {code:scala} > /** > * Marker trait to identify the shape in which tuples are broadcasted. > Typical examples of this are > * identity (tuples remain unchanged) or hashed (tuples are converted into > some hash index). > */ > trait BroadcastMode { > def transform(rows: Array[InternalRow]): Any > def transform(rows: Iterator[InternalRow], sizeHint: Option[Long]): Any > def canonicalized: BroadcastMode > } > {code} > When looking at the code which later uses the instantiated BroadcastMode > objects in BroadcastExchangeExec it becomes that this is not really the base. > The following lines in BroadcastExchangeExec suggest that only objects of > type HashedRelation and Array[InternalRow] are allowed as a result for the > BroadcastMode#transform methods: > {code:scala} > // Construct the relation. > val relation = mode.transform(input, Some(numRows)) > val dataSize = relation match { > case map: HashedRelation => > map.estimatedSize > case arr: Array[InternalRow] => > arr.map(_.asInstanceOf[UnsafeRow].getSizeInBytes.toLong).sum > case _ => > throw new SparkException("[BUG] BroadcastMode.transform returned > unexpected type: " + > relation.getClass.getName) > } > {code} > I believe that this is the only occurrence in the code where the result of > the BroadcastMode#transform method must be either of type HashedRelation or > Array[InternalRow]. Therefore to allow for broader custom implementations of > the BroadcastMode I believe it would be a good idea to solve the calculation > of the data size of the broadcast value in an independent way of the used > BroadcastMode implemented. > One way this could be done is by providing an additional > BroadcastMode#calculateDataSize method, which needs to be implemented by the > BroadcastMode implementations. This methods could then return the required > number of bytes for a given broadcast value. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27853) Allow for custom Partitioning implementations
Marc Arndt created SPARK-27853: -- Summary: Allow for custom Partitioning implementations Key: SPARK-27853 URL: https://issues.apache.org/jira/browse/SPARK-27853 Project: Spark Issue Type: Improvement Components: Optimizer, SQL Affects Versions: 2.4.3 Reporter: Marc Arndt When partitioning a Dataset Spark uses the physical plan element ShuffleExchangeExec together with a Partitioning instance. I find myself in situation where I need to provide my own partitioning criteria, that decides to which partition each InternalRow should belong. According to the Spark API I would expect to be able to provide my custom partitioning criteria as a custom implementation of the Partitioning interface. Sadly after implementing a custom Partitioning implementation you will receive a "Exchange not implemented for $newPartitioning" error message, because of the following code inside the ShuffleExchangeExec#prepareShuffleDependency method: {code:scala} val part: Partitioner = newPartitioning match { case RoundRobinPartitioning(numPartitions) => new HashPartitioner(numPartitions) case HashPartitioning(_, n) => new Partitioner { override def numPartitions: Int = n // For HashPartitioning, the partitioning key is already a valid partition ID, as we use // `HashPartitioning.partitionIdExpression` to produce partitioning key. override def getPartition(key: Any): Int = key.asInstanceOf[Int] } case RangePartitioning(sortingExpressions, numPartitions) => // Internally, RangePartitioner runs a job on the RDD that samples keys to compute // partition bounds. To get accurate samples, we need to copy the mutable keys. val rddForSampling = rdd.mapPartitionsInternal { iter => val mutablePair = new MutablePair[InternalRow, Null]() iter.map(row => mutablePair.update(row.copy(), null)) } implicit val ordering = new LazilyGeneratedOrdering(sortingExpressions, outputAttributes) new RangePartitioner( numPartitions, rddForSampling, ascending = true, samplePointsPerPartitionHint = SQLConf.get.rangeExchangeSampleSizePerPartition) case SinglePartition => new Partitioner { override def numPartitions: Int = 1 override def getPartition(key: Any): Int = 0 } case _ => sys.error(s"Exchange not implemented for $newPartitioning") // TODO: Handle BroadcastPartitioning. } def getPartitionKeyExtractor(): InternalRow => Any = newPartitioning match { case RoundRobinPartitioning(numPartitions) => // Distributes elements evenly across output partitions, starting from a random partition. var position = new Random(TaskContext.get().partitionId()).nextInt(numPartitions) (row: InternalRow) => { // The HashPartitioner will handle the `mod` by the number of partitions position += 1 position } case h: HashPartitioning => val projection = UnsafeProjection.create(h.partitionIdExpression :: Nil, outputAttributes) row => projection(row).getInt(0) case RangePartitioning(_, _) | SinglePartition => identity case _ => sys.error(s"Exchange not implemented for $newPartitioning") } {code} The code in the above code snippet matches the given Partitioning instance "newPartitioning" against a set of given hardcoded Partitioning types. When adding a new Partitioning implementation the pattern matching won't be able to find a pattern for it and therefore will use the fallback case: {code:java} case _ => sys.error(s"Exchange not implemented for $newPartitioning") {code} and throw an exception. To be able to provide custom partition behaviour I would suggest to change the implementation in ShuffleExchangeExec to be able to work with an arbitrary Partitioning implementation. For the Partition creation I would imagine that this can be done in a nice way inside the Partitioning classes via a Partitioning#createPartitioner method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27852) One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala
[ https://issues.apache.org/jira/browse/SPARK-27852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27852: Assignee: (was: Apache Spark) > One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala > > > Key: SPARK-27852 > URL: https://issues.apache.org/jira/browse/SPARK-27852 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Shuaiqi Ge >Priority: Major > > In *_DiskBlockObjectWriter.scala_*, there are 2 overload *_write_* > functions, the first of which executes _*updateBytesWritten*_ function while > the other doesn't. I think writeMetrics should record all the information > about writing operations, some data of which will be displayed in the Spark > jobs UI such as the data size of shuffle read and shuffle write. > {code:java} > def write(key: Any, value: Any) { >if (!streamOpen) { > open() >} >objOut.writeKey(key) >objOut.writeValue(value) >recordWritten() > } > override def write(kvBytes: Array[Byte], offs: Int, len: Int): Unit = { >if (!streamOpen) { > open() >} >bs.write(kvBytes, offs, len) >// updateBytesWritten() // the function is missed > } > ** > * Notify the writer that a record worth of bytes has been written with > OutputStream#write. > */ > def recordWritten(): Unit = { >numRecordsWritten += 1 >writeMetrics.incRecordsWritten(1) >if (numRecordsWritten % 16384 == 0) { > updateBytesWritten() >} > } > /** > * Report the number of bytes written in this writer's shuffle write metrics. > * Note that this is only valid before the underlying streams are closed. > */ > private def updateBytesWritten() { >val pos = channel.position() >writeMetrics.incBytesWritten(pos - reportedPosition) >reportedPosition = pos > } > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27852) One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala
[ https://issues.apache.org/jira/browse/SPARK-27852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27852: Assignee: Apache Spark > One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala > > > Key: SPARK-27852 > URL: https://issues.apache.org/jira/browse/SPARK-27852 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Shuaiqi Ge >Assignee: Apache Spark >Priority: Major > > In *_DiskBlockObjectWriter.scala_*, there are 2 overload *_write_* > functions, the first of which executes _*updateBytesWritten*_ function while > the other doesn't. I think writeMetrics should record all the information > about writing operations, some data of which will be displayed in the Spark > jobs UI such as the data size of shuffle read and shuffle write. > {code:java} > def write(key: Any, value: Any) { >if (!streamOpen) { > open() >} >objOut.writeKey(key) >objOut.writeValue(value) >recordWritten() > } > override def write(kvBytes: Array[Byte], offs: Int, len: Int): Unit = { >if (!streamOpen) { > open() >} >bs.write(kvBytes, offs, len) >// updateBytesWritten() // the function is missed > } > ** > * Notify the writer that a record worth of bytes has been written with > OutputStream#write. > */ > def recordWritten(): Unit = { >numRecordsWritten += 1 >writeMetrics.incRecordsWritten(1) >if (numRecordsWritten % 16384 == 0) { > updateBytesWritten() >} > } > /** > * Report the number of bytes written in this writer's shuffle write metrics. > * Note that this is only valid before the underlying streams are closed. > */ > private def updateBytesWritten() { >val pos = channel.position() >writeMetrics.incBytesWritten(pos - reportedPosition) >reportedPosition = pos > } > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27852) One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala
[ https://issues.apache.org/jira/browse/SPARK-27852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuaiqi Ge updated SPARK-27852: --- Description: In *_DiskBlockObjectWriter.scala_*, there are 2 overload *_write_* functions, the first of which executes _*updateBytesWritten*_ function while the other doesn't. I think writeMetrics should record all the information about writing operations, some data of which will be displayed in the Spark jobs UI such as the data size of shuffle read and shuffle write. {code:java} def write(key: Any, value: Any) { if (!streamOpen) { open() } objOut.writeKey(key) objOut.writeValue(value) recordWritten() } override def write(kvBytes: Array[Byte], offs: Int, len: Int): Unit = { if (!streamOpen) { open() } bs.write(kvBytes, offs, len) // updateBytesWritten() // the function is missed } ** * Notify the writer that a record worth of bytes has been written with OutputStream#write. */ def recordWritten(): Unit = { numRecordsWritten += 1 writeMetrics.incRecordsWritten(1) if (numRecordsWritten % 16384 == 0) { updateBytesWritten() } } /** * Report the number of bytes written in this writer's shuffle write metrics. * Note that this is only valid before the underlying streams are closed. */ private def updateBytesWritten() { val pos = channel.position() writeMetrics.incBytesWritten(pos - reportedPosition) reportedPosition = pos } {code} was: In *_DiskBlockObjectWriter.scala_*, there are 2 overload *_write_* functions, the first of which executes _*updateBytesWritten*_ function while the other doesn't. I think writeMetrics should record all the information about writing operations, some data of which will be displayed in the Spark jobs UI such as the data size of shuffle read and shuffle write. {code:java} def write(key: Any, value: Any) { if (!streamOpen) { open() } objOut.writeKey(key) objOut.writeValue(value) recordWritten() } override def write(kvBytes: Array[Byte], offs: Int, len: Int): Unit = { if (!streamOpen) { open() } bs.write(kvBytes, offs, len) } ** * Notify the writer that a record worth of bytes has been written with OutputStream#write. */ def recordWritten(): Unit = { numRecordsWritten += 1 writeMetrics.incRecordsWritten(1) if (numRecordsWritten % 16384 == 0) { updateBytesWritten() } } /** * Report the number of bytes written in this writer's shuffle write metrics. * Note that this is only valid before the underlying streams are closed. */ private def updateBytesWritten() { val pos = channel.position() writeMetrics.incBytesWritten(pos - reportedPosition) reportedPosition = pos } {code} > One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala > > > Key: SPARK-27852 > URL: https://issues.apache.org/jira/browse/SPARK-27852 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Shuaiqi Ge >Priority: Major > > In *_DiskBlockObjectWriter.scala_*, there are 2 overload *_write_* > functions, the first of which executes _*updateBytesWritten*_ function while > the other doesn't. I think writeMetrics should record all the information > about writing operations, some data of which will be displayed in the Spark > jobs UI such as the data size of shuffle read and shuffle write. > {code:java} > def write(key: Any, value: Any) { >if (!streamOpen) { > open() >} >objOut.writeKey(key) >objOut.writeValue(value) >recordWritten() > } > override def write(kvBytes: Array[Byte], offs: Int, len: Int): Unit = { >if (!streamOpen) { > open() >} >bs.write(kvBytes, offs, len) >// updateBytesWritten() // the function is missed > } > ** > * Notify the writer that a record worth of bytes has been written with > OutputStream#write. > */ > def recordWritten(): Unit = { >numRecordsWritten += 1 >writeMetrics.incRecordsWritten(1) >if (numRecordsWritten % 16384 == 0) { > updateBytesWritten() >} > } > /** > * Report the number of bytes written in this writer's shuffle write metrics. > * Note that this is only valid before the underlying streams are closed. > */ > private def updateBytesWritten() { >val pos = channel.position() >writeMetrics.incBytesWritten(pos - reportedPosition) >reportedPosition = pos > } > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27852) One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala
[ https://issues.apache.org/jira/browse/SPARK-27852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuaiqi Ge updated SPARK-27852: --- Description: In *_DiskBlockObjectWriter.scala_*, there are 2 overload *_write_* functions, the first of which executes _*updateBytesWritten*_ function while the other doesn't. I think writeMetrics should record all the information about writing operations, some data of which will be displayed in the Spark jobs UI such as the data size of shuffle read and shuffle write. {code:java} def write(key: Any, value: Any) { if (!streamOpen) { open() } objOut.writeKey(key) objOut.writeValue(value) recordWritten() } override def write(kvBytes: Array[Byte], offs: Int, len: Int): Unit = { if (!streamOpen) { open() } bs.write(kvBytes, offs, len) } ** * Notify the writer that a record worth of bytes has been written with OutputStream#write. */ def recordWritten(): Unit = { numRecordsWritten += 1 writeMetrics.incRecordsWritten(1) if (numRecordsWritten % 16384 == 0) { updateBytesWritten() } } /** * Report the number of bytes written in this writer's shuffle write metrics. * Note that this is only valid before the underlying streams are closed. */ private def updateBytesWritten() { val pos = channel.position() writeMetrics.incBytesWritten(pos - reportedPosition) reportedPosition = pos } {code} was: In *_DiskBlockObjectWriter.scala_*, there are 2 overload *_write_* functions, the first of which executes _*updateBytesWritten*_ function while the other doesn't. I think writeMetrics should record all the information about writing operation, some data of which will displayed in the Spark jobs UI such as the data size of shuffle read and shuffle write. {code:java} def write(key: Any, value: Any) { if (!streamOpen) { open() } objOut.writeKey(key) objOut.writeValue(value) recordWritten() } override def write(kvBytes: Array[Byte], offs: Int, len: Int): Unit = { if (!streamOpen) { open() } bs.write(kvBytes, offs, len) } ** * Notify the writer that a record worth of bytes has been written with OutputStream#write. */ def recordWritten(): Unit = { numRecordsWritten += 1 writeMetrics.incRecordsWritten(1) if (numRecordsWritten % 16384 == 0) { updateBytesWritten() } } /** * Report the number of bytes written in this writer's shuffle write metrics. * Note that this is only valid before the underlying streams are closed. */ private def updateBytesWritten() { val pos = channel.position() writeMetrics.incBytesWritten(pos - reportedPosition) reportedPosition = pos } {code} > One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala > > > Key: SPARK-27852 > URL: https://issues.apache.org/jira/browse/SPARK-27852 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Shuaiqi Ge >Priority: Major > > In *_DiskBlockObjectWriter.scala_*, there are 2 overload *_write_* > functions, the first of which executes _*updateBytesWritten*_ function while > the other doesn't. I think writeMetrics should record all the information > about writing operations, some data of which will be displayed in the Spark > jobs UI such as the data size of shuffle read and shuffle write. > {code:java} > def write(key: Any, value: Any) { >if (!streamOpen) { > open() >} >objOut.writeKey(key) >objOut.writeValue(value) >recordWritten() > } > override def write(kvBytes: Array[Byte], offs: Int, len: Int): Unit = { >if (!streamOpen) { > open() >} >bs.write(kvBytes, offs, len) > } > ** > * Notify the writer that a record worth of bytes has been written with > OutputStream#write. > */ > def recordWritten(): Unit = { >numRecordsWritten += 1 >writeMetrics.incRecordsWritten(1) >if (numRecordsWritten % 16384 == 0) { > updateBytesWritten() >} > } > /** > * Report the number of bytes written in this writer's shuffle write metrics. > * Note that this is only valid before the underlying streams are closed. > */ > private def updateBytesWritten() { >val pos = channel.position() >writeMetrics.incBytesWritten(pos - reportedPosition) >reportedPosition = pos > } > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27852) One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala
Shuaiqi Ge created SPARK-27852: -- Summary: One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala Key: SPARK-27852 URL: https://issues.apache.org/jira/browse/SPARK-27852 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 2.4.3 Reporter: Shuaiqi Ge In *_DiskBlockObjectWriter.scala_*, there are 2 overload *_write_* functions, the first of which executes _*updateBytesWritten*_ function while the other doesn't. I think writeMetrics should record all the information about writing operation, some data of which will displayed in the Spark jobs UI such as the data size of shuffle read and shuffle write. {code:java} def write(key: Any, value: Any) { if (!streamOpen) { open() } objOut.writeKey(key) objOut.writeValue(value) recordWritten() } override def write(kvBytes: Array[Byte], offs: Int, len: Int): Unit = { if (!streamOpen) { open() } bs.write(kvBytes, offs, len) } ** * Notify the writer that a record worth of bytes has been written with OutputStream#write. */ def recordWritten(): Unit = { numRecordsWritten += 1 writeMetrics.incRecordsWritten(1) if (numRecordsWritten % 16384 == 0) { updateBytesWritten() } } /** * Report the number of bytes written in this writer's shuffle write metrics. * Note that this is only valid before the underlying streams are closed. */ private def updateBytesWritten() { val pos = channel.position() writeMetrics.incBytesWritten(pos - reportedPosition) reportedPosition = pos } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27851) Allow for custom BroadcastMode return values
[ https://issues.apache.org/jira/browse/SPARK-27851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marc Arndt updated SPARK-27851: --- Description: According to the BroadcastMode API the BroadcastMode#transform methods are allows to return a result object of an arbitrary type: {code:scala} /** * Marker trait to identify the shape in which tuples are broadcasted. Typical examples of this are * identity (tuples remain unchanged) or hashed (tuples are converted into some hash index). */ trait BroadcastMode { def transform(rows: Array[InternalRow]): Any def transform(rows: Iterator[InternalRow], sizeHint: Option[Long]): Any def canonicalized: BroadcastMode } {code} When looking at the code which later uses the instantiated BroadcastMode objects in BroadcastExchangeExec it becomes that this is not really the base. The following lines in BroadcastExchangeExec suggest that only objects of type HashedRelation and Array[InternalRow] are allowed as a result for the BroadcastMode#transform methods: {code:scala} // Construct the relation. val relation = mode.transform(input, Some(numRows)) val dataSize = relation match { case map: HashedRelation => map.estimatedSize case arr: Array[InternalRow] => arr.map(_.asInstanceOf[UnsafeRow].getSizeInBytes.toLong).sum case _ => throw new SparkException("[BUG] BroadcastMode.transform returned unexpected type: " + relation.getClass.getName) } {code} I believe that this is the only occurrence in the code where the result of the BroadcastMode#transform method must be either of type HashedRelation or Array[InternalRow]. Therefore to allow for broader custom implementations of the BroadcastMode I believe it would be a good idea to solve the calculation of the data size of the broadcast value in an independent way of the used BroadcastMode implemented. One way this could be done is by providing an additional BroadcastMode#calculateDataSize method, which needs to be implemented by the BroadcastMode implementations. This methods could then return the required number of bytes for a given broadcast value. was: According to the BroadcastMode API the BroadcastMode#transform methods are allows to return a result object of an arbitrary type: {code:scala} /** * Marker trait to identify the shape in which tuples are broadcasted. Typical examples of this are * identity (tuples remain unchanged) or hashed (tuples are converted into some hash index). */ trait BroadcastMode { def transform(rows: Array[InternalRow]): Any def transform(rows: Iterator[InternalRow], sizeHint: Option[Long]): Any def canonicalized: BroadcastMode } {code} When looking at the code which later uses the instantiated BroadcastMode objects in BroadcastExchangeExec it becomes that this is not really the base. The following lines in BroadcastExchangeExec suggest that only objects of type HashRElation and Array[InternalRow] are allowed as a result for the BroadcastMode#transform methods: {code:scala} // Construct the relation. val relation = mode.transform(input, Some(numRows)) val dataSize = relation match { case map: HashedRelation => map.estimatedSize case arr: Array[InternalRow] => arr.map(_.asInstanceOf[UnsafeRow].getSizeInBytes.toLong).sum case _ => throw new SparkException("[BUG] BroadcastMode.transform returned unexpected type: " + relation.getClass.getName) } {code} I believe that this is the only occurrence in the code where the result of the BroadcastMode#transform method must be either of type HashedRelation or Array[InternalRow]. Therefore to allow for broader custom implementations of the BroadcastMode I believe it would be a good idea to solve the calculation of the data size of the broadcast value in an independent way of the used BroadcastMode implemented. One way this could be done is by providing an additional BroadcastMode#calculateDataSize method, which needs to be implemented by the BroadcastMode implementations. This methods could then return the required number of bytes for a given broadcast value. > Allow for custom BroadcastMode return values > > > Key: SPARK-27851 > URL: https://issues.apache.org/jira/browse/SPARK-27851 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 2.4.3 >Reporter: Marc Arndt >Priority: Major > > According to the BroadcastMode API the BroadcastMode#transform methods are > allows to return a result object of an arbitrary type: > {code:scala} > /** > * Marker trait to identify the shape in which tuples are broadcasted. > Typical examples of this are > * identity (tuples remain unchanged) or hashed (tuples are converted into > some hash index). > */ > trait BroadcastMode { > de
[jira] [Created] (SPARK-27851) Allow for custom BroadcastMode return values
Marc Arndt created SPARK-27851: -- Summary: Allow for custom BroadcastMode return values Key: SPARK-27851 URL: https://issues.apache.org/jira/browse/SPARK-27851 Project: Spark Issue Type: Improvement Components: Optimizer, SQL Affects Versions: 2.4.3 Reporter: Marc Arndt According to the BroadcastMode API the BroadcastMode#transform methods are allows to return a result object of an arbitrary type: {code:scala} /** * Marker trait to identify the shape in which tuples are broadcasted. Typical examples of this are * identity (tuples remain unchanged) or hashed (tuples are converted into some hash index). */ trait BroadcastMode { def transform(rows: Array[InternalRow]): Any def transform(rows: Iterator[InternalRow], sizeHint: Option[Long]): Any def canonicalized: BroadcastMode } {code} When looking at the code which later uses the instantiated BroadcastMode objects in BroadcastExchangeExec it becomes that this is not really the base. The following lines in BroadcastExchangeExec suggest that only objects of type HashRElation and Array[InternalRow] are allowed as a result for the BroadcastMode#transform methods: {code:scala} // Construct the relation. val relation = mode.transform(input, Some(numRows)) val dataSize = relation match { case map: HashedRelation => map.estimatedSize case arr: Array[InternalRow] => arr.map(_.asInstanceOf[UnsafeRow].getSizeInBytes.toLong).sum case _ => throw new SparkException("[BUG] BroadcastMode.transform returned unexpected type: " + relation.getClass.getName) } {code} I believe that this is the only occurrence in the code where the result of the BroadcastMode#transform method must be either of type HashedRelation or Array[InternalRow]. Therefore to allow for broader custom implementations of the BroadcastMode I believe it would be a good idea to solve the calculation of the data size of the broadcast value in an independent way of the used BroadcastMode implemented. One way this could be done is by providing an additional BroadcastMode#calculateDataSize method, which needs to be implemented by the BroadcastMode implementations. This methods could then return the required number of bytes for a given broadcast value. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time
[ https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848702#comment-16848702 ] Gabor Somogyi commented on SPARK-27648: --- OK. Could you please calculate the 2 state graphs like numRowsTotal, etc...? > In Spark2.4 Structured Streaming:The executor storage memory increasing over > time > - > > Key: SPARK-27648 > URL: https://issues.apache.org/jira/browse/SPARK-27648 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: tommy duan >Priority: Major > Attachments: houragg(1).out, houragg_filter.csv, > image-2019-05-09-17-51-14-036.png, image-2019-05-10-17-49-42-034.png, > image-2019-05-24-10-20-25-723.png, image-2019-05-27-10-10-30-460.png > > > *Spark Program Code Business:* > Read the topic on kafka, aggregate the stream data sources, and then output > it to another topic line of kafka. > *Problem Description:* > *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory > overflow problems often occur (because of too many versions of state stored > in memory, this bug has been modified in spark 2.4). > {code:java} > /spark-submit \ > --conf “spark.yarn.executor.memoryOverhead=4096M” > --num-executors 15 \ > --executor-memory 3G \ > --executor-cores 2 \ > --driver-memory 6G \{code} > {code} > Executor memory exceptions occur when running with this submit resource under > SPARK 2.2 and the normal running time does not exceed one day. > The solution is to set the executor memory larger than before > {code:java} > My spark-submit script is as follows: > /spark-submit\ > conf "spark. yarn. executor. memoryOverhead = 4096M" > num-executors 15\ > executor-memory 46G\ > executor-cores 3\ > driver-memory 6G\ > ...{code} > In this case, the spark program can be guaranteed to run stably for a long > time, and the executor storage memory is less than 10M (it has been running > stably for more than 20 days). > *2) From the upgrade information of Spark 2.4, we can see that the problem of > large memory consumption of state storage has been solved in Spark 2.4.* > So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, > and found that the use of memory was reduced. > But a problem arises, as the running time increases, the storage memory of > executor is growing (see Executors - > Storage Memory from the Spark on Yarn > Resource Manager UI). > This program has been running for 14 days (under SPARK 2.2, running with > this submit resource, the normal running time is not more than one day, > Executor memory abnormalities will occur). > The script submitted by the program under spark2.4 is as follows: > {code:java} > /spark-submit \ > --conf “spark.yarn.executor.memoryOverhead=4096M” > --num-executors 15 \ > --executor-memory 3G \ > --executor-cores 2 \ > --driver-memory 6G > {code} > Under Spark 2.4, I counted the size of executor memory as time went by during > the running of the spark program: > |Run-time(hour)|Storage Memory size(MB)|Memory growth rate(MB/hour)| > |23.5H|41.6MB/1.5GB|1.770212766| > |108.4H|460.2MB/1.5GB|4.245387454| > |131.7H|559.1MB/1.5GB|4.245254366| > |135.4H|575MB/1.5GB|4.246676514| > |153.6H|641.2MB/1.5GB|4.174479167| > |219H|888.1MB/1.5GB|4.055251142| > |263H|1126.4MB/1.5GB|4.282889734| > |309H|1228.8MB/1.5GB|3.976699029| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27850) Make SparkPlan#doExecuteBroadcast public
Marc Arndt created SPARK-27850: -- Summary: Make SparkPlan#doExecuteBroadcast public Key: SPARK-27850 URL: https://issues.apache.org/jira/browse/SPARK-27850 Project: Spark Issue Type: Improvement Components: Optimizer, Spark Core, SQL Affects Versions: 2.4.3 Reporter: Marc Arndt The handling of broadcasts of SparkPlan objects is handled inside the SparkPlan#executeBroadcast method. According to the documentation of SparkPlan to provide custom broadcast functionality the `doExecuteBroadcast` method should be overriden as indicated by the comment: {code:scala} /** * Returns the result of this query as a broadcast variable by delegating to `doExecuteBroadcast` * after preparations. * * Concrete implementations of SparkPlan should override `doExecuteBroadcast`. */ final def executeBroadcast[T](): broadcast.Broadcast[T] = executeQuery { if (isCanonicalizedPlan) { throw new IllegalStateException("A canonicalized plan is not supposed to be executed.") } doExecuteBroadcast() } {code} When looking at the definition of SparkPlan#doExecuteBroadcast: {code:scala} /** * Produces the result of the query as a broadcast variable. * * Overridden by concrete implementations of SparkPlan. */ protected[sql] def doExecuteBroadcast[T](): broadcast.Broadcast[T] = { throw new UnsupportedOperationException(s"$nodeName does not implement doExecuteBroadcast") } {code} it becomes apparent that it is not possible to override the method from user-defined SparkPlan implementations, because the method has been defined as package protected. To allow custom SparkPlan implementations to provide their own broadcast operations I ask to change the SparkPlan#doExecuteBroadcast to be a public method, so that all SparkPlan implementations, independent of the package they belong to, can override it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27849) Redact treeString of FileTable and DataSourceV2ScanExecBase
[ https://issues.apache.org/jira/browse/SPARK-27849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27849: Assignee: Apache Spark > Redact treeString of FileTable and DataSourceV2ScanExecBase > --- > > Key: SPARK-27849 > URL: https://issues.apache.org/jira/browse/SPARK-27849 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > To follow https://github.com/apache/spark/pull/17397, the output of FileTable > and DataSourceV2ScanExecBase can contain sensitive information (like Amazon > keys). Such information should not end up in logs, or be exposed to non > privileged users. > Add a redaction facility for these output to resolve the issue. A user can > enable this by setting a regex in the same spark.redaction.string.regex > configuration as V1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27849) Redact treeString of FileTable and DataSourceV2ScanExecBase
[ https://issues.apache.org/jira/browse/SPARK-27849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27849: Assignee: (was: Apache Spark) > Redact treeString of FileTable and DataSourceV2ScanExecBase > --- > > Key: SPARK-27849 > URL: https://issues.apache.org/jira/browse/SPARK-27849 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > To follow https://github.com/apache/spark/pull/17397, the output of FileTable > and DataSourceV2ScanExecBase can contain sensitive information (like Amazon > keys). Such information should not end up in logs, or be exposed to non > privileged users. > Add a redaction facility for these output to resolve the issue. A user can > enable this by setting a regex in the same spark.redaction.string.regex > configuration as V1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27849) Redact treeString of FileTable and DataSourceV2ScanExecBase
Gengliang Wang created SPARK-27849: -- Summary: Redact treeString of FileTable and DataSourceV2ScanExecBase Key: SPARK-27849 URL: https://issues.apache.org/jira/browse/SPARK-27849 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang To follow https://github.com/apache/spark/pull/17397, the output of FileTable and DataSourceV2ScanExecBase can contain sensitive information (like Amazon keys). Such information should not end up in logs, or be exposed to non privileged users. Add a redaction facility for these output to resolve the issue. A user can enable this by setting a regex in the same spark.redaction.string.regex configuration as V1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27322) DataSourceV2: Select from multiple catalogs
[ https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27322: Assignee: Apache Spark > DataSourceV2: Select from multiple catalogs > --- > > Key: SPARK-27322 > URL: https://issues.apache.org/jira/browse/SPARK-27322 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Assignee: Apache Spark >Priority: Major > > Support multi-catalog in the following SELECT code paths: > * SELECT * FROM catalog.db.tbl > * TABLE catalog.db.tbl > * JOIN or UNION tables from different catalogs > * SparkSession.table("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27322) DataSourceV2: Select from multiple catalogs
[ https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27322: Assignee: (was: Apache Spark) > DataSourceV2: Select from multiple catalogs > --- > > Key: SPARK-27322 > URL: https://issues.apache.org/jira/browse/SPARK-27322 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > Support multi-catalog in the following SELECT code paths: > * SELECT * FROM catalog.db.tbl > * TABLE catalog.db.tbl > * JOIN or UNION tables from different catalogs > * SparkSession.table("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC
[ https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848667#comment-16848667 ] Hyukjin Kwon commented on SPARK-13283: -- It's just closed due to EOL affect version set. See http://apache-spark-developers-list.1001551.n3.nabble.com/Resolving-all-JIRAs-affecting-EOL-releases-td27238.html > Spark doesn't escape column names when creating table on JDBC > - > > Key: SPARK-13283 > URL: https://issues.apache.org/jira/browse/SPARK-13283 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Priority: Major > Labels: bulk-closed > > Hi, > I have following problem. > I have DF where one of the columns has 'from' name. > {code} > root > |-- from: decimal(20,0) (nullable = true) > {code} > When I'm saving it to MySQL database I'm getting error: > {code} > Py4JJavaError: An error occurred while calling o183.jdbc. > : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an > error in your SQL syntax; check the manual that corresponds to your MySQL > server version for the right syntax to use near 'from DECIMAL(20,0) , ' at > line 1 > {code} > I think the problem is that Spark doesn't escape column names with ` sign on > creating table. > {code} > `from` > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org