[jira] [Commented] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not
[ https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811720#comment-16811720 ] Dongjoon Hyun commented on SPARK-27278: --- The reverting PR should not reuse this JIRA because the purpose is different. This JIRA is dedicated to [~mgaido]'s improvement approach and his PR code. I prefer Marco's way and believe that you do. > Optimize GetMapValue when the map is a foldable and the key is not > -- > > Key: SPARK-27278 > URL: https://issues.apache.org/jira/browse/SPARK-27278 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 >Reporter: Huon Wilson >Priority: Minor > > With a map that isn't constant-foldable, spark will optimise an access to a > series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as > "x").explain > == Physical Plan == > *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L > as int) = 2) THEN id#180L END AS x#182L] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > This results in an efficient series of ifs and elses, in the code generation: > {code:java} > /* 037 */ boolean project_isNull_3 = false; > /* 038 */ int project_value_3 = -1; > /* 039 */ if (!false) { > /* 040 */ project_value_3 = (int) project_expr_0_0; > /* 041 */ } > /* 042 */ > /* 043 */ boolean project_value_2 = false; > /* 044 */ project_value_2 = project_value_3 == 1; > /* 045 */ if (!false && project_value_2) { > /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 047 */ project_project_value_1_0 = 1L; > /* 048 */ continue; > /* 049 */ } > /* 050 */ > /* 051 */ boolean project_isNull_8 = false; > /* 052 */ int project_value_8 = -1; > /* 053 */ if (!false) { > /* 054 */ project_value_8 = (int) project_expr_0_0; > /* 055 */ } > /* 056 */ > /* 057 */ boolean project_value_7 = false; > /* 058 */ project_value_7 = project_value_8 == 2; > /* 059 */ if (!false && project_value_7) { > /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 061 */ project_project_value_1_0 = project_expr_0_0; > /* 062 */ continue; > /* 063 */ } > {code} > If the map can be constant folded, the constant folding happens first, and > the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing > a map traversal and more dynamic checks: > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as > "x").explain > == Physical Plan == > *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which > is what is stored in the {{Literal}} form of the {{map(...)}} expression in > that select. The code generated is less efficient, since it has to do a > manual dynamic traversal of the map's array of keys, with type casts etc.: > {code:java} > /* 099 */ int project_index_0 = 0; > /* 100 */ boolean project_found_0 = false; > /* 101 */ while (project_index_0 < project_length_0 && > !project_found_0) { > /* 102 */ final int project_key_0 = > project_keys_0.getInt(project_index_0); > /* 103 */ if (project_key_0 == project_value_2) { > /* 104 */ project_found_0 = true; > /* 105 */ } else { > /* 106 */ project_index_0++; > /* 107 */ } > /* 108 */ } > /* 109 */ > /* 110 */ if (!project_found_0) { > /* 111 */ project_isNull_0 = true; > /* 112 */ } else { > /* 113 */ project_value_0 = > project_values_0.getInt(project_index_0); > /* 114 */ } > {code} > It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't > handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form: > {code:scala} > case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not
[ https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811717#comment-16811717 ] Dongjoon Hyun commented on SPARK-27278: --- [~huonw]. You can make a PR (for reverting the old one) if you are uncomfortable. Your PR will be reviewed in the same review process; Pros and Cons. Something missed is also the current behavior, too. > Optimize GetMapValue when the map is a foldable and the key is not > -- > > Key: SPARK-27278 > URL: https://issues.apache.org/jira/browse/SPARK-27278 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 >Reporter: Huon Wilson >Priority: Minor > > With a map that isn't constant-foldable, spark will optimise an access to a > series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as > "x").explain > == Physical Plan == > *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L > as int) = 2) THEN id#180L END AS x#182L] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > This results in an efficient series of ifs and elses, in the code generation: > {code:java} > /* 037 */ boolean project_isNull_3 = false; > /* 038 */ int project_value_3 = -1; > /* 039 */ if (!false) { > /* 040 */ project_value_3 = (int) project_expr_0_0; > /* 041 */ } > /* 042 */ > /* 043 */ boolean project_value_2 = false; > /* 044 */ project_value_2 = project_value_3 == 1; > /* 045 */ if (!false && project_value_2) { > /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 047 */ project_project_value_1_0 = 1L; > /* 048 */ continue; > /* 049 */ } > /* 050 */ > /* 051 */ boolean project_isNull_8 = false; > /* 052 */ int project_value_8 = -1; > /* 053 */ if (!false) { > /* 054 */ project_value_8 = (int) project_expr_0_0; > /* 055 */ } > /* 056 */ > /* 057 */ boolean project_value_7 = false; > /* 058 */ project_value_7 = project_value_8 == 2; > /* 059 */ if (!false && project_value_7) { > /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 061 */ project_project_value_1_0 = project_expr_0_0; > /* 062 */ continue; > /* 063 */ } > {code} > If the map can be constant folded, the constant folding happens first, and > the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing > a map traversal and more dynamic checks: > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as > "x").explain > == Physical Plan == > *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which > is what is stored in the {{Literal}} form of the {{map(...)}} expression in > that select. The code generated is less efficient, since it has to do a > manual dynamic traversal of the map's array of keys, with type casts etc.: > {code:java} > /* 099 */ int project_index_0 = 0; > /* 100 */ boolean project_found_0 = false; > /* 101 */ while (project_index_0 < project_length_0 && > !project_found_0) { > /* 102 */ final int project_key_0 = > project_keys_0.getInt(project_index_0); > /* 103 */ if (project_key_0 == project_value_2) { > /* 104 */ project_found_0 = true; > /* 105 */ } else { > /* 106 */ project_index_0++; > /* 107 */ } > /* 108 */ } > /* 109 */ > /* 110 */ if (!project_found_0) { > /* 111 */ project_isNull_0 = true; > /* 112 */ } else { > /* 113 */ project_value_0 = > project_values_0.getInt(project_index_0); > /* 114 */ } > {code} > It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't > handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form: > {code:scala} > case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27352) Apply for translation of the Chinese version, I hope to get authorization!
[ https://issues.apache.org/jira/browse/SPARK-27352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811765#comment-16811765 ] Teng Peng commented on SPARK-27352: --- Correct me if I am wrong. I do not think any authorization are required for translation to other languages. > Apply for translation of the Chinese version, I hope to get authorization! > --- > > Key: SPARK-27352 > URL: https://issues.apache.org/jira/browse/SPARK-27352 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Yuan Yifan >Priority: Minor > > Hello everyone, we are [ApacheCN|https://www.apachecn.org/], an open-source > community in China, focusing on Big Data and AI. > Recently, we have been making progress on translating Spark documents. > - [Source Of Document|https://github.com/apachecn/spark-doc-zh] > - [Document Preview|http://spark.apachecn.org/] > There are several reasons: > *1. The English level of many Chinese users is not very good.* > *2. Network problems, you know (China's magic network)!* > *3. Online blogs are very messy.* > We are very willing to do some Chinese localization for your project. If > possible, please give us some authorization. > Yifan Yuan from Apache CN > You may contact me by mail [tsingjyuj...@163.com|mailto:tsingjyuj...@163.com] > for more details -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26992) Fix STS scheduler pool correct delivery
[ https://issues.apache.org/jira/browse/SPARK-26992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26992: - Assignee: dzcxzl > Fix STS scheduler pool correct delivery > --- > > Key: SPARK-26992 > URL: https://issues.apache.org/jira/browse/SPARK-26992 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.4.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Attachments: error_session.png, error_stage.png > > > The user sets the value of spark.sql.thriftserver.scheduler.pool. > Spark thrift server saves this value in the LocalProperty of threadlocal > type, but does not clean up after running, causing other sessions to run in > the previously set pool name. > > For example > The second session does not manually set the pool name. The default pool name > should be used, but the pool name of the previous user's settings is used. > This is incorrect. > !error_session.png! > > !error_stage.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26992) Fix STS scheduler pool correct delivery
[ https://issues.apache.org/jira/browse/SPARK-26992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26992. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23895 [https://github.com/apache/spark/pull/23895] > Fix STS scheduler pool correct delivery > --- > > Key: SPARK-26992 > URL: https://issues.apache.org/jira/browse/SPARK-26992 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.4.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Fix For: 3.0.0 > > Attachments: error_session.png, error_stage.png > > > The user sets the value of spark.sql.thriftserver.scheduler.pool. > Spark thrift server saves this value in the LocalProperty of threadlocal > type, but does not clean up after running, causing other sessions to run in > the previously set pool name. > > For example > The second session does not manually set the pool name. The default pool name > should be used, but the pool name of the previous user's settings is used. > This is incorrect. > !error_session.png! > > !error_stage.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27401) Refactoring conversion of Date/Timestamp to/from java.sql.Date/Timestamp
Maxim Gekk created SPARK-27401: -- Summary: Refactoring conversion of Date/Timestamp to/from java.sql.Date/Timestamp Key: SPARK-27401 URL: https://issues.apache.org/jira/browse/SPARK-27401 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk The fromJavaTimestamp/toJavaTimestamp and toJavaDate/fromJavaDate can be implemented using existing methods DateTimeUtils like instantToMicros/microsToInstant and daysToLocalDate/localDateToDays. This should allow: # To avoid invocation of millisToDays and time zone offset calculation # To simplify implementation of toJavaTimestamp, and properly handle negative inputs # Detect arithmetic overflow of Long -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27400) LinearSVC only supports binary classification
baris created SPARK-27400: - Summary: LinearSVC only supports binary classification Key: SPARK-27400 URL: https://issues.apache.org/jira/browse/SPARK-27400 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.4.1 Reporter: baris IllegalArgumentException: u'requirement failed: LinearSVC only supports binary c lassification. 99 classes detected in LinearSVC_6596220b55a3__labelCol' -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21805) disable R vignettes code on Windows
[ https://issues.apache.org/jira/browse/SPARK-21805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-21805: - Issue Type: Sub-task (was: Bug) Parent: SPARK-15799 > disable R vignettes code on Windows > --- > > Key: SPARK-21805 > URL: https://issues.apache.org/jira/browse/SPARK-21805 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.2.0, 2.3.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Major > Fix For: 2.2.1, 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22344) Prevent R CMD check from using /tmp
[ https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-22344: - Issue Type: Sub-task (was: Bug) Parent: SPARK-15799 > Prevent R CMD check from using /tmp > --- > > Key: SPARK-22344 > URL: https://issues.apache.org/jira/browse/SPARK-22344 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0 >Reporter: Shivaram Venkataraman >Assignee: Shivaram Venkataraman >Priority: Major > Fix For: 2.2.1, 2.3.0 > > > When R CMD check is run on the SparkR package it leaves behind files in /tmp > which is a violation of CRAN policy. We should instead write to Rtmpdir. > Notes from CRAN are below > {code} > Checking this leaves behind dirs >hive/$USER >$USER > and files named like >b4f6459b-0624-4100-8358-7aa7afbda757_resources > in /tmp, in violation of the CRAN Policy. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24535) Fix java version parsing in SparkR on Windows
[ https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-24535: - Issue Type: Sub-task (was: Bug) Parent: SPARK-15799 > Fix java version parsing in SparkR on Windows > - > > Key: SPARK-24535 > URL: https://issues.apache.org/jira/browse/SPARK-24535 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.3.1, 2.4.0 >Reporter: Shivaram Venkataraman >Assignee: Felix Cheung >Priority: Blocker > Fix For: 2.3.2, 2.4.0 > > > We see errors on CRAN of the form > {code:java} > java version "1.8.0_144" > Java(TM) SE Runtime Environment (build 1.8.0_144-b01) > Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) > Picked up _JAVA_OPTIONS: -XX:-UsePerfData > -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21) > -- > subscript out of bounds > 1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, > sparkConfig = sparkRTestConfig) at > D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21 > 2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, > sparkExecutorEnvMap, > sparkJars, sparkPackages) > 3: checkJavaVersion() > 4: strsplit(javaVersionFilter[[1]], "[\"]") > {code} > The complete log file is at > http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25572) SparkR tests failed on CRAN on Java 10
[ https://issues.apache.org/jira/browse/SPARK-25572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-25572: - Issue Type: Sub-task (was: Bug) Parent: SPARK-15799 > SparkR tests failed on CRAN on Java 10 > -- > > Key: SPARK-25572 > URL: https://issues.apache.org/jira/browse/SPARK-25572 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Major > Fix For: 2.3.3, 2.4.0 > > > follow up to SPARK-24255 > from 2.3.2 release we can see that CRAN doesn't seem to respect the system > requirements as running tests - we have seen cases where SparkR is run on > Java 10, which unfortunately Spark does not start on. For 2.4.x, lets attempt > skipping all tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26010) SparkR vignette fails on CRAN on Java 11
[ https://issues.apache.org/jira/browse/SPARK-26010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-26010: - Issue Type: Sub-task (was: Bug) Parent: SPARK-15799 > SparkR vignette fails on CRAN on Java 11 > > > Key: SPARK-26010 > URL: https://issues.apache.org/jira/browse/SPARK-26010 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0, 3.0.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Major > Fix For: 2.3.3, 2.4.1, 3.0.0 > > > follow up to SPARK-25572 > but for vignettes > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15799) Release SparkR on CRAN
[ https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811654#comment-16811654 ] Felix Cheung commented on SPARK-15799: -- more fixed for this (did not open JIRA) [https://github.com/apache/spark/commit/fa0f791d4d9f083a45ab631a2e9f88a6b749e416#diff-e1e1d3d40573127e9ee0480caf1283d6] [https://github.com/apache/spark/commit/927081dd959217ed6bf014557db20026d7e22672#diff-e1e1d3d40573127e9ee0480caf1283d6] > Release SparkR on CRAN > -- > > Key: SPARK-15799 > URL: https://issues.apache.org/jira/browse/SPARK-15799 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Xiangrui Meng >Assignee: Shivaram Venkataraman >Priority: Major > Fix For: 2.1.2 > > > Story: "As an R user, I would like to see SparkR released on CRAN, so I can > use SparkR easily in an existing R environment and have other packages built > on top of SparkR." > I made this JIRA with the following questions in mind: > * Are there known issues that prevent us releasing SparkR on CRAN? > * Do we want to package Spark jars in the SparkR release? > * Are there license issues? > * How does it fit into Spark's release process? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26910) Re-release SparkR to CRAN
[ https://issues.apache.org/jira/browse/SPARK-26910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-26910. -- Resolution: Fixed Fix Version/s: 2.4.1 2.4.1. released [https://cran.r-project.org/web/packages/SparkR/index.html] > Re-release SparkR to CRAN > - > > Key: SPARK-26910 > URL: https://issues.apache.org/jira/browse/SPARK-26910 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Michael Chirico >Assignee: Felix Cheung >Priority: Major > Fix For: 2.4.1 > > > The logical successor to https://issues.apache.org/jira/browse/SPARK-15799 > I don't see anything specifically tracking re-release in the Jira list. It > would be helpful to have an issue tracking this to refer to as an outsider, > as well as to document what the blockers are in case some outside help could > be useful. > * Is there a plan to re-release SparkR to CRAN? > * What are the major blockers to doing so at the moment? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27176) Upgrade hadoop-3's built-in Hive maven dependencies to 2.3.4
[ https://issues.apache.org/jira/browse/SPARK-27176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16810652#comment-16810652 ] Yuming Wang edited comment on SPARK-27176 at 4/6/19 2:37 PM: - For Hive 2.3.4, we also need {{hive-llap-common}} and {{hive-llap-client}}: {{hive-llap-common}} is used for registry functions: {noformat} scala> spark.range(10).write.saveAsTable("test_hadoop3") java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/security/LlapSigner$Signable at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671) at java.lang.Class.getConstructor0(Class.java:3075) at java.lang.Class.getDeclaredConstructor(Class.java:2178) at org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79) at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208) at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.(FunctionRegistry.java:500) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:247) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231) at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:388) at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:332) at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:312) at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:288) at org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:250) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:272) ... {noformat} {{hive-llap-client}} is used for test Hive(StatisticsSuite, SQLQuerySuite and HiveOrcSourceSuite): {noformat} spark.sharedState.externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog] .client.runSqlHive("SELECT COUNT(*) FROM test_hadoop3") ... java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/io/api/LlapProxy at org.apache.hadoop.hive.ql.exec.GlobalWorkMapFactory.get(GlobalWorkMapFactory.java:102) at org.apache.hadoop.hive.ql.exec.Utilities.clearWorkMapForConf(Utilities.java:3435) at org.apache.hadoop.hive.ql.exec.Utilities.clearWork(Utilities.java:290) at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:443) at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:151) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:733) ... {noformat} We can exclude {{org.apache.curator:curator-framework:jar}} and {{org.apache.curator:apache-curator.jar}} as they are used for add consistent node replacement to LLAP for splits, see HIVE-14589. was (Author: q79969786): For Hive 2.3.4, we also need {{hive-llap-common}} and {{hive-llap-client}}: {{hive-llap-common}} is used for registry functions: {noformat} scala> spark.range(10).write.saveAsTable("test_hadoop3") java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/security/LlapSigner$Signable at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671) at java.lang.Class.getConstructor0(Class.java:3075) at java.lang.Class.getDeclaredConstructor(Class.java:2178) at org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79) at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208) at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.(FunctionRegistry.java:500) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:247) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231) at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:388) at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:332) at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:312) at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:288) at org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:250) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:272) ... {noformat} {{hive-llap-client}} is used for test Hive: {noformat} spar
[jira] [Updated] (SPARK-27399) Spark streaming of kafka 0.10 contains some scattered config
[ https://issues.apache.org/jira/browse/SPARK-27399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-27399: --- Description: I found a lot scattered config in Kafka streaming. I think should arrange these config in unified position. There are also exists some hardcode like {code:java} spark.network.timeout{code} need to change. was: I found a lot scattered config in Kafka streaming. I think should arrange these config in unified position. > Spark streaming of kafka 0.10 contains some scattered config > > > Key: SPARK-27399 > URL: https://issues.apache.org/jira/browse/SPARK-27399 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Minor > > I found a lot scattered config in Kafka streaming. > I think should arrange these config in unified position. > There are also exists some hardcode like > {code:java} > spark.network.timeout{code} > need to change. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27399) Spark streaming of kafka 0.10 contains some scattered config
jiaan.geng created SPARK-27399: -- Summary: Spark streaming of kafka 0.10 contains some scattered config Key: SPARK-27399 URL: https://issues.apache.org/jira/browse/SPARK-27399 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.0, 2.3.0 Reporter: jiaan.geng I found a lot scattered config in Kafka streaming. I think should arrange these config in unified position. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27213) Unexpected results when filter is used after distinct
[ https://issues.apache.org/jira/browse/SPARK-27213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rinaz Belhaj updated SPARK-27213: - Shepherd: Holden Karau > Unexpected results when filter is used after distinct > - > > Key: SPARK-27213 > URL: https://issues.apache.org/jira/browse/SPARK-27213 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Rinaz Belhaj >Priority: Major > Labels: distinct, filter > > The following code gives unexpected output due to the filter not getting > pushed down in catalyst optimizer. > {code:java} > df = > spark.createDataFrame([['a','123','12.3','n'],['a','123','12.3','y'],['a','123','12.4','y']],['x','y','z','y_n']) > df.show(5) > df.filter("y_n='y'").select('x','y','z').distinct().show() > df.select('x','y','z').distinct().filter("y_n='y'").show() > {code} > {panel:title=Output} > |x|y|z|y_n| > |a|123|12.3|n| > |a|123|12.3|y| > |a|123|12.4|y| > > |x|y|z| > |a|123|12.3| > |a|123|12.4| > > |x|y|z| > |a|123|12.4| > {panel} > Ideally, the second statement should result in an error since the column used > in the filter is not present in the preceding select statement. But the > catalyst optimizer is using first() on column y_n and then applying the > filter. > Even if the filter was pushed down, the result would have been accurate. > {code:java} > df = > spark.createDataFrame([['a','123','12.3','n'],['a','123','12.3','y'],['a','123','12.4','y']],['x','y','z','y_n']) > df.filter("y_n='y'").select('x','y','z').distinct().explain(True) > df.select('x','y','z').distinct().filter("y_n='y'").explain(True) > {code} > {panel:title=Output} > > == Parsed Logical Plan == > Deduplicate [x#74, y#75, z#76|#74, y#75, z#76] > +- AnalysisBarrier > +- Project [x#74, y#75, z#76|#74, y#75, z#76] > +- Filter (y_n#77 = y) > +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false > > == Analyzed Logical Plan == > x: string, y: string, z: string > Deduplicate [x#74, y#75, z#76|#74, y#75, z#76] > +- Project [x#74, y#75, z#76|#74, y#75, z#76] > +- Filter (y_n#77 = y) > +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false > > == Optimized Logical Plan == > Aggregate [x#74, y#75, z#76|#74, y#75, z#76], [x#74, y#75, z#76|#74, y#75, > z#76] > +- Project [x#74, y#75, z#76|#74, y#75, z#76] > +- Filter (isnotnull(y_n#77) && (y_n#77 = y)) > +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false > > == Physical Plan == > *(2) HashAggregate(keys=[x#74, y#75, z#76|#74, y#75, z#76], functions=[], > output=[x#74, y#75, z#76|#74, y#75, z#76]) > +- Exchange hashpartitioning(x#74, y#75, z#76, 10) > +- *(1) HashAggregate(keys=[x#74, y#75, z#76|#74, y#75, z#76], functions=[], > output=[x#74, y#75, z#76|#74, y#75, z#76]) > +- *(1) Project [x#74, y#75, z#76|#74, y#75, z#76] > +- *(1) Filter (isnotnull(y_n#77) && (y_n#77 = y)) > +- Scan ExistingRDD[x#74,y#75,z#76,y_n#77|#74,y#75,z#76,y_n#77] > > > --- > > > == Parsed Logical Plan == > 'Filter ('y_n = y) > +- AnalysisBarrier > +- Deduplicate [x#74, y#75, z#76|#74, y#75, z#76] > +- Project [x#74, y#75, z#76|#74, y#75, z#76] > +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false > > == Analyzed Logical Plan == > x: string, y: string, z: string > Project [x#74, y#75, z#76|#74, y#75, z#76] > +- Filter (y_n#77 = y) > +- Deduplicate [x#74, y#75, z#76|#74, y#75, z#76] > +- Project [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77] > +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false > > == Optimized Logical Plan == > Project [x#74, y#75, z#76|#74, y#75, z#76] > +- Filter (isnotnull(y_n#77) && (y_n#77 = y)) > +- Aggregate [x#74, y#75, z#76|#74, y#75, z#76], [x#74, y#75, z#76, > first(y_n#77, false) AS y_n#77|#74, y#75, z#76, first(y_n#77, false) AS > y_n#77] > +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false > > == Physical Plan == > *(3) Project [x#74, y#75, z#76|#74, y#75, z#76] > +- *(3) Filter (isnotnull(y_n#77) && (y_n#77 = y)) > +- SortAggregate(key=[x#74, y#75, z#76|#74, y#75, z#76], > functions=[first(y_n#77, false)|#77, false)], output=[x#74, y#75, z#76, > y_n#77|#74, y#75, z#76, y_n#77]) > +- *(2) Sort [x#74 ASC NULLS FIRST, y#75 ASC NULLS FIRST, z#76 ASC NULLS > FIRST|#74 ASC NULLS FIRST, y#75 ASC NULLS FIRST, z#76 ASC NULLS FIRST], > false, 0 > +- Exchange hashpartitioning(x#74, y#75, z#76, 10) > +- SortAggregate(key=[x#74, y#75, z#76|#74, y#75, z#76], > functions=[partial_first(y_n#77, false)|#77, false)], output=[x#74, y#75, > z#76, first#95, valueSet#96|#74, y#75, z#76, first#95, v
[jira] [Created] (SPARK-27398) Get rid of sun.nio.cs.StreamDecoder in CreateJacksonParser
Maxim Gekk created SPARK-27398: -- Summary: Get rid of sun.nio.cs.StreamDecoder in CreateJacksonParser Key: SPARK-27398 URL: https://issues.apache.org/jira/browse/SPARK-27398 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk The CreateJacksonParser.getStreamDecoder method creates an instance of ReadableByteChannel and returns the result as of sun.nio.cs.StreamDecoder. This is unnecessary and overcomplicates the method. This code can be replaced by: {code:scala} val bais = new ByteArrayInputStream(in, 0, length) new InputStreamReader(bais, enc) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27397) Take care of OpenJ9 in JVM dependant parts
Kazuaki Ishizaki created SPARK-27397: Summary: Take care of OpenJ9 in JVM dependant parts Key: SPARK-27397 URL: https://issues.apache.org/jira/browse/SPARK-27397 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Kazuaki Ishizaki Spark includes multiple JVM dependant code such as {{SizeEstimator}}. The current Spark takes care of IBM JDK and OpenJDK. Recently, OpenJ9 has been released. However, it is not considered yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org