[jira] [Commented] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not

2019-04-06 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811720#comment-16811720
 ] 

Dongjoon Hyun commented on SPARK-27278:
---

The reverting PR should not reuse this JIRA because the purpose is different. 
This JIRA is dedicated to [~mgaido]'s improvement approach and his PR code. I 
prefer Marco's way and believe that you do.

> Optimize GetMapValue when the map is a foldable and the key is not
> --
>
> Key: SPARK-27278
> URL: https://issues.apache.org/jira/browse/SPARK-27278
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0
>Reporter: Huon Wilson
>Priority: Minor
>
> With a map that isn't constant-foldable, spark will optimise an access to a 
> series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L 
> as int) = 2) THEN id#180L END AS x#182L]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> This results in an efficient series of ifs and elses, in the code generation:
> {code:java}
> /* 037 */   boolean project_isNull_3 = false;
> /* 038 */   int project_value_3 = -1;
> /* 039 */   if (!false) {
> /* 040 */ project_value_3 = (int) project_expr_0_0;
> /* 041 */   }
> /* 042 */
> /* 043 */   boolean project_value_2 = false;
> /* 044 */   project_value_2 = project_value_3 == 1;
> /* 045 */   if (!false && project_value_2) {
> /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 047 */ project_project_value_1_0 = 1L;
> /* 048 */ continue;
> /* 049 */   }
> /* 050 */
> /* 051 */   boolean project_isNull_8 = false;
> /* 052 */   int project_value_8 = -1;
> /* 053 */   if (!false) {
> /* 054 */ project_value_8 = (int) project_expr_0_0;
> /* 055 */   }
> /* 056 */
> /* 057 */   boolean project_value_7 = false;
> /* 058 */   project_value_7 = project_value_8 == 2;
> /* 059 */   if (!false && project_value_7) {
> /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 061 */ project_project_value_1_0 = project_expr_0_0;
> /* 062 */ continue;
> /* 063 */   }
> {code}
> If the map can be constant folded, the constant folding happens first, and 
> the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing 
> a map traversal and more dynamic checks:
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which 
> is what is stored in the {{Literal}} form of the {{map(...)}} expression in 
> that select. The code generated is less efficient, since it has to do a 
> manual dynamic traversal of the map's array of keys, with type casts etc.:
> {code:java}
> /* 099 */   int project_index_0 = 0;
> /* 100 */   boolean project_found_0 = false;
> /* 101 */   while (project_index_0 < project_length_0 && 
> !project_found_0) {
> /* 102 */ final int project_key_0 = 
> project_keys_0.getInt(project_index_0);
> /* 103 */ if (project_key_0 == project_value_2) {
> /* 104 */   project_found_0 = true;
> /* 105 */ } else {
> /* 106 */   project_index_0++;
> /* 107 */ }
> /* 108 */   }
> /* 109 */
> /* 110 */   if (!project_found_0) {
> /* 111 */ project_isNull_0 = true;
> /* 112 */   } else {
> /* 113 */ project_value_0 = 
> project_values_0.getInt(project_index_0);
> /* 114 */   }
> {code}
> It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't 
> handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form:
> {code:scala}
>   case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not

2019-04-06 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811717#comment-16811717
 ] 

Dongjoon Hyun commented on SPARK-27278:
---

[~huonw]. You can make a PR (for reverting the old one) if you are 
uncomfortable. Your PR will be reviewed in the same review process; Pros and 
Cons. Something missed is also the current behavior, too.

> Optimize GetMapValue when the map is a foldable and the key is not
> --
>
> Key: SPARK-27278
> URL: https://issues.apache.org/jira/browse/SPARK-27278
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0
>Reporter: Huon Wilson
>Priority: Minor
>
> With a map that isn't constant-foldable, spark will optimise an access to a 
> series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L 
> as int) = 2) THEN id#180L END AS x#182L]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> This results in an efficient series of ifs and elses, in the code generation:
> {code:java}
> /* 037 */   boolean project_isNull_3 = false;
> /* 038 */   int project_value_3 = -1;
> /* 039 */   if (!false) {
> /* 040 */ project_value_3 = (int) project_expr_0_0;
> /* 041 */   }
> /* 042 */
> /* 043 */   boolean project_value_2 = false;
> /* 044 */   project_value_2 = project_value_3 == 1;
> /* 045 */   if (!false && project_value_2) {
> /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 047 */ project_project_value_1_0 = 1L;
> /* 048 */ continue;
> /* 049 */   }
> /* 050 */
> /* 051 */   boolean project_isNull_8 = false;
> /* 052 */   int project_value_8 = -1;
> /* 053 */   if (!false) {
> /* 054 */ project_value_8 = (int) project_expr_0_0;
> /* 055 */   }
> /* 056 */
> /* 057 */   boolean project_value_7 = false;
> /* 058 */   project_value_7 = project_value_8 == 2;
> /* 059 */   if (!false && project_value_7) {
> /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 061 */ project_project_value_1_0 = project_expr_0_0;
> /* 062 */ continue;
> /* 063 */   }
> {code}
> If the map can be constant folded, the constant folding happens first, and 
> the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing 
> a map traversal and more dynamic checks:
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which 
> is what is stored in the {{Literal}} form of the {{map(...)}} expression in 
> that select. The code generated is less efficient, since it has to do a 
> manual dynamic traversal of the map's array of keys, with type casts etc.:
> {code:java}
> /* 099 */   int project_index_0 = 0;
> /* 100 */   boolean project_found_0 = false;
> /* 101 */   while (project_index_0 < project_length_0 && 
> !project_found_0) {
> /* 102 */ final int project_key_0 = 
> project_keys_0.getInt(project_index_0);
> /* 103 */ if (project_key_0 == project_value_2) {
> /* 104 */   project_found_0 = true;
> /* 105 */ } else {
> /* 106 */   project_index_0++;
> /* 107 */ }
> /* 108 */   }
> /* 109 */
> /* 110 */   if (!project_found_0) {
> /* 111 */ project_isNull_0 = true;
> /* 112 */   } else {
> /* 113 */ project_value_0 = 
> project_values_0.getInt(project_index_0);
> /* 114 */   }
> {code}
> It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't 
> handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form:
> {code:scala}
>   case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27352) Apply for translation of the Chinese version, I hope to get authorization!

2019-04-06 Thread Teng Peng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811765#comment-16811765
 ] 

Teng Peng commented on SPARK-27352:
---

Correct me if I am wrong. I do not think any authorization are required for 
translation to other languages. 

> Apply for translation of the Chinese version, I hope to get authorization! 
> ---
>
> Key: SPARK-27352
> URL: https://issues.apache.org/jira/browse/SPARK-27352
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yuan Yifan
>Priority: Minor
>
> Hello everyone, we are [ApacheCN|https://www.apachecn.org/], an open-source 
> community in China, focusing on Big Data and AI.
> Recently, we have been making progress on translating Spark documents.
>  - [Source Of Document|https://github.com/apachecn/spark-doc-zh]
>  - [Document Preview|http://spark.apachecn.org/]
> There are several reasons:
>  *1. The English level of many Chinese users is not very good.*
>  *2. Network problems, you know (China's magic network)!*
>  *3. Online blogs are very messy.*
> We are very willing to do some Chinese localization for your project. If 
> possible, please give us some authorization.
> Yifan Yuan from Apache CN
> You may contact me by mail [tsingjyuj...@163.com|mailto:tsingjyuj...@163.com] 
> for more details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26992) Fix STS scheduler pool correct delivery

2019-04-06 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26992:
-

Assignee: dzcxzl

> Fix STS scheduler pool correct delivery
> ---
>
> Key: SPARK-26992
> URL: https://issues.apache.org/jira/browse/SPARK-26992
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.4.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
> Attachments: error_session.png, error_stage.png
>
>
> The user sets the value of spark.sql.thriftserver.scheduler.pool.
>  Spark thrift server saves this value in the LocalProperty of threadlocal 
> type, but does not clean up after running, causing other sessions to run in 
> the previously set pool name.
>  
> For example
> The second session does not manually set the pool name. The default pool name 
> should be used, but the pool name of the previous user's settings is used. 
> This is incorrect.
> !error_session.png!
>  
> !error_stage.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26992) Fix STS scheduler pool correct delivery

2019-04-06 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26992.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23895
[https://github.com/apache/spark/pull/23895]

> Fix STS scheduler pool correct delivery
> ---
>
> Key: SPARK-26992
> URL: https://issues.apache.org/jira/browse/SPARK-26992
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.4.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: error_session.png, error_stage.png
>
>
> The user sets the value of spark.sql.thriftserver.scheduler.pool.
>  Spark thrift server saves this value in the LocalProperty of threadlocal 
> type, but does not clean up after running, causing other sessions to run in 
> the previously set pool name.
>  
> For example
> The second session does not manually set the pool name. The default pool name 
> should be used, but the pool name of the previous user's settings is used. 
> This is incorrect.
> !error_session.png!
>  
> !error_stage.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27401) Refactoring conversion of Date/Timestamp to/from java.sql.Date/Timestamp

2019-04-06 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-27401:
--

 Summary: Refactoring conversion of Date/Timestamp to/from 
java.sql.Date/Timestamp
 Key: SPARK-27401
 URL: https://issues.apache.org/jira/browse/SPARK-27401
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


The fromJavaTimestamp/toJavaTimestamp and toJavaDate/fromJavaDate can be 
implemented using existing methods DateTimeUtils like 
instantToMicros/microsToInstant and daysToLocalDate/localDateToDays. This 
should allow:
 # To avoid invocation of millisToDays and time zone offset calculation
 # To simplify implementation of toJavaTimestamp, and properly handle negative 
inputs
 # Detect arithmetic overflow of Long



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27400) LinearSVC only supports binary classification

2019-04-06 Thread baris (JIRA)
baris created SPARK-27400:
-

 Summary: LinearSVC only supports binary classification
 Key: SPARK-27400
 URL: https://issues.apache.org/jira/browse/SPARK-27400
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.4.1
Reporter: baris


IllegalArgumentException: u'requirement failed: LinearSVC only supports binary c
lassification. 99 classes detected in LinearSVC_6596220b55a3__labelCol'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21805) disable R vignettes code on Windows

2019-04-06 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-21805:
-
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-15799

> disable R vignettes code on Windows
> ---
>
> Key: SPARK-21805
> URL: https://issues.apache.org/jira/browse/SPARK-21805
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
> Fix For: 2.2.1, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22344) Prevent R CMD check from using /tmp

2019-04-06 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22344:
-
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-15799

> Prevent R CMD check from using /tmp
> ---
>
> Key: SPARK-22344
> URL: https://issues.apache.org/jira/browse/SPARK-22344
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>Priority: Major
> Fix For: 2.2.1, 2.3.0
>
>
> When R CMD check is run on the SparkR package it leaves behind files in /tmp 
> which is a violation of CRAN policy. We should instead write to Rtmpdir. 
> Notes from CRAN are below
> {code}
> Checking this leaves behind dirs
>hive/$USER
>$USER
> and files named like
>b4f6459b-0624-4100-8358-7aa7afbda757_resources
> in /tmp, in violation of the CRAN Policy.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24535) Fix java version parsing in SparkR on Windows

2019-04-06 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-24535:
-
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-15799

> Fix java version parsing in SparkR on Windows
> -
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>Assignee: Felix Cheung
>Priority: Blocker
> Fix For: 2.3.2, 2.4.0
>
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25572) SparkR tests failed on CRAN on Java 10

2019-04-06 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-25572:
-
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-15799

> SparkR tests failed on CRAN on Java 10
> --
>
> Key: SPARK-25572
> URL: https://issues.apache.org/jira/browse/SPARK-25572
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
> Fix For: 2.3.3, 2.4.0
>
>
> follow up to SPARK-24255
> from 2.3.2 release we can see that CRAN doesn't seem to respect the system 
> requirements as running tests - we have seen cases where SparkR is run on 
> Java 10, which unfortunately Spark does not start on. For 2.4.x, lets attempt 
> skipping all tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26010) SparkR vignette fails on CRAN on Java 11

2019-04-06 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-26010:
-
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-15799

> SparkR vignette fails on CRAN on Java 11
> 
>
> Key: SPARK-26010
> URL: https://issues.apache.org/jira/browse/SPARK-26010
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> follow up to SPARK-25572
> but for vignettes
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2019-04-06 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811654#comment-16811654
 ] 

Felix Cheung commented on SPARK-15799:
--

more fixed for this (did not open JIRA)

[https://github.com/apache/spark/commit/fa0f791d4d9f083a45ab631a2e9f88a6b749e416#diff-e1e1d3d40573127e9ee0480caf1283d6]

[https://github.com/apache/spark/commit/927081dd959217ed6bf014557db20026d7e22672#diff-e1e1d3d40573127e9ee0480caf1283d6]

 

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>Assignee: Shivaram Venkataraman
>Priority: Major
> Fix For: 2.1.2
>
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26910) Re-release SparkR to CRAN

2019-04-06 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-26910.
--
   Resolution: Fixed
Fix Version/s: 2.4.1

2.4.1. released [https://cran.r-project.org/web/packages/SparkR/index.html]

> Re-release SparkR to CRAN
> -
>
> Key: SPARK-26910
> URL: https://issues.apache.org/jira/browse/SPARK-26910
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Michael Chirico
>Assignee: Felix Cheung
>Priority: Major
> Fix For: 2.4.1
>
>
> The logical successor to https://issues.apache.org/jira/browse/SPARK-15799
> I don't see anything specifically tracking re-release in the Jira list. It 
> would be helpful to have an issue tracking this to refer to as an outsider, 
> as well as to document what the blockers are in case some outside help could 
> be useful.
>  * Is there a plan to re-release SparkR to CRAN?
>  * What are the major blockers to doing so at the moment?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27176) Upgrade hadoop-3's built-in Hive maven dependencies to 2.3.4

2019-04-06 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16810652#comment-16810652
 ] 

Yuming Wang edited comment on SPARK-27176 at 4/6/19 2:37 PM:
-

For Hive 2.3.4, we also need {{hive-llap-common}} and {{hive-llap-client}}:

{{hive-llap-common}} is used for registry functions:
{noformat}
scala> spark.range(10).write.saveAsTable("test_hadoop3")
java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/llap/security/LlapSigner$Signable
  at java.lang.Class.getDeclaredConstructors0(Native Method)
  at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
  at java.lang.Class.getConstructor0(Class.java:3075)
  at java.lang.Class.getDeclaredConstructor(Class.java:2178)
  at 
org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79)
  at 
org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208)
  at 
org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201)
  at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.(FunctionRegistry.java:500)
  at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:247)
  at 
org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231)
  at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:388)
  at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:332)
  at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:312)
  at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:288)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:250)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:272)
...
{noformat}
{{hive-llap-client}} is used for test Hive(StatisticsSuite, SQLQuerySuite and 
HiveOrcSourceSuite):
{noformat}
spark.sharedState.externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog]
  .client.runSqlHive("SELECT COUNT(*) FROM test_hadoop3")

...
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/io/api/LlapProxy
at 
org.apache.hadoop.hive.ql.exec.GlobalWorkMapFactory.get(GlobalWorkMapFactory.java:102)
at 
org.apache.hadoop.hive.ql.exec.Utilities.clearWorkMapForConf(Utilities.java:3435)
at 
org.apache.hadoop.hive.ql.exec.Utilities.clearWork(Utilities.java:290)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:443)
at 
org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:151)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:733)
...
{noformat}
We can exclude {{org.apache.curator:curator-framework:jar}} and 
{{org.apache.curator:apache-curator.jar}} as they are used for add consistent 
node replacement to LLAP for splits, see HIVE-14589.


was (Author: q79969786):
For Hive 2.3.4, we also need {{hive-llap-common}} and {{hive-llap-client}}:

{{hive-llap-common}} is used for registry functions:
{noformat}
scala> spark.range(10).write.saveAsTable("test_hadoop3")
java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/llap/security/LlapSigner$Signable
  at java.lang.Class.getDeclaredConstructors0(Native Method)
  at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
  at java.lang.Class.getConstructor0(Class.java:3075)
  at java.lang.Class.getDeclaredConstructor(Class.java:2178)
  at 
org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79)
  at 
org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208)
  at 
org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201)
  at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.(FunctionRegistry.java:500)
  at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:247)
  at 
org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231)
  at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:388)
  at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:332)
  at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:312)
  at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:288)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:250)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:272)
...
{noformat}
{{hive-llap-client}} is used for test Hive:
{noformat}
spar

[jira] [Updated] (SPARK-27399) Spark streaming of kafka 0.10 contains some scattered config

2019-04-06 Thread jiaan.geng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-27399:
---
Description: 
I found a lot scattered config in Kafka streaming.

I think should arrange these config in unified position.

There are also exists some hardcode like 
{code:java}
spark.network.timeout{code}
need to change.

 

  was:
I found a lot scattered config in Kafka streaming.

I think should arrange these config in unified position.


> Spark streaming of kafka 0.10 contains some scattered config
> 
>
> Key: SPARK-27399
> URL: https://issues.apache.org/jira/browse/SPARK-27399
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Minor
>
> I found a lot scattered config in Kafka streaming.
> I think should arrange these config in unified position.
> There are also exists some hardcode like 
> {code:java}
> spark.network.timeout{code}
> need to change.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27399) Spark streaming of kafka 0.10 contains some scattered config

2019-04-06 Thread jiaan.geng (JIRA)
jiaan.geng created SPARK-27399:
--

 Summary: Spark streaming of kafka 0.10 contains some scattered 
config
 Key: SPARK-27399
 URL: https://issues.apache.org/jira/browse/SPARK-27399
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.0, 2.3.0
Reporter: jiaan.geng


I found a lot scattered config in Kafka streaming.

I think should arrange these config in unified position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27213) Unexpected results when filter is used after distinct

2019-04-06 Thread Rinaz Belhaj (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rinaz Belhaj updated SPARK-27213:
-
Shepherd: Holden Karau

> Unexpected results when filter is used after distinct
> -
>
> Key: SPARK-27213
> URL: https://issues.apache.org/jira/browse/SPARK-27213
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Rinaz Belhaj
>Priority: Major
>  Labels: distinct, filter
>
> The following code gives unexpected output due to the filter not getting 
> pushed down in catalyst optimizer.
> {code:java}
> df = 
> spark.createDataFrame([['a','123','12.3','n'],['a','123','12.3','y'],['a','123','12.4','y']],['x','y','z','y_n'])
> df.show(5)
> df.filter("y_n='y'").select('x','y','z').distinct().show()
> df.select('x','y','z').distinct().filter("y_n='y'").show()
> {code}
> {panel:title=Output}
> |x|y|z|y_n|
> |a|123|12.3|n|
> |a|123|12.3|y|
> |a|123|12.4|y|
>  
> |x|y|z|
> |a|123|12.3|
> |a|123|12.4|
>  
> |x|y|z|
> |a|123|12.4|
> {panel}
> Ideally, the second statement should result in an error since the column used 
> in the filter is not present in the preceding select statement. But the 
> catalyst optimizer is using first() on column y_n and then applying the 
> filter.
> Even if the filter was pushed down, the result would have been accurate.
> {code:java}
> df = 
> spark.createDataFrame([['a','123','12.3','n'],['a','123','12.3','y'],['a','123','12.4','y']],['x','y','z','y_n'])
> df.filter("y_n='y'").select('x','y','z').distinct().explain(True)
> df.select('x','y','z').distinct().filter("y_n='y'").explain(True) 
> {code}
> {panel:title=Output}
>  
>  == Parsed Logical Plan ==
>  Deduplicate [x#74, y#75, z#76|#74, y#75, z#76]
>  +- AnalysisBarrier
>  +- Project [x#74, y#75, z#76|#74, y#75, z#76]
>  +- Filter (y_n#77 = y)
>  +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false
>   
>  == Analyzed Logical Plan ==
>  x: string, y: string, z: string
>  Deduplicate [x#74, y#75, z#76|#74, y#75, z#76]
>  +- Project [x#74, y#75, z#76|#74, y#75, z#76]
>  +- Filter (y_n#77 = y)
>  +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false
>   
>  == Optimized Logical Plan ==
>  Aggregate [x#74, y#75, z#76|#74, y#75, z#76], [x#74, y#75, z#76|#74, y#75, 
> z#76]
>  +- Project [x#74, y#75, z#76|#74, y#75, z#76]
>  +- Filter (isnotnull(y_n#77) && (y_n#77 = y))
>  +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false
>   
>  == Physical Plan ==
>  *(2) HashAggregate(keys=[x#74, y#75, z#76|#74, y#75, z#76], functions=[], 
> output=[x#74, y#75, z#76|#74, y#75, z#76])
>  +- Exchange hashpartitioning(x#74, y#75, z#76, 10)
>  +- *(1) HashAggregate(keys=[x#74, y#75, z#76|#74, y#75, z#76], functions=[], 
> output=[x#74, y#75, z#76|#74, y#75, z#76])
>  +- *(1) Project [x#74, y#75, z#76|#74, y#75, z#76]
>  +- *(1) Filter (isnotnull(y_n#77) && (y_n#77 = y))
>  +- Scan ExistingRDD[x#74,y#75,z#76,y_n#77|#74,y#75,z#76,y_n#77]
>   
>  
> ---
>  
>   
>  == Parsed Logical Plan ==
>  'Filter ('y_n = y)
>  +- AnalysisBarrier
>  +- Deduplicate [x#74, y#75, z#76|#74, y#75, z#76]
>  +- Project [x#74, y#75, z#76|#74, y#75, z#76]
>  +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false
>   
>  == Analyzed Logical Plan ==
>  x: string, y: string, z: string
>  Project [x#74, y#75, z#76|#74, y#75, z#76]
>  +- Filter (y_n#77 = y)
>  +- Deduplicate [x#74, y#75, z#76|#74, y#75, z#76]
>  +- Project [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77]
>  +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false
>   
>  == Optimized Logical Plan ==
>  Project [x#74, y#75, z#76|#74, y#75, z#76]
>  +- Filter (isnotnull(y_n#77) && (y_n#77 = y))
>  +- Aggregate [x#74, y#75, z#76|#74, y#75, z#76], [x#74, y#75, z#76, 
> first(y_n#77, false) AS y_n#77|#74, y#75, z#76, first(y_n#77, false) AS 
> y_n#77]
>  +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false
>   
>  == Physical Plan ==
>  *(3) Project [x#74, y#75, z#76|#74, y#75, z#76]
>  +- *(3) Filter (isnotnull(y_n#77) && (y_n#77 = y))
>  +- SortAggregate(key=[x#74, y#75, z#76|#74, y#75, z#76], 
> functions=[first(y_n#77, false)|#77, false)], output=[x#74, y#75, z#76, 
> y_n#77|#74, y#75, z#76, y_n#77])
>  +- *(2) Sort [x#74 ASC NULLS FIRST, y#75 ASC NULLS FIRST, z#76 ASC NULLS 
> FIRST|#74 ASC NULLS FIRST, y#75 ASC NULLS FIRST, z#76 ASC NULLS FIRST], 
> false, 0
>  +- Exchange hashpartitioning(x#74, y#75, z#76, 10)
>  +- SortAggregate(key=[x#74, y#75, z#76|#74, y#75, z#76], 
> functions=[partial_first(y_n#77, false)|#77, false)], output=[x#74, y#75, 
> z#76, first#95, valueSet#96|#74, y#75, z#76, first#95, v

[jira] [Created] (SPARK-27398) Get rid of sun.nio.cs.StreamDecoder in CreateJacksonParser

2019-04-06 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-27398:
--

 Summary: Get rid of sun.nio.cs.StreamDecoder in CreateJacksonParser
 Key: SPARK-27398
 URL: https://issues.apache.org/jira/browse/SPARK-27398
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


The CreateJacksonParser.getStreamDecoder method creates an instance of 
ReadableByteChannel and returns the result as of sun.nio.cs.StreamDecoder. This 
is unnecessary and overcomplicates the method. This code can be replaced by:
{code:scala}
val bais = new ByteArrayInputStream(in, 0, length)
new InputStreamReader(bais, enc)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27397) Take care of OpenJ9 in JVM dependant parts

2019-04-06 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-27397:


 Summary: Take care of OpenJ9 in JVM dependant parts
 Key: SPARK-27397
 URL: https://issues.apache.org/jira/browse/SPARK-27397
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Kazuaki Ishizaki


Spark includes multiple JVM dependant code such as {{SizeEstimator}}. The 
current Spark takes care of IBM JDK and OpenJDK. Recently, OpenJ9 has been 
released. However, it is not considered yet. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org