[jira] [Commented] (SPARK-24703) Unable to multiply calender interval with long/int

2018-07-01 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529456#comment-16529456
 ] 

Takeshi Yamamuro commented on SPARK-24703:
--

yea, I've noticed that the SQL standard supports the syntax: 
http://download.mimer.com/pub/developer/docs/html_100/Mimer_SQL_Engine_DocSet/Syntax_Rules4.html#wp1113535

> Unable to multiply calender interval with long/int
> --
>
> Key: SPARK-24703
> URL: https://issues.apache.org/jira/browse/SPARK-24703
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Priyanka Garg
>Priority: Major
>
> When i am trying to multiply calender interval with long/int , I am getting 
> below error. The same syntax is supported in Postgres.
>  spark.sql("select 3 *  interval '1' day").show()
> org.apache.spark.sql.AnalysisException: cannot resolve '(3 * interval 1 
> days)' due to data type mismatch: differing types in '(3 * interval 1 days)' 
> (int and calendarinterval).; line 1 pos 7;
> 'Project [unresolvedalias((3 * interval 1 days), None)]
> +- OneRowRelation
>  
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24665) Add SQLConf in PySpark to manage all sql configs

2018-07-01 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24665:


Assignee: Li Yuanjian

> Add SQLConf in PySpark to manage all sql configs
> 
>
> Key: SPARK-24665
> URL: https://issues.apache.org/jira/browse/SPARK-24665
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Yuanjian
>Assignee: Li Yuanjian
>Priority: Major
> Fix For: 2.4.0
>
>
> With new config adding in PySpark, we currently get them by hard coding the 
> config name and default value. We should move all the configs into a Class 
> like what we did in Spark SQL Conf.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24665) Add SQLConf in PySpark to manage all sql configs

2018-07-01 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24665.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21648
[https://github.com/apache/spark/pull/21648]

> Add SQLConf in PySpark to manage all sql configs
> 
>
> Key: SPARK-24665
> URL: https://issues.apache.org/jira/browse/SPARK-24665
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Yuanjian
>Assignee: Li Yuanjian
>Priority: Major
> Fix For: 2.4.0
>
>
> With new config adding in PySpark, we currently get them by hard coding the 
> config name and default value. We should move all the configs into a Class 
> like what we did in Spark SQL Conf.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24705) Spark.sql.adaptive.enabled=true is enabled and self-join query

2018-07-01 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529421#comment-16529421
 ] 

Takeshi Yamamuro edited comment on SPARK-24705 at 7/2/18 6:20 AM:
--

I checked the master has the same issue. Also, it seems this issue only happens 
when using jdbc sources.
{code:java}
// Prepare test data in postgresql
postgres=# create table device_loc(imei int, speed int);
CREATE TABLE
postgres=# insert into device_loc values (1, 1);
INSERT 0 1
postgres=# select * from device_loc;
 imei | speed 
--+---
1 | 1
(1 row)


// Register as a jdbc table
scala> val jdbcTable = spark.read.jdbc("jdbc:postgresql:postgres", 
"device_loc", options)
scala> jdbcTable.registerTempTable("device_loc")
scala> sql("SELECT * FROM device_loc").show
++-+
|imei|speed|
++-+
|   1|1|
++-+

// Prepare a query
scala> :paste
val df = sql("""
select tv_a.imei
  from ( select a.imei,a.speed from device_loc a) tv_a
  inner join ( select a.imei,a.speed from device_loc a ) tv_b on tv_a.imei = 
tv_b.imei
  group by tv_a.imei
""")

// Run tests
scala> sql("SET spark.sql.adaptive.enabled=false")
scala> df.show
++
|imei|
++
|   1|
++

scala> sql("SET spark.sql.adaptive.enabled=true")
scala> df.show
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange(coordinator id: 1401717308) hashpartitioning(imei#0, 200), 
coordinator[target post-shuffle partition size: 67108864]
+- *(1) Scan JDBCRelation(device_loc) [numPartitions=1] [imei#0] PushedFilters: 
[*IsNotNull(imei)], ReadSchema: struct

Caused by: java.lang.AssertionError: assertion failed
  at scala.Predef$.assert(Predef.scala:156)
  at 
org.apache.spark.sql.execution.exchange.ExchangeCoordinator.doEstimationIfNecessary(ExchangeCoordinator.scala:201)
  at 
org.apache.spark.sql.execution.exchange.ExchangeCoordinator.postShuffleRDD(ExchangeCoordinator.scala:259)
  at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:124)
  at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:119)
  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
  ... 100 more
{code}
It seems this issue doesn't happen in the other datasources
{code:java}
scala> sql("SET spark.sql.adaptive.enabled=true")
scala> spark.range(1).selectExpr("id AS imei", "id AS 
speed").write.saveAsTable("device_loc")
scala> :paste
val df = sql("""
select tv_a.imei
  from ( select a.imei,a.speed from device_loc a) tv_a
  inner join ( select a.imei,a.speed from device_loc a ) tv_b on tv_a.imei = 
tv_b.imei
  group by tv_a.imei
""")
scala> df.show()
++
|imei|
++
|   0|
++
{code}


was (Author: maropu):
I checked the master has the same issue. Also, it seems this issue only happens 
when using jdbc sources.
{code}
// Prepare test data in postgresql
postgres=# create table device_loc(imei int, speed int);
CREATE TABLE
postgres=# insert into device_loc values (1, 1);
INSERT 0 1
postgres=# select * from device_loc;
 imei | speed 
--+---
1 | 1
(1 row)


// Register as a jdbc table
scala> val jdbcTable = spark.read.jdbc("jdbc:postgresql:postgres", 
"device_loc", options)
scala> jdbcTable.registerTempTable("device_loc")
scala> sql("SELECT * FROM device_loc").show
++-+
|imei|speed|
++-+
|   1|1|
++-+

// Prepare a query
scala> :paste
val df = sql("""
select tv_a.imei
  from ( select a.imei,a.speed from device_loc a) tv_a
  inner join ( select a.imei,a.speed from device_loc a ) tv_b on tv_a.imei = 
tv_b.imei
  group by tv_a.imei
""")

// Run tests
scala> sql("SET spark.sql.adaptive.enabled=false")
scala> df.show
++
|imei|
++
|   1|
++

scala> sql("SET spark.sql.adaptive.enabled=true")
scala> df.show
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange(coordinator id: 1401717308) hashpartitioning(imei#0, 200), 
coordinator[target post-shuffle partition size: 67108864]
+- *(1) Scan JDBCRelation(device_loc) [numPartitions=1] [imei#0] PushedFilters: 
[*IsNotNull(imei)], ReadSchema: struct

Caused by: java.lang.AssertionError: assertion failed
  at scala.Predef$.assert(Predef.scala:156)
  at 
org.apache.spark.sql.execution.exchange.ExchangeCoordinator.doEstimationIfNecessary(ExchangeCoordinator.scala:201)
  at 
org.apache.spark.sql.execution.exchange.ExchangeCoordinator.postShuffleRDD(ExchangeCoordinator.scala:259)
  at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:124)
  at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:119)
  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
  ... 100 more
{code}

This issue doesn't ha

[jira] [Commented] (SPARK-24705) Spark.sql.adaptive.enabled=true is enabled and self-join query

2018-07-01 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529421#comment-16529421
 ] 

Takeshi Yamamuro commented on SPARK-24705:
--

I checked the master has the same issue. Also, it seems this issue only happens 
when using jdbc sources.
{code}
// Prepare test data in postgresql
postgres=# create table device_loc(imei int, speed int);
CREATE TABLE
postgres=# insert into device_loc values (1, 1);
INSERT 0 1
postgres=# select * from device_loc;
 imei | speed 
--+---
1 | 1
(1 row)


// Register as a jdbc table
scala> val jdbcTable = spark.read.jdbc("jdbc:postgresql:postgres", 
"device_loc", options)
scala> jdbcTable.registerTempTable("device_loc")
scala> sql("SELECT * FROM device_loc").show
++-+
|imei|speed|
++-+
|   1|1|
++-+

// Prepare a query
scala> :paste
val df = sql("""
select tv_a.imei
  from ( select a.imei,a.speed from device_loc a) tv_a
  inner join ( select a.imei,a.speed from device_loc a ) tv_b on tv_a.imei = 
tv_b.imei
  group by tv_a.imei
""")

// Run tests
scala> sql("SET spark.sql.adaptive.enabled=false")
scala> df.show
++
|imei|
++
|   1|
++

scala> sql("SET spark.sql.adaptive.enabled=true")
scala> df.show
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange(coordinator id: 1401717308) hashpartitioning(imei#0, 200), 
coordinator[target post-shuffle partition size: 67108864]
+- *(1) Scan JDBCRelation(device_loc) [numPartitions=1] [imei#0] PushedFilters: 
[*IsNotNull(imei)], ReadSchema: struct

Caused by: java.lang.AssertionError: assertion failed
  at scala.Predef$.assert(Predef.scala:156)
  at 
org.apache.spark.sql.execution.exchange.ExchangeCoordinator.doEstimationIfNecessary(ExchangeCoordinator.scala:201)
  at 
org.apache.spark.sql.execution.exchange.ExchangeCoordinator.postShuffleRDD(ExchangeCoordinator.scala:259)
  at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:124)
  at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:119)
  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
  ... 100 more
{code}

This issue doesn't happen in datasource
{code}
scala> sql("SET spark.sql.adaptive.enabled=true")
scala> spark.range(1).selectExpr("id AS imei", "id AS 
speed").write.saveAsTable("device_loc")
scala> :paste
val df = sql("""
select tv_a.imei
  from ( select a.imei,a.speed from device_loc a) tv_a
  inner join ( select a.imei,a.speed from device_loc a ) tv_b on tv_a.imei = 
tv_b.imei
  group by tv_a.imei
""")
scala> df.show()
++
|imei|
++
|   0|
++
{code}

> Spark.sql.adaptive.enabled=true is enabled and self-join query
> --
>
> Key: SPARK-24705
> URL: https://issues.apache.org/jira/browse/SPARK-24705
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: cheng
>Priority: Minor
> Attachments: Error stack.txt
>
>
> [~smilegator]
> When loading data using jdbc and enabling spark.sql.adaptive.enabled=true , 
> for example loading a tableA table, unexpected results can occur when you use 
> the following query.
> For example:
> device_loc table comes from the jdbc data source
> select tv_a.imei
> from ( select a.imei,a.speed from device_loc a) tv_a
> inner join ( select a.imei,a.speed from device_loc a ) tv_b on tv_a.imei = 
> tv_b.imei
> group by tv_a.imei
> When the cache tabel device_loc is executed before this query is executed, 
> everything is fine,However, if you do not execute cache table, unexpected 
> results will occur, resulting in failure to execute.
> Remarks:Attachment records the stack when the error occurred



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24706) Support ByteType and ShortType pushdown to parquet

2018-07-01 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-24706:
---

Assignee: Yuming Wang

> Support ByteType and ShortType pushdown to parquet
> --
>
> Key: SPARK-24706
> URL: https://issues.apache.org/jira/browse/SPARK-24706
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24706) Support ByteType and ShortType pushdown to parquet

2018-07-01 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24706:

Target Version/s: 2.4.0

> Support ByteType and ShortType pushdown to parquet
> --
>
> Key: SPARK-24706
> URL: https://issues.apache.org/jira/browse/SPARK-24706
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24714) AnalysisSuite should use ClassTag to check the runtime instance

2018-07-01 Thread Chia-Ping Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529405#comment-16529405
 ] 

Chia-Ping Tsai commented on SPARK-24714:


[~maropu] thank you again. :)

> AnalysisSuite should use ClassTag to check the runtime instance
> ---
>
> Key: SPARK-24714
> URL: https://issues.apache.org/jira/browse/SPARK-24714
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chia-Ping Tsai
>Priority: Minor
>
> {code:java}
> test("SPARK-22614 RepartitionByExpression partitioning") {
> def checkPartitioning[T <: Partitioning](numPartitions: Int, exprs: 
> Expression*): Unit = {
>   val partitioning = RepartitionByExpression(exprs, testRelation2, 
> numPartitions).partitioning
>   assert(partitioning.isInstanceOf[T]) // it always be true because of type 
> erasure
> }{code}
> Spark support the scala 2.10 and 2.11 so it is ok to introduce ClassTag to 
> correct the type check.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24708) Document the default spark url of master in standalone is "spark://localhost:7070"

2018-07-01 Thread Chia-Ping Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529404#comment-16529404
 ] 

Chia-Ping Tsai commented on SPARK-24708:


[~maropu] Thanks for the kind reminder.

> Document the default spark url of master in standalone is 
> "spark://localhost:7070"
> --
>
> Key: SPARK-24708
> URL: https://issues.apache.org/jira/browse/SPARK-24708
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.1
>Reporter: Chia-Ping Tsai
>Priority: Trivial
>
> In the section "Starting a Cluster Manually" we give a example of starting a 
> worker.
> {code:java}
> ./sbin/start-slave.sh {code}
> However, we only mention the default "web port" so readers may be misled into 
> using the "web port" to start the worker. (of course, I am a "reader" too :()
> It seems to me that adding a bit description of default spark url of master 
> can avoid above ambiguity.
> for example:
> {code:java}
> - Similarly, you can start one or more workers and connect them to the master 
> via:
> + Similarly, you can start one or more workers and connect them to the 
> master's spark URL (default: spark://:7070) via:{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24708) Document the default spark url of master in standalone is "spark://localhost:7070"

2018-07-01 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529394#comment-16529394
 ] 

Takeshi Yamamuro commented on SPARK-24708:
--

Feel free to make a pr and you can discuss there. Btw, you don't need to file a 
jira cuz this is a trivial fix.

> Document the default spark url of master in standalone is 
> "spark://localhost:7070"
> --
>
> Key: SPARK-24708
> URL: https://issues.apache.org/jira/browse/SPARK-24708
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.1
>Reporter: Chia-Ping Tsai
>Priority: Trivial
>
> In the section "Starting a Cluster Manually" we give a example of starting a 
> worker.
> {code:java}
> ./sbin/start-slave.sh {code}
> However, we only mention the default "web port" so readers may be misled into 
> using the "web port" to start the worker. (of course, I am a "reader" too :()
> It seems to me that adding a bit description of default spark url of master 
> can avoid above ambiguity.
> for example:
> {code:java}
> - Similarly, you can start one or more workers and connect them to the master 
> via:
> + Similarly, you can start one or more workers and connect them to the 
> master's spark URL (default: spark://:7070) via:{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24715) sbt build brings a wrong jline versions

2018-07-01 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24715:
-
Priority: Critical  (was: Major)

> sbt build brings a wrong jline versions
> ---
>
> Key: SPARK-24715
> URL: https://issues.apache.org/jira/browse/SPARK-24715
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> During SPARK-24418 (Upgrade Scala to 2.11.12 and 2.12.6), we upgrade `jline` 
> version together. So, `mvn` works correctly. However, `sbt` brings old jline 
> library and is hitting `NoSuchMethodError` in `master` branch. Since we use 
> `mvn` mainly, this is dev environment issue.
> {code}
> $ ./build/sbt -Pyarn -Phadoop-2.7 -Phadoop-cloud -Phive -Phive-thriftserver 
> -Psparkr test:package
> $ bin/spark-shell
> scala> Spark context Web UI available at http://localhost:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1530385877441).
> Spark session available as 'spark'.
> Exception in thread "main" java.lang.NoSuchMethodError: 
> jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24715) sbt build brings a wrong jline versions

2018-07-01 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24715:
--
Description: 
During SPARK-24418 (Upgrade Scala to 2.11.12 and 2.12.6), we upgrade `jline` 
version together. So, `mvn` works correctly. However, `sbt` brings old jline 
library and is hitting `NoSuchMethodError` in `master` branch. Since we use 
`mvn` mainly, this is dev environment issue.

{code}
$ ./build/sbt -Pyarn -Phadoop-2.7 -Phadoop-cloud -Phive -Phive-thriftserver 
-Psparkr test:package
$ bin/spark-shell
scala> Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1530385877441).
Spark session available as 'spark'.
Exception in thread "main" java.lang.NoSuchMethodError: 
jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V
{code}

  was:
During SPARK-24418 (Upgrade Scala to 2.11.12 and 2.12.6), we upgrade `jline` 
version together. So, `mvn` works correctly. However, `sbt` brings old jline 
library and is hitting `NoSuchMethodError` in `master` branch.

{code}
$ ./build/sbt -Pyarn -Phadoop-2.7 -Phadoop-cloud -Phive -Phive-thriftserver 
-Psparkr test:package
$ bin/spark-shell
scala> Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1530385877441).
Spark session available as 'spark'.
Exception in thread "main" java.lang.NoSuchMethodError: 
jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V
{code}


> sbt build brings a wrong jline versions
> ---
>
> Key: SPARK-24715
> URL: https://issues.apache.org/jira/browse/SPARK-24715
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> During SPARK-24418 (Upgrade Scala to 2.11.12 and 2.12.6), we upgrade `jline` 
> version together. So, `mvn` works correctly. However, `sbt` brings old jline 
> library and is hitting `NoSuchMethodError` in `master` branch. Since we use 
> `mvn` mainly, this is dev environment issue.
> {code}
> $ ./build/sbt -Pyarn -Phadoop-2.7 -Phadoop-cloud -Phive -Phive-thriftserver 
> -Psparkr test:package
> $ bin/spark-shell
> scala> Spark context Web UI available at http://localhost:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1530385877441).
> Spark session available as 'spark'.
> Exception in thread "main" java.lang.NoSuchMethodError: 
> jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24715) sbt build brings a wrong jline versions

2018-07-01 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24715:
--
Summary: sbt build brings a wrong jline versions  (was: sbt build bring a 
wrong jline versions)

> sbt build brings a wrong jline versions
> ---
>
> Key: SPARK-24715
> URL: https://issues.apache.org/jira/browse/SPARK-24715
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> During SPARK-24418 (Upgrade Scala to 2.11.12 and 2.12.6), we upgrade `jline` 
> version together. So, `mvn` works correctly. However, `sbt` brings old jline 
> library and is hitting `NoSuchMethodError` in `master` branch.
> {code}
> $ ./build/sbt -Pyarn -Phadoop-2.7 -Phadoop-cloud -Phive -Phive-thriftserver 
> -Psparkr test:package
> $ bin/spark-shell
> scala> Spark context Web UI available at http://localhost:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1530385877441).
> Spark session available as 'spark'.
> Exception in thread "main" java.lang.NoSuchMethodError: 
> jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24715) sbt build bring a wrong jline versions

2018-07-01 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-24715:
-

 Summary: sbt build bring a wrong jline versions
 Key: SPARK-24715
 URL: https://issues.apache.org/jira/browse/SPARK-24715
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.0
Reporter: Dongjoon Hyun


During SPARK-24418 (Upgrade Scala to 2.11.12 and 2.12.6), we upgrade `jline` 
version together. So, `mvn` works correctly. However, `sbt` brings old jline 
library and is hitting `NoSuchMethodError` in `master` branch.

{code}
$ ./build/sbt -Pyarn -Phadoop-2.7 -Phadoop-cloud -Phive -Phive-thriftserver 
-Psparkr test:package
$ bin/spark-shell
scala> Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1530385877441).
Spark session available as 'spark'.
Exception in thread "main" java.lang.NoSuchMethodError: 
jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24714) AnalysisSuite should use ClassTag to check the runtime instance

2018-07-01 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529373#comment-16529373
 ] 

Takeshi Yamamuro commented on SPARK-24714:
--

You needn't do that; feel free to make a pr for this ticket.
btw, since this is trivial, I think you don't file a jira

> AnalysisSuite should use ClassTag to check the runtime instance
> ---
>
> Key: SPARK-24714
> URL: https://issues.apache.org/jira/browse/SPARK-24714
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chia-Ping Tsai
>Priority: Minor
>
> {code:java}
> test("SPARK-22614 RepartitionByExpression partitioning") {
> def checkPartitioning[T <: Partitioning](numPartitions: Int, exprs: 
> Expression*): Unit = {
>   val partitioning = RepartitionByExpression(exprs, testRelation2, 
> numPartitions).partitioning
>   assert(partitioning.isInstanceOf[T]) // it always be true because of type 
> erasure
> }{code}
> Spark support the scala 2.10 and 2.11 so it is ok to introduce ClassTag to 
> correct the type check.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken

2018-07-01 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529371#comment-16529371
 ] 

Hyukjin Kwon commented on SPARK-24530:
--

Yea, will post a email to related threads, and try to deal with it very soon. 
[~mengxr], mind if I set the priority to Critical since we have a workaround to 
get through this anyway?

> Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) 
> and pyspark.ml docs are broken
> ---
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24713) AppMatser of spark streaming kafka OOM if there are hundreds of topics consumed

2018-07-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24713:


Assignee: Apache Spark

> AppMatser of spark streaming kafka OOM if there are hundreds of topics 
> consumed
> ---
>
> Key: SPARK-24713
> URL: https://issues.apache.org/jira/browse/SPARK-24713
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.1
>Reporter: Yuanbo Liu
>Assignee: Apache Spark
>Priority: Major
>
> We have hundreds of kafka topics need to be consumed in one application. The 
> application master will throw OOM exception after hanging for nearly half of 
> an hour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24713) AppMatser of spark streaming kafka OOM if there are hundreds of topics consumed

2018-07-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24713:


Assignee: (was: Apache Spark)

> AppMatser of spark streaming kafka OOM if there are hundreds of topics 
> consumed
> ---
>
> Key: SPARK-24713
> URL: https://issues.apache.org/jira/browse/SPARK-24713
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.1
>Reporter: Yuanbo Liu
>Priority: Major
>
> We have hundreds of kafka topics need to be consumed in one application. The 
> application master will throw OOM exception after hanging for nearly half of 
> an hour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24713) AppMatser of spark streaming kafka OOM if there are hundreds of topics consumed

2018-07-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529367#comment-16529367
 ] 

Apache Spark commented on SPARK-24713:
--

User 'yuanboliu' has created a pull request for this issue:
https://github.com/apache/spark/pull/21690

> AppMatser of spark streaming kafka OOM if there are hundreds of topics 
> consumed
> ---
>
> Key: SPARK-24713
> URL: https://issues.apache.org/jira/browse/SPARK-24713
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.1
>Reporter: Yuanbo Liu
>Priority: Major
>
> We have hundreds of kafka topics need to be consumed in one application. The 
> application master will throw OOM exception after hanging for nearly half of 
> an hour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24528) Missing optimization for Aggregations/Windowing on a bucketed table

2018-07-01 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529361#comment-16529361
 ] 

Liang-Chi Hsieh commented on SPARK-24528:
-

I think we can have a sql config to control enabling/disabling this behavior 
too.

> Missing optimization for Aggregations/Windowing on a bucketed table
> ---
>
> Key: SPARK-24528
> URL: https://issues.apache.org/jira/browse/SPARK-24528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Ohad Raviv
>Priority: Major
>
> Closely related to  SPARK-24410, we're trying to optimize a very common use 
> case we have of getting the most updated row by id from a fact table.
> We're saving the table bucketed to skip the shuffle stage, but we're still 
> "waste" time on the Sort operator evethough the data is already sorted.
> here's a good example:
> {code:java}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("key", "t1")
> .saveAsTable("a1"){code}
> {code:java}
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, 
> key#24L, t1, t1#25L, t2, t2#26L))])
> +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, 
> t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))])
> +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, 
> Format: Parquet, Location: ...{code}
>  
> and here's a bad example, but more realistic:
> {code:java}
> sparkSession.sql("set spark.sql.shuffle.partitions=2")
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, 
> key#32L, t1, t1#33L, t2, t2#34L))])
> +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, 
> t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))])
> +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0
> +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, 
> Format: Parquet, Location: ...
> {code}
>  
> I've traced the problem to DataSourceScanExec#235:
> {code:java}
> val sortOrder = if (sortColumns.nonEmpty) {
>   // In case of bucketing, its possible to have multiple files belonging to 
> the
>   // same bucket in a given relation. Each of these files are locally sorted
>   // but those files combined together are not globally sorted. Given that,
>   // the RDD partition will not be sorted even if the relation has sort 
> columns set
>   // Current solution is to check if all the buckets have a single file in it
>   val files = selectedPartitions.flatMap(partition => partition.files)
>   val bucketToFilesGrouping =
> files.map(_.getPath.getName).groupBy(file => 
> BucketingUtils.getBucketId(file))
>   val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 
> 1){code}
> so obviously the code avoids dealing with this situation now..
> could you think of a way to solve this or bypass it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24714) AnalysisSuite should use ClassTag to check the runtime instance

2018-07-01 Thread Chia-Ping Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529356#comment-16529356
 ] 

Chia-Ping Tsai commented on SPARK-24714:


I have no permission to assign this Jira to myself. need helps.

> AnalysisSuite should use ClassTag to check the runtime instance
> ---
>
> Key: SPARK-24714
> URL: https://issues.apache.org/jira/browse/SPARK-24714
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chia-Ping Tsai
>Priority: Minor
>
> {code:java}
> test("SPARK-22614 RepartitionByExpression partitioning") {
> def checkPartitioning[T <: Partitioning](numPartitions: Int, exprs: 
> Expression*): Unit = {
>   val partitioning = RepartitionByExpression(exprs, testRelation2, 
> numPartitions).partitioning
>   assert(partitioning.isInstanceOf[T]) // it always be true because of type 
> erasure
> }{code}
> Spark support the scala 2.10 and 2.11 so it is ok to introduce ClassTag to 
> correct the type check.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24714) AnalysisSuite should use ClassTag to check the runtime instance

2018-07-01 Thread Chia-Ping Tsai (JIRA)
Chia-Ping Tsai created SPARK-24714:
--

 Summary: AnalysisSuite should use ClassTag to check the runtime 
instance
 Key: SPARK-24714
 URL: https://issues.apache.org/jira/browse/SPARK-24714
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.3.1
Reporter: Chia-Ping Tsai


{code:java}
test("SPARK-22614 RepartitionByExpression partitioning") {
def checkPartitioning[T <: Partitioning](numPartitions: Int, exprs: 
Expression*): Unit = {
  val partitioning = RepartitionByExpression(exprs, testRelation2, 
numPartitions).partitioning
  assert(partitioning.isInstanceOf[T]) // it always be true because of type 
erasure
}{code}
Spark support the scala 2.10 and 2.11 so it is ok to introduce ClassTag to 
correct the type check.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken

2018-07-01 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529352#comment-16529352
 ] 

Saisai Shao commented on SPARK-24530:
-

[~hyukjin.kwon], Spark 2.1.3 and 2.2.2 are on vote, can you please fix the 
issue and leave the comments in the related threads.

> Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) 
> and pyspark.ml docs are broken
> ---
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken

2018-07-01 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24530:

Target Version/s: 2.1.3, 2.2.2, 2.3.2, 2.4.0  (was: 2.4.0)

> Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) 
> and pyspark.ml docs are broken
> ---
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24713) AppMatser of spark streaming kafka OOM if there are hundreds of topics consumed

2018-07-01 Thread Yuanbo Liu (JIRA)
Yuanbo Liu created SPARK-24713:
--

 Summary: AppMatser of spark streaming kafka OOM if there are 
hundreds of topics consumed
 Key: SPARK-24713
 URL: https://issues.apache.org/jira/browse/SPARK-24713
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.3.1
Reporter: Yuanbo Liu


We have hundreds of kafka topics need to be consumed in one application. The 
application master will throw OOM exception after hanging for nearly half of an 
hour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24711) Integration tests will not work with exclude/include tags

2018-07-01 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24711:

Priority: Minor  (was: Major)

> Integration tests will not work with exclude/include tags
> -
>
> Key: SPARK-24711
> URL: https://issues.apache.org/jira/browse/SPARK-24711
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Stavros Kontopoulos
>Priority: Minor
> Fix For: 2.4.0
>
>
> I tried to exclude some tests when adding mine and I got errors of the form:
> [INFO] BUILD FAILURE
> [INFO] 
> 
>  [INFO] Total time: 6.798 s
>  [INFO] Finished at: 2018-07-01T18:34:13+03:00
>  [INFO] Final Memory: 36M/652M
>  [INFO] 
> 
>  [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.20.1:test (default-test) on 
> project spark-kubernetes-integration-tests_2.11: There are test failures.
>  [ERROR] 
>  [ERROR] Please refer to 
> /home/stavros/Desktop/workspace/OSS/spark/resource-managers/kubernetes/integration-tests/target/surefire-reports
>  for the individual test results.
>  [ERROR] Please refer to dump files (if any exist) [date]-jvmRun[N].dump, 
> [date].dumpstream and [date]-jvmRun[N].dumpstream.
>  [ERROR] There was an error in the forked process
>  [ERROR] Unable to load category: noDcos
>  
> This will not happen if maven surfire plugin is disabled as stated here: 
> [http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin]
> I will create a PR shortly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21809) Change Stage Page to use datatables to support sorting columns and searching

2018-07-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529206#comment-16529206
 ] 

Apache Spark commented on SPARK-21809:
--

User 'pgandhi999' has created a pull request for this issue:
https://github.com/apache/spark/pull/21688

> Change Stage Page to use datatables to support sorting columns and searching
> 
>
> Key: SPARK-21809
> URL: https://issues.apache.org/jira/browse/SPARK-21809
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Nuochen Lyu
>Priority: Minor
>
> Support column sort and search for Stage Server using jQuery DataTable and 
> REST API. Before this commit, the Stage page was generated hard-coded HTML 
> and can not support search, also, the sorting was disabled if there is any 
> application that has more than one attempt. Supporting search and sort (over 
> all applications rather than the 20 entries in the current page) in any case 
> will greatly improve the user experience.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24712) TrainValidationSplit ignores label column name and forces to be "label"

2018-07-01 Thread Pablo J. Villacorta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo J. Villacorta updated SPARK-24712:

Description: 
When a TrainValidationSplit is fit on a Pipeline containing a ML model, the 
labelCol property of the model is ignored, and the call to fit() will fail 
unless the labelCol equals "label". As an example, the following pyspark code 
only works when the variable labelColumn is set to "label"
{code:java}
from pyspark.sql.functions import rand, randn
from pyspark.ml.regression import LinearRegression

labelColumn = "target"  # CHANGE THIS TO "label" AND THE CODE WORKS

df = spark.range(0, 10).select(rand(seed=10).alias("uniform"), 
randn(seed=27).alias(labelColumn))
vectorAssembler = 
VectorAssembler().setInputCols(["uniform"]).setOutputCol("features")
lr = LinearRegression().setFeaturesCol("features").setLabelCol(labelColumn)
mypipeline = Pipeline(stages = [vectorAssembler, lr])

paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.01, 0.1])\
.build()

trainValidationSplit = TrainValidationSplit()\
.setEstimator(mypipeline)\
.setEvaluator(RegressionEvaluator())\
.setEstimatorParamMaps(paramGrid)\
.setTrainRatio(0.8)

trainValidationSplit.fit(df)  # FAIL UNLESS labelColumn IS SET TO "label"
{code}

  was:
When a TrainValidationSplit is fit on a Pipeline containing a ML model, the 
labelCol property of the model is ignored, and the call to fit() will fail 
unless the labelCol equals "label". As an example, the following pyspark code 
only wors when the variable labelColumn is set to "label"
{code:java}
from pyspark.sql.functions import rand, randn
from pyspark.ml.regression import LinearRegression

labelColumn = "target"  # CHANGE THIS TO "label" AND THE CODE WORKS

df = spark.range(0, 10).select(rand(seed=10).alias("uniform"), 
randn(seed=27).alias(labelColumn))
vectorAssembler = 
VectorAssembler().setInputCols(["uniform"]).setOutputCol("features")
lr = LinearRegression().setFeaturesCol("features").setLabelCol(labelColumn)
mypipeline = Pipeline(stages = [vectorAssembler, lr])

paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.01, 0.1])\
.build()

trainValidationSplit = TrainValidationSplit()\
.setEstimator(mypipeline)\
.setEvaluator(RegressionEvaluator())\
.setEstimatorParamMaps(paramGrid)\
.setTrainRatio(0.8)

trainValidationSplit.fit(df)  # FAIL UNLESS labelColumn IS SET TO "label"
{code}


> TrainValidationSplit ignores label column name and forces to be "label"
> ---
>
> Key: SPARK-24712
> URL: https://issues.apache.org/jira/browse/SPARK-24712
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Pablo J. Villacorta
>Priority: Major
>
> When a TrainValidationSplit is fit on a Pipeline containing a ML model, the 
> labelCol property of the model is ignored, and the call to fit() will fail 
> unless the labelCol equals "label". As an example, the following pyspark code 
> only works when the variable labelColumn is set to "label"
> {code:java}
> from pyspark.sql.functions import rand, randn
> from pyspark.ml.regression import LinearRegression
> labelColumn = "target"  # CHANGE THIS TO "label" AND THE CODE WORKS
> df = spark.range(0, 10).select(rand(seed=10).alias("uniform"), 
> randn(seed=27).alias(labelColumn))
> vectorAssembler = 
> VectorAssembler().setInputCols(["uniform"]).setOutputCol("features")
> lr = LinearRegression().setFeaturesCol("features").setLabelCol(labelColumn)
> mypipeline = Pipeline(stages = [vectorAssembler, lr])
> paramGrid = ParamGridBuilder()\
> .addGrid(lr.regParam, [0.01, 0.1])\
> .build()
> trainValidationSplit = TrainValidationSplit()\
> .setEstimator(mypipeline)\
> .setEvaluator(RegressionEvaluator())\
> .setEstimatorParamMaps(paramGrid)\
> .setTrainRatio(0.8)
> trainValidationSplit.fit(df)  # FAIL UNLESS labelColumn IS SET TO "label"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24712) TrainValidationSplit ignores label column name and forces to be "label"

2018-07-01 Thread Pablo J. Villacorta (JIRA)
Pablo J. Villacorta created SPARK-24712:
---

 Summary: TrainValidationSplit ignores label column name and forces 
to be "label"
 Key: SPARK-24712
 URL: https://issues.apache.org/jira/browse/SPARK-24712
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.0
Reporter: Pablo J. Villacorta


When a TrainValidationSplit is fit on a Pipeline containing a ML model, the 
labelCol property of the model is ignored, and the call to fit() will fail 
unless the labelCol equals "label". As an example, the following pyspark code 
only wors when the variable labelColumn is set to "label"
{code:java}
from pyspark.sql.functions import rand, randn
from pyspark.ml.regression import LinearRegression

labelColumn = "target"  # CHANGE THIS TO "label" AND THE CODE WORKS

df = spark.range(0, 10).select(rand(seed=10).alias("uniform"), 
randn(seed=27).alias(labelColumn))
vectorAssembler = 
VectorAssembler().setInputCols(["uniform"]).setOutputCol("features")
lr = LinearRegression().setFeaturesCol("features").setLabelCol(labelColumn)
mypipeline = Pipeline(stages = [vectorAssembler, lr])

paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.01, 0.1])\
.build()

trainValidationSplit = TrainValidationSplit()\
.setEstimator(mypipeline)\
.setEvaluator(RegressionEvaluator())\
.setEstimatorParamMaps(paramGrid)\
.setTrainRatio(0.8)

trainValidationSplit.fit(df)  # FAIL UNLESS labelColumn IS SET TO "label"
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24711) Integration tests will not work with exclude/include tags

2018-07-01 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24711:

Description: 
I tried to exclude some tests when adding mine and I got errors of the form:

[INFO] BUILD FAILURE

[INFO] 
 [INFO] Total time: 6.798 s
 [INFO] Finished at: 2018-07-01T18:34:13+03:00
 [INFO] Final Memory: 36M/652M
 [INFO] 
 [ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:2.20.1:test (default-test) on 
project spark-kubernetes-integration-tests_2.11: There are test failures.
 [ERROR] 
 [ERROR] Please refer to 
/home/stavros/Desktop/workspace/OSS/spark/resource-managers/kubernetes/integration-tests/target/surefire-reports
 for the individual test results.
 [ERROR] Please refer to dump files (if any exist) [date]-jvmRun[N].dump, 
[date].dumpstream and [date]-jvmRun[N].dumpstream.
 [ERROR] There was an error in the forked process
 [ERROR] Unable to load category: noDcos

 

This will not happen if maven surfire plugin is disabled as stated here: 
[http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin]

I will create a PR shortly.

  was:
I tried to exclude some tests when adding mine and I got errors of the form:

[INFO] BUILD FAILURE

[INFO] 
 [INFO] Total time: 6.798 s
 [INFO] Finished at: 2018-07-01T18:34:13+03:00
 [INFO] Final Memory: 36M/652M
 [INFO] 
 [ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:2.20.1:test (default-test) on 
project spark-kubernetes-integration-tests_2.11: There are test failures.
 [ERROR] 
 [ERROR] Please refer to 
/home/stavros/Desktop/workspace/OSS/spark/resource-managers/kubernetes/integration-tests/target/surefire-reports
 for the individual test results.
 [ERROR] Please refer to dump files (if any exist) [date]-jvmRun[N].dump, 
[date].dumpstream and [date]-jvmRun[N].dumpstream.
 [ERROR] There was an error in the forked process
 [ERROR] Unable to load category: noDcos

 

This will not happen if maven surfire plugin is disabled as stated here: 
[http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin]


> Integration tests will not work with exclude/include tags
> -
>
> Key: SPARK-24711
> URL: https://issues.apache.org/jira/browse/SPARK-24711
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Stavros Kontopoulos
>Priority: Major
> Fix For: 2.4.0
>
>
> I tried to exclude some tests when adding mine and I got errors of the form:
> [INFO] BUILD FAILURE
> [INFO] 
> 
>  [INFO] Total time: 6.798 s
>  [INFO] Finished at: 2018-07-01T18:34:13+03:00
>  [INFO] Final Memory: 36M/652M
>  [INFO] 
> 
>  [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.20.1:test (default-test) on 
> project spark-kubernetes-integration-tests_2.11: There are test failures.
>  [ERROR] 
>  [ERROR] Please refer to 
> /home/stavros/Desktop/workspace/OSS/spark/resource-managers/kubernetes/integration-tests/target/surefire-reports
>  for the individual test results.
>  [ERROR] Please refer to dump files (if any exist) [date]-jvmRun[N].dump, 
> [date].dumpstream and [date]-jvmRun[N].dumpstream.
>  [ERROR] There was an error in the forked process
>  [ERROR] Unable to load category: noDcos
>  
> This will not happen if maven surfire plugin is disabled as stated here: 
> [http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin]
> I will create a PR shortly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24711) Integration tests will not work with exclude/include tags

2018-07-01 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24711:

Description: 
I tried to exclude some tests when adding mine and I got errors of the form:

[INFO] BUILD FAILURE

[INFO] 
 [INFO] Total time: 6.798 s
 [INFO] Finished at: 2018-07-01T18:34:13+03:00
 [INFO] Final Memory: 36M/652M
 [INFO] 
 [ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:2.20.1:test (default-test) on 
project spark-kubernetes-integration-tests_2.11: There are test failures.
 [ERROR] 
 [ERROR] Please refer to 
/home/stavros/Desktop/workspace/OSS/spark/resource-managers/kubernetes/integration-tests/target/surefire-reports
 for the individual test results.
 [ERROR] Please refer to dump files (if any exist) [date]-jvmRun[N].dump, 
[date].dumpstream and [date]-jvmRun[N].dumpstream.
 [ERROR] There was an error in the forked process
 [ERROR] Unable to load category: noDcos

 

This will not happen if maven surfire plugin is disabled as stated here: 
[http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin]

  was:
I tried to exclude some tests when adding mine and I got errors of the form:

[INFO] BUILD FAILURE

[INFO] 
[INFO] Total time: 6.798 s
[INFO] Finished at: 2018-07-01T18:34:13+03:00
[INFO] Final Memory: 36M/652M
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:2.20.1:test (default-test) on 
project spark-kubernetes-integration-tests_2.11: There are test failures.
[ERROR] 
[ERROR] Please refer to 
/home/stavros/Desktop/workspace/OSS/spark/resource-managers/kubernetes/integration-tests/target/surefire-reports
 for the individual test results.
[ERROR] Please refer to dump files (if any exist) [date]-jvmRun[N].dump, 
[date].dumpstream and [date]-jvmRun[N].dumpstream.
[ERROR] There was an error in the forked process
[ERROR] Unable to load category: noDcos

 

This will not happen if maven surfire plugin is disabled as stated here:

http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin


> Integration tests will not work with exclude/include tags
> -
>
> Key: SPARK-24711
> URL: https://issues.apache.org/jira/browse/SPARK-24711
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Stavros Kontopoulos
>Priority: Major
> Fix For: 2.4.0
>
>
> I tried to exclude some tests when adding mine and I got errors of the form:
> [INFO] BUILD FAILURE
> [INFO] 
> 
>  [INFO] Total time: 6.798 s
>  [INFO] Finished at: 2018-07-01T18:34:13+03:00
>  [INFO] Final Memory: 36M/652M
>  [INFO] 
> 
>  [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.20.1:test (default-test) on 
> project spark-kubernetes-integration-tests_2.11: There are test failures.
>  [ERROR] 
>  [ERROR] Please refer to 
> /home/stavros/Desktop/workspace/OSS/spark/resource-managers/kubernetes/integration-tests/target/surefire-reports
>  for the individual test results.
>  [ERROR] Please refer to dump files (if any exist) [date]-jvmRun[N].dump, 
> [date].dumpstream and [date]-jvmRun[N].dumpstream.
>  [ERROR] There was an error in the forked process
>  [ERROR] Unable to load category: noDcos
>  
> This will not happen if maven surfire plugin is disabled as stated here: 
> [http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24711) Integration tests will not work with exclude/include tags

2018-07-01 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-24711:
---

 Summary: Integration tests will not work with exclude/include tags
 Key: SPARK-24711
 URL: https://issues.apache.org/jira/browse/SPARK-24711
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.3.1
Reporter: Stavros Kontopoulos
 Fix For: 2.4.0


I tried to exclude some tests when adding mine and I got errors of the form:

[INFO] BUILD FAILURE

[INFO] 
[INFO] Total time: 6.798 s
[INFO] Finished at: 2018-07-01T18:34:13+03:00
[INFO] Final Memory: 36M/652M
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:2.20.1:test (default-test) on 
project spark-kubernetes-integration-tests_2.11: There are test failures.
[ERROR] 
[ERROR] Please refer to 
/home/stavros/Desktop/workspace/OSS/spark/resource-managers/kubernetes/integration-tests/target/surefire-reports
 for the individual test results.
[ERROR] Please refer to dump files (if any exist) [date]-jvmRun[N].dump, 
[date].dumpstream and [date]-jvmRun[N].dumpstream.
[ERROR] There was an error in the forked process
[ERROR] Unable to load category: noDcos

 

This will not happen if maven surfire plugin is disabled as stated here:

http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24165) UDF within when().otherwise() raises NullPointerException

2018-07-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529108#comment-16529108
 ] 

Apache Spark commented on SPARK-24165:
--

User 'mn-mikke' has created a pull request for this issue:
https://github.com/apache/spark/pull/21687

> UDF within when().otherwise() raises NullPointerException
> -
>
> Key: SPARK-24165
> URL: https://issues.apache.org/jira/browse/SPARK-24165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jingxuan Wang
>Priority: Major
>
> I have a UDF which takes java.sql.Timestamp and String as input column type 
> and returns an Array of (Seq[case class], Double) as output. Since some of 
> values in input columns can be nullable, I put the UDF inside a 
> when($input.isNull, null).otherwise(UDF) filter. Such function works well 
> when I test in spark shell. But running as a scala jar in spark-submit with 
> yarn cluster mode, it raised NullPointerException which points to the UDF 
> function. If I remove the when().otherwsie() condition, but put null check 
> inside the UDF, the function works without issue in spark-submit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24165) UDF within when().otherwise() raises NullPointerException

2018-07-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24165:


Assignee: Apache Spark

> UDF within when().otherwise() raises NullPointerException
> -
>
> Key: SPARK-24165
> URL: https://issues.apache.org/jira/browse/SPARK-24165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jingxuan Wang
>Assignee: Apache Spark
>Priority: Major
>
> I have a UDF which takes java.sql.Timestamp and String as input column type 
> and returns an Array of (Seq[case class], Double) as output. Since some of 
> values in input columns can be nullable, I put the UDF inside a 
> when($input.isNull, null).otherwise(UDF) filter. Such function works well 
> when I test in spark shell. But running as a scala jar in spark-submit with 
> yarn cluster mode, it raised NullPointerException which points to the UDF 
> function. If I remove the when().otherwsie() condition, but put null check 
> inside the UDF, the function works without issue in spark-submit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24165) UDF within when().otherwise() raises NullPointerException

2018-07-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24165:


Assignee: (was: Apache Spark)

> UDF within when().otherwise() raises NullPointerException
> -
>
> Key: SPARK-24165
> URL: https://issues.apache.org/jira/browse/SPARK-24165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jingxuan Wang
>Priority: Major
>
> I have a UDF which takes java.sql.Timestamp and String as input column type 
> and returns an Array of (Seq[case class], Double) as output. Since some of 
> values in input columns can be nullable, I put the UDF inside a 
> when($input.isNull, null).otherwise(UDF) filter. Such function works well 
> when I test in spark shell. But running as a scala jar in spark-submit with 
> yarn cluster mode, it raised NullPointerException which points to the UDF 
> function. If I remove the when().otherwsie() condition, but put null check 
> inside the UDF, the function works without issue in spark-submit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24710) Information Gain Ratio for decision trees

2018-07-01 Thread Pablo J. Villacorta (JIRA)
Pablo J. Villacorta created SPARK-24710:
---

 Summary: Information Gain Ratio for decision trees
 Key: SPARK-24710
 URL: https://issues.apache.org/jira/browse/SPARK-24710
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.3.1
Reporter: Pablo J. Villacorta
 Fix For: 2.3.1


Spark currently uses Information Gain (IG) to decide the next feature to branch 
on when building a decision tree. In case of categorical features, IG is known 
to be biased towards features with a large number of categories. [Information 
Gain Ratio|https://en.wikipedia.org/wiki/Information_gain_ratio] solves this 
problem by dividing the IG by a number that characterizes the intrinsic 
information of a feature.

As far as I know, Spark has IG but not IGR. It would be nice to have the 
possibility to choose IGR instead of IG.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24642) Add a function which infers schema from a JSON column

2018-07-01 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529030#comment-16529030
 ] 

Maxim Gekk edited comment on SPARK-24642 at 7/1/18 10:05 AM:
-

[~rxin] I created new ticket SPARK-24709 which aims to add simpler function. 
Here is the PR https://github.com/apache/spark/pull/21686 for the ticket.


was (Author: maxgekk):
I created new ticket SPARK-24709 which aims to add simpler function.

> Add a function which infers schema from a JSON column
> -
>
> Key: SPARK-24642
> URL: https://issues.apache.org/jira/browse/SPARK-24642
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new aggregate function - *infer_schema()*. The function should 
> infer schema for set of JSON strings. The result of the function is a schema 
> in DDL format (or JSON format).
> One of the use cases is passing output of *infer_schema()* to *from_json()*. 
> Currently, the from_json() function requires a schema as a mandatory 
> argument. It is possible to infer schema programmatically in Scala/Python and 
> pass it as the second argument but in SQL it is not possible. An user has to 
> pass schema as string literal in SQL. The new function should allow to use it 
> in SQL like in the example:
> {code:sql}
> select from_json(json_col, infer_schema(json_col))
> from json_table;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24709) Inferring schema from JSON string literal

2018-07-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24709:


Assignee: Apache Spark

> Inferring schema from JSON string literal
> -
>
> Key: SPARK-24709
> URL: https://issues.apache.org/jira/browse/SPARK-24709
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Need to add new function - *schema_of_json()*. The function should infer 
> schema of JSON string literal. The result of the function is a schema in DDL 
> format.
> One of the use cases is passing output of _schema_of_json()_ to 
> *from_json()*. Currently, the _from_json()_ function requires a schema as a 
> mandatory argument. An user has to pass a schema as string literal in SQL. 
> The new function should allow schema inferring from an example. Let's say 
> json_col is a column containing JSON string with the same schema. It should 
> be possible to pass a JSON string with the same schema to _schema_of_json()_ 
> which infers schema for the particular example.
> {code:sql}
> select from_json(json_col, schema_of_json('{"f1": 0, "f2": [0], "f2": "a"}'))
> from json_table;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24709) Inferring schema from JSON string literal

2018-07-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24709:


Assignee: (was: Apache Spark)

> Inferring schema from JSON string literal
> -
>
> Key: SPARK-24709
> URL: https://issues.apache.org/jira/browse/SPARK-24709
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new function - *schema_of_json()*. The function should infer 
> schema of JSON string literal. The result of the function is a schema in DDL 
> format.
> One of the use cases is passing output of _schema_of_json()_ to 
> *from_json()*. Currently, the _from_json()_ function requires a schema as a 
> mandatory argument. An user has to pass a schema as string literal in SQL. 
> The new function should allow schema inferring from an example. Let's say 
> json_col is a column containing JSON string with the same schema. It should 
> be possible to pass a JSON string with the same schema to _schema_of_json()_ 
> which infers schema for the particular example.
> {code:sql}
> select from_json(json_col, schema_of_json('{"f1": 0, "f2": [0], "f2": "a"}'))
> from json_table;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24709) Inferring schema from JSON string literal

2018-07-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529038#comment-16529038
 ] 

Apache Spark commented on SPARK-24709:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21686

> Inferring schema from JSON string literal
> -
>
> Key: SPARK-24709
> URL: https://issues.apache.org/jira/browse/SPARK-24709
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new function - *schema_of_json()*. The function should infer 
> schema of JSON string literal. The result of the function is a schema in DDL 
> format.
> One of the use cases is passing output of _schema_of_json()_ to 
> *from_json()*. Currently, the _from_json()_ function requires a schema as a 
> mandatory argument. An user has to pass a schema as string literal in SQL. 
> The new function should allow schema inferring from an example. Let's say 
> json_col is a column containing JSON string with the same schema. It should 
> be possible to pass a JSON string with the same schema to _schema_of_json()_ 
> which infers schema for the particular example.
> {code:sql}
> select from_json(json_col, schema_of_json('{"f1": 0, "f2": [0], "f2": "a"}'))
> from json_table;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24642) Add a function which infers schema from a JSON column

2018-07-01 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529030#comment-16529030
 ] 

Maxim Gekk commented on SPARK-24642:


I created new ticket SPARK-24709 which aims to add simpler function.

> Add a function which infers schema from a JSON column
> -
>
> Key: SPARK-24642
> URL: https://issues.apache.org/jira/browse/SPARK-24642
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new aggregate function - *infer_schema()*. The function should 
> infer schema for set of JSON strings. The result of the function is a schema 
> in DDL format (or JSON format).
> One of the use cases is passing output of *infer_schema()* to *from_json()*. 
> Currently, the from_json() function requires a schema as a mandatory 
> argument. It is possible to infer schema programmatically in Scala/Python and 
> pass it as the second argument but in SQL it is not possible. An user has to 
> pass schema as string literal in SQL. The new function should allow to use it 
> in SQL like in the example:
> {code:sql}
> select from_json(json_col, infer_schema(json_col))
> from json_table;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24642) Add a function which infers schema from a JSON column

2018-07-01 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-24642.

Resolution: Won't Fix

> Add a function which infers schema from a JSON column
> -
>
> Key: SPARK-24642
> URL: https://issues.apache.org/jira/browse/SPARK-24642
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new aggregate function - *infer_schema()*. The function should 
> infer schema for set of JSON strings. The result of the function is a schema 
> in DDL format (or JSON format).
> One of the use cases is passing output of *infer_schema()* to *from_json()*. 
> Currently, the from_json() function requires a schema as a mandatory 
> argument. It is possible to infer schema programmatically in Scala/Python and 
> pass it as the second argument but in SQL it is not possible. An user has to 
> pass schema as string literal in SQL. The new function should allow to use it 
> in SQL like in the example:
> {code:sql}
> select from_json(json_col, infer_schema(json_col))
> from json_table;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24709) Inferring schema from JSON string literal

2018-07-01 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24709:
--

 Summary: Inferring schema from JSON string literal
 Key: SPARK-24709
 URL: https://issues.apache.org/jira/browse/SPARK-24709
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Need to add new function - *schema_of_json()*. The function should infer schema 
of JSON string literal. The result of the function is a schema in DDL format.

One of the use cases is passing output of _schema_of_json()_ to *from_json()*. 
Currently, the _from_json()_ function requires a schema as a mandatory 
argument. An user has to pass a schema as string literal in SQL. The new 
function should allow schema inferring from an example. Let's say json_col is a 
column containing JSON string with the same schema. It should be possible to 
pass a JSON string with the same schema to _schema_of_json()_ which infers 
schema for the particular example.

{code:sql}
select from_json(json_col, schema_of_json('{"f1": 0, "f2": [0], "f2": "a"}'))
from json_table;
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24621) WebUI - application 'name' urls point to http instead of https (even when ssl enabled)

2018-07-01 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528572#comment-16528572
 ] 

t oo edited comment on SPARK-24621 at 7/1/18 9:36 AM:
--

[https://github.com/apache/spark/pull/21514/commits] 

 

[core/src/main/scala/org/apache/spark/deploy/master/Master.scala|https://github.com/apache/spark#diff-29dffdccd5a7f4c8b496c293e87c8668]

 

val SSL_ENABLED = conf.getBoolean("spark.ssl.enabled", false)
 var uriScheme = "http://";
 if (SSL_ENABLED)

{ uriScheme = "https://"; }

masterWebUiUrl = uriScheme + masterPublicAddress + ":" + webUi.boundPort
 //masterWebUiUrl = "http://"; + masterPublicAddress + ":" + webUi.boundPort

 

 


was (Author: toopt4):
[https://github.com/apache/spark/pull/21514/commits] 

 

[core/src/main/scala/org/apache/spark/deploy/master/Master.scala|https://github.com/apache/spark#diff-29dffdccd5a7f4c8b496c293e87c8668]

 

val SSL_ENABLED = conf.getBoolean("spark.ssl.enabled", false)
 val uriScheme = "http://";
 if (SSL_ENABLED) {
 uriScheme = "https://";
 }
 masterWebUiUrl = uriScheme + masterPublicAddress + ":" + webUi.boundPort
 //masterWebUiUrl = "http://"; + masterPublicAddress + ":" + webUi.boundPort

 

 

> WebUI - application 'name' urls point to http instead of https (even when ssl 
> enabled)
> --
>
> Key: SPARK-24621
> URL: https://issues.apache.org/jira/browse/SPARK-24621
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: t oo
>Priority: Major
> Attachments: spark_master-one-app.png
>
>
> See attached
> ApplicationID correctly points to DNS url
> but Name points to IP address
> Update: I found setting SPARK_PUBLIC_DNS to DNS hostname will make Name point 
> to DNS. BUT it will use http instead of https!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24708) Document the default spark url of master in standalone is "spark://localhost:7070"

2018-07-01 Thread Chia-Ping Tsai (JIRA)
Chia-Ping Tsai created SPARK-24708:
--

 Summary: Document the default spark url of master in standalone is 
"spark://localhost:7070"
 Key: SPARK-24708
 URL: https://issues.apache.org/jira/browse/SPARK-24708
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.3.1
Reporter: Chia-Ping Tsai


In the section "Starting a Cluster Manually" we give a example of starting a 
worker.
{code:java}
./sbin/start-slave.sh {code}
However, we only mention the default "web port" so readers may be misled into 
using the "web port" to start the worker. (of course, I am a "reader" too :()

It seems to me that adding a bit description of default spark url of master can 
avoid above ambiguity.

for example:
{code:java}
- Similarly, you can start one or more workers and connect them to the master 
via:
+ Similarly, you can start one or more workers and connect them to the master's 
spark URL (default: spark://:7070) via:{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER

2018-07-01 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529013#comment-16529013
 ] 

Yuming Wang commented on SPARK-20427:
-

[~ORichard]. Please try to use {{customSchema}} to specifying the custom data 
types of the read schema.  
https://github.com/apache/spark/blob/v2.3.1/examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala#L197




> Issue with Spark interpreting Oracle datatype NUMBER
> 
>
> Key: SPARK-20427
> URL: https://issues.apache.org/jira/browse/SPARK-20427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Alexander Andrushenko
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> In Oracle exists data type NUMBER. When defining a filed in a table of type 
> NUMBER the field has two components, precision and scale.
> For example, NUMBER(p,s) has precision p and scale s. 
> Precision can range from 1 to 38.
> Scale can range from -84 to 127.
> When reading such a filed Spark can create numbers with precision exceeding 
> 38. In our case it has created fields with precision 44,
> calculated as sum of the precision (in our case 34 digits) and the scale (10):
> "...java.lang.IllegalArgumentException: requirement failed: Decimal precision 
> 44 exceeds max precision 38...".
> The result was, that a data frame was read from a table on one schema but 
> could not be inserted in the identical table on other schema.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER

2018-07-01 Thread Oliver Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529005#comment-16529005
 ] 

Oliver Richardson commented on SPARK-20427:
---

I'm still getting the same problem even in newest version!.

> Issue with Spark interpreting Oracle datatype NUMBER
> 
>
> Key: SPARK-20427
> URL: https://issues.apache.org/jira/browse/SPARK-20427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Alexander Andrushenko
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> In Oracle exists data type NUMBER. When defining a filed in a table of type 
> NUMBER the field has two components, precision and scale.
> For example, NUMBER(p,s) has precision p and scale s. 
> Precision can range from 1 to 38.
> Scale can range from -84 to 127.
> When reading such a filed Spark can create numbers with precision exceeding 
> 38. In our case it has created fields with precision 44,
> calculated as sum of the precision (in our case 34 digits) and the scale (10):
> "...java.lang.IllegalArgumentException: requirement failed: Decimal precision 
> 44 exceeds max precision 38...".
> The result was, that a data frame was read from a table on one schema but 
> could not be inserted in the identical table on other schema.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org