date:20191023

[jira] [Commented] (SPARK-29584) NOT NULL is not supported in Spark

2019-10-23 Thread pavithra ramachandran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958540#comment-16958540
 ] 

pavithra ramachandran commented on SPARK-29584:
---

i shall work on this

> NOT NULL is not supported in Spark
> --
>
> Key: SPARK-29584
> URL: https://issues.apache.org/jira/browse/SPARK-29584
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Spark while creating table restricting column for NULL value is not supported.
> As below
> PostgreSQL: SUCCESS No Exception
>  CREATE TABLE Persons (ID int *NOT NULL*, LastName varchar(255) *NOT 
> NULL*,FirstName varchar(255) NOT NULL, Age int);
>  insert into Persons values(1,'GUPTA','Abhi',NULL);
>  select * from persons;
>  
> Spark: Parse Exception
> jdbc:hive2://10.18.19.208:23040/default> CREATE TABLE Persons (ID int NOT 
> NULL, LastName varchar(255) NOT NULL,FirstName varchar(255) NOT NULL, Age 
> int);
> Error: org.apache.spark.sql.catalyst.parser.ParseException:
> no viable alternative at input 'CREATE TABLE Persons (ID int NOT'(line 1, pos 
> 29)
>  Parse Exception



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29566) Imputer should support single-column input/ouput

2019-10-23 Thread Huaxin Gao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958539#comment-16958539
 ] 

Huaxin Gao commented on SPARK-29566:


I will work on this. Thanks! [~podongfeng]

> Imputer should support single-column input/ouput
> 
>
> Key: SPARK-29566
> URL: https://issues.apache.org/jira/browse/SPARK-29566
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> Imputer should support single-column input/ouput
> refer to https://issues.apache.org/jira/browse/SPARK-29565



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29565) OneHotEncoder should support single-column input/ouput

2019-10-23 Thread Huaxin Gao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958537#comment-16958537
 ] 

Huaxin Gao commented on SPARK-29565:


I will work on this. Thanks for ping me [~podongfeng]

> OneHotEncoder should support single-column input/ouput
> --
>
> Key: SPARK-29565
> URL: https://issues.apache.org/jira/browse/SPARK-29565
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> Current feature algs 
> ({color:#5a6e5a}QuantileDiscretizer/Binarizer/Bucketizer/StringIndexer{color})
>  are designed to support both single-col & multi-col.
> And there is already some internal utils (like 
> {color:#c7a65d}checkSingleVsMultiColumnParams{color}) for this.
> For OneHotEncoder, it's reasonable to support single-col.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29584) NOT NULL is not supported in Spark

2019-10-23 Thread ABHISHEK KUMAR GUPTA (Jira)

ABHISHEK KUMAR GUPTA created SPARK-29584:


 Summary: NOT NULL is not supported in Spark
 Key: SPARK-29584
 URL: https://issues.apache.org/jira/browse/SPARK-29584
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: ABHISHEK KUMAR GUPTA


Spark while creating table restricting column for NULL value is not supported.

As below

PostgreSQL: SUCCESS No Exception


 CREATE TABLE Persons (ID int *NOT NULL*, LastName varchar(255) *NOT 
NULL*,FirstName varchar(255) NOT NULL, Age int);
 insert into Persons values(1,'GUPTA','Abhi',NULL);
 select * from persons;

 

Spark: Parse Exception

jdbc:hive2://10.18.19.208:23040/default> CREATE TABLE Persons (ID int NOT NULL, 
LastName varchar(255) NOT NULL,FirstName varchar(255) NOT NULL, Age int);
Error: org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'CREATE TABLE Persons (ID int NOT'(line 1, pos 
29)

 Parse Exception



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16483) Unifying struct fields and columns

2019-10-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-16483:
-
Labels: sql  (was: bulk-closed sql)

> Unifying struct fields and columns
> --
>
> Key: SPARK-16483
> URL: https://issues.apache.org/jira/browse/SPARK-16483
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: sql
>
> This issue comes as a result of an exchange with Michael Armbrust outside of 
> the usual JIRA/dev list channels.
> DataFrame provides a full set of manipulation operations for top-level 
> columns. They have be added, removed, modified and renamed. The same is not 
> true about fields inside structs yet, from a logical standpoint, Spark users 
> may very well want to perform the same operations on struct fields, 
> especially since automatic schema discovery from JSON input tends to create 
> deeply nested structs.
> Common use-cases include:
>  - Remove and/or rename struct field(s) to adjust the schema
>  - Fix a data quality issue with a struct field (update/rewrite)
> To do this with the existing API by hand requires manually calling 
> {{named_struct}} and listing all fields, including ones we don't want to 
> manipulate. This leads to complex, fragile code that cannot survive schema 
> evolution.
> It would be far better if the various APIs that can now manipulate top-level 
> columns were extended to handle struct fields at arbitrary locations or, 
> alternatively, if we introduced new APIs for modifying any field in a 
> dataframe, whether it is a top-level one or one nested inside a struct.
> Purely for discussion purposes (overloaded methods are not shown):
> {code:java}
> class Column(val expr: Expression) extends Logging {
>   // ...
>   // matches Dataset.schema semantics
>   def schema: StructType
>   // matches Dataset.select() semantics
>   // '* support allows multiple new fields to be added easily, saving 
> cumbersome repeated withColumn() calls
>   def select(cols: Column*): Column
>   // matches Dataset.withColumn() semantics of add or replace
>   def withColumn(colName: String, col: Column): Column
>   // matches Dataset.drop() semantics
>   def drop(colName: String): Column
> }
> class Dataset[T] ... {
>   // ...
>   // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema)
>   def cast(newShema: StructType): DataFrame
> }
> {code}
> The benefit of the above API is that it unifies manipulating top-level & 
> nested columns. The addition of {{schema}} and {{select()}} to {{Column}} 
> allows for nested field reordering, casting, etc., which is important in data 
> exchange scenarios where field position matters. That's also the reason to 
> add {{cast}} to {{Dataset}}: it improves consistency and readability (with 
> method chaining). Another way to think of {{Dataset.cast}} is as the Spark 
> schema equivalent of {{Dataset.as}}. {{as}} is to {{cast}} as a Scala 
> encodable type is to a {{StructType}} instance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29583) extract support interval type

2019-10-23 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958479#comment-16958479
 ] 

Yuming Wang commented on SPARK-29583:
-

cc [~maxgekk]

> extract support interval type
> -
>
> Key: SPARK-29583
> URL: https://issues.apache.org/jira/browse/SPARK-29583
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> postgres=# select extract(minute from INTERVAL '1 YEAR 10 DAYS 50 MINUTES');
>  date_part
> ---
> 50
> (1 row)
> postgres=# select extract(minute from cast('2019-07-01 17:12:33.068' as 
> timestamp) - cast('2019-07-01 15:57:07.912' as timestamp));
>  date_part
> ---
> 15
> (1 row)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29583) extract support interval type

2019-10-23 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29583:

Description: 
{code:sql}
postgres=# select extract(minute from INTERVAL '1 YEAR 10 DAYS 50 MINUTES');
 date_part
---
50
(1 row)

postgres=# select extract(minute from cast('2019-07-01 17:12:33.068' as 
timestamp) - cast('2019-07-01 15:57:07.912' as timestamp));
 date_part
---
15
(1 row)
{code}


  was:

{code:sql}
postgres=# select extract(minute from cast('2019-07-01 17:12:33.068' as 
timestamp) - cast('2019-07-01 15:57:07.912' as timestamp));
 date_part
---
15
(1 row)
{code}



> extract support interval type
> -
>
> Key: SPARK-29583
> URL: https://issues.apache.org/jira/browse/SPARK-29583
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> postgres=# select extract(minute from INTERVAL '1 YEAR 10 DAYS 50 MINUTES');
>  date_part
> ---
> 50
> (1 row)
> postgres=# select extract(minute from cast('2019-07-01 17:12:33.068' as 
> timestamp) - cast('2019-07-01 15:57:07.912' as timestamp));
>  date_part
> ---
> 15
> (1 row)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29583) extract support interval type

2019-10-23 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-29583:
---

 Summary: extract support interval type
 Key: SPARK-29583
 URL: https://issues.apache.org/jira/browse/SPARK-29583
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang



{code:sql}
postgres=# select extract(minute from cast('2019-07-01 17:12:33.068' as 
timestamp) - cast('2019-07-01 15:57:07.912' as timestamp));
 date_part
---
15
(1 row)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29582) Unify the behavior of pyspark.TaskContext with spark core

2019-10-23 Thread Xianyang Liu (Jira)

Xianyang Liu created SPARK-29582:


 Summary: Unify the behavior of pyspark.TaskContext with spark core
 Key: SPARK-29582
 URL: https://issues.apache.org/jira/browse/SPARK-29582
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.4
Reporter: Xianyang Liu


In Spark Core, there is a `TaskContext` object which is a singleton. We set a 
task context instance which can be TaskContext or BarrierTaskContext before the 
task function startup, and unset it to none after the function end. So we can 
both get TaskContext and BarrierTaskContext with the object. How we can only 
get the BarrierTaskContext with `BarrierTaskContext`, we will get `None` if we 
get it by `TaskContext.get` in a barrier stage.

 

In this patch, we unify the behavior of TaskContext for pyspark with Spark 
core. This is useful when people switch from normal code to barrier code, and 
only need a little update.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15348) Hive ACID

2019-10-23 Thread Zhaoyang Qin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958471#comment-16958471
 ] 

Zhaoyang Qin commented on SPARK-15348:
--

[~asomani]  [~georg.kf.hei...@gmail.com] Thank you very much for your advice.

But I focused on this because I wanted to use hive managed tables by Spark. My 
concern is that SparkSQL reads Hive's internal tables much faster than the 
external tables, especially in large Data. I have compared the two performance 
and found that using an internal table was five times faster than using an 
external table at 1T of TPCDS data.

So I'm more concerned about spark's solution for reading managed tables.

I'll keep going.

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0
>Reporter: Ran Haim
>Priority: Major
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29576) Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus

2019-10-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29576.
---
Resolution: Fixed

Issue resolved by pull request 26235
[https://github.com/apache/spark/pull/26235]

> Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus
> -
>
> Key: SPARK-29576
> URL: https://issues.apache.org/jira/browse/SPARK-29576
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> Instead of using ZStd codec directly, we use Spark's CompressionCodec which 
> wraps ZStd codec in buffered stream to avoid overhead excessive of JNI call 
> while trying to compress small amount of data.
> Also, by using Spark's CompressionCodec, we can easily to make it 
> configurable in the future if needed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29576) Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus

2019-10-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29576:
-

Assignee: DB Tsai

> Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus
> -
>
> Key: SPARK-29576
> URL: https://issues.apache.org/jira/browse/SPARK-29576
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> Instead of using ZStd codec directly, we use Spark's CompressionCodec which 
> wraps ZStd codec in buffered stream to avoid overhead excessive of JNI call 
> while trying to compress small amount of data.
> Also, by using Spark's CompressionCodec, we can easily to make it 
> configurable in the future if needed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.

2019-10-23 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958412#comment-16958412
 ] 

Jungtaek Lim commented on SPARK-28594:
--

Please note that SPARK-29579 and SPARK-29581 could be moved out of SPARK-28594, 
as the reason of splitting these issues out of existing one is that we couldn't 
find good way to do that. Things can change if we get some brilliant idea 
before finishing SPARK-28870, but if not, I'd rather set SPARK-28870 as finish 
line of this and move both issues out of this.

> Allow event logs for running streaming apps to be rolled over.
> --
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: This has been reported on 2.0.2.22 but affects all 
> currently available versions.
>Reporter: Stephen Levett
>Priority: Major
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29569) doc build fails with `/api/scala/lib/jquery.js` doesn't exist

2019-10-23 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958410#comment-16958410
 ] 

Dongjoon Hyun commented on SPARK-29569:
---

Oh.. Too bad. Got it. Thank you for update, [~jiangxb1987].

> doc build fails with `/api/scala/lib/jquery.js` doesn't exist
> -
>
> Key: SPARK-29569
> URL: https://issues.apache.org/jira/browse/SPARK-29569
> Project: Spark
>  Issue Type: Bug
>  Components: Build, docs
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2019-10-23 at 8.25.01 PM.png
>
>
> Run `jekyll build` under `./spark/docs`, the command fail with the following 
> error message:
> {code}
> Making directory api/scala
> cp -r ../target/scala-2.12/unidoc/. api/scala
> Making directory api/java
> cp -r ../target/javaunidoc/. api/java
> Updating JavaDoc files for badge post-processing
> Copying jquery.js from Scala API to Java API for page post-processing of 
> badges
> jekyll 3.8.6 | Error:  No such file or directory @ rb_sysopen - 
> ./api/scala/lib/jquery.js
> {code}
> This error only happens on master branch, the command works on branch-2.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29567) Update JDBC Integration Test Docker Images

2019-10-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29567:
-

Assignee: Dongjoon Hyun

> Update JDBC Integration Test Docker Images
> --
>
> Key: SPARK-29567
> URL: https://issues.apache.org/jira/browse/SPARK-29567
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29567) Update JDBC Integration Test Docker Images

2019-10-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29567.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26224
[https://github.com/apache/spark/pull/26224]

> Update JDBC Integration Test Docker Images
> --
>
> Key: SPARK-29567
> URL: https://issues.apache.org/jira/browse/SPARK-29567
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29581) Enable cleanup old event log files

2019-10-23 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-29581:


 Summary: Enable cleanup old event log files 
 Key: SPARK-29581
 URL: https://issues.apache.org/jira/browse/SPARK-29581
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


This issue can be start only with SPARK-29579 is addressed properly.

After SPARK-29579 Spark would guarantee strong compatibility on both live 
entities and snapshots, which means snapshot file could replace older origin 
event log files. This issue tracks the efforts on automatically cleaning up old 
event logs if snapshot file can replace them, which lets overall size of event 
log on streaming query to be manageable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29580) KafkaDelegationTokenSuite fails to create new KafkaAdminClient

2019-10-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29580:
--
Issue Type: Bug  (was: Improvement)

> KafkaDelegationTokenSuite fails to create new KafkaAdminClient
> --
>
> Key: SPARK-29580
> URL: https://issues.apache.org/jira/browse/SPARK-29580
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112562/testReport/org.apache.spark.sql.kafka010/KafkaDelegationTokenSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> {code}
> sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: Failed to 
> create new KafkaAdminClient
>   at 
> org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:407)
>   at 
> org.apache.kafka.clients.admin.AdminClient.create(AdminClient.java:55)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:227)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:249)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: 
> javax.security.auth.login.LoginException: Server not found in Kerberos 
> database (7) - Server not found in Kerberos database
>   at 
> org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:160)
>   at 
> org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:146)
>   at 
> org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:67)
>   at 
> org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:99)
>   at 
> org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:382)
>   ... 16 more
> Caused by: sbt.ForkMain$ForkError: javax.security.auth.login.LoginException: 
> Server not found in Kerberos database (7) - Server not found in Kerberos 
> database
>   at 
> com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804)
>   at 
> com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
>   at 
> javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
>   at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
>   at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
>   at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
>   at 
> org.apache.kafka.common.security.authenticator.AbstractLogin.login(AbstractLogin.java:60)
>   at 
> org.apache.kafka.common.security.kerberos.KerberosLogin.login(KerberosLogin.java:103)
>   at 
> org.apache.kafka.common.security.authenticator.LoginManager.(LoginManager.java:61)
>   at 
> org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:104)
>   at 
> org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:149)
>   ... 20 more
> Caused by: sbt.ForkMain$ForkError: sun.security.krb5.KrbException:

[jira] [Commented] (SPARK-29580) KafkaDelegationTokenSuite fails to create new KafkaAdminClient

2019-10-23 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958374#comment-16958374
 ] 

Dongjoon Hyun commented on SPARK-29580:
---

Hi, [~gsomogyi].
Could you take a look at this failure?

> KafkaDelegationTokenSuite fails to create new KafkaAdminClient
> --
>
> Key: SPARK-29580
> URL: https://issues.apache.org/jira/browse/SPARK-29580
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112562/testReport/org.apache.spark.sql.kafka010/KafkaDelegationTokenSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> {code}
> sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: Failed to 
> create new KafkaAdminClient
>   at 
> org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:407)
>   at 
> org.apache.kafka.clients.admin.AdminClient.create(AdminClient.java:55)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:227)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:249)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: 
> javax.security.auth.login.LoginException: Server not found in Kerberos 
> database (7) - Server not found in Kerberos database
>   at 
> org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:160)
>   at 
> org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:146)
>   at 
> org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:67)
>   at 
> org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:99)
>   at 
> org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:382)
>   ... 16 more
> Caused by: sbt.ForkMain$ForkError: javax.security.auth.login.LoginException: 
> Server not found in Kerberos database (7) - Server not found in Kerberos 
> database
>   at 
> com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804)
>   at 
> com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
>   at 
> javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
>   at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
>   at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
>   at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
>   at 
> org.apache.kafka.common.security.authenticator.AbstractLogin.login(AbstractLogin.java:60)
>   at 
> org.apache.kafka.common.security.kerberos.KerberosLogin.login(KerberosLogin.java:103)
>   at 
> org.apache.kafka.common.security.authenticator.LoginManager.(LoginManager.java:61)
>   at 
> org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:104)
>   at 
> org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:149)
>   ... 20 more
> Caused by:

[jira] [Commented] (SPARK-29580) KafkaDelegationTokenSuite fails to create new KafkaAdminClient

2019-10-23 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958372#comment-16958372
 ] 

Dongjoon Hyun commented on SPARK-29580:
---

This is a different failure from SPARK-29027.

> KafkaDelegationTokenSuite fails to create new KafkaAdminClient
> --
>
> Key: SPARK-29580
> URL: https://issues.apache.org/jira/browse/SPARK-29580
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112562/testReport/org.apache.spark.sql.kafka010/KafkaDelegationTokenSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> {code}
> sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: Failed to 
> create new KafkaAdminClient
>   at 
> org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:407)
>   at 
> org.apache.kafka.clients.admin.AdminClient.create(AdminClient.java:55)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:227)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:249)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: 
> javax.security.auth.login.LoginException: Server not found in Kerberos 
> database (7) - Server not found in Kerberos database
>   at 
> org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:160)
>   at 
> org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:146)
>   at 
> org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:67)
>   at 
> org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:99)
>   at 
> org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:382)
>   ... 16 more
> Caused by: sbt.ForkMain$ForkError: javax.security.auth.login.LoginException: 
> Server not found in Kerberos database (7) - Server not found in Kerberos 
> database
>   at 
> com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804)
>   at 
> com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
>   at 
> javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
>   at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
>   at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
>   at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
>   at 
> org.apache.kafka.common.security.authenticator.AbstractLogin.login(AbstractLogin.java:60)
>   at 
> org.apache.kafka.common.security.kerberos.KerberosLogin.login(KerberosLogin.java:103)
>   at 
> org.apache.kafka.common.security.authenticator.LoginManager.(LoginManager.java:61)
>   at 
> org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:104)
>   at 
> org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:149)
>   ... 20 more
> Caused by:

[jira] [Created] (SPARK-29580) KafkaDelegationTokenSuite fails to create new KafkaAdminClient

2019-10-23 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-29580:
-

 Summary: KafkaDelegationTokenSuite fails to create new 
KafkaAdminClient
 Key: SPARK-29580
 URL: https://issues.apache.org/jira/browse/SPARK-29580
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


- 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112562/testReport/org.apache.spark.sql.kafka010/KafkaDelegationTokenSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/

{code}
sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: Failed to 
create new KafkaAdminClient
at 
org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:407)
at 
org.apache.kafka.clients.admin.AdminClient.create(AdminClient.java:55)
at 
org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:227)
at 
org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:249)
at 
org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
at 
org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
at sbt.ForkMain$Run$2.call(ForkMain.java:296)
at sbt.ForkMain$Run$2.call(ForkMain.java:286)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: 
javax.security.auth.login.LoginException: Server not found in Kerberos database 
(7) - Server not found in Kerberos database
at 
org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:160)
at 
org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:146)
at 
org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:67)
at 
org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:99)
at 
org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:382)
... 16 more
Caused by: sbt.ForkMain$ForkError: javax.security.auth.login.LoginException: 
Server not found in Kerberos database (7) - Server not found in Kerberos 
database
at 
com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804)
at 
com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
at 
javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
at java.security.AccessController.doPrivileged(Native Method)
at 
javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
at 
org.apache.kafka.common.security.authenticator.AbstractLogin.login(AbstractLogin.java:60)
at 
org.apache.kafka.common.security.kerberos.KerberosLogin.login(KerberosLogin.java:103)
at 
org.apache.kafka.common.security.authenticator.LoginManager.(LoginManager.java:61)
at 
org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:104)
at 
org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:149)
... 20 more
Caused by: sbt.ForkMain$ForkError: sun.security.krb5.KrbException: Server not 
found in Kerberos database (7) - Server not found in Kerberos database
at sun.security.krb5.KrbAsRep.(KrbAsRep.java:82)
at sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:316)
at sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:361)
at

[jira] [Created] (SPARK-29579) Guarantee compatibility of snapshot (live entities, KVstore entities)

2019-10-23 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-29579:


 Summary: Guarantee compatibility of snapshot (live entities, 
KVstore entities)
 Key: SPARK-29579
 URL: https://issues.apache.org/jira/browse/SPARK-29579
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


This issue is a follow-up issue after SPARK-29111 and SPARK-29261, which both 
issues WILL NOT guarantee compatibility.

To safely clean up old event log files after snapshot has been written for 
these files, we have to ensure the snapshot file can restore the state as same 
as we replay from these event log files. The issue is on migrating to the newer 
Spark version - if snapshot is not readable due to incompatibility, the app 
cannot be read entirely as we've already removed old event log files. If we can 
guarantee compatibility we can move on to the next item, cleaning up old event 
log files to save space.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29261) Support recover live entities from KVStore for (SQL)AppStatusListener

2019-10-23 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-29261:
-
Description: 
To achieve incremental reply goal in SHS, we need to support recover live 
entities from KVStore for both SQLAppStatusListener and AppStatusListener.

Note that we don't guarantee any compatibility of live entities here - we will 
file another issue to deal with it altogether.

  was:To achieve incremental reply goal in SHS, we need to support recover live 
entities from KVStore for both SQLAppStatusListener and AppStatusListener.


> Support recover live entities from KVStore for (SQL)AppStatusListener
> -
>
> Key: SPARK-29261
> URL: https://issues.apache.org/jira/browse/SPARK-29261
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> To achieve incremental reply goal in SHS, we need to support recover live 
> entities from KVStore for both SQLAppStatusListener and AppStatusListener.
> Note that we don't guarantee any compatibility of live entities here - we 
> will file another issue to deal with it altogether.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29111) Support snapshot/restore of KVStore

2019-10-23 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-29111:
-
Description: 
This issue tracks the effort of supporting snapshot/restore from/to KVStore.

Note that this issue will not touch current behavior - following issue will 
leverage the output of this issue. This is to reduce the size of each PR.

This will not be guaranteeing any compatibility on snapshot - it means this 
issue must have an approach to determine whether the snapshot is compatible 
with current version of SHS.

  was:
This issue tracks the effort of supporting snapshot/restore from/to KVStore.

Note that this issue will not touch current behavior - following issue will 
leverage the output of this issue. This is to reduce the size of each PR.


> Support snapshot/restore of KVStore
> ---
>
> Key: SPARK-29111
> URL: https://issues.apache.org/jira/browse/SPARK-29111
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the effort of supporting snapshot/restore from/to KVStore.
> Note that this issue will not touch current behavior - following issue will 
> leverage the output of this issue. This is to reduce the size of each PR.
> This will not be guaranteeing any compatibility on snapshot - it means this 
> issue must have an approach to determine whether the snapshot is compatible 
> with current version of SHS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28870) Snapshot event log files to support incremental reading

2019-10-23 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-28870:
-
Description: 
This issue tracks the effort on compacting event log files into snapshot and 
enable incremental reading to speed up replaying event logs.

This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled 
event log files. This issue will be also on top of SPARK-29111 and SPARK-29261, 
as SPARK-29111 will add the ability to snapshot/restore from/to KVStore and 
SPARK-29261 will add the ability to snapshot/restore of state of 
(SQL)AppStatusListeners.

  was:
This issue tracks the effort on compacting event log files into snapshot and 
enable incremental reading to speed up replaying event logs.

This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled 
event log files. This issue will be also on top of SPARK-29111 as SPARK-29111 
will add the ability to snapshot/restore from/to KVStore.


> Snapshot event log files to support incremental reading
> ---
>
> Key: SPARK-28870
> URL: https://issues.apache.org/jira/browse/SPARK-28870
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the effort on compacting event log files into snapshot and 
> enable incremental reading to speed up replaying event logs.
> This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled 
> event log files. This issue will be also on top of SPARK-29111 and 
> SPARK-29261, as SPARK-29111 will add the ability to snapshot/restore from/to 
> KVStore and SPARK-29261 will add the ability to snapshot/restore of state of 
> (SQL)AppStatusListeners.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28870) Snapshot event log files to support incremental reading

2019-10-23 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-28870:
-
Description: 
This issue tracks the effort on compacting event log files into snapshot and 
enable incremental reading to speed up replaying event logs.

This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled 
event log files. This issue will be also on top of SPARK-29111 as SPARK-29111 
will add the ability to snapshot/restore from/to KVStore.

  was:
This issue tracks the effort on compacting old event log files into snapshot 
and achieve both goals, 1) reduce overall event log file size 2) speed up 
replaying event logs. It also deals with cleaning event log files as snapshot 
will provide the safe way to clean up old event log files without losing 
ability to replay whole event logs.

This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled 
event log files. This issue will be also on top of SPARK-29111 as SPARK-29111 
will add the ability to snapshot/restore from/to KVStore.


> Snapshot event log files to support incremental reading
> ---
>
> Key: SPARK-28870
> URL: https://issues.apache.org/jira/browse/SPARK-28870
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the effort on compacting event log files into snapshot and 
> enable incremental reading to speed up replaying event logs.
> This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled 
> event log files. This issue will be also on top of SPARK-29111 as SPARK-29111 
> will add the ability to snapshot/restore from/to KVStore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28870) Snapshot old event log files to support compaction

2019-10-23 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958300#comment-16958300
 ] 

Jungtaek Lim commented on SPARK-28870:
--

Discussed with Marcelo/Imran offline: I'm changing the goal of snapshot here to 
only allow incremental reading - as "how to guarantee compatibility of live 
entities/snapshots" has been playing as a blocker for long time and we haven't 
figure out good solution yet.

The goal for opening up the chance on cleanup old event log files will be split 
out to another issue, with adding explicit requirements - we should guarantee 
strong compatibility with both live entities/snapshots to achieve the 
functionality.

> Snapshot old event log files to support compaction
> --
>
> Key: SPARK-28870
> URL: https://issues.apache.org/jira/browse/SPARK-28870
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the effort on compacting old event log files into snapshot 
> and achieve both goals, 1) reduce overall event log file size 2) speed up 
> replaying event logs. It also deals with cleaning event log files as snapshot 
> will provide the safe way to clean up old event log files without losing 
> ability to replay whole event logs.
> This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled 
> event log files. This issue will be also on top of SPARK-29111 as SPARK-29111 
> will add the ability to snapshot/restore from/to KVStore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28870) Snapshot event log files to support incremental reading

2019-10-23 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-28870:
-
Summary: Snapshot event log files to support incremental reading  (was: 
Snapshot old event log files to support compaction)

> Snapshot event log files to support incremental reading
> ---
>
> Key: SPARK-28870
> URL: https://issues.apache.org/jira/browse/SPARK-28870
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the effort on compacting old event log files into snapshot 
> and achieve both goals, 1) reduce overall event log file size 2) speed up 
> replaying event logs. It also deals with cleaning event log files as snapshot 
> will provide the safe way to clean up old event log files without losing 
> ability to replay whole event logs.
> This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled 
> event log files. This issue will be also on top of SPARK-29111 as SPARK-29111 
> will add the ability to snapshot/restore from/to KVStore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29578) JDK 1.8.0_232 timezone updates cause "Kwajalein" test failures again

2019-10-23 Thread Sean R. Owen (Jira)

Sean R. Owen created SPARK-29578:


 Summary: JDK 1.8.0_232 timezone updates cause "Kwajalein" test 
failures again
 Key: SPARK-29578
 URL: https://issues.apache.org/jira/browse/SPARK-29578
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.4.4, 3.0.0
Reporter: Sean R. Owen
Assignee: Sean R. Owen


I have a report that tests fail in JDK 1.8.0_232 because of timezone changes in 
(I believe) tzdata2018i or later, per 
https://www.oracle.com/technetwork/java/javase/tzdata-versions-138805.html:

{{*** FAILED *** with 8634 did not equal 8633 Round trip of 8633 did not work 
in tz}}

See also https://issues.apache.org/jira/browse/SPARK-24950

I say "I've heard" because I can't get this version easily on my Mac. However 
the fix is so inconsequential that I think we can just make it, to allow this 
additional variation just as before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29577) Implement p-value simulation and unit tests for chi2 test

2019-10-23 Thread Alexander Tronchin-James (Jira)

Alexander Tronchin-James created SPARK-29577:


 Summary: Implement p-value simulation and unit tests for chi2 test
 Key: SPARK-29577
 URL: https://issues.apache.org/jira/browse/SPARK-29577
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.4.5, 3.0.0
Reporter: Alexander Tronchin-James


Spark mllib's chi-squared test does not yet include p-value simulation for the 
goodness of fit test, and implementing a robust/scaleable approach was 
non-trivial for us, so we wanted to give this work back to the community for 
others to use.

https://github.com/apache/spark/pull/26197



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29576) Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus

2019-10-23 Thread DB Tsai (Jira)

DB Tsai created SPARK-29576:
---

 Summary: Use Spark's CompressionCodec for Ser/Deser of 
MapOutputStatus
 Key: SPARK-29576
 URL: https://issues.apache.org/jira/browse/SPARK-29576
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: DB Tsai
 Fix For: 3.0.0


Instead of using ZStd codec directly, we use Spark's CompressionCodec which 
wraps ZStd codec in buffered stream to avoid overhead excessive of JNI call 
while trying to compress small amount of data.

Also, by using Spark's CompressionCodec, we can easily to make it configurable 
in the future if needed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29571) Fix UT in AllExecutionsPageSuite class

2019-10-23 Thread Ankit raj boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958237#comment-16958237
 ] 

Ankit raj boudh commented on SPARK-29571:
-

Assert condition is wrong in AllExecutionsPageSuite.scala for testname : "

sorting should be successful" , if IllegalArgumentException will occurs then 
also unit test will pass (actually it should fail.)

 

> Fix UT in  AllExecutionsPageSuite class
> ---
>
> Key: SPARK-29571
> URL: https://issues.apache.org/jira/browse/SPARK-29571
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Ankit Raj Boudh
>Priority: Minor
>
> sorting should be successful UT in class AllExecutionsPageSuite failing due 
> to invalid assert condition. Needs to handle this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29538) Test failure: org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite.multiple joins

2019-10-23 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-29538.
--
Resolution: Duplicate

SPARK-29552 dealt with this. Will reopen if it is still flaky.

> Test failure: 
> org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite.multiple joins
> ---
>
> Key: SPARK-29538
> URL: https://issues.apache.org/jira/browse/SPARK-29538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112373/testReport]
> org.scalatest.exceptions.TestFailedException: 2 did not equal 1
>  
> This doesn't look like occurring rarely - it had been passed, but it was 
> failed once, and failed nearly 1 or 2 failure(s) per a page of history.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112373/testReport/junit/org.apache.spark.sql.execution.adaptive/AdaptiveQueryExecSuite/multiple_joins/history]
> (Please track older iteratively to see how often it failed.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15348) Hive ACID

2019-10-23 Thread Georg Heiler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958225#comment-16958225
 ] 

Georg Heiler commented on SPARK-15348:
--

Another workaround could be external tables 
[https://stackoverflow.com/questions/58406125/how-to-write-a-table-to-hive-from-spark-without-using-the-warehouse-connector-in]
 as outlined here.

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0
>Reporter: Ran Haim
>Priority: Major
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29571) Fix UT in AllExecutionsPageSuite class

2019-10-23 Thread shahid (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958171#comment-16958171
 ] 

shahid commented on SPARK-29571:


Could you clarify which UT is failing?

> Fix UT in  AllExecutionsPageSuite class
> ---
>
> Key: SPARK-29571
> URL: https://issues.apache.org/jira/browse/SPARK-29571
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Ankit Raj Boudh
>Priority: Minor
>
> sorting should be successful UT in class AllExecutionsPageSuite failing due 
> to invalid assert condition. Needs to handle this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29569) doc build fails with `/api/scala/lib/jquery.js` doesn't exist

2019-10-23 Thread Xingbo Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958166#comment-16958166
 ] 

Xingbo Jiang commented on SPARK-29569:
--

[~dongjoon] Not yet, the release script is still failing, Wenchen and I are 
investigating more.

> doc build fails with `/api/scala/lib/jquery.js` doesn't exist
> -
>
> Key: SPARK-29569
> URL: https://issues.apache.org/jira/browse/SPARK-29569
> Project: Spark
>  Issue Type: Bug
>  Components: Build, docs
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2019-10-23 at 8.25.01 PM.png
>
>
> Run `jekyll build` under `./spark/docs`, the command fail with the following 
> error message:
> {code}
> Making directory api/scala
> cp -r ../target/scala-2.12/unidoc/. api/scala
> Making directory api/java
> cp -r ../target/javaunidoc/. api/java
> Updating JavaDoc files for badge post-processing
> Copying jquery.js from Scala API to Java API for page post-processing of 
> badges
> jekyll 3.8.6 | Error:  No such file or directory @ rb_sysopen - 
> ./api/scala/lib/jquery.js
> {code}
> This error only happens on master branch, the command works on branch-2.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29557) Upgrade dropwizard metrics library to 3.2.6

2019-10-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29557:
--
Description: This proposes to upgrade the dropwizard/codahale metrics 
library version used by Spark to `3.2.6` which is the last version supporting 
Ganglia. Spark is currently using Dropwizard metrics version 3.1.5, a version 
that is no more actively developed nor maintained, according to the project's 
Github repo README.  (was: This proposes to upgrade the dropwizard/codahale 
metrics library version used by Spark to a recent version, tentatively 4.1.1. 
Spark is currently using Dropwizard metrics version 3.1.5, a version that is no 
more actively developed nor maintained, according to the project's Github repo 
README.)

> Upgrade dropwizard metrics library to 3.2.6
> ---
>
> Key: SPARK-29557
> URL: https://issues.apache.org/jira/browse/SPARK-29557
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.0.0
>
>
> This proposes to upgrade the dropwizard/codahale metrics library version used 
> by Spark to `3.2.6` which is the last version supporting Ganglia. Spark is 
> currently using Dropwizard metrics version 3.1.5, a version that is no more 
> actively developed nor maintained, according to the project's Github repo 
> README.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29557) Upgrade dropwizard metrics library to 3.2.6

2019-10-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29557.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26212
[https://github.com/apache/spark/pull/26212]

> Upgrade dropwizard metrics library to 3.2.6
> ---
>
> Key: SPARK-29557
> URL: https://issues.apache.org/jira/browse/SPARK-29557
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.0.0
>
>
> This proposes to upgrade the dropwizard/codahale metrics library version used 
> by Spark to a recent version, tentatively 4.1.1. Spark is currently using 
> Dropwizard metrics version 3.1.5, a version that is no more actively 
> developed nor maintained, according to the project's Github repo README.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29557) Upgrade dropwizard metrics library to 3.2.6

2019-10-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29557:
-

Assignee: Luca Canali

> Upgrade dropwizard metrics library to 3.2.6
> ---
>
> Key: SPARK-29557
> URL: https://issues.apache.org/jira/browse/SPARK-29557
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
>
> This proposes to upgrade the dropwizard/codahale metrics library version used 
> by Spark to a recent version, tentatively 4.1.1. Spark is currently using 
> Dropwizard metrics version 3.1.5, a version that is no more actively 
> developed nor maintained, according to the project's Github repo README.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29569) doc build fails with `/api/scala/lib/jquery.js` doesn't exist

2019-10-23 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958065#comment-16958065
 ] 

Dongjoon Hyun commented on SPARK-29569:
---

Sorry for being late to the party! Thank you for swift fixing, [~hyukjin.kwon]!
I saw new `3.0.0-preview-rc1` tag. Now, it's ready for vote? :)

> doc build fails with `/api/scala/lib/jquery.js` doesn't exist
> -
>
> Key: SPARK-29569
> URL: https://issues.apache.org/jira/browse/SPARK-29569
> Project: Spark
>  Issue Type: Bug
>  Components: Build, docs
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2019-10-23 at 8.25.01 PM.png
>
>
> Run `jekyll build` under `./spark/docs`, the command fail with the following 
> error message:
> {code}
> Making directory api/scala
> cp -r ../target/scala-2.12/unidoc/. api/scala
> Making directory api/java
> cp -r ../target/javaunidoc/. api/java
> Updating JavaDoc files for badge post-processing
> Copying jquery.js from Scala API to Java API for page post-processing of 
> badges
> jekyll 3.8.6 | Error:  No such file or directory @ rb_sysopen - 
> ./api/scala/lib/jquery.js
> {code}
> This error only happens on master branch, the command works on branch-2.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29575) from_json can produce nulls for fields which are marked as non-nullable

2019-10-23 Thread Victor Lopez (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Victor Lopez updated SPARK-29575:
-
Description: 
I believe this issue was resolved elsewhere 
(https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this 
bug seems to still be there.

The issue appears when using {{from_json}} to parse a column in a Spark 
dataframe. It seems like {{from_json}} ignores whether the schema provided has 
any {{nullable:False}} property.
{code:java}
schema = T.StructType().add(T.StructField('id', T.LongType(), 
nullable=False)).add(T.StructField('name', T.StringType(), nullable=False))
data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}]
df = spark.read.json(sc.parallelize(data))
df.withColumn("details", F.from_json("user", schema)).select("details.*").show()
{code}
 

  was:
I believe this issue was resolved elsewhere 
(https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this 
bug seems to still be there.

{{The issue appears when using from_json to parse a column in a Spark 
dataframe. It seems like from_json ignores whether the schema provided has any 
nullable:False property.}}
{code:java}
schema = T.StructType().add(T.StructField('id', T.LongType(), 
nullable=False)).add(T.StructField('name', T.StringType(), nullable=False))
data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}]
df = spark.read.json(sc.parallelize(data))
df.withColumn("details", F.from_json("user", schema)).select("details.*").show()
{code}
 


> from_json can produce nulls for fields which are marked as non-nullable
> ---
>
> Key: SPARK-29575
> URL: https://issues.apache.org/jira/browse/SPARK-29575
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Victor Lopez
>Priority: Major
>
> I believe this issue was resolved elsewhere 
> (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this 
> bug seems to still be there.
> The issue appears when using {{from_json}} to parse a column in a Spark 
> dataframe. It seems like {{from_json}} ignores whether the schema provided 
> has any {{nullable:False}} property.
> {code:java}
> schema = T.StructType().add(T.StructField('id', T.LongType(), 
> nullable=False)).add(T.StructField('name', T.StringType(), nullable=False))
> data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 
> 'jane'})}]
> df = spark.read.json(sc.parallelize(data))
> df.withColumn("details", F.from_json("user", 
> schema)).select("details.*").show()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29575) from_json can produce nulls for fields which are marked as non-nullable

2019-10-23 Thread Victor Lopez (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Victor Lopez updated SPARK-29575:
-
Description: 
I believe this issue was resolved elsewhere 
(https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this 
bug seems to still be there.

{{The issue appears when using `from_json` to parse a column in a Spark 
dataframe. It seems like `from_json` ignores whether the schema provided has 
any `nullable:False` property.}}
{code:java}
schema = T.StructType().add(T.StructField('id', T.LongType(), 
nullable=False)).add(T.StructField('name', T.StringType(), nullable=False))
data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}]
df = spark.read.json(sc.parallelize(data))
df.withColumn("details", F.from_json("user", schema)).select("details.*").show()
{code}
 

  was:
I believe this issue was resolved elsewhere 
(https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this 
bug seems to still be there.

The issue appears when using `from_json` to parse a column in a Spark 
dataframe. It seems like `from_json` ignores whether the schema provided has 
any `nullable:False` property.
{code:java}
schema = T.StructType().add(T.StructField('id', T.LongType(), 
nullable=False)).add(T.StructField('name', T.StringType(), nullable=False))
data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}]
df = spark.read.json(sc.parallelize(data))
df.withColumn("details", F.from_json("user", schema)).select("details.*").show()
{code}
 


> from_json can produce nulls for fields which are marked as non-nullable
> ---
>
> Key: SPARK-29575
> URL: https://issues.apache.org/jira/browse/SPARK-29575
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Victor Lopez
>Priority: Major
>
> I believe this issue was resolved elsewhere 
> (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this 
> bug seems to still be there.
> {{The issue appears when using `from_json` to parse a column in a Spark 
> dataframe. It seems like `from_json` ignores whether the schema provided has 
> any `nullable:False` property.}}
> {code:java}
> schema = T.StructType().add(T.StructField('id', T.LongType(), 
> nullable=False)).add(T.StructField('name', T.StringType(), nullable=False))
> data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 
> 'jane'})}]
> df = spark.read.json(sc.parallelize(data))
> df.withColumn("details", F.from_json("user", 
> schema)).select("details.*").show()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29575) from_json can produce nulls for fields which are marked as non-nullable

2019-10-23 Thread Victor Lopez (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Victor Lopez updated SPARK-29575:
-
Description: 
I believe this issue was resolved elsewhere 
(https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this 
bug seems to still be there.

{{The issue appears when using from_json to parse a column in a Spark 
dataframe. It seems like from_json ignores whether the schema provided has any 
nullable:False property.}}
{code:java}
schema = T.StructType().add(T.StructField('id', T.LongType(), 
nullable=False)).add(T.StructField('name', T.StringType(), nullable=False))
data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}]
df = spark.read.json(sc.parallelize(data))
df.withColumn("details", F.from_json("user", schema)).select("details.*").show()
{code}
 

  was:
I believe this issue was resolved elsewhere 
(https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this 
bug seems to still be there.

{{The issue appears when using `from_json` to parse a column in a Spark 
dataframe. It seems like `from_json` ignores whether the schema provided has 
any `nullable:False` property.}}
{code:java}
schema = T.StructType().add(T.StructField('id', T.LongType(), 
nullable=False)).add(T.StructField('name', T.StringType(), nullable=False))
data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}]
df = spark.read.json(sc.parallelize(data))
df.withColumn("details", F.from_json("user", schema)).select("details.*").show()
{code}
 


> from_json can produce nulls for fields which are marked as non-nullable
> ---
>
> Key: SPARK-29575
> URL: https://issues.apache.org/jira/browse/SPARK-29575
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Victor Lopez
>Priority: Major
>
> I believe this issue was resolved elsewhere 
> (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this 
> bug seems to still be there.
> {{The issue appears when using from_json to parse a column in a Spark 
> dataframe. It seems like from_json ignores whether the schema provided has 
> any nullable:False property.}}
> {code:java}
> schema = T.StructType().add(T.StructField('id', T.LongType(), 
> nullable=False)).add(T.StructField('name', T.StringType(), nullable=False))
> data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 
> 'jane'})}]
> df = spark.read.json(sc.parallelize(data))
> df.withColumn("details", F.from_json("user", 
> schema)).select("details.*").show()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29106) Add jenkins arm test for spark

2019-10-23 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958056#comment-16958056
 ] 

Shane Knapp edited comment on SPARK-29106 at 10/23/19 5:29 PM:
---

> For pyspark test, you mentioned we didn't install any python debs for 
> testing. Is there any "requirements.txt" or "test-requirements.txt" in the 
> spark repo? I'm failed to find them. When we test the pyspark before, we just 
> realize that we need to install numpy package with pip, because when we exec 
> the pyspark test scripts, the fail message noticed us. So you mentioned 
> "pyspark testing debs" before, you mean that we should figure all out 
> manually? Is there any kind suggest from your side?

i manage the jenkins configs via ansible, and python specifically through 
anaconda.  anaconda was my initial choice for package management because we 
need to support multiple python versions (2.7, 3.x, pypy) and specific package 
versions for each python version itself.

sadly there is no official ARM anaconda python distribution, which is a BIG 
hurdle for this project.

i also don't use requirements.txt and pip to do the initial python env setup as 
pip is flakier than i like, and the conda envs just work a LOT better.

see:  
https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#building-identical-conda-environments

i could check in the specific python package configs in to the spark repo, but 
they're specific to our worker configsn and even though the worker deployment 
process is automated (via ansible) there is ALWAYS some stupid dependency loop 
that pops up and requires manual intervention.

another issue is that i do NOT want any builds installing/updating/creating 
either python environments OR packages.  builds should NEVER EVER modify the 
bare-metal (or VM) system-level configs.

so, to summarize what needs to happen to get the python tests up and running:
1) there is no conda distribution for the ARM architecture, meaning...
2) i need to use venv to install everything...
3) which means i need to use pip/requirements.txt, which is known to be flaky...
4) and the python packages for ARM are named differently than x86...
5) or don't exist...
6) or are the wrong version...
7) meaning that setting up and testing three different python versions with 
differing package names and versions makes this a lot of trial and error.

i would like to get this done asap, but i will need to carve some serious time 
to get my brain wrapped around the 

> For sparkR test, we compile a higher R version 3.6.1 by fix many lib 
> dependency, and make it work. And exec the R test script, till to all of them 
> return pass. So we wonder the difficult about the test when we truelly in 
> amplab, could you please share more to us?

i have a deep and comprehensive hatred of installing and setting up R.  i've 
attached a couple of files showing the packages installed, their versions, and 
some of the ansible snippets i use to do the initial install.

https://issues.apache.org/jira/secure/attachment/12983856/R-ansible.yml
https://issues.apache.org/jira/secure/attachment/12983857/R-libs.txt

just like you, i need to go back and manually fix lib dependency and version 
errors once the initial setup is complete.

this is why i have a deep and comprehensive hatred of installing and setting up 
R.

> For current periodic jobs, you said it will be triggered 2 times per day. 
> Each build will cost most 11 hours. I have a thought about the next job 
> deployment, wish to know your thought about it. My thought is we can setup 2 
> jobs per day, one is the current maven UT test triggered by SCM changes(11h), 
> the other will run the pyspark and sparkR tests also triggered by SCM 
> changes(including spark build and tests, may cost 5-6 hours). How about this? 
> We can talk and discuss if we don't realize how difficult to do these now.

yeah, i am amenable to having a second ARM build.  i'd be curious as to the 
impact on the VM's performance when we have two builds running simultaneously.  
if i have some time today i'll experiment.

shane


was (Author: shaneknapp):
> For pyspark test, you mentioned we didn't install any python debs for 
> testing. Is there any "requirements.txt" or "test-requirements.txt" in the 
> spark repo? I'm failed to find them. When we test the pyspark before, we just 
> realize that we need to install numpy package with pip, because when we exec 
> the pyspark test scripts, the fail message noticed us. So you mentioned 
> "pyspark testing debs" before, you mean that we should figure all out 
> manually? Is there any kind suggest from your side?

i manage the jenkins configs via ansible, and python specifically through 
anaconda.  anaconda was my initial choice for package management because we 
need to support multiple python versions (2.7, 3.x, pypy) and specific package

[jira] [Updated] (SPARK-29575) from_json can produce nulls for fields which are marked as non-nullable

2019-10-23 Thread Victor Lopez (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Victor Lopez updated SPARK-29575:
-
Description: 
I believe this issue was resolved elsewhere 
(https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this 
bug seems to still be there.

The issue appears when using `from_json` to parse a column in a Spark 
dataframe. It seems like `from_json` ignores whether the schema provided has 
any `nullable:False` property.
{code:java}
schema = T.StructType().add(T.StructField('id', T.LongType(), 
nullable=False)).add(T.StructField('name', T.StringType(), nullable=False))
data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}]
df = spark.read.json(sc.parallelize(data))
df.withColumn("details", F.from_json("user", schema)).select("details.*").show()
{code}
 

  was:
I believe this issue was resolved elsewhere 
(https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this 
bug seems to still be there.

The issue appears when using `from_json` to parse a column in a Spark 
dataframe. It seems like `from_json` ignores whether the schema provided has 
any `nullable:False` property.

  

 
{code:java}
schema = T.StructType().add(T.StructField('id', T.LongType(), 
nullable=False)).add(T.StructField('name', T.StringType(), nullable=False))
data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}]
df = spark.read.json(sc.parallelize(data))
df.withColumn("details", F.from_json("user", schema)).select("details.*").show()
{code}
 


> from_json can produce nulls for fields which are marked as non-nullable
> ---
>
> Key: SPARK-29575
> URL: https://issues.apache.org/jira/browse/SPARK-29575
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Victor Lopez
>Priority: Major
>
> I believe this issue was resolved elsewhere 
> (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this 
> bug seems to still be there.
> The issue appears when using `from_json` to parse a column in a Spark 
> dataframe. It seems like `from_json` ignores whether the schema provided has 
> any `nullable:False` property.
> {code:java}
> schema = T.StructType().add(T.StructField('id', T.LongType(), 
> nullable=False)).add(T.StructField('name', T.StringType(), nullable=False))
> data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 
> 'jane'})}]
> df = spark.read.json(sc.parallelize(data))
> df.withColumn("details", F.from_json("user", 
> schema)).select("details.*").show()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29106) Add jenkins arm test for spark

2019-10-23 Thread Shane Knapp (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp updated SPARK-29106:

Attachment: R-libs.txt
R-ansible.yml

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
> Attachments: R-ansible.yml, R-libs.txt
>
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when integrate arm test with amplab 
> jenkins.
> About the k8s test on arm, we have took test it, see 
> [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it 
> later. 
> And we plan test on other stable branches too, and we can integrate them to 
> amplab when they are ready.
> We have offered an arm instance and sent the infos to shane knapp, thanks 
> shane to add the first arm job to amplab jenkins :) 
> The other important thing is about the leveldbjni 
> [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
>  spark depends on leveldbjni-all-1.8 
> [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
>  we can see there is no arm64 supporting. So we build an arm64 supporting 
> release of leveldbjni see 
> [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
>  but we can't modified the spark pom.xml directly with something like 
> 'property'/'profile' to choose correct jar package on arm or x86 platform, 
> because spark depends on some hadoop packages like hadoop-hdfs, the packages 
> depend on leveldbjni-all-1.8 too, unless hadoop release with new arm 
> supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of 
> openlabtesting and 'mvn install' to use it when arm testing for spark.
> PS: The issues found and fixed:
>  SPARK-28770
>  [https://github.com/apache/spark/pull/25673]
>   
>  SPARK-28519
>  [https://github.com/apache/spark/pull/25279]
>   
>  SPARK-28433
>  [https://github.com/apache/spark/pull/25186]
>  
> SPARK-28467
> [https://github.com/apache/spark/pull/25864]
>  
> SPARK-29286
> [https://github.com/apache/spark/pull/26021]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-10-23 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958056#comment-16958056
 ] 

Shane Knapp commented on SPARK-29106:
-

> For pyspark test, you mentioned we didn't install any python debs for 
> testing. Is there any "requirements.txt" or "test-requirements.txt" in the 
> spark repo? I'm failed to find them. When we test the pyspark before, we just 
> realize that we need to install numpy package with pip, because when we exec 
> the pyspark test scripts, the fail message noticed us. So you mentioned 
> "pyspark testing debs" before, you mean that we should figure all out 
> manually? Is there any kind suggest from your side?

i manage the jenkins configs via ansible, and python specifically through 
anaconda.  anaconda was my initial choice for package management because we 
need to support multiple python versions (2.7, 3.x, pypy) and specific package 
versions for each python version itself.

sadly there is no official ARM anaconda python distribution, which is a BIG 
hurdle for this project.

i also don't use requirements.txt and pip to do the initial python env setup as 
pip is flakier than i like, and the conda envs just work a LOT better.

see:  
https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#building-identical-conda-environments

i could check in the specific python package configs in to the spark repo, but 
they're specific to our worker configsn and even though the worker deployment 
process is automated (via ansible) there is ALWAYS some stupid dependency loop 
that pops up and requires manual intervention.

another issue is that i do NOT want any builds installing/updating/creating 
either python environments OR packages.  builds should NEVER EVER modify the 
bare-metal (or VM) system-level configs.

so, to summarize what needs to happen to get the python tests up and running:
1) there is no conda distribution for the ARM architecture, meaning...
2) i need to use venv to install everything...
3) which means i need to use pip/requirements.txt, which is known to be flaky...
4) and the python packages for ARM are named differently than x86...
5) or don't exist...
6) or are the wrong version...
7) meaning that setting up and testing three different python versions with 
differing package names and versions makes this a lot of trial and error.

i would like to get this done asap, but i will need to carve some serious time 
to get my brain wrapped around the 

> For sparkR test, we compile a higher R version 3.6.1 by fix many lib 
> dependency, and make it work. And exec the R test script, till to all of them 
> return pass. So we wonder the difficult about the test when we truelly in 
> amplab, could you please share more to us?

i have a deep and comprehensive hatred of installing and setting up R.  i'll 
attach a couple of files showing the packages installed, their versions, and 
some of the ansible snippets i use to do the initial install.

just like you, i need to go back and manually fix lib dependency and version 
errors once the initial setup is complete.

this is why i have a deep and comprehensive hatred of installing and setting up 
R.

> For current periodic jobs, you said it will be triggered 2 times per day. 
> Each build will cost most 11 hours. I have a thought about the next job 
> deployment, wish to know your thought about it. My thought is we can setup 2 
> jobs per day, one is the current maven UT test triggered by SCM changes(11h), 
> the other will run the pyspark and sparkR tests also triggered by SCM 
> changes(including spark build and tests, may cost 5-6 hours). How about this? 
> We can talk and discuss if we don't realize how difficult to do these now.

yeah, i am amenable to having a second ARM build.  i'd be curious as to the 
impact on the VM's performance when we have two builds running simultaneously.  
if i have some time today i'll experiment.

shane

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when

[jira] [Assigned] (SPARK-29552) Fix the flaky test failed in AdaptiveQueryExecSuite # multiple joins

2019-10-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29552:
---

Assignee: Ke Jia

> Fix the flaky test failed in AdaptiveQueryExecSuite # multiple joins
> 
>
> Key: SPARK-29552
> URL: https://issues.apache.org/jira/browse/SPARK-29552
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
>
> AQE will optimize the logical plan once there is query stage finished. So for 
> inner join, when two join side is all small to be the build side. The planner 
> of converting logical plan to physical plan will select the build side as 
> BuildRight if right side finished firstly or BuildLeft if left side finished 
> firstly. In some case, when BuildRight or BuildLeft may introduce additional 
> exchange to the parent node. The revert approach in 
> OptimizeLocalShuffleReader rule may be too conservative, which revert all the 
> local shuffle reader when introduce additional exchange not  revert the local 
> shuffle reader introduced shuffle.  It may be expense to only revert the 
> local shuffle reader introduced shuffle. The workaround is to apply the 
> OptimizeLocalShuffleReader rule again when creating new query stage to 
> further optimize the sub tree shuffle reader to local shuffle reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29552) Fix the flaky test failed in AdaptiveQueryExecSuite # multiple joins

2019-10-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29552.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26207
[https://github.com/apache/spark/pull/26207]

> Fix the flaky test failed in AdaptiveQueryExecSuite # multiple joins
> 
>
> Key: SPARK-29552
> URL: https://issues.apache.org/jira/browse/SPARK-29552
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> AQE will optimize the logical plan once there is query stage finished. So for 
> inner join, when two join side is all small to be the build side. The planner 
> of converting logical plan to physical plan will select the build side as 
> BuildRight if right side finished firstly or BuildLeft if left side finished 
> firstly. In some case, when BuildRight or BuildLeft may introduce additional 
> exchange to the parent node. The revert approach in 
> OptimizeLocalShuffleReader rule may be too conservative, which revert all the 
> local shuffle reader when introduce additional exchange not  revert the local 
> shuffle reader introduced shuffle.  It may be expense to only revert the 
> local shuffle reader introduced shuffle. The workaround is to apply the 
> OptimizeLocalShuffleReader rule again when creating new query stage to 
> further optimize the sub tree shuffle reader to local shuffle reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29503) MapObjects doesn't copy Unsafe data when nested under Safe data

2019-10-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29503:
---

Assignee: Jungtaek Lim

> MapObjects doesn't copy Unsafe data when nested under Safe data
> ---
>
> Key: SPARK-29503
> URL: https://issues.apache.org/jira/browse/SPARK-29503
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 3.0.0
>Reporter: Aaron Lewis
>Assignee: Jungtaek Lim
>Priority: Major
>  Labels: correctness
>
> In order for MapObjects to operate safely, it checks to see if the result of 
> the mapping function is an Unsafe type (UnsafeRow, UnsafeArrayData, 
> UnsafeMapData) and performs a copy before writing it into MapObjects' output 
> array. This is to protect against expressions which re-use the same native 
> memory buffer to represent its result across evaluations; if the copy wasn't 
> here, all results would be pointing to the same native buffer and would 
> represent the last result written to the buffer. However, MapObjects misses 
> this needed copy if the Unsafe data is nested below some safe structure, for 
> instance a GenericArrrayData whose elements are all UnsafeRows. In this 
> scenario, all elements of the GenericArrayData will be pointing to the same 
> native UnsafeRow buffer which will hold the last value written to it.
>  
> Right now, this bug seems to only occur when a `ProjectExec` goes down the 
> `execute` path, as opposed to WholeStageCodegen's `produce` and `consume` 
> path.
>  
> Example Reproduction Code:
> {code:scala}
> import org.apache.spark.sql.catalyst.expressions.objects.MapObjects
> import org.apache.spark.sql.catalyst.expressions.CreateArray
> import org.apache.spark.sql.catalyst.expressions.Expression
> import org.apache.spark.sql.functions.{array, struct}
> import org.apache.spark.sql.Column
> import org.apache.spark.sql.types.ArrayType
> // For the purpose of demonstration, we need to disable WholeStage codegen
> spark.conf.set("spark.sql.codegen.wholeStage", "false")
> val exampleDS = spark.sparkContext.parallelize(Seq(Seq(1, 2, 
> 3))).toDF("items")
> // Trivial example: Nest unsafe struct inside safe array
> // items: Seq[Int] => items.map{item => Seq(Struct(item))}
> val result = exampleDS.select(
> new Column(MapObjects(
> {item: Expression => array(struct(new Column(item))).expr},
> $"items".expr,
> exampleDS.schema("items").dataType.asInstanceOf[ArrayType].elementType
> )) as "items"
> )
> result.show(10, false)
> {code}
>  
> Actual Output:
> {code:java}
> +-+
> |items|
> +-+
> |[WrappedArray([3]), WrappedArray([3]), WrappedArray([3])]|
> +-+
> {code}
>  
> Expected Output:
> {code:java}
> +-+
> |items|
> +-+
> |[WrappedArray([1]), WrappedArray([2]), WrappedArray([3])]|
> +-+
> {code}
>  
> We've confirmed that the bug exists on version 2.1.1 as well as on master 
> (which I assume corresponds to version 3.0.0?)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29503) MapObjects doesn't copy Unsafe data when nested under Safe data

2019-10-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29503.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26173
[https://github.com/apache/spark/pull/26173]

> MapObjects doesn't copy Unsafe data when nested under Safe data
> ---
>
> Key: SPARK-29503
> URL: https://issues.apache.org/jira/browse/SPARK-29503
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 3.0.0
>Reporter: Aaron Lewis
>Assignee: Jungtaek Lim
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> In order for MapObjects to operate safely, it checks to see if the result of 
> the mapping function is an Unsafe type (UnsafeRow, UnsafeArrayData, 
> UnsafeMapData) and performs a copy before writing it into MapObjects' output 
> array. This is to protect against expressions which re-use the same native 
> memory buffer to represent its result across evaluations; if the copy wasn't 
> here, all results would be pointing to the same native buffer and would 
> represent the last result written to the buffer. However, MapObjects misses 
> this needed copy if the Unsafe data is nested below some safe structure, for 
> instance a GenericArrrayData whose elements are all UnsafeRows. In this 
> scenario, all elements of the GenericArrayData will be pointing to the same 
> native UnsafeRow buffer which will hold the last value written to it.
>  
> Right now, this bug seems to only occur when a `ProjectExec` goes down the 
> `execute` path, as opposed to WholeStageCodegen's `produce` and `consume` 
> path.
>  
> Example Reproduction Code:
> {code:scala}
> import org.apache.spark.sql.catalyst.expressions.objects.MapObjects
> import org.apache.spark.sql.catalyst.expressions.CreateArray
> import org.apache.spark.sql.catalyst.expressions.Expression
> import org.apache.spark.sql.functions.{array, struct}
> import org.apache.spark.sql.Column
> import org.apache.spark.sql.types.ArrayType
> // For the purpose of demonstration, we need to disable WholeStage codegen
> spark.conf.set("spark.sql.codegen.wholeStage", "false")
> val exampleDS = spark.sparkContext.parallelize(Seq(Seq(1, 2, 
> 3))).toDF("items")
> // Trivial example: Nest unsafe struct inside safe array
> // items: Seq[Int] => items.map{item => Seq(Struct(item))}
> val result = exampleDS.select(
> new Column(MapObjects(
> {item: Expression => array(struct(new Column(item))).expr},
> $"items".expr,
> exampleDS.schema("items").dataType.asInstanceOf[ArrayType].elementType
> )) as "items"
> )
> result.show(10, false)
> {code}
>  
> Actual Output:
> {code:java}
> +-+
> |items|
> +-+
> |[WrappedArray([3]), WrappedArray([3]), WrappedArray([3])]|
> +-+
> {code}
>  
> Expected Output:
> {code:java}
> +-+
> |items|
> +-+
> |[WrappedArray([1]), WrappedArray([2]), WrappedArray([3])]|
> +-+
> {code}
>  
> We've confirmed that the bug exists on version 2.1.1 as well as on master 
> (which I assume corresponds to version 3.0.0?)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29575) from_json can produce nulls for fields which are marked as non-nullable

2019-10-23 Thread Victor Lopez (Jira)

Victor Lopez created SPARK-29575:


 Summary: from_json can produce nulls for fields which are marked 
as non-nullable
 Key: SPARK-29575
 URL: https://issues.apache.org/jira/browse/SPARK-29575
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.4
Reporter: Victor Lopez


I believe this issue was resolved elsewhere 
(https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this 
bug seems to still be there.

The issue appears when using `from_json` to parse a column in a Spark 
dataframe. It seems like `from_json` ignores whether the schema provided has 
any `nullable:False` property.

  

 
{code:java}
schema = T.StructType().add(T.StructField('id', T.LongType(), 
nullable=False)).add(T.StructField('name', T.StringType(), nullable=False))
data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}]
df = spark.read.json(sc.parallelize(data))
df.withColumn("details", F.from_json("user", schema)).select("details.*").show()
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-10-23 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958019#comment-16958019
 ] 

Shane Knapp commented on SPARK-29106:
-

[~huangtianhua]:

> we don't have to download and install leveldbjni-all-1.8 in our arm test 
> instance, we have installed it and it was there.

it's a very inexpensive step to execute and i'd rather have builds be atomic.  
if for some reason the dependency get wiped/corrupted/etc, the download will 
ensure we're properly building.

> maybe we can try to use 'mvn clean package ' instead of 'mvn clean 
> install '?

sure, i'll give that a shot now.

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when integrate arm test with amplab 
> jenkins.
> About the k8s test on arm, we have took test it, see 
> [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it 
> later. 
> And we plan test on other stable branches too, and we can integrate them to 
> amplab when they are ready.
> We have offered an arm instance and sent the infos to shane knapp, thanks 
> shane to add the first arm job to amplab jenkins :) 
> The other important thing is about the leveldbjni 
> [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
>  spark depends on leveldbjni-all-1.8 
> [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
>  we can see there is no arm64 supporting. So we build an arm64 supporting 
> release of leveldbjni see 
> [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
>  but we can't modified the spark pom.xml directly with something like 
> 'property'/'profile' to choose correct jar package on arm or x86 platform, 
> because spark depends on some hadoop packages like hadoop-hdfs, the packages 
> depend on leveldbjni-all-1.8 too, unless hadoop release with new arm 
> supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of 
> openlabtesting and 'mvn install' to use it when arm testing for spark.
> PS: The issues found and fixed:
>  SPARK-28770
>  [https://github.com/apache/spark/pull/25673]
>   
>  SPARK-28519
>  [https://github.com/apache/spark/pull/25279]
>   
>  SPARK-28433
>  [https://github.com/apache/spark/pull/25186]
>  
> SPARK-28467
> [https://github.com/apache/spark/pull/25864]
>  
> SPARK-29286
> [https://github.com/apache/spark/pull/26021]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29415) Stage Level Sched: Add base ResourceProfile and Request classes

2019-10-23 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957974#comment-16957974
 ] 

Thomas Graves commented on SPARK-29415:
---

>From a high level design point, this is the base classes needed for other 
>jira/components to be implemented. You can see the design doc attached to 
>SPARK-27495 for the entire overview, but for this specifically this is what we 
>are looking to add.  These will start out private until we have other parts 
>implemented and then make public incase this isn't fully implemented for a 
>release.

 

ResourceProfile:

The user will have to build up a _ResourceProfile_ to pass into an RDD 
withResources call. This profile will have a limited set of resources the user 
is allowed to specify. It will allow both task and executor resources. It will 
be a builder type interface where the main function called will be 
_ResourceProfile.require._  Adding the ResourceProfile API class leaves it open 
to do more advanced things in the future. For instance, perhaps you want a 
_ResourceProfile.prefer_ option where it would run on a node with some 
resources if available but then fall back if they aren’t.   The config names 
supported correspond to the regular spark configs with the prefix removed. For 
instance overhead memory in this api is memoryOverhead, which is 
spark.executor.memoryOverhead with the spark.executor removed.  Resources like 
GPUs are resource.gpu (spark configs spark.executor.resource.gpu.*).

| |

*_def_* _require(request: TaskResourceRequest):_ *_this_*_._*_type_*

*_def_* _require(request: ExecutorResourceRequest):_ *_this_*_._*_type_*

It will also have functions to get the resources out for both scala and java.

 

*Resource Requests:*

*_class_* _ExecutorResourceRequest(_

   _val resourceName: String,_

   _val amount: Int, // potentially make this handle fractional resources_

   _val units: String, // to handle memory unit types
_

   _val discoveryScript: Option[__String__] = None,_

   _val vendor: Option[__String__] = None)_

 

*_class_* _TaskResourceRequest(_

   _val resourceName: String,_

   _val amount: Double) // double to handle fractional resources (ie 2 tasks 
using 1 resource )
_

 

This will allow the user to programmatically set the resources vs just using 
the configs like they can in Spark 3.0 now.  The first implementation would 
support cpu, memory (overhead, pyspark, on heap, off heap), and the generic 
resources. 

 __ 

An example of the way this might work is:

 __ 

_val_ *_rp_* _= new ResourceProfile()_

_rp.require(new ExecutorResourceRequest("memory", 2048))_

_rp.require(new ExecutorResourceRequest("cores", 2))_

_rp.require(new ExecutorResourceRequest("gpu", 1, 
Some("/opt/gpuScripts/getGpus")))_

_rp.require(new TaskResourceRequest("gpu", 1))_

 

Internally we will also create a default profile, which will be based on the 
normal spark configs passed in. This default one can be used everywhere where 
user hasn't explicitly set the ResourceProfile

> Stage Level Sched: Add base ResourceProfile and Request classes
> ---
>
> Key: SPARK-29415
> URL: https://issues.apache.org/jira/browse/SPARK-29415
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> this is just to add initial ResourceProfile, ExecutorResourceRequest and 
> taskResourceRequest classes that are used by the other parts of the code.
> Initially we will have them private until we have other pieces in place.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29574) spark with user provided hadoop doesn't work on kubernetes

2019-10-23 Thread Jira

Michał Wesołowski created SPARK-29574:
-

 Summary: spark with user provided hadoop doesn't work on kubernetes
 Key: SPARK-29574
 URL: https://issues.apache.org/jira/browse/SPARK-29574
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.4
Reporter: Michał Wesołowski


When spark-submit is run with image built with "hadoop free" spark and user 
provided hadoop it fails on kubernetes (hadoop libraries are not on spark's 
classpath). 

I downloaded spark [Pre-built with user-provided Apache 
Hadoop|https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-without-hadoop.tgz].
 

I created docker image with usage of 
[docker-image-tool.sh|[https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]].
 

 

Based on this image (2.4.4-without-hadoop)

I created another one with Dockerfile
{code:java}
FROM spark-py:2.4.4-without-hadoop
ENV SPARK_HOME=/opt/spark/
# This is needed for newer kubernetes versions
ADD 
https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.6.1/kubernetes-client-4.6.1.jar
 $SPARK_HOME/jars
COPY spark-env.sh /opt/spark/conf/spark-env.sh
RUN chmod +x /opt/spark/conf/spark-env.sh
RUN wget -qO- 
https://www-eu.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz | 
tar xz  -C /opt/
ENV HADOOP_HOME=/opt/hadoop-3.2.1
ENV PATH=${HADOOP_HOME}/bin:${PATH}

{code}
Contents of spark-env.sh:
{code:java}
#!/usr/bin/env bash
export SPARK_DIST_CLASSPATH=$(hadoop 
classpath):$HADOOP_HOME/share/hadoop/tools/lib/*
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native
{code}
spark-submit run with image crated this way fails since spark-env.sh is 
overwritten by [volume created when pod 
starts|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L108]

As quick workaround I tried to modify [entrypoint 
script|https://github.com/apache/spark/blob/ea8b5df47476fe66b63bd7f7bcd15acfb80bde78/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh]
 to run spark-env.sh during startup and moving spark-env.sh to a different 
directory. 
 Driver starts without issues in this setup however, evethough 
SPARK_DIST_CLASSPATH is set executor is constantly failing:
{code:java}
PS 
C:\Sandbox\projekty\roboticdrive-analytics\components\docker-images\spark-rda> 
kubectl logs rda-script-1571835692837-exec-12
++ id -u
+ myuid=0
++ id -g
+ mygid=0
+ set +e
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/ash
+ set -e
+ '[' -z root:x:0:0:root:/root:/bin/ash ']'
+ source /opt/spark-env.sh
+++ hadoop classpath
++ export 
'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoo++
 
SPARK_DIST_CLASSPATH='/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*'
++ export LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native
++ LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native
++ echo 
'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*'
SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*
++ echo LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native
+ SPARK_K8S_CMD=executor
LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native
+ case "$SPARK_K8S_CMD" in
+ shift 1
+ SPARK_CLASSPATH=':/opt/spark//jars/*'
+ env
+ sed 's/[^=]*=\(.*\)/\1/g'
+ sort -t_

[jira] [Assigned] (SPARK-29513) REFRESH TABLE should look up catalog/table like v2 commands

2019-10-23 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-29513:
---

Assignee: Terry Kim

> REFRESH TABLE should look up catalog/table like v2 commands
> ---
>
> Key: SPARK-29513
> URL: https://issues.apache.org/jira/browse/SPARK-29513
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>
> REFRESH TABLE should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29513) REFRESH TABLE should look up catalog/table like v2 commands

2019-10-23 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-29513.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26183
[https://github.com/apache/spark/pull/26183]

> REFRESH TABLE should look up catalog/table like v2 commands
> ---
>
> Key: SPARK-29513
> URL: https://issues.apache.org/jira/browse/SPARK-29513
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> REFRESH TABLE should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29557) Upgrade dropwizard metrics library to 3.2.6

2019-10-23 Thread Luca Canali (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957945#comment-16957945
 ] 

Luca Canali commented on SPARK-29557:
-

Upgrading Apache Spark to use dropwizard/codahale metrics library version 4.x 
or higher is currently blocked by the fact that the Ganglia reporter has been 
dropped by Dropwizard metrics library in version 4.0. Dropwizard metrics 
library version 3.2 still includes a Ganglia reporter.

> Upgrade dropwizard metrics library to 3.2.6
> ---
>
> Key: SPARK-29557
> URL: https://issues.apache.org/jira/browse/SPARK-29557
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes to upgrade the dropwizard/codahale metrics library version used 
> by Spark to a recent version, tentatively 4.1.1. Spark is currently using 
> Dropwizard metrics version 3.1.5, a version that is no more actively 
> developed nor maintained, according to the project's Github repo README.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29557) Upgrade dropwizard metrics library to 3.2.6

2019-10-23 Thread Luca Canali (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-29557:

Summary: Upgrade dropwizard metrics library to 3.2.6  (was: Upgrade 
dropwizard metrics library to 4.1.1)

> Upgrade dropwizard metrics library to 3.2.6
> ---
>
> Key: SPARK-29557
> URL: https://issues.apache.org/jira/browse/SPARK-29557
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes to upgrade the dropwizard/codahale metrics library version used 
> by Spark to a recent version, tentatively 4.1.1. Spark is currently using 
> Dropwizard metrics version 3.1.5, a version that is no more actively 
> developed nor maintained, according to the project's Github repo README.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21287) Cannot use Int.MIN_VALUE as Spark SQL fetchsize

2019-10-23 Thread Hu Fuwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957939#comment-16957939
 ] 

Hu Fuwang commented on SPARK-21287:
---

[~smilegator]  [~srowen] Just submitted a PR for this : 
[https://github.com/apache/spark/pull/26230]

Please help review.

> Cannot use Int.MIN_VALUE as Spark SQL fetchsize
> ---
>
> Key: SPARK-21287
> URL: https://issues.apache.org/jira/browse/SPARK-21287
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Maciej Bryński
>Priority: Major
>
> MySQL JDBC driver gives possibility to not store ResultSet in memory.
> We can do this by setting fetchSize to Int.MIN_VALUE.
> Unfortunately this configuration isn't correct in Spark.
> {code}
> java.lang.IllegalArgumentException: requirement failed: Invalid value 
> `-2147483648` for parameter `fetchsize`. The minimum value is 0. When the 
> value is 0, the JDBC driver ignores the value and does the estimates.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
>   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:206)
>   at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29573) Spark should work as PostgreSQL when using + Operator

2019-10-23 Thread ABHISHEK KUMAR GUPTA (Jira)

ABHISHEK KUMAR GUPTA created SPARK-29573:


 Summary: Spark should work as PostgreSQL when using + Operator
 Key: SPARK-29573
 URL: https://issues.apache.org/jira/browse/SPARK-29573
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: ABHISHEK KUMAR GUPTA


Spark and PostgreSQL result is different when concatenating as below :

Spark : Giving NULL result 

0: jdbc:hive2://10.18.19.208:23040/default> select * from emp12;
+-+-+
| id | name |
+-+-+
| 20 | test |
| 10 | number |
+-+-+
2 rows selected (3.683 seconds)
0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address 
from emp12;
+-+--+
| ID | address |
+-+--+
| 20 | NULL |
| 10 | NULL |
+-+--+
2 rows selected (0.649 seconds)
0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address 
from emp12;
+-+--+
| ID | address |
+-+--+
| 20 | NULL |
| 10 | NULL |
+-+--+
2 rows selected (0.406 seconds)
0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+','+name as 
address from emp12;
+-+--+
| ID | address |
+-+--+
| 20 | NULL |
| 10 | NULL |
+-+--+

 

PostgreSQL: Saying throwing Error saying not supported

create table emp12(id int,name varchar(255));
insert into emp12 values(10,'number');
insert into emp12 values(20,'test');
select id as ID, id+','+name as address from emp12;

output: invalid input syntax for integer: ","

create table emp12(id int,name varchar(255));
insert into emp12 values(10,'number');
insert into emp12 values(20,'test');
select id as ID, id+name as address from emp12;

Output: 42883: operator does not exist: integer + character varying

 

It should throw Error in Spark if it is not supported.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29572) add v1 read fallback API in DS v2

2019-10-23 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-29572:
---

 Summary: add v1 read fallback API in DS v2
 Key: SPARK-29572
 URL: https://issues.apache.org/jira/browse/SPARK-29572
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15348) Hive ACID

2019-10-23 Thread Abhishek Somani (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957904#comment-16957904
 ] 

Abhishek Somani commented on SPARK-15348:
-

[~Kelvin.FE] This seems to be happening because you might have 
"hive.strict.managed.tables" set to true on the hive metastore server. You can 
either try setting it to false or running the above query as "create external 
table test.cars ... " instead of "create table"

If you still face an issue or have more questions, please feel free to open an 
issue at [https://github.com/qubole/spark-acid/issues]

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0
>Reporter: Ran Haim
>Priority: Major
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29571) Fix UT in AllExecutionsPageSuite class

2019-10-23 Thread Ankit Raj Boudh (Jira)

Ankit Raj Boudh created SPARK-29571:
---

 Summary: Fix UT in  AllExecutionsPageSuite class
 Key: SPARK-29571
 URL: https://issues.apache.org/jira/browse/SPARK-29571
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 3.0.0
Reporter: Ankit Raj Boudh


sorting should be successful UT in class AllExecutionsPageSuite failing due to 
invalid assert condition. Needs to handle this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29571) Fix UT in AllExecutionsPageSuite class

2019-10-23 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957898#comment-16957898
 ] 

Ankit Raj Boudh commented on SPARK-29571:
-

i will raise the PR soon

> Fix UT in  AllExecutionsPageSuite class
> ---
>
> Key: SPARK-29571
> URL: https://issues.apache.org/jira/browse/SPARK-29571
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Ankit Raj Boudh
>Priority: Minor
>
> sorting should be successful UT in class AllExecutionsPageSuite failing due 
> to invalid assert condition. Needs to handle this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29570) Improve tooltip for Executor Tab for Shuffle Write,Blacklisted,Logs,Threaddump columns

2019-10-23 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957897#comment-16957897
 ] 

Ankit Raj Boudh commented on SPARK-29570:
-

I will fix this issue

> Improve tooltip for Executor Tab for Shuffle 
> Write,Blacklisted,Logs,Threaddump columns
> --
>
> Key: SPARK-29570
> URL: https://issues.apache.org/jira/browse/SPARK-29570
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> When User move mouse over the columns under Executors Shuffle 
> Write,Blacklisted,Logs,Threaddump columns, tooltip not display at center. 
> Check the other columns it display at center.
> Please fix this issue in all Spark WEB UI page and History UI Page.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29570) Improve tooltip for Executor Tab for Shuffle Write,Blacklisted,Logs,Threaddump columns

2019-10-23 Thread ABHISHEK KUMAR GUPTA (Jira)

ABHISHEK KUMAR GUPTA created SPARK-29570:


 Summary: Improve tooltip for Executor Tab for Shuffle 
Write,Blacklisted,Logs,Threaddump columns
 Key: SPARK-29570
 URL: https://issues.apache.org/jira/browse/SPARK-29570
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 3.0.0
Reporter: ABHISHEK KUMAR GUPTA


When User move mouse over the columns under Executors Shuffle 
Write,Blacklisted,Logs,Threaddump columns, tooltip not display at center. Check 
the other columns it display at center.

Please fix this issue in all Spark WEB UI page and History UI Page.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29499) Add mapPartitionsWithIndex for RDDBarrier

2019-10-23 Thread Xingbo Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang resolved SPARK-29499.
--
  Assignee: Xianyang Liu
Resolution: Fixed

> Add mapPartitionsWithIndex for RDDBarrier
> -
>
> Key: SPARK-29499
> URL: https://issues.apache.org/jira/browse/SPARK-29499
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.4
>Reporter: Xianyang Liu
>Assignee: Xianyang Liu
>Priority: Major
>
> There is only one method in `RDDBarrier`. We often use the partition index as 
> a label for the current partition. We need to get the index from 
> `TaskContext`  index in the method of `mapPartitions` which is not convenient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29569) doc build fails with `/api/scala/lib/jquery.js` doesn't exist

2019-10-23 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957785#comment-16957785
 ] 

Hyukjin Kwon commented on SPARK-29569:
--

This seems to start to happen after Scala 2.12 upgrade. It seems pretty 
critical since it's unable to generate the doc ...

> doc build fails with `/api/scala/lib/jquery.js` doesn't exist
> -
>
> Key: SPARK-29569
> URL: https://issues.apache.org/jira/browse/SPARK-29569
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Blocker
> Attachments: Screen Shot 2019-10-23 at 8.25.01 PM.png
>
>
> Run `jekyll build` under `./spark/docs`, the command fail with the following 
> error message:
> {code}
> Making directory api/scala
> cp -r ../target/scala-2.12/unidoc/. api/scala
> Making directory api/java
> cp -r ../target/javaunidoc/. api/java
> Updating JavaDoc files for badge post-processing
> Copying jquery.js from Scala API to Java API for page post-processing of 
> badges
> jekyll 3.8.6 | Error:  No such file or directory @ rb_sysopen - 
> ./api/scala/lib/jquery.js
> {code}
> This error only happens on master branch, the command works on branch-2.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29569) doc build fails with `/api/scala/lib/jquery.js` doesn't exist

2019-10-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29569:
-
Component/s: docs

> doc build fails with `/api/scala/lib/jquery.js` doesn't exist
> -
>
> Key: SPARK-29569
> URL: https://issues.apache.org/jira/browse/SPARK-29569
> Project: Spark
>  Issue Type: Bug
>  Components: Build, docs
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Blocker
> Attachments: Screen Shot 2019-10-23 at 8.25.01 PM.png
>
>
> Run `jekyll build` under `./spark/docs`, the command fail with the following 
> error message:
> {code}
> Making directory api/scala
> cp -r ../target/scala-2.12/unidoc/. api/scala
> Making directory api/java
> cp -r ../target/javaunidoc/. api/java
> Updating JavaDoc files for badge post-processing
> Copying jquery.js from Scala API to Java API for page post-processing of 
> badges
> jekyll 3.8.6 | Error:  No such file or directory @ rb_sysopen - 
> ./api/scala/lib/jquery.js
> {code}
> This error only happens on master branch, the command works on branch-2.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29542) [SQL][DOC] The descriptions of `spark.sql.files.*` are confused.

2019-10-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-29542:


Assignee: feiwang

> [SQL][DOC] The descriptions of `spark.sql.files.*` are confused.
> 
>
> Key: SPARK-29542
> URL: https://issues.apache.org/jira/browse/SPARK-29542
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: feiwang
>Assignee: feiwang
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.
> {code:java}
> The maximum number of bytes to pack into a single partition when reading 
> files.
> {code}
> It seems that it can ensure each partition at most process bytes of that 
> value for spark sql.
> As shown in the attachment,  the value of spark.sql.files.maxPartitionBytes 
> is 128MB.
> For stage 1, its input is 16.3TB, but there are only 6400 tasks.
> I checked the code,  it is only effective for data source table.
> So, its description is confused.
> Same as all the descriptions of `spark.sql.files.*`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29542) [SQL][DOC] The descriptions of `spark.sql.files.*` are confused.

2019-10-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29542.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/26200

> [SQL][DOC] The descriptions of `spark.sql.files.*` are confused.
> 
>
> Key: SPARK-29542
> URL: https://issues.apache.org/jira/browse/SPARK-29542
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: feiwang
>Assignee: feiwang
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: screenshot-1.png
>
>
> Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.
> {code:java}
> The maximum number of bytes to pack into a single partition when reading 
> files.
> {code}
> It seems that it can ensure each partition at most process bytes of that 
> value for spark sql.
> As shown in the attachment,  the value of spark.sql.files.maxPartitionBytes 
> is 128MB.
> For stage 1, its input is 16.3TB, but there are only 6400 tasks.
> I checked the code,  it is only effective for data source table.
> So, its description is confused.
> Same as all the descriptions of `spark.sql.files.*`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29569) doc build fails with `/api/scala/lib/jquery.js` doesn't exist

2019-10-23 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957772#comment-16957772
 ] 

Hyukjin Kwon commented on SPARK-29569:
--

I attached the ScalaDoc output from the current master. Seems like, at some 
point, the documentation style became completely different.

> doc build fails with `/api/scala/lib/jquery.js` doesn't exist
> -
>
> Key: SPARK-29569
> URL: https://issues.apache.org/jira/browse/SPARK-29569
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Blocker
> Attachments: Screen Shot 2019-10-23 at 8.25.01 PM.png
>
>
> Run `jekyll build` under `./spark/docs`, the command fail with the following 
> error message:
> {code}
> Making directory api/scala
> cp -r ../target/scala-2.12/unidoc/. api/scala
> Making directory api/java
> cp -r ../target/javaunidoc/. api/java
> Updating JavaDoc files for badge post-processing
> Copying jquery.js from Scala API to Java API for page post-processing of 
> badges
> jekyll 3.8.6 | Error:  No such file or directory @ rb_sysopen - 
> ./api/scala/lib/jquery.js
> {code}
> This error only happens on master branch, the command works on branch-2.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29569) doc build fails with `/api/scala/lib/jquery.js` doesn't exist

2019-10-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29569:
-
Attachment: Screen Shot 2019-10-23 at 8.25.01 PM.png

> doc build fails with `/api/scala/lib/jquery.js` doesn't exist
> -
>
> Key: SPARK-29569
> URL: https://issues.apache.org/jira/browse/SPARK-29569
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Blocker
> Attachments: Screen Shot 2019-10-23 at 8.25.01 PM.png
>
>
> Run `jekyll build` under `./spark/docs`, the command fail with the following 
> error message:
> {code}
> Making directory api/scala
> cp -r ../target/scala-2.12/unidoc/. api/scala
> Making directory api/java
> cp -r ../target/javaunidoc/. api/java
> Updating JavaDoc files for badge post-processing
> Copying jquery.js from Scala API to Java API for page post-processing of 
> badges
> jekyll 3.8.6 | Error:  No such file or directory @ rb_sysopen - 
> ./api/scala/lib/jquery.js
> {code}
> This error only happens on master branch, the command works on branch-2.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29569) doc build fails with `/api/scala/lib/jquery.js` doesn't exist

2019-10-23 Thread Xingbo Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957731#comment-16957731
 ] 

Xingbo Jiang commented on SPARK-29569:
--

[~sowen][~dongjoon] Can you take a look at this issue?

> doc build fails with `/api/scala/lib/jquery.js` doesn't exist
> -
>
> Key: SPARK-29569
> URL: https://issues.apache.org/jira/browse/SPARK-29569
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Blocker
>
> Run `jekyll build` under `./spark/docs`, the command fail with the following 
> error message:
> {code}
> Making directory api/scala
> cp -r ../target/scala-2.12/unidoc/. api/scala
> Making directory api/java
> cp -r ../target/javaunidoc/. api/java
> Updating JavaDoc files for badge post-processing
> Copying jquery.js from Scala API to Java API for page post-processing of 
> badges
> jekyll 3.8.6 | Error:  No such file or directory @ rb_sysopen - 
> ./api/scala/lib/jquery.js
> {code}
> This error only happens on master branch, the command works on branch-2.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29569) doc build fails with `/api/scala/lib/jquery.js` doesn't exist

2019-10-23 Thread Xingbo Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang updated SPARK-29569:
-
Summary: doc build fails with `/api/scala/lib/jquery.js` doesn't exist  
(was: doc build fails because `/api/scala/lib/jquery.js` doesn't exist)

> doc build fails with `/api/scala/lib/jquery.js` doesn't exist
> -
>
> Key: SPARK-29569
> URL: https://issues.apache.org/jira/browse/SPARK-29569
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Blocker
>
> Run `jekyll build` under `./spark/docs`, the command fail with the following 
> error message:
> {code}
> Making directory api/scala
> cp -r ../target/scala-2.12/unidoc/. api/scala
> Making directory api/java
> cp -r ../target/javaunidoc/. api/java
> Updating JavaDoc files for badge post-processing
> Copying jquery.js from Scala API to Java API for page post-processing of 
> badges
> jekyll 3.8.6 | Error:  No such file or directory @ rb_sysopen - 
> ./api/scala/lib/jquery.js
> {code}
> This error only happens on master branch, the command works on branch-2.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29569) doc build fails because `/api/scala/lib/jquery.js` doesn't exist

2019-10-23 Thread Xingbo Jiang (Jira)

Xingbo Jiang created SPARK-29569:


 Summary: doc build fails because `/api/scala/lib/jquery.js` 
doesn't exist
 Key: SPARK-29569
 URL: https://issues.apache.org/jira/browse/SPARK-29569
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.0.0
Reporter: Xingbo Jiang


Run `jekyll build` under `./spark/docs`, the command fail with the following 
error message:
{code}
Making directory api/scala
cp -r ../target/scala-2.12/unidoc/. api/scala
Making directory api/java
cp -r ../target/javaunidoc/. api/java
Updating JavaDoc files for badge post-processing
Copying jquery.js from Scala API to Java API for page post-processing of badges
jekyll 3.8.6 | Error:  No such file or directory @ rb_sysopen - 
./api/scala/lib/jquery.js
{code}

This error only happens on master branch, the command works on branch-2.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29568) Add flag to stop existing stream when new copy starts

2019-10-23 Thread Burak Yavuz (Jira)

Burak Yavuz created SPARK-29568:
---

 Summary: Add flag to stop existing stream when new copy starts
 Key: SPARK-29568
 URL: https://issues.apache.org/jira/browse/SPARK-29568
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Burak Yavuz


In multi-tenant environments where you have multiple SparkSessions, you can 
accidentally start multiple copies of the same stream (i.e. streams using the 
same checkpoint location). This will cause all new instantiations of the new 
stream to fail. However, sometimes you may want to turn off the old stream, as 
the old stream may have turned into a zombie (you no longer have access to the 
query handle or SparkSession).

It would be nice to have a SQL flag that allows the stopping of the old stream 
for such zombie cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15348) Hive ACID

2019-10-23 Thread Zhaoyang Qin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957679#comment-16957679
 ] 

Zhaoyang Qin edited comment on SPARK-15348 at 10/23/19 9:15 AM:


[~asomani] when i use the following codes：`

{{scala> spark.sql("create table test.cars using HiveAcid options ('table' 
'test.acidtbl')")}}`,

i got a AnalysisException(HiveException) :

`org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Table 
test.cars failed strict managed table checks due to the following reason: Table 
is marked as a managed table but is not transactional.);` 

Any help for this?

Also, this cluster is HDP3.0 based,the Spark ver2.3.1 & hive 3.0.0.


was (Author: kelvin.fe):
[~asomani] when i use the following codes：`

{{scala> spark.sql("create table symlinkacidtable using HiveAcid options 
('table' 'default.acidtbl')")}}`,

i got a AnalysisException(HiveException) :

`org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Table 
test.cars failed strict managed table checks due to the following reason: Table 
is marked as a managed table but is not transactional.);` 

Any help for this?

Also, this cluster is HDP3.0 based,the Spark ver2.3.1 & hive 3.0.0.

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0
>Reporter: Ran Haim
>Priority: Major
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15348) Hive ACID

2019-10-23 Thread Zhaoyang Qin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957679#comment-16957679
 ] 

Zhaoyang Qin edited comment on SPARK-15348 at 10/23/19 9:15 AM:


[~asomani] when i use the following codes：`

{{scala> spark.sql("create table test.cars using HiveAcid options ('table' 
'test.acidtbl')")}}`,

i got an AnalysisException(HiveException) :

`org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Table 
test.cars failed strict managed table checks due to the following reason: Table 
is marked as a managed table but is not transactional.);` 

Any help for this?

Also, this cluster is HDP3.0 based,the Spark ver2.3.1 & hive 3.0.0.


was (Author: kelvin.fe):
[~asomani] when i use the following codes：`

{{scala> spark.sql("create table test.cars using HiveAcid options ('table' 
'test.acidtbl')")}}`,

i got a AnalysisException(HiveException) :

`org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Table 
test.cars failed strict managed table checks due to the following reason: Table 
is marked as a managed table but is not transactional.);` 

Any help for this?

Also, this cluster is HDP3.0 based,the Spark ver2.3.1 & hive 3.0.0.

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0
>Reporter: Ran Haim
>Priority: Major
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15348) Hive ACID

2019-10-23 Thread Zhaoyang Qin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957679#comment-16957679
 ] 

Zhaoyang Qin commented on SPARK-15348:
--

[~asomani] when i use the following codes：`

{{scala> spark.sql("create table symlinkacidtable using HiveAcid options 
('table' 'default.acidtbl')")}}`,

i got a AnalysisException(HiveException) :

`org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Table 
test.cars failed strict managed table checks due to the following reason: Table 
is marked as a managed table but is not transactional.);` 

Any help for this?

Also, this cluster is HDP3.0 based,the Spark ver2.3.1 & hive 3.0.0.

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0
>Reporter: Ran Haim
>Priority: Major
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29352) Move active streaming query state to the SharedState

2019-10-23 Thread Burak Yavuz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-29352.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/26018]

> Move active streaming query state to the SharedState
> 
>
> Key: SPARK-29352
> URL: https://issues.apache.org/jira/browse/SPARK-29352
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> We have checks to prevent the restarting of the same stream on the same spark 
> session, but we can actually make that better in multi-tenant environments by 
> actually putting that state in the SharedState instead of SessionState. This 
> would allow a more comprehensive check for multi-tenant clusters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29564) Cluster deploy mode should support Spark Thrift server

2019-10-23 Thread Lantao Jin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-29564:
---
Description: 
Cluster deploy mode is not applicable to Spark Thrift server now. This 
restriction is too rude.
In our production, we use multiple Spark Thrift servers as long running 
services which are used yarn-cluster mode to launch. The life cycle of STS is 
managed by upper layer manager system which is also used to dispatcher user's 
JDBC connection to applicable STS.

  was:
Cluster deploy mode is not applicable to Spark Thrift server from SPARK-21403. 
This restriction is too rude.
In our production, we use multiple Spark Thrift servers as long running 
services which are used yarn-cluster mode to launch. The life cycle of STS is 
managed by upper layer manager system which is also used to dispatcher user's 
JDBC connection to applicable STS. SPARK-21403 banned this case.


> Cluster deploy mode should support Spark Thrift server
> --
>
> Key: SPARK-29564
> URL: https://issues.apache.org/jira/browse/SPARK-29564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Cluster deploy mode is not applicable to Spark Thrift server now. This 
> restriction is too rude.
> In our production, we use multiple Spark Thrift servers as long running 
> services which are used yarn-cluster mode to launch. The life cycle of STS is 
> managed by upper layer manager system which is also used to dispatcher user's 
> JDBC connection to applicable STS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29567) Update JDBC Integration Test Docker Images

2019-10-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29567:
--
Summary: Update JDBC Integration Test Docker Images  (was: Upgrade JDBC 
Integration Test Docker Images)

> Update JDBC Integration Test Docker Images
> --
>
> Key: SPARK-29567
> URL: https://issues.apache.org/jira/browse/SPARK-29567
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29567) Upgrade JDBC Integration Test Docker Images

2019-10-23 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-29567:
-

 Summary: Upgrade JDBC Integration Test Docker Images
 Key: SPARK-29567
 URL: https://issues.apache.org/jira/browse/SPARK-29567
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29565) OneHotEncoder should support single-column input/ouput

2019-10-23 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957664#comment-16957664
 ] 

zhengruifeng commented on SPARK-29565:
--

[~huaxingao]  In  [https://github.com/apache/spark/pull/26064,] I guess you 
maybe interested in the tickets (SPARK-29565/SPARK-29566). If you would like to 
work on this, please feel free to ping me in the PRs.

> OneHotEncoder should support single-column input/ouput
> --
>
> Key: SPARK-29565
> URL: https://issues.apache.org/jira/browse/SPARK-29565
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> Current feature algs 
> ({color:#5a6e5a}QuantileDiscretizer/Binarizer/Bucketizer/StringIndexer{color})
>  are designed to support both single-col & multi-col.
> And there is already some internal utils (like 
> {color:#c7a65d}checkSingleVsMultiColumnParams{color}) for this.
> For OneHotEncoder, it's reasonable to support single-col.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29566) Imputer should support single-column input/ouput

2019-10-23 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-29566:
-
Description: 
Imputer should support single-column input/ouput

refer to https://issues.apache.org/jira/browse/SPARK-29565

  was:Imputer should support single-column input/ouput


> Imputer should support single-column input/ouput
> 
>
> Key: SPARK-29566
> URL: https://issues.apache.org/jira/browse/SPARK-29566
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> Imputer should support single-column input/ouput
> refer to https://issues.apache.org/jira/browse/SPARK-29565



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29566) Imputer should support single-column input/ouput

2019-10-23 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-29566:


 Summary: Imputer should support single-column input/ouput
 Key: SPARK-29566
 URL: https://issues.apache.org/jira/browse/SPARK-29566
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


Imputer should support single-column input/ouput



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29565) OneHotEncoder should support single-column input/ouput

2019-10-23 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-29565:


 Summary: OneHotEncoder should support single-column input/ouput
 Key: SPARK-29565
 URL: https://issues.apache.org/jira/browse/SPARK-29565
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


Current feature algs 
({color:#5a6e5a}QuantileDiscretizer/Binarizer/Bucketizer/StringIndexer{color}) 
are designed to support both single-col & multi-col.

And there is already some internal utils (like 
{color:#c7a65d}checkSingleVsMultiColumnParams{color}) for this.

For OneHotEncoder, it's reasonable to support single-col.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29564) Cluster deploy mode should support Spark Thrift server

2019-10-23 Thread Lantao Jin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-29564:
---
Description: 
Cluster deploy mode is not applicable to Spark Thrift server from SPARK-21403. 
This restriction is too rude.
In our production, we use multiple Spark Thrift servers as long running 
services which are used yarn-cluster mode to launch. The life cycle of STS is 
managed by upper layer manager system which is also used to dispatcher user's 
JDBC connection to applicable STS. SPARK-21403 banned this case.

  was:
Cluster deploy mode is not applicable to Spark Thrift server from 
[SPARK-21403|https://issues.apache.org/jira/browse/SPARK-21403]. This 
restriction is too rude.
In our production, we use multiple Spark Thrift servers as long running 
services which are used yarn-cluster mode to launch. The life cycle of STS is 
managed by upper layer manager system which is also used to dispatcher user's 
JDBC connection to applicable STS. SPARK-21403 banned this case.


> Cluster deploy mode should support Spark Thrift server
> --
>
> Key: SPARK-29564
> URL: https://issues.apache.org/jira/browse/SPARK-29564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Cluster deploy mode is not applicable to Spark Thrift server from 
> SPARK-21403. This restriction is too rude.
> In our production, we use multiple Spark Thrift servers as long running 
> services which are used yarn-cluster mode to launch. The life cycle of STS is 
> managed by upper layer manager system which is also used to dispatcher user's 
> JDBC connection to applicable STS. SPARK-21403 banned this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29564) Cluster deploy mode should support to launch Spark Thrift server

2019-10-23 Thread Lantao Jin (Jira)

Lantao Jin created SPARK-29564:
--

 Summary: Cluster deploy mode should support to launch Spark Thrift 
server
 Key: SPARK-29564
 URL: https://issues.apache.org/jira/browse/SPARK-29564
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit, SQL
Affects Versions: 2.4.4, 3.0.0
Reporter: Lantao Jin


Cluster deploy mode is not applicable to Spark Thrift server from 
[SPARK-21403|https://issues.apache.org/jira/browse/SPARK-21403]. This 
restriction is too rude.
In our production, we use multiple Spark Thrift servers as long running 
services which are used yarn-cluster mode to launch. The life cycle of STS is 
managed by upper layer manager system which is also used to dispatcher user's 
JDBC connection to applicable STS. SPARK-21403 banned this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29564) Cluster deploy mode should support Spark Thrift server

2019-10-23 Thread Lantao Jin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-29564:
---
Summary: Cluster deploy mode should support Spark Thrift server  (was: 
Cluster deploy mode should support to launch Spark Thrift server)

> Cluster deploy mode should support Spark Thrift server
> --
>
> Key: SPARK-29564
> URL: https://issues.apache.org/jira/browse/SPARK-29564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Cluster deploy mode is not applicable to Spark Thrift server from 
> [SPARK-21403|https://issues.apache.org/jira/browse/SPARK-21403]. This 
> restriction is too rude.
> In our production, we use multiple Spark Thrift servers as long running 
> services which are used yarn-cluster mode to launch. The life cycle of STS is 
> managed by upper layer manager system which is also used to dispatcher user's 
> JDBC connection to applicable STS. SPARK-21403 banned this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21492) Memory leak in SortMergeJoin

2019-10-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-21492:

Fix Version/s: 2.4.5

> Memory leak in SortMergeJoin
> 
>
> Key: SPARK-21492
> URL: https://issues.apache.org/jira/browse/SPARK-21492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.3.1, 3.0.0
>Reporter: Zhan Zhang
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> In SortMergeJoin, if the iterator is not exhausted, there will be memory leak 
> caused by the Sort. The memory is not released until the task end, and cannot 
> be used by other operators causing performance drop or OOM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24666) Word2Vec generate infinity vectors when numIterations are large

2019-10-23 Thread carlos yan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957623#comment-16957623
 ] 

carlos yan commented on SPARK-24666:


I also get this question, and my spark version is 2.1.0. I used 1000w record 
for train and words size is about 100w. When the numIterations>10, the vectors 
generated contain *infinity* and *NaN*.

> Word2Vec generate infinity vectors when numIterations are large
> ---
>
> Key: SPARK-24666
> URL: https://issues.apache.org/jira/browse/SPARK-24666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.1
> Environment:  2.0.X, 2.1.X, 2.2.X, 2.3.X
>Reporter: ZhongYu
>Priority: Critical
>
> We found that Word2Vec generate large absolute value vectors when 
> numIterations are large, and if numIterations are large enough (>20), the 
> vector's value many be *infinity(or -**infinity)***, resulting in useless 
> vectors.
> In normal situations, vectors values are mainly around -1.0~1.0 when 
> numIterations = 1.
> The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X.
> There are already issues report this bug: 
> https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works 
> seems missing.
> Other people's reports:
> [https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec]
> [http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29324) saveAsTable with overwrite mode results in metadata loss

2019-10-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29324.
--
Resolution: Not A Problem

> saveAsTable with overwrite mode results in metadata loss
> 
>
> Key: SPARK-29324
> URL: https://issues.apache.org/jira/browse/SPARK-29324
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Karuppayya
>Priority: Major
>
> {code:java}
> scala> spark.range(1).write.option("path", 
> "file:///tmp/tbl").format("orc").saveAsTable("tbl")
> scala> spark.sql("desc extended tbl").collect.foreach(println)
> [id,bigint,null]
> [,,]
> [# Detailed Table Information,,]
> [Database,default,]
> [Table,tbl,]
> [Owner,karuppayyar,]
> [Created Time,Wed Oct 02 09:29:06 IST 2019,]
> [Last Access,UNKNOWN,]
> [Created By,Spark 3.0.0-SNAPSHOT,]
> [Type,EXTERNAL,]
> [Provider,orc,]
> [Location,file:/tmp/tbl_loc,]
> [Serde Library,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
> [InputFormat,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> {code}
> {code:java}
> // Overwriting table
> scala> spark.range(100).write.mode("overwrite").saveAsTable("tbl")
> scala> spark.sql("desc extended tbl").collect.foreach(println)
> [id,bigint,null]
> [,,]
> [# Detailed Table Information,,]
> [Database,default,]
> [Table,tbl,]
> [Owner,karuppayyar,]
> [Created Time,Wed Oct 02 09:30:36 IST 2019,]
> [Last Access,UNKNOWN,]
> [Created By,Spark 3.0.0-SNAPSHOT,]
> [Type,MANAGED,]
> [Provider,parquet,]
> [Location,file:/tmp/tbl,]
> [Serde Library,org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe,]
> [InputFormat,org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat,]
> [OutputFormat,org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat,]
> {code}
>  
>  
> The first code block creates an EXTERNAL table in Orc format
> The second code block overwrites it with more data
> After the overwrite,
> 1. The external table became a managed table.
> 2. The  fileformat has changed from Orc to parquet(default fileformat).
> And other information(like owner, TBLPROPERTIES) are also overwritten.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29546) Recover jersey-guava test dependency in docker-integration-tests

2019-10-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29546.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26203
[https://github.com/apache/spark/pull/26203]

> Recover jersey-guava test dependency in docker-integration-tests
> 
>
> Key: SPARK-29546
> URL: https://issues.apache.org/jira/browse/SPARK-29546
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> While SPARK-28737 upgrades `Jersey` to 2.29, `docker-integration-tests` is 
> broken because `com.spotify.docker-client` depends on `jersey-guava`. The 
> latest `com.spotify.docker-client` is also still depending on that, too.
> - https://mvnrepository.com/artifact/com.spotify/docker-client/5.0.2
>   -> 
> https://mvnrepository.com/artifact/org.glassfish.jersey.core/jersey-client/2.19
> -> 
> https://mvnrepository.com/artifact/org.glassfish.jersey.core/jersey-common/2.19
>   -> 
> https://mvnrepository.com/artifact/org.glassfish.jersey.bundles.repackaged/jersey-guava/2.19
> **AFTER**
> {code}
> build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 
> -Dtest=none 
> -DwildcardSuites=org.apache.spark.sql.jdbc.PostgresIntegrationSuite test
> Tests: succeeded 6, failed 0, canceled 0, ignored 0, pending 0
> All tests passed.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-10-23 Thread zhao bo (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957603#comment-16957603
 ] 

zhao bo commented on SPARK-29106:
-

Hi [~shaneknapp],

Sorry for disturb. I have some questions about the following work want to 
discuss with you. I list them in the following.
 # For pyspark test, you mentioned we didn't install any python debs for 
testing. Is there any "requirements.txt" or "test-requirements.txt" in the 
spark repo? I'm failed to find them. When we test the pyspark before, we just 
realize that we need to install numpy package with pip, because when we exec 
the pyspark test scripts, the fail message noticed us. So you mentioned 
"pyspark testing debs" before, you mean that we should figure all out manually? 
Is there any kind suggest from your side?
 # For sparkR test, we compile a higher R version 3.6.1 by fix many lib 
dependency, and make it work. And exec the R test script, till to all of them 
return pass. So we wonder the difficult about the test when we truelly in 
amplab, could you please share more to us?
 # For current periodic jobs, you said it will be triggered 2 times per day. 
Each build will cost most 11 hours. I have a thought about the next job 
deployment, wish to know your thought about it. My thought is we can setup 2 
jobs per day, one is the current maven UT test triggered by SCM changes(11h), 
the other will run the pyspark and sparkR tests also triggered by SCM 
changes(including spark build and tests, may cost 5-6 hours). How about this? 
We can talk and discuss if we don't realize how difficult to do these now.

Thanks very much, shane. And hope you could reply if you are free. ;)

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when integrate arm test with amplab 
> jenkins.
> About the k8s test on arm, we have took test it, see 
> [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it 
> later. 
> And we plan test on other stable branches too, and we can integrate them to 
> amplab when they are ready.
> We have offered an arm instance and sent the infos to shane knapp, thanks 
> shane to add the first arm job to amplab jenkins :) 
> The other important thing is about the leveldbjni 
> [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
>  spark depends on leveldbjni-all-1.8 
> [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
>  we can see there is no arm64 supporting. So we build an arm64 supporting 
> release of leveldbjni see 
> [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
>  but we can't modified the spark pom.xml directly with something like 
> 'property'/'profile' to choose correct jar package on arm or x86 platform, 
> because spark depends on some hadoop packages like hadoop-hdfs, the packages 
> depend on leveldbjni-all-1.8 too, unless hadoop release with new arm 
> supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of 
> openlabtesting and 'mvn install' to use it when arm testing for spark.
> PS: The issues found and fixed:
>  SPARK-28770
>  [https://github.com/apache/spark/pull/25673]
>   
>  SPARK-28519
>  [https://github.com/apache/spark/pull/25279]
>   
>  SPARK-28433
>  [https://github.com/apache/spark/pull/25186]
>  
> SPARK-28467
> [https://github.com/apache/spark/pull/25864]
>  
> SPARK-29286
> [https://github.com/apache/spark/pull/26021]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29093) Remove automatically generated param setters in _shared_params_code_gen.py

2019-10-23 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-29093:


Assignee: Huaxin Gao

> Remove automatically generated param setters in _shared_params_code_gen.py
> --
>
> Key: SPARK-29093
> URL: https://issues.apache.org/jira/browse/SPARK-29093
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Major
>
> The main difference between scala and py sides come from the automatically 
> generated param setter in _shared_params_code_gen.py.
> To make them in sync, we should remove those setters in _shared_.py, and add 
> the corresponding setters manually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29093) Remove automatically generated param setters in _shared_params_code_gen.py

2019-10-23 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957601#comment-16957601
 ] 

zhengruifeng commented on SPARK-29093:
--

[~huaxingao] Thanks!

> Remove automatically generated param setters in _shared_params_code_gen.py
> --
>
> Key: SPARK-29093
> URL: https://issues.apache.org/jira/browse/SPARK-29093
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> The main difference between scala and py sides come from the automatically 
> generated param setter in _shared_params_code_gen.py.
> To make them in sync, we should remove those setters in _shared_.py, and add 
> the corresponding setters manually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23171) Reduce the time costs of the rule runs that do not change the plans

2019-10-23 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957593#comment-16957593
 ] 

Takeshi Yamamuro commented on SPARK-23171:
--

oh, nice, the performance looks much better.

> Reduce the time costs of the rule runs that do not change the plans 
> 
>
> Key: SPARK-23171
> URL: https://issues.apache.org/jira/browse/SPARK-23171
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: bulk-closed
>
> Below is the time stats of Analyzer/Optimizer rules. Try to improve the rules 
> and reduce the time costs, especially for the runs that do not change the 
> plans.
> {noformat}
> === Metrics of Analyzer/Optimizer Rules ===
> Total number of runs = 175827
> Total time: 20.699042877 seconds
> Rule  
>  Total Time Effective Time Total Runs 
> Effective Runs
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning 
>  2340563794 1338268224 1875   
> 761   
> org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution   
>  1632672623 1625071881 788
> 37
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions 
>  1395087131 347339931  1982   
> 38
> org.apache.spark.sql.catalyst.optimizer.PruneFilters  
>  1177711364 21344174   1590   
> 3 
> org.apache.spark.sql.catalyst.optimizer.Optimizer$OptimizeSubqueries  
>  1145135465 1131417128 285
> 39
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences 
>  1008347217 663112062  1982   
> 616   
> org.apache.spark.sql.catalyst.optimizer.ReorderJoin   
>  767024424  693001699  1590   
> 132   
> org.apache.spark.sql.catalyst.analysis.Analyzer$FixNullability
>  598524650  40802876   742
> 12
> org.apache.spark.sql.catalyst.analysis.DecimalPrecision   
>  595384169  436153128  1982   
> 211   
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery   
>  548178270  459695885  1982   
> 49
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts 
>  423002864  139869503  1982   
> 86
> org.apache.spark.sql.catalyst.optimizer.BooleanSimplification 
>  405544962  17250184   1590   
> 7 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin  
>  383837603  284174662  1590   
> 708   
> org.apache.spark.sql.catalyst.optimizer.RemoveRedundantAliases
>  372901885  33623321590   
> 9 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints   
>  364628214  343815519  285
> 192   
> org.apache.spark.sql.execution.datasources.FindDataSourceTable
>  303293296  285344766  1982   
> 233   
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions  
>  233195019  92648171   1982   
> 294   
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$FunctionArgumentConversion
>  220568919  73932736   1982   
> 38
> org.apache.spark.sql.catalyst.optimizer.NullPropagation   
>  207976072  90723051590   
>

1 2 >

1 - 100 of 103 matches

Mail list logo