[jira] [Commented] (SPARK-25004) Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS

2020-03-19 Thread Xiaochen Ouyang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062333#comment-17062333
 ] 

Xiaochen Ouyang commented on SPARK-25004:
-

[~rdblue] This configuration can only control the worker.py process, and the 
maximum memory limit of the derived child process cannot be controlled. 

Worker(JVM) --> Executor–> python.demon–>python.demon , the last python demon 
process can not be controlled by this configuration.

> Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS
> --
>
> Key: SPARK-25004
> URL: https://issues.apache.org/jira/browse/SPARK-25004
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> Some platforms support limiting Python's addressable memory space by limiting 
> [{{resource.RLIMIT_AS}}|https://docs.python.org/3/library/resource.html#resource.RLIMIT_AS].
> We've found that adding a limit is very useful when running in YARN because 
> when Python doesn't know about memory constraints, it doesn't know when to 
> garbage collect and will continue using memory when it doesn't need to. 
> Adding a limit reduces PySpark memory consumption and avoids YARN killing 
> containers because Python hasn't cleaned up memory.
> This also improves error messages for users, allowing them to see when Python 
> is allocating too much memory instead of YARN killing the container:
> {code:lang=python}
>   File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in 
> fe_engineer
> fe_eval_rec.update(f(src_rec_prep, mat_rec_prep))
>   File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp
> comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, 
> []), mat_rec_prep.get(item, []))
>   File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in 
> leven_list_compare
> permutations = sorted(permutations, reverse=True)
>   MemoryError
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30989) TABLE.COLUMN reference doesn't work with new columns created by UDF

2020-03-19 Thread hemanth meka (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062332#comment-17062332
 ] 

hemanth meka commented on SPARK-30989:
--

[~cloud_fan] or [~viirya], can you confirm if this needs a fix? I can work on 
this if needed.

> TABLE.COLUMN reference doesn't work with new columns created by UDF
> ---
>
> Key: SPARK-30989
> URL: https://issues.apache.org/jira/browse/SPARK-30989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Chris Suchanek
>Priority: Major
>
> When a dataframe is created with an alias (`.as("...")`) its columns can be 
> referred as `TABLE.COLUMN` but it doesn't work for newly created columns with 
> UDF.
> {code:java}
> // code placeholder
> df1 = sc.parallelize(l).toDF("x","y").as("cat")
> val squared = udf((s: Int) => s * s)
> val df2 = df1.withColumn("z", squared(col("y")))
> df2.columns //Array[String] = Array(x, y, z)
> df2.select("cat.x") // works
> df2.select("cat.z") // Doesn't work
> // org.apache.spark.sql.AnalysisException: cannot resolve '`cat.z`' given 
> input 
> // columns: [cat.x, cat.y, z];;
> {code}
> Might be related to: https://issues.apache.org/jira/browse/SPARK-30532



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31189) Fix errors and missing parts for datetime pattern document

2020-03-19 Thread Kent Yao (Jira)
Kent Yao created SPARK-31189:


 Summary: Fix errors and missing parts for datetime pattern document
 Key: SPARK-31189
 URL: https://issues.apache.org/jira/browse/SPARK-31189
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


Fix errors and missing parts for datetime pattern document



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30989) TABLE.COLUMN reference doesn't work with new columns created by UDF

2020-03-19 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062355#comment-17062355
 ] 

Wenchen Fan commented on SPARK-30989:
-

https://github.com/apache/spark/pull/27916 can't fix it? I don't have a strong 
opnion as there is no clear rule about how we retain the df alias after many 
transformations.

> TABLE.COLUMN reference doesn't work with new columns created by UDF
> ---
>
> Key: SPARK-30989
> URL: https://issues.apache.org/jira/browse/SPARK-30989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Chris Suchanek
>Priority: Major
>
> When a dataframe is created with an alias (`.as("...")`) its columns can be 
> referred as `TABLE.COLUMN` but it doesn't work for newly created columns with 
> UDF.
> {code:java}
> // code placeholder
> df1 = sc.parallelize(l).toDF("x","y").as("cat")
> val squared = udf((s: Int) => s * s)
> val df2 = df1.withColumn("z", squared(col("y")))
> df2.columns //Array[String] = Array(x, y, z)
> df2.select("cat.x") // works
> df2.select("cat.z") // Doesn't work
> // org.apache.spark.sql.AnalysisException: cannot resolve '`cat.z`' given 
> input 
> // columns: [cat.x, cat.y, z];;
> {code}
> Might be related to: https://issues.apache.org/jira/browse/SPARK-30532



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31188) Spark shell version miss match

2020-03-19 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062457#comment-17062457
 ] 

Kent Yao commented on SPARK-31188:
--

I guess you may be setting your SPARK_HOME to a wrong place

> Spark shell version miss match
> --
>
> Key: SPARK-31188
> URL: https://issues.apache.org/jira/browse/SPARK-31188
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.0.0
> Environment: On the standalone ubuntu machine i tried.
>Reporter: Timmanna Channal
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> Hi Team,
>  I downloaded the spark3.x latest tar ball from the spark website.
> when tried to access the spark-shell I am getting version as 2.4.4. Attaching 
> the screen short.
>  
>  !screenshot-1.png! 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31188) Spark shell version miss match

2020-03-19 Thread Timmanna Channal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062471#comment-17062471
 ] 

Timmanna Channal commented on SPARK-31188:
--

Hi Kent,

Thanks it worked.

I had set SPARK_HOME to spark2.4.4 version. But as you can see in the 
attachment I was inside the spark3.x folder. But i didn't understand why it was 
going to spark-2.4.4 scripts. 

> Spark shell version miss match
> --
>
> Key: SPARK-31188
> URL: https://issues.apache.org/jira/browse/SPARK-31188
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.0.0
> Environment: On the standalone ubuntu machine i tried.
>Reporter: Timmanna Channal
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> Hi Team,
>  I downloaded the spark3.x latest tar ball from the spark website.
> when tried to access the spark-shell I am getting version as 2.4.4. Attaching 
> the screen short.
>  
>  !screenshot-1.png! 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31188) Spark shell version miss match

2020-03-19 Thread Timmanna Channal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062472#comment-17062472
 ] 

Timmanna Channal edited comment on SPARK-31188 at 3/19/20, 11:30 AM:
-

I am very new to the spark issue space. Should I resolve the ticket ?.


was (Author: timmanna):
I am very new the the spark issue space. Should I resolve the ticket ?.

> Spark shell version miss match
> --
>
> Key: SPARK-31188
> URL: https://issues.apache.org/jira/browse/SPARK-31188
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.0.0
> Environment: On the standalone ubuntu machine i tried.
>Reporter: Timmanna Channal
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> Hi Team,
>  I downloaded the spark3.x latest tar ball from the spark website.
> when tried to access the spark-shell I am getting version as 2.4.4. Attaching 
> the screen short.
>  
>  !screenshot-1.png! 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31188) Spark shell version miss match

2020-03-19 Thread Timmanna Channal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062472#comment-17062472
 ] 

Timmanna Channal commented on SPARK-31188:
--

I am very new the the spark issue space. Should I resolve the ticket ?.

> Spark shell version miss match
> --
>
> Key: SPARK-31188
> URL: https://issues.apache.org/jira/browse/SPARK-31188
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.0.0
> Environment: On the standalone ubuntu machine i tried.
>Reporter: Timmanna Channal
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> Hi Team,
>  I downloaded the spark3.x latest tar ball from the spark website.
> when tried to access the spark-shell I am getting version as 2.4.4. Attaching 
> the screen short.
>  
>  !screenshot-1.png! 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31187) Sort the whole-stage codegen debug output by codegenStageId

2020-03-19 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31187.
--
Fix Version/s: 3.0.0
 Assignee: Kris Mok
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/27955]

> Sort the whole-stage codegen debug output by codegenStageId
> ---
>
> Key: SPARK-31187
> URL: https://issues.apache.org/jira/browse/SPARK-31187
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.0.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Minor
> Fix For: 3.0.0
>
>
> Spark SQL's whole-stage codegen (WSCG) supports dumping the generated code to 
> help with debugging. One way to get the generated code is through 
> {{df.queryExecution.debug.codegen}}, or SQL {{explain codegen}} statement.
> The generated code is currently printed without specific ordering, which can 
> make debugging a bit annoying. This ticket tracks a minor improvement to sort 
> the codegen dump by the {{codegenStageId}}, ascending.
> After this change, the following query:
> {code}
> spark.range(10).agg(sum('id)).queryExecution.debug.codegen
> {code}
> will always dump the generated code in a natural, stable order.
> The number of codegen stages within a single SQL query tends to be very 
> small, most likely < 50, so the overhead of adding the sorting shouldn't be 
> significant.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31190) ScalaReflection should erasure non user defined AnyVal type

2020-03-19 Thread wuyi (Jira)
wuyi created SPARK-31190:


 Summary: ScalaReflection should erasure non user defined AnyVal 
type
 Key: SPARK-31190
 URL: https://issues.apache.org/jira/browse/SPARK-31190
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: wuyi


We should only not do erasure for non user defined AnyVal type, but still do 
erasure for other types, e.g. Any, which could give better error message for 
end user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31191) Spark SQL and hive metastore are incompatible

2020-03-19 Thread leishuiyu (Jira)
leishuiyu created SPARK-31191:
-

 Summary: Spark SQL and hive metastore are incompatible
 Key: SPARK-31191
 URL: https://issues.apache.org/jira/browse/SPARK-31191
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
 Environment:  

the spark version 2.3.0

the hive version 2.3.3

 

 

 

 

 

 

 
Reporter: leishuiyu
 Fix For: 2.3.0


# 
h3. When I execute bin/spark-sql, an exception occurs

 
{code:java}
Caused by: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClientCaused by: 
java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)
 at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
 at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
 at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
 at 
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005) 
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024) at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) ... 
12 moreCaused by: java.lang.reflect.InvocationTargetException at 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
 ... 18 moreCaused by: MetaException(message:Hive Schema version 1.2.0 does not 
match metastore's schema version 2.3.0 Metastore is not upgraded or corrupt) at 
org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:6679) 
at 
org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:6645)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:114) 
at com.sun.proxy.$Proxy6.verifySchema(Unknown Source) at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:572)
 at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
 at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
 at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66)
 at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
 at 
org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
 at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199)
 at 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
 ... 23 more
{code}
h3.   2.Find the reason

 query the source code, in spark jars directory have 
hive-metastore-1.2.1.spark2.jar

 the 1.2.1 version match 1.2.0 ,so generate the exception

  

 
{code:java}
//代码占位符
private static final Map EQUIVALENT_VERSIONS =
ImmutableMap.of("0.13.1", "0.13.0",
"1.0.0", "0.14.0",
"1.0.1", "1.0.0",
"1.1.1", "1.1.0",
"1.2.1", "1.2.0"
);
{code}
 
h3. 3.Is there any solution to this problem

   can edit hive-site.xml  hive.metastore.schema.verification set true,but new 
problems may arise

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31192) Introduce PushProjectThroughLimit

2020-03-19 Thread Ali Afroozeh (Jira)
Ali Afroozeh created SPARK-31192:


 Summary: Introduce PushProjectThroughLimit
 Key: SPARK-31192
 URL: https://issues.apache.org/jira/browse/SPARK-31192
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ali Afroozeh


Currently the {{CollapseProject}} rule does many things: not only it collapses 
stacked projects, but also pushes down projects into limits, windows, etc. In 
this PR we factored out rules from {{CollapseProject}} that were pushing 
projects into limits and introduced a new rule called 
{{PushProjectThroughLimit.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31193) set spark.master and spark.app.name conf default value

2020-03-19 Thread daile (Jira)
daile created SPARK-31193:
-

 Summary: set spark.master and spark.app.name conf default value
 Key: SPARK-31193
 URL: https://issues.apache.org/jira/browse/SPARK-31193
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.0, 2.3.3, 2.3.0, 3.1.0
Reporter: daile
 Fix For: 3.1.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31193) set spark.master and spark.app.name conf default value

2020-03-19 Thread daile (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

daile updated SPARK-31193:
--
Description: 
 

 
{code:java}
//代码占位符
{code}
I see the default value of master setting in spark-submit client

 

 

```scala
 // Global defaults. These should be keep to minimum to avoid confusing 
behavior.
 master = Option(master).getOrElse("local[*]")
 ```

but during our development and debugging, We will encounter this kind of problem

Exception in thread "main" org.apache.spark.SparkException: A master URL must 
be set in your configuration

This conflicts with the default setting

```scala
 //If we do
 val sparkConf = new SparkConf().setAppName(“app”)
 //When using the client to submit tasks to the cluster, the matser will be 
overwritten by the local
 sparkConf.set("spark.master", "local[*]")
 ```

so we have to do like this

```scala
 val sparkConf = new SparkConf().setAppName(“app”)
 //Because the program runs to set the priority of the master, we have to first 
determine whether to set the master to avoid submitting the cluster to run.
 sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]"))
 ```

so is spark.app.name

Is it better for users to handle it like submit client ?

> set spark.master and spark.app.name conf default value
> --
>
> Key: SPARK-31193
> URL: https://issues.apache.org/jira/browse/SPARK-31193
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.3, 2.4.0, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.1.0
>Reporter: daile
>Priority: Major
> Fix For: 3.1.0
>
>
>  
>  
> {code:java}
> //代码占位符
> {code}
> I see the default value of master setting in spark-submit client
>  
>  
> ```scala
>  // Global defaults. These should be keep to minimum to avoid confusing 
> behavior.
>  master = Option(master).getOrElse("local[*]")
>  ```
> but during our development and debugging, We will encounter this kind of 
> problem
> Exception in thread "main" org.apache.spark.SparkException: A master URL must 
> be set in your configuration
> This conflicts with the default setting
> ```scala
>  //If we do
>  val sparkConf = new SparkConf().setAppName(“app”)
>  //When using the client to submit tasks to the cluster, the matser will be 
> overwritten by the local
>  sparkConf.set("spark.master", "local[*]")
>  ```
> so we have to do like this
> ```scala
>  val sparkConf = new SparkConf().setAppName(“app”)
>  //Because the program runs to set the priority of the master, we have to 
> first determine whether to set the master to avoid submitting the cluster to 
> run.
>  sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]"))
>  ```
> so is spark.app.name
> Is it better for users to handle it like submit client ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31193) set spark.master and spark.app.name conf default value

2020-03-19 Thread daile (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

daile updated SPARK-31193:
--
Description: 
I see the default value of master setting in spark-submit client
{code:java}
// Global defaults. These should be keep to minimum to avoid confusing 
behavior. master = Option(master).getOrElse("local[*]") 
{code}
but during our development and debugging, We will encounter this kind of problem

Exception in thread "main" org.apache.spark.SparkException: A master URL must 
be set in your configuration

This conflicts with the default setting

 
{code:java}
//If we do
 val sparkConf = new SparkConf().setAppName(“app”)
 //When using the client to submit tasks to the cluster, the matser will be 
overwritten by the local
 sparkConf.set("spark.master", "local[*]"){code}
 

so we have to do like this
{code:java}
val sparkConf = new SparkConf().setAppName(“app”)
 //Because the program runs to set the priority of the master, we have to first 
determine whether to set the master to avoid submitting the cluster to run.
 sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]")){code}
 

 

so is spark.app.name

Is it better for users to handle it like submit client ?

  was:
 

 
{code:java}
//代码占位符
{code}
I see the default value of master setting in spark-submit client

 

 

```scala
 // Global defaults. These should be keep to minimum to avoid confusing 
behavior.
 master = Option(master).getOrElse("local[*]")
 ```

but during our development and debugging, We will encounter this kind of problem

Exception in thread "main" org.apache.spark.SparkException: A master URL must 
be set in your configuration

This conflicts with the default setting

```scala
 //If we do
 val sparkConf = new SparkConf().setAppName(“app”)
 //When using the client to submit tasks to the cluster, the matser will be 
overwritten by the local
 sparkConf.set("spark.master", "local[*]")
 ```

so we have to do like this

```scala
 val sparkConf = new SparkConf().setAppName(“app”)
 //Because the program runs to set the priority of the master, we have to first 
determine whether to set the master to avoid submitting the cluster to run.
 sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]"))
 ```

so is spark.app.name

Is it better for users to handle it like submit client ?


> set spark.master and spark.app.name conf default value
> --
>
> Key: SPARK-31193
> URL: https://issues.apache.org/jira/browse/SPARK-31193
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.3, 2.4.0, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.1.0
>Reporter: daile
>Priority: Major
> Fix For: 3.1.0
>
>
> I see the default value of master setting in spark-submit client
> {code:java}
> // Global defaults. These should be keep to minimum to avoid confusing 
> behavior. master = Option(master).getOrElse("local[*]") 
> {code}
> but during our development and debugging, We will encounter this kind of 
> problem
> Exception in thread "main" org.apache.spark.SparkException: A master URL must 
> be set in your configuration
> This conflicts with the default setting
>  
> {code:java}
> //If we do
>  val sparkConf = new SparkConf().setAppName(“app”)
>  //When using the client to submit tasks to the cluster, the matser will be 
> overwritten by the local
>  sparkConf.set("spark.master", "local[*]"){code}
>  
> so we have to do like this
> {code:java}
> val sparkConf = new SparkConf().setAppName(“app”)
>  //Because the program runs to set the priority of the master, we have to 
> first determine whether to set the master to avoid submitting the cluster to 
> run.
>  sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]")){code}
>  
>  
> so is spark.app.name
> Is it better for users to handle it like submit client ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31194) spark sql runs successfully with query not specifying condition next to where

2020-03-19 Thread Ayoub Omari (Jira)
Ayoub Omari created SPARK-31194:
---

 Summary: spark sql runs successfully with query not specifying 
condition next to where 
 Key: SPARK-31194
 URL: https://issues.apache.org/jira/browse/SPARK-31194
 Project: Spark
  Issue Type: Story
  Components: SQL
Affects Versions: 2.4.5
Reporter: Ayoub Omari


When having a sql query as follows:

```

SELECT *

FROM people

WHERE

```

shouldn't we throw a parsing exception because of __unspecified _condition_  _?_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31194) spark sql runs successfully with query not specifying condition next to where

2020-03-19 Thread Ayoub Omari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayoub Omari updated SPARK-31194:

Description: 
When having a sql query as follows:

{color:#00875a}_SELECT *_{color}

{color:#00875a}_FROM people_{color}

{color:#00875a}_WHERE_{color}

shouldn't we throw a parsing exception because of __unspecified _condition_  _?_

  was:
When having a sql query as follows:

```

SELECT *

FROM people

WHERE

```

shouldn't we throw a parsing exception because of __unspecified _condition_  _?_


> spark sql runs successfully with query not specifying condition next to where 
> --
>
> Key: SPARK-31194
> URL: https://issues.apache.org/jira/browse/SPARK-31194
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Ayoub Omari
>Priority: Major
>
> When having a sql query as follows:
> {color:#00875a}_SELECT *_{color}
> {color:#00875a}_FROM people_{color}
> {color:#00875a}_WHERE_{color}
> shouldn't we throw a parsing exception because of __unspecified _condition_  
> _?_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31191) Spark SQL and hive metastore are incompatible

2020-03-19 Thread leishuiyu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leishuiyu updated SPARK-31191:
--
Environment: 
 the spark version 2.3.0

the hive version 2.3.3

  was:
 

the spark version 2.3.0

the hive version 2.3.3

 

 

 

 

 

 

 


> Spark SQL and hive metastore are incompatible
> -
>
> Key: SPARK-31191
> URL: https://issues.apache.org/jira/browse/SPARK-31191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment:  the spark version 2.3.0
> the hive version 2.3.3
>Reporter: leishuiyu
>Priority: Major
> Fix For: 2.3.0
>
>
> # 
> h3. When I execute bin/spark-sql, an exception occurs
>  
> {code:java}
> Caused by: java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClientCaused by: 
> java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)
>  at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
>  at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
>  at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
>  at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005) 
> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024) at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) 
> ... 12 moreCaused by: java.lang.reflect.InvocationTargetException at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
>  ... 18 moreCaused by: MetaException(message:Hive Schema version 1.2.0 does 
> not match metastore's schema version 2.3.0 Metastore is not upgraded or 
> corrupt) at 
> org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:6679)
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:6645)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:114) 
> at com.sun.proxy.$Proxy6.verifySchema(Unknown Source) at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:572)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
>  at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66)
>  at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199)
>  at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>  ... 23 more
> {code}
> h3.   2.Find the reason
>  query the source code, in spark jars directory have 
> hive-metastore-1.2.1.spark2.jar
>  the 1.2.1 version match 1.2.0 ,so generate the exception
>   
>  
> {code:java}
> //代码占位符
> private static final Map EQUIVALENT_VERSIONS =
> ImmutableMap.of("0.13.1", "0.13.0",
> "1.0.0", "0.14.0",
> "1.0.1", "1.0.0",
> "1.1.1", "1.1.0",
> "1.2.1", "1.2.0"
> );
> {code}
>  
> h3. 3.Is there any solution to this problem
>    can edit hive-site.xml  hive.metastore.schema.verification set true,but 
> new problems may arise
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31173) Spark Kubernetes add tolerations and nodeName support

2020-03-19 Thread Jiaxin Shan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062785#comment-17062785
 ] 

Jiaxin Shan commented on SPARK-31173:
-

I am trying to get more details.

There's two level performance issues. 
 # As every pod need to be mutated by webhook, it drags down to overall 
throughput. 
 # Nodeselector, tolerations or node affinities have impact on kubernetes 
scheduler performance. 

Could I understand if you benchmark difference reflects both of above two 
points?  

 

BTW, Tolerations should be supported in PodTemplate in 3.0.0 release. 

 

 

 

> Spark Kubernetes add tolerations and nodeName support
> -
>
> Key: SPARK-31173
> URL: https://issues.apache.org/jira/browse/SPARK-31173
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.1.0, 2.4.6
> Environment: Alibaba Cloud ACK with spark 
> operator(v1beta2-1.1.0-2.4.5) and spark(2.4.5)
>Reporter: zhongwei liu
>Priority: Trivial
>  Labels: features
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> When you run spark on serverless kubernetes cluster(virtual-kubelet). you 
> need to specific the nodeSelectors,tolerations even nodeName when you want to 
> gain better scheduling performance. Currently spark doesn't support 
> tolerations. If you want to use this feature, You must use admission 
> controller webhook to decorate the pod. But the performance is extremely bad. 
> Here is the benchmark. 
> With webhook 
> Batch Size: 500 Pod creation: about 7 Pods/s   All Pods running: 5min
> Without webhook 
> Batch Size: 500 Pod creation: more than 500 Pods/s All Pods running: 45s
> Adding tolerations and nodeName in spark will bring great help when you want 
> to run a large scale job on serverless kubernetes cluster.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2020-03-19 Thread Udit Mehrotra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062898#comment-17062898
 ] 

Udit Mehrotra commented on SPARK-29767:
---

[~hyukjin.kwon] Can you take a look at it ? There has been no activity on this 
for months now. I have provided the executor dump. Please let me know if there 
is any more information I can provide to help drive this.

> Core dump happening on executors while doing simple union of Data Frames
> 
>
> Key: SPARK-29767
> URL: https://issues.apache.org/jira/browse/SPARK-29767
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.4
> Environment: AWS EMR 5.27.0, Spark 2.4.4
>Reporter: Udit Mehrotra
>Priority: Major
> Attachments: coredump.zip, hs_err_pid13885.log, 
> part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet
>
>
> Running a union operation on two DataFrames through both Scala Spark Shell 
> and PySpark, resulting in executor contains doing a *core dump* and existing 
> with Exit code 134.
> The trace from the *Driver*:
> {noformat}
> Container exited with a non-zero exit code 134
> .
> 19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
> aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure 
> (executor 11 exited caused by one of the running tasks) Reason: Container 
> from a bad node: container_1572981097605_0021_01_77 on host: 
> ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_1572981097605_0021_01_77
> Exit code: 134
> Exception message: /bin/bash: line 1: 12611 Aborted 
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
>  trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted  
>
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/

[jira] [Created] (SPARK-31195) Reuse days rebase functions of DateTimeUtils in DaysWritable

2020-03-19 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-31195:
--

 Summary: Reuse days rebase functions of DateTimeUtils in 
DaysWritable
 Key: SPARK-31195
 URL: https://issues.apache.org/jira/browse/SPARK-31195
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


The functions rebaseJulianToGregorianDays() and rebaseGregorianToJulianDays() 
were added by the PR https://github.com/apache/spark/pull/27915. The ticket 
aims to replace similar code in org.apache.spark.sql.hive.DaysWritable by the 
functions to:
# Deduplicate code
# The functions were better tested, and cross checked by reading parquet files 
saved by Spark 2.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31196) Server-side processing of History UI list of applications

2020-03-19 Thread Jira
Pavol Vidlička created SPARK-31196:
--

 Summary: Server-side processing of History UI list of applications
 Key: SPARK-31196
 URL: https://issues.apache.org/jira/browse/SPARK-31196
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.3.0
Reporter: Pavol Vidlička


Loading the list of applications in the History UI does not scale well with a 
large number of applications. Fetching and rendering the list for 10k+ 
applications takes over a minute.

Using `spark.history.ui.maxApplications` is not a great solution, because (as 
the name implies), it limits the number of applications shown in the UI, which 
hinders usability of the History Server.

A solution would be to use [server side processing of the 
DataTable](https://datatables.net/examples/data_sources/server_side). This 
would limit amount of data sent to the client and processed by the browser.

This proposed change plays nicely with KVStore abstraction implemented in 
SPARK-18085, which was supposed to solve some of the scalability issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31196) Server-side processing of History UI list of applications

2020-03-19 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-31196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavol Vidlička updated SPARK-31196:
---
Description: 
Loading the list of applications in the History UI does not scale well with a 
large number of applications. Fetching and rendering the list for 10k+ 
applications takes over a minute.

Using `spark.history.ui.maxApplications` is not a great solution, because (as 
the name implies), it limits the number of applications shown in the UI, which 
hinders usability of the History Server.

A solution would be to use server [side processing of the 
DataTable|https://datatables.net/examples/data_sources/server_side]. This would 
limit amount of data sent to the client and processed by the browser.

This proposed change plays nicely with KVStore abstraction implemented in 
SPARK-18085, which was supposed to solve some of the scalability issues.

  was:
Loading the list of applications in the History UI does not scale well with a 
large number of applications. Fetching and rendering the list for 10k+ 
applications takes over a minute.

Using `spark.history.ui.maxApplications` is not a great solution, because (as 
the name implies), it limits the number of applications shown in the UI, which 
hinders usability of the History Server.

A solution would be to use [server side processing of the 
DataTable](https://datatables.net/examples/data_sources/server_side). This 
would limit amount of data sent to the client and processed by the browser.

This proposed change plays nicely with KVStore abstraction implemented in 
SPARK-18085, which was supposed to solve some of the scalability issues.


> Server-side processing of History UI list of applications
> -
>
> Key: SPARK-31196
> URL: https://issues.apache.org/jira/browse/SPARK-31196
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Pavol Vidlička
>Priority: Minor
>
> Loading the list of applications in the History UI does not scale well with a 
> large number of applications. Fetching and rendering the list for 10k+ 
> applications takes over a minute.
> Using `spark.history.ui.maxApplications` is not a great solution, because (as 
> the name implies), it limits the number of applications shown in the UI, 
> which hinders usability of the History Server.
> A solution would be to use server [side processing of the 
> DataTable|https://datatables.net/examples/data_sources/server_side]. This 
> would limit amount of data sent to the client and processed by the browser.
> This proposed change plays nicely with KVStore abstraction implemented in 
> SPARK-18085, which was supposed to solve some of the scalability issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31196) Server-side processing of History UI list of applications

2020-03-19 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-31196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavol Vidlička updated SPARK-31196:
---
Description: 
Loading the list of applications in the History UI does not scale well with a 
large number of applications. Fetching and rendering the list for 10k+ 
applications takes over a minute.

Using `spark.history.ui.maxApplications` is not a great solution, because (as 
the name implies), it limits the number of applications shown in the UI, which 
hinders usability of the History Server.

A solution would be to use server [side processing of the 
DataTable|https://datatables.net/examples/data_sources/server_side]. This would 
limit amount of data sent to the client and processed by the browser.

This proposed change plays nicely with KVStore abstraction implemented in 
SPARK-18085, which was supposed to solve some of the scalability issues. It 
could also definitely solve History UI scalability issues reported for example 
in SPARK-21254, SPARK-17243, SPARK-17671

  was:
Loading the list of applications in the History UI does not scale well with a 
large number of applications. Fetching and rendering the list for 10k+ 
applications takes over a minute.

Using `spark.history.ui.maxApplications` is not a great solution, because (as 
the name implies), it limits the number of applications shown in the UI, which 
hinders usability of the History Server.

A solution would be to use server [side processing of the 
DataTable|https://datatables.net/examples/data_sources/server_side]. This would 
limit amount of data sent to the client and processed by the browser.

This proposed change plays nicely with KVStore abstraction implemented in 
SPARK-18085, which was supposed to solve some of the scalability issues.


> Server-side processing of History UI list of applications
> -
>
> Key: SPARK-31196
> URL: https://issues.apache.org/jira/browse/SPARK-31196
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Pavol Vidlička
>Priority: Minor
>
> Loading the list of applications in the History UI does not scale well with a 
> large number of applications. Fetching and rendering the list for 10k+ 
> applications takes over a minute.
> Using `spark.history.ui.maxApplications` is not a great solution, because (as 
> the name implies), it limits the number of applications shown in the UI, 
> which hinders usability of the History Server.
> A solution would be to use server [side processing of the 
> DataTable|https://datatables.net/examples/data_sources/server_side]. This 
> would limit amount of data sent to the client and processed by the browser.
> This proposed change plays nicely with KVStore abstraction implemented in 
> SPARK-18085, which was supposed to solve some of the scalability issues. It 
> could also definitely solve History UI scalability issues reported for 
> example in SPARK-21254, SPARK-17243, SPARK-17671



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31196) Server-side processing of History UI list of applications

2020-03-19 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-31196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavol Vidlička updated SPARK-31196:
---
Affects Version/s: 2.4.5

> Server-side processing of History UI list of applications
> -
>
> Key: SPARK-31196
> URL: https://issues.apache.org/jira/browse/SPARK-31196
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0, 2.4.5
>Reporter: Pavol Vidlička
>Priority: Minor
>
> Loading the list of applications in the History UI does not scale well with a 
> large number of applications. Fetching and rendering the list for 10k+ 
> applications takes over a minute.
> Using `spark.history.ui.maxApplications` is not a great solution, because (as 
> the name implies), it limits the number of applications shown in the UI, 
> which hinders usability of the History Server.
> A solution would be to use server [side processing of the 
> DataTable|https://datatables.net/examples/data_sources/server_side]. This 
> would limit amount of data sent to the client and processed by the browser.
> This proposed change plays nicely with KVStore abstraction implemented in 
> SPARK-18085, which was supposed to solve some of the scalability issues. It 
> could also definitely solve History UI scalability issues reported for 
> example in SPARK-21254, SPARK-17243, SPARK-17671



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30981) Fix flaky "Test basic decommissioning" test

2020-03-19 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-30981.
--
Fix Version/s: 3.1.0
 Assignee: Holden Karau
   Resolution: Fixed

> Fix flaky "Test basic decommissioning" test
> ---
>
> Key: SPARK-30981
> URL: https://issues.apache.org/jira/browse/SPARK-30981
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.1.0
>
>
> - https://github.com/apache/spark/pull/27721
> {code}
> - Test basic decommissioning *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 126 times 
> over 2.010095245067 minutes. Last failure message: "++ id -u
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30981) Fix flaky "Test basic decommissioning" test

2020-03-19 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062942#comment-17062942
 ] 

Holden Karau commented on SPARK-30981:
--

I believe this was resolved in [https://github.com/apache/spark/pull/27905]

> Fix flaky "Test basic decommissioning" test
> ---
>
> Key: SPARK-30981
> URL: https://issues.apache.org/jira/browse/SPARK-30981
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - https://github.com/apache/spark/pull/27721
> {code}
> - Test basic decommissioning *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 126 times 
> over 2.010095245067 minutes. Last failure message: "++ id -u
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20629) Copy shuffle data when nodes are being shut down using PVs

2020-03-19 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-20629:
-
Summary: Copy shuffle data when nodes are being shut down using PVs  (was: 
Copy shuffle data when nodes are being shut down)

> Copy shuffle data when nodes are being shut down using PVs
> --
>
> Key: SPARK-20629
> URL: https://issues.apache.org/jira/browse/SPARK-20629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Priority: Major
>
> We decided not to do this for YARN, but for EC2/GCE and similar systems nodes 
> may be shut down entirely without the ability to keep an AuxiliaryService 
> around.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20629) Copy shuffle data when nodes are being shut down using PVs

2020-03-19 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-20629:
-
Component/s: (was: Spark Core)
 Kubernetes

> Copy shuffle data when nodes are being shut down using PVs
> --
>
> Key: SPARK-20629
> URL: https://issues.apache.org/jira/browse/SPARK-20629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Priority: Major
>
> We decided not to do this for YARN, but for EC2/GCE and similar systems nodes 
> may be shut down entirely without the ability to keep an AuxiliaryService 
> around.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20629) Copy shuffle data when nodes are being shut down using PVs

2020-03-19 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-20629:
-
Description: We decided not to do this for YARN, but for Kubernetes and 
similar systems nodes may be shut down entirely without the ability to keep an 
AuxiliaryService around.  (was: We decided not to do this for YARN, but for 
EC2/GCE and similar systems nodes may be shut down entirely without the ability 
to keep an AuxiliaryService around.)

> Copy shuffle data when nodes are being shut down using PVs
> --
>
> Key: SPARK-20629
> URL: https://issues.apache.org/jira/browse/SPARK-20629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Priority: Major
>
> We decided not to do this for YARN, but for Kubernetes and similar systems 
> nodes may be shut down entirely without the ability to keep an 
> AuxiliaryService around.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30836) Improve the decommissioning K8s integration tests

2020-03-19 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062944#comment-17062944
 ] 

Holden Karau commented on SPARK-30836:
--

Resolved in [https://github.com/apache/spark/pull/27905]

> Improve the decommissioning K8s integration tests
> -
>
> Key: SPARK-30836
> URL: https://issues.apache.org/jira/browse/SPARK-30836
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
> Fix For: 3.1.0
>
>
> See [https://github.com/apache/spark/pull/26440#discussion_r373155825] &; 
> [https://github.com/apache/spark/pull/26440#discussion_r373153511] 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30836) Improve the decommissioning K8s integration tests

2020-03-19 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-30836.
--
  Assignee: Holden Karau
Resolution: Fixed

> Improve the decommissioning K8s integration tests
> -
>
> Key: SPARK-30836
> URL: https://issues.apache.org/jira/browse/SPARK-30836
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
>
> See [https://github.com/apache/spark/pull/26440#discussion_r373155825] &; 
> [https://github.com/apache/spark/pull/26440#discussion_r373153511] 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30836) Improve the decommissioning K8s integration tests

2020-03-19 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-30836:
-
Fix Version/s: 3.1.0

> Improve the decommissioning K8s integration tests
> -
>
> Key: SPARK-30836
> URL: https://issues.apache.org/jira/browse/SPARK-30836
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
> Fix For: 3.1.0
>
>
> See [https://github.com/apache/spark/pull/26440#discussion_r373155825] &; 
> [https://github.com/apache/spark/pull/26440#discussion_r373153511] 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31196) Server-side processing of History UI list of applications

2020-03-19 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-31196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavol Vidlička updated SPARK-31196:
---
Description: 
Loading the list of applications in the History UI does not scale well with a 
large number of applications. Fetching and rendering the list for 10k+ 
applications takes over a minute (much longer for more applications) and tends 
to freeze the browser.

Using `spark.history.ui.maxApplications` is not a great solution, because (as 
the name implies), it limits the number of applications shown in the UI, which 
hinders usability of the History Server.

A solution would be to use server [side processing of the 
DataTable|https://datatables.net/examples/data_sources/server_side]. This would 
limit amount of data sent to the client and processed by the browser.

This proposed change plays nicely with KVStore abstraction implemented in 
SPARK-18085, which was supposed to solve some of the scalability issues. It 
could also definitely solve History UI scalability issues reported for example 
in SPARK-21254, SPARK-17243, SPARK-17671

  was:
Loading the list of applications in the History UI does not scale well with a 
large number of applications. Fetching and rendering the list for 10k+ 
applications takes over a minute.

Using `spark.history.ui.maxApplications` is not a great solution, because (as 
the name implies), it limits the number of applications shown in the UI, which 
hinders usability of the History Server.

A solution would be to use server [side processing of the 
DataTable|https://datatables.net/examples/data_sources/server_side]. This would 
limit amount of data sent to the client and processed by the browser.

This proposed change plays nicely with KVStore abstraction implemented in 
SPARK-18085, which was supposed to solve some of the scalability issues. It 
could also definitely solve History UI scalability issues reported for example 
in SPARK-21254, SPARK-17243, SPARK-17671


> Server-side processing of History UI list of applications
> -
>
> Key: SPARK-31196
> URL: https://issues.apache.org/jira/browse/SPARK-31196
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0, 2.4.5
>Reporter: Pavol Vidlička
>Priority: Minor
>
> Loading the list of applications in the History UI does not scale well with a 
> large number of applications. Fetching and rendering the list for 10k+ 
> applications takes over a minute (much longer for more applications) and 
> tends to freeze the browser.
> Using `spark.history.ui.maxApplications` is not a great solution, because (as 
> the name implies), it limits the number of applications shown in the UI, 
> which hinders usability of the History Server.
> A solution would be to use server [side processing of the 
> DataTable|https://datatables.net/examples/data_sources/server_side]. This 
> would limit amount of data sent to the client and processed by the browser.
> This proposed change plays nicely with KVStore abstraction implemented in 
> SPARK-18085, which was supposed to solve some of the scalability issues. It 
> could also definitely solve History UI scalability issues reported for 
> example in SPARK-21254, SPARK-17243, SPARK-17671



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31197) Exit the executor once all tasks & migrations are finished

2020-03-19 Thread Holden Karau (Jira)
Holden Karau created SPARK-31197:


 Summary: Exit the executor once all tasks & migrations are finished
 Key: SPARK-31197
 URL: https://issues.apache.org/jira/browse/SPARK-31197
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes
Affects Versions: 3.1.0
Reporter: Holden Karau






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31198) Use graceful decommissioning as part of dynamic scaling

2020-03-19 Thread Holden Karau (Jira)
Holden Karau created SPARK-31198:


 Summary: Use graceful decommissioning as part of dynamic scaling
 Key: SPARK-31198
 URL: https://issues.apache.org/jira/browse/SPARK-31198
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes
Affects Versions: 3.1.0
Reporter: Holden Karau
Assignee: Holden Karau






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-31170) Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir

2020-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-31170:
---

This is reverted because this broke all `hive-1.2` profile Jenkins jobs (2 
SBT/2 Maven).

> Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir
> 
>
> Key: SPARK-31170
> URL: https://issues.apache.org/jira/browse/SPARK-31170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> In Spark CLI, we create a hive CliSessionState and it does not load the 
> hive-site.xml. So the configurations in hive-site.xml will not take effects 
> like other spark-hive integration apps.
> Also, the warehouse directory is not correctly picked. If the `default` 
> database does not exist, the CliSessionState will create one during the first 
> time it talks to the metastore. The `Location` of the default DB will be 
> neither the value of spark.sql.warehousr.dir nor the user-specified value of 
> hive.metastore.warehourse.dir, but the default value of 
> hive.metastore.warehourse.dir which will always be `/user/hive/warehouse`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31170) Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir

2020-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31170:
--
Fix Version/s: (was: 3.0.0)

> Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir
> 
>
> Key: SPARK-31170
> URL: https://issues.apache.org/jira/browse/SPARK-31170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> In Spark CLI, we create a hive CliSessionState and it does not load the 
> hive-site.xml. So the configurations in hive-site.xml will not take effects 
> like other spark-hive integration apps.
> Also, the warehouse directory is not correctly picked. If the `default` 
> database does not exist, the CliSessionState will create one during the first 
> time it talks to the metastore. The `Location` of the default DB will be 
> neither the value of spark.sql.warehousr.dir nor the user-specified value of 
> hive.metastore.warehourse.dir, but the default value of 
> hive.metastore.warehourse.dir which will always be `/user/hive/warehouse`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30931) ML 3.0 QA: API: Python API coverage

2020-03-19 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-30931:


Assignee: Huaxin Gao

> ML 3.0 QA: API: Python API coverage
> ---
>
> Key: SPARK-30931
> URL: https://issues.apache.org/jira/browse/SPARK-30931
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Major
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
>  * *GOAL*: Audit and create JIRAs to fix in the next release.
>  * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
>  * Inconsistency: Do class/method/parameter names match?
>  * Docs: Is the Python doc missing or just a stub? We want the Python doc to 
> be as complete as the Scala doc.
>  * API breaking changes: These should be very rare but are occasionally 
> either necessary (intentional) or accidental. These must be recorded and 
> added in the Migration Guide for this release.
>  ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
>  * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle. 
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30935) Update MLlib, GraphX websites for 3.0

2020-03-19 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-30935:


Assignee: Huaxin Gao

> Update MLlib, GraphX websites for 3.0
> -
>
> Key: SPARK-30935
> URL: https://issues.apache.org/jira/browse/SPARK-30935
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25121) Support multi-part column name for hint resolution

2020-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25121.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27935
[https://github.com/apache/spark/pull/27935]

> Support multi-part column name for hint resolution
> --
>
> Key: SPARK-25121
> URL: https://issues.apache.org/jira/browse/SPARK-25121
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>
> After supporting multi-part names in 
> https://github.com/apache/spark/pull/17185, we also need to consider how to 
> resolve the hints for broadcast hints. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30932) ML 3.0 QA: API: Java compatibility, docs

2020-03-19 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-30932:


Assignee: zhengruifeng

> ML 3.0 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-30932
> URL: https://issues.apache.org/jira/browse/SPARK-30932
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Attachments: 1_process_script.sh, added_ml_class, common_ml_class, 
> signature.diff
>
>
> Check Java compatibility for this release:
>  * APIs in {{spark.ml}}
>  * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
>  * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
>  ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
>  *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
>  ** Check Scala objects (especially with nesting!) carefully. These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
>  ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc. (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
>  * Check for differences in generated Scala vs Java docs. E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
>  * Remember that we should not break APIs from previous releases. If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
>  * If needed for complex issues, create small Java unit tests which execute 
> each method. (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
>  * There are not great tools. In the past, this task has been done by:
>  ** Generating API docs
>  ** Building JAR and outputting the Java class signatures for MLlib
>  ** Manually inspecting and searching the docs and class signatures for issues
>  * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30932) ML 3.0 QA: API: Java compatibility, docs

2020-03-19 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-30932.
--
Resolution: Fixed

> ML 3.0 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-30932
> URL: https://issues.apache.org/jira/browse/SPARK-30932
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Attachments: 1_process_script.sh, added_ml_class, common_ml_class, 
> signature.diff
>
>
> Check Java compatibility for this release:
>  * APIs in {{spark.ml}}
>  * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
>  * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
>  ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
>  *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
>  ** Check Scala objects (especially with nesting!) carefully. These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
>  ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc. (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
>  * Check for differences in generated Scala vs Java docs. E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
>  * Remember that we should not break APIs from previous releases. If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
>  * If needed for complex issues, create small Java unit tests which execute 
> each method. (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
>  * There are not great tools. In the past, this task has been done by:
>  ** Generating API docs
>  ** Building JAR and outputting the Java class signatures for MLlib
>  ** Manually inspecting and searching the docs and class signatures for issues
>  * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25121) Support multi-part column name for hint resolution

2020-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25121:
-

Assignee: Takeshi Yamamuro

> Support multi-part column name for hint resolution
> --
>
> Key: SPARK-25121
> URL: https://issues.apache.org/jira/browse/SPARK-25121
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> After supporting multi-part names in 
> https://github.com/apache/spark/pull/17185, we also need to consider how to 
> resolve the hints for broadcast hints. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25121) Support multi-part column name for hint resolution

2020-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25121:
--
Affects Version/s: (was: 3.1.0)
   3.0.0

> Support multi-part column name for hint resolution
> --
>
> Key: SPARK-25121
> URL: https://issues.apache.org/jira/browse/SPARK-25121
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>
> After supporting multi-part names in 
> https://github.com/apache/spark/pull/17185, we also need to consider how to 
> resolve the hints for broadcast hints. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26293) Cast exception when having python udf in subquery

2020-03-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26293:
-
Fix Version/s: (was: 2.4.1)
   2.4.6

> Cast exception when having python udf in subquery
> -
>
> Key: SPARK-26293
> URL: https://issues.apache.org/jira/browse/SPARK-26293
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0, 2.4.6
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar

2020-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30951:
--
Labels:   (was: correctness)

> Potential data loss for legacy applications after switch to proleptic 
> Gregorian calendar
> 
>
> Key: SPARK-30951
> URL: https://issues.apache.org/jira/browse/SPARK-30951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Priority: Blocker
>
> tl;dr: We recently discovered some Spark 2.x sites that have lots of data 
> containing dates before October 15, 1582. This could be an issue when such 
> sites try to upgrade to Spark 3.0.
> From SPARK-26651:
> {quote}"The changes might impact on the results for dates and timestamps 
> before October 15, 1582 (Gregorian)
> {quote}
> We recently discovered that some large scale Spark 2.x applications rely on 
> dates before October 15, 1582.
> Two cases came up recently:
>  * An application that uses a commercial third-party library to encode 
> sensitive dates. On insert, the library encodes the actual date as some other 
> date. On select, the library decodes the date back to the original date. The 
> encoded value could be any date, including one before October 15, 1582 (e.g., 
> "0602-04-04").
>  * An application that uses a specific unlikely date (e.g., "1200-01-01") as 
> a marker to indicate "unknown date" (in lieu of null)
> Both sites ran into problems after another component in their system was 
> upgraded to use the proleptic Gregorian calendar. Spark applications that 
> read files created by the upgraded component were interpreting encoded or 
> marker dates incorrectly, and vice versa. Also, their data now had a mix of 
> calendars (hybrid and proleptic Gregorian) with no metadata to indicate which 
> file used which calendar.
> Both sites had enormous amounts of existing data, so re-encoding the dates 
> using some other scheme was not a feasible solution.
> This is relevant to Spark 3:
> Any Spark 2 application that uses such date-encoding schemes may run into 
> trouble when run on Spark 3. The application may not properly interpret the 
> dates previously written by Spark 2. Also, once the Spark 3 version of the 
> application writes data, the tables will have a mix of calendars (hybrid and 
> proleptic gregorian) with no metadata to indicate which file uses which 
> calendar.
> Similarly, sites might run with mixed Spark versions, resulting in data 
> written by one version that cannot be interpreted by the other. And as above, 
> the tables will now have a mix of calendars with no way to detect which file 
> uses which calendar.
> As with the two real-life example cases, these applications may have enormous 
> amounts of legacy data, so re-encoding the dates using some other scheme may 
> not be feasible.
> We might want to consider a configuration setting to allow the user to 
> specify the calendar for storing and retrieving date and timestamp values 
> (not sure how such a flag would affect other date and timestamp-related 
> functions). I realize the change is far bigger than just adding a 
> configuration setting.
> Here's a quick example of where trouble may happen, using the real-life case 
> of the marker date.
> In Spark 2.4:
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 1
> scala>
> {noformat}
> In Spark 3.0 (reading from the same legacy file):
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 0
> scala> 
> {noformat}
> By the way, Hive had a similar problem. Hive switched from hybrid calendar to 
> proleptic Gregorian calendar between 2.x and 3.x. After some upgrade 
> headaches related to dates before 1582, the Hive community made the following 
> changes:
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> checks a configuration setting to determine which calendar to use.
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> stores the calendar type in the metadata.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files, 
> Hive checks the metadata for the calendar type.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files that 
> lack calendar metadata, Hive's behavior is determined by a configuration 
> setting. This allows Hive to read legacy data (note: if the data already 
> consists of a mix of calendar types with no metadata, there is no good 
> solution).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apach

[jira] [Commented] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar

2020-03-19 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063050#comment-17063050
 ] 

Dongjoon Hyun commented on SPARK-30951:
---

Thanks. According to [~cloud_fan] comment, `correctness` label is removed. 
However, it seems that we need more documents like the above [~cloud_fan]'s 
comment.

> Potential data loss for legacy applications after switch to proleptic 
> Gregorian calendar
> 
>
> Key: SPARK-30951
> URL: https://issues.apache.org/jira/browse/SPARK-30951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Priority: Blocker
>
> tl;dr: We recently discovered some Spark 2.x sites that have lots of data 
> containing dates before October 15, 1582. This could be an issue when such 
> sites try to upgrade to Spark 3.0.
> From SPARK-26651:
> {quote}"The changes might impact on the results for dates and timestamps 
> before October 15, 1582 (Gregorian)
> {quote}
> We recently discovered that some large scale Spark 2.x applications rely on 
> dates before October 15, 1582.
> Two cases came up recently:
>  * An application that uses a commercial third-party library to encode 
> sensitive dates. On insert, the library encodes the actual date as some other 
> date. On select, the library decodes the date back to the original date. The 
> encoded value could be any date, including one before October 15, 1582 (e.g., 
> "0602-04-04").
>  * An application that uses a specific unlikely date (e.g., "1200-01-01") as 
> a marker to indicate "unknown date" (in lieu of null)
> Both sites ran into problems after another component in their system was 
> upgraded to use the proleptic Gregorian calendar. Spark applications that 
> read files created by the upgraded component were interpreting encoded or 
> marker dates incorrectly, and vice versa. Also, their data now had a mix of 
> calendars (hybrid and proleptic Gregorian) with no metadata to indicate which 
> file used which calendar.
> Both sites had enormous amounts of existing data, so re-encoding the dates 
> using some other scheme was not a feasible solution.
> This is relevant to Spark 3:
> Any Spark 2 application that uses such date-encoding schemes may run into 
> trouble when run on Spark 3. The application may not properly interpret the 
> dates previously written by Spark 2. Also, once the Spark 3 version of the 
> application writes data, the tables will have a mix of calendars (hybrid and 
> proleptic gregorian) with no metadata to indicate which file uses which 
> calendar.
> Similarly, sites might run with mixed Spark versions, resulting in data 
> written by one version that cannot be interpreted by the other. And as above, 
> the tables will now have a mix of calendars with no way to detect which file 
> uses which calendar.
> As with the two real-life example cases, these applications may have enormous 
> amounts of legacy data, so re-encoding the dates using some other scheme may 
> not be feasible.
> We might want to consider a configuration setting to allow the user to 
> specify the calendar for storing and retrieving date and timestamp values 
> (not sure how such a flag would affect other date and timestamp-related 
> functions). I realize the change is far bigger than just adding a 
> configuration setting.
> Here's a quick example of where trouble may happen, using the real-life case 
> of the marker date.
> In Spark 2.4:
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 1
> scala>
> {noformat}
> In Spark 3.0 (reading from the same legacy file):
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 0
> scala> 
> {noformat}
> By the way, Hive had a similar problem. Hive switched from hybrid calendar to 
> proleptic Gregorian calendar between 2.x and 3.x. After some upgrade 
> headaches related to dates before 1582, the Hive community made the following 
> changes:
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> checks a configuration setting to determine which calendar to use.
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> stores the calendar type in the metadata.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files, 
> Hive checks the metadata for the calendar type.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files that 
> lack calendar metadata, Hive's behavior is determined by a configuration 
> setting. This allows Hive to read legacy data (note: if the data already 
> consists of a mix of calendar types with no metadata, there is no good 
> solution).

[jira] [Created] (SPARK-31199) Separate connection timeout and idle timeout for shuffle

2020-03-19 Thread runnings (Jira)
runnings created SPARK-31199:


 Summary: Separate connection timeout and idle timeout for shuffle
 Key: SPARK-31199
 URL: https://issues.apache.org/jira/browse/SPARK-31199
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.1.0
Reporter: runnings


spark.shuffle.io.connectionTimeout only used for connection timeout for 
connection setup while spark.shuffle.io.idleTimeout is used to control how long 
to kill the connection if it seems to be 
idle([#27963|https://github.com/apache/spark/pull/27963])

 

These 2 timeouts could be quite different and shorten connectiontimeout could 
help fast fail the shuffle task in some cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31199) Separate connection timeout and idle timeout for shuffle

2020-03-19 Thread runnings (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

runnings updated SPARK-31199:
-
Description: 
spark.shuffle.io.connectionTimeout only used for connection timeout for 
connection setup while spark.shuffle.io.idleTimeout is used to control how long 
to kill the connection if it seems to be 
idle([https://github.com/apache/spark/pull/5584])

 

These 2 timeouts could be quite different and shorten connectiontimeout could 
help fast fail the shuffle task in some cases

  was:
spark.shuffle.io.connectionTimeout only used for connection timeout for 
connection setup while spark.shuffle.io.idleTimeout is used to control how long 
to kill the connection if it seems to be 
idle([#27963|https://github.com/apache/spark/pull/27963])

 

These 2 timeouts could be quite different and shorten connectiontimeout could 
help fast fail the shuffle task in some cases


> Separate connection timeout and idle timeout for shuffle
> 
>
> Key: SPARK-31199
> URL: https://issues.apache.org/jira/browse/SPARK-31199
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: runnings
>Priority: Major
>
> spark.shuffle.io.connectionTimeout only used for connection timeout for 
> connection setup while spark.shuffle.io.idleTimeout is used to control how 
> long to kill the connection if it seems to be 
> idle([https://github.com/apache/spark/pull/5584])
>  
> These 2 timeouts could be quite different and shorten connectiontimeout could 
> help fast fail the shuffle task in some cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31199) Separate connection timeout and idle timeout for shuffle

2020-03-19 Thread runnings (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063067#comment-17063067
 ] 

runnings commented on SPARK-31199:
--

cc [~rxin]  **  who worked on [https://github.com/apache/spark/pull/5584] before

 

> Separate connection timeout and idle timeout for shuffle
> 
>
> Key: SPARK-31199
> URL: https://issues.apache.org/jira/browse/SPARK-31199
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: runnings
>Priority: Major
>
> spark.shuffle.io.connectionTimeout only used for connection timeout for 
> connection setup while spark.shuffle.io.idleTimeout is used to control how 
> long to kill the connection if it seems to be 
> idle([https://github.com/apache/spark/pull/5584])
>  
> These 2 timeouts could be quite different and shorten connectiontimeout could 
> help fast fail the shuffle task in some cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31199) Separate connection timeout and idle timeout for shuffle

2020-03-19 Thread runnings (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063067#comment-17063067
 ] 

runnings edited comment on SPARK-31199 at 3/20/20, 4:15 AM:


cc [~rxin]  [~aaron.davidson_impala_647b]  who worked on 
[https://github.com/apache/spark/pull/5584] before

 

 


was (Author: runnings):
cc [~rxin]  **  who worked on [https://github.com/apache/spark/pull/5584] before

 

> Separate connection timeout and idle timeout for shuffle
> 
>
> Key: SPARK-31199
> URL: https://issues.apache.org/jira/browse/SPARK-31199
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: runnings
>Priority: Major
>
> spark.shuffle.io.connectionTimeout only used for connection timeout for 
> connection setup while spark.shuffle.io.idleTimeout is used to control how 
> long to kill the connection if it seems to be 
> idle([https://github.com/apache/spark/pull/5584])
>  
> These 2 timeouts could be quite different and shorten connectiontimeout could 
> help fast fail the shuffle task in some cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-31199) Separate connection timeout and idle timeout for shuffle

2020-03-19 Thread runnings (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

runnings updated SPARK-31199:
-
Comment: was deleted

(was: cc [~rxin]  [~aaron.davidson_impala_647b]  who worked on 
[https://github.com/apache/spark/pull/5584] before

 

 )

> Separate connection timeout and idle timeout for shuffle
> 
>
> Key: SPARK-31199
> URL: https://issues.apache.org/jira/browse/SPARK-31199
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: runnings
>Priority: Major
>
> spark.shuffle.io.connectionTimeout only used for connection timeout for 
> connection setup while spark.shuffle.io.idleTimeout is used to control how 
> long to kill the connection if it seems to be 
> idle([https://github.com/apache/spark/pull/5584])
>  
> These 2 timeouts could be quite different and shorten connectiontimeout could 
> help fast fail the shuffle task in some cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31181) Remove the default value assumption on CREATE TABLE test cases

2020-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31181.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

> Remove the default value assumption on CREATE TABLE test cases
> --
>
> Key: SPARK-31181
> URL: https://issues.apache.org/jira/browse/SPARK-31181
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31181) Remove the default value assumption on CREATE TABLE test cases

2020-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31181:
-

Assignee: Dongjoon Hyun

> Remove the default value assumption on CREATE TABLE test cases
> --
>
> Key: SPARK-31181
> URL: https://issues.apache.org/jira/browse/SPARK-31181
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31181) Remove the default value assumption on CREATE TABLE test cases

2020-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31181:
--
Affects Version/s: (was: 3.1.0)
   3.0.0

> Remove the default value assumption on CREATE TABLE test cases
> --
>
> Key: SPARK-31181
> URL: https://issues.apache.org/jira/browse/SPARK-31181
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31181) Remove the default value assumption on CREATE TABLE test cases

2020-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31181:
--
Fix Version/s: (was: 3.1.0)
   3.0.0

> Remove the default value assumption on CREATE TABLE test cases
> --
>
> Key: SPARK-31181
> URL: https://issues.apache.org/jira/browse/SPARK-31181
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31136) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

2020-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31136.
---
Resolution: Won't Do

Hi, All. This issue was specifically for `Reverting SPARK-30098`. Now, I'm 
closing this issue as "Won't Do" because we discussed here and we don't agree 
on. In other words, we are going to move forward instead of simply reverting 
this. SPARK-31147 will follow up for a proper action on CHAR type. For 
documentation, SPARK-31133 will follow up for the documentations (including 
syntax changes and meaning, and maybe `LOAD` behavior). We may open up more 
follow-ups, but not this. Also, for another request for reverting SPARK-30098, 
we will reuse this issue for further discussion.

> Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
> -
>
> Key: SPARK-31136
> URL: https://issues.apache.org/jira/browse/SPARK-31136
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> We need to consider the behavior change of SPARK-30098 .
> This is a placeholder to keep the discussion and the final decision.
> `CREATE TABLE` syntax changes its behavior silently.
> The following is one example of the breaking the existing user data pipelines.
> *Apache Spark 2.4.5*
> {code}
> spark-sql> CREATE TABLE t(a STRING);
> spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t;
> spark-sql> SELECT * FROM t LIMIT 1;
> # Apache Spark
> Time taken: 2.05 seconds, Fetched 1 row(s)
> {code}
> {code}
> spark-sql> CREATE TABLE t(a CHAR(3));
> spark-sql> INSERT INTO TABLE t SELECT 'a ';
> spark-sql> SELECT a, length(a) FROM t;
> a 3
> {code}
> *Apache Spark 3.0.0-preview2*
> {code}
> spark-sql> CREATE TABLE t(a STRING);
> spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t;
> Error in query: LOAD DATA is not supported for datasource tables: 
> `default`.`t`;
> {code}
> {code}
> spark-sql> CREATE TABLE t(a CHAR(3));
> spark-sql> INSERT INTO TABLE t SELECT 'a ';
> spark-sql> SELECT a, length(a) FROM t;
> a 2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31136) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

2020-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31136:
--
Labels:   (was: correctness)

> Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
> -
>
> Key: SPARK-31136
> URL: https://issues.apache.org/jira/browse/SPARK-31136
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> We need to consider the behavior change of SPARK-30098 .
> This is a placeholder to keep the discussion and the final decision.
> `CREATE TABLE` syntax changes its behavior silently.
> The following is one example of the breaking the existing user data pipelines.
> *Apache Spark 2.4.5*
> {code}
> spark-sql> CREATE TABLE t(a STRING);
> spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t;
> spark-sql> SELECT * FROM t LIMIT 1;
> # Apache Spark
> Time taken: 2.05 seconds, Fetched 1 row(s)
> {code}
> {code}
> spark-sql> CREATE TABLE t(a CHAR(3));
> spark-sql> INSERT INTO TABLE t SELECT 'a ';
> spark-sql> SELECT a, length(a) FROM t;
> a 3
> {code}
> *Apache Spark 3.0.0-preview2*
> {code}
> spark-sql> CREATE TABLE t(a STRING);
> spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t;
> Error in query: LOAD DATA is not supported for datasource tables: 
> `default`.`t`;
> {code}
> {code}
> spark-sql> CREATE TABLE t(a CHAR(3));
> spark-sql> INSERT INTO TABLE t SELECT 'a ';
> spark-sql> SELECT a, length(a) FROM t;
> a 2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31136) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

2020-03-19 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063092#comment-17063092
 ] 

Dongjoon Hyun commented on SPARK-31136:
---

BTW, [~kabhwan]'s thread should be considered as another topic because it's 
about "Resolve ambiguous parser rule".

> Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
> -
>
> Key: SPARK-31136
> URL: https://issues.apache.org/jira/browse/SPARK-31136
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> We need to consider the behavior change of SPARK-30098 .
> This is a placeholder to keep the discussion and the final decision.
> `CREATE TABLE` syntax changes its behavior silently.
> The following is one example of the breaking the existing user data pipelines.
> *Apache Spark 2.4.5*
> {code}
> spark-sql> CREATE TABLE t(a STRING);
> spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t;
> spark-sql> SELECT * FROM t LIMIT 1;
> # Apache Spark
> Time taken: 2.05 seconds, Fetched 1 row(s)
> {code}
> {code}
> spark-sql> CREATE TABLE t(a CHAR(3));
> spark-sql> INSERT INTO TABLE t SELECT 'a ';
> spark-sql> SELECT a, length(a) FROM t;
> a 3
> {code}
> *Apache Spark 3.0.0-preview2*
> {code}
> spark-sql> CREATE TABLE t(a STRING);
> spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t;
> Error in query: LOAD DATA is not supported for datasource tables: 
> `default`.`t`;
> {code}
> {code}
> spark-sql> CREATE TABLE t(a CHAR(3));
> spark-sql> INSERT INTO TABLE t SELECT 'a ';
> spark-sql> SELECT a, length(a) FROM t;
> a 2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31171) size(null) should return null under ansi mode

2020-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31171:
--
Parent: SPARK-31085
Issue Type: Sub-task  (was: Improvement)

> size(null) should return null under ansi mode
> -
>
> Key: SPARK-31171
> URL: https://issues.apache.org/jira/browse/SPARK-31171
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31193) set spark.master and spark.app.name conf default value

2020-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31193:
--
Fix Version/s: (was: 3.1.0)

> set spark.master and spark.app.name conf default value
> --
>
> Key: SPARK-31193
> URL: https://issues.apache.org/jira/browse/SPARK-31193
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.3, 2.4.0, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.1.0
>Reporter: daile
>Priority: Major
>
> I see the default value of master setting in spark-submit client
> {code:java}
> // Global defaults. These should be keep to minimum to avoid confusing 
> behavior. master = Option(master).getOrElse("local[*]") 
> {code}
> but during our development and debugging, We will encounter this kind of 
> problem
> Exception in thread "main" org.apache.spark.SparkException: A master URL must 
> be set in your configuration
> This conflicts with the default setting
>  
> {code:java}
> //If we do
>  val sparkConf = new SparkConf().setAppName(“app”)
>  //When using the client to submit tasks to the cluster, the matser will be 
> overwritten by the local
>  sparkConf.set("spark.master", "local[*]"){code}
>  
> so we have to do like this
> {code:java}
> val sparkConf = new SparkConf().setAppName(“app”)
>  //Because the program runs to set the priority of the master, we have to 
> first determine whether to set the master to avoid submitting the cluster to 
> run.
>  sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]")){code}
>  
>  
> so is spark.app.name
> Is it better for users to handle it like submit client ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31193) set spark.master and spark.app.name conf default value

2020-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31193:
--
Affects Version/s: (was: 2.4.5)
   (was: 2.4.4)
   (was: 2.4.3)
   (was: 2.4.2)
   (was: 2.3.3)
   (was: 2.4.0)
   (was: 2.3.0)

> set spark.master and spark.app.name conf default value
> --
>
> Key: SPARK-31193
> URL: https://issues.apache.org/jira/browse/SPARK-31193
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: daile
>Priority: Major
>
> I see the default value of master setting in spark-submit client
> {code:java}
> // Global defaults. These should be keep to minimum to avoid confusing 
> behavior. master = Option(master).getOrElse("local[*]") 
> {code}
> but during our development and debugging, We will encounter this kind of 
> problem
> Exception in thread "main" org.apache.spark.SparkException: A master URL must 
> be set in your configuration
> This conflicts with the default setting
>  
> {code:java}
> //If we do
>  val sparkConf = new SparkConf().setAppName(“app”)
>  //When using the client to submit tasks to the cluster, the matser will be 
> overwritten by the local
>  sparkConf.set("spark.master", "local[*]"){code}
>  
> so we have to do like this
> {code:java}
> val sparkConf = new SparkConf().setAppName(“app”)
>  //Because the program runs to set the priority of the master, we have to 
> first determine whether to set the master to avoid submitting the cluster to 
> run.
>  sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]")){code}
>  
>  
> so is spark.app.name
> Is it better for users to handle it like submit client ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31193) set spark.master and spark.app.name conf default value

2020-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31193.
---
Resolution: Not A Bug

> set spark.master and spark.app.name conf default value
> --
>
> Key: SPARK-31193
> URL: https://issues.apache.org/jira/browse/SPARK-31193
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: daile
>Priority: Major
>
> I see the default value of master setting in spark-submit client
> {code:java}
> // Global defaults. These should be keep to minimum to avoid confusing 
> behavior. master = Option(master).getOrElse("local[*]") 
> {code}
> but during our development and debugging, We will encounter this kind of 
> problem
> Exception in thread "main" org.apache.spark.SparkException: A master URL must 
> be set in your configuration
> This conflicts with the default setting
>  
> {code:java}
> //If we do
>  val sparkConf = new SparkConf().setAppName(“app”)
>  //When using the client to submit tasks to the cluster, the matser will be 
> overwritten by the local
>  sparkConf.set("spark.master", "local[*]"){code}
>  
> so we have to do like this
> {code:java}
> val sparkConf = new SparkConf().setAppName(“app”)
>  //Because the program runs to set the priority of the master, we have to 
> first determine whether to set the master to avoid submitting the cluster to 
> run.
>  sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]")){code}
>  
>  
> so is spark.app.name
> Is it better for users to handle it like submit client ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31193) set spark.master and spark.app.name conf default value

2020-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31193:
--
Target Version/s:   (was: 3.1.0)

> set spark.master and spark.app.name conf default value
> --
>
> Key: SPARK-31193
> URL: https://issues.apache.org/jira/browse/SPARK-31193
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.3, 2.4.0, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.1.0
>Reporter: daile
>Priority: Major
>
> I see the default value of master setting in spark-submit client
> {code:java}
> // Global defaults. These should be keep to minimum to avoid confusing 
> behavior. master = Option(master).getOrElse("local[*]") 
> {code}
> but during our development and debugging, We will encounter this kind of 
> problem
> Exception in thread "main" org.apache.spark.SparkException: A master URL must 
> be set in your configuration
> This conflicts with the default setting
>  
> {code:java}
> //If we do
>  val sparkConf = new SparkConf().setAppName(“app”)
>  //When using the client to submit tasks to the cluster, the matser will be 
> overwritten by the local
>  sparkConf.set("spark.master", "local[*]"){code}
>  
> so we have to do like this
> {code:java}
> val sparkConf = new SparkConf().setAppName(“app”)
>  //Because the program runs to set the priority of the master, we have to 
> first determine whether to set the master to avoid submitting the cluster to 
> run.
>  sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]")){code}
>  
>  
> so is spark.app.name
> Is it better for users to handle it like submit client ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-31193) set spark.master and spark.app.name conf default value

2020-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-31193.
-

> set spark.master and spark.app.name conf default value
> --
>
> Key: SPARK-31193
> URL: https://issues.apache.org/jira/browse/SPARK-31193
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: daile
>Priority: Major
>
> I see the default value of master setting in spark-submit client
> {code:java}
> // Global defaults. These should be keep to minimum to avoid confusing 
> behavior. master = Option(master).getOrElse("local[*]") 
> {code}
> but during our development and debugging, We will encounter this kind of 
> problem
> Exception in thread "main" org.apache.spark.SparkException: A master URL must 
> be set in your configuration
> This conflicts with the default setting
>  
> {code:java}
> //If we do
>  val sparkConf = new SparkConf().setAppName(“app”)
>  //When using the client to submit tasks to the cluster, the matser will be 
> overwritten by the local
>  sparkConf.set("spark.master", "local[*]"){code}
>  
> so we have to do like this
> {code:java}
> val sparkConf = new SparkConf().setAppName(“app”)
>  //Because the program runs to set the priority of the master, we have to 
> first determine whether to set the master to avoid submitting the cluster to 
> run.
>  sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]")){code}
>  
>  
> so is spark.app.name
> Is it better for users to handle it like submit client ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31139) Fileformat datasources (ORC, Json) case sensitivity regressions

2020-03-19 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063127#comment-17063127
 ] 

Xiao Li commented on SPARK-31139:
-

ping [~viirya] [~dongjoon]

> Fileformat datasources (ORC, Json) case sensitivity regressions
> ---
>
> Key: SPARK-31139
> URL: https://issues.apache.org/jira/browse/SPARK-31139
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tae-kyeom, Kim
>Priority: Blocker
> Attachments: FileBasedDataSourceSuite.scala.diff
>
>
> In addition to https://issues.apache.org/jira/browse/SPARK-31116
> Not only parquet, json and orc also have case sensitivity issues.
> Following demonstrate test failure based SPARK-31116's test cases. (diff of 
> FileBasedDataSourceSuite is in attachement)
> 
>  
> {code:java}
> [info] - SPARK-31116: Select simple columns correctly in case insensitive 
> manner *** FAILED *** (4 seconds, 277 milliseconds) [info] Results do not 
> match for query: [info] Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
>  [info] Timezone Env: [info] [info] == Parsed Logical Plan == [info] 
> Relation[camelcase#56] json [info] [info] == Analyzed Logical Plan == [info] 
> camelcase: string [info] Relation[camelcase#56] json [info] [info] == 
> Optimized Logical Plan == [info] Relation[camelcase#56] json [info] [info] == 
> Physical Plan == [info] FileScan json [camelcase#56] Batched: false, 
> DataFilters: [], Format: JSON, Location: 
> InMemoryFileIndex[file:/Users/kimtkyeom/Dev/spark_devel/target/tmp/spark-95f1357a-85c9-444f-bdcc-...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct [info] [info] == Results == [info] [info] == Results 
> == [info] !== Correct Answer - 1 == == Spark Answer - 1 == [info] !struct<> 
> struct [info] ![A] [null] (QueryTest.scala:248)
> [info] - SPARK-31116: Select nested columns correctly in case insensitive 
> manner *** FAILED *** (2 seconds, 117 milliseconds) [info] Results do not 
> match for query: [info] Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
>  [info] Timezone Env: [info] [info] == Parsed Logical Plan == [info] 
> Relation[StructColumn#147] json [info] [info] == Analyzed Logical Plan == 
> [info] StructColumn: struct [info] 
> Relation[StructColumn#147] json [info] [info] == Optimized Logical Plan == 
> [info] Relation[StructColumn#147] json [info] [info] == Physical Plan == 
> [info] FileScan json [StructColumn#147] Batched: false, DataFilters: [], 
> Format: JSON, Location: 
> InMemoryFileIndex[file:/Users/kimtkyeom/Dev/spark_devel/target/tmp/spark-f9ecd1a4-e5aa-4dd7-bdfd-...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct> [info] [info] 
> == Results == [info] [info] == Results == [info] !== Correct Answer - 1 == == 
> Spark Answer - 1 == [info] !struct<> 
> struct> [info] 
> ![[0,1]] [[null,null]] (QueryTest.scala:248)
> [info] - SPARK-31116: Select nested columns correctly in case sensitive 
> manner *** FAILED *** (871 milliseconds) [info] Results do not match for 
> query: [info] Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
>  [info] Timezone Env: [info] [info] == Parsed Logical Plan == [info] 
> Relation[StructColumn#329] json [info] [info] == Analyzed Logical Plan == 
> [info] StructColumn: struct [info] 
> Relation[StructColumn#329] json [info] [info] == Optimized Logical Plan == 
> [info] Relation[StructColumn#329] json [info] [info] == Physical Plan == 
> [info] FileScan json [StructColumn#329] Batched: false, DataFilters: [], 
> Format: JSON, Location: 
> InMemoryFileIndex[file:/Users/kimtkyeom/Dev/spark_devel/target/tmp/spark-612baf76-a9d0-41e5-89f4-...,
>  PartitionFilters: [], 

[jira] [Resolved] (SPARK-31195) Reuse days rebase functions of DateTimeUtils in DaysWritable

2020-03-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31195.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27962
[https://github.com/apache/spark/pull/27962]

> Reuse days rebase functions of DateTimeUtils in DaysWritable
> 
>
> Key: SPARK-31195
> URL: https://issues.apache.org/jira/browse/SPARK-31195
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> The functions rebaseJulianToGregorianDays() and rebaseGregorianToJulianDays() 
> were added by the PR https://github.com/apache/spark/pull/27915. The ticket 
> aims to replace similar code in org.apache.spark.sql.hive.DaysWritable by the 
> functions to:
> # Deduplicate code
> # The functions were better tested, and cross checked by reading parquet 
> files saved by Spark 2.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31195) Reuse days rebase functions of DateTimeUtils in DaysWritable

2020-03-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31195:


Assignee: Maxim Gekk

> Reuse days rebase functions of DateTimeUtils in DaysWritable
> 
>
> Key: SPARK-31195
> URL: https://issues.apache.org/jira/browse/SPARK-31195
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> The functions rebaseJulianToGregorianDays() and rebaseGregorianToJulianDays() 
> were added by the PR https://github.com/apache/spark/pull/27915. The ticket 
> aims to replace similar code in org.apache.spark.sql.hive.DaysWritable by the 
> functions to:
> # Deduplicate code
> # The functions were better tested, and cross checked by reading parquet 
> files saved by Spark 2.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org