[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-06-10 Thread Henry Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860580#comment-16860580
 ] 

Henry Yu commented on SPARK-27812:
--

So, any other idea to fix it? or I will try to make pr with the 
SparkUncaughtExceptionHandler solution.  [~dongjoon] 

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27018) Checkpointed RDD deleted prematurely when using GBTClassifier

2019-06-10 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860579#comment-16860579
 ] 

zhengruifeng commented on SPARK-27018:
--

I test on both local env and a cluster env, and your patch works fine.

[~pkolaczk]  Could you plz create a PR to master branch? I think commiters can 
help backporting it to older versions.

> Checkpointed RDD deleted prematurely when using GBTClassifier
> -
>
> Key: SPARK-27018
> URL: https://issues.apache.org/jira/browse/SPARK-27018
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.2, 2.2.3, 2.3.3, 2.4.0
> Environment: OS: Ubuntu Linux 18.10
> Java: java version "1.8.0_201"
> Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
> Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
> Reproducible with a single-node Spark in standalone mode.
> Reproducible with Zepellin or Spark shell.
>  
>Reporter: Piotr Kołaczkowski
>Priority: Major
> Attachments: 
> Fix_check_if_the_next_checkpoint_exists_before_deleting_the_old_one.patch
>
>
> Steps to reproduce:
> {noformat}
> import org.apache.spark.ml.linalg.Vectors
> import org.apache.spark.ml.classification.GBTClassifier
> case class Row(features: org.apache.spark.ml.linalg.Vector, label: Int)
> sc.setCheckpointDir("/checkpoints")
> val trainingData = sc.parallelize(1 to 2426874, 256).map(x => 
> Row(Vectors.dense(x, x + 1, x * 2 % 10), if (x % 5 == 0) 1 else 0)).toDF
> val classifier = new GBTClassifier()
>   .setLabelCol("label")
>   .setFeaturesCol("features")
>   .setProbabilityCol("probability")
>   .setMaxIter(100)
>   .setMaxDepth(10)
>   .setCheckpointInterval(2)
> classifier.fit(trainingData){noformat}
>  
> The last line fails with:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 56.0 failed 10 times, most recent failure: Lost task 0.9 in stage 56.0 
> (TID 12058, 127.0.0.1, executor 0): java.io.FileNotFoundException: 
> /checkpoints/191c9209-0955-440f-8c11-f042bdf7f804/rdd-51
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:63)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:61)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem.com$datastax$bdp$fs$hadoop$DseFileSystem$$translateToHadoopExceptions(DseFileSystem.scala:70)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.input(DseFsInputStream.scala:31)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.openUnderlyingDataSource(DseFsInputStream.scala:39)
> at com.datastax.bdp.fs.hadoop.DseFileSystem.open(DseFileSystem.scala:269)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:292)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> 

[jira] [Created] (SPARK-27997) kubernetes client token expired

2019-06-10 Thread Henry Yu (JIRA)
Henry Yu created SPARK-27997:


 Summary: kubernetes client token expired 
 Key: SPARK-27997
 URL: https://issues.apache.org/jira/browse/SPARK-27997
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.3
Reporter: Henry Yu


Hi ,

when I try to submit spark to k8s in cluster mode, I need an authtoken to talk 
with k8s.

unfortunately, many cloud provider provider token and expired with 10-15 mins. 
so we need to fresh this token.  

client mode is event worse, because scheduler is created on submit process.

Should I also make a pr on this ? I fix it by adding 

RotatingOAuthTokenProvider and some configuration.

[~dongjoon]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27499) Support mapping spark.local.dir to hostPath volume

2019-06-10 Thread Junjie Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860519#comment-16860519
 ] 

Junjie Chen edited comment on SPARK-27499 at 6/11/19 5:20 AM:
--

Hi, [~dongjoon], I know SPARK_LOCAL_DIRS can be mounted as emptyDir. However, 
emptyDir just one directory on node. I opened this Jira to track a feature to 
setting multiple directories to full utilize the nodes' disks bandwidth for 
spilling, which I think currently it can not be achieve through setting 
spark.local.dir. Even I set to multiple dirs, they still map to one directory 
on node.

 

This Jira was intended to use hostPath volumes mounts as spark.local.dir, which 
needs build mountVolumeFeature to built before localDirFeature, while currently 
the localDirFeature is built before mountVolumeFeature.

 


was (Author: junjie):
Hi, [~dongjoon], I know SPARK_LOCAL_DIRS can be mounted as emptyDir. However, 
emptyDir just one directory on node. I opened this Jira to track a feature to 
setting multiple directories to full utilize the nodes' disks bandwidth for 
spilling, which I think currently it can not be achieve through setting 
spark.local.dir. Even I set to multiple dirs, they still map to one directory 
on node.

 

This Jira is intended to use hostPath volumes mounts as spark.local.dir, for 
exmaple:

spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path=/data/mnt-x
 

 

> Support mapping spark.local.dir to hostPath volume
> --
>
> Key: SPARK-27499
> URL: https://issues.apache.org/jira/browse/SPARK-27499
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Junjie Chen
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently, the k8s executor builder mount spark.local.dir as emptyDir or 
> memory, it should satisfy some small workload, while in some heavily workload 
> like TPCDS, both of them can have some problem, such as pods are evicted due 
> to disk pressure when using emptyDir, and OOM when using tmpfs.
> In particular on cloud environment, users may allocate cluster with minimum 
> configuration and add cloud storage when running workload. In this case, we 
> can specify multiple elastic storage as spark.local.dir to accelerate the 
> spilling. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27845) DataSourceV2: InsertTable

2019-06-10 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27845:
---
Summary: DataSourceV2: InsertTable  (was: DataSourceV2: Insert into tables 
in multiple catalogs)

> DataSourceV2: InsertTable
> -
>
> Key: SPARK-27845
> URL: https://issues.apache.org/jira/browse/SPARK-27845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multiple catalogs in the following InsertInto use cases:
>  * INSERT INTO [TABLE] catalog.db.tbl
>  * INSERT OVERWRITE TABLE catalog.db.tbl
>  * DataFrameWriter.insertInto("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27322) DataSourceV2 table relation

2019-06-10 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27322:
---
Summary: DataSourceV2 table relation  (was: DataSourceV2: Select from 
multiple catalogs)

> DataSourceV2 table relation
> ---
>
> Key: SPARK-27322
> URL: https://issues.apache.org/jira/browse/SPARK-27322
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multi-catalog in the following SELECT code paths:
>  * SELECT * FROM catalog.db.tbl
>  * TABLE catalog.db.tbl
>  * JOIN or UNION tables from different catalogs
>  * SparkSession.table("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27372) Standalone executor process-level isolation to support GPU scheduling

2019-06-10 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27372:
--
Issue Type: Story  (was: Sub-task)
Parent: (was: SPARK-27360)

> Standalone executor process-level isolation to support GPU scheduling
> -
>
> Key: SPARK-27372
> URL: https://issues.apache.org/jira/browse/SPARK-27372
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> As an admin, I can configure standalone to have multiple executor processes 
> on the same worker node and processes are configured via cgroups so they only 
> have access to assigned GPUs. So I don't need to worry about resource 
> contention between processes on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27371) Standalone master receives resource info from worker and allocate driver/executor properly

2019-06-10 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27371:
--
Summary: Standalone master receives resource info from worker and allocate 
driver/executor properly  (was: Master receives resource info from worker and 
allocate driver/executor properly)

> Standalone master receives resource info from worker and allocate 
> driver/executor properly
> --
>
> Key: SPARK-27371
> URL: https://issues.apache.org/jira/browse/SPARK-27371
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> As an admin, I can let Spark Standalone worker automatically discover GPUs 
> installed on worker nodes. So I don't need to manually configure them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27371) Master receives resource info from worker and allocate driver/executor properly

2019-06-10 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27371:
--
Summary: Master receives resource info from worker and allocate 
driver/executor properly  (was: Standalone worker can auto discover GPUs)

> Master receives resource info from worker and allocate driver/executor 
> properly
> ---
>
> Key: SPARK-27371
> URL: https://issues.apache.org/jira/browse/SPARK-27371
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> As an admin, I can let Spark Standalone worker automatically discover GPUs 
> installed on worker nodes. So I don't need to manually configure them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-27370) spark-submit requests GPUs in standalone mode

2019-06-10 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng deleted SPARK-27370:
--


> spark-submit requests GPUs in standalone mode
> -
>
> Key: SPARK-27370
> URL: https://issues.apache.org/jira/browse/SPARK-27370
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Xiangrui Meng
>Priority: Major
>
> As a user, I can use spark-submit to request GPUs per task in standalone mode 
> when I submit an Spark application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27369) Standalone worker can load resource conf and discover resources

2019-06-10 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27369:
--
Summary: Standalone worker can load resource conf and discover resources  
(was: Standalone support static conf to describe GPU resources)

> Standalone worker can load resource conf and discover resources
> ---
>
> Key: SPARK-27369
> URL: https://issues.apache.org/jira/browse/SPARK-27369
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-27499) Support mapping spark.local.dir to hostPath volume

2019-06-10 Thread Junjie Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junjie Chen reopened SPARK-27499:
-

> Support mapping spark.local.dir to hostPath volume
> --
>
> Key: SPARK-27499
> URL: https://issues.apache.org/jira/browse/SPARK-27499
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Junjie Chen
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently, the k8s executor builder mount spark.local.dir as emptyDir or 
> memory, it should satisfy some small workload, while in some heavily workload 
> like TPCDS, both of them can have some problem, such as pods are evicted due 
> to disk pressure when using emptyDir, and OOM when using tmpfs.
> In particular on cloud environment, users may allocate cluster with minimum 
> configuration and add cloud storage when running workload. In this case, we 
> can specify multiple elastic storage as spark.local.dir to accelerate the 
> spilling. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27499) Support mapping spark.local.dir to hostPath volume

2019-06-10 Thread Junjie Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860519#comment-16860519
 ] 

Junjie Chen commented on SPARK-27499:
-

Hi, [~dongjoon], I know SPARK_LOCAL_DIRS can be mounted as emptyDir. However, 
emptyDir just one directory on node. I opened this Jira to track a feature to 
setting multiple directories to full utilize the nodes' disks bandwidth for 
spilling, which I think currently it can not be achieve through setting 
spark.local.dir. Even I set to multiple dirs, they still map to one directory 
on node.

 

This Jira is intended to use hostPath volumes mounts as spark.local.dir, for 
exmaple:

spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path=/data/mnt-x
 

 

> Support mapping spark.local.dir to hostPath volume
> --
>
> Key: SPARK-27499
> URL: https://issues.apache.org/jira/browse/SPARK-27499
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Junjie Chen
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently, the k8s executor builder mount spark.local.dir as emptyDir or 
> memory, it should satisfy some small workload, while in some heavily workload 
> like TPCDS, both of them can have some problem, such as pods are evicted due 
> to disk pressure when using emptyDir, and OOM when using tmpfs.
> In particular on cloud environment, users may allocate cluster with minimum 
> configuration and add cloud storage when running workload. In this case, we 
> can specify multiple elastic storage as spark.local.dir to accelerate the 
> spilling. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27368) Design: Standalone supports GPU scheduling

2019-06-10 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27368:
--
Description: 
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.
* Resource isolation is not considered here.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests. Besides CPU cores and memory, worker now also considers 
resources in creating new executors or drivers.

Example conf:

{code}
# static worker conf
spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh

# application conf
spark.driver.resource.gpu.amount=4
spark.executor.resource.gpu.amount=2
spark.task.resource.gpu.amount=1
{code}

In client mode, driver process is not launched by worker. So user can specify 
driver resource discovery script. In cluster mode, if user still specify driver 
resource discovery script, it is ignored with a warning.

Supporting resource isolation is tricky because Spark worker doesn't know how 
to isolate resources unless we hardcode some resource names like GPU support in 
YARN, which is less ideal. Support resource isolation of multiple resource 
types is even harder. In the first version, we will implement accelerator 
support without resource isolation.

Timeline:
1. Worker starts.
2. Worker loads `work.source.*` conf and runs discovery scripts to discover 
resources.
3. Worker reports to master cores, memory, and resources (new) and registers.
4. An application starts.
5. Master finds workers with sufficient available resources and let worker 
start executor or driver process.
6. Worker assigns executor / driver resources by passing the resource info from 
command-line.
7. Application ends.
8. Master requests worker to kill driver/executor process.
9. Master updates available resources.

  was:
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.
* Resource isolation is not considered here.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests. Besides CPU cores and memory, worker now also considers 
resources in creating new executors or drivers.

Example conf:

{code}
spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh
spark.driver.resource.gpu.count=4
spark.executor.resource.gpu.count=1
{code}

In client mode, driver process is not launched by worker. So user can specify 
driver resource discovery script. In cluster mode, if user still specify driver 
resource discovery script, it is ignored with a warning.

Supporting resource isolation is tricky because Spark worker doesn't know how 
to isolate resources unless we hardcode some resource names like GPU support in 
YARN, which is less ideal. Support resource isolation of multiple resource 
types is even harder. In the first version, we will implement accelerator 
support without resource isolation.

Timeline:
1. Worker starts.
2. Worker loads `work.source.*` conf and runs discovery scripts to discover 
resources.
3. Worker reports to master cores, memory, and resources (new) and registers.
4. An application starts.
5. Master finds workers with sufficient available resources and let worker 
start executor or driver process.
6. Worker assigns executor / driver resources by passing the resource info from 
command-line.
7. Application ends.
8. Master requests worker to kill driver/executor process.
9. Master updates available resources.


> Design: Standalone supports GPU scheduling
> --
>
> Key: SPARK-27368
> URL: https://issues.apache.org/jira/browse/SPARK-27368
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Design draft:
> Scenarios:
> * client-mode, worker might create 

[jira] [Updated] (SPARK-27368) Design: Standalone supports GPU scheduling

2019-06-10 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27368:
--
Description: 
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.
* Resource isolation is not considered here.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests. Besides CPU cores and memory, worker now also considers 
resources in creating new executors or drivers.

Example conf:

{code}
spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh
spark.driver.resource.gpu.count=4
spark.executor.resource.gpu.count=1
{code}

In client mode, driver process is not launched by worker. So user can specify 
driver resource discovery script. In cluster mode, if user still specify driver 
resource discovery script, it is ignored with a warning.

Supporting resource isolation is tricky because Spark worker doesn't know how 
to isolate resources unless we hardcode some resource names like GPU support in 
YARN, which is less ideal. Support resource isolation of multiple resource 
types is even harder. In the first version, we will implement accelerator 
support without resource isolation.

Timeline:
1. Worker starts.
2. Worker loads `work.source.*` conf and runs discovery scripts to discover 
resources.
3. Worker reports to master cores, memory, and resources (new) and registers.
4. An application starts.
5. Master finds workers with sufficient available resources and let worker 
start executor or driver process.
6. Worker assigns executor / driver resources by passing the resource info from 
command-line.
7. Application ends.
8. Master requests worker to kill driver/executor process.
9. Master updates available resources.

  was:
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.
* Resource isolation is not considered here.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests. Besides CPU cores and memory, worker now also considers 
resources in creating new executors or drivers.

Example conf:

{code}
spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh
spark.driver.resource.gpu.count=4
spark.worker.resource.gpu.count=1
{code}

In client mode, driver process is not launched by worker. So user can specify 
driver resource discovery script. In cluster mode, if user still specify driver 
resource discovery script, it is ignored with a warning.

Supporting resource isolation is tricky because Spark worker doesn't know how 
to isolate resources unless we hardcode some resource names like GPU support in 
YARN, which is less ideal. Support resource isolation of multiple resource 
types is even harder. In the first version, we will implement accelerator 
support without resource isolation.

Timeline:
1. Worker starts.
2. Worker loads `work.source.*` conf and runs discovery scripts to discover 
resources.
3. Worker reports to master cores, memory, and resources (new) and registers.
4. An application starts.
5. Master finds workers with sufficient available resources and let worker 
start executor or driver process.
6. Worker assigns executor / driver resources by passing the resource info from 
command-line.
7. Application ends.
8. Master requests worker to kill driver/executor process.
9. Master updates available resources.


> Design: Standalone supports GPU scheduling
> --
>
> Key: SPARK-27368
> URL: https://issues.apache.org/jira/browse/SPARK-27368
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Design draft:
> Scenarios:
> * client-mode, worker might create one or more executor processes, from 
> different Spark applications.
> * 

[jira] [Updated] (SPARK-27368) Design: Standalone supports GPU scheduling

2019-06-10 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27368:
--
Description: 
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.
* Resource isolation is not considered here.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests. Besides CPU cores and memory, worker now also considers 
resources in creating new executors or drivers.

Example conf:

{code}
spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh
spark.driver.resource.gpu.count=4
spark.worker.resource.gpu.count=1
{code}

In client mode, driver process is not launched by worker. So user can specify 
driver resource discovery script. In cluster mode, if user still specify driver 
resource discovery script, it is ignored with a warning.

Supporting resource isolation is tricky because Spark worker doesn't know how 
to isolate resources unless we hardcode some resource names like GPU support in 
YARN, which is less ideal. Support resource isolation of multiple resource 
types is even harder. In the first version, we will implement accelerator 
support without resource isolation.

Timeline:
1. Worker starts.
2. Worker loads `work.source.*` conf and runs discovery scripts to discover 
resources.
3. Worker reports to master cores, memory, and resources (new) and registers.
4. An application starts.
5. Master finds workers with sufficient available resources and let worker 
start executor or driver process.
6. Worker assigns executor / driver resources by passing the resource info from 
command-line.
7. Application ends.
8. Master requests worker to kill driver/executor process.
9. Master updates available resources.

  was:
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.
* Resource isolation is not considered here.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests. Besides CPU cores and memory, worker now also considers 
resources in creating new executors or drivers.

Example conf:

{code}
spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh
spark.driver.resource.gpu.count=4
spark.worker.resource.gpu.count=1
{code}

In client mode, driver process is not launched by worker. So user can specify 
driver resource discovery script. In cluster mode, if user still specify driver 
resource discovery script, it is ignored with a warning.

Supporting resource isolation is tricky because Spark worker doesn't know how 
to isolate resources unless we hardcode some resource names like GPU support in 
YARN, which is less ideal. Support resource isolation of multiple resource 
types is even harder. In the first version, we will implement accelerator 
support without resource isolation.


> Design: Standalone supports GPU scheduling
> --
>
> Key: SPARK-27368
> URL: https://issues.apache.org/jira/browse/SPARK-27368
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Design draft:
> Scenarios:
> * client-mode, worker might create one or more executor processes, from 
> different Spark applications.
> * cluster-mode, worker might create driver process as well.
> * local-cluster model, there could be multiple worker processes on the same 
> node. This is an undocumented use of standalone mode, which is mainly for 
> tests.
> * Resource isolation is not considered here.
> Because executor and driver processes on the same node will share the 
> accelerator resources, worker must take the role that allocates resources. So 
> we will add spark.worker.resource.[resourceName].discoveryScript conf for 
> workers to discover resources. User need to match 

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2019-06-10 Thread HonglunChen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860512#comment-16860512
 ] 

HonglunChen commented on SPARK-18112:
-

Why do my issue still exist,I use the Spark-2.4.3 and Hive-2.3.3.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27996) Spark UI redirect will be failed behind the https reverse proxy

2019-06-10 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-27996:
---

 Summary: Spark UI redirect will be failed behind the https reverse 
proxy
 Key: SPARK-27996
 URL: https://issues.apache.org/jira/browse/SPARK-27996
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.4.3
Reporter: Saisai Shao


When Spark live/history UI is proxied behind the reverse proxy, the redirect 
will return wrong scheme, for example:

If reverse proxy is SSL enabled, so the client to reverse proxy is a HTTPS 
request, whereas if Spark's UI is not SSL enabled, then the request from 
reverse proxy to Spark UI is a HTTP request, Spark itself treats all the 
requests as HTTP requests, so the redirect URL is just started with "http", 
which will be failed to redirect from client. 

Actually for most of the reverse proxy, the proxy will add an additional header 
"X-Forwarded-Proto" to tell the backend server that the client request is a 
https request, so Spark should leverage this header to return the correct URL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27995) Note the difference between str of Python 2 and 3 at Arrow optimized toPandas

2019-06-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27995:


Assignee: (was: Apache Spark)

> Note the difference between str of Python 2 and 3 at Arrow optimized toPandas
> -
>
> Key: SPARK-27995
> URL: https://issues.apache.org/jira/browse/SPARK-27995
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> When Arrow optimization is enabled in Python 2.7, 
> {code}
> import pandas
> pdf = pandas.DataFrame(["test1", "test2"])
> df = spark.createDataFrame(pdf)
> df.show()
> {code}
> I got the following output:
> {code}
> ++
> |   0|
> ++
> |[74 65 73 74 31]|
> |[74 65 73 74 32]|
> ++
> {code}
> This looks because Python's {{str}} and {{byte}} are same. it does look right:
> {code}
> >>> str == bytes
> True
> >>> isinstance("a", bytes)
> True
> {code}
> 1. Python 2 treats `str` as `bytes`.
> 2. PySpark added some special codes and hacks to recognizes `str` as string 
> types.
> 3. PyArrow / Pandas followed Python 2 difference
> We might have to match the behaviour to PySpark's but Python 2 is deprecated 
> anyway. I think it's better to just note it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27995) Note the difference between str of Python 2 and 3 at Arrow optimized toPandas

2019-06-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27995:


Assignee: Apache Spark

> Note the difference between str of Python 2 and 3 at Arrow optimized toPandas
> -
>
> Key: SPARK-27995
> URL: https://issues.apache.org/jira/browse/SPARK-27995
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> When Arrow optimization is enabled in Python 2.7, 
> {code}
> import pandas
> pdf = pandas.DataFrame(["test1", "test2"])
> df = spark.createDataFrame(pdf)
> df.show()
> {code}
> I got the following output:
> {code}
> ++
> |   0|
> ++
> |[74 65 73 74 31]|
> |[74 65 73 74 32]|
> ++
> {code}
> This looks because Python's {{str}} and {{byte}} are same. it does look right:
> {code}
> >>> str == bytes
> True
> >>> isinstance("a", bytes)
> True
> {code}
> 1. Python 2 treats `str` as `bytes`.
> 2. PySpark added some special codes and hacks to recognizes `str` as string 
> types.
> 3. PyArrow / Pandas followed Python 2 difference
> We might have to match the behaviour to PySpark's but Python 2 is deprecated 
> anyway. I think it's better to just note it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27995) Note the difference between str of Python 2 and 3 at Arrow optimized toPandas

2019-06-10 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27995:
-
Description: 
When Arrow optimization is enabled in Python 2.7, 

{code}
import pandas
pdf = pandas.DataFrame(["test1", "test2"])
df = spark.createDataFrame(pdf)
df.show()
{code}

I got the following output:

{code}
++
|   0|
++
|[74 65 73 74 31]|
|[74 65 73 74 32]|
++
{code}

This looks because Python's {{str}} and {{byte}} are same. it does look right:

{code}
>>> str == bytes
True
>>> isinstance("a", bytes)
True
{code}

1. Python 2 treats `str` as `bytes`.
2. PySpark added some special codes and hacks to recognizes `str` as string 
types.
3. PyArrow / Pandas followed Python 2 difference

We might have to match the behaviour to PySpark's but Python 2 is deprecated 
anyway. I think it's better to just note it.

  was:
When Arrow optimization is enabled in Python 2.7, 

{code}
import pandas
pdf = pandas.DataFrame(["test1", "test2"])
df = spark.createDataFrame(pdf)
df.show()
{code}

I got the following output:

{code}
++
|   0|
++
|[74 65 73 74 31]|
|[74 65 73 74 32]|
++```
{code}

This looks because Python's {{str}} and {{byte}} are same. it does look right:

{code}
>>> str == bytes
True
>>> isinstance("a", bytes)
True
{code}

1. Python 2 treats `str` as `bytes`.
2. PySpark added some special codes and hacks to recognizes `str` as string 
types.
3. PyArrow / Pandas followed Python 2 difference

We might have to match the behaviour to PySpark's but Python 2 is deprecated 
anyway. I think it's better to just note it.


> Note the difference between str of Python 2 and 3 at Arrow optimized toPandas
> -
>
> Key: SPARK-27995
> URL: https://issues.apache.org/jira/browse/SPARK-27995
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> When Arrow optimization is enabled in Python 2.7, 
> {code}
> import pandas
> pdf = pandas.DataFrame(["test1", "test2"])
> df = spark.createDataFrame(pdf)
> df.show()
> {code}
> I got the following output:
> {code}
> ++
> |   0|
> ++
> |[74 65 73 74 31]|
> |[74 65 73 74 32]|
> ++
> {code}
> This looks because Python's {{str}} and {{byte}} are same. it does look right:
> {code}
> >>> str == bytes
> True
> >>> isinstance("a", bytes)
> True
> {code}
> 1. Python 2 treats `str` as `bytes`.
> 2. PySpark added some special codes and hacks to recognizes `str` as string 
> types.
> 3. PyArrow / Pandas followed Python 2 difference
> We might have to match the behaviour to PySpark's but Python 2 is deprecated 
> anyway. I think it's better to just note it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27995) Note the difference between str of Python 2 and 3 at Arrow optimized toPandas

2019-06-10 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-27995:


 Summary: Note the difference between str of Python 2 and 3 at 
Arrow optimized toPandas
 Key: SPARK-27995
 URL: https://issues.apache.org/jira/browse/SPARK-27995
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


When Arrow optimization is enabled in Python 2.7, 

{code}
import pandas
pdf = pandas.DataFrame(["test1", "test2"])
df = spark.createDataFrame(pdf)
df.show()
{code}

I got the following output:

{code}
++
|   0|
++
|[74 65 73 74 31]|
|[74 65 73 74 32]|
++```
{code}

This looks because Python's {{str}} and {{byte}} are same. it does look right:

{code}
>>> str == bytes
True
>>> isinstance("a", bytes)
True
{code}

1. Python 2 treats `str` as `bytes`.
2. PySpark added some special codes and hacks to recognizes `str` as string 
types.
3. PyArrow / Pandas followed Python 2 difference

We might have to match the behaviour to PySpark's but Python 2 is deprecated 
anyway. I think it's better to just note it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27546) Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone

2019-06-10 Thread Jiatao Tao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860491#comment-16860491
 ] 

Jiatao Tao edited comment on SPARK-27546 at 6/11/19 2:46 AM:
-

Hi [~dongjoon]

After reading the code, 

Although I have set "spark.sql.session.timeZone" to UTC, 
"DateTimeUtils#defaultTimeZone" will still use "TimeZone.getDefault()".

 

I think the root cause is that When I cast to "TimestampType", it uses 
UTC(spark.sql.session.timeZone), but then convert to DateType, it uses 
GMT+8(TimeZone.getDefault).

 

Remember, what I got, is "ts", it should not change with the time zone.

 


was (Author: aron.tao):
Hi [~dongjoon]

After reading the code, 

Although I have set "spark.sql.session.timeZone" to UTC, 
"DateTimeUtils#defaultTimeZone" will still use "TimeZone.getDefault()".

 

> Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone
> -
>
> Key: SPARK-27546
> URL: https://issues.apache.org/jira/browse/SPARK-27546
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jiatao Tao
>Priority: Minor
> Attachments: image-2019-04-23-08-10-00-475.png, 
> image-2019-04-23-08-10-50-247.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27546) Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone

2019-06-10 Thread Jiatao Tao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860491#comment-16860491
 ] 

Jiatao Tao commented on SPARK-27546:


After reading the code, 

Although I have set "spark.sql.session.timeZone" to UTC, 
"DateTimeUtils#defaultTimeZone" will still use "TimeZone.getDefault()".

 

> Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone
> -
>
> Key: SPARK-27546
> URL: https://issues.apache.org/jira/browse/SPARK-27546
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jiatao Tao
>Priority: Minor
> Attachments: image-2019-04-23-08-10-00-475.png, 
> image-2019-04-23-08-10-50-247.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27546) Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone

2019-06-10 Thread Jiatao Tao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860491#comment-16860491
 ] 

Jiatao Tao edited comment on SPARK-27546 at 6/11/19 2:42 AM:
-

Hi [~dongjoon]

After reading the code, 

Although I have set "spark.sql.session.timeZone" to UTC, 
"DateTimeUtils#defaultTimeZone" will still use "TimeZone.getDefault()".

 


was (Author: aron.tao):
After reading the code, 

Although I have set "spark.sql.session.timeZone" to UTC, 
"DateTimeUtils#defaultTimeZone" will still use "TimeZone.getDefault()".

 

> Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone
> -
>
> Key: SPARK-27546
> URL: https://issues.apache.org/jira/browse/SPARK-27546
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jiatao Tao
>Priority: Minor
> Attachments: image-2019-04-23-08-10-00-475.png, 
> image-2019-04-23-08-10-50-247.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27546) Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone

2019-06-10 Thread Jiatao Tao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860489#comment-16860489
 ] 

Jiatao Tao edited comment on SPARK-27546 at 6/11/19 2:40 AM:
-

Hi [~dongjoon]

 

What I get is "ts", not the "Mon Dec 31 16:00:00 PST 2012".

```
 jshell> new Date(135699840L).getTime
 $1 ==> 135699840L

jshell> TimeZone.setDefault(TimeZone.getTimeZone("UTC"))

jshell> new Date(135699840L).getTime

$3 ==> 135699840L
```

Whatere the timezone I set, should not influence the ts I get.


was (Author: aron.tao):
Hi [~dongjoon]

 

What I get is "ts", not the "Mon Dec 31 16:00:00 PST 2012".
jshell> new Date(135699840L).getTime
$1 ==> 135699840L

jshell> TimeZone.setDefault(TimeZone.getTimeZone("UTC"))

jshell> new Date(135699840L).getTime

$3 ==> 135699840L
 

Whatere the timezone I set, should not influence the ts I get.

> Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone
> -
>
> Key: SPARK-27546
> URL: https://issues.apache.org/jira/browse/SPARK-27546
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jiatao Tao
>Priority: Minor
> Attachments: image-2019-04-23-08-10-00-475.png, 
> image-2019-04-23-08-10-50-247.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27546) Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone

2019-06-10 Thread Jiatao Tao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860489#comment-16860489
 ] 

Jiatao Tao commented on SPARK-27546:


Hi [~dongjoon]

 

What I get is "ts", not the "Mon Dec 31 16:00:00 PST 2012".
jshell> new Date(135699840L).getTime
$1 ==> 135699840L

jshell> TimeZone.setDefault(TimeZone.getTimeZone("UTC"))

jshell> new Date(135699840L).getTime

$3 ==> 135699840L
 

Whatere the timezone I set, should not influence the ts I get.

> Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone
> -
>
> Key: SPARK-27546
> URL: https://issues.apache.org/jira/browse/SPARK-27546
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jiatao Tao
>Priority: Minor
> Attachments: image-2019-04-23-08-10-00-475.png, 
> image-2019-04-23-08-10-50-247.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20894) Error while checkpointing to HDFS

2019-06-10 Thread phan minh duc (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

phan minh duc updated SPARK-20894:
--
Comment: was deleted

(was: I'm using spark 2.4.0 and facing the same issue when i submit structured 
streaming app on cluster with 2 executor, but that error not appear if i only 
deploy on 1 executor.

EDIT: even running with only 1 executor i'm still facing the same issue, all 
the checkpoint Location i'm using was in hdfs, and the HDFSStateProvider report 
an error about reading the .delta state file in /tmp.

A part of my log

2019-06-10 02:47:21 WARN  TaskSetManager:66 - Lost task 44.1 in stage 92852.0 
(TID 305080, 10.244.2.205, executor 2): java.lang.IllegalStateException: Error 
reading delta file 
file:/tmp/temporary-06b7ccbd-b9d4-438b-8ed9-8238031ef075/state/2/44/1.delta of 
HDFSStateStoreProvider[id = (op=2,part=44),dir = 
file:/tmp/temporary-06b7ccbd-b9d4-438b-8ed9-8238031ef075/state/2/44]: 
file:/tmp/temporary-06b7ccbd-b9d4-438b-8ed9-8238031ef075/state/2/44/1.delta 
does not exist
    at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:427)
    at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$6$$anonfun$apply$1.apply$mcVJ$sp(HDFSBackedStateStoreProvider.scala:384)
    at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$6$$anonfun$apply$1.apply(HDFSBackedStateStoreProvider.scala:383)
    at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$6$$anonfun$apply$1.apply(HDFSBackedStateStoreProvider.scala:383)
    at scala.collection.immutable.NumericRange.foreach(NumericRange.scala:73)
    at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:383)
    at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:356)
    at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:535)
    at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.loadMap(HDFSBackedStateStoreProvider.scala:356)
    at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:204)
    at 
org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:371)
    at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:88)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: 
file:/tmp/temporary-06b7ccbd-b9d4-438b-8ed9-8238031ef075/state/2/44/1.delta
    at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:200)
    at 
org.apache.hadoop.fs.DelegateToFileSystem.open(DelegateToFileSystem.java:183)
    at org.apache.hadoop.fs.AbstractFileSystem.open(AbstractFileSystem.java:628)
    at org.apache.hadoop.fs.FilterFs.open(FilterFs.java:205)
    at org.apache.hadoop.fs.FileContext$6.next(FileContext.java:795)
    at org.apache.hadoop.fs.FileContext$6.next(FileContext.java:791)
    at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
    at org.apache.hadoop.fs.FileContext.open(FileContext.java:797)
    at 
org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.open(CheckpointFileManager.scala:322)
    at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:424)
    ... 28 more)

[jira] [Updated] (SPARK-27825) spark thriftserver session's first username cause the impersonation issue.

2019-06-10 Thread wangxinxin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxinxin updated SPARK-27825:
---
Attachment: test.zip

> spark thriftserver session's first username  cause  the impersonation issue. 
> -
>
> Key: SPARK-27825
> URL: https://issues.apache.org/jira/browse/SPARK-27825
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: wangxinxin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27825) spark thriftserver session's first username cause the impersonation issue.

2019-06-10 Thread wangxinxin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxinxin updated SPARK-27825:
---
Attachment: (was: test.zip)

> spark thriftserver session's first username  cause  the impersonation issue. 
> -
>
> Key: SPARK-27825
> URL: https://issues.apache.org/jira/browse/SPARK-27825
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: wangxinxin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27979) Remove deprecated `--force` option in `build/mvn` and `run-tests.py`

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27979.
---
Resolution: Fixed

This is resolved back via https://github.com/apache/spark/pull/24833

> Remove deprecated `--force` option in `build/mvn` and `run-tests.py`
> 
>
> Key: SPARK-27979
> URL: https://issues.apache.org/jira/browse/SPARK-27979
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>
> Since 2.0.0, SPARK-14867 deprecated `--force` option and ignores it. This 
> issue cleans up the code completely at 3.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27994) Spark Avro Failed to read logical type decimal backed by bytes

2019-06-10 Thread Nicolas Pascal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860442#comment-16860442
 ] 

Nicolas Pascal commented on SPARK-27994:


I'll push a PR with failing test cases to reproduce the issue

> Spark Avro Failed to read logical type decimal backed by bytes
> --
>
> Key: SPARK-27994
> URL: https://issues.apache.org/jira/browse/SPARK-27994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Nicolas Pascal
>Priority: Major
>
> Fields with this following schema provokes Spark to fail reading the Avro 
> file.
> {noformat}
>  
> {"name":"process_insert_id","type":["null",{"type":"bytes","logicalType":"decimal","precision":10,"scale":0}
>  {noformat}
> The following record is failing:
> {code:java}
> Array[Byte] [32 30 30 30 31 31 30 39 37 34]
> actual: BigDecimal 237007240188420354029364
> expected: 2000110974
> {code}
> The following code in Spark Avro Library 2.4.0 in the 
> org.apache.spark.sql.avro.AvroDeserializer line 149
> {noformat}
> val bigDecimal = 
> decimalConversions.fromFixed(value.asInstanceOf[GenericFixed], avroType,
>   LogicalTypes.decimal(d.precision, d.scale))
> {noformat}
> The avro file is readable and produces expected values when converted to json 
> using the Apache Avro tool jar 
> (https://search.maven.org/artifact/org.apache.avro/avro-tools/1.8.2/jar)
> Full stacktrace bellow:
> {noformat}
> 19/04/17 05:50:45 INFO Client: 
>client token: N/A
>diagnostics: User class threw exception: 
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   at au.com.nbnco.io.Io$.writeParquet(Io.scala:38)
>   at au.com.nbnco.fwk.Outputs$.write(Output.scala:27)
>   at au.com.nbnco.fwk.Context.write(Context.scala:41)
>   at 
> au.com.nbnco.job.merge.MergeToActiveDatasetJob$.run(MergeToActiveDatasetJob.scala:10)
>   at 
> au.com.nbnco.fwk.SparkJobRunner$.au$com$nbnco$fwk$SparkJobRunner$$executeJobRunner(SparkJobRunner.scala:63)
>   at 
> au.com.nbnco.fwk.SparkJobRunner$$anonfun$2$$anonfun$apply$1.apply$mcV$sp(SparkJobRunner.scala:40)
>   at 
> au.com.nbnco.fwk.SparkJobRunner$$anonfun$2$$anonfun$apply$1.apply(SparkJobRunner.scala:37)
>   at 
> au.com.nbnco.fwk.SparkJobRunner$$anonfun$2$$anonfun$apply$1.apply(SparkJobRunner.scala:37)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> 

[jira] [Closed] (SPARK-25053) Allow additional port forwarding on Spark on K8S as needed

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-25053.
-

Thank you, [~skonto].

> Allow additional port forwarding on Spark on K8S as needed
> --
>
> Key: SPARK-25053
> URL: https://issues.apache.org/jira/browse/SPARK-25053
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
> In some cases, like setting up remote debuggers, adding additional ports to 
> be forwarded would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-06-10 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860439#comment-16860439
 ] 

Dongjoon Hyun commented on SPARK-27812:
---

Thank you for update, [~Andrew HUALI].

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27697) KubernetesClientApplication alway exit with 0

2019-06-10 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860438#comment-16860438
 ] 

Dongjoon Hyun commented on SPARK-27697:
---

Please make a PR. Let's see there, [~Andrew HUALI]

> KubernetesClientApplication alway exit with 0
> -
>
> Key: SPARK-27697
> URL: https://issues.apache.org/jira/browse/SPARK-27697
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Henry Yu
>Priority: Minor
>
> When submit spark job to k8s, workflows try to get job status by submission 
> process exit code.
> yarnClient will throw sparkExceptions when application failed.
> I have fix this in out home maintained spark version. I can make a pr on this 
> issue.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27993) Port HIVE-12981 to hive-thriftserver

2019-06-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27993:


Assignee: (was: Apache Spark)

> Port HIVE-12981 to hive-thriftserver
> 
>
> Key: SPARK-27993
> URL: https://issues.apache.org/jira/browse/SPARK-27993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> We need to port HIVE-12981 to our hive-thriftserver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27993) Port HIVE-12981 to hive-thriftserver

2019-06-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27993:


Assignee: Apache Spark

> Port HIVE-12981 to hive-thriftserver
> 
>
> Key: SPARK-27993
> URL: https://issues.apache.org/jira/browse/SPARK-27993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> We need to port HIVE-12981 to our hive-thriftserver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27994) Spark Avro Failed to read logical type decimal backed by bytes

2019-06-10 Thread Nicolas Pascal (JIRA)
Nicolas Pascal created SPARK-27994:
--

 Summary: Spark Avro Failed to read logical type decimal backed by 
bytes
 Key: SPARK-27994
 URL: https://issues.apache.org/jira/browse/SPARK-27994
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Nicolas Pascal


Fields with this following schema provokes Spark to fail reading the Avro file.
{noformat}
 
{"name":"process_insert_id","type":["null",{"type":"bytes","logicalType":"decimal","precision":10,"scale":0}
 {noformat}
The following record is failing:
{code:java}
Array[Byte] [32 30 30 30 31 31 30 39 37 34]
actual: BigDecimal 237007240188420354029364
expected: 2000110974

{code}
The following code in Spark Avro Library 2.4.0 in the 
org.apache.spark.sql.avro.AvroDeserializer line 149
{noformat}
val bigDecimal = decimalConversions.fromFixed(value.asInstanceOf[GenericFixed], 
avroType,
  LogicalTypes.decimal(d.precision, d.scale))
{noformat}
The avro file is readable and produces expected values when converted to json 
using the Apache Avro tool jar 
(https://search.maven.org/artifact/org.apache.avro/avro-tools/1.8.2/jar)

Full stacktrace bellow:
{noformat}
19/04/17 05:50:45 INFO Client: 
 client token: N/A
 diagnostics: User class threw exception: 
org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
at au.com.nbnco.io.Io$.writeParquet(Io.scala:38)
at au.com.nbnco.fwk.Outputs$.write(Output.scala:27)
at au.com.nbnco.fwk.Context.write(Context.scala:41)
at 
au.com.nbnco.job.merge.MergeToActiveDatasetJob$.run(MergeToActiveDatasetJob.scala:10)
at 
au.com.nbnco.fwk.SparkJobRunner$.au$com$nbnco$fwk$SparkJobRunner$$executeJobRunner(SparkJobRunner.scala:63)
at 
au.com.nbnco.fwk.SparkJobRunner$$anonfun$2$$anonfun$apply$1.apply$mcV$sp(SparkJobRunner.scala:40)
at 
au.com.nbnco.fwk.SparkJobRunner$$anonfun$2$$anonfun$apply$1.apply(SparkJobRunner.scala:37)
at 
au.com.nbnco.fwk.SparkJobRunner$$anonfun$2$$anonfun$apply$1.apply(SparkJobRunner.scala:37)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 5 in stage 10.0 failed 4 times, most recent failure: Lost task 

[jira] [Created] (SPARK-27993) Port HIVE-12981 to hive-thriftserver

2019-06-10 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-27993:
---

 Summary: Port HIVE-12981 to hive-thriftserver
 Key: SPARK-27993
 URL: https://issues.apache.org/jira/browse/SPARK-27993
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


We need to port HIVE-12981 to our hive-thriftserver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27217) Nested schema pruning doesn't work for aggregation e.g. `sum`.

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27217:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Nested schema pruning doesn't work for aggregation e.g. `sum`.
> --
>
> Key: SPARK-27217
> URL: https://issues.apache.org/jira/browse/SPARK-27217
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: colin fang
>Priority: Major
>
> Since SPARK-4502 is fixed,  I would expect queries such as `select sum(b.x)` 
> doesn't have to read other nested fields.
> {code:python}   
>  rdd = spark.range(1000).rdd.map(lambda x: [x.id+3, [x.id+1, x.id-1]])
> df = spark.createDataFrame(add, schema='a:int,b:struct')
> df.repartition(1).write.mode('overwrite').parquet('test.parquet')
> df = spark.read.parquet('test.parquet')
> spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'true')
> df.select('b.x').explain()
> # ReadSchema: struct>
> spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'false')
> df.select('b.x').explain()
> # ReadSchema: struct>
> spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'true')
> df.selectExpr('sum(b.x)').explain()
> #  ReadSchema: struct>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27217) Nested schema pruning doesn't work for aggregation e.g. `sum`.

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27217:
--
Priority: Major  (was: Minor)

> Nested schema pruning doesn't work for aggregation e.g. `sum`.
> --
>
> Key: SPARK-27217
> URL: https://issues.apache.org/jira/browse/SPARK-27217
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: colin fang
>Priority: Major
>
> Since SPARK-4502 is fixed,  I would expect queries such as `select sum(b.x)` 
> doesn't have to read other nested fields.
> {code:python}   
>  rdd = spark.range(1000).rdd.map(lambda x: [x.id+3, [x.id+1, x.id-1]])
> df = spark.createDataFrame(add, schema='a:int,b:struct')
> df.repartition(1).write.mode('overwrite').parquet('test.parquet')
> df = spark.read.parquet('test.parquet')
> spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'true')
> df.select('b.x').explain()
> # ReadSchema: struct>
> spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'false')
> df.select('b.x').explain()
> # ReadSchema: struct>
> spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'true')
> df.selectExpr('sum(b.x)').explain()
> #  ReadSchema: struct>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27217) Nested schema pruning doesn't work for aggregation e.g. `sum`.

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27217:
--
Description: 
Since SPARK-4502 is fixed,  I would expect queries such as `select sum(b.x)` 
doesn't have to read other nested fields.

{code:python}   
 rdd = spark.range(1000).rdd.map(lambda x: [x.id+3, [x.id+1, x.id-1]])
df = spark.createDataFrame(add, schema='a:int,b:struct')
df.repartition(1).write.mode('overwrite').parquet('test.parquet')
df = spark.read.parquet('test.parquet')

spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'true')
df.select('b.x').explain()
# ReadSchema: struct>

spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'false')
df.select('b.x').explain()
# ReadSchema: struct>

spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'true')
df.selectExpr('sum(b.x)').explain()
#  ReadSchema: struct>
{code}

  was:
Since SPARK-4502 is fixed,  I would expect queries such as `select sum(b.x)` 
doesn't have to read other nested fields.

{code:python}   
 rdd = spark.range(1000).rdd.map(lambda x: [x.id+3, [x.id+1, x.id-1]])
df = spark.createDataFrame(, schema='a:int,b:struct')
df.repartition(1).write.mode('overwrite').parquet('test.parquet')
df = spark.read.parquet('test.parquet')

spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'true')
df.select('b.x').explain()
# ReadSchema: struct>

spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'false')
df.select('b.x').explain()
# ReadSchema: struct>

spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'true')
df.selectExpr('sum(b.x)').explain()
#  ReadSchema: struct>
{code}


> Nested schema pruning doesn't work for aggregation e.g. `sum`.
> --
>
> Key: SPARK-27217
> URL: https://issues.apache.org/jira/browse/SPARK-27217
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: colin fang
>Priority: Minor
>
> Since SPARK-4502 is fixed,  I would expect queries such as `select sum(b.x)` 
> doesn't have to read other nested fields.
> {code:python}   
>  rdd = spark.range(1000).rdd.map(lambda x: [x.id+3, [x.id+1, x.id-1]])
> df = spark.createDataFrame(add, schema='a:int,b:struct')
> df.repartition(1).write.mode('overwrite').parquet('test.parquet')
> df = spark.read.parquet('test.parquet')
> spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'true')
> df.select('b.x').explain()
> # ReadSchema: struct>
> spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'false')
> df.select('b.x').explain()
> # ReadSchema: struct>
> spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'true')
> df.selectExpr('sum(b.x)').explain()
> #  ReadSchema: struct>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27217) Nested schema pruning doesn't work for aggregation e.g. `sum`.

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27217:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-25603

> Nested schema pruning doesn't work for aggregation e.g. `sum`.
> --
>
> Key: SPARK-27217
> URL: https://issues.apache.org/jira/browse/SPARK-27217
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: colin fang
>Priority: Minor
>
> Since SPARK-4502 is fixed,  I would expect queries such as `select sum(b.x)` 
> doesn't have to read other nested fields.
> {code:python}   
>  rdd = spark.range(1000).rdd.map(lambda x: [x.id+3, [x.id+1, x.id-1]])
> df = spark.createDataFrame(, schema='a:int,b:struct')
> df.repartition(1).write.mode('overwrite').parquet('test.parquet')
> df = spark.read.parquet('test.parquet')
> spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'true')
> df.select('b.x').explain()
> # ReadSchema: struct>
> spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'false')
> df.select('b.x').explain()
> # ReadSchema: struct>
> spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'true')
> df.selectExpr('sum(b.x)').explain()
> #  ReadSchema: struct>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27290) remove unneed sort under Aggregate

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27290:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> remove unneed sort under Aggregate
> --
>
> Key: SPARK-27290
> URL: https://issues.apache.org/jira/browse/SPARK-27290
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiaoju Wu
>Priority: Minor
>
> I saw some tickets to remove unneeded sort in plan while I think there's 
> another case in which sort is redundant:
> Sort just under an non-orderPreserving node is redundant, for example:
> {code}
> select count(*) from (select a1 from A order by a2);
> +- Aggregate
>   +- Sort
>      +- FileScan parquet
> {code}
> But one of the existing test cases is conflict with this example:
> {code}
> test("sort should not be removed when there is a node which doesn't guarantee 
> any order") {
>    val orderedPlan = testRelation.select('a, 'b).orderBy('a.asc)   
>val groupedAndResorted = orderedPlan.groupBy('a)(sum('a)).orderBy('a.asc)
>    val optimized = Optimize.execute(groupedAndResorted.analyze)
>    val correctAnswer = groupedAndResorted.analyze
>comparePlans(optimized, correctAnswer) 
> }
> {code}
> Why is it designed like this? In my opinion, since Aggregate won't pass up 
> the ordering, the below Sort is useless.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19799) Support WITH clause in subqueries

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-19799:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-27764

> Support WITH clause in subqueries
> -
>
> Key: SPARK-19799
> URL: https://issues.apache.org/jira/browse/SPARK-19799
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Priority: Major
>
> Because of Spark-17590 it should be relatively easy to support WITH clause in 
> subqueries besides nested CTE definitions.
> Here an example of a query that does not run on spark:
> create table test (seqno int, k string, v int) using parquet;
> insert into TABLE test values (1,'a', 99),(2, 'b', 88),(3, 'a', 77),(4, 'b', 
> 66),(5, 'c', 55),(6, 'a', 44),(7, 'b', 33);
> SELECT percentile(b, 0.5) FROM (WITH mavg AS (SELECT k, AVG(v) OVER 
> (PARTITION BY k ORDER BY seqno ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) as b 
> FROM test ORDER BY seqno) SELECT k, MAX(b) as b  FROM mavg GROUP BY k);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27970) Support Hive 3.0 metastore

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27970:
--
Description: 
It seems that some users are using Hive 3.0.0, at least HDP 3.0.0



  was:
It seems that some users are using Hive 3.0.0, at least HDP 3.0.0:
!https://camo.githubusercontent.com/736d8a9f04d3960e0cdc3a8ee09aa199ce103b51/68747470733a2f2f32786262686a786336776b3376323170363274386e3464342d7770656e67696e652e6e6574646e612d73736c2e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031382f31322f6864702d332e312e312d4173706172616775732e706e67!
 




> Support Hive 3.0 metastore
> --
>
> Key: SPARK-27970
> URL: https://issues.apache.org/jira/browse/SPARK-27970
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: screenshot-1.png
>
>
> It seems that some users are using Hive 3.0.0, at least HDP 3.0.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27917) Semantic equals of CaseWhen is failing with case sensitivity of column Names

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27917.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24766 at 3.0.0.
We are going to backport to the older branches.

> Semantic equals of CaseWhen is failing with case sensitivity of column Names
> 
>
> Key: SPARK-27917
> URL: https://issues.apache.org/jira/browse/SPARK-27917
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.2, 2.4.3
>Reporter: Akash R Nilugal
>Assignee: Sandeep Katta
>Priority: Major
> Fix For: 3.0.0
>
>
> Semantic equals of CaseWhen is failing with case sensitivity of column Names



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27917) Semantic equals of CaseWhen is failing with case sensitivity of column Names

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-27917:
-

Assignee: Sandeep Katta

> Semantic equals of CaseWhen is failing with case sensitivity of column Names
> 
>
> Key: SPARK-27917
> URL: https://issues.apache.org/jira/browse/SPARK-27917
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.2, 2.4.3
>Reporter: Akash R Nilugal
>Assignee: Sandeep Katta
>Priority: Major
>
> Semantic equals of CaseWhen is failing with case sensitivity of column Names



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27992) PySpark socket server should sync with JVM connection thread future

2019-06-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27992:


Assignee: (was: Apache Spark)

> PySpark socket server should sync with JVM connection thread future
> ---
>
> Key: SPARK-27992
> URL: https://issues.apache.org/jira/browse/SPARK-27992
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark 
> job are not propagated to Python. This is because toLocalIterator() and 
> toPandas() with Arrow enabled run Spark jobs asynchronously in a background 
> thread, after creating the socket connection info. The fix for these was to 
> catch a SparkException if the job errored and then send the exception through 
> the pyspark serializer.
> A better fix would be to allow Python to synchronize on the serving thread 
> future. That way if the serving thread throws an exception, it will be 
> propagated on the synchronization call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27992) PySpark socket server should sync with JVM connection thread future

2019-06-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27992:


Assignee: Apache Spark

> PySpark socket server should sync with JVM connection thread future
> ---
>
> Key: SPARK-27992
> URL: https://issues.apache.org/jira/browse/SPARK-27992
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Major
>
> Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark 
> job are not propagated to Python. This is because toLocalIterator() and 
> toPandas() with Arrow enabled run Spark jobs asynchronously in a background 
> thread, after creating the socket connection info. The fix for these was to 
> catch a SparkException if the job errored and then send the exception through 
> the pyspark serializer.
> A better fix would be to allow Python to synchronize on the serving thread 
> future. That way if the serving thread throws an exception, it will be 
> propagated on the synchronization call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27992) PySpark socket server should sync with JVM connection thread future

2019-06-10 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27992:
-
Affects Version/s: (was: 2.4.3)
   3.0.0

> PySpark socket server should sync with JVM connection thread future
> ---
>
> Key: SPARK-27992
> URL: https://issues.apache.org/jira/browse/SPARK-27992
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark 
> job are not propagated to Python. This is because toLocalIterator() and 
> toPandas() with Arrow enabled run Spark jobs asynchronously in a background 
> thread, after creating the socket connection info. The fix for these was to 
> catch a SparkException if the job errored and then send the exception through 
> the pyspark serializer.
> A better fix would be to allow Python to synchronize on the serving thread 
> future. That way if the serving thread throws an exception, it will be 
> propagated on the synchronization call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27992) PySpark socket server should sync with JVM connection thread future

2019-06-10 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27992:
-
Description: 
Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark job 
are not propagated to Python. This is because toLocalIterator() and toPandas() 
with Arrow enabled run Spark jobs asynchronously in a background thread, after 
creating the socket connection info. The fix for these was to catch a 
SparkException if the job errored and then send the exception through the 
pyspark serializer.

A better fix would be to allow Python to synchronize on the serving thread 
future. That way if the serving thread throws an exception, it will be 
propagated on the synchronization call.

> PySpark socket server should sync with JVM connection thread future
> ---
>
> Key: SPARK-27992
> URL: https://issues.apache.org/jira/browse/SPARK-27992
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Bryan Cutler
>Priority: Major
>
> Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark 
> job are not propagated to Python. This is because toLocalIterator() and 
> toPandas() with Arrow enabled run Spark jobs asynchronously in a background 
> thread, after creating the socket connection info. The fix for these was to 
> catch a SparkException if the job errored and then send the exception through 
> the pyspark serializer.
> A better fix would be to allow Python to synchronize on the serving thread 
> future. That way if the serving thread throws an exception, it will be 
> propagated on the synchronization call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27992) PySpark socket server should sync with JVM connection thread future

2019-06-10 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27992:
-
Environment: (was: Both SPARK-27805 and SPARK-27548 identified an issue 
that errors in a Spark job are not propagated to Python. This is because 
toLocalIterator() and toPandas() with Arrow enabled run Spark jobs 
asynchronously in a background thread, after creating the socket connection 
info. The fix for these was to catch a SparkException if the job errored and 
then send the exception through the pyspark serializer.

A better fix would be to allow Python to synchronize on the serving thread 
future. That way if the serving thread throws an exception, it will be 
propagated on the synchronization call.)

> PySpark socket server should sync with JVM connection thread future
> ---
>
> Key: SPARK-27992
> URL: https://issues.apache.org/jira/browse/SPARK-27992
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Bryan Cutler
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27992) PySpark socket server should sync with JVM connection thread future

2019-06-10 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-27992:


 Summary: PySpark socket server should sync with JVM connection 
thread future
 Key: SPARK-27992
 URL: https://issues.apache.org/jira/browse/SPARK-27992
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.3
 Environment: Both SPARK-27805 and SPARK-27548 identified an issue that 
errors in a Spark job are not propagated to Python. This is because 
toLocalIterator() and toPandas() with Arrow enabled run Spark jobs 
asynchronously in a background thread, after creating the socket connection 
info. The fix for these was to catch a SparkException if the job errored and 
then send the exception through the pyspark serializer.

A better fix would be to allow Python to synchronize on the serving thread 
future. That way if the serving thread throws an exception, it will be 
propagated on the synchronization call.
Reporter: Bryan Cutler






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27947) Enhance redactOptions to accept any Map type

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27947:
--
Priority: Major  (was: Minor)

> Enhance redactOptions to accept any Map type
> 
>
> Key: SPARK-27947
> URL: https://issues.apache.org/jira/browse/SPARK-27947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any 
> Map type, thus causing `asInstanceOf[Map[String, String]]` to throw 
> ClassCastException.
> The following test reproduces the issue:
> {code:java}
> case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
>  override def output: Seq[Attribute] = Nil
>  override def children: Seq[LogicalPlan] = Nil
> }
> TestStatement(Map("abc" -> 1)).toString{code}
> Changing the code to `case mapArg: Map[String, String]` will not work due to 
> type erasure. As a matter of fact, compiler gives this warning:
> {noformat}
> Warning:(41, 18) non-variable type argument String in type pattern 
> scala.collection.immutable.Map[String,String] (the underlying of 
> Map[String,String]) is unchecked since it is eliminated by erasure
> case mapArg: Map[String, String] =>{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27947) Enhance redactOptions to accept any Map type

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27947.
---
   Resolution: Fixed
 Assignee: John Zhuge
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24800

> Enhance redactOptions to accept any Map type
> 
>
> Key: SPARK-27947
> URL: https://issues.apache.org/jira/browse/SPARK-27947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Major
> Fix For: 3.0.0
>
>
> In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any 
> Map type, thus causing `asInstanceOf[Map[String, String]]` to throw 
> ClassCastException.
> The following test reproduces the issue:
> {code:java}
> case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
>  override def output: Seq[Attribute] = Nil
>  override def children: Seq[LogicalPlan] = Nil
> }
> TestStatement(Map("abc" -> 1)).toString{code}
> Changing the code to `case mapArg: Map[String, String]` will not work due to 
> type erasure. As a matter of fact, compiler gives this warning:
> {noformat}
> Warning:(41, 18) non-variable type argument String in type pattern 
> scala.collection.immutable.Map[String,String] (the underlying of 
> Map[String,String]) is unchecked since it is eliminated by erasure
> case mapArg: Map[String, String] =>{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27947) Enhance redactOptions to accept any Map type

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27947:
--
Summary: Enhance redactOptions to accept any Map type  (was: 
ParsedStatement subclass toString may throw ClassCastException)

> Enhance redactOptions to accept any Map type
> 
>
> Key: SPARK-27947
> URL: https://issues.apache.org/jira/browse/SPARK-27947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any 
> Map type, thus causing `asInstanceOf[Map[String, String]]` to throw 
> ClassCastException.
> The following test reproduces the issue:
> {code:java}
> case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
>  override def output: Seq[Attribute] = Nil
>  override def children: Seq[LogicalPlan] = Nil
> }
> TestStatement(Map("abc" -> 1)).toString{code}
> Changing the code to `case mapArg: Map[String, String]` will not work due to 
> type erasure. As a matter of fact, compiler gives this warning:
> {noformat}
> Warning:(41, 18) non-variable type argument String in type pattern 
> scala.collection.immutable.Map[String,String] (the underlying of 
> Map[String,String]) is unchecked since it is eliminated by erasure
> case mapArg: Map[String, String] =>{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27947) Enhance redactOptions to accept any Map type

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27947:
--
Issue Type: Improvement  (was: Bug)

> Enhance redactOptions to accept any Map type
> 
>
> Key: SPARK-27947
> URL: https://issues.apache.org/jira/browse/SPARK-27947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any 
> Map type, thus causing `asInstanceOf[Map[String, String]]` to throw 
> ClassCastException.
> The following test reproduces the issue:
> {code:java}
> case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
>  override def output: Seq[Attribute] = Nil
>  override def children: Seq[LogicalPlan] = Nil
> }
> TestStatement(Map("abc" -> 1)).toString{code}
> Changing the code to `case mapArg: Map[String, String]` will not work due to 
> type erasure. As a matter of fact, compiler gives this warning:
> {noformat}
> Warning:(41, 18) non-variable type argument String in type pattern 
> scala.collection.immutable.Map[String,String] (the underlying of 
> Map[String,String]) is unchecked since it is eliminated by erasure
> case mapArg: Map[String, String] =>{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27845) DataSourceV2: Insert into tables in multiple catalogs

2019-06-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27845:


Assignee: (was: Apache Spark)

> DataSourceV2: Insert into tables in multiple catalogs
> -
>
> Key: SPARK-27845
> URL: https://issues.apache.org/jira/browse/SPARK-27845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multiple catalogs in the following InsertInto use cases:
>  * INSERT INTO [TABLE] catalog.db.tbl
>  * INSERT OVERWRITE TABLE catalog.db.tbl
>  * DataFrameWriter.insertInto("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27845) DataSourceV2: Insert into tables in multiple catalogs

2019-06-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27845:


Assignee: Apache Spark

> DataSourceV2: Insert into tables in multiple catalogs
> -
>
> Key: SPARK-27845
> URL: https://issues.apache.org/jira/browse/SPARK-27845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: Apache Spark
>Priority: Major
>
> Support multiple catalogs in the following InsertInto use cases:
>  * INSERT INTO [TABLE] catalog.db.tbl
>  * INSERT OVERWRITE TABLE catalog.db.tbl
>  * DataFrameWriter.insertInto("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27845) DataSourceV2: Insert into tables in multiple catalogs

2019-06-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860221#comment-16860221
 ] 

Apache Spark commented on SPARK-27845:
--

User 'jzhuge' has created a pull request for this issue:
https://github.com/apache/spark/pull/24832

> DataSourceV2: Insert into tables in multiple catalogs
> -
>
> Key: SPARK-27845
> URL: https://issues.apache.org/jira/browse/SPARK-27845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multiple catalogs in the following InsertInto use cases:
>  * INSERT INTO [TABLE] catalog.db.tbl
>  * INSERT OVERWRITE TABLE catalog.db.tbl
>  * DataFrameWriter.insertInto("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27845) DataSourceV2: Insert into tables in multiple catalogs

2019-06-10 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27845:
---
Summary: DataSourceV2: Insert into tables in multiple catalogs  (was: 
DataSourceV2: InsertInto multiple catalogs)

> DataSourceV2: Insert into tables in multiple catalogs
> -
>
> Key: SPARK-27845
> URL: https://issues.apache.org/jira/browse/SPARK-27845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multiple catalogs in the following InsertInto use cases:
>  * INSERT INTO [TABLE] catalog.db.tbl
>  * INSERT OVERWRITE TABLE catalog.db.tbl
>  * DataFrameWriter.insertInto("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27991) ShuffleBlockFetcherIterator should take Netty constant-factor overheads into account when limiting number of simultaneous block fetches

2019-06-10 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-27991:
--

 Summary: ShuffleBlockFetcherIterator should take Netty 
constant-factor overheads into account when limiting number of simultaneous 
block fetches
 Key: SPARK-27991
 URL: https://issues.apache.org/jira/browse/SPARK-27991
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 2.4.0
Reporter: Josh Rosen


ShuffleBlockFetcherIterator has logic to limit the number of simultaneous block 
fetches. By default, this logic tries to keep the number of outstanding block 
fetches [beneath a data size 
limit|https://github.com/apache/spark/blob/v2.4.3/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L274]
 ({{maxBytesInFlight}}). However, this limiting does not take fixed overheads 
into account: even though a remote block might be, say, 4KB, there are certain 
fixed-size internal overheads due to Netty buffer sizes which may cause the 
actual space requirements to be larger.

As a result, if a map stage produces a huge number of extremely tiny blocks 
then we may see errors like
{code:java}
org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 
byte(s) of direct memory (used: 39325794304, max: 39325794304)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485)
[...]
Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 
16777216 byte(s) of direct memory (used: 39325794304, max: 39325794304)
at 
io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640)
at 
io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594)
at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:226)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:146)
at 
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324)
[...]{code}
SPARK-24989 is another report of this problem (but with a different proposed 
fix).

This problem can currently be mitigated by setting 
{{spark.reducer.maxReqsInFlight}} to some some non-IntMax value (SPARK-6166), 
but this additional manual configuration step is cumbersome.

Instead, I think that Spark should take these fixed overheads into account in 
the {{maxBytesInFlight}} calculation: instead of using blocks' actual sizes, 
use {{Math.min(blockSize, minimumNettyBufferSize)}}. There might be some tricky 
details involved to make this work on all configurations (e.g. to use a 
different minimum when direct buffers are disabled, etc.), but I think the core 
idea behind the fix is pretty simple.

This will improve Spark's stability and removes configuration / tuning burden 
from end users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27635) Prevent from splitting too many partitions smaller than row group size in Parquet file format

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27635.
---
Resolution: Invalid

Please see the discussion on the PR.

> Prevent from splitting too many partitions smaller than row group size in 
> Parquet file format
> -
>
> Key: SPARK-27635
> URL: https://issues.apache.org/jira/browse/SPARK-27635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2, 3.0.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2019-05-05 at 5.45.15 PM.png
>
>
> The scenario is submitting multiple jobs concurrently with spark dynamic 
> allocation enabled. The issue happens in determining RDD partition numbers. 
> When there are more available CPU cores, spark will try to split RDD to more 
> pieces. But since the file is stored as parquet format, parquet's row group 
> is actually the basic unit block to read data. Splitting RDD to too many 
> small pieces doesn't make sense.
> Jobs will launch too many partitions and never complete.
>  !Screen Shot 2019-05-05 at 5.45.15 PM.png! 
> Set the default parallelism to a fixed number (for example 200) could 
> workaround.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-27635) Prevent from splitting too many partitions smaller than row group size in Parquet file format

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-27635.
-

> Prevent from splitting too many partitions smaller than row group size in 
> Parquet file format
> -
>
> Key: SPARK-27635
> URL: https://issues.apache.org/jira/browse/SPARK-27635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2, 3.0.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2019-05-05 at 5.45.15 PM.png
>
>
> The scenario is submitting multiple jobs concurrently with spark dynamic 
> allocation enabled. The issue happens in determining RDD partition numbers. 
> When there are more available CPU cores, spark will try to split RDD to more 
> pieces. But since the file is stored as parquet format, parquet's row group 
> is actually the basic unit block to read data. Splitting RDD to too many 
> small pieces doesn't make sense.
> Jobs will launch too many partitions and never complete.
>  !Screen Shot 2019-05-05 at 5.45.15 PM.png! 
> Set the default parallelism to a fixed number (for example 200) could 
> workaround.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27949) Support SUBSTRING(str FROM n1 [FOR n2]) syntax

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27949.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24802

> Support SUBSTRING(str FROM n1 [FOR n2]) syntax
> --
>
> Key: SPARK-27949
> URL: https://issues.apache.org/jira/browse/SPARK-27949
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Assignee: Zhu, Lipeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently, function `substr/substring`'s usage is like 
> `substring(string_expression, n1 [,n2])`. 
> But, the ANSI SQL defined the pattern for substr/substring is like 
> `SUBSTRING(str FROM n1 [FOR n2])`. This gap makes some inconvenient when we 
> switch to the SparkSQL.
> - ANSI SQL-92: http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
> Below are the mainly DB engines to support the ANSI standard for substring.
> - PostgreSQL https://www.postgresql.org/docs/9.1/functions-string.html
> - MySQL 
> https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_substring
> - Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_SUBSTRING.html
> - Teradata 
> https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/XnePye0Cwexw6Pny_qnxVA



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25053) Allow additional port forwarding on Spark on K8S as needed

2019-06-10 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860014#comment-16860014
 ] 

Stavros Kontopoulos edited comment on SPARK-25053 at 6/10/19 1:59 PM:
--

Sure [~dongjoon]. With the pod template feature all containers can be 
customized to expose the ports needed. [~holdenk] if you describe something 
else with this issue we can re-open it (eg. ports exposed via the service).


was (Author: skonto):
Sure [~dongjoon]. With the pod template feature all containers can be 
customized to expose the ports needed. [~holdenk] if you describe something 
else with this issue we can re-open it (eg. service).

> Allow additional port forwarding on Spark on K8S as needed
> --
>
> Key: SPARK-25053
> URL: https://issues.apache.org/jira/browse/SPARK-25053
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
> In some cases, like setting up remote debuggers, adding additional ports to 
> be forwarded would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25053) Allow additional port forwarding on Spark on K8S as needed

2019-06-10 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860014#comment-16860014
 ] 

Stavros Kontopoulos edited comment on SPARK-25053 at 6/10/19 1:58 PM:
--

Sure [~dongjoon]. With the pod template feature all containers can be 
customized to expose the ports needed. [~holdenk] if you describe something 
else with this issue we can re-open it (eg. service).


was (Author: skonto):
Sure [~dongjoon]. With the pod template feature all containers can be 
customized to expose the ports needed. [~holdenk] if you describe something 
else with this issue we can re-open it.

> Allow additional port forwarding on Spark on K8S as needed
> --
>
> Key: SPARK-25053
> URL: https://issues.apache.org/jira/browse/SPARK-25053
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
> In some cases, like setting up remote debuggers, adding additional ports to 
> be forwarded would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25053) Allow additional port forwarding on Spark on K8S as needed

2019-06-10 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos resolved SPARK-25053.
-
Resolution: Duplicate

> Allow additional port forwarding on Spark on K8S as needed
> --
>
> Key: SPARK-25053
> URL: https://issues.apache.org/jira/browse/SPARK-25053
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
> In some cases, like setting up remote debuggers, adding additional ports to 
> be forwarded would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25053) Allow additional port forwarding on Spark on K8S as needed

2019-06-10 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860014#comment-16860014
 ] 

Stavros Kontopoulos edited comment on SPARK-25053 at 6/10/19 1:57 PM:
--

Sure [~dongjoon]. With the pod template feature all containers can be 
customized to expose the ports needed. [~holdenk] if you describe something 
else with the description we can re-open it.


was (Author: skonto):
Sure [~dongjoon]. With the pod template feature all containers can be 
customized to expose the ports needed. 
[SPARK-24434|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24434]. 
[~holdenk] if you describe something else with the description we can re-open 
it.

> Allow additional port forwarding on Spark on K8S as needed
> --
>
> Key: SPARK-25053
> URL: https://issues.apache.org/jira/browse/SPARK-25053
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
> In some cases, like setting up remote debuggers, adding additional ports to 
> be forwarded would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25053) Allow additional port forwarding on Spark on K8S as needed

2019-06-10 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860014#comment-16860014
 ] 

Stavros Kontopoulos edited comment on SPARK-25053 at 6/10/19 1:57 PM:
--

Sure [~dongjoon]. With the pod template feature all containers can be 
customized to expose the ports needed. [~holdenk] if you describe something 
else with this issue we can re-open it.


was (Author: skonto):
Sure [~dongjoon]. With the pod template feature all containers can be 
customized to expose the ports needed. [~holdenk] if you describe something 
else with the description we can re-open it.

> Allow additional port forwarding on Spark on K8S as needed
> --
>
> Key: SPARK-25053
> URL: https://issues.apache.org/jira/browse/SPARK-25053
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
> In some cases, like setting up remote debuggers, adding additional ports to 
> be forwarded would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25053) Allow additional port forwarding on Spark on K8S as needed

2019-06-10 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860014#comment-16860014
 ] 

Stavros Kontopoulos edited comment on SPARK-25053 at 6/10/19 1:56 PM:
--

Sure [~dongjoon]. With the pod template feature all containers can be 
customized to expose the ports needed. 
[SPARK-24434|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24434]. 
[~holdenk] if you describe something else with the description we can re-open 
it.


was (Author: skonto):
Sure [~dongjoon]. With the pod template feature all containers can be 
customized to expose the ports needed. 
[SPARK-24434|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24434]. 
[~holdenk] if you describe something else with the description let me 
know,otherwise I can close it as duplicate.

> Allow additional port forwarding on Spark on K8S as needed
> --
>
> Key: SPARK-25053
> URL: https://issues.apache.org/jira/browse/SPARK-25053
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
> In some cases, like setting up remote debuggers, adding additional ports to 
> be forwarded would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25053) Allow additional port forwarding on Spark on K8S as needed

2019-06-10 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860014#comment-16860014
 ] 

Stavros Kontopoulos edited comment on SPARK-25053 at 6/10/19 1:55 PM:
--

Sure [~dongjoon]. With the pod template feature all containers can be 
customized to expose the ports needed. 
[SPARK-24434|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24434]. 
[~holdenk] if you describe something else with the description let me 
know,otherwise I can close it as duplicate.


was (Author: skonto):
Sure [~dongjoon]. With the pod template feature all containers can be 
customized to expose the ports needed. 
[SPARK-24434|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24434]. 
[~holdenk] if you describe something with the description let me know so I can 
close it as duplicate.

> Allow additional port forwarding on Spark on K8S as needed
> --
>
> Key: SPARK-25053
> URL: https://issues.apache.org/jira/browse/SPARK-25053
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
> In some cases, like setting up remote debuggers, adding additional ports to 
> be forwarded would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25053) Allow additional port forwarding on Spark on K8S as needed

2019-06-10 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860014#comment-16860014
 ] 

Stavros Kontopoulos commented on SPARK-25053:
-

Sure [~dongjoon]. With the pod template feature all containers can be 
customized to expose the ports needed. 
[SPARK-24434|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24434]. 
[~holdenk] if you describe something with the description let me know so I can 
close it as duplicate.

> Allow additional port forwarding on Spark on K8S as needed
> --
>
> Key: SPARK-25053
> URL: https://issues.apache.org/jira/browse/SPARK-25053
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
> In some cases, like setting up remote debuggers, adding additional ports to 
> be forwarded would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27708) Add documentation for v2 data sources

2019-06-10 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860013#comment-16860013
 ] 

Jacek Laskowski commented on SPARK-27708:
-

[~rdblue] Mind if I asked you to update the requirements (= answer my 
questions)? Thanks.

> Add documentation for v2 data sources
> -
>
> Key: SPARK-27708
> URL: https://issues.apache.org/jira/browse/SPARK-27708
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: documentation
>
> Before the 3.0 release, the new v2 data sources should be documented. This 
> includes:
>  * How to plug in catalog implementations
>  * Catalog plugin configuration
>  * Multi-part identifier behavior
>  * Partition transforms
>  * Table properties that are used to pass table info (e.g. "provider")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27930) Add built-in Math Function: RANDOM

2019-06-10 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859628#comment-16859628
 ] 

Yuming Wang edited comment on SPARK-27930 at 6/10/19 12:54 PM:
---

Workaround:
{code:sql}
select rand()
{code}
{code:sql}
select reflect("java.lang.Math", "random")
{code}


was (Author: q79969786):
Workaround:
{code:sql}
select reflect("java.lang.Math", "random")
{code}

> Add built-in Math Function: RANDOM
> --
>
> Key: SPARK-27930
> URL: https://issues.apache.org/jira/browse/SPARK-27930
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> The RANDOM function generates a random value between 0.0 and 1.0. Syntax:
> {code:sql}
> RANDOM()
> {code}
> More details:
> https://www.postgresql.org/docs/12/functions-math.html
> Other DBs:
> https://docs.aws.amazon.com/redshift/latest/dg/r_RANDOM.html
> https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Mathematical/RANDOM.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CMathematical%20Functions%7C_24



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27425) Add count_if functions

2019-06-10 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27425.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24335
[https://github.com/apache/spark/pull/24335]

> Add count_if functions
> --
>
> Key: SPARK-27425
> URL: https://issues.apache.org/jira/browse/SPARK-27425
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Chaerim Yeo
>Assignee: Chaerim Yeo
>Priority: Minor
> Fix For: 3.0.0
>
>
> Add aggregation function which returns the number of records satisfying a 
> given condition.
> For Presto, 
> [{{count_if}}|https://prestodb.github.io/docs/current/functions/aggregate.html]
>  function is supported, we can write concisely.
> However, Spark does not support yet, we need to write like {{COUNT(CASE WHEN 
> some_condition THEN 1 END)}} or {{SUM(CASE WHEN some_condition THEN 1 END)}}, 
> which looks painful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27425) Add count_if functions

2019-06-10 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27425:


Assignee: Chaerim Yeo

> Add count_if functions
> --
>
> Key: SPARK-27425
> URL: https://issues.apache.org/jira/browse/SPARK-27425
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Chaerim Yeo
>Assignee: Chaerim Yeo
>Priority: Minor
>
> Add aggregation function which returns the number of records satisfying a 
> given condition.
> For Presto, 
> [{{count_if}}|https://prestodb.github.io/docs/current/functions/aggregate.html]
>  function is supported, we can write concisely.
> However, Spark does not support yet, we need to write like {{COUNT(CASE WHEN 
> some_condition THEN 1 END)}} or {{SUM(CASE WHEN some_condition THEN 1 END)}}, 
> which looks painful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27018) Checkpointed RDD deleted prematurely when using GBTClassifier

2019-06-10 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859889#comment-16859889
 ] 

zhengruifeng commented on SPARK-27018:
--

[~pkolaczk]  With the codes you provided, I reproduced this failure. 

moreover, I doubt that this bug may also affect the computation on distributed 
env.

I also encountered a similar case on a cluster, I will look into this.
{code:java}
java.io.FileNotFoundException: File does not exist: 
/tmp/sparkGBM/application_1551338088092_2518369/checkpoints/edbd13db-b61f-445a-8703-691acd595d62/rdd-46484/_partitioner
    at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
    at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
    at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1929)
    at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1900)
    at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1803)
    at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:604)
    at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:388)
    at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:624)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2094)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2090)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1803)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2090)

    at sun.reflect.GeneratedConstructorAccessor79.newInstance(Unknown 
Source)
    at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
    at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
    at 
org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1268)
    at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1253)
    at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1241)
    at 
org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:303)
    at 
org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269)
    at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:261)
    at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1566)
    at 
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:303)
    at 
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
    at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at 
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:299)
    at 
org.apache.spark.rdd.ReliableCheckpointRDD$.org$apache$spark$rdd$ReliableCheckpointRDD$$readCheckpointedPartitionerFile(ReliableCheckpointRDD.scala:255)
    at 
org.apache.spark.rdd.ReliableCheckpointRDD$$anonfun$3.apply(ReliableCheckpointRDD.scala:59)
    at 
org.apache.spark.rdd.ReliableCheckpointRDD$$anonfun$3.apply(ReliableCheckpointRDD.scala:59)
    at scala.Option.orElse(Option.scala:289)
    at 
org.apache.spark.rdd.ReliableCheckpointRDD.(ReliableCheckpointRDD.scala:58)
    at 
org.apache.spark.rdd.ReliableCheckpointRDD$.writeRDDToCheckpointDirectory(ReliableCheckpointRDD.scala:151)
    at 
org.apache.spark.rdd.ReliableRDDCheckpointData.doCheckpoint(ReliableRDDCheckpointData.scala:58)
    at 
org.apache.spark.rdd.RDDCheckpointData.checkpoint(RDDCheckpointData.scala:75)
    at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply$mcV$sp(RDD.scala:1734)
    at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1724)
    at 
org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1724){code}

> Checkpointed RDD deleted prematurely when using GBTClassifier
> -
>
> Key: SPARK-27018
> URL: 

[jira] [Commented] (SPARK-27400) LinearSVC only supports binary classification

2019-06-10 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859878#comment-16859878
 ] 

zhengruifeng commented on SPARK-27400:
--

According to current design, LinearSVC only support binary classification.

To deal with multi-class cases, you may try one-vs-rest meta alg.

> LinearSVC only supports binary classification
> -
>
> Key: SPARK-27400
> URL: https://issues.apache.org/jira/browse/SPARK-27400
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.1
>Reporter: baris
>Priority: Major
>
> IllegalArgumentException: u'requirement failed: LinearSVC only supports 
> binary c
> lassification. 99 classes detected in LinearSVC_6596220b55a3__labelCol'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27990) Provide a way to recursively load data from datasource

2019-06-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27990:


Assignee: (was: Apache Spark)

> Provide a way to recursively load data from datasource
> --
>
> Key: SPARK-27990
> URL: https://issues.apache.org/jira/browse/SPARK-27990
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 2.4.3
>Reporter: Weichen Xu
>Priority: Major
>
> Provide a way to recursively load data from datasource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27867) RegressionEvaluator cache lastest RegressionMetrics to avoid duplicated computation

2019-06-10 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-27867.
--
Resolution: Not A Problem

> RegressionEvaluator cache lastest RegressionMetrics to avoid duplicated 
> computation
> ---
>
> Key: SPARK-27867
> URL: https://issues.apache.org/jira/browse/SPARK-27867
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> In most cases, given a model, we have to obtain multi metrics of it.
> For examples, a regression model, we may need to obtain the R2, MAE and MSE.
> However, current design of `Evaluator` do not support computing multi metrics 
> at once.
> In practice, we usually use RegressionEvaluator like this:
> {code:java}
> val evaluator = new RegressionEvaluator()
> val r2 = evaluator.setMetricName("r2").evaluate(df)
> val mae = evaluator.setMetricName("mae").evaluate(df)
> val mse = evaluator.setMetricName("mse").evaluate(df){code}
>  
> However, current impl of RegressionEvaluator needs one pass of the whole 
> input dataset to compute one metric. So, above example needs 3 passes.
> This can be optimized since in \{RegressionMetrics}  all metrics can be 
> computed at once.
> If we cache the lastest inputs, and then if the next evaluate call keep the 
> inputs (except the metricName), then we can directly obtain the metric from 
> the internal intermediate summary.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27990) Provide a way to recursively load data from datasource

2019-06-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27990:


Assignee: Apache Spark

> Provide a way to recursively load data from datasource
> --
>
> Key: SPARK-27990
> URL: https://issues.apache.org/jira/browse/SPARK-27990
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 2.4.3
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Major
>
> Provide a way to recursively load data from datasource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27990) Provide a way to recursively load data from datasource

2019-06-10 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-27990:
--

 Summary: Provide a way to recursively load data from datasource
 Key: SPARK-27990
 URL: https://issues.apache.org/jira/browse/SPARK-27990
 Project: Spark
  Issue Type: New Feature
  Components: ML, SQL
Affects Versions: 2.4.3
Reporter: Weichen Xu


Provide a way to recursively load data from datasource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27303) PropertyGraph construction (Scala/Java)

2019-06-10 Thread Martin Junghanns (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859853#comment-16859853
 ] 

Martin Junghanns commented on SPARK-27303:
--

[~mengxr] I will work on this, you can assign me. It seems my permissions to 
assign myself got lost again :/

> PropertyGraph construction (Scala/Java)
> ---
>
> Key: SPARK-27303
> URL: https://issues.apache.org/jira/browse/SPARK-27303
> Project: Spark
>  Issue Type: Story
>  Components: Graph
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> As a user, I can construct a PropertyGraph and view its nodes and 
> relationships as DataFrames.
> Required:
> * Scala API to construct a PropertyGraph.
> * Scala API to view nodes and relationships as DataFrames.
> * Scala/Java test suites.
> Out of scope:
> * Cypher queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27989) Add retries on the connection to the driver

2019-06-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27989:


Assignee: (was: Apache Spark)

> Add retries on the connection to the driver
> ---
>
> Key: SPARK-27989
> URL: https://issues.apache.org/jira/browse/SPARK-27989
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.3
>Reporter: Jose Luis Pedrosa
>Priority: Minor
>
> Due to Java caching of negative DNS resolution (failed requests are never 
> retried).
> Any failure in the DNS when trying to connect to the driver, will make 
> impossible a connection from that process.
> This happens specially in Kubernetes where network setup of pods can take 
> some time,
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27989) Add retries on the connection to the driver

2019-06-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859833#comment-16859833
 ] 

Apache Spark commented on SPARK-27989:
--

User 'jlpedrosa' has created a pull request for this issue:
https://github.com/apache/spark/pull/24702

> Add retries on the connection to the driver
> ---
>
> Key: SPARK-27989
> URL: https://issues.apache.org/jira/browse/SPARK-27989
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.3
>Reporter: Jose Luis Pedrosa
>Priority: Minor
>
> Due to Java caching of negative DNS resolution (failed requests are never 
> retried).
> Any failure in the DNS when trying to connect to the driver, will make 
> impossible a connection from that process.
> This happens specially in Kubernetes where network setup of pods can take 
> some time,
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27989) Add retries on the connection to the driver

2019-06-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27989:


Assignee: Apache Spark

> Add retries on the connection to the driver
> ---
>
> Key: SPARK-27989
> URL: https://issues.apache.org/jira/browse/SPARK-27989
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.3
>Reporter: Jose Luis Pedrosa
>Assignee: Apache Spark
>Priority: Minor
>
> Due to Java caching of negative DNS resolution (failed requests are never 
> retried).
> Any failure in the DNS when trying to connect to the driver, will make 
> impossible a connection from that process.
> This happens specially in Kubernetes where network setup of pods can take 
> some time,
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27989) Add retries on the connection to the driver

2019-06-10 Thread Jose Luis Pedrosa (JIRA)
Jose Luis Pedrosa created SPARK-27989:
-

 Summary: Add retries on the connection to the driver
 Key: SPARK-27989
 URL: https://issues.apache.org/jira/browse/SPARK-27989
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Spark Core
Affects Versions: 2.4.3
Reporter: Jose Luis Pedrosa


Due to Java caching of negative DNS resolution (failed requests are never 
retried).

Any failure in the DNS when trying to connect to the driver, will make 
impossible a connection from that process.

This happens specially in Kubernetes where network setup of pods can take some 
time,

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27949) Support SUBSTRING(str FROM n1 [FOR n2]) syntax

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27949:
--
Description: 
Currently, function `substr/substring`'s usage is like 
`substring(string_expression, n1 [,n2])`. 

But, the ANSI SQL defined the pattern for substr/substring is like 
`SUBSTRING(str FROM n1 [FOR n2])`. This gap makes some inconvenient when we 
switch to the SparkSQL.

- ANSI SQL-92: http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt

Below are the mainly DB engines to support the ANSI standard for substring.
- PostgreSQL https://www.postgresql.org/docs/9.1/functions-string.html
- MySQL 
https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_substring
- Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_SUBSTRING.html
- Teradata 
https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/XnePye0Cwexw6Pny_qnxVA


  was:
Currently, function substr/substring's usage is like 
substring(string_expression, n1 [,n2]). 

But the ANSI SQL defined the pattern for substr/substring is like 
substring(string_expression from n1 [for n2]). This gap make some inconvenient 
when we switch to the SparkSQL.

Can we support the ANSI pattern like substring(string_expression from n1 [for 
n2])?

 

[http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt]


> Support SUBSTRING(str FROM n1 [FOR n2]) syntax
> --
>
> Key: SPARK-27949
> URL: https://issues.apache.org/jira/browse/SPARK-27949
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Assignee: Zhu, Lipeng
>Priority: Minor
>
> Currently, function `substr/substring`'s usage is like 
> `substring(string_expression, n1 [,n2])`. 
> But, the ANSI SQL defined the pattern for substr/substring is like 
> `SUBSTRING(str FROM n1 [FOR n2])`. This gap makes some inconvenient when we 
> switch to the SparkSQL.
> - ANSI SQL-92: http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
> Below are the mainly DB engines to support the ANSI standard for substring.
> - PostgreSQL https://www.postgresql.org/docs/9.1/functions-string.html
> - MySQL 
> https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_substring
> - Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_SUBSTRING.html
> - Teradata 
> https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/XnePye0Cwexw6Pny_qnxVA



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27949) Support SUBSTRING(str FROM n1 [FOR n2]) syntax

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-27949:
-

Assignee: Zhu, Lipeng

> Support SUBSTRING(str FROM n1 [FOR n2]) syntax
> --
>
> Key: SPARK-27949
> URL: https://issues.apache.org/jira/browse/SPARK-27949
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Assignee: Zhu, Lipeng
>Priority: Minor
>
> Currently, function substr/substring's usage is like 
> substring(string_expression, n1 [,n2]). 
> But the ANSI SQL defined the pattern for substr/substring is like 
> substring(string_expression from n1 [for n2]). This gap make some 
> inconvenient when we switch to the SparkSQL.
> Can we support the ANSI pattern like substring(string_expression from n1 [for 
> n2])?
>  
> [http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27949) Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27949:
--
Affects Version/s: (was: 3.1.0)
   3.0.0

> Support ANSI SQL grammar `substring(string_expression from n1 [for n2])`
> 
>
> Key: SPARK-27949
> URL: https://issues.apache.org/jira/browse/SPARK-27949
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Minor
>
> Currently, function substr/substring's usage is like 
> substring(string_expression, n1 [,n2]). 
> But the ANSI SQL defined the pattern for substr/substring is like 
> substring(string_expression from n1 [for n2]). This gap make some 
> inconvenient when we switch to the SparkSQL.
> Can we support the ANSI pattern like substring(string_expression from n1 [for 
> n2])?
>  
> [http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27949) Support SUBSTRING(str FROM n1 [FOR n2]) syntax

2019-06-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27949:
--
Summary: Support SUBSTRING(str FROM n1 [FOR n2]) syntax  (was: Support ANSI 
SQL grammar `substring(string_expression from n1 [for n2])`)

> Support SUBSTRING(str FROM n1 [FOR n2]) syntax
> --
>
> Key: SPARK-27949
> URL: https://issues.apache.org/jira/browse/SPARK-27949
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Minor
>
> Currently, function substr/substring's usage is like 
> substring(string_expression, n1 [,n2]). 
> But the ANSI SQL defined the pattern for substr/substring is like 
> substring(string_expression from n1 [for n2]). This gap make some 
> inconvenient when we switch to the SparkSQL.
> Can we support the ANSI pattern like substring(string_expression from n1 [for 
> n2])?
>  
> [http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org