from:"holdenk $JIRA$"

[jira] [Commented] (SPARK-28360) The serviceAccountName configuration item does not take effect in client mode.

2019-09-23 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936207#comment-16936207
 ] 

holdenk commented on SPARK-28360:
-

Don't we need a service account name to create the executor pods?

> The serviceAccountName configuration item does not take effect in client mode.
> --
>
> Key: SPARK-28360
> URL: https://issues.apache.org/jira/browse/SPARK-28360
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: zhixingheyi_tian
>Priority: Major
>
> From the configuration item description from the spark document: 
> https://spark.apache.org/docs/latest/running-on-kubernetes.html
>  
> “spark.kubernetes.authenticate.driver.serviceAccountName default Service 
> account that is used when running the driver pod. The driver pod uses this 
> service account when requesting executor pods from the API server. Note that 
> this cannot be specified alongside a CA cert file, client key file, client 
> cert file, and/or OAuth token. In client mode, use 
> spark.kubernetes.authenticate.serviceAccountName instead.”
> But in client mode. “spark.kubernetes.authenticate.serviceAccountName” does 
> not take effect in fact.
> From the analysis of source codes, spark does not get this configuration item 
> "spark.kubernetes.authenticate.serviceAccountName".
>  In Unit Tests, only cases for 
> "spark.kubernetes.authenticate.driver.serviceAccountName".
> In kubernetes， a service account provides an identity for processes that run 
> in a Pod. When you create a pod, if you do not specify a service account, it 
> is automatically assigned the default service account in the same namespace. 
>  Add a “spec.serviceAccountName” when creating a pod , can specify a custom 
> service account.
>  So in client mode, If you run your driver inside a Kubernetes pod, the 
> serviceaccount has already existed. If your application is not running inside 
> a pod, no serviceaccount is needed at all.
> From my point of view, just modify the document and delete the 
> "spark.kubernetes.authenticate.serviceAccountName" configuration item 
> description. Because it doesn't work at the moment, it also doesn't need to 
> work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28362) Error communicating with MapOutputTracker when many tasks are launched concurrently

2019-09-23 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936206#comment-16936206
 ] 

holdenk edited comment on SPARK-28362 at 9/23/19 9:34 PM:
--

Why is your default parallelism configured to `149 * 13 (cores) * 20 = 38740`? 
That seems maybe a bit high for 149 physical machines.


was (Author: holdenk):
Why is your default parallelism configured to `49 * 13 (cores) * 20 = 38740`? 
That seems maybe a bit high for 49 physical machines.

> Error communicating with MapOutputTracker when many tasks are launched 
> concurrently
> ---
>
> Key: SPARK-28362
> URL: https://issues.apache.org/jira/browse/SPARK-28362
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.2
> Environment: AWS EMR 5.24.0 with Yarn, Spark 2.4.2 and Beam 2.12.0
>Reporter: Peter Backx
>Priority: Major
>
> It looks like the scheduler is unknowingly creating a DoS attack on the 
> MapOutputTracker when many tasks are launched at the same time.
> We are running a Beam on Spark job on AWS EMR 5.24.0 (Yarn, Spark 2.4.2, Beam 
> 2.12.0)
> The job is running on 150 r4.4xlarge machines in cluster mode with executors 
> sized to take up the full machine. So we have 1 machine acting as driver and 
> 149 executors. Default parallelism is 149 * 13 (cores) * 20 = 38740
> When a new stage is launched, sometimes tasks will error out with the 
> message: "Error communicating with MapOutputTracker". 
> I've gone over the logs and it looks like the following is happening:
>  # When the final task of a previous stage completes, the driver launches new 
> tasks (driver log):
>  # 
>  ## 19/07/10 14:04:57 INFO DAGScheduler: Submitting 38740 missing tasks from 
> ShuffleMapStage 29 (MapPartitionsRDD[632] at mapToPair at 
> GroupCombineFunctions.java:147) (first 15 tasks are for partitions Vector(0, 
> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
>  # Executors use the MapOutputTracker to fetch the location of the data they 
> need to work on (executor log):
>  ## 19/07/10 14:04:57 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 36, fetching them
>  19/07/10 14:04:57 INFO MapOutputTrackerWorker: Doing the fetch; tracker 
> endpoint = 
> NettyRpcEndpointRef([spark://mapoutputtrac...@ip-172-28-95-2.eu-west-1.compute.internal:42033])
>  # Usually all executors timeout after 2 minutes. In rare cases some of the 
> executors seem to receive a reply (executor log):
>  ## 19/07/10 14:06:57 ERROR MapOutputTrackerWorker: Error communicating with 
> MapOutputTracker
>  org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
>  at 
> [org.apache.spark.rpc.RpcTimeout.org|http://org.apache.spark.rpc.rpctimeout.org/]$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
>  # Driver marks the task as failed and retries (driver log):
>  ## 19/07/10 14:06:57 WARN TaskSetManager: Lost task 1490.0 in stage 29.0 
> (TID 2724105, ip-172-28-94-245.eu-west-1.compute.internal, executor 248): 
> org.apache.spark.SparkException: Error communicating with MapOutputTracker
>  at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:270)
> I can't find any log with the reason why the executors don't get a reply from 
> the MapOutputTracker. 
> So my questions:
>  * Is there a separate log file for the MapOutputTracker where I can find 
> more info?
>  * Is the parallelism set too high? It seems to be fine for the rest of the 
> job.
>  * Is there anything else we can do? Is there maybe a way to stagger the task 
> launches so they don't happen all at once?
> This is not super critical, but I'd like to get rid of the errors. It happens 
> 2 or 3 times during a 10 hour job and retries always work correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28362) Error communicating with MapOutputTracker when many tasks are launched concurrently

2019-09-23 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936206#comment-16936206
 ] 

holdenk commented on SPARK-28362:
-

Why is your default parallelism configured to `49 * 13 (cores) * 20 = 38740`? 
That seems maybe a bit high for 49 physical machines.

> Error communicating with MapOutputTracker when many tasks are launched 
> concurrently
> ---
>
> Key: SPARK-28362
> URL: https://issues.apache.org/jira/browse/SPARK-28362
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.2
> Environment: AWS EMR 5.24.0 with Yarn, Spark 2.4.2 and Beam 2.12.0
>Reporter: Peter Backx
>Priority: Major
>
> It looks like the scheduler is unknowingly creating a DoS attack on the 
> MapOutputTracker when many tasks are launched at the same time.
> We are running a Beam on Spark job on AWS EMR 5.24.0 (Yarn, Spark 2.4.2, Beam 
> 2.12.0)
> The job is running on 150 r4.4xlarge machines in cluster mode with executors 
> sized to take up the full machine. So we have 1 machine acting as driver and 
> 149 executors. Default parallelism is 149 * 13 (cores) * 20 = 38740
> When a new stage is launched, sometimes tasks will error out with the 
> message: "Error communicating with MapOutputTracker". 
> I've gone over the logs and it looks like the following is happening:
>  # When the final task of a previous stage completes, the driver launches new 
> tasks (driver log):
>  # 
>  ## 19/07/10 14:04:57 INFO DAGScheduler: Submitting 38740 missing tasks from 
> ShuffleMapStage 29 (MapPartitionsRDD[632] at mapToPair at 
> GroupCombineFunctions.java:147) (first 15 tasks are for partitions Vector(0, 
> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
>  # Executors use the MapOutputTracker to fetch the location of the data they 
> need to work on (executor log):
>  ## 19/07/10 14:04:57 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 36, fetching them
>  19/07/10 14:04:57 INFO MapOutputTrackerWorker: Doing the fetch; tracker 
> endpoint = 
> NettyRpcEndpointRef([spark://mapoutputtrac...@ip-172-28-95-2.eu-west-1.compute.internal:42033])
>  # Usually all executors timeout after 2 minutes. In rare cases some of the 
> executors seem to receive a reply (executor log):
>  ## 19/07/10 14:06:57 ERROR MapOutputTrackerWorker: Error communicating with 
> MapOutputTracker
>  org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
>  at 
> [org.apache.spark.rpc.RpcTimeout.org|http://org.apache.spark.rpc.rpctimeout.org/]$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
>  # Driver marks the task as failed and retries (driver log):
>  ## 19/07/10 14:06:57 WARN TaskSetManager: Lost task 1490.0 in stage 29.0 
> (TID 2724105, ip-172-28-94-245.eu-west-1.compute.internal, executor 248): 
> org.apache.spark.SparkException: Error communicating with MapOutputTracker
>  at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:270)
> I can't find any log with the reason why the executors don't get a reply from 
> the MapOutputTracker. 
> So my questions:
>  * Is there a separate log file for the MapOutputTracker where I can find 
> more info?
>  * Is the parallelism set too high? It seems to be fine for the rest of the 
> job.
>  * Is there anything else we can do? Is there maybe a way to stagger the task 
> launches so they don't happen all at once?
> This is not super critical, but I'd like to get rid of the errors. It happens 
> 2 or 3 times during a 10 hour job and retries always work correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28403) Executor Allocation Manager can add an extra executor when speculative tasks

2019-09-23 Thread holdenk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-28403:

Shepherd: holdenk

> Executor Allocation Manager can add an extra executor when speculative tasks
> 
>
> Key: SPARK-28403
> URL: https://issues.apache.org/jira/browse/SPARK-28403
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
>
> It looks like SPARK-19326 added a bug in the execuctor allocation maanger 
> where it adds an extra executor when it shouldn't when we have pending 
> speculative tasks but the target number didn't change. 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L377]
> It doesn't look like this is necessary since it already added in the 
> pendingSpeculative tasks.
> See the questioning of this on the PR at:
> https://github.com/apache/spark/pull/18492/files#diff-b096353602813e47074ace09a3890d56R379



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28517) pyspark with --conf spark.jars.packages causes duplicate jars to be uploaded

2019-09-23 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936201#comment-16936201
 ] 

holdenk commented on SPARK-28517:
-

cc [~bryanc] / [~ifilonenko]

> pyspark with --conf spark.jars.packages causes duplicate jars to be uploaded
> 
>
> Key: SPARK-28517
> URL: https://issues.apache.org/jira/browse/SPARK-28517
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 2.4.3
> Environment: spark 2.4.3_2.12 without hadoop
> yarn 2.6
> python 2.7.16
> centos 7
>Reporter: Barry
>Priority: Major
>  Labels: ivy, pyspark, yarn
>
> h2. Steps to reproduce:
> {{spark-submit --master yarn --conf 
> "spark.jars.packages=org.apache.spark:spark-avro_2.12:2.4.3" 
> ${SPARK_HOME}/examples/src/main/python/pi.py 100}}
> h2. Undesirable behavior:
> warnings are printed package jars have been added to the distributed cache 
> multiple times
> {{19/07/25 23:25:07 WARN Client: Same path resource 
> file:///home/barryl/.ivy2/jars/org.apache.spark_spark-avro_2.12-2.4.3.jar 
> added multiple times to distributed cache.}}
> {{19/07/25 23:25:07 WARN Client: Same path resource 
> file:///home/barryl/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar added 
> multiple times to distributed cache.}}
> This does not happen for Scala jobs, only Pyspark
>  
> h2. Full output of example run.
> {{[barryl@hostname ~]$ /opt/spark2/bin/spark-submit --master yarn --conf 
> "spark.jars.packages=org.apache.spark:spark-avro_2.12:2.4.3" 
> /opt/spark2/examples/src/main/python/pi.py 100}}
> {{Ivy Default Cache set to: /home/barryl/.ivy2/cache}}
> {{The jars for the packages stored in: /home/barryl/.ivy2/jars}}
> {{:: loading settings :: url = 
> jar:file:/opt/spark-2.4.3-bin-without-hadoop-scala-2.12/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml}}
> {{org.apache.spark#spark-avro_2.12 added as a dependency}}
> {{:: resolving dependencies :: 
> org.apache.spark#spark-submit-parent-2c34ecff-b060-4af9-9b9f-83867672748c;1.0}}
> {{    confs: [default]}}
> {{    found org.apache.spark#spark-avro_2.12;2.4.3 in central}}
> {{    found org.spark-project.spark#unused;1.0.0 in central}}
> {{:: resolution report :: resolve 457ms :: artifacts dl 5ms}}
> {{    :: modules in use:}}
> {{    org.apache.spark#spark-avro_2.12;2.4.3 from central in [default]}}
> {{    org.spark-project.spark#unused;1.0.0 from central in [default]}}
> {{    -}}
> {{    |  |    modules    ||   artifacts   |}}
> {{    |   conf   | number| search|dwnlded|evicted|| number|dwnlded|}}
> {{    -}}
> {{    |  default |   2   |   0   |   0   |   0   ||   2   |   0   |}}
> {{    -}}
> {{:: retrieving :: 
> org.apache.spark#spark-submit-parent-2c34ecff-b060-4af9-9b9f-83867672748c}}
> {{    confs: [default]}}
> {{    0 artifacts copied, 2 already retrieved (0kB/7ms)}}
> {{19/07/25 23:25:03 WARN Client: Neither spark.yarn.jars nor 
> spark.yarn.archive is set, falling back to uploading libraries under 
> SPARK_HOME.}}
> {{19/07/25 23:25:07 WARN Client: Same path resource 
> file:///home/barryl/.ivy2/jars/org.apache.spark_spark-avro_2.12-2.4.3.jar 
> added multiple times to distributed cache.}}
> {{19/07/25 23:25:07 WARN Client: Same path resource 
> file:///home/barryl/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar added 
> multiple times to distributed cache.}}
> {{19/07/25 23:25:28 WARN TaskSetManager: Stage 0 contains a task of very 
> large size (365 KB). The maximum recommended task size is 100 KB.}}
> {{Pi is roughly 3.142308}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28558) DatasetWriter partitionBy is changing the group file permissions in 2.4 for parquets

2019-09-23 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936200#comment-16936200
 ] 

holdenk commented on SPARK-28558:
-

What storage system are y'all using [~nladuguie] & [~spearson] ?

> DatasetWriter partitionBy is changing the group file permissions in 2.4 for 
> parquets
> 
>
> Key: SPARK-28558
> URL: https://issues.apache.org/jira/browse/SPARK-28558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: Hadoop 2.7
> Scala 2.11
> Tested:
>  * Spark 2.3.3 - Works
>  * Spark 2.4.x - All have the same issue
>Reporter: Stephen Pearson
>Priority: Minor
>
> When writing a parquet using partitionBy the group file permissions are being 
> changed as shown below. This causes members of the group to get 
> "org.apache.hadoop.security.AccessControlException: Open failed for file 
> error: Permission denied (13)"
> This worked in 2.3. I found a workaround which was to set 
> "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" which gives 
> the correct behaviour
>  
> Code I used to reproduce issue:
> {quote}Seq(("H", 1), ("I", 2))
>  .toDF("Letter", "Number")
>  .write
>  .partitionBy("Letter")
>  .parquet(...){quote}
>  
> {quote}sparktesting$ tree -dp
> ├── [drwxrws---]  letter_testing2.3-defaults
> │   ├── [drwxrws---]  Letter=H
> │   └── [drwxrws---]  Letter=I
> ├── [drwxrws---]  letter_testing2.4-defaults
> │   ├── [drwxrwS---]  Letter=H
> │   └── [drwxrwS---]  Letter=I
> └── [drwxrws---]  letter_testing2.4-file-writer2
>     ├── [drwxrws---]  Letter=H
>     └── [drwxrws---]  Letter=I
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28592) Mark new Shuffle apis as @Experimental (instead of @Private)

2019-09-23 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936199#comment-16936199
 ] 

holdenk commented on SPARK-28592:
-

Should we set this to blocker so we don't forget?

> Mark new Shuffle apis as @Experimental (instead of @Private)
> 
>
> Key: SPARK-28592
> URL: https://issues.apache.org/jira/browse/SPARK-28592
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> The new Shuffle api is initially marked as {{@Private}}, just to discourage 
> anyone from trying to make use of it before all the pieces are in place (in 
> particular if spark 3.0 is released before everything is merged).  But once 
> its all merged we can change to {{@Experimental}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28653) Create table using DDL statement should not auto create the destination folder

2019-09-23 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936198#comment-16936198
 ] 

holdenk commented on SPARK-28653:
-

[~thanida.t] can you confirm if you're still exerpeincing this issue or if it's 
resolved like [~angerszhuuu] has suggested might be the case?

> Create table using DDL statement should not auto create the destination folder
> --
>
> Key: SPARK-28653
> URL: https://issues.apache.org/jira/browse/SPARK-28653
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Thanida
>Priority: Minor
>
> I create external table using this following DDL statement, the destination 
> path was auto-created.
> {code:java}
> CREATE TABLE ${tableName} USING parquet LOCATION ${path}
> {code}
> But, if I specified file format as csv or json, the destination path was not 
> created.
> {code:java}
> CREATE TABLE ${tableName} USING CSV LOCATION ${path}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28727) Request for partial least square (PLS) regression model

2019-09-23 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936196#comment-16936196
 ] 

holdenk commented on SPARK-28727:
-

I don't believe we'll be adding new algorithms to Spark ML in the next release. 
You may want to look at Spark packages or Spark's integration with other ML 
systems if Spark does not currently meet your needs.

> Request for partial least square (PLS) regression model
> ---
>
> Key: SPARK-28727
> URL: https://issues.apache.org/jira/browse/SPARK-28727
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Affects Versions: 3.0.0
> Environment: I am using Windows 10, Spark v2.3.2
>Reporter: Nikunj
>Priority: Major
>  Labels: PLS, least, partial, regression, square
>
> Hi.
> Is there any development going on with regards to a PLS model? Or is there a 
> plan for it in the near future? The application I am developing needs a PLS 
> model as it is mandatory in that particular industry. I am using sparklyr, 
> and have started a bit of the implementation, but was wondering if something 
> is already in the pipeline.
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28781) Unneccesary persist in PeriodicCheckpointer.update()

2019-09-23 Thread holdenk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-28781:

Issue Type: Improvement  (was: Bug)

> Unneccesary persist in PeriodicCheckpointer.update()
> 
>
> Key: SPARK-28781
> URL: https://issues.apache.org/jira/browse/SPARK-28781
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> Once the update is called, newData is persisted at line 82. However, only 
> when the checkpoint is handling (satisfy the condition at line 94), the 
> persist data is used for the second time (do checkpoint at line 97). The 
> other data which is not satisfied to the checkpoint condition is unnecessary 
> to be cached. The persistedQueue avoids too many unnecessary cached data, but 
> it is best to avoid every unnecessary persist operation.
> {code:scala}
> def update(newData: T): Unit = {
> persist(newData)
> persistedQueue.enqueue(newData)
> // We try to maintain 2 Datasets in persistedQueue to support the 
> semantics of this class:
> // Users should call [[update()]] when a new Dataset has been created,
> // before the Dataset has been materialized.
> while (persistedQueue.size > 3) {
>   val dataToUnpersist = persistedQueue.dequeue()
>   unpersist(dataToUnpersist)
> }
> updateCount += 1
> // Handle checkpointing (after persisting)
> if (checkpointInterval != -1 && (updateCount % checkpointInterval) == 0
>   && sc.getCheckpointDir.nonEmpty) {
>   // Add new checkpoint before removing old checkpoints.
>   checkpoint(newData)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28978) PySpark: Can't pass more than 256 arguments to a UDF

2019-09-23 Thread holdenk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-28978:

Target Version/s: 3.0.0

> PySpark: Can't pass more than 256 arguments to a UDF
> 
>
> Key: SPARK-28978
> URL: https://issues.apache.org/jira/browse/SPARK-28978
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jim Fulton
>Priority: Major
>  Labels: koalas, mlflow, pyspark
>
> This code:
> [https://github.com/apache/spark/blob/712874fa0937f0784f47740b127c3bab20da8569/python/pyspark/worker.py#L367-L379]
> Creates Python lambdas that call UDF functions passing arguments singly, 
> rather than using varargs.  For example: `lambda a: f(a[0], a[1], ...)`.
> This fails when there are more than 256 arguments.
> mlflow, when generating model predictions, uses an argument for each feature 
> column.  I have a model with > 500 features.
> I was able to easily hack around this by changing the generated lambdas to 
> use varargs, as in `lambda a: f(*a)`. 
> IDK why these lambdas were created the way they were.  Using varargs is much 
> simpler and works fine in my testing.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29083) Speed up toLocalIterator with prefetching when enabled

2019-09-23 Thread holdenk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-29083:
---

Assignee: holdenk

> Speed up toLocalIterator with prefetching when enabled
> --
>
> Key: SPARK-29083
> URL: https://issues.apache.org/jira/browse/SPARK-29083
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
>
> As a follow up to SPARK-27659 see if we can do the same thing in Scala.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29217) How to read streaming output path by ignoring metadata log files

2019-09-23 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936186#comment-16936186
 ] 

holdenk commented on SPARK-29217:
-

Can you clarify what you mean by "Moving some files in the output while 
streaming"?

> How to read streaming output path by ignoring metadata log files
> 
>
> Key: SPARK-29217
> URL: https://issues.apache.org/jira/browse/SPARK-29217
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Thanida
>Priority: Minor
>
> As the output path of spark streaming contains `_spark_metadata` directory, 
> reading by  
> {code:java}
> spark.read.format("parquet").load(filepath)
> {code}
> always depend on files listing in metadata log.
> Moving some files in the output while streaming caused reading data failed. 
> So, how to read data in the streaming output path by ignoring metadata log 
> files?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29163) Provide a mixin to simplify HadoopConf access patterns in DataSource V2

2019-09-23 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936053#comment-16936053
 ] 

holdenk commented on SPARK-29163:
-

I'm going to try and do some work on this before the end of the month.

> Provide a mixin to simplify HadoopConf access patterns in DataSource V2
> ---
>
> Key: SPARK-29163
> URL: https://issues.apache.org/jira/browse/SPARK-29163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
>
> Since many data sources need the hadoop config we should make an easy way for 
> them to get access to it with minimal overhead (e.g. broadcasting + mixin).
>  
> TODO after SPARK-29158. Also look at DSV1 and see if there were any 
> interesting hacks we did before to make this fast.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27659) Allow PySpark toLocalIterator to prefetch data

2019-09-20 Thread holdenk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-27659.
-
Fix Version/s: 3.0.0
 Assignee: holdenk
   Resolution: Fixed

> Allow PySpark toLocalIterator to prefetch data
> --
>
> Key: SPARK-27659
> URL: https://issues.apache.org/jira/browse/SPARK-27659
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Assignee: holdenk
>Priority: Minor
> Fix For: 3.0.0
>
>
> In SPARK-23961, data was no longer prefetched so that the local iterator 
> could close cleanly in the case of not consuming all of the data. If the user 
> intends to iterate over all elements, then prefetching data could bring back 
> any lost performance. We would need to run some tests to see if the 
> performance is worth it due to additional complexity and extra usage of 
> memory. The option to prefetch could adding as a user conf, and it's possible 
> this could improve the Scala toLocalIterator also.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28936) Simplify Spark K8s tests by replacing race condition during command execution

2019-09-20 Thread holdenk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-28936.
-
Resolution: Fixed

> Simplify Spark K8s tests by replacing race condition during command execution
> -
>
> Key: SPARK-28936
> URL: https://issues.apache.org/jira/browse/SPARK-28936
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Major
>
> Currently our command execution for Spark Kubernetes integration tests 
> depends on a Thread.sleep which sometimes doesn't wait long enough. This 
> normally doesn't show up because we automatically retry the the commands 
> inside of an eventually, but on some machines may result in flaky tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28936) Simplify Spark K8s tests by replacing race condition during command execution

2019-09-20 Thread holdenk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-28936:

Fix Version/s: 3.0.0

> Simplify Spark K8s tests by replacing race condition during command execution
> -
>
> Key: SPARK-28936
> URL: https://issues.apache.org/jira/browse/SPARK-28936
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently our command execution for Spark Kubernetes integration tests 
> depends on a Thread.sleep which sometimes doesn't wait long enough. This 
> normally doesn't show up because we automatically retry the the commands 
> inside of an eventually, but on some machines may result in flaky tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28937) Improve error reporting in Spark Secrets Test Suite

2019-09-20 Thread holdenk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-28937.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Improve error reporting in Spark Secrets Test Suite
> ---
>
> Key: SPARK-28937
> URL: https://issues.apache.org/jira/browse/SPARK-28937
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Right now most the checks for the Secrets Test suite are done inside an 
> eventually condition meaning when they fail they fail with a last exception 
> that they can not connect to the pod, this can mask the actual failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29193) Update fabric8 version to 4.3 continue docker 4 desktop support

2019-09-20 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934798#comment-16934798
 ] 

holdenk commented on SPARK-29193:
-

My bad looks, like we fixed this in 

SPARK-28921

> Update fabric8 version to 4.3 continue docker 4 desktop support
> ---
>
> Key: SPARK-29193
> URL: https://issues.apache.org/jira/browse/SPARK-29193
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: holdenk
>Priority: Blocker
>
> The current version of the kubernetes client we are using has some issues 
> with not setting origin ( 
> [https://github.com/fabric8io/kubernetes-client/issues/1667] ) which cause 
> failures on new versions of Docker 4 Desktop Kubernetes.
>  
> This is fixed in 4.3-snapshot, so we will need to wait for the 4.3 release or 
> backport this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29193) Update fabric8 version to 4.3 continue docker 4 desktop support

2019-09-20 Thread holdenk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-29193.
-
Fix Version/s: 3.0.0
   Resolution: Duplicate

> Update fabric8 version to 4.3 continue docker 4 desktop support
> ---
>
> Key: SPARK-29193
> URL: https://issues.apache.org/jira/browse/SPARK-29193
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: holdenk
>Priority: Blocker
> Fix For: 3.0.0
>
>
> The current version of the kubernetes client we are using has some issues 
> with not setting origin ( 
> [https://github.com/fabric8io/kubernetes-client/issues/1667] ) which cause 
> failures on new versions of Docker 4 Desktop Kubernetes.
>  
> This is fixed in 4.3-snapshot, so we will need to wait for the 4.3 release or 
> backport this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29193) Update fabric8 version to 4.3 continue docker 4 desktop support

2019-09-20 Thread holdenk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-29193:

Description: 
The current version of the kubernetes client we are using has some issues with 
not setting origin ( 
[https://github.com/fabric8io/kubernetes-client/issues/1667] ) which cause 
failures on new versions of Docker 4 Desktop Kubernetes.

 

This is fixed in 4.3-snapshot, so we will need to wait for the 4.3 release or 
backport this.

  was:The current version of the kubernetes client we are using has some issues 
with not setting origin ( 
[https://github.com/fabric8io/kubernetes-client/issues/1667] ) which cause 
failures on new versions of Docker 4 Desktop Kubernetes.

 Issue Type: Bug  (was: Improvement)
   Priority: Blocker  (was: Major)
Summary: Update fabric8 version to 4.3 continue docker 4 desktop 
support  (was: Update fabric8 version to continue docker 4 desktop support)

> Update fabric8 version to 4.3 continue docker 4 desktop support
> ---
>
> Key: SPARK-29193
> URL: https://issues.apache.org/jira/browse/SPARK-29193
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: holdenk
>Priority: Blocker
>
> The current version of the kubernetes client we are using has some issues 
> with not setting origin ( 
> [https://github.com/fabric8io/kubernetes-client/issues/1667] ) which cause 
> failures on new versions of Docker 4 Desktop Kubernetes.
>  
> This is fixed in 4.3-snapshot, so we will need to wait for the 4.3 release or 
> backport this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29193) Update fabric8 version to 4.3 continue docker 4 desktop support

2019-09-20 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934795#comment-16934795
 ] 

holdenk commented on SPARK-29193:
-

While I've only observed the issue on docker 4 desktop, it's possible more 
versions may start doing strict origin checking.

> Update fabric8 version to 4.3 continue docker 4 desktop support
> ---
>
> Key: SPARK-29193
> URL: https://issues.apache.org/jira/browse/SPARK-29193
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: holdenk
>Priority: Blocker
>
> The current version of the kubernetes client we are using has some issues 
> with not setting origin ( 
> [https://github.com/fabric8io/kubernetes-client/issues/1667] ) which cause 
> failures on new versions of Docker 4 Desktop Kubernetes.
>  
> This is fixed in 4.3-snapshot, so we will need to wait for the 4.3 release or 
> backport this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29193) Update fabric8 version to continue docker 4 desktop support

2019-09-20 Thread holdenk (Jira)

holdenk created SPARK-29193:
---

 Summary: Update fabric8 version to continue docker 4 desktop 
support
 Key: SPARK-29193
 URL: https://issues.apache.org/jira/browse/SPARK-29193
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: holdenk


The current version of the kubernetes client we are using has some issues with 
not setting origin ( 
[https://github.com/fabric8io/kubernetes-client/issues/1667] ) which cause 
failures on new versions of Docker 4 Desktop Kubernetes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29163) Provide a mixin to simplify HadoopConf access patterns in DataSource V2

2019-09-18 Thread holdenk (Jira)

holdenk created SPARK-29163:
---

 Summary: Provide a mixin to simplify HadoopConf access patterns in 
DataSource V2
 Key: SPARK-29163
 URL: https://issues.apache.org/jira/browse/SPARK-29163
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: holdenk
Assignee: holdenk


Since many data sources need the hadoop config we should make an easy way for 
them to get access to it with minimal overhead (e.g. broadcasting + mixin).

 

TODO after SPARK-29158. Also look at DSV1 and see if there were any interesting 
hacks we did before to make this fast.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29158) Expose SerializableConfiguration for DSv2

2019-09-18 Thread holdenk (Jira)

holdenk created SPARK-29158:
---

 Summary: Expose SerializableConfiguration for DSv2
 Key: SPARK-29158
 URL: https://issues.apache.org/jira/browse/SPARK-29158
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.4.5, 3.0.0
Reporter: holdenk
Assignee: holdenk


Since we use it frequently inside of our own DataSourceV2 implementations (13 
times from `

 grep -r broadcastedConf ./sql/core/src/ |grep val |wc -l`

) we should expose the SerializableConfiguration for DSv2 dev work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22390) Aggregate push down

2019-09-18 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932739#comment-16932739
 ] 

holdenk commented on SPARK-22390:
-

Love to follow where this is going, especially if it gets broken into smaller 
pieces of work.

> Aggregate push down
> ---
>
> Key: SPARK-22390
> URL: https://issues.apache.org/jira/browse/SPARK-22390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29083) Speed up toLocalIterator with prefetching when enabled

2019-09-13 Thread holdenk (Jira)

holdenk created SPARK-29083:
---

 Summary: Speed up toLocalIterator with prefetching when enabled
 Key: SPARK-29083
 URL: https://issues.apache.org/jira/browse/SPARK-29083
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.0.0
Reporter: holdenk


As a follow up to SPARK-27659 see if we can do the same thing in Scala.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29076) Generalize the PVTestSuite to no longer need the minikube tag

2019-09-13 Thread holdenk (Jira)

holdenk created SPARK-29076:
---

 Summary: Generalize the PVTestSuite to no longer need the minikube 
tag
 Key: SPARK-29076
 URL: https://issues.apache.org/jira/browse/SPARK-29076
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Tests
Affects Versions: 3.0.0
Reporter: holdenk


Currently the PVTestSuite has the MiniKube test tag applied so it can be 
skipped for non-minikube tests. It should be somewhat easily generalizable to 
at least other local k8s test envs, however as written it depends on being able 
to mount a local folder as a PV so may take more work to generalize to 
arbitrary k8s.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28937) Improve error reporting in Spark Secrets Test Suite

2019-08-30 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919991#comment-16919991
 ] 

holdenk commented on SPARK-28937:
-

I'm working on this

> Improve error reporting in Spark Secrets Test Suite
> ---
>
> Key: SPARK-28937
> URL: https://issues.apache.org/jira/browse/SPARK-28937
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
>
> Right now most the checks for the Secrets Test suite are done inside an 
> eventually condition meaning when they fail they fail with a last exception 
> that they can not connect to the pod, this can mask the actual failure.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28937) Improve error reporting in Spark Secrets Test Suite

2019-08-30 Thread holdenk (Jira)

holdenk created SPARK-28937:
---

 Summary: Improve error reporting in Spark Secrets Test Suite
 Key: SPARK-28937
 URL: https://issues.apache.org/jira/browse/SPARK-28937
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Tests
Affects Versions: 3.0.0
Reporter: holdenk
Assignee: holdenk


Right now most the checks for the Secrets Test suite are done inside an 
eventually condition meaning when they fail they fail with a last exception 
that they can not connect to the pod, this can mask the actual failure.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28936) Simplify Spark K8s tests by replacing race condition during command execution

2019-08-30 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919990#comment-16919990
 ] 

holdenk commented on SPARK-28936:
-

I'm working on this.

> Simplify Spark K8s tests by replacing race condition during command execution
> -
>
> Key: SPARK-28936
> URL: https://issues.apache.org/jira/browse/SPARK-28936
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Major
>
> Currently our command execution for Spark Kubernetes integration tests 
> depends on a Thread.sleep which sometimes doesn't wait long enough. This 
> normally doesn't show up because we automatically retry the the commands 
> inside of an eventually, but on some machines may result in flaky tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28936) Simplify Spark K8s tests by replacing race condition during command execution

2019-08-30 Thread holdenk (Jira)

holdenk created SPARK-28936:
---

 Summary: Simplify Spark K8s tests by replacing race condition 
during command execution
 Key: SPARK-28936
 URL: https://issues.apache.org/jira/browse/SPARK-28936
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Tests
Affects Versions: 3.0.0
Reporter: holdenk
Assignee: holdenk


Currently our command execution for Spark Kubernetes integration tests depends 
on a Thread.sleep which sometimes doesn't wait long enough. This normally 
doesn't show up because we automatically retry the the commands inside of an 
eventually, but on some machines may result in flaky tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28904) Spark PV tests don't create required mount

2019-08-28 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918142#comment-16918142
 ] 

holdenk commented on SPARK-28904:
-

Related PV FSGroup https://issues.apache.org/jira/browse/SPARK-28905 , however 
minikube doesn't support PV FSGroup so we still probably want to do the manual 
bind of owner id 185.

> Spark PV tests don't create required mount
> --
>
> Key: SPARK-28904
> URL: https://issues.apache.org/jira/browse/SPARK-28904
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Major
>
> The Spark PVTestsSuite assumes that there exists a mount from the host system 
> into minikube but doesn't set up that mount. It also currently assumes the 
> mount is done with the same UID as the Spark user – although that could also 
> be viewed as bug with how we configure volumes (will create a follow up Jira 
> to explore).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28905) PVs mounted into Spark may not be writable by Spark

2019-08-28 Thread holdenk (Jira)

holdenk created SPARK-28905:
---

 Summary: PVs mounted into Spark may not be writable by Spark
 Key: SPARK-28905
 URL: https://issues.apache.org/jira/browse/SPARK-28905
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: holdenk
Assignee: holdenk


Right now, for Spark to have write access to the PVs we provide it, they need 
to be created with UID 185. This _may_ be a feature, but if we wanted to make 
this more flexible we could set a FsGroup in our security context when setting 
up the volumes. This would give Spark access provided the PV supports fsgroup 
(which on minikube does not appear to be the case - 
[https://github.com/kubernetes/minikube/issues/1990] )

 

If we did want to do this we could add the securitygroup in 
MountVolumesFeatureStep but I don't know if we want to do this or if the 
current behaviour is desired.

I ran into this while getting the integration tests to work w/minikube 1.3.1 on 
OSX. See https://issues.apache.org/jira/browse/SPARK-28904 for details.

 

cc [~ifilonenko] for thoughts



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28904) Spark PV tests don't create required mount

2019-08-28 Thread holdenk (Jira)

holdenk created SPARK-28904:
---

 Summary: Spark PV tests don't create required mount
 Key: SPARK-28904
 URL: https://issues.apache.org/jira/browse/SPARK-28904
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Tests
Affects Versions: 3.0.0
Reporter: holdenk
Assignee: holdenk


The Spark PVTestsSuite assumes that there exists a mount from the host system 
into minikube but doesn't set up that mount. It also currently assumes the 
mount is done with the same UID as the Spark user – although that could also be 
viewed as bug with how we configure volumes (will create a follow up Jira to 
explore).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28886) Kubernetes DepsTestsSuite fails on OSX with minikube 1.3.1 due to formatting

2019-08-27 Thread holdenk (Jira)

holdenk created SPARK-28886:
---

 Summary: Kubernetes DepsTestsSuite fails on OSX with minikube 
1.3.1 due to formatting
 Key: SPARK-28886
 URL: https://issues.apache.org/jira/browse/SPARK-28886
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Tests
Affects Versions: 3.0.0
Reporter: holdenk
Assignee: holdenk


With minikube 1.3.1 on OSX the service discovery command returns an extra "* " 
which doesn't parse into a URL causing the DepsTestsSuite to fail.

 

I've got a fix just need to double check some stuff.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28842) Cleanup the formatting/trailing spaces in resource-managers/kubernetes/integration-tests/README.md

2019-08-21 Thread holdenk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-28842:

Labels: starter  (was: )

> Cleanup the formatting/trailing spaces in 
> resource-managers/kubernetes/integration-tests/README.md
> --
>
> Key: SPARK-28842
> URL: https://issues.apache.org/jira/browse/SPARK-28842
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Kubernetes
>Affects Versions: 3.0.0
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> The K8s integration testing guide currently has a bunch of trailing spaces on 
> lines which we could cleanup.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28842) Cleanup the formatting/trailing spaces in resource-managers/kubernetes/integration-tests/README.md

2019-08-21 Thread holdenk (Jira)

holdenk created SPARK-28842:
---

 Summary: Cleanup the formatting/trailing spaces in 
resource-managers/kubernetes/integration-tests/README.md
 Key: SPARK-28842
 URL: https://issues.apache.org/jira/browse/SPARK-28842
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Kubernetes
Affects Versions: 3.0.0
Reporter: holdenk


The K8s integration testing guide currently has a bunch of trailing spaces on 
lines which we could cleanup.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28784) StreamExecution and StreamingQueryManager should utilize CheckpointFileManager to interact with checkpoint directories

2019-08-20 Thread holdenk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-28784:
---

Assignee: Shruti Gumma

> StreamExecution and StreamingQueryManager should utilize 
> CheckpointFileManager to interact with checkpoint directories
> --
>
> Key: SPARK-28784
> URL: https://issues.apache.org/jira/browse/SPARK-28784
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Shruti Gumma
>Assignee: Shruti Gumma
>Priority: Major
>
> After PR [https://github.com/apache/spark/pull/21048], the 
> CheckpointFileManager interface was created to handle all structured 
> streaming checkpointing operations and helps users to choose how they wish to 
> write checkpointing files atomically.
> StreamExecution and StreamingQueryManager still uses some FileSystem 
> operations without using the CheckpointFileManager.
> For instance,
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L137]
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L392]
> Instead, StreamExecution and StreamingQueryManager should use 
> CheckpointFileManager for these operations.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27659) Allow PySpark toLocalIterator to prefetch data

2019-08-16 Thread holdenk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909509#comment-16909509
 ] 

holdenk commented on SPARK-27659:
-

I'm working on this.

> Allow PySpark toLocalIterator to prefetch data
> --
>
> Key: SPARK-27659
> URL: https://issues.apache.org/jira/browse/SPARK-27659
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> In SPARK-23961, data was no longer prefetched so that the local iterator 
> could close cleanly in the case of not consuming all of the data. If the user 
> intends to iterate over all elements, then prefetching data could bring back 
> any lost performance. We would need to run some tests to see if the 
> performance is worth it due to additional complexity and extra usage of 
> memory. The option to prefetch could adding as a user conf, and it's possible 
> this could improve the Scala toLocalIterator also.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27683) Remove usage of TraversableOnce

2019-08-15 Thread holdenk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908410#comment-16908410
 ] 

holdenk commented on SPARK-27683:
-

Interesting related discussion over in 
[https://contributors.scala-lang.org/t/concerns-about-deprecation-of-iterableonce-member-methods-in-2-13/3583/14]
 , it looks like there might be a PR which would fix this (although I'm not 
sure) - [https://github.com/scala/scala/pull/8330]

> Remove usage of TraversableOnce
> ---
>
> Key: SPARK-27683
> URL: https://issues.apache.org/jira/browse/SPARK-27683
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>
> As with {{Traversable}}, {{TraversableOnce}} is going away in Scala 2.13. We 
> should use {{IterableOnce}} instead. This one is a bigger change as there are 
> more API methods with the existing signature.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24666) Word2Vec generate infinity vectors when numIterations are large

2019-08-15 Thread holdenk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908267#comment-16908267
 ] 

holdenk commented on SPARK-24666:
-

[~zhongyu09]specific code & data which leads to repro can help.

> Word2Vec generate infinity vectors when numIterations are large
> ---
>
> Key: SPARK-24666
> URL: https://issues.apache.org/jira/browse/SPARK-24666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.1
> Environment:  2.0.X, 2.1.X, 2.2.X, 2.3.X
>Reporter: ZhongYu
>Priority: Critical
>
> We found that Word2Vec generate large absolute value vectors when 
> numIterations are large, and if numIterations are large enough (>20), the 
> vector's value many be *infinity(or -**infinity)***, resulting in useless 
> vectors.
> In normal situations, vectors values are mainly around -1.0~1.0 when 
> numIterations = 1.
> The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X.
> There are already issues report this bug: 
> https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works 
> seems missing.
> Other people's reports:
> [https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec]
> [http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28740) Add support for building with bloop

2019-08-14 Thread holdenk (JIRA)

holdenk created SPARK-28740:
---

 Summary: Add support for building with bloop
 Key: SPARK-28740
 URL: https://issues.apache.org/jira/browse/SPARK-28740
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: holdenk


bloop can, in theory, build scala faster. However the JAR layout is a little 
different when you try and run the tests. It would be useful if we updated our 
test JAR discovery to work with bloop.

Before working on this check to make sure that bloop it's self has changed to 
work with Spark. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9792) PySpark DenseMatrix, SparseMatrix should override eq

2019-04-01 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-9792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-9792.

   Resolution: Fixed
Fix Version/s: 3.0.0

> PySpark DenseMatrix, SparseMatrix should override __eq__
> 
>
> Key: SPARK-9792
> URL: https://issues.apache.org/jira/browse/SPARK-9792
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Priority: Critical
> Fix For: 3.0.0
>
>
> See [SPARK-9750].  Equality should be defined semantically, not in terms of 
> representation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27095) We depend on silently accepting failures in setup-integration-test-env.sh

2019-03-07 Thread holdenk (JIRA)

holdenk created SPARK-27095:
---

 Summary: We depend on silently accepting failures in 
setup-integration-test-env.sh
 Key: SPARK-27095
 URL: https://issues.apache.org/jira/browse/SPARK-27095
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Tests
Affects Versions: 3.0.0
Reporter: holdenk


You can add set `-e` to the top of `setup-integration-test-env.sh` and watch CI 
fail. We shouldn't be depending on silently accepting failures in our 
integration shell scripts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21094) Allow stdout/stderr pipes in pyspark.java_gateway.launch_gateway

2019-02-15 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-21094:
---

Assignee: Peter Parente

> Allow stdout/stderr pipes in pyspark.java_gateway.launch_gateway
> 
>
> Key: SPARK-21094
> URL: https://issues.apache.org/jira/browse/SPARK-21094
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.1
>Reporter: Peter Parente
>Assignee: Peter Parente
>Priority: Major
>
> The Popen call to launch the py4j gateway specifies no stdout and stderr 
> options, meaning logging from the JVM always goes to the parent process 
> terminal. 
> https://github.com/apache/spark/blob/v2.1.1/python/pyspark/java_gateway.py#L77
> It would be super handy if the launch_gateway function took an additional 
> dict parameter called popen_kwargs and passed it along to the Popen calls. 
> This API enhancement, for example, would allow Python applications to capture 
> all stdout and stderr coming from Spark and process it programmatically, 
> without resorting to reading from log files or other hijinks.
> Example use:
> {code}
> import pyspark
> import subprocess
> from pyspark.java_gateway import launch_gateway
> # Make the py4j JVM stdout and stderr available without buffering
> popen_kwargs = {
>   'stdout': subprocess.PIPE,
>   'stderr': subprocess.PIPE,
>   'bufsiz': 0
> }
> # Launch the gateway with our custom settings
> gateway = launch_gateway(popen_kwargs=popen_kwargs)
> # Use the gateway we launched
> sc = pyspark.SparkContext(gateway=gateway)
> # This could be done in a thread or event loop or ...
> # Written briefly / poorly here only as a demo
> while True:
>   buf = gateway.proc.stdout.read()
>   print(buf.decode('utf-8'))
> {code}
> To get access to the stdout and stderr pipes, the "proc" instance created in 
> launch_gateway also needs to be exposed to the application. I'm thinking that 
> stashing it on the JavaGateway instance that the function already returns is 
> the cleanest from the client perspective, but means hanging an extra 
> attribute off the py4j.JavaGateway object. 
> I can submit a PR with this addition for further discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21094) Allow stdout/stderr pipes in pyspark.java_gateway.launch_gateway

2019-02-15 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-21094.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Allow stdout/stderr pipes in pyspark.java_gateway.launch_gateway
> 
>
> Key: SPARK-21094
> URL: https://issues.apache.org/jira/browse/SPARK-21094
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.1
>Reporter: Peter Parente
>Assignee: Peter Parente
>Priority: Major
> Fix For: 3.0.0
>
>
> The Popen call to launch the py4j gateway specifies no stdout and stderr 
> options, meaning logging from the JVM always goes to the parent process 
> terminal. 
> https://github.com/apache/spark/blob/v2.1.1/python/pyspark/java_gateway.py#L77
> It would be super handy if the launch_gateway function took an additional 
> dict parameter called popen_kwargs and passed it along to the Popen calls. 
> This API enhancement, for example, would allow Python applications to capture 
> all stdout and stderr coming from Spark and process it programmatically, 
> without resorting to reading from log files or other hijinks.
> Example use:
> {code}
> import pyspark
> import subprocess
> from pyspark.java_gateway import launch_gateway
> # Make the py4j JVM stdout and stderr available without buffering
> popen_kwargs = {
>   'stdout': subprocess.PIPE,
>   'stderr': subprocess.PIPE,
>   'bufsiz': 0
> }
> # Launch the gateway with our custom settings
> gateway = launch_gateway(popen_kwargs=popen_kwargs)
> # Use the gateway we launched
> sc = pyspark.SparkContext(gateway=gateway)
> # This could be done in a thread or event loop or ...
> # Written briefly / poorly here only as a demo
> while True:
>   buf = gateway.proc.stdout.read()
>   print(buf.decode('utf-8'))
> {code}
> To get access to the stdout and stderr pipes, the "proc" instance created in 
> launch_gateway also needs to be exposed to the application. I'm thinking that 
> stashing it on the JavaGateway instance that the function already returns is 
> the cleanest from the client perspective, but means hanging an extra 
> attribute off the py4j.JavaGateway object. 
> I can submit a PR with this addition for further discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26898) Scalastyle should run during k8s integration tests

2019-02-15 Thread holdenk (JIRA)

holdenk created SPARK-26898:
---

 Summary: Scalastyle should run during k8s integration tests
 Key: SPARK-26898
 URL: https://issues.apache.org/jira/browse/SPARK-26898
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.0, 3.0.0
Reporter: holdenk


Right now the scalastyle checks are not run during integration testing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26882) lint-scala script does not check all components

2019-02-14 Thread holdenk (JIRA)

holdenk created SPARK-26882:
---

 Summary: lint-scala script does not check all components
 Key: SPARK-26882
 URL: https://issues.apache.org/jira/browse/SPARK-26882
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: holdenk
Assignee: holdenk


The scala linter script doesn't currently check the kubernetes integration tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26185) add weightCol in python MulticlassClassificationEvaluator

2019-02-08 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-26185:
---

Assignee: Huaxin Gao

> add weightCol in python MulticlassClassificationEvaluator
> -
>
> Key: SPARK-26185
> URL: https://issues.apache.org/jira/browse/SPARK-26185
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-24101 added weightCol in 
> MulticlassClassificationEvaluator.scala. This Jira will add weightCol in 
> python version of MulticlassClassificationEvaluator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24489) No check for invalid input type of weight data in ml.PowerIterationClustering

2019-01-07 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-24489.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Thank's for working on this, I've merged the fix into master :)

> No check for invalid input type of weight data in ml.PowerIterationClustering
> -
>
> Key: SPARK-24489
> URL: https://issues.apache.org/jira/browse/SPARK-24489
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
> Fix For: 3.0.0
>
>
> The test case will result the following failure. currently in ml.PIC, there 
> is no check for the data type of weight column. We should check for the valid 
> data type of the weight.
> {code:java}
>   test("invalid input types for weight") {
> val invalidWeightData = spark.createDataFrame(Seq(
>   (0L, 1L, "a"),
>   (2L, 3L, "b")
> )).toDF("src", "dst", "weight")
> val pic = new PowerIterationClustering()
>   .setWeightCol("weight")
> val result = pic.assignClusters(invalidWeightData)
>   }
> {code}
> {code:java}
> Job aborted due to stage failure: Task 0 in stage 8077.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 8077.0 (TID 882, localhost, executor 
> driver): scala.MatchError: [0,1,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>   at 
> org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178)
>   at 
> org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24489) No check for invalid input type of weight data in ml.PowerIterationClustering

2019-01-07 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-24489:
---

Assignee: shahid

> No check for invalid input type of weight data in ml.PowerIterationClustering
> -
>
> Key: SPARK-24489
> URL: https://issues.apache.org/jira/browse/SPARK-24489
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
>
> The test case will result the following failure. currently in ml.PIC, there 
> is no check for the data type of weight column. We should check for the valid 
> data type of the weight.
> {code:java}
>   test("invalid input types for weight") {
> val invalidWeightData = spark.createDataFrame(Seq(
>   (0L, 1L, "a"),
>   (2L, 3L, "b")
> )).toDF("src", "dst", "weight")
> val pic = new PowerIterationClustering()
>   .setWeightCol("weight")
> val result = pic.assignClusters(invalidWeightData)
>   }
> {code}
> {code:java}
> Job aborted due to stage failure: Task 0 in stage 8077.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 8077.0 (TID 882, localhost, executor 
> driver): scala.MatchError: [0,1,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>   at 
> org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178)
>   at 
> org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26497) Show users where the pre-packaged SparkR and PySpark Dockerfiles are in the image build script.

2018-12-28 Thread holdenk (JIRA)

holdenk created SPARK-26497:
---

 Summary: Show users where the pre-packaged SparkR and PySpark 
Dockerfiles are in the image build script.
 Key: SPARK-26497
 URL: https://issues.apache.org/jira/browse/SPARK-26497
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Kubernetes
Affects Versions: 3.0.0
Reporter: holdenk
Assignee: holdenk






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26343) Running the kubernetes

2018-12-11 Thread holdenk (JIRA)

holdenk created SPARK-26343:
---

 Summary: Running the kubernetes 
 Key: SPARK-26343
 URL: https://issues.apache.org/jira/browse/SPARK-26343
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Tests
Affects Versions: 3.0.0
Reporter: holdenk
Assignee: holdenk


The Kubernetes integration tests right now allow you to specify a docker tag 
but even when you do it also requires a tgz to extract, but then it doesn't 
really need that extracted version. We could make it easier/faster for folks to 
run the integration tests locally by not requiring a distribution tar ball.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26343) Speed up running the kubernetes integration tests locally

2018-12-11 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-26343:

Summary: Speed up running the kubernetes integration tests locally  (was: 
Running the kubernetes )

> Speed up running the kubernetes integration tests locally
> -
>
> Key: SPARK-26343
> URL: https://issues.apache.org/jira/browse/SPARK-26343
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
>
> The Kubernetes integration tests right now allow you to specify a docker tag 
> but even when you do it also requires a tgz to extract, but then it doesn't 
> really need that extracted version. We could make it easier/faster for folks 
> to run the integration tests locally by not requiring a distribution tar ball.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25255) Add getActiveSession to SparkSession in PySpark

2018-10-26 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-25255.
-
Resolution: Fixed

Thanks for the PR and fixing this issue :)

> Add getActiveSession to SparkSession in PySpark
> ---
>
> Key: SPARK-25255
> URL: https://issues.apache.org/jira/browse/SPARK-25255
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: Huaxin Gao
>Priority: Trivial
>  Labels: starter
> Fix For: 3.0.0
>
>
> Add getActiveSession to PySpark session API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25255) Add getActiveSession to SparkSession in PySpark

2018-10-26 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-25255:

Fix Version/s: 3.0.0

> Add getActiveSession to SparkSession in PySpark
> ---
>
> Key: SPARK-25255
> URL: https://issues.apache.org/jira/browse/SPARK-25255
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
> Fix For: 3.0.0
>
>
> Add getActiveSession to PySpark session API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25255) Add getActiveSession to SparkSession in PySpark

2018-10-26 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-25255:
---

Assignee: Huaxin Gao

> Add getActiveSession to SparkSession in PySpark
> ---
>
> Key: SPARK-25255
> URL: https://issues.apache.org/jira/browse/SPARK-25255
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: Huaxin Gao
>Priority: Trivial
>  Labels: starter
> Fix For: 3.0.0
>
>
> Add getActiveSession to PySpark session API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20598) Iterative checkpoints do not get removed from HDFS

2018-09-19 Thread holdenk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621396#comment-16621396
 ] 

holdenk commented on SPARK-20598:
-

Huh that's interesting.I suspect that could be we're keeping the reference 
inside of PySpark.

> Iterative checkpoints do not get removed from HDFS
> --
>
> Key: SPARK-20598
> URL: https://issues.apache.org/jira/browse/SPARK-20598
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, YARN
>Affects Versions: 2.1.0
>Reporter: Guillem Palou
>Priority: Major
>
> I am running a pyspark  application that makes use of dataframe.checkpoint() 
> because Spark needs exponential time to compute the plan and eventually I had 
> to stop it. Using {{checkpoint}} allowed the application to proceed with the 
> computation, but I noticed that the HDFS cluster was filling up with RDD 
> files. Spark is running on YARN client mode. 
> I managed to reproduce the problem in a toy example as below:
> {code}
> df = spark.createDataFrame([T.Row(a=1, b=2)]).checkpoint()
> for i in range(4):
> # either line of the following 2 will produce the error   
> df = df.select('*', F.concat(*df.columns)).cache().checkpoint()
> df = df.join(df, on='a').cache().checkpoint()
> # the following two lines do not seem to have an effect
> gc.collect()
> sc._jvm.System.gc()
> {code}
> After running the code and {{sc.top()}}, I can still see the rdd's 
> checkpointed in HDFS:
> {quote}
> guillem@ip-10-9-94-0:~$ hdfs dfs -du -h $CHECKPOINT_PATH
> 5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-12
> 5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-18
> 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-24
> 5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-30
> 5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-6
> {quote}
> The config flag {{spark.cleaner.referenceTracking.cleanCheckpoints}} is set 
> to {{true}}. I would expect Spark to clean up all RDDs that can't be 
> accessed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25467) Python date/datetime objects in dataframes increment by 1 day when converted to JSON

2018-09-19 Thread holdenk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621391#comment-16621391
 ] 

holdenk commented on SPARK-25467:
-

cc [~bryanc]

> Python date/datetime objects in dataframes increment by 1 day when converted 
> to JSON
> 
>
> Key: SPARK-25467
> URL: https://issues.apache.org/jira/browse/SPARK-25467
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1
> Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:39:56) 
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
> openjdk version "1.8.0_181"
> OpenJDK Runtime Environment (build 1.8.0_181-b13)
> OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
> Centos 7 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 
> x86_64 x86_64 GNU/Linux
>Reporter: David V. Hill
>Priority: Major
>
> When Dataframes contains datetime.date or datetime.datetime instances and 
> toJSON() is called on the Dataframe, the day is incremented in the JSON date 
> representation.
> {code}
> # Create a Dataframe containing datetime.date instances, convert to JSON and 
> display
> rows = [Row(cx=1, cy=2, dates=[datetime.date.fromordinal(1), 
> datetime.date.fromordinal(2)])]
> df = sqc.createDataFrame(rows)
> df.collect()
> [Row(cx=1, cy=2, dates=[datetime.date(1, 1, 1), datetime.date(1, 1, 2)])]
> df.toJSON().collect()
> ['{"cx":1,"cy":2,"dates":["0001-01-03","0001-01-04"]}']
> # Issue also occurs with datetime.datetime instances
> rows = [Row(cx=1, cy=2, dates=[datetime.datetime.fromordinal(1), 
> datetime.datetime.fromordinal(2)])]
> df = sqc.createDataFrame(rows)
> df.collect()
> [Row(cx=1, cy=2, dates=[datetime.datetime(1, 1, 1, 0, 0, fold=1), 
> datetime.datetime(1, 1, 2, 0, 0)])]
> df.toJSON().collect()
> ['{"cx":1,"cy":2,"dates":["0001-01-02T23:50:36.000-06:00","0001-01-03T23:50:36.000-06:00"]}']
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14352) approxQuantile should support multi columns

2018-09-19 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-14352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-14352:
---

Assignee: zhengruifeng

> approxQuantile should support multi columns
> ---
>
> Key: SPARK-14352
> URL: https://issues.apache.org/jira/browse/SPARK-14352
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> It will be convenient and efficient to calculate quantiles of multi-columns 
> with approxQuantile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

2018-09-19 Thread holdenk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621389#comment-16621389
 ] 

holdenk commented on SPARK-17602:
-

Did we end up going anywhere with this?

> PySpark - Performance Optimization Large Size of Broadcast Variable
> ---
>
> Key: SPARK-17602
> URL: https://issues.apache.org/jira/browse/SPARK-17602
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
> Environment: Linux
>Reporter: Xiao Ming Bao
>Priority: Major
> Attachments: PySpark – Performance Optimization for Large Size of 
> Broadcast variable.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Problem: currently at executor side, the broadcast variable is written to 
> disk as file and each python work process reads the bd from local disk and 
> de-serialize to python object before executing a task, when the size of 
> broadcast  variables is large, the read/de-serialization takes a lot of time. 
> And when the python worker is NOT reused and the number of task is large, 
> this performance would be very bad since python worker needs to 
> read/de-serialize for each task. 
> Brief of the solution:
>  transfer the broadcast variable to daemon python process via file (or 
> socket/mmap) and deserialize file to object in daemon python process, after 
> worker python process forked by daemon python process, worker python process 
> would automatically has the deserialzied object and use it directly because 
> of the memory Copy-on-write tech of Linux.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14352) approxQuantile should support multi columns

2018-09-19 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-14352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-14352.
-
  Resolution: Fixed
Target Version/s: 2.2.0

> approxQuantile should support multi columns
> ---
>
> Key: SPARK-14352
> URL: https://issues.apache.org/jira/browse/SPARK-14352
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: zhengruifeng
>Priority: Major
>
> It will be convenient and efficient to calculate quantiles of multi-columns 
> with approxQuantile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25021) Add spark.executor.pyspark.memory support to Kubernetes

2018-09-19 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-25021:

Fix Version/s: 2.4.0

> Add spark.executor.pyspark.memory support to Kubernetes
> ---
>
> Key: SPARK-25021
> URL: https://issues.apache.org/jira/browse/SPARK-25021
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Ilan Filonenko
>Priority: Major
> Fix For: 2.4.0, 3.0.0
>
>
> SPARK-25004 adds {{spark.executor.pyspark.memory}} to control the memory 
> allocation for PySpark and updates YARN to add this memory to its container 
> requests. Kubernetes should do something similar to account for the python 
> memory allocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25432) Consider if using standard getOrCreate from PySpark into JVM SparkSession would simplify code

2018-09-14 Thread holdenk (JIRA)

holdenk created SPARK-25432:
---

 Summary: Consider if using standard getOrCreate from PySpark into 
JVM SparkSession would simplify code
 Key: SPARK-25432
 URL: https://issues.apache.org/jira/browse/SPARK-25432
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
 Environment: As we saw in 
[https://github.com/apache/spark/pull/22295/files] the logic can get a bit out 
of sync. It _might_ make sense to try and simplify this so there's less 
duplicated logic in Python & Scala around session set up.
Reporter: holdenk






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25021) Add spark.executor.pyspark.memory support to Kubernetes

2018-09-08 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-25021.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Merged for 3 - open to the discussion around backporting. Thanks for doing this!

> Add spark.executor.pyspark.memory support to Kubernetes
> ---
>
> Key: SPARK-25021
> URL: https://issues.apache.org/jira/browse/SPARK-25021
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Ilan Filonenko
>Priority: Major
> Fix For: 3.0.0
>
>
> SPARK-25004 adds {{spark.executor.pyspark.memory}} to control the memory 
> allocation for PySpark and updates YARN to add this memory to its container 
> requests. Kubernetes should do something similar to account for the python 
> memory allocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25021) Add spark.executor.pyspark.memory support to Kubernetes

2018-09-08 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-25021:
---

Assignee: Ilan Filonenko

> Add spark.executor.pyspark.memory support to Kubernetes
> ---
>
> Key: SPARK-25021
> URL: https://issues.apache.org/jira/browse/SPARK-25021
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Ilan Filonenko
>Priority: Major
>
> SPARK-25004 adds {{spark.executor.pyspark.memory}} to control the memory 
> allocation for PySpark and updates YARN to add this memory to its container 
> requests. Kubernetes should do something similar to account for the python 
> memory allocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25373) Support mixed language pipelines on Spark on K8s

2018-09-07 Thread holdenk (JIRA)

holdenk created SPARK-25373:
---

 Summary: Support mixed language pipelines on Spark on K8s
 Key: SPARK-25373
 URL: https://issues.apache.org/jira/browse/SPARK-25373
 Project: Spark
  Issue Type: Improvement
  Components: R, Kubernetes, PySpark
Affects Versions: 3.0.0
Reporter: holdenk


It would be good if we supported having Python/R workers in Spark on K8s even 
if it's not the language used to launch the driver program. An example of this 
is sparklingml, and another more common one would be something like livy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25270) lint-python: Add flake8 to find syntax errors and undefined names

2018-09-07 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-25270:
---

Assignee: cclauss

> lint-python: Add flake8 to find syntax errors and undefined names
> -
>
> Key: SPARK-25270
> URL: https://issues.apache.org/jira/browse/SPARK-25270
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: cclauss
>Assignee: cclauss
>Priority: Minor
>
> Flake8 has been a useful tool for finding and fixing undefined names in 
> Python code.  See: SPARK-23698  We should add flake8 testing to the 
> lint-python process to automate this testing on all pull requests.  
> https://github.com/apache/spark/pull/22266



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25370) Undefined name _exception_message in java_gateway

2018-09-07 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-25370.
-
Resolution: Duplicate

Issue was already fixed later.

> Undefined name _exception_message in java_gateway
> -
>
> Key: SPARK-25370
> URL: https://issues.apache.org/jira/browse/SPARK-25370
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
>
> While [~cclauss] has been working on SPARK-25270 we introduced a new 
> undefined name error, we should clean it up so we can merge that flake8 
> script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25370) Undefined name _exception_message in java_gateway

2018-09-07 Thread holdenk (JIRA)

holdenk created SPARK-25370:
---

 Summary: Undefined name _exception_message in java_gateway
 Key: SPARK-25370
 URL: https://issues.apache.org/jira/browse/SPARK-25370
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0
Reporter: holdenk


While [~cclauss] has been working on SPARK-25270 we introduced a new undefined 
name error, we should clean it up so we can merge that flake8 script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25370) Undefined name _exception_message in java_gateway

2018-09-07 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-25370:
---

Assignee: holdenk

> Undefined name _exception_message in java_gateway
> -
>
> Key: SPARK-25370
> URL: https://issues.apache.org/jira/browse/SPARK-25370
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
>
> While [~cclauss] has been working on SPARK-25270 we introduced a new 
> undefined name error, we should clean it up so we can merge that flake8 
> script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25360) Parallelized RDDs of Ranges could have known partitioner

2018-09-06 Thread holdenk (JIRA)

holdenk created SPARK-25360:
---

 Summary: Parallelized RDDs of Ranges could have known partitioner
 Key: SPARK-25360
 URL: https://issues.apache.org/jira/browse/SPARK-25360
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: holdenk


We already have the logic to split up the generator, we could expose the same 
logic as a partitioner. This would be useful when joining a small parallelized 
collection with a larger collection and other cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25255) Add getActiveSession to SparkSession in PySpark

2018-08-27 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-25255:

Labels: starter  (was: )

> Add getActiveSession to SparkSession in PySpark
> ---
>
> Key: SPARK-25255
> URL: https://issues.apache.org/jira/browse/SPARK-25255
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> Add getActiveSession to PySpark session API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25255) Add getActiveSession to SparkSession in PySpark

2018-08-27 Thread holdenk (JIRA)

holdenk created SPARK-25255:
---

 Summary: Add getActiveSession to SparkSession in PySpark
 Key: SPARK-25255
 URL: https://issues.apache.org/jira/browse/SPARK-25255
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: holdenk


Add getActiveSession to PySpark session API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25236) Investigate using a logging library inside of PySpark on the workers instead of print

2018-08-26 Thread holdenk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593098#comment-16593098
 ] 

holdenk commented on SPARK-25236:
-

Probably. The only thing would be probably wanting to pass log level config 
from driver to exec but that could be a V2 feature.

> Investigate using a logging library inside of PySpark on the workers instead 
> of print
> -
>
> Key: SPARK-25236
> URL: https://issues.apache.org/jira/browse/SPARK-25236
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
> We don't have a logging library on the workers to use which means that its 
> difficult for folks to tune the log level on the workers. On the driver 
> processes we _could_ just call the JVM logging, but on the workers that won't 
> work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25236) Investigate using a logging library inside of PySpark on the workers instead of print

2018-08-24 Thread holdenk (JIRA)

holdenk created SPARK-25236:
---

 Summary: Investigate using a logging library inside of PySpark on 
the workers instead of print
 Key: SPARK-25236
 URL: https://issues.apache.org/jira/browse/SPARK-25236
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: holdenk


We don't have a logging library on the workers to use which means that its 
difficult for folks to tune the log level on the workers. On the driver 
processes we _could_ just call the JVM logging, but on the workers that won't 
work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9636) Treat $SPARK_HOME as write-only

2018-08-24 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-9636:
---
Labels:   (was: easyfix)

> Treat $SPARK_HOME as write-only
> ---
>
> Key: SPARK-9636
> URL: https://issues.apache.org/jira/browse/SPARK-9636
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.4.1
> Environment: Linux
>Reporter: Philipp Angerer
>Priority: Minor
>
> when starting spark scripts as user and it is installed in a directory the 
> user has no write permissions on, many things work fine, except for the logs 
> (e.g. for {{start-master.sh}})
> logs are per default written to {{$SPARK_LOG_DIR}} or (if unset) to 
> {{$SPARK_HOME/logs}}.
> if installed in this way, it should, instead of throwing an error, write logs 
> to {{/var/log/spark/}}. that’s easy to fix by simply testing a few log dirs 
> in sequence for writability before trying to use one. i suggest using 
> {{$SPARK_LOG_DIR}} (if set) → {{/var/log/spark/}} → {{~/.cache/spark-logs/}} 
> → {{$SPARK_HOME/logs/}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19094) Plumb through logging/error messages from the JVM to Jupyter PySpark

2018-08-24 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-19094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-19094.
-
Resolution: Won't Fix

No longer as important given other changes.

> Plumb through logging/error messages from the JVM to Jupyter PySpark
> 
>
> Key: SPARK-19094
> URL: https://issues.apache.org/jira/browse/SPARK-19094
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Priority: Trivial
>
> Jupyter/IPython notebooks works by overriding sys.stdout & sys.stderr, as 
> such the error messages that show up in IJupyter/IPython are often missing 
> the related logs - which is often more useful than the exception its self.
> This could make it easier for Python developers getting started with Spark on 
> their local laptops to debug their applications, since otherwise they need to 
> remember to keep going to the terminal where they launched the notebook from.
> One counterpoint to this is that Spark's logging is fairly verbose, but since 
> we provide the ability for the user to tune the log messages from within the 
> notebook that should be OK.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25153) Improve error messages for columns with dots/periods

2018-08-18 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-25153:

Labels: starter  (was: )

> Improve error messages for columns with dots/periods
> 
>
> Key: SPARK-25153
> URL: https://issues.apache.org/jira/browse/SPARK-25153
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> When we fail to resolve a column name with a dot in it, and the column name 
> is present as a string literal the error message could mention using 
> backticks to have the string treated as a literal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25153) Improve error messages for columns with dots/periods

2018-08-18 Thread holdenk (JIRA)

holdenk created SPARK-25153:
---

 Summary: Improve error messages for columns with dots/periods
 Key: SPARK-25153
 URL: https://issues.apache.org/jira/browse/SPARK-25153
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: holdenk


When we fail to resolve a column name with a dot in it, and the column name is 
present as a string literal the error message could mention using backticks to 
have the string treated as a literal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24735) Improve exception when mixing up pandas_udf types

2018-08-13 Thread holdenk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578719#comment-16578719
 ] 

holdenk commented on SPARK-24735:
-

So [~bryanc]what do you think of if we add a AggregatePythonUDF and use it for 
grouped_map / grouped_agg so we get treated the correct way by the Scala SQL 
engine?

> Improve exception when mixing up pandas_udf types
> -
>
> Key: SPARK-24735
> URL: https://issues.apache.org/jira/browse/SPARK-24735
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> From the discussion here 
> https://github.com/apache/spark/pull/21650#discussion_r199203674, mixing up 
> Pandas UDF types, like using GROUPED_MAP as a SCALAR {{foo = 
> pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)}} produces an 
> exception which is hard to understand.  It should tell the user that the UDF 
> type is wrong.  This is the full output:
> {code}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> >>> df.select(foo(df['v'])).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/dataframe.py", 
> line 353, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o257.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, false])
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261)
>   at 
> org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24735) Improve exception when mixing up pandas_udf types

2018-08-13 Thread holdenk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578710#comment-16578710
 ] 

holdenk commented on SPARK-24735:
-

I think we could do better than just improving the exception, if we look at the 
other aggregates in PySpark when we call them with select it does the grouping 
for us:

 
{code:java}
>>> df.select(sumDistinct(df._1)).show()
++
|sum(DISTINCT _1)|
++
| 4950   |
++{code}

> Improve exception when mixing up pandas_udf types
> -
>
> Key: SPARK-24735
> URL: https://issues.apache.org/jira/browse/SPARK-24735
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> From the discussion here 
> https://github.com/apache/spark/pull/21650#discussion_r199203674, mixing up 
> Pandas UDF types, like using GROUPED_MAP as a SCALAR {{foo = 
> pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)}} produces an 
> exception which is hard to understand.  It should tell the user that the UDF 
> type is wrong.  This is the full output:
> {code}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> >>> df.select(foo(df['v'])).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/dataframe.py", 
> line 353, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o257.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, false])
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261)
>   at 
> org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25105) Importing all of pyspark.sql.functions should bring PandasUDFType in as well

2018-08-13 Thread holdenk (JIRA)

holdenk created SPARK-25105:
---

 Summary: Importing all of pyspark.sql.functions should bring 
PandasUDFType in as well
 Key: SPARK-25105
 URL: https://issues.apache.org/jira/browse/SPARK-25105
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: holdenk


 
{code:java}
>>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
Traceback (most recent call last):
 File "", line 1, in 
NameError: name 'PandasUDFType' is not defined
 
{code}
When explicitly imported it works fine:
{code:java}
 
>>> from pyspark.sql.functions import PandasUDFType
>>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
{code}
 

We just need to make sure it's included in __all__/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24735) Improve exception when mixing up pandas_udf types

2018-08-13 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-24735:

Summary: Improve exception when mixing up pandas_udf types  (was: Improve 
exception when mixing pandas_udf types)

> Improve exception when mixing up pandas_udf types
> -
>
> Key: SPARK-24735
> URL: https://issues.apache.org/jira/browse/SPARK-24735
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> From the discussion here 
> https://github.com/apache/spark/pull/21650#discussion_r199203674, mixing up 
> Pandas UDF types, like using GROUPED_MAP as a SCALAR {{foo = 
> pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)}} produces an 
> exception which is hard to understand.  It should tell the user that the UDF 
> type is wrong.  This is the full output:
> {code}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> >>> df.select(foo(df['v'])).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/dataframe.py", 
> line 353, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o257.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, false])
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261)
>   at 
> org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24736) --py-files not functional for non local URLs. It appears to pass non-local URL's into PYTHONPATH directly.

2018-08-13 Thread holdenk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578624#comment-16578624
 ] 

holdenk commented on SPARK-24736:
-

cc [~ifilonenko]

> --py-files not functional for non local URLs. It appears to pass non-local 
> URL's into PYTHONPATH directly.
> --
>
> Key: SPARK-24736
> URL: https://issues.apache.org/jira/browse/SPARK-24736
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
> Environment: Recent 2.4.0 from master branch, submitted on Linux to a 
> KOPS Kubernetes cluster created on AWS.
>  
>Reporter: Jonathan A Weaver
>Priority: Minor
>
> My spark-submit
> bin/spark-submit \
>         --master 
> k8s://[https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com|https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com/]
>  \
>         --deploy-mode cluster \
>         --name pytest \
>         --conf 
> spark.kubernetes.container.image=[412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest|http://412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest]
>  \
>         --conf 
> [spark.kubernetes.driver.pod.name|http://spark.kubernetes.driver.pod.name/]=spark-pi-driver
>  \
>         --conf 
> spark.kubernetes.authenticate.submission.caCertFile=[cluster.ca|http://cluster.ca/]
>  \
>         --conf spark.kubernetes.authenticate.submission.oauthToken=$TOK \
>         --conf spark.kubernetes.authenticate.driver.oauthToken=$TOK \
> --py-files "[https://s3.amazonaws.com/maxar-ids-fids/screw.zip]; \
> [https://s3.amazonaws.com/maxar-ids-fids/it.py]
>  
> *screw.zip is successfully downloaded and placed in SparkFIles.getRootPath()*
> 2018-07-01 07:33:43 INFO  SparkContext:54 - Added file 
> [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] at 
> [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] with timestamp 
> 1530430423297
> 2018-07-01 07:33:43 INFO  Utils:54 - Fetching 
> [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] to 
> /var/data/spark-7aba748d-2bba-4015-b388-c2ba9adba81e/spark-0ed5a100-6efa-45ca-ad4c-d1e57af76ffd/userFiles-a053206e-33d9-4245-b587-f8ac26d4c240/fetchFileTemp1549645948768432992.tmp
> *I print out the  PYTHONPATH and PYSPARK_FILES environment variables from the 
> driver script:*
>      PYTHONPATH 
> /opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-0.10.7-src.zip:/opt/spark/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar:/opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-*.zip:*[https://s3.amazonaws.com/maxar-ids-fids/screw.zip]*
>     PYSPARK_FILES [https://s3.amazonaws.com/maxar-ids-fids/screw.zip]
>  
> *I print out sys.path*
> ['/tmp/spark-fec3684b-8b63-4f43-91a4-2f2fa41a1914', 
> u'/var/data/spark-7aba748d-2bba-4015-b388-c2ba9adba81e/spark-0ed5a100-6efa-45ca-ad4c-d1e57af76ffd/userFiles-a053206e-33d9-4245-b587-f8ac26d4c240',
>  '/opt/spark/python/lib/pyspark.zip', 
> '/opt/spark/python/lib/py4j-0.10.7-src.zip', 
> '/opt/spark/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar', 
> '/opt/spark/python/lib/py4j-*.zip', *'/opt/spark/work-dir/https', 
> '//[s3.amazonaws.com/maxar-ids-fids/screw.zip|http://s3.amazonaws.com/maxar-ids-fids/screw.zip]',*
>  '/usr/lib/python27.zip', '/usr/lib/python2.7', 
> '/usr/lib/python2.7/plat-linux2', '/usr/lib/python2.7/lib-tk', 
> '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', 
> '/usr/lib/python2.7/site-packages']
>  
> *URL from PYTHONFILES gets placed in sys.path verbatim with obvious results.*
>  
> *Dump of spark config from container.*
> Spark config dumped:
> [(u'spark.master', 
> u'k8s://[https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com|https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com/]'),
>  (u'spark.kubernetes.authenticate.submission.oauthToken', 
> u''), 
> (u'spark.kubernetes.authenticate.driver.oauthToken', 
> u''), (u'spark.kubernetes.executor.podNamePrefix', 
> u'pytest-1530430411996'), (u'spark.kubernetes.memoryOverheadFactor', u'0.4'), 
> (u'spark.driver.blockManager.port', u'7079'), 
> (u'[spark.app.id|http://spark.app.id/]', u'spark-application-1530430424433'), 
> (u'[spark.app.name|http://spark.app.name/]', u'pytest'), 
> (u'[spark.executor.id|http://spark.executor.id/]', u'driver'), 
> (u'spark.driver.host', u'pytest-1530430411996-driver-svc.default.svc'), 
> (u'spark.kubernetes.container.image', 
> u'[412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest'|http://412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest']),
>  (u'spark.driver.port', u'7078'), 
> (u'spark.kubernetes.python.mainAppResource', 
> u'[https://s3.amazonaws.com/maxar-ids-fids/it.py']), 
>

[jira] [Created] (SPARK-25053) Allow additional port forwarding on Spark on K8S as needed

2018-08-07 Thread holdenk (JIRA)

holdenk created SPARK-25053:
---

 Summary: Allow additional port forwarding on Spark on K8S as needed
 Key: SPARK-25053
 URL: https://issues.apache.org/jira/browse/SPARK-25053
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: holdenk


In some cases, like setting up remote debuggers, adding additional ports to be 
forwarded would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21436) Take advantage of known partioner for distinct on RDDs

2018-08-06 Thread holdenk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570517#comment-16570517
 ] 

holdenk edited comment on SPARK-21436 at 8/6/18 5:19 PM:
-

@[~podongfeng] 

So distinct triggers a `map` first (e.g. it is
{code:java}
map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1){code}
) and the `map` will throw away our partition information.

 

I did a quick test in the shell and you can see it causes a second shuffle if 
you run the same commands:

 
{code:java}
scala> sc.parallelize(1.to(100))
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 
at :25
scala> sc.parallelize(1.to(100)).count()
res1: Long = 100
scala> sc.parallelize(1.to(100))
res2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize 
at :25
scala> res2.sort()
:26: error: value sort is not a member of org.apache.spark.rdd.RDD[Int]
 res2.sort()
 ^
scala> res2.map(x => (x, x))
res4: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[3] at map at 
:26
scala> res4.sortByKey()
res5: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[4] at sortByKey at 
:26
scala> res5.cache()
res6: res5.type = ShuffledRDD[4] at sortByKey at :26
scala> res6.count()
res7: Long = 100
scala> res6.distinct()
res8: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[7] at distinct at 
:26
scala> res8.count()
res9: Long = 100
{code}
 


was (Author: holdenk):
@[~podongfeng] 

So distinct triggers a `map` first (e.g. it is `map(x => (x, 
null)).reduceByKey((x, y) => x, numPartitions).map(_._1)`) and the `map` will 
throw away our partition information.

 

I did a quick test in the shell and you can see it causes a second shuffle if 
you run the same commands:

 
{code:java}

scala> sc.parallelize(1.to(100))
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 
at :25
scala> sc.parallelize(1.to(100)).count()
res1: Long = 100
scala> sc.parallelize(1.to(100))
res2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize 
at :25
scala> res2.sort()
:26: error: value sort is not a member of org.apache.spark.rdd.RDD[Int]
 res2.sort()
 ^
scala> res2.map(x => (x, x))
res4: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[3] at map at 
:26
scala> res4.sortByKey()
res5: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[4] at sortByKey at 
:26
scala> res5.cache()
res6: res5.type = ShuffledRDD[4] at sortByKey at :26
scala> res6.count()
res7: Long = 100
scala> res6.distinct()
res8: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[7] at distinct at 
:26
scala> res8.count()
res9: Long = 100
{code}
 

> Take advantage of known partioner for distinct on RDDs
> --
>
> Key: SPARK-21436
> URL: https://issues.apache.org/jira/browse/SPARK-21436
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Minor
>
> If we have a known partitioner we should be able to avoid the shuffle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21436) Take advantage of known partioner for distinct on RDDs

2018-08-06 Thread holdenk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570517#comment-16570517
 ] 

holdenk commented on SPARK-21436:
-

@[~podongfeng] 

So distinct triggers a `map` first (e.g. it is `map(x => (x, 
null)).reduceByKey((x, y) => x, numPartitions).map(_._1)`) and the `map` will 
throw away our partition information.

 

I did a quick test in the shell and you can see it causes a second shuffle if 
you run the same commands:

 
{code:java}

scala> sc.parallelize(1.to(100))
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 
at :25
scala> sc.parallelize(1.to(100)).count()
res1: Long = 100
scala> sc.parallelize(1.to(100))
res2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize 
at :25
scala> res2.sort()
:26: error: value sort is not a member of org.apache.spark.rdd.RDD[Int]
 res2.sort()
 ^
scala> res2.map(x => (x, x))
res4: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[3] at map at 
:26
scala> res4.sortByKey()
res5: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[4] at sortByKey at 
:26
scala> res5.cache()
res6: res5.type = ShuffledRDD[4] at sortByKey at :26
scala> res6.count()
res7: Long = 100
scala> res6.distinct()
res8: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[7] at distinct at 
:26
scala> res8.count()
res9: Long = 100
{code}
 

> Take advantage of known partioner for distinct on RDDs
> --
>
> Key: SPARK-21436
> URL: https://issues.apache.org/jira/browse/SPARK-21436
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Minor
>
> If we have a known partitioner we should be able to avoid the shuffle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2018-07-30 Thread holdenk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562638#comment-16562638
 ] 

holdenk commented on SPARK-24579:
-

[~mengxr]How about you just open comments up in general and then turn it off if 
Spam becomes a problem?

> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> 
>
> Key: SPARK-24579
> URL: https://issues.apache.org/jira/browse/SPARK-24579
> Project: Spark
>  Issue Type: Epic
>  Components: ML, PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen
> Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange 
> between Apache Spark and DL%2FAI Frameworks .pdf
>
>
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark 
> as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache 
> MXNet (incubating).
> Both big data and AI are indispensable components to drive business 
> innovation and there have
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI 
> frameworks like tf.data and tf.Transform. However, with 50+ data sources and 
> built-in SQL, DataFrames, and Streaming features, Spark remains the community 
> choice for big data. This is why we saw many efforts to integrate DL/AI 
> frameworks with Spark to leverage its power, for example, TFRecords data 
> source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project 
> Hydrogen, this SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark 
> and external DL/AI frameworks. And the performance matters. However, there 
> doesn’t exist a standard way to exchange data and hence implementation and 
> performance optimization fall into pieces. For example, TensorFlowOnSpark 
> uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and 
> save data and pass the RDD records to TensorFlow in Python. And TensorFrames 
> converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s 
> Java API. How can we reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format) 
> between Spark and DL/AI frameworks and optimize data conversion from/to this 
> interface.  So DL/AI frameworks can leverage Spark to load data virtually 
> from anywhere without spending extra effort building complex data solutions, 
> like reading features from a production data warehouse or streaming model 
> inference. Spark users can use DL/AI frameworks without learning specific 
> data APIs implemented there. And developers from both sides can work on 
> performance optimizations independently given the interface itself doesn’t 
> introduce big overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23451) Deprecate KMeans computeCost

2018-07-20 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-23451.
-
   Resolution: Fixed
 Assignee: Marco Gaido
Fix Version/s: 2.4.0

> Deprecate KMeans computeCost
> 
>
> Key: SPARK-23451
> URL: https://issues.apache.org/jira/browse/SPARK-23451
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Trivial
> Fix For: 2.4.0
>
>
> SPARK-11029 added the {{computeCost}} method as a temp fix for the lack of 
> proper cluster evaluators. Now SPARK-14516 introduces a proper 
> {{ClusteringEvaluator}}, so we should deprecate this method and maybe remove 
> it in the next releases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23528) Add numIter to ClusteringSummary

2018-07-13 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-23528:
---

Assignee: Marco Gaido

> Add numIter to ClusteringSummary
> 
>
> Key: SPARK-23528
> URL: https://issues.apache.org/jira/browse/SPARK-23528
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Erich Schubert
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> Spark ML should expose vital statistics of the GMM model:
>  * *Number of iterations* (actual, not max) until the tolerance threshold was 
> hit: we can set a maximum, but how do we know the limit was large enough, and 
> how many iterations it really took?
> Follow up: Final log likelihood of the model: if we run multiple times with 
> different starting conditions, how do we know which run converged to the 
> better fit?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23528) Add numIter to ClusteringSummary

2018-07-13 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-23528.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Thanks!

> Add numIter to ClusteringSummary
> 
>
> Key: SPARK-23528
> URL: https://issues.apache.org/jira/browse/SPARK-23528
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Erich Schubert
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> Spark ML should expose vital statistics of the GMM model:
>  * *Number of iterations* (actual, not max) until the tolerance threshold was 
> hit: we can set a maximum, but how do we know the limit was large enough, and 
> how many iterations it really took?
> Follow up: Final log likelihood of the model: if we run multiple times with 
> different starting conditions, how do we know which run converged to the 
> better fit?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23528) Add numIter to ClusteringSummary

2018-07-13 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-23528:

Description: 
Spark ML should expose vital statistics of the GMM model:
 * *Number of iterations* (actual, not max) until the tolerance threshold was 
hit: we can set a maximum, but how do we know the limit was large enough, and 
how many iterations it really took?

Follow up: Final log likelihood of the model: if we run multiple times with 
different starting conditions, how do we know which run converged to the better 
fit?

  was:
Spark ML should expose vital statistics of the GMM model:
 * *Number of iterations* (actual, not max) until the tolerance threshold was 
hit: we can set a maximum, but how do we know the limit was large enough, and 
how many iterations it really took?
 * Final *log likelihood* of the model: if we run multiple times with different 
starting conditions, how do we know which run converged to the better fit?


> Add numIter to ClusteringSummary
> 
>
> Key: SPARK-23528
> URL: https://issues.apache.org/jira/browse/SPARK-23528
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Erich Schubert
>Priority: Minor
>
> Spark ML should expose vital statistics of the GMM model:
>  * *Number of iterations* (actual, not max) until the tolerance threshold was 
> hit: we can set a maximum, but how do we know the limit was large enough, and 
> how many iterations it really took?
> Follow up: Final log likelihood of the model: if we run multiple times with 
> different starting conditions, how do we know which run converged to the 
> better fit?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23528) Add numIter to ClusteringSummary

2018-07-13 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-23528:

Summary: Add numIter to ClusteringSummary  (was: Expose vital statistics of 
GaussianMixtureModel)

> Add numIter to ClusteringSummary
> 
>
> Key: SPARK-23528
> URL: https://issues.apache.org/jira/browse/SPARK-23528
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Erich Schubert
>Priority: Minor
>
> Spark ML should expose vital statistics of the GMM model:
>  * *Number of iterations* (actual, not max) until the tolerance threshold was 
> hit: we can set a maximum, but how do we know the limit was large enough, and 
> how many iterations it really took?
>  * Final *log likelihood* of the model: if we run multiple times with 
> different starting conditions, how do we know which run converged to the 
> better fit?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24780) DataFrame.column_name should resolve to a distinct ref

2018-07-10 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-24780:

Summary: DataFrame.column_name should resolve to a distinct ref  (was: 
DataFrame.column_name should take into account DataFrame alias for future joins)

> DataFrame.column_name should resolve to a distinct ref
> --
>
> Key: SPARK-24780
> URL: https://issues.apache.org/jira/browse/SPARK-24780
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Minor
>
> If we join a dataframe with another dataframe which has the same column name 
> of the conditions (e.g. shared lineage on one of the conditions) even though 
> the join condition may be written with the full name, the columns returned 
> don't have the dataframe alias and as such will create a cross-join.
> For example this currently works even if both posts_by_sampled_authors  &  
> mailing_list_posts_in_reply_to contain both in_reply_to and message_id fields.
>  
> {code:java}
> posts_with_replies = posts_by_sampled_authors.join(
>  mailing_list_posts_in_reply_to,
>  [F.col("mailing_list_posts_in_reply_to.in_reply_to") == 
> F.col("posts_by_sampled_authors.message_id")],
>  "inner"){code}
>  
> But a similarly written expression:
> {code:java}
> posts_with_replies = posts_by_sampled_authors.join(
>  mailing_list_posts_in_reply_to,
>  [mailing_list_posts_in_reply_to.in_reply_to == 
> posts_by_sampled_authors.message_id],
>  "inner"){code}
> will fail.
>  
> I'm not super sure whats going on inside of the resolution that's causing it 
> to get confused.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24780) DataFrame.column_name should take into account DataFrame alias for future joins

2018-07-10 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-24780:

Description: 
If we join a dataframe with another dataframe which has the same column name of 
the conditions (e.g. shared lineage on one of the conditions) even though the 
join condition may be written with the full name, the columns returned don't 
have the dataframe alias and as such will create a cross-join.

For example this currently works even if both posts_by_sampled_authors  &  
mailing_list_posts_in_reply_to contain both in_reply_to and message_id fields.

 
{code:java}
posts_with_replies = posts_by_sampled_authors.join(
 mailing_list_posts_in_reply_to,
 [F.col("mailing_list_posts_in_reply_to.in_reply_to") == 
F.col("posts_by_sampled_authors.message_id")],
 "inner"){code}
 

But a similarly written expression:
{code:java}
posts_with_replies = posts_by_sampled_authors.join(
 mailing_list_posts_in_reply_to,
 [mailing_list_posts_in_reply_to.in_reply_to == 
posts_by_sampled_authors.message_id],
 "inner"){code}
will fail.

 

I'm not super sure whats going on inside of the resolution that's causing it to 
get confused.

  was:
If we join a dataframe with another dataframe which has the same column name of 
the conditions (e.g. shared lineage on one of the conditions) even though the 
join condition may be written with the full name, the columns returned don't 
have the dataframe alias and as such will create a cross-join.

For example this currently works even if both posts_by_sampled_authors  &  
mailing_list_posts_in_reply_to contain both in_reply_to and message_id fields.

 
{code:java}
posts_with_replies = posts_by_sampled_authors.join(
 mailing_list_posts_in_reply_to,
 [F.col("mailing_list_posts_in_reply_to.in_reply_to") == 
F.col("posts_by_sampled_authors.message_id")],
 "inner"){code}
 

But a similarly written expression:
{code:java}
posts_with_replies = posts_by_sampled_authors.join(
 mailing_list_posts_in_reply_to,
 [mailing_list_posts_in_reply_to.in_reply_to == 
posts_by_sampled_authors.message_id],
 "inner"){code}
will fail.

 

We could fix this by changing it so that dataframe.column in PySpark returns 
the fully qualified column reference if the dataframe has an alias.


> DataFrame.column_name should take into account DataFrame alias for future 
> joins
> ---
>
> Key: SPARK-24780
> URL: https://issues.apache.org/jira/browse/SPARK-24780
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Minor
>
> If we join a dataframe with another dataframe which has the same column name 
> of the conditions (e.g. shared lineage on one of the conditions) even though 
> the join condition may be written with the full name, the columns returned 
> don't have the dataframe alias and as such will create a cross-join.
> For example this currently works even if both posts_by_sampled_authors  &  
> mailing_list_posts_in_reply_to contain both in_reply_to and message_id fields.
>  
> {code:java}
> posts_with_replies = posts_by_sampled_authors.join(
>  mailing_list_posts_in_reply_to,
>  [F.col("mailing_list_posts_in_reply_to.in_reply_to") == 
> F.col("posts_by_sampled_authors.message_id")],
>  "inner"){code}
>  
> But a similarly written expression:
> {code:java}
> posts_with_replies = posts_by_sampled_authors.join(
>  mailing_list_posts_in_reply_to,
>  [mailing_list_posts_in_reply_to.in_reply_to == 
> posts_by_sampled_authors.message_id],
>  "inner"){code}
> will fail.
>  
> I'm not super sure whats going on inside of the resolution that's causing it 
> to get confused.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24780) DataFrame.column_name should take into account DataFrame alias for future joins

2018-07-10 Thread holdenk (JIRA)

holdenk created SPARK-24780:
---

 Summary: DataFrame.column_name should take into account DataFrame 
alias for future joins
 Key: SPARK-24780
 URL: https://issues.apache.org/jira/browse/SPARK-24780
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.4.0
Reporter: holdenk


If we join a dataframe with another dataframe which has the same column name of 
the conditions (e.g. shared lineage on one of the conditions) even though the 
join condition may be written with the full name, the columns returned don't 
have the dataframe alias and as such will create a cross-join.

For example this currently works even if both posts_by_sampled_authors  &  
mailing_list_posts_in_reply_to contain both in_reply_to and message_id fields.

 
{code:java}
posts_with_replies = posts_by_sampled_authors.join(
 mailing_list_posts_in_reply_to,
 [F.col("mailing_list_posts_in_reply_to.in_reply_to") == 
F.col("posts_by_sampled_authors.message_id")],
 "inner"){code}
 

But a similarly written expression:
{code:java}
posts_with_replies = posts_by_sampled_authors.join(
 mailing_list_posts_in_reply_to,
 [mailing_list_posts_in_reply_to.in_reply_to == 
posts_by_sampled_authors.message_id],
 "inner"){code}
will fail.

 

We could fix this by changing it so that dataframe.column in PySpark returns 
the fully qualified column reference if the dataframe has an alias.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24668) PySpark crashes when getting the webui url if the webui is disabled

2018-07-02 Thread holdenk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530284#comment-16530284
 ] 

holdenk commented on SPARK-24668:
-

So there is also the case where Spark is run without the history server or the 
web UI which I think would be important to consider as well so I think both 
doing a getOrElse is needed even if we do #1.

 

I  either think overriding uiWebUrl inside of SparkUI to point to the history 
server UI URL if the interactive one isn't launched would be OK or changing the 
repr to show both the spark web ui and history UI (with Nones as appropriate) 
might be better. What do you think?

 

Regardless I'm happy to take on the review of this PR :)

> PySpark crashes when getting the webui url if the webui is disabled
> ---
>
> Key: SPARK-24668
> URL: https://issues.apache.org/jira/browse/SPARK-24668
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0, 2.4.0
> Environment: * Spark 2.3.0
>  * Spark-on-YARN
>  * Java 8
>  * Python 3.6.5
>  * Jupyter 4.4.0
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> Repro:
>  
> Evaluate `sc` in a Jupyter notebook:
>  
>  
> {{---}}
> {{Py4JJavaError                             Traceback (most recent call 
> last)}}
> {{/opt/conda/lib/python3.6/site-packages/IPython/core/formatters.py in 
> __call__(self, obj)}}
> {{    343             method = get_real_method(obj, self.print_method)}}
> {{    344             if method is not None:}}
> {{--> 345                 return method()}}
> {{    346             return None}}
> {{    347         else:}}
> {{/usr/lib/spark/python/pyspark/context.py in _repr_html_(self)}}
> {{    261         }}
> {{    262         """.format(}}
> {{--> 263             sc=self}}
> {{    264         )}}
> {{    265 }}
> {{/usr/lib/spark/python/pyspark/context.py in uiWebUrl(self)}}
> {{    373     def uiWebUrl(self):}}
> {{    374         """Return the URL of the SparkUI instance started by this 
> SparkContext"""}}
> {{--> 375         return 
> self._[jsc.sc|https://www.google.com/url?q=http://jsc.sc=D=AFQjCNHUwO0Cf3OHs1QafBFXzShZ_PU8IQ]().uiWebUrl().get()}}
> {{    376 }}
> {{    377     @property}}
> {{/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)}}
> {{   1158         answer = self.gateway_client.send_command(command)}}
> {{   1159         return_value = get_return_value(}}
> {{-> 1160             answer, self.gateway_client, self.target_id, 
> [self.name|https://www.google.com/url?q=http://self.name=D=AFQjCNEu_LlQOduOrIyV64UgIuRgm6Ea2w])}}
> {{   1161 }}
> {{   1162         for temp_arg in temp_args:}}
> {{/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)}}
> {{     61     def deco(*a, **kw):}}
> {{     62         try:}}
> {{---> 63             return f(*a, **kw)}}
> {{     64         except py4j.protocol.Py4JJavaError as e:}}
> {{     65             s = e.java_exception.toString()}}
> {{/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)}}
> {{    318                 raise Py4JJavaError(}}
> {{    319                     "An error occurred while calling 
> \{0}{1}\{2}.\n".}}
> {{--> 320                     format(target_id, ".", name), value)}}
> {{    321             else:}}
> {{    322                 raise Py4JError(}}
> {{Py4JJavaError: An error occurred while calling o80.get.}}
> {{: java.util.NoSuchElementException: None.get}}
> {{        at scala.None$.get(Option.scala:347)}}
> {{        at scala.None$.get(Option.scala:345)}}
> {{        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)}}
> {{        at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)}}
> {{        at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}}
> {{        at java.lang.reflect.Method.invoke(Method.java:498)}}
> {{        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)}}
> {{        at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)}}
> {{        at py4j.Gateway.invoke(Gateway.java:282)}}
> {{        at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)}}
> {{        at py4j.commands.CallCommand.execute(CallCommand.java:79)}}
> {{        at py4j.GatewayConnection.run(GatewayConnection.java:214)}}
> {{        at java.lang.Thread.run(Thread.java:748)}}
>  
> PySpark only prints out the web ui url in `_repr_html`, not `__repr__`, so 
> this only happens in notebooks that render html, not the pyspark shell. 
> [https://github.com/apache/spark/commit/f654b39a63d4f9b118733733c7ed2a1b58649e3d]
>  
> Disabling Spark's UI

[jira] [Updated] (SPARK-24668) PySpark crashes when getting the webui url if the webui is disabled

2018-07-02 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-24668:

 Shepherd: holdenk
Affects Version/s: 2.4.0

> PySpark crashes when getting the webui url if the webui is disabled
> ---
>
> Key: SPARK-24668
> URL: https://issues.apache.org/jira/browse/SPARK-24668
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0, 2.4.0
> Environment: * Spark 2.3.0
>  * Spark-on-YARN
>  * Java 8
>  * Python 3.6.5
>  * Jupyter 4.4.0
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> Repro:
>  
> Evaluate `sc` in a Jupyter notebook:
>  
>  
> {{---}}
> {{Py4JJavaError                             Traceback (most recent call 
> last)}}
> {{/opt/conda/lib/python3.6/site-packages/IPython/core/formatters.py in 
> __call__(self, obj)}}
> {{    343             method = get_real_method(obj, self.print_method)}}
> {{    344             if method is not None:}}
> {{--> 345                 return method()}}
> {{    346             return None}}
> {{    347         else:}}
> {{/usr/lib/spark/python/pyspark/context.py in _repr_html_(self)}}
> {{    261         }}
> {{    262         """.format(}}
> {{--> 263             sc=self}}
> {{    264         )}}
> {{    265 }}
> {{/usr/lib/spark/python/pyspark/context.py in uiWebUrl(self)}}
> {{    373     def uiWebUrl(self):}}
> {{    374         """Return the URL of the SparkUI instance started by this 
> SparkContext"""}}
> {{--> 375         return 
> self._[jsc.sc|https://www.google.com/url?q=http://jsc.sc=D=AFQjCNHUwO0Cf3OHs1QafBFXzShZ_PU8IQ]().uiWebUrl().get()}}
> {{    376 }}
> {{    377     @property}}
> {{/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)}}
> {{   1158         answer = self.gateway_client.send_command(command)}}
> {{   1159         return_value = get_return_value(}}
> {{-> 1160             answer, self.gateway_client, self.target_id, 
> [self.name|https://www.google.com/url?q=http://self.name=D=AFQjCNEu_LlQOduOrIyV64UgIuRgm6Ea2w])}}
> {{   1161 }}
> {{   1162         for temp_arg in temp_args:}}
> {{/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)}}
> {{     61     def deco(*a, **kw):}}
> {{     62         try:}}
> {{---> 63             return f(*a, **kw)}}
> {{     64         except py4j.protocol.Py4JJavaError as e:}}
> {{     65             s = e.java_exception.toString()}}
> {{/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)}}
> {{    318                 raise Py4JJavaError(}}
> {{    319                     "An error occurred while calling 
> \{0}{1}\{2}.\n".}}
> {{--> 320                     format(target_id, ".", name), value)}}
> {{    321             else:}}
> {{    322                 raise Py4JError(}}
> {{Py4JJavaError: An error occurred while calling o80.get.}}
> {{: java.util.NoSuchElementException: None.get}}
> {{        at scala.None$.get(Option.scala:347)}}
> {{        at scala.None$.get(Option.scala:345)}}
> {{        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)}}
> {{        at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)}}
> {{        at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}}
> {{        at java.lang.reflect.Method.invoke(Method.java:498)}}
> {{        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)}}
> {{        at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)}}
> {{        at py4j.Gateway.invoke(Gateway.java:282)}}
> {{        at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)}}
> {{        at py4j.commands.CallCommand.execute(CallCommand.java:79)}}
> {{        at py4j.GatewayConnection.run(GatewayConnection.java:214)}}
> {{        at java.lang.Thread.run(Thread.java:748)}}
>  
> PySpark only prints out the web ui url in `_repr_html`, not `__repr__`, so 
> this only happens in notebooks that render html, not the pyspark shell. 
> [https://github.com/apache/spark/commit/f654b39a63d4f9b118733733c7ed2a1b58649e3d]
>  
> Disabling Spark's UI with `spark.ui.enabled` *is* valuable outside of tests. 
> A couple reasons that come to mind:
> 1) If you run multiple spark applications from one machine, Spark 
> irritatingly starts picking the same port (4040), as the first application, 
> then increments (4041, 4042, etc) until it finds an open port. If you are 
> running 10 spark apps, then the 11th prints out 10 warnings about ports being 
> taken until it finally finds one.
> 2) You can serve the spark web ui from a dedicated spark history server 
> instead of

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 975 matches

Mail list logo