date:20191007

[jira] [Resolved] (SPARK-22964) don't allow task restarts for continuous processing

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22964.
--
Resolution: Incomplete

> don't allow task restarts for continuous processing
> ---
>
> Key: SPARK-22964
> URL: https://issues.apache.org/jira/browse/SPARK-22964
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>Priority: Major
>  Labels: bulk-closed
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24266) Spark client terminates while driver is still running

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24266.
--
Resolution: Incomplete

> Spark client terminates while driver is still running
> -
>
> Key: SPARK-24266
> URL: https://issues.apache.org/jira/browse/SPARK-24266
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Chun Chen
>Priority: Major
>  Labels: bulk-closed
>
> {code}
> Warning: Ignoring non-spark config property: Default=system properties 
> included when running spark-submit.
> 18/05/11 14:50:12 WARN Config: Error reading service account token from: 
> [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
> 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
> Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
> 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
> Mounting Hadoop specific files
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: N/A
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: 2018-05-11T06:50:17Z
>container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>phase: Pending
>status: [ContainerStatus(containerID=null, 
> image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
> lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
> restartCount=0, state=ContainerState(running=null, terminated=null, 
> waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
> additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
> finish...
> 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start ti

[jira] [Resolved] (SPARK-24729) Spark - stackoverflow error - org.apache.spark.sql.catalyst.plans.QueryPlan

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24729.
--
Resolution: Incomplete

> Spark - stackoverflow error - org.apache.spark.sql.catalyst.plans.QueryPlan
> ---
>
> Key: SPARK-24729
> URL: https://issues.apache.org/jira/browse/SPARK-24729
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.1
>Reporter: t oo
>Priority: Major
>  Labels: bulk-closed
>
> Ran a spark (v2.1.1) job that joins 2 rdds (one is .txt file from S3, another 
> is parquet from S3) the job then merges the dataset (ie get latest row per 
> PK, if PK exists in txt and parquet then take the row from the .txt) and 
> writes out a new parquet to S3. Got this error but upon re-running it worked 
> fine. Both the .txt and parquet have 302 columns. The .txt has 191 rows, the 
> parquet has 156300 rows. Does anyone know the cause?
>  
> {code:java}
>  
> 18/07/02 13:51:56 INFO TaskSetManager: Starting task 0.0 in stage 14.0 (TID 
> 134, 10.160.122.226, executor 0, partition 0, PROCESS_LOCAL, 6337 bytes)
> 18/07/02 13:51:56 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory 
> on 10.160.122.226:38011 (size: 27.2 KB, free: 4.6 GB)
> 18/07/02 13:51:56 INFO TaskSetManager: Finished task 0.0 in stage 14.0 (TID 
> 134) in 295 ms on 10.160.122.226 (executor 0) (1/1)
> 18/07/02 13:51:56 INFO TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks 
> have all completed, from pool
> 18/07/02 13:51:56 INFO DAGScheduler: ResultStage 14 (load at Data.scala:25) 
> finished in 0.295 s
> 18/07/02 13:51:56 INFO DAGScheduler: Job 7 finished: load at Data.scala:25, 
> took 0.310932 s
> 18/07/02 13:51:57 INFO FileSourceStrategy: Pruning directories with:
> 18/07/02 13:51:57 INFO FileSourceStrategy: Post-Scan Filters:
> 18/07/02 13:51:57 INFO FileSourceStrategy: Output Data Schema: struct string, created: timestamp, created_by: string, last_upd: timestamp, 
> last_upd_by: string ... 300 more fields>
> 18/07/02 13:51:57 INFO FileSourceStrategy: Pushed Filters:
> 18/07/02 13:51:57 INFO MemoryStore: Block broadcast_19 stored as values in 
> memory (estimated size 387.2 KB, free 911.2 MB)
> 18/07/02 13:51:57 INFO MemoryStore: Block broadcast_19_piece0 stored as bytes 
> in memory (estimated size 33.7 KB, free 911.1 MB)
> 18/07/02 13:51:57 INFO BlockManagerInfo: Added broadcast_19_piece0 in memory 
> on 10.160.123.242:38105 (size: 33.7 KB, free: 912.2 MB)
> 18/07/02 13:51:57 INFO SparkContext: Created broadcast 19 from cache at 
> Upsert.scala:25
> 18/07/02 13:51:57 INFO FileSourceScanExec: Planning scan with bin packing, 
> max size: 48443541 bytes, open cost is considered as scanning 4194304 bytes.
> 18/07/02 13:51:57 INFO SparkContext: Starting job: take at Utils.scala:28
> 18/07/02 13:51:57 INFO DAGScheduler: Got job 8 (take at Utils.scala:28) with 
> 1 output partitions
> 18/07/02 13:51:57 INFO DAGScheduler: Final stage: ResultStage 15 (take at 
> Utils.scala:28)
> 18/07/02 13:51:57 INFO DAGScheduler: Parents of final stage: List()
> 18/07/02 13:51:57 INFO DAGScheduler: Missing parents: List()
> 18/07/02 13:51:57 INFO DAGScheduler: Submitting ResultStage 15 
> (MapPartitionsRDD[65] at take at Utils.scala:28), which has no missing parents
> 18/07/02 13:51:57 INFO MemoryStore: Block broadcast_20 stored as values in 
> memory (estimated size 321.5 KB, free 910.8 MB)
> 18/07/02 13:51:57 INFO MemoryStore: Block broadcast_20_piece0 stored as bytes 
> in memory (estimated size 93.0 KB, free 910.7 MB)
> 18/07/02 13:51:57 INFO BlockManagerInfo: Added broadcast_20_piece0 in memory 
> on 10.160.123.242:38105 (size: 93.0 KB, free: 912.1 MB)
> 18/07/02 13:51:57 INFO SparkContext: Created broadcast 20 from broadcast at 
> DAGScheduler.scala:996
> 18/07/02 13:51:57 INFO DAGScheduler: Submitting 1 missing tasks from 
> ResultStage 15 (MapPartitionsRDD[65] at take at Utils.scala:28)
> 18/07/02 13:51:57 INFO TaskSchedulerImpl: Adding task set 15.0 with 1 tasks
> 18/07/02 13:51:57 INFO TaskSetManager: Starting task 0.0 in stage 15.0 (TID 
> 135, 10.160.122.226, executor 0, partition 0, PROCESS_LOCAL, 9035 bytes)
> 18/07/02 13:51:57 INFO BlockManagerInfo: Added broadcast_20_piece0 in memory 
> on 10.160.122.226:38011 (size: 93.0 KB, free: 4.6 GB)
> 18/07/02 13:51:57 INFO BlockManagerInfo: Added broadcast_19_piece0 in memory 
> on 10.160.122.226:38011 (size: 33.7 KB, free: 4.6 GB)
> 18/07/02 13:52:05 INFO BlockManagerInfo: Added rdd_61_0 in memory on 
> 10.160.122.226:38011 (size: 38.9 MB, free: 4.5 GB)
> 18/07/02 13:52:09 INFO BlockManagerInfo: Added rdd_63_0 in memory on 
> 10.160.122.226:38011 (size: 38.9 MB, free: 4.5 GB)
> 18/07/02 13:52:09 INFO TaskSetManager: Finished task 0.0 in stage 15.0 (TID 
> 135) in 11751 ms on 10.1

[jira] [Resolved] (SPARK-21536) Remove the workaroud to allow dots in field names in R's createDataFame

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21536.
--
Resolution: Incomplete

> Remove the workaroud to allow dots in field names in R's createDataFame
> ---
>
> Key: SPARK-21536
> URL: https://issues.apache.org/jira/browse/SPARK-21536
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: bulk-closed
>
> We are currently converting dots to underscore in some cases as below in 
> SparkR
> {code}
> > createDataFrame(list(list(1)), "a.a")
> SparkDataFrame[a_a:double]
> ...
> In FUN(X[[i]], ...) : Use a_a instead of a.a  as column name
> {code}
> {code}
> > createDataFrame(list(list(a.a = 1)))
> SparkDataFrame[a_a:double]
> ...
> In FUN(X[[i]], ...) : Use a_a instead of a.a  as column name
> {code}
> This looks introduced in the first place due to SPARK-2775 but now it is 
> fixed in SPARK-6898.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23839) consider bucket join in cost-based JoinReorder rule

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23839.
--
Resolution: Incomplete

> consider bucket join in cost-based JoinReorder rule
> ---
>
> Key: SPARK-23839
> URL: https://issues.apache.org/jira/browse/SPARK-23839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiaoju Wu
>Priority: Minor
>  Labels: bulk-closed
>
> Since spark 2.2, the cost-based JoinReorder rule is implemented and in Spark 
> 2.3 released, it is improved with histogram. While it doesn't take the cost 
> of the different join implementations. For example:
> TableA JOIN TableB JOIN TableC
> TableA  will output 10,000 rows after filter and projection. 
> TableB  will output 10,000 rows after filter and projection. 
> TableC  will output 8,000 rows after filter and projection. 
> The current JoinReorder rule will possibly optimize the plan to join TableC 
> with TableA firstly and then TableB. But if the TableA and TableB are bucket 
> tables and can be applied with BucketJoin, it could be a different story. 
>  
> Also, to support bucket join of more than 2 tables when table bucket number 
> is multiple of another (SPARK-17570), whether bucket join can take effect 
> depends on the result of JoinReorder. For example of "A join B join C" which 
> has bucket number like 8, 4, 12, JoinReorder rule should optimize the order 
> to "A join B join C“ to make the bucket join take effect instead of "C join A 
> join B". 
>  
> Based on current CBO JoinReorder, there are possibly 2 part to be changed:
>  # CostBasedJoinReorder rule is applied in optimizer phase while we do Join 
> selection in planner phase and bucket join optimization in EnsureRequirements 
> which is in preparation phase. Both are after optimizer. 
>  # Current statistics and join cost formula are based data selectivity and 
> cardinality, we need to add statistics for present the join method cost like 
> shuffle, sort, hash and etc. Also we need to add the statistics into the 
> formula to estimate the join cost. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24656) SparkML Transformers and Estimators with multiple columns

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24656.
--
Resolution: Incomplete

> SparkML Transformers and Estimators with multiple columns
> -
>
> Key: SPARK-24656
> URL: https://issues.apache.org/jira/browse/SPARK-24656
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.3.1
>Reporter: Michael Dreibelbis
>Priority: Major
>  Labels: bulk-closed
>
> Currently SparkML Transformers and Estimators operate on single input/output 
> column pairs. This makes pipelines extremely cumbersome (as well as 
> non-performant) when transformations on multiple columns needs to be made.
>  
> I am proposing to implement ParallelPipelineStage/Transformer/Estimator/Model 
> that would operate on the input columns in parallel.
>  
> {code:java}
>  // old way
> val pipeline = new Pipeline().setStages(Array(
>   new CountVectorizer().setInputCol("_1").setOutputCol("_1_cv"),
>   new CountVectorizer().setInputCol("_2").setOutputCol("_2_cv"),
>   new IDF().setInputCol("_1_cv").setOutputCol("_1_idf"),
>   new IDF().setInputCol("_2_cv").setOutputCol("_2_idf")
> ))
> // proposed way
> val pipeline2 = new Pipeline().setStages(Array(
>   new ParallelCountVectorizer().setInputCols(Array("_1", 
> "_2")).setOutputCols(Array("_1_cv", "_2_cv")),
>   new ParallelIDF().setInputCols(Array("_1_cv", 
> "_2_cv")).setOutputCols(Array("_1_idf", "_2_idf"))
> ))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24301) Add Instrumentation test coverage

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24301.
--
Resolution: Incomplete

> Add Instrumentation test coverage
> -
>
> Key: SPARK-24301
> URL: https://issues.apache.org/jira/browse/SPARK-24301
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Lu Wang
>Priority: Minor
>  Labels: bulk-closed
>
> Spark has no coverage for Logging. It will be good to add some 
> instrumentation coverage test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23537) Logistic Regression without standardization

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23537.
--
Resolution: Incomplete

> Logistic Regression without standardization
> ---
>
> Key: SPARK-23537
> URL: https://issues.apache.org/jira/browse/SPARK-23537
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Optimizer
>Affects Versions: 2.0.2, 2.2.1
>Reporter: Jordi
>Priority: Major
>  Labels: bulk-closed
> Attachments: non-standardization.log, standardization.log
>
>
> I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer 
> to not use standardization since all my features are binary, using the 
> hashing trick (2^20 sparse vector).
> I trained two models to compare results, I've been expecting to end with two 
> similar models since it seems that internally the optimizer performs 
> standardization and "de-standardization" (when it's deactivated) in order to 
> improve the convergence.
> Here you have the code I used:
> {code:java}
> val lr = new org.apache.spark.ml.classification.LogisticRegression()
> .setRegParam(0.05)
> .setElasticNetParam(0.0)
> .setFitIntercept(true)
> .setMaxIter(5000)
> .setStandardization(false)
> val model = lr.fit(data)
> {code}
> The results are disturbing me, I end with two significantly different models.
> *Standardization:*
> Training time: 8min.
> Iterations: 37
> Intercept: -4.386090107224499
> Max weight: 4.724752299455218
> Min weight: -3.560570478164854
> Mean weight: -0.049325201841722795
> l1 norm: 116710.39522171849
> l2 norm: 402.2581552373957
> Non zero weights: 128084
> Non zero ratio: 0.12215042114257812
> Last 10 LBFGS Val and Grand Norms:
> {code:java}
> 18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) 
> 0.000559057
> 18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) 
> 0.000267527
> 18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) 
> 0.000205888
> 18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) 
> 0.000144173
> 18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) 
> 0.000140296
> 18/02/27 17:15:09 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.52e-08) 
> 0.000122709
> 18/02/27 17:15:13 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.78e-08) 
> 3.08789e-05
> 18/02/27 17:15:18 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.66e-09) 
> 2.23806e-05
> 18/02/27 17:15:23 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 4.31e-09) 
> 1.47422e-05
> 18/02/27 17:15:28 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 9.17e-10) 
> 2.37442e-05
> {code}
> *No standardization:*
> Training time: 7h 14 min.
> Iterations: 4992
> Intercept: -4.216690468849263
> Max weight: 0.41930559767624725
> Min weight: -0.5949182537565524
> Mean weight: -1.2659769019012E-6
> l1 norm: 14.262025330648694
> l2 norm: 1.2508777025612263
> Non zero weights: 128955
> Non zero ratio: 0.12298107147216797
> Last 10 LBFGS Val and Grand Norms:
> {code:java}
> 18/02/28 00:28:56 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 2.17e-07) 
> 0.217581
> 18/02/28 00:29:01 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.88e-07) 
> 0.185812
> 18/02/28 00:29:06 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.33e-07) 
> 0.214570
> 18/02/28 00:29:11 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 8.62e-08) 
> 0.489464
> 18/02/28 00:29:16 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.90e-07) 
> 0.178448
> 18/02/28 00:29:21 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 7.91e-08) 
> 0.172527
> 18/02/28 00:29:26 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.38e-07) 
> 0.189389
> 18/02/28 00:29:31 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.13e-07) 
> 0.480678
> 18/02/28 00:29:36 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.75e-07) 
> 0.184529
> 18/02/28 00:29:41 INFO LBFGS: Val and Grad Norm: 0.559319 (rel: 8.90e-08) 
> 0.154329
> {code}
> Am I missing something?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25020) Unable to Perform Graceful Shutdown in Spark Streaming with Hadoop 2.8

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25020.
--
Resolution: Incomplete

> Unable to Perform Graceful Shutdown in Spark Streaming with Hadoop 2.8
> --
>
> Key: SPARK-25020
> URL: https://issues.apache.org/jira/browse/SPARK-25020
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
> Environment: Spark Streaming
> -- Tested on 2.2 & 2.3 (more than likely affects all versions with graceful 
> shutdown) 
> Hadoop 2.8
>Reporter: Ricky Saltzer
>Priority: Major
>  Labels: bulk-closed
>
> Opening this up to give you guys some insight in an issue that will occur 
> when using Spark Streaming with Hadoop 2.8. 
> Hadoop 2.8 added HADOOP-12950 which adds a upper limit of a 10 second timeout 
> for its shutdown hook. From our tests, if the Spark job takes longer than 10 
> seconds to shutdown gracefully, the Hadoop shutdown thread seems to trample 
> over the graceful shutdown and throw an exception like
> {code:java}
> 18/08/03 17:21:04 ERROR Utils: Uncaught exception in thread pool-1-thread-1
> java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker.stop(ReceiverTracker.scala:177)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:114)
> at 
> org.apache.spark.streaming.StreamingContext$$anonfun$stop$1.apply$mcV$sp(StreamingContext.scala:682)
> at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
> at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:681)
> at 
> org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:715)
> at 
> org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:599)
> at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1948)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
> at scala.util.Try$.apply(Try.scala:192)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> The reason I hit this issue is because we recently upgraded to EMR 5.15, 
> which has both Spark 2.3 & Hadoop 2.8. The following workaround has proven 
> successful to us (in limited testing)
> Instead of just running
> {code:java}
> ...
> ssc.start()
> ssc.awaitTermination(){code}
> We needed to do the following
> {code:java}
> ...
> ssc.start()
> sys.ShutdownHookThread {
>   ssc.stop(true, true)
> }
> ssc.awaitTermination(){code}
> As far as I can tell, there is no way to override the default {{10 second}} 
> timeout in HADOOP-12950, which is why we had to go with the workaround. 
> Note: I also verified this bug exists even with EMR 5.12.1 which runs Spark 
> 2.2.x & Hadoop 2.8. 
> Ricky
>  Epic Games



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24357) createDataFrame in Python infers large integers as long type and then fails silently when converting them

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24357.
--
Resolution: Incomplete

> createDataFrame in Python infers large integers as long type and then fails 
> silently when converting them
> -
>
> Key: SPARK-24357
> URL: https://issues.apache.org/jira/browse/SPARK-24357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Major
>  Labels: bulk-closed
>
> When inferring the schema type of an RDD passed to createDataFrame, PySpark 
> SQL will infer any integral type as a LongType, which is a 64-bit integer, 
> without actually checking whether the values will fit into a 64-bit slot. If 
> the values are larger than 64 bits, then when pickled and unpickled in Java, 
> Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD is 
> called, it will ignore the BigInteger type and return Null. This results in 
> any large integers in the resulting DataFrame being silently converted to 
> None. This can create some very surprising and difficult to debug behavior, 
> in particular if you are not aware of this limitation. There should either be 
> a runtime error at some point in this conversion chain, or else _infer_type 
> should infer larger integers as DecimalType with appropriate precision, or as 
> BinaryType. The former would be less convenient, but the latter may be 
> problematic to implement in practice. In any case, we should stop silently 
> converting large integers to None.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15867) Use bucket files for TABLESAMPLE BUCKET

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-15867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-15867.
--
Resolution: Incomplete

> Use bucket files for TABLESAMPLE BUCKET
> ---
>
> Key: SPARK-15867
> URL: https://issues.apache.org/jira/browse/SPARK-15867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Andrew Or
>Priority: Major
>  Labels: bulk-closed
>
> {code}
> SELECT * FROM boxes TABLESAMPLE (BUCKET 3 OUT OF 16)
> {code}
> In Hive, this would select the 3rd bucket out of every 16 buckets there are 
> in the table. E.g. if the table was clustered by 32 buckets then this would 
> sample the 3rd and the 19th bucket. (See 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling)
> In Spark, however, we simply sample 3/16 of the number of input rows.
> Either we don't support it in Spark or do it in a way that's consistent with 
> Hive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23996) Implement the optimal KLL algorithms for quantiles in streams

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23996.
--
Resolution: Incomplete

> Implement the optimal KLL algorithms for quantiles in streams
> -
>
> Key: SPARK-23996
> URL: https://issues.apache.org/jira/browse/SPARK-23996
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, SQL
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Priority: Major
>  Labels: bulk-closed
>
> The current implementation for approximate quantiles - a variant of 
> Grunwald-Khanna, which I implemented - is not the best in light of recent 
> papers:
>  - it is not exactly the one from the paper for performance reasons, but the 
> changes are not documented beyond comments on the code
>  - there are now more optimal algorithms with proven bounds (unlike q-digest, 
> the other contender at the time)
> I propose that we revisit the current implementation and look at the 
> Karnin-Lang-Liberty algorithm (KLL) for example:
> [https://arxiv.org/abs/1603.05346]
> [https://edoliberty.github.io//papers/streamingQuantiles.pdf]
> This algorithm seems to have favorable characteristics for streaming and a 
> distributed implementation, and there is a python implementation for 
> reference.
> It is a fairly standalone piece, and in that respect available to people who 
> don't know too much about spark internals.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23368) Avoid unnecessary Exchange or Sort after projection

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23368.
--
Resolution: Incomplete

> Avoid unnecessary Exchange or Sort after projection
> ---
>
> Key: SPARK-23368
> URL: https://issues.apache.org/jira/browse/SPARK-23368
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wei Xue
>Priority: Minor
>  Labels: bulk-closed
>
> After column rename projection, the ProjectExec's outputOrdering and 
> outputPartitioning should reflect the projected columns as well. For example,
> {code:java}
> SELECT b1
> FROM (
> SELECT a a1, b b1
> FROM testData2
> ORDER BY a
> )
> ORDER BY a1{code}
> The inner query is ordered on a1 as well. If we had a rule to eliminate Sort 
> on sorted result, together with this fix, the order-by in the outer query 
> could have been optimized out.
>  
> Similarly, the below query
> {code:java}
> SELECT *
> FROM (
> SELECT t1.a a1, t2.a a2, t1.b b1, t2.b b2
> FROM testData2 t1
> LEFT JOIN testData2 t2
> ON t1.a = t2.a
> )
> JOIN testData2 t3
> ON a1 = t3.a{code}
> is equivalent to
> {code:java}
> SELECT *
> FROM testData2 t1
> LEFT JOIN testData2 t2
> ON t1.a = t2.a
> JOIN testData2 t3
> ON t1.a = t3.a{code}
> , so the unnecessary sorting and hash-partitioning that have been optimized 
> out for the second query should have be eliminated in the first query as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23730) Save and expose "in bag" tracking for random forest model

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23730.
--
Resolution: Incomplete

> Save and expose "in bag" tracking for random forest model
> -
>
> Key: SPARK-23730
> URL: https://issues.apache.org/jira/browse/SPARK-23730
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Julian King
>Priority: Minor
>  Labels: bulk-closed
>
> In a random forest model, it is often useful to be able to keep track of 
> which samples ended up in each of the bootstrap replications (and how many 
> times this happened). For instance, in the R randomForest package this is 
> accomplished through the option keep.inbag=TRUE
> Similar functionality in Spark ML's random forest would be helpful



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24473) It is no need to clip the predictive value by maxValue and minValue when computing gradient on SVDplusplus model

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24473.
--
Resolution: Incomplete

> It is no need to clip the predictive value by maxValue and minValue when 
> computing gradient on SVDplusplus model
> 
>
> Key: SPARK-24473
> URL: https://issues.apache.org/jira/browse/SPARK-24473
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.0
>Reporter: caijianming
>Priority: Major
>  Labels: bulk-closed
>
> I think it is no need to clip the predictive value. It will change the convex 
> loss function to non-convex, which might have a bad influence on convergence.
> Other famous recommender systems and original paper also do not include this 
> step.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5572) LDA improvement listing

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5572.
-
Resolution: Incomplete

> LDA improvement listing
> ---
>
> Key: SPARK-5572
> URL: https://issues.apache.org/jira/browse/SPARK-5572
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Major
>  Labels: bulk-closed
>
> This is an umbrella JIRA for listing planned improvements to Latent Dirichlet 
> Allocation (LDA).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22415) lint-r fails if lint-r.R installs any new packages

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22415.
--
Resolution: Incomplete

> lint-r fails if lint-r.R installs any new packages
> --
>
> Key: SPARK-22415
> URL: https://issues.apache.org/jira/browse/SPARK-22415
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
> Environment: OSX 10.12.6 R 3.4.2
>Reporter: Joel Croteau
>Priority: Minor
>  Labels: bulk-closed
>
> The dev/lint-r script checks for lint failures by seeing if anything is 
> output to stdout by lint-r.R. Since package installations will often produce 
> output to stdout, this will cause it to report a failure even if lint-r.R 
> succeeded. This would also mean that if there were a failure further up in 
> lint-r.R that output a message to stderr, it would not detect this and think 
> lint-r.R had succeeded. It would be preferable for lint-r.R to output lints 
> to stderr and for lint-r to check that and/or for lint-r.R to return a 
> failure code if there are any lints, which lint-r could check for. I will 
> write a patch to do this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23987) Unused mailing lists

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23987.
--
Resolution: Incomplete

> Unused mailing lists
> 
>
> Key: SPARK-23987
> URL: https://issues.apache.org/jira/browse/SPARK-23987
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Sebb
>Priority: Major
>  Labels: bulk-closed
>
> The following mailing lists were set up in Jan 2015 but have not been used, 
> and don't appear to be mentioned on the website:
> ml@
> sql@
> streaming@
> If they are not needed, please file a JIRA request with INFR to close them 
> down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24974.
--
Resolution: Incomplete

> Spark put all file's paths into SharedInMemoryCache even for unused 
> partitions.
> ---
>
> Key: SPARK-24974
> URL: https://issues.apache.org/jira/browse/SPARK-24974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>  Labels: bulk-closed
>
> SharedInMemoryCache has all  filestatus no matter whether you specify 
> partition columns or not. It causes long load time for queries that use only 
> couple partitions because Spark loads file's paths for files from all 
> partitions.
> I partitioned files by *report_date* and *type* and i have directory 
> structure like 
> {code:java}
> /custom_path/report_date=2018-07-24/type=A/file_1.parquet
> {code}
>  
> I am trying to execute 
> {code:java}
> val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
> "type == 'A'").count
> {code}
>  
> In my query i need to load only files of type *A* and it is just a couple of 
> files. But spark load all 19K of files from all partitions into 
> SharedInMemoryCache which takes about 60 secs and only after that throws 
> unused partitions. 
>  
> This could be related to [https://jira.apache.org/jira/browse/SPARK-17994] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24955) spark continuing to execute on a task despite not reading all data from a downed machine

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24955.
--
Resolution: Incomplete

> spark continuing to execute on a task despite not reading all data from a 
> downed machine
> 
>
> Key: SPARK-24955
> URL: https://issues.apache.org/jira/browse/SPARK-24955
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Shuffle
>Affects Versions: 2.3.0
>Reporter: San Tung
>Priority: Major
>  Labels: bulk-closed
>
> We've recently run into a few instances where a downed node has led to 
> incomplete data, causing correctness issues, which we can reproduce some of 
> the time.
> *Setup:*
>  - we're currently on spark 2.3.0
>  - we allow retries on failed tasks and stages
>  - we use PySpark to perform these operations
> *Stages:*
> Simplistically, the job does the following:
>  - Stage 1/2: computes a number of `(sha256 hash, 0, 1)` partitioned into 
> 65536 partitions
>  - Stage 3/4: computes a number of `(sha256 hash, 1, 0)` partitioned into 
> 6408 partitions (one hash may exist in multiple partitions)
>  - Stage 5:
>  - repartitions stage 2 and stage 4 by the first 2 bytes of each hash, and 
> find which ones are not in common (stage 2 hashes - stage 4 hashes).
>  - store this partition into a persistent data source.
> *Failure Scenario:*
>  - We take out one of the machines (do a forced shutdown, for example)
>  - For some tasks, stage 5 will die immediately with one of the following:
>  ** `ExecutorLostFailure (executor 24 exited caused by one of the running 
> tasks) Reason: worker lost`
>  ** `FetchFailed(BlockManagerId(24, [redacted], 36829, None), shuffleId=2, 
> mapId=14377, reduceId=48402, message=`
>  - these tasks are reused to calculate stage 1-2 and 3-4 again that were 
> missing on downed nodes, which is correctly recalculated by spark.
>  - However, some tasks still continue executing from Stage 5, seemingly 
> missing stage 4 data, dumping incorrect data to the stage 5 data source. We 
> noticed the subtract operation taking ~1-2 minutes after the machine goes 
> down, and stores a lot more data than usual (which on inspection is wrong).
>  - we've seen this happen with slightly different execution plans too which 
> don't involve or-ing, but end up being some variant of missing some stage 4 
> data.
> However, we cannot reproduce this consistently - sometimes all tasks fail 
> gracefully. Correctly downed nodes means all these tasks fail and re-work on 
> stage 1-2/3-4. Note that this solution produces the correct results if 
> machines stay alive!
> We were wondering if a machine going down can result in a state where a task 
> could keep executing even though not all data has been fetched which gives us 
> incorrect results (or if there is setting that allows this - we tried 
> scanning spark configs up and down). This seems similar to 
> https://issues.apache.org/jira/browse/SPARK-24160 (maybe we get an empty 
> packet?), but it doesn't look like that was to explicitly resolve any known 
> bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24550) Add support for Kubernetes specific metrics

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24550.
--
Resolution: Incomplete

> Add support for Kubernetes specific metrics
> ---
>
> Key: SPARK-24550
> URL: https://issues.apache.org/jira/browse/SPARK-24550
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Stavros Kontopoulos
>Priority: Major
>  Labels: bulk-closed
>
> Spark by default offers a specific set of metrics for monitoring. It is 
> possible to add platform specific metrics or enhance the existing ones if it 
> makes sense. Here is an example for 
> [mesos|https://github.com/apache/spark/pull/21516]. Some example of metrics 
> that could be added are: number of blacklisted nodes, utilization of 
> executors per node, evicted pods and driver restarts, back-off time when 
> requesting executors etc...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25585) Allow users to specify scale of result in Decimal arithmetic

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25585.
--
Resolution: Incomplete

> Allow users to specify scale of result in Decimal arithmetic
> 
>
> Key: SPARK-25585
> URL: https://issues.apache.org/jira/browse/SPARK-25585
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Benito Kestelman
>Priority: Major
>  Labels: bulk-closed, decimal, usability
>
> The current behavior of Spark Decimal during arithmetic makes it difficult 
> for users to achieve their desired level of precision. Numeric literals are 
> automatically cast to unlimited precision during arithmetic, but the final 
> result is cast down depending on the precision and scale of the operands, 
> according to MS SQL rules (discussed in other JIRA's). This final cast can 
> cause substantial loss of scale.
> For example:
> {noformat}
> scala> spark.sql("select 1.3/3.41").show(false)
> ++
> |(CAST(1.3 AS DECIMAL(3,2)) / CAST(3.41 AS DECIMAL(3,2)))|
> ++
> |0.381232    |
> ++{noformat}
> To get higher scale in the result, a user must cast the operands to higher 
> scale:
> {noformat}
> scala> spark.sql("select cast(1.3 as decimal(5,4))/cast(3.41 as 
> decimal(5,4))").show(false)
> ++
> |(CAST(1.3 AS DECIMAL(5,4)) / CAST(3.41 AS DECIMAL(5,4)))|
> ++
> |0.3812316716    |
> ++
> scala> spark.sql("select cast(1.3 as decimal(10,9))/cast(3.41 as 
> decimal(10,9))").show(false)
> +--+
> |(CAST(1.3 AS DECIMAL(10,9)) / CAST(3.41 AS DECIMAL(10,9)))|
> +--+
> |0.38123167155425219941    |
> +--+{noformat}
> But if the user casts too high, the result's scale decreases. 
> {noformat}
> scala> spark.sql("select cast(1.3 as decimal(25,24))/cast(3.41 as 
> decimal(25,24))").show(false)
> ++
> |(CAST(1.3 AS DECIMAL(25,24)) / CAST(3.41 AS DECIMAL(25,24)))|
> ++
> |0.3812316715543 |
> ++{noformat}
> Thus, the user has no way of knowing how to cast to get the scale he wants. 
> This problem is even harder to deal with when using variables instead of 
> literals. 
> The user should be able to explicitly set the desired scale of the result. 
> MySQL offers this capability in the form of a system variable called 
> "div_precision_increment."
> From the MySQL docs: "In division performed with 
> [{{/}}|https://dev.mysql.com/doc/refman/8.0/en/arithmetic-functions.html#operator_divide],
>  the scale of the result when using two exact-value operands is the scale of 
> the first operand plus the value of the 
> [{{div_precision_increment}}|https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_div_precision_increment]
>  system variable (which is 4 by default). For example, the result of the 
> expression {{5.05 / 0.014}} has a scale of six decimal places 
> ({{360.714286}})."
> {noformat}
> mysql> SELECT 1/7;
> ++
> | 1/7    |
> ++
> | 0.1429 |
> ++
> mysql> SET div_precision_increment = 12;
> mysql> SELECT 1/7;
> ++
> | 1/7    |
> ++
> | 0.142857142857 |
> ++{noformat}
> This gives the user full control of the result's scale after arithmetic and 
> obviates the need for casting all over the place.
> Since Spark 2.3, we already have DecimalType.MINIMUM_ADJUSTED_SCALE, which is 
> similar to div_precision_increment. It just needs to be made modifiable by 
> the user. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21406) Add logLikelihood to GLR families

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21406.
--
Resolution: Incomplete

> Add logLikelihood to GLR families
> -
>
> Key: SPARK-21406
> URL: https://issues.apache.org/jira/browse/SPARK-21406
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Priority: Minor
>  Labels: bulk-closed
>
> To be able to implement the typical gradient based aggregator for GLR, we'd 
> need to add a {{logLikelihood(y: Double, mu: Double, weight: Double)}} method 
> to GLR {{Family}} class. 
> One possible hiccup - Tweedie family log likelihood is not computationally 
> feasible [link| 
> http://support.sas.com/documentation/cdl/en/stathpug/67524/HTML/default/viewer.htm#stathpug_hpgenselect_details16.htm].
>  H2O gets around this by using the deviance instead. We could leave it 
> unimplemented initially.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12878) Dataframe fails with nested User Defined Types

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-12878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-12878.
--
Resolution: Incomplete

> Dataframe fails with nested User Defined Types
> --
>
> Key: SPARK-12878
> URL: https://issues.apache.org/jira/browse/SPARK-12878
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Joao Duarte
>Priority: Major
>  Labels: bulk-closed
>
> Spark 1.6.0 crashes when using nested User Defined Types in a Dataframe. 
> In version 1.5.2 the code below worked just fine:
> {code}
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.sql.catalyst.InternalRow
> import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
> import org.apache.spark.sql.types._
> @SQLUserDefinedType(udt = classOf[AUDT])
> case class A(list:Seq[B])
> class AUDT extends UserDefinedType[A] {
>   override def sqlType: DataType = StructType(Seq(StructField("list", 
> ArrayType(BUDT, containsNull = false), nullable = true)))
>   override def userClass: Class[A] = classOf[A]
>   override def serialize(obj: Any): Any = obj match {
> case A(list) =>
>   val row = new GenericMutableRow(1)
>   row.update(0, new 
> GenericArrayData(list.map(_.asInstanceOf[Any]).toArray))
>   row
>   }
>   override def deserialize(datum: Any): A = {
> datum match {
>   case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq)
> }
>   }
> }
> object AUDT extends AUDT
> @SQLUserDefinedType(udt = classOf[BUDT])
> case class B(text:Int)
> class BUDT extends UserDefinedType[B] {
>   override def sqlType: DataType = StructType(Seq(StructField("num", 
> IntegerType, nullable = false)))
>   override def userClass: Class[B] = classOf[B]
>   override def serialize(obj: Any): Any = obj match {
> case B(text) =>
>   val row = new GenericMutableRow(1)
>   row.setInt(0, text)
>   row
>   }
>   override def deserialize(datum: Any): B = {
> datum match {  case row: InternalRow => new B(row.getInt(0))  }
>   }
> }
> object BUDT extends BUDT
> object Test {
>   def main(args:Array[String]) = {
> val col = Seq(new A(Seq(new B(1), new B(2))),
>   new A(Seq(new B(3), new B(4
> val sc = new SparkContext(new 
> SparkConf().setMaster("local[1]").setAppName("TestSpark"))
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
> val df = sc.parallelize(1 to 2 zip col).toDF("id","b")
> df.select("b").show()
> df.collect().foreach(println)
>   }
> }
> {code}
> In the new version (1.6.0) I needed to include the following import:
> `import org.apache.spark.sql.catalyst.expressions.GenericMutableRow`
> However, Spark crashes in runtime:
> {code}
> 16/01/18 14:36:22 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:248)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
>   at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
>   at 
> org.

[jira] [Resolved] (SPARK-9636) Treat $SPARK_HOME as write-only

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-9636.
-
Resolution: Incomplete

> Treat $SPARK_HOME as write-only
> ---
>
> Key: SPARK-9636
> URL: https://issues.apache.org/jira/browse/SPARK-9636
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.4.1
> Environment: Linux
>Reporter: Philipp Angerer
>Priority: Minor
>  Labels: bulk-closed
>
> when starting spark scripts as user and it is installed in a directory the 
> user has no write permissions on, many things work fine, except for the logs 
> (e.g. for {{start-master.sh}})
> logs are per default written to {{$SPARK_LOG_DIR}} or (if unset) to 
> {{$SPARK_HOME/logs}}.
> if installed in this way, it should, instead of throwing an error, write logs 
> to {{/var/log/spark/}}. that’s easy to fix by simply testing a few log dirs 
> in sequence for writability before trying to use one. i suggest using 
> {{$SPARK_LOG_DIR}} (if set) → {{/var/log/spark/}} → {{~/.cache/spark-logs/}} 
> → {{$SPARK_HOME/logs/}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8614.
-
Resolution: Incomplete

> Row order preservation for operations on MLlib IndexedRowMatrix
> ---
>
> Key: SPARK-8614
> URL: https://issues.apache.org/jira/browse/SPARK-8614
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Jan Luts
>Priority: Major
>  Labels: bulk-closed
>
> In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are 
> dropped before calling the methods from RowMatrix. For example for 
> IndexedRowMatrix.computeSVD:
>val svd = toRowMatrix().computeSVD(k, computeU, rCond)
> and for IndexedRowMatrix.multiply:
>val mat = toRowMatrix().multiply(B).
> After computing these results, they are zipped with the original indices, 
> e.g. for IndexedRowMatrix.computeSVD
>val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> and for IndexedRowMatrix.multiply:
>
>val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> I have experienced that for IndexedRowMatrix.computeSVD().U and 
> IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row 
> indices can get mixed (when running Spark jobs with multiple 
> executors/machines): i.e. the vectors and indices of the result do not seem 
> to correspond anymore. 
> To me it looks like this is caused by zipping RDDs that have a different 
> ordering?
> For the IndexedRowMatrix.multiply I have observed that ordering within 
> partitions is preserved, but that it seems to get mixed up between 
> partitions. For example, for:
> part1Index1 part1Vector1
> part1Index2 part1Vector2
> part2Index1 part2Vector1
> part2Index2 part2Vector2
> I got:
> part2Index1 part1Vector1
> part2Index2 part1Vector2
> part1Index1 part2Vector1
> part1Index2 part2Vector2
> Another observation is that the mapPartitions in RowMatrix.multiply :
> val AB = rows.mapPartitions { iter =>
> had an "preservesPartitioning = true" argument in version 1.0, but this is no 
> longer there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24585) Adding ability to audit file system before and after test to ensure all files are cleaned up.

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24585.
--
Resolution: Incomplete

> Adding ability to audit file system before and after test to ensure all files 
> are cleaned up.
> -
>
> Key: SPARK-24585
> URL: https://issues.apache.org/jira/browse/SPARK-24585
> Project: Spark
>  Issue Type: Test
>  Components: Build
>Affects Versions: 2.3.1
>Reporter: David Lewis
>Priority: Minor
>  Labels: bulk-closed
>
> Some spark tests use temporary files and folders on the file system. This 
> proposal is to audit the file system before and after to ensure that all 
> files created are cleaned up.
> This proposal is to add two flags to SparkFunSuite. One to enable the 
> auditing, which will log a warning if there is a discrepancy, and one to 
> throw an exception if there is a discrepancy. 
> This is intended to help prevent file leakage in spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18245) Improving support for bucketed table

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-18245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-18245.
--
Resolution: Incomplete

> Improving support for bucketed table
> 
>
> Key: SPARK-18245
> URL: https://issues.apache.org/jira/browse/SPARK-18245
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
>  Labels: bulk-closed
>
> This is an umbrella ticket for improving various execution planning for 
> bucketed tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24081) Spark SQL drops the table while writing into table in "overwrite" mode.

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24081.
--
Resolution: Incomplete

> Spark SQL drops the table  while writing into table in "overwrite" mode.
> 
>
> Key: SPARK-24081
> URL: https://issues.apache.org/jira/browse/SPARK-24081
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0
>Reporter: Ashish
>Priority: Major
>  Labels: bulk-closed
>
> I am taking data from table and doing  modification to the data once I am 
> writing back to table in overwrite mode its deleting all the record.
> Expectation: It will update the table with updated data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-15690.
--
Resolution: Incomplete

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>Priority: Major
>  Labels: bulk-closed
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24745) Map function does not keep rdd name

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24745.
--
Resolution: Incomplete

> Map function does not keep rdd name 
> 
>
> Key: SPARK-24745
> URL: https://issues.apache.org/jira/browse/SPARK-24745
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Igor Pergenitsa
>Priority: Minor
>  Labels: bulk-closed
>
> This snippet
> {code:scala}
> val namedRdd = sparkContext.makeRDD(List("abc", "123")).setName("named_rdd")
> println(namedRdd.name)
> val mappedRdd = namedRdd.map(_.length)
> println(mappedRdd.name){code}
> outputs:
> named_rdd
> null



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22731) Add a test for ROWID type to OracleIntegrationSuite

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22731.
--
Resolution: Incomplete

> Add a test for ROWID type to OracleIntegrationSuite
> ---
>
> Key: SPARK-22731
> URL: https://issues.apache.org/jira/browse/SPARK-22731
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: bulk-closed
>
> We need to add a test case to OracleIntegrationSuite for checking whether the 
> current support of ROWID type works well for Oracle. If not, we also need a 
> fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24964) Please add OWASP Dependency Check to all comonent builds(pom.xml)

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24964.
--
Resolution: Incomplete

>  Please add OWASP Dependency Check to all comonent builds(pom.xml)
> --
>
> Key: SPARK-24964
> URL: https://issues.apache.org/jira/browse/SPARK-24964
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, MLlib, Spark Core, SparkR
>Affects Versions: 2.3.1
> Environment: All development, build, test, environments.
> ~/workspace/spark-2.3.1/pom.xml
> ~/workspace/spark-2.3.1/assembly/pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/pom.xml
> ~/workspace/spark-2.3.1/common/network-common/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-common/pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/pom.xml
> ~/workspace/spark-2.3.1/common/network-yarn/pom.xml
> ~/workspace/spark-2.3.1/common/sketch/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/sketch/pom.xml
> ~/workspace/spark-2.3.1/common/tags/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/tags/pom.xml
> ~/workspace/spark-2.3.1/common/unsafe/pom.xml
> ~/workspace/spark-2.3.1/core/pom.xml
> ~/workspace/spark-2.3.1/examples/pom.xml
> ~/workspace/spark-2.3.1/external/docker-integration-tests/pom.xml
> ~/workspace/spark-2.3.1/external/flume/pom.xml
> ~/workspace/spark-2.3.1/external/flume-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/flume-sink/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-sql/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/spark-ganglia-lgpl/pom.xml
> ~/workspace/spark-2.3.1/graphx/pom.xml
> ~/workspace/spark-2.3.1/hadoop-cloud/pom.xml
> ~/workspace/spark-2.3.1/launcher/pom.xml
> ~/workspace/spark-2.3.1/mllib/pom.xml
> ~/workspace/spark-2.3.1/mllib-local/pom.xml
> ~/workspace/spark-2.3.1/repl/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/kubernetes/core/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/mesos/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/yarn/pom.xml
> ~/workspace/spark-2.3.1/sql/catalyst/pom.xml
> ~/workspace/spark-2.3.1/sql/core/pom.xml
> ~/workspace/spark-2.3.1/sql/hive/pom.xml
> ~/workspace/spark-2.3.1/sql/hive-thriftserver/pom.xml
> ~/workspace/spark-2.3.1/streaming/pom.xml
> ~/workspace/spark-2.3.1/tools/pom.xml
>Reporter: Albert Baker
>Priority: Major
>  Labels: build, bulk-closed, easy-fix, security
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> OWASP DC makes an outbound REST call to MITRE Common Vulnerabilities & 
> Exposures (CVE) to perform a lookup for each dependant .jar to list any/all 
> known vulnerabilities for each jar. This step is needed because a manual 
> MITRE CVE lookup/check on the main component does not include checking for 
> vulnerabilities in dependant libraries.
> OWASP Dependency check : 
> https://www.owasp.org/index.php/OWASP_Dependency_Check has plug-ins for most 
> Java build/make types (ant, maven, ivy, gradle). Also, add the appropriate 
> command to the nightly build to generate a report of all known 
> vulnerabilities in any/all third party libraries/dependencies that get pulled 
> in. example : mvn -Powasp -Dtest=false -DfailIfNoTests=false clean aggregate
> Generating this report nightly/weekly will help inform the project's 
> development team if any dependant libraries have a reported known 
> vulneraility. Project teams that keep up with removing vulnerabilities on a 
> weekly basis will help protect businesses that rely on these open source 
> componets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24122) Allow automatic driver restarts on K8s

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24122.
--
Resolution: Incomplete

> Allow automatic driver restarts on K8s
> --
>
> Key: SPARK-24122
> URL: https://issues.apache.org/jira/browse/SPARK-24122
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>  Labels: bulk-closed
>
> [~foxish]
> Right now SparkSubmit creates the driver as a bare pod, rather than a managed 
> controller like a Deployment or a StatefulSet. This means there is no way to 
> guarantee automatic restarts, eg in case a node has an issue. Note Pod 
> RestartPolicy does not apply if a node fails. A StatefulSet would allow us to 
> guarantee that, and keep the ability for executors to find the driver using 
> DNS.
> This is particularly helpful for long-running streaming workloads, where we 
> currently use {{yarn.resourcemanager.am.max-attempts}} with YARN. I can 
> confirm that Spark Streaming and Structured Streaming applications can be 
> made to recover from such a restart, with the help of checkpointing. The 
> executors will have to be started again by the driver, but this should not be 
> a problem.
> For batch processing, we could alternatively use Kubernetes {{Job}} objects, 
> which restart pods on failure but not success. For example, note the 
> semantics provided by the {{kubectl run}} 
> [command|https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#run]
>  * {{--restart=Never}}: bare Pod
>  * {{--restart=Always}}: Deployment
>  * {{--restart=OnFailure}}: Job
> https://github.com/apache-spark-on-k8s/spark/issues/288



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25311) `SPARK_LOCAL_HOSTNAME` unsupport IPV6 when do host checking

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25311.
--
Resolution: Incomplete

> `SPARK_LOCAL_HOSTNAME` unsupport IPV6 when do host checking
> ---
>
> Key: SPARK-25311
> URL: https://issues.apache.org/jira/browse/SPARK-25311
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.2.2
>Reporter: Xiaochen Ouyang
>Priority: Major
>  Labels: bulk-closed
>
> IPV4 can pass the follwing check
> {code:java}
>   def checkHost(host: String, message: String = "") {
> assert(host.indexOf(':') == -1, message)
>   }
> {code}
> But IPV6 check failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24910) Spark Bloom Filter Closure Serialization improvement for very high volume of Data

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24910.
--
Resolution: Incomplete

> Spark Bloom Filter Closure Serialization improvement for very high volume of 
> Data
> -
>
> Key: SPARK-24910
> URL: https://issues.apache.org/jira/browse/SPARK-24910
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.3.1
>Reporter: Himangshu Ranjan Borah
>Priority: Minor
>  Labels: bulk-closed
>
> I am proposing an improvement to the Bloom Filter Generation logic being used 
> in the DataFrameStatFunctions' Bloom Filter API using mapPartitions() instead 
> of aggregate() to avoid closure serialization which fails for huge BitArrays.
> Spark's Stat Functions' Bloom Filter Implementation uses 
> aggregate/treeAggregate operations which uses a closure with a dependency on 
> the bloom filter that is created in the driver. Since Spark hard codes the 
> closure serializer to Java Serializer it fails in closure cleanup for very 
> big sizes of Bloom Filters (Typically with num items ~ Billions and with fpp 
> ~ 0.001). Kryo serializer work's fine in such a scale but seems like there 
> were some issues using Kryo for closure serialization due to which Spark 2.0 
> hardcoded it to Java. The call-stack that we get typically looks like,
> {{{color:#f79232}java.lang.OutOfMemoryError{color}}}
> {{{color:#f79232} at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123){color}}}
> {{{color:#f79232} at 
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117){color}}}
> {{{color:#f79232} at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93){color}}}
> {{{color:#f79232} at 
> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153){color}}}
> {{{color:#f79232} at 
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41){color}}}
> {{{color:#f79232} at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877){color}}}
> {{{color:#f79232} at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786){color}}}
> {{{color:#f79232} at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189){color}}}
> {{{color:#f79232} at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548){color}}}
> {{{color:#f79232} at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509){color}}}
> {{{color:#f79232} at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432){color}}}
> {{{color:#f79232} at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178){color}}}
> {{{color:#f79232} at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348){color}}}
> {{{color:#f79232} at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43){color}}}
> {{{color:#f79232} at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100){color}}}
> {{{color:#f79232} at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342){color}}}
> {{{color:#f79232} at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335){color}}}
> {{{color:#f79232} at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159){color}}}
> {{{color:#f79232} at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2292){color}}}
> {{{color:#f79232} at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2022){color}}}
> {{{color:#f79232} at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2124){color}}}
> {{{color:#f79232} at 
> org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1092){color}}}
> {{{color:#f79232} at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151){color}}}
> {{{color:#f79232} at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112){color}}}
> {{{color:#f79232} at 
> org.apache.spark.rdd.RDD.withScope(RDD.scala:363){color}}}
> {{{color:#f79232} at org.apache.spark.rdd.RDD.fold(RDD.scala:1086){color}}}
> {{{color:#f79232} at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1155){color}}}
> {{{color:#f79232} at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151){color}}}
> {{{color:#f79232} at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112){color}}}
> {{{color:#f79232} at 
> org.apache.spark.rdd.RDD.withScope(RDD.scala:363){color}}}
> {{{color:#f79232} at 
> org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1131){color}}}
>

[jira] [Resolved] (SPARK-22114) The condition of OnlineLDAOptimizer convergence should be configurable

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22114.
--
Resolution: Incomplete

> The condition of OnlineLDAOptimizer convergence should be configurable  
> 
>
> Key: SPARK-22114
> URL: https://issues.apache.org/jira/browse/SPARK-22114
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>  Labels: bulk-closed
>
> The current convergence of OnlineLDAOptimizer is:
> {code:java}
> while(meanGammaChange > 1e-3)
> {code}
> The condition of this is critical for the performance and accuracy of LDA. 
> We should keep this configurable, like it is in Vowpal Vabbit: 
> https://github.com/JohnLangford/vowpal_wabbit/blob/430f69453bc4876a39351fba1f18771bdbdb7122/vowpalwabbit/lda_core.cc
>  :638



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25103) CompletionIterator may delay GC of completed resources

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25103.
--
Resolution: Incomplete

> CompletionIterator may delay GC of completed resources
> --
>
> Key: SPARK-25103
> URL: https://issues.apache.org/jira/browse/SPARK-25103
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.1.0, 2.2.0, 2.3.0
>Reporter: Eyal Farago
>Priority: Major
>  Labels: bulk-closed
>
> while working on SPARK-22713 , I fund (and partially fixed) a scenario in 
> which an iterator is already exhausted but still holds a reference to some 
> resources that can be GCed at this point.
> However, these resources can not be GCed because of this reference.
> the specific fix applied in SPARK-22713 was to wrap the iterator with a 
> CompletionIterator that cleans it when exhausted, thing is that it's quite 
> easy to get this wrong by closing over local variables or _this_ reference in 
> the cleanup function itself.
> I propose solving this by modifying CompletionIterator to discard references 
> to the wrapped iterator and cleanup function once exhausted.
>  
>  * a dive into the code showed that most CompletionIterators are eventually 
> used by 
> {code:java}
> org.apache.spark.scheduler.ShuffleMapTask#runTask{code}
> which does:
> {code:java}
> writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: 
> Product2[Any, Any]]]){code}
> looking at 
> {code:java}
> org.apache.spark.shuffle.ShuffleWriter#write{code}
> implementations, it seems all of them first exhaust the iterator and then 
> perform some kind of post-processing: i.e. merging spills, sorting, writing 
> partitions files and then concatenating them into a single file... bottom 
> line the Iterator may actually be 'sitting' for some time after being 
> exhausted.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23686.
--
Resolution: Incomplete

> Make better usage of org.apache.spark.ml.util.Instrumentation
> -
>
> Key: SPARK-23686
> URL: https://issues.apache.org/jira/browse/SPARK-23686
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
>  Labels: bulk-closed
>
> This Jira is a bit high level and might require subtasks or other jiras for 
> more specific tasks.
> I've noticed that we don't make the best usage of the instrumentation class. 
> Specifically sometimes we bypass the instrumentation class and use the 
> debugger instead. For example, 
> [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143]
> Also there are some things that might be useful to log in the instrumentation 
> class that we currently don't. For example:
> number of training examples
> mean/var of label (regression)
> I know computing these things can be expensive in some cases, but especially 
> when this data is already available we can log it for free. For example, 
> Logistic Regression Summarizer computes some useful data including numRows 
> that we don't log.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15573) Backwards-compatible persistence for spark.ml

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-15573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-15573.
--
Resolution: Incomplete

> Backwards-compatible persistence for spark.ml
> -
>
> Key: SPARK-15573
> URL: https://issues.apache.org/jira/browse/SPARK-15573
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>  Labels: bulk-closed
>
> This JIRA is for imposing backwards-compatible persistence for the 
> DataFrames-based API for MLlib.  I.e., we want to be able to load models 
> saved in previous versions of Spark.  We will not require loading models 
> saved in later versions of Spark.
> This requires:
> * Putting unit tests in place to check loading models from previous versions
> * Notifying all committers active on MLlib to be aware of this requirement in 
> the future
> The unit tests could be written as in spark.mllib, where we essentially 
> copied and pasted the save() code every time it changed.  This happens 
> rarely, so it should be acceptable, though other designs are fine.
> Subtasks of this JIRA should cover checking and adding tests for existing 
> cases, such as KMeansModel (whose format changed between 1.6 and 2.0).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20732) Copy cache data when node is being shut down

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20732.
--
Resolution: Incomplete

> Copy cache data when node is being shut down
> 
>
> Key: SPARK-20732
> URL: https://issues.apache.org/jira/browse/SPARK-20732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Holden Karau
>Priority: Major
>  Labels: bulk-closed
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24163) Support "ANY" or sub-query for Pivot "IN" clause

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24163.
--
Resolution: Incomplete

> Support "ANY" or sub-query for Pivot "IN" clause
> 
>
> Key: SPARK-24163
> URL: https://issues.apache.org/jira/browse/SPARK-24163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wei Xue
>Priority: Major
>  Labels: bulk-closed
>
> This is part of a functionality extension to Pivot SQL support as SPARK-24035.
> Currently, only literal values are allowed in Pivot "IN" clause. To support 
> ANY or a sub-query in the "IN" clause (the examples of which provided below), 
> we need to enable evaluation of a sub-query before/during query analysis time.
> {code:java}
> SELECT * FROM (
>   SELECT year, course, earnings FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN ANY
> );{code}
> {code:java}
> SELECT * FROM (
>   SELECT year, course, earnings FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN (
> SELECT course FROM courses
> WHERE region = 'AZ'
>   )
> );
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20744) Predicates with multiple columns do not work

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20744.
--
Resolution: Incomplete

> Predicates with multiple columns do not work
> 
>
> Key: SPARK-20744
> URL: https://issues.apache.org/jira/browse/SPARK-20744
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bogdan Raducanu
>Priority: Major
>  Labels: bulk-closed
>
> The following code reproduces the problem:
> {code}
> scala> spark.range(10).selectExpr("id as a", "id as b").where("(a,b) in 
> ((1,1))").show
> org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('a', 
> `a`, 'b', `b`) IN (named_struct('col1', 1, 'col2', 1)))' due to data type 
> mismatch: Arguments must be same type; line 1 pos 6;
> 'Filter named_struct(a, a#42L, b, b#43L) IN (named_struct(col1, 1, col2, 1))
> +- Project [id#39L AS a#42L, id#39L AS b#43L]
>+- Range (0, 10, step=1, splits=Some(1))
> {code}
> Similarly it won't work from SQL either, which is something that other SQL DB 
> support:
> {code}
> scala> spark.range(10).selectExpr("id as a", "id as 
> b").createOrReplaceTempView("tab1")
> scala> sql("select * from tab1 where (a,b) in ((1,1), (2,2))").show
> org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('a', 
> tab1.`a`, 'b', tab1.`b`) IN (named_struct('col1', 1, 'col2', 1), 
> named_struct('col1', 2, 'col2', 2)))' due to data type mismatch: Arguments 
> must be same type; line 1 pos 31;
> 'Project [*]
> +- 'Filter named_struct(a, a#50L, b, b#51L) IN (named_struct(col1, 1, col2, 
> 1),named_struct(col1, 2, col2, 2))
>+- SubqueryAlias tab1
>   +- Project [id#47L AS a#50L, id#47L AS b#51L]
>  +- Range (0, 10, step=1, splits=Some(1))
> {code}
> Other examples:
> {code}
> scala> sql("select * from tab1 where (a,b) =(1,1)").show
> org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('a', 
> tab1.`a`, 'b', tab1.`b`) = named_struct('col1', 1, 'col2', 1))' due to data 
> type mismatch: differing types in '(named_struct('a', tab1.`a`, 'b', 
> tab1.`b`) = named_struct('col1', 1, 'col2', 1))' (struct 
> and struct).; line 1 pos 25;
> 'Project [*]
> +- 'Filter (named_struct(a, a#50L, b, b#51L) = named_struct(col1, 1, col2, 1))
>+- SubqueryAlias tab1
>   +- Project [id#47L AS a#50L, id#47L AS b#51L]
>  +- Range (0, 10, step=1, splits=Some(1))
> {code}
> Expressions such as (1,1) are apparently read as structs and then the types 
> do not match. Perhaps they should be arrays.
> The following code works:
> {code}
> sql("select * from tab1 where array(a,b) in (array(1,1),array(2,2))").show
> {code}
> This also works, but requires the cast:
> {code}
> sql("select * from tab1 where (a,b) in (named_struct('a', cast(1 as bigint), 
> 'b', cast(1 as bigint)))").show
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24100) Add the CompressionCodec to the saveAsTextFiles interface.

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24100.
--
Resolution: Incomplete

> Add the CompressionCodec to the saveAsTextFiles interface.
> --
>
> Key: SPARK-24100
> URL: https://issues.apache.org/jira/browse/SPARK-24100
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: caijie
>Priority: Minor
>  Labels: bulk-closed
>
> Add the CompressionCodec to the saveAsTextFiles interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24750) HiveCaseSensitiveInferenceMode with INFER_AND_SAVE will show WRITE permission denied even if select table operation

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24750.
--
Resolution: Incomplete

> HiveCaseSensitiveInferenceMode with INFER_AND_SAVE will show WRITE permission 
> denied even if select table operation
> ---
>
> Key: SPARK-24750
> URL: https://issues.apache.org/jira/browse/SPARK-24750
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Lantao Jin
>Priority: Major
>  Labels: bulk-closed
>
> The default HiveCaseSensitiveInferenceMode is INFER_AND_SAVE. In this mode, 
> even if I just select a table which I have no write permission will log a 
> WRITE permission denied.
> spark-sql> select col1 from table1 limit 10;
> table1 is a hive extended table. And user user_me has no write permission for 
> the table1 location /path/table1/dt=20180705 
> (b_someone:group_company:drwxr-xr-x)
> {code}
> 18/07/05 20:13:18 WARN hive.HiveMetastoreCatalog: Unable to save 
> case-sensitive schema for table default.table1
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. 
> java.security.AccessControlException: Permission denied: user=user_me, 
> access=WRITE, 
> inode="/path/table1/dt=20180705":b_someone:group_company:drwxr-xr-x
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1780)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1764)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1738)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8445)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:2022)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1451)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200)
> ;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.alterTableDataSchema(HiveExternalCatalog.scala:646)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.alterTableDataSchema(SessionCatalog.scala:369)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.updateDataSchema(HiveMetastoreCatalog.scala:266)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$inferIfNeeded(HiveMetastoreCatalog.scala:250)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$6$$anonfun$7.apply(HiveMetastoreCatalog.scala:194)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$6$$anonfun$7.apply(HiveMetastoreCatalog.scala:193)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$6.apply(HiveMetastoreCatalog.scala:193)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$6.apply(HiveMetastoreCatalog.scala:186)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.withTableCreationLock(HiveMetastoreCatalog.scala:54)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:186)
> at 
> org.apache.spark.sql.hive.RelationConversions.org$apache$spark$sql$hive$RelationConversions$$convert(HiveStrategies.scala:199)
> at 
> org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:219)
> at 
> org.apache.spark.sql.hive.Rel

[jira] [Resolved] (SPARK-25217) Error thrown when creating BlockMatrix

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25217.
--
Resolution: Incomplete

> Error thrown when creating BlockMatrix
> --
>
> Key: SPARK-25217
> URL: https://issues.apache.org/jira/browse/SPARK-25217
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: cs5090237
>Priority: Major
>  Labels: bulk-closed
>
> dm1 = Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])
> dm2 = Matrices.dense(3, 2, [7, 8, 9, 10, 11, 12])
> sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 1, 2], [7, 11, 12])
> blocks1 = sc.parallelize([((0, 0), dm1)])
> sm_ = Matrix(3,2,sm)
> blocks2 = sc.parallelize([((0, 0), sm), ((1, 0), sm)])
> blocks3 = sc.parallelize([((0, 0), sm), ((1, 0), dm2)])
> mat2 = BlockMatrix(blocks2, 3, 2)
> mat3 = BlockMatrix(blocks3, 3, 2)
>  
> *Running above sample code in Pyspark from documentation raises following 
> error:* 
>  
> An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob. : 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 14 in 
> stage 53.0 failed 4 times, most recent failure: Lost task 14.3 in stage 53.0 
> (TID 1081, , executor 15): org.apache.spark.api.python.PythonException: 
> Traceback (most recent call last): File 
> "/mnt/yarn/usercache/livy/appcache//pyspark.zip/pyspark/worker.py", line 230, 
> in main process() File 
> "/mnt/yarn/usercache/livy/appcache//pyspark.zip/pyspark/worker.py", line 225, 
> in process serializer.dump_stream(func(split_index, iterator), outfile) File 
> "/mnt/yarn/usercache/livy/appcache/application_1535051034290_0001/container_1535051034290_0001_01_23/pyspark.zip/pyspark/serializers.py",
>  line 372, in dump_stream vs = list(itertools.islice(iterator, batch)) File 
> "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1371, in 
> takeUpToNumLeft File 
> "/mnt/yarn/usercache/livy/appcache//pyspark.zip/pyspark/util.py", line 55, in 
> wrapper return f(*args, **kwargs) File 
> "/mnt/yarn/usercache/livy/appcache//pyspark.zip/pyspark/mllib/linalg/distributed.py",
>  line 975, in _convert_to_matrix_block_tuple raise TypeError("Cannot convert 
> type %s into a sub-matrix block tuple" % type(block)) TypeError: Cannot 
> convert type  into a sub-matrix block tuple
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22359) Improve the test coverage of window functions

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22359.
--
Resolution: Incomplete

> Improve the test coverage of window functions
> -
>
> Key: SPARK-22359
> URL: https://issues.apache.org/jira/browse/SPARK-22359
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xingbo Jiang
>Priority: Major
>  Labels: bulk-closed
>
> There are already quite a few integration tests using window functions, but 
> the unit tests coverage for window funtions is not ideal.
> We'd like to test the following aspects:
> * Specifications
> ** different partition clauses (none, one, multiple)
> ** different order clauses (none, one, multiple, asc/desc, nulls first/last)
> * Frames and their combinations
> ** OffsetWindowFunctionFrame
> ** UnboundedWindowFunctionFrame
> ** SlidingWindowFunctionFrame
> ** UnboundedPrecedingWindowFunctionFrame
> ** UnboundedFollowingWindowFunctionFrame
> * Aggregate function types
> ** Declarative
> ** Imperative
> ** UDAF
> * Spilling
> ** Cover the conditions that WindowExec should spill at least once 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18600) BZ2 CRC read error needs better reporting

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-18600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-18600.
--
Resolution: Incomplete

> BZ2 CRC read error needs better reporting
> -
>
> Key: SPARK-18600
> URL: https://issues.apache.org/jira/browse/SPARK-18600
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Charles R Allen
>Priority: Minor
>  Labels: bulk-closed, spree
>
> {code}
> 16/11/25 20:05:03 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 148 
> in stage 5.0 failed 1 times, most recent failure: Lost task 148.0 in stage 
> 5.0 (TID 5945, localhost): org.apache.spark.SparkException: Task failed while 
> writing rows
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalStateException - Error reading from input
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=[INTERVALSTARTTIME_GMT, INTERVALENDTIME_GMT, OPR_DT, OPR_HR, 
> NODE_ID_XML, NODE_ID, NODE, MARKET_RUN_ID, LMP_TYPE, XML_DATA_ITEM, 
> PNODE_RESMRID, GRP_TYPE, POS, VALUE, OPR_INTERVAL, GROUP]
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Line separator detection enabled=false
> Maximum number of characters per column=100
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Row processor=none
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=\0
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\n
> Quote character="
> Quote escape character=\
> Quote escape escape character=null
> Internal state when error was thrown: line=27089, column=13, record=27089, 
> charIndex=4451456, headers=[INTERVALSTARTTIME_GMT, INTERVALENDTIME_GMT, 
> OPR_DT, OPR_HR, NODE_ID_XML, NODE_ID, NODE, MARKET_RUN_ID, LMP_TYPE, 
> XML_DATA_ITEM, PNODE_RESMRID, GRP_TYPE, POS, VALUE, OPR_INTERVAL, GROUP]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:302)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:431)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:148)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:131)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.

[jira] [Resolved] (SPARK-21962) Distributed Tracing in Spark

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21962.
--
Resolution: Incomplete

> Distributed Tracing in Spark
> 
>
> Key: SPARK-21962
> URL: https://issues.apache.org/jira/browse/SPARK-21962
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Priority: Major
>  Labels: bulk-closed
>
> Spark should support distributed tracing, which is the mechanism, widely 
> popularized by Google in the [Dapper 
> Paper|https://research.google.com/pubs/pub36356.html], where network requests 
> have additional metadata used for tracing requests between services.
> This would be useful for me since I have OpenZipkin style tracing in my 
> distributed application up to the Spark driver, and from the executors out to 
> my other services, but the link is broken in Spark between driver and 
> executor since the Span IDs aren't propagated across that link.
> An initial implementation could instrument the most important network calls 
> with trace ids (like launching and finishing tasks), and incrementally add 
> more tracing to other calls (torrent block distribution, external shuffle 
> service, etc) as the feature matures.
> Search keywords: Dapper, Brave, OpenZipkin, HTrace



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20443.
--
Resolution: Incomplete

> The blockSize of MLLIB ALS should be setting  by the User
> -
>
> Key: SPARK-20443
> URL: https://issues.apache.org/jira/browse/SPARK-20443
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>  Labels: bulk-closed
> Attachments: blockSize.jpg
>
>
> The blockSize of MLLIB ALS is very important for ALS performance. 
> In our test, when the blockSize is 128, the performance is about 4X comparing 
> with the blockSize is 4096 (default value).
> The following are our test results: 
> BlockSize(recommendationForAll time)
> 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM)
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24607) Distribute by rand() can lead to data inconsistency

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24607.
--
Resolution: Incomplete

> Distribute by rand() can lead to data inconsistency
> ---
>
> Key: SPARK-24607
> URL: https://issues.apache.org/jira/browse/SPARK-24607
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: zenglinxi
>Priority: Major
>  Labels: bulk-closed
>
> Noticed the following queries can give different results:
> {code:java}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;{code}
> this issue was first reported by someone using kylin for building cube with 
> hiveSQL which include  distribute by rand, data inconsistency may happen 
> during failure tolerance operations. Since spark has similar failure 
> tolerance mechanism, I think it's also an hidden serious problem in sparksql.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25107) Spark 2.2.0 Upgrade Issue : Throwing TreeNodeException: makeCopy, tree: CatalogRelation Errors

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25107.
--
Resolution: Incomplete

> Spark 2.2.0 Upgrade Issue : Throwing TreeNodeException: makeCopy, tree: 
> CatalogRelation Errors
> --
>
> Key: SPARK-25107
> URL: https://issues.apache.org/jira/browse/SPARK-25107
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Spark Version : 2.2.0.cloudera2
>Reporter: Karan
>Priority: Major
>  Labels: bulk-closed
>
> I am in the process of upgrading Spark 1.6 to Spark 2.2.
> I have two stage query and I am running with hiveContext.
> 1) hiveContext.sql("SELECT *, ROW_NUMBER() OVER (PARTITION BY ConfigID, rowid 
> ORDER BY date DESC) AS ro 
>  FROM (SELECT DISTINCT PC.ConfigID, VPM.seqNo, VC.rowid ,VC.recordid 
> ,VC.data,CASE 
>  WHEN pcs.score BETWEEN PC.from AND PC.to 
>  AND ((PC.csacnt IS NOT NULL AND CC.status = 4 
>  AND CC.cnclusion = mc.ca) OR (PC.csacnt IS NULL)) THEN 1 ELSE 0 END AS Flag 
>  FROM maindata VC 
>  INNER JOIN scoretable pcs ON VC.s_recordid = pcs.s_recordid 
>  INNER JOIN cnfgtable PC ON PC.subid = VC.subid 
>  INNER JOIN prdtable VPM ON PC.configID = VPM.CNFG_ID 
>  LEFT JOIN casetable CC ON CC.rowid = VC.rowid 
>  LEFT JOIN prdcnfg mc ON mc.configID = PC.configID WHERE VC.date BETWEEN 
> VPM.StartDate and VPM.EndDate) A 
>  WHERE A.Flag =1").createOrReplaceTempView.("stage1")
> 2) hiveContext.sql("SELECT DISTINCT t1.ConfigID As cnfg_id ,vct.* 
>  FROM stage1 t1 
>  INNER JOIN stage1 t2 ON t1.rowid = t2.rowid AND T1.ConfigID = t2.ConfigID 
>  INNER JOIN cnfgtable PCR ON PCR.ConfigID = t2.ConfigID 
>  INNER JOIN maindata vct on vct.recordid = t1.recordid
>  WHERE t2.ro = PCR.datacount”)
> The same query sequency is working fine in Spark 1.6 but failing with below 
> exeption in Spark 2,2. It throws exception while parsing above 2nd query.
>  
> {{+org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree:+}}
> +CatalogRelation `database_name`.`{color:#f79232}maindata{color}`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [recordid#1506... 89 more 
> fields|#1506... 89 more fields]+ 
>  \{{ at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:258)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:249)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:722)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:721)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
>  \{{ at 
> org.apache.spark.sql.catalyst.trees.Tre

[jira] [Resolved] (SPARK-23797) SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23797.
--
Resolution: Incomplete

> SparkSQL performance on small TPCDS tables is very low when compared to Drill 
> or Presto
> ---
>
> Key: SPARK-23797
> URL: https://issues.apache.org/jira/browse/SPARK-23797
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Submit, SQL
>Affects Versions: 2.3.0
>Reporter: Tin Vu
>Priority: Major
>  Labels: bulk-closed
>
> I am executing a benchmark to compare performance of SparkSQL, Apache Drill 
> and Presto. My experimental setup:
>  * TPCDS dataset with scale factor 100 (size 100GB).
>  * Spark, Drill, Presto have a same number of workers: 12.
>  * Each worked has same allocated amount of memory: 4GB.
>  * Data is stored by Hive with ORC format.
> I executed a very simple SQL query: "SELECT * from table_name"
>  The issue is that for some small size tables (even table with few dozen of 
> records), SparkSQL still required about 7-8 seconds to finish, while Drill 
> and Presto only needed less than 1 second.
>  For other large tables with billions records, SparkSQL performance was 
> reasonable when it required 20-30 seconds to scan the whole table.
>  Do you have any idea or reasonable explanation for this issue?
> Thanks,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8582.
-
Resolution: Incomplete

> Optimize checkpointing to avoid computing an RDD twice
> --
>
> Key: SPARK-8582
> URL: https://issues.apache.org/jira/browse/SPARK-8582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: bulk-closed
>
> In Spark, checkpointing allows the user to truncate the lineage of his RDD 
> and save the intermediate contents to HDFS for fault tolerance. However, this 
> is not currently implemented super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during 
> the action that triggered the checkpointing in the first place, and once 
> while we checkpoint (we iterate through an RDD's partitions and write them to 
> disk). See this line for more detail: 
> https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.
> Instead, we should have a `CheckpointingInterator` that writes checkpoint 
> data to HDFS while we run the action. This will speed up many usages of 
> `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but 
> this is not always viable for very large input data. It's also not a great 
> API to use in general.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21076) R dapply doesn't return array or raw columns when array have different length

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21076.
--
Resolution: Incomplete

> R dapply doesn't return array or raw columns when array have different length
> -
>
> Key: SPARK-21076
> URL: https://issues.apache.org/jira/browse/SPARK-21076
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Xu Yang
>Priority: Major
>  Labels: bulk-closed
>
> Calling SparkR::dapplyCollect with R functions that return dataframes 
> produces an error. This comes up when returning columns of binary data- ie. 
> serialized fitted models. Also happens when functions return columns 
> containing vectors. 
> [SPARK-16785|https://issues.apache.org/jira/browse/SPARK-16785]
> still have this issue when input data is an array column not having the same 
> length on each vector, like:
> {code}
> head(test1)
>key  value
> 1 4dda7d68a202e9e3  1595297780
> 2  4e08f349deb7392  641991337
> 3 4e105531747ee00b  374773009
> 4 4f1d5ef7fdb4620a  2570136926
> 5 4f63a71e6dde04cd  2117602722
> 6 4fa2f96b689624fc  3489692062, 1344510747, 1095592237, 
> 424510360, 3211239587
> sparkR.stop()
> sc <- sparkR.init()
> sqlContext <- sparkRSQL.init(sc)
> spark_df = createDataFrame(sqlContext, test1)
> # Fails
> dapplyCollect(spark_df, function(x) x)
> Caused by: org.apache.spark.SparkException: R computation failed with
>  Error in (function (..., deparse.level = 1, make.row.names = TRUE, 
> stringsAsFactors = default.stringsAsFactors())  : 
>   invalid list argument: all variables should have the same length
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:59)
>   at 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:29)
>   at 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:186)
>   at 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:183)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   ... 1 more
> # Works fine
> spark_df <- selectExpr(spark_df, "key", "explode(value) value") 
> dapplyCollect(spark_df, function(x) x)
> key value
> 1  4dda7d68a202e9e3 1595297780
> 2   4e08f349deb7392  641991337
> 3  4e105531747ee00b  374773009
> 4  4f1d5ef7fdb4620a 2570136926
> 5  4f63a71e6dde04cd 2117602722
> 6  4fa2f96b689624fc 3489692062
> 7  4fa2f96b689624fc 1344510747
> 8  4fa2f96b689624fc 1095592237
> 9  4fa2f96b689624fc  424510360
> 10 4fa2f96b689624fc 3211239587
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16418) DataFrame.filter fails if it references a window function

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16418.
--
Resolution: Incomplete

> DataFrame.filter fails if it references a window function
> -
>
> Key: SPARK-16418
> URL: https://issues.apache.org/jira/browse/SPARK-16418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Erik Wright
>Priority: Major
>  Labels: bulk-closed
>
> I'm using Data Frames in Python. If I build up a column expression that 
> includes a window function, then filter on it, the resulting Data Frame 
> cannot be evaluated.
> If I first add that column expression to the Data Frame as a column (or add 
> the sub-expression that includes the window function as a column), the filter 
> works. This works even if I later drop the added column.
> It seems like this shouldn't be required. In the worst case, the platform 
> should be able to do this for me under the hood when/if necessary.
> {code:none}
> In [1]: from pyspark.sql import types
> In [2]: from pyspark.sql import Window
> In [3]: from pyspark.sql import functions
> In [4]: schema = types.StructType([types.StructField('id', 
> types.IntegerType(), False),
>...:types.StructField('state', 
> types.StringType(), True),
>...:types.StructField('seq', 
> types.IntegerType(), False)])
> In [5]: original_data_frame = sc.sql.createDataFrame([(1, 'hello', 1),(1, 
> 'world', 2),(1,'world',3)], schema)
> In [6]: previous_state = 
> functions.lag('state').over(Window.partitionBy('id').orderBy('seq').rowsBetween(-1,-1))
> In [7]: filter_condition = (original_data_frame['state'] == 'world') & 
> (previous_state != 'world')
> In [8]: data_frame = original_data_frame.withColumn('filter_condition', 
> filter_condition)
> In [9]: data_frame.show()
> +---+-+---++
> | id|state|seq|filter_condition|
> +---+-+---++
> |  1|hello|  1|   false|
> |  1|world|  2|true|
> |  1|world|  3|   false|
> +---+-+---++
> In [10]: data_frame = 
> data_frame.filter(data_frame['filter_condition']).drop('filter_condition')
> In [11]: data_frame.explain()
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [id#0,state#1,seq#2]
> : +- Filter (((isnotnull(state#1) && isnotnull(_we0#6)) && (state#1 = 
> world)) && NOT (_we0#6 = world))
> :+- INPUT
> +- Window [lag(state#1, 1, null) windowspecdefinition(id#0, seq#2 ASC, ROWS 
> BETWEEN 1 PRECEDING AND 1 PRECEDING) AS _we0#6], [id#0], [seq#2 ASC]
>+- WholeStageCodegen
>   :  +- Sort [id#0 ASC,seq#2 ASC], false, 0
>   : +- INPUT
>   +- Exchange hashpartitioning(id#0, 200), None
>  +- Scan ExistingRDD[id#0,state#1,seq#2]
> In [12]: data_frame.show()
> +---+-+---+
> | id|state|seq|
> +---+-+---+
> |  1|world|  2|
> +---+-+---+
> In [13]: data_frame = original_data_frame.withColumn('previous_state', 
> previous_state)
> In [14]: data_frame.show()
> +---+-+---+--+
> | id|state|seq|previous_state|
> +---+-+---+--+
> |  1|hello|  1|  null|
> |  1|world|  2| hello|
> |  1|world|  3| world|
> +---+-+---+--+
> In [15]: filter_condition = (data_frame['state'] == 'world') & 
> (data_frame['previous_state'] != 'world')
> In [16]: data_frame = 
> data_frame.filter(filter_condition).drop('previous_state')
> In [17]: data_frame.explain()
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [id#0,state#1,seq#2]
> : +- Filter (((isnotnull(state#1) && isnotnull(previous_state#12)) && 
> (state#1 = world)) && NOT (previous_state#12 = world))
> :+- INPUT
> +- Window [lag(state#1, 1, null) windowspecdefinition(id#0, seq#2 ASC, ROWS 
> BETWEEN 1 PRECEDING AND 1 PRECEDING) AS previous_state#12], [id#0], [seq#2 
> ASC]
>+- WholeStageCodegen
>   :  +- Sort [id#0 ASC,seq#2 ASC], false, 0
>   : +- INPUT
>   +- Exchange hashpartitioning(id#0, 200), None
>  +- Scan ExistingRDD[id#0,state#1,seq#2]
> In [18]: data_frame.show()
> +---+-+---+
> | id|state|seq|
> +---+-+---+
> |  1|world|  2|
> +---+-+---+
> In [19]: filter_condition = (original_data_frame['state'] == 'world') & 
> (previous_state != 'world')
> In [20]: data_frame = original_data_frame.filter(filter_condition)
> In [21]: data_frame.explain()
> == Physical Plan ==
> WholeStageCodegen
> :  +- Filter ((isnotnull(state#1) && (state#1 = world)) && NOT (lag(state#1, 
> 1, null) windowspecdefinition(id#0, seq#2 ASC, ROWS BETWEEN 1 PRECEDING AND 1 
> PRECEDING) = world))
> : +- INPUT
> +- Scan ExistingRDD[id#0,state#1,seq#2]
> In [22]: data_frame.show()
>

[jira] [Resolved] (SPARK-22565) Session-based windowing

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22565.
--
Resolution: Incomplete

> Session-based windowing
> ---
>
> Key: SPARK-22565
> URL: https://issues.apache.org/jira/browse/SPARK-22565
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Richard Xin
>Priority: Major
>  Labels: bulk-closed
> Attachments: screenshot-1.png
>
>
> I came across a requirement to support session-based windowing. for example, 
> user activity comes in from kafka, we want to create window per user session 
> (if the time gap of activity from the same user exceeds the predefined value, 
> a new window will be created).
> I noticed that Flink does support this kind of support, any plan/schedule for 
> spark for this? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24969) SQL: to_date function can't parse date strings in different locales.

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24969.
--
Resolution: Incomplete

> SQL: to_date function can't parse date strings in different locales.
> 
>
> Key: SPARK-24969
> URL: https://issues.apache.org/jira/browse/SPARK-24969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: Bare Spark 2.2.1 installation, on RHEL 6.
>Reporter: Valentino Pinna
>Priority: Major
>  Labels: bulk-closed
>
> The locale for {{org.apache.spark.sql.catalyst.util.DateTimeUtils}}, that is 
> internally used by {{to_date}} SQL function, is set in code to be 
> {{Locale.US}}.
> This causes problems parsing a dataset which has dates in a different 
> (italian in this case) language.
> {code:java}
> spark.read.format("csv")
>             .option("sep", ";")
>             .csv(logFile)
>             .toDF("DATA", .)
>             .withColumn("DATA2", to_date(col("DATA"), " MMM"))
>     .show(10)
> {code}
> Results from example dataset:
> |*DATA*|*DATA2*|
> |2018 giu|null|
> |2018 mag|null|
> |2018 apr|2018-04-01|
> |2018 mar|2018-03-01|
> |2018 feb|2018-02-01|
> |2018 gen|null|
> |2017 dic|null|
> |2017 nov|2017-11-01|
> |2017 ott|null|
> |2017 set|null|
> Expected results: All values converted.
> TEMPORARY WORKAROUND:
> In object {{org.apache.spark.sql.catalyst.util.DateTimeUtils}}, replace all 
> instances of {{Locale.US}} with {{Locale.}}
> ADDITIONAL NOTES:
> I can make a pull request available on GitHub.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-15516.
--
Resolution: Incomplete

> Schema merging in driver fails for parquet when merging LongType and 
> IntegerType
> 
>
> Key: SPARK-15516
> URL: https://issues.apache.org/jira/browse/SPARK-15516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: bulk-closed
>
> I tried to create a table from partitioned parquet directories that requires 
> schema merging. I get following error:
> {code}
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to merge incompatible data 
> types LongType and IntegerType
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415)
> at org.apache.spark.sql.types.StructType.merge(StructType.scala:333)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829)
> {code}
> cc @rxin and [~mengxr]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19241) remove hive generated table properties if they are not useful in Spark

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-19241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19241.
--
Resolution: Incomplete

> remove hive generated table properties if they are not useful in Spark
> --
>
> Key: SPARK-19241
> URL: https://issues.apache.org/jira/browse/SPARK-19241
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: bulk-closed
>
> When we save a table into hive metastore, hive will generate some table 
> properties automatically. e.g. transient_lastDdlTime, last_modified_by, 
> rawDataSize, etc. Some of them are useless in Spark SQL, we should remove 
> them.
> It will be good if we can get the list of Hive-generated table properties via 
> Hive API, so that we don't need to hardcode them.
> We can take a look at Hive code to see how it excludes these auto-generated 
> table properties when describe table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21885) HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference enabled

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21885.
--
Resolution: Incomplete

> HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference 
> enabled
> ---
>
> Key: SPARK-21885
> URL: https://issues.apache.org/jira/browse/SPARK-21885
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
> Environment: `spark.sql.hive.caseSensitiveInferenceMode` set to 
> INFER_ONLY
>Reporter: liupengcheng
>Priority: Major
>  Labels: bulk-closed, slow, sql
>
> Currently, SparkSQL infer schema is too slow, almost take 2 minutes.
> I digged into the code, and finally findout the reason:
> 1. In the analysis process of  LogicalPlan spark will try to infer table 
> schema if `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY, and 
> it will list all the leaf files of the rootPaths(just tableLocation), and 
> then call `getFileBlockLocations` to turn `FileStatus` into 
> `LocatedFileStatus`. This `getFileBlockLocations` for so manny leaf files 
> will take a long time, and it seems that the locations info is never used.
> 2. When infer a parquet schema, if there is only one file, it will still 
> launch a spark job to merge schema. I think it's expensive.
> Time costly stack is as follow:
> {code:java}
> at org.apache.hadoop.ipc.Client.call(Client.java:1403)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259)
> at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1234)
> at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1224)
> at 
> org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1274)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:217)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:427)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:410)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles(PartitioningAwareFileIndex.scala:410)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:302)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:301)
> at 
> scala.collection.Tra

[jira] [Resolved] (SPARK-23664) Add interface to collect query result through file iterator

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23664.
--
Resolution: Incomplete

> Add interface to collect query result through file iterator
> ---
>
> Key: SPARK-23664
> URL: https://issues.apache.org/jira/browse/SPARK-23664
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1, 2.3.0
>Reporter: zhoukang
>Priority: Major
>  Labels: bulk-closed
>
> Currently, we use spark sql through jdbc.
> Result may cost much memory since we collect result and cached in memory for 
> performance consideration.
> However,we can also add an API to collect result through file iterator(like 
> parquet file iterator),we can avoid OOM of thriftserver for big query.
> Like below:
> {code:java}
> result.toLocalIteratorThroughFile.asScala
> {code}
> I will work on this if make sense!
> And in our internal cluster we have used this API for about a year.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25428) Support plain Kerberos Authentication with Spark

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25428.
--
Resolution: Incomplete

> Support plain Kerberos Authentication with Spark
> 
>
> Key: SPARK-25428
> URL: https://issues.apache.org/jira/browse/SPARK-25428
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Shruti Gumma
>Priority: Major
>  Labels: bulk-closed, features
>
> Spark should work with plain Kerberos authentication. Currently, Spark can 
> work with Hadoop delegation tokens, but not plain Kerberos. Hadoop's 
> UserGroupInformation(UGI) class is responsible for handling security 
> authentication in Spark. This UserGroupInformation(UGI) has support for 
> Kerberos authentication, as well as Token authentication. Since Spark does 
> not work correctly with the Kerberos auth method, it leads to a gap in fully 
> supporting all the security authentication mechanisms.
>  
>  If Kerberos is used to login in UserGroupInformation(UGI) using keytabs at 
> the startup of drivers and executors, then Spark does not allow this 
> logged-in UserGroupInformation(UGI) user to correctly propagate. The 
> exception arises from the implementation of the runAsSparkUser method in 
> SparkHadoopUtil.
>  
>  The runAsSparkUser method in SparkHadoopUtil creates a new UGI based on the 
> current static UGI and then transfers credentials from this current static 
> UGI to the new UGI. This works well with other auth methods, except Kerberos. 
> Transfer credentials implementation is not conducive for Kerberos auth model 
> since it does not transfer all the required internal state of UGI( such as 
> isKeytab and isKrbTkt). For Kerberos, the UGI has to be created from 
> UGI.loginUserFromKeytab method only and not simply by doing a transfer 
> credentials from the previous UGI to the new UGI. 
>  
>  Ideally, the CoarseGrainedExecutorBackend should login using keytab, similar 
> to MesosCoarseGrainedExecutorBackend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24764) Add ServiceLoader implementation for SparkHadoopUtil

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24764.
--
Resolution: Incomplete

> Add ServiceLoader implementation for SparkHadoopUtil
> 
>
> Key: SPARK-24764
> URL: https://issues.apache.org/jira/browse/SPARK-24764
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Shruti Gumma
>Assignee: Shruti Gumma
>Priority: Major
>  Labels: bulk-closed
>
> Currently SparkHadoopUtil creation is static and cannot be changed. This 
> proposal is to move the creation of SparkHadoopUtil to a ServiceLoader, so 
> that external cluster managers can create their own SparkHadoopUtil.
> In the case of Yarn, a specific YarnSparkHadoopUtil is used in the Yarn 
> packages whereas in other places, SparkHadoopUtil is used. SparkHadoopUtil 
> has been changed in v2.3.0 to work with Yarn, leaving external cluster 
> managers with no configurable way of doing the same. 
> The util classes such as SparkHadoopUtil should be made configurable, similar 
> to how the ExternalClusterManager was made pluggable through a ServiceLoader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18822) Support ML Pipeline in SparkR

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-18822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-18822.
--
Resolution: Incomplete

> Support ML Pipeline in SparkR
> -
>
> Key: SPARK-18822
> URL: https://issues.apache.org/jira/browse/SPARK-18822
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Felix Cheung
>Priority: Major
>  Labels: bulk-closed
>
> From Joseph Bradley:
> "
> Supporting Pipelines and advanced use cases: There really needs to be more 
> design discussion around SparkR. Felix Cheung would you be interested in 
> leading some discussion? I'm envisioning something similar to what was done a 
> while back for Pipelines in Scala/Java/Python, where we consider several use 
> cases of MLlib: fitting a single model, creating and tuning a complex 
> Pipeline, and working with multiple languages. That should help inform what 
> APIs should look like in Spark R.
> "
> Certain ML model, such as OneVsRest, is harder to represent in a single call 
> R API. Having advanced API or Pipeline API like this could help to expose 
> that to our users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24617) Spark driver not requesting another executor once original executor exits due to 'lost worker'

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24617.
--
Resolution: Incomplete

> Spark driver not requesting another executor once original executor exits due 
> to 'lost worker'
> --
>
> Key: SPARK-24617
> URL: https://issues.apache.org/jira/browse/SPARK-24617
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: t oo
>Priority: Major
>  Labels: bulk-closed
>
> I am running Spark v2.1.1 in 'standalone' mode (no yarn/mesos) across EC2s. I 
> have 1 master ec2 that acts as the driver (since spark-submit is called on 
> this host), spark.master is setup, deploymode is client (so sparksubmit only 
> returns a ReturnCode to the putty window once it finishes processing). I have 
> 1 worker ec2 that is registered with the spark master. When i run sparksubmit 
> on the master, I can see in the WebUI that executors starting on the worker 
> and I can verify successful completion. However if while the sparksubmit is 
> running and the worker ec2 gets terminated and then new ec2 worker becomes 
> alive 3mins later and registers with the master, I have noticed on the webui 
> that it shows 'cannot find address' in the executor status but the driver 
> keeps waiting forever (2 days later I kill it) or in some cases the driver 
> allocates tasks to the new worker only 5 hours later and then completes! Is 
> there some setting i am missing that would explain this behavior?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24240) Add a config to control whether InMemoryFileIndex should update cache when refresh.

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24240.
--
Resolution: Incomplete

> Add a config to control whether InMemoryFileIndex should update cache when 
> refresh.
> ---
>
> Key: SPARK-24240
> URL: https://issues.apache.org/jira/browse/SPARK-24240
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: jin xing
>Priority: Major
>  Labels: bulk-closed
>
> In current 
> code([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L172),]
>  after data is inserted, spark will always refresh file index and update the 
> cache. If the target table has tons of files, job will suffer time and OOM 
> issue. Could we add a config to control whether InMemoryFileIndex should 
> update cache when refresh.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15691) Refactor and improve Hive support

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-15691.
--
Resolution: Incomplete

> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
>  Labels: bulk-closed
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog (So, for a CatalogTable returned by HiveExternalCatalog, 
> we do not need to distinguish tables stored in hive formats and data source 
> tables).
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24450) Error: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.curator.utils.PathUtils.validatePath(Ljava/lang/String;)Ljava/lang/String;

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24450.
--
Resolution: Incomplete

> Error: Exception in thread "main" java.lang.NoSuchMethodError: 
> org.apache.curator.utils.PathUtils.validatePath(Ljava/lang/String;)Ljava/lang/String;
> 
>
> Key: SPARK-24450
> URL: https://issues.apache.org/jira/browse/SPARK-24450
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.2, 2.2.0
>Reporter: Maxim Dahenko
>Priority: Major
>  Labels: bulk-closed
>
> Hello,
> artifact org.apache.curator, version 2.7.1 and higher doesn't work in a spark 
> job.
> pom.xml file:
> {code:java}
> 
> http://maven.apache.org/POM/4.0.0"; 
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
>  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
> http://maven.apache.org/maven-v4_0_0.xsd";>
>  4.0.0
>  com.test
>  test
>  0.0.1-SNAPSHOT
>  jar
>  "test"
>  
>  
>  
>  maven-assembly-plugin
>  
>  
>  test
>  
>  
>  
>  
>  
> 
>  
>  org.apache.curator
>  curator-client
>  2.7.1
>  
>  
> 
> {code}
>  Source code src/main/java/com/test/Test.java:
> {code:java}
> package com.test;
> import org.apache.curator.utils.PathUtils;
> public class Test {
>  public static void main(String[] args) throws Exception {
>  PathUtils.validatePath("/tmp");
>  }
> }
> {code}
>  Result
> {code:java}
> spark-submit --class com.test.Test --master local test-0.0.1-SNAPSHOT.jar
> Exception in thread "main" java.lang.NoSuchMethodError: 
> org.apache.curator.utils.PathUtils.validatePath(Ljava/lang/String;)Ljava/lang/String;
>  at com.test.Test.main(Test.java:7)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>  at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25030) SparkSubmit.doSubmit will not return result if the mainClass submitted creates a Timer()

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25030.
--
Resolution: Incomplete

> SparkSubmit.doSubmit will not return result if the mainClass submitted 
> creates a Timer()
> 
>
> Key: SPARK-25030
> URL: https://issues.apache.org/jira/browse/SPARK-25030
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Xingbo Jiang
>Priority: Major
>  Labels: bulk-closed
>
> Create a Timer() in the mainClass submitted to SparkSubmit makes it unable to 
> fetch result, it is very easy to reproduce the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23292) python tests related to pandas are skipped with python 2

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23292.
--
Resolution: Incomplete

> python tests related to pandas are skipped with python 2
> 
>
> Key: SPARK-23292
> URL: https://issues.apache.org/jira/browse/SPARK-23292
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Yin Huai
>Priority: Critical
>  Labels: bulk-closed
>
> I was running python tests and found that 
> [pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types|https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548]
>  does not run with Python 2 because the test uses "assertRaisesRegex" 
> (supported by Python 3) instead of "assertRaisesRegexp" (supported by Python 
> 2). However, spark jenkins does not fail because of this issue (see run 
> history at 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/]).
>  After looking into this issue, [seems test script will skip tests related to 
> pandas if pandas is not 
> installed|https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63],
>  which means that jenkins does not have pandas installed. 
>  
> Since pyarrow related tests have the same skipping logic, we will need to 
> check if jenkins has pyarrow installed correctly as well. 
>  
> Since features using pandas and pyarrow are in 2.3, we should fix the test 
> issue and make sure all tests pass before we make the release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15777) Catalog federation

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-15777.
--
Resolution: Incomplete

> Catalog federation
> --
>
> Key: SPARK-15777
> URL: https://issues.apache.org/jira/browse/SPARK-15777
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
>  Labels: bulk-closed
> Attachments: SparkFederationDesign.pdf
>
>
> This is a ticket to track progress to support federating multiple external 
> catalogs. This would require establishing an API (similar to the current 
> ExternalCatalog API) for getting information about external catalogs, and 
> ability to convert a table into a data source table.
> As part of this, we would also need to be able to support more than a 
> two-level table identifier (database.table). At the very least we would need 
> a three level identifier for tables (catalog.database.table). A possibly 
> direction is to support arbitrary level hierarchical namespaces similar to 
> file systems.
> Once we have this implemented, we can convert the current Hive catalog 
> implementation into an external catalog that is "mounted" into an internal 
> catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21040) On executor/worker decommission consider speculatively re-launching current tasks

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21040.
--
Resolution: Incomplete

> On executor/worker decommission consider speculatively re-launching current 
> tasks
> -
>
> Key: SPARK-21040
> URL: https://issues.apache.org/jira/browse/SPARK-21040
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Holden Karau
>Priority: Major
>  Labels: bulk-closed
>
> If speculative execution is enabled we may wish to consider decommissioning 
> of worker as a weight for speculative execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24905) Spark 2.3 Internal URL env variable

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24905.
--
Resolution: Incomplete

> Spark 2.3 Internal URL env variable
> ---
>
> Key: SPARK-24905
> URL: https://issues.apache.org/jira/browse/SPARK-24905
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Björn Wenzel
>Priority: Major
>  Labels: bulk-closed
>
> Currently the Kubernetes Master internal URL is hardcoded in the 
> Constants.scala file 
> ([https://github.com/apache/spark/blob/85fe1297e35bcff9cf86bd53fee615e140ee5bfb/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L75)]
> In some cases these URL should be changed e.g. if the Certificate is valid 
> for another Hostname.
> Is it possible to make this URL a property like: 
> spark.kubernetes.authenticate.driver.hostname?
> Kubernetes The Hard Way maintained by Kelsey Hightower for example uses 
> kubernetes.default as hostname, this will produce again a 
> SSLPeerUnverifiedException.
>  
> Here is the use of the Hardcoded Host: 
> [https://github.com/apache/spark/blob/85fe1297e35bcff9cf86bd53fee615e140ee5bfb/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala#L52]
>  maybe this could be changed like the KUBERNETES_NAMESPACE property in Line 
> 53.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24494) Give users possibility to skip own classes in SparkContext.getCallSite()

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24494.
--
Resolution: Incomplete

> Give users possibility to skip own classes in SparkContext.getCallSite()
> 
>
> Key: SPARK-24494
> URL: https://issues.apache.org/jira/browse/SPARK-24494
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: any (currently using Spark 2.2.0 in both local mode and 
> with YARN)
>Reporter: Florian Kaspar
>Priority: Minor
>  Labels: bulk-closed
>
> org.apache.spark.SparkContext.getCallSite() uses 
> org.apache.spark.util.Utils.getCallSite() which by default skips Spark and 
> Scala classes.
> It would be nice to be able to add user-defined classes/patterns to skip here 
> as well.
> We have one central class that acts as some kind of proxy for RDD 
> transformations and is used by many other classes.
> In the SparkUI we would like to see the real origin of a transformation being 
> not the proxy but its caller. This would make the UI even more useful to us.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23796) There's no API to change state RDD's name

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23796.
--
Resolution: Incomplete

> There's no API to change state RDD's name
> -
>
> Key: SPARK-23796
> URL: https://issues.apache.org/jira/browse/SPARK-23796
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: István Gansperger
>Priority: Minor
>  Labels: bulk-closed
>
> I use a few {{mapWithState}} stream oparations in my application and at some 
> point it became a minor inconvenience that I could not figure out how to set 
> the state RDDs name or serialization level. Searching around didn't really 
> help and I have not come across any issues regarding this (pardon my 
> inability to find it if there's one). It could be useful to see how much 
> memory each state uses if the user has multiple such transformations.
> I have used some ugly reflection based code to be able to set the name of the 
> state RDD and also the serialization level. I understand that the latter may 
> be intentionally limited, but I haven't come across any issues caused by this 
> apart from slightly degraded performance in exchange for a bit less memory 
> usage. Are these limitations in place intentionally or is it just an 
> oversight? Having some extra methods for these on {{StateSpec}} could be 
> useful in my opinion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14604) Modify design of ML model summaries

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-14604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-14604.
--
Resolution: Incomplete

> Modify design of ML model summaries
> ---
>
> Key: SPARK-14604
> URL: https://issues.apache.org/jira/browse/SPARK-14604
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>  Labels: bulk-closed
>
> Several spark.ml models now have summaries containing evaluation metrics and 
> training info:
> * LinearRegressionModel
> * LogisticRegressionModel
> * GeneralizedLinearRegressionModel
> These summaries have unfortunately been added in an inconsistent way.  I 
> propose to reorganize them to have:
> * For each model, 1 summary (without training info) and 1 training summary 
> (with info from training).  The non-training summary can be produced for a 
> new dataset via {{evaluate}}.
> * A summary should not store the model itself as a public field.
> * A summary should provide a transient reference to the dataset used to 
> produce the summary.
> This task will involve reorganizing the GLM summary (which lacks a 
> training/non-training distinction) and deprecating the model method in the 
> LinearRegressionSummary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23837) Create table as select gives exception if the spark generated alias name contains comma

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23837.
--
Resolution: Incomplete

> Create table as select gives exception if the spark generated alias name 
> contains comma
> ---
>
> Key: SPARK-23837
> URL: https://issues.apache.org/jira/browse/SPARK-23837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: shahid
>Priority: Minor
>  Labels: bulk-closed
>
> For spark generated alias name contains comma, Hive metastore throws 
> exception.
>  
> 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 
> decimal(18,5));
>  +---++
> |Result|
> +---++
>  +---++
>  No rows selected (0.171 seconds)
>  0: jdbc:hive2://ha-cluster/default> select col1*col2 from a;
>  
> +---+
> |(CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5)))|
> +---+
>  
> +---+
>  No rows selected (0.168 seconds)
>  0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from 
> a;
> Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a 
> column whose name contains commas in Hive metastore. Table: `default`.`b`; 
> Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); 
> (state=,code=0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24837) Add kafka as spark metrics sink

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24837.
--
Resolution: Incomplete

> Add kafka as spark metrics sink
> ---
>
> Key: SPARK-24837
> URL: https://issues.apache.org/jira/browse/SPARK-24837
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Sandish Kumar HN
>Priority: Major
>  Labels: bulk-closed
>
> Sink spark metrics to kafka producer 
> spark/core/src/main/scala/org/apache/spark/metrics/sink/
>  someone assign this to me?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24273) Failure while using .checkpoint method to private S3 store via S3A connector

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24273.
--
Resolution: Incomplete

> Failure while using .checkpoint method to private S3 store via S3A connector
> 
>
> Key: SPARK-24273
> URL: https://issues.apache.org/jira/browse/SPARK-24273
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Jami Malikzade
>Priority: Major
>  Labels: bulk-closed
>
> We are getting following error:
> {code}
> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 416, AWS 
> Service: Amazon S3, AWS Request ID: 
> tx14126-005ae9bfd9-9ed9ac2-default, AWS Error Code: 
> InvalidRange, AWS Error Message: null, S3 Extended Request ID: 
> 9ed9ac2-default-default"
> {code}
> when we use checkpoint method as below.
> {code}
> val streamBucketDF = streamPacketDeltaDF
>  .filter('timeDelta > maxGap && 'timeDelta <= 3)
>  .withColumn("bucket", when('timeDelta <= mediumGap, "medium")
>  .otherwise("large")
>  )
>  .checkpoint()
> {code}
> Do you have idea how to prevent invalid range in header to be sent, or how it 
> can be workarounded or fixed?
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25351.
--
Resolution: Incomplete

> Handle Pandas category type when converting from Python with Arrow
> --
>
> Key: SPARK-25351
> URL: https://issues.apache.org/jira/browse/SPARK-25351
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: bulk-closed
>
> There needs to be some handling of category types done when calling 
> {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  
> Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A  object
> Bcategory
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>1667 spark_type = ArrayType(from_arrow_type(at.value_type))
>1668 else:
> -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " 
> + str(at))
>1670 return spark_type
>1671 
> TypeError: Unsupported type in conversion from Arrow: 
> dictionary
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23669) Executors fetch jars and name the jars with md5 prefix

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23669.
--
Resolution: Incomplete

> Executors fetch jars and name the jars with md5 prefix
> --
>
> Key: SPARK-23669
> URL: https://issues.apache.org/jira/browse/SPARK-23669
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: jin xing
>Priority: Minor
>  Labels: bulk-closed
>
> In our cluster, there are lots of UDF jars, some of them have the same 
> filename but different path, for example:
> ```
> hdfs://A/B/udf.jar  -> udfA
> hdfs://C/D/udf.jar -> udfB
> ```
> When user uses udfA and udfB in same sql, executor will fetch both 
> *hdfs://A/B/udf.jar* and *hdfs://C/D/udf.jar.* There will be a conflict for 
> the same name. 
>  
> Can we config to fetch jars and save with a filename with MD5 prefix, so 
> there will be no conflict.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25219) KMeans Clustering - Text Data - Results are incorrect

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25219.
--
Resolution: Incomplete

> KMeans Clustering - Text Data - Results are incorrect
> -
>
> Key: SPARK-25219
> URL: https://issues.apache.org/jira/browse/SPARK-25219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Vasanthkumar Velayudham
>Priority: Major
>  Labels: bulk-closed
> Attachments: Apache_Logs_Results.xlsx, SKLearn_Kmeans.txt, 
> Spark_Kmeans.txt
>
>
> Hello Everyone,
> I am facing issues with the usage of KMeans Clustering on my text data. When 
> I apply clustering on my text data, after performing various transformations 
> such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated 
> clusters are not proper and one cluster is found to have lot of data points 
> assigned to it.
> I am able to perform clustering with similar kind of processing and with the 
> same attributes on the SKLearn KMeans algorithm. 
> Upon searching in internet, I observe many have reported the same issue with 
> KMeans clustering library of Spark.
> Request your help in fixing this issue.
> Please let me know if you require any additional details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25180) Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname fails

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25180.
--
Resolution: Incomplete

> Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname 
> fails
> ---
>
> Key: SPARK-25180
> URL: https://issues.apache.org/jira/browse/SPARK-25180
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: mac laptop running on a corporate guest wifi, presumably 
> a wifi with odd DNS settings.
>Reporter: Steve Loughran
>Priority: Minor
>  Labels: bulk-closed
>
> trying to save work on spark standalone can fail if netty RPC cannot 
> determine the hostname. While that's a valid failure on a real cluster, in 
> standalone falling back to localhost rather than inferred "hw13176.lan" value 
> may be the better option.
> note also, the abort* call failed; NPE.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12126) JDBC datasource processes filters only commonly pushed down.

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-12126.
--
Resolution: Incomplete

> JDBC datasource processes filters only commonly pushed down.
> 
>
> Key: SPARK-12126
> URL: https://issues.apache.org/jira/browse/SPARK-12126
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: bulk-closed
>
> As suggested 
> [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=14955646&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14955646],
>  Currently JDBC datasource only processes the filters pushed down from 
> {{DataSourceStrategy}}.
> Unlike ORC or Parquet, this can process pretty a lot of filters (for example, 
> a + b > 3) since it is just about string parsing.
> As 
> [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=15031526&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15031526],
>  using {{CatalystScan}} trait might be one of solutions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20629) Copy shuffle data when nodes are being shut down

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20629.
--
Resolution: Incomplete

> Copy shuffle data when nodes are being shut down
> 
>
> Key: SPARK-20629
> URL: https://issues.apache.org/jira/browse/SPARK-20629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Holden Karau
>Priority: Major
>  Labels: bulk-closed
>
> We decided not to do this for YARN, but for EC2/GCE and similar systems nodes 
> may be shut down entirely without the ability to keep an AuxiliaryService 
> around.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24293) Serialized shuffle supports mapSideCombine

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24293.
--
Resolution: Incomplete

> Serialized shuffle supports mapSideCombine
> --
>
> Key: SPARK-24293
> URL: https://issues.apache.org/jira/browse/SPARK-24293
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Xianjin YE
>Priority: Major
>  Labels: bulk-closed
>
> While doing research on integrating my company's internal Shuffle Service 
> with Spark, I found it is possible to support mapSideCombine with serialized 
> shuffle.
> The simple idea is that the `UnsafeShuffleWriter` uses a `Combiner` to 
> accumulate records when mapSideCombine is required before inserting into 
> `ShuffleExternalSorter`. The `Combiner` will tracking it's memory usage or 
> elements accumulated and is never spilled. When the `Combiner` accumulates 
> enough records(varied by different strategies), the accumulated (K, C) pairs 
> are then inserted into the `ShuffleExternalSorter`.  After that, the 
> `Combiner` is reset to empty state.
> After this change, combinedValues are sent to sorter segment by segment, and 
> the `BlockStoreShuffleReader` already handles this case.
> I did a local POC, and looks like that I can get the same result with normal 
> SortShuffle. The performance is not optimized yet. The most significant part 
> of code is shown as below: 
> {code:java}
> // code placeholder
> while (records.hasNext()) {
>   Product2 record = records.next();
>   if (this.mapSideCombine) {
> this.aggregator.accumulateRecord(record);
> if (this.aggregator.accumulatedKeyNum() >= 160_000) { // for poc
>   scala.collection.Iterator> combinedIterator = 
> this.aggregator.accumulatedIterator();
>   while (combinedIterator.hasNext()) {
> insertRecordIntoSorter(combinedIterator.next());
>   }
>   this.aggregator.resetAccumulation();
> }
>   } else {
> insertRecordIntoSorter(record);
>   }
> }
> if (this.mapSideCombine && this.aggregator.accumulatedKeyNum() > 0) {
>   scala.collection.Iterator> combinedIterator = 
> this.aggregator.accumulatedIterator();
>   while (combinedIterator.hasNext()) {
> insertRecordIntoSorter(combinedIterator.next());
>   }
>   this.aggregator.resetAccumulation(1);
> }
> {code}
>  
>  Is there something I am missing? cc [~joshrosen] [~cloud_fan] [~XuanYuan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23740) Add FPGrowth Param for filtering out very common items

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23740.
--
Resolution: Incomplete

> Add FPGrowth Param for filtering out very common items
> --
>
> Key: SPARK-23740
> URL: https://issues.apache.org/jira/browse/SPARK-23740
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>  Labels: bulk-closed
>
> It would be handy to have a Param in FPGrowth for filtering out very common 
> items.  This is from a use case where the dataset had items appearing in 
> 99.9%+ of the rows.  These common items were useless, but they caused the 
> algorithm to generate many unnecessary itemsets.  Filtering useless common 
> items beforehand can make the algorithm much faster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15041) adding mode strategy for ml.feature.Imputer for categorical features

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-15041.
--
Resolution: Incomplete

> adding mode strategy for ml.feature.Imputer for categorical features
> 
>
> Key: SPARK-15041
> URL: https://issues.apache.org/jira/browse/SPARK-15041
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>  Labels: bulk-closed
>
> Adding mode strategy for ml.feature.Imputer for categorical features. This 
> need to wait until PR for SPARK-13568 gets merged.
> https://github.com/apache/spark/pull/11601
> From comments of jkbradley and Nick Pentreath in the PR
> {quote}
> Investigate efficiency of approaches using DataFrame/Dataset and/or approx 
> approaches such as frequentItems or Count-Min Sketch (will require an update 
> to CMS to return "heavy-hitters").
> investigate if we can use metadata to only allow mode for categorical 
> features (or perhaps as an easier alternative, allow mode for only Int/Long 
> columns)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23858) Need to apply pyarrow adjustments to complex types with DateType/TimestampType

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23858.
--
Resolution: Incomplete

> Need to apply pyarrow adjustments to complex types with 
> DateType/TimestampType 
> ---
>
> Key: SPARK-23858
> URL: https://issues.apache.org/jira/browse/SPARK-23858
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: bulk-closed
>
> Currently, ArrayTypes with DateType and TimestampType need to perform the 
> same adjustments as simple types, e.g.  
> {{_check_series_localize_timestamps}}, and that should work for nested types 
> as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23237) Add UI / endpoint for threaddumps for executors with active tasks

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23237.
--
Resolution: Incomplete

> Add UI / endpoint for threaddumps for executors with active tasks
> -
>
> Key: SPARK-23237
> URL: https://issues.apache.org/jira/browse/SPARK-23237
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: bulk-closed
>
> Frequently, when there are a handful of straggler tasks, users want to know 
> what is going on in those executors running the stragglers.  Currently, that 
> is a bit of a pain to do: you have to go to the page for your active stage, 
> find the task, figure out which executor its on, then go to the executors 
> page, and get the thread dump.  Or maybe you just go to the executors page, 
> find the executor with an active task, and then click on that, but that 
> doesn't work if you've got multiple stages running.
> Users could figure this by extracting the info from the stage rest endpoint, 
> but it's such a common thing to do that we should make it easy.
> I realize that figuring out a good way to do this is a little tricky.  We 
> don't want to make it easy to end up pulling thread dumps from 1000 executors 
> back to the driver.  So we've got to come up with a reasonable heuristic for 
> choosing which executors to poll.  And we've also got to find a suitable 
> place to put this.
> My suggestion is that the stage page always has a link to the thread dumps 
> for the *one* executor with the longest running task.  And there would be a 
> corresponding endpoint in the rest api with the same info, maybe at 
> {{/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/slowestTaskThreadDump}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22728) Unify artifact access for (mesos, standalone and yarn) when HDFS is available

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22728.
--
Resolution: Incomplete

> Unify artifact access for (mesos, standalone and yarn) when HDFS is available
> -
>
> Key: SPARK-22728
> URL: https://issues.apache.org/jira/browse/SPARK-22728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>  Labels: bulk-closed
>
> A unified cluster layer for caching artifacts would be very useful like in 
> the case for the work that has be done for Flink: 
> https://issues.apache.org/jira/browse/FLINK-6177
> It would be great to make available the Hadoop Distributed Cache when we 
> deploy jobs in Mesos and Standalone envs. Hdfs is often present in many 
> end-to-end apps out there, so we should have an option for using it.
> I am creating this JIRA as a follow up of the discussion here: 
> https://github.com/apache/spark/pull/18587#issuecomment-314718391



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24748) Support for reporting custom metrics via Streaming Query Progress

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24748.
--
Resolution: Incomplete

> Support for reporting custom metrics via Streaming Query Progress
> -
>
> Key: SPARK-24748
> URL: https://issues.apache.org/jira/browse/SPARK-24748
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Arun Mahadevan
>Priority: Major
>  Labels: bulk-closed
>
> Currently the Structured Streaming sources and sinks does not have a way to 
> report custom metrics. Providing an option to report custom metrics and 
> making it available via Streaming Query progress can enable sources and sinks 
> to report custom progress information (E.g. the lag metrics for Kafka source).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22918) sbt test (spark - local) fail after upgrading to 2.2.1 with: java.security.AccessControlException: access denied org.apache.derby.security.SystemPermission( "engine", "

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22918.
--
Resolution: Incomplete

> sbt test (spark - local) fail after upgrading to 2.2.1 with: 
> java.security.AccessControlException: access denied 
> org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )
> 
>
> Key: SPARK-22918
> URL: https://issues.apache.org/jira/browse/SPARK-22918
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Damian Momot
>Priority: Major
>  Labels: bulk-closed
>
> After upgrading 2.2.0 -> 2.2.1 sbt test command in one of my projects started 
> to fail with following exception:
> {noformat}
> java.security.AccessControlException: access denied 
> org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )
>   at 
> java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
>   at 
> java.security.AccessController.checkPermission(AccessController.java:884)
>   at 
> org.apache.derby.iapi.security.SecurityUtil.checkDerbyInternalsPrivilege(Unknown
>  Source)
>   at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown 
> Source)
>   at org.apache.derby.iapi.jdbc.JDBCBoot$1.run(Unknown Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
>   at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
>   at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source)
>   at org.apache.derby.jdbc.EmbeddedDriver.(Unknown Source)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at java.lang.Class.newInstance(Class.java:442)
>   at 
> org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47)
>   at 
> org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:325)
>   at 
> org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:282)
>   at 
> org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:240)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:286)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
>   at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
>   at 
> or

[jira] [Resolved] (SPARK-21084) Improvements to dynamic allocation for notebook use cases

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21084.
--
Resolution: Incomplete

> Improvements to dynamic allocation for notebook use cases
> -
>
> Key: SPARK-21084
> URL: https://issues.apache.org/jira/browse/SPARK-21084
> Project: Spark
>  Issue Type: Umbrella
>  Components: Block Manager, Scheduler, Spark Core, YARN
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Frederick Reiss
>Priority: Major
>  Labels: bulk-closed
>
> One important application of Spark is to support many notebook users with a 
> single YARN or Spark Standalone cluster.  We at IBM have seen this 
> requirement across multiple deployments of Spark: on-premises and private 
> cloud deployments at our clients, as well as on the IBM cloud.  The scenario 
> goes something like this: "Every morning at 9am, 500 analysts log into their 
> computers and start running Spark notebooks intermittently for the next 8 
> hours." I'm sure that many other members of the community are interested in 
> making similar scenarios work.
> 
> Dynamic allocation is supposed to support these kinds of use cases by 
> shifting cluster resources towards users who are currently executing scalable 
> code.  In our own testing, we have encountered a number of issues with using 
> the current implementation of dynamic allocation for this purpose:
> *Issue #1: Starvation.* A Spark job acquires all available containers, 
> preventing other jobs or applications from starting.
> *Issue #2: Request latency.* Jobs that would normally finish in less than 30 
> seconds take 2-4x longer than normal with dynamic allocation.
> *Issue #3: Unfair resource allocation due to cached data.* Applications that 
> have cached RDD partitions hold onto executors indefinitely, denying those 
> resources to other applications.
> *Issue #4: Loss of cached data leads to thrashing.*  Applications repeatedly 
> lose partitions of cached RDDs because the underlying executors are removed; 
> the applications then need to rerun expensive computations.
> 
> This umbrella JIRA covers efforts to address these issues by making 
> enhancements to Spark.
> Here's a high-level summary of the current planned work:
> * [SPARK-21097]: Preserve an executor's cached data when shutting down the 
> executor.
> * [SPARK-21122]: Make Spark give up executors in a controlled fashion when 
> the RM indicates it is running low on capacity.
> * (JIRA TBD): Reduce the delay for dynamic allocation to spin up new 
> executors.
> Note that this overall plan is subject to change, and other members of the 
> community should feel free to suggest changes and to help out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23631) Add summary to RandomForestClassificationModel

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23631.
--
Resolution: Incomplete

> Add summary to RandomForestClassificationModel
> --
>
> Key: SPARK-23631
> URL: https://issues.apache.org/jira/browse/SPARK-23631
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Evan Zamir
>Priority: Major
>  Labels: bulk-closed
>
> I'm using the RandomForestClassificationModel and noticed that there is no 
> summary attribute like there is for LogisticRegressionModel. Specifically, 
> I'd like to have the roc and pr curves. Is that on the Spark roadmap 
> anywhere? Is there a reason it hasn't been implemented?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24939) Support YARN Shared Cache in Spark

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24939.
--
Resolution: Incomplete

> Support YARN Shared Cache in Spark
> --
>
> Key: SPARK-24939
> URL: https://issues.apache.org/jira/browse/SPARK-24939
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.1
>Reporter: Jonathan Bender
>Priority: Minor
>  Labels: bulk-closed
>
> https://issues.apache.org/jira/browse/YARN-1492 introduced support for the 
> YARN Shared Cache, which when configured allows clients to cache submitted 
> application resources (jars, archives) in HDFS and avoid having to re-upload 
> them for successive jobs. MapReduce YARN applications support this feature, 
> it would be great to add support for it in Spark as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24049) Add a feature to not start speculative tasks when average task duration is less than a configurable absolute number

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24049.
--
Resolution: Incomplete

> Add a feature to not start speculative tasks when average task duration is 
> less than a configurable absolute number
> ---
>
> Key: SPARK-24049
> URL: https://issues.apache.org/jira/browse/SPARK-24049
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Lars Francke
>Priority: Minor
>  Labels: bulk-closed
>
> Speculation currently has four different configuration options according to 
> the docs:
> * spark.speculation to turn it on or off
> * .interval - how often it'll check
> * .multiplier - how many times slower than average a task must be
> * .quantile - fraction of tasks that have to be completed before speculation 
> starts
> What I'd love to see is a feature that allows me to set a minimum average 
> duration. We are using a Jobserver to submit varying queries to Spark. Some 
> of those have stages with tasks of an average length of ~20ms (yes, not idea 
> but it happens). So with a multiplier of 2 a speculative task starts after 
> 40ms. This happens quite a lot and in our use-case it doesn't make sense. We 
> don't consider 40ms tasks "stragglers".
> So it would make sense to add a parameter to only start speculation when the 
> average task time is greater than X seconds/minutes/millis.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24689) java.io.NotSerializableException: org.apache.spark.mllib.clustering.DistributedLDAModel

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24689.
--
Resolution: Incomplete

> java.io.NotSerializableException: 
> org.apache.spark.mllib.clustering.DistributedLDAModel
> ---
>
> Key: SPARK-24689
> URL: https://issues.apache.org/jira/browse/SPARK-24689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.1
> Environment: !image-2018-06-29-13-42-30-255.png!
>Reporter: konglingbo
>Priority: Major
>  Labels: bulk-closed
> Attachments: @CLZ98635A644[_edx...@e.png
>
>
> scala> val predictionAndLabels=testing.map{case LabeledPoint(label,features)=>
>  | val prediction = model.predict(features)
>  | (prediction, label)
>  | }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23181) Add compatibility tests for SHS serialized data / disk format

2019-10-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23181.
--
Resolution: Incomplete

> Add compatibility tests for SHS serialized data / disk format
> -
>
> Key: SPARK-23181
> URL: https://issues.apache.org/jira/browse/SPARK-23181
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Marcelo Masiero Vanzin
>Priority: Major
>  Labels: bulk-closed
>
> The SHS in 2.3.0 has the ability to serialize history data to disk (see 
> SPARK-18085 and its sub-tasks). This means that if either the serialized data 
> or the disk format changes, the code needs to be modified to either support 
> the old formats, or discard the old data (and re-create it from logs).
> We should add integration tests that help us detect whether one of these 
> changes has occurred. The should check data generated by old versions of 
> Spark and fail if that data cannot be read back.
> The Hive suites recently added the ability to download old Spark versions and 
> generate data from those old versions to test that new code can read it, we 
> could use something similar to test this (starting with when 2.3.0 is 
> released).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4 5 6 >

301 - 400 of 528 matches

Mail list logo