[jira] [Resolved] (SPARK-22964) don't allow task restarts for continuous processing
[ https://issues.apache.org/jira/browse/SPARK-22964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22964. -- Resolution: Incomplete > don't allow task restarts for continuous processing > --- > > Key: SPARK-22964 > URL: https://issues.apache.org/jira/browse/SPARK-22964 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres >Priority: Major > Labels: bulk-closed > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24266) Spark client terminates while driver is still running
[ https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24266. -- Resolution: Incomplete > Spark client terminates while driver is still running > - > > Key: SPARK-24266 > URL: https://issues.apache.org/jira/browse/SPARK-24266 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Chun Chen >Priority: Major > Labels: bulk-closed > > {code} > Warning: Ignoring non-spark config property: Default=system properties > included when running spark-submit. > 18/05/11 14:50:12 WARN Config: Error reading service account token from: > [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring. > 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: > Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf) > 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. > Mounting Hadoop specific files > 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: N/A >start time: N/A >container images: N/A >phase: Pending >status: [] > 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >start time: N/A >container images: N/A >phase: Pending >status: [] > 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >start time: 2018-05-11T06:50:17Z >container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 >phase: Pending >status: [ContainerStatus(containerID=null, > image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, > lastState=ContainerState(running=null, terminated=null, waiting=null, > additionalProperties={}), name=spark-kubernetes-driver, ready=false, > restartCount=0, state=ContainerState(running=null, terminated=null, > waiting=ContainerStateWaiting(message=null, reason=PodInitializing, > additionalProperties={}), additionalProperties={}), additionalProperties={})] > 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to > finish... > 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >start ti
[jira] [Resolved] (SPARK-24729) Spark - stackoverflow error - org.apache.spark.sql.catalyst.plans.QueryPlan
[ https://issues.apache.org/jira/browse/SPARK-24729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24729. -- Resolution: Incomplete > Spark - stackoverflow error - org.apache.spark.sql.catalyst.plans.QueryPlan > --- > > Key: SPARK-24729 > URL: https://issues.apache.org/jira/browse/SPARK-24729 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.1 >Reporter: t oo >Priority: Major > Labels: bulk-closed > > Ran a spark (v2.1.1) job that joins 2 rdds (one is .txt file from S3, another > is parquet from S3) the job then merges the dataset (ie get latest row per > PK, if PK exists in txt and parquet then take the row from the .txt) and > writes out a new parquet to S3. Got this error but upon re-running it worked > fine. Both the .txt and parquet have 302 columns. The .txt has 191 rows, the > parquet has 156300 rows. Does anyone know the cause? > > {code:java} > > 18/07/02 13:51:56 INFO TaskSetManager: Starting task 0.0 in stage 14.0 (TID > 134, 10.160.122.226, executor 0, partition 0, PROCESS_LOCAL, 6337 bytes) > 18/07/02 13:51:56 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory > on 10.160.122.226:38011 (size: 27.2 KB, free: 4.6 GB) > 18/07/02 13:51:56 INFO TaskSetManager: Finished task 0.0 in stage 14.0 (TID > 134) in 295 ms on 10.160.122.226 (executor 0) (1/1) > 18/07/02 13:51:56 INFO TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks > have all completed, from pool > 18/07/02 13:51:56 INFO DAGScheduler: ResultStage 14 (load at Data.scala:25) > finished in 0.295 s > 18/07/02 13:51:56 INFO DAGScheduler: Job 7 finished: load at Data.scala:25, > took 0.310932 s > 18/07/02 13:51:57 INFO FileSourceStrategy: Pruning directories with: > 18/07/02 13:51:57 INFO FileSourceStrategy: Post-Scan Filters: > 18/07/02 13:51:57 INFO FileSourceStrategy: Output Data Schema: struct string, created: timestamp, created_by: string, last_upd: timestamp, > last_upd_by: string ... 300 more fields> > 18/07/02 13:51:57 INFO FileSourceStrategy: Pushed Filters: > 18/07/02 13:51:57 INFO MemoryStore: Block broadcast_19 stored as values in > memory (estimated size 387.2 KB, free 911.2 MB) > 18/07/02 13:51:57 INFO MemoryStore: Block broadcast_19_piece0 stored as bytes > in memory (estimated size 33.7 KB, free 911.1 MB) > 18/07/02 13:51:57 INFO BlockManagerInfo: Added broadcast_19_piece0 in memory > on 10.160.123.242:38105 (size: 33.7 KB, free: 912.2 MB) > 18/07/02 13:51:57 INFO SparkContext: Created broadcast 19 from cache at > Upsert.scala:25 > 18/07/02 13:51:57 INFO FileSourceScanExec: Planning scan with bin packing, > max size: 48443541 bytes, open cost is considered as scanning 4194304 bytes. > 18/07/02 13:51:57 INFO SparkContext: Starting job: take at Utils.scala:28 > 18/07/02 13:51:57 INFO DAGScheduler: Got job 8 (take at Utils.scala:28) with > 1 output partitions > 18/07/02 13:51:57 INFO DAGScheduler: Final stage: ResultStage 15 (take at > Utils.scala:28) > 18/07/02 13:51:57 INFO DAGScheduler: Parents of final stage: List() > 18/07/02 13:51:57 INFO DAGScheduler: Missing parents: List() > 18/07/02 13:51:57 INFO DAGScheduler: Submitting ResultStage 15 > (MapPartitionsRDD[65] at take at Utils.scala:28), which has no missing parents > 18/07/02 13:51:57 INFO MemoryStore: Block broadcast_20 stored as values in > memory (estimated size 321.5 KB, free 910.8 MB) > 18/07/02 13:51:57 INFO MemoryStore: Block broadcast_20_piece0 stored as bytes > in memory (estimated size 93.0 KB, free 910.7 MB) > 18/07/02 13:51:57 INFO BlockManagerInfo: Added broadcast_20_piece0 in memory > on 10.160.123.242:38105 (size: 93.0 KB, free: 912.1 MB) > 18/07/02 13:51:57 INFO SparkContext: Created broadcast 20 from broadcast at > DAGScheduler.scala:996 > 18/07/02 13:51:57 INFO DAGScheduler: Submitting 1 missing tasks from > ResultStage 15 (MapPartitionsRDD[65] at take at Utils.scala:28) > 18/07/02 13:51:57 INFO TaskSchedulerImpl: Adding task set 15.0 with 1 tasks > 18/07/02 13:51:57 INFO TaskSetManager: Starting task 0.0 in stage 15.0 (TID > 135, 10.160.122.226, executor 0, partition 0, PROCESS_LOCAL, 9035 bytes) > 18/07/02 13:51:57 INFO BlockManagerInfo: Added broadcast_20_piece0 in memory > on 10.160.122.226:38011 (size: 93.0 KB, free: 4.6 GB) > 18/07/02 13:51:57 INFO BlockManagerInfo: Added broadcast_19_piece0 in memory > on 10.160.122.226:38011 (size: 33.7 KB, free: 4.6 GB) > 18/07/02 13:52:05 INFO BlockManagerInfo: Added rdd_61_0 in memory on > 10.160.122.226:38011 (size: 38.9 MB, free: 4.5 GB) > 18/07/02 13:52:09 INFO BlockManagerInfo: Added rdd_63_0 in memory on > 10.160.122.226:38011 (size: 38.9 MB, free: 4.5 GB) > 18/07/02 13:52:09 INFO TaskSetManager: Finished task 0.0 in stage 15.0 (TID > 135) in 11751 ms on 10.1
[jira] [Resolved] (SPARK-21536) Remove the workaroud to allow dots in field names in R's createDataFame
[ https://issues.apache.org/jira/browse/SPARK-21536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21536. -- Resolution: Incomplete > Remove the workaroud to allow dots in field names in R's createDataFame > --- > > Key: SPARK-21536 > URL: https://issues.apache.org/jira/browse/SPARK-21536 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > Labels: bulk-closed > > We are currently converting dots to underscore in some cases as below in > SparkR > {code} > > createDataFrame(list(list(1)), "a.a") > SparkDataFrame[a_a:double] > ... > In FUN(X[[i]], ...) : Use a_a instead of a.a as column name > {code} > {code} > > createDataFrame(list(list(a.a = 1))) > SparkDataFrame[a_a:double] > ... > In FUN(X[[i]], ...) : Use a_a instead of a.a as column name > {code} > This looks introduced in the first place due to SPARK-2775 but now it is > fixed in SPARK-6898. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23839) consider bucket join in cost-based JoinReorder rule
[ https://issues.apache.org/jira/browse/SPARK-23839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23839. -- Resolution: Incomplete > consider bucket join in cost-based JoinReorder rule > --- > > Key: SPARK-23839 > URL: https://issues.apache.org/jira/browse/SPARK-23839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiaoju Wu >Priority: Minor > Labels: bulk-closed > > Since spark 2.2, the cost-based JoinReorder rule is implemented and in Spark > 2.3 released, it is improved with histogram. While it doesn't take the cost > of the different join implementations. For example: > TableA JOIN TableB JOIN TableC > TableA will output 10,000 rows after filter and projection. > TableB will output 10,000 rows after filter and projection. > TableC will output 8,000 rows after filter and projection. > The current JoinReorder rule will possibly optimize the plan to join TableC > with TableA firstly and then TableB. But if the TableA and TableB are bucket > tables and can be applied with BucketJoin, it could be a different story. > > Also, to support bucket join of more than 2 tables when table bucket number > is multiple of another (SPARK-17570), whether bucket join can take effect > depends on the result of JoinReorder. For example of "A join B join C" which > has bucket number like 8, 4, 12, JoinReorder rule should optimize the order > to "A join B join C“ to make the bucket join take effect instead of "C join A > join B". > > Based on current CBO JoinReorder, there are possibly 2 part to be changed: > # CostBasedJoinReorder rule is applied in optimizer phase while we do Join > selection in planner phase and bucket join optimization in EnsureRequirements > which is in preparation phase. Both are after optimizer. > # Current statistics and join cost formula are based data selectivity and > cardinality, we need to add statistics for present the join method cost like > shuffle, sort, hash and etc. Also we need to add the statistics into the > formula to estimate the join cost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24656) SparkML Transformers and Estimators with multiple columns
[ https://issues.apache.org/jira/browse/SPARK-24656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24656. -- Resolution: Incomplete > SparkML Transformers and Estimators with multiple columns > - > > Key: SPARK-24656 > URL: https://issues.apache.org/jira/browse/SPARK-24656 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 2.3.1 >Reporter: Michael Dreibelbis >Priority: Major > Labels: bulk-closed > > Currently SparkML Transformers and Estimators operate on single input/output > column pairs. This makes pipelines extremely cumbersome (as well as > non-performant) when transformations on multiple columns needs to be made. > > I am proposing to implement ParallelPipelineStage/Transformer/Estimator/Model > that would operate on the input columns in parallel. > > {code:java} > // old way > val pipeline = new Pipeline().setStages(Array( > new CountVectorizer().setInputCol("_1").setOutputCol("_1_cv"), > new CountVectorizer().setInputCol("_2").setOutputCol("_2_cv"), > new IDF().setInputCol("_1_cv").setOutputCol("_1_idf"), > new IDF().setInputCol("_2_cv").setOutputCol("_2_idf") > )) > // proposed way > val pipeline2 = new Pipeline().setStages(Array( > new ParallelCountVectorizer().setInputCols(Array("_1", > "_2")).setOutputCols(Array("_1_cv", "_2_cv")), > new ParallelIDF().setInputCols(Array("_1_cv", > "_2_cv")).setOutputCols(Array("_1_idf", "_2_idf")) > )) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24301) Add Instrumentation test coverage
[ https://issues.apache.org/jira/browse/SPARK-24301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24301. -- Resolution: Incomplete > Add Instrumentation test coverage > - > > Key: SPARK-24301 > URL: https://issues.apache.org/jira/browse/SPARK-24301 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Lu Wang >Priority: Minor > Labels: bulk-closed > > Spark has no coverage for Logging. It will be good to add some > instrumentation coverage test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23537) Logistic Regression without standardization
[ https://issues.apache.org/jira/browse/SPARK-23537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23537. -- Resolution: Incomplete > Logistic Regression without standardization > --- > > Key: SPARK-23537 > URL: https://issues.apache.org/jira/browse/SPARK-23537 > Project: Spark > Issue Type: Bug > Components: ML, Optimizer >Affects Versions: 2.0.2, 2.2.1 >Reporter: Jordi >Priority: Major > Labels: bulk-closed > Attachments: non-standardization.log, standardization.log > > > I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer > to not use standardization since all my features are binary, using the > hashing trick (2^20 sparse vector). > I trained two models to compare results, I've been expecting to end with two > similar models since it seems that internally the optimizer performs > standardization and "de-standardization" (when it's deactivated) in order to > improve the convergence. > Here you have the code I used: > {code:java} > val lr = new org.apache.spark.ml.classification.LogisticRegression() > .setRegParam(0.05) > .setElasticNetParam(0.0) > .setFitIntercept(true) > .setMaxIter(5000) > .setStandardization(false) > val model = lr.fit(data) > {code} > The results are disturbing me, I end with two significantly different models. > *Standardization:* > Training time: 8min. > Iterations: 37 > Intercept: -4.386090107224499 > Max weight: 4.724752299455218 > Min weight: -3.560570478164854 > Mean weight: -0.049325201841722795 > l1 norm: 116710.39522171849 > l2 norm: 402.2581552373957 > Non zero weights: 128084 > Non zero ratio: 0.12215042114257812 > Last 10 LBFGS Val and Grand Norms: > {code:java} > 18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) > 0.000559057 > 18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) > 0.000267527 > 18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) > 0.000205888 > 18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) > 0.000144173 > 18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) > 0.000140296 > 18/02/27 17:15:09 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.52e-08) > 0.000122709 > 18/02/27 17:15:13 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.78e-08) > 3.08789e-05 > 18/02/27 17:15:18 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.66e-09) > 2.23806e-05 > 18/02/27 17:15:23 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 4.31e-09) > 1.47422e-05 > 18/02/27 17:15:28 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 9.17e-10) > 2.37442e-05 > {code} > *No standardization:* > Training time: 7h 14 min. > Iterations: 4992 > Intercept: -4.216690468849263 > Max weight: 0.41930559767624725 > Min weight: -0.5949182537565524 > Mean weight: -1.2659769019012E-6 > l1 norm: 14.262025330648694 > l2 norm: 1.2508777025612263 > Non zero weights: 128955 > Non zero ratio: 0.12298107147216797 > Last 10 LBFGS Val and Grand Norms: > {code:java} > 18/02/28 00:28:56 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 2.17e-07) > 0.217581 > 18/02/28 00:29:01 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.88e-07) > 0.185812 > 18/02/28 00:29:06 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.33e-07) > 0.214570 > 18/02/28 00:29:11 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 8.62e-08) > 0.489464 > 18/02/28 00:29:16 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.90e-07) > 0.178448 > 18/02/28 00:29:21 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 7.91e-08) > 0.172527 > 18/02/28 00:29:26 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.38e-07) > 0.189389 > 18/02/28 00:29:31 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.13e-07) > 0.480678 > 18/02/28 00:29:36 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.75e-07) > 0.184529 > 18/02/28 00:29:41 INFO LBFGS: Val and Grad Norm: 0.559319 (rel: 8.90e-08) > 0.154329 > {code} > Am I missing something? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25020) Unable to Perform Graceful Shutdown in Spark Streaming with Hadoop 2.8
[ https://issues.apache.org/jira/browse/SPARK-25020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25020. -- Resolution: Incomplete > Unable to Perform Graceful Shutdown in Spark Streaming with Hadoop 2.8 > -- > > Key: SPARK-25020 > URL: https://issues.apache.org/jira/browse/SPARK-25020 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1 > Environment: Spark Streaming > -- Tested on 2.2 & 2.3 (more than likely affects all versions with graceful > shutdown) > Hadoop 2.8 >Reporter: Ricky Saltzer >Priority: Major > Labels: bulk-closed > > Opening this up to give you guys some insight in an issue that will occur > when using Spark Streaming with Hadoop 2.8. > Hadoop 2.8 added HADOOP-12950 which adds a upper limit of a 10 second timeout > for its shutdown hook. From our tests, if the Spark job takes longer than 10 > seconds to shutdown gracefully, the Hadoop shutdown thread seems to trample > over the graceful shutdown and throw an exception like > {code:java} > 18/08/03 17:21:04 ERROR Utils: Uncaught exception in thread pool-1-thread-1 > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) > at > org.apache.spark.streaming.scheduler.ReceiverTracker.stop(ReceiverTracker.scala:177) > at > org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:114) > at > org.apache.spark.streaming.StreamingContext$$anonfun$stop$1.apply$mcV$sp(StreamingContext.scala:682) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317) > at > org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:681) > at > org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:715) > at > org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:599) > at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1948) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){code} > The reason I hit this issue is because we recently upgraded to EMR 5.15, > which has both Spark 2.3 & Hadoop 2.8. The following workaround has proven > successful to us (in limited testing) > Instead of just running > {code:java} > ... > ssc.start() > ssc.awaitTermination(){code} > We needed to do the following > {code:java} > ... > ssc.start() > sys.ShutdownHookThread { > ssc.stop(true, true) > } > ssc.awaitTermination(){code} > As far as I can tell, there is no way to override the default {{10 second}} > timeout in HADOOP-12950, which is why we had to go with the workaround. > Note: I also verified this bug exists even with EMR 5.12.1 which runs Spark > 2.2.x & Hadoop 2.8. > Ricky > Epic Games -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24357) createDataFrame in Python infers large integers as long type and then fails silently when converting them
[ https://issues.apache.org/jira/browse/SPARK-24357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24357. -- Resolution: Incomplete > createDataFrame in Python infers large integers as long type and then fails > silently when converting them > - > > Key: SPARK-24357 > URL: https://issues.apache.org/jira/browse/SPARK-24357 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Joel Croteau >Priority: Major > Labels: bulk-closed > > When inferring the schema type of an RDD passed to createDataFrame, PySpark > SQL will infer any integral type as a LongType, which is a 64-bit integer, > without actually checking whether the values will fit into a 64-bit slot. If > the values are larger than 64 bits, then when pickled and unpickled in Java, > Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD is > called, it will ignore the BigInteger type and return Null. This results in > any large integers in the resulting DataFrame being silently converted to > None. This can create some very surprising and difficult to debug behavior, > in particular if you are not aware of this limitation. There should either be > a runtime error at some point in this conversion chain, or else _infer_type > should infer larger integers as DecimalType with appropriate precision, or as > BinaryType. The former would be less convenient, but the latter may be > problematic to implement in practice. In any case, we should stop silently > converting large integers to None. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15867) Use bucket files for TABLESAMPLE BUCKET
[ https://issues.apache.org/jira/browse/SPARK-15867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-15867. -- Resolution: Incomplete > Use bucket files for TABLESAMPLE BUCKET > --- > > Key: SPARK-15867 > URL: https://issues.apache.org/jira/browse/SPARK-15867 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Andrew Or >Priority: Major > Labels: bulk-closed > > {code} > SELECT * FROM boxes TABLESAMPLE (BUCKET 3 OUT OF 16) > {code} > In Hive, this would select the 3rd bucket out of every 16 buckets there are > in the table. E.g. if the table was clustered by 32 buckets then this would > sample the 3rd and the 19th bucket. (See > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling) > In Spark, however, we simply sample 3/16 of the number of input rows. > Either we don't support it in Spark or do it in a way that's consistent with > Hive. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23996) Implement the optimal KLL algorithms for quantiles in streams
[ https://issues.apache.org/jira/browse/SPARK-23996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23996. -- Resolution: Incomplete > Implement the optimal KLL algorithms for quantiles in streams > - > > Key: SPARK-23996 > URL: https://issues.apache.org/jira/browse/SPARK-23996 > Project: Spark > Issue Type: Improvement > Components: MLlib, SQL >Affects Versions: 2.3.0 >Reporter: Timothy Hunter >Priority: Major > Labels: bulk-closed > > The current implementation for approximate quantiles - a variant of > Grunwald-Khanna, which I implemented - is not the best in light of recent > papers: > - it is not exactly the one from the paper for performance reasons, but the > changes are not documented beyond comments on the code > - there are now more optimal algorithms with proven bounds (unlike q-digest, > the other contender at the time) > I propose that we revisit the current implementation and look at the > Karnin-Lang-Liberty algorithm (KLL) for example: > [https://arxiv.org/abs/1603.05346] > [https://edoliberty.github.io//papers/streamingQuantiles.pdf] > This algorithm seems to have favorable characteristics for streaming and a > distributed implementation, and there is a python implementation for > reference. > It is a fairly standalone piece, and in that respect available to people who > don't know too much about spark internals. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23368) Avoid unnecessary Exchange or Sort after projection
[ https://issues.apache.org/jira/browse/SPARK-23368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23368. -- Resolution: Incomplete > Avoid unnecessary Exchange or Sort after projection > --- > > Key: SPARK-23368 > URL: https://issues.apache.org/jira/browse/SPARK-23368 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wei Xue >Priority: Minor > Labels: bulk-closed > > After column rename projection, the ProjectExec's outputOrdering and > outputPartitioning should reflect the projected columns as well. For example, > {code:java} > SELECT b1 > FROM ( > SELECT a a1, b b1 > FROM testData2 > ORDER BY a > ) > ORDER BY a1{code} > The inner query is ordered on a1 as well. If we had a rule to eliminate Sort > on sorted result, together with this fix, the order-by in the outer query > could have been optimized out. > > Similarly, the below query > {code:java} > SELECT * > FROM ( > SELECT t1.a a1, t2.a a2, t1.b b1, t2.b b2 > FROM testData2 t1 > LEFT JOIN testData2 t2 > ON t1.a = t2.a > ) > JOIN testData2 t3 > ON a1 = t3.a{code} > is equivalent to > {code:java} > SELECT * > FROM testData2 t1 > LEFT JOIN testData2 t2 > ON t1.a = t2.a > JOIN testData2 t3 > ON t1.a = t3.a{code} > , so the unnecessary sorting and hash-partitioning that have been optimized > out for the second query should have be eliminated in the first query as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23730) Save and expose "in bag" tracking for random forest model
[ https://issues.apache.org/jira/browse/SPARK-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23730. -- Resolution: Incomplete > Save and expose "in bag" tracking for random forest model > - > > Key: SPARK-23730 > URL: https://issues.apache.org/jira/browse/SPARK-23730 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Julian King >Priority: Minor > Labels: bulk-closed > > In a random forest model, it is often useful to be able to keep track of > which samples ended up in each of the bootstrap replications (and how many > times this happened). For instance, in the R randomForest package this is > accomplished through the option keep.inbag=TRUE > Similar functionality in Spark ML's random forest would be helpful -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24473) It is no need to clip the predictive value by maxValue and minValue when computing gradient on SVDplusplus model
[ https://issues.apache.org/jira/browse/SPARK-24473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24473. -- Resolution: Incomplete > It is no need to clip the predictive value by maxValue and minValue when > computing gradient on SVDplusplus model > > > Key: SPARK-24473 > URL: https://issues.apache.org/jira/browse/SPARK-24473 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.3.0 >Reporter: caijianming >Priority: Major > Labels: bulk-closed > > I think it is no need to clip the predictive value. It will change the convex > loss function to non-convex, which might have a bad influence on convergence. > Other famous recommender systems and original paper also do not include this > step. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5572) LDA improvement listing
[ https://issues.apache.org/jira/browse/SPARK-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-5572. - Resolution: Incomplete > LDA improvement listing > --- > > Key: SPARK-5572 > URL: https://issues.apache.org/jira/browse/SPARK-5572 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Major > Labels: bulk-closed > > This is an umbrella JIRA for listing planned improvements to Latent Dirichlet > Allocation (LDA). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22415) lint-r fails if lint-r.R installs any new packages
[ https://issues.apache.org/jira/browse/SPARK-22415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22415. -- Resolution: Incomplete > lint-r fails if lint-r.R installs any new packages > -- > > Key: SPARK-22415 > URL: https://issues.apache.org/jira/browse/SPARK-22415 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 > Environment: OSX 10.12.6 R 3.4.2 >Reporter: Joel Croteau >Priority: Minor > Labels: bulk-closed > > The dev/lint-r script checks for lint failures by seeing if anything is > output to stdout by lint-r.R. Since package installations will often produce > output to stdout, this will cause it to report a failure even if lint-r.R > succeeded. This would also mean that if there were a failure further up in > lint-r.R that output a message to stderr, it would not detect this and think > lint-r.R had succeeded. It would be preferable for lint-r.R to output lints > to stderr and for lint-r to check that and/or for lint-r.R to return a > failure code if there are any lints, which lint-r could check for. I will > write a patch to do this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23987) Unused mailing lists
[ https://issues.apache.org/jira/browse/SPARK-23987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23987. -- Resolution: Incomplete > Unused mailing lists > > > Key: SPARK-23987 > URL: https://issues.apache.org/jira/browse/SPARK-23987 > Project: Spark > Issue Type: Task > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Sebb >Priority: Major > Labels: bulk-closed > > The following mailing lists were set up in Jan 2015 but have not been used, > and don't appear to be mentioned on the website: > ml@ > sql@ > streaming@ > If they are not needed, please file a JIRA request with INFR to close them > down. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.
[ https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24974. -- Resolution: Incomplete > Spark put all file's paths into SharedInMemoryCache even for unused > partitions. > --- > > Key: SPARK-24974 > URL: https://issues.apache.org/jira/browse/SPARK-24974 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: andrzej.stankev...@gmail.com >Priority: Major > Labels: bulk-closed > > SharedInMemoryCache has all filestatus no matter whether you specify > partition columns or not. It causes long load time for queries that use only > couple partitions because Spark loads file's paths for files from all > partitions. > I partitioned files by *report_date* and *type* and i have directory > structure like > {code:java} > /custom_path/report_date=2018-07-24/type=A/file_1.parquet > {code} > > I am trying to execute > {code:java} > val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( > "type == 'A'").count > {code} > > In my query i need to load only files of type *A* and it is just a couple of > files. But spark load all 19K of files from all partitions into > SharedInMemoryCache which takes about 60 secs and only after that throws > unused partitions. > > This could be related to [https://jira.apache.org/jira/browse/SPARK-17994] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24955) spark continuing to execute on a task despite not reading all data from a downed machine
[ https://issues.apache.org/jira/browse/SPARK-24955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24955. -- Resolution: Incomplete > spark continuing to execute on a task despite not reading all data from a > downed machine > > > Key: SPARK-24955 > URL: https://issues.apache.org/jira/browse/SPARK-24955 > Project: Spark > Issue Type: Bug > Components: PySpark, Shuffle >Affects Versions: 2.3.0 >Reporter: San Tung >Priority: Major > Labels: bulk-closed > > We've recently run into a few instances where a downed node has led to > incomplete data, causing correctness issues, which we can reproduce some of > the time. > *Setup:* > - we're currently on spark 2.3.0 > - we allow retries on failed tasks and stages > - we use PySpark to perform these operations > *Stages:* > Simplistically, the job does the following: > - Stage 1/2: computes a number of `(sha256 hash, 0, 1)` partitioned into > 65536 partitions > - Stage 3/4: computes a number of `(sha256 hash, 1, 0)` partitioned into > 6408 partitions (one hash may exist in multiple partitions) > - Stage 5: > - repartitions stage 2 and stage 4 by the first 2 bytes of each hash, and > find which ones are not in common (stage 2 hashes - stage 4 hashes). > - store this partition into a persistent data source. > *Failure Scenario:* > - We take out one of the machines (do a forced shutdown, for example) > - For some tasks, stage 5 will die immediately with one of the following: > ** `ExecutorLostFailure (executor 24 exited caused by one of the running > tasks) Reason: worker lost` > ** `FetchFailed(BlockManagerId(24, [redacted], 36829, None), shuffleId=2, > mapId=14377, reduceId=48402, message=` > - these tasks are reused to calculate stage 1-2 and 3-4 again that were > missing on downed nodes, which is correctly recalculated by spark. > - However, some tasks still continue executing from Stage 5, seemingly > missing stage 4 data, dumping incorrect data to the stage 5 data source. We > noticed the subtract operation taking ~1-2 minutes after the machine goes > down, and stores a lot more data than usual (which on inspection is wrong). > - we've seen this happen with slightly different execution plans too which > don't involve or-ing, but end up being some variant of missing some stage 4 > data. > However, we cannot reproduce this consistently - sometimes all tasks fail > gracefully. Correctly downed nodes means all these tasks fail and re-work on > stage 1-2/3-4. Note that this solution produces the correct results if > machines stay alive! > We were wondering if a machine going down can result in a state where a task > could keep executing even though not all data has been fetched which gives us > incorrect results (or if there is setting that allows this - we tried > scanning spark configs up and down). This seems similar to > https://issues.apache.org/jira/browse/SPARK-24160 (maybe we get an empty > packet?), but it doesn't look like that was to explicitly resolve any known > bug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24550) Add support for Kubernetes specific metrics
[ https://issues.apache.org/jira/browse/SPARK-24550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24550. -- Resolution: Incomplete > Add support for Kubernetes specific metrics > --- > > Key: SPARK-24550 > URL: https://issues.apache.org/jira/browse/SPARK-24550 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Stavros Kontopoulos >Priority: Major > Labels: bulk-closed > > Spark by default offers a specific set of metrics for monitoring. It is > possible to add platform specific metrics or enhance the existing ones if it > makes sense. Here is an example for > [mesos|https://github.com/apache/spark/pull/21516]. Some example of metrics > that could be added are: number of blacklisted nodes, utilization of > executors per node, evicted pods and driver restarts, back-off time when > requesting executors etc... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25585) Allow users to specify scale of result in Decimal arithmetic
[ https://issues.apache.org/jira/browse/SPARK-25585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25585. -- Resolution: Incomplete > Allow users to specify scale of result in Decimal arithmetic > > > Key: SPARK-25585 > URL: https://issues.apache.org/jira/browse/SPARK-25585 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Benito Kestelman >Priority: Major > Labels: bulk-closed, decimal, usability > > The current behavior of Spark Decimal during arithmetic makes it difficult > for users to achieve their desired level of precision. Numeric literals are > automatically cast to unlimited precision during arithmetic, but the final > result is cast down depending on the precision and scale of the operands, > according to MS SQL rules (discussed in other JIRA's). This final cast can > cause substantial loss of scale. > For example: > {noformat} > scala> spark.sql("select 1.3/3.41").show(false) > ++ > |(CAST(1.3 AS DECIMAL(3,2)) / CAST(3.41 AS DECIMAL(3,2)))| > ++ > |0.381232 | > ++{noformat} > To get higher scale in the result, a user must cast the operands to higher > scale: > {noformat} > scala> spark.sql("select cast(1.3 as decimal(5,4))/cast(3.41 as > decimal(5,4))").show(false) > ++ > |(CAST(1.3 AS DECIMAL(5,4)) / CAST(3.41 AS DECIMAL(5,4)))| > ++ > |0.3812316716 | > ++ > scala> spark.sql("select cast(1.3 as decimal(10,9))/cast(3.41 as > decimal(10,9))").show(false) > +--+ > |(CAST(1.3 AS DECIMAL(10,9)) / CAST(3.41 AS DECIMAL(10,9)))| > +--+ > |0.38123167155425219941 | > +--+{noformat} > But if the user casts too high, the result's scale decreases. > {noformat} > scala> spark.sql("select cast(1.3 as decimal(25,24))/cast(3.41 as > decimal(25,24))").show(false) > ++ > |(CAST(1.3 AS DECIMAL(25,24)) / CAST(3.41 AS DECIMAL(25,24)))| > ++ > |0.3812316715543 | > ++{noformat} > Thus, the user has no way of knowing how to cast to get the scale he wants. > This problem is even harder to deal with when using variables instead of > literals. > The user should be able to explicitly set the desired scale of the result. > MySQL offers this capability in the form of a system variable called > "div_precision_increment." > From the MySQL docs: "In division performed with > [{{/}}|https://dev.mysql.com/doc/refman/8.0/en/arithmetic-functions.html#operator_divide], > the scale of the result when using two exact-value operands is the scale of > the first operand plus the value of the > [{{div_precision_increment}}|https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_div_precision_increment] > system variable (which is 4 by default). For example, the result of the > expression {{5.05 / 0.014}} has a scale of six decimal places > ({{360.714286}})." > {noformat} > mysql> SELECT 1/7; > ++ > | 1/7 | > ++ > | 0.1429 | > ++ > mysql> SET div_precision_increment = 12; > mysql> SELECT 1/7; > ++ > | 1/7 | > ++ > | 0.142857142857 | > ++{noformat} > This gives the user full control of the result's scale after arithmetic and > obviates the need for casting all over the place. > Since Spark 2.3, we already have DecimalType.MINIMUM_ADJUSTED_SCALE, which is > similar to div_precision_increment. It just needs to be made modifiable by > the user. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21406) Add logLikelihood to GLR families
[ https://issues.apache.org/jira/browse/SPARK-21406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21406. -- Resolution: Incomplete > Add logLikelihood to GLR families > - > > Key: SPARK-21406 > URL: https://issues.apache.org/jira/browse/SPARK-21406 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Seth Hendrickson >Priority: Minor > Labels: bulk-closed > > To be able to implement the typical gradient based aggregator for GLR, we'd > need to add a {{logLikelihood(y: Double, mu: Double, weight: Double)}} method > to GLR {{Family}} class. > One possible hiccup - Tweedie family log likelihood is not computationally > feasible [link| > http://support.sas.com/documentation/cdl/en/stathpug/67524/HTML/default/viewer.htm#stathpug_hpgenselect_details16.htm]. > H2O gets around this by using the deviance instead. We could leave it > unimplemented initially. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12878) Dataframe fails with nested User Defined Types
[ https://issues.apache.org/jira/browse/SPARK-12878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-12878. -- Resolution: Incomplete > Dataframe fails with nested User Defined Types > -- > > Key: SPARK-12878 > URL: https://issues.apache.org/jira/browse/SPARK-12878 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Joao Duarte >Priority: Major > Labels: bulk-closed > > Spark 1.6.0 crashes when using nested User Defined Types in a Dataframe. > In version 1.5.2 the code below worked just fine: > {code} > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.sql.catalyst.InternalRow > import org.apache.spark.sql.catalyst.expressions.GenericMutableRow > import org.apache.spark.sql.types._ > @SQLUserDefinedType(udt = classOf[AUDT]) > case class A(list:Seq[B]) > class AUDT extends UserDefinedType[A] { > override def sqlType: DataType = StructType(Seq(StructField("list", > ArrayType(BUDT, containsNull = false), nullable = true))) > override def userClass: Class[A] = classOf[A] > override def serialize(obj: Any): Any = obj match { > case A(list) => > val row = new GenericMutableRow(1) > row.update(0, new > GenericArrayData(list.map(_.asInstanceOf[Any]).toArray)) > row > } > override def deserialize(datum: Any): A = { > datum match { > case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq) > } > } > } > object AUDT extends AUDT > @SQLUserDefinedType(udt = classOf[BUDT]) > case class B(text:Int) > class BUDT extends UserDefinedType[B] { > override def sqlType: DataType = StructType(Seq(StructField("num", > IntegerType, nullable = false))) > override def userClass: Class[B] = classOf[B] > override def serialize(obj: Any): Any = obj match { > case B(text) => > val row = new GenericMutableRow(1) > row.setInt(0, text) > row > } > override def deserialize(datum: Any): B = { > datum match { case row: InternalRow => new B(row.getInt(0)) } > } > } > object BUDT extends BUDT > object Test { > def main(args:Array[String]) = { > val col = Seq(new A(Seq(new B(1), new B(2))), > new A(Seq(new B(3), new B(4 > val sc = new SparkContext(new > SparkConf().setMaster("local[1]").setAppName("TestSpark")) > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.implicits._ > val df = sc.parallelize(1 to 2 zip col).toDF("id","b") > df.select("b").show() > df.collect().foreach(println) > } > } > {code} > In the new version (1.6.0) I needed to include the following import: > `import org.apache.spark.sql.catalyst.expressions.GenericMutableRow` > However, Spark crashes in runtime: > {code} > 16/01/18 14:36:22 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51) > at > org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212) > at > org.
[jira] [Resolved] (SPARK-9636) Treat $SPARK_HOME as write-only
[ https://issues.apache.org/jira/browse/SPARK-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-9636. - Resolution: Incomplete > Treat $SPARK_HOME as write-only > --- > > Key: SPARK-9636 > URL: https://issues.apache.org/jira/browse/SPARK-9636 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 1.4.1 > Environment: Linux >Reporter: Philipp Angerer >Priority: Minor > Labels: bulk-closed > > when starting spark scripts as user and it is installed in a directory the > user has no write permissions on, many things work fine, except for the logs > (e.g. for {{start-master.sh}}) > logs are per default written to {{$SPARK_LOG_DIR}} or (if unset) to > {{$SPARK_HOME/logs}}. > if installed in this way, it should, instead of throwing an error, write logs > to {{/var/log/spark/}}. that’s easy to fix by simply testing a few log dirs > in sequence for writability before trying to use one. i suggest using > {{$SPARK_LOG_DIR}} (if set) → {{/var/log/spark/}} → {{~/.cache/spark-logs/}} > → {{$SPARK_HOME/logs/}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix
[ https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-8614. - Resolution: Incomplete > Row order preservation for operations on MLlib IndexedRowMatrix > --- > > Key: SPARK-8614 > URL: https://issues.apache.org/jira/browse/SPARK-8614 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Jan Luts >Priority: Major > Labels: bulk-closed > > In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are > dropped before calling the methods from RowMatrix. For example for > IndexedRowMatrix.computeSVD: >val svd = toRowMatrix().computeSVD(k, computeU, rCond) > and for IndexedRowMatrix.multiply: >val mat = toRowMatrix().multiply(B). > After computing these results, they are zipped with the original indices, > e.g. for IndexedRowMatrix.computeSVD >val indexedRows = indices.zip(svd.U.rows).map { case (i, v) => > IndexedRow(i, v) >} > and for IndexedRowMatrix.multiply: > >val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) => > IndexedRow(i, v) >} > I have experienced that for IndexedRowMatrix.computeSVD().U and > IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row > indices can get mixed (when running Spark jobs with multiple > executors/machines): i.e. the vectors and indices of the result do not seem > to correspond anymore. > To me it looks like this is caused by zipping RDDs that have a different > ordering? > For the IndexedRowMatrix.multiply I have observed that ordering within > partitions is preserved, but that it seems to get mixed up between > partitions. For example, for: > part1Index1 part1Vector1 > part1Index2 part1Vector2 > part2Index1 part2Vector1 > part2Index2 part2Vector2 > I got: > part2Index1 part1Vector1 > part2Index2 part1Vector2 > part1Index1 part2Vector1 > part1Index2 part2Vector2 > Another observation is that the mapPartitions in RowMatrix.multiply : > val AB = rows.mapPartitions { iter => > had an "preservesPartitioning = true" argument in version 1.0, but this is no > longer there. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24585) Adding ability to audit file system before and after test to ensure all files are cleaned up.
[ https://issues.apache.org/jira/browse/SPARK-24585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24585. -- Resolution: Incomplete > Adding ability to audit file system before and after test to ensure all files > are cleaned up. > - > > Key: SPARK-24585 > URL: https://issues.apache.org/jira/browse/SPARK-24585 > Project: Spark > Issue Type: Test > Components: Build >Affects Versions: 2.3.1 >Reporter: David Lewis >Priority: Minor > Labels: bulk-closed > > Some spark tests use temporary files and folders on the file system. This > proposal is to audit the file system before and after to ensure that all > files created are cleaned up. > This proposal is to add two flags to SparkFunSuite. One to enable the > auditing, which will log a warning if there is a discrepancy, and one to > throw an exception if there is a discrepancy. > This is intended to help prevent file leakage in spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18245) Improving support for bucketed table
[ https://issues.apache.org/jira/browse/SPARK-18245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-18245. -- Resolution: Incomplete > Improving support for bucketed table > > > Key: SPARK-18245 > URL: https://issues.apache.org/jira/browse/SPARK-18245 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Priority: Major > Labels: bulk-closed > > This is an umbrella ticket for improving various execution planning for > bucketed tables. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24081) Spark SQL drops the table while writing into table in "overwrite" mode.
[ https://issues.apache.org/jira/browse/SPARK-24081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24081. -- Resolution: Incomplete > Spark SQL drops the table while writing into table in "overwrite" mode. > > > Key: SPARK-24081 > URL: https://issues.apache.org/jira/browse/SPARK-24081 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.0 >Reporter: Ashish >Priority: Major > Labels: bulk-closed > > I am taking data from table and doing modification to the data once I am > writing back to table in overwrite mode its deleting all the record. > Expectation: It will update the table with updated data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15690) Fast single-node (single-process) in-memory shuffle
[ https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-15690. -- Resolution: Incomplete > Fast single-node (single-process) in-memory shuffle > --- > > Key: SPARK-15690 > URL: https://issues.apache.org/jira/browse/SPARK-15690 > Project: Spark > Issue Type: New Feature > Components: Shuffle, SQL >Reporter: Reynold Xin >Priority: Major > Labels: bulk-closed > > Spark's current shuffle implementation sorts all intermediate data by their > partition id, and then write the data to disk. This is not a big bottleneck > because the network throughput on commodity clusters tend to be low. However, > an increasing number of Spark users are using the system to process data on a > single-node. When in a single node operating against intermediate data that > fits in memory, the existing shuffle code path can become a big bottleneck. > The goal of this ticket is to change Spark so it can use in-memory radix sort > to do data shuffling on a single node, and still gracefully fallback to disk > if the data size does not fit in memory. Given the number of partitions is > usually small (say less than 256), it'd require only a single pass do to the > radix sort with pretty decent CPU efficiency. > Note that there have been many in-memory shuffle attempts in the past. This > ticket has a smaller scope (single-process), and aims to actually > productionize this code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24745) Map function does not keep rdd name
[ https://issues.apache.org/jira/browse/SPARK-24745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24745. -- Resolution: Incomplete > Map function does not keep rdd name > > > Key: SPARK-24745 > URL: https://issues.apache.org/jira/browse/SPARK-24745 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Igor Pergenitsa >Priority: Minor > Labels: bulk-closed > > This snippet > {code:scala} > val namedRdd = sparkContext.makeRDD(List("abc", "123")).setName("named_rdd") > println(namedRdd.name) > val mappedRdd = namedRdd.map(_.length) > println(mappedRdd.name){code} > outputs: > named_rdd > null -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22731) Add a test for ROWID type to OracleIntegrationSuite
[ https://issues.apache.org/jira/browse/SPARK-22731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22731. -- Resolution: Incomplete > Add a test for ROWID type to OracleIntegrationSuite > --- > > Key: SPARK-22731 > URL: https://issues.apache.org/jira/browse/SPARK-22731 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > Labels: bulk-closed > > We need to add a test case to OracleIntegrationSuite for checking whether the > current support of ROWID type works well for Oracle. If not, we also need a > fix. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24964) Please add OWASP Dependency Check to all comonent builds(pom.xml)
[ https://issues.apache.org/jira/browse/SPARK-24964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24964. -- Resolution: Incomplete > Please add OWASP Dependency Check to all comonent builds(pom.xml) > -- > > Key: SPARK-24964 > URL: https://issues.apache.org/jira/browse/SPARK-24964 > Project: Spark > Issue Type: New Feature > Components: Build, MLlib, Spark Core, SparkR >Affects Versions: 2.3.1 > Environment: All development, build, test, environments. > ~/workspace/spark-2.3.1/pom.xml > ~/workspace/spark-2.3.1/assembly/pom.xml > ~/workspace/spark-2.3.1/common/kvstore/dependency-reduced-pom.xml > ~/workspace/spark-2.3.1/common/kvstore/pom.xml > ~/workspace/spark-2.3.1/common/network-common/dependency-reduced-pom.xml > ~/workspace/spark-2.3.1/common/network-common/pom.xml > ~/workspace/spark-2.3.1/common/network-shuffle/dependency-reduced-pom.xml > ~/workspace/spark-2.3.1/common/network-shuffle/pom.xml > ~/workspace/spark-2.3.1/common/network-yarn/pom.xml > ~/workspace/spark-2.3.1/common/sketch/dependency-reduced-pom.xml > ~/workspace/spark-2.3.1/common/sketch/pom.xml > ~/workspace/spark-2.3.1/common/tags/dependency-reduced-pom.xml > ~/workspace/spark-2.3.1/common/tags/pom.xml > ~/workspace/spark-2.3.1/common/unsafe/pom.xml > ~/workspace/spark-2.3.1/core/pom.xml > ~/workspace/spark-2.3.1/examples/pom.xml > ~/workspace/spark-2.3.1/external/docker-integration-tests/pom.xml > ~/workspace/spark-2.3.1/external/flume/pom.xml > ~/workspace/spark-2.3.1/external/flume-assembly/pom.xml > ~/workspace/spark-2.3.1/external/flume-sink/pom.xml > ~/workspace/spark-2.3.1/external/kafka-0-10/pom.xml > ~/workspace/spark-2.3.1/external/kafka-0-10-assembly/pom.xml > ~/workspace/spark-2.3.1/external/kafka-0-10-sql/pom.xml > ~/workspace/spark-2.3.1/external/kafka-0-8/pom.xml > ~/workspace/spark-2.3.1/external/kafka-0-8-assembly/pom.xml > ~/workspace/spark-2.3.1/external/kinesis-asl/pom.xml > ~/workspace/spark-2.3.1/external/kinesis-asl-assembly/pom.xml > ~/workspace/spark-2.3.1/external/spark-ganglia-lgpl/pom.xml > ~/workspace/spark-2.3.1/graphx/pom.xml > ~/workspace/spark-2.3.1/hadoop-cloud/pom.xml > ~/workspace/spark-2.3.1/launcher/pom.xml > ~/workspace/spark-2.3.1/mllib/pom.xml > ~/workspace/spark-2.3.1/mllib-local/pom.xml > ~/workspace/spark-2.3.1/repl/pom.xml > ~/workspace/spark-2.3.1/resource-managers/kubernetes/core/pom.xml > ~/workspace/spark-2.3.1/resource-managers/mesos/pom.xml > ~/workspace/spark-2.3.1/resource-managers/yarn/pom.xml > ~/workspace/spark-2.3.1/sql/catalyst/pom.xml > ~/workspace/spark-2.3.1/sql/core/pom.xml > ~/workspace/spark-2.3.1/sql/hive/pom.xml > ~/workspace/spark-2.3.1/sql/hive-thriftserver/pom.xml > ~/workspace/spark-2.3.1/streaming/pom.xml > ~/workspace/spark-2.3.1/tools/pom.xml >Reporter: Albert Baker >Priority: Major > Labels: build, bulk-closed, easy-fix, security > Original Estimate: 3h > Remaining Estimate: 3h > > OWASP DC makes an outbound REST call to MITRE Common Vulnerabilities & > Exposures (CVE) to perform a lookup for each dependant .jar to list any/all > known vulnerabilities for each jar. This step is needed because a manual > MITRE CVE lookup/check on the main component does not include checking for > vulnerabilities in dependant libraries. > OWASP Dependency check : > https://www.owasp.org/index.php/OWASP_Dependency_Check has plug-ins for most > Java build/make types (ant, maven, ivy, gradle). Also, add the appropriate > command to the nightly build to generate a report of all known > vulnerabilities in any/all third party libraries/dependencies that get pulled > in. example : mvn -Powasp -Dtest=false -DfailIfNoTests=false clean aggregate > Generating this report nightly/weekly will help inform the project's > development team if any dependant libraries have a reported known > vulneraility. Project teams that keep up with removing vulnerabilities on a > weekly basis will help protect businesses that rely on these open source > componets. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24122) Allow automatic driver restarts on K8s
[ https://issues.apache.org/jira/browse/SPARK-24122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24122. -- Resolution: Incomplete > Allow automatic driver restarts on K8s > -- > > Key: SPARK-24122 > URL: https://issues.apache.org/jira/browse/SPARK-24122 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Oz Ben-Ami >Priority: Minor > Labels: bulk-closed > > [~foxish] > Right now SparkSubmit creates the driver as a bare pod, rather than a managed > controller like a Deployment or a StatefulSet. This means there is no way to > guarantee automatic restarts, eg in case a node has an issue. Note Pod > RestartPolicy does not apply if a node fails. A StatefulSet would allow us to > guarantee that, and keep the ability for executors to find the driver using > DNS. > This is particularly helpful for long-running streaming workloads, where we > currently use {{yarn.resourcemanager.am.max-attempts}} with YARN. I can > confirm that Spark Streaming and Structured Streaming applications can be > made to recover from such a restart, with the help of checkpointing. The > executors will have to be started again by the driver, but this should not be > a problem. > For batch processing, we could alternatively use Kubernetes {{Job}} objects, > which restart pods on failure but not success. For example, note the > semantics provided by the {{kubectl run}} > [command|https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#run] > * {{--restart=Never}}: bare Pod > * {{--restart=Always}}: Deployment > * {{--restart=OnFailure}}: Job > https://github.com/apache-spark-on-k8s/spark/issues/288 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25311) `SPARK_LOCAL_HOSTNAME` unsupport IPV6 when do host checking
[ https://issues.apache.org/jira/browse/SPARK-25311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25311. -- Resolution: Incomplete > `SPARK_LOCAL_HOSTNAME` unsupport IPV6 when do host checking > --- > > Key: SPARK-25311 > URL: https://issues.apache.org/jira/browse/SPARK-25311 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1, 2.2.2 >Reporter: Xiaochen Ouyang >Priority: Major > Labels: bulk-closed > > IPV4 can pass the follwing check > {code:java} > def checkHost(host: String, message: String = "") { > assert(host.indexOf(':') == -1, message) > } > {code} > But IPV6 check failed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24910) Spark Bloom Filter Closure Serialization improvement for very high volume of Data
[ https://issues.apache.org/jira/browse/SPARK-24910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24910. -- Resolution: Incomplete > Spark Bloom Filter Closure Serialization improvement for very high volume of > Data > - > > Key: SPARK-24910 > URL: https://issues.apache.org/jira/browse/SPARK-24910 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.3.1 >Reporter: Himangshu Ranjan Borah >Priority: Minor > Labels: bulk-closed > > I am proposing an improvement to the Bloom Filter Generation logic being used > in the DataFrameStatFunctions' Bloom Filter API using mapPartitions() instead > of aggregate() to avoid closure serialization which fails for huge BitArrays. > Spark's Stat Functions' Bloom Filter Implementation uses > aggregate/treeAggregate operations which uses a closure with a dependency on > the bloom filter that is created in the driver. Since Spark hard codes the > closure serializer to Java Serializer it fails in closure cleanup for very > big sizes of Bloom Filters (Typically with num items ~ Billions and with fpp > ~ 0.001). Kryo serializer work's fine in such a scale but seems like there > were some issues using Kryo for closure serialization due to which Spark 2.0 > hardcoded it to Java. The call-stack that we get typically looks like, > {{{color:#f79232}java.lang.OutOfMemoryError{color}}} > {{{color:#f79232} at > java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123){color}}} > {{{color:#f79232} at > java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117){color}}} > {{{color:#f79232} at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93){color}}} > {{{color:#f79232} at > java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153){color}}} > {{{color:#f79232} at > org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41){color}}} > {{{color:#f79232} at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877){color}}} > {{{color:#f79232} at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786){color}}} > {{{color:#f79232} at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189){color}}} > {{{color:#f79232} at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548){color}}} > {{{color:#f79232} at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509){color}}} > {{{color:#f79232} at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432){color}}} > {{{color:#f79232} at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178){color}}} > {{{color:#f79232} at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348){color}}} > {{{color:#f79232} at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43){color}}} > {{{color:#f79232} at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100){color}}} > {{{color:#f79232} at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342){color}}} > {{{color:#f79232} at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335){color}}} > {{{color:#f79232} at > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159){color}}} > {{{color:#f79232} at > org.apache.spark.SparkContext.clean(SparkContext.scala:2292){color}}} > {{{color:#f79232} at > org.apache.spark.SparkContext.runJob(SparkContext.scala:2022){color}}} > {{{color:#f79232} at > org.apache.spark.SparkContext.runJob(SparkContext.scala:2124){color}}} > {{{color:#f79232} at > org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1092){color}}} > {{{color:#f79232} at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151){color}}} > {{{color:#f79232} at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112){color}}} > {{{color:#f79232} at > org.apache.spark.rdd.RDD.withScope(RDD.scala:363){color}}} > {{{color:#f79232} at org.apache.spark.rdd.RDD.fold(RDD.scala:1086){color}}} > {{{color:#f79232} at > org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1155){color}}} > {{{color:#f79232} at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151){color}}} > {{{color:#f79232} at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112){color}}} > {{{color:#f79232} at > org.apache.spark.rdd.RDD.withScope(RDD.scala:363){color}}} > {{{color:#f79232} at > org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1131){color}}} >
[jira] [Resolved] (SPARK-22114) The condition of OnlineLDAOptimizer convergence should be configurable
[ https://issues.apache.org/jira/browse/SPARK-22114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22114. -- Resolution: Incomplete > The condition of OnlineLDAOptimizer convergence should be configurable > > > Key: SPARK-22114 > URL: https://issues.apache.org/jira/browse/SPARK-22114 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > Labels: bulk-closed > > The current convergence of OnlineLDAOptimizer is: > {code:java} > while(meanGammaChange > 1e-3) > {code} > The condition of this is critical for the performance and accuracy of LDA. > We should keep this configurable, like it is in Vowpal Vabbit: > https://github.com/JohnLangford/vowpal_wabbit/blob/430f69453bc4876a39351fba1f18771bdbdb7122/vowpalwabbit/lda_core.cc > :638 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25103) CompletionIterator may delay GC of completed resources
[ https://issues.apache.org/jira/browse/SPARK-25103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25103. -- Resolution: Incomplete > CompletionIterator may delay GC of completed resources > -- > > Key: SPARK-25103 > URL: https://issues.apache.org/jira/browse/SPARK-25103 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1, 2.1.0, 2.2.0, 2.3.0 >Reporter: Eyal Farago >Priority: Major > Labels: bulk-closed > > while working on SPARK-22713 , I fund (and partially fixed) a scenario in > which an iterator is already exhausted but still holds a reference to some > resources that can be GCed at this point. > However, these resources can not be GCed because of this reference. > the specific fix applied in SPARK-22713 was to wrap the iterator with a > CompletionIterator that cleans it when exhausted, thing is that it's quite > easy to get this wrong by closing over local variables or _this_ reference in > the cleanup function itself. > I propose solving this by modifying CompletionIterator to discard references > to the wrapped iterator and cleanup function once exhausted. > > * a dive into the code showed that most CompletionIterators are eventually > used by > {code:java} > org.apache.spark.scheduler.ShuffleMapTask#runTask{code} > which does: > {code:java} > writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: > Product2[Any, Any]]]){code} > looking at > {code:java} > org.apache.spark.shuffle.ShuffleWriter#write{code} > implementations, it seems all of them first exhaust the iterator and then > perform some kind of post-processing: i.e. merging spills, sorting, writing > partitions files and then concatenating them into a single file... bottom > line the Iterator may actually be 'sitting' for some time after being > exhausted. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation
[ https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23686. -- Resolution: Incomplete > Make better usage of org.apache.spark.ml.util.Instrumentation > - > > Key: SPARK-23686 > URL: https://issues.apache.org/jira/browse/SPARK-23686 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > Labels: bulk-closed > > This Jira is a bit high level and might require subtasks or other jiras for > more specific tasks. > I've noticed that we don't make the best usage of the instrumentation class. > Specifically sometimes we bypass the instrumentation class and use the > debugger instead. For example, > [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143] > Also there are some things that might be useful to log in the instrumentation > class that we currently don't. For example: > number of training examples > mean/var of label (regression) > I know computing these things can be expensive in some cases, but especially > when this data is already available we can log it for free. For example, > Logistic Regression Summarizer computes some useful data including numRows > that we don't log. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15573) Backwards-compatible persistence for spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-15573. -- Resolution: Incomplete > Backwards-compatible persistence for spark.ml > - > > Key: SPARK-15573 > URL: https://issues.apache.org/jira/browse/SPARK-15573 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Major > Labels: bulk-closed > > This JIRA is for imposing backwards-compatible persistence for the > DataFrames-based API for MLlib. I.e., we want to be able to load models > saved in previous versions of Spark. We will not require loading models > saved in later versions of Spark. > This requires: > * Putting unit tests in place to check loading models from previous versions > * Notifying all committers active on MLlib to be aware of this requirement in > the future > The unit tests could be written as in spark.mllib, where we essentially > copied and pasted the save() code every time it changed. This happens > rarely, so it should be acceptable, though other designs are fine. > Subtasks of this JIRA should cover checking and adding tests for existing > cases, such as KMeansModel (whose format changed between 1.6 and 2.0). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20732) Copy cache data when node is being shut down
[ https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20732. -- Resolution: Incomplete > Copy cache data when node is being shut down > > > Key: SPARK-20732 > URL: https://issues.apache.org/jira/browse/SPARK-20732 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Holden Karau >Priority: Major > Labels: bulk-closed > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24163) Support "ANY" or sub-query for Pivot "IN" clause
[ https://issues.apache.org/jira/browse/SPARK-24163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24163. -- Resolution: Incomplete > Support "ANY" or sub-query for Pivot "IN" clause > > > Key: SPARK-24163 > URL: https://issues.apache.org/jira/browse/SPARK-24163 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wei Xue >Priority: Major > Labels: bulk-closed > > This is part of a functionality extension to Pivot SQL support as SPARK-24035. > Currently, only literal values are allowed in Pivot "IN" clause. To support > ANY or a sub-query in the "IN" clause (the examples of which provided below), > we need to enable evaluation of a sub-query before/during query analysis time. > {code:java} > SELECT * FROM ( > SELECT year, course, earnings FROM courseSales > ) > PIVOT ( > sum(earnings) > FOR course IN ANY > );{code} > {code:java} > SELECT * FROM ( > SELECT year, course, earnings FROM courseSales > ) > PIVOT ( > sum(earnings) > FOR course IN ( > SELECT course FROM courses > WHERE region = 'AZ' > ) > ); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20744) Predicates with multiple columns do not work
[ https://issues.apache.org/jira/browse/SPARK-20744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20744. -- Resolution: Incomplete > Predicates with multiple columns do not work > > > Key: SPARK-20744 > URL: https://issues.apache.org/jira/browse/SPARK-20744 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bogdan Raducanu >Priority: Major > Labels: bulk-closed > > The following code reproduces the problem: > {code} > scala> spark.range(10).selectExpr("id as a", "id as b").where("(a,b) in > ((1,1))").show > org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('a', > `a`, 'b', `b`) IN (named_struct('col1', 1, 'col2', 1)))' due to data type > mismatch: Arguments must be same type; line 1 pos 6; > 'Filter named_struct(a, a#42L, b, b#43L) IN (named_struct(col1, 1, col2, 1)) > +- Project [id#39L AS a#42L, id#39L AS b#43L] >+- Range (0, 10, step=1, splits=Some(1)) > {code} > Similarly it won't work from SQL either, which is something that other SQL DB > support: > {code} > scala> spark.range(10).selectExpr("id as a", "id as > b").createOrReplaceTempView("tab1") > scala> sql("select * from tab1 where (a,b) in ((1,1), (2,2))").show > org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('a', > tab1.`a`, 'b', tab1.`b`) IN (named_struct('col1', 1, 'col2', 1), > named_struct('col1', 2, 'col2', 2)))' due to data type mismatch: Arguments > must be same type; line 1 pos 31; > 'Project [*] > +- 'Filter named_struct(a, a#50L, b, b#51L) IN (named_struct(col1, 1, col2, > 1),named_struct(col1, 2, col2, 2)) >+- SubqueryAlias tab1 > +- Project [id#47L AS a#50L, id#47L AS b#51L] > +- Range (0, 10, step=1, splits=Some(1)) > {code} > Other examples: > {code} > scala> sql("select * from tab1 where (a,b) =(1,1)").show > org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('a', > tab1.`a`, 'b', tab1.`b`) = named_struct('col1', 1, 'col2', 1))' due to data > type mismatch: differing types in '(named_struct('a', tab1.`a`, 'b', > tab1.`b`) = named_struct('col1', 1, 'col2', 1))' (struct > and struct).; line 1 pos 25; > 'Project [*] > +- 'Filter (named_struct(a, a#50L, b, b#51L) = named_struct(col1, 1, col2, 1)) >+- SubqueryAlias tab1 > +- Project [id#47L AS a#50L, id#47L AS b#51L] > +- Range (0, 10, step=1, splits=Some(1)) > {code} > Expressions such as (1,1) are apparently read as structs and then the types > do not match. Perhaps they should be arrays. > The following code works: > {code} > sql("select * from tab1 where array(a,b) in (array(1,1),array(2,2))").show > {code} > This also works, but requires the cast: > {code} > sql("select * from tab1 where (a,b) in (named_struct('a', cast(1 as bigint), > 'b', cast(1 as bigint)))").show > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24100) Add the CompressionCodec to the saveAsTextFiles interface.
[ https://issues.apache.org/jira/browse/SPARK-24100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24100. -- Resolution: Incomplete > Add the CompressionCodec to the saveAsTextFiles interface. > -- > > Key: SPARK-24100 > URL: https://issues.apache.org/jira/browse/SPARK-24100 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: caijie >Priority: Minor > Labels: bulk-closed > > Add the CompressionCodec to the saveAsTextFiles interface. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24750) HiveCaseSensitiveInferenceMode with INFER_AND_SAVE will show WRITE permission denied even if select table operation
[ https://issues.apache.org/jira/browse/SPARK-24750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24750. -- Resolution: Incomplete > HiveCaseSensitiveInferenceMode with INFER_AND_SAVE will show WRITE permission > denied even if select table operation > --- > > Key: SPARK-24750 > URL: https://issues.apache.org/jira/browse/SPARK-24750 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: Lantao Jin >Priority: Major > Labels: bulk-closed > > The default HiveCaseSensitiveInferenceMode is INFER_AND_SAVE. In this mode, > even if I just select a table which I have no write permission will log a > WRITE permission denied. > spark-sql> select col1 from table1 limit 10; > table1 is a hive extended table. And user user_me has no write permission for > the table1 location /path/table1/dt=20180705 > (b_someone:group_company:drwxr-xr-x) > {code} > 18/07/05 20:13:18 WARN hive.HiveMetastoreCatalog: Unable to save > case-sensitive schema for table default.table1 > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. > java.security.AccessControlException: Permission denied: user=user_me, > access=WRITE, > inode="/path/table1/dt=20180705":b_someone:group_company:drwxr-xr-x > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1780) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1764) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1738) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8445) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:2022) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1451) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200) > ; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.alterTableDataSchema(HiveExternalCatalog.scala:646) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.alterTableDataSchema(SessionCatalog.scala:369) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog.updateDataSchema(HiveMetastoreCatalog.scala:266) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$inferIfNeeded(HiveMetastoreCatalog.scala:250) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$6$$anonfun$7.apply(HiveMetastoreCatalog.scala:194) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$6$$anonfun$7.apply(HiveMetastoreCatalog.scala:193) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$6.apply(HiveMetastoreCatalog.scala:193) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$6.apply(HiveMetastoreCatalog.scala:186) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog.withTableCreationLock(HiveMetastoreCatalog.scala:54) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:186) > at > org.apache.spark.sql.hive.RelationConversions.org$apache$spark$sql$hive$RelationConversions$$convert(HiveStrategies.scala:199) > at > org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:219) > at > org.apache.spark.sql.hive.Rel
[jira] [Resolved] (SPARK-25217) Error thrown when creating BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-25217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25217. -- Resolution: Incomplete > Error thrown when creating BlockMatrix > -- > > Key: SPARK-25217 > URL: https://issues.apache.org/jira/browse/SPARK-25217 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: cs5090237 >Priority: Major > Labels: bulk-closed > > dm1 = Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6]) > dm2 = Matrices.dense(3, 2, [7, 8, 9, 10, 11, 12]) > sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 1, 2], [7, 11, 12]) > blocks1 = sc.parallelize([((0, 0), dm1)]) > sm_ = Matrix(3,2,sm) > blocks2 = sc.parallelize([((0, 0), sm), ((1, 0), sm)]) > blocks3 = sc.parallelize([((0, 0), sm), ((1, 0), dm2)]) > mat2 = BlockMatrix(blocks2, 3, 2) > mat3 = BlockMatrix(blocks3, 3, 2) > > *Running above sample code in Pyspark from documentation raises following > error:* > > An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. : > org.apache.spark.SparkException: Job aborted due to stage failure: Task 14 in > stage 53.0 failed 4 times, most recent failure: Lost task 14.3 in stage 53.0 > (TID 1081, , executor 15): org.apache.spark.api.python.PythonException: > Traceback (most recent call last): File > "/mnt/yarn/usercache/livy/appcache//pyspark.zip/pyspark/worker.py", line 230, > in main process() File > "/mnt/yarn/usercache/livy/appcache//pyspark.zip/pyspark/worker.py", line 225, > in process serializer.dump_stream(func(split_index, iterator), outfile) File > "/mnt/yarn/usercache/livy/appcache/application_1535051034290_0001/container_1535051034290_0001_01_23/pyspark.zip/pyspark/serializers.py", > line 372, in dump_stream vs = list(itertools.islice(iterator, batch)) File > "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1371, in > takeUpToNumLeft File > "/mnt/yarn/usercache/livy/appcache//pyspark.zip/pyspark/util.py", line 55, in > wrapper return f(*args, **kwargs) File > "/mnt/yarn/usercache/livy/appcache//pyspark.zip/pyspark/mllib/linalg/distributed.py", > line 975, in _convert_to_matrix_block_tuple raise TypeError("Cannot convert > type %s into a sub-matrix block tuple" % type(block)) TypeError: Cannot > convert type into a sub-matrix block tuple > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22359) Improve the test coverage of window functions
[ https://issues.apache.org/jira/browse/SPARK-22359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22359. -- Resolution: Incomplete > Improve the test coverage of window functions > - > > Key: SPARK-22359 > URL: https://issues.apache.org/jira/browse/SPARK-22359 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xingbo Jiang >Priority: Major > Labels: bulk-closed > > There are already quite a few integration tests using window functions, but > the unit tests coverage for window funtions is not ideal. > We'd like to test the following aspects: > * Specifications > ** different partition clauses (none, one, multiple) > ** different order clauses (none, one, multiple, asc/desc, nulls first/last) > * Frames and their combinations > ** OffsetWindowFunctionFrame > ** UnboundedWindowFunctionFrame > ** SlidingWindowFunctionFrame > ** UnboundedPrecedingWindowFunctionFrame > ** UnboundedFollowingWindowFunctionFrame > * Aggregate function types > ** Declarative > ** Imperative > ** UDAF > * Spilling > ** Cover the conditions that WindowExec should spill at least once -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18600) BZ2 CRC read error needs better reporting
[ https://issues.apache.org/jira/browse/SPARK-18600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-18600. -- Resolution: Incomplete > BZ2 CRC read error needs better reporting > - > > Key: SPARK-18600 > URL: https://issues.apache.org/jira/browse/SPARK-18600 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Charles R Allen >Priority: Minor > Labels: bulk-closed, spree > > {code} > 16/11/25 20:05:03 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 148 > in stage 5.0 failed 1 times, most recent failure: Lost task 148.0 in stage > 5.0 (TID 5945, localhost): org.apache.spark.SparkException: Task failed while > writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: com.univocity.parsers.common.TextParsingException: > java.lang.IllegalStateException - Error reading from input > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > Header extraction enabled=null > Headers=[INTERVALSTARTTIME_GMT, INTERVALENDTIME_GMT, OPR_DT, OPR_HR, > NODE_ID_XML, NODE_ID, NODE, MARKET_RUN_ID, LMP_TYPE, XML_DATA_ITEM, > PNODE_RESMRID, GRP_TYPE, POS, VALUE, OPR_INTERVAL, GROUP] > Ignore leading whitespaces=false > Ignore trailing whitespaces=false > Input buffer size=128 > Input reading on separate thread=false > Keep escape sequences=false > Line separator detection enabled=false > Maximum number of characters per column=100 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Row processor=none > RowProcessor error handler=null > Selected fields=none > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=\0 > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\n > Quote character=" > Quote escape character=\ > Quote escape escape character=null > Internal state when error was thrown: line=27089, column=13, record=27089, > charIndex=4451456, headers=[INTERVALSTARTTIME_GMT, INTERVALENDTIME_GMT, > OPR_DT, OPR_HR, NODE_ID_XML, NODE_ID, NODE, MARKET_RUN_ID, LMP_TYPE, > XML_DATA_ITEM, PNODE_RESMRID, GRP_TYPE, POS, VALUE, OPR_INTERVAL, GROUP] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:302) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:431) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:148) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:131) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.
[jira] [Resolved] (SPARK-21962) Distributed Tracing in Spark
[ https://issues.apache.org/jira/browse/SPARK-21962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21962. -- Resolution: Incomplete > Distributed Tracing in Spark > > > Key: SPARK-21962 > URL: https://issues.apache.org/jira/browse/SPARK-21962 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Priority: Major > Labels: bulk-closed > > Spark should support distributed tracing, which is the mechanism, widely > popularized by Google in the [Dapper > Paper|https://research.google.com/pubs/pub36356.html], where network requests > have additional metadata used for tracing requests between services. > This would be useful for me since I have OpenZipkin style tracing in my > distributed application up to the Spark driver, and from the executors out to > my other services, but the link is broken in Spark between driver and > executor since the Span IDs aren't propagated across that link. > An initial implementation could instrument the most important network calls > with trace ids (like launching and finishing tasks), and incrementally add > more tracing to other calls (torrent block distribution, external shuffle > service, etc) as the feature matures. > Search keywords: Dapper, Brave, OpenZipkin, HTrace -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User
[ https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20443. -- Resolution: Incomplete > The blockSize of MLLIB ALS should be setting by the User > - > > Key: SPARK-20443 > URL: https://issues.apache.org/jira/browse/SPARK-20443 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > Labels: bulk-closed > Attachments: blockSize.jpg > > > The blockSize of MLLIB ALS is very important for ALS performance. > In our test, when the blockSize is 128, the performance is about 4X comparing > with the blockSize is 4096 (default value). > The following are our test results: > BlockSize(recommendationForAll time) > 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM) > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24607) Distribute by rand() can lead to data inconsistency
[ https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24607. -- Resolution: Incomplete > Distribute by rand() can lead to data inconsistency > --- > > Key: SPARK-24607 > URL: https://issues.apache.org/jira/browse/SPARK-24607 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.1 >Reporter: zenglinxi >Priority: Major > Labels: bulk-closed > > Noticed the following queries can give different results: > {code:java} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a;{code} > this issue was first reported by someone using kylin for building cube with > hiveSQL which include distribute by rand, data inconsistency may happen > during failure tolerance operations. Since spark has similar failure > tolerance mechanism, I think it's also an hidden serious problem in sparksql. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25107) Spark 2.2.0 Upgrade Issue : Throwing TreeNodeException: makeCopy, tree: CatalogRelation Errors
[ https://issues.apache.org/jira/browse/SPARK-25107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25107. -- Resolution: Incomplete > Spark 2.2.0 Upgrade Issue : Throwing TreeNodeException: makeCopy, tree: > CatalogRelation Errors > -- > > Key: SPARK-25107 > URL: https://issues.apache.org/jira/browse/SPARK-25107 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: Spark Version : 2.2.0.cloudera2 >Reporter: Karan >Priority: Major > Labels: bulk-closed > > I am in the process of upgrading Spark 1.6 to Spark 2.2. > I have two stage query and I am running with hiveContext. > 1) hiveContext.sql("SELECT *, ROW_NUMBER() OVER (PARTITION BY ConfigID, rowid > ORDER BY date DESC) AS ro > FROM (SELECT DISTINCT PC.ConfigID, VPM.seqNo, VC.rowid ,VC.recordid > ,VC.data,CASE > WHEN pcs.score BETWEEN PC.from AND PC.to > AND ((PC.csacnt IS NOT NULL AND CC.status = 4 > AND CC.cnclusion = mc.ca) OR (PC.csacnt IS NULL)) THEN 1 ELSE 0 END AS Flag > FROM maindata VC > INNER JOIN scoretable pcs ON VC.s_recordid = pcs.s_recordid > INNER JOIN cnfgtable PC ON PC.subid = VC.subid > INNER JOIN prdtable VPM ON PC.configID = VPM.CNFG_ID > LEFT JOIN casetable CC ON CC.rowid = VC.rowid > LEFT JOIN prdcnfg mc ON mc.configID = PC.configID WHERE VC.date BETWEEN > VPM.StartDate and VPM.EndDate) A > WHERE A.Flag =1").createOrReplaceTempView.("stage1") > 2) hiveContext.sql("SELECT DISTINCT t1.ConfigID As cnfg_id ,vct.* > FROM stage1 t1 > INNER JOIN stage1 t2 ON t1.rowid = t2.rowid AND T1.ConfigID = t2.ConfigID > INNER JOIN cnfgtable PCR ON PCR.ConfigID = t2.ConfigID > INNER JOIN maindata vct on vct.recordid = t1.recordid > WHERE t2.ro = PCR.datacount”) > The same query sequency is working fine in Spark 1.6 but failing with below > exeption in Spark 2,2. It throws exception while parsing above 2nd query. > > {{+org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree:+}} > +CatalogRelation `database_name`.`{color:#f79232}maindata{color}`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [recordid#1506... 89 more > fields|#1506... 89 more fields]+ > \{{ at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}} > \{{ at > org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)}} > \{{ at > org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)}} > \{{ at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:258)}} > \{{ at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:249)}} > \{{ at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:722)}} > \{{ at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:721)}} > \{{ at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}} > \{{ at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}} > \{{ at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)}} > \{{ at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)}} > \{{ at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} > \{{ at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} > \{{ at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} > \{{ at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} > \{{ at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} > \{{ at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} > \{{ at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} > \{{ at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} > \{{ at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} > \{{ at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} > \{{ at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} > \{{ at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} > \{{ at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} > \{{ at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} > \{{ at > org.apache.spark.sql.catalyst.trees.Tre
[jira] [Resolved] (SPARK-23797) SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto
[ https://issues.apache.org/jira/browse/SPARK-23797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23797. -- Resolution: Incomplete > SparkSQL performance on small TPCDS tables is very low when compared to Drill > or Presto > --- > > Key: SPARK-23797 > URL: https://issues.apache.org/jira/browse/SPARK-23797 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Submit, SQL >Affects Versions: 2.3.0 >Reporter: Tin Vu >Priority: Major > Labels: bulk-closed > > I am executing a benchmark to compare performance of SparkSQL, Apache Drill > and Presto. My experimental setup: > * TPCDS dataset with scale factor 100 (size 100GB). > * Spark, Drill, Presto have a same number of workers: 12. > * Each worked has same allocated amount of memory: 4GB. > * Data is stored by Hive with ORC format. > I executed a very simple SQL query: "SELECT * from table_name" > The issue is that for some small size tables (even table with few dozen of > records), SparkSQL still required about 7-8 seconds to finish, while Drill > and Presto only needed less than 1 second. > For other large tables with billions records, SparkSQL performance was > reasonable when it required 20-30 seconds to scan the whole table. > Do you have any idea or reasonable explanation for this issue? > Thanks, -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice
[ https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-8582. - Resolution: Incomplete > Optimize checkpointing to avoid computing an RDD twice > -- > > Key: SPARK-8582 > URL: https://issues.apache.org/jira/browse/SPARK-8582 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Shixiong Zhu >Priority: Major > Labels: bulk-closed > > In Spark, checkpointing allows the user to truncate the lineage of his RDD > and save the intermediate contents to HDFS for fault tolerance. However, this > is not currently implemented super efficiently: > Every time we checkpoint an RDD, we actually compute it twice: once during > the action that triggered the checkpointing in the first place, and once > while we checkpoint (we iterate through an RDD's partitions and write them to > disk). See this line for more detail: > https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102. > Instead, we should have a `CheckpointingInterator` that writes checkpoint > data to HDFS while we run the action. This will speed up many usages of > `RDD#checkpoint` by 2X. > (Alternatively, the user can just cache the RDD before checkpointing it, but > this is not always viable for very large input data. It's also not a great > API to use in general.) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21076) R dapply doesn't return array or raw columns when array have different length
[ https://issues.apache.org/jira/browse/SPARK-21076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21076. -- Resolution: Incomplete > R dapply doesn't return array or raw columns when array have different length > - > > Key: SPARK-21076 > URL: https://issues.apache.org/jira/browse/SPARK-21076 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Xu Yang >Priority: Major > Labels: bulk-closed > > Calling SparkR::dapplyCollect with R functions that return dataframes > produces an error. This comes up when returning columns of binary data- ie. > serialized fitted models. Also happens when functions return columns > containing vectors. > [SPARK-16785|https://issues.apache.org/jira/browse/SPARK-16785] > still have this issue when input data is an array column not having the same > length on each vector, like: > {code} > head(test1) >key value > 1 4dda7d68a202e9e3 1595297780 > 2 4e08f349deb7392 641991337 > 3 4e105531747ee00b 374773009 > 4 4f1d5ef7fdb4620a 2570136926 > 5 4f63a71e6dde04cd 2117602722 > 6 4fa2f96b689624fc 3489692062, 1344510747, 1095592237, > 424510360, 3211239587 > sparkR.stop() > sc <- sparkR.init() > sqlContext <- sparkRSQL.init(sc) > spark_df = createDataFrame(sqlContext, test1) > # Fails > dapplyCollect(spark_df, function(x) x) > Caused by: org.apache.spark.SparkException: R computation failed with > Error in (function (..., deparse.level = 1, make.row.names = TRUE, > stringsAsFactors = default.stringsAsFactors()) : > invalid list argument: all variables should have the same length > at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108) > at > org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:59) > at > org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:29) > at > org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:186) > at > org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:183) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > ... 1 more > # Works fine > spark_df <- selectExpr(spark_df, "key", "explode(value) value") > dapplyCollect(spark_df, function(x) x) > key value > 1 4dda7d68a202e9e3 1595297780 > 2 4e08f349deb7392 641991337 > 3 4e105531747ee00b 374773009 > 4 4f1d5ef7fdb4620a 2570136926 > 5 4f63a71e6dde04cd 2117602722 > 6 4fa2f96b689624fc 3489692062 > 7 4fa2f96b689624fc 1344510747 > 8 4fa2f96b689624fc 1095592237 > 9 4fa2f96b689624fc 424510360 > 10 4fa2f96b689624fc 3211239587 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16418) DataFrame.filter fails if it references a window function
[ https://issues.apache.org/jira/browse/SPARK-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-16418. -- Resolution: Incomplete > DataFrame.filter fails if it references a window function > - > > Key: SPARK-16418 > URL: https://issues.apache.org/jira/browse/SPARK-16418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Erik Wright >Priority: Major > Labels: bulk-closed > > I'm using Data Frames in Python. If I build up a column expression that > includes a window function, then filter on it, the resulting Data Frame > cannot be evaluated. > If I first add that column expression to the Data Frame as a column (or add > the sub-expression that includes the window function as a column), the filter > works. This works even if I later drop the added column. > It seems like this shouldn't be required. In the worst case, the platform > should be able to do this for me under the hood when/if necessary. > {code:none} > In [1]: from pyspark.sql import types > In [2]: from pyspark.sql import Window > In [3]: from pyspark.sql import functions > In [4]: schema = types.StructType([types.StructField('id', > types.IntegerType(), False), >...:types.StructField('state', > types.StringType(), True), >...:types.StructField('seq', > types.IntegerType(), False)]) > In [5]: original_data_frame = sc.sql.createDataFrame([(1, 'hello', 1),(1, > 'world', 2),(1,'world',3)], schema) > In [6]: previous_state = > functions.lag('state').over(Window.partitionBy('id').orderBy('seq').rowsBetween(-1,-1)) > In [7]: filter_condition = (original_data_frame['state'] == 'world') & > (previous_state != 'world') > In [8]: data_frame = original_data_frame.withColumn('filter_condition', > filter_condition) > In [9]: data_frame.show() > +---+-+---++ > | id|state|seq|filter_condition| > +---+-+---++ > | 1|hello| 1| false| > | 1|world| 2|true| > | 1|world| 3| false| > +---+-+---++ > In [10]: data_frame = > data_frame.filter(data_frame['filter_condition']).drop('filter_condition') > In [11]: data_frame.explain() > == Physical Plan == > WholeStageCodegen > : +- Project [id#0,state#1,seq#2] > : +- Filter (((isnotnull(state#1) && isnotnull(_we0#6)) && (state#1 = > world)) && NOT (_we0#6 = world)) > :+- INPUT > +- Window [lag(state#1, 1, null) windowspecdefinition(id#0, seq#2 ASC, ROWS > BETWEEN 1 PRECEDING AND 1 PRECEDING) AS _we0#6], [id#0], [seq#2 ASC] >+- WholeStageCodegen > : +- Sort [id#0 ASC,seq#2 ASC], false, 0 > : +- INPUT > +- Exchange hashpartitioning(id#0, 200), None > +- Scan ExistingRDD[id#0,state#1,seq#2] > In [12]: data_frame.show() > +---+-+---+ > | id|state|seq| > +---+-+---+ > | 1|world| 2| > +---+-+---+ > In [13]: data_frame = original_data_frame.withColumn('previous_state', > previous_state) > In [14]: data_frame.show() > +---+-+---+--+ > | id|state|seq|previous_state| > +---+-+---+--+ > | 1|hello| 1| null| > | 1|world| 2| hello| > | 1|world| 3| world| > +---+-+---+--+ > In [15]: filter_condition = (data_frame['state'] == 'world') & > (data_frame['previous_state'] != 'world') > In [16]: data_frame = > data_frame.filter(filter_condition).drop('previous_state') > In [17]: data_frame.explain() > == Physical Plan == > WholeStageCodegen > : +- Project [id#0,state#1,seq#2] > : +- Filter (((isnotnull(state#1) && isnotnull(previous_state#12)) && > (state#1 = world)) && NOT (previous_state#12 = world)) > :+- INPUT > +- Window [lag(state#1, 1, null) windowspecdefinition(id#0, seq#2 ASC, ROWS > BETWEEN 1 PRECEDING AND 1 PRECEDING) AS previous_state#12], [id#0], [seq#2 > ASC] >+- WholeStageCodegen > : +- Sort [id#0 ASC,seq#2 ASC], false, 0 > : +- INPUT > +- Exchange hashpartitioning(id#0, 200), None > +- Scan ExistingRDD[id#0,state#1,seq#2] > In [18]: data_frame.show() > +---+-+---+ > | id|state|seq| > +---+-+---+ > | 1|world| 2| > +---+-+---+ > In [19]: filter_condition = (original_data_frame['state'] == 'world') & > (previous_state != 'world') > In [20]: data_frame = original_data_frame.filter(filter_condition) > In [21]: data_frame.explain() > == Physical Plan == > WholeStageCodegen > : +- Filter ((isnotnull(state#1) && (state#1 = world)) && NOT (lag(state#1, > 1, null) windowspecdefinition(id#0, seq#2 ASC, ROWS BETWEEN 1 PRECEDING AND 1 > PRECEDING) = world)) > : +- INPUT > +- Scan ExistingRDD[id#0,state#1,seq#2] > In [22]: data_frame.show() >
[jira] [Resolved] (SPARK-22565) Session-based windowing
[ https://issues.apache.org/jira/browse/SPARK-22565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22565. -- Resolution: Incomplete > Session-based windowing > --- > > Key: SPARK-22565 > URL: https://issues.apache.org/jira/browse/SPARK-22565 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Richard Xin >Priority: Major > Labels: bulk-closed > Attachments: screenshot-1.png > > > I came across a requirement to support session-based windowing. for example, > user activity comes in from kafka, we want to create window per user session > (if the time gap of activity from the same user exceeds the predefined value, > a new window will be created). > I noticed that Flink does support this kind of support, any plan/schedule for > spark for this? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24969) SQL: to_date function can't parse date strings in different locales.
[ https://issues.apache.org/jira/browse/SPARK-24969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24969. -- Resolution: Incomplete > SQL: to_date function can't parse date strings in different locales. > > > Key: SPARK-24969 > URL: https://issues.apache.org/jira/browse/SPARK-24969 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 > Environment: Bare Spark 2.2.1 installation, on RHEL 6. >Reporter: Valentino Pinna >Priority: Major > Labels: bulk-closed > > The locale for {{org.apache.spark.sql.catalyst.util.DateTimeUtils}}, that is > internally used by {{to_date}} SQL function, is set in code to be > {{Locale.US}}. > This causes problems parsing a dataset which has dates in a different > (italian in this case) language. > {code:java} > spark.read.format("csv") > .option("sep", ";") > .csv(logFile) > .toDF("DATA", .) > .withColumn("DATA2", to_date(col("DATA"), " MMM")) > .show(10) > {code} > Results from example dataset: > |*DATA*|*DATA2*| > |2018 giu|null| > |2018 mag|null| > |2018 apr|2018-04-01| > |2018 mar|2018-03-01| > |2018 feb|2018-02-01| > |2018 gen|null| > |2017 dic|null| > |2017 nov|2017-11-01| > |2017 ott|null| > |2017 set|null| > Expected results: All values converted. > TEMPORARY WORKAROUND: > In object {{org.apache.spark.sql.catalyst.util.DateTimeUtils}}, replace all > instances of {{Locale.US}} with {{Locale.}} > ADDITIONAL NOTES: > I can make a pull request available on GitHub. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-15516. -- Resolution: Incomplete > Schema merging in driver fails for parquet when merging LongType and > IntegerType > > > Key: SPARK-15516 > URL: https://issues.apache.org/jira/browse/SPARK-15516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Databricks >Reporter: Hossein Falaki >Priority: Major > Labels: bulk-closed > > I tried to create a table from partitioned parquet directories that requires > schema merging. I get following error: > {code} > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Failed to merge incompatible data > types LongType and IntegerType > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:333) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829) > {code} > cc @rxin and [~mengxr] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19241) remove hive generated table properties if they are not useful in Spark
[ https://issues.apache.org/jira/browse/SPARK-19241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19241. -- Resolution: Incomplete > remove hive generated table properties if they are not useful in Spark > -- > > Key: SPARK-19241 > URL: https://issues.apache.org/jira/browse/SPARK-19241 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Priority: Major > Labels: bulk-closed > > When we save a table into hive metastore, hive will generate some table > properties automatically. e.g. transient_lastDdlTime, last_modified_by, > rawDataSize, etc. Some of them are useless in Spark SQL, we should remove > them. > It will be good if we can get the list of Hive-generated table properties via > Hive API, so that we don't need to hardcode them. > We can take a look at Hive code to see how it excludes these auto-generated > table properties when describe table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21885) HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference enabled
[ https://issues.apache.org/jira/browse/SPARK-21885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21885. -- Resolution: Incomplete > HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference > enabled > --- > > Key: SPARK-21885 > URL: https://issues.apache.org/jira/browse/SPARK-21885 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0 > Environment: `spark.sql.hive.caseSensitiveInferenceMode` set to > INFER_ONLY >Reporter: liupengcheng >Priority: Major > Labels: bulk-closed, slow, sql > > Currently, SparkSQL infer schema is too slow, almost take 2 minutes. > I digged into the code, and finally findout the reason: > 1. In the analysis process of LogicalPlan spark will try to infer table > schema if `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY, and > it will list all the leaf files of the rootPaths(just tableLocation), and > then call `getFileBlockLocations` to turn `FileStatus` into > `LocatedFileStatus`. This `getFileBlockLocations` for so manny leaf files > will take a long time, and it seems that the locations info is never used. > 2. When infer a parquet schema, if there is only one file, it will still > launch a spark job to merge schema. I think it's expensive. > Time costly stack is as follow: > {code:java} > at org.apache.hadoop.ipc.Client.call(Client.java:1403) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1234) > at > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1224) > at > org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1274) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:217) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:427) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:410) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles(PartitioningAwareFileIndex.scala:410) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:302) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:301) > at > scala.collection.Tra
[jira] [Resolved] (SPARK-23664) Add interface to collect query result through file iterator
[ https://issues.apache.org/jira/browse/SPARK-23664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23664. -- Resolution: Incomplete > Add interface to collect query result through file iterator > --- > > Key: SPARK-23664 > URL: https://issues.apache.org/jira/browse/SPARK-23664 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1, 2.3.0 >Reporter: zhoukang >Priority: Major > Labels: bulk-closed > > Currently, we use spark sql through jdbc. > Result may cost much memory since we collect result and cached in memory for > performance consideration. > However,we can also add an API to collect result through file iterator(like > parquet file iterator),we can avoid OOM of thriftserver for big query. > Like below: > {code:java} > result.toLocalIteratorThroughFile.asScala > {code} > I will work on this if make sense! > And in our internal cluster we have used this API for about a year. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25428) Support plain Kerberos Authentication with Spark
[ https://issues.apache.org/jira/browse/SPARK-25428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25428. -- Resolution: Incomplete > Support plain Kerberos Authentication with Spark > > > Key: SPARK-25428 > URL: https://issues.apache.org/jira/browse/SPARK-25428 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Shruti Gumma >Priority: Major > Labels: bulk-closed, features > > Spark should work with plain Kerberos authentication. Currently, Spark can > work with Hadoop delegation tokens, but not plain Kerberos. Hadoop's > UserGroupInformation(UGI) class is responsible for handling security > authentication in Spark. This UserGroupInformation(UGI) has support for > Kerberos authentication, as well as Token authentication. Since Spark does > not work correctly with the Kerberos auth method, it leads to a gap in fully > supporting all the security authentication mechanisms. > > If Kerberos is used to login in UserGroupInformation(UGI) using keytabs at > the startup of drivers and executors, then Spark does not allow this > logged-in UserGroupInformation(UGI) user to correctly propagate. The > exception arises from the implementation of the runAsSparkUser method in > SparkHadoopUtil. > > The runAsSparkUser method in SparkHadoopUtil creates a new UGI based on the > current static UGI and then transfers credentials from this current static > UGI to the new UGI. This works well with other auth methods, except Kerberos. > Transfer credentials implementation is not conducive for Kerberos auth model > since it does not transfer all the required internal state of UGI( such as > isKeytab and isKrbTkt). For Kerberos, the UGI has to be created from > UGI.loginUserFromKeytab method only and not simply by doing a transfer > credentials from the previous UGI to the new UGI. > > Ideally, the CoarseGrainedExecutorBackend should login using keytab, similar > to MesosCoarseGrainedExecutorBackend. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24764) Add ServiceLoader implementation for SparkHadoopUtil
[ https://issues.apache.org/jira/browse/SPARK-24764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24764. -- Resolution: Incomplete > Add ServiceLoader implementation for SparkHadoopUtil > > > Key: SPARK-24764 > URL: https://issues.apache.org/jira/browse/SPARK-24764 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1 >Reporter: Shruti Gumma >Assignee: Shruti Gumma >Priority: Major > Labels: bulk-closed > > Currently SparkHadoopUtil creation is static and cannot be changed. This > proposal is to move the creation of SparkHadoopUtil to a ServiceLoader, so > that external cluster managers can create their own SparkHadoopUtil. > In the case of Yarn, a specific YarnSparkHadoopUtil is used in the Yarn > packages whereas in other places, SparkHadoopUtil is used. SparkHadoopUtil > has been changed in v2.3.0 to work with Yarn, leaving external cluster > managers with no configurable way of doing the same. > The util classes such as SparkHadoopUtil should be made configurable, similar > to how the ExternalClusterManager was made pluggable through a ServiceLoader. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18822) Support ML Pipeline in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-18822. -- Resolution: Incomplete > Support ML Pipeline in SparkR > - > > Key: SPARK-18822 > URL: https://issues.apache.org/jira/browse/SPARK-18822 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Felix Cheung >Priority: Major > Labels: bulk-closed > > From Joseph Bradley: > " > Supporting Pipelines and advanced use cases: There really needs to be more > design discussion around SparkR. Felix Cheung would you be interested in > leading some discussion? I'm envisioning something similar to what was done a > while back for Pipelines in Scala/Java/Python, where we consider several use > cases of MLlib: fitting a single model, creating and tuning a complex > Pipeline, and working with multiple languages. That should help inform what > APIs should look like in Spark R. > " > Certain ML model, such as OneVsRest, is harder to represent in a single call > R API. Having advanced API or Pipeline API like this could help to expose > that to our users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24617) Spark driver not requesting another executor once original executor exits due to 'lost worker'
[ https://issues.apache.org/jira/browse/SPARK-24617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24617. -- Resolution: Incomplete > Spark driver not requesting another executor once original executor exits due > to 'lost worker' > -- > > Key: SPARK-24617 > URL: https://issues.apache.org/jira/browse/SPARK-24617 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.1 >Reporter: t oo >Priority: Major > Labels: bulk-closed > > I am running Spark v2.1.1 in 'standalone' mode (no yarn/mesos) across EC2s. I > have 1 master ec2 that acts as the driver (since spark-submit is called on > this host), spark.master is setup, deploymode is client (so sparksubmit only > returns a ReturnCode to the putty window once it finishes processing). I have > 1 worker ec2 that is registered with the spark master. When i run sparksubmit > on the master, I can see in the WebUI that executors starting on the worker > and I can verify successful completion. However if while the sparksubmit is > running and the worker ec2 gets terminated and then new ec2 worker becomes > alive 3mins later and registers with the master, I have noticed on the webui > that it shows 'cannot find address' in the executor status but the driver > keeps waiting forever (2 days later I kill it) or in some cases the driver > allocates tasks to the new worker only 5 hours later and then completes! Is > there some setting i am missing that would explain this behavior? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24240) Add a config to control whether InMemoryFileIndex should update cache when refresh.
[ https://issues.apache.org/jira/browse/SPARK-24240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24240. -- Resolution: Incomplete > Add a config to control whether InMemoryFileIndex should update cache when > refresh. > --- > > Key: SPARK-24240 > URL: https://issues.apache.org/jira/browse/SPARK-24240 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: jin xing >Priority: Major > Labels: bulk-closed > > In current > code([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L172),] > after data is inserted, spark will always refresh file index and update the > cache. If the target table has tons of files, job will suffer time and OOM > issue. Could we add a config to control whether InMemoryFileIndex should > update cache when refresh. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15691) Refactor and improve Hive support
[ https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-15691. -- Resolution: Incomplete > Refactor and improve Hive support > - > > Key: SPARK-15691 > URL: https://issues.apache.org/jira/browse/SPARK-15691 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Priority: Major > Labels: bulk-closed > > Hive support is important to Spark SQL, as many Spark users use it to read > from Hive. The current architecture is very difficult to maintain, and this > ticket tracks progress towards getting us to a sane state. > A number of things we want to accomplish are: > - Move the Hive specific catalog logic into HiveExternalCatalog. > -- Remove HiveSessionCatalog. All Hive-related stuff should go into > HiveExternalCatalog. This would require moving caching either into > HiveExternalCatalog, or just into SessionCatalog. > -- Move using properties to store data source options into > HiveExternalCatalog (So, for a CatalogTable returned by HiveExternalCatalog, > we do not need to distinguish tables stored in hive formats and data source > tables). > -- Potentially more. > - Remove HIve's specific ScriptTransform implementation and make it more > general so we can put it in sql/core. > - Implement HiveTableScan (and write path) as a data source, so we don't need > a special planner rule for HiveTableScan. > - Remove HiveSharedState and HiveSessionState. > One thing that is still unclear to me is how to work with Hive UDF support. > We might still need a special planner rule there. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24450) Error: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.curator.utils.PathUtils.validatePath(Ljava/lang/String;)Ljava/lang/String;
[ https://issues.apache.org/jira/browse/SPARK-24450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24450. -- Resolution: Incomplete > Error: Exception in thread "main" java.lang.NoSuchMethodError: > org.apache.curator.utils.PathUtils.validatePath(Ljava/lang/String;)Ljava/lang/String; > > > Key: SPARK-24450 > URL: https://issues.apache.org/jira/browse/SPARK-24450 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.2, 2.2.0 >Reporter: Maxim Dahenko >Priority: Major > Labels: bulk-closed > > Hello, > artifact org.apache.curator, version 2.7.1 and higher doesn't work in a spark > job. > pom.xml file: > {code:java} > > http://maven.apache.org/POM/4.0.0"; > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; > xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 > http://maven.apache.org/maven-v4_0_0.xsd";> > 4.0.0 > com.test > test > 0.0.1-SNAPSHOT > jar > "test" > > > > maven-assembly-plugin > > > test > > > > > > > > org.apache.curator > curator-client > 2.7.1 > > > > {code} > Source code src/main/java/com/test/Test.java: > {code:java} > package com.test; > import org.apache.curator.utils.PathUtils; > public class Test { > public static void main(String[] args) throws Exception { > PathUtils.validatePath("/tmp"); > } > } > {code} > Result > {code:java} > spark-submit --class com.test.Test --master local test-0.0.1-SNAPSHOT.jar > Exception in thread "main" java.lang.NoSuchMethodError: > org.apache.curator.utils.PathUtils.validatePath(Ljava/lang/String;)Ljava/lang/String; > at com.test.Test.main(Test.java:7) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25030) SparkSubmit.doSubmit will not return result if the mainClass submitted creates a Timer()
[ https://issues.apache.org/jira/browse/SPARK-25030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25030. -- Resolution: Incomplete > SparkSubmit.doSubmit will not return result if the mainClass submitted > creates a Timer() > > > Key: SPARK-25030 > URL: https://issues.apache.org/jira/browse/SPARK-25030 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Xingbo Jiang >Priority: Major > Labels: bulk-closed > > Create a Timer() in the mainClass submitted to SparkSubmit makes it unable to > fetch result, it is very easy to reproduce the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23292) python tests related to pandas are skipped with python 2
[ https://issues.apache.org/jira/browse/SPARK-23292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23292. -- Resolution: Incomplete > python tests related to pandas are skipped with python 2 > > > Key: SPARK-23292 > URL: https://issues.apache.org/jira/browse/SPARK-23292 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Yin Huai >Priority: Critical > Labels: bulk-closed > > I was running python tests and found that > [pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types|https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548] > does not run with Python 2 because the test uses "assertRaisesRegex" > (supported by Python 3) instead of "assertRaisesRegexp" (supported by Python > 2). However, spark jenkins does not fail because of this issue (see run > history at > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/]). > After looking into this issue, [seems test script will skip tests related to > pandas if pandas is not > installed|https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63], > which means that jenkins does not have pandas installed. > > Since pyarrow related tests have the same skipping logic, we will need to > check if jenkins has pyarrow installed correctly as well. > > Since features using pandas and pyarrow are in 2.3, we should fix the test > issue and make sure all tests pass before we make the release. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15777) Catalog federation
[ https://issues.apache.org/jira/browse/SPARK-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-15777. -- Resolution: Incomplete > Catalog federation > -- > > Key: SPARK-15777 > URL: https://issues.apache.org/jira/browse/SPARK-15777 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Priority: Major > Labels: bulk-closed > Attachments: SparkFederationDesign.pdf > > > This is a ticket to track progress to support federating multiple external > catalogs. This would require establishing an API (similar to the current > ExternalCatalog API) for getting information about external catalogs, and > ability to convert a table into a data source table. > As part of this, we would also need to be able to support more than a > two-level table identifier (database.table). At the very least we would need > a three level identifier for tables (catalog.database.table). A possibly > direction is to support arbitrary level hierarchical namespaces similar to > file systems. > Once we have this implemented, we can convert the current Hive catalog > implementation into an external catalog that is "mounted" into an internal > catalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21040) On executor/worker decommission consider speculatively re-launching current tasks
[ https://issues.apache.org/jira/browse/SPARK-21040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21040. -- Resolution: Incomplete > On executor/worker decommission consider speculatively re-launching current > tasks > - > > Key: SPARK-21040 > URL: https://issues.apache.org/jira/browse/SPARK-21040 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Holden Karau >Priority: Major > Labels: bulk-closed > > If speculative execution is enabled we may wish to consider decommissioning > of worker as a weight for speculative execution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24905) Spark 2.3 Internal URL env variable
[ https://issues.apache.org/jira/browse/SPARK-24905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24905. -- Resolution: Incomplete > Spark 2.3 Internal URL env variable > --- > > Key: SPARK-24905 > URL: https://issues.apache.org/jira/browse/SPARK-24905 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Björn Wenzel >Priority: Major > Labels: bulk-closed > > Currently the Kubernetes Master internal URL is hardcoded in the > Constants.scala file > ([https://github.com/apache/spark/blob/85fe1297e35bcff9cf86bd53fee615e140ee5bfb/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L75)] > In some cases these URL should be changed e.g. if the Certificate is valid > for another Hostname. > Is it possible to make this URL a property like: > spark.kubernetes.authenticate.driver.hostname? > Kubernetes The Hard Way maintained by Kelsey Hightower for example uses > kubernetes.default as hostname, this will produce again a > SSLPeerUnverifiedException. > > Here is the use of the Hardcoded Host: > [https://github.com/apache/spark/blob/85fe1297e35bcff9cf86bd53fee615e140ee5bfb/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala#L52] > maybe this could be changed like the KUBERNETES_NAMESPACE property in Line > 53. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24494) Give users possibility to skip own classes in SparkContext.getCallSite()
[ https://issues.apache.org/jira/browse/SPARK-24494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24494. -- Resolution: Incomplete > Give users possibility to skip own classes in SparkContext.getCallSite() > > > Key: SPARK-24494 > URL: https://issues.apache.org/jira/browse/SPARK-24494 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.2.0 > Environment: any (currently using Spark 2.2.0 in both local mode and > with YARN) >Reporter: Florian Kaspar >Priority: Minor > Labels: bulk-closed > > org.apache.spark.SparkContext.getCallSite() uses > org.apache.spark.util.Utils.getCallSite() which by default skips Spark and > Scala classes. > It would be nice to be able to add user-defined classes/patterns to skip here > as well. > We have one central class that acts as some kind of proxy for RDD > transformations and is used by many other classes. > In the SparkUI we would like to see the real origin of a transformation being > not the proxy but its caller. This would make the UI even more useful to us. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23796) There's no API to change state RDD's name
[ https://issues.apache.org/jira/browse/SPARK-23796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23796. -- Resolution: Incomplete > There's no API to change state RDD's name > - > > Key: SPARK-23796 > URL: https://issues.apache.org/jira/browse/SPARK-23796 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: István Gansperger >Priority: Minor > Labels: bulk-closed > > I use a few {{mapWithState}} stream oparations in my application and at some > point it became a minor inconvenience that I could not figure out how to set > the state RDDs name or serialization level. Searching around didn't really > help and I have not come across any issues regarding this (pardon my > inability to find it if there's one). It could be useful to see how much > memory each state uses if the user has multiple such transformations. > I have used some ugly reflection based code to be able to set the name of the > state RDD and also the serialization level. I understand that the latter may > be intentionally limited, but I haven't come across any issues caused by this > apart from slightly degraded performance in exchange for a bit less memory > usage. Are these limitations in place intentionally or is it just an > oversight? Having some extra methods for these on {{StateSpec}} could be > useful in my opinion. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14604) Modify design of ML model summaries
[ https://issues.apache.org/jira/browse/SPARK-14604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-14604. -- Resolution: Incomplete > Modify design of ML model summaries > --- > > Key: SPARK-14604 > URL: https://issues.apache.org/jira/browse/SPARK-14604 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Major > Labels: bulk-closed > > Several spark.ml models now have summaries containing evaluation metrics and > training info: > * LinearRegressionModel > * LogisticRegressionModel > * GeneralizedLinearRegressionModel > These summaries have unfortunately been added in an inconsistent way. I > propose to reorganize them to have: > * For each model, 1 summary (without training info) and 1 training summary > (with info from training). The non-training summary can be produced for a > new dataset via {{evaluate}}. > * A summary should not store the model itself as a public field. > * A summary should provide a transient reference to the dataset used to > produce the summary. > This task will involve reorganizing the GLM summary (which lacks a > training/non-training distinction) and deprecating the model method in the > LinearRegressionSummary. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23837) Create table as select gives exception if the spark generated alias name contains comma
[ https://issues.apache.org/jira/browse/SPARK-23837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23837. -- Resolution: Incomplete > Create table as select gives exception if the spark generated alias name > contains comma > --- > > Key: SPARK-23837 > URL: https://issues.apache.org/jira/browse/SPARK-23837 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: shahid >Priority: Minor > Labels: bulk-closed > > For spark generated alias name contains comma, Hive metastore throws > exception. > > 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 > decimal(18,5)); > +---++ > |Result| > +---++ > +---++ > No rows selected (0.171 seconds) > 0: jdbc:hive2://ha-cluster/default> select col1*col2 from a; > > +---+ > |(CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5)))| > +---+ > > +---+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from > a; > Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a > column whose name contains commas in Hive metastore. Table: `default`.`b`; > Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); > (state=,code=0) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24837) Add kafka as spark metrics sink
[ https://issues.apache.org/jira/browse/SPARK-24837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24837. -- Resolution: Incomplete > Add kafka as spark metrics sink > --- > > Key: SPARK-24837 > URL: https://issues.apache.org/jira/browse/SPARK-24837 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Sandish Kumar HN >Priority: Major > Labels: bulk-closed > > Sink spark metrics to kafka producer > spark/core/src/main/scala/org/apache/spark/metrics/sink/ > someone assign this to me? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24273) Failure while using .checkpoint method to private S3 store via S3A connector
[ https://issues.apache.org/jira/browse/SPARK-24273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24273. -- Resolution: Incomplete > Failure while using .checkpoint method to private S3 store via S3A connector > > > Key: SPARK-24273 > URL: https://issues.apache.org/jira/browse/SPARK-24273 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.0 >Reporter: Jami Malikzade >Priority: Major > Labels: bulk-closed > > We are getting following error: > {code} > com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 416, AWS > Service: Amazon S3, AWS Request ID: > tx14126-005ae9bfd9-9ed9ac2-default, AWS Error Code: > InvalidRange, AWS Error Message: null, S3 Extended Request ID: > 9ed9ac2-default-default" > {code} > when we use checkpoint method as below. > {code} > val streamBucketDF = streamPacketDeltaDF > .filter('timeDelta > maxGap && 'timeDelta <= 3) > .withColumn("bucket", when('timeDelta <= mediumGap, "medium") > .otherwise("large") > ) > .checkpoint() > {code} > Do you have idea how to prevent invalid range in header to be sent, or how it > can be workarounded or fixed? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow
[ https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25351. -- Resolution: Incomplete > Handle Pandas category type when converting from Python with Arrow > -- > > Key: SPARK-25351 > URL: https://issues.apache.org/jira/browse/SPARK-25351 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Bryan Cutler >Priority: Major > Labels: bulk-closed > > There needs to be some handling of category types done when calling > {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}. > Without Arrow, Spark casts each element to the category. For example > {noformat} > In [1]: import pandas as pd > In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]}) > In [3]: pdf["B"] = pdf["A"].astype('category') > In [4]: pdf > Out[4]: >A B > 0 a a > 1 b b > 2 c c > 3 a a > In [5]: pdf.dtypes > Out[5]: > A object > Bcategory > dtype: object > In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False) > In [8]: df = spark.createDataFrame(pdf) > In [9]: df.show() > +---+---+ > | A| B| > +---+---+ > | a| a| > | b| b| > | c| c| > | a| a| > +---+---+ > In [10]: df.printSchema() > root > |-- A: string (nullable = true) > |-- B: string (nullable = true) > In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True) > In [19]: df = spark.createDataFrame(pdf) >1667 spark_type = ArrayType(from_arrow_type(at.value_type)) >1668 else: > -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " > + str(at)) >1670 return spark_type >1671 > TypeError: Unsupported type in conversion from Arrow: > dictionary > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23669) Executors fetch jars and name the jars with md5 prefix
[ https://issues.apache.org/jira/browse/SPARK-23669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23669. -- Resolution: Incomplete > Executors fetch jars and name the jars with md5 prefix > -- > > Key: SPARK-23669 > URL: https://issues.apache.org/jira/browse/SPARK-23669 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: jin xing >Priority: Minor > Labels: bulk-closed > > In our cluster, there are lots of UDF jars, some of them have the same > filename but different path, for example: > ``` > hdfs://A/B/udf.jar -> udfA > hdfs://C/D/udf.jar -> udfB > ``` > When user uses udfA and udfB in same sql, executor will fetch both > *hdfs://A/B/udf.jar* and *hdfs://C/D/udf.jar.* There will be a conflict for > the same name. > > Can we config to fetch jars and save with a filename with MD5 prefix, so > there will be no conflict. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25219) KMeans Clustering - Text Data - Results are incorrect
[ https://issues.apache.org/jira/browse/SPARK-25219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25219. -- Resolution: Incomplete > KMeans Clustering - Text Data - Results are incorrect > - > > Key: SPARK-25219 > URL: https://issues.apache.org/jira/browse/SPARK-25219 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Vasanthkumar Velayudham >Priority: Major > Labels: bulk-closed > Attachments: Apache_Logs_Results.xlsx, SKLearn_Kmeans.txt, > Spark_Kmeans.txt > > > Hello Everyone, > I am facing issues with the usage of KMeans Clustering on my text data. When > I apply clustering on my text data, after performing various transformations > such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated > clusters are not proper and one cluster is found to have lot of data points > assigned to it. > I am able to perform clustering with similar kind of processing and with the > same attributes on the SKLearn KMeans algorithm. > Upon searching in internet, I observe many have reported the same issue with > KMeans clustering library of Spark. > Request your help in fixing this issue. > Please let me know if you require any additional details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25180) Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname fails
[ https://issues.apache.org/jira/browse/SPARK-25180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25180. -- Resolution: Incomplete > Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname > fails > --- > > Key: SPARK-25180 > URL: https://issues.apache.org/jira/browse/SPARK-25180 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 > Environment: mac laptop running on a corporate guest wifi, presumably > a wifi with odd DNS settings. >Reporter: Steve Loughran >Priority: Minor > Labels: bulk-closed > > trying to save work on spark standalone can fail if netty RPC cannot > determine the hostname. While that's a valid failure on a real cluster, in > standalone falling back to localhost rather than inferred "hw13176.lan" value > may be the better option. > note also, the abort* call failed; NPE. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12126) JDBC datasource processes filters only commonly pushed down.
[ https://issues.apache.org/jira/browse/SPARK-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-12126. -- Resolution: Incomplete > JDBC datasource processes filters only commonly pushed down. > > > Key: SPARK-12126 > URL: https://issues.apache.org/jira/browse/SPARK-12126 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Priority: Major > Labels: bulk-closed > > As suggested > [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=14955646&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14955646], > Currently JDBC datasource only processes the filters pushed down from > {{DataSourceStrategy}}. > Unlike ORC or Parquet, this can process pretty a lot of filters (for example, > a + b > 3) since it is just about string parsing. > As > [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=15031526&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15031526], > using {{CatalystScan}} trait might be one of solutions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20629) Copy shuffle data when nodes are being shut down
[ https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20629. -- Resolution: Incomplete > Copy shuffle data when nodes are being shut down > > > Key: SPARK-20629 > URL: https://issues.apache.org/jira/browse/SPARK-20629 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Holden Karau >Priority: Major > Labels: bulk-closed > > We decided not to do this for YARN, but for EC2/GCE and similar systems nodes > may be shut down entirely without the ability to keep an AuxiliaryService > around. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24293) Serialized shuffle supports mapSideCombine
[ https://issues.apache.org/jira/browse/SPARK-24293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24293. -- Resolution: Incomplete > Serialized shuffle supports mapSideCombine > -- > > Key: SPARK-24293 > URL: https://issues.apache.org/jira/browse/SPARK-24293 > Project: Spark > Issue Type: Brainstorming > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Xianjin YE >Priority: Major > Labels: bulk-closed > > While doing research on integrating my company's internal Shuffle Service > with Spark, I found it is possible to support mapSideCombine with serialized > shuffle. > The simple idea is that the `UnsafeShuffleWriter` uses a `Combiner` to > accumulate records when mapSideCombine is required before inserting into > `ShuffleExternalSorter`. The `Combiner` will tracking it's memory usage or > elements accumulated and is never spilled. When the `Combiner` accumulates > enough records(varied by different strategies), the accumulated (K, C) pairs > are then inserted into the `ShuffleExternalSorter`. After that, the > `Combiner` is reset to empty state. > After this change, combinedValues are sent to sorter segment by segment, and > the `BlockStoreShuffleReader` already handles this case. > I did a local POC, and looks like that I can get the same result with normal > SortShuffle. The performance is not optimized yet. The most significant part > of code is shown as below: > {code:java} > // code placeholder > while (records.hasNext()) { > Product2 record = records.next(); > if (this.mapSideCombine) { > this.aggregator.accumulateRecord(record); > if (this.aggregator.accumulatedKeyNum() >= 160_000) { // for poc > scala.collection.Iterator> combinedIterator = > this.aggregator.accumulatedIterator(); > while (combinedIterator.hasNext()) { > insertRecordIntoSorter(combinedIterator.next()); > } > this.aggregator.resetAccumulation(); > } > } else { > insertRecordIntoSorter(record); > } > } > if (this.mapSideCombine && this.aggregator.accumulatedKeyNum() > 0) { > scala.collection.Iterator> combinedIterator = > this.aggregator.accumulatedIterator(); > while (combinedIterator.hasNext()) { > insertRecordIntoSorter(combinedIterator.next()); > } > this.aggregator.resetAccumulation(1); > } > {code} > > Is there something I am missing? cc [~joshrosen] [~cloud_fan] [~XuanYuan] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23740) Add FPGrowth Param for filtering out very common items
[ https://issues.apache.org/jira/browse/SPARK-23740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23740. -- Resolution: Incomplete > Add FPGrowth Param for filtering out very common items > -- > > Key: SPARK-23740 > URL: https://issues.apache.org/jira/browse/SPARK-23740 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > Labels: bulk-closed > > It would be handy to have a Param in FPGrowth for filtering out very common > items. This is from a use case where the dataset had items appearing in > 99.9%+ of the rows. These common items were useless, but they caused the > algorithm to generate many unnecessary itemsets. Filtering useless common > items beforehand can make the algorithm much faster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15041) adding mode strategy for ml.feature.Imputer for categorical features
[ https://issues.apache.org/jira/browse/SPARK-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-15041. -- Resolution: Incomplete > adding mode strategy for ml.feature.Imputer for categorical features > > > Key: SPARK-15041 > URL: https://issues.apache.org/jira/browse/SPARK-15041 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: yuhao yang >Priority: Minor > Labels: bulk-closed > > Adding mode strategy for ml.feature.Imputer for categorical features. This > need to wait until PR for SPARK-13568 gets merged. > https://github.com/apache/spark/pull/11601 > From comments of jkbradley and Nick Pentreath in the PR > {quote} > Investigate efficiency of approaches using DataFrame/Dataset and/or approx > approaches such as frequentItems or Count-Min Sketch (will require an update > to CMS to return "heavy-hitters"). > investigate if we can use metadata to only allow mode for categorical > features (or perhaps as an easier alternative, allow mode for only Int/Long > columns) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23858) Need to apply pyarrow adjustments to complex types with DateType/TimestampType
[ https://issues.apache.org/jira/browse/SPARK-23858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23858. -- Resolution: Incomplete > Need to apply pyarrow adjustments to complex types with > DateType/TimestampType > --- > > Key: SPARK-23858 > URL: https://issues.apache.org/jira/browse/SPARK-23858 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Priority: Major > Labels: bulk-closed > > Currently, ArrayTypes with DateType and TimestampType need to perform the > same adjustments as simple types, e.g. > {{_check_series_localize_timestamps}}, and that should work for nested types > as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23237) Add UI / endpoint for threaddumps for executors with active tasks
[ https://issues.apache.org/jira/browse/SPARK-23237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23237. -- Resolution: Incomplete > Add UI / endpoint for threaddumps for executors with active tasks > - > > Key: SPARK-23237 > URL: https://issues.apache.org/jira/browse/SPARK-23237 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Major > Labels: bulk-closed > > Frequently, when there are a handful of straggler tasks, users want to know > what is going on in those executors running the stragglers. Currently, that > is a bit of a pain to do: you have to go to the page for your active stage, > find the task, figure out which executor its on, then go to the executors > page, and get the thread dump. Or maybe you just go to the executors page, > find the executor with an active task, and then click on that, but that > doesn't work if you've got multiple stages running. > Users could figure this by extracting the info from the stage rest endpoint, > but it's such a common thing to do that we should make it easy. > I realize that figuring out a good way to do this is a little tricky. We > don't want to make it easy to end up pulling thread dumps from 1000 executors > back to the driver. So we've got to come up with a reasonable heuristic for > choosing which executors to poll. And we've also got to find a suitable > place to put this. > My suggestion is that the stage page always has a link to the thread dumps > for the *one* executor with the longest running task. And there would be a > corresponding endpoint in the rest api with the same info, maybe at > {{/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/slowestTaskThreadDump}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22728) Unify artifact access for (mesos, standalone and yarn) when HDFS is available
[ https://issues.apache.org/jira/browse/SPARK-22728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22728. -- Resolution: Incomplete > Unify artifact access for (mesos, standalone and yarn) when HDFS is available > - > > Key: SPARK-22728 > URL: https://issues.apache.org/jira/browse/SPARK-22728 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos >Priority: Major > Labels: bulk-closed > > A unified cluster layer for caching artifacts would be very useful like in > the case for the work that has be done for Flink: > https://issues.apache.org/jira/browse/FLINK-6177 > It would be great to make available the Hadoop Distributed Cache when we > deploy jobs in Mesos and Standalone envs. Hdfs is often present in many > end-to-end apps out there, so we should have an option for using it. > I am creating this JIRA as a follow up of the discussion here: > https://github.com/apache/spark/pull/18587#issuecomment-314718391 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24748) Support for reporting custom metrics via Streaming Query Progress
[ https://issues.apache.org/jira/browse/SPARK-24748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24748. -- Resolution: Incomplete > Support for reporting custom metrics via Streaming Query Progress > - > > Key: SPARK-24748 > URL: https://issues.apache.org/jira/browse/SPARK-24748 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Arun Mahadevan >Priority: Major > Labels: bulk-closed > > Currently the Structured Streaming sources and sinks does not have a way to > report custom metrics. Providing an option to report custom metrics and > making it available via Streaming Query progress can enable sources and sinks > to report custom progress information (E.g. the lag metrics for Kafka source). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22918) sbt test (spark - local) fail after upgrading to 2.2.1 with: java.security.AccessControlException: access denied org.apache.derby.security.SystemPermission( "engine", "
[ https://issues.apache.org/jira/browse/SPARK-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22918. -- Resolution: Incomplete > sbt test (spark - local) fail after upgrading to 2.2.1 with: > java.security.AccessControlException: access denied > org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" ) > > > Key: SPARK-22918 > URL: https://issues.apache.org/jira/browse/SPARK-22918 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Damian Momot >Priority: Major > Labels: bulk-closed > > After upgrading 2.2.0 -> 2.2.1 sbt test command in one of my projects started > to fail with following exception: > {noformat} > java.security.AccessControlException: access denied > org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" ) > at > java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) > at > java.security.AccessController.checkPermission(AccessController.java:884) > at > org.apache.derby.iapi.security.SecurityUtil.checkDerbyInternalsPrivilege(Unknown > Source) > at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown > Source) > at org.apache.derby.iapi.jdbc.JDBCBoot$1.run(Unknown Source) > at java.security.AccessController.doPrivileged(Native Method) > at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source) > at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source) > at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source) > at org.apache.derby.jdbc.EmbeddedDriver.(Unknown Source) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at java.lang.Class.newInstance(Class.java:442) > at > org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47) > at > org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:325) > at > org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:282) > at > org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:240) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:286) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) > at > org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187) > at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333) > at > or
[jira] [Resolved] (SPARK-21084) Improvements to dynamic allocation for notebook use cases
[ https://issues.apache.org/jira/browse/SPARK-21084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21084. -- Resolution: Incomplete > Improvements to dynamic allocation for notebook use cases > - > > Key: SPARK-21084 > URL: https://issues.apache.org/jira/browse/SPARK-21084 > Project: Spark > Issue Type: Umbrella > Components: Block Manager, Scheduler, Spark Core, YARN >Affects Versions: 2.2.0, 2.3.0 >Reporter: Frederick Reiss >Priority: Major > Labels: bulk-closed > > One important application of Spark is to support many notebook users with a > single YARN or Spark Standalone cluster. We at IBM have seen this > requirement across multiple deployments of Spark: on-premises and private > cloud deployments at our clients, as well as on the IBM cloud. The scenario > goes something like this: "Every morning at 9am, 500 analysts log into their > computers and start running Spark notebooks intermittently for the next 8 > hours." I'm sure that many other members of the community are interested in > making similar scenarios work. > > Dynamic allocation is supposed to support these kinds of use cases by > shifting cluster resources towards users who are currently executing scalable > code. In our own testing, we have encountered a number of issues with using > the current implementation of dynamic allocation for this purpose: > *Issue #1: Starvation.* A Spark job acquires all available containers, > preventing other jobs or applications from starting. > *Issue #2: Request latency.* Jobs that would normally finish in less than 30 > seconds take 2-4x longer than normal with dynamic allocation. > *Issue #3: Unfair resource allocation due to cached data.* Applications that > have cached RDD partitions hold onto executors indefinitely, denying those > resources to other applications. > *Issue #4: Loss of cached data leads to thrashing.* Applications repeatedly > lose partitions of cached RDDs because the underlying executors are removed; > the applications then need to rerun expensive computations. > > This umbrella JIRA covers efforts to address these issues by making > enhancements to Spark. > Here's a high-level summary of the current planned work: > * [SPARK-21097]: Preserve an executor's cached data when shutting down the > executor. > * [SPARK-21122]: Make Spark give up executors in a controlled fashion when > the RM indicates it is running low on capacity. > * (JIRA TBD): Reduce the delay for dynamic allocation to spin up new > executors. > Note that this overall plan is subject to change, and other members of the > community should feel free to suggest changes and to help out. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23631) Add summary to RandomForestClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-23631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23631. -- Resolution: Incomplete > Add summary to RandomForestClassificationModel > -- > > Key: SPARK-23631 > URL: https://issues.apache.org/jira/browse/SPARK-23631 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.0 >Reporter: Evan Zamir >Priority: Major > Labels: bulk-closed > > I'm using the RandomForestClassificationModel and noticed that there is no > summary attribute like there is for LogisticRegressionModel. Specifically, > I'd like to have the roc and pr curves. Is that on the Spark roadmap > anywhere? Is there a reason it hasn't been implemented? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24939) Support YARN Shared Cache in Spark
[ https://issues.apache.org/jira/browse/SPARK-24939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24939. -- Resolution: Incomplete > Support YARN Shared Cache in Spark > -- > > Key: SPARK-24939 > URL: https://issues.apache.org/jira/browse/SPARK-24939 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.1 >Reporter: Jonathan Bender >Priority: Minor > Labels: bulk-closed > > https://issues.apache.org/jira/browse/YARN-1492 introduced support for the > YARN Shared Cache, which when configured allows clients to cache submitted > application resources (jars, archives) in HDFS and avoid having to re-upload > them for successive jobs. MapReduce YARN applications support this feature, > it would be great to add support for it in Spark as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24049) Add a feature to not start speculative tasks when average task duration is less than a configurable absolute number
[ https://issues.apache.org/jira/browse/SPARK-24049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24049. -- Resolution: Incomplete > Add a feature to not start speculative tasks when average task duration is > less than a configurable absolute number > --- > > Key: SPARK-24049 > URL: https://issues.apache.org/jira/browse/SPARK-24049 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Lars Francke >Priority: Minor > Labels: bulk-closed > > Speculation currently has four different configuration options according to > the docs: > * spark.speculation to turn it on or off > * .interval - how often it'll check > * .multiplier - how many times slower than average a task must be > * .quantile - fraction of tasks that have to be completed before speculation > starts > What I'd love to see is a feature that allows me to set a minimum average > duration. We are using a Jobserver to submit varying queries to Spark. Some > of those have stages with tasks of an average length of ~20ms (yes, not idea > but it happens). So with a multiplier of 2 a speculative task starts after > 40ms. This happens quite a lot and in our use-case it doesn't make sense. We > don't consider 40ms tasks "stragglers". > So it would make sense to add a parameter to only start speculation when the > average task time is greater than X seconds/minutes/millis. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24689) java.io.NotSerializableException: org.apache.spark.mllib.clustering.DistributedLDAModel
[ https://issues.apache.org/jira/browse/SPARK-24689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24689. -- Resolution: Incomplete > java.io.NotSerializableException: > org.apache.spark.mllib.clustering.DistributedLDAModel > --- > > Key: SPARK-24689 > URL: https://issues.apache.org/jira/browse/SPARK-24689 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.1 > Environment: !image-2018-06-29-13-42-30-255.png! >Reporter: konglingbo >Priority: Major > Labels: bulk-closed > Attachments: @CLZ98635A644[_edx...@e.png > > > scala> val predictionAndLabels=testing.map{case LabeledPoint(label,features)=> > | val prediction = model.predict(features) > | (prediction, label) > | } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23181) Add compatibility tests for SHS serialized data / disk format
[ https://issues.apache.org/jira/browse/SPARK-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23181. -- Resolution: Incomplete > Add compatibility tests for SHS serialized data / disk format > - > > Key: SPARK-23181 > URL: https://issues.apache.org/jira/browse/SPARK-23181 > Project: Spark > Issue Type: Task > Components: Tests >Affects Versions: 2.3.0 >Reporter: Marcelo Masiero Vanzin >Priority: Major > Labels: bulk-closed > > The SHS in 2.3.0 has the ability to serialize history data to disk (see > SPARK-18085 and its sub-tasks). This means that if either the serialized data > or the disk format changes, the code needs to be modified to either support > the old formats, or discard the old data (and re-create it from logs). > We should add integration tests that help us detect whether one of these > changes has occurred. The should check data generated by old versions of > Spark and fail if that data cannot be read back. > The Hive suites recently added the ability to download old Spark versions and > generate data from those old versions to test that new code can read it, we > could use something similar to test this (starting with when 2.3.0 is > released). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org