[jira] [Commented] (SPARK-20898) spark.blacklist.killBlacklistedExecutors doesn't work in YARN
[ https://issues.apache.org/jira/browse/SPARK-20898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030710#comment-16030710 ] Saisai Shao commented on SPARK-20898: - [~tgraves], I addressed this issue in https://github.com/apache/spark/pull/17113 > spark.blacklist.killBlacklistedExecutors doesn't work in YARN > - > > Key: SPARK-20898 > URL: https://issues.apache.org/jira/browse/SPARK-20898 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Thomas Graves > > I was trying out the new spark.blacklist.killBlacklistedExecutors on YARN but > it doesn't appear to work. Everytime I get: > 17/05/26 16:28:12 WARN BlacklistTracker: Not attempting to kill blacklisted > executor id 4 since allocation client is not defined > Even though dynamic allocation is on. Taking a quick look, I think the way > it creates the blacklisttracker and passes the allocation client is wrong. > The scheduler backend is > not set yet so it never passes the allocation client to the blacklisttracker > correctly. Thus it will never kill. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20934) Task is hung at inner join, would work with other kind of joins
Mohamed Elagamy created SPARK-20934: --- Summary: Task is hung at inner join, would work with other kind of joins Key: SPARK-20934 URL: https://issues.apache.org/jira/browse/SPARK-20934 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 2.1.0 Reporter: Mohamed Elagamy I am using spark 2.1.0 to read from parquets and inner join between different dataframes, but it gets stuck at the inner join step and never show any progress, here is the thread dump sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153) scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169) scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169) scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) scala.concurrent.Await$.ready(package.scala:169) com.wd.perf.collector.spark.report.utils.GroupUtils$.generateReportInGroups(GroupUtils.scala:70) com.wd.perf.collector.spark.report.jobs.definition.JobSummaryDefinitionParquetCreator$.com$wd$perf$collector$spark$report$jobs$definition$JobSummaryDefinitionParquetCreator$$generateHourlyBySWH(JobSummaryDefinitionParquetCreator.scala:82) com.wd.perf.collector.spark.report.jobs.definition.JobSummaryDefinitionParquetCreator$$anonfun$1$$anonfun$apply$1.apply(JobSummaryDefinitionParquetCreator.scala:40) com.wd.perf.collector.spark.report.jobs.definition.JobSummaryDefinitionParquetCreator$$anonfun$1$$anonfun$apply$1.apply(JobSummaryDefinitionParquetCreator.scala:39) scala.collection.immutable.List.foreach(List.scala:381) com.wd.perf.collector.spark.report.jobs.definition.JobSummaryDefinitionParquetCreator$$anonfun$1.apply(JobSummaryDefinitionParquetCreator.scala:39) com.wd.perf.collector.spark.report.jobs.definition.JobSummaryDefinitionParquetCreator$$anonfun$1.apply(JobSummaryDefinitionParquetCreator.scala:38) com.wd.perf.collector.spark.report.utils.ParquetReportInstaller$$anonfun$5.apply(ParquetReportInstaller.scala:148) com.wd.perf.collector.spark.report.utils.ParquetReportInstaller$$anonfun$5.apply(ParquetReportInstaller.scala:110) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) scala.collection.immutable.List.foreach(List.scala:381) scala.collection.TraversableLike$class.map(TraversableLike.scala:234) scala.collection.immutable.List.map(List.scala:285) com.wd.perf.collector.spark.report.utils.ParquetReportInstaller$.executeReportStep(ParquetReportInstaller.scala:110) com.wd.perf.collector.spark.report.utils.ParquetReportInstaller$.executeByInstallOption(ParquetReportInstaller.scala:51) com.wd.perf.collector.spark.report.jobs.definition.JobSummaryDefinitionParquetCreator$.generateReport(JobSummaryDefinitionParquetCreator.scala:44) com.wd.perf.collector.spark.actor.report.parquet.ReportNameInstanceCreatorActor.com$wd$perf$collector$spark$actor$report$parquet$ReportNameInstanceCreatorActor$$processMessage(ReportNameInstanceCreatorActor.scala:67) com.wd.perf.collector.spark.actor.report.parquet.ReportNameInstanceCreatorActor$$anonfun$receive$1$$anonfun$1$$anonfun$apply$1.apply$mcV$sp(ReportNameInstanceCreatorActor.scala:36) com.wd.perf.collector.spark.actor.report.parquet.ReportNameInstanceCreatorActor$$anonfun$receive$1$$anonfun$1$$anonfun$apply$1.apply(ReportNameInstanceCreatorActor.scala:35) com.wd.perf.collector.spark.actor.report.parquet.ReportNameInstanceCreatorActor$$anonfun$receive$1$$anonfun$1$$anonfun$apply$1.apply(ReportNameInstanceCreatorActor.scala:35) scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20933) when the input parameter is float type for ’round ’ or ‘bround’ ,it can't work well
[ https://issues.apache.org/jira/browse/SPARK-20933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20933: Assignee: (was: Apache Spark) > when the input parameter is float type for ’round ’ or ‘bround’ ,it can't > work well > > > Key: SPARK-20933 > URL: https://issues.apache.org/jira/browse/SPARK-20933 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: liuxian > > spark-sql>select round(cast(3.1415 as float), 3); > spark-sql>3.141 > For this case, the result we expected is 3.142 > spark-sql>select bround(cast(3.1415 as float), 3); > spark-sql>3.141 > For this case, the result we expected is 3.142 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20933) when the input parameter is float type for ’round ’ or ‘bround’ ,it can't work well
[ https://issues.apache.org/jira/browse/SPARK-20933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20933: Assignee: Apache Spark > when the input parameter is float type for ’round ’ or ‘bround’ ,it can't > work well > > > Key: SPARK-20933 > URL: https://issues.apache.org/jira/browse/SPARK-20933 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: liuxian >Assignee: Apache Spark > > spark-sql>select round(cast(3.1415 as float), 3); > spark-sql>3.141 > For this case, the result we expected is 3.142 > spark-sql>select bround(cast(3.1415 as float), 3); > spark-sql>3.141 > For this case, the result we expected is 3.142 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20933) when the input parameter is float type for ’round ’ or ‘bround’ ,it can't work well
[ https://issues.apache.org/jira/browse/SPARK-20933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030706#comment-16030706 ] Apache Spark commented on SPARK-20933: -- User '10110346' has created a pull request for this issue: https://github.com/apache/spark/pull/18156 > when the input parameter is float type for ’round ’ or ‘bround’ ,it can't > work well > > > Key: SPARK-20933 > URL: https://issues.apache.org/jira/browse/SPARK-20933 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: liuxian > > spark-sql>select round(cast(3.1415 as float), 3); > spark-sql>3.141 > For this case, the result we expected is 3.142 > spark-sql>select bround(cast(3.1415 as float), 3); > spark-sql>3.141 > For this case, the result we expected is 3.142 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20933) when the input parameter is float type for ’round ’ or ‘bround’ ,it can't work well
[ https://issues.apache.org/jira/browse/SPARK-20933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-20933: Description: spark-sql>select round(cast(3.1415 as float), 3); spark-sql>3.141 For this case, the result we expected is 3.142 spark-sql>select bround(cast(3.1415 as float), 3); spark-sql>3.141 For this case, the result we expected is 3.142 was: spark-sql>select round(cast(3.1415 as float), 3); spark-sql>3.141 spark-sql>select bround(cast(3.1415 as float), 3); spark-sql>3.141 > when the input parameter is float type for ’round ’ or ‘bround’ ,it can't > work well > > > Key: SPARK-20933 > URL: https://issues.apache.org/jira/browse/SPARK-20933 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: liuxian > > spark-sql>select round(cast(3.1415 as float), 3); > spark-sql>3.141 > For this case, the result we expected is 3.142 > spark-sql>select bround(cast(3.1415 as float), 3); > spark-sql>3.141 > For this case, the result we expected is 3.142 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20933) when the input parameter is float type for ’round ’ or ‘bround’ ,it can't work well
liuxian created SPARK-20933: --- Summary: when the input parameter is float type for ’round ’ or ‘bround’ ,it can't work well Key: SPARK-20933 URL: https://issues.apache.org/jira/browse/SPARK-20933 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: liuxian spark-sql>select round(cast(3.1415 as float), 3); spark-sql>3.141 spark-sql>select bround(cast(3.1415 as float), 3); spark-sql>3.141 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20865) caching dataset throws "Queries with streaming sources must be executed with writeStream.start()"
[ https://issues.apache.org/jira/browse/SPARK-20865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski resolved SPARK-20865. - Resolution: Won't Fix Fix Version/s: 2.3.0 2.2.0 {{cache}} is not allowed due to its eager execution. > caching dataset throws "Queries with streaming sources must be executed with > writeStream.start()" > - > > Key: SPARK-20865 > URL: https://issues.apache.org/jira/browse/SPARK-20865 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.0.2, 2.1.0, 2.1.1 >Reporter: Martin Brišiak > Fix For: 2.2.0, 2.3.0 > > > {code} > SparkSession > .builder > .master("local[*]") > .config("spark.sql.warehouse.dir", "C:/tmp/spark") > .config("spark.sql.streaming.checkpointLocation", > "C:/tmp/spark/spark-checkpoint") > .appName("my-test") > .getOrCreate > .readStream > .schema(schema) > .json("src/test/data") > .cache > .writeStream > .start > .awaitTermination > {code} > While executing this sample in spark got error. Without the .cache option it > worked as intended but with .cache option i got: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries > with streaming sources must be executed with writeStream.start();; > FileSource[src/test/data] at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:196) > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35) > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:33) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:33) > at > org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:58) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:69) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:67) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:73) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84) > at > org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:102) > at > org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:65) > at > org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:89) > at org.apache.spark.sql.Dataset.persist(Dataset.scala:2479) at > org.apache.spark.sql.Dataset.cache(Dataset.scala:2489) at > org.me.App$.main(App.scala:23) at org.me.App.main(App.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20877) Investigate if tests will time out on CRAN
[ https://issues.apache.org/jira/browse/SPARK-20877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman reassigned SPARK-20877: - Assignee: Felix Cheung > Investigate if tests will time out on CRAN > -- > > Key: SPARK-20877 > URL: https://issues.apache.org/jira/browse/SPARK-20877 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20877) Investigate if tests will time out on CRAN
[ https://issues.apache.org/jira/browse/SPARK-20877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-20877. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 18104 [https://github.com/apache/spark/pull/18104] > Investigate if tests will time out on CRAN > -- > > Key: SPARK-20877 > URL: https://issues.apache.org/jira/browse/SPARK-20877 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20854) extend hint syntax to support any expression, not just identifiers or strings
[ https://issues.apache.org/jira/browse/SPARK-20854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20854: Issue Type: Improvement (was: Bug) > extend hint syntax to support any expression, not just identifiers or strings > - > > Key: SPARK-20854 > URL: https://issues.apache.org/jira/browse/SPARK-20854 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu >Priority: Blocker > > Currently the SQL hint syntax supports as parameters only identifiers while > the Dataset hint syntax supports only strings. > They should support any expression as parameters, for example numbers. This > is useful for implementing other hints in the future. > Examples: > {code} > df.hint("hint1", Seq(1, 2, 3)) > df.hint("hint2", "A", 1) > sql("select /*+ hint1((1,2,3)) */") > sql("select /*+ hint2('A', 1) */") > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20854) extend hint syntax to support any expression, not just identifiers or strings
[ https://issues.apache.org/jira/browse/SPARK-20854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-20854: --- Assignee: Bogdan Raducanu Target Version/s: 2.2.0 Priority: Blocker (was: Major) > extend hint syntax to support any expression, not just identifiers or strings > - > > Key: SPARK-20854 > URL: https://issues.apache.org/jira/browse/SPARK-20854 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu >Priority: Blocker > > Currently the SQL hint syntax supports as parameters only identifiers while > the Dataset hint syntax supports only strings. > They should support any expression as parameters, for example numbers. This > is useful for implementing other hints in the future. > Examples: > {code} > df.hint("hint1", Seq(1, 2, 3)) > df.hint("hint2", "A", 1) > sql("select /*+ hint1((1,2,3)) */") > sql("select /*+ hint2('A', 1) */") > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20199) GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030629#comment-16030629 ] pralabhkumar edited comment on SPARK-20199 at 5/31/17 4:56 AM: --- please review the pull request . was (Author: pralabhkumar): please review the pull request . https://github.com/apache/spark/commit/16ccbdfd8862c528c90fdde94c8ec20d6631126e > GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have featureSubsetStrategy . It Uses > random forest internally ,which have featureSubsetStrategy hardcoded "all". > It should be provided by the user to have randomness at the feature level. > This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030629#comment-16030629 ] pralabhkumar commented on SPARK-20199: -- please review the pull request . https://github.com/apache/spark/commit/16ccbdfd8862c528c90fdde94c8ec20d6631126e > GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have featureSubsetStrategy . It Uses > random forest internally ,which have featureSubsetStrategy hardcoded "all". > It should be provided by the user to have randomness at the feature level. > This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-20392: Target Version/s: 2.3.0 > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker >Assignee: Liang-Chi Hsieh >Priority: Blocker > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 044_bucketizer_8530f3570019 > 045_bucketizer_c53614f1e507 > 046_bucketizer_8fd99e6ec27b > 047_bucketizer_6a8610496d8a > 048_bucketizer_888b0055c1ad > 049_bucketizer_974e0a1433a6 > 050_bucketizer_e848c0937cb9 > 051_bucketizer_95611095a4ac > 052_bucketizer_660a6031acd9 > 053_bucketizer_aaffe5a3140d > 054_bucketizer_8dc569be285f > 055_bucketizer_83d1bffa07bc > 056_bucketizer_0c6180ba75e6 > 057_bucketizer_452f265a000d > 058_bucketizer_38e02ddfb447 > 059_bucketizer_6fa4ad5d3ebd > 060_bucketizer_91044ee766ce > 061_bucketizer_9a9ef04a173d > 062_bucketizer_3d98eb15f206 > 063_bucketizer_c4915bb4d4ed > 064_bucketizer_8ca2b6550c38 > 065_bucketizer_417ee9b760bc > 066_bucketizer_67f3556bebe8 > 067_bucketizer_0556deb652c6 > 068_bucketizer_067b4b3d234c > 069_bucketizer_30ba55321538 > 070_bucketizer_ad826cc5d746 > 071_bucketizer_77676a898055 > 072_bucketizer_05c37a38ce30 > 073_bucketizer_6d9ae54163ed > 074_bucketizer_8cd668b2855d > 075_bucketizer_d50ea1732021 > 076_bucketizer_c68f467c9559 > 077_bucketizer_ee1dfc840db1 > 078_bucketizer_83ec06a32519 > 079_bucketizer_741d08c1b69e > 080_bucketizer_b7402e4829c7 > 081_bucketizer_8adc590dc447 > 082_bucketizer_673be99bdace > 083_bucketizer_77693b45f94c > 084_bucketizer_53529c6b1ac4 > 085_bucketizer_6a3ca776a81e > 086_bucketizer_6679d9588ac1 > 087_bucketizer_6c73af456f65 > 088_bucketizer_2291b2c5ab51 > 089_bucketizer_cb3d0fe669d8 > 090_bucketizer_e71f913c1512 > 091_bucketizer_156528f65ce7 > 092_bucketizer_f3ec5dae079b > 093_bucketizer_809fab77eee1 > 094_bucketizer_6925831511e6 > 095_bucketizer_c5d853b95707 > 096_bucketizer_e677659ca253 > 097_bucketizer_396e35548c72 > 098_bucketizer_78a6410d7a84 > 099_bucketizer_e3ae6e54bca1 > 100_bucketizer_9fed5923fe8a > 101_bucketizer_8925ba4c3ee2 > 102_bucketizer_95750b6942b8 > 103_bucketizer_6e8b50a1918b > 104_bucketizer_36cfcc13d4ba > 105_bucketizer_2716d0455512 > 106_bucketizer_9bcf2891652f > 107_bucketizer_8c3d352915f7 > 108_bucketizer_0786c17d5ef9 > 109_bucketizer_f22df23ef56f > 110_bucketizer_bad04578bd20 > 111_bucketizer_35cfbde7e28f > 112_bucketizer_cf89177a528b > 113_b
[jira] [Updated] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-20392: Priority: Blocker (was: Major) > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker >Assignee: Liang-Chi Hsieh >Priority: Blocker > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 044_bucketizer_8530f3570019 > 045_bucketizer_c53614f1e507 > 046_bucketizer_8fd99e6ec27b > 047_bucketizer_6a8610496d8a > 048_bucketizer_888b0055c1ad > 049_bucketizer_974e0a1433a6 > 050_bucketizer_e848c0937cb9 > 051_bucketizer_95611095a4ac > 052_bucketizer_660a6031acd9 > 053_bucketizer_aaffe5a3140d > 054_bucketizer_8dc569be285f > 055_bucketizer_83d1bffa07bc > 056_bucketizer_0c6180ba75e6 > 057_bucketizer_452f265a000d > 058_bucketizer_38e02ddfb447 > 059_bucketizer_6fa4ad5d3ebd > 060_bucketizer_91044ee766ce > 061_bucketizer_9a9ef04a173d > 062_bucketizer_3d98eb15f206 > 063_bucketizer_c4915bb4d4ed > 064_bucketizer_8ca2b6550c38 > 065_bucketizer_417ee9b760bc > 066_bucketizer_67f3556bebe8 > 067_bucketizer_0556deb652c6 > 068_bucketizer_067b4b3d234c > 069_bucketizer_30ba55321538 > 070_bucketizer_ad826cc5d746 > 071_bucketizer_77676a898055 > 072_bucketizer_05c37a38ce30 > 073_bucketizer_6d9ae54163ed > 074_bucketizer_8cd668b2855d > 075_bucketizer_d50ea1732021 > 076_bucketizer_c68f467c9559 > 077_bucketizer_ee1dfc840db1 > 078_bucketizer_83ec06a32519 > 079_bucketizer_741d08c1b69e > 080_bucketizer_b7402e4829c7 > 081_bucketizer_8adc590dc447 > 082_bucketizer_673be99bdace > 083_bucketizer_77693b45f94c > 084_bucketizer_53529c6b1ac4 > 085_bucketizer_6a3ca776a81e > 086_bucketizer_6679d9588ac1 > 087_bucketizer_6c73af456f65 > 088_bucketizer_2291b2c5ab51 > 089_bucketizer_cb3d0fe669d8 > 090_bucketizer_e71f913c1512 > 091_bucketizer_156528f65ce7 > 092_bucketizer_f3ec5dae079b > 093_bucketizer_809fab77eee1 > 094_bucketizer_6925831511e6 > 095_bucketizer_c5d853b95707 > 096_bucketizer_e677659ca253 > 097_bucketizer_396e35548c72 > 098_bucketizer_78a6410d7a84 > 099_bucketizer_e3ae6e54bca1 > 100_bucketizer_9fed5923fe8a > 101_bucketizer_8925ba4c3ee2 > 102_bucketizer_95750b6942b8 > 103_bucketizer_6e8b50a1918b > 104_bucketizer_36cfcc13d4ba > 105_bucketizer_2716d0455512 > 106_bucketizer_9bcf2891652f > 107_bucketizer_8c3d352915f7 > 108_bucketizer_0786c17d5ef9 > 109_bucketizer_f22df23ef56f > 110_bucketizer_bad04578bd20 > 111_bucketizer_35cfbde7e28f > 112_bucketizer_cf89177a528b > 113_b
[jira] [Updated] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-20392: Issue Type: Improvement (was: Bug) > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker >Assignee: Liang-Chi Hsieh >Priority: Blocker > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 044_bucketizer_8530f3570019 > 045_bucketizer_c53614f1e507 > 046_bucketizer_8fd99e6ec27b > 047_bucketizer_6a8610496d8a > 048_bucketizer_888b0055c1ad > 049_bucketizer_974e0a1433a6 > 050_bucketizer_e848c0937cb9 > 051_bucketizer_95611095a4ac > 052_bucketizer_660a6031acd9 > 053_bucketizer_aaffe5a3140d > 054_bucketizer_8dc569be285f > 055_bucketizer_83d1bffa07bc > 056_bucketizer_0c6180ba75e6 > 057_bucketizer_452f265a000d > 058_bucketizer_38e02ddfb447 > 059_bucketizer_6fa4ad5d3ebd > 060_bucketizer_91044ee766ce > 061_bucketizer_9a9ef04a173d > 062_bucketizer_3d98eb15f206 > 063_bucketizer_c4915bb4d4ed > 064_bucketizer_8ca2b6550c38 > 065_bucketizer_417ee9b760bc > 066_bucketizer_67f3556bebe8 > 067_bucketizer_0556deb652c6 > 068_bucketizer_067b4b3d234c > 069_bucketizer_30ba55321538 > 070_bucketizer_ad826cc5d746 > 071_bucketizer_77676a898055 > 072_bucketizer_05c37a38ce30 > 073_bucketizer_6d9ae54163ed > 074_bucketizer_8cd668b2855d > 075_bucketizer_d50ea1732021 > 076_bucketizer_c68f467c9559 > 077_bucketizer_ee1dfc840db1 > 078_bucketizer_83ec06a32519 > 079_bucketizer_741d08c1b69e > 080_bucketizer_b7402e4829c7 > 081_bucketizer_8adc590dc447 > 082_bucketizer_673be99bdace > 083_bucketizer_77693b45f94c > 084_bucketizer_53529c6b1ac4 > 085_bucketizer_6a3ca776a81e > 086_bucketizer_6679d9588ac1 > 087_bucketizer_6c73af456f65 > 088_bucketizer_2291b2c5ab51 > 089_bucketizer_cb3d0fe669d8 > 090_bucketizer_e71f913c1512 > 091_bucketizer_156528f65ce7 > 092_bucketizer_f3ec5dae079b > 093_bucketizer_809fab77eee1 > 094_bucketizer_6925831511e6 > 095_bucketizer_c5d853b95707 > 096_bucketizer_e677659ca253 > 097_bucketizer_396e35548c72 > 098_bucketizer_78a6410d7a84 > 099_bucketizer_e3ae6e54bca1 > 100_bucketizer_9fed5923fe8a > 101_bucketizer_8925ba4c3ee2 > 102_bucketizer_95750b6942b8 > 103_bucketizer_6e8b50a1918b > 104_bucketizer_36cfcc13d4ba > 105_bucketizer_2716d0455512 > 106_bucketizer_9bcf2891652f > 107_bucketizer_8c3d352915f7 > 108_bucketizer_0786c17d5ef9 > 109_bucketizer_f22df23ef56f > 110_bucketizer_bad04578bd20 > 111_bucketizer_35cfbde7e28f > 112_bucketizer_cf89177a
[jira] [Reopened] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reopened SPARK-20392: - will re-merge it at the end of Spark 2.3, to avoid conflicts when backporting analyzer related PRs to 2.2 in the future. > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker >Assignee: Liang-Chi Hsieh > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 044_bucketizer_8530f3570019 > 045_bucketizer_c53614f1e507 > 046_bucketizer_8fd99e6ec27b > 047_bucketizer_6a8610496d8a > 048_bucketizer_888b0055c1ad > 049_bucketizer_974e0a1433a6 > 050_bucketizer_e848c0937cb9 > 051_bucketizer_95611095a4ac > 052_bucketizer_660a6031acd9 > 053_bucketizer_aaffe5a3140d > 054_bucketizer_8dc569be285f > 055_bucketizer_83d1bffa07bc > 056_bucketizer_0c6180ba75e6 > 057_bucketizer_452f265a000d > 058_bucketizer_38e02ddfb447 > 059_bucketizer_6fa4ad5d3ebd > 060_bucketizer_91044ee766ce > 061_bucketizer_9a9ef04a173d > 062_bucketizer_3d98eb15f206 > 063_bucketizer_c4915bb4d4ed > 064_bucketizer_8ca2b6550c38 > 065_bucketizer_417ee9b760bc > 066_bucketizer_67f3556bebe8 > 067_bucketizer_0556deb652c6 > 068_bucketizer_067b4b3d234c > 069_bucketizer_30ba55321538 > 070_bucketizer_ad826cc5d746 > 071_bucketizer_77676a898055 > 072_bucketizer_05c37a38ce30 > 073_bucketizer_6d9ae54163ed > 074_bucketizer_8cd668b2855d > 075_bucketizer_d50ea1732021 > 076_bucketizer_c68f467c9559 > 077_bucketizer_ee1dfc840db1 > 078_bucketizer_83ec06a32519 > 079_bucketizer_741d08c1b69e > 080_bucketizer_b7402e4829c7 > 081_bucketizer_8adc590dc447 > 082_bucketizer_673be99bdace > 083_bucketizer_77693b45f94c > 084_bucketizer_53529c6b1ac4 > 085_bucketizer_6a3ca776a81e > 086_bucketizer_6679d9588ac1 > 087_bucketizer_6c73af456f65 > 088_bucketizer_2291b2c5ab51 > 089_bucketizer_cb3d0fe669d8 > 090_bucketizer_e71f913c1512 > 091_bucketizer_156528f65ce7 > 092_bucketizer_f3ec5dae079b > 093_bucketizer_809fab77eee1 > 094_bucketizer_6925831511e6 > 095_bucketizer_c5d853b95707 > 096_bucketizer_e677659ca253 > 097_bucketizer_396e35548c72 > 098_bucketizer_78a6410d7a84 > 099_bucketizer_e3ae6e54bca1 > 100_bucketizer_9fed5923fe8a > 101_bucketizer_8925ba4c3ee2 > 102_bucketizer_95750b6942b8 > 103_bucketizer_6e8b50a1918b > 104_bucketizer_36cfcc13d4ba > 105_bucketizer_2716d0455512 > 106_bucketizer_9bcf2891652f > 107_bucketizer_8c3d352915f7 > 108_bucketizer_0786c17d5ef9 > 109_bucketizer_f22df23ef56f > 110_bucketizer_bad04578bd20 > 111_buc
[jira] [Updated] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-20392: Fix Version/s: (was: 2.3.0) > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker >Assignee: Liang-Chi Hsieh > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 044_bucketizer_8530f3570019 > 045_bucketizer_c53614f1e507 > 046_bucketizer_8fd99e6ec27b > 047_bucketizer_6a8610496d8a > 048_bucketizer_888b0055c1ad > 049_bucketizer_974e0a1433a6 > 050_bucketizer_e848c0937cb9 > 051_bucketizer_95611095a4ac > 052_bucketizer_660a6031acd9 > 053_bucketizer_aaffe5a3140d > 054_bucketizer_8dc569be285f > 055_bucketizer_83d1bffa07bc > 056_bucketizer_0c6180ba75e6 > 057_bucketizer_452f265a000d > 058_bucketizer_38e02ddfb447 > 059_bucketizer_6fa4ad5d3ebd > 060_bucketizer_91044ee766ce > 061_bucketizer_9a9ef04a173d > 062_bucketizer_3d98eb15f206 > 063_bucketizer_c4915bb4d4ed > 064_bucketizer_8ca2b6550c38 > 065_bucketizer_417ee9b760bc > 066_bucketizer_67f3556bebe8 > 067_bucketizer_0556deb652c6 > 068_bucketizer_067b4b3d234c > 069_bucketizer_30ba55321538 > 070_bucketizer_ad826cc5d746 > 071_bucketizer_77676a898055 > 072_bucketizer_05c37a38ce30 > 073_bucketizer_6d9ae54163ed > 074_bucketizer_8cd668b2855d > 075_bucketizer_d50ea1732021 > 076_bucketizer_c68f467c9559 > 077_bucketizer_ee1dfc840db1 > 078_bucketizer_83ec06a32519 > 079_bucketizer_741d08c1b69e > 080_bucketizer_b7402e4829c7 > 081_bucketizer_8adc590dc447 > 082_bucketizer_673be99bdace > 083_bucketizer_77693b45f94c > 084_bucketizer_53529c6b1ac4 > 085_bucketizer_6a3ca776a81e > 086_bucketizer_6679d9588ac1 > 087_bucketizer_6c73af456f65 > 088_bucketizer_2291b2c5ab51 > 089_bucketizer_cb3d0fe669d8 > 090_bucketizer_e71f913c1512 > 091_bucketizer_156528f65ce7 > 092_bucketizer_f3ec5dae079b > 093_bucketizer_809fab77eee1 > 094_bucketizer_6925831511e6 > 095_bucketizer_c5d853b95707 > 096_bucketizer_e677659ca253 > 097_bucketizer_396e35548c72 > 098_bucketizer_78a6410d7a84 > 099_bucketizer_e3ae6e54bca1 > 100_bucketizer_9fed5923fe8a > 101_bucketizer_8925ba4c3ee2 > 102_bucketizer_95750b6942b8 > 103_bucketizer_6e8b50a1918b > 104_bucketizer_36cfcc13d4ba > 105_bucketizer_2716d0455512 > 106_bucketizer_9bcf2891652f > 107_bucketizer_8c3d352915f7 > 108_bucketizer_0786c17d5ef9 > 109_bucketizer_f22df23ef56f > 110_bucketizer_bad04578bd20 > 111_bucketizer_35cfbde7e28f > 112_bucketizer_cf89177a528b > 113_bucketizer_183a0d393ef0 > 114_bu
[jira] [Commented] (SPARK-20876) If the input parameter is float type for ceil or floor ,the result is not we expected
[ https://issues.apache.org/jira/browse/SPARK-20876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030604#comment-16030604 ] Apache Spark commented on SPARK-20876: -- User '10110346' has created a pull request for this issue: https://github.com/apache/spark/pull/18155 > If the input parameter is float type for ceil or floor ,the result is not we > expected > -- > > Key: SPARK-20876 > URL: https://issues.apache.org/jira/browse/SPARK-20876 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: liuxian >Assignee: liuxian > Fix For: 2.3.0 > > > spark-sql>SELECT ceil(cast(12345.1233 as float)); > spark-sql>12345 > For this case, the result we expected is 12346 > spark-sql>SELECT floor(cast(-12345.1233 as float)); > spark-sql>-12345 > For this case, the result we expected is -12346 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20932) CountVectorizer support handle persistence
[ https://issues.apache.org/jira/browse/SPARK-20932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20932: Assignee: (was: Apache Spark) > CountVectorizer support handle persistence > -- > > Key: SPARK-20932 > URL: https://issues.apache.org/jira/browse/SPARK-20932 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng > > in {{CountVectorizer.fit}}, RDDs {{input}} & {{wordCounts}} should be > unpersisted after computation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20932) CountVectorizer support handle persistence
[ https://issues.apache.org/jira/browse/SPARK-20932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20932: Assignee: Apache Spark > CountVectorizer support handle persistence > -- > > Key: SPARK-20932 > URL: https://issues.apache.org/jira/browse/SPARK-20932 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng >Assignee: Apache Spark > > in {{CountVectorizer.fit}}, RDDs {{input}} & {{wordCounts}} should be > unpersisted after computation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20932) CountVectorizer support handle persistence
[ https://issues.apache.org/jira/browse/SPARK-20932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030601#comment-16030601 ] Apache Spark commented on SPARK-20932: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/18154 > CountVectorizer support handle persistence > -- > > Key: SPARK-20932 > URL: https://issues.apache.org/jira/browse/SPARK-20932 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng > > in {{CountVectorizer.fit}}, RDDs {{input}} & {{wordCounts}} should be > unpersisted after computation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20932) CountVectorizer support handle persistence
zhengruifeng created SPARK-20932: Summary: CountVectorizer support handle persistence Key: SPARK-20932 URL: https://issues.apache.org/jira/browse/SPARK-20932 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.3.0 Reporter: zhengruifeng in {{CountVectorizer.fit}}, RDDs {{input}} & {{wordCounts}} should be unpersisted after computation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20931) Built-in SQL Function - ABS support string type
[ https://issues.apache.org/jira/browse/SPARK-20931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-20931: Comment: was deleted (was: I'm working on.) > Built-in SQL Function - ABS support string type > --- > > Key: SPARK-20931 > URL: https://issues.apache.org/jira/browse/SPARK-20931 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Yuming Wang > Labels: starter > > {noformat} > ABS() > {noformat} > Hive/MySQL support this. > Ref: > https://github.com/apache/hive/blob/4ba713ccd85c3706d195aeef9476e6e6363f1c21/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAbs.java#L93 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20931) Built-in SQL Function - ABS support string type
[ https://issues.apache.org/jira/browse/SPARK-20931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20931: Assignee: Apache Spark > Built-in SQL Function - ABS support string type > --- > > Key: SPARK-20931 > URL: https://issues.apache.org/jira/browse/SPARK-20931 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Apache Spark > Labels: starter > > {noformat} > ABS() > {noformat} > Hive/MySQL support this. > Ref: > https://github.com/apache/hive/blob/4ba713ccd85c3706d195aeef9476e6e6363f1c21/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAbs.java#L93 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20931) Built-in SQL Function - ABS support string type
[ https://issues.apache.org/jira/browse/SPARK-20931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20931: Assignee: (was: Apache Spark) > Built-in SQL Function - ABS support string type > --- > > Key: SPARK-20931 > URL: https://issues.apache.org/jira/browse/SPARK-20931 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Yuming Wang > Labels: starter > > {noformat} > ABS() > {noformat} > Hive/MySQL support this. > Ref: > https://github.com/apache/hive/blob/4ba713ccd85c3706d195aeef9476e6e6363f1c21/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAbs.java#L93 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20931) Built-in SQL Function - ABS support string type
[ https://issues.apache.org/jira/browse/SPARK-20931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030577#comment-16030577 ] Apache Spark commented on SPARK-20931: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/18153 > Built-in SQL Function - ABS support string type > --- > > Key: SPARK-20931 > URL: https://issues.apache.org/jira/browse/SPARK-20931 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Yuming Wang > Labels: starter > > {noformat} > ABS() > {noformat} > Hive/MySQL support this. > Ref: > https://github.com/apache/hive/blob/4ba713ccd85c3706d195aeef9476e6e6363f1c21/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAbs.java#L93 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20275) HistoryServer page shows incorrect complete date of inprogress apps
[ https://issues.apache.org/jira/browse/SPARK-20275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20275. - Resolution: Fixed Assignee: Saisai Shao Fix Version/s: 2.2.0 2.1.2 > HistoryServer page shows incorrect complete date of inprogress apps > --- > > Key: SPARK-20275 > URL: https://issues.apache.org/jira/browse/SPARK-20275 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.1.2, 2.2.0 > > Attachments: screenshot-1.png > > > The HistoryServer's incomplete page shows in-progress application's completed > date as {{1969-12-31 23:59:59}}, which is not meaningful and could be > improved. > !https://issues.apache.org/jira/secure/attachment/12862656/screenshot-1.png! > So instead of showing this date, here proposed to not display this column > since it is not required for in-progress applications. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20931) Built-in SQL Function - ABS support string type
Yuming Wang created SPARK-20931: --- Summary: Built-in SQL Function - ABS support string type Key: SPARK-20931 URL: https://issues.apache.org/jira/browse/SPARK-20931 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.2.0 Reporter: Yuming Wang {noformat} ABS() {noformat} Hive/MySQL support this. Ref: https://github.com/apache/hive/blob/4ba713ccd85c3706d195aeef9476e6e6363f1c21/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAbs.java#L93 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20931) Built-in SQL Function - ABS support string type
[ https://issues.apache.org/jira/browse/SPARK-20931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030562#comment-16030562 ] Yuming Wang commented on SPARK-20931: - I'm working on. > Built-in SQL Function - ABS support string type > --- > > Key: SPARK-20931 > URL: https://issues.apache.org/jira/browse/SPARK-20931 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Yuming Wang > Labels: starter > > {noformat} > ABS() > {noformat} > Hive/MySQL support this. > Ref: > https://github.com/apache/hive/blob/4ba713ccd85c3706d195aeef9476e6e6363f1c21/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAbs.java#L93 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20930) Destroy broadcasted centers after computing cost
[ https://issues.apache.org/jira/browse/SPARK-20930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20930: Assignee: Apache Spark > Destroy broadcasted centers after computing cost > - > > Key: SPARK-20930 > URL: https://issues.apache.org/jira/browse/SPARK-20930 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Trivial > > Destroy broadcasted centers after computing cost -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20930) Destroy broadcasted centers after computing cost
[ https://issues.apache.org/jira/browse/SPARK-20930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20930: Assignee: (was: Apache Spark) > Destroy broadcasted centers after computing cost > - > > Key: SPARK-20930 > URL: https://issues.apache.org/jira/browse/SPARK-20930 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng >Priority: Trivial > > Destroy broadcasted centers after computing cost -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20930) Destroy broadcasted centers after computing cost
[ https://issues.apache.org/jira/browse/SPARK-20930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030559#comment-16030559 ] Apache Spark commented on SPARK-20930: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/18152 > Destroy broadcasted centers after computing cost > - > > Key: SPARK-20930 > URL: https://issues.apache.org/jira/browse/SPARK-20930 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng >Priority: Trivial > > Destroy broadcasted centers after computing cost -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20930) Destroy broadcasted centers after computing cost
zhengruifeng created SPARK-20930: Summary: Destroy broadcasted centers after computing cost Key: SPARK-20930 URL: https://issues.apache.org/jira/browse/SPARK-20930 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.3.0 Reporter: zhengruifeng Priority: Trivial Destroy broadcasted centers after computing cost -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20213) DataFrameWriter operations do not show up in SQL tab
[ https://issues.apache.org/jira/browse/SPARK-20213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-20213: --- Assignee: Wenchen Fan > DataFrameWriter operations do not show up in SQL tab > > > Key: SPARK-20213 > URL: https://issues.apache.org/jira/browse/SPARK-20213 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.2, 2.1.0 >Reporter: Ryan Blue >Assignee: Wenchen Fan > Fix For: 2.3.0 > > Attachments: Screen Shot 2017-05-03 at 5.00.19 PM.png > > > In 1.6.1, {{DataFrame}} writes started using {{DataFrameWriter}} actions like > {{insertInto}} would show up in the SQL tab. In 2.0.0 and later, they no > longer do. The problem is that 2.0.0 and later no longer wrap execution with > {{SQLExecution.withNewExecutionId}}, which emits > {{SparkListenerSQLExecutionStart}}. > Here are the relevant parts of the stack traces: > {code:title=Spark 1.6.1} > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56) > org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56) > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53) > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56) > => holding > Monitor(org.apache.spark.sql.hive.HiveContext$QueryExecution@424773807}) > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:196) > {code} > {code:title=Spark 2.0.0} > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > => holding Monitor(org.apache.spark.sql.execution.QueryExecution@490977924}) > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:301) > {code} > I think this was introduced by > [54d23599|https://github.com/apache/spark/commit/54d23599]. The fix should be > to add withNewExecutionId to > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L610 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20213) DataFrameWriter operations do not show up in SQL tab
[ https://issues.apache.org/jira/browse/SPARK-20213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20213. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18064 [https://github.com/apache/spark/pull/18064] > DataFrameWriter operations do not show up in SQL tab > > > Key: SPARK-20213 > URL: https://issues.apache.org/jira/browse/SPARK-20213 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.2, 2.1.0 >Reporter: Ryan Blue > Fix For: 2.3.0 > > Attachments: Screen Shot 2017-05-03 at 5.00.19 PM.png > > > In 1.6.1, {{DataFrame}} writes started using {{DataFrameWriter}} actions like > {{insertInto}} would show up in the SQL tab. In 2.0.0 and later, they no > longer do. The problem is that 2.0.0 and later no longer wrap execution with > {{SQLExecution.withNewExecutionId}}, which emits > {{SparkListenerSQLExecutionStart}}. > Here are the relevant parts of the stack traces: > {code:title=Spark 1.6.1} > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56) > org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56) > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53) > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56) > => holding > Monitor(org.apache.spark.sql.hive.HiveContext$QueryExecution@424773807}) > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:196) > {code} > {code:title=Spark 2.0.0} > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > => holding Monitor(org.apache.spark.sql.execution.QueryExecution@490977924}) > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:301) > {code} > I think this was introduced by > [54d23599|https://github.com/apache/spark/commit/54d23599]. The fix should be > to add withNewExecutionId to > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L610 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20929) LinearSVC should not use shared Param HasThresholds
[ https://issues.apache.org/jira/browse/SPARK-20929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20929: Assignee: Apache Spark (was: Joseph K. Bradley) > LinearSVC should not use shared Param HasThresholds > --- > > Key: SPARK-20929 > URL: https://issues.apache.org/jira/browse/SPARK-20929 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > LinearSVC applies the Param 'threshold' to the rawPrediction, not the > probability. It has different semantics than the shared Param HasThreshold, > so it should not use the shared Param. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20929) LinearSVC should not use shared Param HasThresholds
[ https://issues.apache.org/jira/browse/SPARK-20929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20929: Assignee: Joseph K. Bradley (was: Apache Spark) > LinearSVC should not use shared Param HasThresholds > --- > > Key: SPARK-20929 > URL: https://issues.apache.org/jira/browse/SPARK-20929 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > LinearSVC applies the Param 'threshold' to the rawPrediction, not the > probability. It has different semantics than the shared Param HasThreshold, > so it should not use the shared Param. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20929) LinearSVC should not use shared Param HasThresholds
[ https://issues.apache.org/jira/browse/SPARK-20929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030437#comment-16030437 ] Apache Spark commented on SPARK-20929: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/18151 > LinearSVC should not use shared Param HasThresholds > --- > > Key: SPARK-20929 > URL: https://issues.apache.org/jira/browse/SPARK-20929 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > LinearSVC applies the Param 'threshold' to the rawPrediction, not the > probability. It has different semantics than the shared Param HasThreshold, > so it should not use the shared Param. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20929) LinearSVC should not use shared Param HasThresholds
[ https://issues.apache.org/jira/browse/SPARK-20929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20929: -- Priority: Minor (was: Major) > LinearSVC should not use shared Param HasThresholds > --- > > Key: SPARK-20929 > URL: https://issues.apache.org/jira/browse/SPARK-20929 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > LinearSVC applies the Param 'threshold' to the rawPrediction, not the > probability. It has different semantics than the shared Param HasThreshold, > so it should not use the shared Param. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20929) LinearSVC should not use shared Param HasThresholds
Joseph K. Bradley created SPARK-20929: - Summary: LinearSVC should not use shared Param HasThresholds Key: SPARK-20929 URL: https://issues.apache.org/jira/browse/SPARK-20929 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.2.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley LinearSVC applies the Param 'threshold' to the rawPrediction, not the probability. It has different semantics than the shared Param HasThreshold, so it should not use the shared Param. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20928) Continuous Processing Mode for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030379#comment-16030379 ] Nan Zhu commented on SPARK-20928: - Hi, is there any description on what does it mean? > Continuous Processing Mode for Structured Streaming > --- > > Key: SPARK-20928 > URL: https://issues.apache.org/jira/browse/SPARK-20928 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Michael Armbrust > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20651) Speed up the new app state listener
[ https://issues.apache.org/jira/browse/SPARK-20651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-20651. Resolution: Won't Do I've done some perf work to make sure live applications don't regress, and made a bunch of changes that make the original code I had for this milestone obsolete, so I removed it from my plan. The current list of "upcoming" PRs can be seen at: https://github.com/vanzin/spark/pulls > Speed up the new app state listener > --- > > Key: SPARK-20651 > URL: https://issues.apache.org/jira/browse/SPARK-20651 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > See spec in parent issue (SPARK-18085) for more details. > This task tracks enhancements to the code added in previous tasks so that the > new app state listener is faster; it adds a caching layer and an asynchronous > write layer that also does deduplication, so that it avoids blocking the > listener thread and also avoids unnecessary writes to disk. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2183) Avoid loading/shuffling data twice in self-join query
[ https://issues.apache.org/jira/browse/SPARK-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-2183. -- Resolution: Fixed Assignee: Reynold Xin This shouldn't be an issue anymore with reuse exchange in the latest release (I checked 2.2). > Avoid loading/shuffling data twice in self-join query > - > > Key: SPARK-2183 > URL: https://issues.apache.org/jira/browse/SPARK-2183 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Minor > > {code} > scala> hql("select * from src a join src b on (a.key=b.key)") > res2: org.apache.spark.sql.SchemaRDD = > SchemaRDD[3] at RDD at SchemaRDD.scala:100 > == Query Plan == > Project [key#3:0,value#4:1,key#5:2,value#6:3] > HashJoin [key#3], [key#5], BuildRight > Exchange (HashPartitioning [key#3:0], 200) >HiveTableScan [key#3,value#4], (MetastoreRelation default, src, Some(a)), > None > Exchange (HashPartitioning [key#5:0], 200) >HiveTableScan [key#5,value#6], (MetastoreRelation default, src, Some(b)), > None > {code} > The optimal execution strategy for the above example is to load data only > once and repartition once. > If we want to hyper optimize it, we can also have a self join operator that > builds the hashmap and then simply traverses the hashmap ... -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures
[ https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030284#comment-16030284 ] Sital Kedia commented on SPARK-20178: - https://github.com/apache/spark/pull/18150 > Improve Scheduler fetch failures > > > Key: SPARK-20178 > URL: https://issues.apache.org/jira/browse/SPARK-20178 > Project: Spark > Issue Type: Epic > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > We have been having a lot of discussions around improving the handling of > fetch failures. There are 4 jira currently related to this. > We should try to get a list of things we want to improve and come up with one > cohesive design. > SPARK-20163, SPARK-20091, SPARK-14649 , and SPARK-19753 > I will put my initial thoughts in a follow on comment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20883) Improve StateStore APIs for efficiency
[ https://issues.apache.org/jira/browse/SPARK-20883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-20883. -- Resolution: Fixed Fix Version/s: 2.3.0 > Improve StateStore APIs for efficiency > -- > > Key: SPARK-20883 > URL: https://issues.apache.org/jira/browse/SPARK-20883 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.3.0 > > > Current state store API has a bunch of problems that causes too many > transient objects causing memory pressure. > - StateStore.get() returns Options which forces creation of Some/None objects > for every get > - StateStore.iterator() returns tuples which forces creation of new tuple for > each record returned > - StateStore.updates() requires the implementation to keep track of updates, > while this is used minimally (only by Append mode in streaming aggregations). > This can be totally removed. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19753) Remove all shuffle files on a host in case of slave lost of fetch failure
[ https://issues.apache.org/jira/browse/SPARK-19753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030270#comment-16030270 ] Apache Spark commented on SPARK-19753: -- User 'sitalkedia' has created a pull request for this issue: https://github.com/apache/spark/pull/18150 > Remove all shuffle files on a host in case of slave lost of fetch failure > - > > Key: SPARK-19753 > URL: https://issues.apache.org/jira/browse/SPARK-19753 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.1 >Reporter: Sital Kedia > > Currently, when we detect fetch failure, we only remove the shuffle files > produced by the executor, while the host itself might be down and all the > shuffle files are not accessible. In case we are running multiple executors > on a host, any host going down currently results in multiple fetch failures > and multiple retries of the stage, which is very inefficient. If we remove > all the shuffle files on that host, on first fetch failure, we can rerun all > the tasks on that host in a single stage retry. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20894) Error while checkpointing to HDFS
[ https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20894: Assignee: (was: Apache Spark) > Error while checkpointing to HDFS > - > > Key: SPARK-20894 > URL: https://issues.apache.org/jira/browse/SPARK-20894 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.1 > Environment: Ubuntu, Spark 2.1.1, hadoop 2.7 >Reporter: kant kodali > Attachments: driver_info_log, executor1_log, executor2_log > > > Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 > hours", "24 hours"), df1.col("AppName")).count(); > StreamingQuery query = df2.writeStream().foreach(new > KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start(); > query.awaitTermination(); > This for some reason fails with the Error > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalStateException: Error reading delta file > /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = > (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: > /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist > I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/ and all > consumer offsets in Kafka from all brokers prior to running and yet this > error still persists. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20894) Error while checkpointing to HDFS
[ https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20894: Assignee: Apache Spark > Error while checkpointing to HDFS > - > > Key: SPARK-20894 > URL: https://issues.apache.org/jira/browse/SPARK-20894 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.1 > Environment: Ubuntu, Spark 2.1.1, hadoop 2.7 >Reporter: kant kodali >Assignee: Apache Spark > Attachments: driver_info_log, executor1_log, executor2_log > > > Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 > hours", "24 hours"), df1.col("AppName")).count(); > StreamingQuery query = df2.writeStream().foreach(new > KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start(); > query.awaitTermination(); > This for some reason fails with the Error > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalStateException: Error reading delta file > /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = > (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: > /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist > I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/ and all > consumer offsets in Kafka from all brokers prior to running and yet this > error still persists. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20894) Error while checkpointing to HDFS
[ https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030261#comment-16030261 ] Apache Spark commented on SPARK-20894: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/18149 > Error while checkpointing to HDFS > - > > Key: SPARK-20894 > URL: https://issues.apache.org/jira/browse/SPARK-20894 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.1 > Environment: Ubuntu, Spark 2.1.1, hadoop 2.7 >Reporter: kant kodali > Attachments: driver_info_log, executor1_log, executor2_log > > > Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 > hours", "24 hours"), df1.col("AppName")).count(); > StreamingQuery query = df2.writeStream().foreach(new > KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start(); > query.awaitTermination(); > This for some reason fails with the Error > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalStateException: Error reading delta file > /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = > (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: > /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist > I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/ and all > consumer offsets in Kafka from all brokers prior to running and yet this > error still persists. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20926) Exposure to Guava libraries by directly accessing tableRelationCache in SessionCatalog caused failures
[ https://issues.apache.org/jira/browse/SPARK-20926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20926: Assignee: Apache Spark > Exposure to Guava libraries by directly accessing tableRelationCache in > SessionCatalog caused failures > -- > > Key: SPARK-20926 > URL: https://issues.apache.org/jira/browse/SPARK-20926 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Reza Safi >Assignee: Apache Spark > > Because of shading that we did for guava libraries, we see test failures > whenever those components directly access tableRelationCache in > SessionCatalog. > This can happen in any component that shaded guava library. Failures looks > like this: > {noformat} > java.lang.NoSuchMethodError: > org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableRelationCache()Lcom/google/common/cache/Cache; > 01:25:14 at > org.apache.spark.sql.hive.test.TestHiveSparkSession.reset(TestHive.scala:492) > 01:25:14 at > org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:138) > 01:25:14 at > org.apache.spark.sql.hive.test.TestHiveSingleton$class.afterAll(TestHiveSingleton.scala:32) > 01:25:14 at > org.apache.spark.sql.hive.StatisticsSuite.afterAll(StatisticsSuite.scala:34) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213) > 01:25:14 at org.apache.spark.SparkFunSuite.afterAll(SparkFunSuite.scala:31) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:280) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:278) > 01:25:14 at org.scalatest.CompositeStatus.whenCompleted(Status.scala:377) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:278) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20926) Exposure to Guava libraries by directly accessing tableRelationCache in SessionCatalog caused failures
[ https://issues.apache.org/jira/browse/SPARK-20926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030257#comment-16030257 ] Apache Spark commented on SPARK-20926: -- User 'rezasafi' has created a pull request for this issue: https://github.com/apache/spark/pull/18148 > Exposure to Guava libraries by directly accessing tableRelationCache in > SessionCatalog caused failures > -- > > Key: SPARK-20926 > URL: https://issues.apache.org/jira/browse/SPARK-20926 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Reza Safi > > Because of shading that we did for guava libraries, we see test failures > whenever those components directly access tableRelationCache in > SessionCatalog. > This can happen in any component that shaded guava library. Failures looks > like this: > {noformat} > java.lang.NoSuchMethodError: > org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableRelationCache()Lcom/google/common/cache/Cache; > 01:25:14 at > org.apache.spark.sql.hive.test.TestHiveSparkSession.reset(TestHive.scala:492) > 01:25:14 at > org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:138) > 01:25:14 at > org.apache.spark.sql.hive.test.TestHiveSingleton$class.afterAll(TestHiveSingleton.scala:32) > 01:25:14 at > org.apache.spark.sql.hive.StatisticsSuite.afterAll(StatisticsSuite.scala:34) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213) > 01:25:14 at org.apache.spark.SparkFunSuite.afterAll(SparkFunSuite.scala:31) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:280) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:278) > 01:25:14 at org.scalatest.CompositeStatus.whenCompleted(Status.scala:377) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:278) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20926) Exposure to Guava libraries by directly accessing tableRelationCache in SessionCatalog caused failures
[ https://issues.apache.org/jira/browse/SPARK-20926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20926: Assignee: (was: Apache Spark) > Exposure to Guava libraries by directly accessing tableRelationCache in > SessionCatalog caused failures > -- > > Key: SPARK-20926 > URL: https://issues.apache.org/jira/browse/SPARK-20926 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Reza Safi > > Because of shading that we did for guava libraries, we see test failures > whenever those components directly access tableRelationCache in > SessionCatalog. > This can happen in any component that shaded guava library. Failures looks > like this: > {noformat} > java.lang.NoSuchMethodError: > org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableRelationCache()Lcom/google/common/cache/Cache; > 01:25:14 at > org.apache.spark.sql.hive.test.TestHiveSparkSession.reset(TestHive.scala:492) > 01:25:14 at > org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:138) > 01:25:14 at > org.apache.spark.sql.hive.test.TestHiveSingleton$class.afterAll(TestHiveSingleton.scala:32) > 01:25:14 at > org.apache.spark.sql.hive.StatisticsSuite.afterAll(StatisticsSuite.scala:34) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213) > 01:25:14 at org.apache.spark.SparkFunSuite.afterAll(SparkFunSuite.scala:31) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:280) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:278) > 01:25:14 at org.scalatest.CompositeStatus.whenCompleted(Status.scala:377) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:278) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
[ https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030197#comment-16030197 ] Jeffrey Quinn commented on SPARK-20925: --- Apologies, will move to the mailing list next time I have a general question like that. I agree key skew is often an issue, but for the data we were testing with the cardinality of the partition column is 1, which helps rule some things out. I wanted to post again because after taking another crack at looking through the source I think I may have found a root cause: The ExecuteWriteTask implementation for a partitioned table (org.apache.spark.sql.execution.datasources.FileFormatWriter.DynamicPartitionWriteTask) sorts the rows of the table by the partition keys before writing. This makes sense as it minimizes the number of OutputWriters that need to be created. In the course of doing this, the ExecuteWriteTask uses org.apache.spark.sql.execution.UnsafeKVExternalSorter to sort the rows to be written. It then gets an iterator over the sorted rows via org.apache.spark.sql.execution.UnsafeKVExternalSorter#sortedIterator. The scaladoc of that method advises that it is the callers responsibility to call org.apache.spark.sql.execution.UnsafeKVExternalSorter#cleanupResources (see https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java#L176). However in ExecuteWriteTask, we appear to never call cleanupResources() when we are done with the iterator (see https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L379). This seems like it could create a memory leak, which would explain the behavior that we have observed. Luckily, it seems like this possible memory leak was fixed totally coincidentally by this revision: https://github.com/apache/spark/commit/776b8f17cfc687a57c005a421a81e591c8d44a3f Which changes this behavior for stated performance reasons. So the best solution to this issue may be to upgrade to v2.1.1. > Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy > -- > > Key: SPARK-20925 > URL: https://issues.apache.org/jira/browse/SPARK-20925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jeffrey Quinn > > Observed under the following conditions: > Spark Version: Spark 2.1.0 > Hadoop Version: Amazon 2.7.3 (emr-5.5.0) > spark.submit.deployMode = client > spark.master = yarn > spark.driver.memory = 10g > spark.shuffle.service.enabled = true > spark.dynamicAllocation.enabled = true > The job we are running is very simple: Our workflow reads data from a JSON > format stored on S3, and write out partitioned parquet files to HDFS. > As a one-liner, the whole workflow looks like this: > ``` > sparkSession.sqlContext > .read > .schema(inputSchema) > .json(expandedInputPath) > .select(columnMap:_*) > .write.partitionBy("partition_by_column") > .parquet(outputPath) > ``` > Unfortunately, for larger inputs, this job consistently fails with containers > running out of memory. We observed containers of up to 20GB OOMing, which is > surprising because the input data itself is only 15 GB compressed and maybe > 100GB uncompressed. > The error message we get indicates yarn is killing the containers. The > executors are running out of memory and not the driver. > ```Caused by: org.apache.spark.SparkException: Job aborted due to stage > failure: Task 184 in stage 74.0 failed 4 times, most recent failure: Lost > task 184.3 in stage 74.0 (TID 19110, ip-10-242-15-251.ec2.internal, executor > 14): ExecutorLostFailure (executor 14 exited caused by one of the running > tasks) Reason: Container killed by YARN for exceeding memory limits. 21.5 GB > of 20.9 GB physical memory used. Consider boosting > spark.yarn.executor.memoryOverhead.``` > We tried a full parameter sweep, including using dynamic allocation and > setting executor memory as high as 20GB. The result was the same each time, > with the job failing due to lost executors due to YARN killing containers. > We were able to bisect that `partitionBy` is the problem by progressively > removing/commenting out parts of our workflow. Finally when we get to the > above state, if we remove `partitionBy` the job succeeds with no OOM. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20928) Continuous Processing Mode for Structured Streaming
Michael Armbrust created SPARK-20928: Summary: Continuous Processing Mode for Structured Streaming Key: SPARK-20928 URL: https://issues.apache.org/jira/browse/SPARK-20928 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19236) Add createOrReplaceGlobalTempView
[ https://issues.apache.org/jira/browse/SPARK-19236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030165#comment-16030165 ] Apache Spark commented on SPARK-19236: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/18147 > Add createOrReplaceGlobalTempView > - > > Key: SPARK-19236 > URL: https://issues.apache.org/jira/browse/SPARK-19236 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Arman Yazdani >Priority: Minor > > There are 3 methods for saving a temp tables: > createTempView > createOrReplaceTempView > createGlobalTempView > but there isn't: > createOrReplaceGlobalTempView -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20894) Error while checkpointing to HDFS (similar to JIRA SPARK-19268)
[ https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030160#comment-16030160 ] Shixiong Zhu commented on SPARK-20894: -- The root issue here is the driver uses the local file system for checkpoints but executors use HDFS. I reopened this ticket because I think we can improve the error message here. > Error while checkpointing to HDFS (similar to JIRA SPARK-19268) > --- > > Key: SPARK-20894 > URL: https://issues.apache.org/jira/browse/SPARK-20894 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.1 > Environment: Ubuntu, Spark 2.1.1, hadoop 2.7 >Reporter: kant kodali > Attachments: driver_info_log, executor1_log, executor2_log > > > Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 > hours", "24 hours"), df1.col("AppName")).count(); > StreamingQuery query = df2.writeStream().foreach(new > KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start(); > query.awaitTermination(); > This for some reason fails with the Error > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalStateException: Error reading delta file > /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = > (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: > /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist > I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/ and all > consumer offsets in Kafka from all brokers prior to running and yet this > error still persists. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20894) Error while checkpointing to HDFS
[ https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20894: - Summary: Error while checkpointing to HDFS (was: Error while checkpointing to HDFS (similar to JIRA SPARK-19268)) > Error while checkpointing to HDFS > - > > Key: SPARK-20894 > URL: https://issues.apache.org/jira/browse/SPARK-20894 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.1 > Environment: Ubuntu, Spark 2.1.1, hadoop 2.7 >Reporter: kant kodali > Attachments: driver_info_log, executor1_log, executor2_log > > > Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 > hours", "24 hours"), df1.col("AppName")).count(); > StreamingQuery query = df2.writeStream().foreach(new > KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start(); > query.awaitTermination(); > This for some reason fails with the Error > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalStateException: Error reading delta file > /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = > (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: > /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist > I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/ and all > consumer offsets in Kafka from all brokers prior to running and yet this > error still persists. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20894) Error while checkpointing to HDFS (similar to JIRA SPARK-19268)
[ https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20894: - Issue Type: Improvement (was: Bug) > Error while checkpointing to HDFS (similar to JIRA SPARK-19268) > --- > > Key: SPARK-20894 > URL: https://issues.apache.org/jira/browse/SPARK-20894 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.1 > Environment: Ubuntu, Spark 2.1.1, hadoop 2.7 >Reporter: kant kodali > Attachments: driver_info_log, executor1_log, executor2_log > > > Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 > hours", "24 hours"), df1.col("AppName")).count(); > StreamingQuery query = df2.writeStream().foreach(new > KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start(); > query.awaitTermination(); > This for some reason fails with the Error > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalStateException: Error reading delta file > /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = > (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: > /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist > I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/ and all > consumer offsets in Kafka from all brokers prior to running and yet this > error still persists. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20894) Error while checkpointing to HDFS (similar to JIRA SPARK-19268)
[ https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reopened SPARK-20894: -- > Error while checkpointing to HDFS (similar to JIRA SPARK-19268) > --- > > Key: SPARK-20894 > URL: https://issues.apache.org/jira/browse/SPARK-20894 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 > Environment: Ubuntu, Spark 2.1.1, hadoop 2.7 >Reporter: kant kodali > Attachments: driver_info_log, executor1_log, executor2_log > > > Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 > hours", "24 hours"), df1.col("AppName")).count(); > StreamingQuery query = df2.writeStream().foreach(new > KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start(); > query.awaitTermination(); > This for some reason fails with the Error > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalStateException: Error reading delta file > /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = > (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: > /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist > I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/ and all > consumer offsets in Kafka from all brokers prior to running and yet this > error still persists. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20924) Unable to call the function registered in the not-current database
[ https://issues.apache.org/jira/browse/SPARK-20924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20924. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 18146 [https://github.com/apache/spark/pull/18146] > Unable to call the function registered in the not-current database > -- > > Key: SPARK-20924 > URL: https://issues.apache.org/jira/browse/SPARK-20924 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.1, 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > We are unable to call the function registered in the not-current database. > {noformat} > sql("CREATE DATABASE dAtABaSe1") > sql(s"CREATE FUNCTION dAtABaSe1.test_avg AS > '${classOf[GenericUDAFAverage].getName}'") > sql("SELECT dAtABaSe1.test_avg(1)") > {noformat} > The above code returns an error: > {noformat} > Undefined function: 'dAtABaSe1.test_avg'. This function is neither a > registered temporary function nor a permanent function registered in the > database 'default'.; line 1 pos 7 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
[ https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030108#comment-16030108 ] Sean Owen commented on SPARK-20925: --- This is better for the mailing list. Spark allocates off heap memory for lots of things (look up "spark tungsten"). Sometimes the default isn't enough. It's not a Spark issue per se, no, but a matter of how much YARN is asked to give the JVM. Partitioning isn't necessarily a trivial operation, and you might have some issue with key skew. By the way, the error message tells you about spark.yarn.executor.memoryOverhead already. > Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy > -- > > Key: SPARK-20925 > URL: https://issues.apache.org/jira/browse/SPARK-20925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jeffrey Quinn > > Observed under the following conditions: > Spark Version: Spark 2.1.0 > Hadoop Version: Amazon 2.7.3 (emr-5.5.0) > spark.submit.deployMode = client > spark.master = yarn > spark.driver.memory = 10g > spark.shuffle.service.enabled = true > spark.dynamicAllocation.enabled = true > The job we are running is very simple: Our workflow reads data from a JSON > format stored on S3, and write out partitioned parquet files to HDFS. > As a one-liner, the whole workflow looks like this: > ``` > sparkSession.sqlContext > .read > .schema(inputSchema) > .json(expandedInputPath) > .select(columnMap:_*) > .write.partitionBy("partition_by_column") > .parquet(outputPath) > ``` > Unfortunately, for larger inputs, this job consistently fails with containers > running out of memory. We observed containers of up to 20GB OOMing, which is > surprising because the input data itself is only 15 GB compressed and maybe > 100GB uncompressed. > The error message we get indicates yarn is killing the containers. The > executors are running out of memory and not the driver. > ```Caused by: org.apache.spark.SparkException: Job aborted due to stage > failure: Task 184 in stage 74.0 failed 4 times, most recent failure: Lost > task 184.3 in stage 74.0 (TID 19110, ip-10-242-15-251.ec2.internal, executor > 14): ExecutorLostFailure (executor 14 exited caused by one of the running > tasks) Reason: Container killed by YARN for exceeding memory limits. 21.5 GB > of 20.9 GB physical memory used. Consider boosting > spark.yarn.executor.memoryOverhead.``` > We tried a full parameter sweep, including using dynamic allocation and > setting executor memory as high as 20GB. The result was the same each time, > with the job failing due to lost executors due to YARN killing containers. > We were able to bisect that `partitionBy` is the problem by progressively > removing/commenting out parts of our workflow. Finally when we get to the > above state, if we remove `partitionBy` the job succeeds with no OOM. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19732) DataFrame.fillna() does not work for bools in PySpark
[ https://issues.apache.org/jira/browse/SPARK-19732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030102#comment-16030102 ] Ruben Berenguel commented on SPARK-19732: - I'll give this a go! > DataFrame.fillna() does not work for bools in PySpark > - > > Key: SPARK-19732 > URL: https://issues.apache.org/jira/browse/SPARK-19732 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Len Frodgers >Priority: Minor > > In PySpark, the fillna function of DataFrame inadvertently casts bools to > ints, so fillna cannot be used to fill True/False. > e.g. > `spark.createDataFrame([Row(a=True),Row(a=None)]).fillna(True).collect()` > yields > `[Row(a=True), Row(a=None)]` > It should be a=True for the second Row > The cause is this bit of code: > {code} > if isinstance(value, (int, long)): > value = float(value) > {code} > There needs to be a separate check for isinstance(bool), since in python, > bools are ints too > Additionally there's another anomaly: > Spark (and pyspark) supports filling of bools if you specify the args as a > map: > {code} > fillna({"a": False}) > {code} > , but not if you specify it as > {code} > fillna(False) > {code} > This is because (scala-)Spark has no > {code} > def fill(value: Boolean): DataFrame = fill(value, df.columns) > {code} > method. I find that strange/buggy -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
[ https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030089#comment-16030089 ] Jeffrey Quinn commented on SPARK-20925: --- Thanks Sean, Sorry to continue to comment on a resolved issue, but I'm extremely curious to learn how this works since I have run into this issue several times before on other applications. In the the scenario you describe, the spark application logic thinks that it can allocate more memory, but that calculation is incorrect because there is a a significant amount of off-heap memory already in use and the spark application logic does not take that into account. Instead to provide for off heap allocation, a static overhead in terms of percentage of total JVM memory is used. Is that correct? The thing that really boggles my mind is, what could be using that off-heap memory? As you can see from my question, we do not have any sort of custom UDF code here, we are just calling the spark api in the most straightforward way possible. Why would the default setting not be sufficient for this case? Our schema has a significant number of columns (~100), perhaps that is to blame? Is catalyst using the off heap memory maybe? Thanks, Jeff > Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy > -- > > Key: SPARK-20925 > URL: https://issues.apache.org/jira/browse/SPARK-20925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jeffrey Quinn > > Observed under the following conditions: > Spark Version: Spark 2.1.0 > Hadoop Version: Amazon 2.7.3 (emr-5.5.0) > spark.submit.deployMode = client > spark.master = yarn > spark.driver.memory = 10g > spark.shuffle.service.enabled = true > spark.dynamicAllocation.enabled = true > The job we are running is very simple: Our workflow reads data from a JSON > format stored on S3, and write out partitioned parquet files to HDFS. > As a one-liner, the whole workflow looks like this: > ``` > sparkSession.sqlContext > .read > .schema(inputSchema) > .json(expandedInputPath) > .select(columnMap:_*) > .write.partitionBy("partition_by_column") > .parquet(outputPath) > ``` > Unfortunately, for larger inputs, this job consistently fails with containers > running out of memory. We observed containers of up to 20GB OOMing, which is > surprising because the input data itself is only 15 GB compressed and maybe > 100GB uncompressed. > The error message we get indicates yarn is killing the containers. The > executors are running out of memory and not the driver. > ```Caused by: org.apache.spark.SparkException: Job aborted due to stage > failure: Task 184 in stage 74.0 failed 4 times, most recent failure: Lost > task 184.3 in stage 74.0 (TID 19110, ip-10-242-15-251.ec2.internal, executor > 14): ExecutorLostFailure (executor 14 exited caused by one of the running > tasks) Reason: Container killed by YARN for exceeding memory limits. 21.5 GB > of 20.9 GB physical memory used. Consider boosting > spark.yarn.executor.memoryOverhead.``` > We tried a full parameter sweep, including using dynamic allocation and > setting executor memory as high as 20GB. The result was the same each time, > with the job failing due to lost executors due to YARN killing containers. > We were able to bisect that `partitionBy` is the problem by progressively > removing/commenting out parts of our workflow. Finally when we get to the > above state, if we remove `partitionBy` the job succeeds with no OOM. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
[ https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20925. --- Resolution: Not A Problem That doesn't mean the JVM is out of memory; it kind of means the opposite. It thinks it can use more than YARN does, due to off-heap allocation. Setting the heap size higher only helps if you make it so high that the default off-heap cushion is sufficient. Increase spark.yarn.executor.memoryOverhead instead, as your heap is likely far too big. > Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy > -- > > Key: SPARK-20925 > URL: https://issues.apache.org/jira/browse/SPARK-20925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jeffrey Quinn > > Observed under the following conditions: > Spark Version: Spark 2.1.0 > Hadoop Version: Amazon 2.7.3 (emr-5.5.0) > spark.submit.deployMode = client > spark.master = yarn > spark.driver.memory = 10g > spark.shuffle.service.enabled = true > spark.dynamicAllocation.enabled = true > The job we are running is very simple: Our workflow reads data from a JSON > format stored on S3, and write out partitioned parquet files to HDFS. > As a one-liner, the whole workflow looks like this: > ``` > sparkSession.sqlContext > .read > .schema(inputSchema) > .json(expandedInputPath) > .select(columnMap:_*) > .write.partitionBy("partition_by_column") > .parquet(outputPath) > ``` > Unfortunately, for larger inputs, this job consistently fails with containers > running out of memory. We observed containers of up to 20GB OOMing, which is > surprising because the input data itself is only 15 GB compressed and maybe > 100GB uncompressed. > The error message we get indicates yarn is killing the containers. The > executors are running out of memory and not the driver. > ```Caused by: org.apache.spark.SparkException: Job aborted due to stage > failure: Task 184 in stage 74.0 failed 4 times, most recent failure: Lost > task 184.3 in stage 74.0 (TID 19110, ip-10-242-15-251.ec2.internal, executor > 14): ExecutorLostFailure (executor 14 exited caused by one of the running > tasks) Reason: Container killed by YARN for exceeding memory limits. 21.5 GB > of 20.9 GB physical memory used. Consider boosting > spark.yarn.executor.memoryOverhead.``` > We tried a full parameter sweep, including using dynamic allocation and > setting executor memory as high as 20GB. The result was the same each time, > with the job failing due to lost executors due to YARN killing containers. > We were able to bisect that `partitionBy` is the problem by progressively > removing/commenting out parts of our workflow. Finally when we get to the > above state, if we remove `partitionBy` the job succeeds with no OOM. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030059#comment-16030059 ] Thomas Graves commented on SPARK-20923: --- taking a quick look at the history of the _updatedBlockStatuses it looks like this used to be used for StorageStatusListener but it has been since changed to do this on the SparkListenerBlockUpdated event. That BlockUpdated event is coming from the BlockManagerMaster.updateBlockInfo which is called by the executors. So I'm not seeing anything use _updatedBlockStatuses. I'll start to rip it out and see what I hit. > TaskMetrics._updatedBlockStatuses uses a lot of memory > -- > > Key: SPARK-20923 > URL: https://issues.apache.org/jira/browse/SPARK-20923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > The driver appears to use a ton of memory in certain cases to store the task > metrics updated block status'. For instance I had a user reading data form > hive and caching it. The # of tasks to read was around 62,000, they were > using 1000 executors and it ended up caching a couple TB's of data. The > driver kept running out of memory. > I investigated and it looks like there was 5GB of a 10GB heap being used up > by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks. > The updatedBlockStatuses was already removed from the task end event under > SPARK-20084. I don't see anything else that seems to be using this. Anybody > know if I missed something? > If its not being used we should remove it, otherwise we need to figure out a > better way of doing it so it doesn't use so much memory. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18881) Spark never finishes jobs and stages, JobProgressListener fails
[ https://issues.apache.org/jira/browse/SPARK-18881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030025#comment-16030025 ] Mathieu D edited comment on SPARK-18881 at 5/30/17 7:52 PM: Just to mention a workaround for those experiencing the problem : try increase {{spark.scheduler.listenerbus.eventqueue.size}} (default 1). It may only postpone the problem, if the queue filling is faster than listeners for a long time. In our case, we have bursts of activity and raising this limit helps. was (Author: mathieude): Just to mention a workaround for those experiencing the problem : try increase {{spark.scheduler.listenerbus.eventqueue.size}} (default 1). It may only postpone the problem, if the queue filling is faster than listeners for a long time. In our case, we have bursts of activity and raising this limits helps. > Spark never finishes jobs and stages, JobProgressListener fails > --- > > Key: SPARK-18881 > URL: https://issues.apache.org/jira/browse/SPARK-18881 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 > Environment: yarn, deploy-mode = client >Reporter: Mathieu D > > We have a Spark application that process continuously a lot of incoming jobs. > Several jobs are processed in parallel, on multiple threads. > During intensive workloads, at some point, we start to have hundreds of > warnings like this : > {code} > 16/12/14 21:04:03 WARN JobProgressListener: Task end for unknown stage 147379 > 16/12/14 21:04:03 WARN JobProgressListener: Job completed for unknown job > 64610 > 16/12/14 21:04:04 WARN JobProgressListener: Task start for unknown stage > 147405 > 16/12/14 21:04:04 WARN JobProgressListener: Task end for unknown stage 147406 > 16/12/14 21:04:04 WARN JobProgressListener: Job completed for unknown job > 64622 > {code} > Starting from that, the performance of the app plummet, most of Stages and > Jobs never finish. On SparkUI, I can see figures like 13000 pending jobs. > I can't see clearly another related exception happening before. Maybe this > one, but it concerns another listener : > {code} > 16/12/14 21:03:54 ERROR LiveListenerBus: Dropping SparkListenerEvent because > no remaining room in event queue. This likely means one of the SparkListeners > is too slow and cannot keep up with the rate at which tasks are being started > by the scheduler. > 16/12/14 21:03:54 WARN LiveListenerBus: Dropped 1 SparkListenerEvents since > Thu Jan 01 01:00:00 CET 1970 > {code} > This is very problematic for us, since it's hard to detect, and requires an > app restart. > *EDIT :* > I confirm the sequence : > 1- ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining > room in event queue > then > 2- JobProgressListener losing track of job and stages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18881) Spark never finishes jobs and stages, JobProgressListener fails
[ https://issues.apache.org/jira/browse/SPARK-18881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030025#comment-16030025 ] Mathieu D commented on SPARK-18881: --- Just to mention a workaround for those experiencing the problem : try increase {{spark.scheduler.listenerbus.eventqueue.size}} (default 1). It may only postpone the problem, if the queue filling is faster than listeners for a long time. In our case, we have bursts of activity and raising this limits helps. > Spark never finishes jobs and stages, JobProgressListener fails > --- > > Key: SPARK-18881 > URL: https://issues.apache.org/jira/browse/SPARK-18881 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 > Environment: yarn, deploy-mode = client >Reporter: Mathieu D > > We have a Spark application that process continuously a lot of incoming jobs. > Several jobs are processed in parallel, on multiple threads. > During intensive workloads, at some point, we start to have hundreds of > warnings like this : > {code} > 16/12/14 21:04:03 WARN JobProgressListener: Task end for unknown stage 147379 > 16/12/14 21:04:03 WARN JobProgressListener: Job completed for unknown job > 64610 > 16/12/14 21:04:04 WARN JobProgressListener: Task start for unknown stage > 147405 > 16/12/14 21:04:04 WARN JobProgressListener: Task end for unknown stage 147406 > 16/12/14 21:04:04 WARN JobProgressListener: Job completed for unknown job > 64622 > {code} > Starting from that, the performance of the app plummet, most of Stages and > Jobs never finish. On SparkUI, I can see figures like 13000 pending jobs. > I can't see clearly another related exception happening before. Maybe this > one, but it concerns another listener : > {code} > 16/12/14 21:03:54 ERROR LiveListenerBus: Dropping SparkListenerEvent because > no remaining room in event queue. This likely means one of the SparkListeners > is too slow and cannot keep up with the rate at which tasks are being started > by the scheduler. > 16/12/14 21:03:54 WARN LiveListenerBus: Dropped 1 SparkListenerEvents since > Thu Jan 01 01:00:00 CET 1970 > {code} > This is very problematic for us, since it's hard to detect, and requires an > app restart. > *EDIT :* > I confirm the sequence : > 1- ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining > room in event queue > then > 2- JobProgressListener losing track of job and stages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19044) PySpark dropna() can fail with AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-19044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030021#comment-16030021 ] Ruben Berenguel commented on SPARK-19044: - Oh, there's a typo in the "equivalent Scala code" in the bug report, where we have v1 instead of v2: val v1 = spark.range(10) val v2 = v1.crossJoin(v1) v2.na.drop().explain() org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id#0L, id#3L.; I think this can't be really considered a bug, since at least it's as bad in Scala as it is in Python (not much more can be done here since this is the expected behaviour of a LogicalPlan). [~holdenk] do you know who I need to ping in this case? > PySpark dropna() can fail with AnalysisException > > > Key: SPARK-19044 > URL: https://issues.apache.org/jira/browse/SPARK-19044 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Josh Rosen >Priority: Minor > > In PySpark, the following fails with an AnalysisException: > {code} > v1 = spark.range(10) > v2 = v1.crossJoin(v1) > v2.dropna() > {code} > {code} > AnalysisException: u"Reference 'id' is ambiguous, could be: id#66L, id#69L.;" > {code} > However, the equivalent Scala code works fine: > {code} > val v1 = spark.range(10) > val v2 = v1.crossJoin(v1) > v1.na.drop() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20803) KernelDensity.estimate in pyspark.mllib.stat.KernelDensity throws net.razorvine.pickle.PickleException when input data is normally distributed (no error when data is n
[ https://issues.apache.org/jira/browse/SPARK-20803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030004#comment-16030004 ] Bettadapura Srinath Sharma commented on SPARK-20803: In Java, the (correct) result is: Code: KernelDensity kd = new KernelDensity().setSample(col3).setBandwidth(3.0); double[] densities = kd.estimate(samplePoints); [0.06854408498726733, 1.4028730306237974E-174, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] > KernelDensity.estimate in pyspark.mllib.stat.KernelDensity throws > net.razorvine.pickle.PickleException when input data is normally distributed > (no error when data is not normally distributed) > --- > > Key: SPARK-20803 > URL: https://issues.apache.org/jira/browse/SPARK-20803 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.1.1 > Environment: Linux version 4.4.14-smp > x86/fpu: Legacy x87 FPU detected. > using command line: > bash-4.3$ ./bin/spark-submit ~/work/python/Features.py > bash-4.3$ pwd > /home/bsrsharma/spark-2.1.1-bin-hadoop2.7 > export JAVA_HOME=/home/bsrsharma/jdk1.8.0_121 >Reporter: Bettadapura Srinath Sharma > > When data is NOT normally distributed (correct behavior): > This code: > vecRDD = sc.parallelize(colVec) > kd = KernelDensity() > kd.setSample(vecRDD) > kd.setBandwidth(3.0) > # Find density estimates for the given values > densities = kd.estimate(samplePoints) > produces: > 17/05/18 15:40:36 INFO SparkContext: Starting job: aggregate at > KernelDensity.scala:92 > 17/05/18 15:40:36 INFO DAGScheduler: Got job 21 (aggregate at > KernelDensity.scala:92) with 1 output partitions > 17/05/18 15:40:36 INFO DAGScheduler: Final stage: ResultStage 24 (aggregate > at KernelDensity.scala:92) > 17/05/18 15:40:36 INFO DAGScheduler: Parents of final stage: List() > 17/05/18 15:40:36 INFO DAGScheduler: Missing parents: List() > 17/05/18 15:40:36 INFO DAGScheduler: Submitting ResultStage 24 > (MapPartitionsRDD[44] at mapPartitions at PythonMLLibAPI.scala:1345), which > has no missing parents > 17/05/18 15:40:36 INFO MemoryStore: Block broadcast_25 stored as values in > memory (estimated size 6.6 KB, free 413.6 MB) > 17/05/18 15:40:36 INFO MemoryStore: Block broadcast_25_piece0 stored as bytes > in memory (estimated size 3.6 KB, free 413.6 MB) > 17/05/18 15:40:36 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory > on 192.168.0.115:38645 (size: 3.6 KB, free: 413.9 MB) > 17/05/18 15:40:36 INFO SparkContext: Created broadcast 25 from broadcast at > DAGScheduler.scala:996 > 17/05/18 15:40:36 INFO DAGScheduler: Submitting 1 missing tasks from > ResultStage 24 (MapPartitionsRDD[44] at mapPartitions at > PythonMLLibAPI.scala:1345) > 17/05/18 15:40:36 INFO TaskSchedulerImpl: Adding task set 24.0 with 1 tasks > 17/05/18 15:40:36 INFO TaskSetManager: Starting task 0.0 in stage 24.0 (TID > 24, localhost, executor driver, partition 0, PROCESS_LOCAL, 96186 bytes) > 17/05/18 15:40:36 INFO Executor: Running task 0.0 in stage 24.0 (TID 24) > 17/05/18 15:40:37 INFO PythonRunner: Times: total = 66, boot = -1831, init = > 1844, finish = 53 > 17/05/18 15:40:37 INFO Executor: Finished task 0.0 in stage 24.0 (TID 24). > 2476 bytes result sent to driver > 17/05/18 15:40:37 INFO DAGScheduler: ResultStage 24 (aggregate at > KernelDensity.scala:92) finished in 1.001 s > 17/05/18 15:40:37 INFO TaskSetManager: Finished task 0.0 in stage 24.0 (TID > 24) in 1004 ms on localhost (executor driver) (1/1) > 17/05/18 15:40:37 INFO TaskSchedulerImpl: Removed TaskSet 24.0, whose tasks > have all completed, from pool > 17/05/18 15:40:37 INFO DAGScheduler: Job 21 finished: aggregate at > KernelDensity.scala:92, took 1.136263 s > 17/05/18 15:40:37 INFO BlockManagerInfo: Removed broadcast_25_piece0 on > 192.168.0.115:38645 in memory (size: 3.6 KB, free: 413.9 MB) > 5.6654703477e-05,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001 > ,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.0001000
[jira] [Created] (SPARK-20927) Add cache operator to Unsupported Operations in Structured Streaming
Jacek Laskowski created SPARK-20927: --- Summary: Add cache operator to Unsupported Operations in Structured Streaming Key: SPARK-20927 URL: https://issues.apache.org/jira/browse/SPARK-20927 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Jacek Laskowski Priority: Trivial Just [found out|https://stackoverflow.com/questions/42062092/why-does-using-cache-on-streaming-datasets-fail-with-analysisexception-queries] that {{cache}} is not allowed on streaming datasets. {{cache}} on streaming datasets leads to the following exception: {code} scala> spark.readStream.text("files").cache org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();; FileSource[files] at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297) at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36) at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34) at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63) at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74) at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89) at org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:104) at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:68) at org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:92) at org.apache.spark.sql.Dataset.persist(Dataset.scala:2603) at org.apache.spark.sql.Dataset.cache(Dataset.scala:2613) ... 48 elided {code} It should be included in Structured Streaming's [Unsupported Operations|http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20926) Exposure to Guava libraries by directly accessing tableRelationCache in SessionCatalog caused failures
[ https://issues.apache.org/jira/browse/SPARK-20926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029996#comment-16029996 ] Reza Safi commented on SPARK-20926: --- I will post a pull request for this issue soon, by tonight at the latest. > Exposure to Guava libraries by directly accessing tableRelationCache in > SessionCatalog caused failures > -- > > Key: SPARK-20926 > URL: https://issues.apache.org/jira/browse/SPARK-20926 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Reza Safi > > Because of shading that we did for guava libraries, we see test failures > whenever those components directly access tableRelationCache in > SessionCatalog. > This can happen in any component that shaded guava library. Failures looks > like this: > {noformat} > java.lang.NoSuchMethodError: > org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableRelationCache()Lcom/google/common/cache/Cache; > 01:25:14 at > org.apache.spark.sql.hive.test.TestHiveSparkSession.reset(TestHive.scala:492) > 01:25:14 at > org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:138) > 01:25:14 at > org.apache.spark.sql.hive.test.TestHiveSingleton$class.afterAll(TestHiveSingleton.scala:32) > 01:25:14 at > org.apache.spark.sql.hive.StatisticsSuite.afterAll(StatisticsSuite.scala:34) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213) > 01:25:14 at org.apache.spark.SparkFunSuite.afterAll(SparkFunSuite.scala:31) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:280) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:278) > 01:25:14 at org.scalatest.CompositeStatus.whenCompleted(Status.scala:377) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:278) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
[ https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeffrey Quinn updated SPARK-20925: -- Description: Observed under the following conditions: Spark Version: Spark 2.1.0 Hadoop Version: Amazon 2.7.3 (emr-5.5.0) spark.submit.deployMode = client spark.master = yarn spark.driver.memory = 10g spark.shuffle.service.enabled = true spark.dynamicAllocation.enabled = true The job we are running is very simple: Our workflow reads data from a JSON format stored on S3, and write out partitioned parquet files to HDFS. As a one-liner, the whole workflow looks like this: ``` sparkSession.sqlContext .read .schema(inputSchema) .json(expandedInputPath) .select(columnMap:_*) .write.partitionBy("partition_by_column") .parquet(outputPath) ``` Unfortunately, for larger inputs, this job consistently fails with containers running out of memory. We observed containers of up to 20GB OOMing, which is surprising because the input data itself is only 15 GB compressed and maybe 100GB uncompressed. The error message we get indicates yarn is killing the containers. The executors are running out of memory and not the driver. ```Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 184 in stage 74.0 failed 4 times, most recent failure: Lost task 184.3 in stage 74.0 (TID 19110, ip-10-242-15-251.ec2.internal, executor 14): ExecutorLostFailure (executor 14 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 21.5 GB of 20.9 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.``` We tried a full parameter sweep, including using dynamic allocation and setting executor memory as high as 20GB. The result was the same each time, with the job failing due to lost executors due to YARN killing containers. We were able to bisect that `partitionBy` is the problem by progressively removing/commenting out parts of our workflow. Finally when we get to the above state, if we remove `partitionBy` the job succeeds with no OOM. was: Observed under the following conditions: Spark Version: Spark 2.1.0 Hadoop Version: Amazon 2.7.3 (emr-5.5.0) spark.submit.deployMode = client spark.master = yarn spark.driver.memory = 10g spark.shuffle.service.enabled = true spark.dynamicAllocation.enabled = true The job we are running is very simple: Our workflow reads data from a JSON format stored on S3, and write out partitioned parquet files to HDFS. As a one-liner, the whole workflow looks like this: ``` sparkSession.sqlContext .read .schema(inputSchema) .json(expandedInputPath) .select(columnMap:_*) .write.partitionBy("partition_by_column") .parquet(outputPath) ``` Unfortunately, for larger inputs, this job consistently fails with containers running out of memory. We observed containers of up to 20GB OOMing, which is surprising because the input data itself is only 15 GB compressed and maybe 100GB uncompressed. We were able to bisect that `partitionBy` is the problem by progressively removing/commenting out parts of our workflow. Finally when we get to the above state, if we remove `partitionBy` the job succeeds with no OOM. > Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy > -- > > Key: SPARK-20925 > URL: https://issues.apache.org/jira/browse/SPARK-20925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jeffrey Quinn > > Observed under the following conditions: > Spark Version: Spark 2.1.0 > Hadoop Version: Amazon 2.7.3 (emr-5.5.0) > spark.submit.deployMode = client > spark.master = yarn > spark.driver.memory = 10g > spark.shuffle.service.enabled = true > spark.dynamicAllocation.enabled = true > The job we are running is very simple: Our workflow reads data from a JSON > format stored on S3, and write out partitioned parquet files to HDFS. > As a one-liner, the whole workflow looks like this: > ``` > sparkSession.sqlContext > .read > .schema(inputSchema) > .json(expandedInputPath) > .select(columnMap:_*) > .write.partitionBy("partition_by_column") > .parquet(outputPath) > ``` > Unfortunately, for larger inputs, this job consistently fails with containers > running out of memory. We observed containers of up to 20GB OOMing, which is > surprising because the input data itself is only 15 GB compressed and maybe > 100GB uncompressed. > The error message we get indicates yarn is killing the containers. The > executors are running out of memory and not the driver. > ```Caused by: org.apache.spark.SparkException: Job aborted due to stage > failure: Task 184 in st
[jira] [Commented] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
[ https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029993#comment-16029993 ] Jeffrey Quinn commented on SPARK-20925: --- Hi Sean, Sorry for not providing adequate information. Thank you for responding so quickly. The error message we get indicates yarn is killing the containers. The executors are running out of memory and not the driver. ```Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 184 in stage 74.0 failed 4 times, most recent failure: Lost task 184.3 in stage 74.0 (TID 19110, ip-10-242-15-251.ec2.internal, executor 14): ExecutorLostFailure (executor 14 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 21.5 GB of 20.9 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.``` We tried a full parameter sweep, including using dynamic allocation and setting executor memory as high as 20GB. The result was the same each time, with the job failing due to lost executors due to YARN killing containers. I attempted to trace through the source code to see how `partitionBy` is implemented, I was surprised to see it cause this problem since it seems like it should not require a shuffle. Unfortunately I am not experienced enough with the DataFrame source code to figure out what is going on. For now we are reimplementing the partitioning ourselves as a work around, but very curious to know what could have been happening here. My next step was going to be to obtain a full heap dump and poke around in it with my profiler, does that sound like a reasonable approach? Thanks! Jeff > Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy > -- > > Key: SPARK-20925 > URL: https://issues.apache.org/jira/browse/SPARK-20925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jeffrey Quinn > > Observed under the following conditions: > Spark Version: Spark 2.1.0 > Hadoop Version: Amazon 2.7.3 (emr-5.5.0) > spark.submit.deployMode = client > spark.master = yarn > spark.driver.memory = 10g > spark.shuffle.service.enabled = true > spark.dynamicAllocation.enabled = true > The job we are running is very simple: Our workflow reads data from a JSON > format stored on S3, and write out partitioned parquet files to HDFS. > As a one-liner, the whole workflow looks like this: > ``` > sparkSession.sqlContext > .read > .schema(inputSchema) > .json(expandedInputPath) > .select(columnMap:_*) > .write.partitionBy("partition_by_column") > .parquet(outputPath) > ``` > Unfortunately, for larger inputs, this job consistently fails with containers > running out of memory. We observed containers of up to 20GB OOMing, which is > surprising because the input data itself is only 15 GB compressed and maybe > 100GB uncompressed. > We were able to bisect that `partitionBy` is the problem by progressively > removing/commenting out parts of our workflow. Finally when we get to the > above state, if we remove `partitionBy` the job succeeds with no OOM. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20926) Exposure to Guava libraries by directly accessing tableRelationCache in SessionCatalog caused failures
Reza Safi created SPARK-20926: - Summary: Exposure to Guava libraries by directly accessing tableRelationCache in SessionCatalog caused failures Key: SPARK-20926 URL: https://issues.apache.org/jira/browse/SPARK-20926 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Reza Safi Because of shading that we did for guava libraries, we see test failures whenever those components directly access tableRelationCache in SessionCatalog. This can happen in any component that shaded guava library. Failures looks like this: {noformat} java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableRelationCache()Lcom/google/common/cache/Cache; 01:25:14 at org.apache.spark.sql.hive.test.TestHiveSparkSession.reset(TestHive.scala:492) 01:25:14 at org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:138) 01:25:14 at org.apache.spark.sql.hive.test.TestHiveSingleton$class.afterAll(TestHiveSingleton.scala:32) 01:25:14 at org.apache.spark.sql.hive.StatisticsSuite.afterAll(StatisticsSuite.scala:34) 01:25:14 at org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213) 01:25:14 at org.apache.spark.SparkFunSuite.afterAll(SparkFunSuite.scala:31) 01:25:14 at org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:280) 01:25:14 at org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:278) 01:25:14 at org.scalatest.CompositeStatus.whenCompleted(Status.scala:377) 01:25:14 at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:278) {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
[ https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029984#comment-16029984 ] Sean Owen commented on SPARK-20925: --- Not enough info here -- is the JVM running out of memory? is YARN killing it? is the driver or executor running out of memory? All of those are typically matters of setting memory config properly, and not a Spark issue, so I am not sure this stands as a JIRA. > Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy > -- > > Key: SPARK-20925 > URL: https://issues.apache.org/jira/browse/SPARK-20925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jeffrey Quinn > > Observed under the following conditions: > Spark Version: Spark 2.1.0 > Hadoop Version: Amazon 2.7.3 (emr-5.5.0) > spark.submit.deployMode = client > spark.master = yarn > spark.driver.memory = 10g > spark.shuffle.service.enabled = true > spark.dynamicAllocation.enabled = true > The job we are running is very simple: Our workflow reads data from a JSON > format stored on S3, and write out partitioned parquet files to HDFS. > As a one-liner, the whole workflow looks like this: > ``` > sparkSession.sqlContext > .read > .schema(inputSchema) > .json(expandedInputPath) > .select(columnMap:_*) > .write.partitionBy("partition_by_column") > .parquet(outputPath) > ``` > Unfortunately, for larger inputs, this job consistently fails with containers > running out of memory. We observed containers of up to 20GB OOMing, which is > surprising because the input data itself is only 15 GB compressed and maybe > 100GB uncompressed. > We were able to bisect that `partitionBy` is the problem by progressively > removing/commenting out parts of our workflow. Finally when we get to the > above state, if we remove `partitionBy` the job succeeds with no OOM. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
[ https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeffrey Quinn updated SPARK-20925: -- Description: Observed under the following conditions: Spark Version: Spark 2.1.0 Hadoop Version: Amazon 2.7.3 (emr-5.5.0) spark.submit.deployMode = client spark.master = yarn spark.driver.memory = 10g spark.shuffle.service.enabled = true spark.dynamicAllocation.enabled = true The job we are running is very simple: Our workflow reads data from a JSON format stored on S3, and write out partitioned parquet files to HDFS. As a one-liner, the whole workflow looks like this: ``` sparkSession.sqlContext .read .schema(inputSchema) .json(expandedInputPath) .select(columnMap:_*) .write.partitionBy("partition_by_column") .parquet(outputPath) ``` Unfortunately, for larger inputs, this job consistently fails with containers running out of memory. We observed containers of up to 20GB OOMing, which is surprising because the input data itself is only 15 GB compressed and maybe 100GB uncompressed. We were able to bisect that `partitionBy` is the problem by progressively removing/commenting out parts of our workflow. Finally when we get to the above state, if we remove `partitionBy` the job succeeds with no OOM. was: Observed under the following conditions: Spark Version: Spark 2.1.0 Hadoop Version: Amazon 2.7.3 (emr-5.5.0) spark.submit.deployMode = client spark.master = yarn spark.driver.memory = 10g spark.shuffle.service.enabled = true spark.dynamicAllocation.enabled = true The job we are running is very simple: Our workflow reads data from a JSON format stored on S3, and write out partitioned parquet files to HDFS. As a one-liner, the whole workflow looks like this: ``` sparkSession.sqlContext .read .schema(inputSchema) .json(expandedInputPath) .select(columnMap:_*) .write.partitionBy("partition_by_column") .parquet(outputPath) ``` Unfortunately, for larger inputs, this job consistently fails with containers running out of memory. We observed containers of up to 20GB OOMing, which is surprising because the input data itself is only 15 GB compressed and maybe 100GB uncompressed. > Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy > -- > > Key: SPARK-20925 > URL: https://issues.apache.org/jira/browse/SPARK-20925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jeffrey Quinn > > Observed under the following conditions: > Spark Version: Spark 2.1.0 > Hadoop Version: Amazon 2.7.3 (emr-5.5.0) > spark.submit.deployMode = client > spark.master = yarn > spark.driver.memory = 10g > spark.shuffle.service.enabled = true > spark.dynamicAllocation.enabled = true > The job we are running is very simple: Our workflow reads data from a JSON > format stored on S3, and write out partitioned parquet files to HDFS. > As a one-liner, the whole workflow looks like this: > ``` > sparkSession.sqlContext > .read > .schema(inputSchema) > .json(expandedInputPath) > .select(columnMap:_*) > .write.partitionBy("partition_by_column") > .parquet(outputPath) > ``` > Unfortunately, for larger inputs, this job consistently fails with containers > running out of memory. We observed containers of up to 20GB OOMing, which is > surprising because the input data itself is only 15 GB compressed and maybe > 100GB uncompressed. > We were able to bisect that `partitionBy` is the problem by progressively > removing/commenting out parts of our workflow. Finally when we get to the > above state, if we remove `partitionBy` the job succeeds with no OOM. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
Jeffrey Quinn created SPARK-20925: - Summary: Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy Key: SPARK-20925 URL: https://issues.apache.org/jira/browse/SPARK-20925 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Jeffrey Quinn Observed under the following conditions: Spark Version: Spark 2.1.0 Hadoop Version: Amazon 2.7.3 (emr-5.5.0) spark.submit.deployMode = client spark.master = yarn spark.driver.memory = 10g spark.shuffle.service.enabled = true spark.dynamicAllocation.enabled = true The job we are running is very simple: Our workflow reads data from a JSON format stored on S3, and write out partitioned parquet files to HDFS. As a one-liner, the whole workflow looks like this: ``` sparkSession.sqlContext .read .schema(inputSchema) .json(expandedInputPath) .select(columnMap:_*) .write.partitionBy("partition_by_column") .parquet(outputPath) ``` Unfortunately, for larger inputs, this job consistently fails with containers running out of memory. We observed containers of up to 20GB OOMing, which is surprising because the input data itself is only 15 GB compressed and maybe 100GB uncompressed. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029962#comment-16029962 ] Josh Rosen commented on SPARK-20923: It doesn't seem to be used, as far as I can tell from a quick skim. The best way to confirm would probably be to start removing it, deleting things which depend on this as you go (e.g. the TaskMetrics getter method for accessing the current value) and see if you run into anything which looks like a non-test use. I'll be happy to review a patch to clean this up. > TaskMetrics._updatedBlockStatuses uses a lot of memory > -- > > Key: SPARK-20923 > URL: https://issues.apache.org/jira/browse/SPARK-20923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > The driver appears to use a ton of memory in certain cases to store the task > metrics updated block status'. For instance I had a user reading data form > hive and caching it. The # of tasks to read was around 62,000, they were > using 1000 executors and it ended up caching a couple TB's of data. The > driver kept running out of memory. > I investigated and it looks like there was 5GB of a 10GB heap being used up > by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks. > The updatedBlockStatuses was already removed from the task end event under > SPARK-20084. I don't see anything else that seems to be using this. Anybody > know if I missed something? > If its not being used we should remove it, otherwise we need to figure out a > better way of doing it so it doesn't use so much memory. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20333) Fix HashPartitioner in DAGSchedulerSuite
[ https://issues.apache.org/jira/browse/SPARK-20333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-20333: Assignee: jin xing > Fix HashPartitioner in DAGSchedulerSuite > > > Key: SPARK-20333 > URL: https://issues.apache.org/jira/browse/SPARK-20333 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: jin xing >Assignee: jin xing >Priority: Minor > Fix For: 2.3.0 > > > In test > "don't submit stage until its dependencies map outputs are registered > (SPARK-5259)" > "run trivial shuffle with out-of-band executor failure and retry" > "reduce tasks should be placed locally with map output" > HashPartitioner should be compatible with num of child RDD's partitions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20802) kolmogorovSmirnovTest in pyspark.mllib.stat.Statistics throws net.razorvine.pickle.PickleException when input data is normally distributed (no error when data is not n
[ https://issues.apache.org/jira/browse/SPARK-20802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029950#comment-16029950 ] Bettadapura Srinath Sharma commented on SPARK-20802: In Java, (Correct behavior) code: KolmogorovSmirnovTestResult testResult = Statistics.kolmogorovSmirnovTest(col1, "norm", mean[1], stdDev[1]); produces: Kolmogorov-Smirnov test summary: degrees of freedom = 0 statistic = 0.005983051038968901 pValue = 0.8643736171652615 No presumption against null hypothesis: Sample follows theoretical distribution. > kolmogorovSmirnovTest in pyspark.mllib.stat.Statistics throws > net.razorvine.pickle.PickleException when input data is normally distributed > (no error when data is not normally distributed) > --- > > Key: SPARK-20802 > URL: https://issues.apache.org/jira/browse/SPARK-20802 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.1.1 > Environment: Linux version 4.4.14-smp > x86/fpu: Legacy x87 FPU detected. > using command line: > bash-4.3$ ./bin/spark-submit ~/work/python/Features.py > bash-4.3$ pwd > /home/bsrsharma/spark-2.1.1-bin-hadoop2.7 > export JAVA_HOME=/home/bsrsharma/jdk1.8.0_121 >Reporter: Bettadapura Srinath Sharma > > In Scala,(correct behavior) > code: > testResult = Statistics.kolmogorovSmirnovTest(vecRDD, "norm", means(j), > stdDev(j)) > produces: > 17/05/18 10:52:53 INFO FeatureLogger: Kolmogorov-Smirnov test summary: > degrees of freedom = 0 > statistic = 0.005495681749849268 > pValue = 0.9216108887428276 > No presumption against null hypothesis: Sample follows theoretical > distribution. > in python (incorrect behavior): > the code: > testResult = Statistics.kolmogorovSmirnovTest(vecRDD, 'norm', numericMean[j], > numericSD[j]) > causes this error: > 17/05/17 21:59:23 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14) > net.razorvine.pickle.PickleException: expected zero arguments for > construction of ClassDict (for numpy.dtype) > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20333) Fix HashPartitioner in DAGSchedulerSuite
[ https://issues.apache.org/jira/browse/SPARK-20333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-20333. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 17634 [https://github.com/apache/spark/pull/17634] > Fix HashPartitioner in DAGSchedulerSuite > > > Key: SPARK-20333 > URL: https://issues.apache.org/jira/browse/SPARK-20333 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: jin xing >Priority: Minor > Fix For: 2.3.0 > > > In test > "don't submit stage until its dependencies map outputs are registered > (SPARK-5259)" > "run trivial shuffle with out-of-band executor failure and retry" > "reduce tasks should be placed locally with map output" > HashPartitioner should be compatible with num of child RDD's partitions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029917#comment-16029917 ] Thomas Graves commented on SPARK-20923: --- [~joshrosen] [~zsxwing] [~eseyfe] I think you have looked at this fairly recently, do you know if this is used by anything or anybody? I'm not finding it used anywhere in the code or UI but maybe I'm missing some obscure reference > TaskMetrics._updatedBlockStatuses uses a lot of memory > -- > > Key: SPARK-20923 > URL: https://issues.apache.org/jira/browse/SPARK-20923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > The driver appears to use a ton of memory in certain cases to store the task > metrics updated block status'. For instance I had a user reading data form > hive and caching it. The # of tasks to read was around 62,000, they were > using 1000 executors and it ended up caching a couple TB's of data. The > driver kept running out of memory. > I investigated and it looks like there was 5GB of a 10GB heap being used up > by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks. > The updatedBlockStatuses was already removed from the task end event under > SPARK-20084. I don't see anything else that seems to be using this. Anybody > know if I missed something? > If its not being used we should remove it, otherwise we need to figure out a > better way of doing it so it doesn't use so much memory. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15905) Driver hung while writing to console progress bar
[ https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil closed SPARK-15905. --- Resolution: Cannot Reproduce > Driver hung while writing to console progress bar > - > > Key: SPARK-15905 > URL: https://issues.apache.org/jira/browse/SPARK-15905 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Tejas Patil >Priority: Minor > > This leads to driver being not able to get heartbeats from its executors and > job being stuck. After looking at the locking dependency amongst the driver > threads per the jstack, this is where the driver seems to be stuck. > {noformat} > "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 > nid=0x7887d runnable [0x7f6d3507a000] >java.lang.Thread.State: RUNNABLE > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:326) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream) > at java.io.PrintStream.write(PrintStream.java:482) >- locked <0x7f6eb81dd258> (a java.io.PrintStream) > at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) > at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) > at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104) > - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter) > at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185) > at java.io.PrintStream.write(PrintStream.java:527) > - locked <0x7f6eb81dd258> (a java.io.PrintStream) > at java.io.PrintStream.print(PrintStream.java:669) > at > org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99) > at > org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69) > - locked <0x7f6ed33b48a0> (a > org.apache.spark.ui.ConsoleProgressBar) > at > org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029847#comment-16029847 ] Ryan Blue commented on SPARK-20923: --- I didn't look at the code path up to writing history files. I just confirmed that nothing based on the history file actually used them. Sounds like if we can stop tracking this entirely, that would be great! > TaskMetrics._updatedBlockStatuses uses a lot of memory > -- > > Key: SPARK-20923 > URL: https://issues.apache.org/jira/browse/SPARK-20923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > The driver appears to use a ton of memory in certain cases to store the task > metrics updated block status'. For instance I had a user reading data form > hive and caching it. The # of tasks to read was around 62,000, they were > using 1000 executors and it ended up caching a couple TB's of data. The > driver kept running out of memory. > I investigated and it looks like there was 5GB of a 10GB heap being used up > by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks. > The updatedBlockStatuses was already removed from the task end event under > SPARK-20084. I don't see anything else that seems to be using this. Anybody > know if I missed something? > If its not being used we should remove it, otherwise we need to figure out a > better way of doing it so it doesn't use so much memory. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar
[ https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029846#comment-16029846 ] Tejas Patil commented on SPARK-15905: - I haven't seen this in a while with Spark 2.0. Closing. If anyone is still experiencing this issue, please comment with details. > Driver hung while writing to console progress bar > - > > Key: SPARK-15905 > URL: https://issues.apache.org/jira/browse/SPARK-15905 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Tejas Patil >Priority: Minor > > This leads to driver being not able to get heartbeats from its executors and > job being stuck. After looking at the locking dependency amongst the driver > threads per the jstack, this is where the driver seems to be stuck. > {noformat} > "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 > nid=0x7887d runnable [0x7f6d3507a000] >java.lang.Thread.State: RUNNABLE > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:326) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream) > at java.io.PrintStream.write(PrintStream.java:482) >- locked <0x7f6eb81dd258> (a java.io.PrintStream) > at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) > at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) > at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104) > - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter) > at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185) > at java.io.PrintStream.write(PrintStream.java:527) > - locked <0x7f6eb81dd258> (a java.io.PrintStream) > at java.io.PrintStream.print(PrintStream.java:669) > at > org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99) > at > org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69) > - locked <0x7f6ed33b48a0> (a > org.apache.spark.ui.ConsoleProgressBar) > at > org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20597) KafkaSourceProvider falls back on path as synonym for topic
[ https://issues.apache.org/jira/browse/SPARK-20597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20597: - Labels: starter (was: ) > KafkaSourceProvider falls back on path as synonym for topic > --- > > Key: SPARK-20597 > URL: https://issues.apache.org/jira/browse/SPARK-20597 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Trivial > Labels: starter > > # {{KafkaSourceProvider}} supports {{topic}} option that sets the Kafka topic > to save a DataFrame's rows to > # {{KafkaSourceProvider}} can use {{topic}} column to assign rows to Kafka > topics for writing > What seems a quite interesting option is to support {{start(path: String)}} > as the least precedence option in which {{path}} would designate the default > topic when no other options are used. > {code} > df.writeStream.format("kafka").start("topic") > {code} > See > http://apache-spark-developers-list.1001551.n3.nabble.com/KafkaSourceProvider-Why-topic-option-and-column-without-reverting-to-path-as-the-least-priority-td21458.html > for discussion -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20599) ConsoleSink should work with write (batch)
[ https://issues.apache.org/jira/browse/SPARK-20599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20599: - Labels: starter (was: ) > ConsoleSink should work with write (batch) > -- > > Key: SPARK-20599 > URL: https://issues.apache.org/jira/browse/SPARK-20599 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Minor > Labels: starter > > I think the following should just work. > {code} > spark. > read. // <-- it's a batch query not streaming query if that matters > format("kafka"). > option("subscribe", "topic1"). > option("kafka.bootstrap.servers", "localhost:9092"). > load. > write. > format("console"). // <-- that's not supported currently > save > {code} > The above combination of {{kafka}} source and {{console}} sink leads to the > following exception: > {code} > java.lang.RuntimeException: > org.apache.spark.sql.execution.streaming.ConsoleSinkProvider does not allow > create table as select. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:479) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:93) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:93) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20919) Simplificaiton of CachedKafkaConsumer using guava cache.
[ https://issues.apache.org/jira/browse/SPARK-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20919: - Affects Version/s: (was: 2.3.0) 2.2.0 Target Version/s: 2.3.0 > Simplificaiton of CachedKafkaConsumer using guava cache. > > > Key: SPARK-20919 > URL: https://issues.apache.org/jira/browse/SPARK-20919 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Prashant Sharma > > On the lines of SPARK-19968, guava cache can be used to simplify the code in > CachedKafkaConsumer as well. With an additional feature of automatic cleanup > of a consumer unused for a configurable time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20919) Simplificaiton of CachedKafkaConsumer using guava cache.
[ https://issues.apache.org/jira/browse/SPARK-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20919: - Issue Type: Improvement (was: Bug) > Simplificaiton of CachedKafkaConsumer using guava cache. > > > Key: SPARK-20919 > URL: https://issues.apache.org/jira/browse/SPARK-20919 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Prashant Sharma > > On the lines of SPARK-19968, guava cache can be used to simplify the code in > CachedKafkaConsumer as well. With an additional feature of automatic cleanup > of a consumer unused for a configurable time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20924) Unable to call the function registered in the not-current database
[ https://issues.apache.org/jira/browse/SPARK-20924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20924: Assignee: Apache Spark (was: Xiao Li) > Unable to call the function registered in the not-current database > -- > > Key: SPARK-20924 > URL: https://issues.apache.org/jira/browse/SPARK-20924 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.1, 2.2.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Critical > > We are unable to call the function registered in the not-current database. > {noformat} > sql("CREATE DATABASE dAtABaSe1") > sql(s"CREATE FUNCTION dAtABaSe1.test_avg AS > '${classOf[GenericUDAFAverage].getName}'") > sql("SELECT dAtABaSe1.test_avg(1)") > {noformat} > The above code returns an error: > {noformat} > Undefined function: 'dAtABaSe1.test_avg'. This function is neither a > registered temporary function nor a permanent function registered in the > database 'default'.; line 1 pos 7 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20924) Unable to call the function registered in the not-current database
[ https://issues.apache.org/jira/browse/SPARK-20924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20924: Assignee: Xiao Li (was: Apache Spark) > Unable to call the function registered in the not-current database > -- > > Key: SPARK-20924 > URL: https://issues.apache.org/jira/browse/SPARK-20924 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.1, 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Critical > > We are unable to call the function registered in the not-current database. > {noformat} > sql("CREATE DATABASE dAtABaSe1") > sql(s"CREATE FUNCTION dAtABaSe1.test_avg AS > '${classOf[GenericUDAFAverage].getName}'") > sql("SELECT dAtABaSe1.test_avg(1)") > {noformat} > The above code returns an error: > {noformat} > Undefined function: 'dAtABaSe1.test_avg'. This function is neither a > registered temporary function nor a permanent function registered in the > database 'default'.; line 1 pos 7 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20924) Unable to call the function registered in the not-current database
[ https://issues.apache.org/jira/browse/SPARK-20924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029815#comment-16029815 ] Apache Spark commented on SPARK-20924: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/18146 > Unable to call the function registered in the not-current database > -- > > Key: SPARK-20924 > URL: https://issues.apache.org/jira/browse/SPARK-20924 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.1, 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Critical > > We are unable to call the function registered in the not-current database. > {noformat} > sql("CREATE DATABASE dAtABaSe1") > sql(s"CREATE FUNCTION dAtABaSe1.test_avg AS > '${classOf[GenericUDAFAverage].getName}'") > sql("SELECT dAtABaSe1.test_avg(1)") > {noformat} > The above code returns an error: > {noformat} > Undefined function: 'dAtABaSe1.test_avg'. This function is neither a > registered temporary function nor a permanent function registered in the > database 'default'.; line 1 pos 7 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20924) Unable to call the function registered in the not-current database
Xiao Li created SPARK-20924: --- Summary: Unable to call the function registered in the not-current database Key: SPARK-20924 URL: https://issues.apache.org/jira/browse/SPARK-20924 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1, 2.0.2, 2.2.0 Reporter: Xiao Li Assignee: Xiao Li Priority: Critical We are unable to call the function registered in the not-current database. {noformat} sql("CREATE DATABASE dAtABaSe1") sql(s"CREATE FUNCTION dAtABaSe1.test_avg AS '${classOf[GenericUDAFAverage].getName}'") sql("SELECT dAtABaSe1.test_avg(1)") {noformat} The above code returns an error: {noformat} Undefined function: 'dAtABaSe1.test_avg'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7 {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20832) Standalone master should explicitly inform drivers of worker deaths and invalidate external shuffle service outputs
[ https://issues.apache.org/jira/browse/SPARK-20832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029778#comment-16029778 ] Jiang Xingbo commented on SPARK-20832: -- I'm working on this. > Standalone master should explicitly inform drivers of worker deaths and > invalidate external shuffle service outputs > --- > > Key: SPARK-20832 > URL: https://issues.apache.org/jira/browse/SPARK-20832 > Project: Spark > Issue Type: Bug > Components: Deploy, Scheduler >Affects Versions: 2.0.0 >Reporter: Josh Rosen > > In SPARK-17370 (a patch authored by [~ekhliang] and reviewed by me), we added > logic to the DAGScheduler to mark external shuffle service instances as > unavailable upon task failure when the task failure reason was "SlaveLost" > and this was known to be caused by worker death. If the Spark Master > discovered that a worker was dead then it would notify any drivers with > executors on those workers to mark those executors as dead. The linked patch > simply piggybacked on this logic to have the executor death notification also > imply worker death and to have worker-death-caused-executor-death imply > shuffle file loss. > However, there are modes of external shuffle service loss which this > mechanism does not detect, leaving the system prone race conditions. Consider > the following: > * Spark standalone is configured to run an external shuffle service embedded > in the Worker. > * Application has shuffle outputs and executors on Worker A. > * Stage depending on outputs of tasks that ran on Worker A starts. > * All executors on worker A are removed due to dying with exceptions, > scaling-down via the dynamic allocation APIs, but _not_ due to worker death. > Worker A is still healthy at this point. > * At this point the MapOutputTracker still records map output locations on > Worker A's shuffle service. This is expected behavior. > * Worker A dies at an instant where the application has no executors running > on it. > * The Master knows that Worker A died but does not inform the driver (which > had no executors on that worker at the time of its death). > * Some task from the running stage attempts to fetch map outputs from Worker > A but these requests time out because Worker A's shuffle service isn't > available. > * Due to other logic in the scheduler, these preventable FetchFailures don't > wind up invaliding the now-invalid unavailable map output locations (this is > a distinct bug / behavior which I'll discuss in a separate JIRA ticket). > * This behavior leads to several unsuccessful stage reattempts and ultimately > to a job failure. > A simple way to address this would be to have the Master explicitly notify > drivers of all Worker deaths, even if those drivers don't currently have > executors. The Spark Standalone scheduler backend can receive the explicit > WorkerLost message and can bubble up the right calls to the task scheduler > and DAGScheduler to invalidate map output locations from the now-dead > external shuffle service. > This relates to SPARK-20115 in the sense that both tickets aim to address > issues where the external shuffle service is unavailable. The key difference > is the mechanism for detection: SPARK-20115 marks the external shuffle > service as unavailable whenever any fetch failure occurs from it, whereas the > proposal here relies on more explicit signals. This JIRA ticket's proposal is > scoped only to Spark Standalone mode. As a compromise, we might be able to > consider "all of a single shuffle's outputs lost on a single external shuffle > service" following a fetch failure (to be discussed in separate JIRA). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20899) PySpark supports stringIndexerOrderType in RFormula
[ https://issues.apache.org/jira/browse/SPARK-20899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-20899: Component/s: ML > PySpark supports stringIndexerOrderType in RFormula > --- > > Key: SPARK-20899 > URL: https://issues.apache.org/jira/browse/SPARK-20899 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.1.1 >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20899) PySpark supports stringIndexerOrderType in RFormula
[ https://issues.apache.org/jira/browse/SPARK-20899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-20899. - Resolution: Fixed Assignee: Wayne Zhang Fix Version/s: 2.3.0 > PySpark supports stringIndexerOrderType in RFormula > --- > > Key: SPARK-20899 > URL: https://issues.apache.org/jira/browse/SPARK-20899 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.1.1 >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029701#comment-16029701 ] Thomas Graves commented on SPARK-20923: --- [~rdblue] with SPARK-20084, did you see anything using these updatedblockStatuses in TaskMetrics? > TaskMetrics._updatedBlockStatuses uses a lot of memory > -- > > Key: SPARK-20923 > URL: https://issues.apache.org/jira/browse/SPARK-20923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > The driver appears to use a ton of memory in certain cases to store the task > metrics updated block status'. For instance I had a user reading data form > hive and caching it. The # of tasks to read was around 62,000, they were > using 1000 executors and it ended up caching a couple TB's of data. The > driver kept running out of memory. > I investigated and it looks like there was 5GB of a 10GB heap being used up > by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks. > The updatedBlockStatuses was already removed from the task end event under > SPARK-20084. I don't see anything else that seems to be using this. Anybody > know if I missed something? > If its not being used we should remove it, otherwise we need to figure out a > better way of doing it so it doesn't use so much memory. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory
Thomas Graves created SPARK-20923: - Summary: TaskMetrics._updatedBlockStatuses uses a lot of memory Key: SPARK-20923 URL: https://issues.apache.org/jira/browse/SPARK-20923 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Thomas Graves The driver appears to use a ton of memory in certain cases to store the task metrics updated block status'. For instance I had a user reading data form hive and caching it. The # of tasks to read was around 62,000, they were using 1000 executors and it ended up caching a couple TB's of data. The driver kept running out of memory. I investigated and it looks like there was 5GB of a 10GB heap being used up by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks. The updatedBlockStatuses was already removed from the task end event under SPARK-20084. I don't see anything else that seems to be using this. Anybody know if I missed something? If its not being used we should remove it, otherwise we need to figure out a better way of doing it so it doesn't use so much memory. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org