[jira] [Commented] (SPARK-24891) Fix HandleNullInputsForUDF rule
[ https://issues.apache.org/jira/browse/SPARK-24891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555111#comment-16555111 ] Apache Spark commented on SPARK-24891: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/21869 > Fix HandleNullInputsForUDF rule > --- > > Key: SPARK-24891 > URL: https://issues.apache.org/jira/browse/SPARK-24891 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > The HandleNullInputsForUDF rule can generate new {{If}} node infinitely, thus > causing problems like match of SQL cache missed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23957) Sorts in subqueries are redundant and can be removed
[ https://issues.apache.org/jira/browse/SPARK-23957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23957. - Resolution: Fixed Assignee: Henry Robinson Fix Version/s: 2.4.0 > Sorts in subqueries are redundant and can be removed > > > Key: SPARK-23957 > URL: https://issues.apache.org/jira/browse/SPARK-23957 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Henry Robinson >Assignee: Henry Robinson >Priority: Major > Fix For: 2.4.0 > > > Unless combined with a {{LIMIT}}, there's no correctness reason that planned > and optimized subqueries should have any sort operators (since the result of > the subquery is an unordered collection of tuples). > For example: > {{SELECT count(1) FROM (select id FROM dft ORDER by id)}} > has the following plan: > {code:java} > == Physical Plan == > *(3) HashAggregate(keys=[], functions=[count(1)]) > +- Exchange SinglePartition >+- *(2) HashAggregate(keys=[], functions=[partial_count(1)]) > +- *(2) Project > +- *(2) Sort [id#0L ASC NULLS FIRST], true, 0 > +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200) >+- *(1) Project [id#0L] > +- *(1) FileScan parquet [id#0L] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct > {code} > ... but the sort operator is redundant. > Less intuitively, the sort is also redundant in selections from an ordered > subquery: > {{SELECT * FROM (SELECT id FROM dft ORDER BY id)}} > has plan: > {code:java} > == Physical Plan == > *(2) Sort [id#0L ASC NULLS FIRST], true, 0 > +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200) >+- *(1) Project [id#0L] > +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], > PartitionFilters: [], PushedFilters: [], ReadSchema: struct > {code} > ... but again, since the subquery returns a bag of tuples, the sort is > unnecessary. > We should consider adding an optimizer rule that removes a sort inside a > subquery. SPARK-23375 is related, but removes sorts that are functionally > redundant because they perform the same ordering. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24890) Short circuiting the `if` condition when `trueValue` and `falseValue` are the same
[ https://issues.apache.org/jira/browse/SPARK-24890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24890. - Resolution: Fixed Fix Version/s: 2.4.0 > Short circuiting the `if` condition when `trueValue` and `falseValue` are the > same > -- > > Key: SPARK-24890 > URL: https://issues.apache.org/jira/browse/SPARK-24890 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 2.4.0 > > > When `trueValue` and `falseValue` are semantic equivalence, the condition > expression in `if` can be removed to avoid extra computation in runtime. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24914) totalSize is not a good estimate for broadcast joins
[ https://issues.apache.org/jira/browse/SPARK-24914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-24914: -- Description: When determining whether to do a broadcast join, Spark estimates the size of the smaller table as follows: - if totalSize is defined and greater than 0, use it. - else, if rawDataSize is defined and greater than 0, use it - else, use spark.sql.defaultSizeInBytes (default: Long.MaxValue) Therefore, Spark prefers totalSize over rawDataSize. Unfortunately, totalSize is often quite a bit smaller than the actual table size, since it represents the size of the table's files on disk. Parquet and Orc files, for example, are encoded and compressed. This can result in the JVM throwing an OutOfMemoryError while Spark is loading the table into a HashedRelation, or when Spark actually attempts to broadcast the data. On the other hand, rawDataSize represents the uncompressed size of the dataset, according to Hive documentation. This seems like a pretty good number to use in preference to totalSize. However, due to HIVE-20079, this value is simply #columns * #rows. Once that bug is fixed, it may be a superior statistic, at least for managed tables. In the meantime, we could apply a configurable "fudge factor" to totalSize, at least for types of files that are encoded and compressed. Hive has the setting hive.stats.deserialization.factor, which defaults to 1.0, and is described as follows: {quote}in the absence of uncompressed/raw data size, total file size will be used for statistics annotation. But the file may be compressed, encoded and serialized which may be lesser in size than the actual uncompressed/raw data size. This factor will be multiplied to file size to estimate the raw data size. {quote} In addition to the fudge factor, we could compare the adjusted totalSize to rawDataSize and use the bigger of the two. Caveat: This mitigates the issue only for Hive tables. It does not help much when the user is reading files using {{spark.read.parquet}}, unless we apply the same fudge factor there. was: When determining whether to do a broadcast join, Spark estimates the size of the smaller table as follows: - if totalSize is defined and greater than 0, use it. - else, if rawDataSize is defined and greater than 0, use it - else, use spark.sql.defaultSizeInBytes (default: Long.MaxValue) Therefore, Spark prefers totalSize over rawDataSize. Unfortunately, totalSize is often quite a bit smaller than the actual table size, since it represents the size of the table's files on disk. Parquet and Orc files, for example, are encoded and compressed. This can result in the JVM throwing an OutOfMemoryError while Spark is loading the table into a HashedRelation, or when Spark actually attempts to broadcast the data. On the other hand, rawDataSize represents the uncompressed size of the dataset, according to Hive documentation. This seems like a pretty good number to use in preference to totalSize. However, due to HIVE-20079, this value is simply #columns * #rows. Once that bug is fixed, it may be a superior statistic, at least for managed tables. In the meantime, we could apply a configurable "fudge factor" to totalSize, at least for types of files that are encoded and compressed. Hive has the setting hive.stats.deserialization.factor, which defaults to 1.0, and is described as follows: {quote}in the absence of uncompressed/raw data size, total file size will be used for statistics annotation. But the file may be compressed, encoded and serialized which may be lesser in size than the actual uncompressed/raw data size. This factor will be multiplied to file size to estimate the raw data size. {quote} In addition to the fudge factor, we could compare the adjusted totalSize to rawDataSize and use the bigger of the two: {noformat} size1 = totalSize.isDefined ? totalSize * fudgeFactor : Long.MAX_VALUE size2 = rawDataSize.isDefined ? rawDataSize : Long.MAX_VALUE size = max(size1, size2) {noformat} Caveat: This mitigates the issue only for Hive tables. It does not help much when the user is reading files using {{spark.read.parquet}}, unless we apply the same fudge factor there. > totalSize is not a good estimate for broadcast joins > > > Key: SPARK-24914 > URL: https://issues.apache.org/jira/browse/SPARK-24914 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Priority: Major > > When determining whether to do a broadcast join, Spark estimates the size of > the smaller table as follows: > - if totalSize is defined and greater than 0, use it. > - else, if rawDataSize is defined and greater than 0, use it > - else, use spark.sql.defaultSizeInBytes (default:
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555080#comment-16555080 ] Saisai Shao commented on SPARK-24615: - Hi [~tgraves] [~irashid] thanks a lot for your comments. Currently in my design I don't insert a specific stage boundary with different resources, the stage boundary is still the same (by shuffle or by result). so {{withResouces}} is not an eval() action which trigger a stage. Instead, it just adds a resource hint to the RDD. So which means RDDs with different resources requirements in one stage may have conflicts. For example: {{rdd1.withResources.mapPartitions \{ xxx \}.withResources.mapPartitions \{ xxx \}.collect}}, resources in rdd1 may be different from map rdd, so currently what I can think is that: 1. always pick the latter with warning log to say that multiple different resources in one stage is illegal. 2. fail the stage with warning log to say that multiple different resources in one stage is illegal. 3. merge conflicts with maximum resources needs. For example rdd1 requires 3 gpus per task, rdd2 requires 4 gpus per task, then the merged requirement would be 4 gpus per task. (This is the high level description, details will be per partition based merging) [chosen]. Take join for example, where rdd1 and rdd2 may have different resource requirements, and joined RDD will potentially have other resource requirements. For example: {code} val rddA = rdd.mapPartitions().withResources val rddB = rdd.mapPartitions().withResources val rddC = rddA.join(rddB).withResources rddC.collect() {code} Here this 3 {{withResources}} may have different requirements. Since {{rddC}} is running in a different stage, so there's no need to merge the resource conflicts. But {{rddA}} and {{rddB}} are running in the same stage with different tasks (partitions). So the merging strategy I'm thinking is based on the partition, tasks running with partitions from {{rddA}} will use the resource specified by {{rddA}}, so does {{rddB}}. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23622) Flaky Test: HiveClientSuites
[ https://issues.apache.org/jira/browse/SPARK-23622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555041#comment-16555041 ] Yuming Wang edited comment on SPARK-23622 at 7/25/18 2:44 AM: -- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93521/testReport/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93506/testReport/ {noformat} java.lang.RuntimeException: [unresolved dependency: com.sun.jersey#jersey-core;1.14: configuration not found in com.sun.jersey#jersey-core;1.14: 'master(compile)'. Missing configuration: 'compile'. It was required from org.apache.hadoop#hadoop-yarn-common;2.6.5 compile] {noformat} {noformat} sbt.ForkMain$ForkError: java.lang.RuntimeException: [unresolved dependency: com.sun.jersey#jersey-core;1.14: configuration not found in com.sun.jersey#jersey-core;1.14: 'master(compile)'. Missing configuration: 'compile'. It was required from org.apache.hadoop#hadoop-yarn-common;2.6.5 compile] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1305) at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$2.apply(IsolatedClientLoader.scala:115) at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$2.apply(IsolatedClientLoader.scala:115) at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) at org.apache.spark.sql.hive.client.IsolatedClientLoader$.downloadVersion(IsolatedClientLoader.scala:114) at org.apache.spark.sql.hive.client.IsolatedClientLoader$.liftedTree1$1(IsolatedClientLoader.scala:74) at org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:62) at org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:51) at org.apache.spark.sql.hive.client.HiveVersionSuite.buildClient(HiveVersionSuite.scala:41) at org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$init(HiveClientSuite.scala:48) at org.apache.spark.sql.hive.client.HiveClientSuite.beforeAll(HiveClientSuite.scala:79) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210) at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257) at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1255) at org.apache.spark.sql.hive.client.HiveClientSuites.runNestedSuites(HiveClientSuites.scala:24) at org.scalatest.Suite$class.run(Suite.scala:1144) at org.apache.spark.sql.hive.client.HiveClientSuites.run(HiveClientSuites.scala:24) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) at sbt.ForkMain$Run$2.call(ForkMain.java:296) at sbt.ForkMain$Run$2.call(ForkMain.java:286) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} was (Author: q79969786): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93521/testReport/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93506/testReport/ {noformat} java.lang.RuntimeException: [unresolved dependency: com.sun.jersey#jersey-core;1.14: configuration not found in com.sun.jersey#jersey-core;1.14: 'master(compile)'. Missing configuration: 'compile'. It was required from org.apache.hadoop#hadoop-yarn-common;2.6.5 compile] {noformat} > Flaky Test: HiveClientSuites > > > Key: SPARK-23622 > URL: https://issues.apache.org/jira/browse/SPARK-23622 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88052/testReport/org.apache.spark.sql.hive.client/HiveClientSuites/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ > - https://amplab.cs.berkeley.edu/jenkins/view/Spark QA Test >
[jira] [Resolved] (SPARK-24891) Fix HandleNullInputsForUDF rule
[ https://issues.apache.org/jira/browse/SPARK-24891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24891. - Resolution: Fixed Assignee: Maryann Xue Fix Version/s: 2.4.0 2.3.2 > Fix HandleNullInputsForUDF rule > --- > > Key: SPARK-24891 > URL: https://issues.apache.org/jira/browse/SPARK-24891 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > The HandleNullInputsForUDF rule can generate new {{If}} node infinitely, thus > causing problems like match of SQL cache missed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23622) Flaky Test: HiveClientSuites
[ https://issues.apache.org/jira/browse/SPARK-23622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555041#comment-16555041 ] Yuming Wang commented on SPARK-23622: - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93521/testReport/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93506/testReport/ {noformat} java.lang.RuntimeException: [unresolved dependency: com.sun.jersey#jersey-core;1.14: configuration not found in com.sun.jersey#jersey-core;1.14: 'master(compile)'. Missing configuration: 'compile'. It was required from org.apache.hadoop#hadoop-yarn-common;2.6.5 compile] {noformat} > Flaky Test: HiveClientSuites > > > Key: SPARK-23622 > URL: https://issues.apache.org/jira/browse/SPARK-23622 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88052/testReport/org.apache.spark.sql.hive.client/HiveClientSuites/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ > - https://amplab.cs.berkeley.edu/jenkins/view/Spark QA Test > (Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325 > {code} > Error Message > java.lang.reflect.InvocationTargetException: null > Stacktrace > sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:270) > at > org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:58) > at > org.apache.spark.sql.hive.client.HiveVersionSuite.buildClient(HiveVersionSuite.scala:41) > at > org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$init(HiveClientSuite.scala:48) > at > org.apache.spark.sql.hive.client.HiveClientSuite.beforeAll(HiveClientSuite.scala:71) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1255) > at > org.apache.spark.sql.hive.client.HiveClientSuites.runNestedSuites(HiveClientSuites.scala:24) > at org.scalatest.Suite$class.run(Suite.scala:1144) > at > org.apache.spark.sql.hive.client.HiveClientSuites.run(HiveClientSuites.scala:24) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: > java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:444) > at > org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:117) > ... 29 more > Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1453) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:63) > at >
[jira] [Issue Comment Deleted] (SPARK-24663) Flaky test: StreamingContextSuite "stop slow receiver gracefully"
[ https://issues.apache.org/jira/browse/SPARK-24663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24663: Comment: was deleted (was: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93521/testReport/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93506/testReport/ {noformat} java.lang.RuntimeException: [unresolved dependency: com.sun.jersey#jersey-core;1.14: configuration not found in com.sun.jersey#jersey-core;1.14: 'master(compile)'. Missing configuration: 'compile'. It was required from org.apache.hadoop#hadoop-yarn-common;2.6.5 compile] {noformat} ) > Flaky test: StreamingContextSuite "stop slow receiver gracefully" > - > > Key: SPARK-24663 > URL: https://issues.apache.org/jira/browse/SPARK-24663 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > This is another test that sometimes fails on our build machines, although I > can't find failures on the riselab jenkins servers. Failure looks like: > {noformat} > org.scalatest.exceptions.TestFailedException: 0 was not greater than 0 > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply$mcV$sp(StreamingContextSuite.scala:356) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply(StreamingContextSuite.scala:335) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply(StreamingContextSuite.scala:335) > {noformat} > The test fails in about 2s, while a successful run generally takes 15s. > Looking at the logs, the receiver hasn't even started when things fail, which > points at a race during test initialization. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24529) Add spotbugs into maven build process
[ https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24529: - Fix Version/s: (was: 2.4.0) > Add spotbugs into maven build process > - > > Key: SPARK-24529 > URL: https://issues.apache.org/jira/browse/SPARK-24529 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > > We will enable a Java bytecode check tool > [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at > multiplication. Due to the tool limitation, some other checks will be enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-24529) Add spotbugs into maven build process
[ https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-24529: -- This was reverted in favour of https://github.com/apache/spark/pull/21865 and SPARK-24895 for now. > Add spotbugs into maven build process > - > > Key: SPARK-24529 > URL: https://issues.apache.org/jira/browse/SPARK-24529 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > > We will enable a Java bytecode check tool > [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at > multiplication. Due to the tool limitation, some other checks will be enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555027#comment-16555027 ] Hyukjin Kwon commented on SPARK-24895: -- Thank you [~yhuai]. I couldn't foresee this problem. > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Major > Fix For: 2.4.0 > > > Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > {noformat} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > {noformat} > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24663) Flaky test: StreamingContextSuite "stop slow receiver gracefully"
[ https://issues.apache.org/jira/browse/SPARK-24663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555028#comment-16555028 ] Yuming Wang commented on SPARK-24663: - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93521/testReport/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93506/testReport/ {noformat} java.lang.RuntimeException: [unresolved dependency: com.sun.jersey#jersey-core;1.14: configuration not found in com.sun.jersey#jersey-core;1.14: 'master(compile)'. Missing configuration: 'compile'. It was required from org.apache.hadoop#hadoop-yarn-common;2.6.5 compile] {noformat} > Flaky test: StreamingContextSuite "stop slow receiver gracefully" > - > > Key: SPARK-24663 > URL: https://issues.apache.org/jira/browse/SPARK-24663 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > This is another test that sometimes fails on our build machines, although I > can't find failures on the riselab jenkins servers. Failure looks like: > {noformat} > org.scalatest.exceptions.TestFailedException: 0 was not greater than 0 > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply$mcV$sp(StreamingContextSuite.scala:356) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply(StreamingContextSuite.scala:335) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply(StreamingContextSuite.scala:335) > {noformat} > The test fails in about 2s, while a successful run generally takes 15s. > Looking at the logs, the receiver hasn't even started when things fail, which > points at a race during test initialization. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data
[ https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555023#comment-16555023 ] Apache Spark commented on SPARK-24906: -- User 'habren' has created a pull request for this issue: https://github.com/apache/spark/pull/21868 > Enlarge split size for columnar file to ensure the task read enough data > > > Key: SPARK-24906 > URL: https://issues.apache.org/jira/browse/SPARK-24906 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jason Guo >Priority: Critical > Attachments: image-2018-07-24-20-26-32-441.png, > image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, > image-2018-07-24-20-30-24-552.png > > > For columnar file, such as, when spark sql read the table, each split will be > 128 MB by default since spark.sql.files.maxPartitionBytes is default to > 128MB. Even when user set it to a large value, such as 512MB, the task may > read only few MB or even hundreds of KB. Because the table (Parquet) may > consists of dozens of columns while the SQL only need few columns. And spark > will prune the unnecessary columns. > > In this case, spark DataSourceScanExec can enlarge maxPartitionBytes > adaptively. > For example, there is 40 columns , 20 are integer while another 20 are long. > When use query on an integer type column and an long type column, the > maxPartitionBytes should be 20 times larger. (20*4+20*8) / (4+8) = 20. > > With this optimization, the number of task will be smaller and the job will > run faster. More importantly, for a very large cluster (more the 10 thousand > nodes), it will relieve RM's schedule pressure. > > Here is the test > > The table named test2 has more than 40 columns and there are more than 5 TB > data each hour. > When we issue a very simple query > > {code:java} > select count(device_id) from test2 where date=20180708 and hour='23'{code} > > There are 72176 tasks and the duration of the job is 4.8 minutes > !image-2018-07-24-20-26-32-441.png! > > Most tasks last less than 1 second and read less than 1.5 MB data > !image-2018-07-24-20-28-06-269.png! > > After the optimization, there are only 1615 tasks and the job last only 30 > seconds. It almost 10 times faster. > !image-2018-07-24-20-29-24-797.png! > > The median of read data is 44.2MB. > !image-2018-07-24-20-30-24-552.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data
[ https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24906: Assignee: (was: Apache Spark) > Enlarge split size for columnar file to ensure the task read enough data > > > Key: SPARK-24906 > URL: https://issues.apache.org/jira/browse/SPARK-24906 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jason Guo >Priority: Critical > Attachments: image-2018-07-24-20-26-32-441.png, > image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, > image-2018-07-24-20-30-24-552.png > > > For columnar file, such as, when spark sql read the table, each split will be > 128 MB by default since spark.sql.files.maxPartitionBytes is default to > 128MB. Even when user set it to a large value, such as 512MB, the task may > read only few MB or even hundreds of KB. Because the table (Parquet) may > consists of dozens of columns while the SQL only need few columns. And spark > will prune the unnecessary columns. > > In this case, spark DataSourceScanExec can enlarge maxPartitionBytes > adaptively. > For example, there is 40 columns , 20 are integer while another 20 are long. > When use query on an integer type column and an long type column, the > maxPartitionBytes should be 20 times larger. (20*4+20*8) / (4+8) = 20. > > With this optimization, the number of task will be smaller and the job will > run faster. More importantly, for a very large cluster (more the 10 thousand > nodes), it will relieve RM's schedule pressure. > > Here is the test > > The table named test2 has more than 40 columns and there are more than 5 TB > data each hour. > When we issue a very simple query > > {code:java} > select count(device_id) from test2 where date=20180708 and hour='23'{code} > > There are 72176 tasks and the duration of the job is 4.8 minutes > !image-2018-07-24-20-26-32-441.png! > > Most tasks last less than 1 second and read less than 1.5 MB data > !image-2018-07-24-20-28-06-269.png! > > After the optimization, there are only 1615 tasks and the job last only 30 > seconds. It almost 10 times faster. > !image-2018-07-24-20-29-24-797.png! > > The median of read data is 44.2MB. > !image-2018-07-24-20-30-24-552.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data
[ https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24906: Assignee: Apache Spark > Enlarge split size for columnar file to ensure the task read enough data > > > Key: SPARK-24906 > URL: https://issues.apache.org/jira/browse/SPARK-24906 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jason Guo >Assignee: Apache Spark >Priority: Critical > Attachments: image-2018-07-24-20-26-32-441.png, > image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, > image-2018-07-24-20-30-24-552.png > > > For columnar file, such as, when spark sql read the table, each split will be > 128 MB by default since spark.sql.files.maxPartitionBytes is default to > 128MB. Even when user set it to a large value, such as 512MB, the task may > read only few MB or even hundreds of KB. Because the table (Parquet) may > consists of dozens of columns while the SQL only need few columns. And spark > will prune the unnecessary columns. > > In this case, spark DataSourceScanExec can enlarge maxPartitionBytes > adaptively. > For example, there is 40 columns , 20 are integer while another 20 are long. > When use query on an integer type column and an long type column, the > maxPartitionBytes should be 20 times larger. (20*4+20*8) / (4+8) = 20. > > With this optimization, the number of task will be smaller and the job will > run faster. More importantly, for a very large cluster (more the 10 thousand > nodes), it will relieve RM's schedule pressure. > > Here is the test > > The table named test2 has more than 40 columns and there are more than 5 TB > data each hour. > When we issue a very simple query > > {code:java} > select count(device_id) from test2 where date=20180708 and hour='23'{code} > > There are 72176 tasks and the duration of the job is 4.8 minutes > !image-2018-07-24-20-26-32-441.png! > > Most tasks last less than 1 second and read less than 1.5 MB data > !image-2018-07-24-20-28-06-269.png! > > After the optimization, there are only 1615 tasks and the job last only 30 > seconds. It almost 10 times faster. > !image-2018-07-24-20-29-24-797.png! > > The median of read data is 44.2MB. > !image-2018-07-24-20-30-24-552.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24897) DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for stage fetchFailed
[ https://issues.apache.org/jira/browse/SPARK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555002#comment-16555002 ] Sean Owen commented on SPARK-24897: --- I don't understand what you're reporting. I think you should maybe make reference to a specific change you would suggest that addresses this, or else I think it would be closed. > DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for > stage fetchFailed > -- > > Key: SPARK-24897 > URL: https://issues.apache.org/jira/browse/SPARK-24897 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.1.0, 2.3.1 >Reporter: liupengcheng >Priority: Major > > In Spark2.1, when a stage fetchfailed DAGScheduler will retry both this stage > and it's parent stage, however, when the parent stage is resubmitted and > start running, the mapstatuses can > still be invalidate by the stage's outstanding task due to fetchfailed. > The stage's outstanding task might unregister the mapstatuses with new epoch, > thus causing > the parent stage repeated MetadataFetchFailed and finally failling the Job. > > > {code:java} > 2018-07-23,01:52:33,012 WARN org.apache.spark.scheduler.TaskSetManager: Lost > task 174.0 in stage 71.0 (TID 154127, , executor 96): > FetchFailed(BlockManagerId(4945, , 22409), shuffleId=24, mapId=667, > reduceId=174, message= org.apache.spark.shuffle.FetchFailedException: Failed > to connect to /:22409 > 2018-07-23,01:52:33,013 INFO org.apache.spark.scheduler.DAGScheduler: Marking > ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) as failed due to a > fetch failure from ShuffleMapStage 69 ($plus$plus at > DeviceLocateMain.scala:236) > 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: > ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) failed in 246.856 s > 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: > Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) > and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch > failure > 2018-07-23,01:52:36,004 WARN org.apache.spark.scheduler.TaskSetManager: Lost > task 120.0 in stage 71.0 (TID 154073, , executor 286): > FetchFailed(BlockManagerId(4208, , 22409), shuffleId=24, mapId=241, > reduceId=120, message= org.apache.spark.shuffle.FetchFailedException: Failed > to connect to /:22409 > 2018-07-23,01:52:36,005 INFO org.apache.spark.scheduler.DAGScheduler: > Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) > and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch > failure > 2018-07-23,01:52:36,017 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece0 stored as bytes in memory (estimated size 4.0 MB, free > 26.7 MB) > 2018-07-23,01:52:36,025 INFO org.apache.spark.storage.BlockManagerInfo: > Removed broadcast_59_piece1 on :52349 in memory (size: 4.0 MB, free: > 3.0 GB) > 2018-07-23,01:52:36,029 INFO org.apache.spark.storage.BlockManagerInfo: > Removed broadcast_61_piece6 on :52349 in memory (size: 4.0 MB, free: > 3.0 GB) > 2018-07-23,01:52:36,079 INFO org.apache.spark.deploy.yarn.YarnAllocator: > Canceling requests for 0 executor containers > 2018-07-23,01:52:36,079 WARN org.apache.spark.deploy.yarn.YarnAllocator: > Expected to find pending requests, but found none. > 2018-07-23,01:52:36,094 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_63_piece0 in memory on :56780 (size: 4.0 MB, free: 3.7 > GB) > 2018-07-23,01:52:36,095 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece1 stored as bytes in memory (estimated size 4.0 MB, free > 30.7 MB) > 2018-07-23,01:52:36,107 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece1 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) > 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece2 stored as bytes in memory (estimated size 4.0 MB, free > 34.7 MB) > 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece2 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) > 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece3 stored as bytes in memory (estimated size 4.0 MB, free > 38.7 MB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece3 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece4 stored as bytes in memory (estimated size 3.8 MB, free > 42.5 MB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added >
[jira] [Commented] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names
[ https://issues.apache.org/jira/browse/SPARK-24911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554990#comment-16554990 ] Yuming Wang commented on SPARK-24911: - Can we show non printable field delimiter when {{SHOW CREATE TABLE}}. I have reported in https://issues.apache.org/jira/browse/SPARK-23058 > SHOW CREATE TABLE drops escaping of nested column names > --- > > Key: SPARK-24911 > URL: https://issues.apache.org/jira/browse/SPARK-24911 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Major > > Create a table with quoted nested column - *`b`*: > {code:sql} > create table `test` (`a` STRUCT<`b`:STRING>); > {code} > and show how the table was created: > {code:sql} > SHOW CREATE TABLE `test` > {code} > {code} > CREATE TABLE `test`(`a` struct) > {code} > The column *b* becomes unquoted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24897) DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for stage fetchFailed
[ https://issues.apache.org/jira/browse/SPARK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554988#comment-16554988 ] liupengcheng commented on SPARK-24897: -- [~srowen] anybody can verify this issue? Thanks a lot > DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for > stage fetchFailed > -- > > Key: SPARK-24897 > URL: https://issues.apache.org/jira/browse/SPARK-24897 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.1.0, 2.3.1 >Reporter: liupengcheng >Priority: Major > > In Spark2.1, when a stage fetchfailed DAGScheduler will retry both this stage > and it's parent stage, however, when the parent stage is resubmitted and > start running, the mapstatuses can > still be invalidate by the stage's outstanding task due to fetchfailed. > The stage's outstanding task might unregister the mapstatuses with new epoch, > thus causing > the parent stage repeated MetadataFetchFailed and finally failling the Job. > > > {code:java} > 2018-07-23,01:52:33,012 WARN org.apache.spark.scheduler.TaskSetManager: Lost > task 174.0 in stage 71.0 (TID 154127, , executor 96): > FetchFailed(BlockManagerId(4945, , 22409), shuffleId=24, mapId=667, > reduceId=174, message= org.apache.spark.shuffle.FetchFailedException: Failed > to connect to /:22409 > 2018-07-23,01:52:33,013 INFO org.apache.spark.scheduler.DAGScheduler: Marking > ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) as failed due to a > fetch failure from ShuffleMapStage 69 ($plus$plus at > DeviceLocateMain.scala:236) > 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: > ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) failed in 246.856 s > 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: > Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) > and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch > failure > 2018-07-23,01:52:36,004 WARN org.apache.spark.scheduler.TaskSetManager: Lost > task 120.0 in stage 71.0 (TID 154073, , executor 286): > FetchFailed(BlockManagerId(4208, , 22409), shuffleId=24, mapId=241, > reduceId=120, message= org.apache.spark.shuffle.FetchFailedException: Failed > to connect to /:22409 > 2018-07-23,01:52:36,005 INFO org.apache.spark.scheduler.DAGScheduler: > Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) > and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch > failure > 2018-07-23,01:52:36,017 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece0 stored as bytes in memory (estimated size 4.0 MB, free > 26.7 MB) > 2018-07-23,01:52:36,025 INFO org.apache.spark.storage.BlockManagerInfo: > Removed broadcast_59_piece1 on :52349 in memory (size: 4.0 MB, free: > 3.0 GB) > 2018-07-23,01:52:36,029 INFO org.apache.spark.storage.BlockManagerInfo: > Removed broadcast_61_piece6 on :52349 in memory (size: 4.0 MB, free: > 3.0 GB) > 2018-07-23,01:52:36,079 INFO org.apache.spark.deploy.yarn.YarnAllocator: > Canceling requests for 0 executor containers > 2018-07-23,01:52:36,079 WARN org.apache.spark.deploy.yarn.YarnAllocator: > Expected to find pending requests, but found none. > 2018-07-23,01:52:36,094 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_63_piece0 in memory on :56780 (size: 4.0 MB, free: 3.7 > GB) > 2018-07-23,01:52:36,095 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece1 stored as bytes in memory (estimated size 4.0 MB, free > 30.7 MB) > 2018-07-23,01:52:36,107 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece1 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) > 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece2 stored as bytes in memory (estimated size 4.0 MB, free > 34.7 MB) > 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece2 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) > 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece3 stored as bytes in memory (estimated size 4.0 MB, free > 38.7 MB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece3 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece4 stored as bytes in memory (estimated size 3.8 MB, free > 42.5 MB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece4 in memory on :56780 (size: 3.8 MB, free: 3.7 GB) > 2018-07-23,01:52:36,132 INFO
[jira] [Commented] (SPARK-24867) Add AnalysisBarrier to DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-24867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554986#comment-16554986 ] Saisai Shao commented on SPARK-24867: - [~smilegator] what's the ETA of this issue? > Add AnalysisBarrier to DataFrameWriter > --- > > Key: SPARK-24867 > URL: https://issues.apache.org/jira/browse/SPARK-24867 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Blocker > > {code} > val udf1 = udf({(x: Int, y: Int) => x + y}) > val df = spark.range(0, 3).toDF("a") > .withColumn("b", udf1($"a", udf1($"a", lit(10 > df.cache() > df.write.saveAsTable("t") > df.write.saveAsTable("t1") > {code} > Cache is not being used because the plans do not match with the cached plan. > This is a regression caused by the changes we made in AnalysisBarrier, since > not all the Analyzer rules are idempotent. We need to fix it to Spark 2.3.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24297) Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB
[ https://issues.apache.org/jira/browse/SPARK-24297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao reassigned SPARK-24297: --- Assignee: Imran Rashid > Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB > --- > > Key: SPARK-24297 > URL: https://issues.apache.org/jira/browse/SPARK-24297 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Shuffle, Spark Core >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Fix For: 2.4.0 > > > Any network request which does not use stream-to-disk that is sending over > 2GB is doomed to fail, so we might as well at least set the default value of > spark.maxRemoteBlockSizeFetchToMem to something < 2GB. > It probably makes sense to set it to something even lower still, but that > might require more careful testing; this is a totally safe first step. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24297) Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB
[ https://issues.apache.org/jira/browse/SPARK-24297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-24297. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21474 [https://github.com/apache/spark/pull/21474] > Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB > --- > > Key: SPARK-24297 > URL: https://issues.apache.org/jira/browse/SPARK-24297 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Shuffle, Spark Core >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Fix For: 2.4.0 > > > Any network request which does not use stream-to-disk that is sending over > 2GB is doomed to fail, so we might as well at least set the default value of > spark.maxRemoteBlockSizeFetchToMem to something < 2GB. > It probably makes sense to set it to something even lower still, but that > might require more careful testing; this is a totally safe first step. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data
[ https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554972#comment-16554972 ] Jason Guo edited comment on SPARK-24906 at 7/25/18 1:03 AM: Thanks [~maropu] and [~viirya] for your comments. Here is my solution for our cluster (more than 40 thousands nodes in total and more than 10 thousands nodes for a single cluster). 1. When this will be enabled? There is a configuration item named {code:java} spark.sql.parquet.adaptiveFileSplit=false{code} Only when this is enabled, DataSourceScanExec will enlarge the mapPartitionBytes. By default, this parameter is set to false. (Our ad-hoc query will set it to true) With this configuration, user will know that spark will adjust the partition / split size adaptively. If user do not want to use this, he or she can disable it 2. How to calculate maxPartitionBytes and openCostInBytes Different data type has different length (calculated with DataType.defaultSize). First we get the total size of the whole table (henceforth referred to as the “T”). Then we get the total size of all the requiredSchema (henceforth referred to as the “R”). The multiplier should be T / R .Then the maxPartitionBytes and openCostInBytes will be enlarge with T / R times. was (Author: habren): Thanks [~maropu] and [~viirya] for your comments. Here is my solution for our cluster (more than 40 thousands nodes in total and more than 10 thousands nodes for a single cluster). 1. When this is enabled There is a configuration item named {code:java} spark.sql.parquet.adaptiveFileSplit=false{code} Only when this is enabled, DataSourceScanExec will enlarge the mapPartitionBytes. By default, this parameter is set to false. (Our ad-hoc query will set it to true) 2. How to calculate maxPartitionBytes and openCostInBytes Different data type has different length (calculated with DataType.defaultSize). First we get the total size of the whole table (henceforth referred to as the “T”). Then we get the total size of all the requiredSchema (henceforth referred to as the “R”). The multiplier should be T / R .Then the maxPartitionBytes and openCostInBytes will be enlarge with T / R times. > Enlarge split size for columnar file to ensure the task read enough data > > > Key: SPARK-24906 > URL: https://issues.apache.org/jira/browse/SPARK-24906 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jason Guo >Priority: Critical > Attachments: image-2018-07-24-20-26-32-441.png, > image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, > image-2018-07-24-20-30-24-552.png > > > For columnar file, such as, when spark sql read the table, each split will be > 128 MB by default since spark.sql.files.maxPartitionBytes is default to > 128MB. Even when user set it to a large value, such as 512MB, the task may > read only few MB or even hundreds of KB. Because the table (Parquet) may > consists of dozens of columns while the SQL only need few columns. And spark > will prune the unnecessary columns. > > In this case, spark DataSourceScanExec can enlarge maxPartitionBytes > adaptively. > For example, there is 40 columns , 20 are integer while another 20 are long. > When use query on an integer type column and an long type column, the > maxPartitionBytes should be 20 times larger. (20*4+20*8) / (4+8) = 20. > > With this optimization, the number of task will be smaller and the job will > run faster. More importantly, for a very large cluster (more the 10 thousand > nodes), it will relieve RM's schedule pressure. > > Here is the test > > The table named test2 has more than 40 columns and there are more than 5 TB > data each hour. > When we issue a very simple query > > {code:java} > select count(device_id) from test2 where date=20180708 and hour='23'{code} > > There are 72176 tasks and the duration of the job is 4.8 minutes > !image-2018-07-24-20-26-32-441.png! > > Most tasks last less than 1 second and read less than 1.5 MB data > !image-2018-07-24-20-28-06-269.png! > > After the optimization, there are only 1615 tasks and the job last only 30 > seconds. It almost 10 times faster. > !image-2018-07-24-20-29-24-797.png! > > The median of read data is 44.2MB. > !image-2018-07-24-20-30-24-552.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data
[ https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554972#comment-16554972 ] Jason Guo commented on SPARK-24906: --- Thanks [~maropu] and [~viirya] for your comments. Here is my solution for our cluster (more than 40 thousands nodes in total and more than 10 thousands nodes for a single cluster). 1. When this is enabled There is a configuration item named {code:java} spark.sql.parquet.adaptiveFileSplit=false{code} Only when this is enabled, DataSourceScanExec will enlarge the mapPartitionBytes. By default, this parameter is set to false. (Our ad-hoc query will set it to true) 2. How to calculate maxPartitionBytes and openCostInBytes Different data type has different length (calculated with DataType.defaultSize). First we get the total size of the whole table (henceforth referred to as the “T”). Then we get the total size of all the requiredSchema (henceforth referred to as the “R”). The multiplier should be T / R .Then the maxPartitionBytes and openCostInBytes will be enlarge with T / R times. > Enlarge split size for columnar file to ensure the task read enough data > > > Key: SPARK-24906 > URL: https://issues.apache.org/jira/browse/SPARK-24906 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jason Guo >Priority: Critical > Attachments: image-2018-07-24-20-26-32-441.png, > image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, > image-2018-07-24-20-30-24-552.png > > > For columnar file, such as, when spark sql read the table, each split will be > 128 MB by default since spark.sql.files.maxPartitionBytes is default to > 128MB. Even when user set it to a large value, such as 512MB, the task may > read only few MB or even hundreds of KB. Because the table (Parquet) may > consists of dozens of columns while the SQL only need few columns. And spark > will prune the unnecessary columns. > > In this case, spark DataSourceScanExec can enlarge maxPartitionBytes > adaptively. > For example, there is 40 columns , 20 are integer while another 20 are long. > When use query on an integer type column and an long type column, the > maxPartitionBytes should be 20 times larger. (20*4+20*8) / (4+8) = 20. > > With this optimization, the number of task will be smaller and the job will > run faster. More importantly, for a very large cluster (more the 10 thousand > nodes), it will relieve RM's schedule pressure. > > Here is the test > > The table named test2 has more than 40 columns and there are more than 5 TB > data each hour. > When we issue a very simple query > > {code:java} > select count(device_id) from test2 where date=20180708 and hour='23'{code} > > There are 72176 tasks and the duration of the job is 4.8 minutes > !image-2018-07-24-20-26-32-441.png! > > Most tasks last less than 1 second and read less than 1.5 MB data > !image-2018-07-24-20-28-06-269.png! > > After the optimization, there are only 1615 tasks and the job last only 30 > seconds. It almost 10 times faster. > !image-2018-07-24-20-29-24-797.png! > > The median of read data is 44.2MB. > !image-2018-07-24-20-30-24-552.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554932#comment-16554932 ] Yin Huai commented on SPARK-24895: -- [~hyukjin.kwon] [~kiszk] seems this revert indeed fixed the problem :) > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Major > Fix For: 2.4.0 > > > Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > {noformat} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > {noformat} > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24914) totalSize is not a good estimate for broadcast joins
Bruce Robbins created SPARK-24914: - Summary: totalSize is not a good estimate for broadcast joins Key: SPARK-24914 URL: https://issues.apache.org/jira/browse/SPARK-24914 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Bruce Robbins When determining whether to do a broadcast join, Spark estimates the size of the smaller table as follows: - if totalSize is defined and greater than 0, use it. - else, if rawDataSize is defined and greater than 0, use it - else, use spark.sql.defaultSizeInBytes (default: Long.MaxValue) Therefore, Spark prefers totalSize over rawDataSize. Unfortunately, totalSize is often quite a bit smaller than the actual table size, since it represents the size of the table's files on disk. Parquet and Orc files, for example, are encoded and compressed. This can result in the JVM throwing an OutOfMemoryError while Spark is loading the table into a HashedRelation, or when Spark actually attempts to broadcast the data. On the other hand, rawDataSize represents the uncompressed size of the dataset, according to Hive documentation. This seems like a pretty good number to use in preference to totalSize. However, due to HIVE-20079, this value is simply #columns * #rows. Once that bug is fixed, it may be a superior statistic, at least for managed tables. In the meantime, we could apply a configurable "fudge factor" to totalSize, at least for types of files that are encoded and compressed. Hive has the setting hive.stats.deserialization.factor, which defaults to 1.0, and is described as follows: {quote}in the absence of uncompressed/raw data size, total file size will be used for statistics annotation. But the file may be compressed, encoded and serialized which may be lesser in size than the actual uncompressed/raw data size. This factor will be multiplied to file size to estimate the raw data size. {quote} In addition to the fudge factor, we could compare the adjusted totalSize to rawDataSize and use the bigger of the two: {noformat} size1 = totalSize.isDefined ? totalSize * fudgeFactor : Long.MAX_VALUE size2 = rawDataSize.isDefined ? rawDataSize : Long.MAX_VALUE size = max(size1, size2) {noformat} Caveat: This mitigates the issue only for Hive tables. It does not help much when the user is reading files using {{spark.read.parquet}}, unless we apply the same fudge factor there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24778) DateTimeUtils.getTimeZone method returns GMT time if timezone cannot be parsed
[ https://issues.apache.org/jira/browse/SPARK-24778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554907#comment-16554907 ] Vinitha Reddy Gankidi commented on SPARK-24778: --- [~maropu] Sorry, I missed seeing this comment. Failing the query when the timezone is not supported works too. In some cases, Spark returns NULL where the inputs are incorrect. For instance, {{to_date('2018-02-31')}} returns NULL instead of throwing an error. > DateTimeUtils.getTimeZone method returns GMT time if timezone cannot be parsed > -- > > Key: SPARK-24778 > URL: https://issues.apache.org/jira/browse/SPARK-24778 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Vinitha Reddy Gankidi >Priority: Major > > {{DateTimeUtils.getTimeZone}} calls java's {{Timezone.getTimezone}} method > that defaults to GMT if the timezone cannot be parsed. This can be misleading > for users and its better to return NULL instead of returning an incorrect > value. > To reproduce: {{from_utc_timestamp}} is one of the functions that calls > {{DateTimeUtils.getTimeZone}}. Session timezone is GMT for the following > queries. > {code:java} > SELECT from_utc_timestamp('2018-07-10 12:00:00', 'GMT+05:00') -> 2018-07-10 > 17:00:00 > SELECT from_utc_timestamp('2018-07-10 12:00:00', '+05:00') -> 2018-07-10 > 12:00:00 (Defaults to GMT as the timezone is not recognized){code} > We could fix it by using the workaround mentioned here: > [https://bugs.openjdk.java.net/browse/JDK-4412864]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24913) Make `AssertTrue` and `AssertNotNull` non-deterministic
DB Tsai created SPARK-24913: --- Summary: Make `AssertTrue` and `AssertNotNull` non-deterministic Key: SPARK-24913 URL: https://issues.apache.org/jira/browse/SPARK-24913 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: DB Tsai -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24908) [R] remove spaces to make lintr happy
[ https://issues.apache.org/jira/browse/SPARK-24908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-24908. - Resolution: Fixed Target Version/s: 2.4.0 > [R] remove spaces to make lintr happy > - > > Key: SPARK-24908 > URL: https://issues.apache.org/jira/browse/SPARK-24908 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.1 >Reporter: shane knapp >Assignee: shane knapp >Priority: Critical > > during my travails in porting spark builds to run on our centos worker, i > managed to recreate (as best i could) the centos environment on our new > ubuntu-testing machine. > while running my initial builds, lintr was crashing on some extraneous spaces > in test_basic.R (see: > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)] > after removing those spaces, the ubuntu build happily passed the lintr tests. > i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build > (see > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),] > which scp'ed a copy of test_basic.R in to the repo after the git clone. > everything seems to be working happily. > i will be creating a pull request for this now. > > attn: [~felixcheung] [~shivaram] [~ifilonenko] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-24895. -- Resolution: Fixed Fix Version/s: 2.4.0 [https://github.com/apache/spark/pull/21865] has been merged. > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Major > Fix For: 2.4.0 > > > Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > {noformat} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > {noformat} > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24912) Broadcast join OutOfMemory stack trace obscures actual cause of OOM
[ https://issues.apache.org/jira/browse/SPARK-24912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-24912: -- Priority: Minor (was: Major) > Broadcast join OutOfMemory stack trace obscures actual cause of OOM > --- > > Key: SPARK-24912 > URL: https://issues.apache.org/jira/browse/SPARK-24912 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Priority: Minor > > When the Spark driver suffers an OutOfMemoryError while attempting to > broadcast a table for a broadcast join, the resulting stack trace obscures > the actual cause of the OOM. For e.g.: > {noformat} > [GC (Allocation Failure) 585453K->585453K(928768K), 0.0060025 secs] > [Full GC (Allocation Failure) 585453K->582524K(928768K), 0.4019639 secs] > java.lang.OutOfMemoryError: Java heap space > Dumping heap to java_pid12446.hprof ... > Heap dump file created [632701033 bytes in 1.016 secs] > Exception in thread "main" java.lang.OutOfMemoryError: Not enough memory to > build and broadcast the table to all worker nodes. As a workaround, you can > either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to > -1 or increase the spark driver memory by setting spark.driver.memory to a > higher value > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:122) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:76) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:101) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:98) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 18/07/24 14:29:58 INFO ContextCleaner: Cleaned accumulator 30 > 18/07/24 14:29:58 INFO ContextCleaner: Cleaned accumulator 35 > {noformat} > The above stack trace blames BroadcastExchangeExec. However, the given line > is actually where the original OutOfMemoryError was caught and a new one was > created and wrapped by a SparkException. The actual location where the OOM > occurred was in LongToUnsafeRowMap#grow, at this line: > {noformat} > val newPage = new Array[Long](newNumWords.toInt) > {noformat} > Sometimes it is helpful to know the actual location from which an OOM is > thrown. In the above case, the location indicated that Spark underestimated > the size of a large-ish table and ran out of memory trying to load it into > memory. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names
[ https://issues.apache.org/jira/browse/SPARK-24911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554839#comment-16554839 ] Apache Spark commented on SPARK-24911: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/21803 > SHOW CREATE TABLE drops escaping of nested column names > --- > > Key: SPARK-24911 > URL: https://issues.apache.org/jira/browse/SPARK-24911 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Major > > Create a table with quoted nested column - *`b`*: > {code:sql} > create table `test` (`a` STRUCT<`b`:STRING>); > {code} > and show how the table was created: > {code:sql} > SHOW CREATE TABLE `test` > {code} > {code} > CREATE TABLE `test`(`a` struct) > {code} > The column *b* becomes unquoted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names
[ https://issues.apache.org/jira/browse/SPARK-24911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24911: Assignee: Apache Spark > SHOW CREATE TABLE drops escaping of nested column names > --- > > Key: SPARK-24911 > URL: https://issues.apache.org/jira/browse/SPARK-24911 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Create a table with quoted nested column - *`b`*: > {code:sql} > create table `test` (`a` STRUCT<`b`:STRING>); > {code} > and show how the table was created: > {code:sql} > SHOW CREATE TABLE `test` > {code} > {code} > CREATE TABLE `test`(`a` struct) > {code} > The column *b* becomes unquoted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names
[ https://issues.apache.org/jira/browse/SPARK-24911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24911: Assignee: (was: Apache Spark) > SHOW CREATE TABLE drops escaping of nested column names > --- > > Key: SPARK-24911 > URL: https://issues.apache.org/jira/browse/SPARK-24911 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Major > > Create a table with quoted nested column - *`b`*: > {code:sql} > create table `test` (`a` STRUCT<`b`:STRING>); > {code} > and show how the table was created: > {code:sql} > SHOW CREATE TABLE `test` > {code} > {code} > CREATE TABLE `test`(`a` struct) > {code} > The column *b* becomes unquoted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24912) Broadcast join OutOfMemory stack trace obscures actual cause of OOM
Bruce Robbins created SPARK-24912: - Summary: Broadcast join OutOfMemory stack trace obscures actual cause of OOM Key: SPARK-24912 URL: https://issues.apache.org/jira/browse/SPARK-24912 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Bruce Robbins When the Spark driver suffers an OutOfMemoryError while attempting to broadcast a table for a broadcast join, the resulting stack trace obscures the actual cause of the OOM. For e.g.: {noformat} [GC (Allocation Failure) 585453K->585453K(928768K), 0.0060025 secs] [Full GC (Allocation Failure) 585453K->582524K(928768K), 0.4019639 secs] java.lang.OutOfMemoryError: Java heap space Dumping heap to java_pid12446.hprof ... Heap dump file created [632701033 bytes in 1.016 secs] Exception in thread "main" java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or increase the spark driver memory by setting spark.driver.memory to a higher value at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:122) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:76) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:101) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:98) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 18/07/24 14:29:58 INFO ContextCleaner: Cleaned accumulator 30 18/07/24 14:29:58 INFO ContextCleaner: Cleaned accumulator 35 {noformat} The above stack trace blames BroadcastExchangeExec. However, the given line is actually where the original OutOfMemoryError was caught and a new one was created and wrapped by a SparkException. The actual location where the OOM occurred was in LongToUnsafeRowMap#grow, at this line: {noformat} val newPage = new Array[Long](newNumWords.toInt) {noformat} Sometimes it is helpful to know the actual location from which an OOM is thrown. In the above case, the location indicated that Spark underestimated the size of a large-ish table and ran out of memory trying to load it into memory. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names
Maxim Gekk created SPARK-24911: -- Summary: SHOW CREATE TABLE drops escaping of nested column names Key: SPARK-24911 URL: https://issues.apache.org/jira/browse/SPARK-24911 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Create a table with quoted nested column - *`b`*: {code:sql} create table `test` (`a` STRUCT<`b`:STRING>); {code} and show how the table was created: {code:sql} SHOW CREATE TABLE `test` {code} {code} CREATE TABLE `test`(`a` struct) {code} The column *b* becomes unquoted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24910) Spark Bloom Filter Closure Serialization improvement for very high volume of Data
Himangshu Ranjan Borah created SPARK-24910: -- Summary: Spark Bloom Filter Closure Serialization improvement for very high volume of Data Key: SPARK-24910 URL: https://issues.apache.org/jira/browse/SPARK-24910 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 2.3.1 Reporter: Himangshu Ranjan Borah I am proposing an improvement to the Bloom Filter Generation logic being used in the DataFrameStatFunctions' Bloom Filter API using mapPartitions() instead of aggregate() to avoid closure serialization which fails for huge BitArrays. Spark's Stat Functions' Bloom Filter Implementation uses aggregate/treeAggregate operations which uses a closure with a dependency on the bloom filter that is created in the driver. Since Spark hard codes the closure serializer to Java Serializer it fails in closure cleanup for very big sizes of Bloom Filters (Typically with num items ~ Billions and with fpp ~ 0.001). Kryo serializer work's fine in such a scale but seems like there were some issues using Kryo for closure serialization due to which Spark 2.0 hardcoded it to Java. The call-stack that we get typically looks like, {{{color:#f79232}java.lang.OutOfMemoryError{color}}} {{{color:#f79232} at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123){color}}} {{{color:#f79232} at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117){color}}} {{{color:#f79232} at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93){color}}} {{{color:#f79232} at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153){color}}} {{{color:#f79232} at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41){color}}} {{{color:#f79232} at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877){color}}} {{{color:#f79232} at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786){color}}} {{{color:#f79232} at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189){color}}} {{{color:#f79232} at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548){color}}} {{{color:#f79232} at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509){color}}} {{{color:#f79232} at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432){color}}} {{{color:#f79232} at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178){color}}} {{{color:#f79232} at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348){color}}} {{{color:#f79232} at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43){color}}} {{{color:#f79232} at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100){color}}} {{{color:#f79232} at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342){color}}} {{{color:#f79232} at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335){color}}} {{{color:#f79232} at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159){color}}} {{{color:#f79232} at org.apache.spark.SparkContext.clean(SparkContext.scala:2292){color}}} {{{color:#f79232} at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022){color}}} {{{color:#f79232} at org.apache.spark.SparkContext.runJob(SparkContext.scala:2124){color}}} {{{color:#f79232} at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1092){color}}} {{{color:#f79232} at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151){color}}} {{{color:#f79232} at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112){color}}} {{{color:#f79232} at org.apache.spark.rdd.RDD.withScope(RDD.scala:363){color}}} {{{color:#f79232} at org.apache.spark.rdd.RDD.fold(RDD.scala:1086){color}}} {{{color:#f79232} at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1155){color}}} {{{color:#f79232} at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151){color}}} {{{color:#f79232} at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112){color}}} {{{color:#f79232} at org.apache.spark.rdd.RDD.withScope(RDD.scala:363){color}}} {{{color:#f79232} at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1131){color}}} {{{color:#f79232} at org.apache.spark.sql.DataFrameStatFunctions.buildBloomFilter(DataFrameStatFunctions.scala:554){color}}} {{{color:#f79232} at org.apache.spark.sql.DataFrameStatFunctions.bloomFilter(DataFrameStatFunctions.scala:505){color}}} This issue can be overcome if we *don't* use the *aggregate()* operations for the Bloom Filter generation and use *mapPartitions()* kind of operations where we create the
[jira] [Commented] (SPARK-24909) Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts
[ https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554803#comment-16554803 ] Thomas Graves commented on SPARK-24909: --- I haven't come up with a fix yet but have been looking at essentially all the things you have mentioned. will continue working on it, except I'm out tomorrow so will continue thursday. > Spark scheduler can hang when fetch failures, executor lost, task running on > lost executor, and multiple stage attempts > --- > > Key: SPARK-24909 > URL: https://issues.apache.org/jira/browse/SPARK-24909 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.1 >Reporter: Thomas Graves >Priority: Critical > > The DAGScheduler can hang if the executor was lost (due to fetch failure) and > all the tasks in the tasks sets are marked as completed. > ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)] > It never creates new task attempts in the task scheduler but the dag > scheduler still has pendingPartitions. > {code:java} > 8/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in > stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, > PROCESS_LOCAL, 7874 bytes) > 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 > (repartition at Lift.scala:191) as failed due to a fetch failure from > ShuffleMapStage 42 (map at foo.scala:27) > 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage > 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at > bar.scala:191) due to fetch failure > > 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18) > 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for > executor: 33 (epoch 18) > 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 > (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing > parents > 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 > with 59955 tasks > 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in > stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320) > 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus > ShuffleMapTask(44, 55769) completion from executor 33{code} > > > In the logs above you will see that task 55769.0 finished after the executor > was lost and a new task set was started. The DAG scheduler says "Ignoring > possibly bogus".. but in the TaskSetManager side it has marked those tasks as > completed for all stage attempts. The DAGScheduler gets hung here. I did a > heap dump on the process and can see that 55769 is still in the DAGScheduler > pendingPartitions list but the tasksetmanagers are all complete > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24307) Support sending messages over 2GB from memory
[ https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554784#comment-16554784 ] Apache Spark commented on SPARK-24307: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/21867 > Support sending messages over 2GB from memory > - > > Key: SPARK-24307 > URL: https://issues.apache.org/jira/browse/SPARK-24307 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Fix For: 2.4.0 > > > Spark's networking layer supports sending messages backed by a {{FileRegion}} > or a {{ByteBuf}}. Sending large FileRegion's works, as netty supports large > FileRegions. However, {{ByteBuf}} is limited to 2GB. This is particularly > a problem for sending large datasets that are already in memory, eg. cached > RDD blocks. > eg. if you try to replicate a block stored in memory that is over 2 GB, you > will see an exception like: > {noformat} > 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC > 7420542363232096629 to xyz.com/172.31.113.213:44358: > io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: > readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= > writerIndex <= capacity(-1294617291)) > io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: > readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= > writerIndex <= capacity(-1294617291)) > at > io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816) > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723) > at > io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) > at > io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081) > at > io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: > -1294617291 (expected: 0 <= readerIndex <= writerIndex <= > capacity(-1294617291)) > at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129) > at > io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688) > at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110) > at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359) > at > org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87) > at > org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95) > at > org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33) > at > io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88) > ... 17 more > {noformat} > A simple solution to this is to create a "FileRegion" which is backed by a > {{ChunkedByteBuffer}} (spark's existing datastructure to support
[jira] [Commented] (SPARK-24909) Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts
[ https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554748#comment-16554748 ] Imran Rashid commented on SPARK-24909: -- ugh, yeah I think you're right. do you have a fix in mind? we could undo SPARK-23433, so that active taskset doesn't care if an earlier taskset completes the tasks it has (and rethink how to solve that case). Or we could have the markPartitionCompletedInAllTaskSets take the epoch into account. Or, we could change that condition in the dagscheduler where its getting hung -- it could resubmit the taskset if it thinks there is still work to be done, but all tasksets are complete (which might be a more stable approach in general), rather than only doing it if {{shuffleStage.pendingPartitions.isEmpty}}. All of these changes always have a ton of corner cases and past bugs I need to remind myself of, though ... > Spark scheduler can hang when fetch failures, executor lost, task running on > lost executor, and multiple stage attempts > --- > > Key: SPARK-24909 > URL: https://issues.apache.org/jira/browse/SPARK-24909 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.1 >Reporter: Thomas Graves >Priority: Critical > > The DAGScheduler can hang if the executor was lost (due to fetch failure) and > all the tasks in the tasks sets are marked as completed. > ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)] > It never creates new task attempts in the task scheduler but the dag > scheduler still has pendingPartitions. > {code:java} > 8/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in > stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, > PROCESS_LOCAL, 7874 bytes) > 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 > (repartition at Lift.scala:191) as failed due to a fetch failure from > ShuffleMapStage 42 (map at foo.scala:27) > 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage > 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at > bar.scala:191) due to fetch failure > > 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18) > 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for > executor: 33 (epoch 18) > 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 > (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing > parents > 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 > with 59955 tasks > 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in > stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320) > 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus > ShuffleMapTask(44, 55769) completion from executor 33{code} > > > In the logs above you will see that task 55769.0 finished after the executor > was lost and a new task set was started. The DAG scheduler says "Ignoring > possibly bogus".. but in the TaskSetManager side it has marked those tasks as > completed for all stage attempts. The DAGScheduler gets hung here. I did a > heap dump on the process and can see that 55769 is still in the DAGScheduler > pendingPartitions list but the tasksetmanagers are all complete > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554742#comment-16554742 ] Imran Rashid commented on SPARK-24615: -- hi, just catching up here -- agree with a lot of Tom's concerns. There are a lot of corner cases we'd need to think about. Eg., you mentioned joins, that in general they have stage boundaries -- but then again you might in some cases have a shared partitioner, which would remove the stage boundary. And then there are other cases like {{zip()}}. What if you had two rdds in one stage with conflicting requirements? Eg. they request different accelerator types, and no node in the cluster has both available? Would you fail? Or introduce a stage boundary? I think failing is fine, it would be OK if we require the user to put in their own boundary there (eg. by writing to hdfs) just want to think it through. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24909) Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts
[ https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-24909: -- Description: The DAGScheduler can hang if the executor was lost (due to fetch failure) and all the tasks in the tasks sets are marked as completed. ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)] It never creates new task attempts in the task scheduler but the dag scheduler still has pendingPartitions. {code:java} 8/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, PROCESS_LOCAL, 7874 bytes) 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 (repartition at Lift.scala:191) as failed due to a fetch failure from ShuffleMapStage 42 (map at foo.scala:27) 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at bar.scala:191) due to fetch failure 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18) 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for executor: 33 (epoch 18) 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing parents 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 with 59955 tasks 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320) 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus ShuffleMapTask(44, 55769) completion from executor 33{code} In the logs above you will see that task 55769.0 finished after the executor was lost and a new task set was started. The DAG scheduler says "Ignoring possibly bogus".. but in the TaskSetManager side it has marked those tasks as completed for all stage attempts. The DAGScheduler gets hung here. I did a heap dump on the process and can see that 55769 is still in the DAGScheduler pendingPartitions list but the tasksetmanagers are all complete was: The DAGScheduler can hang if the executor was lost (due to fetch failure) and all the tasks in the tasks sets are marked as completed. ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)] It never creates new task attempts in the task scheduler but the dag scheduler still has pendingPartitions. 18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, PROCESS_LOCAL, 7874 bytes) 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 (repartition at Lift.scala:191) as failed due to a fetch failure from ShuffleMapStage 42 (map at foo.scala:27) 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at bar.scala:191) due to fetch failure 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18) 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for executor: 33 (epoch 18) 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing parents 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 with 59955 tasks 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320) 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus ShuffleMapTask(44, 55769) completion from executor 33 In the logs above you will see that task 55769.0 finished after the executor was lost and a new task set was started. The DAG scheduler says "Ignoring possibly bogus".. but in the TaskSetManager side it has marked those tasks as completed for all stage attempts. The DAGScheduler gets hung here. I did a heap dump on the process and can see that 55769 is still in the DAGScheduler pendingPartitions list but the tasksetmanagers are all complete > Spark scheduler can hang when fetch failures, executor lost, task running on > lost executor, and multiple stage attempts > --- > > Key: SPARK-24909 > URL: https://issues.apache.org/jira/browse/SPARK-24909 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.1 >Reporter: Thomas Graves >Priority: Critical > > The DAGScheduler can hang if the executor was lost
[jira] [Updated] (SPARK-24909) Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts
[ https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-24909: -- Summary: Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts (was: Spark scheduler can hang with fetch failures and executor lost and multiple stage attempts) > Spark scheduler can hang when fetch failures, executor lost, task running on > lost executor, and multiple stage attempts > --- > > Key: SPARK-24909 > URL: https://issues.apache.org/jira/browse/SPARK-24909 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.1 >Reporter: Thomas Graves >Priority: Critical > > The DAGScheduler can hang if the executor was lost (due to fetch failure) and > all the tasks in the tasks sets are marked as completed. > ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)] > It never creates new task attempts in the task scheduler but the dag > scheduler still has pendingPartitions. > 18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in > stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, > PROCESS_LOCAL, 7874 bytes) > 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 > (repartition at Lift.scala:191) as failed due to a fetch failure from > ShuffleMapStage 42 (map at foo.scala:27) > 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage > 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at > bar.scala:191) due to fetch failure > > 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18) > 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for > executor: 33 (epoch 18) > 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 > (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing > parents > 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 > with 59955 tasks > 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in > stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320) > 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus > ShuffleMapTask(44, 55769) completion from executor 33 > > In the logs above you will see that task 55769.0 finished after the executor > was lost and a new task set was started. The DAG scheduler says "Ignoring > possibly bogus".. but in the TaskSetManager side it has marked those tasks as > completed for all stage attempts. The DAGScheduler gets hung here. I did a > heap dump on the process and can see that 55769 is still in the DAGScheduler > pendingPartitions list but the tasksetmanagers are all complete > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24768) Have a built-in AVRO data source implementation
[ https://issues.apache.org/jira/browse/SPARK-24768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554715#comment-16554715 ] Apache Spark commented on SPARK-24768: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/21866 > Have a built-in AVRO data source implementation > --- > > Key: SPARK-24768 > URL: https://issues.apache.org/jira/browse/SPARK-24768 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > Attachments: Built-in AVRO Data Source In Spark 2.4.pdf > > > Apache Avro (https://avro.apache.org) is a popular data serialization format. > It is widely used in the Spark and Hadoop ecosystem, especially for > Kafka-based data pipelines. Using the external package > [https://github.com/databricks/spark-avro], Spark SQL can read and write the > avro data. Making spark-Avro built-in can provide a better experience for > first-time users of Spark SQL and structured streaming. We expect the > built-in Avro data source can further improve the adoption of structured > streaming. The proposal is to inline code from spark-avro package > ([https://github.com/databricks/spark-avro]). The target release is Spark > 2.4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data
[ https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554716#comment-16554716 ] Liang-Chi Hsieh commented on SPARK-24906: - A {{maxPartitionBytes}} value adapted improperly might hurt performance. And in this case, users may not know what is happened. Do we have a general rule to calculate proper {{maxPartitionBytes}}? > Enlarge split size for columnar file to ensure the task read enough data > > > Key: SPARK-24906 > URL: https://issues.apache.org/jira/browse/SPARK-24906 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jason Guo >Priority: Critical > Attachments: image-2018-07-24-20-26-32-441.png, > image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, > image-2018-07-24-20-30-24-552.png > > > For columnar file, such as, when spark sql read the table, each split will be > 128 MB by default since spark.sql.files.maxPartitionBytes is default to > 128MB. Even when user set it to a large value, such as 512MB, the task may > read only few MB or even hundreds of KB. Because the table (Parquet) may > consists of dozens of columns while the SQL only need few columns. And spark > will prune the unnecessary columns. > > In this case, spark DataSourceScanExec can enlarge maxPartitionBytes > adaptively. > For example, there is 40 columns , 20 are integer while another 20 are long. > When use query on an integer type column and an long type column, the > maxPartitionBytes should be 20 times larger. (20*4+20*8) / (4+8) = 20. > > With this optimization, the number of task will be smaller and the job will > run faster. More importantly, for a very large cluster (more the 10 thousand > nodes), it will relieve RM's schedule pressure. > > Here is the test > > The table named test2 has more than 40 columns and there are more than 5 TB > data each hour. > When we issue a very simple query > > {code:java} > select count(device_id) from test2 where date=20180708 and hour='23'{code} > > There are 72176 tasks and the duration of the job is 4.8 minutes > !image-2018-07-24-20-26-32-441.png! > > Most tasks last less than 1 second and read less than 1.5 MB data > !image-2018-07-24-20-28-06-269.png! > > After the optimization, there are only 1615 tasks and the job last only 30 > seconds. It almost 10 times faster. > !image-2018-07-24-20-29-24-797.png! > > The median of read data is 44.2MB. > !image-2018-07-24-20-30-24-552.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24909) Spark scheduler can hang with fetch failures and executor lost and multiple stage attempts
[ https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554701#comment-16554701 ] Thomas Graves commented on SPARK-24909: --- Note this may have been introduced as part of SPARK-23433 where we now mark all the partitions in stage attempts as completed. If the task that completed had been in the latest attempt it would have been removed from pendingPartitions. If we didn't have SPARK-23433 then I think the stage attempt 44.1 would have went on to run the task again. cc [~irashid] > Spark scheduler can hang with fetch failures and executor lost and multiple > stage attempts > -- > > Key: SPARK-24909 > URL: https://issues.apache.org/jira/browse/SPARK-24909 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.1 >Reporter: Thomas Graves >Priority: Critical > > The DAGScheduler can hang if the executor was lost (due to fetch failure) and > all the tasks in the tasks sets are marked as completed. > ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)] > It never creates new task attempts in the task scheduler but the dag > scheduler still has pendingPartitions. > 18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in > stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, > PROCESS_LOCAL, 7874 bytes) > 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 > (repartition at Lift.scala:191) as failed due to a fetch failure from > ShuffleMapStage 42 (map at foo.scala:27) > 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage > 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at > bar.scala:191) due to fetch failure > > 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18) > 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for > executor: 33 (epoch 18) > 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 > (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing > parents > 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 > with 59955 tasks > 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in > stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320) > 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus > ShuffleMapTask(44, 55769) completion from executor 33 > > In the logs above you will see that task 55769.0 finished after the executor > was lost and a new task set was started. The DAG scheduler says "Ignoring > possibly bogus".. but in the TaskSetManager side it has marked those tasks as > completed for all stage attempts. The DAGScheduler gets hung here. I did a > heap dump on the process and can see that 55769 is still in the DAGScheduler > pendingPartitions list but the tasksetmanagers are all complete > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24909) Spark scheduler can hang with fetch failures and executor lost and multiple stage attempts
[ https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-24909: -- Description: The DAGScheduler can hang if the executor was lost (due to fetch failure) and all the tasks in the tasks sets are marked as completed. ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)] It never creates new task attempts in the task scheduler but the dag scheduler still has pendingPartitions. 18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, PROCESS_LOCAL, 7874 bytes) 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 (repartition at Lift.scala:191) as failed due to a fetch failure from ShuffleMapStage 42 (map at foo.scala:27) 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at bar.scala:191) due to fetch failure 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18) 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for executor: 33 (epoch 18) 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing parents 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 with 59955 tasks 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320) 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus ShuffleMapTask(44, 55769) completion from executor 33 In the logs above you will see that task 55769.0 finished after the executor was lost and a new task set was started. The DAG scheduler says "Ignoring possibly bogus".. but in the TaskSetManager side it has marked those tasks as completed for all stage attempts. The DAGScheduler gets hung here. I did a heap dump on the process and can see that 55769 is still in the DAGScheduler pendingPartitions list but the tasksetmanagers are all complete was: The DAGScheduler can hang if the executor was lost (due to fetch failure) and all the tasks in the tasks sets are marked as completed. ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)] It never creates new task attempts in the task scheduler but the dag scheduler still has pendingPartitions. 18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, PROCESS_LOCAL, 7874 bytes) 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 (repartition at Lift.scala:191) as failed due to a fetch failure from ShuffleMapStage 42 (map at foo.scala:27) 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at bar.scala:191) due to fetch failure 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18) 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for executor: 33 (epoch 18) 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing parents 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 with 59955 tasks 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320) 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus ShuffleMapTask(44, 55769) completion from executor 33 In the logs above you will see that task 55769.0 finished after the executor was lost and a new task set was started. The DAG scheduler says "Ignoring possibly bogus".. but in the TaskSetManager side it has marked those tasks as completed for all stage attempts. > Spark scheduler can hang with fetch failures and executor lost and multiple > stage attempts > -- > > Key: SPARK-24909 > URL: https://issues.apache.org/jira/browse/SPARK-24909 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.1 >Reporter: Thomas Graves >Priority: Critical > > The DAGScheduler can hang if the executor was lost (due to fetch failure) and > all the tasks in the tasks sets are marked as completed. > ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)] > It never creates new task attempts in the task
[jira] [Created] (SPARK-24909) Spark scheduler can hang with fetch failures and executor lost and multiple stage attempts
Thomas Graves created SPARK-24909: - Summary: Spark scheduler can hang with fetch failures and executor lost and multiple stage attempts Key: SPARK-24909 URL: https://issues.apache.org/jira/browse/SPARK-24909 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.3.1 Reporter: Thomas Graves The DAGScheduler can hang if the executor was lost (due to fetch failure) and all the tasks in the tasks sets are marked as completed. ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)] It never creates new task attempts in the task scheduler but the dag scheduler still has pendingPartitions. 18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, PROCESS_LOCAL, 7874 bytes) 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 (repartition at Lift.scala:191) as failed due to a fetch failure from ShuffleMapStage 42 (map at foo.scala:27) 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at bar.scala:191) due to fetch failure 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18) 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for executor: 33 (epoch 18) 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing parents 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 with 59955 tasks 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320) 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus ShuffleMapTask(44, 55769) completion from executor 33 In the logs above you will see that task 55769.0 finished after the executor was lost and a new task set was started. The DAG scheduler says "Ignoring possibly bogus".. but in the TaskSetManager side it has marked those tasks as completed for all stage attempts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24909) Spark scheduler can hang with fetch failures and executor lost and multiple stage attempts
[ https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-24909: -- Priority: Critical (was: Major) > Spark scheduler can hang with fetch failures and executor lost and multiple > stage attempts > -- > > Key: SPARK-24909 > URL: https://issues.apache.org/jira/browse/SPARK-24909 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.1 >Reporter: Thomas Graves >Priority: Critical > > The DAGScheduler can hang if the executor was lost (due to fetch failure) and > all the tasks in the tasks sets are marked as completed. > ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)] > It never creates new task attempts in the task scheduler but the dag > scheduler still has pendingPartitions. > 18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in > stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, > PROCESS_LOCAL, 7874 bytes) > 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 > (repartition at Lift.scala:191) as failed due to a fetch failure from > ShuffleMapStage 42 (map at foo.scala:27) > 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage > 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at > bar.scala:191) due to fetch failure > > 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18) > 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for > executor: 33 (epoch 18) > 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 > (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing > parents > 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 > with 59955 tasks > 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in > stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320) > 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus > ShuffleMapTask(44, 55769) completion from executor 33 > > In the logs above you will see that task 55769.0 finished after the executor > was lost and a new task set was started. The DAG scheduler says "Ignoring > possibly bogus".. but in the TaskSetManager side it has marked those tasks as > completed for all stage attempts. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24812) Last Access Time in the table description is not valid
[ https://issues.apache.org/jira/browse/SPARK-24812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24812. - Resolution: Fixed Assignee: Sujith Fix Version/s: 2.4.0 > Last Access Time in the table description is not valid > -- > > Key: SPARK-24812 > URL: https://issues.apache.org/jira/browse/SPARK-24812 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: Sujith >Assignee: Sujith >Priority: Minor > Fix For: 2.4.0 > > Attachments: image-2018-07-16-15-37-28-896.png, > image-2018-07-16-15-38-26-717.png > > > Last Access Time in the table description is not valid, > Test steps: > Step 1 - create a table > Step 2 - Run command "DESC FORMATTED table" > Last Access Time will always displayed wrong date > Wed Dec 31 15:59:59 PST 1969 - which is wrong. > !image-2018-07-16-15-37-28-896.png! > In hive its displayed as "UNKNOWN" which makes more sense than displaying > wrong date. > Please find the snapshot tested in hive for the same com > !image-2018-07-16-15-38-26-717.png! mand > > Seems to be a limitation as of now, better we can follow the hive behavior in > this scenario. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned SPARK-24895: -- Assignee: Eric Chang > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Major > > Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > {noformat} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > {noformat} > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-24895: --- Description: Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven repo has mismatched filenames: {noformat} [ERROR] Failed to execute goal org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce (enforce-banned-dependencies) on project spark_2.4: Execution enforce-banned-dependencies of goal org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: Could not resolve following dependencies: [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not resolve dependencies for project com.databricks:spark_2.4:pom:1: The following artifacts could not be resolved: org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find artifact org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] {noformat} If you check the artifact metadata you will see the pom and jar files are 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: {code:xml} org.apache.spark spark-mllib-local_2.11 2.4.0-SNAPSHOT 20180723.232411 177 20180723232411 jar 2.4.0-20180723.232411-177 20180723232411 pom 2.4.0-20180723.232411-177 20180723232411 tests jar 2.4.0-20180723.232410-177 20180723232411 sources jar 2.4.0-20180723.232410-177 20180723232411 test-sources jar 2.4.0-20180723.232410-177 20180723232411 {code} This behavior is very similar to this issue: https://issues.apache.org/jira/browse/MDEPLOY-221 Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 2.8.2 plugin, it is highly possible that we introduced a new plugin that causes this. The most recent addition is the spot-bugs plugin, which is known to have incompatibilities with other plugins: [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] We may want to try building without it to sanity check. was: Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven repo has mismatched filenames: {code} [ERROR] Failed to execute goal org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce (enforce-banned-dependencies) on project spark_2.4: Execution enforce-banned-dependencies of goal org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: Could not resolve following dependencies: [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not resolve dependencies for project com.databricks:spark_2.4:pom:1: The following artifacts could not be resolved: org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find artifact org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] {code} If you check the artifact metadata you will see the pom and jar files are 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: {code:xml} org.apache.spark spark-mllib-local_2.11 2.4.0-SNAPSHOT 20180723.232411 177 20180723232411 jar 2.4.0-20180723.232411-177 20180723232411 pom 2.4.0-20180723.232411-177 20180723232411 tests jar 2.4.0-20180723.232410-177 20180723232411 sources jar 2.4.0-20180723.232410-177 20180723232411 test-sources jar 2.4.0-20180723.232410-177 20180723232411 {code} This behavior is very similar to this issue: https://issues.apache.org/jira/browse/MDEPLOY-221 Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 2.8.2 plugin, it is highly possible that we introduced a new plugin that causes this. The most recent addition is the spot-bugs plugin, which is known to have incompatibilities with other plugins: [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] We may want to try building without it to sanity check. > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames >
[jira] [Commented] (SPARK-24894) Invalid DNS name due to hostname truncation
[ https://issues.apache.org/jira/browse/SPARK-24894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554602#comment-16554602 ] Yinan Li commented on SPARK-24894: -- [~mcheah]. We need to make sure the truncation leads to a valid hostname. > Invalid DNS name due to hostname truncation > > > Key: SPARK-24894 > URL: https://issues.apache.org/jira/browse/SPARK-24894 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Dharmesh Kakadia >Priority: Major > > The truncation for hostname happening here > [https://github.com/apache/spark/blob/5ff1b9ba1983d5601add62aef64a3e87d07050eb/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L77] > is a problematic and can lead to DNS names starting with "-". > Originally filled here : > https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/229 > ``` > {{2018-07-23 21:21:42 ERROR Utils:91 - Uncaught exception in thread > kubernetes-pod-allocator > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://kubernetes.default.svc/api/v1/namespaces/default/pods. > Message: Pod > "user-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9" is > invalid: spec.hostname: Invalid value: > "-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9": a DNS-1123 > label must consist of lower case alphanumeric characters or '-', and must > start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', > regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'). Received > status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=spec.hostname, > message=Invalid value: > "-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9": a DNS-1123 > label must consist of lower case alphanumeric characters or '-', and must > start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', > regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'), > reason=FieldValueInvalid, additionalProperties={})], group=null, kind=Pod, > name=user-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=Pod > "user-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9" is > invalid: spec.hostname: Invalid value: > "-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9": a DNS-1123 > label must consist of lower case alphanumeric characters or '-', and must > start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', > regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'), > metadata=ListMeta(resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:470) > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:409) > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:379) > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343) > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:226) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:769) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:356) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$$anon$1$$anonfun$3$$anonfun$apply$3.apply(KubernetesClusterSchedulerBackend.scala:140) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$$anon$1$$anonfun$3$$anonfun$apply$3.apply(KubernetesClusterSchedulerBackend.scala:140) > at org.apache.spark.util.Utils$.tryLog(Utils.scala:1922) at > org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$$anon$1$$anonfun$3.apply(KubernetesClusterSchedulerBackend.scala:139) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$$anon$1$$anonfun$3.apply(KubernetesClusterSchedulerBackend.scala:138) > at > scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) > at > scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at
[jira] [Commented] (SPARK-24908) [R] remove spaces to make lintr happy
[ https://issues.apache.org/jira/browse/SPARK-24908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554597#comment-16554597 ] shane knapp commented on SPARK-24908: - i noticed lintr complaining in my PRB build ([https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93509/console)] :( `Error: StartTag: invalid element name [68] Execution halted lintr checks passed.` so i checked other R-based PRB builds and saw the same thing: pull request: https://github.com/apache/spark/pull/21835 associated build: [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93402/consoleFull] i'm guessing that this is a red herring. > [R] remove spaces to make lintr happy > - > > Key: SPARK-24908 > URL: https://issues.apache.org/jira/browse/SPARK-24908 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.1 >Reporter: shane knapp >Assignee: shane knapp >Priority: Critical > > during my travails in porting spark builds to run on our centos worker, i > managed to recreate (as best i could) the centos environment on our new > ubuntu-testing machine. > while running my initial builds, lintr was crashing on some extraneous spaces > in test_basic.R (see: > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)] > after removing those spaces, the ubuntu build happily passed the lintr tests. > i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build > (see > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),] > which scp'ed a copy of test_basic.R in to the repo after the git clone. > everything seems to be working happily. > i will be creating a pull request for this now. > > attn: [~felixcheung] [~shivaram] [~ifilonenko] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24895: Assignee: Apache Spark > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Assignee: Apache Spark >Priority: Major > > Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > {code} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > {code} > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554595#comment-16554595 ] Apache Spark commented on SPARK-24895: -- User 'ericfchang' has created a pull request for this issue: https://github.com/apache/spark/pull/21865 > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Priority: Major > > Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > {code} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > {code} > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24895: Assignee: (was: Apache Spark) > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Priority: Major > > Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > {code} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > {code} > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.
[ https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23325. - Resolution: Fixed Assignee: Ryan Blue Fix Version/s: 2.4.0 > DataSourceV2 readers should always produce InternalRow. > --- > > Key: SPARK-23325 > URL: https://issues.apache.org/jira/browse/SPARK-23325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 2.4.0 > > > DataSourceV2 row-oriented implementations are limited to producing either > {{Row}} instances or {{UnsafeRow}} instances by implementing > {{SupportsScanUnsafeRow}}. Instead, I think that implementations should > always produce {{InternalRow}}. > The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither > one is appropriate for implementers. > File formats don't produce {{Row}} instances or the data values used by > {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation > that uses {{Row}} instances must produce data that is immediately translated > from the representation that was just produced by Spark. In my experience, it > made little sense to translate a timestamp in microseconds to a > (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass > that instance to Spark for immediate translation back. > On the other hand, {{UnsafeRow}} is very difficult to produce unless data is > already held in memory. Even the Parquet support built into Spark > deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce > unsafe rows. When I went to build an implementation that deserializes Parquet > or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be > done without first deserializing into memory because the size of an array > must be known before any values are written. > I ended up deciding to deserialize to {{InternalRow}} and use > {{UnsafeProjection}} to convert to unsafe. There are two problems with this: > first, this is Scala and was difficult to call from Java (it required > reflection), and second, this causes double projection in the physical plan > (a copy for unsafe to unsafe) if there is a projection that wasn't fully > pushed to the data source. > I think the solution is to have a single interface for readers that expects > {{InternalRow}}. Then, a projection should be added in the Spark plan to > convert to unsafe and avoid projection in the plan and in the data source. If > the data source already produces unsafe rows by deserializing directly, this > still minimizes the number of copies because the unsafe projection will check > whether the incoming data is already {{UnsafeRow}}. > Using {{InternalRow}} would also match the interface on the write side. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24908) [R] remove spaces to make lintr happy
[ https://issues.apache.org/jira/browse/SPARK-24908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554561#comment-16554561 ] Apache Spark commented on SPARK-24908: -- User 'shaneknapp' has created a pull request for this issue: https://github.com/apache/spark/pull/21864 > [R] remove spaces to make lintr happy > - > > Key: SPARK-24908 > URL: https://issues.apache.org/jira/browse/SPARK-24908 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.1 >Reporter: shane knapp >Assignee: shane knapp >Priority: Critical > > during my travails in porting spark builds to run on our centos worker, i > managed to recreate (as best i could) the centos environment on our new > ubuntu-testing machine. > while running my initial builds, lintr was crashing on some extraneous spaces > in test_basic.R (see: > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)] > after removing those spaces, the ubuntu build happily passed the lintr tests. > i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build > (see > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),] > which scp'ed a copy of test_basic.R in to the repo after the git clone. > everything seems to be working happily. > i will be creating a pull request for this now. > > attn: [~felixcheung] [~shivaram] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24908) [R] remove spaces to make lintr happy
[ https://issues.apache.org/jira/browse/SPARK-24908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24908: Assignee: shane knapp (was: Apache Spark) > [R] remove spaces to make lintr happy > - > > Key: SPARK-24908 > URL: https://issues.apache.org/jira/browse/SPARK-24908 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.1 >Reporter: shane knapp >Assignee: shane knapp >Priority: Critical > > during my travails in porting spark builds to run on our centos worker, i > managed to recreate (as best i could) the centos environment on our new > ubuntu-testing machine. > while running my initial builds, lintr was crashing on some extraneous spaces > in test_basic.R (see: > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)] > after removing those spaces, the ubuntu build happily passed the lintr tests. > i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build > (see > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),] > which scp'ed a copy of test_basic.R in to the repo after the git clone. > everything seems to be working happily. > i will be creating a pull request for this now. > > attn: [~felixcheung] [~shivaram] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24908) [R] remove spaces to make lintr happy
[ https://issues.apache.org/jira/browse/SPARK-24908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp updated SPARK-24908: Description: during my travails in porting spark builds to run on our centos worker, i managed to recreate (as best i could) the centos environment on our new ubuntu-testing machine. while running my initial builds, lintr was crashing on some extraneous spaces in test_basic.R (see: [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)] after removing those spaces, the ubuntu build happily passed the lintr tests. i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build (see [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),] which scp'ed a copy of test_basic.R in to the repo after the git clone. everything seems to be working happily. i will be creating a pull request for this now. attn: [~felixcheung] [~shivaram] [~ifilonenko] was: during my travails in porting spark builds to run on our centos worker, i managed to recreate (as best i could) the centos environment on our new ubuntu-testing machine. while running my initial builds, lintr was crashing on some extraneous spaces in test_basic.R (see: [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)] after removing those spaces, the ubuntu build happily passed the lintr tests. i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build (see [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),] which scp'ed a copy of test_basic.R in to the repo after the git clone. everything seems to be working happily. i will be creating a pull request for this now. attn: [~felixcheung] [~shivaram] > [R] remove spaces to make lintr happy > - > > Key: SPARK-24908 > URL: https://issues.apache.org/jira/browse/SPARK-24908 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.1 >Reporter: shane knapp >Assignee: shane knapp >Priority: Critical > > during my travails in porting spark builds to run on our centos worker, i > managed to recreate (as best i could) the centos environment on our new > ubuntu-testing machine. > while running my initial builds, lintr was crashing on some extraneous spaces > in test_basic.R (see: > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)] > after removing those spaces, the ubuntu build happily passed the lintr tests. > i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build > (see > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),] > which scp'ed a copy of test_basic.R in to the repo after the git clone. > everything seems to be working happily. > i will be creating a pull request for this now. > > attn: [~felixcheung] [~shivaram] [~ifilonenko] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24908) [R] remove spaces to make lintr happy
[ https://issues.apache.org/jira/browse/SPARK-24908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24908: Assignee: Apache Spark (was: shane knapp) > [R] remove spaces to make lintr happy > - > > Key: SPARK-24908 > URL: https://issues.apache.org/jira/browse/SPARK-24908 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.1 >Reporter: shane knapp >Assignee: Apache Spark >Priority: Critical > > during my travails in porting spark builds to run on our centos worker, i > managed to recreate (as best i could) the centos environment on our new > ubuntu-testing machine. > while running my initial builds, lintr was crashing on some extraneous spaces > in test_basic.R (see: > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)] > after removing those spaces, the ubuntu build happily passed the lintr tests. > i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build > (see > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),] > which scp'ed a copy of test_basic.R in to the repo after the git clone. > everything seems to be working happily. > i will be creating a pull request for this now. > > attn: [~felixcheung] [~shivaram] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24908) [R] remove spaces to make lintr happy
shane knapp created SPARK-24908: --- Summary: [R] remove spaces to make lintr happy Key: SPARK-24908 URL: https://issues.apache.org/jira/browse/SPARK-24908 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.3.1 Reporter: shane knapp Assignee: shane knapp during my travails in porting spark builds to run on our centos worker, i managed to recreate (as best i could) the centos environment on our new ubuntu-testing machine. while running my initial builds, lintr was crashing on some extraneous spaces in test_basic.R (see: [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)] after removing those spaces, the ubuntu build happily passed the lintr tests. i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build (see [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),] which scp'ed a copy of test_basic.R in to the repo after the git clone. everything seems to be working happily. i will be creating a pull request for this now. attn: [~felixcheung] [~shivaram] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18874) First phase: Deferring the correlated predicate pull up to Optimizer phase
[ https://issues.apache.org/jira/browse/SPARK-18874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554431#comment-16554431 ] Apache Spark commented on SPARK-18874: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/21863 > First phase: Deferring the correlated predicate pull up to Optimizer phase > -- > > Key: SPARK-18874 > URL: https://issues.apache.org/jira/browse/SPARK-18874 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Nattavut Sutyanyong >Assignee: Dilip Biswal >Priority: Major > Fix For: 2.2.0 > > Attachments: SPARK-18874-3.pdf > > > This JIRA implements the first phase of SPARK-18455 by deferring the > correlated predicate pull up from Analyzer to Optimizer. The goal is to > preserve the current functionality of subquery in Spark 2.0 (if it works, it > continues to work after this JIRA, if it does not, it won't). The performance > of subquery processing is expected to be at par with Spark 2.0. > The representation of the LogicalPlan after Analyzer will be different after > this JIRA that it will preserve the original positions of correlated > predicates in a subquery. This new representation is a preparation work for > the second phase of extending the support of correlated subquery to cases > Spark 2.0 does not support such as deep correlation, outer references in > SELECT clause. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24581) Design: BarrierTaskContext.barrier()
[ https://issues.apache.org/jira/browse/SPARK-24581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554409#comment-16554409 ] Jiang Xingbo commented on SPARK-24581: -- Design doc: https://docs.google.com/document/d/1r07-vU5JTH6s1jJ6azkmK0K5it6jwpfO6b_K3mJmxR4/edit?usp=sharing > Design: BarrierTaskContext.barrier() > > > Key: SPARK-24581 > URL: https://issues.apache.org/jira/browse/SPARK-24581 > Project: Spark > Issue Type: Story > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Jiang Xingbo >Priority: Major > > We need to provide a communication barrier function to users to help > coordinate tasks within a barrier stage. This is very similar to MPI_Barrier > function in MPI. This story is for its design. > > Requirements: > * Low-latency. The tasks should be unblocked soon after all tasks have > reached this barrier. The latency is more important than CPU cycles here. > * Support unlimited timeout with proper logging. For DL tasks, it might take > very long to converge, we should support unlimited timeout with proper > logging. So users know why a task is waiting. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data
[ https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554404#comment-16554404 ] Brad commented on SPARK-21097: -- Hi [~menelaus], I have stalled out on this project, if you would like to work on it feel free. There is a *lot* of discussion on the PR (it might take a couple of attempts to load the page). The last recommendation from Squito was that I open up a SPIP to get more input. I'm not sure if the final state of the code is 100% correct at this point, I was making a lot of changes at the end. Thanks Brad > Dynamic allocation will preserve cached data > > > Key: SPARK-21097 > URL: https://issues.apache.org/jira/browse/SPARK-21097 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Scheduler, Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Brad >Priority: Major > Attachments: Preserving Cached Data with Dynamic Allocation.pdf > > > We want to use dynamic allocation to distribute resources among many notebook > users on our spark clusters. One difficulty is that if a user has cached data > then we are either prevented from de-allocating any of their executors, or we > are forced to drop their cached data, which can lead to a bad user experience. > We propose adding a feature to preserve cached data by copying it to other > executors before de-allocation. This behavior would be enabled by a simple > spark config. Now when an executor reaches its configured idle timeout, > instead of just killing it on the spot, we will stop sending it new tasks, > replicate all of its rdd blocks onto other executors, and then kill it. If > there is an issue while we replicate the data, like an error, it takes too > long, or there isn't enough space, then we will fall back to the original > behavior and drop the data and kill the executor. > This feature should allow anyone with notebook users to use their cluster > resources more efficiently. Also since it will be completely opt-in it will > unlikely to cause problems for other use cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24903) Allow driver container name to be configurable in k8 spark
[ https://issues.apache.org/jira/browse/SPARK-24903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24903: Assignee: Apache Spark > Allow driver container name to be configurable in k8 spark > -- > > Key: SPARK-24903 > URL: https://issues.apache.org/jira/browse/SPARK-24903 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Yifei Huang >Assignee: Apache Spark >Priority: Major > > We'd like to expose the container name as a configurable value. In case it > changes in spark, we want to be able to override it to guarantee consistency > on our end. > https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L76 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24903) Allow driver container name to be configurable in k8 spark
[ https://issues.apache.org/jira/browse/SPARK-24903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554379#comment-16554379 ] Apache Spark commented on SPARK-24903: -- User 'yifeih' has created a pull request for this issue: https://github.com/apache/spark/pull/21862 > Allow driver container name to be configurable in k8 spark > -- > > Key: SPARK-24903 > URL: https://issues.apache.org/jira/browse/SPARK-24903 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Yifei Huang >Priority: Major > > We'd like to expose the container name as a configurable value. In case it > changes in spark, we want to be able to override it to guarantee consistency > on our end. > https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L76 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24903) Allow driver container name to be configurable in k8 spark
[ https://issues.apache.org/jira/browse/SPARK-24903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24903: Assignee: (was: Apache Spark) > Allow driver container name to be configurable in k8 spark > -- > > Key: SPARK-24903 > URL: https://issues.apache.org/jira/browse/SPARK-24903 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Yifei Huang >Priority: Major > > We'd like to expose the container name as a configurable value. In case it > changes in spark, we want to be able to override it to guarantee consistency > on our end. > https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L76 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data
[ https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554340#comment-16554340 ] Takeshi Yamamuro commented on SPARK-24906: -- Ah, I see. It make some sense to me. In DataSourceScanExec, in case that `requiredSchema` has the relatively smaller number of columns than `dataSchema`, we might consider an additional term to make `maxSplitBytes` larger in `createNonBucketedReadRDD`. https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L316 WDYT? [~smilegator] [~viirya] > Enlarge split size for columnar file to ensure the task read enough data > > > Key: SPARK-24906 > URL: https://issues.apache.org/jira/browse/SPARK-24906 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jason Guo >Priority: Critical > Attachments: image-2018-07-24-20-26-32-441.png, > image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, > image-2018-07-24-20-30-24-552.png > > > For columnar file, such as, when spark sql read the table, each split will be > 128 MB by default since spark.sql.files.maxPartitionBytes is default to > 128MB. Even when user set it to a large value, such as 512MB, the task may > read only few MB or even hundreds of KB. Because the table (Parquet) may > consists of dozens of columns while the SQL only need few columns. And spark > will prune the unnecessary columns. > > In this case, spark DataSourceScanExec can enlarge maxPartitionBytes > adaptively. > For example, there is 40 columns , 20 are integer while another 20 are long. > When use query on an integer type column and an long type column, the > maxPartitionBytes should be 20 times larger. (20*4+20*8) / (4+8) = 20. > > With this optimization, the number of task will be smaller and the job will > run faster. More importantly, for a very large cluster (more the 10 thousand > nodes), it will relieve RM's schedule pressure. > > Here is the test > > The table named test2 has more than 40 columns and there are more than 5 TB > data each hour. > When we issue a very simple query > > {code:java} > select count(device_id) from test2 where date=20180708 and hour='23'{code} > > There are 72176 tasks and the duration of the job is 4.8 minutes > !image-2018-07-24-20-26-32-441.png! > > Most tasks last less than 1 second and read less than 1.5 MB data > !image-2018-07-24-20-28-06-269.png! > > After the optimization, there are only 1615 tasks and the job last only 30 > seconds. It almost 10 times faster. > !image-2018-07-24-20-29-24-797.png! > > The median of read data is 44.2MB. > !image-2018-07-24-20-30-24-552.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554327#comment-16554327 ] Thomas Graves commented on SPARK-24615: --- Right so I think part of this is trying to make it more obvious to the user what the scope actually is. This in some ways is similar to caching. I see people many times force an evaluation after a cache to force the data to actually be cached because otherwise it might not do what they expect. that is why I mentioned the .eval() type functionality. For the example: val rddA = rdd.withResources.mapPartitions() val rddB = rdd.withResources.mapPartitions() val rddC = rddA.join(rddB) Above the mapPartitions would normally get their own stages correct? So I would think those stages would be with the resources specific but the join would be with the default resources. then you wouldn't have to worry about merging, etc. But you have the case with map or others where they wouldn't normally get their own stage so the question is perhaps should they, or do you provide something to? > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554306#comment-16554306 ] Thomas Graves commented on SPARK-23128: --- we also did some initial evaluation with it as well and it was looking good so that is why I pinged on here. We should work to get it in rather then everyone having their own version. It might be nice to post on https://issues.apache.org/jira/browse/SPARK-9850 since it initial version but I don't see any activity there. Ideally we would have a committer that is more familiar with the sql code shepherd it in. I'm just still learning the sql side of the code here so don't consider myself an expert there. Have you posted this to the dev list at all? We should probably have a SPIP for this, which your doc above should pretty much cover, although you may want to make sure its up to date. So I think the first step would be to post to the dev list to get any feedback and see if someone else is willing to volunteer to review. Could you just post a DISCUSS thread about it? If no one else will review I will, it may just take me longer. > A new approach to do adaptive execution in Spark SQL > > > Key: SPARK-23128 > URL: https://issues.apache.org/jira/browse/SPARK-23128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Carson Wang >Priority: Major > Attachments: AdaptiveExecutioninBaidu.pdf > > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24907) Migrate JDBC data source to DataSource API v2
[ https://issues.apache.org/jira/browse/SPARK-24907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24907: Assignee: Apache Spark > Migrate JDBC data source to DataSource API v2 > - > > Key: SPARK-24907 > URL: https://issues.apache.org/jira/browse/SPARK-24907 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Teng Peng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24907) Migrate JDBC data source to DataSource API v2
[ https://issues.apache.org/jira/browse/SPARK-24907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554221#comment-16554221 ] Apache Spark commented on SPARK-24907: -- User 'tengpeng' has created a pull request for this issue: https://github.com/apache/spark/pull/21861 > Migrate JDBC data source to DataSource API v2 > - > > Key: SPARK-24907 > URL: https://issues.apache.org/jira/browse/SPARK-24907 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Teng Peng >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24907) Migrate JDBC data source to DataSource API v2
[ https://issues.apache.org/jira/browse/SPARK-24907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24907: Assignee: (was: Apache Spark) > Migrate JDBC data source to DataSource API v2 > - > > Key: SPARK-24907 > URL: https://issues.apache.org/jira/browse/SPARK-24907 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Teng Peng >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554210#comment-16554210 ] Jackey Lee commented on SPARK-24630: In SQLStreaming, we also support standard SQL as batch queries, except for the stream key word. Such as, you can run this query to do some window aggregations, which is similar to batch queries. select stream count(*) from kafka_sql_test group by window(timestamp, '10 seconds', '5 seconds') Are there any more queries can you think that don't support in batch syntax? > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24907) Migrate JDBC data source to DataSource API v2
Teng Peng created SPARK-24907: - Summary: Migrate JDBC data source to DataSource API v2 Key: SPARK-24907 URL: https://issues.apache.org/jira/browse/SPARK-24907 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.3.0 Reporter: Teng Peng -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data
[ https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Guo updated SPARK-24906: -- Description: For columnar file, such as, when spark sql read the table, each split will be 128 MB by default since spark.sql.files.maxPartitionBytes is default to 128MB. Even when user set it to a large value, such as 512MB, the task may read only few MB or even hundreds of KB. Because the table (Parquet) may consists of dozens of columns while the SQL only need few columns. And spark will prune the unnecessary columns. In this case, spark DataSourceScanExec can enlarge maxPartitionBytes adaptively. For example, there is 40 columns , 20 are integer while another 20 are long. When use query on an integer type column and an long type column, the maxPartitionBytes should be 20 times larger. (20*4+20*8) / (4+8) = 20. With this optimization, the number of task will be smaller and the job will run faster. More importantly, for a very large cluster (more the 10 thousand nodes), it will relieve RM's schedule pressure. Here is the test The table named test2 has more than 40 columns and there are more than 5 TB data each hour. When we issue a very simple query {code:java} select count(device_id) from test2 where date=20180708 and hour='23'{code} There are 72176 tasks and the duration of the job is 4.8 minutes !image-2018-07-24-20-26-32-441.png! Most tasks last less than 1 second and read less than 1.5 MB data !image-2018-07-24-20-28-06-269.png! After the optimization, there are only 1615 tasks and the job last only 30 seconds. It almost 10 times faster. !image-2018-07-24-20-29-24-797.png! The median of read data is 44.2MB. !image-2018-07-24-20-30-24-552.png! was: For columnar file, such as, when spark sql read the table, each split will be 128 MB by default since spark.sql.files.maxPartitionBytes is default to 128MB. Even when user set it to a large value, such as 512MB, the task may read only few MB or even hundreds of KB. Because the table (Parquet) may consists of dozens of columns while the SQL only need few columns. In this case, spark DataSourceScanExec can enlarge maxPartitionBytes adaptively. For example, there is 40 columns , 20 are integer while another 20 are long. When use query on an integer type column and an long type column, the maxPartitionBytes should be 20 times larger. (20*4+20*8) / (4+8) = 20. With this optimization, the number of task will be smaller and the job will run faster. More importantly, for a very large cluster (more the 10 thousand nodes), it will relieve RM's schedule pressure. > Enlarge split size for columnar file to ensure the task read enough data > > > Key: SPARK-24906 > URL: https://issues.apache.org/jira/browse/SPARK-24906 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jason Guo >Priority: Critical > Attachments: image-2018-07-24-20-26-32-441.png, > image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, > image-2018-07-24-20-30-24-552.png > > > For columnar file, such as, when spark sql read the table, each split will be > 128 MB by default since spark.sql.files.maxPartitionBytes is default to > 128MB. Even when user set it to a large value, such as 512MB, the task may > read only few MB or even hundreds of KB. Because the table (Parquet) may > consists of dozens of columns while the SQL only need few columns. And spark > will prune the unnecessary columns. > > In this case, spark DataSourceScanExec can enlarge maxPartitionBytes > adaptively. > For example, there is 40 columns , 20 are integer while another 20 are long. > When use query on an integer type column and an long type column, the > maxPartitionBytes should be 20 times larger. (20*4+20*8) / (4+8) = 20. > > With this optimization, the number of task will be smaller and the job will > run faster. More importantly, for a very large cluster (more the 10 thousand > nodes), it will relieve RM's schedule pressure. > > Here is the test > > The table named test2 has more than 40 columns and there are more than 5 TB > data each hour. > When we issue a very simple query > > {code:java} > select count(device_id) from test2 where date=20180708 and hour='23'{code} > > There are 72176 tasks and the duration of the job is 4.8 minutes > !image-2018-07-24-20-26-32-441.png! > > Most tasks last less than 1 second and read less than 1.5 MB data > !image-2018-07-24-20-28-06-269.png! > > After the optimization, there are only 1615 tasks and the job last only 30 > seconds. It almost 10 times faster. > !image-2018-07-24-20-29-24-797.png! > > The median of read data is 44.2MB. >
[jira] [Updated] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data
[ https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Guo updated SPARK-24906: -- Attachment: image-2018-07-24-20-30-24-552.png > Enlarge split size for columnar file to ensure the task read enough data > > > Key: SPARK-24906 > URL: https://issues.apache.org/jira/browse/SPARK-24906 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jason Guo >Priority: Critical > Attachments: image-2018-07-24-20-26-32-441.png, > image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, > image-2018-07-24-20-30-24-552.png > > > For columnar file, such as, when spark sql read the table, each split will be > 128 MB by default since spark.sql.files.maxPartitionBytes is default to > 128MB. Even when user set it to a large value, such as 512MB, the task may > read only few MB or even hundreds of KB. Because the table (Parquet) may > consists of dozens of columns while the SQL only need few columns. > > In this case, spark DataSourceScanExec can enlarge maxPartitionBytes > adaptively. > For example, there is 40 columns , 20 are integer while another 20 are long. > When use query on an integer type column and an long type column, the > maxPartitionBytes should be 20 times larger. (20*4+20*8) / (4+8) = 20. > > With this optimization, the number of task will be smaller and the job will > run faster. More importantly, for a very large cluster (more the 10 thousand > nodes), it will relieve RM's schedule pressure. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data
[ https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Guo updated SPARK-24906: -- Attachment: image-2018-07-24-20-29-24-797.png > Enlarge split size for columnar file to ensure the task read enough data > > > Key: SPARK-24906 > URL: https://issues.apache.org/jira/browse/SPARK-24906 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jason Guo >Priority: Critical > Attachments: image-2018-07-24-20-26-32-441.png, > image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png > > > For columnar file, such as, when spark sql read the table, each split will be > 128 MB by default since spark.sql.files.maxPartitionBytes is default to > 128MB. Even when user set it to a large value, such as 512MB, the task may > read only few MB or even hundreds of KB. Because the table (Parquet) may > consists of dozens of columns while the SQL only need few columns. > > In this case, spark DataSourceScanExec can enlarge maxPartitionBytes > adaptively. > For example, there is 40 columns , 20 are integer while another 20 are long. > When use query on an integer type column and an long type column, the > maxPartitionBytes should be 20 times larger. (20*4+20*8) / (4+8) = 20. > > With this optimization, the number of task will be smaller and the job will > run faster. More importantly, for a very large cluster (more the 10 thousand > nodes), it will relieve RM's schedule pressure. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data
[ https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Guo updated SPARK-24906: -- Attachment: image-2018-07-24-20-28-06-269.png > Enlarge split size for columnar file to ensure the task read enough data > > > Key: SPARK-24906 > URL: https://issues.apache.org/jira/browse/SPARK-24906 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jason Guo >Priority: Critical > Attachments: image-2018-07-24-20-26-32-441.png, > image-2018-07-24-20-28-06-269.png > > > For columnar file, such as, when spark sql read the table, each split will be > 128 MB by default since spark.sql.files.maxPartitionBytes is default to > 128MB. Even when user set it to a large value, such as 512MB, the task may > read only few MB or even hundreds of KB. Because the table (Parquet) may > consists of dozens of columns while the SQL only need few columns. > > In this case, spark DataSourceScanExec can enlarge maxPartitionBytes > adaptively. > For example, there is 40 columns , 20 are integer while another 20 are long. > When use query on an integer type column and an long type column, the > maxPartitionBytes should be 20 times larger. (20*4+20*8) / (4+8) = 20. > > With this optimization, the number of task will be smaller and the job will > run faster. More importantly, for a very large cluster (more the 10 thousand > nodes), it will relieve RM's schedule pressure. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24905) Spark 2.3 Internal URL env variable
[ https://issues.apache.org/jira/browse/SPARK-24905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Björn Wenzel updated SPARK-24905: - Priority: Critical (was: Major) > Spark 2.3 Internal URL env variable > --- > > Key: SPARK-24905 > URL: https://issues.apache.org/jira/browse/SPARK-24905 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Björn Wenzel >Priority: Critical > > Currently the Kubernetes Master internal URL is hardcoded in the > Constants.scala file > ([https://github.com/apache/spark/blob/85fe1297e35bcff9cf86bd53fee615e140ee5bfb/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L75)] > In some cases these URL should be changed e.g. if the Certificate is valid > for another Hostname. > Is it possible to make this URL a property like: > spark.kubernetes.authenticate.driver.hostname? > Kubernetes The Hard Way maintained by Kelsey Hightower for example uses > kubernetes.default as hostname, this will produce again a > SSLPeerUnverifiedException. > > Here is the use of the Hardcoded Host: > [https://github.com/apache/spark/blob/85fe1297e35bcff9cf86bd53fee615e140ee5bfb/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala#L52] > maybe this could be changed like the KUBERNETES_NAMESPACE property in Line > 53. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554043#comment-16554043 ] Genmao Yu edited comment on SPARK-24630 at 7/24/18 11:38 AM: - [~zsxwing] {{Structured Streaming supports standard SQL as the batch queries, so the users can switch their queries between batch and streaming easily.}} IIUC, there are some queries we can not switch from stream to batch, like "*groupBy window a**ggregation"* or "*over window a**ggregation"* on stream. Isn't it? was (Author: unclegen): {{Structured Streaming supports standard SQL as the batch queries, so the users can switch their queries between batch and streaming easily.}} IIUC, there are some queries we can not switch from stream to batch, like "*groupBy window a**ggregation"* or "*over window a**ggregation"* on stream. Isn't it? > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24904) Join with broadcasted dataframe causes shuffle of redundant data
Shay Elbaz created SPARK-24904: -- Summary: Join with broadcasted dataframe causes shuffle of redundant data Key: SPARK-24904 URL: https://issues.apache.org/jira/browse/SPARK-24904 Project: Spark Issue Type: Question Components: SQL Affects Versions: 2.1.2 Reporter: Shay Elbaz When joining a "large" dataframe with broadcasted small one, and join-type is on the small DF side (see right-join below), the physical plan does not include broadcasting the small table. But when the join is on the large DF side, the broadcast does take place. Is there a good reason for this? In the below example it sure doesn't make any sense to shuffle the entire large table: {code:java} val small = spark.range(1, 10) val big = spark.range(1, 1 << 30) .withColumnRenamed("id", "id2") big.join(broadcast(small), $"id" === $"id2", "right") .explain == Physical Plan == SortMergeJoin [id2#16307L], [id#16310L], RightOuter :- *Sort [id2#16307L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id2#16307L, 1000) : +- *Project [id#16304L AS id2#16307L] : +- *Range (1, 1073741824, step=1, splits=Some(600)) +- *Sort [id#16310L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#16310L, 1000) +- *Range (1, 10, step=1, splits=Some(600)) {code} As a workaround, users need to perform inner instead of right join, and then join the result back with the small DF to fill the missing rows. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24288) Enable preventing predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554109#comment-16554109 ] Tomasz Gawęda commented on SPARK-24288: --- [~smilegator] Yes, you are right. If we don't want to use barriers (as mentioned by [~rxin] in mail), we can add option to disable predicate pushdown for JDBC source. I've also though about adding custom optimizer rule, but probably I'm not good enough in Spark internals yet ;) > Enable preventing predicate pushdown > > > Key: SPARK-24288 > URL: https://issues.apache.org/jira/browse/SPARK-24288 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Tomasz Gawęda >Priority: Major > Attachments: SPARK-24288.simple.patch > > > Issue discussed on Mailing List: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html] > While working with JDBC datasource I saw that many "or" clauses with > non-equality operators causes huge performance degradation of SQL query > to database (DB2). For example: > val df = spark.read.format("jdbc").(other options to parallelize > load).load() > df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x > > 100)").show() // in real application whose predicates were pushed > many many lines below, many ANDs and ORs > If I use cache() before where, there is no predicate pushdown of this > "where" clause. However, in production system caching many sources is a > waste of memory (especially is pipeline is long and I must do cache many > times).There are also few more workarounds, but it would be great if Spark > will support preventing predicate pushdown by user. > > For example: df.withAnalysisBarrier().where(...) ? > > Note, that this should not be a global configuration option. If I read 2 > DataFrames, df1 and df2, I would like to specify that df1 should not have > some predicates pushed down, but some may be, but df2 should have all > predicates pushed down, even if target query joins df1 and df2. As far as I > understand Spark optimizer, if we use functions like `withAnalysisBarrier` > and put AnalysisBarrier explicitly in logical plan, then predicates won't be > pushed down on this particular DataFrames and PP will be still possible on > the second one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24901) Merge the codegen of RegularHashMap and fastHashMap to reduce compiler maxCodesize when VectorizedHashMap is false
[ https://issues.apache.org/jira/browse/SPARK-24901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24901: Assignee: Apache Spark > Merge the codegen of RegularHashMap and fastHashMap to reduce compiler > maxCodesize when VectorizedHashMap is false > -- > > Key: SPARK-24901 > URL: https://issues.apache.org/jira/browse/SPARK-24901 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: caoxuewen >Assignee: Apache Spark >Priority: Major > > Currently, Generate code of update UnsafeRow in hash aggregation. > FastHashMap and RegularHashMap are two separate codes,These two separate > codes need only when VectorizedHashMap is true. but other cases, we can merge > together to reduce compiler maxCodesize. thanks. > case class DistinctAgg(a: Int, b: Float, c: Double, d: Int, e: String) > spark.sparkContext.parallelize( > DistinctAgg(8, 2, 3, 4, "a") :: > DistinctAgg(9, 3, 4, 5, "b") > ::Nil).toDF()createOrReplaceTempView("distinctAgg") > val df = sql("select a,b,e, min(d) as mind, min(case when a > 10 then a else > null end) as mincasea, min(a) as mina from distinctAgg group by a, b, e") > println(org.apache.spark.sql.execution.debug.codegenString(df.queryExecution.executedPlan)) > df.show() > Generate code like: > *Before modified:* > Generated code: > /* 001 */ public Object generate(Object[] references) { > /* 002 */ return new GeneratedIteratorForCodegenStage1(references); > /* 003 */ } > /* 004 */ > ... > /* 354 */ > /* 355 */ if (agg_fastAggBuffer_0 != null) { > /* 356 */ // common sub-expressions > /* 357 */ > /* 358 */ // evaluate aggregate function > /* 359 */ agg_agg_isNull_31_0 = true; > /* 360 */ int agg_value_34 = -1; > /* 361 */ > /* 362 */ boolean agg_isNull_32 = agg_fastAggBuffer_0.isNullAt(0); > /* 363 */ int agg_value_35 = agg_isNull_32 ? > /* 364 */ -1 : (agg_fastAggBuffer_0.getInt(0)); > /* 365 */ > /* 366 */ if (!agg_isNull_32 && (agg_agg_isNull_31_0 || > /* 367 */ agg_value_34 > agg_value_35)) { > /* 368 */ agg_agg_isNull_31_0 = false; > /* 369 */ agg_value_34 = agg_value_35; > /* 370 */ } > /* 371 */ > /* 372 */ if (!false && (agg_agg_isNull_31_0 || > /* 373 */ agg_value_34 > agg_expr_2_0)) { > /* 374 */ agg_agg_isNull_31_0 = false; > /* 375 */ agg_value_34 = agg_expr_2_0; > /* 376 */ } > /* 377 */ agg_agg_isNull_34_0 = true; > /* 378 */ int agg_value_37 = -1; > /* 379 */ > /* 380 */ boolean agg_isNull_35 = agg_fastAggBuffer_0.isNullAt(1); > /* 381 */ int agg_value_38 = agg_isNull_35 ? > /* 382 */ -1 : (agg_fastAggBuffer_0.getInt(1)); > /* 383 */ > /* 384 */ if (!agg_isNull_35 && (agg_agg_isNull_34_0 || > /* 385 */ agg_value_37 > agg_value_38)) { > /* 386 */ agg_agg_isNull_34_0 = false; > /* 387 */ agg_value_37 = agg_value_38; > /* 388 */ } > /* 389 */ > /* 390 */ byte agg_caseWhenResultState_1 = -1; > /* 391 */ do { > /* 392 */ boolean agg_value_40 = false; > /* 393 */ agg_value_40 = agg_expr_0_0 > 10; > /* 394 */ if (!false && agg_value_40) { > /* 395 */ agg_caseWhenResultState_1 = (byte)(false ? 1 : 0); > /* 396 */ agg_agg_value_39_0 = agg_expr_0_0; > /* 397 */ continue; > /* 398 */ } > /* 399 */ > /* 400 */ agg_caseWhenResultState_1 = (byte)(true ? 1 : 0); > /* 401 */ agg_agg_value_39_0 = -1; > /* 402 */ > /* 403 */ } while (false); > /* 404 */ // TRUE if any condition is met and the result is null, or no > any condition is met. > /* 405 */ final boolean agg_isNull_36 = (agg_caseWhenResultState_1 != > 0); > /* 406 */ > /* 407 */ if (!agg_isNull_36 && (agg_agg_isNull_34_0 || > /* 408 */ agg_value_37 > agg_agg_value_39_0)) { > /* 409 */ agg_agg_isNull_34_0 = false; > /* 410 */ agg_value_37 = agg_agg_value_39_0; > /* 411 */ } > /* 412 */ agg_agg_isNull_42_0 = true; > /* 413 */ int agg_value_45 = -1; > /* 414 */ > /* 415 */ boolean agg_isNull_43 = agg_fastAggBuffer_0.isNullAt(2); > /* 416 */ int agg_value_46 = agg_isNull_43 ? > /* 417 */ -1 : (agg_fastAggBuffer_0.getInt(2)); > /* 418 */ > /* 419 */ if (!agg_isNull_43 && (agg_agg_isNull_42_0 || > /* 420 */ agg_value_45 > agg_value_46)) { > /* 421 */ agg_agg_isNull_42_0 = false; > /* 422 */ agg_value_45 = agg_value_46; > /* 423 */ } > /* 424 */ > /* 425 */ if (!false && (agg_agg_isNull_42_0 || > /* 426 */ agg_value_45 > agg_expr_0_0)) { > /* 427
[jira] [Commented] (SPARK-24901) Merge the codegen of RegularHashMap and fastHashMap to reduce compiler maxCodesize when VectorizedHashMap is false
[ https://issues.apache.org/jira/browse/SPARK-24901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554106#comment-16554106 ] Apache Spark commented on SPARK-24901: -- User 'heary-cao' has created a pull request for this issue: https://github.com/apache/spark/pull/21860 > Merge the codegen of RegularHashMap and fastHashMap to reduce compiler > maxCodesize when VectorizedHashMap is false > -- > > Key: SPARK-24901 > URL: https://issues.apache.org/jira/browse/SPARK-24901 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: caoxuewen >Priority: Major > > Currently, Generate code of update UnsafeRow in hash aggregation. > FastHashMap and RegularHashMap are two separate codes,These two separate > codes need only when VectorizedHashMap is true. but other cases, we can merge > together to reduce compiler maxCodesize. thanks. > case class DistinctAgg(a: Int, b: Float, c: Double, d: Int, e: String) > spark.sparkContext.parallelize( > DistinctAgg(8, 2, 3, 4, "a") :: > DistinctAgg(9, 3, 4, 5, "b") > ::Nil).toDF()createOrReplaceTempView("distinctAgg") > val df = sql("select a,b,e, min(d) as mind, min(case when a > 10 then a else > null end) as mincasea, min(a) as mina from distinctAgg group by a, b, e") > println(org.apache.spark.sql.execution.debug.codegenString(df.queryExecution.executedPlan)) > df.show() > Generate code like: > *Before modified:* > Generated code: > /* 001 */ public Object generate(Object[] references) { > /* 002 */ return new GeneratedIteratorForCodegenStage1(references); > /* 003 */ } > /* 004 */ > ... > /* 354 */ > /* 355 */ if (agg_fastAggBuffer_0 != null) { > /* 356 */ // common sub-expressions > /* 357 */ > /* 358 */ // evaluate aggregate function > /* 359 */ agg_agg_isNull_31_0 = true; > /* 360 */ int agg_value_34 = -1; > /* 361 */ > /* 362 */ boolean agg_isNull_32 = agg_fastAggBuffer_0.isNullAt(0); > /* 363 */ int agg_value_35 = agg_isNull_32 ? > /* 364 */ -1 : (agg_fastAggBuffer_0.getInt(0)); > /* 365 */ > /* 366 */ if (!agg_isNull_32 && (agg_agg_isNull_31_0 || > /* 367 */ agg_value_34 > agg_value_35)) { > /* 368 */ agg_agg_isNull_31_0 = false; > /* 369 */ agg_value_34 = agg_value_35; > /* 370 */ } > /* 371 */ > /* 372 */ if (!false && (agg_agg_isNull_31_0 || > /* 373 */ agg_value_34 > agg_expr_2_0)) { > /* 374 */ agg_agg_isNull_31_0 = false; > /* 375 */ agg_value_34 = agg_expr_2_0; > /* 376 */ } > /* 377 */ agg_agg_isNull_34_0 = true; > /* 378 */ int agg_value_37 = -1; > /* 379 */ > /* 380 */ boolean agg_isNull_35 = agg_fastAggBuffer_0.isNullAt(1); > /* 381 */ int agg_value_38 = agg_isNull_35 ? > /* 382 */ -1 : (agg_fastAggBuffer_0.getInt(1)); > /* 383 */ > /* 384 */ if (!agg_isNull_35 && (agg_agg_isNull_34_0 || > /* 385 */ agg_value_37 > agg_value_38)) { > /* 386 */ agg_agg_isNull_34_0 = false; > /* 387 */ agg_value_37 = agg_value_38; > /* 388 */ } > /* 389 */ > /* 390 */ byte agg_caseWhenResultState_1 = -1; > /* 391 */ do { > /* 392 */ boolean agg_value_40 = false; > /* 393 */ agg_value_40 = agg_expr_0_0 > 10; > /* 394 */ if (!false && agg_value_40) { > /* 395 */ agg_caseWhenResultState_1 = (byte)(false ? 1 : 0); > /* 396 */ agg_agg_value_39_0 = agg_expr_0_0; > /* 397 */ continue; > /* 398 */ } > /* 399 */ > /* 400 */ agg_caseWhenResultState_1 = (byte)(true ? 1 : 0); > /* 401 */ agg_agg_value_39_0 = -1; > /* 402 */ > /* 403 */ } while (false); > /* 404 */ // TRUE if any condition is met and the result is null, or no > any condition is met. > /* 405 */ final boolean agg_isNull_36 = (agg_caseWhenResultState_1 != > 0); > /* 406 */ > /* 407 */ if (!agg_isNull_36 && (agg_agg_isNull_34_0 || > /* 408 */ agg_value_37 > agg_agg_value_39_0)) { > /* 409 */ agg_agg_isNull_34_0 = false; > /* 410 */ agg_value_37 = agg_agg_value_39_0; > /* 411 */ } > /* 412 */ agg_agg_isNull_42_0 = true; > /* 413 */ int agg_value_45 = -1; > /* 414 */ > /* 415 */ boolean agg_isNull_43 = agg_fastAggBuffer_0.isNullAt(2); > /* 416 */ int agg_value_46 = agg_isNull_43 ? > /* 417 */ -1 : (agg_fastAggBuffer_0.getInt(2)); > /* 418 */ > /* 419 */ if (!agg_isNull_43 && (agg_agg_isNull_42_0 || > /* 420 */ agg_value_45 > agg_value_46)) { > /* 421 */ agg_agg_isNull_42_0 = false; > /* 422 */ agg_value_45 = agg_value_46; > /* 423 */ } > /* 424 */ > /* 425 */ if (!false &&
[jira] [Assigned] (SPARK-24901) Merge the codegen of RegularHashMap and fastHashMap to reduce compiler maxCodesize when VectorizedHashMap is false
[ https://issues.apache.org/jira/browse/SPARK-24901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24901: Assignee: (was: Apache Spark) > Merge the codegen of RegularHashMap and fastHashMap to reduce compiler > maxCodesize when VectorizedHashMap is false > -- > > Key: SPARK-24901 > URL: https://issues.apache.org/jira/browse/SPARK-24901 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: caoxuewen >Priority: Major > > Currently, Generate code of update UnsafeRow in hash aggregation. > FastHashMap and RegularHashMap are two separate codes,These two separate > codes need only when VectorizedHashMap is true. but other cases, we can merge > together to reduce compiler maxCodesize. thanks. > case class DistinctAgg(a: Int, b: Float, c: Double, d: Int, e: String) > spark.sparkContext.parallelize( > DistinctAgg(8, 2, 3, 4, "a") :: > DistinctAgg(9, 3, 4, 5, "b") > ::Nil).toDF()createOrReplaceTempView("distinctAgg") > val df = sql("select a,b,e, min(d) as mind, min(case when a > 10 then a else > null end) as mincasea, min(a) as mina from distinctAgg group by a, b, e") > println(org.apache.spark.sql.execution.debug.codegenString(df.queryExecution.executedPlan)) > df.show() > Generate code like: > *Before modified:* > Generated code: > /* 001 */ public Object generate(Object[] references) { > /* 002 */ return new GeneratedIteratorForCodegenStage1(references); > /* 003 */ } > /* 004 */ > ... > /* 354 */ > /* 355 */ if (agg_fastAggBuffer_0 != null) { > /* 356 */ // common sub-expressions > /* 357 */ > /* 358 */ // evaluate aggregate function > /* 359 */ agg_agg_isNull_31_0 = true; > /* 360 */ int agg_value_34 = -1; > /* 361 */ > /* 362 */ boolean agg_isNull_32 = agg_fastAggBuffer_0.isNullAt(0); > /* 363 */ int agg_value_35 = agg_isNull_32 ? > /* 364 */ -1 : (agg_fastAggBuffer_0.getInt(0)); > /* 365 */ > /* 366 */ if (!agg_isNull_32 && (agg_agg_isNull_31_0 || > /* 367 */ agg_value_34 > agg_value_35)) { > /* 368 */ agg_agg_isNull_31_0 = false; > /* 369 */ agg_value_34 = agg_value_35; > /* 370 */ } > /* 371 */ > /* 372 */ if (!false && (agg_agg_isNull_31_0 || > /* 373 */ agg_value_34 > agg_expr_2_0)) { > /* 374 */ agg_agg_isNull_31_0 = false; > /* 375 */ agg_value_34 = agg_expr_2_0; > /* 376 */ } > /* 377 */ agg_agg_isNull_34_0 = true; > /* 378 */ int agg_value_37 = -1; > /* 379 */ > /* 380 */ boolean agg_isNull_35 = agg_fastAggBuffer_0.isNullAt(1); > /* 381 */ int agg_value_38 = agg_isNull_35 ? > /* 382 */ -1 : (agg_fastAggBuffer_0.getInt(1)); > /* 383 */ > /* 384 */ if (!agg_isNull_35 && (agg_agg_isNull_34_0 || > /* 385 */ agg_value_37 > agg_value_38)) { > /* 386 */ agg_agg_isNull_34_0 = false; > /* 387 */ agg_value_37 = agg_value_38; > /* 388 */ } > /* 389 */ > /* 390 */ byte agg_caseWhenResultState_1 = -1; > /* 391 */ do { > /* 392 */ boolean agg_value_40 = false; > /* 393 */ agg_value_40 = agg_expr_0_0 > 10; > /* 394 */ if (!false && agg_value_40) { > /* 395 */ agg_caseWhenResultState_1 = (byte)(false ? 1 : 0); > /* 396 */ agg_agg_value_39_0 = agg_expr_0_0; > /* 397 */ continue; > /* 398 */ } > /* 399 */ > /* 400 */ agg_caseWhenResultState_1 = (byte)(true ? 1 : 0); > /* 401 */ agg_agg_value_39_0 = -1; > /* 402 */ > /* 403 */ } while (false); > /* 404 */ // TRUE if any condition is met and the result is null, or no > any condition is met. > /* 405 */ final boolean agg_isNull_36 = (agg_caseWhenResultState_1 != > 0); > /* 406 */ > /* 407 */ if (!agg_isNull_36 && (agg_agg_isNull_34_0 || > /* 408 */ agg_value_37 > agg_agg_value_39_0)) { > /* 409 */ agg_agg_isNull_34_0 = false; > /* 410 */ agg_value_37 = agg_agg_value_39_0; > /* 411 */ } > /* 412 */ agg_agg_isNull_42_0 = true; > /* 413 */ int agg_value_45 = -1; > /* 414 */ > /* 415 */ boolean agg_isNull_43 = agg_fastAggBuffer_0.isNullAt(2); > /* 416 */ int agg_value_46 = agg_isNull_43 ? > /* 417 */ -1 : (agg_fastAggBuffer_0.getInt(2)); > /* 418 */ > /* 419 */ if (!agg_isNull_43 && (agg_agg_isNull_42_0 || > /* 420 */ agg_value_45 > agg_value_46)) { > /* 421 */ agg_agg_isNull_42_0 = false; > /* 422 */ agg_value_45 = agg_value_46; > /* 423 */ } > /* 424 */ > /* 425 */ if (!false && (agg_agg_isNull_42_0 || > /* 426 */ agg_value_45 > agg_expr_0_0)) { > /* 427 */
[jira] [Updated] (SPARK-24902) Add integration tests for PVs
[ https://issues.apache.org/jira/browse/SPARK-24902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-24902: Description: PVs and hostpath support has been added recently (https://github.com/apache/spark/pull/21260/files) for Spark on K8s. We should have some integration tests based on local storage. It is easy to add PVs to minikube and attatch them to pods. We could target a known dir like /tmp for the PV path. was: PVs and hostpath support has been added recently for Spark on K8s. We should have some integration tests based on local storage. It is easy to add PVs to minikube and attatch them to pods. We could target a known dir like /tmp for the PV path. > Add integration tests for PVs > - > > Key: SPARK-24902 > URL: https://issues.apache.org/jira/browse/SPARK-24902 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Priority: Minor > > PVs and hostpath support has been added recently > (https://github.com/apache/spark/pull/21260/files) for Spark on K8s. > We should have some integration tests based on local storage. > It is easy to add PVs to minikube and attatch them to pods. > We could target a known dir like /tmp for the PV path. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24903) Allow driver container name to be configurable in k8 spark
[ https://issues.apache.org/jira/browse/SPARK-24903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yifei Huang updated SPARK-24903: Description: We'd like to expose the container name as a configurable value. In case it changes in spark, we want to be able to override it to guarantee consistency on our end. https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L76 was:We'd like to expose the container name as a configurable value. In case it changes in spark, we want to be able to override it to guarantee consistency on our end. > Allow driver container name to be configurable in k8 spark > -- > > Key: SPARK-24903 > URL: https://issues.apache.org/jira/browse/SPARK-24903 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Yifei Huang >Priority: Major > > We'd like to expose the container name as a configurable value. In case it > changes in spark, we want to be able to override it to guarantee consistency > on our end. > https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L76 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24903) Allow driver container name to be configurable in k8 spark
Yifei Huang created SPARK-24903: --- Summary: Allow driver container name to be configurable in k8 spark Key: SPARK-24903 URL: https://issues.apache.org/jira/browse/SPARK-24903 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.3.1 Reporter: Yifei Huang We'd like to expose the container name as a configurable value. In case it changes in spark, we want to be able to override it to guarantee consistency on our end. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24902) Add integration tests for PVs
Stavros Kontopoulos created SPARK-24902: --- Summary: Add integration tests for PVs Key: SPARK-24902 URL: https://issues.apache.org/jira/browse/SPARK-24902 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.4.0 Reporter: Stavros Kontopoulos PVs and hostpath support has been added recently for Spark on K8s. We should have some integration tests based on local storage. It is easy to add PVs to minikube and attatch them to pods. We could target a known dir like /tmp for the PV path. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554099#comment-16554099 ] Stavros Kontopoulos edited comment on SPARK-24434 at 7/24/18 11:06 AM: --- [~liyinan926] [~eje] Should we move with the implementation? Any more questions? Could you do another review round pls? was (Author: skonto): [~liyinan926] Should we move with the implementation? Any more questions? Could you do another review round pls? > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554099#comment-16554099 ] Stavros Kontopoulos commented on SPARK-24434: - [~liyinan926] Should we move with the implementation? Any more questions? Could you do another review round pls? > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24900) speed up sort when the dataset is small
[ https://issues.apache.org/jira/browse/SPARK-24900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24900: Assignee: Apache Spark > speed up sort when the dataset is small > --- > > Key: SPARK-24900 > URL: https://issues.apache.org/jira/browse/SPARK-24900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: SongXun >Assignee: Apache Spark >Priority: Minor > Labels: pull-request-available > Original Estimate: 1h > Remaining Estimate: 1h > > when running the sql like 'select * from order where order_status = 4 order > by order_id', the pysical plan is: > !https://ws1.sinaimg.cn/large/006tNc79ly1ftl01gtso1j31kw0ag43q.jpg! > the Exchange rangepartitioning has two steps: > 1. sample the rdd and get the RangePartitioner which has a rangeBounds > !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0a6ziytj30le0su0vw.jpg! > 2. get the rddWithPartitionIds depending on the rangeBounds, and do the > shuffle > !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0bapn6mj30l40ke769.jpg! > !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0brnj3jj30kq0mu40q.jpg! > The filescan and filter will be executed twice, it may take a long time. If > the final dataset is small, and the sample data covers all the data, there is > no need to do so. > !https://ws3.sinaimg.cn/large/006tNc79ly1ftl3u0091wj31600a8wh1.jpg! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24900) speed up sort when the dataset is small
[ https://issues.apache.org/jira/browse/SPARK-24900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554094#comment-16554094 ] Apache Spark commented on SPARK-24900: -- User 'sddyljsx' has created a pull request for this issue: https://github.com/apache/spark/pull/21859 > speed up sort when the dataset is small > --- > > Key: SPARK-24900 > URL: https://issues.apache.org/jira/browse/SPARK-24900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: SongXun >Priority: Minor > Labels: pull-request-available > Original Estimate: 1h > Remaining Estimate: 1h > > when running the sql like 'select * from order where order_status = 4 order > by order_id', the pysical plan is: > !https://ws1.sinaimg.cn/large/006tNc79ly1ftl01gtso1j31kw0ag43q.jpg! > the Exchange rangepartitioning has two steps: > 1. sample the rdd and get the RangePartitioner which has a rangeBounds > !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0a6ziytj30le0su0vw.jpg! > 2. get the rddWithPartitionIds depending on the rangeBounds, and do the > shuffle > !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0bapn6mj30l40ke769.jpg! > !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0brnj3jj30kq0mu40q.jpg! > The filescan and filter will be executed twice, it may take a long time. If > the final dataset is small, and the sample data covers all the data, there is > no need to do so. > !https://ws3.sinaimg.cn/large/006tNc79ly1ftl3u0091wj31600a8wh1.jpg! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24900) speed up sort when the dataset is small
[ https://issues.apache.org/jira/browse/SPARK-24900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24900: Assignee: (was: Apache Spark) > speed up sort when the dataset is small > --- > > Key: SPARK-24900 > URL: https://issues.apache.org/jira/browse/SPARK-24900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: SongXun >Priority: Minor > Labels: pull-request-available > Original Estimate: 1h > Remaining Estimate: 1h > > when running the sql like 'select * from order where order_status = 4 order > by order_id', the pysical plan is: > !https://ws1.sinaimg.cn/large/006tNc79ly1ftl01gtso1j31kw0ag43q.jpg! > the Exchange rangepartitioning has two steps: > 1. sample the rdd and get the RangePartitioner which has a rangeBounds > !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0a6ziytj30le0su0vw.jpg! > 2. get the rddWithPartitionIds depending on the rangeBounds, and do the > shuffle > !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0bapn6mj30l40ke769.jpg! > !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0brnj3jj30kq0mu40q.jpg! > The filescan and filter will be executed twice, it may take a long time. If > the final dataset is small, and the sample data covers all the data, there is > no need to do so. > !https://ws3.sinaimg.cn/large/006tNc79ly1ftl3u0091wj31600a8wh1.jpg! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24901) Merge the codegen of RegularHashMap and fastHashMap to reduce compiler maxCodesize when VectorizedHashMap is false
caoxuewen created SPARK-24901: - Summary: Merge the codegen of RegularHashMap and fastHashMap to reduce compiler maxCodesize when VectorizedHashMap is false Key: SPARK-24901 URL: https://issues.apache.org/jira/browse/SPARK-24901 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: caoxuewen Currently, Generate code of update UnsafeRow in hash aggregation. FastHashMap and RegularHashMap are two separate codes,These two separate codes need only when VectorizedHashMap is true. but other cases, we can merge together to reduce compiler maxCodesize. thanks. case class DistinctAgg(a: Int, b: Float, c: Double, d: Int, e: String) spark.sparkContext.parallelize( DistinctAgg(8, 2, 3, 4, "a") :: DistinctAgg(9, 3, 4, 5, "b") ::Nil).toDF()createOrReplaceTempView("distinctAgg") val df = sql("select a,b,e, min(d) as mind, min(case when a > 10 then a else null end) as mincasea, min(a) as mina from distinctAgg group by a, b, e") println(org.apache.spark.sql.execution.debug.codegenString(df.queryExecution.executedPlan)) df.show() Generate code like: *Before modified:* Generated code: /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIteratorForCodegenStage1(references); /* 003 */ } /* 004 */ ... /* 354 */ /* 355 */ if (agg_fastAggBuffer_0 != null) { /* 356 */ // common sub-expressions /* 357 */ /* 358 */ // evaluate aggregate function /* 359 */ agg_agg_isNull_31_0 = true; /* 360 */ int agg_value_34 = -1; /* 361 */ /* 362 */ boolean agg_isNull_32 = agg_fastAggBuffer_0.isNullAt(0); /* 363 */ int agg_value_35 = agg_isNull_32 ? /* 364 */ -1 : (agg_fastAggBuffer_0.getInt(0)); /* 365 */ /* 366 */ if (!agg_isNull_32 && (agg_agg_isNull_31_0 || /* 367 */ agg_value_34 > agg_value_35)) { /* 368 */ agg_agg_isNull_31_0 = false; /* 369 */ agg_value_34 = agg_value_35; /* 370 */ } /* 371 */ /* 372 */ if (!false && (agg_agg_isNull_31_0 || /* 373 */ agg_value_34 > agg_expr_2_0)) { /* 374 */ agg_agg_isNull_31_0 = false; /* 375 */ agg_value_34 = agg_expr_2_0; /* 376 */ } /* 377 */ agg_agg_isNull_34_0 = true; /* 378 */ int agg_value_37 = -1; /* 379 */ /* 380 */ boolean agg_isNull_35 = agg_fastAggBuffer_0.isNullAt(1); /* 381 */ int agg_value_38 = agg_isNull_35 ? /* 382 */ -1 : (agg_fastAggBuffer_0.getInt(1)); /* 383 */ /* 384 */ if (!agg_isNull_35 && (agg_agg_isNull_34_0 || /* 385 */ agg_value_37 > agg_value_38)) { /* 386 */ agg_agg_isNull_34_0 = false; /* 387 */ agg_value_37 = agg_value_38; /* 388 */ } /* 389 */ /* 390 */ byte agg_caseWhenResultState_1 = -1; /* 391 */ do { /* 392 */ boolean agg_value_40 = false; /* 393 */ agg_value_40 = agg_expr_0_0 > 10; /* 394 */ if (!false && agg_value_40) { /* 395 */ agg_caseWhenResultState_1 = (byte)(false ? 1 : 0); /* 396 */ agg_agg_value_39_0 = agg_expr_0_0; /* 397 */ continue; /* 398 */ } /* 399 */ /* 400 */ agg_caseWhenResultState_1 = (byte)(true ? 1 : 0); /* 401 */ agg_agg_value_39_0 = -1; /* 402 */ /* 403 */ } while (false); /* 404 */ // TRUE if any condition is met and the result is null, or no any condition is met. /* 405 */ final boolean agg_isNull_36 = (agg_caseWhenResultState_1 != 0); /* 406 */ /* 407 */ if (!agg_isNull_36 && (agg_agg_isNull_34_0 || /* 408 */ agg_value_37 > agg_agg_value_39_0)) { /* 409 */ agg_agg_isNull_34_0 = false; /* 410 */ agg_value_37 = agg_agg_value_39_0; /* 411 */ } /* 412 */ agg_agg_isNull_42_0 = true; /* 413 */ int agg_value_45 = -1; /* 414 */ /* 415 */ boolean agg_isNull_43 = agg_fastAggBuffer_0.isNullAt(2); /* 416 */ int agg_value_46 = agg_isNull_43 ? /* 417 */ -1 : (agg_fastAggBuffer_0.getInt(2)); /* 418 */ /* 419 */ if (!agg_isNull_43 && (agg_agg_isNull_42_0 || /* 420 */ agg_value_45 > agg_value_46)) { /* 421 */ agg_agg_isNull_42_0 = false; /* 422 */ agg_value_45 = agg_value_46; /* 423 */ } /* 424 */ /* 425 */ if (!false && (agg_agg_isNull_42_0 || /* 426 */ agg_value_45 > agg_expr_0_0)) { /* 427 */ agg_agg_isNull_42_0 = false; /* 428 */ agg_value_45 = agg_expr_0_0; /* 429 */ } /* 430 */ // update fast row /* 431 */ agg_fastAggBuffer_0.setInt(0, agg_value_34); /* 432 */ /* 433 */ if (!agg_agg_isNull_34_0) { /* 434 */ agg_fastAggBuffer_0.setInt(1, agg_value_37); /* 435 */ } else { /* 436 */ agg_fastAggBuffer_0.setNullAt(1); /* 437 */ } /* 438 */ /* 439 */ agg_fastAggBuffer_0.setInt(2, agg_value_45); /* 440 */ } else { /* 441 */ // common
[jira] [Created] (SPARK-24900) speed up sort when the dataset is small
SongXun created SPARK-24900: --- Summary: speed up sort when the dataset is small Key: SPARK-24900 URL: https://issues.apache.org/jira/browse/SPARK-24900 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: SongXun when running the sql like 'select * from order where order_status = 4 order by order_id', the pysical plan is: !https://ws1.sinaimg.cn/large/006tNc79ly1ftl01gtso1j31kw0ag43q.jpg! the Exchange rangepartitioning has two steps: 1. sample the rdd and get the RangePartitioner which has a rangeBounds !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0a6ziytj30le0su0vw.jpg! 2. get the rddWithPartitionIds depending on the rangeBounds, and do the shuffle !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0bapn6mj30l40ke769.jpg! !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0brnj3jj30kq0mu40q.jpg! The filescan and filter will be executed twice, it may take a long time. If the final dataset is small, and the sample data covers all the data, there is no need to do so. !https://ws3.sinaimg.cn/large/006tNc79ly1ftl3u0091wj31600a8wh1.jpg! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554043#comment-16554043 ] Genmao Yu commented on SPARK-24630: --- {{Structured Streaming supports standard SQL as the batch queries, so the users can switch their queries between batch and streaming easily.}} IIUC, there are some queries we can not switch from stream to batch, like "*groupBy window a**ggregation"*** or "*over window a**ggregation"*** on stream. Isn't it? > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org