[jira] [Assigned] (SPARK-23129) Lazy init DiskMapIterator#deserializeStream to reduce memory usage when ExternalAppendOnlyMap spill too much times
[ https://issues.apache.org/jira/browse/SPARK-23129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23129: --- Assignee: zhoukang > Lazy init DiskMapIterator#deserializeStream to reduce memory usage when > ExternalAppendOnlyMap spill too much times > --- > > Key: SPARK-23129 > URL: https://issues.apache.org/jira/browse/SPARK-23129 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Fix For: 2.3.0 > > > Currently,the deserializeStream in ExternalAppendOnlyMap#DiskMapIterator init > when DiskMapIterator instance created.This will cause memory use overhead > when ExternalAppendOnlyMap spill too much times. > We can avoid this by making deserializeStream init when it is used the first > time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23129) Lazy init DiskMapIterator#deserializeStream to reduce memory usage when ExternalAppendOnlyMap spill too much times
[ https://issues.apache.org/jira/browse/SPARK-23129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23129. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20292 [https://github.com/apache/spark/pull/20292] > Lazy init DiskMapIterator#deserializeStream to reduce memory usage when > ExternalAppendOnlyMap spill too much times > --- > > Key: SPARK-23129 > URL: https://issues.apache.org/jira/browse/SPARK-23129 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Fix For: 2.3.0 > > > Currently,the deserializeStream in ExternalAppendOnlyMap#DiskMapIterator init > when DiskMapIterator instance created.This will cause memory use overhead > when ExternalAppendOnlyMap spill too much times. > We can avoid this by making deserializeStream init when it is used the first > time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23211) SparkR MLlib randomFroest parameter problem
[ https://issues.apache.org/jira/browse/SPARK-23211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23211. --- Resolution: Invalid I can't make out what you're asking. Please put this to the mailing list first. > SparkR MLlib randomFroest parameter problem > > > Key: SPARK-23211 > URL: https://issues.apache.org/jira/browse/SPARK-23211 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 > Environment: {code:R} > sdf_list <- randomSplit(train_data, rep(7, 3), 10086) > model <- spark.randomForest( > sdf_list[[1]], > forward_count ~ ., > type = "regression", > path = paste0("./predict/model/randomForest_", x), > overwrite = TRUE, > newData = sdf_list[[2]]) > {code} > train_data is a SparkDataFrame > The notes of parameter newData is "a SparkDataFrame for testing." > The notes of parameter path is "The directory where the model is saved." > These all don't work normaly. > why? >Reporter: 黄龙龙 >Priority: Major > Labels: documentation, usability > > spark.randomForest() and randomSplit() problem -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23211) SparkR MLlib randomFroest parameter problem
黄龙龙 created SPARK-23211: --- Summary: SparkR MLlib randomFroest parameter problem Key: SPARK-23211 URL: https://issues.apache.org/jira/browse/SPARK-23211 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.1.0 Environment: {code:R} sdf_list <- randomSplit(train_data, rep(7, 3), 10086) model <- spark.randomForest( sdf_list[[1]], forward_count ~ ., type = "regression", path = paste0("./predict/model/randomForest_", x), overwrite = TRUE, newData = sdf_list[[2]]) {code} train_data is a SparkDataFrame The notes of parameter newData is "a SparkDataFrame for testing." The notes of parameter path is "The directory where the model is saved." These all don't work normaly. why? Reporter: 黄龙龙 spark.randomForest() and randomSplit() problem -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23187) Accumulator object can not be sent from Executor to Driver
[ https://issues.apache.org/jira/browse/SPARK-23187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338737#comment-16338737 ] Saisai Shao commented on SPARK-23187: - Actually heartbeat report is OK according to my investigation, registered accumulator will report back current updates from executors to driver, no need to wait task end. The only thing is that in Spark UI, accumulator will only be displayed when task is finished, but for internal metric accumulators they will display in live. So I guess it is because UI doesn't display your registered accumulator in time, which makes you think that accumulator is not reported in heartbeat. > Accumulator object can not be sent from Executor to Driver > -- > > Key: SPARK-23187 > URL: https://issues.apache.org/jira/browse/SPARK-23187 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1, 2.3.0 >Reporter: Lantao Jin >Priority: Major > > In the Executor.scala->reportHeartBeat(), task Metrics value can not be sent > to Driver (In receive side all values are zero). > I write an UT for explanation. > {code} > diff --git > a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala > b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala > index f9481f8..57fb096 100644 > --- a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala > +++ b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala > @@ -17,11 +17,16 @@ > package org.apache.spark.rpc.netty > +import scala.collection.mutable.ArrayBuffer > + > import org.scalatest.mockito.MockitoSugar > import org.apache.spark._ > import org.apache.spark.network.client.TransportClient > import org.apache.spark.rpc._ > +import org.apache.spark.util.AccumulatorContext > +import org.apache.spark.util.AccumulatorV2 > +import org.apache.spark.util.LongAccumulator > class NettyRpcEnvSuite extends RpcEnvSuite with MockitoSugar { > @@ -83,5 +88,21 @@ class NettyRpcEnvSuite extends RpcEnvSuite with > MockitoSugar { > assertRequestMessageEquals( >msg3, >RequestMessage(nettyEnv, client, msg3.serialize(nettyEnv))) > + > +val acc = new LongAccumulator > +val sc = SparkContext.getOrCreate(new > SparkConf().setMaster("local").setAppName("testAcc")); > +sc.register(acc, "testAcc") > +acc.setValue(1) > +//val msg4 = new RequestMessage(senderAddress, receiver, acc) > +//assertRequestMessageEquals( > +// msg4, > +// RequestMessage(nettyEnv, client, msg4.serialize(nettyEnv))) > + > +val accbuf = new ArrayBuffer[AccumulatorV2[_, _]]() > +accbuf += acc > +val msg5 = new RequestMessage(senderAddress, receiver, accbuf) > +assertRequestMessageEquals( > + msg5, > + RequestMessage(nettyEnv, client, msg5.serialize(nettyEnv))) >} > } > {code} > msg4 and msg5 are all going to failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23210) Introduce the concept of default value to schema
[ https://issues.apache.org/jira/browse/SPARK-23210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338675#comment-16338675 ] LvDongrong commented on SPARK-23210: Can we set the default value to be null, like hive? @maropu @gatorsmile thank you! > Introduce the concept of default value to schema > > > Key: SPARK-23210 > URL: https://issues.apache.org/jira/browse/SPARK-23210 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: LvDongrong >Priority: Major > > There is no concept of DEFAULT VALUE for schema in spark now. > Our team want to support insert into serial columns of table,like "insert > into (a, c) values ("value1", "value2") for our use case, but the default > vaule of column is not definited. In hive, the default vaule of column is > NULL if we don't specify. > So I think maybe it is necessary to introduce the concept of default value to > schema in spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23210) Introduce the concept of default value to schema
LvDongrong created SPARK-23210: -- Summary: Introduce the concept of default value to schema Key: SPARK-23210 URL: https://issues.apache.org/jira/browse/SPARK-23210 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.1 Reporter: LvDongrong There is no concept of DEFAULT VALUE for schema in spark now. Our team want to support insert into serial columns of table,like "insert into (a, c) values ("value1", "value2") for our use case, but the default vaule of column is not definited. In hive, the default vaule of column is NULL if we don't specify. So I think maybe it is necessary to introduce the concept of default value to schema in spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338559#comment-16338559 ] Edwina Lu commented on SPARK-23206: --- Thanks, [~zsxwing]. Making the new metrics available in the metrics system would be very useful. Could this be done separately? We would also like the metrics to be available in the Spark web UI, since it is easy for users to see and use, and the REST API, which is used by [Dr. Elephant|https://github.com/linkedin/dr-elephant] (an open source tool for analyzing and tuning Hadoop and now Spark) and other metrics gathering and analysis projects which we have. > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: MemoryTuningMetricsDesignDoc.pdf > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338553#comment-16338553 ] Saisai Shao commented on SPARK-23206: - I think this Jira duplicates SPARK-9103. Also seems some metrics already existed in Spark, right? > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: MemoryTuningMetricsDesignDoc.pdf > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23209) HiveDelegationTokenProvider throws an exception if Hive jars are not the classpath
Sahil Takiar created SPARK-23209: Summary: HiveDelegationTokenProvider throws an exception if Hive jars are not the classpath Key: SPARK-23209 URL: https://issues.apache.org/jira/browse/SPARK-23209 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Environment: OSX, Java(TM) SE Runtime Environment (build 1.8.0_92-b14), Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode) Reporter: Sahil Takiar While doing some Hive-on-Spark testing against the Spark 2.3.0 release candidates we came across a bug (see HIVE-18436). Stack-trace: {code} Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf at org.apache.spark.deploy.security.HadoopDelegationTokenManager.getDelegationTokenProviders(HadoopDelegationTokenManager.scala:68) at org.apache.spark.deploy.security.HadoopDelegationTokenManager.(HadoopDelegationTokenManager.scala:54) at org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager.(YARNHadoopDelegationTokenManager.scala:44) at org.apache.spark.deploy.yarn.Client.(Client.scala:123) at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1502) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 10 more {code} Looks like the bug was introduced by SPARK-20434. SPARK-20434 changed {{HiveDelegationTokenProvider}} so that it constructs {{o.a.h.hive.conf.HiveConf}} inside {{HiveCredentialProvider#hiveConf}} rather than trying to manually load the class via the class loader. Looks like with the new code the JVM tries to load {{HiveConf}} as soon as {{HiveDelegationTokenProvider}} is referenced. Since there is no try-catch around the construction of {{HiveDelegationTokenProvider}} a {{ClassNotFoundException}} is thrown, which causes spark-submit to crash. Spark's {{docs/running-on-yarn.md}} says "a Hive token will be obtained if Hive is on the classpath". This behavior would seem to contradict that. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23201) Cannot create view when duplicate columns exist in subquery
[ https://issues.apache.org/jira/browse/SPARK-23201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338551#comment-16338551 ] Dongjoon Hyun commented on SPARK-23201: --- Apache Spark 1.6.3 (released on November 7, 2016) also has this issue. I don't think there will be Apache Spark 1.6.4 for this bug. > Cannot create view when duplicate columns exist in subquery > --- > > Key: SPARK-23201 > URL: https://issues.apache.org/jira/browse/SPARK-23201 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.3 >Reporter: Johannes Mayer >Priority: Critical > > I have two tables A(colA, col2, col3), B(colB, col3, col5) > If i join them in a subquery on A.colA = B.colB i can select the non > duplicate columns, but i cannot create a view (col3 is duplicate, but not > selected) > > {code:java} > create view testview as select > tmp.colA, tmp.col2, tmp.colB, tmp.col5 > from ( > select * from A left join B > on (A.colA = B.colB) > ) > {code} > > > This works: > > {code:java} > select > tmp.colA, tmp.col2, tmp.colB, tmp.col5 > from ( > select * from A left join B > on (A.colA = B.colB) > ) > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23201) Cannot create view when duplicate columns exist in subquery
[ https://issues.apache.org/jira/browse/SPARK-23201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23201: -- Affects Version/s: 1.6.3 > Cannot create view when duplicate columns exist in subquery > --- > > Key: SPARK-23201 > URL: https://issues.apache.org/jira/browse/SPARK-23201 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.3 >Reporter: Johannes Mayer >Priority: Critical > > I have two tables A(colA, col2, col3), B(colB, col3, col5) > If i join them in a subquery on A.colA = B.colB i can select the non > duplicate columns, but i cannot create a view (col3 is duplicate, but not > selected) > > {code:java} > create view testview as select > tmp.colA, tmp.col2, tmp.colB, tmp.col5 > from ( > select * from A left join B > on (A.colA = B.colB) > ) > {code} > > > This works: > > {code:java} > select > tmp.colA, tmp.col2, tmp.colB, tmp.col5 > from ( > select * from A left join B > on (A.colA = B.colB) > ) > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21717) Decouple the generated codes of consuming rows in operators under whole-stage codegen
[ https://issues.apache.org/jira/browse/SPARK-21717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal updated SPARK-21717: --- Target Version/s: 2.3.0 Priority: Critical (was: Major) > Decouple the generated codes of consuming rows in operators under whole-stage > codegen > - > > Key: SPARK-21717 > URL: https://issues.apache.org/jira/browse/SPARK-21717 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Priority: Critical > > It has been observed in SPARK-21603 that whole-stage codegen suffers > performance degradtion, if generated functions are too long to be optimized > by JIT. > We basically produce a single function to incorporate generated codes from > all physical operators in whole-stage. Thus, it is possibly to grow the size > of generated function over a threshold that we can't have JIT optimization > for it anymore. > This ticket is trying to decouple the logic of consuming rows in physical > operators to avoid a giant function processing rows. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22221) Add User Documentation for Working with Arrow in Spark
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-1: -- Target Version/s: 2.3.0 > Add User Documentation for Working with Arrow in Spark > -- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Priority: Major > > There needs to be user facing documentation that will show how to enable/use > Arrow with Spark, what the user should expect, and describe any differences > with similar existing functionality. > A comment from Xiao Li on https://github.com/apache/spark/pull/18664 > Given the users/applications contain the Timestamp in their Dataset and their > processing algorithms also need to have the codes based on the corresponding > time-zone related assumptions. > * For the new users/applications, they first enabled Arrow and later hit an > Arrow bug? Can they simply turn off spark.sql.execution.arrow.enable? If not, > what should they do? > * For the existing users/applications, they want to utilize Arrow for better > performance. Can they just turn on spark.sql.execution.arrow.enable? What > should they do? > Note Hopefully, the guides/solutions are user-friendly. That means, it must > be very simple to understand for most users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23208) GenArrayData produces illegal code
[ https://issues.apache.org/jira/browse/SPARK-23208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338537#comment-16338537 ] Apache Spark commented on SPARK-23208: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/20391 > GenArrayData produces illegal code > -- > > Key: SPARK-23208 > URL: https://issues.apache.org/jira/browse/SPARK-23208 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Blocker > > The GenArrayData.genCodeToCreateArrayData produces illegal java code when > code splitting is enabled. This is caused by a typo on the following line: > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L114 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23208) GenArrayData produces illegal code
[ https://issues.apache.org/jira/browse/SPARK-23208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23208: Assignee: Apache Spark (was: Herman van Hovell) > GenArrayData produces illegal code > -- > > Key: SPARK-23208 > URL: https://issues.apache.org/jira/browse/SPARK-23208 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Apache Spark >Priority: Blocker > > The GenArrayData.genCodeToCreateArrayData produces illegal java code when > code splitting is enabled. This is caused by a typo on the following line: > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L114 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23208) GenArrayData produces illegal code
[ https://issues.apache.org/jira/browse/SPARK-23208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23208: Assignee: Herman van Hovell (was: Apache Spark) > GenArrayData produces illegal code > -- > > Key: SPARK-23208 > URL: https://issues.apache.org/jira/browse/SPARK-23208 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Blocker > > The GenArrayData.genCodeToCreateArrayData produces illegal java code when > code splitting is enabled. This is caused by a typo on the following line: > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L114 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23207) Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss
[ https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal reassigned SPARK-23207: -- Assignee: Jiang Xingbo > Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss > --- > > Key: SPARK-23207 > URL: https://issues.apache.org/jira/browse/SPARK-23207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Major > > Currently shuffle repartition uses RoundRobinPartitioning, the generated > result is nondeterministic since the sequence of input rows are not > determined. > The bug can be triggered when there is a repartition call following a shuffle > (which would lead to non-deterministic row ordering), as the pattern shows > below: > upstream stage -> repartition stage -> result stage > (-> indicate a shuffle) > When one of the executors process goes down, some tasks on the repartition > stage will be retried and generate inconsistent ordering, and some tasks of > the result stage will be retried generating different data. > The following code returns 931532, instead of 100: > {code} > import scala.sys.process._ > import org.apache.spark.TaskContext > val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x => > x > }.repartition(200).map { x => > if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { > throw new Exception("pkill -f java".!!) > } > x > } > res.distinct().count() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23207) Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss
[ https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal updated SPARK-23207: --- Priority: Blocker (was: Major) > Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss > --- > > Key: SPARK-23207 > URL: https://issues.apache.org/jira/browse/SPARK-23207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Blocker > > Currently shuffle repartition uses RoundRobinPartitioning, the generated > result is nondeterministic since the sequence of input rows are not > determined. > The bug can be triggered when there is a repartition call following a shuffle > (which would lead to non-deterministic row ordering), as the pattern shows > below: > upstream stage -> repartition stage -> result stage > (-> indicate a shuffle) > When one of the executors process goes down, some tasks on the repartition > stage will be retried and generate inconsistent ordering, and some tasks of > the result stage will be retried generating different data. > The following code returns 931532, instead of 100: > {code} > import scala.sys.process._ > import org.apache.spark.TaskContext > val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x => > x > }.repartition(200).map { x => > if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { > throw new Exception("pkill -f java".!!) > } > x > } > res.distinct().count() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23201) Cannot create view when duplicate columns exist in subquery
[ https://issues.apache.org/jira/browse/SPARK-23201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338525#comment-16338525 ] Dongjoon Hyun edited comment on SPARK-23201 at 1/25/18 1:11 AM: Hi, [~joha0123]. It seems to work in the latest Apache Spark (2.2.1) and (2.1.2). Do you really want to report at *1.6.0*? {code} scala> sql("create view v1 as select tmp.colA, tmp.col2, tmp.colB, tmp.col5 from (select * from A left join B on (A.colA = B.colB)) tmp").show scala> sql("select * from v1").show +++++ |colA|col2|colB|col5| +++++ +++++ scala> spark.version res7: String = 2.2.1 {code} was (Author: dongjoon): Hi, [~joha0123]. It seems to work in the latest Apache Spark (2.2.1). Do you really want to report at *1.6.0*? {code} scala> sql("create view v1 as select tmp.colA, tmp.col2, tmp.colB, tmp.col5 from (select * from A left join B on (A.colA = B.colB)) tmp").show scala> sql("select * from v1").show +++++ |colA|col2|colB|col5| +++++ +++++ scala> spark.version res7: String = 2.2.1 {code} > Cannot create view when duplicate columns exist in subquery > --- > > Key: SPARK-23201 > URL: https://issues.apache.org/jira/browse/SPARK-23201 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Johannes Mayer >Priority: Critical > > I have two tables A(colA, col2, col3), B(colB, col3, col5) > If i join them in a subquery on A.colA = B.colB i can select the non > duplicate columns, but i cannot create a view (col3 is duplicate, but not > selected) > > {code:java} > create view testview as select > tmp.colA, tmp.col2, tmp.colB, tmp.col5 > from ( > select * from A left join B > on (A.colA = B.colB) > ) > {code} > > > This works: > > {code:java} > select > tmp.colA, tmp.col2, tmp.colB, tmp.col5 > from ( > select * from A left join B > on (A.colA = B.colB) > ) > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23208) GenArrayData produces illegal code
[ https://issues.apache.org/jira/browse/SPARK-23208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-23208: -- Target Version/s: 2.3.0 > GenArrayData produces illegal code > -- > > Key: SPARK-23208 > URL: https://issues.apache.org/jira/browse/SPARK-23208 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Blocker > > The GenArrayData.genCodeToCreateArrayData produces illegal java code when > code splitting is enabled. This is caused by a typo on the following line: > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L114 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23201) Cannot create view when duplicate columns exist in subquery
[ https://issues.apache.org/jira/browse/SPARK-23201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338525#comment-16338525 ] Dongjoon Hyun commented on SPARK-23201: --- Hi, [~joha0123]. It seems to work in the latest Apache Spark (2.2.1). Do you really want to report at *1.6.0*? {code} scala> sql("create view v1 as select tmp.colA, tmp.col2, tmp.colB, tmp.col5 from (select * from A left join B on (A.colA = B.colB)) tmp").show scala> sql("select * from v1").show +++++ |colA|col2|colB|col5| +++++ +++++ scala> spark.version res7: String = 2.2.1 {code} > Cannot create view when duplicate columns exist in subquery > --- > > Key: SPARK-23201 > URL: https://issues.apache.org/jira/browse/SPARK-23201 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Johannes Mayer >Priority: Critical > > I have two tables A(colA, col2, col3), B(colB, col3, col5) > If i join them in a subquery on A.colA = B.colB i can select the non > duplicate columns, but i cannot create a view (col3 is duplicate, but not > selected) > > {code:java} > create view testview as select > tmp.colA, tmp.col2, tmp.colB, tmp.col5 > from ( > select * from A left join B > on (A.colA = B.colB) > ) > {code} > > > This works: > > {code:java} > select > tmp.colA, tmp.col2, tmp.colB, tmp.col5 > from ( > select * from A left join B > on (A.colA = B.colB) > ) > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23081) Add colRegex API to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23081: Assignee: (was: Apache Spark) > Add colRegex API to PySpark > --- > > Key: SPARK-23081 > URL: https://issues.apache.org/jira/browse/SPARK-23081 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23081) Add colRegex API to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23081: Assignee: Apache Spark > Add colRegex API to PySpark > --- > > Key: SPARK-23081 > URL: https://issues.apache.org/jira/browse/SPARK-23081 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23081) Add colRegex API to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338486#comment-16338486 ] Apache Spark commented on SPARK-23081: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/20390 > Add colRegex API to PySpark > --- > > Key: SPARK-23081 > URL: https://issues.apache.org/jira/browse/SPARK-23081 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23208) GenArrayData produces illegal code
Herman van Hovell created SPARK-23208: - Summary: GenArrayData produces illegal code Key: SPARK-23208 URL: https://issues.apache.org/jira/browse/SPARK-23208 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Herman van Hovell Assignee: Herman van Hovell The GenArrayData.genCodeToCreateArrayData produces illegal java code when code splitting is enabled. This is caused by a typo on the following line: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L114 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338427#comment-16338427 ] Shixiong Zhu commented on SPARK-23206: -- We can also just add more information to metrics system and let the external system stores the metrics data and display them. > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: MemoryTuningMetricsDesignDoc.pdf > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338421#comment-16338421 ] Shixiong Zhu commented on SPARK-23206: -- Also cc [~jerryshao] since you were working on metrics system. > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: MemoryTuningMetricsDesignDoc.pdf > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23207) Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss
[ https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338419#comment-16338419 ] Jiang Xingbo commented on SPARK-23207: -- I'm working on this. > Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss > --- > > Key: SPARK-23207 > URL: https://issues.apache.org/jira/browse/SPARK-23207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Priority: Major > > Currently shuffle repartition uses RoundRobinPartitioning, the generated > result is nondeterministic since the sequence of input rows are not > determined. > The bug can be triggered when there is a repartition call following a shuffle > (which would lead to non-deterministic row ordering), as the pattern shows > below: > upstream stage -> repartition stage -> result stage > (-> indicate a shuffle) > When one of the executors process goes down, some tasks on the repartition > stage will be retried and generate inconsistent ordering, and some tasks of > the result stage will be retried generating different data. > The following code returns 931532, instead of 100: > {code} > import scala.sys.process._ > import org.apache.spark.TaskContext > val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x => > x > }.repartition(200).map { x => > if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { > throw new Exception("pkill -f java".!!) > } > x > } > res.distinct().count() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23207) Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss
Jiang Xingbo created SPARK-23207: Summary: Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss Key: SPARK-23207 URL: https://issues.apache.org/jira/browse/SPARK-23207 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Jiang Xingbo Currently shuffle repartition uses RoundRobinPartitioning, the generated result is nondeterministic since the sequence of input rows are not determined. The bug can be triggered when there is a repartition call following a shuffle (which would lead to non-deterministic row ordering), as the pattern shows below: upstream stage -> repartition stage -> result stage (-> indicate a shuffle) When one of the executors process goes down, some tasks on the repartition stage will be retried and generate inconsistent ordering, and some tasks of the result stage will be retried generating different data. The following code returns 931532, instead of 100: {code} import scala.sys.process._ import org.apache.spark.TaskContext val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x => x }.repartition(200).map { x => if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { throw new Exception("pkill -f java".!!) } x } res.distinct().count() {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20641) Key-value store abstraction and implementation for storing application data
[ https://issues.apache.org/jira/browse/SPARK-20641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338402#comment-16338402 ] Marcelo Vanzin commented on SPARK-20641: [~rxin] sorry I missed you comment. As I explained in the spec, I chose LevelDB because Spark already depends on it. We can definitely add a RocksDB implementation if it's better, and I tried to create the abstraction so that it's not too hard to add these things later on. The HDFS support sounds like an interesting things to have. > Key-value store abstraction and implementation for storing application data > --- > > Key: SPARK-20641 > URL: https://issues.apache.org/jira/browse/SPARK-20641 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.3.0 > > > See spec in parent issue (SPARK-18085) for more details. > This task tracks adding a key-value store abstraction and initial LevelDB > implementation to be used to store application data for building the UI and > REST API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20650) Remove JobProgressListener (and other unneeded classes)
[ https://issues.apache.org/jira/browse/SPARK-20650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-20650: -- Assignee: Marcelo Vanzin > Remove JobProgressListener (and other unneeded classes) > --- > > Key: SPARK-20650 > URL: https://issues.apache.org/jira/browse/SPARK-20650 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.3.0 > > > See spec in parent issue (SPARK-18085) for more details. > This task tracks removing JobProgressListener and other classes that will be > made obsolete by the other changes in this project, and making adjustments to > parts of the code that still rely on them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20641) Key-value store abstraction and implementation for storing application data
[ https://issues.apache.org/jira/browse/SPARK-20641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-20641: -- Assignee: Marcelo Vanzin > Key-value store abstraction and implementation for storing application data > --- > > Key: SPARK-20641 > URL: https://issues.apache.org/jira/browse/SPARK-20641 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.3.0 > > > See spec in parent issue (SPARK-18085) for more details. > This task tracks adding a key-value store abstraction and initial LevelDB > implementation to be used to store application data for building the UI and > REST API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338396#comment-16338396 ] Shixiong Zhu commented on SPARK-23206: -- cc [~vanzin] > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: MemoryTuningMetricsDesignDoc.pdf > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338378#comment-16338378 ] Ye Zhou commented on SPARK-23206: - [~zsxwing] Hi, Can you help find some one who can help review this design doc? Thanks. > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: MemoryTuningMetricsDesignDoc.pdf > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edwina Lu updated SPARK-23206: -- Description: At LinkedIn, we have multiple clusters, running thousands of Spark applications, and these numbers are growing rapidly. We need to ensure that these Spark applications are well tuned – cluster resources, including memory, should be used efficiently so that the cluster can support running more applications concurrently, and applications should run quickly and reliably. Currently there is limited visibility into how much memory executors are using, and users are guessing numbers for executor and driver memory sizing. These estimates are often much larger than needed, leading to memory wastage. Examining the metrics for one cluster for a month, the average percentage of used executor memory (max JVM used memory across executors / spark.executor.memory) is 35%, leading to an average of 591GB unused memory per application (number of executors * (spark.executor.memory - max JVM used memory)). Spark has multiple memory regions (user memory, execution memory, storage memory, and overhead memory), and to understand how memory is being used and fine-tune allocation between regions, it would be useful to have information about how much memory is being used for the different regions. To improve visibility into memory usage for the driver and executors and different memory regions, the following additional memory metrics can be be tracked for each executor and driver: * JVM used memory: the JVM heap size for the executor/driver. * Execution memory: memory used for computation in shuffles, joins, sorts and aggregations. * Storage memory: memory used caching and propagating internal data across the cluster. * Unified memory: sum of execution and storage memory. The peak values for each memory metric can be tracked for each executor, and also per stage. This information can be shown in the Spark UI and the REST APIs. Information for peak JVM used memory can help with determining appropriate values for spark.executor.memory and spark.driver.memory, and information about the unified memory region can help with determining appropriate values for spark.memory.fraction and spark.memory.storageFraction. Stage memory information can help identify which stages are most memory intensive, and users can look into the relevant code to determine if it can be optimized. The memory metrics can be gathered by adding the current JVM used memory, execution memory and storage memory to the heartbeat. SparkListeners are modified to collect the new metrics for the executors, stages and Spark history log. Only interesting values (peak values per stage per executor) are recorded in the Spark history log, to minimize the amount of additional logging. We have attached our design documentation with this ticket and would like to receive feedback from the community for this proposal. was: At LinkedIn, we have multiple clusters, running thousands of Spark applications, and these numbers are growing rapidly. We need to ensure that these Spark applications are well tuned -- cluster resources, including memory, should be used efficiently so that the cluster can support running more applications concurrently, and applications should run quickly and reliably. Currently there is limited visibility into how much memory executors are using, and users are guessing numbers for executor and driver memory sizing. These estimates are often much larger than needed, leading to memory wastage. Examining the metrics for one cluster for a month, the average percentage of used executor memory (max JVM used memory across executors / spark.executor.memory) is 35%, leading to an average of 591GB unused memory per application (number of executors * (spark.executor.memory - max JVM used memory)). Spark has multiple memory regions (user memory, execution memory, storage memory, and overhead memory), and to understand how memory is being used and fine-tune allocation between regions, it would be useful to have information about how much memory is being used for the different regions. To improve visibility into memory usage for the driver and executors and different memory regions, the following additional memory metrics can be be tracked for each executor and driver: * JVM used memory: the JVM heap size for the executor/driver. * Execution memory: memory used for computation in shuffles, joins, sorts and aggregations. * Storage memory: memory used caching and propagating internal data across the cluster. * Unified memory: sum of execution and storage memory. The peak values for each memory metric can be tracked for each executor, and also per stage. This information can be shown in the Spark UI and the REST APIs. Information for peak JVM used memory can help with determining appropriate values for
[jira] [Updated] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edwina Lu updated SPARK-23206: -- Attachment: MemoryTuningMetricsDesignDoc.pdf > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: MemoryTuningMetricsDesignDoc.pdf > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned -- cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23206) Additional Memory Tuning Metrics
Edwina Lu created SPARK-23206: - Summary: Additional Memory Tuning Metrics Key: SPARK-23206 URL: https://issues.apache.org/jira/browse/SPARK-23206 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.1 Reporter: Edwina Lu At LinkedIn, we have multiple clusters, running thousands of Spark applications, and these numbers are growing rapidly. We need to ensure that these Spark applications are well tuned -- cluster resources, including memory, should be used efficiently so that the cluster can support running more applications concurrently, and applications should run quickly and reliably. Currently there is limited visibility into how much memory executors are using, and users are guessing numbers for executor and driver memory sizing. These estimates are often much larger than needed, leading to memory wastage. Examining the metrics for one cluster for a month, the average percentage of used executor memory (max JVM used memory across executors / spark.executor.memory) is 35%, leading to an average of 591GB unused memory per application (number of executors * (spark.executor.memory - max JVM used memory)). Spark has multiple memory regions (user memory, execution memory, storage memory, and overhead memory), and to understand how memory is being used and fine-tune allocation between regions, it would be useful to have information about how much memory is being used for the different regions. To improve visibility into memory usage for the driver and executors and different memory regions, the following additional memory metrics can be be tracked for each executor and driver: * JVM used memory: the JVM heap size for the executor/driver. * Execution memory: memory used for computation in shuffles, joins, sorts and aggregations. * Storage memory: memory used caching and propagating internal data across the cluster. * Unified memory: sum of execution and storage memory. The peak values for each memory metric can be tracked for each executor, and also per stage. This information can be shown in the Spark UI and the REST APIs. Information for peak JVM used memory can help with determining appropriate values for spark.executor.memory and spark.driver.memory, and information about the unified memory region can help with determining appropriate values for spark.memory.fraction and spark.memory.storageFraction. Stage memory information can help identify which stages are most memory intensive, and users can look into the relevant code to determine if it can be optimized. The memory metrics can be gathered by adding the current JVM used memory, execution memory and storage memory to the heartbeat. SparkListeners are modified to collect the new metrics for the executors, stages and Spark history log. Only interesting values (peak values per stage per executor) are recorded in the Spark history log, to minimize the amount of additional logging. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images
[ https://issues.apache.org/jira/browse/SPARK-23205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23205: Assignee: Apache Spark > ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel > images > > > Key: SPARK-23205 > URL: https://issues.apache.org/jira/browse/SPARK-23205 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Siddharth Murching >Assignee: Apache Spark >Priority: Critical > > When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color > constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)] > that sets alpha = 255, even for four-channel images. > See the offending line here: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172 > A fix is to simply update the line to: > val color = new Color(img.getRGB(w, h), nChannels == 4) > instead of > val color = new Color(img.getRGB(w, h)) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images
[ https://issues.apache.org/jira/browse/SPARK-23205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338362#comment-16338362 ] Apache Spark commented on SPARK-23205: -- User 'smurching' has created a pull request for this issue: https://github.com/apache/spark/pull/20389 > ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel > images > > > Key: SPARK-23205 > URL: https://issues.apache.org/jira/browse/SPARK-23205 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Siddharth Murching >Priority: Critical > > When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color > constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)] > that sets alpha = 255, even for four-channel images. > See the offending line here: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172 > A fix is to simply update the line to: > val color = new Color(img.getRGB(w, h), nChannels == 4) > instead of > val color = new Color(img.getRGB(w, h)) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images
[ https://issues.apache.org/jira/browse/SPARK-23205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23205: Assignee: (was: Apache Spark) > ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel > images > > > Key: SPARK-23205 > URL: https://issues.apache.org/jira/browse/SPARK-23205 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Siddharth Murching >Priority: Critical > > When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color > constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)] > that sets alpha = 255, even for four-channel images. > See the offending line here: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172 > A fix is to simply update the line to: > val color = new Color(img.getRGB(w, h), nChannels == 4) > instead of > val color = new Color(img.getRGB(w, h)) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images
[ https://issues.apache.org/jira/browse/SPARK-23205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338360#comment-16338360 ] Siddharth Murching commented on SPARK-23205: Working on a PR to address this issue > ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel > images > > > Key: SPARK-23205 > URL: https://issues.apache.org/jira/browse/SPARK-23205 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Siddharth Murching >Priority: Critical > > When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color > constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)] > that sets alpha = 255, even for four-channel images. > See the offending line here: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172 > A fix is to simply update the line to: > val color = new Color(img.getRGB(w, h), nChannels == 4) > instead of > val color = new Color(img.getRGB(w, h)) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images
[ https://issues.apache.org/jira/browse/SPARK-23205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338360#comment-16338360 ] Siddharth Murching edited comment on SPARK-23205 at 1/24/18 10:40 PM: -- I'm working on a PR to address this issue if that's alright :) was (Author: siddharth murching): Working on a PR to address this issue > ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel > images > > > Key: SPARK-23205 > URL: https://issues.apache.org/jira/browse/SPARK-23205 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Siddharth Murching >Priority: Critical > > When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color > constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)] > that sets alpha = 255, even for four-channel images. > See the offending line here: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172 > A fix is to simply update the line to: > val color = new Color(img.getRGB(w, h), nChannels == 4) > instead of > val color = new Color(img.getRGB(w, h)) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images
Siddharth Murching created SPARK-23205: -- Summary: ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images Key: SPARK-23205 URL: https://issues.apache.org/jira/browse/SPARK-23205 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 2.3.0 Reporter: Siddharth Murching When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)] that sets alpha = 255, even for four-channel images. See the offending line here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172 A fix is to simply update the line to: val color = new Color(img.getRGB(w, h), nChannels == 4) instead of val color = new Color(img.getRGB(w, h)) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
[ https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338317#comment-16338317 ] Apache Spark commented on SPARK-23020: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/20388 > Re-enable Flaky Test: > org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher > > > Key: SPARK-23020 > URL: https://issues.apache.org/jira/browse/SPARK-23020 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Assignee: Marcelo Vanzin >Priority: Blocker > Fix For: 2.3.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23198) Fix KafkaContinuousSourceStressForDontFailOnDataLossSuite to test ContinuousExecution
[ https://issues.apache.org/jira/browse/SPARK-23198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-23198: Assignee: Dongjoon Hyun > Fix KafkaContinuousSourceStressForDontFailOnDataLossSuite to test > ContinuousExecution > - > > Key: SPARK-23198 > URL: https://issues.apache.org/jira/browse/SPARK-23198 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.3.0 > > > Currently, `KafkaContinuousSourceStressForDontFailOnDataLossSuite` runs on > `MicroBatchExecution`. It should test `ContinuousExecution`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23198) Fix KafkaContinuousSourceStressForDontFailOnDataLossSuite to test ContinuousExecution
[ https://issues.apache.org/jira/browse/SPARK-23198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-23198. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20374 [https://github.com/apache/spark/pull/20374] > Fix KafkaContinuousSourceStressForDontFailOnDataLossSuite to test > ContinuousExecution > - > > Key: SPARK-23198 > URL: https://issues.apache.org/jira/browse/SPARK-23198 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.3.0 > > > Currently, `KafkaContinuousSourceStressForDontFailOnDataLossSuite` runs on > `MicroBatchExecution`. It should test `ContinuousExecution`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23203) DataSourceV2 should use immutable trees.
[ https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338120#comment-16338120 ] Marcelo Vanzin commented on SPARK-23203: Ah, cool, wasn't aware that it was still experimental. Was just wondering since I saw this before I meant to reply to the rc1 e-mail. (I'm also concerned about the amount of patches going in after rcs started, but I generally count that as standard operation procedure during initial rcs...) > DataSourceV2 should use immutable trees. > > > Key: SPARK-23203 > URL: https://issues.apache.org/jira/browse/SPARK-23203 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > The DataSourceV2 integration doesn't use [immutable > trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html], > which is a basic requirement of Catalyst. The v2 relation should not wrap a > mutable reader and change the logical plan by pushing projections and filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py
[ https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338116#comment-16338116 ] Bryan Cutler commented on SPARK-22711: -- Yes, normally you would not need to import inside the functions. I think it is because the wordnet module initializes lazily, so when cloudpickle tries to pickle it, it is not complete and something is missed. I tried calling {{wordnet.ensure_loaded()}} in main that forces the module to load and that looks it allows it to be pickled properly, but then it has a problem trying to unpickle it. So I think it is best to use the workaround above for this case. > _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from > cloudpickle.py > --- > > Key: SPARK-22711 > URL: https://issues.apache.org/jira/browse/SPARK-22711 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.2.0, 2.2.1 > Environment: Ubuntu pseudo distributed installation of Spark 2.2.0 >Reporter: Prateek >Priority: Major > Attachments: Jira_Spark_minimized_code.py > > Original Estimate: 336h > Remaining Estimate: 336h > > When I submit a Pyspark program with spark-submit command this error is > thrown. > It happens when for code like below > RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v) > Traceback (most recent call last): > File "/home/prateek/Project/textrank.py", line 299, in > summaryRDD = sentenceTokensReduceRDD.map(lambda m: > get_summary(m)).reduceByKey(lambda c,v :c+v) > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, > in reduceByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, > in combineByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, > in partitionBy > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, > in _jrdd > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, > in _wrap_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, > in _prepare_for_python_RDD > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line > 460, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 704, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 148, in dump > File "/usr/lib/python3.5/pickle.py", line 408, in dump > self.save(obj) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line >
[jira] [Commented] (SPARK-23203) DataSourceV2 should use immutable trees.
[ https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338112#comment-16338112 ] Ryan Blue commented on SPARK-23203: --- [~vanzin], given that DataSourceV2 is experimental, I don't think it should block the release. But I also don't think it was a great idea to rush patches lately to get them into 2.3.0. We are shipping an experimental implementation that will no doubt change a lot as we add basics, like checking a data source table's schema against the data frame that is being written (which is completely by-passed right now). > DataSourceV2 should use immutable trees. > > > Key: SPARK-23203 > URL: https://issues.apache.org/jira/browse/SPARK-23203 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > The DataSourceV2 integration doesn't use [immutable > trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html], > which is a basic requirement of Catalyst. The v2 relation should not wrap a > mutable reader and change the logical plan by pushing projections and filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23204) DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-23204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338089#comment-16338089 ] Apache Spark commented on SPARK-23204: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/20387 > DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter > > > Key: SPARK-23204 > URL: https://issues.apache.org/jira/browse/SPARK-23204 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceV2 is currently only configured with a path, passed in options as > {{path}}. For many data sources, like JDBC, a table name is more appropriate. > I propose testing the "location" passed to load(String) and save(String) to > see if it is a path and if not, parsing it as a table name and passing > "database" and "table" options to readers and writers. > This also creates a way to pass the table identifier when using DataSourceV2 > tables from SQL. For example, {{SELECT * FROM db.table}} creates an > {{UnresolvedRelation(db,table)}} that could be resolved using the default > source, passing the db and table name using the same options. Similarly, we > can add a table property for the datasource implementation to metastore > tables and add a rule to convert them to DataSourceV2 relations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23203) DataSourceV2 should use immutable trees.
[ https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338088#comment-16338088 ] Apache Spark commented on SPARK-23203: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/20387 > DataSourceV2 should use immutable trees. > > > Key: SPARK-23203 > URL: https://issues.apache.org/jira/browse/SPARK-23203 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > The DataSourceV2 integration doesn't use [immutable > trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html], > which is a basic requirement of Catalyst. The v2 relation should not wrap a > mutable reader and change the logical plan by pushing projections and filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23203) DataSourceV2 should use immutable trees.
[ https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23203: Assignee: (was: Apache Spark) > DataSourceV2 should use immutable trees. > > > Key: SPARK-23203 > URL: https://issues.apache.org/jira/browse/SPARK-23203 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > The DataSourceV2 integration doesn't use [immutable > trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html], > which is a basic requirement of Catalyst. The v2 relation should not wrap a > mutable reader and change the logical plan by pushing projections and filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23204) DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-23204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23204: Assignee: Apache Spark > DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter > > > Key: SPARK-23204 > URL: https://issues.apache.org/jira/browse/SPARK-23204 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Apache Spark >Priority: Major > > DataSourceV2 is currently only configured with a path, passed in options as > {{path}}. For many data sources, like JDBC, a table name is more appropriate. > I propose testing the "location" passed to load(String) and save(String) to > see if it is a path and if not, parsing it as a table name and passing > "database" and "table" options to readers and writers. > This also creates a way to pass the table identifier when using DataSourceV2 > tables from SQL. For example, {{SELECT * FROM db.table}} creates an > {{UnresolvedRelation(db,table)}} that could be resolved using the default > source, passing the db and table name using the same options. Similarly, we > can add a table property for the datasource implementation to metastore > tables and add a rule to convert them to DataSourceV2 relations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23203) DataSourceV2 should use immutable trees.
[ https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23203: Assignee: Apache Spark > DataSourceV2 should use immutable trees. > > > Key: SPARK-23203 > URL: https://issues.apache.org/jira/browse/SPARK-23203 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Apache Spark >Priority: Major > > The DataSourceV2 integration doesn't use [immutable > trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html], > which is a basic requirement of Catalyst. The v2 relation should not wrap a > mutable reader and change the logical plan by pushing projections and filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23204) DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-23204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23204: Assignee: (was: Apache Spark) > DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter > > > Key: SPARK-23204 > URL: https://issues.apache.org/jira/browse/SPARK-23204 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceV2 is currently only configured with a path, passed in options as > {{path}}. For many data sources, like JDBC, a table name is more appropriate. > I propose testing the "location" passed to load(String) and save(String) to > see if it is a path and if not, parsing it as a table name and passing > "database" and "table" options to readers and writers. > This also creates a way to pass the table identifier when using DataSourceV2 > tables from SQL. For example, {{SELECT * FROM db.table}} creates an > {{UnresolvedRelation(db,table)}} that could be resolved using the default > source, passing the db and table name using the same options. Similarly, we > can add a table property for the datasource implementation to metastore > tables and add a rule to convert them to DataSourceV2 relations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py
[ https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338086#comment-16338086 ] Prateek commented on SPARK-22711: - Thanks [~bryanc]. I will check. Do we have to import the libraries in all functions dependent functions? Its uncommon right? > _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from > cloudpickle.py > --- > > Key: SPARK-22711 > URL: https://issues.apache.org/jira/browse/SPARK-22711 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.2.0, 2.2.1 > Environment: Ubuntu pseudo distributed installation of Spark 2.2.0 >Reporter: Prateek >Priority: Major > Attachments: Jira_Spark_minimized_code.py > > Original Estimate: 336h > Remaining Estimate: 336h > > When I submit a Pyspark program with spark-submit command this error is > thrown. > It happens when for code like below > RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v) > Traceback (most recent call last): > File "/home/prateek/Project/textrank.py", line 299, in > summaryRDD = sentenceTokensReduceRDD.map(lambda m: > get_summary(m)).reduceByKey(lambda c,v :c+v) > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, > in reduceByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, > in combineByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, > in partitionBy > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, > in _jrdd > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, > in _wrap_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, > in _prepare_for_python_RDD > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line > 460, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 704, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 148, in dump > File "/usr/lib/python3.5/pickle.py", line 408, in dump > self.save(obj) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr
[jira] [Updated] (SPARK-23204) DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-23204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-23204: -- Description: DataSourceV2 is currently only configured with a path, passed in options as {{path}}. For many data sources, like JDBC, a table name is more appropriate. I propose testing the "location" passed to load(String) and save(String) to see if it is a path and if not, parsing it as a table name and passing "database" and "table" options to readers and writers. This also creates a way to pass the table identifier when using DataSourceV2 tables from SQL. For example, {{SELECT * FROM db.table}} creates an {{UnresolvedRelation(db,table)}} that could be resolved using the default source, passing the db and table name using the same options. Similarly, we can add a table property for the datasource implementation to metastore tables and add a rule to convert them to DataSourceV2 relations. was: DataSourceV2 is currently only configured with a path, passed in options as {{path}}. For many data sources, like JDBC, a table name is more appropriate. I propose testing the "location" passed to load(String) and save(String) to see if it is a path and if not, parsing it as a table name and passing "database" and "table" options to readers and writers. This also creates a way to pass the table identifier when using DataSourceV2 tables from SQL. For example, {{SELECT * FROM db.table}} creates an {{UnresolvedRelation(db,table)}} that could be resolved using the default source, passing the db and table name using the same options. > DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter > > > Key: SPARK-23204 > URL: https://issues.apache.org/jira/browse/SPARK-23204 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceV2 is currently only configured with a path, passed in options as > {{path}}. For many data sources, like JDBC, a table name is more appropriate. > I propose testing the "location" passed to load(String) and save(String) to > see if it is a path and if not, parsing it as a table name and passing > "database" and "table" options to readers and writers. > This also creates a way to pass the table identifier when using DataSourceV2 > tables from SQL. For example, {{SELECT * FROM db.table}} creates an > {{UnresolvedRelation(db,table)}} that could be resolved using the default > source, passing the db and table name using the same options. Similarly, we > can add a table property for the datasource implementation to metastore > tables and add a rule to convert them to DataSourceV2 relations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23189) reflect stage level blacklisting on executor tab
[ https://issues.apache.org/jira/browse/SPARK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338083#comment-16338083 ] Imran Rashid commented on SPARK-23189: -- OK, since nobody feels strongly and to avoid bike-shedding here, my suggestion to move forward is Attila goes ahead with this since it should be a small change, but if it complicated for whatever reason its probably not worth sinking a lot of time into. I'd rather we just fixed the stage page to be better -- I don't like adding this if we plan on ripping it out later. But also realize we don't have a ton of contributors working on the UI now, so who knows when the other UI stuff will happen, so we can make this small change, since I think you probably have a pretty good sense of what would be useful in these pages. > reflect stage level blacklisting on executor tab > - > > Key: SPARK-23189 > URL: https://issues.apache.org/jira/browse/SPARK-23189 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.1 >Reporter: Attila Zsolt Piros >Priority: Major > > This issue is the came during working on SPARK-22577 where the conclusion was > not only stage tab should reflect stage and application level backlisting but > also the executor tab should be extended with stage level backlisting > information. > As [~irashid] and [~tgraves] are discussed the backlisted stages should be > listed for an executor like "*stage[ , ,...]*". One idea was to list only the > most recent 3 of the blacklisted stages another was list all the active > stages which are blacklisted. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23204) DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-23204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-23204: -- Description: DataSourceV2 is currently only configured with a path, passed in options as {{path}}. For many data sources, like JDBC, a table name is more appropriate. I propose testing the "location" passed to load(String) and save(String) to see if it is a path and if not, parsing it as a table name and passing "database" and "table" options to readers and writers. This also creates a way to pass the table identifier when using DataSourceV2 tables from SQL. For example, {{SELECT * FROM db.table}} creates an {{UnresolvedRelation(db,table)}} that could be resolved using the default source, passing the db and table name using the same options. was:DataSourceV2 is currently only configured with a path, passed in options as `path`. For many data sources, like JDBC, a table name is more appropriate. I propose testing the "location" passed to load(String) and save(String) to see if it is a path and if not, parsing it as a table name and passing "database" and "table" options to readers and writers. > DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter > > > Key: SPARK-23204 > URL: https://issues.apache.org/jira/browse/SPARK-23204 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceV2 is currently only configured with a path, passed in options as > {{path}}. For many data sources, like JDBC, a table name is more appropriate. > I propose testing the "location" passed to load(String) and save(String) to > see if it is a path and if not, parsing it as a table name and passing > "database" and "table" options to readers and writers. > This also creates a way to pass the table identifier when using DataSourceV2 > tables from SQL. For example, {{SELECT * FROM db.table}} creates an > {{UnresolvedRelation(db,table)}} that could be resolved using the default > source, passing the db and table name using the same options. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23204) DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter
Ryan Blue created SPARK-23204: - Summary: DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter Key: SPARK-23204 URL: https://issues.apache.org/jira/browse/SPARK-23204 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Ryan Blue DataSourceV2 is currently only configured with a path, passed in options as `path`. For many data sources, like JDBC, a table name is more appropriate. I propose testing the "location" passed to load(String) and save(String) to see if it is a path and if not, parsing it as a table name and passing "database" and "table" options to readers and writers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22386) Data Source V2 improvements
[ https://issues.apache.org/jira/browse/SPARK-22386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338071#comment-16338071 ] Apache Spark commented on SPARK-22386: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/20387 > Data Source V2 improvements > --- > > Key: SPARK-22386 > URL: https://issues.apache.org/jira/browse/SPARK-22386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22386) Data Source V2 improvements
[ https://issues.apache.org/jira/browse/SPARK-22386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22386: Assignee: (was: Apache Spark) > Data Source V2 improvements > --- > > Key: SPARK-22386 > URL: https://issues.apache.org/jira/browse/SPARK-22386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22386) Data Source V2 improvements
[ https://issues.apache.org/jira/browse/SPARK-22386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22386: Assignee: Apache Spark > Data Source V2 improvements > --- > > Key: SPARK-22386 > URL: https://issues.apache.org/jira/browse/SPARK-22386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py
[ https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338070#comment-16338070 ] Bryan Cutler commented on SPARK-22711: -- Hi [~PrateekRM], here is your code trimmed down to where the problem is. It seems like CloudPickle in pyspark is having trouble with wordnet {code} from pyspark import SparkContext from nltk.corpus import wordnet as wn def to_synset(word): return str(wn.synsets(word)) sc = SparkContext(appName="Text Rank") rdd = sc.parallelize(["cat", "dog"]) print(rdd.map(to_synset).collect()) {code} I can look into it, but as a workaround if you import wordnet in your function, it seems to work fine {code} def to_synset(word): from nltk.corpus import wordnet as wn return str(wn.synsets(word)) {code} > _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from > cloudpickle.py > --- > > Key: SPARK-22711 > URL: https://issues.apache.org/jira/browse/SPARK-22711 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.2.0, 2.2.1 > Environment: Ubuntu pseudo distributed installation of Spark 2.2.0 >Reporter: Prateek >Priority: Major > Attachments: Jira_Spark_minimized_code.py > > Original Estimate: 336h > Remaining Estimate: 336h > > When I submit a Pyspark program with spark-submit command this error is > thrown. > It happens when for code like below > RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v) > Traceback (most recent call last): > File "/home/prateek/Project/textrank.py", line 299, in > summaryRDD = sentenceTokensReduceRDD.map(lambda m: > get_summary(m)).reduceByKey(lambda c,v :c+v) > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, > in reduceByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, > in combineByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, > in partitionBy > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, > in _jrdd > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, > in _wrap_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, > in _prepare_for_python_RDD > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line > 460, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 704, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 148, in dump > File "/usr/lib/python3.5/pickle.py", line 408, in dump > self.save(obj) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in sav
[jira] [Commented] (SPARK-23203) DataSourceV2 should use immutable trees.
[ https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338063#comment-16338063 ] Marcelo Vanzin commented on SPARK-23203: [~rdblue] is this something that affects the API or is it just an internal change? (Or, in other words, "should it block 2.3"?) > DataSourceV2 should use immutable trees. > > > Key: SPARK-23203 > URL: https://issues.apache.org/jira/browse/SPARK-23203 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > The DataSourceV2 integration doesn't use [immutable > trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html], > which is a basic requirement of Catalyst. The v2 relation should not wrap a > mutable reader and change the logical plan by pushing projections and filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23203) DataSourceV2 should use immutable trees.
Ryan Blue created SPARK-23203: - Summary: DataSourceV2 should use immutable trees. Key: SPARK-23203 URL: https://issues.apache.org/jira/browse/SPARK-23203 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Environment: The DataSourceV2 integration doesn't use [immutable trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html], which is a basic requirement of Catalyst. The v2 relation should not wrap a mutable reader and change the logical plan by pushing projections and filters. Reporter: Ryan Blue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23203) DataSourceV2 should use immutable trees.
[ https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-23203: -- Environment: (was: The DataSourceV2 integration doesn't use [immutable trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html], which is a basic requirement of Catalyst. The v2 relation should not wrap a mutable reader and change the logical plan by pushing projections and filters.) > DataSourceV2 should use immutable trees. > > > Key: SPARK-23203 > URL: https://issues.apache.org/jira/browse/SPARK-23203 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23203) DataSourceV2 should use immutable trees.
[ https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-23203: -- Description: The DataSourceV2 integration doesn't use [immutable trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html], which is a basic requirement of Catalyst. The v2 relation should not wrap a mutable reader and change the logical plan by pushing projections and filters. > DataSourceV2 should use immutable trees. > > > Key: SPARK-23203 > URL: https://issues.apache.org/jira/browse/SPARK-23203 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > The DataSourceV2 integration doesn't use [immutable > trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html], > which is a basic requirement of Catalyst. The v2 relation should not wrap a > mutable reader and change the logical plan by pushing projections and filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23117) SparkR 2.3 QA: Check for new R APIs requiring example code
[ https://issues.apache.org/jira/browse/SPARK-23117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338043#comment-16338043 ] Felix Cheung commented on SPARK-23117: -- I'm ok to sign off if we don't have example for SPARK-20307 or SPARK-21381. Perhaps something we should explain more in ML guide - since changes go into python and scala APIs as well. > SparkR 2.3 QA: Check for new R APIs requiring example code > -- > > Key: SPARK-23117 > URL: https://issues.apache.org/jira/browse/SPARK-23117 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Major > > Audit list of new features added to MLlib's R API, and see which major items > are missing example code (in the examples folder). We do not need examples > for everything, only for major items such as new algorithms. > For any such items: > * Create a JIRA for that feature, and assign it to the author of the feature > (or yourself if interested). > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338032#comment-16338032 ] Justin Miller commented on SPARK-17147: --- I'm also seeing this behavior on a topic that has cleanup.policy=delete. The volume on this topic is very large, > 10 billion messages per day, and it seems to happen about once per day. Another topic with lower volume but larger messages happens every few days. 18/01/23 18:30:10 WARN TaskSetManager: Lost task 28.0 in stage 26.0 (TID 861, ,executor 15): java.lang.AssertionError: assertion failed: Got wrong record for -8002 124 even after seeking to offset 1769485661 got back record.offset 1769485662 18/01/23 18:30:12 INFO TaskSetManager: Lost task 28.1 in stage 26.0 (TID 865) on ,executor 24: java.lang.AssertionError (assertion failed: Got wrong record for -8002 124 even after seeking to offset 1769485661 got back record.offset 1769485662) [duplicate 1] 18/01/23 18:30:14 INFO TaskSetManager: Lost task 28.2 in stage 26.0 (TID 866) on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong record for -8002 124 even after seeking to offset 1769485661 got back record.offset 1769485662) [duplicate 2] 18/01/23 18:30:15 INFO TaskSetManager: Lost task 28.3 in stage 26.0 (TID 867) on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong record for -8002 124 even after seeking to offset 1769485661 got back record.offset 1769485662) [duplicate 3] 18/01/23 18:30:18 WARN TaskSetManager: Lost task 28.0 in stage 27.0 (TID 898, ,executor 6): java.lang.AssertionError: assertion failed: Got wrong record for -8002 124 even after seeking to offset 1769485661 got back record.offset 1769485662 18/01/23 18:30:19 INFO TaskSetManager: Lost task 28.1 in stage 27.0 (TID 900) on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong record for -8002 124 even after seeking to offset 1769485661 got back record.offset 1769485662) [duplicate 1] 18/01/23 18:30:20 INFO TaskSetManager: Lost task 28.2 in stage 27.0 (TID 901) on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong record for -8002 124 even after seeking to offset 1769485661 got back record.offset 1769485662) [duplicate 2] 18/01/23 18:30:21 INFO TaskSetManager: Lost task 28.3 in stage 27.0 (TID 902) on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong record for -8002 124 even after seeking to offset 1769485661 got back record.offset 1769485662) [duplicate 3] When checked with kafka-simple-consumer-shell, the offset is in fact missing: next offset = 1769485661 next offset = 1769485663 next offset = 1769485664 next offset = 1769485665 I'm currently testing out this branch in the persister and will post if it crashes again over the next few days (I currently have the kafka-10 source from the branch with a few extra log lines deployed). We're currently on log format 0.10.2 (upgraded yesterday) but saw the same issue on 0.9.0.0. chao.wu - Is this behavior similar to what you're seeing? > Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > (i.e. Log Compaction) > -- > > Key: SPARK-17147 > URL: https://issues.apache.org/jira/browse/SPARK-17147 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: Robert Conrad >Priority: Major > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22297) Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts conf"
[ https://issues.apache.org/jira/browse/SPARK-22297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-22297: -- Assignee: Mark Petruska > Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts > conf" > - > > Key: SPARK-22297 > URL: https://issues.apache.org/jira/browse/SPARK-22297 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Mark Petruska >Priority: Minor > Fix For: 2.4.0 > > > Ran into this locally; the test code seems to use timeouts which generally > end up in flakiness like this. > {noformat} > [info] - SPARK-20640: Shuffle registration timeout and maxAttempts conf are > working *** FAILED *** (1 second, 203 milliseconds) > [info] "Unable to register with external shuffle server due to : > java.util.concurrent.TimeoutException: Timeout waiting for task." did not > contain "test_spark_20640_try_again" (BlockManagerSuite.scala:1370) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > [info] at > org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply$mcV$sp(BlockManagerSuite.scala:1370) > [info] at > org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323) > [info] at > org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22297) Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts conf"
[ https://issues.apache.org/jira/browse/SPARK-22297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-22297. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 19671 [https://github.com/apache/spark/pull/19671] > Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts > conf" > - > > Key: SPARK-22297 > URL: https://issues.apache.org/jira/browse/SPARK-22297 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Minor > Fix For: 2.4.0 > > > Ran into this locally; the test code seems to use timeouts which generally > end up in flakiness like this. > {noformat} > [info] - SPARK-20640: Shuffle registration timeout and maxAttempts conf are > working *** FAILED *** (1 second, 203 milliseconds) > [info] "Unable to register with external shuffle server due to : > java.util.concurrent.TimeoutException: Timeout waiting for task." did not > contain "test_spark_20640_try_again" (BlockManagerSuite.scala:1370) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > [info] at > org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply$mcV$sp(BlockManagerSuite.scala:1370) > [info] at > org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323) > [info] at > org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
[ https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338023#comment-16338023 ] Marcelo Vanzin commented on SPARK-23020: Argh. Feel free to disable it in branch-2.3; please leave it on on master so we can get more info while I look at it. > Re-enable Flaky Test: > org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher > > > Key: SPARK-23020 > URL: https://issues.apache.org/jira/browse/SPARK-23020 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Assignee: Marcelo Vanzin >Priority: Blocker > Fix For: 2.3.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
[ https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338016#comment-16338016 ] Sameer Agarwal commented on SPARK-23020: FYI The {{SparkLauncherSuite}} test is still failing occasionally (a lot less common though): https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.6/142/testReport/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/ > Re-enable Flaky Test: > org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher > > > Key: SPARK-23020 > URL: https://issues.apache.org/jira/browse/SPARK-23020 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Assignee: Marcelo Vanzin >Priority: Blocker > Fix For: 2.3.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-23152: - Assignee: Matthew Tovbin > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Assignee: Matthew Tovbin >Priority: Minor > Labels: easyfix > Fix For: 2.4.0 > > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a misleading > NullPointerException is thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} > > The problem happens due to an incorrect guard condition in function > getNumClasses at org.apache.spark.ml.classification.Classifier:106 > {code:java} > val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) > if (maxLabelRow.isEmpty) { > throw new SparkException("ML algorithm was given empty dataset.") > } > {code} > When the input data is empty the result "maxLabelRow" array is not. Instead > it contains a single Row(null) element. > > Proposed solution: the condition can be modified to verify that. > {code:java} > if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { > throw new SparkException("ML algorithm was given empty dataset.") > } > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23152. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20321 [https://github.com/apache/spark/pull/20321] > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Assignee: Matthew Tovbin >Priority: Minor > Labels: easyfix > Fix For: 2.4.0 > > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a misleading > NullPointerException is thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} > > The problem happens due to an incorrect guard condition in function > getNumClasses at org.apache.spark.ml.classification.Classifier:106 > {code:java} > val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) > if (maxLabelRow.isEmpty) { > throw new SparkException("ML algorithm was given empty dataset.") > } > {code} > When the input data is empty the result "maxLabelRow" array is not. Instead > it contains a single Row(null) element. > > Proposed solution: the condition can be modified to verify that. > {code:java} > if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { > throw new SparkException("ML algorithm was given empty dataset.") > } > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22837) Session timeout checker does not work in SessionManager
[ https://issues.apache.org/jira/browse/SPARK-22837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22837. - Resolution: Fixed Assignee: zuotingbing Fix Version/s: 2.3.0 > Session timeout checker does not work in SessionManager > --- > > Key: SPARK-22837 > URL: https://issues.apache.org/jira/browse/SPARK-22837 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.1 >Reporter: zuotingbing >Assignee: zuotingbing >Priority: Major > Fix For: 2.3.0 > > > Currently, > {code:java} > SessionManager.init > {code} > will not be called, the config > {code:java} > HIVE_SERVER2_SESSION_CHECK_INTERVAL HIVE_SERVER2_IDLE_SESSION_TIMEOUT > HIVE_SERVER2_IDLE_SESSION_CHECK_OPERATION > {code} > of session timeout checker can not be loaded, it cause the session timeout > checker does not work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23189) reflect stage level blacklisting on executor tab
[ https://issues.apache.org/jira/browse/SPARK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338005#comment-16338005 ] Thomas Graves commented on SPARK-23189: --- for large jobs the specific stage page is a pain to navigate. Most of the time I go there i immediately shrink the executors table just to be able to see task table more easily. All of those pages are a pain to navigate and find things in my opinion, which is why we had a pr to change to datatables but got back burnered based on vanzin request because changes in history server. that is done now so I think could be done now but don't have time at the moment. But that is just one reason. Most of the time I can take a quick look at the job or stages tab (list of all stages) to see what is happening at a high level, jump to executors tab to see what is going there at a high level, and then only if needed load the specific stages page. It might just be the way I use it. It also depends on what the user is complaining about as to where I go first. I think if we fixed the stage page to be more usable that might change. Like I already said I'm fine with waiting on this. > reflect stage level blacklisting on executor tab > - > > Key: SPARK-23189 > URL: https://issues.apache.org/jira/browse/SPARK-23189 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.1 >Reporter: Attila Zsolt Piros >Priority: Major > > This issue is the came during working on SPARK-22577 where the conclusion was > not only stage tab should reflect stage and application level backlisting but > also the executor tab should be extended with stage level backlisting > information. > As [~irashid] and [~tgraves] are discussed the backlisted stages should be > listed for an executor like "*stage[ , ,...]*". One idea was to list only the > most recent 3 of the blacklisted stages another was list all the active > stages which are blacklisted. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23189) reflect stage level blacklisting on executor tab
[ https://issues.apache.org/jira/browse/SPARK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337957#comment-16337957 ] Imran Rashid commented on SPARK-23189: -- [~tgraves] -- why do you use the executors page instead of the stage page for the active stage? Is it because you're typically running lots of jobs simultaneously? Or is it just because its easier to navigate directly to the executors page -- no need to have a jobid / stageId? If its just having a convenient spot to navigate to, we could easily at a "lastStage" / "lastJob" redirect, that would be pretty trivial. Or maybe there is some other summary view we're missing to capture the current state of the cluster -- the "executors" page may be close to what you want know but perhaps we're actually missing something else. I'm not trying to block this change or anything, just want to avoid UI clutter and think a bit about the right way to add this. > reflect stage level blacklisting on executor tab > - > > Key: SPARK-23189 > URL: https://issues.apache.org/jira/browse/SPARK-23189 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.1 >Reporter: Attila Zsolt Piros >Priority: Major > > This issue is the came during working on SPARK-22577 where the conclusion was > not only stage tab should reflect stage and application level backlisting but > also the executor tab should be extended with stage level backlisting > information. > As [~irashid] and [~tgraves] are discussed the backlisted stages should be > listed for an executor like "*stage[ , ,...]*". One idea was to list only the > most recent 3 of the blacklisted stages another was list all the active > stages which are blacklisted. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23115) SparkR 2.3 QA: New R APIs and API docs
[ https://issues.apache.org/jira/browse/SPARK-23115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-23115. -- Resolution: Fixed Assignee: Felix Cheung > SparkR 2.3 QA: New R APIs and API docs > -- > > Key: SPARK-23115 > URL: https://issues.apache.org/jira/browse/SPARK-23115 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Felix Cheung >Priority: Blocker > > Audit new public R APIs. Take note of: > * Correctness and uniformity of API > * Documentation: Missing? Bad links or formatting? > ** Check both the generated docs linked from the user guide and the R command > line docs `?read.df`. These are generated using roxygen. > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23115) SparkR 2.3 QA: New R APIs and API docs
[ https://issues.apache.org/jira/browse/SPARK-23115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337947#comment-16337947 ] Felix Cheung commented on SPARK-23115: -- done > SparkR 2.3 QA: New R APIs and API docs > -- > > Key: SPARK-23115 > URL: https://issues.apache.org/jira/browse/SPARK-23115 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Blocker > > Audit new public R APIs. Take note of: > * Correctness and uniformity of API > * Documentation: Missing? Bad links or formatting? > ** Check both the generated docs linked from the user guide and the R command > line docs `?read.df`. These are generated using roxygen. > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22577) executor page blacklist status should update with TaskSet level blacklisting
[ https://issues.apache.org/jira/browse/SPARK-22577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-22577: Assignee: Attila Zsolt Piros > executor page blacklist status should update with TaskSet level blacklisting > > > Key: SPARK-22577 > URL: https://issues.apache.org/jira/browse/SPARK-22577 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.1 >Reporter: Thomas Graves >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 2.4.0 > > Attachments: app_blacklisting.png, node_blacklisting_for_stage.png, > stage_blacklisting.png > > > right now the executor blacklist status only updates with the > BlacklistTracker after a task set has finished and propagated the > blacklisting to the application level. We should change that to show at the > taskset level as well. Without this it can be very confusing to the user why > things aren't running. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22577) executor page blacklist status should update with TaskSet level blacklisting
[ https://issues.apache.org/jira/browse/SPARK-22577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-22577. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20203 [https://github.com/apache/spark/pull/20203] > executor page blacklist status should update with TaskSet level blacklisting > > > Key: SPARK-22577 > URL: https://issues.apache.org/jira/browse/SPARK-22577 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.1 >Reporter: Thomas Graves >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 2.4.0 > > Attachments: app_blacklisting.png, node_blacklisting_for_stage.png, > stage_blacklisting.png > > > right now the executor blacklist status only updates with the > BlacklistTracker after a task set has finished and propagated the > blacklisting to the application level. We should change that to show at the > taskset level as well. Without this it can be very confusing to the user why > things aren't running. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase
[ https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337927#comment-16337927 ] Apache Spark commented on SPARK-23202: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/20386 > Break down DataSourceV2Writer.commit into two phase > --- > > Key: SPARK-23202 > URL: https://issues.apache.org/jira/browse/SPARK-23202 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Gengliang Wang >Priority: Major > > Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a > writing job with a list of commit messages. > It makes sense in some scenarios, e.g. MicroBatchExecution. > However, on receiving commit message, driver can start processing > messages(e.g. persist messages into files) before all the messages are > collected. > The proposal is to Break down DataSourceV2Writer.commit into two phase: > # add(WriterCommitMessage message): Handles a commit message produced by > \{@link DataWriter#commit()}. > # commit(): Commits the writing job. > This should make the API more flexible, and more reasonable for implementing > some datasources. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase
[ https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23202: Assignee: (was: Apache Spark) > Break down DataSourceV2Writer.commit into two phase > --- > > Key: SPARK-23202 > URL: https://issues.apache.org/jira/browse/SPARK-23202 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Gengliang Wang >Priority: Major > > Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a > writing job with a list of commit messages. > It makes sense in some scenarios, e.g. MicroBatchExecution. > However, on receiving commit message, driver can start processing > messages(e.g. persist messages into files) before all the messages are > collected. > The proposal is to Break down DataSourceV2Writer.commit into two phase: > # add(WriterCommitMessage message): Handles a commit message produced by > \{@link DataWriter#commit()}. > # commit(): Commits the writing job. > This should make the API more flexible, and more reasonable for implementing > some datasources. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase
[ https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23202: Assignee: Apache Spark > Break down DataSourceV2Writer.commit into two phase > --- > > Key: SPARK-23202 > URL: https://issues.apache.org/jira/browse/SPARK-23202 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a > writing job with a list of commit messages. > It makes sense in some scenarios, e.g. MicroBatchExecution. > However, on receiving commit message, driver can start processing > messages(e.g. persist messages into files) before all the messages are > collected. > The proposal is to Break down DataSourceV2Writer.commit into two phase: > # add(WriterCommitMessage message): Handles a commit message produced by > \{@link DataWriter#commit()}. > # commit(): Commits the writing job. > This should make the API more flexible, and more reasonable for implementing > some datasources. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase
Gengliang Wang created SPARK-23202: -- Summary: Break down DataSourceV2Writer.commit into two phase Key: SPARK-23202 URL: https://issues.apache.org/jira/browse/SPARK-23202 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.1 Reporter: Gengliang Wang Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a writing job with a list of commit messages. It makes sense in some scenarios, e.g. MicroBatchExecution. However, on receiving commit message, driver can start processing messages(e.g. persist messages into files) before all the messages are collected. The proposal is to Break down DataSourceV2Writer.commit into two phase: # add(WriterCommitMessage message): Handles a commit message produced by \{@link DataWriter#commit()}. # commit(): Commits the writing job. This should make the API more flexible, and more reasonable for implementing some datasources. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py
[ https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337875#comment-16337875 ] Prateek commented on SPARK-22711: - I don't understand what you mean. updated code is in attachment [~seemab] > _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from > cloudpickle.py > --- > > Key: SPARK-22711 > URL: https://issues.apache.org/jira/browse/SPARK-22711 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.2.0, 2.2.1 > Environment: Ubuntu pseudo distributed installation of Spark 2.2.0 >Reporter: Prateek >Priority: Major > Attachments: Jira_Spark_minimized_code.py > > Original Estimate: 336h > Remaining Estimate: 336h > > When I submit a Pyspark program with spark-submit command this error is > thrown. > It happens when for code like below > RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v) > Traceback (most recent call last): > File "/home/prateek/Project/textrank.py", line 299, in > summaryRDD = sentenceTokensReduceRDD.map(lambda m: > get_summary(m)).reduceByKey(lambda c,v :c+v) > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, > in reduceByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, > in combineByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, > in partitionBy > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, > in _jrdd > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, > in _wrap_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, > in _prepare_for_python_RDD > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line > 460, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 704, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 148, in dump > File "/usr/lib/python3.5/pickle.py", line 408, in dump > self.save(obj) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f
[jira] [Assigned] (SPARK-21396) Spark Hive Thriftserver doesn't return UDT field
[ https://issues.apache.org/jira/browse/SPARK-21396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21396: Assignee: Apache Spark > Spark Hive Thriftserver doesn't return UDT field > > > Key: SPARK-21396 > URL: https://issues.apache.org/jira/browse/SPARK-21396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Haopu Wang >Assignee: Apache Spark >Priority: Major > Labels: Hive, ThriftServer2, user-defined-type > > I want to query a table with a MLLib Vector field and get below exception. > Can Spark Hive Thriftserver be enhanced to return UDT field? > == > 2017-07-13 13:14:25,435 WARN > [org.apache.hive.service.cli.thrift.ThriftCLIService] > (HiveServer2-Handler-Pool: Thread-18537;) Error fetching results: > java.lang.RuntimeException: scala.MatchError: > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 (of class > org.apache.spark.ml.linalg.VectorUDT) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) > at com.sun.proxy.$Proxy29.fetchResults(Unknown Source) > at > org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: scala.MatchError: org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 > (of class org.apache.spark.ml.linalg.VectorUDT) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(SparkExecuteStatementOperation.scala:80) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:144) > at > org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220) > at > org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:685) > at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) > ... 18 more -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23195) Hint of cached data is lost
[ https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23195: Assignee: Apache Spark (was: Xiao Li) > Hint of cached data is lost > --- > > Key: SPARK-23195 > URL: https://issues.apache.org/jira/browse/SPARK-23195 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > {noformat} > withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { > val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", > "value") > val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", > "value") > broadcast(df2).cache() > df2.collect() > val df3 = df1.join(df2, Seq("key"), "inner") > val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { > case b: BroadcastHashJoinExec => b > }.size > assert(numBroadCastHashJoin === 1) > } > {noformat} > The broadcast hint is not respected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23195) Hint of cached data is lost
[ https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23195: Assignee: Xiao Li (was: Apache Spark) > Hint of cached data is lost > --- > > Key: SPARK-23195 > URL: https://issues.apache.org/jira/browse/SPARK-23195 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > > {noformat} > withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { > val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", > "value") > val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", > "value") > broadcast(df2).cache() > df2.collect() > val df3 = df1.join(df2, Seq("key"), "inner") > val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { > case b: BroadcastHashJoinExec => b > }.size > assert(numBroadCastHashJoin === 1) > } > {noformat} > The broadcast hint is not respected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21396) Spark Hive Thriftserver doesn't return UDT field
[ https://issues.apache.org/jira/browse/SPARK-21396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21396: Assignee: (was: Apache Spark) > Spark Hive Thriftserver doesn't return UDT field > > > Key: SPARK-21396 > URL: https://issues.apache.org/jira/browse/SPARK-21396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Haopu Wang >Priority: Major > Labels: Hive, ThriftServer2, user-defined-type > > I want to query a table with a MLLib Vector field and get below exception. > Can Spark Hive Thriftserver be enhanced to return UDT field? > == > 2017-07-13 13:14:25,435 WARN > [org.apache.hive.service.cli.thrift.ThriftCLIService] > (HiveServer2-Handler-Pool: Thread-18537;) Error fetching results: > java.lang.RuntimeException: scala.MatchError: > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 (of class > org.apache.spark.ml.linalg.VectorUDT) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) > at com.sun.proxy.$Proxy29.fetchResults(Unknown Source) > at > org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: scala.MatchError: org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 > (of class org.apache.spark.ml.linalg.VectorUDT) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(SparkExecuteStatementOperation.scala:80) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:144) > at > org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220) > at > org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:685) > at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) > ... 18 more -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21396) Spark Hive Thriftserver doesn't return UDT field
[ https://issues.apache.org/jira/browse/SPARK-21396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337820#comment-16337820 ] Apache Spark commented on SPARK-21396: -- User 'atallahhezbor' has created a pull request for this issue: https://github.com/apache/spark/pull/20385 > Spark Hive Thriftserver doesn't return UDT field > > > Key: SPARK-21396 > URL: https://issues.apache.org/jira/browse/SPARK-21396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Haopu Wang >Priority: Major > Labels: Hive, ThriftServer2, user-defined-type > > I want to query a table with a MLLib Vector field and get below exception. > Can Spark Hive Thriftserver be enhanced to return UDT field? > == > 2017-07-13 13:14:25,435 WARN > [org.apache.hive.service.cli.thrift.ThriftCLIService] > (HiveServer2-Handler-Pool: Thread-18537;) Error fetching results: > java.lang.RuntimeException: scala.MatchError: > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 (of class > org.apache.spark.ml.linalg.VectorUDT) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) > at com.sun.proxy.$Proxy29.fetchResults(Unknown Source) > at > org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: scala.MatchError: org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 > (of class org.apache.spark.ml.linalg.VectorUDT) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(SparkExecuteStatementOperation.scala:80) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:144) > at > org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220) > at > org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:685) > at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) > ... 18 more -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-23195) Hint of cached data is lost
[ https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-23195: --- Since this is reverted, let's keep this open until the on-going PR is merged back. > Hint of cached data is lost > --- > > Key: SPARK-23195 > URL: https://issues.apache.org/jira/browse/SPARK-23195 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > > {noformat} > withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { > val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", > "value") > val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", > "value") > broadcast(df2).cache() > df2.collect() > val df3 = df1.join(df2, Seq("key"), "inner") > val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { > case b: BroadcastHashJoinExec => b > }.size > assert(numBroadCastHashJoin === 1) > } > {noformat} > The broadcast hint is not respected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23195) Hint of cached data is lost
[ https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23195: -- Fix Version/s: (was: 2.3.1) > Hint of cached data is lost > --- > > Key: SPARK-23195 > URL: https://issues.apache.org/jira/browse/SPARK-23195 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > > {noformat} > withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { > val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", > "value") > val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", > "value") > broadcast(df2).cache() > df2.collect() > val df3 = df1.join(df2, Seq("key"), "inner") > val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { > case b: BroadcastHashJoinExec => b > }.size > assert(numBroadCastHashJoin === 1) > } > {noformat} > The broadcast hint is not respected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13108) Encoding not working with non-ascii compatible encodings (UTF-16/32 etc.)
[ https://issues.apache.org/jira/browse/SPARK-13108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337775#comment-16337775 ] Rafael Cavazin commented on SPARK-13108: [~hyukjin.kwon] was this issue fixed? the PR was closed > Encoding not working with non-ascii compatible encodings (UTF-16/32 etc.) > - > > Key: SPARK-13108 > URL: https://issues.apache.org/jira/browse/SPARK-13108 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > This library uses Hadoop's > [{{TextInputFormat}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java], > which uses > [{{LineRecordReader}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java]. > According to > [MAPREDUCE-232|https://issues.apache.org/jira/browse/MAPREDUCE-232], it looks > [{{TextInputFormat}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java] > does not guarantee all encoding types but officially only UTF-8 (as > commented in > [{{LineRecordReader#L147}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L147]). > According to > [MAPREDUCE-232#comment-13183601|https://issues.apache.org/jira/browse/MAPREDUCE-232?focusedCommentId=13183601&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13183601], > it still looks fine with most encodings though but without UTF-16/32. > In more details, > I tested this in Max OS. I converted `cars_iso-8859-1.csv` into > `cars_utf-16.csv` as below: > {code} > iconv -f iso-8859-1 -t utf-16 < cars_iso-8859-1.csv > cars_utf-16.csv > {code} > and run the codes below: > {code} > val cars = "cars_utf-16.csv" > sqlContext.read > .format("csv") > .option("charset", "utf-16") > .option("delimiter", 'þ') > .load(cars) > .show() > {code} > This produces a wrong results below: > {code} > ++-+-++--+ > |year| make|model| comment|blank�| > ++-+-++--+ > |2012|Tesla|S| No comment| �| > | �| null| null|null| null| > |1997| Ford| E350|Go get one now th...| �| > |2015|Chevy|Volt�|null| null| > | �| null| null|null| null| > ++-+-++--+ > {code} > Instead of the correct results below: > {code} > ++-+-++-+ > |year| make|model| comment|blank| > ++-+-++-+ > |2012|Tesla|S| No comment| | > |1997| Ford| E350|Go get one now th...| | > |2015|Chevy| Volt|null| null| > ++-+-++-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23195) Hint of cached data is lost
[ https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337771#comment-16337771 ] Apache Spark commented on SPARK-23195: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/20384 > Hint of cached data is lost > --- > > Key: SPARK-23195 > URL: https://issues.apache.org/jira/browse/SPARK-23195 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 2.3.1 > > > {noformat} > withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { > val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", > "value") > val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", > "value") > broadcast(df2).cache() > df2.collect() > val df3 = df1.join(df2, Seq("key"), "inner") > val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { > case b: BroadcastHashJoinExec => b > }.size > assert(numBroadCastHashJoin === 1) > } > {noformat} > The broadcast hint is not respected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15348) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337760#comment-16337760 ] Arvind Jajoo commented on SPARK-15348: -- I think in order to have an end to end streaming ETL implementation within spark , this feature needs to be supported in spark sql now, specially after structured streaming. i.e. MERGE INTO statement can be run directly from spark sql for batch or microbatch incremental updates. Currently , this needs to be done outside of spark using hive but then it breaks end to end streaming ETL semantics. > Hive ACID > - > > Key: SPARK-15348 > URL: https://issues.apache.org/jira/browse/SPARK-15348 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0 >Reporter: Ran Haim >Priority: Major > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22784) Configure reading buffer size in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22784. --- Resolution: Won't Fix > Configure reading buffer size in Spark History Server > - > > Key: SPARK-22784 > URL: https://issues.apache.org/jira/browse/SPARK-22784 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Mikhail Erofeev >Priority: Minor > Attachments: replay-baseline.svg > > > Motivation: > Our Spark History Server spends most of the backfill time inside > BufferedReader and StringBuffer. It happens because average line size of our > events is ~1.500.000 chars (due to a lot of partitions and iterations), > whereas the default buffer size is 2048 bytes. See the attached flame graph. > Implementation: > I've added logging of spent time and line size for each job. > Parametrised ReplayListenerBus with a new buffer size parameter. > Measured the best buffer size. x20 of the average line size (30mb) gives 32% > speedup in a local test. > Result: > Backfill of Spark History and reading to the cache will be up to 30% faster > after tuning. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org