[jira] [Updated] (SPARK-24689) java.io.NotSerializableException: org.apache.spark.mllib.clustering.DistributedLDAModel
[ https://issues.apache.org/jira/browse/SPARK-24689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] konglingbo updated SPARK-24689: --- Attachment: @CLZ98635A644[_edx...@e.png > java.io.NotSerializableException: > org.apache.spark.mllib.clustering.DistributedLDAModel > --- > > Key: SPARK-24689 > URL: https://issues.apache.org/jira/browse/SPARK-24689 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.1 > Environment: !image-2018-06-29-13-42-30-255.png! >Reporter: konglingbo >Priority: Blocker > Attachments: @CLZ98635A644[_edx...@e.png > > > scala> val predictionAndLabels=testing.map{case LabeledPoint(label,features)=> > | val prediction = model.predict(features) > | (prediction, label) > | } -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24689) java.io.NotSerializableException: org.apache.spark.mllib.clustering.DistributedLDAModel
[ https://issues.apache.org/jira/browse/SPARK-24689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] konglingbo updated SPARK-24689: --- Description: scala> val predictionAndLabels=testing.map{case LabeledPoint(label,features)=> | val prediction = model.predict(features) | (prediction, label) | } was:!image-2018-06-29-13-41-55-635.png! > java.io.NotSerializableException: > org.apache.spark.mllib.clustering.DistributedLDAModel > --- > > Key: SPARK-24689 > URL: https://issues.apache.org/jira/browse/SPARK-24689 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.1 > Environment: !image-2018-06-29-13-42-30-255.png! >Reporter: konglingbo >Priority: Blocker > > scala> val predictionAndLabels=testing.map{case LabeledPoint(label,features)=> > | val prediction = model.predict(features) > | (prediction, label) > | } -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24689) java.io.NotSerializableException: org.apache.spark.mllib.clustering.DistributedLDAModel
konglingbo created SPARK-24689: -- Summary: java.io.NotSerializableException: org.apache.spark.mllib.clustering.DistributedLDAModel Key: SPARK-24689 URL: https://issues.apache.org/jira/browse/SPARK-24689 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 2.3.1 Environment: !image-2018-06-29-13-42-30-255.png! Reporter: konglingbo !image-2018-06-29-13-41-55-635.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23636) [SPARK 2.2] | Kafka Consumer | KafkaUtils.createRDD throws Exception - java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access
[ https://issues.apache.org/jira/browse/SPARK-23636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527204#comment-16527204 ] Ted Yu commented on SPARK-23636: It seems in KafkaDataConsumer#close : {code} def close(): Unit = consumer.close() {code} The code should catch ConcurrentModificationException and try closing the consumer again. > [SPARK 2.2] | Kafka Consumer | KafkaUtils.createRDD throws Exception - > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > - > > Key: SPARK-23636 > URL: https://issues.apache.org/jira/browse/SPARK-23636 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1, 2.2.0 >Reporter: Deepak >Priority: Major > Labels: performance > > h2. > h2. Summary > > While using the KafkaUtils.createRDD API - we receive below listed error, > specifically when 1 executor connects to 1 kafka topic-partition, but with > more than 1 core & fetches an Array(OffsetRanges) > > _I've tagged this issue to "Structured Streaming" - as I could not find a > more appropriate component_ > > > h2. Error Faced > {noformat} > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access{noformat} > Stack Trace > {noformat} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 5 in stage 1.0 failed 4 times, most recent failure: Lost task 5.3 in > stage 1.0 (TID 17, host, executor 16): > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1629) > at > org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1528) > at > org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1508) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.close(CachedKafkaConsumer.scala:59) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer$.remove(CachedKafkaConsumer.scala:185) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.(KafkaRDD.scala:204) > at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:181) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323){noformat} > > > h2. Config Used to simulate the error > A session with : > * Executors - 1 > * Cores - 2 or More > * Kafka Topic - has only 1 partition > * While fetching - More than one Array of Offset Range , Example > {noformat} > Array(OffsetRange("kafka_topic",0,608954201,608954202), > OffsetRange("kafka_topic",0,608954202,608954203) > ){noformat} > > > h2. Was this approach working before? > > This was working in spark 1.6.2 > However, from spark 2.1 onwards - the approach throws exception > > > h2. Why are we fetching from kafka as mentioned above. > > This gives us the capability to establish a connection to Kafka Broker for > every spark executor's core, thus each core can fetch/process its own set of > messages based on the specified (offset ranges). > > > > h2. Sample Code > > {quote}scala snippet - on versions spark 2.2.0 or 2.1.0 > // Bunch of imports > import kafka.serializer.\{DefaultDecoder, StringDecoder} > import org.apache.avro.generic.GenericRecord > import org.apache.kafka.clients.consumer.ConsumerRecord > import org.apache.kafka.common.serialization._ > import org.apache.spark.rdd.RDD > import org.apache.spark.sql.\{DataFrame, Row, SQLContext} > import org.apache.spark.sql.Row > import org.apache.spark.sql.hive.HiveContext > import org.apache.spark.sql.types.\{StringType, StructField, StructType} > import org.apache.spark.streaming.kafka010._ > import org.apache.spark.streaming.kafka010.KafkaUtils._ > {quote} > {quote}// This forces two connections - from a single executor - to > topic-partition . > // And with 2 cores assigned to 1 executor : each core has a task - pulling > respective offsets : OffsetRange("kafka_topic",0,1,2) & > OffsetRange("kafka_topic",0,2,3) > val parallelizedRanges = Array(OffsetRange("kafka_topic",0,1,2), // Fetching > sample 2 records > OffsetRange("kafka_topic",0,2,3) // Fetching sample 2 records > ) > > // Initiate kafka properties > val kafkaParams1: java.util.Map[String, Object] = new java.util.HashMap() > // kafkaParams1.put("key","val") add all the parameters such as broker, > topic Not listing every property here. > > // Create RDD > val rDDConsumerRec: RDD[ConsumerRecord[String, String]] = > createRDD[String, String](sparkContext > , kafkaParams1, parallelizedRanges, LocationStrategies.PreferConsistent) >
[jira] [Updated] (SPARK-24535) Fix java version parsing in SparkR
[ https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-24535: - Target Version/s: 2.3.2 > Fix java version parsing in SparkR > -- > > Key: SPARK-24535 > URL: https://issues.apache.org/jira/browse/SPARK-24535 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1, 2.4.0 >Reporter: Shivaram Venkataraman >Assignee: Felix Cheung >Priority: Blocker > > We see errors on CRAN of the form > {code:java} > java version "1.8.0_144" > Java(TM) SE Runtime Environment (build 1.8.0_144-b01) > Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) > Picked up _JAVA_OPTIONS: -XX:-UsePerfData > -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21) > -- > subscript out of bounds > 1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, > sparkConfig = sparkRTestConfig) at > D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21 > 2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, > sparkExecutorEnvMap, > sparkJars, sparkPackages) > 3: checkJavaVersion() > 4: strsplit(javaVersionFilter[[1]], "[\"]") > {code} > The complete log file is at > http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24535) Fix java version parsing in SparkR
[ https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-24535: - Priority: Blocker (was: Major) > Fix java version parsing in SparkR > -- > > Key: SPARK-24535 > URL: https://issues.apache.org/jira/browse/SPARK-24535 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1, 2.4.0 >Reporter: Shivaram Venkataraman >Priority: Blocker > > We see errors on CRAN of the form > {code:java} > java version "1.8.0_144" > Java(TM) SE Runtime Environment (build 1.8.0_144-b01) > Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) > Picked up _JAVA_OPTIONS: -XX:-UsePerfData > -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21) > -- > subscript out of bounds > 1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, > sparkConfig = sparkRTestConfig) at > D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21 > 2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, > sparkExecutorEnvMap, > sparkJars, sparkPackages) > 3: checkJavaVersion() > 4: strsplit(javaVersionFilter[[1]], "[\"]") > {code} > The complete log file is at > http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24535) Fix java version parsing in SparkR
[ https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung reassigned SPARK-24535: Assignee: Felix Cheung > Fix java version parsing in SparkR > -- > > Key: SPARK-24535 > URL: https://issues.apache.org/jira/browse/SPARK-24535 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1, 2.4.0 >Reporter: Shivaram Venkataraman >Assignee: Felix Cheung >Priority: Blocker > > We see errors on CRAN of the form > {code:java} > java version "1.8.0_144" > Java(TM) SE Runtime Environment (build 1.8.0_144-b01) > Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) > Picked up _JAVA_OPTIONS: -XX:-UsePerfData > -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21) > -- > subscript out of bounds > 1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, > sparkConfig = sparkRTestConfig) at > D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21 > 2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, > sparkExecutorEnvMap, > sparkJars, sparkPackages) > 3: checkJavaVersion() > 4: strsplit(javaVersionFilter[[1]], "[\"]") > {code} > The complete log file is at > http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24688) Comments of the example code have some typos
[ https://issues.apache.org/jira/browse/SPARK-24688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24688: Assignee: Apache Spark > Comments of the example code have some typos > > > Key: SPARK-24688 > URL: https://issues.apache.org/jira/browse/SPARK-24688 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 2.3.1 >Reporter: Weizhe Huang >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24688) Comments of the example code have some typos
[ https://issues.apache.org/jira/browse/SPARK-24688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24688: Assignee: (was: Apache Spark) > Comments of the example code have some typos > > > Key: SPARK-24688 > URL: https://issues.apache.org/jira/browse/SPARK-24688 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 2.3.1 >Reporter: Weizhe Huang >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-24144) monotonically_increasing_id on streaming dataFrames
[ https://issues.apache.org/jira/browse/SPARK-24144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Yu updated SPARK-24144: --- Comment: was deleted (was: So do you propose to send the information regarding monotonically_increasing_id to checkpoint data storage which could later be retrieved?) > monotonically_increasing_id on streaming dataFrames > --- > > Key: SPARK-24144 > URL: https://issues.apache.org/jira/browse/SPARK-24144 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Hemant Bhanawat >Priority: Major > > For our use case, we want to assign snapshot ids (incrementing counters) to > the incoming records. In case of failures, the same record should get the > same id after failure so that the downstream DB can handle the records in a > correct manner. > We were trying to do this by zipping the streaming rdds with that counter > using a modified version of ZippedWithIndexRDD. There are other ways to do > that but it turns out all ways are cumbersome and error prone in failure > scenarios. > As suggested on the spark user dev list, one way to do this would be to > support monotonically_increasing_id on streaming dataFrames in Spark code > base. This would ensure that counters are incrementing for the records of the > stream. Also, since the counter can be checkpointed, it would work well in > case of failure scenarios. Last but not the least, doing this in spark would > be the most performance efficient way. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24688) Comments of the example code have some typos
[ https://issues.apache.org/jira/browse/SPARK-24688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527134#comment-16527134 ] Apache Spark commented on SPARK-24688: -- User 'uzmijnlm' has created a pull request for this issue: https://github.com/apache/spark/pull/21665 > Comments of the example code have some typos > > > Key: SPARK-24688 > URL: https://issues.apache.org/jira/browse/SPARK-24688 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 2.3.1 >Reporter: Weizhe Huang >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-24672) No job is running but there are active tasks
[ https://issues.apache.org/jira/browse/SPARK-24672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deegue reopened SPARK-24672: Coditions to this issue: When the amount of data I selected is larger than spark.driver.maxResultSize , It returns the info below and the job failed automatically . !image4.png! After that , there are several active tasks remain that occupy executors. !image2.png! Thanks a lot for your comment. > No job is running but there are active tasks > > > Key: SPARK-24672 > URL: https://issues.apache.org/jira/browse/SPARK-24672 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core, SQL >Affects Versions: 2.2.1 > Environment: hadoop 2.6.0 > spark 2.2.1 > CDH 5.8.3 > java 1.8.0 > > More info : > image1.png in Attachments >Reporter: Deegue >Priority: Major > Attachments: image1.png, image2.png, image3.png, image4.png > > Original Estimate: 120h > Remaining Estimate: 120h > > Issues: > > There are active tasks while no job is running. > These active tasks occupy the executors(resources) , and I don't know why > they haven't been killed or stopped after its jobs failed. > > More info : > image2.png & image3.png in Attachments > > I'd be very appreciated it if anyone can help me... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24672) No job is running but there are active tasks
[ https://issues.apache.org/jira/browse/SPARK-24672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deegue updated SPARK-24672: --- Attachment: image4.png > No job is running but there are active tasks > > > Key: SPARK-24672 > URL: https://issues.apache.org/jira/browse/SPARK-24672 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core, SQL >Affects Versions: 2.2.1 > Environment: hadoop 2.6.0 > spark 2.2.1 > CDH 5.8.3 > java 1.8.0 > > More info : > image1.png in Attachments >Reporter: Deegue >Priority: Major > Attachments: image1.png, image2.png, image3.png, image4.png > > Original Estimate: 120h > Remaining Estimate: 120h > > Issues: > > There are active tasks while no job is running. > These active tasks occupy the executors(resources) , and I don't know why > they haven't been killed or stopped after its jobs failed. > > More info : > image2.png & image3.png in Attachments > > I'd be very appreciated it if anyone can help me... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8659) Spark SQL Thrift Server does NOT honour hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
[ https://issues.apache.org/jira/browse/SPARK-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527087#comment-16527087 ] Takeshi Yamamuro commented on SPARK-8659: - I think Spark doesn't support GRANT/REVOKE now. > Spark SQL Thrift Server does NOT honour > hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory > > --- > > Key: SPARK-8659 > URL: https://issues.apache.org/jira/browse/SPARK-8659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 > Environment: Linux >Reporter: Premchandra Preetham Kukillaya >Priority: Major > > It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the > Hive's security feature SQL based authorisation is not working. It ignores > the security settings passed through the command line. The arguments for > command line is given below for reference > The problem is user X can do select on table belonging to user Y, though > permission for table is explicitly defined and its a data security risk. > I am using Hive .13.1 and Spark 1.3.1 and here is the list arguments passed > to Spark SQL Thrift Server. > ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf > hostname.compute.amazonaws.com --hiveconf > hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator > --hiveconf > hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory > --hiveconf hive.server2.enable.doAs=false --hiveconf > hive.security.authorization.enabled=true --hiveconf > javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true > --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver > --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf > javax.jdo.option.ConnectionPassword=hive -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24678) We should use 'PROCESS_LOCAL' first for Spark-Streaming
[ https://issues.apache.org/jira/browse/SPARK-24678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527084#comment-16527084 ] Apache Spark commented on SPARK-24678: -- User 'caneGuy' has created a pull request for this issue: https://github.com/apache/spark/pull/21664 > We should use 'PROCESS_LOCAL' first for Spark-Streaming > --- > > Key: SPARK-24678 > URL: https://issues.apache.org/jira/browse/SPARK-24678 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.3.1 >Reporter: sharkd tu >Priority: Major > > Currently, `BlockRDD.getPreferredLocations` only get hosts info of blocks, > which results in subsequent schedule level is not better than 'NODE_LOCAL'. > We can just make a small changes, the schedule level can be improved to > 'PROCESS_LOCAL' > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24688) Comments of the example code have some typos
Weizhe Huang created SPARK-24688: Summary: Comments of the example code have some typos Key: SPARK-24688 URL: https://issues.apache.org/jira/browse/SPARK-24688 Project: Spark Issue Type: Improvement Components: Examples Affects Versions: 2.3.1 Reporter: Weizhe Huang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24687) When NoClassDefError thrown during task serialization will cause job hang
[ https://issues.apache.org/jira/browse/SPARK-24687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-24687: - Description: When below exception thrown: {code:java} Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: Lcom/xxx/data/recommend/aggregator/queue/QueueName; at java.lang.Class.getDeclaredFields0(Native Method) at java.lang.Class.privateGetDeclaredFields(Class.java:2436) at java.lang.Class.getDeclaredField(Class.java:1946) at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659) at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72) at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480) at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468) at java.security.AccessController.doPrivileged(Native Method) at java.io.ObjectStreamClass.(ObjectStreamClass.java:468) at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365) at java.io.ObjectOutputStream.writeClass(ObjectOutputStream.java:1212) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1119) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) {code} Stage will always hang.Not abort. !hanging-960.png! It is because NoClassDefError will no be catch up below. {code} var taskBinary: Broadcast[Array[Byte]] = null try { // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep). // For ResultTask, serialize and broadcast (rdd, func). val taskBinaryBytes: Array[Byte] = stage match { case stage: ShuffleMapStage => JavaUtils.bufferToArray( closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef)) case stage: ResultStage => JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef)) } taskBinary =
[jira] [Created] (SPARK-24687) When NoClassDefError thrown during task serialization will cause job hang
zhoukang created SPARK-24687: Summary: When NoClassDefError thrown during task serialization will cause job hang Key: SPARK-24687 URL: https://issues.apache.org/jira/browse/SPARK-24687 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.1, 2.1.0 Reporter: zhoukang Attachments: hanging-960.png When below exception thrown: {code:java} Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: Lcom/xxx/data/recommend/aggregator/queue/QueueName; at java.lang.Class.getDeclaredFields0(Native Method) at java.lang.Class.privateGetDeclaredFields(Class.java:2436) at java.lang.Class.getDeclaredField(Class.java:1946) at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659) at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72) at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480) at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468) at java.security.AccessController.doPrivileged(Native Method) at java.io.ObjectStreamClass.(ObjectStreamClass.java:468) at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365) at java.io.ObjectOutputStream.writeClass(ObjectOutputStream.java:1212) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1119) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) {code} Stage will always hang.Not abort. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24687) When NoClassDefError thrown during task serialization will cause job hang
[ https://issues.apache.org/jira/browse/SPARK-24687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-24687: - Description: When below exception thrown: {code:java} Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: Lcom/xxx/data/recommend/aggregator/queue/QueueName; at java.lang.Class.getDeclaredFields0(Native Method) at java.lang.Class.privateGetDeclaredFields(Class.java:2436) at java.lang.Class.getDeclaredField(Class.java:1946) at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659) at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72) at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480) at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468) at java.security.AccessController.doPrivileged(Native Method) at java.io.ObjectStreamClass.(ObjectStreamClass.java:468) at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365) at java.io.ObjectOutputStream.writeClass(ObjectOutputStream.java:1212) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1119) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) {code} Stage will always hang.Not abort. !hanging-960.png! was: When below exception thrown: {code:java} Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: Lcom/xxx/data/recommend/aggregator/queue/QueueName; at java.lang.Class.getDeclaredFields0(Native Method) at java.lang.Class.privateGetDeclaredFields(Class.java:2436) at java.lang.Class.getDeclaredField(Class.java:1946) at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659) at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72) at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480) at
[jira] [Updated] (SPARK-24687) When NoClassDefError thrown during task serialization will cause job hang
[ https://issues.apache.org/jira/browse/SPARK-24687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-24687: - Attachment: hanging-960.png > When NoClassDefError thrown during task serialization will cause job hang > - > > Key: SPARK-24687 > URL: https://issues.apache.org/jira/browse/SPARK-24687 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1 >Reporter: zhoukang >Priority: Major > Attachments: hanging-960.png > > > When below exception thrown: > {code:java} > Exception in thread "dag-scheduler-event-loop" > java.lang.NoClassDefFoundError: > Lcom/xxx/data/recommend/aggregator/queue/QueueName; > at java.lang.Class.getDeclaredFields0(Native Method) > at java.lang.Class.privateGetDeclaredFields(Class.java:2436) > at java.lang.Class.getDeclaredField(Class.java:1946) > at > java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659) > at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72) > at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480) > at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468) > at java.security.AccessController.doPrivileged(Native Method) > at java.io.ObjectStreamClass.(ObjectStreamClass.java:468) > at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365) > at java.io.ObjectOutputStream.writeClass(ObjectOutputStream.java:1212) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1119) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > {code} > Stage will always hang.Not abort. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (SPARK-24680) spark.executorEnv.JAVA_HOME does not take effect in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-24680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24680: Assignee: (was: Apache Spark) > spark.executorEnv.JAVA_HOME does not take effect in Standalone mode > --- > > Key: SPARK-24680 > URL: https://issues.apache.org/jira/browse/SPARK-24680 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.1, 2.2.1, 2.3.1 >Reporter: StanZhai >Priority: Minor > > spark.executorEnv.JAVA_HOME does not take effect when a Worker starting an > Executor process in Standalone mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24680) spark.executorEnv.JAVA_HOME does not take effect in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-24680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527044#comment-16527044 ] Apache Spark commented on SPARK-24680: -- User 'stanzhai' has created a pull request for this issue: https://github.com/apache/spark/pull/21663 > spark.executorEnv.JAVA_HOME does not take effect in Standalone mode > --- > > Key: SPARK-24680 > URL: https://issues.apache.org/jira/browse/SPARK-24680 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.1, 2.2.1, 2.3.1 >Reporter: StanZhai >Priority: Minor > > spark.executorEnv.JAVA_HOME does not take effect when a Worker starting an > Executor process in Standalone mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24680) spark.executorEnv.JAVA_HOME does not take effect in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-24680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24680: Assignee: Apache Spark > spark.executorEnv.JAVA_HOME does not take effect in Standalone mode > --- > > Key: SPARK-24680 > URL: https://issues.apache.org/jira/browse/SPARK-24680 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.1, 2.2.1, 2.3.1 >Reporter: StanZhai >Assignee: Apache Spark >Priority: Minor > > spark.executorEnv.JAVA_HOME does not take effect when a Worker starting an > Executor process in Standalone mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527036#comment-16527036 ] Takeshi Yamamuro commented on SPARK-24498: -- I added some benchmark results on different conditions: sf=5 adn splitConsumeFuncByOperator=false (I didn't still get any benefit though) > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24673) scala sql function from_utc_timestamp second argument could be Column instead of String
[ https://issues.apache.org/jira/browse/SPARK-24673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527020#comment-16527020 ] Takeshi Yamamuro commented on SPARK-24673: -- I'm not 100% sure though, probably I think we cannot touch the existing signature. So, we need to add an new entry `from_utc_timestamp(ts: Column, tz: Column)` there. But, the user-facing api issues are more sensitive, so you need to ask qualified committers first before making a pr: [~smilegator] [~ueshin] > scala sql function from_utc_timestamp second argument could be Column instead > of String > --- > > Key: SPARK-24673 > URL: https://issues.apache.org/jira/browse/SPARK-24673 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Antonio Murgia >Priority: Minor > > As of 2.3.1 the scala API for the built-in function from_utc_timestamp > (org.apache.spark.sql.functions#from_utc_timestamp) is less powerful than its > SQL counter part. In particular, given a dataset/dataframe with the following > schema: > {code:java} > CREATE TABLE MY_TABLE ( > ts TIMESTAMP, > tz STRING > ){code} > from the SQL api I can do something like: > {code:java} > SELECT FROM_UTC_TIMESTAMP(TS, TZ){code} > while from the programmatic api I simply cannot because > {code:java} > functions.from_utc_timestamp(ts: Column, tz: String){code} > second argument is a String. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24643) from_json should accept an aggregate function as schema
[ https://issues.apache.org/jira/browse/SPARK-24643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527013#comment-16527013 ] Hyukjin Kwon commented on SPARK-24643: -- [~maxgekk] shall we leave this closed for now? > from_json should accept an aggregate function as schema > --- > > Key: SPARK-24643 > URL: https://issues.apache.org/jira/browse/SPARK-24643 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Currently, the *from_json()* function accepts only string literals as schema: > - Checking of schema argument inside of JsonToStructs: > [https://github.com/apache/spark/blob/b8f27ae3b34134a01998b77db4b7935e7f82a4fe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L530] > - Accepting only string literal: > [https://github.com/apache/spark/blob/b8f27ae3b34134a01998b77db4b7935e7f82a4fe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L749-L752] > JsonToStructs should be modified to accept results of aggregate functions > like *infer_schema* (see SPARK-24642). It should be possible to write SQL > like: > {code:sql} > select from_json(json_col, infer_schema(json_col)) from json_table > {code} > Here is a test case with existing aggregate function - *first()*: > {code:sql} > create temporary view schemas(schema) as select * from values > ('struct'), > ('map'); > select from_json('{"a":1}', first(schema)) from schemas; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24686) Provide spark distributions for hadoop-2.8 rather than hadoop-2.7 as releases on apache mirrors
t oo created SPARK-24686: Summary: Provide spark distributions for hadoop-2.8 rather than hadoop-2.7 as releases on apache mirrors Key: SPARK-24686 URL: https://issues.apache.org/jira/browse/SPARK-24686 Project: Spark Issue Type: Improvement Components: Build, Spark Core Affects Versions: 2.3.1 Reporter: t oo Hadoop 2.7 is quite old, time to move on to providing released spark builds for hadoop 2.8 distribution on the apache mirrors -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18258) Sinks need access to offset representation
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526936#comment-16526936 ] Richard Yu commented on SPARK-18258: {quote} * We need agreement on whether it is worth making a change to the public Sink api (probably not any time soon, judging from the spark 3.0 vs 2.4 discussion), or whether there is a different way to accomplish the goal. - Cody {quote} cc [~rxin] [~lwlin] [~marmbrus] Hi all, I would like to poll what you think on this issue. > Sinks need access to offset representation > -- > > Key: SPARK-18258 > URL: https://issues.apache.org/jira/browse/SPARK-18258 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > Transactional "exactly-once" semantics for output require storing an offset > identifier in the same transaction as results. > The Sink.addBatch method currently only has access to batchId and data, not > the actual offset representation. > I want to store the actual offsets, so that they are recoverable as long as > the results are and I'm not locked in to a particular streaming engine. > I could see this being accomplished by adding parameters to Sink.addBatch for > the starting and ending offsets (either the offsets themselves, or the > SPARK-17829 string/json representation). That would be an API change, but if > there's another way to map batch ids to offset representations without > changing the Sink api that would work as well. > I'm assuming we don't need the same level of access to offsets throughout a > job as e.g. the Kafka dstream gives, because Sinks are the main place that > should need them. > After SPARK-17829 is complete and offsets have a .json method, an api for > this ticket might look like > {code} > trait Sink { > def addBatch(batchId: Long, data: DataFrame, start: OffsetSeq, end: > OffsetSeq): Unit > {code} > where start and end were provided by StreamExecution.runBatch using > committedOffsets and availableOffsets. > I'm not 100% certain that the offsets in the seq could always be mapped back > to the correct source when restarting complicated multi-source jobs, but I > think it'd be sufficient. Passing the string/json representation of the seq > instead of the seq itself would probably be sufficient as well, but the > convention of rendering a None as "-" in the json is maybe a little > idiosyncratic to parse, and the constant defining that is private. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24386) implement continuous processing coalesce(1)
[ https://issues.apache.org/jira/browse/SPARK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-24386. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 21560 [https://github.com/apache/spark/pull/21560] > implement continuous processing coalesce(1) > --- > > Key: SPARK-24386 > URL: https://issues.apache.org/jira/browse/SPARK-24386 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Major > Fix For: 3.0.0 > > > [~marmbrus] suggested this as a good implementation checkpoint. If we do the > shuffle reader and writer correctly, it should be easy to make a custom > coalesce(1) execution for continuous processing using them, without having to > implement the logic for shuffle writers finding out where shuffle readers are > located. (The coalesce(1) can just get the RpcEndpointRef directly from the > reader and pass it to the writers.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24386) implement continuous processing coalesce(1)
[ https://issues.apache.org/jira/browse/SPARK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-24386: - Assignee: Jose Torres > implement continuous processing coalesce(1) > --- > > Key: SPARK-24386 > URL: https://issues.apache.org/jira/browse/SPARK-24386 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Major > > [~marmbrus] suggested this as a good implementation checkpoint. If we do the > shuffle reader and writer correctly, it should be easy to make a custom > coalesce(1) execution for continuous processing using them, without having to > implement the logic for shuffle writers finding out where shuffle readers are > located. (The coalesce(1) can just get the RpcEndpointRef directly from the > reader and pass it to the writers.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24662) Structured Streaming should support LIMIT
[ https://issues.apache.org/jira/browse/SPARK-24662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526859#comment-16526859 ] Apache Spark commented on SPARK-24662: -- User 'mukulmurthy' has created a pull request for this issue: https://github.com/apache/spark/pull/21662 > Structured Streaming should support LIMIT > - > > Key: SPARK-24662 > URL: https://issues.apache.org/jira/browse/SPARK-24662 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mukul Murthy >Priority: Major > > Make structured streams support the LIMIT operator. > This will undo SPARK-24525 as the limit operator would be a superior solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24662) Structured Streaming should support LIMIT
[ https://issues.apache.org/jira/browse/SPARK-24662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24662: Assignee: (was: Apache Spark) > Structured Streaming should support LIMIT > - > > Key: SPARK-24662 > URL: https://issues.apache.org/jira/browse/SPARK-24662 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mukul Murthy >Priority: Major > > Make structured streams support the LIMIT operator. > This will undo SPARK-24525 as the limit operator would be a superior solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24662) Structured Streaming should support LIMIT
[ https://issues.apache.org/jira/browse/SPARK-24662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24662: Assignee: Apache Spark > Structured Streaming should support LIMIT > - > > Key: SPARK-24662 > URL: https://issues.apache.org/jira/browse/SPARK-24662 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mukul Murthy >Assignee: Apache Spark >Priority: Major > > Make structured streams support the LIMIT operator. > This will undo SPARK-24525 as the limit operator would be a superior solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken
[ https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-24530: - Assignee: Hyukjin Kwon > Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) > and pyspark.ml docs are broken > --- > > Key: SPARK-24530 > URL: https://issues.apache.org/jira/browse/SPARK-24530 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Assignee: Hyukjin Kwon >Priority: Blocker > Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot > 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, > pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png > > > I generated python docs from master locally using `make html`. However, the > generated html doc doesn't render class docs correctly. I attached the > screenshot from Spark 2.3 docs and master docs generated on my local. Not > sure if this is because my local setup. > cc: [~dongjoon] Could you help verify? > > The followings are our released doc status. Some recent docs seems to be > broken. > *2.1.x* > (O) > [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (O) > [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.2.x* > (O) > [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.3.x* > (O) > [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24662) Structured Streaming should support LIMIT
[ https://issues.apache.org/jira/browse/SPARK-24662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526848#comment-16526848 ] Mukul Murthy commented on SPARK-24662: -- Calling .limit(n) on a DataFrame (or in SQL, SELECT ... LIMIT n) returns only n rows from that query. limit is currently not supported on streaming dataframes; my fix is going to support it for streams writing in append and complete output modes. It's still going to be unsupported for streams in update output mode, because for updating streams, limit doesn't make sense. > Structured Streaming should support LIMIT > - > > Key: SPARK-24662 > URL: https://issues.apache.org/jira/browse/SPARK-24662 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mukul Murthy >Priority: Major > > Make structured streams support the LIMIT operator. > This will undo SPARK-24525 as the limit operator would be a superior solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24684) DAGScheduler reports the wrong attempt number to the commit coordinator
[ https://issues.apache.org/jira/browse/SPARK-24684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526834#comment-16526834 ] Ryan Blue commented on SPARK-24684: --- Yeah, I just backported this wrong and moved to using unique ids in the canCommit calls. I don't think it affects master or the on-going patch releases. Sorry for the false alarm. > DAGScheduler reports the wrong attempt number to the commit coordinator > --- > > Key: SPARK-24684 > URL: https://issues.apache.org/jira/browse/SPARK-24684 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.3, 2.3.2 >Reporter: Ryan Blue >Priority: Major > > SPARK-24552 changes writers to pass the task ID to the output coordinator so > that the coordinator tracks each task uniquely because attempt numbers can be > reused across stage attempts. However, the DAGScheduler still passes the > attempt number when notifying the coordinator that a task has finished. The > result is that when a task is authorized and then fails due to OOM or a > similar error, the scheduler is notified but doesn't remove the commit > authorization because the attempt number doesn't match. This causes infinite > task retries. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24679) Download page should not link to unreleased code
[ https://issues.apache.org/jira/browse/SPARK-24679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-24679. Resolution: Fixed Assignee: Luciano Resende > Download page should not link to unreleased code > > > Key: SPARK-24679 > URL: https://issues.apache.org/jira/browse/SPARK-24679 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.1 >Reporter: Luciano Resende >Assignee: Luciano Resende >Priority: Major > > The download pages currently link to the git code repository. > Whilst the instructions show how to check out master or a particular release > branch, this also gives access to the rest of the repo, i.e. to non-released > code. > Links to code repos should only be published on pages intended for developers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23940) High-order function: transform_values(map, function) → map
[ https://issues.apache.org/jira/browse/SPARK-23940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526808#comment-16526808 ] Neha Patil commented on SPARK-23940: I can work on this one. > High-order function: transform_values(map, function) → > map > --- > > Key: SPARK-23940 > URL: https://issues.apache.org/jira/browse/SPARK-23940 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > Labels: starter > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns a map that applies function to each entry of map and transforms the > values. > {noformat} > SELECT transform_values(MAP(ARRAY[], ARRAY[]), (k, v) -> v + 1); -- {} > SELECT transform_values(MAP(ARRAY [1, 2, 3], ARRAY [10, 20, 30]), (k, v) -> v > + k); -- {1 -> 11, 2 -> 22, 3 -> 33} > SELECT transform_values(MAP(ARRAY [1, 2, 3], ARRAY ['a', 'b', 'c']), (k, v) > -> k * k); -- {1 -> 1, 2 -> 4, 3 -> 9} > SELECT transform_values(MAP(ARRAY ['a', 'b'], ARRAY [1, 2]), (k, v) -> k || > CAST(v as VARCHAR)); -- {a -> a1, b -> b2} > SELECT transform_values(MAP(ARRAY [1, 2], ARRAY [1.0, 1.4]), -- {1 -> > one_1.0, 2 -> two_1.4} > (k, v) -> MAP(ARRAY[1, 2], ARRAY['one', 'two'])[k] || > '_' || CAST(v AS VARCHAR)); > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24439) Add distanceMeasure to BisectingKMeans in PySpark
[ https://issues.apache.org/jira/browse/SPARK-24439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-24439. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21557 [https://github.com/apache/spark/pull/21557] > Add distanceMeasure to BisectingKMeans in PySpark > - > > Key: SPARK-24439 > URL: https://issues.apache.org/jira/browse/SPARK-24439 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 2.4.0 > > > https://issues.apache.org/jira/browse/SPARK-23412 added distanceMeasure to > BisectingKMeans. We will do the same for PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24439) Add distanceMeasure to BisectingKMeans in PySpark
[ https://issues.apache.org/jira/browse/SPARK-24439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-24439: Assignee: Huaxin Gao > Add distanceMeasure to BisectingKMeans in PySpark > - > > Key: SPARK-24439 > URL: https://issues.apache.org/jira/browse/SPARK-24439 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > https://issues.apache.org/jira/browse/SPARK-23412 added distanceMeasure to > BisectingKMeans. We will do the same for PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24685) Adjust release scripts to build all versions for older releases
[ https://issues.apache.org/jira/browse/SPARK-24685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24685: Assignee: (was: Apache Spark) > Adjust release scripts to build all versions for older releases > --- > > Key: SPARK-24685 > URL: https://issues.apache.org/jira/browse/SPARK-24685 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > If using the current release scripts to build old branches (e.g. 2.1), some > hadoop versions are missing, since they were removed in newer versions of > Spark. > To keep the RM job consistent for multiple branches, let's keep support for > older versions in the scripts until we really decide they're EOL. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24685) Adjust release scripts to build all versions for older releases
[ https://issues.apache.org/jira/browse/SPARK-24685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526764#comment-16526764 ] Apache Spark commented on SPARK-24685: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/21661 > Adjust release scripts to build all versions for older releases > --- > > Key: SPARK-24685 > URL: https://issues.apache.org/jira/browse/SPARK-24685 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > If using the current release scripts to build old branches (e.g. 2.1), some > hadoop versions are missing, since they were removed in newer versions of > Spark. > To keep the RM job consistent for multiple branches, let's keep support for > older versions in the scripts until we really decide they're EOL. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24685) Adjust release scripts to build all versions for older releases
[ https://issues.apache.org/jira/browse/SPARK-24685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24685: Assignee: Apache Spark > Adjust release scripts to build all versions for older releases > --- > > Key: SPARK-24685 > URL: https://issues.apache.org/jira/browse/SPARK-24685 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > If using the current release scripts to build old branches (e.g. 2.1), some > hadoop versions are missing, since they were removed in newer versions of > Spark. > To keep the RM job consistent for multiple branches, let's keep support for > older versions in the scripts until we really decide they're EOL. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24685) Adjust release scripts to build all versions for older releases
Marcelo Vanzin created SPARK-24685: -- Summary: Adjust release scripts to build all versions for older releases Key: SPARK-24685 URL: https://issues.apache.org/jira/browse/SPARK-24685 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.4.0 Reporter: Marcelo Vanzin If using the current release scripts to build old branches (e.g. 2.1), some hadoop versions are missing, since they were removed in newer versions of Spark. To keep the RM job consistent for multiple branches, let's keep support for older versions in the scripts until we really decide they're EOL. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24670) How to stream only newer files from a folder in Apache Spark?
[ https://issues.apache.org/jira/browse/SPARK-24670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahbub Murshed resolved SPARK-24670. Resolution: Fixed The problem with count difference was ultimately solved by setting maxFilesPerTrigger setting to a low number. Initially, it was set to 1000, meaning Spark will try processing 1000 files at a time. This means for the given data, it will need to process about 50 days of data, which would take about 2 days to complete, before it writes everything to the disk. Setting it to a lower number solves the problem. > How to stream only newer files from a folder in Apache Spark? > - > > Key: SPARK-24670 > URL: https://issues.apache.org/jira/browse/SPARK-24670 > Project: Spark > Issue Type: Question > Components: Input/Output, Structured Streaming >Affects Versions: 2.3.0 >Reporter: Mahbub Murshed >Priority: Major > > Background: > I have a directory in Google Cloud Storage containing files for 1.5 years of > data. The files are named as hits__.csv. For example, for June > 24, say there are three files, hits_20180624_000.csv, hits_20180624_001.csv, > hits_20180624_002.csv. etc. The folder has files since January 2017. New > files are dropped in the folder every day. > I am reading the files using Spark streaming and writing to AWS S3. > Problem: > For the first batch Spark processes ALL files in the folder. It will take > about a month to complete the entire set. > Moreover, when writing out the data, Spark isn't completely writing out each > days of data until the entire folder is complete. > Example: > Say each input file contains 100,000 records. > Input: > hits_20180624_000.csv > hits_20180624_001.csv > hits_20180624_002.csv > hits_20180623_000.csv > hits_20180623_001.csv > ... > hits_20170101_000.csv > hits_20170101_001.csv > Processing: > Drops half records (say). Each output files should contain 50,000 records per > day. > Output Expected (number of file may be different): > year=2018/month=6/day=24/hash0.parquet > year=2018/month=6/day=24/hash1.parquet > year=2018/month=6/day=24/hash2.parquet > year=2018/month=6/day=23/hash0.parquet > year=2018/month=6/day=23/hash1.parquet > ... > Problem: > Each day contains less than 50,000 records, unless entire batch is complete. > In a test with a small subset this behavior was reproduced. > Question: > Is there a way to configure Spark to not load older files, even for the first > load? Why is Spark not writing out the remaining records? > Things I tried: > 1. A trigger of 1 hr > 2. Watermarking based on eventtime > [1]: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24408) Move abs function to math_funcs group
[ https://issues.apache.org/jira/browse/SPARK-24408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-24408. - Resolution: Fixed Fix Version/s: 2.4.0 Thanks for helping improve our docs :) > Move abs function to math_funcs group > - > > Key: SPARK-24408 > URL: https://issues.apache.org/jira/browse/SPARK-24408 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Assignee: Jacek Laskowski >Priority: Trivial > Fix For: 2.4.0 > > > A few math function ( {{abs}} ) are is in {{math_funcs}} group. It should > really be. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24408) Move abs function to math_funcs group
[ https://issues.apache.org/jira/browse/SPARK-24408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk reassigned SPARK-24408: --- Assignee: Jacek Laskowski > Move abs function to math_funcs group > - > > Key: SPARK-24408 > URL: https://issues.apache.org/jira/browse/SPARK-24408 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Assignee: Jacek Laskowski >Priority: Trivial > > A few math function ( {{abs}} ) are is in {{math_funcs}} group. It should > really be. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24408) Move abs function to math_funcs group
[ https://issues.apache.org/jira/browse/SPARK-24408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-24408: Description: A few math function ( {{abs}} ) are is in {{math_funcs}} group. It should really be. (was: A few math functions ( {{abs}} , {{bitwiseNOT}}, {{isnan}}, {{nanvl}}) are not in {{math_funcs}} group. They should really be.) > Move abs function to math_funcs group > - > > Key: SPARK-24408 > URL: https://issues.apache.org/jira/browse/SPARK-24408 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > A few math function ( {{abs}} ) are is in {{math_funcs}} group. It should > really be. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24408) Move abs function to math_funcs group
[ https://issues.apache.org/jira/browse/SPARK-24408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-24408: Summary: Move abs function to math_funcs group (was: Move abs, bitwiseNOT, isnan, nanvl functions to math_funcs group) > Move abs function to math_funcs group > - > > Key: SPARK-24408 > URL: https://issues.apache.org/jira/browse/SPARK-24408 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > A few math functions ( {{abs}} , {{bitwiseNOT}}, {{isnan}}, {{nanvl}}) are > not in {{math_funcs}} group. They should really be. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23120) Add PMML pipeline export support to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-23120. - Resolution: Fixed Fix Version/s: 2.4.0 > Add PMML pipeline export support to PySpark > --- > > Key: SPARK-23120 > URL: https://issues.apache.org/jira/browse/SPARK-23120 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: holdenk >Assignee: holdenk >Priority: Major > Fix For: 2.4.0 > > > Once we have Scala support go back and fill in Python. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24684) DAGScheduler reports the wrong attempt number to the commit coordinator
[ https://issues.apache.org/jira/browse/SPARK-24684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526714#comment-16526714 ] Marcelo Vanzin commented on SPARK-24684: The code still uses the attempt number currently (and there are unit tests for the case you're talking about). SPARK-24611 was filed to track switching to task IDs and other enhancements which we'll only do in master. > DAGScheduler reports the wrong attempt number to the commit coordinator > --- > > Key: SPARK-24684 > URL: https://issues.apache.org/jira/browse/SPARK-24684 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.3, 2.3.2 >Reporter: Ryan Blue >Priority: Major > > SPARK-24552 changes writers to pass the task ID to the output coordinator so > that the coordinator tracks each task uniquely because attempt numbers can be > reused across stage attempts. However, the DAGScheduler still passes the > attempt number when notifying the coordinator that a task has finished. The > result is that when a task is authorized and then fails due to OOM or a > similar error, the scheduler is notified but doesn't remove the commit > authorization because the attempt number doesn't match. This causes infinite > task retries. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14712) spark.ml LogisticRegressionModel.toString should summarize model
[ https://issues.apache.org/jira/browse/SPARK-14712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk reassigned SPARK-14712: --- Assignee: Bravo Zhang > spark.ml LogisticRegressionModel.toString should summarize model > > > Key: SPARK-14712 > URL: https://issues.apache.org/jira/browse/SPARK-14712 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Bravo Zhang >Priority: Trivial > Labels: starter > Fix For: 2.4.0 > > > spark.mllib LogisticRegressionModel overrides toString to print a little > model info. We should do the same in spark.ml. I'd recommend: > * super.toString > * numClasses > * numFeatures > We should also override {{__repr__}} in pyspark to do the same. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14712) spark.ml LogisticRegressionModel.toString should summarize model
[ https://issues.apache.org/jira/browse/SPARK-14712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-14712. - Resolution: Fixed Fix Version/s: 2.4.0 > spark.ml LogisticRegressionModel.toString should summarize model > > > Key: SPARK-14712 > URL: https://issues.apache.org/jira/browse/SPARK-14712 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Bravo Zhang >Priority: Trivial > Labels: starter > Fix For: 2.4.0 > > > spark.mllib LogisticRegressionModel overrides toString to print a little > model info. We should do the same in spark.ml. I'd recommend: > * super.toString > * numClasses > * numFeatures > We should also override {{__repr__}} in pyspark to do the same. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24682) from_json / to_json do not handle java.sql.Date inside Maps correctly
[ https://issues.apache.org/jira/browse/SPARK-24682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526687#comment-16526687 ] Patrick McGloin commented on SPARK-24682: - I would like to work on this. > from_json / to_json do not handle java.sql.Date inside Maps correctly > - > > Key: SPARK-24682 > URL: https://issues.apache.org/jira/browse/SPARK-24682 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Patrick McGloin >Priority: Minor > > When a date is one of the types inside a Map, the from_json and to_json > functions do not handle it correctly. A date would be persisted like this: > "17710". And the from_json tries to read it in it complains with the error: > java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot > be cast to java.lang.Integer > Consider the following test, which will fail on the final show: > {color:#80} > *case class* {color}UnitTestCaseClassWithDateInsideMap(map: > {color:#009f82}Map{color}[Date, Int]) > test({color:#008000}*"Test a Date as key in a Map"*{color}) { > {color:#80}*val* {color}map = > _UnitTestCaseClassWithDateInsideMap_({color:#800080}_Map_{color}(Date._valueOf_({color:#008000}*"2018-06-28"*{color}) > -> {color:#FF}1{color})) > {color:#80}*val* {color}options = > {color:#800080}_Map_{color}({color:#008000}*"timestampFormat"* {color}-> > {color:#008000}*"/MM/dd HH:mm:ss.SSS"*{color}, > {color:#008000}*"dateFormat"* {color}-> {color:#008000}*"/MM/dd"*{color}) > {color:#80}*val* {color}schema = > Encoders._product_[UnitTestCaseClassWithDateInsideMap].schema > {color:#80}*val* {color}mapDF = {color:#800080}_Seq_{color}(map).toDF() > {color:#80}*val* {color}jsonDF = > mapDF.select(_to_json_(_struct_(mapDF.columns.head, mapDF.columns.tail:_*), > options)) > jsonDF.show() > {color:#80}*val* {color}jsonString = > jsonDF.map(_.getString({color:#FF}0{color})).collect().head > {color:#80}*val* {color}stringDF = > {color:#800080}_Seq_{color}(jsonString).toDF({color:#008000}*"json"*{color}) > {color:#80}*val* {color}parsedDF = > stringDF.select(_from_json_({color:#008000}*$"json"*{color}, schema, options)) > parsedDF.show() > } > The result of the line "jsonDF.show()" is as follows: > +---+ > |structstojson(named_struct(NamePlaceholder(), map))| > +---+ > | \{"map":{"17710":1}}| > +---+ > As can be seen the date is not formatted correctly. The error with > "parsedDF.show()" is: > java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot > be cast to java.lang.Integer > As you can see I have added in the options with date formats but that does > not have any impact. If I do the same tests with a Date outside a Map it > works as expected. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24684) DAGScheduler reports the wrong attempt number to the commit coordinator
[ https://issues.apache.org/jira/browse/SPARK-24684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-24684. --- Resolution: Not A Problem Closing this. In master, the attempt number is still used. Looks like this was just backported incorrectly by me. > DAGScheduler reports the wrong attempt number to the commit coordinator > --- > > Key: SPARK-24684 > URL: https://issues.apache.org/jira/browse/SPARK-24684 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.3, 2.3.2 >Reporter: Ryan Blue >Priority: Major > > SPARK-24552 changes writers to pass the task ID to the output coordinator so > that the coordinator tracks each task uniquely because attempt numbers can be > reused across stage attempts. However, the DAGScheduler still passes the > attempt number when notifying the coordinator that a task has finished. The > result is that when a task is authorized and then fails due to OOM or a > similar error, the scheduler is notified but doesn't remove the commit > authorization because the attempt number doesn't match. This causes infinite > task retries. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24684) DAGScheduler reports the wrong attempt number to the commit coordinator
Ryan Blue created SPARK-24684: - Summary: DAGScheduler reports the wrong attempt number to the commit coordinator Key: SPARK-24684 URL: https://issues.apache.org/jira/browse/SPARK-24684 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 2.1.3, 2.3.2 Reporter: Ryan Blue SPARK-24552 changes writers to pass the task ID to the output coordinator so that the coordinator tracks each task uniquely because attempt numbers can be reused across stage attempts. However, the DAGScheduler still passes the attempt number when notifying the coordinator that a task has finished. The result is that when a task is authorized and then fails due to OOM or a similar error, the scheduler is notified but doesn't remove the commit authorization because the attempt number doesn't match. This causes infinite task retries. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24683) SparkLauncher.NO_RESOURCE doesn't work with Java applications
[ https://issues.apache.org/jira/browse/SPARK-24683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526628#comment-16526628 ] Apache Spark commented on SPARK-24683: -- User 'mccheah' has created a pull request for this issue: https://github.com/apache/spark/pull/21660 > SparkLauncher.NO_RESOURCE doesn't work with Java applications > - > > Key: SPARK-24683 > URL: https://issues.apache.org/jira/browse/SPARK-24683 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > On the tip of master, after we merged the Python bindings support, Spark > Submit on Kubernetes no longer works with {{SparkLauncher.NO_RESOURCE. > This is because spark-submit will not set the main class as a child argument. > See here: > https://github.com/apache/spark/blob/1a644afbac35c204f9ad55f86999319a9ab458c6/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L694 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24683) SparkLauncher.NO_RESOURCE doesn't work with Java applications
[ https://issues.apache.org/jira/browse/SPARK-24683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24683: Assignee: Apache Spark > SparkLauncher.NO_RESOURCE doesn't work with Java applications > - > > Key: SPARK-24683 > URL: https://issues.apache.org/jira/browse/SPARK-24683 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Assignee: Apache Spark >Priority: Critical > > On the tip of master, after we merged the Python bindings support, Spark > Submit on Kubernetes no longer works with {{SparkLauncher.NO_RESOURCE. > This is because spark-submit will not set the main class as a child argument. > See here: > https://github.com/apache/spark/blob/1a644afbac35c204f9ad55f86999319a9ab458c6/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L694 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24683) SparkLauncher.NO_RESOURCE doesn't work with Java applications
[ https://issues.apache.org/jira/browse/SPARK-24683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24683: Assignee: (was: Apache Spark) > SparkLauncher.NO_RESOURCE doesn't work with Java applications > - > > Key: SPARK-24683 > URL: https://issues.apache.org/jira/browse/SPARK-24683 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Critical > > On the tip of master, after we merged the Python bindings support, Spark > Submit on Kubernetes no longer works with {{SparkLauncher.NO_RESOURCE. > This is because spark-submit will not set the main class as a child argument. > See here: > https://github.com/apache/spark/blob/1a644afbac35c204f9ad55f86999319a9ab458c6/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L694 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24683) SparkLauncher.NO_RESOURCE doesn't work with Java applications
Matt Cheah created SPARK-24683: -- Summary: SparkLauncher.NO_RESOURCE doesn't work with Java applications Key: SPARK-24683 URL: https://issues.apache.org/jira/browse/SPARK-24683 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.4.0 Reporter: Matt Cheah On the tip of master, after we merged the Python bindings support, Spark Submit on Kubernetes no longer works with {{SparkLauncher.NO_RESOURCE. This is because spark-submit will not set the main class as a child argument. See here: https://github.com/apache/spark/blob/1a644afbac35c204f9ad55f86999319a9ab458c6/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L694 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken
[ https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24530: Assignee: (was: Apache Spark) > Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) > and pyspark.ml docs are broken > --- > > Key: SPARK-24530 > URL: https://issues.apache.org/jira/browse/SPARK-24530 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Blocker > Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot > 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, > pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png > > > I generated python docs from master locally using `make html`. However, the > generated html doc doesn't render class docs correctly. I attached the > screenshot from Spark 2.3 docs and master docs generated on my local. Not > sure if this is because my local setup. > cc: [~dongjoon] Could you help verify? > > The followings are our released doc status. Some recent docs seems to be > broken. > *2.1.x* > (O) > [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (O) > [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.2.x* > (O) > [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.3.x* > (O) > [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken
[ https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526560#comment-16526560 ] Apache Spark commented on SPARK-24530: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/21659 > Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) > and pyspark.ml docs are broken > --- > > Key: SPARK-24530 > URL: https://issues.apache.org/jira/browse/SPARK-24530 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Blocker > Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot > 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, > pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png > > > I generated python docs from master locally using `make html`. However, the > generated html doc doesn't render class docs correctly. I attached the > screenshot from Spark 2.3 docs and master docs generated on my local. Not > sure if this is because my local setup. > cc: [~dongjoon] Could you help verify? > > The followings are our released doc status. Some recent docs seems to be > broken. > *2.1.x* > (O) > [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (O) > [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.2.x* > (O) > [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.3.x* > (O) > [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken
[ https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24530: Assignee: Apache Spark > Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) > and pyspark.ml docs are broken > --- > > Key: SPARK-24530 > URL: https://issues.apache.org/jira/browse/SPARK-24530 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark >Priority: Blocker > Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot > 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, > pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png > > > I generated python docs from master locally using `make html`. However, the > generated html doc doesn't render class docs correctly. I attached the > screenshot from Spark 2.3 docs and master docs generated on my local. Not > sure if this is because my local setup. > cc: [~dongjoon] Could you help verify? > > The followings are our released doc status. Some recent docs seems to be > broken. > *2.1.x* > (O) > [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (O) > [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.2.x* > (O) > [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.3.x* > (O) > [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24642) Add a function which infers schema from a JSON column
[ https://issues.apache.org/jira/browse/SPARK-24642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526547#comment-16526547 ] Maxim Gekk commented on SPARK-24642: > I think this is too complicated and unpredictable. ok. I will close the PR: https://github.com/apache/spark/pull/21626 > Add a function which infers schema from a JSON column > - > > Key: SPARK-24642 > URL: https://issues.apache.org/jira/browse/SPARK-24642 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Need to add new aggregate function - *infer_schema()*. The function should > infer schema for set of JSON strings. The result of the function is a schema > in DDL format (or JSON format). > One of the use cases is passing output of *infer_schema()* to *from_json()*. > Currently, the from_json() function requires a schema as a mandatory > argument. It is possible to infer schema programmatically in Scala/Python and > pass it as the second argument but in SQL it is not possible. An user has to > pass schema as string literal in SQL. The new function should allow to use it > in SQL like in the example: > {code:sql} > select from_json(json_col, infer_schema(json_col)) > from json_table; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24682) from_json / to_json do not handle java.sql.Date inside Maps correctly
Patrick McGloin created SPARK-24682: --- Summary: from_json / to_json do not handle java.sql.Date inside Maps correctly Key: SPARK-24682 URL: https://issues.apache.org/jira/browse/SPARK-24682 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Patrick McGloin When a date is one of the types inside a Map, the from_json and to_json functions do not handle it correctly. A date would be persisted like this: "17710". And the from_json tries to read it in it complains with the error: java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer Consider the following test, which will fail on the final show: {color:#80} *case class* {color}UnitTestCaseClassWithDateInsideMap(map: {color:#009f82}Map{color}[Date, Int]) test({color:#008000}*"Test a Date as key in a Map"*{color}) { {color:#80}*val* {color}map = _UnitTestCaseClassWithDateInsideMap_({color:#800080}_Map_{color}(Date._valueOf_({color:#008000}*"2018-06-28"*{color}) -> {color:#FF}1{color})) {color:#80}*val* {color}options = {color:#800080}_Map_{color}({color:#008000}*"timestampFormat"* {color}-> {color:#008000}*"/MM/dd HH:mm:ss.SSS"*{color}, {color:#008000}*"dateFormat"* {color}-> {color:#008000}*"/MM/dd"*{color}) {color:#80}*val* {color}schema = Encoders._product_[UnitTestCaseClassWithDateInsideMap].schema {color:#80}*val* {color}mapDF = {color:#800080}_Seq_{color}(map).toDF() {color:#80}*val* {color}jsonDF = mapDF.select(_to_json_(_struct_(mapDF.columns.head, mapDF.columns.tail:_*), options)) jsonDF.show() {color:#80}*val* {color}jsonString = jsonDF.map(_.getString({color:#FF}0{color})).collect().head {color:#80}*val* {color}stringDF = {color:#800080}_Seq_{color}(jsonString).toDF({color:#008000}*"json"*{color}) {color:#80}*val* {color}parsedDF = stringDF.select(_from_json_({color:#008000}*$"json"*{color}, schema, options)) parsedDF.show() } The result of the line "jsonDF.show()" is as follows: +---+ |structstojson(named_struct(NamePlaceholder(), map))| +---+ | \{"map":{"17710":1}}| +---+ As can be seen the date is not formatted correctly. The error with "parsedDF.show()" is: java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer As you can see I have added in the options with date formats but that does not have any impact. If I do the same tests with a Date outside a Map it works as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24681) Cannot create a view from a table when a nested column name contains ':'
[ https://issues.apache.org/jira/browse/SPARK-24681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Ionescu updated SPARK-24681: --- Description: Here's a patch that reproduces the issue: {code:java} diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala index 09c1547..29bb3db 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala @@ -19,6 +19,7 @@ package org.apache.spark.sql.hive import org.apache.spark.sql.{QueryTest, Row} import org.apache.spark.sql.execution.datasources.parquet.ParquetTest +import org.apache.spark.sql.functions.{lit, struct} import org.apache.spark.sql.hive.test.TestHiveSingleton case class Cases(lower: String, UPPER: String) @@ -76,4 +77,21 @@ class HiveParquetSuite extends QueryTest with ParquetTest with TestHiveSingleton } } } + + test("column names including ':' characters") { + withTempPath { path => + withTable("test_table") { + spark.range(0) + .select(struct(lit(0).as("nested:column")).as("toplevel:column")) + .write.format("parquet") + .option("path", path.getCanonicalPath) + .saveAsTable("test_table") + + sql("CREATE VIEW test_view_1 AS SELECT `toplevel:column`.* FROM test_table") + sql("CREATE VIEW test_view_2 AS SELECT * FROM test_table") + + } + } + } }{code} The first "CREATE VIEW" statement succeeds, but the second one fails with: {code:java} org.apache.spark.SparkException: Cannot recognize hive type string: struct {code} was: Here's a patch that reproduces the issue: {code:java} diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala index 09c1547..29bb3db 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala @@ -19,6 +19,7 @@ package org.apache.spark.sql.hive import org.apache.spark.sql.{QueryTest, Row} import org.apache.spark.sql.execution.datasources.parquet.ParquetTest +import org.apache.spark.sql.functions.{lit, struct} import org.apache.spark.sql.hive.test.TestHiveSingleton case class Cases(lower: String, UPPER: String) @@ -76,4 +77,21 @@ class HiveParquetSuite extends QueryTest with ParquetTest with TestHiveSingleton } } } + + test("column names including ':' characters") { + withTempPath { path => + withTable("test_table") { + spark.range(0) + .select(struct(lit(0).as("nested:column")).as("toplevel:column")) + .write.format("parquet") + .option("path", path.getCanonicalPath) + .saveAsTable("test_table") + + sql("CREATE VIEW test_view_1 AS SELECT `toplevel:column`.* FROM test_table") + sql("CREATE VIEW test_view_2 AS SELECT * FROM test_table") + + } + } + } }{code} The last sql statement in there fails with: {code:java} org.apache.spark.SparkException: Cannot recognize hive type string: struct {code} > Cannot create a view from a table when a nested column name contains ':' > > > Key: SPARK-24681 > URL: https://issues.apache.org/jira/browse/SPARK-24681 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0, 2.4.0 >Reporter: Adrian Ionescu >Priority: Major > > Here's a patch that reproduces the issue: > {code:java} > diff --git > a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala > b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala > index 09c1547..29bb3db 100644 > --- > a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala > +++ > b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala > @@ -19,6 +19,7 @@ package org.apache.spark.sql.hive > > import org.apache.spark.sql.{QueryTest, Row} > import org.apache.spark.sql.execution.datasources.parquet.ParquetTest > +import org.apache.spark.sql.functions.{lit, struct} > import org.apache.spark.sql.hive.test.TestHiveSingleton > > case class Cases(lower: String, UPPER: String) > @@ -76,4 +77,21 @@ class HiveParquetSuite extends QueryTest with ParquetTest > with TestHiveSingleton > } > } > } > + > + test("column names including ':' characters") { > + withTempPath { path => > + withTable("test_table") { > + spark.range(0) > + .select(struct(lit(0).as("nested:column")).as("toplevel:column")) > + .write.format("parquet")
[jira] [Created] (SPARK-24681) Cannot create a view from a table when a nested column name contains ':'
Adrian Ionescu created SPARK-24681: -- Summary: Cannot create a view from a table when a nested column name contains ':' Key: SPARK-24681 URL: https://issues.apache.org/jira/browse/SPARK-24681 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.2.0, 2.4.0 Reporter: Adrian Ionescu Here's a patch that reproduces the issue: {code:java} diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala index 09c1547..29bb3db 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala @@ -19,6 +19,7 @@ package org.apache.spark.sql.hive import org.apache.spark.sql.{QueryTest, Row} import org.apache.spark.sql.execution.datasources.parquet.ParquetTest +import org.apache.spark.sql.functions.{lit, struct} import org.apache.spark.sql.hive.test.TestHiveSingleton case class Cases(lower: String, UPPER: String) @@ -76,4 +77,21 @@ class HiveParquetSuite extends QueryTest with ParquetTest with TestHiveSingleton } } } + + test("column names including ':' characters") { + withTempPath { path => + withTable("test_table") { + spark.range(0) + .select(struct(lit(0).as("nested:column")).as("toplevel:column")) + .write.format("parquet") + .option("path", path.getCanonicalPath) + .saveAsTable("test_table") + + sql("CREATE VIEW test_view_1 AS SELECT `toplevel:column`.* FROM test_table") + sql("CREATE VIEW test_view_2 AS SELECT * FROM test_table") + + } + } + } }{code} The last sql statement in there fails with: {code:java} org.apache.spark.SparkException: Cannot recognize hive type string: struct {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization
[ https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petar Zecevic updated SPARK-24020: -- Description: The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted within partitions by Y column and you need to calculate an expensive function on the joined rows, which reduces the number of output rows (e.g. condition based on a spatial distance calculation). But you could theoretically reduce the number of joined rows for which the calculation itself is performed by using a range condition on the Y column. Something like this: {{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }} However, during a sort-merge join with this range condition specified, Spark will first cross-join all the rows with the same X value and only then try to apply the range condition and any function calculations. This happens because, inside the generated sort-merge join (SMJ) code, these extra conditions are put in the same block with the function being calculated and there is no way to evaluate these conditions before reading all the rows to be checked into memory (into an {{ExternalAppendOnlyUnsafeRowArray}}). If the two tables have a large number of rows per X, this can result in a huge number of calculations and a huge number of rows in executor memory, which can be unfeasible. h3. The solution implementation We therefore propose a change to the sort-merge join so that, when these extra conditions are specified, a queue is used instead of the ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving window through the values from the right relation as the left row changes. You could call this a combination of an equi-join and a theta join; in literature it is sometimes called an “epsilon join”. We call it a "sort-merge inner range join". This design uses much less memory (not all rows with the same values of X need to be loaded into memory at once) and requires a much lower number of comparisons (the validity of this statement depends on the actual data and conditions used). h3. The classes that need to be changed For implementing the described change we propose changes to these classes: * _ExtractEquiJoinKeys_ – a pattern that needs to be extended to be able to recognize the case where a simple range condition with lower and upper limits is used on a secondary column (a column not included in the equi-join condition). The pattern also needs to extract the information later required for code generation etc. * _InMemoryUnsafeRowQueue_ – the moving window implementation to be used instead of the _ExternalAppendOnlyUnsafeRowArray_ class. The rows need to be removed and added to/from the structure as the left key (X) changes, or the left secondary value (Y) changes, so the structure needs to be a queue. To make the change as less intrusive as possible, we propose to implement _InMemoryUnsafeRowQueue_ as a subclass of _ExternalAppendOnlyUnsafeRowArray_ * _JoinSelection_ – a strategy that uses _ExtractEquiJoinKeys_ and needs to be aware of the extracted range conditions * _SortMergeJoinExec_ – the main implementation of the optimization. Needs to support two code paths: ** when whole-stage code generation is turned off (method doExecute, which uses sortMergeJoinInnerRangeScanner) ** when whole-stage code generation is turned on (methods doProduce and genScanner) * _SortMergeJoinInnerRangeScanner_ – implements the SMJ with inner-range optimization in the case when whole-stage codegen is turned off * _InnerJoinSuite_ – functional tests * _JoinBenchmark_ – performance tests h3. Triggering the optimization The optimization should be triggered automatically when an equi-join expression is present AND lower and upper range conditions on a secondary column are specified. If the tables aren't sorted by both columns, appropriate sorts should be added. To limit the impact of this change we also propose adding a new parameter (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which could be used to switch off the optimization entirely. h2. Applicable use cases Potential use-cases for this are joins based on spatial or temporal distance calculations. was: The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted within partitions by Y column and you need to calculate an expensive function on the joined rows, which reduces the number of output rows (e.g. condition based on a spatial distance calculation). But you could theoretically reduce the number of joined rows for which the calculation itself is performed by using a range condition on the Y column. Something like this: {{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }} However, during a sort-merge join with this range condition specified,
[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization
[ https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petar Zecevic updated SPARK-24020: -- Description: The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted within partitions by Y column and you need to calculate an expensive function on the joined rows, which reduces the number of output rows (e.g. condition based on a spatial distance calculation). But you could theoretically reduce the number of joined rows for which the calculation itself is performed by using a range condition on the Y column. Something like this: {{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }} However, during a sort-merge join with this range condition specified, Spark will first cross-join all the rows with the same X value and only then try to apply the range condition and any function calculations. This happens because, inside the generated sort-merge join (SMJ) code, these extra conditions are put in the same block with the function being calculated and there is no way to evaluate these conditions before reading all the rows to be checked into memory (into an {{ExternalAppendOnlyUnsafeRowArray}}). If the two tables have a large number of rows per X, this can result in a huge number of calculations and a huge number of rows in executor memory, which can be unfeasible. h3. The solution implementation We therefore propose a change to the sort-merge join so that, when these extra conditions are specified, a queue is used instead of the ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving window through the values from the right relation as the left row changes. You could call this a combination of an equi-join and a theta join; in literature it is sometimes called an “epsilon join”. We call it a "sort-merge inner range join". This design uses much less memory (not all rows with the same values of X need to be loaded into memory at once) and requires a much lower number of comparisons (the validity of this statement depends on the actual data and conditions used). h3. The classes that need to be changed For implementing the described change we propose changes to these classes: * _ExtractEquiJoinKeys_ – a pattern that needs to be extended to be able to recognize the case where a simple range condition with lower and upper limits is used on a secondary column (a column not included in the equi-join condition). The pattern also needs to extract the information later required for code generation etc. * _InMemoryUnsafeRowQueue_ – the moving window implementation to be used instead of the _ExternalAppendOnlyUnsafeRowArray_ class. The rows need to be removed and added to/from the structure as the left key (X) changes, or the left secondary value (Y) changes, so the structure needs to be a queue. To make the change as less intrusive as possible, we propose to implement _InMemoryUnsafeRowQueue_ as a subclass of _ExternalAppendOnlyUnsafeRowArray_ * _JoinSelection_ – a strategy that uses _ExtractEquiJoinKeys_ and needs to be aware of the extracted range conditions * _SortMergeJoinExec_ – the main implementation of the optimization. Needs to support two code paths: ** when whole-stage code generation is turned off (method doExecute, which uses sortMergeJoinInnerRangeScanner) ** when whole-stage code generation is turned on (methods doProduce and genScanner) * _SortMergeJoinInnerRangeScanner_ – implements the SMJ with inner-range optimization in the case when whole-stage codegen is turned off * _InnerJoinSuite_ – functional tests * _JoinBenchmark_ – performance tests h3. Triggering the optimization The optimization should be triggered automatically when an equi-join expression is present AND lower and upper range conditions on a secondary column are specified. If the tables aren't sorted by both columns, appropriate sorts should be added. To limit the impact of this change we also propose adding a new parameter (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which could be used to switch off the optimization entirely. h3. Applicable use cases Potential use-cases for this are joins based on spatial or temporal distance calculations. was: The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted within partitions by Y column and you need to calculate an expensive function on the joined rows, which reduces the number of output rows (e.g. condition based on a spatial distance calculation). But you could theoretically reduce the number of joined rows for which the calculation itself is performed by using a range condition on the Y column. Something like this: {{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }} However, during a sort-merge join with this range condition specified,
[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization
[ https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petar Zecevic updated SPARK-24020: -- Description: The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted within partitions by Y column and you need to calculate an expensive function on the joined rows, which reduces the number of output rows (e.g. condition based on a spatial distance calculation). But you could theoretically reduce the number of joined rows for which the calculation itself is performed by using a range condition on the Y column. Something like this: {{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }} However, during a sort-merge join with this range condition specified, Spark will first cross-join all the rows with the same X value and only then try to apply the range condition and any function calculations. This happens because, inside the generated sort-merge join (SMJ) code, these extra conditions are put in the same block with the function being calculated and there is no way to evaluate these conditions before reading all the rows to be checked into memory (into an {{ExternalAppendOnlyUnsafeRowArray}}). If the two tables have a large number of rows per X, this can result in a huge number of calculations and a huge number of rows in executor memory, which can be unfeasible. h2. The solution implementation We therefore propose a change to the sort-merge join so that, when these extra conditions are specified, a queue is used instead of the ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving window through the values from the right relation as the left row changes. You could call this a combination of an equi-join and a theta join; in literature it is sometimes called an “epsilon join”. We call it a "sort-merge inner range join". This design uses much less memory (not all rows with the same values of X need to be loaded into memory at once) and requires a much lower number of comparisons (the validity of this statement depends on the actual data and conditions used). h3. The classes that need to be changed For implementing the described change we propose changes to these classes: * _ExtractEquiJoinKeys_ – a pattern that needs to be extended to be able to recognize the case where a simple range condition with lower and upper limits is used on a secondary column (a column not included in the equi-join condition). The pattern also needs to extract the information later required for code generation etc. * _InMemoryUnsafeRowQueue_ – the moving window implementation to be used instead of the _ExternalAppendOnlyUnsafeRowArray_ class. The rows need to be removed and added to/from the structure as the left key (X) changes, or the left secondary value (Y) changes, so the structure needs to be a queue. To make the change as less intrusive as possible, we propose to implement _InMemoryUnsafeRowQueue_ as a subclass of _ExternalAppendOnlyUnsafeRowArray_ * _JoinSelection_ – a strategy that uses _ExtractEquiJoinKeys_ and needs to be aware of the extracted range conditions * _SortMergeJoinExec_ – the main implementation of the optimization. Needs to support two code paths: ** when whole-stage code generation is turned off (method doExecute, which uses sortMergeJoinInnerRangeScanner) ** when whole-stage code generation is turned on (methods doProduce and genScanner) * _SortMergeJoinInnerRangeScanner_ – implements the SMJ with inner-range optimization in the case when whole-stage codegen is turned off * _InnerJoinSuite_ – functional tests * _JoinBenchmark_ – performance tests h2. Triggering the optimization The optimization should be triggered automatically when an equi-join expression is present AND lower and upper range conditions on a secondary column are specified. If the tables aren't sorted by both columns, appropriate sorts should be added. To limit the impact of this change we also propose adding a new parameter (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which could be used to switch off the optimization entirely. h2. Applicable use cases Potential use-cases for this are joins based on spatial or temporal distance calculations. was: The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted within partitions by Y column and you need to calculate an expensive function on the joined rows, which reduces the number of output rows (e.g. condition based on a spatial distance calculation). But you could theoretically reduce the number of joined rows for which the calculation itself is performed by using a range condition on the Y column. Something like this: {{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }} However, during a sort-merge join with this range condition specified,
[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization
[ https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petar Zecevic updated SPARK-24020: -- Description: The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted within partitions by Y column and you need to calculate an expensive function on the joined rows, which reduces the number of output rows (e.g. condition based on a spatial distance calculation). But you could theoretically reduce the number of joined rows for which the calculation itself is performed by using a range condition on the Y column. Something like this: {{... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d AND }} However, during a sort-merge join with this range condition specified, Spark will first cross-join all the rows with the same X value and only then try to apply the range condition and any function calculations. This happens because, inside the generated sort-merge join (SMJ) code, these extra conditions are put in the same block with the function being calculated and there is no way to evaluate these conditions before reading all the rows to be checked into memory (into an {{ExternalAppendOnlyUnsafeRowArray}}). If the two tables have a large number of rows per X, this can result in a huge number of calculations and a huge number of rows in executor memory, which can be unfeasible. h2. The solution implementation We therefore propose a change to the sort-merge join so that, when these extra conditions are specified, a queue is used instead of the ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a moving window through the values from the right relation as the left row changes. You could call this a combination of an equi-join and a theta join; in literature it is sometimes called an “epsilon join”. We call it a "sort-merge inner range join". This design uses much less memory (not all rows with the same values of X need to be loaded into memory at once) and requires a much lower number of comparisons (the validity of this statement depends on the actual data and conditions used). h3. For implementing the described change we propose changes to these classes: * _ExtractEquiJoinKeys_ – a pattern that needs to be extended to be able to recognize the case where a simple range condition with lower and upper limits is used on a secondary column (a column not included in the equi-join condition). The pattern also needs to extract the information later required for code generation etc. * _InMemoryUnsafeRowQueue_ – the moving window implementation to be used instead of the _ExternalAppendOnlyUnsafeRowArray_ class. The rows need to be removed and added to/from the structure as the left key (X) changes, or the left secondary value (Y) changes, so the structure needs to be a queue. To make the change as less intrusive as possible, we propose to implement _InMemoryUnsafeRowQueue_ as a subclass of _ExternalAppendOnlyUnsafeRowArray_ * _JoinSelection_ – a strategy that uses _ExtractEquiJoinKeys_ and needs to be aware of the extracted range conditions * _SortMergeJoinExec_ – the main implementation of the optimization. Needs to support two code paths: ** when whole-stage code generation is turned off (method doExecute, which uses sortMergeJoinInnerRangeScanner) ** when whole-stage code generation is turned on (methods doProduce and genScanner) * _SortMergeJoinInnerRangeScanner_ – implements the SMJ with inner-range optimization in the case when whole-stage codegen is turned off * _InnerJoinSuite_ – functional tests * _JoinBenchmark_ – performance tests The optimization should be triggered automatically when an equi-join expression is present AND lower and upper range conditions on a secondary column are specified. If the tables aren't sorted by both columns, appropriate sorts should be added. To limit the impact of this change we also propose adding a new parameter (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which could be used to switch off the optimization entirely. Potential use-cases for this are joins based on spatial or temporal distance calculations. was: The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted by Y column (within partitions) and you need to calculate an expensive function on the joined rows. During a sort-merge join, Spark will do cross-joins of all rows that have the same X values and calculate the function's value on all of them. If the two tables have a large number of rows per X, this can result in a huge number of calculations. We hereby propose an optimization that would allow you to reduce the number of matching rows per X using a range condition on Y columns of the two tables. Something like: ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d The way SMJ is currently
[jira] [Updated] (SPARK-24020) Sort-merge join inner range optimization
[ https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petar Zecevic updated SPARK-24020: -- Attachment: SMJ-innerRange-PR24020-designDoc.pdf > Sort-merge join inner range optimization > > > Key: SPARK-24020 > URL: https://issues.apache.org/jira/browse/SPARK-24020 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Petar Zecevic >Priority: Major > Attachments: SMJ-innerRange-PR24020-designDoc.pdf > > > The problem we are solving is the case where you have two big tables > partitioned by X column, but also sorted by Y column (within partitions) and > you need to calculate an expensive function on the joined rows. During a > sort-merge join, Spark will do cross-joins of all rows that have the same X > values and calculate the function's value on all of them. If the two tables > have a large number of rows per X, this can result in a huge number of > calculations. > We hereby propose an optimization that would allow you to reduce the number > of matching rows per X using a range condition on Y columns of the two > tables. Something like: > ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d > The way SMJ is currently implemented, these extra conditions have no > influence on the number of rows (per X) being checked because these extra > conditions are put in the same block with the function being calculated. > Here we propose a change to the sort-merge join so that, when these extra > conditions are specified, a queue is used instead of the > ExternalAppendOnlyUnsafeRowArray class. This queue would then used as a > moving window across the values from the right relation as the left row > changes. You could call this a combination of an equi-join and a theta join > (we call it "sort-merge inner range join"). > Potential use-cases for this are joins based on spatial or temporal distance > calculations. > The optimization should be triggered automatically when an equi-join > expression is present AND lower and upper range conditions on a secondary > column are specified. If the tables aren't sorted by both columns, > appropriate sorts should be added. > To limit the impact of this change we also propose adding a new parameter > (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which > could be used to switch off the optimization entirely. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24680) spark.executorEnv.JAVA_HOME does not take effect in Standalone mode
StanZhai created SPARK-24680: Summary: spark.executorEnv.JAVA_HOME does not take effect in Standalone mode Key: SPARK-24680 URL: https://issues.apache.org/jira/browse/SPARK-24680 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 2.3.1, 2.2.1, 2.1.1 Reporter: StanZhai spark.executorEnv.JAVA_HOME does not take effect when a Worker starting an Executor process in Standalone mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24673) scala sql function from_utc_timestamp second argument could be Column instead of String
[ https://issues.apache.org/jira/browse/SPARK-24673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526307#comment-16526307 ] Antonio Murgia commented on SPARK-24673: Looks doable. Should I go with a method overload, resulting in: {code:java} functions.from_utc_timestamp(ts: Column, tz: String) functions.from_utc_timestamp(ts: Column, tz: Column) {code} Or is there some limitation I am not aware of? Also do you think {code:java} to_utc_timestamp{code} should receive the same treatment? > scala sql function from_utc_timestamp second argument could be Column instead > of String > --- > > Key: SPARK-24673 > URL: https://issues.apache.org/jira/browse/SPARK-24673 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Antonio Murgia >Priority: Minor > > As of 2.3.1 the scala API for the built-in function from_utc_timestamp > (org.apache.spark.sql.functions#from_utc_timestamp) is less powerful than its > SQL counter part. In particular, given a dataset/dataframe with the following > schema: > {code:java} > CREATE TABLE MY_TABLE ( > ts TIMESTAMP, > tz STRING > ){code} > from the SQL api I can do something like: > {code:java} > SELECT FROM_UTC_TIMESTAMP(TS, TZ){code} > while from the programmatic api I simply cannot because > {code:java} > functions.from_utc_timestamp(ts: Column, tz: String){code} > second argument is a String. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24679) Download page should not link to unreleased code
Luciano Resende created SPARK-24679: --- Summary: Download page should not link to unreleased code Key: SPARK-24679 URL: https://issues.apache.org/jira/browse/SPARK-24679 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 2.3.1 Reporter: Luciano Resende The download pages currently link to the git code repository. Whilst the instructions show how to check out master or a particular release branch, this also gives access to the rest of the repo, i.e. to non-released code. Links to code repos should only be published on pages intended for developers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526264#comment-16526264 ] Jackey Lee commented on SPARK-24630: Main Goal: * SQL API for StructStreaming Benefits: * Users, who are unfamiliar with streaming, can easily use SQL to run StructStreaming especially when migrating offline tasks to real time processing tasks. * Support SQL API in StructStreaming can also combine StructStreaming with hive. Users can store the source/sink metadata in a table and use hive metastore to manage it. The users, who want to read this data, can easily create a stream by accessing the table, which can greatly reduce the development cost and maintenance costs of StructStreaming. * Easy to achieve unified management and authority control of source and sink, and more controllable in the management of some private data, especially in some financial or security area. > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24673) scala sql function from_utc_timestamp second argument could be Column instead of String
[ https://issues.apache.org/jira/browse/SPARK-24673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526118#comment-16526118 ] Takeshi Yamamuro commented on SPARK-24673: -- It makes sense. Can you make a pr? > scala sql function from_utc_timestamp second argument could be Column instead > of String > --- > > Key: SPARK-24673 > URL: https://issues.apache.org/jira/browse/SPARK-24673 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Antonio Murgia >Priority: Minor > > As of 2.3.1 the scala API for the built-in function from_utc_timestamp > (org.apache.spark.sql.functions#from_utc_timestamp) is less powerful than its > SQL counter part. In particular, given a dataset/dataframe with the following > schema: > {code:java} > CREATE TABLE MY_TABLE ( > ts TIMESTAMP, > tz STRING > ){code} > from the SQL api I can do something like: > {code:java} > SELECT FROM_UTC_TIMESTAMP(TS, TZ){code} > while from the programmatic api I simply cannot because > {code:java} > functions.from_utc_timestamp(ts: Column, tz: String){code} > second argument is a String. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14834) Force adding doc for new api in pyspark with @since annotation
[ https://issues.apache.org/jira/browse/SPARK-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526102#comment-16526102 ] Alexander Gorokhov commented on SPARK-14834: So, basically, this is about to make "since" decorator require docs to be written for underlying function? > Force adding doc for new api in pyspark with @since annotation > -- > > Key: SPARK-14834 > URL: https://issues.apache.org/jira/browse/SPARK-14834 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang >Priority: Minor > > This is for enforcing user to add python doc when adding new python api with > @since annotation. But I think about it again, this is only suitable for > adding new api for existing python module. If it is a new python module > migrating from scala api, python doc is not mandatory. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24674) Spark on Kubernetes BLAS performance
[ https://issues.apache.org/jira/browse/SPARK-24674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-24674. -- Resolution: Invalid > Spark on Kubernetes BLAS performance > > > Key: SPARK-24674 > URL: https://issues.apache.org/jira/browse/SPARK-24674 > Project: Spark > Issue Type: Question > Components: Build, Kubernetes, MLlib >Affects Versions: 2.3.1 > Environment: Spark 2.3.1 SNAPSHOT (as of June 25th) > Kubernetes version 1.7.5 > Kubernetes cluster, consisting of 4 Nodes with 16 GB RAM, 8 core Intel > processors. >Reporter: Dennis Aumiller >Priority: Minor > Labels: performance > > > Usually native BLAS libraries speed up the execution time of CPU-heavy > operations as for example in MLlib quite significantly. > Of course, the initial error > {code:java} > WARN BLAS:61 - Failed to load implementation from: > com.github.fommil.netlib.NativeSystemBLAS > {code} > can be resolved not so easily, since, as reported > [here|[https://github.com/apache/spark/pull/19717/files/7d2b30373b2e4d8d5311e10c3f9a62a2d900d568],] > this seems to be the issue because of the underlying image used by the Spark > Dockerfile. > Re-building spark with > {code:java} > -Pnetlib-lgpl > {code} > also does not solve the problem, but I managed to build BLAS and LAPACK into > Alpine, with a lot of tricks involved. > Interestingly, I noticed that the performance of PCA in my case dropped quite > significantly (with BLAS support, compared to the netlib-java fallback). I am > aware of [#SPARK-21305] as well, but that did not help my case, either. > Furthermore, calling SVD on a matrix of only size 5000x5000 (density 1%) > already throws an error when trying to use native ARPACK, but runs perfectly > fine with the fallback version. > The question would be whether there has been some investigation in that > direction already. > Or, if not, whether it would be interesting for the Spark community to > provide a > * more detailed report with respect to timings/configurations/test setup > * a provided Dockerfile to build Spark with BLAS/LAPACK/ARPACK using the > shipped Dockerfile as a basis > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24674) Spark on Kubernetes BLAS performance
[ https://issues.apache.org/jira/browse/SPARK-24674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526101#comment-16526101 ] Takeshi Yamamuro commented on SPARK-24674: -- You should first ask in the spark-user mailing list. If you still hit actual problem, feel free to reopen this. (Also, I think this issue is not related to kubernetes...) > Spark on Kubernetes BLAS performance > > > Key: SPARK-24674 > URL: https://issues.apache.org/jira/browse/SPARK-24674 > Project: Spark > Issue Type: Question > Components: Build, Kubernetes, MLlib >Affects Versions: 2.3.1 > Environment: Spark 2.3.1 SNAPSHOT (as of June 25th) > Kubernetes version 1.7.5 > Kubernetes cluster, consisting of 4 Nodes with 16 GB RAM, 8 core Intel > processors. >Reporter: Dennis Aumiller >Priority: Minor > Labels: performance > > > Usually native BLAS libraries speed up the execution time of CPU-heavy > operations as for example in MLlib quite significantly. > Of course, the initial error > {code:java} > WARN BLAS:61 - Failed to load implementation from: > com.github.fommil.netlib.NativeSystemBLAS > {code} > can be resolved not so easily, since, as reported > [here|[https://github.com/apache/spark/pull/19717/files/7d2b30373b2e4d8d5311e10c3f9a62a2d900d568],] > this seems to be the issue because of the underlying image used by the Spark > Dockerfile. > Re-building spark with > {code:java} > -Pnetlib-lgpl > {code} > also does not solve the problem, but I managed to build BLAS and LAPACK into > Alpine, with a lot of tricks involved. > Interestingly, I noticed that the performance of PCA in my case dropped quite > significantly (with BLAS support, compared to the netlib-java fallback). I am > aware of [#SPARK-21305] as well, but that did not help my case, either. > Furthermore, calling SVD on a matrix of only size 5000x5000 (density 1%) > already throws an error when trying to use native ARPACK, but runs perfectly > fine with the fallback version. > The question would be whether there has been some investigation in that > direction already. > Or, if not, whether it would be interesting for the Spark community to > provide a > * more detailed report with respect to timings/configurations/test setup > * a provided Dockerfile to build Spark with BLAS/LAPACK/ARPACK using the > shipped Dockerfile as a basis > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24678) We should use 'PROCESS_LOCAL' first for Spark-Streaming
[ https://issues.apache.org/jira/browse/SPARK-24678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24678: Assignee: (was: Apache Spark) > We should use 'PROCESS_LOCAL' first for Spark-Streaming > --- > > Key: SPARK-24678 > URL: https://issues.apache.org/jira/browse/SPARK-24678 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.3.1 >Reporter: sharkd tu >Priority: Major > > Currently, `BlockRDD.getPreferredLocations` only get hosts info of blocks, > which results in subsequent schedule level is not better than 'NODE_LOCAL'. > We can just make a small changes, the schedule level can be improved to > 'PROCESS_LOCAL' > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24678) We should use 'PROCESS_LOCAL' first for Spark-Streaming
[ https://issues.apache.org/jira/browse/SPARK-24678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24678: Assignee: Apache Spark > We should use 'PROCESS_LOCAL' first for Spark-Streaming > --- > > Key: SPARK-24678 > URL: https://issues.apache.org/jira/browse/SPARK-24678 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.3.1 >Reporter: sharkd tu >Assignee: Apache Spark >Priority: Major > > Currently, `BlockRDD.getPreferredLocations` only get hosts info of blocks, > which results in subsequent schedule level is not better than 'NODE_LOCAL'. > We can just make a small changes, the schedule level can be improved to > 'PROCESS_LOCAL' > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24678) We should use 'PROCESS_LOCAL' first for Spark-Streaming
[ https://issues.apache.org/jira/browse/SPARK-24678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526078#comment-16526078 ] Apache Spark commented on SPARK-24678: -- User 'sharkdtu' has created a pull request for this issue: https://github.com/apache/spark/pull/21658 > We should use 'PROCESS_LOCAL' first for Spark-Streaming > --- > > Key: SPARK-24678 > URL: https://issues.apache.org/jira/browse/SPARK-24678 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.3.1 >Reporter: sharkd tu >Priority: Major > > Currently, `BlockRDD.getPreferredLocations` only get hosts info of blocks, > which results in subsequent schedule level is not better than 'NODE_LOCAL'. > We can just make a small changes, the schedule level can be improved to > 'PROCESS_LOCAL' > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24678) We should use 'PROCESS_LOCAL' first for Spark-Streaming
[ https://issues.apache.org/jira/browse/SPARK-24678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sharkd tu updated SPARK-24678: -- Description: Currently, `BlockRDD.getPreferredLocations` only get hosts info of blocks, which results in subsequent schedule level is not better than 'NODE_LOCAL'. We can just make a small changes, the schedule level can be improved to 'PROCESS_LOCAL' was: Currently, the meta-info of blocks that were received by spark-streaming receivers only contains hosts info, which results in subsequent schedule level is not better than 'NODE_LOCAL'. We can just make a small changes, the schedule level can be improved to 'PROCESS_LOCAL' > We should use 'PROCESS_LOCAL' first for Spark-Streaming > --- > > Key: SPARK-24678 > URL: https://issues.apache.org/jira/browse/SPARK-24678 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.3.1 >Reporter: sharkd tu >Priority: Major > > Currently, `BlockRDD.getPreferredLocations` only get hosts info of blocks, > which results in subsequent schedule level is not better than 'NODE_LOCAL'. > We can just make a small changes, the schedule level can be improved to > 'PROCESS_LOCAL' > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24144) monotonically_increasing_id on streaming dataFrames
[ https://issues.apache.org/jira/browse/SPARK-24144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526075#comment-16526075 ] Richard Yu commented on SPARK-24144: So do you propose to send the information regarding monotonically_increasing_id to checkpoint data storage which could later be retrieved? > monotonically_increasing_id on streaming dataFrames > --- > > Key: SPARK-24144 > URL: https://issues.apache.org/jira/browse/SPARK-24144 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Hemant Bhanawat >Priority: Major > > For our use case, we want to assign snapshot ids (incrementing counters) to > the incoming records. In case of failures, the same record should get the > same id after failure so that the downstream DB can handle the records in a > correct manner. > We were trying to do this by zipping the streaming rdds with that counter > using a modified version of ZippedWithIndexRDD. There are other ways to do > that but it turns out all ways are cumbersome and error prone in failure > scenarios. > As suggested on the spark user dev list, one way to do this would be to > support monotonically_increasing_id on streaming dataFrames in Spark code > base. This would ensure that counters are incrementing for the records of the > stream. Also, since the counter can be checkpointed, it would work well in > case of failure scenarios. Last but not the least, doing this in spark would > be the most performance efficient way. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24672) No job is running but there are active tasks
[ https://issues.apache.org/jira/browse/SPARK-24672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-24672. -- Resolution: Invalid > No job is running but there are active tasks > > > Key: SPARK-24672 > URL: https://issues.apache.org/jira/browse/SPARK-24672 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core, SQL >Affects Versions: 2.2.1 > Environment: hadoop 2.6.0 > spark 2.2.1 > CDH 5.8.3 > java 1.8.0 > > More info : > image1.png in Attachments >Reporter: Deegue >Priority: Major > Attachments: image1.png, image2.png, image3.png > > Original Estimate: 120h > Remaining Estimate: 120h > > Issues: > > There are active tasks while no job is running. > These active tasks occupy the executors(resources) , and I don't know why > they haven't been killed or stopped after its jobs failed. > > More info : > image2.png & image3.png in Attachments > > I'd be very appreciated it if anyone can help me... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24672) No job is running but there are active tasks
[ https://issues.apache.org/jira/browse/SPARK-24672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526069#comment-16526069 ] Takeshi Yamamuro commented on SPARK-24672: -- You should first ask in the spark-user mailing list. If you can find some concrete conditions to reproduce this, feel free to reopen or file a new jira. Thanks. > No job is running but there are active tasks > > > Key: SPARK-24672 > URL: https://issues.apache.org/jira/browse/SPARK-24672 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core, SQL >Affects Versions: 2.2.1 > Environment: hadoop 2.6.0 > spark 2.2.1 > CDH 5.8.3 > java 1.8.0 > > More info : > image1.png in Attachments >Reporter: Deegue >Priority: Major > Attachments: image1.png, image2.png, image3.png > > Original Estimate: 120h > Remaining Estimate: 120h > > Issues: > > There are active tasks while no job is running. > These active tasks occupy the executors(resources) , and I don't know why > they haven't been killed or stopped after its jobs failed. > > More info : > image2.png & image3.png in Attachments > > I'd be very appreciated it if anyone can help me... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24662) Structured Streaming should support LIMIT
[ https://issues.apache.org/jira/browse/SPARK-24662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526065#comment-16526065 ] Richard Yu commented on SPARK-24662: Just to be clear on the function of the limit operator, could you explain what it does? > Structured Streaming should support LIMIT > - > > Key: SPARK-24662 > URL: https://issues.apache.org/jira/browse/SPARK-24662 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mukul Murthy >Priority: Major > > Make structured streams support the LIMIT operator. > This will undo SPARK-24525 as the limit operator would be a superior solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24678) We should use 'PROCESS_LOCAL' first for Spark-Streaming
sharkd tu created SPARK-24678: - Summary: We should use 'PROCESS_LOCAL' first for Spark-Streaming Key: SPARK-24678 URL: https://issues.apache.org/jira/browse/SPARK-24678 Project: Spark Issue Type: Improvement Components: Block Manager Affects Versions: 2.3.1 Reporter: sharkd tu Currently, the meta-info of blocks that were received by spark-streaming receivers only contains hosts info, which results in subsequent schedule level is not better than 'NODE_LOCAL'. We can just make a small changes, the schedule level can be improved to 'PROCESS_LOCAL' -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24676) Project required data from parsed data when csvColumnPruning disabled
[ https://issues.apache.org/jira/browse/SPARK-24676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24676: Assignee: (was: Apache Spark) > Project required data from parsed data when csvColumnPruning disabled > - > > Key: SPARK-24676 > URL: https://issues.apache.org/jira/browse/SPARK-24676 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > I hit a bug below when parsing csv data; > {code} > ./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false > scala> val dir = "/tmp/spark-csv/csv" > scala> spark.range(10).selectExpr("id % 2 AS p", > "id").write.mode("overwrite").partitionBy("p").csv(dir) > scala> spark.read.csv(dir).selectExpr("sum(p)").collect() > 18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) > java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot > be cast to java.lang.Integer > at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41) > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24676) Project required data from parsed data when csvColumnPruning disabled
[ https://issues.apache.org/jira/browse/SPARK-24676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24676: Assignee: Apache Spark > Project required data from parsed data when csvColumnPruning disabled > - > > Key: SPARK-24676 > URL: https://issues.apache.org/jira/browse/SPARK-24676 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark >Priority: Minor > > I hit a bug below when parsing csv data; > {code} > ./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false > scala> val dir = "/tmp/spark-csv/csv" > scala> spark.range(10).selectExpr("id % 2 AS p", > "id").write.mode("overwrite").partitionBy("p").csv(dir) > scala> spark.read.csv(dir).selectExpr("sum(p)").collect() > 18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) > java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot > be cast to java.lang.Integer > at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41) > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24676) Project required data from parsed data when csvColumnPruning disabled
[ https://issues.apache.org/jira/browse/SPARK-24676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526057#comment-16526057 ] Apache Spark commented on SPARK-24676: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/21657 > Project required data from parsed data when csvColumnPruning disabled > - > > Key: SPARK-24676 > URL: https://issues.apache.org/jira/browse/SPARK-24676 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > I hit a bug below when parsing csv data; > {code} > ./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false > scala> val dir = "/tmp/spark-csv/csv" > scala> spark.range(10).selectExpr("id % 2 AS p", > "id").write.mode("overwrite").partitionBy("p").csv(dir) > scala> spark.read.csv(dir).selectExpr("sum(p)").collect() > 18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) > java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot > be cast to java.lang.Integer > at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41) > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24677) MedianHeap is empty when speculation is enabled, causing the SparkContext to stop
[ https://issues.apache.org/jira/browse/SPARK-24677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526051#comment-16526051 ] Apache Spark commented on SPARK-24677: -- User 'cxzl25' has created a pull request for this issue: https://github.com/apache/spark/pull/21656 > MedianHeap is empty when speculation is enabled, causing the SparkContext to > stop > - > > Key: SPARK-24677 > URL: https://issues.apache.org/jira/browse/SPARK-24677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: dzcxzl >Priority: Critical > > When introducing SPARK-23433 , maybe cause stop sparkcontext. > {code:java} > ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping > SparkContext > java.util.NoSuchElementException: MedianHeap is empty. > at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83) > at > org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24677) MedianHeap is empty when speculation is enabled, causing the SparkContext to stop
[ https://issues.apache.org/jira/browse/SPARK-24677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24677: Assignee: (was: Apache Spark) > MedianHeap is empty when speculation is enabled, causing the SparkContext to > stop > - > > Key: SPARK-24677 > URL: https://issues.apache.org/jira/browse/SPARK-24677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: dzcxzl >Priority: Critical > > When introducing SPARK-23433 , maybe cause stop sparkcontext. > {code:java} > ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping > SparkContext > java.util.NoSuchElementException: MedianHeap is empty. > at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83) > at > org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24677) MedianHeap is empty when speculation is enabled, causing the SparkContext to stop
[ https://issues.apache.org/jira/browse/SPARK-24677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24677: Assignee: Apache Spark > MedianHeap is empty when speculation is enabled, causing the SparkContext to > stop > - > > Key: SPARK-24677 > URL: https://issues.apache.org/jira/browse/SPARK-24677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: dzcxzl >Assignee: Apache Spark >Priority: Critical > > When introducing SPARK-23433 , maybe cause stop sparkcontext. > {code:java} > ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping > SparkContext > java.util.NoSuchElementException: MedianHeap is empty. > at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83) > at > org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) > at > org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24675) Rename table: validate existence of new location
[ https://issues.apache.org/jira/browse/SPARK-24675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24675: Assignee: (was: Apache Spark) > Rename table: validate existence of new location > > > Key: SPARK-24675 > URL: https://issues.apache.org/jira/browse/SPARK-24675 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Gengliang Wang >Priority: Major > > If table is renamed to a existing new location, data won't show up. > > scala> Seq("hello").toDF("a").write.format("parquet").saveAsTable("t") > scala> sql("select * from t").show() > +-+ > | a| > +-+ > |hello| > +-+ > scala> sql("alter table t rename to test") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("select * from test").show() > +---+ > | a| > +---+ > +—+ > > In Hive, if the new location exists, the renaming will fail even the location > is empty. > We should have the same validation in catalog, in case of unexpected bugs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24675) Rename table: validate existence of new location
[ https://issues.apache.org/jira/browse/SPARK-24675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24675: Assignee: Apache Spark > Rename table: validate existence of new location > > > Key: SPARK-24675 > URL: https://issues.apache.org/jira/browse/SPARK-24675 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > If table is renamed to a existing new location, data won't show up. > > scala> Seq("hello").toDF("a").write.format("parquet").saveAsTable("t") > scala> sql("select * from t").show() > +-+ > | a| > +-+ > |hello| > +-+ > scala> sql("alter table t rename to test") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("select * from test").show() > +---+ > | a| > +---+ > +—+ > > In Hive, if the new location exists, the renaming will fail even the location > is empty. > We should have the same validation in catalog, in case of unexpected bugs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24675) Rename table: validate existence of new location
[ https://issues.apache.org/jira/browse/SPARK-24675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526048#comment-16526048 ] Apache Spark commented on SPARK-24675: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/21655 > Rename table: validate existence of new location > > > Key: SPARK-24675 > URL: https://issues.apache.org/jira/browse/SPARK-24675 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Gengliang Wang >Priority: Major > > If table is renamed to a existing new location, data won't show up. > > scala> Seq("hello").toDF("a").write.format("parquet").saveAsTable("t") > scala> sql("select * from t").show() > +-+ > | a| > +-+ > |hello| > +-+ > scala> sql("alter table t rename to test") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("select * from test").show() > +---+ > | a| > +---+ > +—+ > > In Hive, if the new location exists, the renaming will fail even the location > is empty. > We should have the same validation in catalog, in case of unexpected bugs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24677) MedianHeap is empty when speculation is enabled, causing the SparkContext to stop
dzcxzl created SPARK-24677: -- Summary: MedianHeap is empty when speculation is enabled, causing the SparkContext to stop Key: SPARK-24677 URL: https://issues.apache.org/jira/browse/SPARK-24677 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.1 Reporter: dzcxzl When introducing SPARK-23433 , maybe cause stop sparkcontext. {code:java} ERROR Utils: uncaught error in thread task-scheduler-speculation, stopping SparkContext java.util.NoSuchElementException: MedianHeap is empty. at org.apache.spark.util.collection.MedianHeap.median(MedianHeap.scala:83) at org.apache.spark.scheduler.TaskSetManager.checkSpeculatableTasks(TaskSetManager.scala:968) at org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) at org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at org.apache.spark.scheduler.Pool.checkSpeculatableTasks(Pool.scala:93) at org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:94) at org.apache.spark.scheduler.Pool$$anonfun$checkSpeculatableTasks$1.apply(Pool.scala:93) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24676) Project required data from parsed data when csvColumnPruning disabled
Takeshi Yamamuro created SPARK-24676: Summary: Project required data from parsed data when csvColumnPruning disabled Key: SPARK-24676 URL: https://issues.apache.org/jira/browse/SPARK-24676 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Takeshi Yamamuro I hit a bug below when parsing csv data; {code} ./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false scala> val dir = "/tmp/spark-csv/csv" scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir) scala> spark.read.csv(dir).selectExpr("sum(p)").collect() 18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41) ... {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org