[jira] [Commented] (SPARK-23427) spark.sql.autoBroadcastJoinThreshold causing OOM exception in the driver
[ https://issues.apache.org/jira/browse/SPARK-23427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517326#comment-16517326 ] Dean Wampler commented on SPARK-23427: -- Hi, Kazuaki. Any update on this issue? Any pointers on what you discovered? Thanks. > spark.sql.autoBroadcastJoinThreshold causing OOM exception in the driver > - > > Key: SPARK-23427 > URL: https://issues.apache.org/jira/browse/SPARK-23427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: SPARK 2.0 version >Reporter: Dhiraj >Priority: Critical > > We are facing issue around value of spark.sql.autoBroadcastJoinThreshold. > With spark.sql.autoBroadcastJoinThreshold -1 ( disable) we seeing driver > memory used flat. > With any other values 10MB, 5MB, 2 MB, 1MB, 10K, 1K we see driver memory used > goes up with rate depending upon the size of the autoBroadcastThreshold and > getting OOM exception. The problem is memory used by autoBroadcast is not > being free up in the driver. > Application imports oracle tables as master dataframes which are persisted. > Each job applies filter to these tables and then registered them as > tempViewTable . Then sql query are using to process data further. At the end > all the intermediate dataFrame are unpersisted. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887342#comment-15887342 ] Dean Wampler commented on SPARK-17147: -- Cody, thanks for the suggestion. We'll try to test it and also suggest a customer do the same who wants this functionality. > Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > (i.e. Log Compaction) > -- > > Key: SPARK-17147 > URL: https://issues.apache.org/jira/browse/SPARK-17147 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: Robert Conrad > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882778#comment-15882778 ] Dean Wampler commented on SPARK-17147: -- We're interested in this enhancement. Anyone know if and one it will be implemented in Spark? > Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > (i.e. Log Compaction) > -- > > Key: SPARK-17147 > URL: https://issues.apache.org/jira/browse/SPARK-17147 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: Robert Conrad > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16239) SQL issues with cast from date to string around daylight savings time
[ https://issues.apache.org/jira/browse/SPARK-16239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15478438#comment-15478438 ] Dean Wampler edited comment on SPARK-16239 at 9/9/16 10:20 PM: --- I investigated this a bit today for a customer. I could not reproduce this bug on MacOS X, Ubuntu, nor RedHat releases with kernels 3.10.0-327.el7.x86_64 and 2.6.32-504.8.1.el6.x86_64, using Amazon AMIs. My customer has a private cloud environment with kernel 2.6.32-504.50.1.el6.x86_64 where he sees the bug. Anyway, I think it's something very specific to his cloud VM configuration, such as a buggy library. For all cases we used this JVM: {code} $ java -version java version "1.8.0_101" Java(TM) SE Runtime Environment (build 1.8.0_101-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode) {code} My point is that we should narrow down if this is really a Spark bug or a bug in the underlying platform. For reference, here's the code example we used in his environment and my test environments (some output suppressed): {code} scala> sqlContext.udf.register("to_date", (s: String) => new java.sql.Date( new java.text.SimpleDateFormat("-MM-dd").parse(s).getTime()) ) scala> val dates = (0 to 5).map(i => s"1949-11-${25+i}") scala> val df = sc.parallelize(dates).toDF("date") scala> df.show +--+ | date| +--+ |1949-11-25| |1949-11-26| |1949-11-27| |1949-11-28| |1949-11-29| |1949-11-30| +--+ scala> val df2 = df.select(to_date($"date")) scala> df2.show ++ |todate(date)| ++ | 1949-11-25| | 1949-11-26| | 1949-11-27| // <--- my customer sees 1949-11-26 | 1949-11-28| | 1949-11-29| | 1949-11-30| ++ {code} If I'm right that this isn't really a Spark bug, then the following should be sufficient to demonstrate it in the Spark shell or a Scala interpreter of the same version: {code} scala> val f = (s: String) => new java.sql.Date( new java.text.SimpleDateFormat("-MM-dd").parse(s).getTime()) scala> val d = f("1949-11-27") d: java.sql.Date = 1949-11-27 {code} was (Author: deanwampler): I invested this a bit today for a customer. I could not reproduce this bug on MacOS X, Ubuntu, nor RedHat releases with kernels 3.10.0-327.el7.x86_64 and 2.6.32-504.8.1.el6.x86_64, using Amazon AMIs. My customer has a private cloud environment with kernel 2.6.32-504.50.1.el6.x86_64 where he sees the bug. Anyway, I think it's something very specific to his cloud VM configuration, such as a buggy library. For all cases we used this JVM: {code} $ java -version java version "1.8.0_101" Java(TM) SE Runtime Environment (build 1.8.0_101-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode) {code} My point is that we should narrow down if this is really a Spark bug or a bug in the underlying platform. For reference, here's the code example we used in his environment and my test environments (some output suppressed): {code} scala> sqlContext.udf.register("to_date", (s: String) => new java.sql.Date( new java.text.SimpleDateFormat("-MM-dd").parse(s).getTime()) ) scala> val dates = (0 to 5).map(i => s"1949-11-${25+i}") scala> val df = sc.parallelize(dates).toDF("date") scala> df.show +--+ | date| +--+ |1949-11-25| |1949-11-26| |1949-11-27| |1949-11-28| |1949-11-29| |1949-11-30| +--+ scala> val df2 = df.select(to_date($"date")) scala> df2.show ++ |todate(date)| ++ | 1949-11-25| | 1949-11-26| | 1949-11-27| // <--- my customer sees 1949-11-26 | 1949-11-28| | 1949-11-29| | 1949-11-30| ++ {code} If I'm right that this isn't really a Spark bug, then the following should be sufficient to demonstrate it in the Spark shell or a Scala interpreter of the same version: {code} scala> val f = (s: String) => new java.sql.Date( new java.text.SimpleDateFormat("-MM-dd").parse(s).getTime()) scala> val d = f("1949-11-27") d: java.sql.Date = 1949-11-27 {code} > SQL issues with cast from date to string around daylight savings time > - > > Key: SPARK-16239 > URL: https://issues.apache.org/jira/browse/SPARK-16239 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Glen Maisey >Priority: Critical > > Hi all, > I have a dataframe with a date column. When I cast to a string using the > spark sql cast function it converts it to the wrong date on certain days. > Looking into it, it occurs once a year when summer daylight savings starts. > I've tried to show this issue the code below. The toString() function works > correctly whereas the cast does not. > Unfortunately my users are using SQL code rather than scala dataframes and > therefore this workaround
[jira] [Commented] (SPARK-16239) SQL issues with cast from date to string around daylight savings time
[ https://issues.apache.org/jira/browse/SPARK-16239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15478438#comment-15478438 ] Dean Wampler commented on SPARK-16239: -- I invested this a bit today for a customer. I could not reproduce this bug on MacOS X, Ubuntu, nor RedHat releases with kernels 3.10.0-327.el7.x86_64 and 2.6.32-504.8.1.el6.x86_64, using Amazon AMIs. My customer has a private cloud environment with kernel 2.6.32-504.50.1.el6.x86_64 where he sees the bug. Anyway, I think it's something very specific to his cloud VM configuration, such as a buggy library. For all cases we used this JVM: {code} $ java -version java version "1.8.0_101" Java(TM) SE Runtime Environment (build 1.8.0_101-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode) {code} My point is that we should narrow down if this is really a Spark bug or a bug in the underlying platform. For reference, here's the code example we used in his environment and my test environments (some output suppressed): {code} scala> sqlContext.udf.register("to_date", (s: String) => new java.sql.Date( new java.text.SimpleDateFormat("-MM-dd").parse(s).getTime()) ) scala> val dates = (0 to 5).map(i => s"1949-11-${25+i}") scala> val df = sc.parallelize(dates).toDF("date") scala> df.show +--+ | date| +--+ |1949-11-25| |1949-11-26| |1949-11-27| |1949-11-28| |1949-11-29| |1949-11-30| +--+ scala> val df2 = df.select(to_date($"date")) scala> df2.show ++ |todate(date)| ++ | 1949-11-25| | 1949-11-26| | 1949-11-27| // <--- my customer sees 1949-11-26 | 1949-11-28| | 1949-11-29| | 1949-11-30| ++ {code} If I'm right that this isn't really a Spark bug, then the following should be sufficient to demonstrate it in the Spark shell or a Scala interpreter of the same version: {code} scala> val f = (s: String) => new java.sql.Date( new java.text.SimpleDateFormat("-MM-dd").parse(s).getTime()) scala> val d = f("1949-11-27") d: java.sql.Date = 1949-11-27 {code} > SQL issues with cast from date to string around daylight savings time > - > > Key: SPARK-16239 > URL: https://issues.apache.org/jira/browse/SPARK-16239 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Glen Maisey >Priority: Critical > > Hi all, > I have a dataframe with a date column. When I cast to a string using the > spark sql cast function it converts it to the wrong date on certain days. > Looking into it, it occurs once a year when summer daylight savings starts. > I've tried to show this issue the code below. The toString() function works > correctly whereas the cast does not. > Unfortunately my users are using SQL code rather than scala dataframes and > therefore this workaround does not apply. This was actually picked up where a > user was writing something like "SELECT date1 UNION ALL select date2" where > date1 was a string and date2 was a date type. It must be implicitly > converting the date to a string which gives this error. > I'm in the Australia/Sydney timezone (see the time changes here > http://www.timeanddate.com/time/zone/australia/sydney) > val dates = > Array("2014-10-03","2014-10-04","2014-10-05","2014-10-06","2015-10-02","2015-10-03", > "2015-10-04", "2015-10-05") > val df = sc.parallelize(dates) > .toDF("txn_date") > .select(col("txn_date").cast("Date")) > df.select( > col("txn_date"), > col("txn_date").cast("Timestamp").alias("txn_date_timestamp"), > col("txn_date").cast("String").alias("txn_date_str_cast"), > col("txn_date".toString()).alias("txn_date_str_toString") > ) > .show() > +--++-+-+ > | txn_date| txn_date_timestamp|txn_date_str_cast|txn_date_str_toString| > +--++-+-+ > |2014-10-03|2014-10-02 14:00:...| 2014-10-03| 2014-10-03| > |2014-10-04|2014-10-03 14:00:...| 2014-10-04| 2014-10-04| > |2014-10-05|2014-10-04 13:00:...| 2014-10-04| 2014-10-05| > |2014-10-06|2014-10-05 13:00:...| 2014-10-06| 2014-10-06| > |2015-10-02|2015-10-01 14:00:...| 2015-10-02| 2015-10-02| > |2015-10-03|2015-10-02 14:00:...| 2015-10-03| 2015-10-03| > |2015-10-04|2015-10-03 13:00:...| 2015-10-03| 2015-10-04| > |2015-10-05|2015-10-04 13:00:...| 2015-10-05| 2015-10-05| > +--++-+-+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe,
[jira] [Created] (SPARK-16684) Standalone mode local dirs not properly cleaned if job is killed
Dean Wampler created SPARK-16684: Summary: Standalone mode local dirs not properly cleaned if job is killed Key: SPARK-16684 URL: https://issues.apache.org/jira/browse/SPARK-16684 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.6.2 Environment: MacOS, but probably the same for Linux Reporter: Dean Wampler Priority: Minor The shuffle service wasn't used. If you control-c out of a job, e.g., the spark-shell, cleanup does in fact occur correctly, but if you send a kill -9 to the process, then clean up isn't done (using these methods to simulate certain crash scenarios). Possible solution: Have the master and worker daemons delete temporary files older than a user-configurable time. Workaround: setup a cron job that does this clean up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058856#comment-15058856 ] Dean Wampler commented on SPARK-12177: -- Since the new Kafka 0.9 consumer API supports SSL, would implementation of #12177 enable use of SSL or would additional work be required in Spark's integration? > Update KafkaDStreams to new Kafka 0.9 Consumer API > -- > > Key: SPARK-12177 > URL: https://issues.apache.org/jira/browse/SPARK-12177 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Nikita Tarasenko > Labels: consumer, kafka > > Kafka 0.9 already released and it introduce new consumer API that not > compatible with old one. So, I added new consumer api. I made separate > classes in package org.apache.spark.streaming.kafka.v09 with changed API. I > didn't remove old classes for more backward compatibility. User will not need > to change his old spark applications when he uprgade to new Spark version. > Please rewiew my changes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12105) Add a DataFrame.show() with argument for output PrintStream
Dean Wampler created SPARK-12105: Summary: Add a DataFrame.show() with argument for output PrintStream Key: SPARK-12105 URL: https://issues.apache.org/jira/browse/SPARK-12105 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.2 Reporter: Dean Wampler Priority: Minor It would be nice to send the output of DataFrame.show(...) to a different output stream than stdout, including just capturing the string itself. This is useful, e.g., for testing. Actually, it would be sufficient and perhaps better to just make DataFrame.showString a public method, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12052) DataFrame with self-join fails unless toDF() column aliases provided
Dean Wampler created SPARK-12052: Summary: DataFrame with self-join fails unless toDF() column aliases provided Key: SPARK-12052 URL: https://issues.apache.org/jira/browse/SPARK-12052 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2, 1.5.1, 1.6.0 Environment: spark-shell for Spark 1.5.1, 1.5.2, and 1.6.0-Preview2 Reporter: Dean Wampler Joining with the same DF twice appears to match on the wrong column unless the columns in the results of the first join are aliased with "toDF". Here is an example program: {code} val rdd = sc.parallelize(2 to 100, 1).cache val numbers = rdd.map(i => (i, i*i)).toDF("n", "nsq") val names = rdd.map(i => (i, i.toString)).toDF("id", "name") numbers.show names.show val good = numbers. join(names, numbers("n") === names("id")).toDF("n", "nsq", "id1", "name1"). join(names, $"nsq" === names("id")).toDF("n", "nsq", "id1", "name1", "id2", "name2") // The last toDF can be omitted and you'll still get valid results. good.printSchema // root // |-- i: integer (nullable = false) // |-- isq: integer (nullable = false) // |-- i1: integer (nullable = false) // |-- name1: string (nullable = true) // |-- i2: integer (nullable = false) // |-- name2: string (nullable = true) good.count // res3: Long = 9 good.show // +---+---+---+-+---+-+ // | n|nsq|id1|name1|id2|name2| // +---+---+---+-+---+-+ // | 2| 4| 2|2| 4|4| // | 4| 16| 4|4| 16| 16| // | 6| 36| 6|6| 36| 36| // | 8| 64| 8|8| 64| 64| // | 10|100| 10| 10|100| 100| // | 3| 9| 3|3| 9|9| // | 5| 25| 5|5| 25| 25| // | 7| 49| 7|7| 49| 49| // | 9| 81| 9|9| 81| 81| // +---+---+---+-+---+-+ val bad = numbers. join(names, numbers("n") === names("id")). join(names, $"nsq" === names("id")) bad.printSchema // root // |-- n: integer (nullable = false) // |-- nsq: integer (nullable = false) // |-- id: integer (nullable = false) // |-- name: string (nullable = true) // |-- id: integer (nullable = false) // |-- name: string (nullable = true) bad.count // res6: Long = 0 bad.show // +---+---+---++---++ // | n|nsq| id|name| id|name| // +---+---+---++---++ // +---+---+---++---++ // Curiosly, if you change the original rdd line to this: // val rdd = sc.parallelize(2 to 100, 1).cache // The first record is for numbers is (1,1). Then bad will have the following // content: // +---+---+---++---++ // | n|nsq| id|name| id|name| // +---+---+---++---++ // | 1| 1| 1| 1| 1| 1| // | 1| 1| 1| 1| 2| 2| // | 1| 1| 1| 1| 3| 3| // | 1| 1| 1| 1| 4| 4| // | 1| 1| 1| 1| 5| 5| // | 1| 1| 1| 1| 6| 6| // | 1| 1| 1| 1| 7| 7| // | 1| 1| 1| 1| 8| 8| // | 1| 1| 1| 1| 9| 9| // | 1| 1| 1| 1| 10| 10| // | 1| 1| 1| 1| 11| 11| // | 1| 1| 1| 1| 12| 12| // | 1| 1| 1| 1| 13| 13| // | 1| 1| 1| 1| 14| 14| // | 1| 1| 1| 1| 15| 15| // | 1| 1| 1| 1| 16| 16| // | 1| 1| 1| 1| 17| 17| // | 1| 1| 1| 1| 18| 18| // | 1| 1| 1| 1| 19| 19| // | 1| 1| 1| 1| 20| 20| // ... // | 1| 1| 1| 1| 96| 96| // | 1| 1| 1| 1| 97| 97| // | 1| 1| 1| 1| 98| 98| // | 1| 1| 1| 1| 99| 99| // | 1| 1| 1| 1|100| 100| // +---+---+---++---++ // // This make no sense to me. // Breaking it up, so we can reference 'bad2("nsq")' doesn't help: val bad2 = numbers. join(names, numbers("n") === names("id")) val bad3 = bad2. join(names, bad2("nsq") === names("id")) bad3.printSchema bad3.count bad3.show {code} Note the embedded comment that if you start with 1 to 100, you get a record in {{numbers}} with two {{1}} values. This yields the strange results shown in the comment, suggesting that the join was actually done on the wrong column of the first result set. However, the output actually makes no sense; based on the results you get from the first join alone, it's "impossible" to get this output! Note: Could be related to the following issues: * https://issues.apache.org/jira/browse/SPARK-10838 (I observed this behavior while experimenting to examine this bug). * https://issues.apache.org/jira/browse/SPARK-11072 * https://issues.apache.org/jira/browse/SPARK-10925 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11700) Memory leak at SparkContext jobProgressListener stageIdToData map
[ https://issues.apache.org/jira/browse/SPARK-11700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022455#comment-15022455 ] Dean Wampler commented on SPARK-11700: -- Actually, memory leaks like this aren't acceptable, IMHO. First, the object graph held by the SQLContext, including the SparkContext can accumulate lots of data, such as metrics data, in large, long-running jobs (In this case, setting "spark.ui.retainedJobs = some_reasonable_N and spark.ui.retainedStages = some_reasonable_N can help). It gets worse for streaming jobs. Second, allowing some memory leaks inevitably makes it harder to track down more painful leaks when they occur. > Memory leak at SparkContext jobProgressListener stageIdToData map > - > > Key: SPARK-11700 > URL: https://issues.apache.org/jira/browse/SPARK-11700 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.5.0, 1.5.1, 1.5.2 > Environment: Ubuntu 14.04 LTS, Oracle JDK 1.8.51 Apache tomcat > 8.0.28. Spring 4 >Reporter: Kostas papageorgopoulos >Assignee: Shixiong Zhu >Priority: Critical > Labels: leak, memory-leak > Attachments: AbstractSparkJobRunner.java, > SparkContextPossibleMemoryLeakIDEA_DEBUG.png, SparkHeapSpaceProgress.png, > SparkMemoryAfterLotsOfConsecutiveRuns.png, > SparkMemoryLeakAfterLotsOfRunsWithinTheSameContext.png > > > it seems that there is A SparkContext jobProgressListener memory leak.*. > Bellow i describe the steps i do to reproduce that. > I have created a java webapp trying to abstractly Run some Spark Sql jobs > that read data from HDFS (join them) and Write them To ElasticSearch using ES > hadoop connector. After a Lot of consecutive runs i noticed that my heap > space was full so i got an out of heap space error. > At the attached file {code} AbstractSparkJobRunner {code} the {code} public > final void run(T jobConfiguration, ExecutionLog executionLog) throws > Exception {code} runs each time an Spark Sql Job is triggered. So tried to > reuse the same SparkContext for a number of consecutive runs. If some rules > apply i try to clean up the SparkContext by first calling {code} > killSparkAndSqlContext {code}. This code eventually runs {code} synchronized > (sparkContextThreadLock) { > if (javaSparkContext != null) { > LOGGER.info("!!! CLEARING SPARK > CONTEXT!!!"); > javaSparkContext.stop(); > javaSparkContext = null; > sqlContext = null; > System.gc(); > } > numberOfRunningJobsForSparkContext.getAndSet(0); > } > {code}. > So at some point in time i suppose that if no other SparkSql job should run i > should kill the sparkContext (The > AbstractSparkJobRunner.killSparkAndSqlContext runs) and this should be > garbage collected from garbage collector. However this is not the case, Even > if in my debugger shows that my JavaSparkContext object is null see attached > picture {code} SparkContextPossibleMemoryLeakIDEA_DEBUG.png {code}. > The jvisual vm shows an incremental heap space even when the garbage > collector is called. See attached picture {code} SparkHeapSpaceProgress.png > {code}. > The memory analyser Tool shows that a big part of the retained heap to be > assigned to _jobProgressListener see attached picture {code} > SparkMemoryAfterLotsOfConsecutiveRuns.png {code} and summary picture {code} > SparkMemoryLeakAfterLotsOfRunsWithinTheSameContext.png {code}. Although at > the same time in Singleton Service the JavaSparkContext is null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9409) make-distribution.sh should copy all files in conf, so that it's easy to create a distro with custom configuration and property settings
Dean Wampler created SPARK-9409: --- Summary: make-distribution.sh should copy all files in conf, so that it's easy to create a distro with custom configuration and property settings Key: SPARK-9409 URL: https://issues.apache.org/jira/browse/SPARK-9409 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.4.1 Environment: MacOS, Linux Reporter: Dean Wampler Priority: Minor When using make-distribution.sh to build a custom distribution, it would be nice to be able to drop custom configuration files in the conf directory and have them included in the archive. Currently, only the *.template files are included. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4564) SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the groupingExprs as part of the output schema
Dean Wampler created SPARK-4564: --- Summary: SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the groupingExprs as part of the output schema Key: SPARK-4564 URL: https://issues.apache.org/jira/browse/SPARK-4564 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Environment: Mac OSX, local mode, but should hold true for all environments Reporter: Dean Wampler In the following example, I would expect the grouped schema to contain two fields, the String name and the Long count, but it only contains the Long count. {code} // Assumes val sc = new SparkContext(...), e.g., in Spark Shell import org.apache.spark.sql.{SQLContext, SchemaRDD} import org.apache.spark.sql.catalyst.expressions._ val sqlc = new SQLContext(sc) import sqlc._ case class Record(name: String, n: Int) val records = List( Record(three, 1), Record(three, 2), Record(two, 3), Record(three, 4), Record(two, 5)) val recs = sc.parallelize(records) recs.registerTempTable(records) val grouped = recs.select('name, 'n).groupBy('name)(Count('n) as 'count) grouped.printSchema // root // |-- count: long (nullable = false) grouped foreach println // [2] // [3] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4564) SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the groupingExprs as part of the output schema
[ https://issues.apache.org/jira/browse/SPARK-4564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222409#comment-14222409 ] Dean Wampler commented on SPARK-4564: - As soon as I reported this, I thought of a way to project out the field. {{('name, Count('n) as 'count)}}, making the whole expression: {code} val grouped = recs.select('name, 'n).groupBy('name)('name, Count('n) as 'count) {code} However, the behavior is still inconsistent with similar methods in RDD and PairRDDFunctions, where the grouping expression is part of the output schema. SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the groupingExprs as part of the output schema -- Key: SPARK-4564 URL: https://issues.apache.org/jira/browse/SPARK-4564 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Environment: Mac OSX, local mode, but should hold true for all environments Reporter: Dean Wampler In the following example, I would expect the grouped schema to contain two fields, the String name and the Long count, but it only contains the Long count. {code} // Assumes val sc = new SparkContext(...), e.g., in Spark Shell import org.apache.spark.sql.{SQLContext, SchemaRDD} import org.apache.spark.sql.catalyst.expressions._ val sqlc = new SQLContext(sc) import sqlc._ case class Record(name: String, n: Int) val records = List( Record(three, 1), Record(three, 2), Record(two, 3), Record(three, 4), Record(two, 5)) val recs = sc.parallelize(records) recs.registerTempTable(records) val grouped = recs.select('name, 'n).groupBy('name)(Count('n) as 'count) grouped.printSchema // root // |-- count: long (nullable = false) grouped foreach println // [2] // [3] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org