[jira] [Commented] (SPARK-23427) spark.sql.autoBroadcastJoinThreshold causing OOM exception in the driver

2018-06-19 Thread Dean Wampler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517326#comment-16517326
 ] 

Dean Wampler commented on SPARK-23427:
--

Hi, Kazuaki. Any update on this issue? Any pointers on what you discovered? 
Thanks.

> spark.sql.autoBroadcastJoinThreshold causing OOM exception in the driver 
> -
>
> Key: SPARK-23427
> URL: https://issues.apache.org/jira/browse/SPARK-23427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: SPARK 2.0 version
>Reporter: Dhiraj
>Priority: Critical
>
> We are facing issue around value of spark.sql.autoBroadcastJoinThreshold.
> With spark.sql.autoBroadcastJoinThreshold -1 ( disable) we seeing driver 
> memory used flat.
> With any other values 10MB, 5MB, 2 MB, 1MB, 10K, 1K we see driver memory used 
> goes up with rate depending upon the size of the autoBroadcastThreshold and 
> getting OOM exception. The problem is memory used by autoBroadcast is not 
> being free up in the driver.
> Application imports oracle tables as master dataframes which are persisted. 
> Each job applies filter to these tables and then registered them as 
> tempViewTable . Then sql query are using to process data further. At the end 
> all the intermediate dataFrame are unpersisted.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2017-02-27 Thread Dean Wampler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887342#comment-15887342
 ] 

Dean Wampler commented on SPARK-17147:
--

Cody, thanks for the suggestion. We'll try to test it and also suggest a 
customer do the same who wants this functionality.

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets 
> (i.e. Log Compaction)
> --
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2017-02-24 Thread Dean Wampler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882778#comment-15882778
 ] 

Dean Wampler commented on SPARK-17147:
--

We're interested in this enhancement. Anyone know if and one it will be 
implemented in Spark?

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets 
> (i.e. Log Compaction)
> --
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16239) SQL issues with cast from date to string around daylight savings time

2016-09-09 Thread Dean Wampler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15478438#comment-15478438
 ] 

Dean Wampler edited comment on SPARK-16239 at 9/9/16 10:20 PM:
---

I investigated this a bit today for a customer. I could not reproduce this bug 
on MacOS X, Ubuntu, nor RedHat releases with kernels 3.10.0-327.el7.x86_64 and 
2.6.32-504.8.1.el6.x86_64, using Amazon AMIs. My customer has a private cloud 
environment with kernel 2.6.32-504.50.1.el6.x86_64 where he sees the bug. 
Anyway, I think it's something very specific to his cloud VM configuration, 
such as a buggy library. For all cases we used this JVM:

{code}
$ java -version
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
{code}

My point is that we should narrow down if this is really a Spark bug or a bug 
in the underlying platform.

For reference, here's the code example we used in his environment and my test 
environments (some output suppressed):

{code}
scala> sqlContext.udf.register("to_date", (s: String) =>
  new java.sql.Date(
new java.text.SimpleDateFormat("-MM-dd").parse(s).getTime())
)

scala> val dates = (0 to 5).map(i => s"1949-11-${25+i}")

scala> val df = sc.parallelize(dates).toDF("date")
scala> df.show
+--+
|  date|
+--+
|1949-11-25|
|1949-11-26|
|1949-11-27|
|1949-11-28|
|1949-11-29|
|1949-11-30|
+--+

scala> val df2 = df.select(to_date($"date"))
scala> df2.show

++
|todate(date)|
++
|  1949-11-25|
|  1949-11-26|
|  1949-11-27|  //  <--- my customer sees 1949-11-26
|  1949-11-28|
|  1949-11-29|
|  1949-11-30|
++
{code}

If I'm right that this isn't really a Spark bug, then the following should be 
sufficient to demonstrate it in the Spark shell or a Scala interpreter of the 
same version:

{code}
scala> val f = (s: String) =>
  new java.sql.Date(
new java.text.SimpleDateFormat("-MM-dd").parse(s).getTime())

scala> val d = f("1949-11-27")
d: java.sql.Date = 1949-11-27
{code}



was (Author: deanwampler):
I invested this a bit today for a customer. I could not reproduce this bug on 
MacOS X, Ubuntu, nor RedHat releases with kernels 3.10.0-327.el7.x86_64 and 
2.6.32-504.8.1.el6.x86_64, using Amazon AMIs. My customer has a private cloud 
environment with kernel 2.6.32-504.50.1.el6.x86_64 where he sees the bug. 
Anyway, I think it's something very specific to his cloud VM configuration, 
such as a buggy library. For all cases we used this JVM:

{code}
$ java -version
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
{code}

My point is that we should narrow down if this is really a Spark bug or a bug 
in the underlying platform.

For reference, here's the code example we used in his environment and my test 
environments (some output suppressed):

{code}
scala> sqlContext.udf.register("to_date", (s: String) =>
  new java.sql.Date(
new java.text.SimpleDateFormat("-MM-dd").parse(s).getTime())
)

scala> val dates = (0 to 5).map(i => s"1949-11-${25+i}")

scala> val df = sc.parallelize(dates).toDF("date")
scala> df.show
+--+
|  date|
+--+
|1949-11-25|
|1949-11-26|
|1949-11-27|
|1949-11-28|
|1949-11-29|
|1949-11-30|
+--+

scala> val df2 = df.select(to_date($"date"))
scala> df2.show

++
|todate(date)|
++
|  1949-11-25|
|  1949-11-26|
|  1949-11-27|  //  <--- my customer sees 1949-11-26
|  1949-11-28|
|  1949-11-29|
|  1949-11-30|
++
{code}

If I'm right that this isn't really a Spark bug, then the following should be 
sufficient to demonstrate it in the Spark shell or a Scala interpreter of the 
same version:

{code}
scala> val f = (s: String) =>
  new java.sql.Date(
new java.text.SimpleDateFormat("-MM-dd").parse(s).getTime())

scala> val d = f("1949-11-27")
d: java.sql.Date = 1949-11-27
{code}


> SQL issues with cast from date to string around daylight savings time
> -
>
> Key: SPARK-16239
> URL: https://issues.apache.org/jira/browse/SPARK-16239
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Glen Maisey
>Priority: Critical
>
> Hi all,
> I have a dataframe with a date column. When I cast to a string using the 
> spark sql cast function it converts it to the wrong date on certain days. 
> Looking into it, it occurs once a year when summer daylight savings starts.
> I've tried to show this issue the code below. The toString() function works 
> correctly whereas the cast does not.
> Unfortunately my users are using SQL code rather than scala dataframes and 
> therefore this workaround 

[jira] [Commented] (SPARK-16239) SQL issues with cast from date to string around daylight savings time

2016-09-09 Thread Dean Wampler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15478438#comment-15478438
 ] 

Dean Wampler commented on SPARK-16239:
--

I invested this a bit today for a customer. I could not reproduce this bug on 
MacOS X, Ubuntu, nor RedHat releases with kernels 3.10.0-327.el7.x86_64 and 
2.6.32-504.8.1.el6.x86_64, using Amazon AMIs. My customer has a private cloud 
environment with kernel 2.6.32-504.50.1.el6.x86_64 where he sees the bug. 
Anyway, I think it's something very specific to his cloud VM configuration, 
such as a buggy library. For all cases we used this JVM:

{code}
$ java -version
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
{code}

My point is that we should narrow down if this is really a Spark bug or a bug 
in the underlying platform.

For reference, here's the code example we used in his environment and my test 
environments (some output suppressed):

{code}
scala> sqlContext.udf.register("to_date", (s: String) =>
  new java.sql.Date(
new java.text.SimpleDateFormat("-MM-dd").parse(s).getTime())
)

scala> val dates = (0 to 5).map(i => s"1949-11-${25+i}")

scala> val df = sc.parallelize(dates).toDF("date")
scala> df.show
+--+
|  date|
+--+
|1949-11-25|
|1949-11-26|
|1949-11-27|
|1949-11-28|
|1949-11-29|
|1949-11-30|
+--+

scala> val df2 = df.select(to_date($"date"))
scala> df2.show

++
|todate(date)|
++
|  1949-11-25|
|  1949-11-26|
|  1949-11-27|  //  <--- my customer sees 1949-11-26
|  1949-11-28|
|  1949-11-29|
|  1949-11-30|
++
{code}

If I'm right that this isn't really a Spark bug, then the following should be 
sufficient to demonstrate it in the Spark shell or a Scala interpreter of the 
same version:

{code}
scala> val f = (s: String) =>
  new java.sql.Date(
new java.text.SimpleDateFormat("-MM-dd").parse(s).getTime())

scala> val d = f("1949-11-27")
d: java.sql.Date = 1949-11-27
{code}


> SQL issues with cast from date to string around daylight savings time
> -
>
> Key: SPARK-16239
> URL: https://issues.apache.org/jira/browse/SPARK-16239
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Glen Maisey
>Priority: Critical
>
> Hi all,
> I have a dataframe with a date column. When I cast to a string using the 
> spark sql cast function it converts it to the wrong date on certain days. 
> Looking into it, it occurs once a year when summer daylight savings starts.
> I've tried to show this issue the code below. The toString() function works 
> correctly whereas the cast does not.
> Unfortunately my users are using SQL code rather than scala dataframes and 
> therefore this workaround does not apply. This was actually picked up where a 
> user was writing something like "SELECT date1 UNION ALL select date2" where 
> date1 was a string and date2 was a date type. It must be implicitly 
> converting the date to a string which gives this error.
> I'm in the Australia/Sydney timezone (see the time changes here 
> http://www.timeanddate.com/time/zone/australia/sydney) 
> val dates = 
> Array("2014-10-03","2014-10-04","2014-10-05","2014-10-06","2015-10-02","2015-10-03",
>  "2015-10-04", "2015-10-05")
> val df = sc.parallelize(dates)
> .toDF("txn_date")
> .select(col("txn_date").cast("Date"))
> df.select(
> col("txn_date"),
> col("txn_date").cast("Timestamp").alias("txn_date_timestamp"),
> col("txn_date").cast("String").alias("txn_date_str_cast"),
> col("txn_date".toString()).alias("txn_date_str_toString")
> )
> .show()
> +--++-+-+
> |  txn_date|  txn_date_timestamp|txn_date_str_cast|txn_date_str_toString|
> +--++-+-+
> |2014-10-03|2014-10-02 14:00:...|   2014-10-03|   2014-10-03|
> |2014-10-04|2014-10-03 14:00:...|   2014-10-04|   2014-10-04|
> |2014-10-05|2014-10-04 13:00:...|   2014-10-04|   2014-10-05|
> |2014-10-06|2014-10-05 13:00:...|   2014-10-06|   2014-10-06|
> |2015-10-02|2015-10-01 14:00:...|   2015-10-02|   2015-10-02|
> |2015-10-03|2015-10-02 14:00:...|   2015-10-03|   2015-10-03|
> |2015-10-04|2015-10-03 13:00:...|   2015-10-03|   2015-10-04|
> |2015-10-05|2015-10-04 13:00:...|   2015-10-05|   2015-10-05|
> +--++-+-+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, 

[jira] [Created] (SPARK-16684) Standalone mode local dirs not properly cleaned if job is killed

2016-07-22 Thread Dean Wampler (JIRA)
Dean Wampler created SPARK-16684:


 Summary: Standalone mode local dirs not properly cleaned if job is 
killed
 Key: SPARK-16684
 URL: https://issues.apache.org/jira/browse/SPARK-16684
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.6.2
 Environment: MacOS, but probably the same for Linux
Reporter: Dean Wampler
Priority: Minor


The shuffle service wasn't used.

If you control-c out of a job, e.g., the spark-shell, cleanup does in fact 
occur correctly, but if you send a kill -9 to the process, then clean up isn't 
done (using these methods to simulate certain crash scenarios). 

Possible solution: Have the master and worker daemons delete temporary files 
older than a user-configurable time.

Workaround: setup a cron job that does this clean up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2015-12-15 Thread Dean Wampler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058856#comment-15058856
 ] 

Dean Wampler commented on SPARK-12177:
--

Since the new Kafka 0.9 consumer API supports SSL, would implementation of 
#12177 enable use of SSL or would additional work be required in Spark's 
integration?

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12105) Add a DataFrame.show() with argument for output PrintStream

2015-12-02 Thread Dean Wampler (JIRA)
Dean Wampler created SPARK-12105:


 Summary: Add a DataFrame.show() with argument for output 
PrintStream
 Key: SPARK-12105
 URL: https://issues.apache.org/jira/browse/SPARK-12105
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.2
Reporter: Dean Wampler
Priority: Minor


It would be nice to send the output of DataFrame.show(...) to a different 
output stream than stdout, including just capturing the string itself. This is 
useful, e.g., for testing. Actually, it would be sufficient and perhaps better 
to just make DataFrame.showString a public method, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12052) DataFrame with self-join fails unless toDF() column aliases provided

2015-11-30 Thread Dean Wampler (JIRA)
Dean Wampler created SPARK-12052:


 Summary: DataFrame with self-join fails unless toDF() column 
aliases provided
 Key: SPARK-12052
 URL: https://issues.apache.org/jira/browse/SPARK-12052
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2, 1.5.1, 1.6.0
 Environment: spark-shell for Spark 1.5.1, 1.5.2, and 1.6.0-Preview2
Reporter: Dean Wampler


Joining with the same DF twice appears to match on the wrong column unless the 
columns in the results of the first join are aliased with "toDF". Here is an 
example program:

{code}
val rdd = sc.parallelize(2 to 100, 1).cache
val numbers = rdd.map(i => (i, i*i)).toDF("n", "nsq")
val names   = rdd.map(i => (i, i.toString)).toDF("id", "name")
numbers.show
names.show

val good = numbers.
  join(names, numbers("n") === names("id")).toDF("n", "nsq", "id1", "name1").
  join(names, $"nsq" === names("id")).toDF("n", "nsq", "id1", "name1", "id2", 
"name2")

// The last toDF can be omitted and you'll still get valid results.

good.printSchema
// root
//  |-- i: integer (nullable = false)
//  |-- isq: integer (nullable = false)
//  |-- i1: integer (nullable = false)
//  |-- name1: string (nullable = true)
//  |-- i2: integer (nullable = false)
//  |-- name2: string (nullable = true)

good.count
// res3: Long = 9

good.show
// +---+---+---+-+---+-+
// |  n|nsq|id1|name1|id2|name2|
// +---+---+---+-+---+-+
// |  2|  4|  2|2|  4|4|
// |  4| 16|  4|4| 16|   16|
// |  6| 36|  6|6| 36|   36|
// |  8| 64|  8|8| 64|   64|
// | 10|100| 10|   10|100|  100|
// |  3|  9|  3|3|  9|9|
// |  5| 25|  5|5| 25|   25|
// |  7| 49|  7|7| 49|   49|
// |  9| 81|  9|9| 81|   81|
// +---+---+---+-+---+-+

val bad = numbers.
  join(names, numbers("n") === names("id")).
  join(names, $"nsq" === names("id"))

bad.printSchema
// root
//  |-- n: integer (nullable = false)
//  |-- nsq: integer (nullable = false)
//  |-- id: integer (nullable = false)
//  |-- name: string (nullable = true)
//  |-- id: integer (nullable = false)
//  |-- name: string (nullable = true)

bad.count
// res6: Long = 0

bad.show
// +---+---+---++---++
// |  n|nsq| id|name| id|name|
// +---+---+---++---++
// +---+---+---++---++

// Curiosly, if you change the original rdd line to this:
//   val rdd = sc.parallelize(2 to 100, 1).cache
// The first record is for numbers is (1,1). Then bad will have the following
// content:
// +---+---+---++---++
// |  n|nsq| id|name| id|name|
// +---+---+---++---++
// |  1|  1|  1|   1|  1|   1|
// |  1|  1|  1|   1|  2|   2|
// |  1|  1|  1|   1|  3|   3|
// |  1|  1|  1|   1|  4|   4|
// |  1|  1|  1|   1|  5|   5|
// |  1|  1|  1|   1|  6|   6|
// |  1|  1|  1|   1|  7|   7|
// |  1|  1|  1|   1|  8|   8|
// |  1|  1|  1|   1|  9|   9|
// |  1|  1|  1|   1| 10|  10|
// |  1|  1|  1|   1| 11|  11|
// |  1|  1|  1|   1| 12|  12|
// |  1|  1|  1|   1| 13|  13|
// |  1|  1|  1|   1| 14|  14|
// |  1|  1|  1|   1| 15|  15|
// |  1|  1|  1|   1| 16|  16|
// |  1|  1|  1|   1| 17|  17|
// |  1|  1|  1|   1| 18|  18|
// |  1|  1|  1|   1| 19|  19|
// |  1|  1|  1|   1| 20|  20|
// ...
// |  1|  1|  1|   1| 96|  96|
// |  1|  1|  1|   1| 97|  97|
// |  1|  1|  1|   1| 98|  98|
// |  1|  1|  1|   1| 99|  99|
// |  1|  1|  1|   1|100| 100|
// +---+---+---++---++
//
// This make no sense to me.

// Breaking it up, so we can reference 'bad2("nsq")' doesn't help:
val bad2 = numbers.
  join(names, numbers("n") === names("id"))
val bad3 = bad2.
  join(names, bad2("nsq") === names("id"))

bad3.printSchema
bad3.count
bad3.show
{code}

Note the embedded comment that if you start with 1 to 100, you get a record in 
{{numbers}} with two {{1}} values. This yields the strange results shown in the 
comment, suggesting that the join was actually done on the wrong column of the 
first result set. However, the output actually makes no sense; based on the 
results you get from the first join alone, it's "impossible" to get this output!

Note: Could be related to the following issues:

* https://issues.apache.org/jira/browse/SPARK-10838 (I observed this behavior 
while experimenting to examine this bug).
* https://issues.apache.org/jira/browse/SPARK-11072
* https://issues.apache.org/jira/browse/SPARK-10925



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11700) Memory leak at SparkContext jobProgressListener stageIdToData map

2015-11-23 Thread Dean Wampler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022455#comment-15022455
 ] 

Dean Wampler commented on SPARK-11700:
--

Actually, memory leaks like this aren't acceptable, IMHO. First, the object 
graph held by the SQLContext, including the SparkContext can accumulate lots of 
data, such as metrics data, in large, long-running jobs (In this case, setting 
"spark.ui.retainedJobs = some_reasonable_N and spark.ui.retainedStages = 
some_reasonable_N can help). It gets worse for streaming jobs. Second, allowing 
some memory leaks inevitably makes it harder to track down more painful leaks 
when they occur. 

> Memory leak at SparkContext jobProgressListener stageIdToData map
> -
>
> Key: SPARK-11700
> URL: https://issues.apache.org/jira/browse/SPARK-11700
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: Ubuntu 14.04 LTS, Oracle JDK 1.8.51 Apache tomcat 
> 8.0.28. Spring 4
>Reporter: Kostas papageorgopoulos
>Assignee: Shixiong Zhu
>Priority: Critical
>  Labels: leak, memory-leak
> Attachments: AbstractSparkJobRunner.java, 
> SparkContextPossibleMemoryLeakIDEA_DEBUG.png, SparkHeapSpaceProgress.png, 
> SparkMemoryAfterLotsOfConsecutiveRuns.png, 
> SparkMemoryLeakAfterLotsOfRunsWithinTheSameContext.png
>
>
> it seems that there is  A SparkContext jobProgressListener memory leak.*. 
> Bellow i describe the  steps i do to reproduce that. 
> I have created a java webapp trying to abstractly Run some Spark Sql jobs 
> that read data from HDFS (join them) and Write them To ElasticSearch using ES 
> hadoop connector. After a Lot of consecutive runs  i noticed that my heap 
> space was full so i got an out of heap space error.
> At the attached file {code} AbstractSparkJobRunner {code} the {code}  public 
> final void run(T jobConfiguration, ExecutionLog executionLog) throws 
> Exception  {code} runs each time an Spark Sql Job is triggered.  So tried to 
> reuse the same SparkContext for a number of consecutive runs. If some rules 
> apply i try to clean up the SparkContext by first calling {code} 
> killSparkAndSqlContext {code}. This code eventually runs {code}  synchronized 
> (sparkContextThreadLock) {
> if (javaSparkContext != null) {
> LOGGER.info("!!! CLEARING SPARK 
> CONTEXT!!!");
> javaSparkContext.stop();
> javaSparkContext = null;
> sqlContext = null;
> System.gc();
> }
> numberOfRunningJobsForSparkContext.getAndSet(0);
> }
> {code}.
> So at some point in time i suppose that if no other SparkSql job should run i 
> should kill the sparkContext  (The 
> AbstractSparkJobRunner.killSparkAndSqlContext  runs) and this should be 
> garbage collected from garbage collector. However this is not the case, Even 
> if in my debugger shows that my JavaSparkContext object is null see attached 
> picture {code} SparkContextPossibleMemoryLeakIDEA_DEBUG.png {code}.
> The jvisual vm shows an incremental heap space even when the garbage 
> collector is called. See attached picture {code} SparkHeapSpaceProgress.png 
> {code}.
> The memory analyser Tool shows that a big part of the retained heap to be 
> assigned to _jobProgressListener see attached picture {code} 
> SparkMemoryAfterLotsOfConsecutiveRuns.png {code}  and summary picture {code} 
> SparkMemoryLeakAfterLotsOfRunsWithinTheSameContext.png {code}. Although at 
> the same time in Singleton Service the JavaSparkContext is null.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9409) make-distribution.sh should copy all files in conf, so that it's easy to create a distro with custom configuration and property settings

2015-07-28 Thread Dean Wampler (JIRA)
Dean Wampler created SPARK-9409:
---

 Summary: make-distribution.sh should copy all files in conf, so 
that it's easy to create a distro with custom configuration and property 
settings
 Key: SPARK-9409
 URL: https://issues.apache.org/jira/browse/SPARK-9409
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.4.1
 Environment: MacOS, Linux
Reporter: Dean Wampler
Priority: Minor


When using make-distribution.sh to build a custom distribution, it would be 
nice to be able to drop custom configuration files in the conf directory and 
have them included in the archive. Currently, only the *.template files are 
included.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4564) SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the groupingExprs as part of the output schema

2014-11-23 Thread Dean Wampler (JIRA)
Dean Wampler created SPARK-4564:
---

 Summary: SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't 
return the groupingExprs as part of the output schema
 Key: SPARK-4564
 URL: https://issues.apache.org/jira/browse/SPARK-4564
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
 Environment: Mac OSX, local mode, but should hold true for all 
environments
Reporter: Dean Wampler


In the following example, I would expect the grouped schema to contain two 
fields, the String name and the Long count, but it only contains the Long count.

{code}
// Assumes val sc = new SparkContext(...), e.g., in Spark Shell
import org.apache.spark.sql.{SQLContext, SchemaRDD}
import org.apache.spark.sql.catalyst.expressions._

val sqlc = new SQLContext(sc)
import sqlc._

case class Record(name: String, n: Int)

val records = List(
  Record(three,   1),
  Record(three,   2),
  Record(two, 3),
  Record(three,   4),
  Record(two, 5))
val recs = sc.parallelize(records)
recs.registerTempTable(records)

val grouped = recs.select('name, 'n).groupBy('name)(Count('n) as 'count)
grouped.printSchema
// root
//  |-- count: long (nullable = false)

grouped foreach println
// [2]
// [3]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4564) SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the groupingExprs as part of the output schema

2014-11-23 Thread Dean Wampler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222409#comment-14222409
 ] 

Dean Wampler commented on SPARK-4564:
-

As soon as I reported this, I thought of a way to project out the field. 
{{('name, Count('n) as 'count)}}, making the whole expression:

{code}
val grouped = recs.select('name, 'n).groupBy('name)('name, Count('n) as 'count)
{code}

However, the behavior is still inconsistent with similar methods in RDD and 
PairRDDFunctions, where the grouping expression is part of the output schema.

 SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the 
 groupingExprs as part of the output schema
 --

 Key: SPARK-4564
 URL: https://issues.apache.org/jira/browse/SPARK-4564
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
 Environment: Mac OSX, local mode, but should hold true for all 
 environments
Reporter: Dean Wampler

 In the following example, I would expect the grouped schema to contain two 
 fields, the String name and the Long count, but it only contains the Long 
 count.
 {code}
 // Assumes val sc = new SparkContext(...), e.g., in Spark Shell
 import org.apache.spark.sql.{SQLContext, SchemaRDD}
 import org.apache.spark.sql.catalyst.expressions._
 val sqlc = new SQLContext(sc)
 import sqlc._
 case class Record(name: String, n: Int)
 val records = List(
   Record(three,   1),
   Record(three,   2),
   Record(two, 3),
   Record(three,   4),
   Record(two, 5))
 val recs = sc.parallelize(records)
 recs.registerTempTable(records)
 val grouped = recs.select('name, 'n).groupBy('name)(Count('n) as 'count)
 grouped.printSchema
 // root
 //  |-- count: long (nullable = false)
 grouped foreach println
 // [2]
 // [3]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org