Re: Back-pressure to Spark Kafka Streaming?

2016-12-05 Thread Richard Startin
I've seen the feature work very well. For tuning, you've got:

spark.streaming.backpressure.pid.proportional (defaults to 1, non-negative) - 
weight for response to "error" (change between last batch and this batch)
spark.streaming.backpressure.pid.integral (defaults to 0.2, non-negative) - 
weight for the response to the accumulation of error. This has a dampening 
effect.
spark.streaming.backpressure.pid.derived (defaults to zero, non-negative) - 
weight for the response to the trend in error. This can cause 
arbitrary/noise-induced fluctuations in batch size, but can also help react 
quickly to increased/reduced capacity.
spark.streaming.backpressure.pid.minRate - the default value is 100 (must be 
positive), batch size won't go below this.

spark.streaming.receiver.maxRate - batch size won't go above this.


Cheers,

Richard


https://richardstartin.com/



From: Liren Ding 
Sent: 05 December 2016 22:18
To: dev@spark.apache.org; u...@spark.apache.org
Subject: Back-pressure to Spark Kafka Streaming?

Hey all,

Does backressure actually work on spark kafka streaming? According to the 
latest spark streaming document:
http://spark.apache.org/docs/latest/streaming-programming-guide.html
"In Spark 1.5, we have introduced a feature called backpressure that eliminate 
the need to set this rate limit, as Spark Streaming automatically figures out 
the rate limits and dynamically adjusts them if the processing conditions 
change. This backpressure can be enabled by setting the configuration parameter 
spark.streaming.backpressure.enabled to true."
But I also see a few open spark jira tickets on this option:
https://issues.apache.org/jira/browse/SPARK-7398
https://issues.apache.org/jira/browse/SPARK-18371

The case in the second ticket describes a similar issue as we have here. We use 
Kafka to send large batches (10~100M) to spark streaming, and the spark 
streaming interval is set to 1~4 minutes. With the backpressure set to true, 
the queued active batches still pile up when average batch processing time 
takes longer than default interval. After the spark driver is restarted, all 
queued batches turn to a giant batch, which block subsequent batches and also 
have a great chance to fail eventually. The only config we found that might 
help is "spark.streaming.kafka.maxRatePerPartition". It does limit the incoming 
batch size, but not a perfect solution since it depends on size of partition as 
well as the length of batch interval. For our case, hundreds of partitions X 
minutes of interval still produce a number that is too large for each batch. So 
we still want to figure out how to make the backressure work in spark kafka 
streaming, if it is supposed to work there. Thanks.


Liren









SparkR Function for Step Wise Regression

2016-12-05 Thread Prasann modi
Hello,

I have an issue related to SparkR. I want to build step wise regression model

using SparkR, is any function is there in SparkR to build those kind of model.

In R function is available for step wise regression, code is given below:

step(glm(formula,data,family),direction = "forward"))

If SparkR have function for step wise then please let me know.


Thanks & Regard
Prasann Modi

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



driver in queued state and not started

2016-12-05 Thread Yu Wei
Hi Guys,


I tried to run spark on mesos cluster.

However, when I tried to submit jobs via spark-submit. The driver is in "Queued 
state" and not started.


Which should I check?



Thanks,

Jared, (??)
Software developer
Interested in open source software, big data, Linux


Re: Spark-9487, Need some insight

2016-12-05 Thread Reynold Xin
Honestly it is pretty difficult. Given the difficulty, would it still make
sense to do that change? (the one that sets the same number of
workers/parallelism across different languages in testing)


On Mon, Dec 5, 2016 at 3:33 PM, Saikat Kanjilal  wrote:

> Hello again dev community,
>
> Ping on this, apologies for rerunning this thread but never heard from
> anyone, based on this link:  https://wiki.jenkins-ci.org/
> display/JENKINS/Installing+Jenkins  I can try to install jenkins locally
> but is that really needed?
>
>
> Thanks in advance.
>
>
> --
> *From:* Saikat Kanjilal 
> *Sent:* Tuesday, November 29, 2016 8:14 PM
> *To:* dev@spark.apache.org
> *Subject:* Spark-9487, Need some insight
>
>
> Hello Spark dev community,
>
> I took this the following jira item (https://github.com/apache/
> spark/pull/15848) and am looking for some general pointers, it seems that
> I am running into issues where things work successfully doing local
> development on my macbook pro but fail on jenkins for a multitiude of
> reasons and errors, here's an example,  if you see this build
> output report: https://amplab.cs.berkeley.edu/jenkins//job/
> SparkPullRequestBuilder/69297/ you will see the DataFrameStatSuite, now
> locally I am running these individual tests with this command: ./build/mvn
> test -P... -DwildcardSuites=none 
> -Dtest=org.apache.spark.sql.DataFrameStatSuite.
> It seems that I need to emulate a jenkins like environment locally,
> this seems sort of like an untenable hurdle, granted that my changes
> involve changing the total number of workers in the sparkcontext and if so
> should I be testing my changes in an environment that more closely
> resembles jenkins.  I really want to work on/complete this PR but I keep
> getting hamstrung by a dev environment that is not equivalent to our CI
> environment.
>
>
>
> I'm guessing/hoping I'm not the first one to run into this so some
> insights. pointers to get past this would be very appreciated , would love
> to keep contributing and hoping this is a hurdle that's overcomeable with
> some tweaks to my dev environment.
>
>
>
> Thanks in advance.
>


Re: Back-pressure to Spark Kafka Streaming?

2016-12-05 Thread Cody Koeninger
If you want finer-grained max rate setting, SPARK-17510 got merged a
while ago.  There's also SPARK-18580 which might help address the
issue of starting backpressure rate for the first batch.

On Mon, Dec 5, 2016 at 4:18 PM, Liren Ding  wrote:
> Hey all,
>
> Does backressure actually work on spark kafka streaming? According to the
> latest spark streaming document:
> http://spark.apache.org/docs/latest/streaming-programming-guide.html
> "In Spark 1.5, we have introduced a feature called backpressure that
> eliminate the need to set this rate limit, as Spark Streaming automatically
> figures out the rate limits and dynamically adjusts them if the processing
> conditions change. This backpressure can be enabled by setting the
> configuration parameter spark.streaming.backpressure.enabled to true."
> But I also see a few open spark jira tickets on this option:
> https://issues.apache.org/jira/browse/SPARK-7398
> https://issues.apache.org/jira/browse/SPARK-18371
>
> The case in the second ticket describes a similar issue as we have here. We
> use Kafka to send large batches (10~100M) to spark streaming, and the spark
> streaming interval is set to 1~4 minutes. With the backpressure set to true,
> the queued active batches still pile up when average batch processing time
> takes longer than default interval. After the spark driver is restarted, all
> queued batches turn to a giant batch, which block subsequent batches and
> also have a great chance to fail eventually. The only config we found that
> might help is "spark.streaming.kafka.maxRatePerPartition". It does limit the
> incoming batch size, but not a perfect solution since it depends on size of
> partition as well as the length of batch interval. For our case, hundreds of
> partitions X minutes of interval still produce a number that is too large
> for each batch. So we still want to figure out how to make the backressure
> work in spark kafka streaming, if it is supposed to work there. Thanks.
>
>
> Liren
>
>
>
>
>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Can I add a new method to RDD class?

2016-12-05 Thread Jakob Odersky
It looks like you're having issues with including your custom spark
version (with the extensions) in your test project. To use your local
spark version:
1) make sure it has a custom version (let's call it 2.1.0-CUSTOM)
2) publish it to your local machine with `sbt publishLocal`
3) include the modified version of spark in your test project with
`libraryDependencies += "org.apache.spark" %% "spark-core" %
"2.1.0-CUSTOM"`

However, as others have said, it can be quite a lot of work to
maintain a custom fork of spark. If you're planning on contributing
these changes back to spark, than forking is the way to go (although I
would recommend to keep an ongoing discussion with the maintainers, to
make sure your work will be merged back). Otherwise, I would recommend
to use "implicit extensions" to enrich your rdds instead. An easy
workaround to access spark-private fields is to simply define your
custom rdds in an "org.apache.spark" package as well ;)

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Difference between netty and netty-all

2016-12-05 Thread Shixiong(Ryan) Zhu
No. I meant only updating master. It's not worth to update a maintenance
branch unless there are critical issues.

On Mon, Dec 5, 2016 at 5:39 PM, Nicholas Chammas  wrote:

> You mean just for branch-2.0, right?
> ​
>
> On Mon, Dec 5, 2016 at 8:35 PM Shixiong(Ryan) Zhu 
> wrote:
>
>> Hey Nick,
>>
>> It should be safe to upgrade Netty to the latest 4.0.x version. Could you
>> submit a PR, please?
>>
>> On Mon, Dec 5, 2016 at 11:47 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>> That file is in Netty 4.0.29, but I believe the PR I referenced is not.
>> It's only in Netty 4.0.37 and up.
>>
>>
>> On Mon, Dec 5, 2016 at 1:57 PM Ted Yu  wrote:
>>
>> This should be in netty-all :
>>
>> $ jar tvf 
>> /home/x/.m2/repository/io/netty/netty-all/4.0.29.Final/netty-all-4.0.29.Final.jar
>> | grep ThreadLocalRandom
>>967 Tue Jun 23 11:10:30 UTC 2015 io/netty/util/internal/
>> ThreadLocalRandom$1.class
>>   1079 Tue Jun 23 11:10:30 UTC 2015 io/netty/util/internal/
>> ThreadLocalRandom$2.class
>>   5973 Tue Jun 23 11:10:30 UTC 2015 io/netty/util/internal/
>> ThreadLocalRandom.class
>>
>> On Mon, Dec 5, 2016 at 8:53 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>> I’m looking at the list of dependencies here:
>>
>> https://github.com/apache/spark/search?l=Groff=netty;
>> type=Code=%E2%9C%93
>>
>> What’s the difference between netty and netty-all?
>>
>> The reason I ask is because I’m looking at a Netty PR
>>  and trying to figure out if
>> Spark 2.0.2 is using a version of Netty that includes that PR or not.
>>
>> Nick
>> ​
>>
>>
>>
>>


Re: [MLLIB] RankingMetrics.precisionAt

2016-12-05 Thread Sean Owen
I read it again and that looks like it implements mean precision@k as I
would expect. What is the issue?

On Tue, Dec 6, 2016, 07:30 Maciej Szymkiewicz 
wrote:

> Hi,
>
> Could I ask for a fresh pair of eyes on this piece of code:
>
>
> https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala#L59-L80
>
>   @Since("1.2.0")
>   def precisionAt(k: Int): Double = {
> require(k > 0, "ranking position k should be positive")
> predictionAndLabels.map { case (pred, lab) =>
>   val labSet = lab.toSet
>
>   if (labSet.nonEmpty) {
> val n = math.min(pred.length, k)
> var i = 0
> var cnt = 0
> while (i < n) {
>   if (labSet.contains(pred(i))) {
> cnt += 1
>   }
>   i += 1
> }
> cnt.toDouble / k
>   } else {
> logWarning("Empty ground truth set, check input data")
> 0.0
>   }
> }.mean()
>   }
>
>
> Am I the only one who thinks this doesn't do what it claims? Just for
> reference:
>
>
>-
>
> https://web.archive.org/web/20120415101144/http://sas.uwaterloo.ca/stats_navigation/techreports/04WorkingPapers/2004-09.pdf
>-
>
> https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py
>
> --
> Best,
> Maciej
>
>


Re: Difference between netty and netty-all

2016-12-05 Thread Nicholas Chammas
You mean just for branch-2.0, right?
​

On Mon, Dec 5, 2016 at 8:35 PM Shixiong(Ryan) Zhu 
wrote:

> Hey Nick,
>
> It should be safe to upgrade Netty to the latest 4.0.x version. Could you
> submit a PR, please?
>
> On Mon, Dec 5, 2016 at 11:47 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> That file is in Netty 4.0.29, but I believe the PR I referenced is not.
> It's only in Netty 4.0.37 and up.
>
>
> On Mon, Dec 5, 2016 at 1:57 PM Ted Yu  wrote:
>
> This should be in netty-all :
>
> $ jar tvf
> /home/x/.m2/repository/io/netty/netty-all/4.0.29.Final/netty-all-4.0.29.Final.jar
> | grep ThreadLocalRandom
>967 Tue Jun 23 11:10:30 UTC 2015
> io/netty/util/internal/ThreadLocalRandom$1.class
>   1079 Tue Jun 23 11:10:30 UTC 2015
> io/netty/util/internal/ThreadLocalRandom$2.class
>   5973 Tue Jun 23 11:10:30 UTC 2015
> io/netty/util/internal/ThreadLocalRandom.class
>
> On Mon, Dec 5, 2016 at 8:53 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> I’m looking at the list of dependencies here:
>
>
> https://github.com/apache/spark/search?l=Groff=netty=Code=%E2%9C%93
>
> What’s the difference between netty and netty-all?
>
> The reason I ask is because I’m looking at a Netty PR
>  and trying to figure out if
> Spark 2.0.2 is using a version of Netty that includes that PR or not.
>
> Nick
> ​
>
>
>
>


Re: Difference between netty and netty-all

2016-12-05 Thread Shixiong(Ryan) Zhu
Hey Nick,

It should be safe to upgrade Netty to the latest 4.0.x version. Could you
submit a PR, please?

On Mon, Dec 5, 2016 at 11:47 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> That file is in Netty 4.0.29, but I believe the PR I referenced is not.
> It's only in Netty 4.0.37 and up.
>
>
> On Mon, Dec 5, 2016 at 1:57 PM Ted Yu  wrote:
>
>> This should be in netty-all :
>>
>> $ jar tvf 
>> /home/x/.m2/repository/io/netty/netty-all/4.0.29.Final/netty-all-4.0.29.Final.jar
>> | grep ThreadLocalRandom
>>967 Tue Jun 23 11:10:30 UTC 2015 io/netty/util/internal/
>> ThreadLocalRandom$1.class
>>   1079 Tue Jun 23 11:10:30 UTC 2015 io/netty/util/internal/
>> ThreadLocalRandom$2.class
>>   5973 Tue Jun 23 11:10:30 UTC 2015 io/netty/util/internal/
>> ThreadLocalRandom.class
>>
>> On Mon, Dec 5, 2016 at 8:53 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>> I’m looking at the list of dependencies here:
>>
>> https://github.com/apache/spark/search?l=Groff=netty;
>> type=Code=%E2%9C%93
>>
>> What’s the difference between netty and netty-all?
>>
>> The reason I ask is because I’m looking at a Netty PR
>>  and trying to figure out if
>> Spark 2.0.2 is using a version of Netty that includes that PR or not.
>>
>> Nick
>> ​
>>
>>
>>


Re: Can I add a new method to RDD class?

2016-12-05 Thread Teng Long
Tarun,

I want to access some private methods e.g. withScope, so I add a similar 
implicit class compiled with spark. But I can’t import that into my 
application? 

for example in I have added org/apache/spark/rdd/RDDExtensions.scala, in there 
I defined a implicit class inside the RDDExtensions object, and I successfully 
compiled spark with it.

Then in my application code, when I try to import that implicit class by using,

import org.apache.spark.rdd.RDDExtensions._

I can’t compile my application, and error says "object RDDExtensions is not a 
member of package org.apache.spark.rdd”. It seems like my import statement is 
wrong, but I don’t know how?

Thanks!

> On Dec 5, 2016, at 5:14 PM, Teng Long  wrote:
> 
> I’m trying to implement a transformation that can merge partitions (to align 
> with GPU specs) and move them onto GPU memory, for example rdd.toGPU() and 
> later transformations like map can automatically be performed on GPU. And 
> another transformation rdd.offGPU() to move partitions off GPU memory and 
> repartition them to the way they were on CPU before.
> 
> Thank you, Tarun, for creating that gist. I’ll look at it and see if it meets 
> my needs.
> 
>> On Dec 5, 2016, at 5:07 PM, Tarun Kumar  wrote:
>> 
>> Teng,
>> 
>> Can you please share the details of transformation that you want to 
>> implement in your method foo?
>> 
>> I have created a gist of one dummy transformation for your method foo , this 
>> foo method transforms from an RDD[T] to RDD[(T,T)]. Many such more 
>> transformations can easily be achieved.
>> 
>> https://gist.github.com/fidato13/3b46fe1c96b37ae0dd80c275fbe90e92
>> 
>> Thanks
>> Tarun Kumar
>> 
>> On 5 December 2016 at 22:33, Thakrar, Jayesh  
>> wrote:
>> Teng,
>> 
>>  
>> 
>> Before you go down creating your own custom Spark system, do give some 
>> thought to what Holden and others are suggesting, viz. using implicit 
>> methods.
>> 
>>  
>> 
>> If you want real concrete examples, have a look at the Spark Cassandra 
>> Connector -
>> 
>>  
>> 
>> Here you will see an example of "extending" SparkContext - 
>> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md
>> 
>>  
>> 
>> // validation is deferred, so it is not triggered during rdd creation
>> 
>> val rdd = sc.cassandraTable[SomeType]("ks", "not_existing_table")
>> 
>> val emptyRDD = rdd.toEmptyCassandraRDD
>> 
>>  
>> 
>> val emptyRDD2 = sc.emptyCassandraTable[SomeType]("ks", "not_existing_table"))
>> 
>>  
>> 
>>  
>> 
>> And here you will se an example of "extending" RDD - 
>> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md
>> 
>>  
>> 
>> case class WordCount(word: String, count: Long)
>> 
>> val collection = sc.parallelize(Seq(WordCount("dog", 50), WordCount("cow", 
>> 60)))
>> 
>> collection.saveToCassandra("test", "words", SomeColumns("word", "count"))
>> 
>>  
>> 
>> Hope that helps…
>> 
>> Jayesh
>> 
>>  
>> 
>>  
>> 
>> From: Teng Long 
>> Date: Monday, December 5, 2016 at 3:04 PM
>> To: Holden Karau , 
>> Subject: Re: Can I add a new method to RDD class?
>> 
>>  
>> 
>> Thank you for providing another answer, Holden.
>> 
>>  
>> 
>> So I did what Tarun and Michal suggested, and it didn’t work out as I want 
>> to have a new transformation method in RDD class, and need to use that RDD’s 
>> spark context which is private. So I guess the only thing I can do now is to 
>> sbt publishLocal?
>> 
>>  
>> 
>> On Dec 5, 2016, at 9:19 AM, Holden Karau  wrote:
>> 
>>  
>> 
>> Doing that requires publishing a custom version of Spark, you can edit the 
>> version number do do a publishLocal - but maintaining that change is going 
>> to be difficult. The other approaches suggested are probably better, but 
>> also does your method need to be defined on the RDD class? Could you instead 
>> make a helper object or class to expose whatever functionality you need?
>> 
>>  
>> 
>> On Mon, Dec 5, 2016 at 6:06 PM long  wrote:
>> 
>> Thank you very much! But why can’t I just add new methods in to the source 
>> code of RDD?
>> 
>>  
>> 
>> On Dec 5, 2016, at 3:15 AM, Michal Šenkýř [via Apache Spark Developers List] 
>> <[hidden email]> wrote:
>> 
>>  
>> 
>> A simple Scala example of implicit classes:
>> 
>> implicit class EnhancedString(str: String) {
>>   def prefix(prefix: String) = prefix + str
>> }
>>  
>> println("World".prefix("Hello "))
>> As Tarun said, you have to import it if it's not in the same class where you 
>> use it.
>> 
>> Hope this makes it clearer,
>> 
>> Michal Senkyr
>> 
>>  
>> 
>> On 5.12.2016 07:43, Tarun Kumar wrote:
>> 
>> Not sure if that's documented in terms of Spark but this is a fairly common 
>> pattern in scala known as "pimp my library" pattern, you can easily find 
>> many generic example of using this pattern. If you want I can 

Re: Spark-9487, Need some insight

2016-12-05 Thread Saikat Kanjilal
Hello again dev community,

Ping on this, apologies for rerunning this thread but never heard from anyone, 
based on this link:  
https://wiki.jenkins-ci.org/display/JENKINS/Installing+Jenkins  I can try to 
install jenkins locally but is that really needed?


Thanks in advance.



From: Saikat Kanjilal 
Sent: Tuesday, November 29, 2016 8:14 PM
To: dev@spark.apache.org
Subject: Spark-9487, Need some insight


Hello Spark dev community,

I took this the following jira item 
(https://github.com/apache/spark/pull/15848) and am looking for some general 
pointers, it seems that I am running into issues where things work successfully 
doing local development on my macbook pro but fail on jenkins for a multitiude 
of reasons and errors, here's an example,  if you see this build output report: 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69297/ you 
will see the DataFrameStatSuite, now locally I am running these individual 
tests with this command: ./build/mvn test -P... -DwildcardSuites=none 
-Dtest=org.apache.spark.sql.DataFrameStatSuite. It seems that I need to 
emulate a jenkins like environment locally, this seems sort of like an 
untenable hurdle, granted that my changes involve changing the total number of 
workers in the sparkcontext and if so should I be testing my changes in an 
environment that more closely resembles jenkins.  I really want to work 
on/complete this PR but I keep getting hamstrung by a dev environment that is 
not equivalent to our CI environment.



I'm guessing/hoping I'm not the first one to run into this so some insights. 
pointers to get past this would be very appreciated , would love to keep 
contributing and hoping this is a hurdle that's overcomeable with some tweaks 
to my dev environment.



Thanks in advance.


[MLLIB] RankingMetrics.precisionAt

2016-12-05 Thread Maciej Szymkiewicz
Hi,

Could I ask fora fresh pair of eyes on this piece of code:

https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala#L59-L80

  @Since("1.2.0")
  def precisionAt(k: Int): Double = {
require(k > 0, "ranking position k should be positive")
predictionAndLabels.map { case (pred, lab) =>
  val labSet = lab.toSet

  if (labSet.nonEmpty) {
val n = math.min(pred.length, k)
var i = 0
var cnt = 0
while (i < n) {
  if (labSet.contains(pred(i))) {
cnt += 1
  }
  i += 1
}
cnt.toDouble / k
  } else {
logWarning("Empty ground truth set, check input data")
0.0
  }
}.mean()
  }


Am I the only one who thinks this doesn't do what it claims? Just for
reference:

  * 
https://web.archive.org/web/20120415101144/http://sas.uwaterloo.ca/stats_navigation/techreports/04WorkingPapers/2004-09.pdf
  * 
https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py

-- 
Best,
Maciej



Re: Future of the Python 2 support.

2016-12-05 Thread Maciej Szymkiewicz
Fair enough. I have to admit I am bit disappointed but that's life :)


On 12/04/2016 07:28 PM, Reynold Xin wrote:
> Echoing Nick. I don't see any strong reason to drop Python 2 support.
> We typically drop support for X when it is rarely used and support for
> X is long past EOL. Python 2 is still very popular, and depending on
> the statistics it might be more popular than Python 3.
>
> On Sun, Dec 4, 2016 at 9:29 AM Nicholas Chammas
> > wrote:
>
> I don't think it makes sense to deprecate or drop support for
> Python 2.7 until at least 2020, when 2.7 itself will be EOLed. (As
> of Spark 2.0, Python 2.6 support is deprecated and will be removed
> by Spark 2.2. Python 2.7 is only version of Python 2 that's still
> fully supported.)
>
> Given the widespread industry use of Python 2.7, and the fact that
> it is supported upstream by the Python core developers until 2020,
> I don't see why Spark should even consider dropping support for it
> before then. There is, of course, additional ongoing work to
> support Python 2.7, but it seems more than justified by its level
> of use and popularity in the broader community. And I say that as
> someone who almost exclusively develops in Python 3.5+ these days.
>
> Perhaps by 2018 the industry usage of Python 2 will drop
> precipitously and merit a discussion about dropping support, but I
> think at this point it's premature to discuss that and we should
> just wait and see.
>
> Nick
>
>
> On Sun, Dec 4, 2016 at 10:59 AM Maciej Szymkiewicz
> > wrote:
>
> Hi,
>
> I am aware there was a previous discussion about dropping
> support for different platforms
> 
> (http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html)
> but somehow it has been dominated by Scala and JVM and never
> touched the subject of Python 2.
>
> Some facts:
>
>   * Python 2 End Of Life is scheduled for 2020
> (http://legacy.python.org/dev/peps/pep-0373/) without with
> "no guarantee that bugfix releases will be made on a
> regular basis" until then.
>   * Almost all commonly used libraries already support Python
> 3 (https://python3wos.appspot.com/). A single exception
> that can be important for Spark is thrift (Python 3
> support is already present on the master) and transitively
> PyHive and Blaze.
>   * Supporting both Python 2 and Python 3 introduces
> significant technical debt. In practice Python 3 is a
> different language with backward incompatible syntax and
> growing number of features which won't be backported to 2.x.
>
> Suggestions:
>
>   * We need a public discussion about possible date for
> dropping Python 2 support.
>   * Early 2018 should give enough time for a graceful transition.
>
> -- 
> Best,
> Maciej
>

-- 
Maciej Szymkiewicz



Back-pressure to Spark Kafka Streaming?

2016-12-05 Thread Liren Ding
Hey all,

Does backressure actually work on spark kafka streaming? According to the
latest spark streaming document:
*http://spark.apache.org/docs/latest/streaming-programming-guide.html
*
"*In Spark 1.5, we have introduced a feature called backpressure that
eliminate the need to set this rate limit, as Spark Streaming automatically
figures out the rate limits and dynamically adjusts them if the processing
conditions change. This backpressure can be enabled by setting the
configuration parameter spark.streaming.backpressure.enabled to true.*"
But I also see a few open spark jira tickets on this option:

*https://issues.apache.org/jira/browse/SPARK-7398
*
*https://issues.apache.org/jira/browse/SPARK-18371
*

The case in the second ticket describes a similar issue as we have here. We
use Kafka to send large batches (10~100M) to spark streaming, and the spark
streaming interval is set to 1~4 minutes. With the backpressure set to
true, the queued active batches still pile up when average batch processing
time takes longer than default interval. After the spark driver is
restarted, all queued batches turn to a giant batch, which block subsequent
batches and also have a great chance to fail eventually. The only config we
found that might help is "*spark.streaming.kafka.maxRatePerPartition*". It
does limit the incoming batch size, but not a perfect solution since it
depends on size of partition as well as the length of batch interval. For
our case, hundreds of partitions X minutes of interval still produce a
number that is too large for each batch. So we still want to figure out how
to make the backressure work in spark kafka streaming, if it is supposed to
work there. Thanks.


Liren


Re: Can I add a new method to RDD class?

2016-12-05 Thread Teng Long
I’m trying to implement a transformation that can merge partitions (to align 
with GPU specs) and move them onto GPU memory, for example rdd.toGPU() and 
later transformations like map can automatically be performed on GPU. And 
another transformation rdd.offGPU() to move partitions off GPU memory and 
repartition them to the way they were on CPU before.

Thank you, Tarun, for creating that gist. I’ll look at it and see if it meets 
my needs.

> On Dec 5, 2016, at 5:07 PM, Tarun Kumar  wrote:
> 
> Teng,
> 
> Can you please share the details of transformation that you want to implement 
> in your method foo?
> 
> I have created a gist of one dummy transformation for your method foo , this 
> foo method transforms from an RDD[T] to RDD[(T,T)]. Many such more 
> transformations can easily be achieved.
> 
> https://gist.github.com/fidato13/3b46fe1c96b37ae0dd80c275fbe90e92 
> 
> 
> Thanks
> Tarun Kumar
> 
> On 5 December 2016 at 22:33, Thakrar, Jayesh  > wrote:
> Teng,
> 
>  
> 
> Before you go down creating your own custom Spark system, do give some 
> thought to what Holden and others are suggesting, viz. using implicit methods.
> 
>  
> 
> If you want real concrete examples, have a look at the Spark Cassandra 
> Connector -
> 
>  
> 
> Here you will see an example of "extending" SparkContext - 
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md
>  
> 
>  
> 
> // validation is deferred, so it is not triggered during rdd creation
> 
> val rdd = sc.cassandraTable[SomeType]("ks", "not_existing_table")
> 
> val emptyRDD = rdd.toEmptyCassandraRDD
> 
>  
> 
> val emptyRDD2 = sc.emptyCassandraTable[SomeType]("ks", "not_existing_table"))
> 
>  
> 
>  
> 
> And here you will se an example of "extending" RDD - 
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md
>  
> 
>  
> 
> case class WordCount(word: String, count: Long)
> 
> val collection = sc.parallelize(Seq(WordCount("dog", 50), WordCount("cow", 
> 60)))
> 
> collection.saveToCassandra("test", "words", SomeColumns("word", "count"))
> 
>  
> 
> Hope that helps…
> 
> Jayesh
> 
>  
> 
>  
> 
> From: Teng Long >
> Date: Monday, December 5, 2016 at 3:04 PM
> To: Holden Karau >, 
> >
> Subject: Re: Can I add a new method to RDD class?
> 
>  
> 
> Thank you for providing another answer, Holden.
> 
>  
> 
> So I did what Tarun and Michal suggested, and it didn’t work out as I want to 
> have a new transformation method in RDD class, and need to use that RDD’s 
> spark context which is private. So I guess the only thing I can do now is to 
> sbt publishLocal?
> 
>  
> 
> On Dec 5, 2016, at 9:19 AM, Holden Karau  > wrote:
> 
>  
> 
> Doing that requires publishing a custom version of Spark, you can edit the 
> version number do do a publishLocal - but maintaining that change is going to 
> be difficult. The other approaches suggested are probably better, but also 
> does your method need to be defined on the RDD class? Could you instead make 
> a helper object or class to expose whatever functionality you need?
> 
>  
> 
> On Mon, Dec 5, 2016 at 6:06 PM long  > wrote:
> 
> Thank you very much! But why can’t I just add new methods in to the source 
> code of RDD?
> 
>  
> 
> On Dec 5, 2016, at 3:15 AM, Michal Šenkýř [via Apache Spark Developers List] 
> <[hidden email] > wrote:
> 
>  
> 
> A simple Scala example of implicit classes:
> 
> implicit class EnhancedString(str: String) {
>   def prefix(prefix: String) = prefix + str
> }
>  
> println("World".prefix("Hello "))
> As Tarun said, you have to import it if it's not in the same class where you 
> use it.
> 
> Hope this makes it clearer,
> 
> Michal Senkyr
> 
>  
> 
> On 5.12.2016 07:43, Tarun Kumar wrote:
> 
> Not sure if that's documented in terms of Spark but this is a fairly common 
> pattern in scala known as "pimp my library" pattern, you can easily find many 
> generic example of using this pattern. If you want I can quickly cook up a 
> short conplete  example with rdd(although there is nothing really more to my 
> example in earlier mail) ? Thanks Tarun Kumar
> 
>  
> 
> On Mon, 5 Dec 2016 at 7:15 AM, long < href="x-msg://22/user/SendEmail.jtp?type=nodenode=20106i=0 <>" 
> target="_top" rel="nofollow" link="external" class="">[hidden email]> wrote:
> 
> So is there documentation of this I can refer to? 
> 
>  
> 
> On Dec 

Re: Can I add a new method to RDD class?

2016-12-05 Thread Tarun Kumar
Teng,

Can you please share the details of transformation that you want to
implement in your method foo?

I have created a gist of one dummy transformation for your method foo ,
this foo method transforms from an RDD[T] to RDD[(T,T)]. Many such more
transformations can easily be achieved.

https://gist.github.com/fidato13/3b46fe1c96b37ae0dd80c275fbe90e92

Thanks
Tarun Kumar

On 5 December 2016 at 22:33, Thakrar, Jayesh 
wrote:

> Teng,
>
>
>
> Before you go down creating your own custom Spark system, do give some
> thought to what Holden and others are suggesting, viz. using implicit
> methods.
>
>
>
> If you want real concrete examples, have a look at the Spark Cassandra
> Connector -
>
>
>
> Here you will see an example of "extending" SparkContext -
> https://github.com/datastax/spark-cassandra-connector/
> blob/master/doc/2_loading.md
>
>
>
> // validation is deferred, so it is not triggered during rdd creation
>
> val rdd = sc.cassandraTable[SomeType]("ks", "not_existing_table")
>
> val emptyRDD = rdd.toEmptyCassandraRDD
>
>
>
> val emptyRDD2 = sc.emptyCassandraTable[SomeType]("ks",
> "not_existing_table"))
>
>
>
>
>
> And here you will se an example of "extending" RDD -
> https://github.com/datastax/spark-cassandra-connector/
> blob/master/doc/5_saving.md
>
>
>
> case class WordCount(word: String, count: Long)
>
> val collection = sc.parallelize(Seq(WordCount("dog", 50),
> WordCount("cow", 60)))
>
> collection.saveToCassandra("test", "words", SomeColumns("word", "count"))
>
>
>
> Hope that helps…
>
> Jayesh
>
>
>
>
>
> *From: *Teng Long 
> *Date: *Monday, December 5, 2016 at 3:04 PM
> *To: *Holden Karau , 
> *Subject: *Re: Can I add a new method to RDD class?
>
>
>
> Thank you for providing another answer, Holden.
>
>
>
> So I did what Tarun and Michal suggested, and it didn’t work out as I want
> to have a new transformation method in RDD class, and need to use that
> RDD’s spark context which is private. So I guess the only thing I can do
> now is to sbt publishLocal?
>
>
>
> On Dec 5, 2016, at 9:19 AM, Holden Karau  wrote:
>
>
>
> Doing that requires publishing a custom version of Spark, you can edit the
> version number do do a publishLocal - but maintaining that change is going
> to be difficult. The other approaches suggested are probably better, but
> also does your method need to be defined on the RDD class? Could you
> instead make a helper object or class to expose whatever functionality you
> need?
>
>
>
> On Mon, Dec 5, 2016 at 6:06 PM long  wrote:
>
> Thank you very much! But why can’t I just add new methods in to the source
> code of RDD?
>
>
>
> On Dec 5, 2016, at 3:15 AM, Michal Šenkýř [via Apache Spark Developers
> List] <[hidden email] >
> wrote:
>
>
>
> A simple Scala example of implicit classes:
>
> implicit class EnhancedString(str: String) {
>
>   def prefix(prefix: String) = prefix + str
>
> }
>
>
>
> println("World".prefix("Hello "))
>
> As Tarun said, you have to import it if it's not in the same class where
> you use it.
>
> Hope this makes it clearer,
>
> Michal Senkyr
>
>
>
> On 5.12.2016 07:43, Tarun Kumar wrote:
>
> Not sure if that's documented in terms of Spark but this is a fairly
> common pattern in scala known as "pimp my library" pattern, you can easily
> find many generic example of using this pattern. If you want I can quickly
> cook up a short conplete example with rdd(although there is nothing really
> more to my example in earlier mail) ? Thanks Tarun Kumar
>
>
>
> On Mon, 5 Dec 2016 at 7:15 AM, long < rel="nofollow" link="external" class="">[hidden email]> wrote:
>
> So is there documentation of this I can refer to?
>
>
>
> On Dec 5, 2016, at 1:07 AM, Tarun Kumar [via Apache Spark Developers List]
> <[hidden email] >
> wrote:
>
>
>
> Hi Tenglong, In addition to trsell's reply, you can add any method to an
> rdd without making changes to spark code. This can be achieved by using
> implicit class in your own client code: implicit class extendRDD[T](rdd:
> RDD[T]){ def foo() } Then you basically nees to import this implicit class
> in scope where you want to use the new foo method. Thanks Tarun Kumar
>
>
>
> On Mon, 5 Dec 2016 at 6:59 AM, < SendEmail.jtp?type=nodeamp;node=20102amp;i=0" class="">
> x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=0"
> target="_top" rel="nofollow" link="external" class="">[hidden email]> wrote:
>
> How does your application fetch the spark dependency? Perhaps list your
> project dependencies and check it's using your dev build.
>
>
>
> On Mon, 5 Dec 2016, 08:47 tenglong, < SendEmail.jtp?type=nodeamp;node=20102amp;i=1" class="">
> x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=1"
> target="_top" rel="nofollow" link="external" class="">[hidden email]> wrote:
>
> Hi,
>
> Apparently, I've already 

Re: Can I add a new method to RDD class?

2016-12-05 Thread Thakrar, Jayesh
Teng,

Before you go down creating your own custom Spark system, do give some thought 
to what Holden and others are suggesting, viz. using implicit methods.

If you want real concrete examples, have a look at the Spark Cassandra 
Connector -

Here you will see an example of "extending" SparkContext - 
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md

// validation is deferred, so it is not triggered during rdd creation
val rdd = sc.cassandraTable[SomeType]("ks", "not_existing_table")
val emptyRDD = rdd.toEmptyCassandraRDD

val emptyRDD2 = sc.emptyCassandraTable[SomeType]("ks", "not_existing_table"))


And here you will se an example of "extending" RDD - 
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md

case class WordCount(word: String, count: Long)
val collection = sc.parallelize(Seq(WordCount("dog", 50), WordCount("cow", 60)))
collection.saveToCassandra("test", "words", SomeColumns("word", "count"))

Hope that helps…
Jayesh


From: Teng Long 
Date: Monday, December 5, 2016 at 3:04 PM
To: Holden Karau , 
Subject: Re: Can I add a new method to RDD class?

Thank you for providing another answer, Holden.

So I did what Tarun and Michal suggested, and it didn’t work out as I want to 
have a new transformation method in RDD class, and need to use that RDD’s spark 
context which is private. So I guess the only thing I can do now is to sbt 
publishLocal?

On Dec 5, 2016, at 9:19 AM, Holden Karau 
> wrote:

Doing that requires publishing a custom version of Spark, you can edit the 
version number do do a publishLocal - but maintaining that change is going to 
be difficult. The other approaches suggested are probably better, but also does 
your method need to be defined on the RDD class? Could you instead make a 
helper object or class to expose whatever functionality you need?

On Mon, Dec 5, 2016 at 6:06 PM long 
> wrote:
Thank you very much! But why can’t I just add new methods in to the source code 
of RDD?

On Dec 5, 2016, at 3:15 AM, Michal Šenkýř [via Apache Spark Developers List] 
<[hidden email]> wrote:


A simple Scala example of implicit classes:

implicit class EnhancedString(str: String) {

  def prefix(prefix: String) = prefix + str

}



println("World".prefix("Hello "))

As Tarun said, you have to import it if it's not in the same class where you 
use it.

Hope this makes it clearer,

Michal Senkyr

On 5.12.2016 07:43, Tarun Kumar wrote:
Not sure if that's documented in terms of Spark but this is a fairly common 
pattern in scala known as "pimp my library" pattern, you can easily find many 
generic example of using this pattern. If you want I can quickly cook up a 
short conplete example with rdd(although there is nothing really more to my 
example in earlier mail) ? Thanks Tarun Kumar

On Mon, 5 Dec 2016 at 7:15 AM, long <[hidden email]> wrote:
So is there documentation of this I can refer to?

On Dec 5, 2016, at 1:07 AM, Tarun Kumar [via Apache Spark Developers List] 
<[hidden email]> wrote:

Hi Tenglong, In addition to trsell's reply, you can add any method to an rdd 
without making changes to spark code. This can be achieved by using implicit 
class in your own client code: implicit class extendRDD[T](rdd: RDD[T]){ def 
foo() } Then you basically nees to import this implicit class in scope where 
you want to use the new foo method. Thanks Tarun Kumar

On Mon, 5 Dec 2016 at 6:59 AM, [hidden email]> wrote:

How does your application fetch the spark dependency? Perhaps list your project 
dependencies and check it's using your dev build.

On Mon, 5 Dec 2016, 08:47 tenglong, [hidden email]> wrote:
Hi,

Apparently, I've already tried adding a new method to RDD,

for example,

class RDD {
  def foo() // this is the one I added

  def map()

  def collect()
}

I can build Spark successfully, but I can't compile my application code
which calls rdd.foo(), and the error message says

value foo is not a member of org.apache.spark.rdd.RDD[String]

So I am wondering if there is any mechanism prevents me from doing this or
something I'm doing wrong?




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100.html
Sent from the Apache Spark Developers List mailing list archive at 
Nabble.com.

-
To unsubscribe e-mail: x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=2" 
target="_top" rel="nofollow" 

Re: Can I add a new method to RDD class?

2016-12-05 Thread Teng Long
Thank you, Ryan. Didn’t there is a method for that!

> On Dec 5, 2016, at 4:10 PM, Shixiong(Ryan) Zhu  
> wrote:
> 
> RDD.sparkContext is public: 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@sparkContext:org.apache.spark.SparkContext
>  
> 
> 
> On Mon, Dec 5, 2016 at 1:04 PM, Teng Long  > wrote:
> Thank you for providing another answer, Holden.
> 
> So I did what Tarun and Michal suggested, and it didn’t work out as I want to 
> have a new transformation method in RDD class, and need to use that RDD’s 
> spark context which is private. So I guess the only thing I can do now is to 
> sbt publishLocal?
> 
>> On Dec 5, 2016, at 9:19 AM, Holden Karau > > wrote:
>> 
>> Doing that requires publishing a custom version of Spark, you can edit the 
>> version number do do a publishLocal - but maintaining that change is going 
>> to be difficult. The other approaches suggested are probably better, but 
>> also does your method need to be defined on the RDD class? Could you instead 
>> make a helper object or class to expose whatever functionality you need?
>> 
>> On Mon, Dec 5, 2016 at 6:06 PM long > > wrote:
>> Thank you very much! But why can’t I just add new methods in to the source 
>> code of RDD?
>> 
>> 
>>> On Dec 5, 2016, at 3:15 AM, Michal Šenkýř [via Apache Spark Developers 
>>> List] <[hidden email] > 
>>> wrote:
>>> 
>> 
>>> A simple Scala example of implicit classes:
>>> 
>>> implicit class EnhancedString(str: String) {
>>>   def prefix(prefix: String) = prefix + str
>>> }
>>> 
>>> println("World".prefix("Hello "))
>>> As Tarun said, you have to import it if it's not in the same class where 
>>> you use it.
>>> 
>>> Hope this makes it clearer,
>>> 
>>> Michal Senkyr
>>> 
>>> 
>>> On 5.12.2016 07:43, Tarun Kumar wrote:
>> 
 Not sure if that's documented in terms of Spark but this is a fairly 
 common pattern in scala known as "pimp my library" pattern, you can easily 
 find many generic example of using this pattern.
 
 If you want I can quickly cook up a short conplete example with 
 rdd(although there is nothing really more to my example in earlier mail) ?
 
 Thanks 
 Tarun Kumar
 
>> 
 On Mon, 5 Dec 2016 at 7:15 AM, long <>>> href="x-msg://22/user/SendEmail.jtp?type=nodenode=20106i=0 <>" 
 target="_top" rel="nofollow" link="external" class="">[hidden email]> 
 wrote:
>> 
 So is there documentation of this I can refer to? 
 
> On Dec 5, 2016, at 1:07 AM, Tarun Kumar [via Apache Spark Developers 
> List] <[hidden email] 
> > wrote:
> 
 
> Hi Tenglong,
> 
> In addition to trsell's reply, you can add any method to an rdd without 
> making changes to spark code.
> 
> This can be achieved by using implicit class in your own client code:
> 
> implicit class extendRDD[T](rdd: RDD[T]){
> 
>  def foo()
> 
> }
> 
> Then you basically nees to import this implicit class in scope where you 
> want to use the new foo method.
> 
> Thanks
> Tarun Kumar 
> 
>> 
> On Mon, 5 Dec 2016 at 6:59 AM, <  <>" 
> class="">x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=0 
> <>" target="_top" rel="nofollow" link="external" class="">[hidden email]> 
> wrote:
 
> How does your application fetch the spark dependency? Perhaps list your 
> project dependencies and check it's using your dev build.
> 
> 
>> 
> On Mon, 5 Dec 2016, 08:47 tenglong, <  <>" 
> class="">x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=1 
> <>" target="_top" rel="nofollow" link="external" class="">[hidden email]> 
> wrote:
 
> Hi,
> 
> Apparently, I've already tried adding a new method to RDD,
> 
> for example,
> 
> class RDD {
>   def foo() // this is the one I added
> 
>   def map()
> 
>   def collect()
> }
> 
> I can build Spark successfully, but I can't compile my application code
> which calls rdd.foo(), and the error message says
> 
> value foo is not a member of org.apache.spark.rdd.RDD[String]
> 
> So I am wondering if there is any mechanism prevents me from doing this or
> something I'm doing wrong?
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100.html
>  
> 

Re: Can I add a new method to RDD class?

2016-12-05 Thread Teng Long
Thank you for providing another answer, Holden.

So I did what Tarun and Michal suggested, and it didn’t work out as I want to 
have a new transformation method in RDD class, and need to use that RDD’s spark 
context which is private. So I guess the only thing I can do now is to sbt 
publishLocal?

> On Dec 5, 2016, at 9:19 AM, Holden Karau  wrote:
> 
> Doing that requires publishing a custom version of Spark, you can edit the 
> version number do do a publishLocal - but maintaining that change is going to 
> be difficult. The other approaches suggested are probably better, but also 
> does your method need to be defined on the RDD class? Could you instead make 
> a helper object or class to expose whatever functionality you need?
> 
> On Mon, Dec 5, 2016 at 6:06 PM long  > wrote:
> Thank you very much! But why can’t I just add new methods in to the source 
> code of RDD?
> 
> 
>> On Dec 5, 2016, at 3:15 AM, Michal Šenkýř [via Apache Spark Developers List] 
>> <[hidden email] > wrote:
>> 
> 
>> A simple Scala example of implicit classes:
>> 
>> implicit class EnhancedString(str: String) {
>>   def prefix(prefix: String) = prefix + str
>> }
>> 
>> println("World".prefix("Hello "))
>> As Tarun said, you have to import it if it's not in the same class where you 
>> use it.
>> 
>> Hope this makes it clearer,
>> 
>> Michal Senkyr
>> 
>> 
>> On 5.12.2016 07:43, Tarun Kumar wrote:
> 
>>> Not sure if that's documented in terms of Spark but this is a fairly common 
>>> pattern in scala known as "pimp my library" pattern, you can easily find 
>>> many generic example of using this pattern.
>>> 
>>> If you want I can quickly cook up a short conplete example with 
>>> rdd(although there is nothing really more to my example in earlier mail) ?
>>> 
>>> Thanks 
>>> Tarun Kumar
>>> 
> 
>>> On Mon, 5 Dec 2016 at 7:15 AM, long <>> href="x-msg://22/user/SendEmail.jtp?type=nodenode=20106i=0" 
>>> target="_top" rel="nofollow" link="external" class="">[hidden email]> wrote:
> 
>>> So is there documentation of this I can refer to? 
>>> 
 On Dec 5, 2016, at 1:07 AM, Tarun Kumar [via Apache Spark Developers List] 
 <[hidden email] > 
 wrote:
 
>>> 
 Hi Tenglong,
 
 In addition to trsell's reply, you can add any method to an rdd without 
 making changes to spark code.
 
 This can be achieved by using implicit class in your own client code:
 
 implicit class extendRDD[T](rdd: RDD[T]){
 
  def foo()
 
 }
 
 Then you basically nees to import this implicit class in scope where you 
 want to use the new foo method.
 
 Thanks
 Tarun Kumar 
 
> 
 On Mon, 5 Dec 2016 at 6:59 AM, <>>>  class="">x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=0" 
 target="_top" rel="nofollow" link="external" class="">[hidden email]> 
 wrote:
>>> 
 How does your application fetch the spark dependency? Perhaps list your 
 project dependencies and check it's using your dev build.
 
 
> 
 On Mon, 5 Dec 2016, 08:47 tenglong, <>>>  class="">x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=1" 
 target="_top" rel="nofollow" link="external" class="">[hidden email]> 
 wrote:
>>> 
 Hi,
 
 Apparently, I've already tried adding a new method to RDD,
 
 for example,
 
 class RDD {
   def foo() // this is the one I added
 
   def map()
 
   def collect()
 }
 
 I can build Spark successfully, but I can't compile my application code
 which calls rdd.foo(), and the error message says
 
 value foo is not a member of org.apache.spark.rdd.RDD[String]
 
 So I am wondering if there is any mechanism prevents me from doing this or
 something I'm doing wrong?
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100.html
  
 
 Sent from the Apache Spark Developers List mailing list archive at 
 Nabble.com .
 
 -
> 
 To unsubscribe e-mail: >>>  class="">x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=2" 
 target="_top" rel="nofollow" link="external" class="">[hidden email]
 
> 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100p20102.html
  
 
 To unsubscribe from Can I add a 

Re: Can I add a new method to RDD class?

2016-12-05 Thread Shixiong(Ryan) Zhu
RDD.sparkContext is public:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@sparkContext:org.apache.spark.SparkContext

On Mon, Dec 5, 2016 at 1:04 PM, Teng Long  wrote:

> Thank you for providing another answer, Holden.
>
> So I did what Tarun and Michal suggested, and it didn’t work out as I want
> to have a new transformation method in RDD class, and need to use that
> RDD’s spark context which is private. So I guess the only thing I can do
> now is to sbt publishLocal?
>
> On Dec 5, 2016, at 9:19 AM, Holden Karau  wrote:
>
> Doing that requires publishing a custom version of Spark, you can edit the
> version number do do a publishLocal - but maintaining that change is going
> to be difficult. The other approaches suggested are probably better, but
> also does your method need to be defined on the RDD class? Could you
> instead make a helper object or class to expose whatever functionality you
> need?
>
> On Mon, Dec 5, 2016 at 6:06 PM long  wrote:
>
>> Thank you very much! But why can’t I just add new methods in to the
>> source code of RDD?
>>
>> On Dec 5, 2016, at 3:15 AM, Michal Šenkýř [via Apache Spark Developers
>> List] <[hidden email]
>> > wrote:
>>
>> A simple Scala example of implicit classes:
>>
>> implicit class EnhancedString(str: String) {
>>   def prefix(prefix: String) = prefix + str
>> }
>>
>> println("World".prefix("Hello "))
>>
>> As Tarun said, you have to import it if it's not in the same class where
>> you use it.
>>
>> Hope this makes it clearer,
>>
>> Michal Senkyr
>>
>> On 5.12.2016 07:43, Tarun Kumar wrote:
>>
>> Not sure if that's documented in terms of Spark but this is a fairly
>> common pattern in scala known as "pimp my library" pattern, you can easily
>> find many generic example of using this pattern. If you want I can quickly
>> cook up a short conplete example with rdd(although there is nothing really
>> more to my example in earlier mail) ? Thanks Tarun Kumar
>>
>> On Mon, 5 Dec 2016 at 7:15 AM, long <> rel="nofollow" link="external" class="">[hidden email]> wrote:
>>
>> So is there documentation of this I can refer to?
>>
>> On Dec 5, 2016, at 1:07 AM, Tarun Kumar [via Apache Spark Developers
>> List] <[hidden email]
>> > wrote:
>>
>> Hi Tenglong, In addition to trsell's reply, you can add any method to an
>> rdd without making changes to spark code. This can be achieved by using
>> implicit class in your own client code: implicit class extendRDD[T](rdd:
>> RDD[T]){ def foo() } Then you basically nees to import this implicit class
>> in scope where you want to use the new foo method. Thanks Tarun Kumar
>>
>> On Mon, 5 Dec 2016 at 6:59 AM, <> SendEmail.jtp?type=nodeamp;node=20102amp;i=0" class="">
>> x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=0"
>> target="_top" rel="nofollow" link="external" class="">[hidden email]> wrote:
>>
>> How does your application fetch the spark dependency? Perhaps list your
>> project dependencies and check it's using your dev build.
>>
>> On Mon, 5 Dec 2016, 08:47 tenglong, <> SendEmail.jtp?type=nodeamp;node=20102amp;i=1" class="">
>> x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=1"
>> target="_top" rel="nofollow" link="external" class="">[hidden email]> wrote:
>>
>> Hi,
>>
>> Apparently, I've already tried adding a new method to RDD,
>>
>> for example,
>>
>> class RDD {
>>   def foo() // this is the one I added
>>
>>   def map()
>>
>>   def collect()
>> }
>>
>> I can build Spark successfully, but I can't compile my application code
>> which calls rdd.foo(), and the error message says
>>
>> value foo is not a member of org.apache.spark.rdd.RDD[String]
>>
>> So I am wondering if there is any mechanism prevents me from doing this or
>> something I'm doing wrong?
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/Can-I-add-a-new-
>> method-to-RDD-class-tp20100.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com .
>>
>> -
>>
>> To unsubscribe e-mail: > SendEmail.jtp?type=nodeamp;node=20102amp;i=2" class="">
>> x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=2"
>> target="_top" rel="nofollow" link="external" class="">[hidden email]
>>
>>
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-spark-developers-list.1001551.n3.
>> nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100p20102.html
>> To unsubscribe from Can I add a new method to RDD class?, click here.
>> NAML
>> 

Re: ability to provide custom serializers

2016-12-05 Thread Michael Armbrust
Lets start with a new ticket, link them and we can merge if the solution
ends up working out for both cases.

On Sun, Dec 4, 2016 at 5:39 PM, Erik LaBianca 
wrote:

> Thanks Michael!
>
> On Dec 2, 2016, at 7:29 PM, Michael Armbrust 
> wrote:
>
> I would love to see something like this.  The closest related ticket is
> probably https://issues.apache.org/jira/browse/SPARK-7768 (though maybe
> there are enough people using UDTs in their current form that we should
> just make a new ticket)
>
>
> I’m not very familiar with UDT’s. Is this something I should research or
> just leave it be and create a new ticket? I did notice the presence of a
> registry in the source code but it seemed like it was targeted at a
> different use case.
>
> A few thoughts:
>  - even if you can do implicit search, we probably also want a registry
> for Java users.
>
>
> That’s fine. I’m not 100% sure I can get the right implicit in scope as
> things stand anyway, so let’s table that idea for now and do the registry.
>
>  - what is the output of the serializer going to be? one challenge here is
> that encoders write directly into the tungsten format, which is not a
> stable public API. Maybe this is more obvious if I understood MappedColumnType
> better?
>
>
> My assumption was that the output would be existing scalar data types. So
> string, long, double, etc. What I’d like to do is just “layer” the new ones
> on top already existing ones, kinda like the case case encoder does.
>
> Either way, I'm happy to give further advice if you come up with a more
> concrete proposal and put it on JIRA.
>
>
> Great, let me know and I’ll create a ticket, or we can re-use SPARK-7768
> and we can move the discussion there.
>
> Thanks!
>
> —erik
>
>


Re: Difference between netty and netty-all

2016-12-05 Thread Nicholas Chammas
That file is in Netty 4.0.29, but I believe the PR I referenced is not.
It's only in Netty 4.0.37 and up.

On Mon, Dec 5, 2016 at 1:57 PM Ted Yu  wrote:

> This should be in netty-all :
>
> $ jar tvf
> /home/x/.m2/repository/io/netty/netty-all/4.0.29.Final/netty-all-4.0.29.Final.jar
> | grep ThreadLocalRandom
>967 Tue Jun 23 11:10:30 UTC 2015
> io/netty/util/internal/ThreadLocalRandom$1.class
>   1079 Tue Jun 23 11:10:30 UTC 2015
> io/netty/util/internal/ThreadLocalRandom$2.class
>   5973 Tue Jun 23 11:10:30 UTC 2015
> io/netty/util/internal/ThreadLocalRandom.class
>
> On Mon, Dec 5, 2016 at 8:53 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> I’m looking at the list of dependencies here:
>
>
> https://github.com/apache/spark/search?l=Groff=netty=Code=%E2%9C%93
>
> What’s the difference between netty and netty-all?
>
> The reason I ask is because I’m looking at a Netty PR
>  and trying to figure out if
> Spark 2.0.2 is using a version of Netty that includes that PR or not.
>
> Nick
> ​
>
>
>


Re: Difference between netty and netty-all

2016-12-05 Thread Sean Owen
netty should be Netty 3.x. It is all but unused but I couldn't manage to
get rid of it: https://issues.apache.org/jira/browse/SPARK-17875

netty-all should be 4.x, actually used.

On Tue, Dec 6, 2016 at 12:54 AM Nicholas Chammas 
wrote:

> I’m looking at the list of dependencies here:
>
>
> https://github.com/apache/spark/search?l=Groff=netty=Code=%E2%9C%93
>
> What’s the difference between netty and netty-all?
>
> The reason I ask is because I’m looking at a Netty PR
>  and trying to figure out if
> Spark 2.0.2 is using a version of Netty that includes that PR or not.
>
> Nick
> ​
>


Re: Please limit commits for branch-2.1

2016-12-05 Thread Reynold Xin
I would like to re-iterate that committers please be very conservative now
in merging patches into branch-2.1.

Spark is a very sophisticated (compiler, optimizer) project and sometimes
one-line changes can have huge consequences and introduce regressions. If
it is just a tiny optimization, don't merge it into branch-2.1.


On Tue, Nov 22, 2016 at 9:37 AM, Sean Owen  wrote:

> Thanks, this was another message that went to spam for me:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Apache-
> Spark-branch-2-1-td19688.html
>
> Looks great -- cutting branch = in RC period.
>
> On Tue, Nov 22, 2016 at 5:31 PM Reynold Xin  wrote:
>
>> I did send an email out with those information on Nov 1st. It is not
>> meant to be in new feature development mode anymore.
>>
>> FWIW, I will cut an RC today to remind people of that. The RC will fail,
>> but it can serve as a good reminder.
>>
>> On Tue, Nov 22, 2016 at 1:53 AM Sean Owen  wrote:
>>
>> Maybe I missed it, but did anyone declare a QA period? In the past I've
>> not seen this, and just seen people start talking retrospectively about how
>> "we're in QA now" until it stops. We have https://cwiki.apache.org/
>> confluence/display/SPARK/Wiki+Homepage saying it is already over, but
>> clearly we're not doing RCs.
>>
>> We should make this more formal and predictable. We probably need a
>> clearer definition of what changes in QA. I'm moving the wiki to
>> spark.apache.org now and could try to put up some words around this when
>> I move this page above today.
>>
>> On Mon, Nov 21, 2016 at 11:20 PM Joseph Bradley 
>> wrote:
>>
>> To committers and contributors active in MLlib,
>>
>> Thanks everyone who has started helping with the QA tasks in
>> SPARK-18316!  I'd like to request that we stop committing non-critical
>> changes to MLlib, including the Python and R APIs, since still-changing
>> public APIs make it hard to QA.  We need have already started to sign off
>> on some QA tasks, but we may need to re-open them if changes are committed,
>> especially if those changes are to public APIs.  There's no need to push
>> Python and R wrappers into 2.1 at the last minute.
>>
>> Let's focus on completing QA, after which we can resume committing API
>> changes to master (not branch-2.1).
>>
>> Thanks everyone!
>> Joseph
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] 
>>
>>


Re: Difference between netty and netty-all

2016-12-05 Thread Ted Yu
This should be in netty-all :

$ jar tvf
/home/x/.m2/repository/io/netty/netty-all/4.0.29.Final/netty-all-4.0.29.Final.jar
| grep ThreadLocalRandom
   967 Tue Jun 23 11:10:30 UTC 2015
io/netty/util/internal/ThreadLocalRandom$1.class
  1079 Tue Jun 23 11:10:30 UTC 2015
io/netty/util/internal/ThreadLocalRandom$2.class
  5973 Tue Jun 23 11:10:30 UTC 2015
io/netty/util/internal/ThreadLocalRandom.class

On Mon, Dec 5, 2016 at 8:53 AM, Nicholas Chammas  wrote:

> I’m looking at the list of dependencies here:
>
> https://github.com/apache/spark/search?l=Groff=netty;
> type=Code=%E2%9C%93
>
> What’s the difference between netty and netty-all?
>
> The reason I ask is because I’m looking at a Netty PR
>  and trying to figure out if
> Spark 2.0.2 is using a version of Netty that includes that PR or not.
>
> Nick
> ​
>


Re: SPARK-18689: A proposal for priority based app scheduling utilizing linux cgroups.

2016-12-05 Thread Michal Šenkýř

Hello Travis,


I am just a short-time member of this list but I can definitely see the 
benefit of using built-in OS resource management facilities to 
dynamically manage cluster resources on the node level in this manner. 
At our company we often fight for resources on our development cluster 
as well as sometimes cancel running jobs in production to free up 
immediately needed resources. If I understand it correctly, this would 
solve a lot of our problems.



The only downside I see with this is that it is Linux-specific.


Michal


On 5.12.2016 16:36, Hegner, Travis wrote:


My apologies, in my excitement of finding a rather simple way to 
accomplish the scheduling goal I have in mind, I hastily jumped 
straight into a technical solution, without explaining that goal, or 
the problem it's attempting to solve.



You are correct that I'm looking for an additional running mode for 
the standalone scheduler. Perhaps you could/should classify it as a 
different scheduler, but I don't want to give the impression that this 
will be as difficult to implement as most schedulers are. Initially, 
from a memory perspective, we would still allocate in a FIFO 
manner. This new scheduling mode (or new scheduler, if you'd 
rather) would mostly benefit any users with small-ish clusters, both 
on-premise and cloud based. Essentially, my end goal is to be able to 
run multiple *applications* simultaneously with each application 
having *access* to the entire core count of the cluster.



I have a very cpu intensive application that I'd like to run weekly. I 
have a second application that I'd like to run hourly. The hourly 
application is more time critical (higher priority), so I'd like it to 
finish in a small amount of time. If I allow the first app to run with 
all cores (this takes several days on my 64 core cluster), then 
nothing else can be executed when running with the default FIFO 
scheduler. All of the cores have been allocated to the 
first application, and it will not release them until it is finished. 
Dynamic allocation does not help in this case, as there is always a 
backlog of tasks to run until the first application is nearing the end 
anyway. Naturally, I could just limit the number of cores that the 
first application has access to, but then I have idle cpu time when 
the second app is not running, and that is not optimal. Secondly in 
that case, the second application only has access to the *leftover* 
cores that the first app has not allocated, and will take a 
considerably longer amount of time to run.



You could also imagine a scenario where a developer has a spark-shell 
running without specifying the number of cores they want to utilize 
(whether intentionally or not). As I'm sure you know, the default is 
to allocate the entire cluster to this application. The cores 
allocated to this shell are unavailable to other applications, even if 
they are just sitting idle while a developer is getting their 
environment set up to run a very big job interactively. Other 
developers that would like to launch interactive shells are stuck 
waiting for the first one to exit their shell.



My proposal would eliminate this static nature of core counts and 
allow as many simultaneous applications to be running as the cluster 
memory (still statically partitioned, at least initially) will allow. 
Applications could be configured with a "cpu shares" parameter (just 
an arbitrary integer relative only to other applications) which is 
essentially just passed through to the linux cgroup cpu.shares 
setting. Since each executor of an application on a given worker runs 
in it's own process/jvm, then that process could be easily be placed 
into a cgroup created and dedicated for that application.



Linux cgroups cpu.shares are pretty well documented, but the gist is 
that processes competing for cpu time are allocated a percentage of 
time equal to their share count as a percentage of all shares in that 
level of the cgroup hierarchy. If two applications are both scheduled 
on the same core with the same weight, each will get to utilize 50% of 
the time on that core. This is all built into the kernel, and the only 
thing the spark worker has to do is create a cgroup for each 
application, set the cpu.shares parameter, and assign the executors 
for that application to the new cgroup. If multiple executors are 
running on a single worker, for a single application, the cpu time 
available to that application is divided among each of those executors 
equally. The default for cpu.shares is that they are not limiting in 
any way. A process can consume all available cpu time if it would 
otherwise be idle anyway.



Another benefit to passing cpu.shares directly to the kernel (as 
opposed to some abstraction) is that cpu share allocations are 
heterogeneous to all processes running on a machine. An admin could 
have very fine grained control over which processes get priority 
access to cpu time, depending on their 

Difference between netty and netty-all

2016-12-05 Thread Nicholas Chammas
I’m looking at the list of dependencies here:

https://github.com/apache/spark/search?l=Groff=netty=Code=%E2%9C%93

What’s the difference between netty and netty-all?

The reason I ask is because I’m looking at a Netty PR
 and trying to figure out if
Spark 2.0.2 is using a version of Netty that includes that PR or not.

Nick
​


Re: SPARK-18689: A proposal for priority based app scheduling utilizing linux cgroups.

2016-12-05 Thread Hegner, Travis
My apologies, in my excitement of finding a rather simple way to accomplish the 
scheduling goal I have in mind, I hastily jumped straight into a technical 
solution, without explaining that goal, or the problem it's attempting to solve.


You are correct that I'm looking for an additional running mode for the 
standalone scheduler. Perhaps you could/should classify it as a different 
scheduler, but I don't want to give the impression that this will be as 
difficult to implement as most schedulers are. Initially, from a memory 
perspective, we would still allocate in a FIFO manner. This new scheduling mode 
(or new scheduler, if you'd rather) would mostly benefit any users with 
small-ish clusters, both on-premise and cloud based. Essentially, my end goal 
is to be able to run multiple *applications* simultaneously with each 
application having *access* to the entire core count of the cluster.


I have a very cpu intensive application that I'd like to run weekly. I have a 
second application that I'd like to run hourly. The hourly application is more 
time critical (higher priority), so I'd like it to finish in a small amount of 
time. If I allow the first app to run with all cores (this takes several days 
on my 64 core cluster), then nothing else can be executed when running with the 
default FIFO scheduler. All of the cores have been allocated to the first 
application, and it will not release them until it is finished. Dynamic 
allocation does not help in this case, as there is always a backlog of tasks to 
run until the first application is nearing the end anyway. Naturally, I could 
just limit the number of cores that the first application has access to, but 
then I have idle cpu time when the second app is not running, and that is not 
optimal. Secondly in that case, the second application only has access to the 
*leftover* cores that the first app has not allocated, and will take a 
considerably longer amount of time to run.


You could also imagine a scenario where a developer has a spark-shell running 
without specifying the number of cores they want to utilize (whether 
intentionally or not). As I'm sure you know, the default is to allocate the 
entire cluster to this application. The cores allocated to this shell are 
unavailable to other applications, even if they are just sitting idle while a 
developer is getting their environment set up to run a very big job 
interactively. Other developers that would like to launch interactive shells 
are stuck waiting for the first one to exit their shell.


My proposal would eliminate this static nature of core counts and allow as many 
simultaneous applications to be running as the cluster memory (still statically 
partitioned, at least initially) will allow. Applications could be configured 
with a "cpu shares" parameter (just an arbitrary integer relative only to other 
applications) which is essentially just passed through to the linux cgroup 
cpu.shares setting. Since each executor of an application on a given worker 
runs in it's own process/jvm, then that process could be easily be placed into 
a cgroup created and dedicated for that application.


Linux cgroups cpu.shares are pretty well documented, but the gist is that 
processes competing for cpu time are allocated a percentage of time equal to 
their share count as a percentage of all shares in that level of the cgroup 
hierarchy. If two applications are both scheduled on the same core with the 
same weight, each will get to utilize 50% of the time on that core. This is all 
built into the kernel, and the only thing the spark worker has to do is create 
a cgroup for each application, set the cpu.shares parameter, and assign the 
executors for that application to the new cgroup. If multiple executors are 
running on a single worker, for a single application, the cpu time available to 
that application is divided among each of those executors equally. The default 
for cpu.shares is that they are not limiting in any way. A process can consume 
all available cpu time if it would otherwise be idle anyway.


Another benefit to passing cpu.shares directly to the kernel (as opposed to 
some abstraction) is that cpu share allocations are heterogeneous to all 
processes running on a machine. An admin could have very fine grained control 
over which processes get priority access to cpu time, depending on their needs.


To continue my personal example above, my long running cpu intensive 
application could utilize 100% of all cluster cores if they are idle. Then my 
time sensitive app could be launched with nine times the priority and the linux 
kernel would scale back the first application to 10% of all cores (completely 
seemlessly and automatically: no pre-emption, just fewer time slices of cpu 
allocated by the kernel to the first application), while the second application 
gets 90% of all the cores until it completes.


The only downside that I can think of currently is that this scheduling mode 

Re: Can I add a new method to RDD class?

2016-12-05 Thread Holden Karau
Doing that requires publishing a custom version of Spark, you can edit the
version number do do a publishLocal - but maintaining that change is going
to be difficult. The other approaches suggested are probably better, but
also does your method need to be defined on the RDD class? Could you
instead make a helper object or class to expose whatever functionality you
need?

On Mon, Dec 5, 2016 at 6:06 PM long  wrote:

> Thank you very much! But why can’t I just add new methods in to the source
> code of RDD?
>
> On Dec 5, 2016, at 3:15 AM, Michal Šenkýř [via Apache Spark Developers
> List] <[hidden email]
> > wrote:
>
> A simple Scala example of implicit classes:
>
> implicit class EnhancedString(str: String) {
>   def prefix(prefix: String) = prefix + str
> }
>
> println("World".prefix("Hello "))
>
> As Tarun said, you have to import it if it's not in the same class where
> you use it.
>
> Hope this makes it clearer,
>
> Michal Senkyr
>
> On 5.12.2016 07:43, Tarun Kumar wrote:
>
> Not sure if that's documented in terms of Spark but this is a fairly
> common pattern in scala known as "pimp my library" pattern, you can easily
> find many generic example of using this pattern. If you want I can quickly
> cook up a short conplete example with rdd(although there is nothing really
> more to my example in earlier mail) ? Thanks Tarun Kumar
>
> On Mon, 5 Dec 2016 at 7:15 AM, long < href="x-msg://22/user/SendEmail.jtp?type=nodenode=20106i=0"
> target="_top" rel="nofollow" link="external" class="">[hidden email]> wrote:
>
> So is there documentation of this I can refer to?
>
> On Dec 5, 2016, at 1:07 AM, Tarun Kumar [via Apache Spark Developers List]
> <[hidden email] >
> wrote:
>
> Hi Tenglong, In addition to trsell's reply, you can add any method to an
> rdd without making changes to spark code. This can be achieved by using
> implicit class in your own client code: implicit class extendRDD[T](rdd:
> RDD[T]){ def foo() } Then you basically nees to import this implicit class
> in scope where you want to use the new foo method. Thanks Tarun Kumar
>
> On Mon, 5 Dec 2016 at 6:59 AM, < class="">x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=0"
> target="_top" rel="nofollow" link="external" class="">[hidden email]> wrote:
>
> How does your application fetch the spark dependency? Perhaps list your
> project dependencies and check it's using your dev build.
>
> On Mon, 5 Dec 2016, 08:47 tenglong, < class="">x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=1"
> target="_top" rel="nofollow" link="external" class="">[hidden email]> wrote:
>
> Hi,
>
> Apparently, I've already tried adding a new method to RDD,
>
> for example,
>
> class RDD {
>   def foo() // this is the one I added
>
>   def map()
>
>   def collect()
> }
>
> I can build Spark successfully, but I can't compile my application code
> which calls rdd.foo(), and the error message says
>
> value foo is not a member of org.apache.spark.rdd.RDD[String]
>
> So I am wondering if there is any mechanism prevents me from doing this or
> something I'm doing wrong?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com .
>
> -
>
> To unsubscribe e-mail:  class="">x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=2"
> target="_top" rel="nofollow" link="external" class="">[hidden email]
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100p20102.html
> To unsubscribe from Can I add a new method to RDD class?, click here.
> NAML
> 
>
>
> --
> View this message in context: Re: Can I add a new method to RDD class?
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at Nabble.com
> .
>
> If you reply to this email, your message will be added to the discussion
> below:
>
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100p20106.html
>
> To unsubscribe from Can I add a new method to RDD 

java.lang.IllegalStateException: There is no space for new record

2016-12-05 Thread Nicholas Chammas
I was testing out a new project at scale on Spark 2.0.2 running on YARN,
and my job failed with an interesting error message:

TaskSetManager: Lost task 37.3 in stage 31.0 (TID 10684,
server.host.name): java.lang.IllegalStateException: There is no space
for new record
05:27:09.573 at
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:211)
05:27:09.574 at
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:127)
05:27:09.574 at
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:244)
05:27:09.575 at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
Source)
05:27:09.575 at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
05:27:09.576 at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
05:27:09.576 at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
05:27:09.577 at
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
05:27:09.577 at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
05:27:09.577 at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
05:27:09.578 at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
05:27:09.578 at org.apache.spark.scheduler.Task.run(Task.scala:86)
05:27:09.578 at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
05:27:09.579 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
05:27:09.579 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
05:27:09.579 at java.lang.Thread.run(Thread.java:745)

I’ve never seen this before, and searching on Google/DDG/JIRA doesn’t yield
any results. There are no other errors coming from that executor, whether
related to memory, storage space, or otherwise.

Could this be a bug? If so, how would I narrow down the source? Otherwise,
how might I work around the issue?

Nick
​


Re: Can I add a new method to RDD class?

2016-12-05 Thread long
Thank you very much! But why can’t I just add new methods in to the source code 
of RDD?

> On Dec 5, 2016, at 3:15 AM, Michal Šenkýř [via Apache Spark Developers List] 
>  wrote:
> 
> A simple Scala example of implicit classes:
> 
> implicit class EnhancedString(str: String) {
>   def prefix(prefix: String) = prefix + str
> }
> 
> println("World".prefix("Hello "))
> As Tarun said, you have to import it if it's not in the same class where you 
> use it.
> 
> Hope this makes it clearer,
> 
> Michal Senkyr
> 
> 
> On 5.12.2016 07:43, Tarun Kumar wrote:
>> Not sure if that's documented in terms of Spark but this is a fairly common 
>> pattern in scala known as "pimp my library" pattern, you can easily find 
>> many generic example of using this pattern.
>> 
>> If you want I can quickly cook up a short conplete example with rdd(although 
>> there is nothing really more to my example in earlier mail) ?
>> 
>> Thanks 
>> Tarun Kumar
>> 
>> On Mon, 5 Dec 2016 at 7:15 AM, long <[hidden email] 
>> > wrote:
>> So is there documentation of this I can refer to? 
>> 
>>> On Dec 5, 2016, at 1:07 AM, Tarun Kumar [via Apache Spark Developers List] 
>>> <[hidden email] > wrote:
>>> 
>> 
>>> Hi Tenglong,
>>> 
>>> In addition to trsell's reply, you can add any method to an rdd without 
>>> making changes to spark code.
>>> 
>>> This can be achieved by using implicit class in your own client code:
>>> 
>>> implicit class extendRDD[T](rdd: RDD[T]){
>>> 
>>>  def foo()
>>> 
>>> }
>>> 
>>> Then you basically nees to import this implicit class in scope where you 
>>> want to use the new foo method.
>>> 
>>> Thanks
>>> Tarun Kumar 
>>> 
>> 
>>> On Mon, 5 Dec 2016 at 6:59 AM, <>> href="x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=0" 
>>> target="_top" rel="nofollow" link="external" class="">[hidden email]> wrote:
>> 
>>> How does your application fetch the spark dependency? Perhaps list your 
>>> project dependencies and check it's using your dev build.
>>> 
>>> 
>> 
>>> On Mon, 5 Dec 2016, 08:47 tenglong, <>> href="x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=1" 
>>> target="_top" rel="nofollow" link="external" class="">[hidden email]> wrote:
>> 
>>> Hi,
>>> 
>>> Apparently, I've already tried adding a new method to RDD,
>>> 
>>> for example,
>>> 
>>> class RDD {
>>>   def foo() // this is the one I added
>>> 
>>>   def map()
>>> 
>>>   def collect()
>>> }
>>> 
>>> I can build Spark successfully, but I can't compile my application code
>>> which calls rdd.foo(), and the error message says
>>> 
>>> value foo is not a member of org.apache.spark.rdd.RDD[String]
>>> 
>>> So I am wondering if there is any mechanism prevents me from doing this or
>>> something I'm doing wrong?
>>> 
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100.html
>>>  
>>> 
>>> Sent from the Apache Spark Developers List mailing list archive at 
>>> Nabble.com .
>>> 
>>> -
>> 
>>> To unsubscribe e-mail: >> href="x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=2" 
>>> target="_top" rel="nofollow" link="external" class="">[hidden email]
>>> 
>>> 
>>> 
>>> If you reply to this email, your message will be added to the discussion 
>>> below:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100p20102.html
>>>  
>>> 
>>> To unsubscribe from Can I add a new method to RDD class?, click here <>.
>>> NAML 
>>> 
>> 
>> View this message in context: Re: Can I add a new method to RDD class? 
>> 
>> Sent from the Apache Spark Developers List mailing list archive 
>>  at Nabble.com.
> 
> 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100p20106.html
>  
> 
> To unsubscribe from Can I add a new method to RDD class?, click 

Re: Can I add a new method to RDD class?

2016-12-05 Thread Michal Šenkýř

A simple Scala example of implicit classes:

implicit  class  EnhancedString(str:String) {
  def  prefix(prefix:String)=  prefix+  str
}

println("World".prefix("Hello "))

As Tarun said, you have to import it if it's not in the same class where 
you use it.


Hope this makes it clearer,

Michal Senkyr


On 5.12.2016 07:43, Tarun Kumar wrote:
Not sure if that's documented in terms of Spark but this is a fairly 
common pattern in scala known as "pimp my library" pattern, you can 
easily find many generic example of using this pattern. If you want I 
can quickly cook up a short conplete example with rdd(although there 
is nothing really more to my example in earlier mail) ? Thanks Tarun Kumar


On Mon, 5 Dec 2016 at 7:15 AM, long > wrote:


So is there documentation of this I can refer to?


On Dec 5, 2016, at 1:07 AM, Tarun Kumar [via Apache Spark
Developers List] <[hidden email]
> wrote:

Hi Tenglong, In addition to trsell's reply, you can add any
method to an rdd without making changes to spark code. This can
be achieved by using implicit class in your own client code:
implicit class extendRDD[T](rdd: RDD[T]){ def foo() } Then you
basically nees to import this implicit class in scope where you
want to use the new foo method. Thanks Tarun Kumar

On Mon, 5 Dec 2016 at 6:59 AM, <[hidden
email]> wrote:

How does your application fetch the spark dependency? Perhaps
list your project dependencies and check it's using your dev
build.


On Mon, 5 Dec 2016, 08:47 tenglong, <[hidden
email]> wrote:

Hi,

Apparently, I've already tried adding a new method to RDD,

for example,

class RDD {
  def foo() // this is the one I added

  def map()

  def collect()
}

I can build Spark successfully, but I can't compile my
application code
which calls rdd.foo(), and the error message says

value foo is not a member of org.apache.spark.rdd.RDD[String]

So I am wondering if there is any mechanism prevents me
from doing this or
something I'm doing wrong?




--
View this message in context:

http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100.html
Sent from the Apache Spark Developers List mailing list
archive at Nabble.com .


-

To unsubscribe e-mail: [hidden email]




If you reply to this email, your message will be added to the
discussion below:

http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100p20102.html

To unsubscribe from Can I add a new method to RDD class?, click here.
NAML







View this message in context: Re: Can I add a new method to RDD
class?


Sent from the Apache Spark Developers List mailing list archive
 at
Nabble.com.