Re: Can I add a new method to RDD class?

2016-12-04 Thread Tarun Kumar
Not sure if that's documented in terms of Spark but this is a fairly common
pattern in scala known as "pimp my library" pattern, you can easily find
many generic example of using this pattern.

If you want I can quickly cook up a short conplete example with
rdd(although there is nothing really more to my example in earlier mail) ?

Thanks
Tarun Kumar

On Mon, 5 Dec 2016 at 7:15 AM, long  wrote:

> So is there documentation of this I can refer to?
>
> On Dec 5, 2016, at 1:07 AM, Tarun Kumar [via Apache Spark Developers List]
> <[hidden email] >
> wrote:
>
> Hi Tenglong,
>
> In addition to trsell's reply, you can add any method to an rdd without
> making changes to spark code.
>
> This can be achieved by using implicit class in your own client code:
>
> implicit class extendRDD[T](rdd: RDD[T]){
>
> def foo()
>
> }
>
> Then you basically nees to import this implicit class in scope where you
> want to use the new foo method.
>
> Thanks
> Tarun Kumar
>
> On Mon, 5 Dec 2016 at 6:59 AM, < href="x-msg://19/user/SendEmail.jtp?type=node&node=20102&i=0"
> target="_top" rel="nofollow" link="external" class="">[hidden email]> wrote:
>
> How does your application fetch the spark dependency? Perhaps list your
> project dependencies and check it's using your dev build.
>
> On Mon, 5 Dec 2016, 08:47 tenglong, < href="x-msg://19/user/SendEmail.jtp?type=node&node=20102&i=1"
> target="_top" rel="nofollow" link="external" class="">[hidden email]> wrote:
>
> Hi,
>
> Apparently, I've already tried adding a new method to RDD,
>
> for example,
>
> class RDD {
>   def foo() // this is the one I added
>
>   def map()
>
>   def collect()
> }
>
> I can build Spark successfully, but I can't compile my application code
> which calls rdd.foo(), and the error message says
>
> value foo is not a member of org.apache.spark.rdd.RDD[String]
>
> So I am wondering if there is any mechanism prevents me from doing this or
> something I'm doing wrong?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
>
> To unsubscribe e-mail:  href="x-msg://19/user/SendEmail.jtp?type=node&node=20102&i=2"
> target="_top" rel="nofollow" link="external" class="">[hidden email]
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100p20102.html
> To unsubscribe from Can I add a new method to RDD class?, click here.
> NAML
> 
>
>
>
> --
> View this message in context: Re: Can I add a new method to RDD class?
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>


Re: Can I add a new method to RDD class?

2016-12-04 Thread long
So is there documentation of this I can refer to? 

> On Dec 5, 2016, at 1:07 AM, Tarun Kumar [via Apache Spark Developers List] 
>  wrote:
> 
> Hi Tenglong,
> 
> In addition to trsell's reply, you can add any method to an rdd without 
> making changes to spark code.
> 
> This can be achieved by using implicit class in your own client code:
> 
> implicit class extendRDD[T](rdd: RDD[T]){
> 
>  def foo()
> 
> }
> 
> Then you basically nees to import this implicit class in scope where you want 
> to use the new foo method.
> 
> Thanks
> Tarun Kumar 
> 
> On Mon, 5 Dec 2016 at 6:59 AM, <[hidden email] 
> > wrote:
> How does your application fetch the spark dependency? Perhaps list your 
> project dependencies and check it's using your dev build.
> 
> 
> On Mon, 5 Dec 2016, 08:47 tenglong, <[hidden email] 
> > wrote:
> Hi,
> 
> Apparently, I've already tried adding a new method to RDD,
> 
> for example,
> 
> class RDD {
>   def foo() // this is the one I added
> 
>   def map()
> 
>   def collect()
> }
> 
> I can build Spark successfully, but I can't compile my application code
> which calls rdd.foo(), and the error message says
> 
> value foo is not a member of org.apache.spark.rdd.RDD[String]
> 
> So I am wondering if there is any mechanism prevents me from doing this or
> something I'm doing wrong?
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100.html
>  
> 
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: [hidden email] 
> 
> 
> 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100p20102.html
>  
> 
> To unsubscribe from Can I add a new method to RDD class?, click here 
> .
> NAML 
> 




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100p20104.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Can I add a new method to RDD class?

2016-12-04 Thread long
So im my sbt build script, I have the same line as instructed in the 
quickstart guide   ,

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.2"

And since I was able to see all the other logs I added into the spark source
code, so I'm pretty sure the application is using the one I just built.

Thanks!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100p20103.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Can I add a new method to RDD class?

2016-12-04 Thread Tarun Kumar
Hi Tenglong,

In addition to trsell's reply, you can add any method to an rdd without
making changes to spark code.

This can be achieved by using implicit class in your own client code:

implicit class extendRDD[T](rdd: RDD[T]){

def foo()

}

Then you basically nees to import this implicit class in scope where you
want to use the new foo method.

Thanks
Tarun Kumar

On Mon, 5 Dec 2016 at 6:59 AM,  wrote:

> How does your application fetch the spark dependency? Perhaps list your
> project dependencies and check it's using your dev build.
>
> On Mon, 5 Dec 2016, 08:47 tenglong,  wrote:
>
> Hi,
>
> Apparently, I've already tried adding a new method to RDD,
>
> for example,
>
> class RDD {
>   def foo() // this is the one I added
>
>   def map()
>
>   def collect()
> }
>
> I can build Spark successfully, but I can't compile my application code
> which calls rdd.foo(), and the error message says
>
> value foo is not a member of org.apache.spark.rdd.RDD[String]
>
> So I am wondering if there is any mechanism prevents me from doing this or
> something I'm doing wrong?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Can I add a new method to RDD class?

2016-12-04 Thread trsell
How does your application fetch the spark dependency? Perhaps list your
project dependencies and check it's using your dev build.

On Mon, 5 Dec 2016, 08:47 tenglong,  wrote:

> Hi,
>
> Apparently, I've already tried adding a new method to RDD,
>
> for example,
>
> class RDD {
>   def foo() // this is the one I added
>
>   def map()
>
>   def collect()
> }
>
> I can build Spark successfully, but I can't compile my application code
> which calls rdd.foo(), and the error message says
>
> value foo is not a member of org.apache.spark.rdd.RDD[String]
>
> So I am wondering if there is any mechanism prevents me from doing this or
> something I'm doing wrong?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Can I add a new method to RDD class?

2016-12-04 Thread tenglong
Hi,

Apparently, I've already tried adding a new method to RDD,

for example,

class RDD {
  def foo() // this is the one I added

  def map()

  def collect()
}

I can build Spark successfully, but I can't compile my application code
which calls rdd.foo(), and the error message says

value foo is not a member of org.apache.spark.rdd.RDD[String]

So I am wondering if there is any mechanism prevents me from doing this or
something I'm doing wrong?




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-12-04 Thread Koert Kuipers
with the current branch-2.1 after rc1 i am now also seeing this error in
our unit tests:

 java.lang.UnsupportedOperationException: Cannot create encoder for Option
of Product type, because Product type is represented as a row, and the
entire row can not be null in Spark SQL like normal databases. You can wrap
your type with Tuple1 if you do want top level null Product objects, e.g.
instead of creating `Dataset[Option[MyClass]]`, you can do something like
`val ds: Dataset[Tuple1[MyClass]] = Seq(Tuple1(MyClass(...)),
Tuple1(null)).toDS`

the issue is that we have Aggregator[String, Option[SomeCaseClass], String]
and it doesn't like creating the Encoder for that Option[SameCaseClass]
anymore.

this is related to SPARK-18251

we have a workaround for this: we will wrap all buffer encoder types in
Tuple1. a little inefficient but its okay with me.

On Sun, Dec 4, 2016 at 11:16 PM, Koert Kuipers  wrote:

> somewhere between rc1 and the current head of branch-2.1 i started seeing
> an NPE in our in-house unit tests for Dataset + Aggregator. i created
> SPARK-18711  for this.
> 
>
> On Mon, Nov 28, 2016 at 8:25 PM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.0. The vote is open until Thursday, December 1, 2016 at 18:00 UTC and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.0-rc1 (80aabc0bd33dc5661a90133156247
>> e7a8c1bf7f5)
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1216/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-docs/
>>
>>
>> ===
>> How can I help test this release?
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.1.0?
>> ===
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>>
>>
>>
>


Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-12-04 Thread Koert Kuipers
somewhere between rc1 and the current head of branch-2.1 i started seeing
an NPE in our in-house unit tests for Dataset + Aggregator. i created
SPARK-18711  for this.


On Mon, Nov 28, 2016 at 8:25 PM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.1.0. The vote is open until Thursday, December 1, 2016 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.0-rc1 (80aabc0bd33dc5661a90133156247e
> 7a8c1bf7f5)
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1216/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ===
> What should happen to JIRA tickets still targeting 2.1.0?
> ===
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>
>
>


Re: ability to provide custom serializers

2016-12-04 Thread Erik LaBianca
Thanks Michael!

> On Dec 2, 2016, at 7:29 PM, Michael Armbrust  wrote:
> 
> I would love to see something like this.  The closest related ticket is 
> probably https://issues.apache.org/jira/browse/SPARK-7768 
>  (though maybe there are 
> enough people using UDTs in their current form that we should just make a new 
> ticket)

I’m not very familiar with UDT’s. Is this something I should research or just 
leave it be and create a new ticket? I did notice the presence of a registry in 
the source code but it seemed like it was targeted at a different use case.

> A few thoughts:
>  - even if you can do implicit search, we probably also want a registry for 
> Java users.

That’s fine. I’m not 100% sure I can get the right implicit in scope as things 
stand anyway, so let’s table that idea for now and do the registry.

>  - what is the output of the serializer going to be? one challenge here is 
> that encoders write directly into the tungsten format, which is not a stable 
> public API. Maybe this is more obvious if I understood MappedColumnType 
> better?

My assumption was that the output would be existing scalar data types. So 
string, long, double, etc. What I’d like to do is just “layer” the new ones on 
top already existing ones, kinda like the case case encoder does.

> Either way, I'm happy to give further advice if you come up with a more 
> concrete proposal and put it on JIRA.

Great, let me know and I’ll create a ticket, or we can re-use SPARK-7768 and we 
can move the discussion there.

Thanks!

—erik



Re: Future of the Python 2 support.

2016-12-04 Thread Reynold Xin
Echoing Nick. I don't see any strong reason to drop Python 2 support.

We typically drop support for X when it is rarely used and support for X is
long past EOL. Python 2 is still very popular, and depending on the
statistics it might be more popular than Python 3.

On Sun, Dec 4, 2016 at 9:29 AM Nicholas Chammas 
wrote:

> I don't think it makes sense to deprecate or drop support for Python 2.7
> until at least 2020, when 2.7 itself will be EOLed. (As of Spark 2.0,
> Python 2.6 support is deprecated and will be removed by Spark 2.2. Python
> 2.7 is only version of Python 2 that's still fully supported.)
>
> Given the widespread industry use of Python 2.7, and the fact that it is
> supported upstream by the Python core developers until 2020, I don't see
> why Spark should even consider dropping support for it before then. There
> is, of course, additional ongoing work to support Python 2.7, but it seems
> more than justified by its level of use and popularity in the broader
> community. And I say that as someone who almost exclusively develops in
> Python 3.5+ these days.
>
> Perhaps by 2018 the industry usage of Python 2 will drop precipitously and
> merit a discussion about dropping support, but I think at this point it's
> premature to discuss that and we should just wait and see.
>
> Nick
>
>
> On Sun, Dec 4, 2016 at 10:59 AM Maciej Szymkiewicz 
> wrote:
>
> Hi,
>
> I am aware there was a previous discussion about dropping support for
> different platforms (
> http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html)
> but somehow it has been dominated by Scala and JVM and never touched the
> subject of Python 2.
>
> Some facts:
>
>- Python 2 End Of Life is scheduled for 2020 (
>http://legacy.python.org/dev/peps/pep-0373/) without with "no
>guarantee that bugfix releases will be made on a regular basis" until then.
>- Almost all commonly used libraries already support Python 3 (
>https://python3wos.appspot.com/). A single exception that can be
>important for Spark is thrift (Python 3 support is already present on the
>master) and transitively PyHive and Blaze.
>- Supporting both Python 2 and Python 3 introduces significant
>technical debt. In practice Python 3 is a different language with backward
>incompatible syntax and growing number of features which won't be
>backported to 2.x.
>
> Suggestions:
>
>- We need a public discussion about possible date for dropping Python
>2 support.
>- Early 2018 should give enough time for a graceful transition.
>
> --
> Best,
> Maciej
>
>


Re: Future of the Python 2 support.

2016-12-04 Thread Nicholas Chammas
I don't think it makes sense to deprecate or drop support for Python 2.7
until at least 2020, when 2.7 itself will be EOLed. (As of Spark 2.0,
Python 2.6 support is deprecated and will be removed by Spark 2.2. Python
2.7 is only version of Python 2 that's still fully supported.)

Given the widespread industry use of Python 2.7, and the fact that it is
supported upstream by the Python core developers until 2020, I don't see
why Spark should even consider dropping support for it before then. There
is, of course, additional ongoing work to support Python 2.7, but it seems
more than justified by its level of use and popularity in the broader
community. And I say that as someone who almost exclusively develops in
Python 3.5+ these days.

Perhaps by 2018 the industry usage of Python 2 will drop precipitously and
merit a discussion about dropping support, but I think at this point it's
premature to discuss that and we should just wait and see.

Nick


On Sun, Dec 4, 2016 at 10:59 AM Maciej Szymkiewicz 
wrote:

> Hi,
>
> I am aware there was a previous discussion about dropping support for
> different platforms (
> http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html)
> but somehow it has been dominated by Scala and JVM and never touched the
> subject of Python 2.
>
> Some facts:
>
>- Python 2 End Of Life is scheduled for 2020 (
>http://legacy.python.org/dev/peps/pep-0373/) without with "no
>guarantee that bugfix releases will be made on a regular basis" until then.
>- Almost all commonly used libraries already support Python 3 (
>https://python3wos.appspot.com/). A single exception that can be
>important for Spark is thrift (Python 3 support is already present on the
>master) and transitively PyHive and Blaze.
>- Supporting both Python 2 and Python 3 introduces significant
>technical debt. In practice Python 3 is a different language with backward
>incompatible syntax and growing number of features which won't be
>backported to 2.x.
>
> Suggestions:
>
>- We need a public discussion about possible date for dropping Python
>2 support.
>- Early 2018 should give enough time for a graceful transition.
>
> --
> Best,
> Maciej
>
>


Future of the Python 2 support.

2016-12-04 Thread Maciej Szymkiewicz
Hi,

I am aware there was a previous discussion about dropping support for
different platforms
(http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html)
but somehow it has been dominated by Scala and JVM and never touched the
subject of Python 2.

Some facts:

  * Python 2 End Of Life is scheduled for 2020
(http://legacy.python.org/dev/peps/pep-0373/) without with "no
guarantee that bugfix releases will be made on a regular basis"
until then.
  * Almost all commonly used libraries already support Python 3
(https://python3wos.appspot.com/). A single exception that can be
important for Spark is thrift (Python 3 support is already present
on the master) and transitively PyHive and Blaze.
  * Supporting both Python 2 and Python 3 introduces significant
technical debt. In practice Python 3 is a different language with
backward incompatible syntax and growing number of features which
won't be backported to 2.x.

Suggestions:

  * We need a public discussion about possible date for dropping Python
2 support.
  * Early 2018 should give enough time for a graceful transition.

-- 
Best,
Maciej