Master options Cluster/Client descrepencies.

2016-03-28 Thread satyajit vegesna
Hi All,

I have written a spark program on my dev box ,
   IDE:Intellij
   scala version:2.11.7
   spark verison:1.6.1

run fine from IDE, by providing proper input and output paths including
 master.

But when i try to deploy the code in my cluster made of below,

   Spark version:1.6.1
built from source pkg using scala 2.11
But when i try spark-shell on cluster i get scala version to be
2.10.5
 hadoop yarn cluster 2.6.0

and with additional options,

--executor-memory
--total-executor-cores
--deploy-mode cluster/client
--master yarn

i get Exception in thread "main" java.lang.NoSuchMethodError:
scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
at com.movoto.SparkPost$.main(SparkPost.scala:36)
at com.movoto.SparkPost.main(SparkPost.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

i understand this to be a scala version issue, as i have faced this before.

Is there something that i have change and try  things to get the same
program running on cluster.

Regards,
Satyajit.


Re: SPARK-13843 Next steps

2016-03-28 Thread Sean Owen
I tend to agree. If it's going to present a significant technical hurdle
and the software is clearly non ASF like via a different artifact, there's
a decent argument the namespace should stay. The artifact has to change
though and that is what David was referring to in his other message.

On Mon, Mar 28, 2016, 08:33 Cody Koeninger  wrote:

> I really think the only thing that should have to change is the maven
> group and identifier, not the java namespace.
>
> There are compatibility problems with the java namespace changing
> (e.g. access to private[spark]), and I don't think that someone who
> takes the time to change their build file to download a maven artifact
> without "apache" in the identifier is at significant risk of consumer
> confusion.
>
> I've tried to get a straight answer from ASF trademarks on this point,
> but the answers I've been getting are mixed, and personally disturbing
> to me in terms of over-reaching.
>
> On Sat, Mar 26, 2016 at 9:03 AM, Sean Owen  wrote:
> > Looks like this is done; docs have been moved, flume is back in, etc.
> >
> > For the moment Kafka streaming is still in the project and I know
> > there's still discussion about how to manage multiple versions within
> > the project.
> >
> > One other thing we need to finish up is stuff like the namespace of
> > the code that was moved out. I believe it'll have to move out of the
> > org.apache namespace as well as change its artifact group. At least,
> > David indicated Sonatype wouldn't let someone non-ASF push an artifact
> > from that group anyway.
> >
> > Also might be worth adding a description at
> > https://github.com/spark-packages explaining that these are just some
> > unofficial Spark-related packages.
> >
> > On Tue, Mar 22, 2016 at 7:27 AM, Kostas Sakellis 
> wrote:
> >> Hello all,
> >>
> >> I'd like to close out the discussion on SPARK-13843 by getting a poll
> from
> >> the community on which components we should seriously reconsider
> re-adding
> >> back to Apache Spark. For reference, here are the modules that were
> removed
> >> as part of SPARK-13843 and pushed to: https://github.com/spark-packages
> >>
> >> streaming-flume
> >> streaming-akka
> >> streaming-mqtt
> >> streaming-zeromq
> >> streaming-twitter
> >>
> >> For us, we'd like to see the streaming-flume added back to Apache Spark.
> >>
> >> Thanks,
> >> Kostas
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>


Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-28 Thread Kostas Sakellis
Also, +1 on dropping jdk7 in Spark 2.0.

Kostas

On Mon, Mar 28, 2016 at 2:01 PM, Marcelo Vanzin  wrote:

> Finally got some internal feedback on this, and we're ok with
> requiring people to deploy jdk8 for 2.0, so +1 too.
>
> On Mon, Mar 28, 2016 at 1:15 PM, Luciano Resende 
> wrote:
> > +1, I also checked with few projects inside IBM that consume Spark and
> they
> > seem to be ok with the direction of droping JDK 7.
> >
> > On Mon, Mar 28, 2016 at 11:24 AM, Michael Gummelt <
> mgumm...@mesosphere.io>
> > wrote:
> >>
> >> +1 from Mesosphere
> >>
> >> On Mon, Mar 28, 2016 at 5:12 AM, Steve Loughran  >
> >> wrote:
> >>>
> >>>
> >>> > On 25 Mar 2016, at 01:59, Mridul Muralidharan 
> wrote:
> >>> >
> >>> > Removing compatibility (with jdk, etc) can be done with a major
> >>> > release- given that 7 has been EOLed a while back and is now
> unsupported, we
> >>> > have to decide if we drop support for it in 2.0 or 3.0 (2+ years
> from now).
> >>> >
> >>> > Given the functionality & performance benefits of going to jdk8,
> future
> >>> > enhancements relevant in 2.x timeframe ( scala, dependencies) which
> requires
> >>> > it, and simplicity wrt code, test & support it looks like a good
> checkpoint
> >>> > to drop jdk7 support.
> >>> >
> >>> > As already mentioned in the thread, existing yarn clusters are
> >>> > unaffected if they want to continue running jdk7 and yet use spark2
> (install
> >>> > jdk8 on all nodes and use it via JAVA_HOME, or worst case distribute
> jdk8 as
> >>> > archive - suboptimal).
> >>>
> >>> you wouldn't want to dist it as an archive; it's not just the binaries,
> >>> it's the install phase. And you'd better remember to put the JCE jar
> in on
> >>> top of the JDK for kerberos to work.
> >>>
> >>> setting up environment vars to point to JDK8 in the launched
> >>> app/container avoids that. Yes, the ops team do need to install java,
> but if
> >>> you offer them the choice of "installing a centrally managed Java" and
> >>> "having my code try and install it", they should go for the managed
> option.
> >>>
> >>> One thing to consider for 2.0 is to make it easier to set up those env
> >>> vars for both python and java. And, as the techniques for mixing JDK
> >>> versions is clearly not that well known, documenting it.
> >>>
> >>> (FWIW I've done code which even uploads it's own hadoop-* JAR, but what
> >>> gets you is changes in the hadoop-native libs; you do need to get the
> PATH
> >>> var spot on)
> >>>
> >>>
> >>> > I am unsure about mesos (standalone might be easier upgrade I guess
> ?).
> >>> >
> >>> >
> >>> > Proposal is for 1.6x line to continue to be supported with critical
> >>> > fixes; newer features will require 2.x and so jdk8
> >>> >
> >>> > Regards
> >>> > Mridul
> >>> >
> >>> >
> >>>
> >>>
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>
> >>
> >>
> >>
> >> --
> >> Michael Gummelt
> >> Software Engineer
> >> Mesosphere
> >
> >
> >
> >
> > --
> > Luciano Resende
> > http://twitter.com/lresende1975
> > http://lresende.blogspot.com/
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-28 Thread Marcelo Vanzin
Finally got some internal feedback on this, and we're ok with
requiring people to deploy jdk8 for 2.0, so +1 too.

On Mon, Mar 28, 2016 at 1:15 PM, Luciano Resende  wrote:
> +1, I also checked with few projects inside IBM that consume Spark and they
> seem to be ok with the direction of droping JDK 7.
>
> On Mon, Mar 28, 2016 at 11:24 AM, Michael Gummelt 
> wrote:
>>
>> +1 from Mesosphere
>>
>> On Mon, Mar 28, 2016 at 5:12 AM, Steve Loughran 
>> wrote:
>>>
>>>
>>> > On 25 Mar 2016, at 01:59, Mridul Muralidharan  wrote:
>>> >
>>> > Removing compatibility (with jdk, etc) can be done with a major
>>> > release- given that 7 has been EOLed a while back and is now unsupported, 
>>> > we
>>> > have to decide if we drop support for it in 2.0 or 3.0 (2+ years from 
>>> > now).
>>> >
>>> > Given the functionality & performance benefits of going to jdk8, future
>>> > enhancements relevant in 2.x timeframe ( scala, dependencies) which 
>>> > requires
>>> > it, and simplicity wrt code, test & support it looks like a good 
>>> > checkpoint
>>> > to drop jdk7 support.
>>> >
>>> > As already mentioned in the thread, existing yarn clusters are
>>> > unaffected if they want to continue running jdk7 and yet use spark2 
>>> > (install
>>> > jdk8 on all nodes and use it via JAVA_HOME, or worst case distribute jdk8 
>>> > as
>>> > archive - suboptimal).
>>>
>>> you wouldn't want to dist it as an archive; it's not just the binaries,
>>> it's the install phase. And you'd better remember to put the JCE jar in on
>>> top of the JDK for kerberos to work.
>>>
>>> setting up environment vars to point to JDK8 in the launched
>>> app/container avoids that. Yes, the ops team do need to install java, but if
>>> you offer them the choice of "installing a centrally managed Java" and
>>> "having my code try and install it", they should go for the managed option.
>>>
>>> One thing to consider for 2.0 is to make it easier to set up those env
>>> vars for both python and java. And, as the techniques for mixing JDK
>>> versions is clearly not that well known, documenting it.
>>>
>>> (FWIW I've done code which even uploads it's own hadoop-* JAR, but what
>>> gets you is changes in the hadoop-native libs; you do need to get the PATH
>>> var spot on)
>>>
>>>
>>> > I am unsure about mesos (standalone might be easier upgrade I guess ?).
>>> >
>>> >
>>> > Proposal is for 1.6x line to continue to be supported with critical
>>> > fixes; newer features will require 2.x and so jdk8
>>> >
>>> > Regards
>>> > Mridul
>>> >
>>> >
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>>
>>
>> --
>> Michael Gummelt
>> Software Engineer
>> Mesosphere
>
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-28 Thread Luciano Resende
+1, I also checked with few projects inside IBM that consume Spark and they
seem to be ok with the direction of droping JDK 7.

On Mon, Mar 28, 2016 at 11:24 AM, Michael Gummelt 
wrote:

> +1 from Mesosphere
>
> On Mon, Mar 28, 2016 at 5:12 AM, Steve Loughran 
> wrote:
>
>>
>> > On 25 Mar 2016, at 01:59, Mridul Muralidharan  wrote:
>> >
>> > Removing compatibility (with jdk, etc) can be done with a major
>> release- given that 7 has been EOLed a while back and is now unsupported,
>> we have to decide if we drop support for it in 2.0 or 3.0 (2+ years from
>> now).
>> >
>> > Given the functionality & performance benefits of going to jdk8, future
>> enhancements relevant in 2.x timeframe ( scala, dependencies) which
>> requires it, and simplicity wrt code, test & support it looks like a good
>> checkpoint to drop jdk7 support.
>> >
>> > As already mentioned in the thread, existing yarn clusters are
>> unaffected if they want to continue running jdk7 and yet use spark2
>> (install jdk8 on all nodes and use it via JAVA_HOME, or worst case
>> distribute jdk8 as archive - suboptimal).
>>
>> you wouldn't want to dist it as an archive; it's not just the binaries,
>> it's the install phase. And you'd better remember to put the JCE jar in on
>> top of the JDK for kerberos to work.
>>
>> setting up environment vars to point to JDK8 in the launched
>> app/container avoids that. Yes, the ops team do need to install java, but
>> if you offer them the choice of "installing a centrally managed Java" and
>> "having my code try and install it", they should go for the managed option.
>>
>> One thing to consider for 2.0 is to make it easier to set up those env
>> vars for both python and java. And, as the techniques for mixing JDK
>> versions is clearly not that well known, documenting it.
>>
>> (FWIW I've done code which even uploads it's own hadoop-* JAR, but what
>> gets you is changes in the hadoop-native libs; you do need to get the PATH
>> var spot on)
>>
>>
>> > I am unsure about mesos (standalone might be easier upgrade I guess ?).
>> >
>> >
>> > Proposal is for 1.6x line to continue to be supported with critical
>> fixes; newer features will require 2.x and so jdk8
>> >
>> > Regards
>> > Mridul
>> >
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: OOM and "spark.buffer.pageSize"

2016-03-28 Thread Steve Johnston
Yes I have. That’s the best source of information at the moment. Thanks.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/OOM-and-spark-buffer-pageSize-tp16890p16892.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: OOM and "spark.buffer.pageSize"

2016-03-28 Thread Ted Yu
I guess you have looked at MemoryManager#pageSizeBytes where
the "spark.buffer.pageSize" config can override default page size.

FYI

On Mon, Mar 28, 2016 at 12:07 PM, Steve Johnston <
sjohns...@algebraixdata.com> wrote:

> I'm attempting to address an OOM issue. I saw referenced in
> java.lang.OutOfMemoryError: Unable to acquire bytes of memory
> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-OutOfMemoryError-Unable-to-acquire-bytes-of-memory-td16773.html
> >
> the configuration setting "spark.buffer.pageSize" which was used in
> conjunction with "spark.sql.shuffle.partitions" to solve the OOM problem
> Nezih was having.
>
> What is "spark.buffer.pageSize"? How can it be used? I can find it in the
> code but there doesn't seem to be any other documentation.
>
> Thanks,
> Steve
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/OOM-and-spark-buffer-pageSize-tp16890.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


OOM and "spark.buffer.pageSize"

2016-03-28 Thread Steve Johnston
I'm attempting to address an OOM issue. I saw referenced in 
java.lang.OutOfMemoryError: Unable to acquire bytes of memory

  
the configuration setting "spark.buffer.pageSize" which was used in
conjunction with "spark.sql.shuffle.partitions" to solve the OOM problem
Nezih was having.

What is "spark.buffer.pageSize"? How can it be used? I can find it in the
code but there doesn't seem to be any other documentation.

Thanks,
Steve



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/OOM-and-spark-buffer-pageSize-tp16890.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-28 Thread Michael Gummelt
+1 from Mesosphere

On Mon, Mar 28, 2016 at 5:12 AM, Steve Loughran 
wrote:

>
> > On 25 Mar 2016, at 01:59, Mridul Muralidharan  wrote:
> >
> > Removing compatibility (with jdk, etc) can be done with a major release-
> given that 7 has been EOLed a while back and is now unsupported, we have to
> decide if we drop support for it in 2.0 or 3.0 (2+ years from now).
> >
> > Given the functionality & performance benefits of going to jdk8, future
> enhancements relevant in 2.x timeframe ( scala, dependencies) which
> requires it, and simplicity wrt code, test & support it looks like a good
> checkpoint to drop jdk7 support.
> >
> > As already mentioned in the thread, existing yarn clusters are
> unaffected if they want to continue running jdk7 and yet use spark2
> (install jdk8 on all nodes and use it via JAVA_HOME, or worst case
> distribute jdk8 as archive - suboptimal).
>
> you wouldn't want to dist it as an archive; it's not just the binaries,
> it's the install phase. And you'd better remember to put the JCE jar in on
> top of the JDK for kerberos to work.
>
> setting up environment vars to point to JDK8 in the launched app/container
> avoids that. Yes, the ops team do need to install java, but if you offer
> them the choice of "installing a centrally managed Java" and "having my
> code try and install it", they should go for the managed option.
>
> One thing to consider for 2.0 is to make it easier to set up those env
> vars for both python and java. And, as the techniques for mixing JDK
> versions is clearly not that well known, documenting it.
>
> (FWIW I've done code which even uploads it's own hadoop-* JAR, but what
> gets you is changes in the hadoop-native libs; you do need to get the PATH
> var spot on)
>
>
> > I am unsure about mesos (standalone might be easier upgrade I guess ?).
> >
> >
> > Proposal is for 1.6x line to continue to be supported with critical
> fixes; newer features will require 2.x and so jdk8
> >
> > Regards
> > Mridul
> >
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere


Re: SPARK-13843 Next steps

2016-03-28 Thread Marcelo Vanzin
On Mon, Mar 28, 2016 at 8:33 AM, Cody Koeninger  wrote:
> There are compatibility problems with the java namespace changing
> (e.g. access to private[spark])

I think it would be fine to keep the package names for backwards
compatibility, but I think if these external projects want to keep a
separate release cycle from Spark, they should refrain from using
"private[spark]" APIs; which I guess is an argument for changing the
package names at some point.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SPARK-13843 and future of streaming backends

2016-03-28 Thread Cody Koeninger
Are you talking about group/identifier name, or contained classes?

Because there are plenty of org.apache.* classes distributed via maven
with non-apache group / identifiers.

On Fri, Mar 25, 2016 at 6:54 PM, David Nalley  wrote:
>
>> As far as group / artifact name compatibility, at least in the case of
>> Kafka we need different artifact names anyway, and people are going to
>> have to make changes to their build files for spark 2.0 anyway.   As
>> far as keeping the actual classes in org.apache.spark to not break
>> code despite the group name being different, I don't know whether that
>> would be enforced by maven central, just looked at as poor taste, or
>> ASF suing for trademark violation :)
>
>
> Sonatype, has strict instructions to only permit org.apache.* to originate 
> from repository.apache.org. Exceptions to that must be approved by VP, 
> Infrastructure.
> --
> Sent via Pony Mail for dev@spark.apache.org.
> View this email online at:
> https://pony-poc.apache.org/list.html?dev@spark.apache.org
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SPARK-13843 Next steps

2016-03-28 Thread Cody Koeninger
I really think the only thing that should have to change is the maven
group and identifier, not the java namespace.

There are compatibility problems with the java namespace changing
(e.g. access to private[spark]), and I don't think that someone who
takes the time to change their build file to download a maven artifact
without "apache" in the identifier is at significant risk of consumer
confusion.

I've tried to get a straight answer from ASF trademarks on this point,
but the answers I've been getting are mixed, and personally disturbing
to me in terms of over-reaching.

On Sat, Mar 26, 2016 at 9:03 AM, Sean Owen  wrote:
> Looks like this is done; docs have been moved, flume is back in, etc.
>
> For the moment Kafka streaming is still in the project and I know
> there's still discussion about how to manage multiple versions within
> the project.
>
> One other thing we need to finish up is stuff like the namespace of
> the code that was moved out. I believe it'll have to move out of the
> org.apache namespace as well as change its artifact group. At least,
> David indicated Sonatype wouldn't let someone non-ASF push an artifact
> from that group anyway.
>
> Also might be worth adding a description at
> https://github.com/spark-packages explaining that these are just some
> unofficial Spark-related packages.
>
> On Tue, Mar 22, 2016 at 7:27 AM, Kostas Sakellis  wrote:
>> Hello all,
>>
>> I'd like to close out the discussion on SPARK-13843 by getting a poll from
>> the community on which components we should seriously reconsider re-adding
>> back to Apache Spark. For reference, here are the modules that were removed
>> as part of SPARK-13843 and pushed to: https://github.com/spark-packages
>>
>> streaming-flume
>> streaming-akka
>> streaming-mqtt
>> streaming-zeromq
>> streaming-twitter
>>
>> For us, we'd like to see the streaming-flume added back to Apache Spark.
>>
>> Thanks,
>> Kostas
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-28 Thread Steve Loughran

> On 25 Mar 2016, at 01:59, Mridul Muralidharan  wrote:
> 
> Removing compatibility (with jdk, etc) can be done with a major release- 
> given that 7 has been EOLed a while back and is now unsupported, we have to 
> decide if we drop support for it in 2.0 or 3.0 (2+ years from now).
> 
> Given the functionality & performance benefits of going to jdk8, future 
> enhancements relevant in 2.x timeframe ( scala, dependencies) which requires 
> it, and simplicity wrt code, test & support it looks like a good checkpoint 
> to drop jdk7 support.
> 
> As already mentioned in the thread, existing yarn clusters are unaffected if 
> they want to continue running jdk7 and yet use spark2 (install jdk8 on all 
> nodes and use it via JAVA_HOME, or worst case distribute jdk8 as archive - 
> suboptimal).

you wouldn't want to dist it as an archive; it's not just the binaries, it's 
the install phase. And you'd better remember to put the JCE jar in on top of 
the JDK for kerberos to work.

setting up environment vars to point to JDK8 in the launched app/container 
avoids that. Yes, the ops team do need to install java, but if you offer them 
the choice of "installing a centrally managed Java" and "having my code try and 
install it", they should go for the managed option.

One thing to consider for 2.0 is to make it easier to set up those env vars for 
both python and java. And, as the techniques for mixing JDK versions is clearly 
not that well known, documenting it. 

(FWIW I've done code which even uploads it's own hadoop-* JAR, but what gets 
you is changes in the hadoop-native libs; you do need to get the PATH var spot 
on)


> I am unsure about mesos (standalone might be easier upgrade I guess ?).
> 
> 
> Proposal is for 1.6x line to continue to be supported with critical fixes; 
> newer features will require 2.x and so jdk8
> 
> Regards 
> Mridul 
> 
> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-28 Thread Michał Zieliński
Hi Maciej,

Absolutely. We had to copy HasInputCol/s, HasOutputCol/s (along with a
couple of others like HasProbabilityCol) to our repo. Which for most
use-cases is good enough, but for some (e.g. operating on any Transformer
that accepts either our or Sparks HasInputCol) makes the code clunky.
Opening those traits to the public would be a big gain.

Thanks,
Michal

On 28 March 2016 at 07:44, Jacek Laskowski  wrote:

> Hi,
>
> Never develop any custom Transformer (or UnaryTransformer in particular),
> but I'd be for it if that's the case.
>
> Jacek
> 28.03.2016 6:54 AM "Maciej Szymkiewicz" 
> napisał(a):
>
>> Hi Jacek,
>>
>> In this context, don't you think it would be useful, if at least some
>> traits from org.apache.spark.ml.param.shared.sharedParams were
>> public?HasInputCol(s) and HasOutputCol for example. These are useful
>> pretty much every time you create custom Transformer.
>>
>> --
>> Pozdrawiam,
>> Maciej Szymkiewicz
>>
>>
>> On 03/26/2016 10:26 AM, Jacek Laskowski wrote:
>> > Hi Joseph,
>> >
>> > Thanks for the response. I'm one who doesn't understand all the
>> > hype/need for Machine Learning...yet and through Spark ML(lib) glasses
>> > I'm looking at ML space. In the meantime I've got few assignments (in
>> > a project with Spark and Scala) that have required quite extensive
>> > dataset manipulation.
>> >
>> > It was when I sinked into using DataFrame/Dataset for data
>> > manipulation not RDD (I remember talking to Brian about how RDD is an
>> > "assembly" language comparing to the higher-level concept of
>> > DataFrames with Catalysts and other optimizations). After few days
>> > with DataFrame I learnt he was so right! (sorry Brian, it took me
>> > longer to understand your point).
>> >
>> > I started using DataFrames in far too many places than one could ever
>> > accept :-) I was so...carried away with DataFrames (esp. show vs
>> > foreach(println) and UDFs via udf() function)
>> >
>> > And then, when I moved to Pipeline API and discovered Transformers.
>> > And PipelineStage that can create pipelines of DataFrame manipulation.
>> > They read so well that I'm pretty sure people would love using them
>> > more often, but...they belong to MLlib so they are part of ML space
>> > (not many devs tackled yet). I applied the approach to using
>> > withColumn to have better debugging experience (if I ever need it). I
>> > learnt it after having watched your presentation about Pipeline API.
>> > It was so helpful in my RDD/DataFrame space.
>> >
>> > So, to promote a more extensive use of Pipelines, PipelineStages, and
>> > Transformers, I was thinking about moving that part to SQL/DataFrame
>> > API where they really belong. If not, I think people might miss the
>> > beauty of the very fine and so helpful Transformers.
>> >
>> > Transformers are *not* a ML thing -- they are DataFrame thing and
>> > should be where they really belong (for their greater adoption).
>> >
>> > What do you think?
>> >
>> >
>> > Pozdrawiam,
>> > Jacek Laskowski
>> > 
>> > https://medium.com/@jaceklaskowski/
>> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> > Follow me at https://twitter.com/jaceklaskowski
>> >
>> >
>> > On Sat, Mar 26, 2016 at 3:23 AM, Joseph Bradley 
>> wrote:
>> >> There have been some comments about using Pipelines outside of ML, but
>> I
>> >> have not yet seen a real need for it.  If a user does want to use
>> Pipelines
>> >> for non-ML tasks, they still can use Transformers + PipelineModels.
>> Will
>> >> that work?
>> >>
>> >> On Fri, Mar 25, 2016 at 8:05 AM, Jacek Laskowski 
>> wrote:
>> >>> Hi,
>> >>>
>> >>> After few weeks with spark.ml now, I came to conclusion that
>> >>> Transformer concept from Pipeline API (spark.ml/MLlib) should be part
>> >>> of DataFrame (SQL) where they fit better. Are there any plans to
>> >>> migrate Transformer API (ML) to DataFrame (SQL)?
>> >>>
>> >>> Pozdrawiam,
>> >>> Jacek Laskowski
>> >>> 
>> >>> https://medium.com/@jaceklaskowski/
>> >>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> >>> Follow me at https://twitter.com/jaceklaskowski
>> >>>
>> >>> -
>> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >>> For additional commands, e-mail: dev-h...@spark.apache.org
>> >>>
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>>
>>
>>


Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-28 Thread Jacek Laskowski
Hi,

Never develop any custom Transformer (or UnaryTransformer in particular),
but I'd be for it if that's the case.

Jacek
28.03.2016 6:54 AM "Maciej Szymkiewicz"  napisał(a):

> Hi Jacek,
>
> In this context, don't you think it would be useful, if at least some
> traits from org.apache.spark.ml.param.shared.sharedParams were
> public?HasInputCol(s) and HasOutputCol for example. These are useful
> pretty much every time you create custom Transformer.
>
> --
> Pozdrawiam,
> Maciej Szymkiewicz
>
>
> On 03/26/2016 10:26 AM, Jacek Laskowski wrote:
> > Hi Joseph,
> >
> > Thanks for the response. I'm one who doesn't understand all the
> > hype/need for Machine Learning...yet and through Spark ML(lib) glasses
> > I'm looking at ML space. In the meantime I've got few assignments (in
> > a project with Spark and Scala) that have required quite extensive
> > dataset manipulation.
> >
> > It was when I sinked into using DataFrame/Dataset for data
> > manipulation not RDD (I remember talking to Brian about how RDD is an
> > "assembly" language comparing to the higher-level concept of
> > DataFrames with Catalysts and other optimizations). After few days
> > with DataFrame I learnt he was so right! (sorry Brian, it took me
> > longer to understand your point).
> >
> > I started using DataFrames in far too many places than one could ever
> > accept :-) I was so...carried away with DataFrames (esp. show vs
> > foreach(println) and UDFs via udf() function)
> >
> > And then, when I moved to Pipeline API and discovered Transformers.
> > And PipelineStage that can create pipelines of DataFrame manipulation.
> > They read so well that I'm pretty sure people would love using them
> > more often, but...they belong to MLlib so they are part of ML space
> > (not many devs tackled yet). I applied the approach to using
> > withColumn to have better debugging experience (if I ever need it). I
> > learnt it after having watched your presentation about Pipeline API.
> > It was so helpful in my RDD/DataFrame space.
> >
> > So, to promote a more extensive use of Pipelines, PipelineStages, and
> > Transformers, I was thinking about moving that part to SQL/DataFrame
> > API where they really belong. If not, I think people might miss the
> > beauty of the very fine and so helpful Transformers.
> >
> > Transformers are *not* a ML thing -- they are DataFrame thing and
> > should be where they really belong (for their greater adoption).
> >
> > What do you think?
> >
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
> >
> >
> > On Sat, Mar 26, 2016 at 3:23 AM, Joseph Bradley 
> wrote:
> >> There have been some comments about using Pipelines outside of ML, but I
> >> have not yet seen a real need for it.  If a user does want to use
> Pipelines
> >> for non-ML tasks, they still can use Transformers + PipelineModels.
> Will
> >> that work?
> >>
> >> On Fri, Mar 25, 2016 at 8:05 AM, Jacek Laskowski 
> wrote:
> >>> Hi,
> >>>
> >>> After few weeks with spark.ml now, I came to conclusion that
> >>> Transformer concept from Pipeline API (spark.ml/MLlib) should be part
> >>> of DataFrame (SQL) where they fit better. Are there any plans to
> >>> migrate Transformer API (ML) to DataFrame (SQL)?
> >>>
> >>> Pozdrawiam,
> >>> Jacek Laskowski
> >>> 
> >>> https://medium.com/@jaceklaskowski/
> >>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> >>> Follow me at https://twitter.com/jaceklaskowski
> >>>
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
>
>