SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Eugene Morozov
Hi,

I have a web service that provides rest api to train random forest algo.
I train random forest on a 5 nodes spark cluster with enough memory -
everything is cached (~22 GB).
On a small datasets up to 100k samples everything is fine, but with the
biggest one (400k samples and ~70k features) I'm stuck with
StackOverflowError.

Additional options for my web service
spark.executor.extraJavaOptions="-XX:ThreadStackSize=8192"
spark.default.parallelism = 200.

On a 400k samples dataset
- (with default thread stack size) it took 4 hours of training to get the
error.
- with increased stack size it took 60 hours to hit it.
I can increase it, but it's hard to say what amount of memory it needs and
it's applied to all of the treads and might waste a lot of memory.

I'm looking at different stages at event timeline now and see that task
deserialization time gradually increases. And at the end task
deserialization time is roughly same as executor computing time.

Code I use to train model:

int MAX_BINS = 16;
int NUM_CLASSES = 0;
double MIN_INFO_GAIN = 0.0;
int MAX_MEMORY_IN_MB = 256;
double SUBSAMPLING_RATE = 1.0;
boolean USE_NODEID_CACHE = true;
int CHECKPOINT_INTERVAL = 10;
int RANDOM_SEED = 12345;

int NODE_SIZE = 5;
int maxDepth = 30;
int numTrees = 50;
Strategy strategy = new Strategy(Algo.Regression(),
Variance.instance(), maxDepth, NUM_CLASSES, MAX_BINS,
QuantileStrategy.Sort(), new
scala.collection.immutable.HashMap<>(), nodeSize, MIN_INFO_GAIN,
MAX_MEMORY_IN_MB, SUBSAMPLING_RATE, USE_NODEID_CACHE,
CHECKPOINT_INTERVAL);
RandomForestModel model =
RandomForest.trainRegressor(labeledPoints.rdd(), strategy, numTrees,
"auto", RANDOM_SEED);


Any advice would be highly appreciated.

The exception (~3000 lines long):
 java.lang.StackOverflowError
at
java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2320)
at
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2333)
at
java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2828)
at java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1453)
at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1512)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at
scala.collection.immutable.$colon$colon.readObject(List.scala:366)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at
scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

--
Be well!
Jean Morozov


Re: SPARK-13843 Next steps

2016-03-29 Thread Steve Loughran
while sonatype are utterly strict about the org.apache namespace (it guarantees 
that all such artifacts have come through the ASF release process, ideally 
including code-signing), nobody checks the org.apache internals, or worries too 
much about them. Note that spark itself has some bits of code in 
org.apache.hive so as to subclass the thriftserver.

What are the costs of having a project's package used externally?

1. interesting debugging sessions if JARs with conflicting classes are loaded.
2. you can't sign the JARs in the metadata. Nobody does that with the maven 
artifacts anyway.
3. whoever's package name it is often gets to see the stack traces in bug 
reports filed against them.



On 29 Mar 2016, at 01:47, Sean Owen 
mailto:so...@cloudera.com>> wrote:

I tend to agree. If it's going to present a significant technical hurdle and 
the software is clearly non ASF like via a different artifact, there's a decent 
argument the namespace should stay. The artifact has to change though and that 
is what David was referring to in his other message.

On Mon, Mar 28, 2016, 08:33 Cody Koeninger 
mailto:c...@koeninger.org>> wrote:
I really think the only thing that should have to change is the maven
group and identifier, not the java namespace.

There are compatibility problems with the java namespace changing
(e.g. access to private[spark]), and I don't think that someone who
takes the time to change their build file to download a maven artifact
without "apache" in the identifier is at significant risk of consumer
confusion.

I've tried to get a straight answer from ASF trademarks on this point,
but the answers I've been getting are mixed, and personally disturbing
to me in terms of over-reaching.

On Sat, Mar 26, 2016 at 9:03 AM, Sean Owen 
mailto:so...@cloudera.com>> wrote:
> Looks like this is done; docs have been moved, flume is back in, etc.
>
> For the moment Kafka streaming is still in the project and I know
> there's still discussion about how to manage multiple versions within
> the project.
>
> One other thing we need to finish up is stuff like the namespace of
> the code that was moved out. I believe it'll have to move out of the
> org.apache namespace as well as change its artifact group. At least,
> David indicated Sonatype wouldn't let someone non-ASF push an artifact
> from that group anyway.
>
> Also might be worth adding a description at
> https://github.com/spark-packages explaining that these are just some
> unofficial Spark-related packages.
>
> On Tue, Mar 22, 2016 at 7:27 AM, Kostas Sakellis 
> mailto:kos...@cloudera.com>> wrote:
>> Hello all,
>>
>> I'd like to close out the discussion on SPARK-13843 by getting a poll from
>> the community on which components we should seriously reconsider re-adding
>> back to Apache Spark. For reference, here are the modules that were removed
>> as part of SPARK-13843 and pushed to: https://github.com/spark-packages
>>
>> streaming-flume
>> streaming-akka
>> streaming-mqtt
>> streaming-zeromq
>> streaming-twitter
>>
>> For us, we'd like to see the streaming-flume added back to Apache Spark.
>>
>> Thanks,
>> Kostas
>
> -
> To unsubscribe, e-mail: 
> dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: 
> dev-h...@spark.apache.org
>



Understanding PySpark Internals

2016-03-29 Thread Adam Roberts
Hi, I'm interested in figuring out how the Python API for Spark works, 
I've came to the following conclusion and want to share this with the 
community; could be of use in the PySpark docs here, specifically the 
"Execution and pipelining part".

Any sanity checking would be much appreciated, here's the trivial Python 
example I've traced:
from pyspark import SparkContext
sc = SparkContext("local[1]", "Adam test")
sc.setCheckpointDir("foo checkpoint dir")

Added this JVM option:
export 
IBM_JAVA_OPTIONS="-Xtrace:methods={org/apache/spark/*,py4j/*},print=mt"

Prints added in py4j-java/src/py4j/commands/CallCommand.java - 
specifically in the execute method. Built and replaced existing class in 
the py4j 0.9 jar in my Spark assembly jar. Example output is:
In execute for CallCommand, commandName: c
target object id: o0
methodName: get

I'll launch the Spark application with:
$SPARK_HOME/bin/spark-submit --master local[1] Adam.py > checkme.txt 2>&1

I've quickly put together the following WIP diagram of what I think is 
happening:
http://postimg.org/image/nihylmset/

To summarise I think:
We're heavily using reflection (as evidenced by Py4j's ReflectionEngine 
and MethodInvoker classes) to invoke Spark's API in a JVM from Python
There's an agreed protocol (in Py4j's Protocol.java) for handling 
commands: said commands are exchanged using a local socket between Python 
and our JVM (the driver based on docs, not the master)
The Spark API is accessible by means of commands exchanged using said 
socket using the agreed protocol
Commands are read/written using BufferedReader/Writer
Type conversion is also performed from Python to Java (not looked at in 
detail yet)
We keep track of the objects with, for example, o0 representing the first 
object we know about

Does this sound correct?

I've only checked the trace output in local mode, curious as to what 
happens when we're running in standalone mode (I didn't see a Python 
interpreter appearing on all workers in order to process partitions of 
data, I assume in standalone mode we use Python solely as an orchestrator 
- the driver - and not as an executor for distributed computing?).

Happy to provide the full trace output on request (omitted timestamps, 
logging info, added spacing), I expect there's a O*JDK method tracing 
equivalent so the above can easily be reproduced regardless of Java 
vendor.

Cheers,


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-29 Thread Koert Kuipers
if scala prior to sbt 2.10.4 didn't support java 8, does that mean that 3rd
party scala libraries compiled with a scala version < 2.10.4 might not work
on java 8?


On Mon, Mar 28, 2016 at 7:06 PM, Kostas Sakellis 
wrote:

> Also, +1 on dropping jdk7 in Spark 2.0.
>
> Kostas
>
> On Mon, Mar 28, 2016 at 2:01 PM, Marcelo Vanzin 
> wrote:
>
>> Finally got some internal feedback on this, and we're ok with
>> requiring people to deploy jdk8 for 2.0, so +1 too.
>>
>> On Mon, Mar 28, 2016 at 1:15 PM, Luciano Resende 
>> wrote:
>> > +1, I also checked with few projects inside IBM that consume Spark and
>> they
>> > seem to be ok with the direction of droping JDK 7.
>> >
>> > On Mon, Mar 28, 2016 at 11:24 AM, Michael Gummelt <
>> mgumm...@mesosphere.io>
>> > wrote:
>> >>
>> >> +1 from Mesosphere
>> >>
>> >> On Mon, Mar 28, 2016 at 5:12 AM, Steve Loughran <
>> ste...@hortonworks.com>
>> >> wrote:
>> >>>
>> >>>
>> >>> > On 25 Mar 2016, at 01:59, Mridul Muralidharan 
>> wrote:
>> >>> >
>> >>> > Removing compatibility (with jdk, etc) can be done with a major
>> >>> > release- given that 7 has been EOLed a while back and is now
>> unsupported, we
>> >>> > have to decide if we drop support for it in 2.0 or 3.0 (2+ years
>> from now).
>> >>> >
>> >>> > Given the functionality & performance benefits of going to jdk8,
>> future
>> >>> > enhancements relevant in 2.x timeframe ( scala, dependencies) which
>> requires
>> >>> > it, and simplicity wrt code, test & support it looks like a good
>> checkpoint
>> >>> > to drop jdk7 support.
>> >>> >
>> >>> > As already mentioned in the thread, existing yarn clusters are
>> >>> > unaffected if they want to continue running jdk7 and yet use spark2
>> (install
>> >>> > jdk8 on all nodes and use it via JAVA_HOME, or worst case
>> distribute jdk8 as
>> >>> > archive - suboptimal).
>> >>>
>> >>> you wouldn't want to dist it as an archive; it's not just the
>> binaries,
>> >>> it's the install phase. And you'd better remember to put the JCE jar
>> in on
>> >>> top of the JDK for kerberos to work.
>> >>>
>> >>> setting up environment vars to point to JDK8 in the launched
>> >>> app/container avoids that. Yes, the ops team do need to install java,
>> but if
>> >>> you offer them the choice of "installing a centrally managed Java" and
>> >>> "having my code try and install it", they should go for the managed
>> option.
>> >>>
>> >>> One thing to consider for 2.0 is to make it easier to set up those env
>> >>> vars for both python and java. And, as the techniques for mixing JDK
>> >>> versions is clearly not that well known, documenting it.
>> >>>
>> >>> (FWIW I've done code which even uploads it's own hadoop-* JAR, but
>> what
>> >>> gets you is changes in the hadoop-native libs; you do need to get the
>> PATH
>> >>> var spot on)
>> >>>
>> >>>
>> >>> > I am unsure about mesos (standalone might be easier upgrade I guess
>> ?).
>> >>> >
>> >>> >
>> >>> > Proposal is for 1.6x line to continue to be supported with critical
>> >>> > fixes; newer features will require 2.x and so jdk8
>> >>> >
>> >>> > Regards
>> >>> > Mridul
>> >>> >
>> >>> >
>> >>>
>> >>>
>> >>> -
>> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >>> For additional commands, e-mail: dev-h...@spark.apache.org
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Michael Gummelt
>> >> Software Engineer
>> >> Mesosphere
>> >
>> >
>> >
>> >
>> > --
>> > Luciano Resende
>> > http://twitter.com/lresende1975
>> > http://lresende.blogspot.com/
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Fwd: Master options Cluster/Client descrepencies.

2016-03-29 Thread satyajit vegesna
Hi All,

I have written a spark program on my dev box ,
   IDE:Intellij
   scala version:2.11.7
   spark verison:1.6.1

run fine from IDE, by providing proper input and output paths including
 master.

But when i try to deploy the code in my cluster made of below,

   Spark version:1.6.1
built from source pkg using scala 2.11
But when i try spark-shell on cluster i get scala version to be
2.10.5
 hadoop yarn cluster 2.6.0

and with additional options,

--executor-memory
--total-executor-cores
--deploy-mode cluster/client
--master yarn

i get Exception in thread "main" java.lang.NoSuchMethodError:
scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
at com.movoto.SparkPost$.main(SparkPost.scala:36)
at com.movoto.SparkPost.main(SparkPost.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

i understand this to be a scala version issue, as i have faced this before.

Is there something that i have change and try  things to get the same
program running on cluster.

Regards,
Satyajit.


Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-29 Thread Reynold Xin
They work.


On Tue, Mar 29, 2016 at 10:01 AM, Koert Kuipers  wrote:

> if scala prior to sbt 2.10.4 didn't support java 8, does that mean that
> 3rd party scala libraries compiled with a scala version < 2.10.4 might not
> work on java 8?
>
>
> On Mon, Mar 28, 2016 at 7:06 PM, Kostas Sakellis 
> wrote:
>
>> Also, +1 on dropping jdk7 in Spark 2.0.
>>
>> Kostas
>>
>> On Mon, Mar 28, 2016 at 2:01 PM, Marcelo Vanzin 
>> wrote:
>>
>>> Finally got some internal feedback on this, and we're ok with
>>> requiring people to deploy jdk8 for 2.0, so +1 too.
>>>
>>> On Mon, Mar 28, 2016 at 1:15 PM, Luciano Resende 
>>> wrote:
>>> > +1, I also checked with few projects inside IBM that consume Spark and
>>> they
>>> > seem to be ok with the direction of droping JDK 7.
>>> >
>>> > On Mon, Mar 28, 2016 at 11:24 AM, Michael Gummelt <
>>> mgumm...@mesosphere.io>
>>> > wrote:
>>> >>
>>> >> +1 from Mesosphere
>>> >>
>>> >> On Mon, Mar 28, 2016 at 5:12 AM, Steve Loughran <
>>> ste...@hortonworks.com>
>>> >> wrote:
>>> >>>
>>> >>>
>>> >>> > On 25 Mar 2016, at 01:59, Mridul Muralidharan 
>>> wrote:
>>> >>> >
>>> >>> > Removing compatibility (with jdk, etc) can be done with a major
>>> >>> > release- given that 7 has been EOLed a while back and is now
>>> unsupported, we
>>> >>> > have to decide if we drop support for it in 2.0 or 3.0 (2+ years
>>> from now).
>>> >>> >
>>> >>> > Given the functionality & performance benefits of going to jdk8,
>>> future
>>> >>> > enhancements relevant in 2.x timeframe ( scala, dependencies)
>>> which requires
>>> >>> > it, and simplicity wrt code, test & support it looks like a good
>>> checkpoint
>>> >>> > to drop jdk7 support.
>>> >>> >
>>> >>> > As already mentioned in the thread, existing yarn clusters are
>>> >>> > unaffected if they want to continue running jdk7 and yet use
>>> spark2 (install
>>> >>> > jdk8 on all nodes and use it via JAVA_HOME, or worst case
>>> distribute jdk8 as
>>> >>> > archive - suboptimal).
>>> >>>
>>> >>> you wouldn't want to dist it as an archive; it's not just the
>>> binaries,
>>> >>> it's the install phase. And you'd better remember to put the JCE jar
>>> in on
>>> >>> top of the JDK for kerberos to work.
>>> >>>
>>> >>> setting up environment vars to point to JDK8 in the launched
>>> >>> app/container avoids that. Yes, the ops team do need to install
>>> java, but if
>>> >>> you offer them the choice of "installing a centrally managed Java"
>>> and
>>> >>> "having my code try and install it", they should go for the managed
>>> option.
>>> >>>
>>> >>> One thing to consider for 2.0 is to make it easier to set up those
>>> env
>>> >>> vars for both python and java. And, as the techniques for mixing JDK
>>> >>> versions is clearly not that well known, documenting it.
>>> >>>
>>> >>> (FWIW I've done code which even uploads it's own hadoop-* JAR, but
>>> what
>>> >>> gets you is changes in the hadoop-native libs; you do need to get
>>> the PATH
>>> >>> var spot on)
>>> >>>
>>> >>>
>>> >>> > I am unsure about mesos (standalone might be easier upgrade I
>>> guess ?).
>>> >>> >
>>> >>> >
>>> >>> > Proposal is for 1.6x line to continue to be supported with critical
>>> >>> > fixes; newer features will require 2.x and so jdk8
>>> >>> >
>>> >>> > Regards
>>> >>> > Mridul
>>> >>> >
>>> >>> >
>>> >>>
>>> >>>
>>> >>> -
>>> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> >>> For additional commands, e-mail: dev-h...@spark.apache.org
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Michael Gummelt
>>> >> Software Engineer
>>> >> Mesosphere
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Luciano Resende
>>> > http://twitter.com/lresende1975
>>> > http://lresende.blogspot.com/
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>


Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-29 Thread Tom Graves
+1.
Tom 

On Tuesday, March 29, 2016 1:17 PM, Reynold Xin  wrote:
 

 They work.

On Tue, Mar 29, 2016 at 10:01 AM, Koert Kuipers  wrote:

if scala prior to sbt 2.10.4 didn't support java 8, does that mean that 3rd 
party scala libraries compiled with a scala version < 2.10.4 might not work on 
java 8?


On Mon, Mar 28, 2016 at 7:06 PM, Kostas Sakellis  wrote:

Also, +1 on dropping jdk7 in Spark 2.0. 
Kostas
On Mon, Mar 28, 2016 at 2:01 PM, Marcelo Vanzin  wrote:

Finally got some internal feedback on this, and we're ok with
requiring people to deploy jdk8 for 2.0, so +1 too.

On Mon, Mar 28, 2016 at 1:15 PM, Luciano Resende  wrote:
> +1, I also checked with few projects inside IBM that consume Spark and they
> seem to be ok with the direction of droping JDK 7.
>
> On Mon, Mar 28, 2016 at 11:24 AM, Michael Gummelt 
> wrote:
>>
>> +1 from Mesosphere
>>
>> On Mon, Mar 28, 2016 at 5:12 AM, Steve Loughran 
>> wrote:
>>>
>>>
>>> > On 25 Mar 2016, at 01:59, Mridul Muralidharan  wrote:
>>> >
>>> > Removing compatibility (with jdk, etc) can be done with a major
>>> > release- given that 7 has been EOLed a while back and is now unsupported, 
>>> > we
>>> > have to decide if we drop support for it in 2.0 or 3.0 (2+ years from 
>>> > now).
>>> >
>>> > Given the functionality & performance benefits of going to jdk8, future
>>> > enhancements relevant in 2.x timeframe ( scala, dependencies) which 
>>> > requires
>>> > it, and simplicity wrt code, test & support it looks like a good 
>>> > checkpoint
>>> > to drop jdk7 support.
>>> >
>>> > As already mentioned in the thread, existing yarn clusters are
>>> > unaffected if they want to continue running jdk7 and yet use spark2 
>>> > (install
>>> > jdk8 on all nodes and use it via JAVA_HOME, or worst case distribute jdk8 
>>> > as
>>> > archive - suboptimal).
>>>
>>> you wouldn't want to dist it as an archive; it's not just the binaries,
>>> it's the install phase. And you'd better remember to put the JCE jar in on
>>> top of the JDK for kerberos to work.
>>>
>>> setting up environment vars to point to JDK8 in the launched
>>> app/container avoids that. Yes, the ops team do need to install java, but if
>>> you offer them the choice of "installing a centrally managed Java" and
>>> "having my code try and install it", they should go for the managed option.
>>>
>>> One thing to consider for 2.0 is to make it easier to set up those env
>>> vars for both python and java. And, as the techniques for mixing JDK
>>> versions is clearly not that well known, documenting it.
>>>
>>> (FWIW I've done code which even uploads it's own hadoop-* JAR, but what
>>> gets you is changes in the hadoop-native libs; you do need to get the PATH
>>> var spot on)
>>>
>>>
>>> > I am unsure about mesos (standalone might be easier upgrade I guess ?).
>>> >
>>> >
>>> > Proposal is for 1.6x line to continue to be supported with critical
>>> > fixes; newer features will require 2.x and so jdk8
>>> >
>>> > Regards
>>> > Mridul
>>> >
>>> >
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>>
>>
>> --
>> Michael Gummelt
>> Software Engineer
>> Mesosphere
>
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/



--
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org









  

Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-29 Thread Joseph Bradley
This is great feedback to hear.  I think there was discussion about moving
Pipelines outside of ML at some point, but I'll have to spend more time to
dig it up.

In the meantime, I thought I'd mention this JIRA here in case people have
feedback:
https://issues.apache.org/jira/browse/SPARK-14033
--> It's about merging the concepts of Estimator and Model.  It would be a
breaking change in 2.0, but it would help to simplify the API and reduce
code duplication.

Regarding making shared params public:
https://issues.apache.org/jira/browse/SPARK-7146
--> I'd like to do this for 2.0, though maybe not for all shared params

Joseph

On Mon, Mar 28, 2016 at 12:49 AM, Michał Zieliński <
zielinski.mich...@gmail.com> wrote:

> Hi Maciej,
>
> Absolutely. We had to copy HasInputCol/s, HasOutputCol/s (along with a
> couple of others like HasProbabilityCol) to our repo. Which for most
> use-cases is good enough, but for some (e.g. operating on any Transformer
> that accepts either our or Sparks HasInputCol) makes the code clunky.
> Opening those traits to the public would be a big gain.
>
> Thanks,
> Michal
>
> On 28 March 2016 at 07:44, Jacek Laskowski  wrote:
>
>> Hi,
>>
>> Never develop any custom Transformer (or UnaryTransformer in particular),
>> but I'd be for it if that's the case.
>>
>> Jacek
>> 28.03.2016 6:54 AM "Maciej Szymkiewicz" 
>> napisał(a):
>>
>>> Hi Jacek,
>>>
>>> In this context, don't you think it would be useful, if at least some
>>> traits from org.apache.spark.ml.param.shared.sharedParams were
>>> public?HasInputCol(s) and HasOutputCol for example. These are useful
>>> pretty much every time you create custom Transformer.
>>>
>>> --
>>> Pozdrawiam,
>>> Maciej Szymkiewicz
>>>
>>>
>>> On 03/26/2016 10:26 AM, Jacek Laskowski wrote:
>>> > Hi Joseph,
>>> >
>>> > Thanks for the response. I'm one who doesn't understand all the
>>> > hype/need for Machine Learning...yet and through Spark ML(lib) glasses
>>> > I'm looking at ML space. In the meantime I've got few assignments (in
>>> > a project with Spark and Scala) that have required quite extensive
>>> > dataset manipulation.
>>> >
>>> > It was when I sinked into using DataFrame/Dataset for data
>>> > manipulation not RDD (I remember talking to Brian about how RDD is an
>>> > "assembly" language comparing to the higher-level concept of
>>> > DataFrames with Catalysts and other optimizations). After few days
>>> > with DataFrame I learnt he was so right! (sorry Brian, it took me
>>> > longer to understand your point).
>>> >
>>> > I started using DataFrames in far too many places than one could ever
>>> > accept :-) I was so...carried away with DataFrames (esp. show vs
>>> > foreach(println) and UDFs via udf() function)
>>> >
>>> > And then, when I moved to Pipeline API and discovered Transformers.
>>> > And PipelineStage that can create pipelines of DataFrame manipulation.
>>> > They read so well that I'm pretty sure people would love using them
>>> > more often, but...they belong to MLlib so they are part of ML space
>>> > (not many devs tackled yet). I applied the approach to using
>>> > withColumn to have better debugging experience (if I ever need it). I
>>> > learnt it after having watched your presentation about Pipeline API.
>>> > It was so helpful in my RDD/DataFrame space.
>>> >
>>> > So, to promote a more extensive use of Pipelines, PipelineStages, and
>>> > Transformers, I was thinking about moving that part to SQL/DataFrame
>>> > API where they really belong. If not, I think people might miss the
>>> > beauty of the very fine and so helpful Transformers.
>>> >
>>> > Transformers are *not* a ML thing -- they are DataFrame thing and
>>> > should be where they really belong (for their greater adoption).
>>> >
>>> > What do you think?
>>> >
>>> >
>>> > Pozdrawiam,
>>> > Jacek Laskowski
>>> > 
>>> > https://medium.com/@jaceklaskowski/
>>> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> > Follow me at https://twitter.com/jaceklaskowski
>>> >
>>> >
>>> > On Sat, Mar 26, 2016 at 3:23 AM, Joseph Bradley 
>>> wrote:
>>> >> There have been some comments about using Pipelines outside of ML,
>>> but I
>>> >> have not yet seen a real need for it.  If a user does want to use
>>> Pipelines
>>> >> for non-ML tasks, they still can use Transformers + PipelineModels.
>>> Will
>>> >> that work?
>>> >>
>>> >> On Fri, Mar 25, 2016 at 8:05 AM, Jacek Laskowski 
>>> wrote:
>>> >>> Hi,
>>> >>>
>>> >>> After few weeks with spark.ml now, I came to conclusion that
>>> >>> Transformer concept from Pipeline API (spark.ml/MLlib) should be
>>> part
>>> >>> of DataFrame (SQL) where they fit better. Are there any plans to
>>> >>> migrate Transformer API (ML) to DataFrame (SQL)?
>>> >>>
>>> >>> Pozdrawiam,
>>> >>> Jacek Laskowski
>>> >>> 
>>> >>> https://medium.com/@jaceklaskowski/
>>> >>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> >>> Follow me at https://twitter.com/jaceklaskowski
>>> >>>
>>> >>> -

Any documentation on Spark's security model beyond YARN?

2016-03-29 Thread Michael Segel
Hi, 

So yeah, I know that Spark jobs running on a Hadoop cluster will inherit its 
security from the underlying YARN job. 
However… that’s not really saying much when you think about some use cases. 

Like using the thrift service … 

I’m wondering what else is new and what people have been thinking about how to 
enhance spark’s security. 

Thx

-Mike


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



aggregateByKey on PairRDD

2016-03-29 Thread Suniti Singh
Hi All,

I have an RDD having the data in  the following form :

tempRDD: RDD[(String, (String, String))]

(brand , (product, key))

("amazon",("book1","tech"))

("eBay",("book1","tech"))

("barns&noble",("book","tech"))

("amazon",("book2","tech"))


I would like to group the data by Brand and would like to get the result
set in the following format :

resultSetRDD : RDD[(String, List[(String), (String)]

i tried using the aggregateByKey but kind  of not getting how to achieve
this. OR is there any other way to achieve this?

val resultSetRDD  = tempRDD.aggregateByKey("")({case (aggr , value) => aggr
+ String.valueOf(value) + ","}, (aggr1, aggr2) => aggr1 + aggr2)

resultSetRDD = (amazon,("book1","tech"),("book2","tech"))

Thanks,

Suniti


Re: Null pointer exception when using com.databricks.spark.csv

2016-03-29 Thread Hyukjin Kwon
Hi,

I guess this is not a CSV-datasource specific problem.

Does loading any file (eg. textFile()) work as well?

I think this is related with this thread,
http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-running-example-scala-application-using-spark-submit-td10056.html
.


2016-03-30 12:44 GMT+09:00 Selvam Raman :

> Hi,
>
> i am using spark 1.6.0 prebuilt hadoop 2.6.0 version in my windows machine.
>
> i was trying to use databricks csv format to read csv file. i used the
> below command.
>
> [image: Inline image 1]
>
> I got null pointer exception. Any help would be greatly appreciated.
>
> [image: Inline image 2]
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


Re: Null pointer exception when using com.databricks.spark.csv

2016-03-29 Thread Akhil Das
Looks like the winutils.exe is missing from the environment, See
https://issues.apache.org/jira/browse/SPARK-2356

Thanks
Best Regards

On Wed, Mar 30, 2016 at 10:44 AM, Selvam Raman  wrote:

> Hi,
>
> i am using spark 1.6.0 prebuilt hadoop 2.6.0 version in my windows machine.
>
> i was trying to use databricks csv format to read csv file. i used the
> below command.
>
> [image: Inline image 1]
>
> I got null pointer exception. Any help would be greatly appreciated.
>
> [image: Inline image 2]
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>