Re: yarn-cluster mode throwing NullPointerException

2015-10-12 Thread Venkatakrishnan Sowrirajan
Hi Rachana,


Are you by any chance saying something like this in your code
​?
​

"sparkConf.setMaster("yarn-cluster");"

​SparkContext is not supported with yarn-cluster mode.​


I think you are hitting this bug -- >
https://issues.apache.org/jira/browse/SPARK-7504. This got fixed in
Spark-1.4.0, so you can try in 1.4.0

Regards
Venkata krishnan

On Sun, Oct 11, 2015 at 8:49 PM, Rachana Srivastava <
rachana.srivast...@markmonitor.com> wrote:

> I am trying to submit a job using yarn-cluster mode using spark-submit
> command.  My code works fine when I use yarn-client mode.
>
>
>
> *Cloudera Version:*
>
> CDH-5.4.7-1.cdh5.4.7.p0.3
>
>
>
> *Command Submitted:*
>
> spark-submit --class "com.markmonitor.antifraud.ce.KafkaURLStreaming"  \
>
> --driver-java-options
> "-Dlog4j.configuration=file:///etc/spark/myconf/log4j.sample.properties" \
>
> --conf
> "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///etc/spark/myconf/log4j.sample.properties"
> \
>
> --conf
> "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///etc/spark/myconf/log4j.sample.properties"
> \
>
> --num-executors 2 \
>
> --executor-cores 2 \
>
> ../target/mm-XXX-ce-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
>
> yarn-cluster 10 "XXX:2181" "XXX:9092" groups kafkaurl 5 \
>
> "hdfs://ip-10-0-0-XXX.us-west-2.compute.internal:8020/user/ec2-user/urlFeature.properties"
> \
>
> "hdfs://ip-10-0-0-XXX.us-west-2.compute.internal:8020/user/ec2-user/urlFeatureContent.properties"
> \
>
> "hdfs://ip-10-0-0-XXX.us-west-2.compute.internal:8020/user/ec2-user/hdfsOutputNEWScript/OUTPUTYarn2"
> false
>
>
>
>
>
> *Log Details:*
>
> INFO : org.apache.spark.SparkContext - Running Spark version 1.3.0
>
> INFO : org.apache.spark.SecurityManager - Changing view acls to: ec2-user
>
> INFO : org.apache.spark.SecurityManager - Changing modify acls to: ec2-user
>
> INFO : org.apache.spark.SecurityManager - SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(ec2-user);
> users with modify permissions: Set(ec2-user)
>
> INFO : akka.event.slf4j.Slf4jLogger - Slf4jLogger started
>
> INFO : Remoting - Starting remoting
>
> INFO : Remoting - Remoting started; listening on addresses
> :[akka.tcp://sparkdri...@ip-10-0-0-xxx.us-west-2.compute.internal:49579]
>
> INFO : Remoting - Remoting now listens on addresses:
> [akka.tcp://sparkdri...@ip-10-0-0-xxx.us-west-2.compute.internal:49579]
>
> INFO : org.apache.spark.util.Utils - Successfully started service
> 'sparkDriver' on port 49579.
>
> INFO : org.apache.spark.SparkEnv - Registering MapOutputTracker
>
> INFO : org.apache.spark.SparkEnv - Registering BlockManagerMaster
>
> INFO : org.apache.spark.storage.DiskBlockManager - Created local directory
> at
> /tmp/spark-1c805495-c7c4-471d-973f-b1ae0e2c8ff9/blockmgr-fff1946f-a716-40fc-a62d-bacba5b17638
>
> INFO : org.apache.spark.storage.MemoryStore - MemoryStore started with
> capacity 265.4 MB
>
> INFO : org.apache.spark.HttpFileServer - HTTP File server directory is
> /tmp/spark-8ed6f513-854f-4ee4-95ea-87185364eeaf/httpd-75cee1e7-af7a-4c82-a9ff-a124ce7ca7ae
>
> INFO : org.apache.spark.HttpServer - Starting HTTP Server
>
> INFO : org.spark-project.jetty.server.Server - jetty-8.y.z-SNAPSHOT
>
> INFO : org.spark-project.jetty.server.AbstractConnector - Started
> SocketConnector@0.0.0.0:46671
>
> INFO : org.apache.spark.util.Utils - Successfully started service 'HTTP
> file server' on port 46671.
>
> INFO : org.apache.spark.SparkEnv - Registering OutputCommitCoordinator
>
> INFO : org.spark-project.jetty.server.Server - jetty-8.y.z-SNAPSHOT
>
> INFO : org.spark-project.jetty.server.AbstractConnector - Started
> SelectChannelConnector@0.0.0.0:4040
>
> INFO : org.apache.spark.util.Utils - Successfully started service
> 'SparkUI' on port 4040.
>
> INFO : org.apache.spark.ui.SparkUI - Started SparkUI at
> http://ip-10-0-0-XXX.us-west-2.compute.internal:4040
>
> INFO : org.apache.spark.SparkContext - Added JAR
> file:/home/ec2-user/CE/correlationengine/scripts/../target/mm-anti-fraud-ce-0.0.1-SNAPSHOT-jar-with-dependencies.jar
> at
> http://10.0.0.XXX:46671/jars/mm-anti-fraud-ce-0.0.1-SNAPSHOT-jar-with-dependencies.jar
> with timestamp 1444620509463
>
> INFO : org.apache.spark.scheduler.cluster.YarnClusterScheduler - Created
> YarnClusterScheduler
>
> ERROR: org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend -
> Application ID is not set.
>
> INFO : org.apache.spark.network.netty.NettyBlockTransferService - Server
> created on 33880
>
> INFO : org.apache.spark.storage.BlockManagerMaster - Trying to register
> BlockManager
>
> INFO : org.apache.spark.storage.BlockManagerMasterActor - Registering
> block manager ip-10-0-0-XXX.us-west-2.compute.internal:33880 with 265.4 MB
> RAM, BlockManagerId(, ip-10-0-0-XXX.us-west-2.compute.internal,
> 33880)
>
> INFO : org.apache.spark.storage.BlockManagerMaster - Registered
> BlockManager
>
> INFO : org.apache.spark.scheduler.EventLoggingListener - Logging events to
> 

Re: sbt test error -- "Could not reserve enough space"

2015-10-12 Thread Xiao Li
Hi, Robert,

Please check the following link. It might help you.

http://stackoverflow.com/questions/18155325/scala-error-occurred-during-initialization-of-vm-on-ubuntu-12-04

Good luck,

Xiao Li


2015-10-09 9:41 GMT-07:00 Robert Dodier :

> Hi,
>
> I am trying to  build and test the current master. My system is Ubuntu
> 14.04 with 4 G physical memory with Oracle Java 8.
>
> I have been running into various out-of-memory errors. I tried
> building with Maven but couldn't get all the way through compile and
> package. I'm having better luck with sbt. At this point build/sbt
> package runs to completion, so that's great.
>
> When I try to run build/sbt test, I get a lot of errors saying: "Could
> not reserve enough space for 3145728KB object heap". Unfortunately 3.1
> G is somewhat larger than the available memory, as reported by 'free'.
> Is there any way to convince sbt that it needs to allocate less
> memory?
>
> I tried build/sbt "test-only
> org.apache.spark.mllib.random.RandomDataGeneratorSuite" (I'm not
> particularly interested in that test, it's just one that I thought
> would be relatively simple) but it seems to do a lot more work than
> just running that one test, and I still get the out-of-memory errors.
>
> Aside from getting a machine with more memory (which is not out of the
> question), are there any stretegies for coping with out-of-memory
> errors in Maven and/or sbt?
>
> Thanks in advance for any light you can shed on this problem.
>
> Robert Dodier
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Build spark 1.5.1 branch fails

2015-10-12 Thread Xiao Li
Hi, Chester,

Please check your pom.xml. Your java.version and maven.version might not
match your build environment.

Or using -Denforcer.skip=true from the command line to skip it.

Good luck,

Xiao Li

2015-10-08 10:35 GMT-07:00 Chester Chen :

> Question regarding branch-1.5  build.
>
> Noticed that the spark project no longer publish the spark-assembly. We
> have to build ourselves ( until we find way to not depends on assembly
> jar).
>
>
> I check out the tag v.1.5.1 release version and using the sbt to build it,
> I get the following error
>
> build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive
> -Phive-thriftserver -DskipTests clean package assembly
>
>
> [warn] ::
> [warn] ::  UNRESOLVED DEPENDENCIES ::
> [warn] ::
> [warn] :: org.apache.spark#spark-network-common_2.10;1.5.1: configuration
> not public in org.apache.spark#spark-network-common_2.10;1.5.1: 'test'. It
> was required from org.apache.spark#spark-network-shuffle_2.10;1.5.1 test
> [warn] ::
> [warn]
> [warn] Note: Unresolved dependencies path:
> [warn] org.apache.spark:spark-network-common_2.10:1.5.1
> ((com.typesafe.sbt.pom.MavenHelper) MavenHelper.scala#L76)
> [warn]  +- org.apache.spark:spark-network-shuffle_2.10:1.5.1
> [info] Packaging
> /Users/chester/projects/alpine/apache/spark/launcher/target/scala-2.10/spark-launcher_2.10-1.5.1.jar
> ...
> [info] Done packaging.
> [warn] four warnings found
> [warn] Note: Some input files use unchecked or unsafe operations.
> [warn] Note: Recompile with -Xlint:unchecked for details.
> [warn] No main class detected
> [info] Packaging
> /Users/chester/projects/alpine/apache/spark/external/flume-sink/target/scala-2.10/spark-streaming-flume-sink_2.10-1.5.1.jar
> ...
> [info] Done packaging.
> sbt.ResolveException: unresolved dependency:
> org.apache.spark#spark-network-common_2.10;1.5.1: configuration not public
> in org.apache.spark#spark-network-common_2.10;1.5.1: 'test'. It was
> required from org.apache.spark#spark-network-shuffle_2.10;1.5.1 test
>
>
> Somehow the network-shuffle can't find the test jar needed ( not sure why
> test still needed, even the  -DskipTests is already specified)
>
> tried the maven command, the build failed as well ( without assembly)
>
> mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver
> -DskipTests clean package
>
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-enforcer-plugin:1.4:enforce
> (enforce-versions) on project spark-parent_2.10: Some Enforcer rules have
> failed. Look above for specific messages explaining why the rule failed. ->
> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
>
>
>
> I checkout the branch-1.5 and replaced "1.5.2-SNAPSHOT" with "1.5.1" and
> build/sbt will still fail ( same error as above for sbt)
>
> But if I keep the version string as "1.5.2-SNAPSHOT", the build/sbt works
> fine.
>
>
> Any ideas ?
>
> Chester
>
>
>
>
>
>
>
>


Re: taking the heap dump when an executor goes OOM

2015-10-12 Thread Ted Yu
http://stackoverflow.com/questions/542979/using-heapdumponoutofmemoryerror-parameter-for-heap-dump-for-jboss

> On Oct 11, 2015, at 10:45 PM, Niranda Perera  wrote:
> 
> Hi all, 
> 
> is there a way for me to get the heap-dump hprof of an executor jvm, when it 
> goes out of memory? 
> 
> is this currently supported or do I have to change some configurations? 
> 
> cheers 
> 
> -- 
> Niranda 
> @n1r44
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/


Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-12 Thread YiZhi Liu
Hi Joseph,

Thank you for clarifying the motivation that you setup a different API
for ml pipelines, it sounds great. But I still think we could extract
some common parts of the training & inference procedures for ml and
mllib. In ml.classification.LogisticRegression, you simply transform
the DataFrame into RDD and follow the same procedures in
mllib.optimization.{LBFGS,OWLQN}, right?

My suggestion is, if I may, ml package should focus on the public API,
and leave the underlying implementations, e.g. numerical optimization,
to mllib package.

Please let me know if my understanding has any problem. Thank you!

2015-10-08 1:15 GMT+08:00 Joseph Bradley :
> Hi YiZhi Liu,
>
> The spark.ml classes are part of the higher-level "Pipelines" API, which
> works with DataFrames.  When creating this API, we decided to separate it
> from the old API to avoid confusion.  You can read more about it here:
> http://spark.apache.org/docs/latest/ml-guide.html
>
> For (3): We use Breeze, but we have to modify it in order to do distributed
> optimization based on Spark.
>
> Joseph
>
> On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu  wrote:
>>
>> Hi everyone,
>>
>> I'm curious about the difference between
>> ml.classification.LogisticRegression and
>> mllib.classification.LogisticRegressionWithLBFGS. Both of them are
>> optimized using LBFGS, the only difference I see is LogisticRegression
>> takes DataFrame while LogisticRegressionWithLBFGS takes RDD.
>>
>> So I wonder,
>> 1. Why not simply add a DataFrame training interface to
>> LogisticRegressionWithLBFGS?
>> 2. Whats the difference between ml.classification and
>> mllib.classification package?
>> 3. Why doesn't ml.classification.LogisticRegression call
>> mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead,
>> it uses breeze.optimize.LBFGS and re-implements most of the procedures
>> in mllib.optimization.{LBFGS,OWLQN}.
>>
>> Thank you.
>>
>> Best,
>>
>> --
>> Yizhi Liu
>> Senior Software Engineer / Data Mining
>> www.mvad.com, Shanghai, China
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>



-- 
Yizhi Liu
Senior Software Engineer / Data Mining
www.mvad.com, Shanghai, China

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Regarding SPARK JIRA ID-10286

2015-10-12 Thread Jagadeesan A.S.
Hi,

I'm newbie to SPARK community. Last three months i started to working on
spark and it's various modules. I tried to test spark with spark-perf
workbench.

Now one step forward, started to contribute in JIRA ID.

i took SPARK JIRA ID - 10286 and sent pull request.

Add @since annotation to pyspark.ml.param and pyspark.ml.*
https://github.com/Jetsonpaul/spark/commit/be1c2769c178746ff094abfdbdefe4869f75ba0d
https://github.com/Jetsonpaul/spark/commit/a1a1c62e9f5b520bef7ce50a09fb72ffde955167

Kindly check and give reviews. As well if i'm wrong please direct me in
correct way, so that i can correct my mistakes in further.


with regards
Jagadeesan A S


Re: Scala 2.11 builds broken/ Can the PR build run also 2.11?

2015-10-12 Thread Sean Owen
Yeah, was the issue that it had to be built vs Maven to show the error
and this uses SBT -- or vice versa? that's why the existing test
didn't detect it. Was just thinking of adding one more of these non-PR
builds, but I forget if there was a reason this is hard. Certainly not
worth building for each PR.

On Mon, Oct 12, 2015 at 5:16 PM, Patrick Wendell  wrote:
> We already do automated compile testing for Scala 2.11 similar to Hadoop
> versions:
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-master-Scala211-Compile/buildTimeTrend
>
>
> If you look, this build takes 7-10 minutes, so it's a nontrivial increase to
> add it to all new PR's. Also, it's only broken once in the last few months
> (despite many patches going in) - a pretty low failure rate. For scenarios
> like this it's better to test it asynchronously. We can even just revert a
> patch immediately if it's found to break 2.11.
>
> Put another way - we typically have 1000 patches or more per release. Even
> at one jenkins run per patch: 7 minutes * 1000 = 7 days of developer
> productivity loss. Compare that to having a few times where we have to
> revert a patch and ask someone to resubmit (which maybe takes at most one
> hour)... it's not worth it.
>
> - Patrick
>
> On Mon, Oct 12, 2015 at 8:24 AM, Sean Owen  wrote:
>>
>> There are many Jenkins jobs besides the pull request builder that
>> build against various Hadoop combinations, for example, in the
>> background. Is there an obstacle to building vs 2.11 on both Maven and
>> SBT this way?
>>
>> On Mon, Oct 12, 2015 at 2:55 PM, Iulian Dragoș
>>  wrote:
>> > Anything that can be done by a machine should be done by a machine. I am
>> > not
>> > sure we have enough data to say it's only once or twice per release, and
>> > even if we were to issue a PR for each breakage, it's additional load on
>> > committers and reviewers, not to mention our own work. I personally
>> > don't
>> > see how 2-3 minutes of compute time per PR can justify hours of work
>> > plus
>> > reviews.
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Scala 2.11 builds broken/ Can the PR build run also 2.11?

2015-10-12 Thread Sean Owen
There are many Jenkins jobs besides the pull request builder that
build against various Hadoop combinations, for example, in the
background. Is there an obstacle to building vs 2.11 on both Maven and
SBT this way?

On Mon, Oct 12, 2015 at 2:55 PM, Iulian Dragoș
 wrote:
> Anything that can be done by a machine should be done by a machine. I am not
> sure we have enough data to say it's only once or twice per release, and
> even if we were to issue a PR for each breakage, it's additional load on
> committers and reviewers, not to mention our own work. I personally don't
> see how 2-3 minutes of compute time per PR can justify hours of work plus
> reviews.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [ANNOUNCE] Announcing Spark 1.5.1

2015-10-12 Thread Krishna Sankar
I think the key is to vote a specific set of source tarballs without any
binary artifacts. The specific binaries are useful but shouldn't be part of
the voting process. Makes sense, we really cannot prove (and no need to)
that the  binaries do not contain malware, but the source can be proven to
be clean by inspection, I assume.
Cheers


On Mon, Oct 12, 2015 at 6:56 AM, Tom Graves 
wrote:

> I know there are multiple things being talked about here, but  I agree
> with Patrick here, we vote on the source distribution - src tarball (and of
> course the tag should match).  Perhaps in principle we vote on all the
> other specific binary distributions since they are generated from source
> tarball but that isn't the main thing and I surely don't test and verify
> each one of those.
>
> Tom
>
>
>
> On Monday, October 12, 2015 12:13 AM, Sean Owen 
> wrote:
>
>
> No we are voting on the artifacts being released (too) in principle.
> Although of course the artifacts should be a deterministic function of the
> source at a certain point in time.
> I think the concern is about putting Spark binaries or its dependencies
> into a source release. That should not happen, but it is not what has
> happened here.
>
> On Mon, Oct 12, 2015, 6:03 AM Patrick Wendell  wrote:
>
> Oh I see - yes it's the build/. I always thought release votes related to
> a source tag rather than specific binaries. But maybe we can just fix it in
> 1.5.2 if there is concern about mutating binaries. It seems reasonable to
> me.
>
> For tests... in the past we've tried to avoid having jars inside of the
> source tree, including some effort to generate jars on the fly which a lot
> of our tests use. I am not sure whether it's a firm policy that you can't
> have jars in test folders, though. If it is, we could probably do some
> magic to get rid of these few ones that have crept in.
>
> - Patrick
>
> On Sun, Oct 11, 2015 at 9:57 PM, Sean Owen  wrote:
>
> Agree, but we are talking about the build/ bit right?
> I don't agree that it invalidates the release, which is probably the more
> important idea. As a point of process, you would not want to modify and
> republish the artifact that was already released after being voted on -
> unless it was invalid in which case we spin up 1.5.1.1 or something.
> But that build/ directory should go in future releases.
> I think he is talking about more than this though and the other jars look
> like they are part of tests, and still nothing to do with Spark binaries.
> Those can and should stay.
>
> On Mon, Oct 12, 2015, 5:35 AM Patrick Wendell  wrote:
>
> I think Daniel is correct here. The source artifact incorrectly includes
> jars. It is inadvertent and not part of our intended release process. This
> was something I noticed in Spark 1.5.0 and filed a JIRA and was fixed by
> updating our build scripts to fix it. However, our build environment was
> not using the most current version of the build scripts. See related links:
>
> https://issues.apache.org/jira/browse/SPARK-10511
> https://github.com/apache/spark/pull/8774/files
>
> I can update our build environment and we can repackage the Spark 1.5.1
> source tarball. To not include sources.
>
>
> - Patrick
>
> On Sun, Oct 11, 2015 at 8:53 AM, Sean Owen  wrote:
>
> Daniel: we did not vote on a tag. Please again read the VOTE email I
> linked to you:
>
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-1-RC1-tt14310.html#none
>
> among other things, it contains a link to the concrete source (and
> binary) distribution under vote:
>
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>
> You can still examine it, sure.
>
> Dependencies are *not* bundled in the source release. You're again
> misunderstanding what you are seeing. Read my email again.
>
> I am still pretty confused about what the problem is. This is entirely
> business as usual for ASF projects. I'll follow up with you offline if
> you have any more doubts.
>
> On Sun, Oct 11, 2015 at 4:49 PM, Daniel Gruno 
> wrote:
> > Here's my issue:
> >
> > How am I to audit that the dependencies you bundle are in fact what you
> > claim they are?  How do I know they don't contain malware or - in light
> > of recent events - emissions test rigging? ;)
> >
> > I am not interested in a git tag - that means nothing in the ASF voting
> > process, you cannot vote on a tag, only on a release candidate. The VCS
> > in use is irrelevant in this issue. If you can point me to a release
> > candidate archive that was voted upon and does not contain binary
> > applications, all is well.
> >
> > If there is no such thing, and we cannot come to an understanding, I
> > will exercise my ASF Members' rights and bring this to the attention of
> > the board of directors and ask for a clarification of the legality of
> 

Re: Scala 2.11 builds broken/ Can the PR build run also 2.11?

2015-10-12 Thread Patrick Wendell
It's really easy to create and modify those builds. If the issue is that we
need to add SBT or Maven to the existing one, it's a short change. We can
just have it build both of them. I wasn't aware of things breaking before
in one build but not another.

- Patrick

On Mon, Oct 12, 2015 at 9:21 AM, Sean Owen  wrote:

> Yeah, was the issue that it had to be built vs Maven to show the error
> and this uses SBT -- or vice versa? that's why the existing test
> didn't detect it. Was just thinking of adding one more of these non-PR
> builds, but I forget if there was a reason this is hard. Certainly not
> worth building for each PR.
>
> On Mon, Oct 12, 2015 at 5:16 PM, Patrick Wendell 
> wrote:
> > We already do automated compile testing for Scala 2.11 similar to Hadoop
> > versions:
> >
> > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/
> >
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-master-Scala211-Compile/buildTimeTrend
> >
> >
> > If you look, this build takes 7-10 minutes, so it's a nontrivial
> increase to
> > add it to all new PR's. Also, it's only broken once in the last few
> months
> > (despite many patches going in) - a pretty low failure rate. For
> scenarios
> > like this it's better to test it asynchronously. We can even just revert
> a
> > patch immediately if it's found to break 2.11.
> >
> > Put another way - we typically have 1000 patches or more per release.
> Even
> > at one jenkins run per patch: 7 minutes * 1000 = 7 days of developer
> > productivity loss. Compare that to having a few times where we have to
> > revert a patch and ask someone to resubmit (which maybe takes at most one
> > hour)... it's not worth it.
> >
> > - Patrick
> >
> > On Mon, Oct 12, 2015 at 8:24 AM, Sean Owen  wrote:
> >>
> >> There are many Jenkins jobs besides the pull request builder that
> >> build against various Hadoop combinations, for example, in the
> >> background. Is there an obstacle to building vs 2.11 on both Maven and
> >> SBT this way?
> >>
> >> On Mon, Oct 12, 2015 at 2:55 PM, Iulian Dragoș
> >>  wrote:
> >> > Anything that can be done by a machine should be done by a machine. I
> am
> >> > not
> >> > sure we have enough data to say it's only once or twice per release,
> and
> >> > even if we were to issue a PR for each breakage, it's additional load
> on
> >> > committers and reviewers, not to mention our own work. I personally
> >> > don't
> >> > see how 2-3 minutes of compute time per PR can justify hours of work
> >> > plus
> >> > reviews.
> >
> >
>


Re: Scala 2.11 builds broken/ Can the PR build run also 2.11?

2015-10-12 Thread Iulian Dragoș
On Fri, Oct 9, 2015 at 10:34 PM, Patrick Wendell  wrote:

> I would push back slightly. The reason we have the PR builds taking so
> long is death by a million small things that we add. Doing a full 2.11
> compile is order minutes... it's a nontrivial increase to the build times.
>

We can host the build if there's a way to post back a comment when the
build is broken.


>
> It doesn't seem that bad to me to go back post-hoc once in a while and fix
> 2.11 bugs when they come up. It's on the order of once or twice per release
> and the typesafe guys keep a close eye on it (thanks!). Compare that to
> literally thousands of PR runs and a few minutes every time, IMO it's not
> worth it.
>

Anything that can be done by a machine should be done by a machine. I am
not sure we have enough data to say it's only once or twice per release,
and even if we were to issue a PR for each breakage, it's additional load
on committers and reviewers, not to mention our own work. I personally
don't see how 2-3 minutes of compute time per PR can justify hours of work
plus reviews.

iulian


>
> On Fri, Oct 9, 2015 at 3:31 PM, Hari Shreedharan <
> hshreedha...@cloudera.com> wrote:
>
>> +1, much better than having a new PR each time to fix something for
>> scala-2.11 every time a patch breaks it.
>>
>> Thanks,
>> Hari Shreedharan
>>
>>
>>
>>
>> On Oct 9, 2015, at 11:47 AM, Michael Armbrust 
>> wrote:
>>
>> How about just fixing the warning? I get it; it doesn't stop this from
>>> happening again, but still seems less drastic than tossing out the
>>> whole mechanism.
>>>
>>
>> +1
>>
>> It also does not seem that expensive to test only compilation for Scala
>> 2.11 on PR builds.
>>
>>
>>
>


-- 

--
Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com


SparkSQL can not extract values from UDT (like VectorUDT)

2015-10-12 Thread Hao Ren
Hi,

Consider the following code using spark.ml to get the probability column on
a data set:

model.transform(dataSet)
.selectExpr("probability.values")
.printSchema()

 Note that "probability" is `vector` type which is a UDT with the following
implementation.

class VectorUDT extends UserDefinedType[Vector] {

  override def sqlType: StructType = {
// type: 0 = sparse, 1 = dense
// We only use "values" for dense vectors, and "size", "indices",
and "values" for sparse
// vectors. The "values" field is nullable because we might want
to add binary vectors later,
// which uses "size" and "indices", but not "values".
StructType(Seq(
  StructField("type", ByteType, nullable = false),
  StructField("size", IntegerType, nullable = true),
  StructField("indices", ArrayType(IntegerType, containsNull =
false), nullable = true),
  StructField("values", ArrayType(DoubleType, containsNull =
false), nullable = true)))
  }

  //...

}


`values` is one of its attribute. However, it can not be extracted.

The first code snippet results in an exception of  complexTypeExtractors:

org.apache.spark.sql.AnalysisException: Can't extract value from
probability#743;
  at ...
  at ...
  at ...
...

Here is the code:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala#L49

It seems that the pattern matching does not take UDT into consideration.

Is this an intended feature? If not, I would like to create a PR to fix it.

-- 
Hao Ren

Data Engineer @ leboncoin

Paris, France


Re: [ANNOUNCE] Announcing Spark 1.5.1

2015-10-12 Thread Tom Graves
I know there are multiple things being talked about here, but  I agree with 
Patrick here, we vote on the source distribution - src tarball (and of course 
the tag should match).  Perhaps in principle we vote on all the other specific 
binary distributions since they are generated from source tarball but that 
isn't the main thing and I surely don't test and verify each one of those.
Tom 


 On Monday, October 12, 2015 12:13 AM, Sean Owen  wrote:
   

 No we are voting on the artifacts being released (too) in principle. Although 
of course the artifacts should be a deterministic function of the source at a 
certain point in time. I think the concern is about putting Spark binaries or 
its dependencies into a source release. That should not happen, but it is not 
what has happened here.

On Mon, Oct 12, 2015, 6:03 AM Patrick Wendell  wrote:

Oh I see - yes it's the build/. I always thought release votes related to a 
source tag rather than specific binaries. But maybe we can just fix it in 1.5.2 
if there is concern about mutating binaries. It seems reasonable to me.
For tests... in the past we've tried to avoid having jars inside of the source 
tree, including some effort to generate jars on the fly which a lot of our 
tests use. I am not sure whether it's a firm policy that you can't have jars in 
test folders, though. If it is, we could probably do some magic to get rid of 
these few ones that have crept in.
- Patrick
On Sun, Oct 11, 2015 at 9:57 PM, Sean Owen  wrote:

Agree, but we are talking about the build/ bit right?I don't agree that it 
invalidates the release, which is probably the more important idea. As a point 
of process, you would not want to modify and republish the artifact that was 
already released after being voted on - unless it was invalid in which case we 
spin up 1.5.1.1 or something. But that build/ directory should go in future 
releases. I think he is talking about more than this though and the other jars 
look like they are part of tests, and still nothing to do with Spark binaries. 
Those can and should stay.

On Mon, Oct 12, 2015, 5:35 AM Patrick Wendell  wrote:

I think Daniel is correct here. The source artifact incorrectly includes jars. 
It is inadvertent and not part of our intended release process. This was 
something I noticed in Spark 1.5.0 and filed a JIRA and was fixed by updating 
our build scripts to fix it. However, our build environment was not using the 
most current version of the build scripts. See related links:
https://issues.apache.org/jira/browse/SPARK-10511https://github.com/apache/spark/pull/8774/files
I can update our build environment and we can repackage the Spark 1.5.1 source 
tarball. To not include sources.

- Patrick
On Sun, Oct 11, 2015 at 8:53 AM, Sean Owen  wrote:

Daniel: we did not vote on a tag. Please again read the VOTE email I
linked to you:

http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-1-RC1-tt14310.html#none

among other things, it contains a link to the concrete source (and
binary) distribution under vote:

http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/

You can still examine it, sure.

Dependencies are *not* bundled in the source release. You're again
misunderstanding what you are seeing. Read my email again.

I am still pretty confused about what the problem is. This is entirely
business as usual for ASF projects. I'll follow up with you offline if
you have any more doubts.

On Sun, Oct 11, 2015 at 4:49 PM, Daniel Gruno  wrote:
> Here's my issue:
>
> How am I to audit that the dependencies you bundle are in fact what you
> claim they are?  How do I know they don't contain malware or - in light
> of recent events - emissions test rigging? ;)
>
> I am not interested in a git tag - that means nothing in the ASF voting
> process, you cannot vote on a tag, only on a release candidate. The VCS
> in use is irrelevant in this issue. If you can point me to a release
> candidate archive that was voted upon and does not contain binary
> applications, all is well.
>
> If there is no such thing, and we cannot come to an understanding, I
> will exercise my ASF Members' rights and bring this to the attention of
> the board of directors and ask for a clarification of the legality of this.
>
> I find it highly irregular. Perhaps it is something some projects do in
> the Java community, but that doesn't make it permissible in my view.
>
> With regards,
> Daniel.
>
>
> On 10/11/2015 05:42 PM, Sean Owen wrote:
>> Still confused. Why are you saying we didn't vote on an archive? refer
>> to the email I linked, which includes both the git tag and a link to
>> all generated artifacts (also in my email).
>>
>> So, there are two things at play here:
>>
>> First, I am not sure what you mean that a source distro can't have
>> binary files. It's supposed to have the 

Re: Regarding SPARK JIRA ID-10286

2015-10-12 Thread Sean Owen
I don't see that you ever opened a pull request. You just linked to
commits in your branch. Please have a look at
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

On Mon, Oct 12, 2015 at 4:56 PM, Jagadeesan A.S.  wrote:
> Hi,
>
> I'm newbie to SPARK community. Last three months i started to working on
> spark and it's various modules. I tried to test spark with spark-perf
> workbench.
>
> Now one step forward, started to contribute in JIRA ID.
>
> i took SPARK JIRA ID - 10286 and sent pull request.
>
> Add @since annotation to pyspark.ml.param and pyspark.ml.*
>
> https://github.com/Jetsonpaul/spark/commit/be1c2769c178746ff094abfdbdefe4869f75ba0d
> https://github.com/Jetsonpaul/spark/commit/a1a1c62e9f5b520bef7ce50a09fb72ffde955167
>
> Kindly check and give reviews. As well if i'm wrong please direct me in
> correct way, so that i can correct my mistakes in further.
>
>
> with regards
> Jagadeesan A S
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: No speedup in MultiLayerPerceptronClassifier with increase in number of cores

2015-10-12 Thread Ulanov, Alexander
Hi Disha,

The problem might be as follows. The data that you have might physically reside 
only on two nodes and Spark launches data-local tasks. As a result, only two 
workers are used. You might want to force Spark to distribute the data across 
all nodes, however it does not seem to be worthwhile for this rather small 
dataset.

Best regards, Alexander

From: Disha Shrivastava [mailto:dishu@gmail.com]
Sent: Sunday, October 11, 2015 9:29 AM
To: Mike Hynes
Cc: dev@spark.apache.org; Ulanov, Alexander
Subject: Re: No speedup in MultiLayerPerceptronClassifier with increase in 
number of cores

Actually I have 5 workers running ( 1 per physical machine) as displayed by the 
spark UI on spark://IP_of_the_master:7077. I have entered all the physical 
machines IP in a file named slaves in spark/conf directory and using the script 
start-all.sh to start the cluster.
My question is that is there a way to control how the tasks are distributed 
among different workers? To my knowledge it is done by Spark automatically and 
is not in our control.

On Sun, Oct 11, 2015 at 9:49 PM, Mike Hynes 
<91m...@gmail.com> wrote:
Having only 2 workers for 5 machines would be your problem: you
probably want 1 worker per physical machine, which entails running the
spark-daemon.sh script to start a worker on those machines.
The partitioning is agnositic to how many executors are available for
running the tasks, so you can't do scalability tests in the manner
you're thinking by changing the partitioning.

On 10/11/15, Disha Shrivastava 
> wrote:
> Dear Spark developers,
>
> I am trying to study the effect of increasing number of cores ( CPU's) on
> speedup and accuracy ( scalability with spark ANN ) performance for the
> MNIST dataset using ANN implementation provided in the latest spark
> release.
>
> I have formed a cluster of 5 machines with 88 cores in total.The thing
> which is troubling me is that even if I have more than 2 workers in my
> spark cluster the job gets divided only to 2 workers.( executors) which
> Spark takes by default and hence it takes the same time . I know we can set
> the number of partitions manually using sc.parallelize(train_data,10)
> suppose which then divides the data in 10 partitions and all the workers
> are involved in the computation.I am using the below code:
>
>
> import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
> import org.apache.spark.mllib.util.MLUtils
> import org.apache.spark.sql.Row
>
> // Load training data
> val data = MLUtils.loadLibSVMFile(sc, "data/1_libsvm").toDF()
> // Split the data into train and test
> val splits = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
> val train = splits(0)
> val test = splits(1)
> //val tr=sc.parallelize(train,10);
> // specify layers for the neural network:
> // input layer of size 4 (features), two intermediate of size 5 and 4 and
> output of size 3 (classes)
> val layers = Array[Int](784,160,10)
> // create the trainer and set its parameters
> val trainer = new
> MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100)
> // train the model
> val model = trainer.fit(train)
> // compute precision on the test set
> val result = model.transform(test)
> val predictionAndLabels = result.select("prediction", "label")
> val evaluator = new
> MulticlassClassificationEvaluator().setMetricName("precision")
> println("Precision:" + evaluator.evaluate(predictionAndLabels))
>
> Can you please suggest me how can I ensure that the data/task is divided
> equally to all the worker machines?
>
> Thanks and Regards,
> Disha Shrivastava
> Masters student, IIT Delhi
>

--
Thanks,
Mike



RE: Operations with cached RDD

2015-10-12 Thread Ulanov, Alexander
Thank you, Nitin. This does explain the problem. It seems that UI should make 
this more clear to the user, otherwise it is simply misleading if you read it 
as it.

From: Nitin Goyal [mailto:nitin2go...@gmail.com]
Sent: Sunday, October 11, 2015 5:57 AM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Operations with cached RDD

The problem is not that zipWithIndex is executed again. "groupBy" triggered 
hash partitioning on your keys and a shuffle happened due to that and that's 
why you are seeing 2 stages. You can confirm this by clicking on latter 
"zipWithIndex" stage and input data has "(memory)" written which means input 
data has been fetched from memory (your cached RDD).

As far as lineage/call site is concerned, I think there was a change in spark 
1.3 which excluded some classes from appearing in call site (I know that some 
Spark SQL related were removed for sure).

Thanks
-Nitin


On Sat, Oct 10, 2015 at 5:05 AM, Ulanov, Alexander 
> wrote:
Dear Spark developers,

I am trying to understand how Spark UI displays operation with the cached RDD.

For example, the following code caches an rdd:
>> val rdd = sc.parallelize(1 to 5, 5).zipWithIndex.cache
>> rdd.count
The Jobs tab shows me that the RDD is evaluated:
: 1 count at :24  2015/10/09 16:15:430.4 s   
1/1
: 0 zipWithIndex at  :21 2015/10/09 16:15:380.6 s  
 1/1
An I can observe this rdd in the Storage tab of Spark UI:
: ZippedWithIndexRDD  Memory Deserialized 1x Replicated

Then I want to make an operation over the cached RDD. I run the following code:
>> val g = rdd.groupByKey()
>> g.count
The Jobs tab shows me a new Job:
: 2 count at :26
Inside this Job there are two stages:
: 3 count at :26 +details 2015/10/09 16:16:18   0.2 s   5/5
: 2 zipWithIndex at :21
It shows that zipWithIndex is executed again. It does not seem to be 
reasonable, because the rdd is cached, and zipWithIndex is already executed 
previously.

Could you explain why if I perform an operation followed by an action on a 
cached RDD, then the last operation in the lineage of the cached RDD is shown 
to be executed in the Spark UI?


Best regards, Alexander



--
Regards
Nitin Goyal


Flaky Jenkins tests?

2015-10-12 Thread Meihua Wu
Hi Spark Devs,

I recently encountered several cases that the Jenkin failed tests that
are supposed to be unrelated to my patch. For example, I made a patch
to Spark ML Scala API but some Scala RDD tests failed due to timeout,
or the java_gateway in PySpark fails. Just wondering if these are
isolated cases?

Thanks,

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-12 Thread DB Tsai
Hi Liu,

In ML, even after extracting the data into RDD, the versions between MLib
and ML are quite different. Due to legacy design, in MLlib, we use Updater
for handling regularization, and this layer of abstraction also does
adaptive step size which is only for SGD. In order to get it working with
LBFGS, some hacks were being done here and there, and in Updater, all the
components including intercept are regularized which is not desirable in
many cases. Also, in the legacy design, it's hard for us to do in-place
standardization to improve the convergency rate. As a result, at some
point, we decide to ditch those abstractions, and customize them for each
algorithms. (Even LiR and LoR use different tricks to have better
performance for numerical optimization, so it's hard to share code at that
time. But I can see the point that we have working code now, so it's time
to try to refactor those code to share more.)


Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Mon, Oct 12, 2015 at 1:24 AM, YiZhi Liu  wrote:

> Hi Joseph,
>
> Thank you for clarifying the motivation that you setup a different API
> for ml pipelines, it sounds great. But I still think we could extract
> some common parts of the training & inference procedures for ml and
> mllib. In ml.classification.LogisticRegression, you simply transform
> the DataFrame into RDD and follow the same procedures in
> mllib.optimization.{LBFGS,OWLQN}, right?
>
> My suggestion is, if I may, ml package should focus on the public API,
> and leave the underlying implementations, e.g. numerical optimization,
> to mllib package.
>
> Please let me know if my understanding has any problem. Thank you!
>
> 2015-10-08 1:15 GMT+08:00 Joseph Bradley :
> > Hi YiZhi Liu,
> >
> > The spark.ml classes are part of the higher-level "Pipelines" API, which
> > works with DataFrames.  When creating this API, we decided to separate it
> > from the old API to avoid confusion.  You can read more about it here:
> > http://spark.apache.org/docs/latest/ml-guide.html
> >
> > For (3): We use Breeze, but we have to modify it in order to do
> distributed
> > optimization based on Spark.
> >
> > Joseph
> >
> > On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu  wrote:
> >>
> >> Hi everyone,
> >>
> >> I'm curious about the difference between
> >> ml.classification.LogisticRegression and
> >> mllib.classification.LogisticRegressionWithLBFGS. Both of them are
> >> optimized using LBFGS, the only difference I see is LogisticRegression
> >> takes DataFrame while LogisticRegressionWithLBFGS takes RDD.
> >>
> >> So I wonder,
> >> 1. Why not simply add a DataFrame training interface to
> >> LogisticRegressionWithLBFGS?
> >> 2. Whats the difference between ml.classification and
> >> mllib.classification package?
> >> 3. Why doesn't ml.classification.LogisticRegression call
> >> mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead,
> >> it uses breeze.optimize.LBFGS and re-implements most of the procedures
> >> in mllib.optimization.{LBFGS,OWLQN}.
> >>
> >> Thank you.
> >>
> >> Best,
> >>
> >> --
> >> Yizhi Liu
> >> Senior Software Engineer / Data Mining
> >> www.mvad.com, Shanghai, China
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>
>
>
> --
> Yizhi Liu
> Senior Software Engineer / Data Mining
> www.mvad.com, Shanghai, China
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Flaky Jenkins tests?

2015-10-12 Thread Ted Yu
You can go to:
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN

and see if the test failure(s) you encountered appeared there.

FYI

On Mon, Oct 12, 2015 at 1:24 PM, Meihua Wu 
wrote:

> Hi Spark Devs,
>
> I recently encountered several cases that the Jenkin failed tests that
> are supposed to be unrelated to my patch. For example, I made a patch
> to Spark ML Scala API but some Scala RDD tests failed due to timeout,
> or the java_gateway in PySpark fails. Just wondering if these are
> isolated cases?
>
> Thanks,
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Adding Spark Testing functionality

2015-10-12 Thread Holden Karau
So here is a quick description of the current testing bits (I can expand on
it if people are interested) http://bit.ly/pandaPandaPanda .

On Tue, Oct 6, 2015 at 3:49 PM, Holden Karau  wrote:

> I'll put together a google doc and send that out (in the meantime a quick
> guide of sort of how the current package can be used is in the blog post I
> did at
> http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/
> )  If people think its better to keep as a package I am of course happy to
> keep doing that. It feels a little strange to have something as core as
> being able to test your code live outside.
>
> On Tue, Oct 6, 2015 at 3:44 PM, Patrick Wendell 
> wrote:
>
>> Hey Holden,
>>
>> It would be helpful if you could outline the set of features you'd
>> imagine being part of Spark in a short doc. I didn't see a README on the
>> existing repo, so it's hard to know exactly what is being proposed.
>>
>> As a general point of process, we've typically avoided merging modules
>> into Spark that can exist outside of the project. A testing utility package
>> that is based on Spark's public API's seems like a really useful thing for
>> the community, but it does seem like a good fit for a package library. At
>> least, this is my first question after taking a look at the project.
>>
>> In any case, getting some high level view of the functionality you
>> imagine would be helpful to give more detailed feedback.
>>
>> - Patrick
>>
>> On Tue, Oct 6, 2015 at 3:12 PM, Holden Karau 
>> wrote:
>>
>>> Hi Spark Devs,
>>>
>>> So this has been brought up a few times before, and generally on the
>>> user list people get directed to use spark-testing-base. I'd like to start
>>> moving some of spark-testing-base's functionality into Spark so that people
>>> don't need a library to do what is (hopefully :p) a very common requirement
>>> across all Spark projects.
>>>
>>> To that end I was wondering what peoples thoughts are on where this
>>> should live inside of Spark. I was thinking it could either be a separate
>>> testing project (like sql or similar), or just put the bits to enable
>>> testing inside of each relevant project.
>>>
>>> I was also thinking it probably makes sense to only move the unit
>>> testing parts at the start and leave things like integration testing in a
>>> testing project since that could vary depending on the users environment.
>>>
>>> What are peoples thoughts?
>>>
>>> Cheers,
>>>
>>> Holden :)
>>>
>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
> Linked In: https://www.linkedin.com/in/holdenkarau
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau


Re: Flaky Jenkins tests?

2015-10-12 Thread Meihua Wu
Hi Ted,

Thanks for the info. I have checked but I did not find the failures though.

In my cases, I have seen

1) spilling in ExternalAppendOnlyMapSuite failed due to timeout.
[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43531/console]

2) pySpark failure
[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43553/console]

Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 316, in _get_connection
IndexError: pop from an empty deque



On Mon, Oct 12, 2015 at 1:36 PM, Ted Yu  wrote:
> You can go to:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN
>
> and see if the test failure(s) you encountered appeared there.
>
> FYI
>
> On Mon, Oct 12, 2015 at 1:24 PM, Meihua Wu 
> wrote:
>>
>> Hi Spark Devs,
>>
>> I recently encountered several cases that the Jenkin failed tests that
>> are supposed to be unrelated to my patch. For example, I made a patch
>> to Spark ML Scala API but some Scala RDD tests failed due to timeout,
>> or the java_gateway in PySpark fails. Just wondering if these are
>> isolated cases?
>>
>> Thanks,
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Live UI

2015-10-12 Thread Jakob Odersky
Hi everyone,
I am just getting started working on spark and was thinking of a first way
to contribute whilst still trying to wrap my head around the codebase.

Exploring the web UI, I noticed it is a classic request-response website,
requiring manual refresh to get the latest data.
I think it would be great to have a "live" website where data would be
displayed real-time without the need to hit the refresh button. I would be
very interested in contributing this feature if it is acceptable.

Specifically, I was thinking of using websockets with a ScalaJS front-end.
Please let me know if this design would be welcome or if it introduces
unwanted dependencies, I'll be happy to discuss this further in detail.

thanks for your feedback,
--Jakob


Re: Live UI

2015-10-12 Thread Ryan Williams
Yea, definitely check out Spree ! It
functions as "live" UI, history server, and archival storage of event log
data.

There are pros and cons to building something like it in Spark trunk (and
running it in the Spark driver, presumably) that I've spent a lot of time
thinking about and am happy to talk through (here, offline, or in the Spree
gitter room ) if you want to go that
route.


On Mon, Oct 12, 2015 at 5:36 PM Jakob Odersky  wrote:

> Hi everyone,
> I am just getting started working on spark and was thinking of a first way
> to contribute whilst still trying to wrap my head around the codebase.
>
> Exploring the web UI, I noticed it is a classic request-response website,
> requiring manual refresh to get the latest data.
> I think it would be great to have a "live" website where data would be
> displayed real-time without the need to hit the refresh button. I would be
> very interested in contributing this feature if it is acceptable.
>
> Specifically, I was thinking of using websockets with a ScalaJS front-end.
> Please let me know if this design would be welcome or if it introduces
> unwanted dependencies, I'll be happy to discuss this further in detail.
>
> thanks for your feedback,
> --Jakob
>


Re: Flaky Jenkins tests?

2015-10-12 Thread Ted Yu
Can you re-submit your PR to trigger a new build - assuming the tests are
flaky ?

If any test fails again, consider contacting the owner of the module for
expert opinion.

Cheers

On Mon, Oct 12, 2015 at 2:07 PM, Meihua Wu 
wrote:

> Hi Ted,
>
> Thanks for the info. I have checked but I did not find the failures though.
>
> In my cases, I have seen
>
> 1) spilling in ExternalAppendOnlyMapSuite failed due to timeout.
> [
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43531/console
> ]
>
> 2) pySpark failure
> [
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43553/console
> ]
>
> Traceback (most recent call last):
>   File
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 316, in _get_connection
> IndexError: pop from an empty deque
>
>
>
> On Mon, Oct 12, 2015 at 1:36 PM, Ted Yu  wrote:
> > You can go to:
> > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN
> >
> > and see if the test failure(s) you encountered appeared there.
> >
> > FYI
> >
> > On Mon, Oct 12, 2015 at 1:24 PM, Meihua Wu  >
> > wrote:
> >>
> >> Hi Spark Devs,
> >>
> >> I recently encountered several cases that the Jenkin failed tests that
> >> are supposed to be unrelated to my patch. For example, I made a patch
> >> to Spark ML Scala API but some Scala RDD tests failed due to timeout,
> >> or the java_gateway in PySpark fails. Just wondering if these are
> >> isolated cases?
> >>
> >> Thanks,
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
> >
>


Re: Flaky Jenkins tests?

2015-10-12 Thread Ted Yu
Josh:
We're on the same page.

I used the term 're-submit your PR' which was different from opening new PR.

On Mon, Oct 12, 2015 at 2:47 PM, Personal  wrote:

> Just ask Jenkins to retest; no need to open a new PR just to re-trigger
> the build.
>
>
> On October 12, 2015 at 2:45:13 PM, Ted Yu (yuzhih...@gmail.com) wrote:
>
> Can you re-submit your PR to trigger a new build - assuming the tests are
> flaky ?
>
> If any test fails again, consider contacting the owner of the module for
> expert opinion.
>
> Cheers
>
> On Mon, Oct 12, 2015 at 2:07 PM, Meihua Wu 
> wrote:
>
>> Hi Ted,
>>
>> Thanks for the info. I have checked but I did not find the failures
>> though.
>>
>> In my cases, I have seen
>>
>> 1) spilling in ExternalAppendOnlyMapSuite failed due to timeout.
>> [
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43531/console
>> ]
>>
>> 2) pySpark failure
>> [
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43553/console
>> ]
>>
>> Traceback (most recent call last):
>>   File
>> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>> line 316, in _get_connection
>> IndexError: pop from an empty deque
>>
>>
>>
>> On Mon, Oct 12, 2015 at 1:36 PM, Ted Yu  wrote:
>> > You can go to:
>> > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN
>> >
>> > and see if the test failure(s) you encountered appeared there.
>> >
>> > FYI
>> >
>> > On Mon, Oct 12, 2015 at 1:24 PM, Meihua Wu <
>> rotationsymmetr...@gmail.com>
>> > wrote:
>> >>
>> >> Hi Spark Devs,
>> >>
>> >> I recently encountered several cases that the Jenkin failed tests that
>> >> are supposed to be unrelated to my patch. For example, I made a patch
>> >> to Spark ML Scala API but some Scala RDD tests failed due to timeout,
>> >> or the java_gateway in PySpark fails. Just wondering if these are
>> >> isolated cases?
>> >>
>> >> Thanks,
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>> >>
>> >
>>
>
>


Re: Live UI

2015-10-12 Thread Holden Karau
I don't think there has been much work done with ScalaJS and Spark (outside
of the April fools press release), but there is a live Web UI project out
of hammerlab with Ryan Williams https://github.com/hammerlab/spree which
you may want to take a look at.

On Mon, Oct 12, 2015 at 2:36 PM, Jakob Odersky  wrote:

> Hi everyone,
> I am just getting started working on spark and was thinking of a first way
> to contribute whilst still trying to wrap my head around the codebase.
>
> Exploring the web UI, I noticed it is a classic request-response website,
> requiring manual refresh to get the latest data.
> I think it would be great to have a "live" website where data would be
> displayed real-time without the need to hit the refresh button. I would be
> very interested in contributing this feature if it is acceptable.
>
> Specifically, I was thinking of using websockets with a ScalaJS front-end.
> Please let me know if this design would be welcome or if it introduces
> unwanted dependencies, I'll be happy to discuss this further in detail.
>
> thanks for your feedback,
> --Jakob
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau


a few major changes / improvements for Spark 1.6

2015-10-12 Thread Reynold Xin
Hi Spark devs,

It is hard to track everything going on in Spark with so many pull requests
and JIRA tickets. Below are 4 major improvements that will likely be in
Spark 1.6. We have already done prototyping for all of them, and want
feedback on their design.


1. SPARK-9850 Adaptive query execution in Spark
https://issues.apache.org/jira/browse/SPARK-9850

Historically, query planning is done using statistics before the execution
begins. However, the query engine doesn't always have perfect statistics
before execution, especially on fresh data with blackbox UDFs. SPARK-9850
proposes adaptively picking executions plans based on runtime statistics.


2. SPARK- Type-safe API on top of Catalyst/DataFrame
https://issues.apache.org/jira/browse/SPARK-

A high level, typed API built on top of Catalyst/DataFrames. This API can
leverage all the work in Project Tungsten to have more robust and efficient
execution (including memory management, code generation, and query
optimization). This API is tentatively named Dataset (i.e. the last D in
RDD).


3. SPARK-1 Unified memory management (by consolidating cache and
execution memory)
https://issues.apache.org/jira/browse/SPARK-1

Spark statically divides memory into multiple fractions. The two biggest
ones are cache (aka storage) memory and execution memory. Out of the box,
only 16% of the memory is used for execution. That is to say, if an
application is not using caching, it is wasting majority of the memory
resource with the default configuration. SPARK-1 proposes a solution to
dynamically allocate memory for these two fractions, and should improve
performance for large workloads without configuration tuning.


4. SPARK-10810 Improved session management in Spark SQL and DataFrames
https://issues.apache.org/jira/browse/SPARK-10810

Session isolation & management is important in SQL query engines. In Spark,
this is slightly more complicated since users can also use DataFrames
interactively beyond SQL. SPARK-10810 implements session management for
both SQL's JDBC/ODBC servers, as well as the DataFrame API.

Most of this work has been merged already in this pull request:
https://github.com/apache/spark/pull/8909