Assorted project updates (tests, build, etc)

2014-06-22 Thread Patrick Wendell
Hey All,

1. The original test infrastructure hosted by the AMPLab has been
fully restored and also expanded with many more executor slots for
tests. Thanks to Matt Massie at the Amplab for helping with this.

2. We now have a nightly build matrix across different Hadoop
versions. It appears that the Maven build is failing tests with some
of the newer Hadoop versions. If people from the community are
interested, diagnosing and fixing test issues would be welcome patches
(they are all dependency related).

https://issues.apache.org/jira/browse/SPARK-2232

3. Prashant Sharma has spent a lot of time to make it possible for our
sbt build to read dependencies from Maven. This will save us a huge
amount of headache keeping the builds consistent. I just wanted to
give a heads up to users about this - we should retain compatibility
with features of the sbt build, but if you are e.g. hooking into deep
internals of our build it may affect you. I'm hoping this can be
updated and merged in the next week:

https://github.com/apache/spark/pull/77

4. We've moved most of the documentation over to recommending users
build with Maven when creating official packages. This is just to
provide a single "reference build" of Spark since it's the one we test
and package for releases, we make sure all recursive dependencies are
correct, etc. I'd recommend that all downstream packagers use this
build.

For day-to-day development I imagine sbt will remain more popular
(repl, incremental builds, etc). Prashant's work allows us to get the
"best of both worlds" which is great.

- Patrick


Re: Assorted project updates (tests, build, etc)

2014-06-22 Thread Mark Hamstra
Just a couple of FYI notes: With Zinc and the scala-maven-plugin, repl and
incremental builds are also available to those doing day-to-day development
using Maven.  As long as you don't have to delve into the extra boilerplate
and verbosity of Maven's POMs relative to an SBT build file, there is
little day-to-day functional difference between the two -- if anything, I
find that Maven supports faster development cycles.


On Sun, Jun 22, 2014 at 12:24 AM, Patrick Wendell 
wrote:

> Hey All,
>
> 1. The original test infrastructure hosted by the AMPLab has been
> fully restored and also expanded with many more executor slots for
> tests. Thanks to Matt Massie at the Amplab for helping with this.
>
> 2. We now have a nightly build matrix across different Hadoop
> versions. It appears that the Maven build is failing tests with some
> of the newer Hadoop versions. If people from the community are
> interested, diagnosing and fixing test issues would be welcome patches
> (they are all dependency related).
>
> https://issues.apache.org/jira/browse/SPARK-2232
>
> 3. Prashant Sharma has spent a lot of time to make it possible for our
> sbt build to read dependencies from Maven. This will save us a huge
> amount of headache keeping the builds consistent. I just wanted to
> give a heads up to users about this - we should retain compatibility
> with features of the sbt build, but if you are e.g. hooking into deep
> internals of our build it may affect you. I'm hoping this can be
> updated and merged in the next week:
>
> https://github.com/apache/spark/pull/77
>
> 4. We've moved most of the documentation over to recommending users
> build with Maven when creating official packages. This is just to
> provide a single "reference build" of Spark since it's the one we test
> and package for releases, we make sure all recursive dependencies are
> correct, etc. I'd recommend that all downstream packagers use this
> build.
>
> For day-to-day development I imagine sbt will remain more popular
> (repl, incremental builds, etc). Prashant's work allows us to get the
> "best of both worlds" which is great.
>
> - Patrick
>


GraphX's VertexRDD can not be materialized by calling count()

2014-06-22 Thread dash
Hi there,

Seems one can not materialize VertexRDD by simply calling count method,
which is overridden by VertexRDD. But if you call RDD's count, it could
materialize it.

Is this a feature that designed to get the count without materialize
VertexRDD? 

If so, do you guys think it is necessary to add a materialize method to
VertexRDD? I did that and ready to send a pull request, but I just want to
make sure this is reasonable. 

By the way, does count() is the cheapest way to materialize a RDD? Or it
just cost the same resources like other actions?

Best, 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/GraphX-s-VertexRDD-can-not-be-materialized-by-calling-count-tp7065.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Checkpointed RDD still causing StackOverflow

2014-06-22 Thread dash
Hi,

I'm doing iterative computing now, and due to lineage chain, we need to
checkpoint the RDD in order to cut off lineage and prevent StackOverflow
error. 

The following code still having StackOverflowError, I checked
`isCheckpointed` and the result is true. Also, I write a function to count
the lineage, but the lineage is not big. Any idea about that? Please give me
some hit so I can dig into the source code and try to fix it.




Best,



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Checkpointed-RDD-still-causing-StackOverflow-tp7066.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Checkpointed RDD still causing StackOverflow

2014-06-22 Thread Xiangrui Meng
After checkpoint(), please call count(). This is similar to cache(),
the RDD is only marked as to be checked with checkpoint(). -Xiangrui

On Sun, Jun 22, 2014 at 3:14 PM, dash  wrote:
> Hi,
>
> I'm doing iterative computing now, and due to lineage chain, we need to
> checkpoint the RDD in order to cut off lineage and prevent StackOverflow
> error.
>
> The following code still having StackOverflowError, I checked
> `isCheckpointed` and the result is true. Also, I write a function to count
> the lineage, but the lineage is not big. Any idea about that? Please give me
> some hit so I can dig into the source code and try to fix it.
>
>
>
>
> Best,
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Checkpointed-RDD-still-causing-StackOverflow-tp7066.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Checkpointed RDD still causing StackOverflow

2014-06-22 Thread dash
Hi Xiangrui,

According to my knowledge, calling count is for materialize the RDD, does
collect do the same thing since it also an action? I can not call count
because for a Graph object, count does not materialize the RDD. I already
send an issue on that.

My question is, why there still have stack overflow even if `isCheckpointed`
is true?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Checkpointed-RDD-still-causing-StackOverflow-tp7066p7068.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.