Re: Updating docs for running on Mesos

2014-05-13 Thread Tim St Clair
Perhaps linking to a Mesos page, which then can list the various package 
incantations. 

Cheers,
Tim

- Original Message -
> From: "Matei Zaharia" 
> To: dev@spark.apache.org
> Sent: Tuesday, May 13, 2014 2:59:42 AM
> Subject: Re: Updating docs for running on Mesos
> 
> I’ll ask the Mesos folks about this. Unfortunately it might be tough to link
> only to a company’s builds; but we can perhaps include them in addition to
> instructions for building Mesos from Apache.
> 
> Matei
> 
> On May 12, 2014, at 11:55 PM, Gerard Maas  wrote:
> 
> > Andrew,
> > 
> > Mesosphere has binary releases here:
> > http://mesosphere.io/downloads/
> > 
> > (Anecdote: I actually burned a CPU building Mesos from source. No kidding -
> > it was coming, as the laptop was crashing from time to time, but the mesos
> > build was that one drop too much)
> > 
> > kr, Gerard.
> > 
> > 
> > 
> > On Tue, May 13, 2014 at 6:57 AM, Andrew Ash  wrote:
> > 
> >> As far as I know, the upstream doesn't release binaries, only source code.
> >> The downloads page  for 0.18.0 only
> >> has a source tarball.  Is there a binary release somewhere from Mesos that
> >> I'm missing?
> >> 
> >> 
> >> On Sun, May 11, 2014 at 2:16 PM, Patrick Wendell 
> >> wrote:
> >> 
> >>> Andrew,
> >>> 
> >>> Updating these docs would be great! I think this would be a welcome
> >> change.
> >>> 
> >>> In terms of packaging, it would be good to mention the binaries
> >>> produced by the upstream project as well, in addition to Mesosphere.
> >>> 
> >>> - Patrick
> >>> 
> >>> On Thu, May 8, 2014 at 12:51 AM, Andrew Ash 
> >> wrote:
>  The docs for how to run Spark on Mesos have changed very little since
>  0.6.0, but setting it up is much easier now than then.  Does it make
> >>> sense
>  to revamp with the below changes?
>  
>  
>  You no longer need to build mesos yourself as pre-built versions are
>  available from Mesosphere: http://mesosphere.io/downloads/
>  
>  And the instructions guide you towards compiling your own distribution
> >> of
>  Spark, when you can use the prebuilt versions of Spark as well.
>  
>  
>  I'd like to split that portion of the documentation into two sections,
> >> a
>  build-from-scratch section and a use-prebuilt section.  The new outline
>  would look something like this:
>  
>  
>  *Running Spark on Mesos*
>  
>  Installing Mesos
>  - using prebuilt (recommended)
>  - pointer to mesosphere's packages
>  - from scratch
>  - (similar to current)
>  
>  
>  Connecting Spark to Mesos
>  - loading distribution into an accessible location
>  - Spark settings
>  
>  Mesos Run Modes
>  - (same as current)
>  
>  Running Alongside Hadoop
>  - (trim this down)
>  
>  
>  
>  Does that work for people?
>  
>  
>  Thanks!
>  Andrew
>  
>  
>  PS Basically all the same:
>  
>  http://spark.apache.org/docs/0.6.0/running-on-mesos.html
>  http://spark.apache.org/docs/0.6.2/running-on-mesos.html
>  http://spark.apache.org/docs/0.7.3/running-on-mesos.html
>  http://spark.apache.org/docs/0.8.1/running-on-mesos.html
>  http://spark.apache.org/docs/0.9.1/running-on-mesos.html
>  
> >>> 
> >> https://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/running-on-mesos.html
> >>> 
> >> 
> 
> 

-- 
Cheers,
Tim
Freedom, Features, Friends, First -> Fedora
https://fedoraproject.org/wiki/SIGs/bigdata


Re: Updating docs for running on Mesos

2014-05-13 Thread Gerard Maas
Great work!. I just left some comments in the PR. In summary, it would be
great to have more background on how Spark works on Mesos and how the
different elements interact. That will (hopefully) help understanding the
practicalities of the common assembly location (http/hdfs) and how the jobs
are distributed to the Mesos infrastructure.

Also, adding a chapter on troubleshooting (where we have spent most of our
time lately :-) would be a welcome addition.  I'm not use I've figured it
out completely as to attempt to contribute that myself.

-kr, Gerard.


On Tue, May 13, 2014 at 6:56 AM, Andrew Ash  wrote:

> For trimming the Running Alongside Hadoop section I mostly think there
> should be a separate Spark+HDFS section and have the CDH+HDP page be merged
> into that one, but I supposed that's a separate docs change.
>
>
> On Sun, May 11, 2014 at 4:28 PM, Andy Konwinski  >wrote:
>
> > Thanks for suggesting this and volunteering to do it.
> >
> > On May 11, 2014 3:32 AM, "Andrew Ash"  wrote:
> > >
> > > The docs for how to run Spark on Mesos have changed very little since
> > > 0.6.0, but setting it up is much easier now than then.  Does it make
> > sense
> > > to revamp with the below changes?
> > >
> > >
> > > You no longer need to build mesos yourself as pre-built versions are
> > > available from Mesosphere: http://mesosphere.io/downloads/
> > >
> > > And the instructions guide you towards compiling your own distribution
> of
> > > Spark, when you can use the prebuilt versions of Spark as well.
> > >
> > >
> > > I'd like to split that portion of the documentation into two sections,
> a
> > > build-from-scratch section and a use-prebuilt section.  The new outline
> > > would look something like this:
> > >
> > >
> > > *Running Spark on Mesos*
> > >
> > > Installing Mesos
> > > - using prebuilt (recommended)
> > >  - pointer to mesosphere's packages
> > > - from scratch
> > >  - (similar to current)
> > >
> > >
> > > Connecting Spark to Mesos
> > > - loading distribution into an accessible location
> > > - Spark settings
> > >
> > > Mesos Run Modes
> > > - (same as current)
> > >
> > > Running Alongside Hadoop
> > > - (trim this down)
> >
> > What trimming do you have in mind here?
> >
> > >
> > >
> > >
> > > Does that work for people?
> > >
> > >
> > > Thanks!
> > > Andrew
> > >
> > >
> > > PS Basically all the same:
> > >
> > > http://spark.apache.org/docs/0.6.0/running-on-mesos.html
> > > http://spark.apache.org/docs/0.6.2/running-on-mesos.html
> > > http://spark.apache.org/docs/0.7.3/running-on-mesos.html
> > > http://spark.apache.org/docs/0.8.1/running-on-mesos.html
> > > http://spark.apache.org/docs/0.9.1/running-on-mesos.html
> > >
> >
> >
> https://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/running-on-mesos.html
> >
>


Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Andrew Ash
Thanks for filing -- I'm keeping my eye out for updates on that ticket.

Cheers!
Andrew


On Tue, May 13, 2014 at 2:40 PM, Michael Armbrust wrote:

> >
> > It looks like currently the .count() on parquet is handled incredibly
> > inefficiently and all the columns are materialized.  But if I select just
> > that relevant column and then count, then the column-oriented storage of
> > Parquet really shines.
> >
> > There ought to be a potential optimization here such that a .count() on a
> > SchemaRDD backed by Parquet doesn't require re-assembling the rows, as
> > that's expensive.  I don't think .count() is handled specially in
> > SchemaRDDs, but it seems ripe for optimization.
> >
>
> Yeah, you are right.  Thanks for pointing this out!
>
> If you call .count() that is just the native Spark count, which is not
> aware of the potential optimizations.  We could just override count() in a
> schema RDD to be something like
> "groupBy()(Count(Literal(1))).collect().head.getInt(0)"
>
> Here is a JIRA: SPARK-1822 - SchemaRDD.count() should use the
> optimizer.
>
> Michael
>


Re: Class-based key in groupByKey?

2014-05-13 Thread Matei Zaharia
Your key needs to implement hashCode in addition to equals.

Matei

On May 13, 2014, at 3:30 PM, Michael Malak  wrote:

> Is it permissible to use a custom class (as opposed to e.g. the built-in 
> String or Int) for the key in groupByKey? It doesn't seem to be working for 
> me on Spark 0.9.0/Scala 2.10.3:
> 
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
> 
> class C(val s:String) extends Serializable {
>   override def equals(o: Any) = if (o.isInstanceOf[C]) o.asInstanceOf[C].s == 
> s else false
>   override def toString = s
> }
> 
> object SimpleApp {
>   def main(args: Array[String]) {
> val sc = new SparkContext("local", "Simple App", null, null)
> val r1 = sc.parallelize(Array((new C("a"),11),(new C("a"),12)))
> println("r1=" + r1.groupByKey.collect.mkString(";"))
> val r2 = sc.parallelize(Array(("a",11),("a",12)))
> println("r2=" + r2.groupByKey.collect.mkString(";"))
>   }
> }
> 
> 
> Output
> ==
> r1=(a,ArrayBuffer(11));(a,ArrayBuffer(12))
> r2=(a,ArrayBuffer(11, 12))



Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Madhu
I just built rc5 on Windows 7 and tried to reproduce the problem described in

https://issues.apache.org/jira/browse/SPARK-1712

It works on my machine:

14/05/13 21:06:47 INFO DAGScheduler: Stage 1 (sum at :17) finished
in 4.548 s
14/05/13 21:06:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
have all completed, from pool
14/05/13 21:06:47 INFO SparkContext: Job finished: sum at :17, took
4.814991993 s
res1: Double = 5.05E11

I used all defaults, no config files were changed.
Not sure if that makes a difference...



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6560.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Anand Avati
On Tue, May 13, 2014 at 8:26 AM, Michael Malak wrote:

> Reposting here on dev since I didn't see a response on user:
>
> I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell.
> In the Spark Shell, equals() fails when I use the canonical equals()
> pattern of match{}, but works when I subsitute with isInstanceOf[]. I am
> using Spark 0.9.0/Scala 2.10.3.
>
> Is this a bug?
>
> Spark Shell (equals uses match{})
> =
>
> class C(val s:String) extends Serializable {
>   override def equals(o: Any) = o match {
> case that: C => that.s == s
> case _ => false
>   }
> }
>
> val x = new C("a")
> val bos = new java.io.ByteArrayOutputStream()
> val out = new java.io.ObjectOutputStream(bos)
> out.writeObject(x);
> val b = bos.toByteArray();
> out.close
> bos.close
> val y = new java.io.ObjectInputStream(new
> java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
> x.equals(y)
>
> res: Boolean = false
>
> Spark Shell (equals uses isInstanceOf[])
> 
>
> class C(val s:String) extends Serializable {
>   override def equals(o: Any) = if (o.isInstanceOf[C])
> (o.asInstanceOf[C].s == s) else false
> }
>
> val x = new C("a")
> val bos = new java.io.ByteArrayOutputStream()
> val out = new java.io.ObjectOutputStream(bos)
> out.writeObject(x);
> val b = bos.toByteArray();
> out.close
> bos.close
> val y = new java.io.ObjectInputStream(new
> java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
> x.equals(y)
>
> res: Boolean = true
>
> Scala Shell (equals uses match{})
> =
>
> class C(val s:String) extends Serializable {
>   override def equals(o: Any) = o match {
> case that: C => that.s == s
> case _ => false
>   }
> }
>
> val x = new C("a")
> val bos = new java.io.ByteArrayOutputStream()
> val out = new java.io.ObjectOutputStream(bos)
> out.writeObject(x);
> val b = bos.toByteArray();
> out.close
> bos.close
> val y = new java.io.ObjectInputStream(new
> java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
> x.equals(y)
>
> res: Boolean = true
>


Hmm. I see that this can be reproduced without Spark in Scala 2.11, with
and without -Yrepl-class-based command line flag to the repl. Spark's REPL
has the effective behavior of 2.11's -Yrepl-class-based flag. Inspecting
the byte code generated, it appears -Yrepl-class-based results in the
creation of "$outer" field in the generated classes (including class C).
The first case match in equals() is resulting code along the lines of
(simplified):

if (o isinstanceof Cstr && this.$outer == that.$outer) { // do string
compare // }

$outer is the synthetic field object to the outer object in which the
object was created (in this case, the repl environment). Now obviously,
when x is taken through the bytestream and deserialized, it would have a
new $outer created (it may have deserialized in a different jvm or machine
for all we know). So the $outer's mismatching is expected. However I'm
still trying to understand why the outers need to be the same for the case
match.


Class-based key in groupByKey?

2014-05-13 Thread Michael Malak
Is it permissible to use a custom class (as opposed to e.g. the built-in String 
or Int) for the key in groupByKey? It doesn't seem to be working for me on 
Spark 0.9.0/Scala 2.10.3:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

class C(val s:String) extends Serializable {
  override def equals(o: Any) = if (o.isInstanceOf[C]) o.asInstanceOf[C].s == s 
else false
  override def toString = s
}

object SimpleApp {
  def main(args: Array[String]) {
    val sc = new SparkContext("local", "Simple App", null, null)
    val r1 = sc.parallelize(Array((new C("a"),11),(new C("a"),12)))
    println("r1=" + r1.groupByKey.collect.mkString(";"))
    val r2 = sc.parallelize(Array(("a",11),("a",12)))
    println("r2=" + r2.groupByKey.collect.mkString(";"))
  }
}


Output
==
r1=(a,ArrayBuffer(11));(a,ArrayBuffer(12))
r2=(a,ArrayBuffer(11, 12))


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Nan Zhu
+1, replaced rc3 with rc5, all applications are working fine

Best, 

-- 
Nan Zhu


On Tuesday, May 13, 2014 at 8:03 PM, Madhu wrote:

> I built rc5 using sbt/sbt assembly on Linux without any problems.
> There used to be an sbt.cmd for Windows build, has that been deprecated?
> If so, I can document the Windows build steps that worked for me.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6558.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
> 




Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread witgo
-1 
The following bug should be fixed: 
https://issues.apache.org/jira/browse/SPARK-1817
https://issues.apache.org/jira/browse/SPARK-1712


-- Original --
From:  "Patrick Wendell";;
Date:  Wed, May 14, 2014 04:07 AM
To:  "dev@spark.apache.org"; 

Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)



Hey all - there were some earlier RC's that were not presented to the
dev list because issues were found with them. Also, there seems to be
some issues with the reliability of the dev list e-mail. Just a heads
up.

I'll lead with a +1 for this.

On Tue, May 13, 2014 at 8:07 AM, Nan Zhu  wrote:
> just curious, where is rc4 VOTE?
>
> I searched my gmail but didn't find that?
>
>
>
>
> On Tue, May 13, 2014 at 9:49 AM, Sean Owen  wrote:
>
>> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell 
>> wrote:
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>>
>> Good news is that the sigs, MD5 and SHA are all correct.
>>
>> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
>> use SHA512, which took me a bit of head-scratching to figure out.
>>
>> If another RC comes out, I might suggest making it SHA1 everywhere?
>> But there is nothing wrong with these signatures and checksums.
>>
>> Now to look at the contents...
>>
.

Re: Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Michael Malak
Thank you for your investigation into this!

Just for completeness, I've confirmed it's a problem only in REPL, not in 
compiled Spark programs.

But within REPL, a direct consequence of non-same classes after 
serialization/deserialization also means that lookup() doesn't work:

scala> class C(val s:String) extends Serializable {
 |   override def equals(o: Any) = if (o.isInstanceOf[C]) 
o.asInstanceOf[C].s == s else false
 |   override def toString = s
 | }
defined class C

scala> val r = sc.parallelize(Array((new C("a"),11),(new C("a"),12)))
r: org.apache.spark.rdd.RDD[(C, Int)] = ParallelCollectionRDD[3] at parallelize 
at :14

scala> r.lookup(new C("a"))
:17: error: type mismatch;
 found   : C
 required: C
  r.lookup(new C("a"))
   ^



On Tuesday, May 13, 2014 3:05 PM, Anand Avati  wrote:

On Tue, May 13, 2014 at 8:26 AM, Michael Malak  wrote:

Reposting here on dev since I didn't see a response on user:
>
>I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In 
>the Spark Shell, equals() fails when I use the canonical equals() pattern of 
>match{}, but works when I subsitute with isInstanceOf[]. I am using Spark 
>0.9.0/Scala 2.10.3.
>
>Is this a bug?
>
>Spark Shell (equals uses match{})
>=
>
>class C(val s:String) extends Serializable {
>  override def equals(o: Any) = o match {
>    case that: C => that.s == s
>    case _ => false
>  }
>}
>
>val x = new C("a")
>val bos = new java.io.ByteArrayOutputStream()
>val out = new java.io.ObjectOutputStream(bos)
>out.writeObject(x);
>val b = bos.toByteArray();
>out.close
>bos.close
>val y = new java.io.ObjectInputStream(new 
>java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
>x.equals(y)
>
>res: Boolean = false
>
>Spark Shell (equals uses isInstanceOf[])
>
>
>class C(val s:String) extends Serializable {
>  override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s == 
>s) else false
>}
>
>val x = new C("a")
>val bos = new java.io.ByteArrayOutputStream()
>val out = new java.io.ObjectOutputStream(bos)
>out.writeObject(x);
>val b = bos.toByteArray();
>out.close
>bos.close
>val y = new java.io.ObjectInputStream(new 
>java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
>x.equals(y)
>
>res: Boolean = true
>
>Scala Shell (equals uses match{})
>=
>
>class C(val s:String) extends Serializable {
>  override def equals(o: Any) = o match {
>    case that: C => that.s == s
>    case _ => false
>  }
>}
>
>val x = new C("a")
>val bos = new java.io.ByteArrayOutputStream()
>val out = new java.io.ObjectOutputStream(bos)
>out.writeObject(x);
>val b = bos.toByteArray();
>out.close
>bos.close
>val y = new java.io.ObjectInputStream(new 
>java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
>x.equals(y)
>
>res: Boolean = true
>


Hmm. I see that this can be reproduced without Spark in Scala 2.11, with and 
without -Yrepl-class-based command line flag to the repl. Spark's REPL has the 
effective behavior of 2.11's -Yrepl-class-based flag. Inspecting the byte code 
generated, it appears -Yrepl-class-based results in the creation of "$outer" 
field in the generated classes (including class C). The first case match in 
equals() is resulting code along the lines of (simplified):

if (o isinstanceof Cstr && this.$outer == that.$outer) { // do string compare 
// }

$outer is the synthetic field object to the outer object in which the object 
was created (in this case, the repl environment). Now obviously, when x is 
taken through the bytestream and deserialized, it would have a new $outer 
created (it may have deserialized in a different jvm or machine for all we 
know). So the $outer's mismatching is expected. However I'm still trying to 
understand why the outers need to be the same for the case match.


Re: Class-based key in groupByKey?

2014-05-13 Thread Andrew Ash
In Scala, if you override .equals() you also need to override .hashCode(),
just like in Java:

http://www.scala-lang.org/api/2.10.3/index.html#scala.AnyRef

I suspect if your .hashCode() delegates to just the hashcode of s then
you'd be good.


On Tue, May 13, 2014 at 3:30 PM, Michael Malak wrote:

> Is it permissible to use a custom class (as opposed to e.g. the built-in
> String or Int) for the key in groupByKey? It doesn't seem to be working for
> me on Spark 0.9.0/Scala 2.10.3:
>
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
>
> class C(val s:String) extends Serializable {
>   override def equals(o: Any) = if (o.isInstanceOf[C]) o.asInstanceOf[C].s
> == s else false
>   override def toString = s
> }
>
> object SimpleApp {
>   def main(args: Array[String]) {
> val sc = new SparkContext("local", "Simple App", null, null)
> val r1 = sc.parallelize(Array((new C("a"),11),(new C("a"),12)))
> println("r1=" + r1.groupByKey.collect.mkString(";"))
> val r2 = sc.parallelize(Array(("a",11),("a",12)))
> println("r2=" + r2.groupByKey.collect.mkString(";"))
>   }
> }
>
>
> Output
> ==
> r1=(a,ArrayBuffer(11));(a,ArrayBuffer(12))
> r2=(a,ArrayBuffer(11, 12))
>


Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Michael Armbrust
>
> It looks like currently the .count() on parquet is handled incredibly
> inefficiently and all the columns are materialized.  But if I select just
> that relevant column and then count, then the column-oriented storage of
> Parquet really shines.
>
> There ought to be a potential optimization here such that a .count() on a
> SchemaRDD backed by Parquet doesn't require re-assembling the rows, as
> that's expensive.  I don't think .count() is handled specially in
> SchemaRDDs, but it seems ripe for optimization.
>

Yeah, you are right.  Thanks for pointing this out!

If you call .count() that is just the native Spark count, which is not
aware of the potential optimizations.  We could just override count() in a
schema RDD to be something like
"groupBy()(Count(Literal(1))).collect().head.getInt(0)"

Here is a JIRA: SPARK-1822 - SchemaRDD.count() should use the
optimizer.

Michael


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Patrick Wendell
Hey all - there were some earlier RC's that were not presented to the
dev list because issues were found with them. Also, there seems to be
some issues with the reliability of the dev list e-mail. Just a heads
up.

I'll lead with a +1 for this.

On Tue, May 13, 2014 at 8:07 AM, Nan Zhu  wrote:
> just curious, where is rc4 VOTE?
>
> I searched my gmail but didn't find that?
>
>
>
>
> On Tue, May 13, 2014 at 9:49 AM, Sean Owen  wrote:
>
>> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell 
>> wrote:
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>>
>> Good news is that the sigs, MD5 and SHA are all correct.
>>
>> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
>> use SHA512, which took me a bit of head-scratching to figure out.
>>
>> If another RC comes out, I might suggest making it SHA1 everywhere?
>> But there is nothing wrong with these signatures and checksums.
>>
>> Now to look at the contents...
>>


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Nan Zhu
Ah, I see, thanks 

-- 
Nan Zhu


On Tuesday, May 13, 2014 at 12:59 PM, Mark Hamstra wrote:

> There were a few early/test RCs this cycle that were never put to a vote.
> 
> 
> On Tue, May 13, 2014 at 8:07 AM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
> 
> > just curious, where is rc4 VOTE?
> > 
> > I searched my gmail but didn't find that?
> > 
> > 
> > 
> > 
> > On Tue, May 13, 2014 at 9:49 AM, Sean Owen  > (mailto:so...@cloudera.com)> wrote:
> > 
> > > On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell  > > (mailto:pwend...@gmail.com)>
> > > wrote:
> > > > The release files, including signatures, digests, etc. can be found at:
> > > > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
> > > > 
> > > 
> > > 
> > > Good news is that the sigs, MD5 and SHA are all correct.
> > > 
> > > Tiny note: the Maven artifacts use SHA1, while the binary artifacts
> > > use SHA512, which took me a bit of head-scratching to figure out.
> > > 
> > > If another RC comes out, I might suggest making it SHA1 everywhere?
> > > But there is nothing wrong with these signatures and checksums.
> > > 
> > > Now to look at the contents... 



Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Andrew Or
+1


2014-05-13 6:49 GMT-07:00 Sean Owen :

> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell 
> wrote:
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>
> Good news is that the sigs, MD5 and SHA are all correct.
>
> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
> use SHA512, which took me a bit of head-scratching to figure out.
>
> If another RC comes out, I might suggest making it SHA1 everywhere?
> But there is nothing wrong with these signatures and checksums.
>
> Now to look at the contents...
>


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Nan Zhu
just curious, where is rc4 VOTE?

I searched my gmail but didn't find that?




On Tue, May 13, 2014 at 9:49 AM, Sean Owen  wrote:

> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell 
> wrote:
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>
> Good news is that the sigs, MD5 and SHA are all correct.
>
> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
> use SHA512, which took me a bit of head-scratching to figure out.
>
> If another RC comes out, I might suggest making it SHA1 everywhere?
> But there is nothing wrong with these signatures and checksums.
>
> Now to look at the contents...
>


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Mark Hamstra
There were a few early/test RCs this cycle that were never put to a vote.


On Tue, May 13, 2014 at 8:07 AM, Nan Zhu  wrote:

> just curious, where is rc4 VOTE?
>
> I searched my gmail but didn't find that?
>
>
>
>
> On Tue, May 13, 2014 at 9:49 AM, Sean Owen  wrote:
>
> > On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell 
> > wrote:
> > > The release files, including signatures, digests, etc. can be found at:
> > > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
> >
> > Good news is that the sigs, MD5 and SHA are all correct.
> >
> > Tiny note: the Maven artifacts use SHA1, while the binary artifacts
> > use SHA512, which took me a bit of head-scratching to figure out.
> >
> > If another RC comes out, I might suggest making it SHA1 everywhere?
> > But there is nothing wrong with these signatures and checksums.
> >
> > Now to look at the contents...
> >
>


Re: Multinomial Logistic Regression

2014-05-13 Thread DB Tsai
Hi Deb,

For K possible outcomes in multinomial logistic regression,  we can have
K-1 independent binary logistic regression models, in which one outcome is
chosen as a "pivot" and then the other K-1 outcomes are separately
regressed against the pivot outcome. See my presentation for technical
detail http://www.slideshare.net/dbtsai/2014-0501-mlor

Since mllib only supports one linear model per classification model, there
will be some infrastructure work to support MLOR in mllib. But if you want
to implement yourself with the L-BFGS solver in mllib, you can follow the
equation in my slide, and it will not be too difficult.

I can give you the gradient method for multinomial logistic regression, you
just need to put the K-1 intercepts in the right place.

  def computeGradient(y: Double, x: Array[Double], lambda: Double, w:
Array[Array[Double]], b: Array[Double],
gradient: Array[Double]): (Double, Int) = {
val classes = b.length + 1
val yy = y.toInt

def alpha(i: Int): Int = {
  if (i == 0) 1 else 0
}

def delta(i: Int, j: Int): Int = {
  if (i == j) 1 else 0
}

var denominator: Double = 1.0
val numerators: Array[Double] = Array.ofDim[Double](b.length)

var predicted = 1

{
  var i = 0
  var j = 0
  var acc: Double = 0
  while (i < b.length) {
acc = b(i)
j = 0
while (j < x.length) {
  acc += x(j) * w(i)(j)
  j += 1
}
numerators(i) = math.exp(acc)
if (i > 0 && numerators(i) > numerators(predicted - 1)) {
  predicted = i + 1
}
denominator += numerators(i)
i += 1
  }
  if (numerators(predicted - 1) < 1) {
predicted = 0
  }
}

{
  // gradient has dim of (classes-1) * (x.length+1)
  var i = 0
  var m1: Int = 0
  var l1: Int = 0
  while (i < (classes - 1) * (x.length + 1)) {
m1 = i % (x.length + 1) // m0 is intercept
l1 = (i - m1) / (x.length + 1) // l + 1 is class
if (m1 == 0) {
  gradient(i) += (1 - alpha(yy)) * delta(yy, l1 + 1) -
numerators(l1) / denominator
} else {
  gradient(i) += ((1 - alpha(yy)) * delta(yy, l1 + 1) -
numerators(l1) / denominator) * x(m1 - 1)
}
i += 1
  }
}
val loglike: Double = math.round(y).toInt match {
  case 0 => math.log(1.0 / denominator)
  case _ => math.log(numerators(math.round(y - 1).toInt) / denominator)
}
(loglike, predicted)
  }



Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Tue, May 13, 2014 at 4:08 AM, Debasish Das wrote:

> Hi,
>
> Is there a PR for multinomial logistic regression which does one-vs-all and
> compare it to the other possibilities ?
>
> @dbtsai in your strata presentation you used one vs all ? Did you add some
> constraints on the fact that you penalize if mis-predicted labels are not
> very far from the true label ?
>
> Thanks.
> Deb
>


Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Michael Malak
Reposting here on dev since I didn't see a response on user:

I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In 
the Spark Shell, equals() fails when I use the canonical equals() pattern of 
match{}, but works when I subsitute with isInstanceOf[]. I am using Spark 
0.9.0/Scala 2.10.3.

Is this a bug?

Spark Shell (equals uses match{})
=

class C(val s:String) extends Serializable {
  override def equals(o: Any) = o match {
    case that: C => that.s == s
    case _ => false
  }
}

val x = new C("a")
val bos = new java.io.ByteArrayOutputStream()
val out = new java.io.ObjectOutputStream(bos)
out.writeObject(x);
val b = bos.toByteArray();
out.close
bos.close
val y = new java.io.ObjectInputStream(new 
java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
x.equals(y)

res: Boolean = false

Spark Shell (equals uses isInstanceOf[])


class C(val s:String) extends Serializable {
  override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s == 
s) else false
}

val x = new C("a")
val bos = new java.io.ByteArrayOutputStream()
val out = new java.io.ObjectOutputStream(bos)
out.writeObject(x);
val b = bos.toByteArray();
out.close
bos.close
val y = new java.io.ObjectInputStream(new 
java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
x.equals(y)

res: Boolean = true

Scala Shell (equals uses match{})
=

class C(val s:String) extends Serializable {
  override def equals(o: Any) = o match {
    case that: C => that.s == s
    case _ => false
  }
}

val x = new C("a")
val bos = new java.io.ByteArrayOutputStream()
val out = new java.io.ObjectOutputStream(bos)
out.writeObject(x);
val b = bos.toByteArray();
out.close
bos.close
val y = new java.io.ObjectInputStream(new 
java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
x.equals(y)

res: Boolean = true


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Sean Owen
On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell  wrote:
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.0.0-rc5/

Good news is that the sigs, MD5 and SHA are all correct.

Tiny note: the Maven artifacts use SHA1, while the binary artifacts
use SHA512, which took me a bit of head-scratching to figure out.

If another RC comes out, I might suggest making it SHA1 everywhere?
But there is nothing wrong with these signatures and checksums.

Now to look at the contents...


[VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.0.0!

The tag to be voted on is v1.0.0-rc5 (commit 18f0623):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=18f062303303824139998e8fc8f4158217b0dbc3

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc5/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1012/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Friday, May 16, at 09:30 UTC and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

changes to ML vector specification:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10

changes to the Java API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

changes to the streaming API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

changes to the GraphX API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

coGroup and related functions now return Iterable[T] instead of Seq[T]
==> Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
==> Call toSeq on the result to restore old behavior


Multinomial Logistic Regression

2014-05-13 Thread Debasish Das
Hi,

Is there a PR for multinomial logistic regression which does one-vs-all and
compare it to the other possibilities ?

@dbtsai in your strata presentation you used one vs all ? Did you add some
constraints on the fact that you penalize if mis-predicted labels are not
very far from the true label ?

Thanks.
Deb


Sparse vector toLibSvm API

2014-05-13 Thread Debasish Das
Hi,

In the sparse vector the toString API is as follows:

 override def toString: String = {

"(" + size + "," + indices.zip(values).mkString("[", "," ,"]") + ")"

  }

Does it make sense to keep it consistent with libsvm format ?

What does each line of libsvm format looks like ?

Thanks.

Deb


Re: Updating docs for running on Mesos

2014-05-13 Thread Gerard Maas
Andrew,

Mesosphere has binary releases here:
http://mesosphere.io/downloads/

(Anecdote: I actually burned a CPU building Mesos from source. No kidding -
it was coming, as the laptop was crashing from time to time, but the mesos
build was that one drop too much)

kr, Gerard.



On Tue, May 13, 2014 at 6:57 AM, Andrew Ash  wrote:

> As far as I know, the upstream doesn't release binaries, only source code.
>  The downloads page  for 0.18.0 only
> has a source tarball.  Is there a binary release somewhere from Mesos that
> I'm missing?
>
>
> On Sun, May 11, 2014 at 2:16 PM, Patrick Wendell 
> wrote:
>
> > Andrew,
> >
> > Updating these docs would be great! I think this would be a welcome
> change.
> >
> > In terms of packaging, it would be good to mention the binaries
> > produced by the upstream project as well, in addition to Mesosphere.
> >
> > - Patrick
> >
> > On Thu, May 8, 2014 at 12:51 AM, Andrew Ash 
> wrote:
> > > The docs for how to run Spark on Mesos have changed very little since
> > > 0.6.0, but setting it up is much easier now than then.  Does it make
> > sense
> > > to revamp with the below changes?
> > >
> > >
> > > You no longer need to build mesos yourself as pre-built versions are
> > > available from Mesosphere: http://mesosphere.io/downloads/
> > >
> > > And the instructions guide you towards compiling your own distribution
> of
> > > Spark, when you can use the prebuilt versions of Spark as well.
> > >
> > >
> > > I'd like to split that portion of the documentation into two sections,
> a
> > > build-from-scratch section and a use-prebuilt section.  The new outline
> > > would look something like this:
> > >
> > >
> > > *Running Spark on Mesos*
> > >
> > > Installing Mesos
> > > - using prebuilt (recommended)
> > >  - pointer to mesosphere's packages
> > > - from scratch
> > >  - (similar to current)
> > >
> > >
> > > Connecting Spark to Mesos
> > > - loading distribution into an accessible location
> > > - Spark settings
> > >
> > > Mesos Run Modes
> > > - (same as current)
> > >
> > > Running Alongside Hadoop
> > > - (trim this down)
> > >
> > >
> > >
> > > Does that work for people?
> > >
> > >
> > > Thanks!
> > > Andrew
> > >
> > >
> > > PS Basically all the same:
> > >
> > > http://spark.apache.org/docs/0.6.0/running-on-mesos.html
> > > http://spark.apache.org/docs/0.6.2/running-on-mesos.html
> > > http://spark.apache.org/docs/0.7.3/running-on-mesos.html
> > > http://spark.apache.org/docs/0.8.1/running-on-mesos.html
> > > http://spark.apache.org/docs/0.9.1/running-on-mesos.html
> > >
> >
> https://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/running-on-mesos.html
> >
>


Re: Kryo not default?

2014-05-13 Thread Dmitriy Lyubimov
On Mon, May 12, 2014 at 2:47 PM, Anand Avati  wrote:

> Hi,
> Can someone share the reason why Kryo serializer is not the default?

why should it be?

On top of it, the only way to serialize a closure into the backend (even
now) is java serialization (which means java serialization is required of
all closure attributes)


> Is
> there anything to be careful about (because of which it is not enabled by
> default)?
>

Yes. Kind of stems from above. There's still a number of api calls that use
closure attributes to serialize data to backend (see fold(), for example).
which means even if you enable kryo, some api still requires java
serialization of an attribute.

I fixed parallelize(), collect() and something else that i don't remember
already in that regard, but i think even up till now there's still a number
of apis lingering whose data parameters  wouldn't work with kryo.


> Thanks!
>


Re: Updating docs for running on Mesos

2014-05-13 Thread Andrew Ash
Completely agree about preferring to link to the upstream project rather
than a company's -- the only reason I'm using mesosphere's now is that I
see no alternative from mesos.apache.org

I included instructions for both using Mesosphere's packages and building
from scratch in the PR: https://github.com/apache/spark/pull/756


On Tue, May 13, 2014 at 12:59 AM, Matei Zaharia wrote:

> I’ll ask the Mesos folks about this. Unfortunately it might be tough to
> link only to a company’s builds; but we can perhaps include them in
> addition to instructions for building Mesos from Apache.
>
> Matei
>
> On May 12, 2014, at 11:55 PM, Gerard Maas  wrote:
>
> > Andrew,
> >
> > Mesosphere has binary releases here:
> > http://mesosphere.io/downloads/
> >
> > (Anecdote: I actually burned a CPU building Mesos from source. No
> kidding -
> > it was coming, as the laptop was crashing from time to time, but the
> mesos
> > build was that one drop too much)
> >
> > kr, Gerard.
> >
> >
> >
> > On Tue, May 13, 2014 at 6:57 AM, Andrew Ash 
> wrote:
> >
> >> As far as I know, the upstream doesn't release binaries, only source
> code.
> >> The downloads page  for 0.18.0
> only
> >> has a source tarball.  Is there a binary release somewhere from Mesos
> that
> >> I'm missing?
> >>
> >>
> >> On Sun, May 11, 2014 at 2:16 PM, Patrick Wendell 
> >> wrote:
> >>
> >>> Andrew,
> >>>
> >>> Updating these docs would be great! I think this would be a welcome
> >> change.
> >>>
> >>> In terms of packaging, it would be good to mention the binaries
> >>> produced by the upstream project as well, in addition to Mesosphere.
> >>>
> >>> - Patrick
> >>>
> >>> On Thu, May 8, 2014 at 12:51 AM, Andrew Ash 
> >> wrote:
>  The docs for how to run Spark on Mesos have changed very little since
>  0.6.0, but setting it up is much easier now than then.  Does it make
> >>> sense
>  to revamp with the below changes?
> 
> 
>  You no longer need to build mesos yourself as pre-built versions are
>  available from Mesosphere: http://mesosphere.io/downloads/
> 
>  And the instructions guide you towards compiling your own distribution
> >> of
>  Spark, when you can use the prebuilt versions of Spark as well.
> 
> 
>  I'd like to split that portion of the documentation into two sections,
> >> a
>  build-from-scratch section and a use-prebuilt section.  The new
> outline
>  would look something like this:
> 
> 
>  *Running Spark on Mesos*
> 
>  Installing Mesos
>  - using prebuilt (recommended)
>  - pointer to mesosphere's packages
>  - from scratch
>  - (similar to current)
> 
> 
>  Connecting Spark to Mesos
>  - loading distribution into an accessible location
>  - Spark settings
> 
>  Mesos Run Modes
>  - (same as current)
> 
>  Running Alongside Hadoop
>  - (trim this down)
> 
> 
> 
>  Does that work for people?
> 
> 
>  Thanks!
>  Andrew
> 
> 
>  PS Basically all the same:
> 
>  http://spark.apache.org/docs/0.6.0/running-on-mesos.html
>  http://spark.apache.org/docs/0.6.2/running-on-mesos.html
>  http://spark.apache.org/docs/0.7.3/running-on-mesos.html
>  http://spark.apache.org/docs/0.8.1/running-on-mesos.html
>  http://spark.apache.org/docs/0.9.1/running-on-mesos.html
> 
> >>>
> >>
> https://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/running-on-mesos.html
> >>>
> >>
>
>


Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Andrew Ash
These numbers were run on git commit 756c96 (a few days after the 1.0.0-rc3
tag).  Do you have a link to the patch that avoids scanning all columns for
count(*) or count(1)?  I'd like to give it a shot.

Andrew


On Mon, May 12, 2014 at 11:41 PM, Reynold Xin  wrote:

> Thanks for the experiments and analysis!
>
> I think Michael already submitted a patch that avoids scanning all columns
> for count(*) or count(1).
>
>
> On Mon, May 12, 2014 at 9:46 PM, Andrew Ash  wrote:
>
> > Hi Spark devs,
> >
> > First of all, huge congrats on the parquet integration with SparkSQL!
>  This
> > is an incredible direction forward and something I can see being very
> > broadly useful.
> >
> > I was doing some preliminary tests to see how it works with one of my
> > workflows, and wanted to share some numbers that people might want to
> know
> > about.
> >
> > I also wanted to point out that .count() doesn't seem integrated with the
> > rest of the optimization framework, and some big gains could be possible.
> >
> >
> > So, the numbers:
> >
> > I took a table extracted from a SQL database and stored in HDFS:
> >
> >- 115 columns (several always-empty, mostly strings, some enums, some
> >numbers)
> >- 253,887,080 rows
> >- 182,150,295,881 bytes (raw uncompressed)
> >- 42,826,820,222 bytes (lzo compressed with .index file)
> >
> > And I converted it to Parquet using SparkSQL's SchemaRDD.saveAsParquet()
> > call:
> >
> >- Converting from .lzo in HDFS to .parquet in HDFS took 635s using 42
> >cores across 4 machines
> >- 17,517,922,117 bytes (parquet per SparkSQL defaults)
> >
> > So storing in parquet format vs lzo compresses the data down to less than
> > 50% of the .lzo size, and under 10% of the raw uncompressed size.  Nice!
> >
> >
> > I then did some basic interactions on it:
> >
> > *Row count*
> >
> >- LZO
> >   - lzoFile("/path/to/lzo").count
> >   - 31.632305953s
> >- Parquet
> >   - sqlContext.parquetFile("/path/to/parquet").count
> >   - 289.129487003s
> >
> > Reassembling rows from the separate column storage is clearly really
> > expensive.  Median task length is 33s vs 4s, and of that 33s in each task
> > (319 tasks total) about 1.75 seconds are spent in GC (inefficient object
> > allocation?)
> >
> >
> >
> > *Count number of rows with a particular key:*
> >
> >- LZO
> >- lzoFile("/path/to/lzo").filter(_.split("\\|")(0) ==
> > "1234567890").count
> >   - 73.988897511s
> >- Parquet
> >- sqlContext.parquetFile("/path/to/parquet").where('COL ===
> >   1234567890).count
> >   - 293.410470418s
> >- Parquet (hand-tuned to count on just one column)
> >- sqlContext.parquetFile("/path/to/parquet").where('COL ===
> >   1234567890).select('IDCOL).count
> >   - 1.160449187s
> >
> > It looks like currently the .count() on parquet is handled incredibly
> > inefficiently and all the columns are materialized.  But if I select just
> > that relevant column and then count, then the column-oriented storage of
> > Parquet really shines.
> >
> > There ought to be a potential optimization here such that a .count() on a
> > SchemaRDD backed by Parquet doesn't require re-assembling the rows, as
> > that's expensive.  I don't think .count() is handled specially in
> > SchemaRDDs, but it seems ripe for optimization.
> >
> >
> > *Count number of distinct values in a column*
> >
> >- LZO
> >- lzoFile("/path/to/lzo").map(sel(0)).distinct.count
> >   - 115.582916866s
> >- Parquet
> >-
> sqlContext.parquetFile("/path/to/parquet").select('COL).distinct.count
> >   - 16.839004826 s
> >
> > It turns out column selectivity is very useful!  I'm guessing that if I
> > could get byte counts read out of HDFS, that would just about match up
> with
> > the difference in read times.
> >
> >
> >
> >
> > Any thoughts on how to embed the knowledge of my hand-tuned additional
> > .select('IDCOL)
> > into Catalyst?
> >
> >
> > Thanks again for all the hard work and prep for the 1.0 release!
> >
> > Andrew
> >
>


Re: Updating docs for running on Mesos

2014-05-13 Thread Matei Zaharia
I’ll ask the Mesos folks about this. Unfortunately it might be tough to link 
only to a company’s builds; but we can perhaps include them in addition to 
instructions for building Mesos from Apache.

Matei

On May 12, 2014, at 11:55 PM, Gerard Maas  wrote:

> Andrew,
> 
> Mesosphere has binary releases here:
> http://mesosphere.io/downloads/
> 
> (Anecdote: I actually burned a CPU building Mesos from source. No kidding -
> it was coming, as the laptop was crashing from time to time, but the mesos
> build was that one drop too much)
> 
> kr, Gerard.
> 
> 
> 
> On Tue, May 13, 2014 at 6:57 AM, Andrew Ash  wrote:
> 
>> As far as I know, the upstream doesn't release binaries, only source code.
>> The downloads page  for 0.18.0 only
>> has a source tarball.  Is there a binary release somewhere from Mesos that
>> I'm missing?
>> 
>> 
>> On Sun, May 11, 2014 at 2:16 PM, Patrick Wendell 
>> wrote:
>> 
>>> Andrew,
>>> 
>>> Updating these docs would be great! I think this would be a welcome
>> change.
>>> 
>>> In terms of packaging, it would be good to mention the binaries
>>> produced by the upstream project as well, in addition to Mesosphere.
>>> 
>>> - Patrick
>>> 
>>> On Thu, May 8, 2014 at 12:51 AM, Andrew Ash 
>> wrote:
 The docs for how to run Spark on Mesos have changed very little since
 0.6.0, but setting it up is much easier now than then.  Does it make
>>> sense
 to revamp with the below changes?
 
 
 You no longer need to build mesos yourself as pre-built versions are
 available from Mesosphere: http://mesosphere.io/downloads/
 
 And the instructions guide you towards compiling your own distribution
>> of
 Spark, when you can use the prebuilt versions of Spark as well.
 
 
 I'd like to split that portion of the documentation into two sections,
>> a
 build-from-scratch section and a use-prebuilt section.  The new outline
 would look something like this:
 
 
 *Running Spark on Mesos*
 
 Installing Mesos
 - using prebuilt (recommended)
 - pointer to mesosphere's packages
 - from scratch
 - (similar to current)
 
 
 Connecting Spark to Mesos
 - loading distribution into an accessible location
 - Spark settings
 
 Mesos Run Modes
 - (same as current)
 
 Running Alongside Hadoop
 - (trim this down)
 
 
 
 Does that work for people?
 
 
 Thanks!
 Andrew
 
 
 PS Basically all the same:
 
 http://spark.apache.org/docs/0.6.0/running-on-mesos.html
 http://spark.apache.org/docs/0.6.2/running-on-mesos.html
 http://spark.apache.org/docs/0.7.3/running-on-mesos.html
 http://spark.apache.org/docs/0.8.1/running-on-mesos.html
 http://spark.apache.org/docs/0.9.1/running-on-mesos.html
 
>>> 
>> https://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/running-on-mesos.html
>>> 
>> 



Is this supported? : Spark on Windows, Hadoop YARN on Linux.

2014-05-13 Thread innowireless TaeYun Kim
I'm trying to run spark-shell on Windows that uses Hadoop YARN on Linux.
Specifically, the environment is as follows:

- Client
  - OS: Windows 7
  - Spark version: 1.0.0-SNAPSHOT (git cloned 2014.5.8)
- Server
  - Platform: hortonworks sandbox 2.1

I has to modify the spark source code to apply
https://issues.apache.org/jira/browse/YARN-1824, so that the cross-platform
issues can be addressed. (that is, change $() to $$(), File.pathSeparator to
ApplicationConstants.CLASS_PATH_SEPARATOR).
Seeing this, I suspect that the Spark code for now is not prepared to
support cross-platform submit, that is, Spark on Windows -> Hadoop YARN on
Linux.

Anyways, after the modification and some configuration tweak, at least the
yarn-client mode spark-shell submitted from Windows 7 seems to try to start.
But the ApplicationManager fails to register.
Yarn server log is as follows: ('owner' is the user name of the Windows 7
machine.)

Log Type: stderr
Log Length: 1356
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more info.
14/05/12 01:13:54 INFO YarnSparkHadoopUtil: Using Spark's default log4j
profile: org/apache/spark/log4j-defaults.properties
14/05/12 01:13:54 INFO SecurityManager: Changing view acls to: yarn,owner
14/05/12 01:13:54 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(yarn, owner)
14/05/12 01:13:55 INFO Slf4jLogger: Slf4jLogger started
14/05/12 01:13:56 INFO Remoting: Starting remoting
14/05/12 01:13:56 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkyar...@sandbox.hortonworks.com:47074]
14/05/12 01:13:56 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkyar...@sandbox.hortonworks.com:47074]
14/05/12 01:13:56 INFO RMProxy: Connecting to ResourceManager at
/0.0.0.0:8030
14/05/12 01:13:56 INFO ExecutorLauncher: ApplicationAttemptId:
appattempt_1399856448891_0018_01
14/05/12 01:13:56 INFO ExecutorLauncher: Registering the ApplicationMaster
14/05/12 01:13:56 WARN Client: Exception encountered while connecting to the
server : org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN]

How can I handle this error?
Or, should I give up and use Linux for my client machine?
(I want to use Windows for client, since for me it's more comfortable to
develop applications.)
BTW, I'm a newbie for Spark and Hadoop.

Thanks in advance.