Re: Requirements of objects stored in RDDs

2014-05-13 Thread Andrew Ash
An RDD can hold objects of any type. If you generally think of it as a distributed Collection, then you won't ever be that far off. As far as serialization, the contents of an RDD must be serializable. There are two serialization libraries you can use with Spark: normal Java serialization or

Re: Updating docs for running on Mesos

2014-05-13 Thread Andrew Ash
As far as I know, the upstream doesn't release binaries, only source code. The downloads page https://mesos.apache.org/downloads/ for 0.18.0 only has a source tarball. Is there a binary release somewhere from Mesos that I'm missing? On Sun, May 11, 2014 at 2:16 PM, Patrick Wendell

Re: Kryo not default?

2014-05-13 Thread Reynold Xin
The main reason is that it doesn't always work (e.g. sometimes application program has special serialization / externalization written already for Java which don't work in Kryo). On Mon, May 12, 2014 at 5:47 PM, Anand Avati av...@gluster.org wrote: Hi, Can someone share the reason why Kryo

Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Reynold Xin
Thanks for the experiments and analysis! I think Michael already submitted a patch that avoids scanning all columns for count(*) or count(1). On Mon, May 12, 2014 at 9:46 PM, Andrew Ash and...@andrewash.com wrote: Hi Spark devs, First of all, huge congrats on the parquet integration with

Is this supported? : Spark on Windows, Hadoop YARN on Linux.

2014-05-13 Thread innowireless TaeYun Kim
I'm trying to run spark-shell on Windows that uses Hadoop YARN on Linux. Specifically, the environment is as follows: - Client - OS: Windows 7 - Spark version: 1.0.0-SNAPSHOT (git cloned 2014.5.8) - Server - Platform: hortonworks sandbox 2.1 I has to modify the spark source code to apply

Re: Updating docs for running on Mesos

2014-05-13 Thread Matei Zaharia
I’ll ask the Mesos folks about this. Unfortunately it might be tough to link only to a company’s builds; but we can perhaps include them in addition to instructions for building Mesos from Apache. Matei On May 12, 2014, at 11:55 PM, Gerard Maas gerard.m...@gmail.com wrote: Andrew,

Re: Kryo not default?

2014-05-13 Thread Dmitriy Lyubimov
On Mon, May 12, 2014 at 2:47 PM, Anand Avati av...@gluster.org wrote: Hi, Can someone share the reason why Kryo serializer is not the default? why should it be? On top of it, the only way to serialize a closure into the backend (even now) is java serialization (which means java serialization

Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Michael Malak
Reposting here on dev since I didn't see a response on user: I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In the Spark Shell, equals() fails when I use the canonical equals() pattern of match{}, but works when I subsitute with isInstanceOf[]. I am using Spark

Re: Multinomial Logistic Regression

2014-05-13 Thread DB Tsai
Hi Deb, For K possible outcomes in multinomial logistic regression, we can have K-1 independent binary logistic regression models, in which one outcome is chosen as a pivot and then the other K-1 outcomes are separately regressed against the pivot outcome. See my presentation for technical

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Mark Hamstra
There were a few early/test RCs this cycle that were never put to a vote. On Tue, May 13, 2014 at 8:07 AM, Nan Zhu zhunanmcg...@gmail.com wrote: just curious, where is rc4 VOTE? I searched my gmail but didn't find that? On Tue, May 13, 2014 at 9:49 AM, Sean Owen so...@cloudera.com

Re: Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Michael Malak
Thank you for your investigation into this! Just for completeness, I've confirmed it's a problem only in REPL, not in compiled Spark programs. But within REPL, a direct consequence of non-same classes after serialization/deserialization also means that lookup() doesn't work: scala class C(val

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread witgo
-1 The following bug should be fixed: https://issues.apache.org/jira/browse/SPARK-1817 https://issues.apache.org/jira/browse/SPARK-1712 -- Original -- From: Patrick Wendell;pwend...@gmail.com; Date: Wed, May 14, 2014 04:07 AM To:

Re: Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Anand Avati
On Tue, May 13, 2014 at 8:26 AM, Michael Malak michaelma...@yahoo.comwrote: Reposting here on dev since I didn't see a response on user: I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In the Spark Shell, equals() fails when I use the canonical equals() pattern of

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Madhu
I just built rc5 on Windows 7 and tried to reproduce the problem described in https://issues.apache.org/jira/browse/SPARK-1712 It works on my machine: 14/05/13 21:06:47 INFO DAGScheduler: Stage 1 (sum at console:17) finished in 4.548 s 14/05/13 21:06:47 INFO TaskSchedulerImpl: Removed TaskSet

Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Andrew Ash
Thanks for filing -- I'm keeping my eye out for updates on that ticket. Cheers! Andrew On Tue, May 13, 2014 at 2:40 PM, Michael Armbrust mich...@databricks.comwrote: It looks like currently the .count() on parquet is handled incredibly inefficiently and all the columns are materialized.