Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Patrick Wendell
Hey Marcelo, Yes - I agree. That one trickled in just as I was packaging this RC. However, I still put this out here to allow people to test the existing fixes, etc. - Patrick On Wed, Mar 4, 2015 at 9:26 AM, Marcelo Vanzin van...@cloudera.com wrote: I haven't tested the rc2 bits yet, but I'd

Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Marcelo Vanzin
I haven't tested the rc2 bits yet, but I'd consider https://issues.apache.org/jira/browse/SPARK-6144 a serious regression from 1.2 (since it affects existing addFile() functionality if the URL is hdfs:...). Will test other parts separately. On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell

Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Marcelo Vanzin
-1 (non-binding) because of SPARK-6144. But aside from that I ran a set of tests on top of standalone and yarn and things look good. On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.0! The

short jenkins 7am downtime tomorrow morning (3-5-15)

2015-03-04 Thread shane knapp
the master and workers need some system and package updates, and i'll also be rebooting the machines as well. this shouldn't take very long to perform, and i expect jenkins to be back up and building by 9am at the *latest*. important note: i will NOT be updating jenkins or any of the plugins

Re: enum-like types in Spark

2015-03-04 Thread Michael Armbrust
#4 with a preference for CamelCaseEnums On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley jos...@databricks.com wrote: another vote for #4 People are already used to adding () in Java. On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch java...@gmail.com wrote: #4 but with MemoryOnly (more

enum-like types in Spark

2015-03-04 Thread Xiangrui Meng
Hi all, There are many places where we use enum-like types in Spark, but in different ways. Every approach has both pros and cons. I wonder whether there should be an “official” approach for enum-like types in Spark. 1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc) * All types

Re: Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Patrick Wendell
Hey Mingyu, I think it's broken out separately so we can record the time taken to serialize the result. Once we serializing it once, the second serialization should be really simple since it's just wrapping something that has already been turned into a byte buffer. Do you see a specific issue

Re: enum-like types in Spark

2015-03-04 Thread Joseph Bradley
another vote for #4 People are already used to adding () in Java. On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch java...@gmail.com wrote: #4 but with MemoryOnly (more scala-like) http://docs.scala-lang.org/style/naming-conventions.html Constants, Values, Variable and Methods Constant

Re: Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Mingyu Kim
The concern is really just the runtime overhead and memory footprint of Java-serializing an already-serialized byte array again. We originally noticed this when we were using RDD.toLocalIterator() which serializes the entire 64MB partition. We worked around this issue by kryo-serializing and

Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Mingyu Kim
Hi all, It looks like the result of task is serialized twice, once by serializer (I.e. Java/Kryo depending on configuration) and once again by closure serializer (I.e. Java). To link the actual code, The first one:

Re: enum-like types in Spark

2015-03-04 Thread Aaron Davidson
I'm cool with #4 as well, but make sure we dictate that the values should be defined within an object with the same name as the enumeration (like we do for StorageLevel). Otherwise we may pollute a higher namespace. e.g. we SHOULD do: trait StorageLevel object StorageLevel { case object

Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Sean Owen
I think we will have to fix https://issues.apache.org/jira/browse/SPARK-5143 as well before the final 1.3.x. But yes everything else checks out for me, including sigs and hashes and building the source release. I have been following JIRA closely and am not aware of other blockers besides the

Re: enum-like types in Spark

2015-03-04 Thread Stephen Boesch
#4 but with MemoryOnly (more scala-like) http://docs.scala-lang.org/style/naming-conventions.html Constants, Values, Variable and Methods Constant names should be in upper camel case. That is, if the member is final, immutable and it belongs to a package object or an object, it may be

Re: enum-like types in Spark

2015-03-04 Thread Patrick Wendell
I like #4 as well and agree with Aaron's suggestion. - Patrick On Wed, Mar 4, 2015 at 6:07 PM, Aaron Davidson ilike...@gmail.com wrote: I'm cool with #4 as well, but make sure we dictate that the values should be defined within an object with the same name as the enumeration (like we do for

Fwd: Unable to Read/Write Avro RDD on cluster.

2015-03-04 Thread ๏̯͡๏
I am trying to read RDD avro, transform and write. I am able to run it locally fine but when i run onto cluster, i see issues with Avro. export SPARK_HOME=/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1 export SPARK_YARN_USER_ENV=CLASSPATH=/apache/hadoop/conf export

Re: ideas for MLlib development

2015-03-04 Thread Robert Dodier
Thanks for your reply, Evan. It may make sense to have a more general Gibbs sampling framework, but it might be good to have a few desired applications in mind (e.g. higher level models that rely on Gibbs) to help API design, parallelization strategy, etc. I think I'm more interested in a

Re: Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Patrick Wendell
Yeah, it will result in a second serialized copy of the array (costing some memory). But the computational overhead should be very small. The absolute worst case here will be when doing a collect() or something similar that just bundles the entire partition. - Patrick On Wed, Mar 4, 2015 at 5:47

Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Krishna Sankar
It is the LR over car-data at https://github.com/xsankar/cloaked-ironman. 1.2.0 gives Mean Squared Error = 40.8130551358 1.3.0 gives Mean Squared Error = 105.857603953 I will verify it one more time tomorrow. Cheers k/ On Tue, Mar 3, 2015 at 11:28 PM, Xiangrui Meng men...@gmail.com wrote: On

Spark Streaming and SchemaRDD usage

2015-03-04 Thread Haopu Wang
Hi, in the roadmap of Spark in 2015 (link: http://files.meetup.com/3138542/Spark%20in%202015%20Talk%20-%20Wendell.p ptx), I saw SchemaRDD is designed to be the basis of BOTH Spark Streaming and Spark SQL. My question is: what's the typical usage of SchemaRDD in a Spark Streaming application?

Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Robin East
+1 (subject to comments on ec2 issues below) machine 1: Macbook Air, OSX 10.10.2 (Yosemite), Java 8 machine 2: iMac, OSX 10.8.4, Java 7 1. mvn clean package -DskipTests (33min/13min) 2. ran SVM benchmark https://github.com/insidedctm/spark-mllib-benchmark EC2 issues: 1) Unable to

Re: Google Summer of Code - Quick Query

2015-03-04 Thread Ulrich Stärk
Hi Manoj, this question is best asked on the Spark mailing lists (copied). From a formal point of view all that counts is your proposal in Melange once applications start but your mentor or the project you wish to contribute to may have additional requirements. Cheers, Uli On 2015-03-03