Re: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Patrick Wendell
Is this the serialization throughput per task or the serialization throughput for all the tasks? On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond raymond@intel.com wrote: Hi I am running a WordCount program which count words from HDFS, and I noticed that the serializer part of code

Re: JavaSparkConf

2014-04-29 Thread Patrick Wendell
This class was made to be java friendly so that we wouldn't have to use two versions. The class itself is simple. But I agree adding java setters would be nice. On Tue, Apr 29, 2014 at 8:32 PM, Soren Macbeth so...@yieldbot.com wrote: There is a JavaSparkContext, but no JavaSparkConf object. I

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Patrick Wendell
You are right, once you sort() the RDD, then yes it has a well defined ordering. But that ordering is lost as soon as you transform the RDD, including if you union it with another RDD. On Tue, Apr 29, 2014 at 10:22 PM, Mingyu Kim m...@palantir.com wrote: Hi Patrick, I¹m a little confused

Re: pySpark memory usage

2014-04-28 Thread Patrick Wendell
the error first before the reader knows what is going on. Anyways maybe if you have a simpler solution you could sketch it out in the JIRA and we could talk over there. The current proposal in the JIRA is somewhat complicated... - Patrick On Mon, Apr 28, 2014 at 1:01 PM, Jim Blomo jim.bl

Re: Running a spark-submit compatible app in spark-shell

2014-04-28 Thread Patrick Wendell
What about if you run ./bin/spark-shell --driver-class-path=/path/to/your/jar.jar I think either this or the --jars flag should work, but it's possible there is a bug with the --jars flag when calling the Repl. On Mon, Apr 28, 2014 at 4:30 PM, Roger Hoover roger.hoo...@gmail.comwrote: A

Re: launching concurrent jobs programmatically

2014-04-28 Thread Patrick Wendell
You can also accomplish this by just having a separate service that submits multiple jobs to a cluster where those jobs e.g. use different jars. - Patrick On Mon, Apr 28, 2014 at 4:44 PM, Andrew Ash and...@andrewash.com wrote: For the second question, you can submit multiple jobs through

Re: compile spark 0.9.1 in hadoop 2.2 above exception

2014-04-24 Thread Patrick Wendell
Try running sbt/sbt clean and re-compiling. Any luck? On Thu, Apr 24, 2014 at 5:33 PM, martin.ou martin...@orchestrallinc.cnwrote: occure exception when compile spark 0.9.1 using sbt,env: hadoop 2.3 1. SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly 2.found Exception:

Re: running tests selectively

2014-04-20 Thread Patrick Wendell
I put some notes in this doc: https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools On Sun, Apr 20, 2014 at 8:58 PM, Arun Ramakrishnan sinchronized.a...@gmail.com wrote: I would like to run some of the tests selectively. I am in branch-1.0 Tried the following two

Re: Task splitting among workers

2014-04-20 Thread Patrick Wendell
For a HadoopRDD, first the spark scheduler calculates the number of tasks based on input splits. Usually people use this with HDFS data so in that case it's based on HDFS blocks. If the HDFS datanodes are co-located with the Spark cluster then it will try to run the tasks on the data node that

Re: Hybrid GPU CPU computation

2014-04-11 Thread Patrick Grinaway
I've actually done it using PySpark and python libraries which call cuda code, though I've never done it from scala directly. The only major challenge I've hit is assigning tasks to gpus on multiple gpu machines. Sent from my iPhone On Apr 11, 2014, at 8:38 AM, Jaonary Rabarisoa

Re: Spark on YARN performance

2014-04-11 Thread Patrick Wendell
To reiterate what Tom was saying - the code that runs inside of Spark on YARN is exactly the same code that runs in any deployment mode. There shouldn't be any performance difference once your application starts (assuming you are comparing apples-to-apples in terms of hardware). The differences

Re: programmatic way to tell Spark version

2014-04-10 Thread Patrick Wendell
: Hey Patrick, I've created SPARK-1458 https://issues.apache.org/jira/browse/SPARK-1458 to track this request, in case the team/community wants to implement it in the future. Nick On Sat, Feb 22, 2014 at 7:25 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: No use case at the moment

Re: programmatic way to tell Spark version

2014-04-10 Thread Patrick Wendell
Pierre - I'm not sure that would work. I just opened a Spark shell and did this: scala classOf[SparkContext].getClass.getPackage.getImplementationVersion res4: String = 1.7.0_25 It looks like this is the JVM version. - Patrick On Thu, Apr 10, 2014 at 2:08 PM, Pierre Borckmans pierre.borckm

Re: trouble with join on large RDDs

2014-04-07 Thread Patrick Wendell
On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller bmill...@eecs.berkeley.eduwrote: I am running the latest version of PySpark branch-0.9 and having some trouble with join. One RDD is about 100G (25GB compressed and serialized in memory) with 130K records, the other RDD is about 10G (2.5G

Re: Largest Spark Cluster

2014-04-04 Thread Patrick Wendell
and on jobs that crunch hundreds of terabytes (uncompressed) of data. - Patrick On Fri, Apr 4, 2014 at 12:05 PM, Parviz Deyhim pdey...@gmail.com wrote: Spark community, What's the size of the largest Spark cluster ever deployed? I've heard Yahoo is running Spark on several hundred nodes

Re: How to create a RPM package

2014-04-04 Thread Patrick Wendell
in the community has feedback from trying this. - Patrick On Fri, Apr 4, 2014 at 12:43 PM, Rahul Singhal rahul.sing...@guavus.comwrote: Hi Christophe, Thanks for your reply and the spec file. I have solved my issue for now. I didn't want to rely building spark using the spec file (%build

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-02 Thread Patrick Wendell
(default-cli) on project spark-0.9.0-incubating: Error reading assemblies: No assembly descriptors found. - [Help 1] upon runnning mvn -Dhadoop.version=2.0.0-cdh4.2.1 -DskipTests clean assembly:assembly On Apr 1, 2014, at 4:13 PM, Patrick Wendell pwend...@gmail.com wrote: Do you get the same

Re: Spark output compression on HDFS

2014-04-02 Thread Patrick Wendell
For textFile I believe we overload it and let you set a codec directly: https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59 For saveAsSequenceFile yep, I think Mark is right, you need an option. On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra

Re: Resilient nature of RDD

2014-04-02 Thread Patrick Wendell
The driver stores the meta-data associated with the partition, but the re-computation will occur on an executor. So if several partitions are lost, e.g. due to a few machines failing, the re-computation can be striped across the cluster making it fast. On Wed, Apr 2, 2014 at 11:27 AM, David

Re: Is there a way to get the current progress of the job?

2014-04-02 Thread Patrick Wendell
of functionality and something we might, e.g. want to change the API of over time. - Patrick On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren philip.og...@oracle.comwrote: What I'd like is a way to capture the information provided on the stages page (i.e. cluster:4040/stages via IndexPage). Looking

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Patrick Wendell
Do you get the same problem if you build with maven? On Tue, Apr 1, 2014 at 12:23 PM, Vipul Pandey vipan...@gmail.com wrote: SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly That's all I do. On Apr 1, 2014, at 11:41 AM, Patrick Wendell pwend...@gmail.com wrote: Vidal - could you show

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Patrick Grinaway
Also in NYC, definitely interested in a spark meetup! Sent from my iPhone On Mar 31, 2014, at 3:07 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Happy to help with an NYC meet up (just emailed Andy). I recently moved to VA, but am back in NYC quite often, and have been turning several

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-03-31 Thread Patrick Wendell
dependencies including the exact Spark version and other libraries. - Patrick On Sun, Mar 30, 2014 at 10:03 PM, Vipul Pandey vipan...@gmail.com wrote: I'm using ScalaBuff (which depends on protobuf2.5) and facing the same issue. any word on this one? On Mar 27, 2014, at 6:41 PM, Kanwaldeep kanwal

Re: Spark webUI - application details page

2014-03-30 Thread Patrick Wendell
This will be a feature in Spark 1.0 but is not yet released. In 1.0 Spark applications can persist their state so that the UI can be reloaded after they have completed. - Patrick On Sun, Mar 30, 2014 at 10:30 AM, David Thomas dt5434...@gmail.com wrote: Is there a way to see 'Application

Re: Announcing Spark SQL

2014-03-27 Thread Patrick Wendell
to the respective cassandra columns. I think all of this would be fairly easy to implement on SchemaRDD and likely will make it into Spark 1.1 - Patrick On Wed, Mar 26, 2014 at 10:59 PM, Rohit Rai ro...@tuplejump.com wrote: Great work guys! Have been looking forward to this . . . In the blog it mentions

Re: Building Spark 0.9.x for CDH5 with mrv1 installation (Protobuf 2.5 upgrade)

2014-03-26 Thread Patrick Wendell
I'm not sure exactly how your cluster is configured. But as far as I can tell Cloudera's MR1 CDH5 dependencies are against Hadoop 2.3. I'd just find the exact CDH version you have and link against the `mr1` version of their published dependencies in that version. So I think you wan't

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-25 Thread Patrick Wendell
Starting with Spark 0.9 the protobuf dependency we use is shaded and cannot interfere with other protobuf libaries including those in Hadoop. Not sure what's going on in this case. Would someone who is having this problem post exactly how they are building spark? - Patrick On Fri, Mar 21, 2014

Re: How many partitions is my RDD split into?

2014-03-24 Thread Patrick Wendell
Ah we should just add this directly in pyspark - it's as simple as the code Shivaram just wrote. - Patrick On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman shivaram.venkatara...@gmail.com wrote: There is no direct way to get this in pyspark, but you can get it from the underlying java

Re: combining operations elegantly

2014-03-23 Thread Patrick Wendell
... but that's not quite released yet :) - Patrick On Sun, Mar 23, 2014 at 1:31 PM, Koert Kuipers ko...@tresata.com wrote: i currently typically do something like this: scala val rdd = sc.parallelize(1 to 10) scala import com.twitter.algebird.Operators._ scala import com.twitter.algebird.{Max, Min

Re: How many partitions is my RDD split into?

2014-03-23 Thread Patrick Wendell
if you do a highly selective filter on an RDD. For instance, you filter out one day of data from a dataset of a year. - Patrick On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra m...@clearstorydata.com wrote: It's much simpler: rdd.partitions.size On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas

Re: No space left on device exception

2014-03-23 Thread Patrick Wendell
Ognen - just so I understand. The issue is that there weren't enough inodes and this was causing a No space left on device error? Is that correct? If so, that's good to know because it's definitely counter intuitive. On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski og...@nengoiksvelzud.com wrote:

Re: slf4j and log4j loop

2014-03-16 Thread Patrick Wendell
This is not released yet but we're planning to cut a 0.9.1 release very soon (e.g. most likely this week). In the mean time you'll have checkout branch-0.9 of Spark and publish it locally then depend on the snapshot version. Or just wait it out... On Fri, Mar 14, 2014 at 2:01 PM, Adrian Mocanu

Re: Round Robin Partitioner

2014-03-13 Thread Patrick Wendell
itself and override getPreferredLocations. Keep in mind this is tricky because the set of executors might change during the lifetime of a Spark job. - Patrick On Thu, Mar 13, 2014 at 11:50 AM, David Thomas dt5434...@gmail.com wrote: Is it possible to parition the RDD elements in a round robin

Re: [External] Re: no stdout output from worker

2014-03-10 Thread Patrick Wendell
Hey Sen, Suarav is right, and I think all of your print statements are inside of the driver program rather than inside of a closure. How are you running your program (i.e. what do you run that starts this job)? Where you run the driver you should expect to see the output. - Patrick On Mon, Mar

Re: Too many open files exception on reduceByKey

2014-03-10 Thread Patrick Wendell
change so it won't help the ulimit problem. This means you'll have to use fewer reducers (e.g. pass reduceByKey a number of reducers) or use fewer cores on each machine. - Patrick On Mon, Mar 10, 2014 at 10:41 AM, Matthew Cheah matthew.c.ch...@gmail.com wrote: Hi everyone, My team (cc'ed

Re: no stdout output from worker

2014-03-09 Thread Patrick Wendell
on the workers machines. If you see stderr but not stdout that's a bit of a puzzler since they both go through the same mechanism. - Patrick On Sun, Mar 9, 2014 at 2:32 PM, Sen, Ranjan [USA] sen_ran...@bah.com wrote: Hi I have some System.out.println in my Java code that is working ok in a local environment

Re: Python 2.7 + numpy break sortByKey()

2014-03-06 Thread Patrick Wendell
The difference between your two jobs is that take() is optimized and only runs on the machine where you are using the shell, whereas sortByKey requires using many machines. It seems like maybe python didn't get upgraded correctly on one of the slaves. I would look in the /root/spark/work/ folder

Re: Unable to redirect Spark logs to slf4j

2014-03-05 Thread Patrick Wendell
- Patrick On Wed, Mar 5, 2014 at 1:52 PM, Sergey Parhomenko sparhome...@gmail.com wrote: Hi Patrick, Thanks for the patch. I tried building a patched version of spark-core_2.10-0.9.0-incubating.jar but the Maven build fails: [ERROR] /home/das/Work/thx/incubator-spark/core/src/main/scala/org

<    1   2   3   4