Re: ETL on pyspark

2014-02-24 Thread Matei Zaharia
collect() means to bring all the data back to the master node, and there might just be too much of it for that. How big is your file? If you can’t bring it back to the master node try saveAsTextFile to write it out to a filesystem (in parallel). Matei On Feb 24, 2014, at 1:08 PM, Chengi Liu

Re: ETL on pyspark

2014-02-24 Thread Matei Zaharia
...@gmail.com wrote: Its around 10 GB big? All I want is to do a frequency count? And then get top 10 entries based on count? How do i do this (again on pyspark( Thanks On Mon, Feb 24, 2014 at 1:19 PM, Matei Zaharia matei.zaha...@gmail.com wrote: collect() means to bring all the data back

Re: Spark + MongoDB

2014-02-18 Thread Matei Zaharia
Very cool, thanks for writing this. I’ll link it from our website. Matei On Feb 18, 2014, at 12:44 PM, Sampo Niskanen sampo.niska...@wellmo.com wrote: Hi, Since getting Spark + MongoDB to work together was not very obvious (at least to me) I wrote a tutorial about it in my blog with an

Re: Kmeans example with floats

2014-02-17 Thread Matei Zaharia
The Vector class is defined to work on doubles right now. You’d have to write your own version for floats. Matei On Feb 17, 2014, at 11:58 AM, agg agalaka...@gmail.com wrote: Hi, I would like to run the spark example with floats instead of doubles. When I change this: def

Re: [0.9.0] MEMORY_AND_DISK_SER not falling back to disk

2014-02-08 Thread Matei Zaharia
This probably means that there’s not enough free memory for the “scratch” space used for computations, so we OOM before the Spark cache decides that it’s full and starts to spill stuff. Try reducing spark.storage.memoryFraction (default is 0.66, try 0.5). Matei On Feb 5, 2014, at 10:29 PM,

Re: Hadoop MapReduce on Spark

2014-02-01 Thread Matei Zaharia
It’s fairly easy to take your existing Mapper and Reducer objects and call them within Spark. First, you can use SparkContext.hadoopRDD to read a file with any Hadoop InputFormat (you can even pass it the JobConf you would’ve created in Hadoop). Then use mapPartitions to iterate through each

Re: Single application using all the cores - preventing other applications from running

2014-01-31 Thread Matei Zaharia
You can set the spark.cores.max property in your application to limit the maximum number of cores it will take. Checko ut http://spark.incubator.apache.org/docs/latest/spark-standalone.html#resource-scheduling. It’s also possible to control scheduling in more detail within a Spark application,

Re: setting partitioners with hadoop rdds

2014-01-28 Thread Matei Zaharia
Hey Imran, You probably have to create a subclass of HadoopRDD to do this, or some RDD that wraps around the HadoopRDD. It would be a cool feature but HDFS itself has no information about partitioning, so your application needs to track it. Matei On Jan 27, 2014, at 11:57 PM, Imran Rashid

Re: What I am missing from configuration?

2014-01-27 Thread Matei Zaharia
Hi Dana, I think the problem is that your simple.sbt does not add a dependency on hadoop-client for CDH4, so you get a different version of the Hadoop library on your driver application compared to the cluster. Try adding a dependency on hadoop-client version 2.0.0-mr1-cdh4.X.X for your

Re: Stalling during large iterative PySpark jobs

2014-01-26 Thread Matei Zaharia
Jeremy, do you happen to have a small test case that reproduces it? Is it with the kmeans example that comes with PySpark? Matei On Jan 22, 2014, at 3:03 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Thanks for the thoughts Matei! I poked at this some more. I ran top on each of the

Re: executor failed, cannot find compute-classpath.sh

2014-01-23 Thread Matei Zaharia
Hi Ken, This is unfortunately a limitation of spark-shell and the way it works on the standalone mode. spark-shell sets an environment variable, SPARK_HOME, which tells Spark where to find its code installed on the cluster. This means that the path on your laptop must be the same as on the

Re: .intersection() method on RDDs?

2014-01-23 Thread Matei Zaharia
, 2014 at 8:33 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I’d be happy to see this added to the core API. Matei On Jan 23, 2014, at 5:39 PM, Andrew Ash and...@andrewash.com wrote: Ah right of course -- perils of typing code without running it! It feels like this is a pretty core

Re: Running make-distribution.sh .. compilation errors in streaming/api/java/JavaPairDStream.scala

2014-01-22 Thread Matei Zaharia
Try doing a sbt clean before rebuilding. Matei On Jan 22, 2014, at 10:22 AM, Manoj Samel manojsamelt...@gmail.com wrote: See thread below. Reposted as compilation error thread -- Forwarded message -- From: Manoj Samel manojsamelt...@gmail.com Date: Wed, Jan 22, 2014 at

Re: Running make-distribution.sh .. compilation errors in streaming/api/java/JavaPairDStream.scala

2014-01-22 Thread Matei Zaharia
install - still same error. On Wed, Jan 22, 2014 at 10:46 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Try doing a sbt clean before rebuilding. Matei On Jan 22, 2014, at 10:22 AM, Manoj Samel manojsamelt...@gmail.com wrote: See thread below. Reposted as compilation error thread

Re: How to use cluster for large set of linux files

2014-01-22 Thread Matei Zaharia
see the value, just the println does not seems working On Wed, Jan 22, 2014 at 12:39 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Manoj, You’d have to make the files available at the same path on each machine through something like NFS. You don’t need to copy them, though

Re: Lazy evaluation of RDD data transformation

2014-01-21 Thread Matei Zaharia
If you don’t cache the RDD, the computation will happen over and over each time we scan through it. This is done to save memory in that case and because Spark can’t know at the beginning whether you plan to access a dataset multiple times. If you’d like to prevent this, use cache(), or maybe

Re: Quality of documentation (rant)

2014-01-20 Thread Matei Zaharia
Hi Ognen, It’s true that the documentation is partly targeting Hadoop users, and that’s something we need to fix. Perhaps the best solution would be some kind of tutorial on “here’s how to set up Spark by hand on EC2”. However it also sounds like you ran into some issues with S3 that it would

Re: Time frame / features in spark 0.9 release ?

2014-01-19 Thread Matei Zaharia
It’s being voted on right now on the dev list. Check out http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-0-9-0-incubating-rc2-td225.html. Matei On Jan 18, 2014, at 11:03 PM, Manoj Samel manojsamelt...@gmail.com wrote: Any time frame and list of enhancements

Re: Stalling during large iterative PySpark jobs

2014-01-14 Thread Matei Zaharia
Hi Jeremy, If you look at the stdout and stderr files on that worker, do you see any earlier errors? I wonder if one of the Python workers crashed earlier. It would also be good to run “top” and see if more memory is used during the computation. I guess the cached RDD itself fits in less than

Re: Spark writing to disk when there's enough memory?!

2014-01-14 Thread Matei Zaharia
Hey Majd, I believe Shark sets up data to spill to disk, even though the default storage level in Spark is memory-only. In terms of those executors, it looks like data distribution was unbalanced across them, possibly due to data locality in HDFS (some of the executors may have had more data).

Re: performance

2014-01-09 Thread Matei Zaharia
Typically you want 2-3 partitions per CPU core to get good load balancing. How big is the data you’re transferring in this case? And have you looked at the machines to see whether they’re spending lots of time on IO, CPU, etc? (Use top or dstat on each machine for this). For large datasets with

Re: is saveAsTextFile in java uses buffered I/O streams?

2014-01-09 Thread Matei Zaharia
It just uses the Hadoop FileSystem API, I don’t think there’s any extra buffering. That API itself may do buffering in the HDFS case, though newer versions of HDFS fix that. Matei On Jan 9, 2014, at 2:54 PM, hussam_jar...@dell.com wrote: Can someone provide me details on the spark java

Re: WARN ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-01-08 Thread Matei Zaharia
Have you looked at the cluster UI, and do you see any workers registered there, and your application under running applications? Maybe you typed in the wrong master URL or something like that. Matei On Jan 8, 2014, at 7:07 PM, Aureliano Buendia buendia...@gmail.com wrote: The strange thing

Re: WARN ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-01-08 Thread Matei Zaharia
...@gmail.com wrote: On Thu, Jan 9, 2014 at 3:59 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Have you looked at the cluster UI, and do you see any workers registered there, and your application under running applications? Maybe you typed in the wrong master URL or something like

Re: WARN ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-01-08 Thread Matei Zaharia
, which will distribute it. You can launch your application with “scala”, “java”, or whatever tool you’d prefer. Matei On Jan 8, 2014, at 8:26 PM, Aureliano Buendia buendia...@gmail.com wrote: On Thu, Jan 9, 2014 at 4:11 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Oh, you shouldn’t use

Re: ship MatrixFactorizationModel with each partition?

2014-01-07 Thread Matei Zaharia
Sorry, you actually can’t call predict() on the cluster because the model contains some RDDs. There was a recent patch that added a parallel predict method, here: https://github.com/apache/incubator-spark/pull/328/files. You can grab the code from that method there (which does a join) and call

Re: Spark SequenceFile Java API Repeat Key Values

2014-01-07 Thread Matei Zaharia
Yeah, unfortunately sequenceFile() reuses the Writable object across records. If you plan to use each record repeatedly (e.g. cache it), you should clone them using a map function. It was originally designed assuming you only look at each record once, but it’s poorly documented. Matei On Jan

Re: Spark SequenceFile Java API Repeat Key Values

2014-01-07 Thread Matei Zaharia
, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, unfortunately sequenceFile() reuses the Writable object across records. If you plan to use each record repeatedly (e.g. cache it), you should clone them using a map function. It was originally designed assuming you only look at each record

Re: Unable to connect spark 0.8.1 (built for hadoop 2.2.0) to connect to mesos 0.14.2

2014-01-03 Thread Matei Zaharia
(Replying on new Spark mailing list since the old one closed). Are you sure Spark is finding your build of Mesos instead of the Apache one from Maven Central? Unfortunately, code compiled with different protobuf versions is not compatible, because the generated code by the protoc compiler

Re: Is spark-env.sh supposed to be stateless?

2014-01-02 Thread Matei Zaharia
I agree that it would be good to do it only once, if you can find a nice way of doing so. Matei On Jan 3, 2014, at 1:33 AM, Andrew Ash and...@andrewash.com wrote: In my spark-env.sh I append to the SPARK_CLASSPATH variable rather than overriding it, because I want to support both adding a

Re: Lazy execution

2013-12-27 Thread Matei Zaharia
If you’re trying to measure the performance assuming that a dataset is already in memory, then doing cache() and count() would work. However if you want to measure an end-to-end workflow, it might be good to leave the operations and the data loading to happen together, as Spark does by default.

Re: endless job and slant tasks

2013-12-25 Thread Matei Zaharia
Does that machine maybe have a full disk drive, or no space in /tmp (where Spark stores local files by default)? On Dec 25, 2013, at 7:50 AM, leosand...@gmail.com wrote: No , just standalone cluster leosand...@gmail.com From: Azuryy Yu Date: 2013-12-25 19:21 To:

Re: Unable to load additional JARs in yarn-client mode

2013-12-23 Thread Matei Zaharia
I’m surprised by this, but one way that will definitely work is to assemble your application into a single JAR. If passing them to the constructor doesn’t work, that’s probably a bug. Matei On Dec 23, 2013, at 12:03 PM, Karavany, Ido ido.karav...@intel.com wrote: Hi All, For our

Re: logs in PySpark?

2013-12-20 Thread Matei Zaharia
On Thu, Dec 19, 2013 at 2:23 PM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: It might also mean you don’t have Python installed on the worker. On Dec 19, 2013, at 1:17 PM, Jey Kottalam j...@cs.berkeley.edu mailto:j...@cs.berkeley.edu wrote

Re: DoubleMatrix vs Array[Array[Double]] : Question about debugging serialization performance issues

2013-12-19 Thread Matei Zaharia
Hi Guillaume, I haven’t looked at the serialization of DoubleMatrix but I believe it just creates one big Array[Double] instead of many ones, and stores all the rows contiguously in that. I don’t think that would be slower to serialize. However, because the object is bigger overall, it might

Re: Spark Mesos Install Change Question

2013-12-19 Thread Matei Zaharia
Yup, this will still be supported. On Dec 18, 2013, at 12:40 PM, Gary Malouf malouf.g...@gmail.com wrote: In 0.7.3, the way of installing spark on mesos was to unpack it into the same directory across the cluster (I assume this includes the driver program). We automated this process in our

Re: Spark 0.8.1 Released

2013-12-19 Thread Matei Zaharia
Rosen, Henry Saputra, Jerry Shao, Mingfei Shi, Andre Schumacher, Karthik Tunga, Patrick Wendell, Neal Wiggins, Andrew Xia, Reynold Xin, Matei Zaharia, and Wu Zeming - Patrick

Re: spark pre-built binaries for 0.8.0

2013-12-18 Thread Matei Zaharia
It takes a while to download all the dependencies from Maven the first time you build. Just let it run, it won’t need to do that next time. Or see if you can build it on a machine with better Internet access and copy the binaries (you can even get an EC2 machine for a few cents if you want).

Re: Repartitioning an RDD

2013-12-17 Thread Matei Zaharia
I’m not sure if a method called repartition() ever existed in an official release, since we don’t remove methods, but there is a method called coalesce() that does what you want. You just tell it the desired new number of partitions. You can also have it shuffle the data across the cluster to

Re: Best ways to use Spark with .NET code

2013-12-16 Thread Matei Zaharia
the latest status is alpha Its license terms (and code integrity) may not pass our legal department Its robustness and efficiency are dubious. Anyway, I'm looking at some other alternatives (e.g. JNBridge). Thanks. -Ken On Mon, Dec 16, 2013 at 12:04 PM, Matei Zaharia matei.zaha...@gmail.com

Re: writing to HDFS with a given username

2013-12-13 Thread Matei Zaharia
Yup, this should be in Spark 0.9 and 0.8.1. Matei On Dec 13, 2013, at 9:41 AM, Koert Kuipers ko...@tresata.com wrote: thats great. didn't realize this was in master already. On Thu, Dec 12, 2013 at 8:10 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi Koert, Spark with

Re: Scala driver, Python workers?

2013-12-12 Thread Matei Zaharia
Yeah, I’m curious which APIs you found missing in Python. I know we have a lot on the Scala side that aren’t yet in there, but I’m not sure how to prioritize them.If you do want to call Python from Scala, you can also use the RDD.pipe() operation to pass data through an external process. However

Re: Why SparkPi example is slower than LocalPi example

2013-12-12 Thread Matei Zaharia
How long did they run for? The JVM takes a few seconds to start up and compile code, not to mention that Spark takes some time to initialize too, so you won’t see a major difference unless the application is taking longer. One other problem in this job is that it might use Math.random(), which

Re: spark avro: caching leads to identical records?

2013-12-12 Thread Matei Zaharia
The hadoopFile method reuses the Writable object between records that it reads by default, so you get back the same object. You should clone them if you need to cache them. This is kind of an unintuitive behavior that we’ll probably need to turn off by default; it’s helpful when you don’t need

Re: Hadoop RDD incorrect data

2013-12-09 Thread Matei Zaharia
Hi Matt, The behavior for sequenceFile is there because we reuse the same Writable object when reading elements from the file. This is definitely unintuitive, but if you pass through each data item only once instead of caching it, it can be more efficient (probably should be off by default

Re: Biggest spark.akka.framesize possible

2013-12-08 Thread Matei Zaharia
Hey Matt, This setting shouldn’t really affect groupBy operations, because they don’t go through Akka. The frame size setting is for messages from the master to workers (specifically, sending out tasks), and for results that go directly from workers to the application (e.g. collect()). So it

Re: Spark Import Issue

2013-12-08 Thread Matei Zaharia
I’m not sure you can have a star inside that quoted classpath argument (the double quotes may cancel the *). Try using the JAR through its full name, or link to Spark through Maven (http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-java). Matei On Dec 6, 2013,

Re: Newbie questions

2013-12-08 Thread Matei Zaharia
Hi Kenneth, 1. Is Spark suited for online learning algorithms? From what I’ve read so far (mainly from this slide), it seems not but I could be wrong. You can probably use Spark Streaming (http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html) to implement

Re: Biggest spark.akka.framesize possible

2013-12-08 Thread Matei Zaharia
to know the maximum value for spark.akka.framesize, too and I am wondering if it will affect the performance of reduceByKey(). Thanks! 2013/12/8 Matei Zaharia matei.zaha...@gmail.com Hey Matt, This setting shouldn’t really affect groupBy operations, because they don’t go through Akka

Re: Build Spark with maven

2013-12-08 Thread Matei Zaharia
within the com/typesafe/akka subtree. On Sun, Dec 8, 2013 at 5:01 PM, Azuryy Yu azury...@gmail.com wrote: I build 0.8.1, maven try to download akka-actor-2.0.1, which is used by scala-core-io. On 2013-12-09 8:40 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Which version of Spark

Re: Writing an RDD to Hive

2013-12-07 Thread Matei Zaharia
Hi Philip, There are a few things you can do: - If you want to avoid the data copy with a CREATE TABLE statement, you can use CREATE EXTERNAL TABLE, which points to an existing file or directory. - If you always reuse the same table, you could CREATE TABLE only once and then simply place

Re: Cluster not accepting jobs

2013-12-06 Thread Matei Zaharia
Yeah, in general, make sure you use exactly the same “cluster URL” string shown on the master’s web UI. There’s currently a limitation in Akka where different ways of specifying the hostname won’t work. Matei On Dec 6, 2013, at 10:54 AM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote:

Re: Spark 0.8.0 Compiling issues

2013-12-06 Thread Matei Zaharia
Yeah, unfortunately the reason it pops up more in 0.8.0 is because our package names got longer! But if you just do the build in /tmp it will work. On Dec 6, 2013, at 11:35 AM, Josh Rosen rosenvi...@gmail.com wrote: This isn't a Spark 0.8.0-specific problem. I googled for sbt error filen

Re: Pre-build Spark for Windows 8.1

2013-12-06 Thread Matei Zaharia
. On Thu, Dec 5, 2013 at 2:43 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi, When you launch the worker, try using spark://ADRIBONA-DEV-1:7077 as the URL (uppercase instead of lowercase). Unfortunately Akka is very specific about seeing hostnames written in the same way on each

Re: takeSample() computation

2013-12-05 Thread Matei Zaharia
Hi Matt, Try using take() instead, which will only begin computing from the start of the RDD (first partition) if the number of elements you ask for is small. Note that if you’re doing any shuffle operations, like groupBy or sort, then the stages before that do have to be computed fully.

Re: Pre-build Spark for Windows 8.1

2013-12-05 Thread Matei Zaharia
, December 5, 2013 7:49 AM To: user@spark.incubator.apache.org Subject: RE: Pre-build Spark for Windows 8.1 Excellent! Thank you, Matei. From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: Wednesday, December 4, 2013 4:26 PM To: user@spark.incubator.apache.org Subject: Re: Pre-build

Re: takeSample() computation

2013-12-05 Thread Matei Zaharia
– we want as much data to be computed as possible. It's only for benchmarking purposes, of course. -Matt Cheah From: Matei Zaharia matei.zaha...@gmail.com Reply-To: user@spark.incubator.apache.org user@spark.incubator.apache.org Date: Thursday, December 5, 2013 10:31 AM To: user

Re: Benchmark numbers for terabytes of data

2013-12-04 Thread Matei Zaharia
Yes, check out the Shark paper for example: https://amplab.cs.berkeley.edu/publication/shark-sql-and-rich-analytics-at-scale/ The numbers on that benchmark are for Shark. Matei On Dec 3, 2013, at 3:50 PM, Matt Cheah mch...@palantir.com wrote: Hi everyone, I notice the benchmark page for

Re: Benchmark numbers for terabytes of data

2013-12-04 Thread Matei Zaharia
these up. -Matt Cheah From: Matei Zaharia matei.zaha...@gmail.com Reply-To: user@spark.incubator.apache.org user@spark.incubator.apache.org Date: Wednesday, December 4, 2013 10:53 AM To: user@spark.incubator.apache.org user@spark.incubator.apache.org Cc: Mingyu Kim m...@palantir.com Subject

Re: Pre-build Spark for Windows 8.1

2013-12-04 Thread Matei Zaharia
Hey Adrian, Ideally you shouldn’t use Cygwin to run on Windows — use the .cmd scripts we provide instead. Cygwin might be made to work but we haven’t tried to do this so far so it’s not supported. If you can fix it, that would of course be welcome. Also, the deploy scripts don’t work on

Re: Removing broadcasts

2013-12-04 Thread Matei Zaharia
Hey Roman, It looks like that pull request was never migrated to the Apache GitHub, but I like the idea. If you migrate it over, we can merge in something like this. In terms of the API, I’d just add a unpersist() method on each Broadcast object. Matei On Dec 3, 2013, at 6:00 AM, Roman

Re: forcing node local processing

2013-12-01 Thread Matei Zaharia
Ah, interesting, thanks for reporting that. Do you mind opening a JIRA issue for it? I think the right way would be to wait at least X seconds after start before deciding that some blocks don’t have preferred locations available. Matei On Dec 1, 2013, at 9:08 AM, Erik Freed

Re: Not able run Apache Spark on Mesos

2013-11-29 Thread Matei Zaharia
I think this might be an issue with the tutorial — try asking the Mesosphere folks who created it. Matei On Nov 28, 2013, at 9:23 PM, om prakash pandey pande...@gmail.com wrote: Dear Sir/Madam, I have been trying to run Apache Spark over Mesos and have been following the below tutorial.

Re: Could not find resource path for Web UI: org/apache/spark/ui/static

2013-11-27 Thread Matei Zaharia
Sorry, what’s the full context for this? Do you have a stack trace? My guess is that Spark isn’t on your classpath, or maybe you only have an old version of it on there. Matei On Nov 27, 2013, at 6:04 PM, Walrus theCat walrusthe...@gmail.com wrote: To clarify, I just undid that var...

Re: Spark driver behind NAT

2013-11-24 Thread Matei Zaharia
Yup, it’s also important to have low latency between the drivers and the workers. If you plan to expose this to the outside (e.g. offer a shell interface), it would be better to write something on top. Matei On Nov 24, 2013, at 3:17 PM, Patrick Wendell pwend...@gmail.com wrote: Or more

Re: DataFrame RDDs

2013-11-18 Thread Matei Zaharia
Interesting idea — in Scala you can also use the Dynamic type (http://hacking-scala.org/post/49051516694/introduction-to-type-dynamic) to allow dynamic properties. It has the same potential pitfalls as string names, but with nicer syntax. Matei On Nov 18, 2013, at 3:45 PM, andy petrella

Spark meetup in Boston on Nov 21st

2013-11-14 Thread Matei Zaharia
Hey folks, just a quick announcement -- in case you’re interested in learning more about Spark in the Boston area, I’m going to speak at the Boston Hadoop Meetup next Thursday: http://www.meetup.com/bostonhadoop/events/150875522/. This is a good chance to meet local users and learn more about

Re: executor failures w/ scala 2.10

2013-11-13 Thread Matei Zaharia
. This timeout can ofcourse be configurable. Thoughts ? On Sat, Nov 2, 2013 at 3:29 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Imran, Good to know that Akka 2.1 handles this — that at least will give us a start. In the old code, executors certainly did get flagged as “down

[ANNOUNCE] Welcoming two new Spark committers: Tom Graves and Prashant Sharma

2013-11-13 Thread Matei Zaharia
Hi folks, The Apache Spark PPMC is happy to welcome two new PPMC members and committers: Tom Graves and Prashant Sharma. Tom has been maintaining and expanding the YARN support in Spark over the past few months, including adding big features such as support for YARN security, and recently

Re: Removing RDDs' data from BlockManager

2013-11-13 Thread Matei Zaharia
Hi Meisam, Each block manager removes data from the cache in a least-recently-used fashion as space fills up. If you’d like to remove an RDD manually before that, you can call rdd.unpersist(). Matei On Nov 13, 2013, at 8:15 PM, Meisam Fathi meisam.fa...@gmail.com wrote: Hi Community,

Re: interesting finding per using union

2013-11-13 Thread Matei Zaharia
Union just puts the data in two RDDs together, so you get an RDD containing the elements of both, and with the partitions that would’ve been in both. It’s not a unique set union (that would be union() then distinct()). Here you’ve unioned four RDDs of 32 partitions each to get 128. If you want

Re: problems with sbt

2013-11-12 Thread Matei Zaharia
It’s hard to tell, but maybe you’ve run out of space in your working directory? The assembly command will try to write stuff in assembly/target. Matei On Nov 11, 2013, at 2:54 PM, Umar Javed umarj.ja...@gmail.com wrote: I keep getting these io.Exception Permission denied errors when building

Re: spark.akka.threads recommendations?

2013-11-11 Thread Matei Zaharia
Actually it doesn’t matter a lot from what I’ve seen. Only do it if you see a lot of communication going to the master (these threads do the serialization of tasks). I’ve never put more than 8 or so. Matei On Nov 11, 2013, at 12:13 PM, Walrus theCat walrusthe...@gmail.com wrote: Hi, The

Re: Spark Summit agenda posted

2013-11-08 Thread Matei Zaharia
. 2013/11/7 Matei Zaharia matei.zaha...@gmail.com Hi everyone, We're glad to announce the agenda of the Spark Summit, which will happen on December 2nd and 3rd in San Francisco. We have 5 keynotes and 24 talks lined up, from 18 different companies. Check out the agenda here: http://spark

Re: PMML support in spark

2013-11-07 Thread Matei Zaharia
Hi Pranay, I don’t think anyone’s working on this right now, but contributions would be welcome if this is a thing we could plug into MLlib. Matei On Nov 6, 2013, at 8:44 PM, Pranay Tonpay pranay.ton...@impetus.co.in wrote: Hi, Wanted to know if PMML support in Spark is there in the roadmap

Spark Summit agenda posted

2013-11-07 Thread Matei Zaharia
Hi everyone, We're glad to announce the agenda of the Spark Summit, which will happen on December 2nd and 3rd in San Francisco. We have 5 keynotes and 24 talks lined up, from 18 different companies. Check out the agenda here: http://spark-summit.org/agenda/. This will be the biggest Spark

Re: Where is reduceByKey?

2013-11-07 Thread Matei Zaharia
import statements On 11/7/2013 4:05 PM, Matei Zaharia wrote: Yeah, this is confusing and unfortunately as far as I know it’s API specific. Maybe we should add this to the documentation page for RDD. The reason for these conversions is to only allow some operations based

Re: rdd.foreach doesn't act as expected

2013-11-06 Thread Matei Zaharia
In general, you shouldn’t be mutating data in RDDs. That will make it impossible to recover from faults. In this particular case, you got 1 and 2 because the RDD isn’t cached. You just get the same list you called parallelize() with each time you iterate through it. But caching it and

Re: executor failures w/ scala 2.10

2013-11-01 Thread Matei Zaharia
if I am wrong. On Fri, Nov 1, 2013 at 10:08 AM, Matei Zaharia matei.zaha...@gmail.com wrote: It’s true that Akka’s delivery guarantees are in general at-most-once, but if you look at the text there it says that they differ by transport. In the previous version, I’m quite sure

Re: executor failures w/ scala 2.10

2013-10-31 Thread Matei Zaharia
never bothered looking into it more. I will keep digging ... On Thu, Oct 31, 2013 at 4:36 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW the problem might be the Akka failure detector settings that seem new in 2.2: http://doc.akka.io/docs/akka/2.2.3/scala/remoting.html

Re: How to exclude a library from sbt assembly

2013-10-30 Thread Matei Zaharia
Looking at https://github.com/sbt/sbt-assembly, it seems you can add the following into extraAssemblySettings: assemblyOption in assembly ~= { _.copy(includeScala = false) } Matei On Oct 30, 2013, at 9:58 AM, Mingyu Kim m...@palantir.com wrote: Hi, In order to work around the library

Re: Questions about the files that Spark will produce during its running

2013-10-29 Thread Matei Zaharia
The error is from a worker node -- did you check that /data2 is set up properly on the worker nodes too? In general that should be the only directory used. Matei On Oct 28, 2013, at 6:52 PM, Shangyu Luo lsy...@gmail.com wrote: Hello, I have some questions about the files that Spark will

Re: spark-0.8.0 and hadoop-2.1.0-beta

2013-10-29 Thread Matei Zaharia
) to ConverterUtils.convertFromYarn(containerToken, cmAddress). Not 100% sure that my changes are correct. Hope that helps, Viren On Sun, Sep 29, 2013 at 8:59 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Terence, YARN's API changed in an incompatible way in Hadoop 2.1.0, so I'd suggest

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Matei Zaharia
FWIW, the only thing that Spark expects to fit in memory if you use DISK_ONLY caching is the input to each reduce task. Those currently don't spill to disk. The solution if datasets are large is to add more reduce tasks, whereas Hadoop would run along with a small number of tasks that do lots

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Matei Zaharia
of course we develop features and optimizations as we see demand for them, but if there's a lot of demand for this, we can do it. Matei On Oct 28, 2013, at 5:51 PM, Matei Zaharia matei.zaha...@gmail.com wrote: FWIW, the only thing that Spark expects to fit in memory if you use DISK_ONLY

Re: Task output before a shuffle

2013-10-28 Thread Matei Zaharia
Hi Ufuk, Yes, we still write out data after these tasks in Spark 0.8, and it needs to be written out before any stage that reads it can start. The main reason is simplicity when there are faults, as well as more flexible scheduling (you don't have to decide where each reduce task is in

Re: Failed to build Spark with YARN 2.2.0

2013-10-24 Thread Matei Zaharia
Yup, unfortunately YARN changed its API upon releasing 2.2, which puts us in an awkward position because all the major current users are on the old YARN API (from 0.23.x and 2.0.x) but new users will try this one. We'll probably change the default version in Spark 0.8.1 or 0.8.2. If you look on

Re: solution to write data to S3?

2013-10-23 Thread Matei Zaharia
Yes, take a look at http://spark.incubator.apache.org/docs/latest/ec2-scripts.html#accessing-data-in-s3 Matei On Oct 23, 2013, at 6:17 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, all Is there any solution running Spark with Amazon S3? Best, Nan

Re: solution to write data to S3?

2013-10-23 Thread Matei Zaharia
, Ayush Mishra ay...@knoldus.com wrote: You can check http://blog.knoldus.com/2013/09/09/running-standalone-scala-job-on-amazon-ec2-spark-cluster/. On Thu, Oct 24, 2013 at 6:54 AM, Nan Zhu zhunanmcg...@gmail.com wrote: Great!!! On Wed, Oct 23, 2013 at 9:21 PM, Matei Zaharia matei.zaha

Re: Help with Initial Cluster Configuration / Tuning

2013-10-22 Thread Matei Zaharia
of data etc. I was wondering if you could write up a little white paper or some guide lines on how to set memory values, and what to look at when something goes wrong? Eg. I would never gave guessed that countByValue happens on a single machine etc. On Oct 21, 2013 6:18 PM, Matei

Re: Kafka dependency issues

2013-10-17 Thread Matei Zaharia
if the goal is to keep size down and you don't want to confuse new adopters who aren't using Kafka as part of their tech stack. -Ryan On Sat, Oct 12, 2013 at 10:52 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Ryan, Spark Streaming ships with a special version of the Kafka

Re: Spark Spark Streaming, how to get started for local development?

2013-10-13 Thread Matei Zaharia
Hi Ryan, If you're only going to run in local mode, there's no need to package the app with sbt and pass a JAR. You can just run it straight out of your IDE. Matei On Oct 13, 2013, at 9:17 PM, Ryan Chan ryanchan...@gmail.com wrote: Hi, Are there any guide on teaching how to get started

Re: Spark 0.8.0 on Mesos 0.13.0 (clustered) : NoClassDefFoundError

2013-10-12 Thread Matei Zaharia
Hey, this seems to be a problem in the docs about how to set the executor URI. It looks like the SPARK_EXECUTOR_URI variable is not actually used. Instead, set the spark.executor.uri Java system property using System.setProperty(spark.executor.uri, your URI) before you create a SparkContext.

Re: Kafka dependency issues

2013-10-12 Thread Matei Zaharia
Hi Ryan, Spark Streaming ships with a special version of the Kafka 0.7.2 client that we ported to Scala 2.9, and you need to add that as a JAR explicitly in your project. The JAR is in streaming/lib/org/apache/kafka/kafka/0.7.2-spark/kafka-0.7.2-spark.jar under Spark. The streaming/lib

Re: Output configuration

2013-10-12 Thread Matei Zaharia
Hi Alex, Unfortunately there seems to be something wrong with how the generics on that method get seen by Java. You can work around it by calling this with: plans.saveAsHadoopFiles(hdfs://localhost:8020/user/hue/output/completed, csv, String.class, String.class, (Class)

Re: Output to a single directory with multiple files rather multiple directories ?

2013-10-10 Thread Matei Zaharia
Hey, sorry, for this question, there's a similar answer to the previous one. You'll have to move the files from the output directories into a common directory by hand, possibly renaming them. The Hadoop InputFormat and OutputFormat APIs that we use are just designed to work at the level of

Re: Output to a single directory with multiple files rather multiple directories ?

2013-10-10 Thread Matei Zaharia
Yeah, Christopher answered this before I could, but you can list the directory in the driver nodes, find out all the filenames, and then use SparkContext.parallelize() on an array of filenames to split the set of filenames among tasks. After that, run a foreach() on the parallelized RDD and

Re: Execution time of spark job

2013-10-10 Thread Matei Zaharia
Take a look at the org.apache.spark.scheduler.SparkListener class. You can register your own SparkListener with SparkContext that listens for job-start and job-end events. Matei On Oct 10, 2013, at 9:04 PM, prabeesh k prabsma...@gmail.com wrote: Is there any way to get execution time in the

Re: Spark dependency library causing problems with conflicting versions at import

2013-10-07 Thread Matei Zaharia
Hi Mingyu, The latest version of Spark works with Scala 2.9.3, which is the latest Scala-2.9 version. There's also a branch called branch-2.10 on GitHub that uses 2.10.3. What specific libraries are you having trouble with? I see other open source projects private-namespacing the dependencies

Re: Roadblock with Spark 0.8.0 ActorStream

2013-10-04 Thread Matei Zaharia
Hi Paul, Just FYI, I'm not sure Akka was designed to pass ActorSystems across closures the way you're doing. Also, there's a bit of a misunderstanding about closures on RDDs. Consider this change you made to ActorWordCount: lines.flatMap(_.split(\\s+)).map(x = (x, 1)).reduceByKey(_ +

  1   2   >