Re: Size of RDD larger than Size of data on disk

2014-02-25 Thread Matei Zaharia
The problem is that Java objects can take more space than the underlying data, but there are options in Spark to store data in serialized form to get around this. Take a look at https://spark.incubator.apache.org/docs/latest/tuning.html. Matei On Feb 25, 2014, at 12:01 PM, Suraj Satishkumar

Re: Implementing a custom Spark shell

2014-02-26 Thread Matei Zaharia
In Spark 0.9 and master, you can pass the -i argument to spark-shell to load a script containing commands before opening the prompt. This is also a feature of the Scala shell as a whole (try scala -help for details). Also, once you’re in the shell, you can use :load file.scala to execute the

Re: Building spark with native library support

2014-03-06 Thread Matei Zaharia
Is it an error, or just a warning? In any case, you need to get those libraries from a build of Hadoop for your platform. Then add them to the SPARK_LIBRARY_PATH environment variable in conf/spark-env.sh, or to your -Djava.library.path if launching an application separately. These libraries

Re: major Spark performance problem

2014-03-09 Thread Matei Zaharia
Hi Dana, It’s hard to tell exactly what is consuming time, but I’d suggest starting by profiling the single application first. Three things to look at there: 1) How many stages and how many tasks per stage is Spark launching (in the application web UI at http://driver:4040)? If you have

Re: NO SUCH METHOD EXCEPTION

2014-03-11 Thread Matei Zaharia
Since it’s from Scala, it might mean you’re running with a different version of Scala than you compiled Spark with. Spark 0.8 and earlier use Scala 2.9, while Spark 0.9 uses Scala 2.10. Matei On Mar 11, 2014, at 8:19 AM, Jeyaraj, Arockia R (Arockia) arockia.r.jeya...@verizon.com wrote: Hi,

Re: Powered By Spark Page -- Companies Organizations

2014-03-11 Thread Matei Zaharia
Thanks, added you. On Mar 11, 2014, at 2:47 AM, Christoph Böhm listenbru...@gmx.net wrote: Dear Spark team, thanks for the great work and congrats on becoming an Apache top-level project! You could add us to your Powered-by-page, because we are using Spark (and Shark) to perform

Re: RDD.saveAs...

2014-03-11 Thread Matei Zaharia
I agree that we can’t keep adding these to the core API, partly because it will get unwieldy to maintain and partly just because each storage system will bring in lots of dependencies. We can simply have helper classes in different modules for each storage system. There’s some discussion on

Re: possible bug in Spark's ALS implementation...

2014-03-16 Thread Matei Zaharia
On Mar 14, 2014, at 5:52 PM, Michael Allman m...@allman.ms wrote: I also found that the product and user RDDs were being rebuilt many times over in my tests, even for tiny data sets. By persisting the RDD returned from updateFeatures() I was able to avoid a raft of duplicate computations. Is

Re: How to kill a spark app ?

2014-03-16 Thread Matei Zaharia
If it’s a driver on the cluster, please open a JIRA issue about this — this kill command is indeed intended to work. Matei On Mar 16, 2014, at 2:35 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: Are you embedding your driver inside the cluster? If not then that command will not kill the

Re: [Powered by] Yandex Islands powered by Spark

2014-03-16 Thread Matei Zaharia
Thanks, I’ve added you: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark. Let me know if you want to change any wording. Matei On Mar 16, 2014, at 6:48 AM, Egor Pahomov pahomov.e...@gmail.com wrote: Hi, page https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

Re: example of non-line oriented input data?

2014-03-17 Thread Matei Zaharia
Hi Diana, Non-text input formats are only supported in Java and Scala right now, where you can use sparkContext.hadoopFile or .hadoopDataset to load data with any InputFormat that Hadoop MapReduce supports. In Python, you unfortunately only have textFile, which gives you one record per line.

Re: example of non-line oriented input data?

2014-03-17 Thread Matei Zaharia
to me how to do that as I probably should be. Thanks, Diana On Mon, Mar 17, 2014 at 1:02 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Diana, Non-text input formats are only supported in Java and Scala right now, where you can use sparkContext.hadoopFile or .hadoopDataset

Re: is collect exactly-once?

2014-03-17 Thread Matei Zaharia
Yup, it only returns each value once. Matei On Mar 17, 2014, at 1:14 PM, Adrian Mocanu amoc...@verticalscope.com wrote: Hi Quick question here, I know that .foreach is not idempotent. I am wondering if collect() is idempotent? Meaning that once I’ve collect()-ed if spark node crashes I

Re: example of non-line oriented input data?

2014-03-17 Thread Matei Zaharia
) On Mon, Mar 17, 2014 at 1:57 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Here’s an example of getting together all lines in a file as one string: $ cat dir/a.txt Hello world! $ cat dir/b.txt What's up?? $ bin/pyspark files = sc.textFile(“dir”) files.collect() [u'Hello

Re: links for the old versions are broken

2014-03-17 Thread Matei Zaharia
Thanks for reporting this, looking into it. On Mar 17, 2014, at 2:44 PM, Walrus theCat walrusthe...@gmail.com wrote: ping On Thu, Mar 13, 2014 at 11:05 AM, Aaron Davidson ilike...@gmail.com wrote: Looks like everything from 0.8.0 and before errors similarly (though Spark 0.3 for Scala

Re: Incrementally add/remove vertices in GraphX

2014-03-18 Thread Matei Zaharia
I just meant that you call union() before creating the RDDs that you pass to new Graph(). If you call it after it will produce other RDDs. The Graph() constructor actually shuffles and “indexes” the data to make graph operations efficient, so it’s not too easy to add elements after. You could

Re: Pyspark worker memory

2014-03-19 Thread Matei Zaharia
Try checking spark-env.sh on the workers as well. Maybe code there is somehow overriding the spark.executor.memory setting. Matei On Mar 18, 2014, at 6:17 PM, Jim Blomo jim.bl...@gmail.com wrote: Hello, I'm using the Github snapshot of PySpark and having trouble setting the worker memory

Re: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread Matei Zaharia
Yes, Spark automatically removes old RDDs from the cache when you make new ones. Unpersist forces it to remove them right away. In both cases though, note that Java doesn’t garbage-collect the objects released until later. Matei On Mar 19, 2014, at 7:22 PM, Nicholas Chammas

Re: Pyspark worker memory

2014-03-20 Thread Matei Zaharia
-Dspark.executor.memory in SPARK_JAVA_OPTS *on the master*. I'm not sure how this varies from 0.9.0 release, but it seems to work on SNAPSHOT. On Tue, Mar 18, 2014 at 11:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Try checking spark-env.sh on the workers as well. Maybe code there is somehow

Re: DStream spark paper

2014-03-20 Thread Matei Zaharia
Hi Adrian, On every timestep of execution, we receive new data, then report updated word counts for that new data plus the past 30 seconds. The latency here is about how quickly you get these updated counts once the new batch of data comes in. It’s true that the count reflects some data from

Re: How to save as a single file efficiently?

2014-03-21 Thread Matei Zaharia
Try passing the shuffle=true parameter to coalesce, then it will do the map in parallel but still pass all the data through one reduce node for writing it out. That’s probably the fastest it will get. No need to cache if you do that. Matei On Mar 21, 2014, at 4:04 PM, Aureliano Buendia

Re: How to save as a single file efficiently?

2014-03-21 Thread Matei Zaharia
, at 5:01 PM, Aureliano Buendia buendia...@gmail.com wrote: Good to know it's as simple as that! I wonder why shuffle=true is not the default for coalesce(). On Fri, Mar 21, 2014 at 11:37 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Try passing the shuffle=true parameter to coalesce

Re: error loading large files in PySpark 0.9.0

2014-03-23 Thread Matei Zaharia
Hey Jeremy, what happens if you pass batchSize=10 as an argument to your SparkContext? It tries to serialize that many objects together at a time, which might be too much. By default the batchSize is 1024. Matei On Mar 23, 2014, at 10:11 AM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hi

Re: Announcing Spark SQL

2014-03-26 Thread Matei Zaharia
Congrats Michael co for putting this together — this is probably the neatest piece of technology added to Spark in the past few months, and it will greatly change what users can do as more data sources are added. Matei On Mar 26, 2014, at 3:22 PM, Ognen Duzlevski og...@plainvanillagames.com

Re: All pairs shortest paths?

2014-03-26 Thread Matei Zaharia
wrote: Much thanks, I suspected this would be difficult. I was hoping to generate some 4 degrees of separation-like statistics. Looks like I'll just have to work with a subset of my graph. On Wed, Mar 26, 2014 at 5:20 PM, Matei Zaharia matei.zaha...@gmail.com wrote: All-pairs distances

Re: pySpark memory usage

2014-03-27 Thread Matei Zaharia
exceptions, but I think they all stem from the above, eg. org.apache.spark.SparkException: Error sending message to BlockManagerMaster Let me know if there are other settings I should try, or if I should try a newer snapshot. Thanks again! On Mon, Mar 24, 2014 at 9:35 AM, Matei Zaharia

Re: Strange behavior of RDD.cartesian

2014-03-28 Thread Matei Zaharia
Weird, how exactly are you pulling out the sample? Do you have a small program that reproduces this? Matei On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: I forgot to mention that I don't really use all of my data. Instead I use a sample extracted with randomSample.

Re: [shark-users] SQL on Spark - Shark or SparkSQL

2014-03-30 Thread Matei Zaharia
Hi Manoj, At the current time, for drop-in replacement of Hive, it will be best to stick with Shark. Over time, Shark will use the Spark SQL backend, but should remain deployable the way it is today (including launching the SharkServer, using the Hive CLI, etc). Spark SQL is better for

Re: Mllib in pyspark for 0.8.1

2014-04-01 Thread Matei Zaharia
You could probably port it back, but it required some changes on the Java side as well (a new PythonMLUtils class). It might be easier to fix the Mesos issues with 0.9. Matei On Apr 1, 2014, at 8:53 AM, Ian Ferreira ianferre...@hotmail.com wrote: Hi there, For some reason the

Re: Spark 1.0.0 release plan

2014-04-03 Thread Matei Zaharia
Hey Bhaskar, this is still the plan, though QAing might take longer than 15 days. Right now since we’ve passed April 1st, the only features considered for a merge are those that had pull requests in review before. (Some big ones are things like annotating the public APIs and simplifying

Re: Optimal Server Design for Spark

2014-04-03 Thread Matei Zaharia
, 2014 at 3:58 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Steve, This configuration sounds pretty good. The one thing I would consider is having more disks, for two reasons — Spark uses the disks for large shuffles and out-of-core operations, and often it’s better to run HDFS or your

Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread Matei Zaharia
Exceptions should be sent back to the driver program and logged there (with a SparkException thrown if a task fails more than 4 times), but there were some bugs before where this did not happen for non-Serializable exceptions. We changed it to pass back the stack traces only (as text), which

Re: example of non-line oriented input data?

2014-04-04 Thread Matei Zaharia
, Mar 18, 2014 at 8:14 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW one other thing — in your experience, Diana, which non-text InputFormats would be most useful to support in Python first? Would it be Parquet or Avro, simple SequenceFiles with the Hadoop Writable types, or something

Re: Having spark-ec2 join new slaves to existing cluster

2014-04-04 Thread Matei Zaharia
This can’t be done through the script right now, but you can do it manually as long as the cluster is stopped. If the cluster is stopped, just go into the AWS Console, right click a slave and choose “launch more of these” to add more. Or select multiple slaves and delete them. When you run

Re: Spark on other parallel filesystems

2014-04-04 Thread Matei Zaharia
As long as the filesystem is mounted at the same path on every node, you should be able to just run Spark and use a file:// URL for your files. The only downside with running it this way is that Lustre won’t expose data locality info to Spark, the way HDFS does. That may not matter if it’s a

Re: Spark 0.9.1 released

2014-04-09 Thread Matei Zaharia
, Chen Chao, Christian Lundgren, Diana Carroll, Emtiaz Ahmed, Frank Dai, Henry Saputra, jianghan, Josh Rosen, Jyotiska NK, Kay Ousterhout, Kousuke Saruta, Mark Grover, Matei Zaharia, Nan Zhu, Nick Lanham, Patrick Wendell, Prabin Banka, Prashant Sharma, Qiuzhuang, Raymond Liu, Reynold Xin, Sandy

Re: pySpark memory usage

2014-04-09 Thread Matei Zaharia
) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85) On Thu, Apr 3, 2014 at 8:37 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Cool, thanks for the update. Have you tried running a branch with this fix (e.g. branch-0.9, or the 0.9.1 release candidate?) Also, what memory leak issue are you

Re: NPE using saveAsTextFile

2014-04-10 Thread Matei Zaharia
I haven’t seen this but it may be a bug in Typesafe Config, since this is serializing a Config object. We don’t actually use Typesafe Config ourselves. Do you have any nulls in the data itself by any chance? And do you know how that Config object is getting there? Matei On Apr 9, 2014, at

Re: Spark - ready for prime time?

2014-04-10 Thread Matei Zaharia
To add onto the discussion about memory working space, 0.9 introduced the ability to spill data within a task to disk, and in 1.0 we’re also changing the interface to allow spilling data within the same *group* to disk (e.g. when you do groupBy and get a key with lots of values). The main

Re: Spark 0.9.1 PySpark ImportError

2014-04-10 Thread Matei Zaharia
Kind of strange because we haven’t updated CloudPickle AFAIK. Is this a package you added on the PYTHONPATH? How did you set the path, was it in conf/spark-env.sh? Matei On Apr 10, 2014, at 7:39 AM, aazout albert.az...@velos.io wrote: I am getting a python ImportError on Spark standalone

Re: Spark - ready for prime time?

2014-04-11 Thread Matei Zaharia
, Surendranauth Hiraman suren.hira...@velos.io wrote: Matei, Where is the functionality in 0.9 to spill data within a task (separately from persist)? My apologies if this is something obvious but I don't see it in the api docs. -Suren On Thu, Apr 10, 2014 at 3:59 PM, Matei Zaharia

Re: RDD.tail()

2014-04-14 Thread Matei Zaharia
You can use mapPartitionsWithIndex and look at the partition index (0 will be the first partition) to decide whether to skip the first line. Matei On Apr 14, 2014, at 8:50 AM, Ethan Jewett esjew...@gmail.com wrote: We have similar needs but IIRC, I came to the conclusion that this would only

Re: process_local vs node_local

2014-04-14 Thread Matei Zaharia
Spark can actually launch multiple executors on the same node if you configure it that way, but if you haven’t done that, this might mean that some tasks are reading data from the cache, and some from HDFS. (In the HDFS case Spark will only report it as NODE_LOCAL since HDFS isn’t tied to a

Re: using Kryo with pyspark?

2014-04-14 Thread Matei Zaharia
Kryo won’t make a major impact on PySpark because it just stores data as byte[] objects, which are fast to serialize even with Java. But it may be worth a try — you would just set spark.serializer and not try to register any classes. What might make more impact is storing data as

Re: partitioning of small data sets

2014-04-15 Thread Matei Zaharia
Yup, one reason it’s 2 actually is to give people a similar experience to working with large files, in case their code doesn’t deal well with the file being partitioned. Matei On Apr 15, 2014, at 9:53 AM, Aaron Davidson ilike...@gmail.com wrote: Take a look at the minSplits argument for

Re: Multi-tenant?

2014-04-15 Thread Matei Zaharia
Yes, both things can happen. Take a look at http://spark.apache.org/docs/latest/job-scheduling.html, which includes scheduling concurrent jobs within the same driver. Matei On Apr 15, 2014, at 4:08 PM, Ian Ferreira ianferre...@hotmail.com wrote: What is the support for multi-tenancy in

Re: PySpark still reading only text?

2014-04-16 Thread Matei Zaharia
Hi Bertrand, We should probably add a SparkContext.pickleFile and RDD.saveAsPickleFile that will allow saving pickled objects. Unfortunately this is not in yet, but there is an issue up to track it: https://issues.apache.org/jira/browse/SPARK-1161. In 1.0, one feature we do have now is the

Re: extremely slow k-means version

2014-04-19 Thread Matei Zaharia
The problem is that groupByKey means “bring all the points with this same key to the same JVM”. Your input is a Seq[Point], so you have to have all the points there. This means that a) all points will be sent across the network in a cluster, which is slow (and Spark goes through this sending

Re: Spark Streaming source from Amazon Kinesis

2014-04-21 Thread Matei Zaharia
There was a patch posted a few weeks ago (https://github.com/apache/spark/pull/223), but it needs a few changes in packaging because it uses a license that isn’t fully compatible with Apache. I’d like to get this merged when the changes are made though — it would be a good input source to

Re: error in mllib lr example code

2014-04-23 Thread Matei Zaharia
See http://people.csail.mit.edu/matei/spark-unified-docs/ for a more recent build of the docs; if you spot any problems in those, let us know. Matei On Apr 23, 2014, at 9:49 AM, Xiangrui Meng men...@gmail.com wrote: The doc is for 0.9.1. You are running a later snapshot, which added sparse

Re: How do I access the SPARK SQL

2014-04-23 Thread Matei Zaharia
It’s currently in the master branch, on https://github.com/apache/spark. You can check that out from git, build it with sbt/sbt assembly, and then try it out. We’re also going to post some release candidates soon that will be pre-built. Matei On Apr 23, 2014, at 1:30 PM, diplomatic Guru

Re: Deploying a python code on a spark EC2 cluster

2014-04-24 Thread Matei Zaharia
Did you launch this using our EC2 scripts (http://spark.apache.org/docs/latest/ec2-scripts.html) or did you manually set up the daemons? My guess is that their hostnames are not being resolved properly on all nodes, so executor processes can’t connect back to your driver app. This error

Re: SparkPi performance-3 cluster standalone mode

2014-04-24 Thread Matei Zaharia
The problem is that SparkPi uses Math.random(), which is a synchronized method, so it can’t scale to multiple cores. In fact it will be slower on multiple cores due to lock contention. Try another example and you’ll see better scaling. I think we’ll have to update SparkPi to create a new Random

Re: Finding bad data

2014-04-24 Thread Matei Zaharia
Hey Jim, this is unfortunately harder than I’d like right now, but here’s how to do it. Look at the stderr file of the executor on that machine, and you’ll see lines like this: 14/04/24 19:17:24 INFO HadoopRDD: Input split: file:/Users/matei/workspace/apache-spark/README.md:0+2000 This says

Re: parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Matei Zaharia
Try setting the serializer to org.apache.spark.serializer.KryoSerializer (see http://spark.apache.org/docs/0.9.1/tuning.html), it should be considerably faster. Matei On Apr 24, 2014, at 8:01 PM, Earthson Lu earthson...@gmail.com wrote:

Re: Spark on Yarn or Mesos?

2014-04-27 Thread Matei Zaharia
From my point of view, both are supported equally. The YARN support is newer and that’s why there’s been a lot more action there in recent months. Matei On Apr 27, 2014, at 12:08 PM, Andrew Ash and...@andrewash.com wrote: That thread was mostly about benchmarking YARN vs standalone, and the

Re: Running a spark-submit compatible app in spark-shell

2014-04-27 Thread Matei Zaharia
Hi Roger, You should be able to use the --jars argument of spark-shell to add JARs onto the classpath and then work with those classes in the shell. (A recent patch, https://github.com/apache/spark/pull/542, made spark-shell use the same command-line arguments as spark-submit). But this is a

Re: Running out of memory Naive Bayes

2014-04-28 Thread Matei Zaharia
Not sure if this is always ideal for Naive Bayes, but you could also hash the features into a lower-dimensional space (e.g. reduce it to 50,000 features). For each feature simply take MurmurHash3(featureID) % 5 for example. Matei On Apr 27, 2014, at 11:24 PM, DB Tsai dbt...@stanford.edu

Re: K-means with large K

2014-04-28 Thread Matei Zaharia
Try turning on the Kryo serializer as described at http://spark.apache.org/docs/latest/tuning.html. Also, are there any exceptions in the driver program’s log before this happens? Matei On Apr 28, 2014, at 9:19 AM, Buttler, David buttl...@llnl.gov wrote: Hi, I am trying to run the K-means

Re: processing s3n:// files in parallel

2014-04-28 Thread Matei Zaharia
Actually wildcards work too, e.g. s3n://bucket/file1*, and I believe so do comma-separated lists (e.g. s3n://file1,s3n://file2). These are all inherited from FileInputFormat in Hadoop. Matei On Apr 28, 2014, at 6:05 PM, Andrew Ash and...@andrewash.com wrote: This is already possible with the

Re: Python Spark on YARN

2014-04-29 Thread Matei Zaharia
This will be possible in 1.0 after this pull request: https://github.com/apache/spark/pull/30 Matei On Apr 29, 2014, at 9:51 AM, Guanhua Yan gh...@lanl.gov wrote: Hi all: Is it possible to develop Spark programs in Python and run them on YARN? From the Python SparkContext class, it

Re: performance improvement on second operation...without caching?

2014-05-03 Thread Matei Zaharia
Hi Diana, Apart from these reasons, in a multi-stage job, Spark saves the map output files from map stages to the filesystem, so it only needs to rerun the last reduce stage. This is why you only saw one stage executing. These files are saved for fault recovery but they speed up subsequent

Re: performance improvement on second operation...without caching?

2014-05-03 Thread Matei Zaharia
-uses that? On Sat, May 3, 2014 at 8:29 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Diana, Apart from these reasons, in a multi-stage job, Spark saves the map output files from map stages to the filesystem, so it only needs to rerun the last reduce stage. This is why you only saw

Re: Spark GCE Script

2014-05-05 Thread Matei Zaharia
Very cool! Have you thought about sending this as a pull request? We’d be happy to maintain it inside Spark, though it might be interesting to find a single Python package that can manage clusters across both EC2 and GCE. Matei On May 5, 2014, at 7:18 AM, Akhil Das ak...@sigmoidanalytics.com

Re: Increase Stack Size Workers

2014-05-06 Thread Matei Zaharia
Add export SPARK_JAVA_OPTS=“-Xss16m” to conf/spark-env.sh. Then it should apply to the executor. Matei On May 5, 2014, at 2:20 PM, Andrea Esposito and1...@gmail.com wrote: Hi there, i'm doing an iterative algorithm and sometimes i ended up with StackOverflowError, doesn't matter if i do

Re: Spark and Java 8

2014-05-06 Thread Matei Zaharia
Java 8 support is a feature in Spark, but vendors need to decide for themselves when they’d like support Java 8 commercially. You can still run Spark on Java 7 or 6 without taking advantage of the new features (indeed our builds are always against Java 6). Matei On May 6, 2014, at 8:59 AM,

Re: Spark to utilize HDFS's mmap caching

2014-05-12 Thread Matei Zaharia
Yes, Spark goes through the standard HDFS client and will automatically benefit from this. Matei On May 8, 2014, at 4:43 AM, Chanwit Kaewkasi chan...@gmail.com wrote: Hi all, Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via sc.textFile() and other HDFS-related APIs?

Re: Is their a way to Create SparkContext object?

2014-05-12 Thread Matei Zaharia
You can just pass it around as a parameter. On May 12, 2014, at 12:37 PM, yh18190 yh18...@gmail.com wrote: Hi, Could anyone suggest an idea how can we create sparkContext object in other classes or fucntions where we need to convert a scala collection to RDD using sc object.like

Re: pySpark memory usage

2014-05-12 Thread Matei Zaharia
at ~54GB. stats() returns (count: 56757667, mean: 1001.68740583, stdev: 601.775217822, max: 8965, min: 343) On Wed, Apr 9, 2014 at 6:59 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Okay, thanks. Do you have any info on how large your records and data file are? I'd like to reproduce and fix

Test

2014-05-15 Thread Matei Zaharia

Re: pySpark memory usage

2014-05-15 Thread Matei Zaharia
400 for the textFile()s, 1500 for the join()s. On Mon, May 12, 2014 at 7:58 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Jim, unfortunately external spilling is not implemented in Python right now. While it would be possible to update combineByKey to do smarter stuff here, one

Re: persist @ disk-only failing

2014-05-19 Thread Matei Zaharia
, May 19, 2014 at 1:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: What version is this with? We used to build each partition first before writing it out, but this was fixed a while back (0.9.1, but it may also be in 0.9.0). Matei On May 19, 2014, at 12:41 AM, Sai Prasanna

Re: How to compile the examples directory?

2014-05-19 Thread Matei Zaharia
If you’d like to work on just this code for your own changes, it might be best to copy it to a separate project. Look at http://spark.apache.org/docs/latest/quick-start.html for how to set up a standalone job. Matei On May 19, 2014, at 4:53 AM, Hao Wang wh.s...@gmail.com wrote: Hi, I am

Re: advice on maintaining a production spark cluster?

2014-05-19 Thread Matei Zaharia
Which version is this with? I haven’t seen standalone masters lose workers. Is there other stuff on the machines that’s killing them, or what errors do you see? Matei On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.com wrote: Hey folks, I'm wondering what strategies other folks

Re: life if an executor

2014-05-19 Thread Matei Zaharia
They’re tied to the SparkContext (application) that launched them. Matei On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote: from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Matei Zaharia
restarting the workers usually resolves this, but we often seen workers disappear after a failed or killed job. If we see this occur again, I'll try and provide some logs. On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Which version is this with? I

Re: Python, Spark and HBase

2014-05-20 Thread Matei Zaharia
Unfortunately this is not yet possible. There’s a patch in progress posted here though: https://github.com/apache/spark/pull/455 — it would be great to get your feedback on it. Matei On May 20, 2014, at 4:21 PM, twizansk twiza...@gmail.com wrote: Hello, This seems like a basic question

Re: Python, Spark and HBase

2014-05-28 Thread Matei Zaharia
It sounds like you made a typo in the code — perhaps you’re trying to call self._jvm.PythonRDDnewAPIHadoopFile instead of self._jvm.PythonRDD.newAPIHadoopFile? There should be a dot before the new. Matei On May 28, 2014, at 5:25 PM, twizansk twiza...@gmail.com wrote: Hi Nick, I finally

Re: Checking spark cache percentage programatically. And how to clear cache.

2014-05-28 Thread Matei Zaharia
You can remove cached RDDs by calling unpersist() on them. You can also use SparkContext.getRDDStorageInfo to get info on cache usage, though this is a developer API so it may change in future versions. We will add a standard API eventually but this is just very closely tied to framework

Re: Spark hook to create external process

2014-05-29 Thread Matei Zaharia
Hi Anand, This is probably already handled by the RDD.pipe() operation. It will spawn a process and let you feed data to it through its stdin and read data through stdout. Matei On May 29, 2014, at 9:39 AM, ansriniv ansri...@gmail.com wrote: I have a requirement where for every Spark

Re: Driver OOM while using reduceByKey

2014-05-29 Thread Matei Zaharia
That hash map is just a list of where each task ran, it’s not the actual data. How many map and reduce tasks do you have? Maybe you need to give the driver a bit more memory, or use fewer tasks (e.g. do reduceByKey(_ + _, 100) to use only 100 tasks). Matei On May 29, 2014, at 2:03 AM, haitao

Re: Why Scala?

2014-05-29 Thread Matei Zaharia
Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data

Re: Shuffle file consolidation

2014-05-29 Thread Matei Zaharia
It can be set in an individual application. Consolidation had some issues on ext3 as mentioned there, though we might enable it by default in the future because other optimizations now made it perform on par with the non-consolidation version. It also had some bugs in 0.9.0 so I’d suggest at

Re: Trouble with EC2

2014-05-31 Thread Matei Zaharia
What instance types did you launch on? Sometimes you also get a bad individual machine from EC2. It might help to remove the node it’s complaining about from the conf/slaves file. Matei On May 30, 2014, at 11:18 AM, PJ$ p...@chickenandwaffl.es wrote: Hey Folks, I'm really having quite a

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Matei Zaharia
More specifically with the -a flag, you *can* set your own AMI, but you’ll need to base it off ours. This is because spark-ec2 assumes that some packages (e.g. java, Python 2.6) are already available on the AMI. Matei On Jun 1, 2014, at 11:01 AM, Patrick Wendell pwend...@gmail.com wrote: Hey

Re: Trouble with EC2

2014-06-01 Thread Matei Zaharia
1, 2014, at 3:11 PM, PJ$ p...@chickenandwaffl.es wrote: Running on a few m3.larges with the ami-848a6eec image (debian 7). Haven't gotten any further. No clue what's wrong. I'd really appreciate any guidance y'all could offer. Best, PJ$ On Sat, May 31, 2014 at 1:40 PM, Matei Zaharia

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Matei Zaharia
FYI, I opened https://issues.apache.org/jira/browse/SPARK-1990 to track this. Matei On Jun 1, 2014, at 6:14 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Sort of.. there were two separate issues, but both related to AWS.. I've sorted the confusion about the Master/Worker AMI ... use

Re: SecurityException when running tests with Spark 1.0.0

2014-06-02 Thread Matei Zaharia
You can just use the Maven build for now, even for Spark 1.0.0. Matei On Jun 2, 2014, at 5:30 PM, Mohit Nayak wiza...@gmail.com wrote: Hey, Yup that fixed it. Thanks so much! Is this the only solution, or could this be resolved in future versions of Spark ? On Mon, Jun 2, 2014 at

Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-03 Thread Matei Zaharia
Yeah unfortunately Hadoop 2 requires these binaries on Windows. Hadoop 1 runs just fine without them. Matei On Jun 3, 2014, at 10:33 AM, Sean Owen so...@cloudera.com wrote: I'd try the internet / SO first -- these are actually generic Hadoop-related issues. Here I think you don't have

Re: Better line number hints for logging?

2014-06-03 Thread Matei Zaharia
You can use RDD.setName to give it a name. There’s also a creationSite field that is private[spark] — we may want to add a public setter for that later. If the name isn’t enough and you’d like this, please open a JIRA issue for it. Matei On Jun 3, 2014, at 5:22 PM, John Salvatier

Re: Invalid Class Exception

2014-06-03 Thread Matei Zaharia
What Java version do you have, and how did you get Spark (did you build it yourself by any chance or download a pre-built one)? If you build Spark yourself you need to do it with Java 6 — it’s a known issue because of the way Java 6 and 7 package JAR files. But I haven’t seen it result in this

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-03 Thread Matei Zaharia
Ghost, it's the dream language we've theorized about for years! I hadn't realized! Indeed, glad you’re enjoying it. Matei On Mon, Jun 2, 2014 at 12:05 PM, Matei Zaharia matei.zaha...@gmail.com wrote: FYI, I opened https://issues.apache.org/jira/browse/SPARK-1990 to track this. Matei

Re: Upgradation to Spark 1.0.0

2014-06-03 Thread Matei Zaharia
You can copy your configuration from the old one. I’d suggest just downloading it to a different location on each node first for testing, then you can delete the old one if things work. On Jun 3, 2014, at 12:38 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi , I am currently using

Re: Join : Giving incorrect result

2014-06-04 Thread Matei Zaharia
If this isn’t the problem, it would be great if you can post the code for the program. Matei On Jun 4, 2014, at 12:58 PM, Xu (Simon) Chen xche...@gmail.com wrote: Maybe your two workers have different assembly jar files? I just ran into a similar problem that my spark-shell is using a

Re: reuse hadoop code in Spark

2014-06-04 Thread Matei Zaharia
Yes, you can write some glue in Spark to call these. Some functions to look at: - SparkContext.hadoopRDD lets you create an input RDD from an existing JobConf configured by Hadoop (including InputFormat, paths, etc) - RDD.mapPartitions lets you operate in all the values on one partition (block)

Re: Better line number hints for logging?

2014-06-04 Thread Matei Zaharia
than just one line? (Of course you would have to click to expand it.) On Wed, Jun 4, 2014 at 2:38 AM, John Salvatier jsalvat...@gmail.com wrote: Ok, I will probably open a Jira. On Tue, Jun 3, 2014 at 5:29 PM, Matei Zaharia matei.zaha...@gmail.com wrote: You can use RDD.setName to give

Re: pyspark join crash

2014-06-04 Thread Matei Zaharia
In PySpark, the data processed by each reduce task needs to fit in memory within the Python process, so you should use more tasks to process this dataset. Data is spilled to disk across tasks. I’ve created https://issues.apache.org/jira/browse/SPARK-2021 to track this — it’s something we’ve

Re: How can I dispose an Accumulator?

2014-06-04 Thread Matei Zaharia
All of these are disposed of automatically if you stop the context or exit the program. Matei On Jun 4, 2014, at 2:22 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: Will the broadcast variables be disposed automatically if the context is stopped, or do I still need to unpersist()?

Re: pyspark join crash

2014-06-04 Thread Matei Zaharia
On Wed, Jun 4, 2014 at 1:42 PM, Matei Zaharia matei.zaha...@gmail.com wrote: In PySpark, the data processed by each reduce task needs to fit in memory within the Python process, so you should use more tasks to process this dataset. Data is spilled to disk across tasks. I’ve created https

Re: Why Scala?

2014-06-04 Thread Matei Zaharia
to include Python APIs in Spark Streaming? Anytime frame on this? Thanks! John On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work

  1   2   3   >