[jira] [Updated] (SPARK-2661) Unpersist last RDD in bagel iteration

2014-07-23 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2661: - Fix Version/s: 1.1.0 Unpersist last RDD in bagel iteration

[jira] [Resolved] (SPARK-2047) Use less memory in AppendOnlyMap.destructiveSortedIterator

2014-07-22 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2047. -- Resolution: Fixed Fix Version/s: 1.1.0 Use less memory

[jira] [Updated] (SPARK-2047) Use less memory in AppendOnlyMap.destructiveSortedIterator

2014-07-22 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2047: - Assignee: Aaron Davidson Use less memory in AppendOnlyMap.destructiveSortedIterator

Re: Very wierd behavior

2014-07-22 Thread Matei Zaharia
Is the first() being computed locally on the driver program? Maybe it's to hard to compute with the memory, etc available there. Take a look at the driver's log and see whether it has the message Computing the requested partition locally. Matei On Jul 22, 2014, at 12:04 PM, Nathan Kronenfeld

[jira] [Updated] (SPARK-2494) Hash of None is different cross machines in CPython

2014-07-21 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2494: - Priority: Major (was: Blocker) Hash of None is different cross machines in CPython

[jira] [Updated] (SPARK-2494) Hash of None is different cross machines in CPython

2014-07-21 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2494: - Affects Version/s: 0.9.2 0.9.0 0.9.1 Hash of None

[jira] [Updated] (SPARK-2494) Hash of None is different cross machines in CPython

2014-07-21 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2494: - Fix Version/s: (was: 1.0.1) (was: 1.0.0) 0.9.3

[jira] [Updated] (SPARK-2494) Hash of None is different cross machines in CPython

2014-07-21 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2494: - Target Version/s: 1.1.0, 1.0.2, 0.9.3 (was: 1.1.0, 1.0.2) Hash of None is different cross

[jira] [Updated] (SPARK-2494) Hash of None is different cross machines in CPython

2014-07-21 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2494: - Assignee: Davies Liu Hash of None is different cross machines in CPython

Re: Launching with m3.2xlarge instances: /mnt and /mnt2 mounted on 7gb drive

2014-07-21 Thread Matei Zaharia
Actually the script in the master branch is also broken (it's pointing to an older AMI). Try 1.0.1 for launching clusters. On Jul 20, 2014, at 2:25 PM, Chris DuBois chris.dub...@gmail.com wrote: I pulled the latest last night. I'm on commit 4da01e3. On Sun, Jul 20, 2014 at 2:08 PM, Matei

Re: Launching with m3.2xlarge instances: /mnt and /mnt2 mounted on 7gb drive

2014-07-20 Thread Matei Zaharia
Is this with the 1.0.0 scripts? I believe it's fixed in 1.0.1. Matei On Jul 20, 2014, at 1:22 AM, Chris DuBois chris.dub...@gmail.com wrote: Using the spark-ec2 script with m3.2xlarge instances seems to not have /mnt and /mnt2 pointing to the 80gb SSDs that come with that instance. Does

[jira] [Updated] (SPARK-2553) CoGroupedRDD unnecessarily allocates a Tuple2 per dep per key

2014-07-18 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2553: - Assignee: Sandy Ryza CoGroupedRDD unnecessarily allocates a Tuple2 per dep per key

[jira] [Resolved] (SPARK-2553) CoGroupedRDD unnecessarily allocates a Tuple2 per dep per key

2014-07-18 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2553. -- Resolution: Fixed Target Version/s: 1.1.0 CoGroupedRDD unnecessarily allocates

[jira] [Assigned] (SPARK-2045) Sort-based shuffle implementation

2014-07-18 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-2045: Assignee: Matei Zaharia Sort-based shuffle implementation

[jira] [Created] (SPARK-2558) Mention --queue argument in YARN documentation

2014-07-17 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2558: Summary: Mention --queue argument in YARN documentation Key: SPARK-2558 URL: https://issues.apache.org/jira/browse/SPARK-2558 Project: Spark Issue Type

[jira] [Updated] (SPARK-2558) Mention --queue argument in YARN documentation

2014-07-17 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2558: - Labels: Starter (was: ) Mention --queue argument in YARN documentation

Re: preservesPartitioning

2014-07-17 Thread Matei Zaharia
Hi Kamal, This is not what preservesPartitioning does -- actually what it means is that if the RDD has a Partitioner set (which means it's an RDD of key-value pairs and the keys are grouped into a known way, e.g. hashed or range-partitioned), your map function is not changing the partition of

Re: Spark scheduling with Capacity scheduler

2014-07-17 Thread Matei Zaharia
It's possible using the --queue argument of spark-submit. Unfortunately this is not documented on http://spark.apache.org/docs/latest/running-on-yarn.html but it appears if you just type spark-submit --help or spark-submit with no arguments. Matei On Jul 17, 2014, at 2:33 AM, Konstantin

Re: Include permalinks in mail footer

2014-07-17 Thread Matei Zaharia
Good question.. I'll ask INFRA because I haven't seen other Apache mailing lists provide this. It would indeed be helpful. Matei On Jul 17, 2014, at 12:59 PM, Nick Chammas nicholas.cham...@gmail.com wrote: Can we modify the mailing list to include permalinks to the thread in the footer of

[jira] [Updated] (SPARK-2048) Optimizations to CPU usage of external spilling code

2014-07-16 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2048: - Description: In the external spilling code in ExternalAppendOnlyMap and CoGroupedRDD

[jira] [Commented] (SPARK-2048) Optimizations to CPU usage of external spilling code

2014-07-16 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064602#comment-14064602 ] Matei Zaharia commented on SPARK-2048: -- I added one more issue to this BTW, about

Re: small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-16 Thread Matei Zaharia
Hey Reynold, just to clarify, users will still have to manually broadcast objects that they want to use *across* operations (e.g. in multiple iterations of an algorithm, or multiple map functions, or stuff like that). But they won't have to broadcast something they only use once. Matei On Jul

Re: Release date for new pyspark

2014-07-16 Thread Matei Zaharia
Yeah, we try to have a regular 3 month release cycle; see https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage for the current window. Matei On Jul 16, 2014, at 4:21 PM, Mark Hamstra m...@clearstorydata.com wrote: You should expect master to compile and run: patches aren't merged

[jira] [Updated] (SPARK-2045) Sort-based shuffle implementation

2014-07-15 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2045: - Attachment: (was: Sort-basedshuffledesign.pdf) Sort-based shuffle implementation

[jira] [Updated] (SPARK-2045) Sort-based shuffle implementation

2014-07-15 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2045: - Attachment: Sort-basedshuffledesign.pdf I've posted a design doc for a simple version

[jira] [Updated] (SPARK-2045) Sort-based shuffle implementation

2014-07-15 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2045: - Attachment: Sort-basedshuffledesign.pdf Oops, attached the wrong file before. Here's the right

[jira] [Commented] (SPARK-2045) Sort-based shuffle implementation

2014-07-15 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063009#comment-14063009 ] Matei Zaharia commented on SPARK-2045: -- Right now I was thinking it would happen

Re: Catalyst dependency on Spark Core

2014-07-15 Thread Matei Zaharia
Yeah, that seems like something we can inline :). On Jul 15, 2014, at 7:30 PM, Baofeng Zhang pelickzh...@qq.com wrote: Is Matei following this? Catalyst uses the Utils to get the ClassLoader which loaded Spark. Can Catalyst directly do getClass.getClassLoader to avoid the dependency on

Re: Iteration question

2014-07-15 Thread Matei Zaharia
Hi Nathan, I think there are two possible reasons for this. One is that even though you are caching RDDs, their lineage chain gets longer and longer, and thus serializing each RDD takes more time. You can cut off the chain by using RDD.checkpoint() periodically, say every 5-10 iterations. The

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Matei Zaharia
Yeah, this is handled by the commit call of the FileOutputFormat. In general Hadoop OutputFormats have a concept called committing the output, which you should do only once per partition. In the file ones it does an atomic rename to make sure that the final output is a complete file. Matei On

Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Matei Zaharia
I haven't seen issues using the JVM's own tools (jstack, jmap, hprof and such), so maybe there's a problem in YourKit or in your release of the JVM. Otherwise I'd suggest increasing the heap size of the unit tests a bit (you can do this in the SBT build file). Maybe they are very close to full

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Matei Zaharia
Yeah, I'd just add a spark-util that has these things. Matei On Jul 14, 2014, at 1:04 PM, Michael Armbrust mich...@databricks.com wrote: Yeah, sadly this dependency was introduced when someone consolidated the logging infrastructure. However, the dependency should be very small and thus

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Matei Zaharia
You can actually turn off shuffle compression by setting spark.shuffle.compress to false. Try that out, there will still be some buffers for the various OutputStreams, but they should be smaller. Matei On Jul 14, 2014, at 3:30 PM, Stephen Haberman stephen.haber...@gmail.com wrote: Just a

Re: Can we get a spark context inside a mapper

2014-07-14 Thread Matei Zaharia
You currently can't use SparkContext inside a Spark task, so in this case you'd have to call some kind of local K-means library. One example you can try to use is Weka (http://www.cs.waikato.ac.nz/ml/weka/). You can then load your text files as an RDD of strings with SparkContext.wholeTextFiles

Re: Ideal core count within a single JVM

2014-07-14 Thread Matei Zaharia
Are you increasing the number of parallel tasks with cores as well? With more tasks there will be more data communicated and hence more calls to these functions. Unfortunately contention is kind of hard to measure, since often the result is that you see many cores idle as they're waiting on a

Re: Memory compute-intensive tasks

2014-07-14 Thread Matei Zaharia
I think coalesce with shuffle=true will force it to have one task per node. Without that, it might be that due to data locality it decides to launch multiple ones on the same node even though the total # of tasks is equal to the # of nodes. If this is the *only* thing you run on the cluster,

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Matei Zaharia
Yeah, I'd just add a spark-util that has these things. Matei On Jul 14, 2014, at 1:04 PM, Michael Armbrust mich...@databricks.com wrote: Yeah, sadly this dependency was introduced when someone consolidated the logging infrastructure. However, the dependency should be very small and thus

Re: Spark 1.0.1 EC2 - Launching Applications

2014-07-14 Thread Matei Zaharia
The script should be there, in the spark/bin directory. What command did you use to launch the cluster? Matei On Jul 14, 2014, at 1:12 PM, Josh Happoldt josh.happo...@trueffect.com wrote: Hi All, I've used the spark-ec2 scripts to build a simple 1.0.1 Standalone cluster on EC2. It

Re: Ideal core count within a single JVM

2014-07-14 Thread Matei Zaharia
of them here, but if your file is big it will also have at least one task per 32 MB block of the file. Matei On Jul 14, 2014, at 6:39 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I see, so here might be the problem. With more cores, there's less memory available per core, and now many of your

Re: hdfs replication on saving RDD

2014-07-14 Thread Matei Zaharia
You can change this setting through SparkContext.hadoopConfiguration, or put the conf/ directory of your Hadoop installation on the CLASSPATH when you launch your app so that it reads the config values from there. Matei On Jul 14, 2014, at 8:06 PM, valgrind_girl 124411...@qq.com wrote: eager

Re: spark ui on yarn

2014-07-12 Thread Matei Zaharia
The UI code is the same in both, but one possibility is that your executors were given less memory on YARN. Can you check that? Or otherwise, how do you know that some RDDs were cached? Matei On Jul 12, 2014, at 4:12 PM, Koert Kuipers ko...@tresata.com wrote: hey shuo, so far all stage

Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-11 Thread Matei Zaharia
Unless you can diagnose the problem quickly, Gary, I think we need to go ahead with this release as is. This release didn't touch the Mesos support as far as I know, so the problem might be a nondeterministic issue with your application. But on the other hand the release does fix some critical

Re: Document page load fault

2014-07-08 Thread Matei Zaharia
Thanks for catching this. For now you can just access the page through http:// instead of https:// to avoid this. Matei On Jul 8, 2014, at 10:46 PM, binbinbin915 binbinbin...@live.cn wrote: https://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression on Chrome 35 with

Re: the Pre-built packages for CDH4 can not support yarn ?

2014-07-07 Thread Matei Zaharia
They are for CDH4 without YARN, since YARN is experimental in that. You can download one of the Hadoop 2 packages if you want to run on YARN. Or you might have to build specifically against CDH4's version of YARN if that doesn't work. Matei On Jul 7, 2014, at 9:37 PM, ch huang

Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-06 Thread Matei Zaharia
+1 Tested on Mac OS X. Matei On Jul 6, 2014, at 1:54 AM, Andrew Or and...@databricks.com wrote: +1, verified that the UI bug is in fact fixed in https://github.com/apache/spark/pull/1255. 2014-07-05 20:01 GMT-07:00 Soren Macbeth so...@yieldbot.com: +1 On Sat, Jul 5, 2014 at 7:41

[jira] [Created] (SPARK-2371) Show locally-running tasks (e.g. from take()) in web UI

2014-07-04 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2371: Summary: Show locally-running tasks (e.g. from take()) in web UI Key: SPARK-2371 URL: https://issues.apache.org/jira/browse/SPARK-2371 Project: Spark Issue

Re: java options for spark-1.0.0

2014-07-02 Thread Matei Zaharia
Try looking at the running processes with “ps” to see their full command line and see whether any options are different. It seems like in both cases, your young generation is quite large (11 GB), which doesn’t make lot of sense with a heap of 15 GB. But maybe I’m misreading something. Matei

Re: Shark Vs Spark SQL

2014-07-02 Thread Matei Zaharia
Spark SQL in Spark 1.1 will include all the functionality in Shark; take a look at http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html. We decided to do this because at the end of the day, the only code left in Shark was the JDBC / Thrift

Re: AWS Credentials for private S3 reads

2014-07-02 Thread Matei Zaharia
When you use hadoopConfiguration directly, I don’t think you have to replace the “/“ with “%2f”. Have you tried it without that? Also make sure you’re not replacing slashes in the URL itself. Matei On Jul 2, 2014, at 4:17 PM, Brian Gawalt bgaw...@gmail.com wrote: Hello everyone, I'm

Re: Spark 1.0: Unable to Read LZO Compressed File

2014-07-01 Thread Matei Zaharia
I’d suggest asking the IBM Hadoop folks, but my guess is that the library cannot be found in /opt/IHC/lib/native/Linux-amd64-64/. Or maybe if this exception is happening in your driver program, the driver program’s java.library.path doesn’t include this. (SPARK_LIBRARY_PATH from spark-env.sh

Re: [VOTE] Release Apache Spark 1.0.1 (RC1)

2014-06-27 Thread Matei Zaharia
+1 Tested it out on Mac OS X and Windows, looked through docs. Matei On Jun 26, 2014, at 7:06 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.1! The tag to be voted on is v1.0.1-rc1 (commit 7feeda3):

[jira] [Updated] (SPARK-1937) Tasks can be submitted before executors are registered

2014-06-24 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1937: - Assignee: Rui Li Tasks can be submitted before executors are registered

[jira] [Resolved] (SPARK-1937) Tasks can be submitted before executors are registered

2014-06-24 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1937. -- Resolution: Fixed Fix Version/s: 1.1.0 Target Version/s: 1.1.0 Tasks can

[jira] [Created] (SPARK-2248) spark.default.parallelism does not apply in local mode

2014-06-23 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2248: Summary: spark.default.parallelism does not apply in local mode Key: SPARK-2248 URL: https://issues.apache.org/jira/browse/SPARK-2248 Project: Spark Issue

[jira] [Resolved] (SPARK-2124) Move aggregation into ShuffleManager implementations

2014-06-23 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2124. -- Resolution: Fixed Fix Version/s: 1.1.0 Move aggregation into ShuffleManager

Re: RFC: [SPARK-529] Create constants for known config variables.

2014-06-23 Thread Matei Zaharia
Hey Marcelo, When we did the configuration pull request, we actually avoided having a big list of defaults in one class file, because this creates a file that all the components in the project depend on. For example, since we have some settings specific to streaming and the REPL, do we want

Re: Powered by Spark addition

2014-06-21 Thread Matei Zaharia
customer targetting, accurate inventory and efficient analysis. Thanks! Best Regards, Sonal Nube Technologies On Thu, Jun 12, 2014 at 11:33 PM, Derek Mansen de...@vistarmedia.com wrote: Awesome, thank you! On Wed, Jun 11, 2014 at 6:53 PM, Matei Zaharia matei.zaha

[jira] [Updated] (SPARK-2206) Automatically infer the number of classification classes in multiclass classification

2014-06-19 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2206: - Assignee: Manish Amde Automatically infer the number of classification classes in multiclass

[jira] [Updated] (SPARK-2207) Add minimum information gain and minimum instances per node as training parameters for decision tree.

2014-06-19 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2207: - Assignee: Manish Amde Add minimum information gain and minimum instances per node as training

Re: MLLib inside Storm : silly or not ?

2014-06-19 Thread Matei Zaharia
You should be able to use many of the MLlib Model objects directly in Storm, if you save them out using Java serialization. The only one that won’t work is probably ALS, because it’s a distributed model. Otherwise, you will have to output them in your own format and write code for evaluating

[jira] [Updated] (SPARK-1112) When spark.akka.frameSize 10, task results bigger than 10MiB block execution

2014-06-18 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1112: - Affects Version/s: 1.0.0 When spark.akka.frameSize 10, task results bigger than 10MiB block

Re: Spark is now available via Homebrew

2014-06-18 Thread Matei Zaharia
Interesting, does anyone know the people over there who set it up? It would be good if Apache itself could publish packages there, though I’m not sure what’s involved. Since Spark just depends on Java and Python it should be easy for us to update. Matei On Jun 18, 2014, at 1:37 PM, Nick

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Matei Zaharia
I was going to suggest the same thing :). On Jun 18, 2014, at 4:56 PM, Evan R. Sparks evan.spa...@gmail.com wrote: This looks like a job for SparkSQL! val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class MyRecord(country: String, name: String, age: Int,

Re: Un-serializable 3rd-party classes (Spark, Java)

2014-06-17 Thread Matei Zaharia
There are a few options: - Kryo might be able to serialize these objects out of the box, depending what’s inside them. Try turning it on as described at http://spark.apache.org/docs/latest/tuning.html. - If that doesn’t work, you can create your own “wrapper” objects that implement

Re: pyspark serializer can't handle functions?

2014-06-16 Thread Matei Zaharia
It’s true that it can’t. You can try to use the CloudPickle library instead, which is what we use within PySpark to serialize functions (see python/pyspark/cloudpickle.py). However I’m also curious, why do you need an RDD of functions? Matei On Jun 15, 2014, at 4:49 PM, madeleine

Re: pyspark serializer can't handle functions?

2014-06-16 Thread Matei Zaharia
is that I'm using alternating minimization, so I'll be minimizing over the rows and columns of this matrix at alternating steps; hence I need to store both the matrix and its transpose to avoid data thrashing. On Mon, Jun 16, 2014 at 11:05 AM, Matei Zaharia [via Apache Spark User List] [hidden

[jira] [Resolved] (SPARK-1837) NumericRange should be partitioned in the same way as other sequences

2014-06-14 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1837. -- Resolution: Fixed Fix Version/s: 1.1.0 NumericRange should be partitioned in the same

Re: Is shuffle stable?

2014-06-14 Thread Matei Zaharia
The order is not guaranteed actually, only which keys end up in each partition. Reducers may fetch data from map tasks in an arbitrary order, depending on which ones are available first. If you’d like a specific order, you should sort each partition. Here you might be getting it because each

Re: guidance on simple unit testing with Spark

2014-06-13 Thread Matei Zaharia
You need to factor your program so that it’s not just a main(). This is not a Spark-specific issue, it’s about how you’d unit test any program in general. In this case, your main() creates a SparkContext, so you can’t pass one from outside, and your code has to read data from a file and write

[jira] [Commented] (SPARK-889) Bring back DFS broadcast

2014-06-12 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030080#comment-14030080 ] Matei Zaharia commented on SPARK-889: - This is a really old JIRA and actually I

Fwd: ApacheCon CFP closes June 25

2014-06-12 Thread Matei Zaharia
(I’m forwarding this message on behalf of the ApacheCon organizers, who’d like to see involvement from every Apache project!) As you may be aware, ApacheCon will be held this year in Budapest, on November 17-23. (See http://apachecon.eu for more info.) The Call For Papers for that conference

Re: How to specify executor memory in EC2 ?

2014-06-12 Thread Matei Zaharia
and will post it if I find it :) Thank you, anyway On Wed, Jun 11, 2014 at 12:19 AM, Matei Zaharia matei.zaha...@gmail.com wrote: It might be that conf/spark-env.sh on EC2 is configured to set it to 512, and is overriding the application’s settings. Take a look in there and delete

[jira] [Created] (SPARK-2123) Basic pluggable interface for shuffle

2014-06-11 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2123: Summary: Basic pluggable interface for shuffle Key: SPARK-2123 URL: https://issues.apache.org/jira/browse/SPARK-2123 Project: Spark Issue Type: Sub-task

[jira] [Created] (SPARK-2125) Add sorting flag to ShuffleManager, and implement it in HashShuffleManager

2014-06-11 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2125: Summary: Add sorting flag to ShuffleManager, and implement it in HashShuffleManager Key: SPARK-2125 URL: https://issues.apache.org/jira/browse/SPARK-2125 Project

[jira] [Created] (SPARK-2124) Move aggregation into ShuffleManager implementations

2014-06-11 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2124: Summary: Move aggregation into ShuffleManager implementations Key: SPARK-2124 URL: https://issues.apache.org/jira/browse/SPARK-2124 Project: Spark Issue

[jira] [Updated] (SPARK-2124) Move aggregation into ShuffleManager implementations

2014-06-11 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2124: - Assignee: Saisai Shao Move aggregation into ShuffleManager implementations

[jira] [Resolved] (SPARK-2123) Basic pluggable interface for shuffle

2014-06-11 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2123. -- Resolution: Fixed Resolved in https://github.com/apache/spark/pull/1009 Basic pluggable

[jira] [Created] (SPARK-2126) Move MapOutputTracker behind ShuffleManager interface

2014-06-11 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2126: Summary: Move MapOutputTracker behind ShuffleManager interface Key: SPARK-2126 URL: https://issues.apache.org/jira/browse/SPARK-2126 Project: Spark Issue

Re: Compression with DISK_ONLY persistence

2014-06-11 Thread Matei Zaharia
Yes, actually even if you don’t set it to true, on-disk data is compressed. (This setting only affects serialized data in memory). Matei On Jun 11, 2014, at 2:56 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Hi, Will spark.rdd.compress=true enable compression when using

Re: Powered by Spark addition

2014-06-11 Thread Matei Zaharia
Alright, added you. Matei On Jun 11, 2014, at 1:28 PM, Derek Mansen de...@vistarmedia.com wrote: Hello, I was wondering if we could add our organization to the Powered by Spark page. The information is: Name: Vistar Media URL: www.vistarmedia.com Description: Location technology company

Re: When to use CombineByKey vs reduceByKey?

2014-06-11 Thread Matei Zaharia
combineByKey is designed for when your return type from the aggregation is different from the values being aggregated (e.g. you group together objects), and it should allow you to modify the leftmost argument of each function (mergeCombiners, mergeValue, etc) and return that instead of

[jira] [Commented] (SPARK-1416) Add support for SequenceFiles and binary Hadoop InputFormats in PySpark

2014-06-10 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026701#comment-14026701 ] Matei Zaharia commented on SPARK-1416: -- That pull request also added generic

Re: How to specify executor memory in EC2 ?

2014-06-10 Thread Matei Zaharia
It might be that conf/spark-env.sh on EC2 is configured to set it to 512, and is overriding the application’s settings. Take a look in there and delete that line if possible. Matei On Jun 10, 2014, at 2:38 PM, Aliaksei Litouka aliaksei.lito...@gmail.com wrote: I am testing my application in

[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles

2014-06-09 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025580#comment-14025580 ] Matei Zaharia commented on SPARK-2044: -- Hey Weihua, I'll look into the sorting flag

[jira] [Resolved] (SPARK-1416) Add support for SequenceFiles in PySpark

2014-06-09 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1416. -- Resolution: Fixed Fix Version/s: 1.1.0 Target Version/s: 1.1.0 Implemented

[jira] [Updated] (SPARK-1416) Add support for SequenceFiles in PySpark

2014-06-09 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1416: - Assignee: Nick Pentreath Add support for SequenceFiles in PySpark

Re: How to enable fault-tolerance?

2014-06-09 Thread Matei Zaharia
If this is a useful feature for local mode, we should open a JIRA to document the setting or improve it (I’d prefer to add a spark.local.retries property instead of a special URL format). We initially disabled it for everything except unit tests because 90% of the time an exception in local

Re: Is spark context in local mode thread-safe?

2014-06-09 Thread Matei Zaharia
You currently can’t have multiple SparkContext objects in the same JVM, but within a SparkContext, all of the APIs are thread-safe so you can share that context between multiple threads. The other issue you’ll run into is that in each thread where you want to use Spark, you need to use

Re: Is spark context in local mode thread-safe?

2014-06-09 Thread Matei Zaharia
. If we can disable the UI http Server; it would be much simpler to handle than having two http containers to deal with. Chester On Monday, June 9, 2014 4:35 PM, Matei Zaharia matei.zaha...@gmail.com wrote: You currently can’t have multiple SparkContext objects in the same JVM

[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles

2014-06-08 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021115#comment-14021115 ] Matei Zaharia commented on SPARK-2044: -- Alright so I've posted my code at https

[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles

2014-06-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14020087#comment-14020087 ] Matei Zaharia commented on SPARK-2044: -- {quote} 1. Is it a goal to support more kind

[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles

2014-06-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14020329#comment-14020329 ] Matei Zaharia commented on SPARK-2044: -- So BTW I think what I'll do is move over

[jira] [Created] (SPARK-2032) Add an RDD.samplePartitions method for partition-level sampling

2014-06-05 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2032: Summary: Add an RDD.samplePartitions method for partition-level sampling Key: SPARK-2032 URL: https://issues.apache.org/jira/browse/SPARK-2032 Project: Spark

[jira] [Updated] (SPARK-2032) Add an RDD.samplePartitions method for partition-level sampling

2014-06-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2032: - Priority: Minor (was: Major) Add an RDD.samplePartitions method for partition-level sampling

[jira] [Created] (SPARK-2043) ExternalAppendOnlyMap doesn't always find matching keys

2014-06-05 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2043: Summary: ExternalAppendOnlyMap doesn't always find matching keys Key: SPARK-2043 URL: https://issues.apache.org/jira/browse/SPARK-2043 Project: Spark Issue

[jira] [Created] (SPARK-2045) Sort-based shuffle implementation

2014-06-05 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2045: Summary: Sort-based shuffle implementation Key: SPARK-2045 URL: https://issues.apache.org/jira/browse/SPARK-2045 Project: Spark Issue Type: New Feature

[jira] [Created] (SPARK-2047) Use less memory in AppendOnlyMap.destructiveSortedIterator

2014-06-05 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2047: Summary: Use less memory in AppendOnlyMap.destructiveSortedIterator Key: SPARK-2047 URL: https://issues.apache.org/jira/browse/SPARK-2047 Project: Spark

[jira] [Updated] (SPARK-2047) Use less memory in AppendOnlyMap.destructiveSortedIterator

2014-06-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2047: - Priority: Minor (was: Major) Use less memory in AppendOnlyMap.destructiveSortedIterator

[jira] [Updated] (SPARK-2047) Use less memory in AppendOnlyMap.destructiveSortedIterator

2014-06-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2047: - Priority: Major (was: Minor) Use less memory in AppendOnlyMap.destructiveSortedIterator

[jira] [Commented] (SPARK-2043) ExternalAppendOnlyMap doesn't always find matching keys

2014-06-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019482#comment-14019482 ] Matei Zaharia commented on SPARK-2043: -- https://github.com/apache/spark/pull/986

<    3   4   5   6   7   8   9   10   11   12   >