Re: How to compile the examples directory?

2014-05-19 Thread Matei Zaharia
If you’d like to work on just this code for your own changes, it might be best to copy it to a separate project. Look at http://spark.apache.org/docs/latest/quick-start.html for how to set up a standalone job. Matei On May 19, 2014, at 4:53 AM, Hao Wang wh.s...@gmail.com wrote: Hi, I am

Re: advice on maintaining a production spark cluster?

2014-05-19 Thread Matei Zaharia
Which version is this with? I haven’t seen standalone masters lose workers. Is there other stuff on the machines that’s killing them, or what errors do you see? Matei On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.com wrote: Hey folks, I'm wondering what strategies other folks

Re: life if an executor

2014-05-19 Thread Matei Zaharia
They’re tied to the SparkContext (application) that launched them. Matei On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote: from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are

[jira] [Created] (SPARK-1874) Clean up MLlib sample data

2014-05-18 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1874: Summary: Clean up MLlib sample data Key: SPARK-1874 URL: https://issues.apache.org/jira/browse/SPARK-1874 Project: Spark Issue Type: Bug

[jira] [Updated] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1

2014-05-18 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1875: - Fix Version/s: 1.0.0 NoClassDefFoundError: StringUtils when building against Hadoop 1

[jira] [Updated] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1

2014-05-18 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1875: - Priority: Blocker (was: Critical) NoClassDefFoundError: StringUtils when building against

[jira] [Commented] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1

2014-05-18 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001297#comment-14001297 ] Matei Zaharia commented on SPARK-1875: -- This may have been broken by https

Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-18 Thread Matei Zaharia
I took the always fun task of testing it on Windows, and unfortunately, I found some small problems with the prebuilt packages due to recent changes to the launch scripts: bin/spark-class2.cmd looks in ./jars instead of ./lib for the assembly JAR, and bin/run-example2.cmd doesn’t quite match

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-18 Thread Matei Zaharia
if it's a common approach to have discussions in JIRA not here. I don't think it's the ASF way. Pozdrawiam, Jacek Laskowski http://blog.japila.pl 17 maj 2014 23:55 Matei Zaharia matei.zaha...@gmail.com napisał(a): We do actually have replicated StorageLevels in Spark. You can use

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-18 Thread Matei Zaharia
it was the easiest way for people to continue. Matei On May 18, 2014, at 4:01 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Ah, maybe it’s just different in other Apache projects. All the ones I’ve participated in have had their design discussions on JIRA. For example take a look at https

Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-18 Thread Matei Zaharia
on a different version of org.apache.commons than Hadoop 2, but it needs investigation. Tom, any thoughts on this? Matei On May 18, 2014, at 12:33 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I took the always fun task of testing it on Windows, and unfortunately, I found some small problems

[jira] [Updated] (SPARK-1145) Memory mapping with many small blocks can cause JVM allocation failures

2014-05-17 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1145: - Fix Version/s: 0.9.2 Memory mapping with many small blocks can cause JVM allocation failures

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Matei Zaharia
As others have said, the 1.0 milestone is about API stability, not about saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the sooner users can confidently build on Spark, knowing that the application they build today will still run on Spark 1.9.9 three years from now. This is

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-17 Thread Matei Zaharia
BTW for what it’s worth I agree this is a good option to add, the only tricky thing will be making sure the checkpoint blocks are not garbage-collected by the block store. I don’t think they will be though. Matei On May 17, 2014, at 2:20 PM, Matei Zaharia matei.zaha...@gmail.com wrote: We do

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-17 Thread Matei Zaharia
We do actually have replicated StorageLevels in Spark. You can use MEMORY_AND_DISK_2 or construct your own StorageLevel with your own custom replication factor. BTW you guys should probably have this discussion on the JIRA rather than the dev list; I think the replies somehow ended up on the

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Matei Zaharia
stability can be addressed in minor releases if found, but behavioral change and/or interface changes would be a much more invasive issue for our users. Regards Mridul On 18-May-2014 2:19 am, Matei Zaharia matei.zaha...@gmail.com wrote: As others have said, the 1.0 milestone is about API

[jira] [Created] (SPARK-1858) Update third-party Hadoop distros doc to list more distros

2014-05-16 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1858: Summary: Update third-party Hadoop distros doc to list more distros Key: SPARK-1858 URL: https://issues.apache.org/jira/browse/SPARK-1858 Project: Spark

[jira] [Updated] (SPARK-1775) Unneeded lock in ShuffleMapTask.deserializeInfo

2014-05-15 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1775: - Fix Version/s: 0.9.2 Unneeded lock in ShuffleMapTask.deserializeInfo

[jira] [Created] (SPARK-1770) repartition and coalesce(shuffle=true) put objects with the same key in the same bucket

2014-05-15 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1770: Summary: repartition and coalesce(shuffle=true) put objects with the same key in the same bucket Key: SPARK-1770 URL: https://issues.apache.org/jira/browse/SPARK-1770

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-15 Thread Matei Zaharia
SHA-1 is being end-of-lived so I’d actually say switch to 512 for all of them instead. On May 13, 2014, at 6:49 AM, Sean Owen so...@cloudera.com wrote: On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell pwend...@gmail.com wrote: The release files, including signatures, digests, etc. can be

Test

2014-05-15 Thread Matei Zaharia

Re: pySpark memory usage

2014-05-15 Thread Matei Zaharia
400 for the textFile()s, 1500 for the join()s. On Mon, May 12, 2014 at 7:58 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Jim, unfortunately external spilling is not implemented in Python right now. While it would be possible to update combineByKey to do smarter stuff here, one

[jira] [Updated] (SPARK-1770) repartition and coalesce(shuffle=true) put objects with the same key in the same bucket

2014-05-14 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1770: - Fix Version/s: 1.0.0 repartition and coalesce(shuffle=true) put objects with the same key

Re: Updating docs for running on Mesos

2014-05-13 Thread Matei Zaharia
I’ll ask the Mesos folks about this. Unfortunately it might be tough to link only to a company’s builds; but we can perhaps include them in addition to instructions for building Mesos from Apache. Matei On May 12, 2014, at 11:55 PM, Gerard Maas gerard.m...@gmail.com wrote: Andrew,

Re: Bug is KryoSerializer under Mesos [work-around included]

2014-05-12 Thread Matei Zaharia
Hey Soren, are you sure that the JAR you used on the executors is for the right version of Spark? Maybe they’re running an older version. The Kryo serializer should be initialized the same way on both. Matei On May 12, 2014, at 10:39 AM, Soren Macbeth so...@yieldbot.com wrote: I finally

Re: Kryo not default?

2014-05-12 Thread Matei Zaharia
It was just because it might not work with some user data types that are Serializable. But we should investigate it, as it’s the easiest thing one can enable to improve performance. Matei On May 12, 2014, at 2:47 PM, Anand Avati av...@gluster.org wrote: Hi, Can someone share the reason why

Re: Spark to utilize HDFS's mmap caching

2014-05-12 Thread Matei Zaharia
Yes, Spark goes through the standard HDFS client and will automatically benefit from this. Matei On May 8, 2014, at 4:43 AM, Chanwit Kaewkasi chan...@gmail.com wrote: Hi all, Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via sc.textFile() and other HDFS-related APIs?

Re: Is their a way to Create SparkContext object?

2014-05-12 Thread Matei Zaharia
You can just pass it around as a parameter. On May 12, 2014, at 12:37 PM, yh18190 yh18...@gmail.com wrote: Hi, Could anyone suggest an idea how can we create sparkContext object in other classes or fucntions where we need to convert a scala collection to RDD using sc object.like

Re: pySpark memory usage

2014-05-12 Thread Matei Zaharia
at ~54GB. stats() returns (count: 56757667, mean: 1001.68740583, stdev: 601.775217822, max: 8965, min: 343) On Wed, Apr 9, 2014 at 6:59 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Okay, thanks. Do you have any info on how large your records and data file are? I'd like to reproduce and fix

Re: Spark on Scala 2.11

2014-05-11 Thread Matei Zaharia
We do want to support it eventually, possibly as early as Spark 1.1 (which we’d cross-build on Scala 2.10 and 2.11). If someone wants to look at it before, feel free to do so! Scala 2.11 is very close to 2.10 so I think things will mostly work, except for possibly the REPL (which has require

[jira] [Created] (SPARK-1775) Unneeded lock in ShuffleMapTask.deserializeInfo

2014-05-10 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1775: Summary: Unneeded lock in ShuffleMapTask.deserializeInfo Key: SPARK-1775 URL: https://issues.apache.org/jira/browse/SPARK-1775 Project: Spark Issue Type

[jira] [Updated] (SPARK-1775) Unneeded lock in ShuffleMapTask.deserializeInfo

2014-05-10 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1775: - Labels: Starter (was: ) Unneeded lock in ShuffleMapTask.deserializeInfo

[jira] [Updated] (SPARK-1775) Unneeded lock in ShuffleMapTask.deserializeInfo

2014-05-10 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1775: - Priority: Critical (was: Major) Unneeded lock in ShuffleMapTask.deserializeInfo

[jira] [Resolved] (SPARK-1732) Support for primitive nulls in SparkSQL

2014-05-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1732. -- Resolution: Fixed Support for primitive nulls in SparkSQL

[jira] [Updated] (SPARK-1736) spark-submit on Windows

2014-05-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1736: - Component/s: Windows spark-submit on Windows --- Key

[jira] [Created] (SPARK-1736) Update remaining Windows scripts

2014-05-06 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1736: Summary: Update remaining Windows scripts Key: SPARK-1736 URL: https://issues.apache.org/jira/browse/SPARK-1736 Project: Spark Issue Type: Improvement

[jira] [Updated] (SPARK-1736) spark-submit on Windows

2014-05-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1736: - Priority: Blocker (was: Critical) spark-submit on Windows

[jira] [Updated] (SPARK-1620) Uncaught exception from Akka scheduler

2014-05-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1620: - Assignee: Mark Hamstra Uncaught exception from Akka scheduler

Re: Increase Stack Size Workers

2014-05-06 Thread Matei Zaharia
Add export SPARK_JAVA_OPTS=“-Xss16m” to conf/spark-env.sh. Then it should apply to the executor. Matei On May 5, 2014, at 2:20 PM, Andrea Esposito and1...@gmail.com wrote: Hi there, i'm doing an iterative algorithm and sometimes i ended up with StackOverflowError, doesn't matter if i do

Re: Spark and Java 8

2014-05-06 Thread Matei Zaharia
Java 8 support is a feature in Spark, but vendors need to decide for themselves when they’d like support Java 8 commercially. You can still run Spark on Java 7 or 6 without taking advantage of the new features (indeed our builds are always against Java 6). Matei On May 6, 2014, at 8:59 AM,

Re: Spark GCE Script

2014-05-05 Thread Matei Zaharia
Very cool! Have you thought about sending this as a pull request? We’d be happy to maintain it inside Spark, though it might be interesting to find a single Python package that can manage clusters across both EC2 and GCE. Matei On May 5, 2014, at 7:18 AM, Akhil Das ak...@sigmoidanalytics.com

[jira] [Assigned] (SPARK-1709) spark-submit should use main class attribute of JAR if no --class is given

2014-05-04 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-1709: Assignee: Matei Zaharia (was: Sandeep Singh) spark-submit should use main class

[jira] [Commented] (SPARK-1709) spark-submit should use main class attribute of JAR if no --class is given

2014-05-04 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13989181#comment-13989181 ] Matei Zaharia commented on SPARK-1709: -- Sorry Sandeep, I actually have a patch done

[jira] [Commented] (SPARK-1709) spark-submit should use main class attribute of JAR if no --class is given

2014-05-04 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13989182#comment-13989182 ] Matei Zaharia commented on SPARK-1709: -- Should've assigned it to myself earlier

[jira] [Assigned] (SPARK-1549) Add python support to spark-submit script

2014-05-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-1549: Assignee: Matei Zaharia Add python support to spark-submit script

[jira] [Created] (SPARK-1709) spark-submit should use main class attribute of JAR if no --class is given

2014-05-03 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1709: Summary: spark-submit should use main class attribute of JAR if no --class is given Key: SPARK-1709 URL: https://issues.apache.org/jira/browse/SPARK-1709 Project

[jira] [Updated] (SPARK-1710) spark-submit should print better errors than InvocationTargetException

2014-05-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1710: - Affects Version/s: 1.0.0 spark-submit should print better errors than InvocationTargetException

Re: Mailing list

2014-05-03 Thread Matei Zaharia
Hi Nicolas, Good catches on these things. Your website seems a little bit incomplete. I have found this page [1] with list the two main mailing lists, users and dev. But I see a reference to a mailing list about issues which tracks the sparks issues when it was hosted at Atlassian. I

Re: performance improvement on second operation...without caching?

2014-05-03 Thread Matei Zaharia
Hi Diana, Apart from these reasons, in a multi-stage job, Spark saves the map output files from map stages to the filesystem, so it only needs to rerun the last reduce stage. This is why you only saw one stage executing. These files are saved for fault recovery but they speed up subsequent

Re: performance improvement on second operation...without caching?

2014-05-03 Thread Matei Zaharia
-uses that? On Sat, May 3, 2014 at 8:29 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Diana, Apart from these reasons, in a multi-stage job, Spark saves the map output files from map stages to the filesystem, so it only needs to rerun the last reduce stage. This is why you only saw

[jira] [Resolved] (SPARK-615) Add mapPartitionsWithIndex() to the Java API

2014-04-29 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-615. - Resolution: Fixed Fix Version/s: 1.0.0 Add mapPartitionsWithIndex() to the Java API

[jira] [Resolved] (SPARK-1268) Adding XOR and AND-NOT operations to spark.util.collection.BitSet

2014-04-29 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1268. -- Resolution: Fixed Fix Version/s: 1.0.0 Adding XOR and AND-NOT operations

[jira] [Assigned] (SPARK-544) Provide a Configuration class in addition to system properties

2014-04-29 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-544: --- Assignee: Matei Zaharia (was: Evan Chan) Provide a Configuration class in addition

[jira] [Resolved] (SPARK-544) Provide a Configuration class in addition to system properties

2014-04-29 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-544. - Resolution: Fixed Fix Version/s: 0.9.0 Provide a Configuration class in addition

Re: Python Spark on YARN

2014-04-29 Thread Matei Zaharia
This will be possible in 1.0 after this pull request: https://github.com/apache/spark/pull/30 Matei On Apr 29, 2014, at 9:51 AM, Guanhua Yan gh...@lanl.gov wrote: Hi all: Is it possible to develop Spark programs in Python and run them on YARN? From the Python SparkContext class, it

Re: Running out of memory Naive Bayes

2014-04-28 Thread Matei Zaharia
Not sure if this is always ideal for Naive Bayes, but you could also hash the features into a lower-dimensional space (e.g. reduce it to 50,000 features). For each feature simply take MurmurHash3(featureID) % 5 for example. Matei On Apr 27, 2014, at 11:24 PM, DB Tsai dbt...@stanford.edu

Re: K-means with large K

2014-04-28 Thread Matei Zaharia
Try turning on the Kryo serializer as described at http://spark.apache.org/docs/latest/tuning.html. Also, are there any exceptions in the driver program’s log before this happens? Matei On Apr 28, 2014, at 9:19 AM, Buttler, David buttl...@llnl.gov wrote: Hi, I am trying to run the K-means

Re: processing s3n:// files in parallel

2014-04-28 Thread Matei Zaharia
Actually wildcards work too, e.g. s3n://bucket/file1*, and I believe so do comma-separated lists (e.g. s3n://file1,s3n://file2). These are all inherited from FileInputFormat in Hadoop. Matei On Apr 28, 2014, at 6:05 PM, Andrew Ash and...@andrewash.com wrote: This is already possible with the

Re: Spark on Yarn or Mesos?

2014-04-27 Thread Matei Zaharia
From my point of view, both are supported equally. The YARN support is newer and that’s why there’s been a lot more action there in recent months. Matei On Apr 27, 2014, at 12:08 PM, Andrew Ash and...@andrewash.com wrote: That thread was mostly about benchmarking YARN vs standalone, and the

Re: Running a spark-submit compatible app in spark-shell

2014-04-27 Thread Matei Zaharia
Hi Roger, You should be able to use the --jars argument of spark-shell to add JARs onto the classpath and then work with those classes in the shell. (A recent patch, https://github.com/apache/spark/pull/542, made spark-shell use the same command-line arguments as spark-submit). But this is a

[jira] [Updated] (SPARK-1242) Add aggregate to python API

2014-04-25 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1242: - Assignee: Holden Karau Add aggregate to python API

[jira] [Resolved] (SPARK-1607) Remove use of octal literals, deprecated in Scala 2.10 / removed in 2.11

2014-04-25 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1607. -- Resolution: Fixed Fix Version/s: 1.0.0 Remove use of octal literals, deprecated

[jira] [Resolved] (SPARK-1621) Update Chill to 0.3.6

2014-04-25 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1621. -- Resolution: Fixed Update Chill to 0.3.6 - Key: SPARK

[jira] [Created] (SPARK-1637) Clean up examples for 1.0

2014-04-25 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1637: Summary: Clean up examples for 1.0 Key: SPARK-1637 URL: https://issues.apache.org/jira/browse/SPARK-1637 Project: Spark Issue Type: Improvement

[jira] [Updated] (SPARK-1637) Clean up examples for 1.0

2014-04-25 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1637: - Description: - Move all of them into subpackages of org.apache.spark.examples (right now some

[jira] [Updated] (SPARK-1235) DAGScheduler ignores exceptions thrown in handleTaskCompletion

2014-04-25 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1235: - Affects Version/s: (was: 1.0.0) DAGScheduler ignores exceptions thrown

[jira] [Resolved] (SPARK-1235) DAGScheduler ignores exceptions thrown in handleTaskCompletion

2014-04-25 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1235. -- Resolution: Fixed Fix Version/s: 1.0.0 Resolved in https://github.com/apache/spark/pull

[jira] [Resolved] (SPARK-1540) Investigate whether we should require keys in PairRDD to be Comparable

2014-04-24 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1540. -- Resolution: Fixed Resolved here: https://github.com/apache/spark/pull/487/files. We were able

[jira] [Commented] (SPARK-1540) Investigate whether we should require keys in PairRDD to be Comparable

2014-04-24 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13979393#comment-13979393 ] Matei Zaharia commented on SPARK-1540: -- Note that it will remain to add

[jira] [Updated] (SPARK-1548) Add Partial Random Forest algorithm to MLlib

2014-04-24 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1548: - Assignee: Jason Day Add Partial Random Forest algorithm to MLlib

[jira] [Updated] (SPARK-928) Add support for Unsafe-based serializer in Kryo 2.22

2014-04-24 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-928: Priority: Major (was: Minor) Add support for Unsafe-based serializer in Kryo 2.22

[jira] [Updated] (SPARK-928) Add support for Unsafe-based serializer in Kryo 2.22

2014-04-24 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-928: Priority: Minor (was: Major) Add support for Unsafe-based serializer in Kryo 2.22

[jira] [Assigned] (SPARK-1621) Update Chill to 0.3.6

2014-04-24 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-1621: Assignee: Matei Zaharia Update Chill to 0.3.6

[jira] [Created] (SPARK-1621) Update Chill to 0.3.6

2014-04-24 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1621: Summary: Update Chill to 0.3.6 Key: SPARK-1621 URL: https://issues.apache.org/jira/browse/SPARK-1621 Project: Spark Issue Type: Improvement

[jira] [Commented] (SPARK-928) Add support for Unsafe-based serializer in Kryo 2.22

2014-04-24 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13980471#comment-13980471 ] Matei Zaharia commented on SPARK-928: - This probably can't be fixed in 1.0.0 because

[jira] [Updated] (SPARK-1438) Update RDD.sample() API to make seed parameter optional

2014-04-24 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1438: - Assignee: Arun Ramakrishnan Update RDD.sample() API to make seed parameter optional

[jira] [Resolved] (SPARK-1438) Update RDD.sample() API to make seed parameter optional

2014-04-24 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1438. -- Resolution: Fixed Update RDD.sample() API to make seed parameter optional

[jira] [Resolved] (SPARK-986) Add job cancellation to PySpark

2014-04-24 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-986. - Resolution: Fixed Add job cancellation to PySpark

[jira] [Updated] (SPARK-986) Add job cancellation to PySpark

2014-04-24 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-986: Affects Version/s: (was: 0.9.0) Add job cancellation to PySpark

[jira] [Updated] (SPARK-986) Add job cancellation to PySpark

2014-04-24 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-986: Fix Version/s: 1.0.0 Add job cancellation to PySpark

[jira] [Updated] (SPARK-986) Add job cancellation to PySpark

2014-04-24 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-986: Assignee: Ahir Reddy Add job cancellation to PySpark

[jira] [Resolved] (SPARK-1586) Fix issues with spark development under windows

2014-04-24 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1586. -- Resolution: Fixed Fix Version/s: 1.0.0 Fix issues with spark development under windows

Re: Deploying a python code on a spark EC2 cluster

2014-04-24 Thread Matei Zaharia
Did you launch this using our EC2 scripts (http://spark.apache.org/docs/latest/ec2-scripts.html) or did you manually set up the daemons? My guess is that their hostnames are not being resolved properly on all nodes, so executor processes can’t connect back to your driver app. This error

Re: SparkPi performance-3 cluster standalone mode

2014-04-24 Thread Matei Zaharia
The problem is that SparkPi uses Math.random(), which is a synchronized method, so it can’t scale to multiple cores. In fact it will be slower on multiple cores due to lock contention. Try another example and you’ll see better scaling. I think we’ll have to update SparkPi to create a new Random

Re: Finding bad data

2014-04-24 Thread Matei Zaharia
Hey Jim, this is unfortunately harder than I’d like right now, but here’s how to do it. Look at the stderr file of the executor on that machine, and you’ll see lines like this: 14/04/24 19:17:24 INFO HadoopRDD: Input split: file:/Users/matei/workspace/apache-spark/README.md:0+2000 This says

Re: parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Matei Zaharia
Try setting the serializer to org.apache.spark.serializer.KryoSerializer (see http://spark.apache.org/docs/0.9.1/tuning.html), it should be considerably faster. Matei On Apr 24, 2014, at 8:01 PM, Earthson Lu earthson...@gmail.com wrote:

Re: Jekyll documentation generation error

2014-04-23 Thread Matei Zaharia
Try doing “gem install kramdown”. The maruku gem for Markdown throws these errors, but Kramdown doesn’t. Matei On Apr 22, 2014, at 11:31 PM, DB Tsai dbt...@dbtsai.com wrote: This is the trace. Conversion error: There was an error converting 'docs/cluster-overview.md '.

Re: error in mllib lr example code

2014-04-23 Thread Matei Zaharia
See http://people.csail.mit.edu/matei/spark-unified-docs/ for a more recent build of the docs; if you spot any problems in those, let us know. Matei On Apr 23, 2014, at 9:49 AM, Xiangrui Meng men...@gmail.com wrote: The doc is for 0.9.1. You are running a later snapshot, which added sparse

Re: How do I access the SPARK SQL

2014-04-23 Thread Matei Zaharia
It’s currently in the master branch, on https://github.com/apache/spark. You can check that out from git, build it with sbt/sbt assembly, and then try it out. We’re also going to post some release candidates soon that will be pre-built. Matei On Apr 23, 2014, at 1:30 PM, diplomatic Guru

[jira] [Created] (SPARK-1563) Add package-info.java files for all packages that appear in Javadoc

2014-04-22 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1563: Summary: Add package-info.java files for all packages that appear in Javadoc Key: SPARK-1563 URL: https://issues.apache.org/jira/browse/SPARK-1563 Project: Spark

[jira] [Created] (SPARK-1564) Add JavaScript into Javadoc to turn ::Experimental:: and such into badges

2014-04-22 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1564: Summary: Add JavaScript into Javadoc to turn ::Experimental:: and such into badges Key: SPARK-1564 URL: https://issues.apache.org/jira/browse/SPARK-1564 Project

[jira] [Created] (SPARK-1567) Add language tabs to quick start guide

2014-04-22 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1567: Summary: Add language tabs to quick start guide Key: SPARK-1567 URL: https://issues.apache.org/jira/browse/SPARK-1567 Project: Spark Issue Type: Sub-task

[jira] [Created] (SPARK-1566) Consolidate the Spark Programming Guide with tabs for all languages

2014-04-22 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1566: Summary: Consolidate the Spark Programming Guide with tabs for all languages Key: SPARK-1566 URL: https://issues.apache.org/jira/browse/SPARK-1566 Project: Spark

[jira] [Updated] (SPARK-1563) Add package-info.java and package.scala files for all packages that appear in docs

2014-04-22 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1563: - Summary: Add package-info.java and package.scala files for all packages that appear in docs

[jira] [Assigned] (SPARK-1439) Aggregate Scaladocs across projects

2014-04-21 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-1439: Assignee: Matei Zaharia Aggregate Scaladocs across projects

[jira] [Assigned] (SPARK-1440) Generate JavaDoc instead of ScalaDoc for Java API

2014-04-21 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-1440: Assignee: Matei Zaharia Generate JavaDoc instead of ScalaDoc for Java API

[jira] [Created] (SPARK-1554) Update doc overview page to not mention building if you get a pre-built distro

2014-04-21 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1554: Summary: Update doc overview page to not mention building if you get a pre-built distro Key: SPARK-1554 URL: https://issues.apache.org/jira/browse/SPARK-1554 Project

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Matei Zaharia
The wiki is actually maintained separately in https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage. We restricted editing of the wiki because bots would automatically add stuff. I’ve given you permissions now. Matei On Apr 21, 2014, at 6:22 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

Re: Spark Streaming source from Amazon Kinesis

2014-04-21 Thread Matei Zaharia
There was a patch posted a few weeks ago (https://github.com/apache/spark/pull/223), but it needs a few changes in packaging because it uses a license that isn’t fully compatible with Apache. I’d like to get this merged when the changes are made though — it would be a good input source to

[jira] [Commented] (SPARK-1439) Aggregate Scaladocs across projects

2014-04-20 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975226#comment-13975226 ] Matei Zaharia commented on SPARK-1439: -- Thanks for looking into this, Sean. Instead

<    5   6   7   8   9   10   11   12   13   14   >