Re: Make off-heap store pluggable

2015-07-21 Thread Matei Zaharia
I agree with this -- basically, to build on Reynold's point, you should be able to get almost the same performance by implementing either the Hadoop FileSystem API or the Spark Data Source API over Ignite in the right way. This would let people save data persistently in Ignite in addition to

Re: work around Size exceeds Integer.MAX_VALUE

2015-07-09 Thread Matei Zaharia
Thus means that one of your cached RDD partitions is bigger than 2 GB of data. You can fix it by having more partitions. If you read data from a file system like HDFS or S3, set the number of partitions higher in the sc.textFile, hadoopFile, etc methods (it's an optional second parameter to

Re: how can I write a language wrapper?

2015-06-23 Thread Matei Zaharia
Just FYI, it would be easiest to follow SparkR's example and add the DataFrame API first. Other APIs will be designed to work on DataFrames (most notably machine learning pipelines), and the surface of this API is much smaller than of the RDD API. This API will also give you great performance

Re: Spark or Storm

2015-06-17 Thread Matei Zaharia
This documentation is only for writes to an external system, but all the counting you do within your streaming app (e.g. if you use reduceByKeyAndWindow to keep track of a running count) is exactly-once. When you write to a storage system, no matter which streaming framework you use, you'll

Re: Spark or Storm

2015-06-17 Thread Matei Zaharia
[4,5,6] can be invoked before the operation for offset [1,2,3] 2) If you wanted to achieve something similar to what TridentState does, you'll have to do it yourself (for example using Zookeeper) Is this a correct understanding? On Wed, Jun 17, 2015 at 7:14 PM, Matei Zaharia matei.zaha

Welcoming some new committers

2015-06-17 Thread Matei Zaharia
Hey all, Over the past 1.5 months we added a number of new committers to the project, and I wanted to welcome them now that all of their respective forms, accounts, etc are in. Join me in welcoming the following new committers: - Davies Liu - DB Tsai - Kousuke Saruta - Sandy Ryza - Yin Huai

Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Matei Zaharia
I don't like the idea of removing Hadoop 1 unless it becomes a significant maintenance burden, which I don't think it is. You'll always be surprised how many people use old software, even though various companies may no longer support them. With Hadoop 2 in particular, I may be misremembering,

[jira] [Updated] (SPARK-8110) DAG visualizations sometimes look weird in Python

2015-06-04 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-8110: - Attachment: Screen Shot 2015-06-04 at 1.51.32 PM.png Screen Shot 2015-06-04

[jira] [Created] (SPARK-8110) DAG visualizations sometimes look weird in Python

2015-06-04 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-8110: Summary: DAG visualizations sometimes look weird in Python Key: SPARK-8110 URL: https://issues.apache.org/jira/browse/SPARK-8110 Project: Spark Issue Type

Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-04 Thread Matei Zaharia
+1 Tested on Mac OS X On Jun 4, 2015, at 1:09 PM, Patrick Wendell pwend...@gmail.com wrote: I will give +1 as well. On Wed, Jun 3, 2015 at 11:59 PM, Reynold Xin r...@databricks.com wrote: Let me give you the 1st +1 On Tue, Jun 2, 2015 at 10:47 PM, Patrick Wendell

Re: Equivalent to Storm's 'field grouping' in Spark.

2015-06-03 Thread Matei Zaharia
This happens automatically when you use the byKey operations, e.g. reduceByKey, updateStateByKey, etc. Spark Streaming keeps the state for a given set of keys on a specific node and sends new tuples with that key to that. Matei On Jun 3, 2015, at 6:31 AM, allonsy luke1...@gmail.com wrote:

Re: map - reduce only with disk

2015-06-02 Thread Matei Zaharia
:-UseCompressedOops SPARK_DRIVER_MEMORY=129G spark version: 1.1.1 Thank you a lot for your help! 2015-06-02 4:40 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com: As long as you don't use cache(), these operations will go from disk to disk, and will only use

Re: map - reduce only with disk

2015-06-02 Thread Matei Zaharia
? Thank you! 2015-06-02 21:25 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com: You shouldn't have to persist the RDD at all, just call flatMap and reduce on it directly. If you try to persist it, that will try to load the original dat into memory, but here

Re: Representing a recursive data type in Spark SQL

2015-05-28 Thread Matei Zaharia
Your best bet might be to use a mapstring,string in SQL and make the keys be longer paths (e.g. params_param1 and params_param2). I don't think you can have a map in some of them but not in others. Matei On May 28, 2015, at 3:48 PM, Jeremy Lucas jeremyalu...@gmail.com wrote: Hey Reynold,

Re: Spark logo license

2015-05-19 Thread Matei Zaharia
Check out Apache's trademark guidelines here: http://www.apache.org/foundation/marks/ http://www.apache.org/foundation/marks/ Matei On May 20, 2015, at 12:02 AM, Justin Pihony justin.pih...@gmail.com wrote: What is the license on using the spark logo. Is it free to be used for displaying

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Matei Zaharia
Hey Tom, Are you using the fine-grained or coarse-grained scheduler? For the coarse-grained scheduler, there is a spark.cores.max config setting that will limit the total # of cores it grabs. This was there in earlier versions too. Matei On May 19, 2015, at 12:39 PM, Thomas Dudziak

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Matei Zaharia
of tasks per job :) cheers, Tom On Tue, May 19, 2015 at 10:05 AM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: Hey Tom, Are you using the fine-grained or coarse-grained scheduler? For the coarse-grained scheduler, there is a spark.cores.max config

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Matei Zaharia
...This is madness! On May 14, 2015, at 9:31 AM, dmoralesdf dmora...@stratio.com wrote: Hi there, We have released our real-time aggregation engine based on Spark Streaming. SPARKTA is fully open source (Apache2) You can checkout the slides showed up at the Strata past week:

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Matei Zaharia
(Sorry, for non-English people: that means it's a good thing.) Matei On May 14, 2015, at 10:53 AM, Matei Zaharia matei.zaha...@gmail.com wrote: ...This is madness! On May 14, 2015, at 9:31 AM, dmoralesdf dmora...@stratio.com wrote: Hi there, We have released our real-time

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-12 Thread Matei Zaharia
It could also be that your hash function is expensive. What is the key class you have for the reduceByKey / groupByKey? Matei On May 12, 2015, at 10:08 AM, Night Wolf nightwolf...@gmail.com wrote: I'm seeing a similar thing with a slightly different stack trace. Ideas?

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-12 Thread Matei Zaharia
It could also be that your hash function is expensive. What is the key class you have for the reduceByKey / groupByKey? Matei On May 12, 2015, at 10:08 AM, Night Wolf nightwolf...@gmail.com wrote: I'm seeing a similar thing with a slightly different stack trace. Ideas?

[jira] [Resolved] (SPARK-7298) Harmonize style of new UI visualizations

2015-05-08 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-7298. -- Resolution: Fixed Fix Version/s: 1.4.0 Harmonize style of new UI visualizations

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-07 Thread Matei Zaharia
We should make sure to update our docs to mention s3a as well, since many people won't look at Hadoop's docs for this. Matei On May 7, 2015, at 12:57 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Ah, thanks for the pointers. So as far as Spark is concerned, is this a breaking

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Matei Zaharia
I don't know whether this is common, but we might also allow another separator for JSON objects, such as two blank lines. Matei On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote: Joe - I think that's a legit and useful thing to do. Do you want to give it a shot? On Mon,

[jira] [Commented] (SPARK-7261) Change default log level to WARN in the REPL

2015-04-29 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520366#comment-14520366 ] Matei Zaharia commented on SPARK-7261: -- IMO we can do this even without SPARK-7260

Re: Spark on Windows

2015-04-16 Thread Matei Zaharia
You could build Spark with Scala 2.11 on Mac / Linux and transfer it over to Windows. AFAIK it should build on Windows too, the only problem is that Maven might take a long time to download dependencies. What errors are you seeing? Matei On Apr 16, 2015, at 9:23 AM, Arun Lists

Re: Dataset announcement

2015-04-15 Thread Matei Zaharia
Very neat, Olivier; thanks for sharing this. Matei On Apr 15, 2015, at 5:58 PM, Olivier Chapelle oliv...@chapelle.cc wrote: Dear Spark users, I would like to draw your attention to a dataset that we recently released, which is as of now the largest machine learning dataset ever released;

Re: [VOTE] Release Apache Spark 1.3.1 (RC2)

2015-04-08 Thread Matei Zaharia
+1. Tested on Mac OS X and verified that some of the bugs were fixed. Matei On Apr 8, 2015, at 7:13 AM, Sean Owen so...@cloudera.com wrote: Still a +1 from me; same result (except that now of course the UISeleniumSuite test does not fail) On Wed, Apr 8, 2015 at 1:46 AM, Patrick Wendell

[jira] [Created] (SPARK-6778) SQL contexts in spark-shell and pyspark should both be called sqlContext

2015-04-08 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-6778: Summary: SQL contexts in spark-shell and pyspark should both be called sqlContext Key: SPARK-6778 URL: https://issues.apache.org/jira/browse/SPARK-6778 Project

Re: Contributor CLAs

2015-04-07 Thread Matei Zaharia
You do actually sign a CLA when you become a committer, and in general, we should ask for CLAs from anyone who contributes a large piece of code. This is the individual CLA: https://www.apache.org/licenses/icla.txt. Some people have sent them proactively because their employer asks them too.

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391456#comment-14391456 ] Matei Zaharia commented on SPARK-6646: -- Not to rain on the parade here, but I worry

Re: Experience using binary packages on various Hadoop distros

2015-03-24 Thread Matei Zaharia
Just a note, one challenge with the BYOH version might be that users who download that can't run in local mode without also having Hadoop. But if we describe it correctly then hopefully it's okay. Matei On Mar 24, 2015, at 3:05 PM, Patrick Wendell pwend...@gmail.com wrote: Hey All, For

Re: IPyhon notebook command for spark need to be updated?

2015-03-20 Thread Matei Zaharia
Feel free to send a pull request to fix the doc (or say which versions it's needed in). Matei On Mar 20, 2015, at 6:49 PM, Krishna Sankar ksanka...@gmail.com wrote: Yep the command-option is gone. No big deal, just add the '%pylab inline' command as part of your notebook. Cheers k/

Re: Querying JSON in Spark SQL

2015-03-16 Thread Matei Zaharia
The programming guide has a short example: http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets. Note that once you infer a schema for a JSON dataset, you can also use nested path notation

[jira] [Commented] (SPARK-1564) Add JavaScript into Javadoc to turn ::Experimental:: and such into badges

2015-03-12 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359017#comment-14359017 ] Matei Zaharia commented on SPARK-1564: -- This is still a valid issue AFAIK, isn't

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-08 Thread Matei Zaharia
+1 Tested it on Mac OS X. One small issue I noticed is that the Scala 2.11 build is using Hadoop 1 without Hive, which is kind of weird because people will more likely want Hadoop 2 with Hive. So it would be good to publish a build for that configuration instead. We can do it if we do a new

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

2015-03-08 Thread Matei Zaharia
Hadoop-provided releases can help. It might kill several birds with one stone. On Sun, Mar 8, 2015 at 11:07 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Our goal is to let people use the latest Apache release even if vendors fall behind or don't want to package everything, so that's

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

2015-03-08 Thread Matei Zaharia
for the 2.10 build too. Pros and cons discussed more at https://issues.apache.org/jira/browse/SPARK-5134 https://github.com/apache/spark/pull/3917 On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia matei.zaha...@gmail.com wrote: +1 Tested it on Mac OS X. One small issue I noticed is that the Scala

Re: Berlin Apache Spark Meetup

2015-02-17 Thread Matei Zaharia
Thanks! I've added you. Matei On Feb 17, 2015, at 4:06 PM, Ralph Bergmann | the4thFloor.eu ra...@the4thfloor.eu wrote: Hi, there is a small Spark Meetup group in Berlin, Germany :-) http://www.meetup.com/Berlin-Apache-Spark-Meetup/ Plaes add this group to the Meetups list at

Re: renaming SchemaRDD - DataFrame

2015-02-10 Thread Matei Zaharia
of fact most were for it). We can still change it if somebody lays out a strong argument. On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia matei.zaha...@gmail.com wrote: The type alias means your methods can specify either type and they will work. It's just another name

Re: Powered by Spark: Concur

2015-02-09 Thread Matei Zaharia
Thanks Denny; added you. Matei On Feb 9, 2015, at 10:11 PM, Denny Lee denny.g@gmail.com wrote: Forgot to add Concur to the Powered by Spark wiki: Concur https://www.concur.com Spark SQL, MLLib Using Spark for travel and expenses analytics and personalization Thanks! Denny

Re: [VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-08 Thread Matei Zaharia
+1 Tested on Mac OS X. Matei On Feb 2, 2015, at 8:57 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.1! The tag to be voted on is v1.2.1-rc3 (commit b6eaf77):

[jira] [Commented] (SPARK-5654) Integrate SparkR into Apache Spark

2015-02-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309782#comment-14309782 ] Matei Zaharia commented on SPARK-5654: -- Yup, there's a tradeoff, but given

Re: Beginner in Spark

2015-02-06 Thread Matei Zaharia
You don't need HDFS or virtual machines to run Spark. You can just download it, unzip it and run it on your laptop. See http://spark.apache.org/docs/latest/index.html http://spark.apache.org/docs/latest/index.html. Matei On Feb 6, 2015, at 2:58 PM, David Fallside falls...@us.ibm.com wrote:

[jira] [Resolved] (SPARK-5608) Improve SEO of Spark documentation site to let Google find latest docs

2015-02-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-5608. -- Resolution: Fixed Fix Version/s: 1.3.0 Improve SEO of Spark documentation site to let

Welcoming three new committers

2015-02-03 Thread Matei Zaharia
Hi all, The PMC recently voted to add three new committers: Cheng Lian, Joseph Bradley and Sean Owen. All three have been major contributors to Spark in the past year: Cheng on Spark SQL, Joseph on MLlib, and Sean on ML and many pieces throughout Spark Core. Join me in welcoming them as

Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-31 Thread Matei Zaharia
This looks like a pretty serious problem, thanks! Glad people are testing on Windows. Matei On Jan 31, 2015, at 11:57 AM, MartinWeindel martin.wein...@gmail.com wrote: FYI: Spark 1.2.1rc2 does not work on Windows! On creating a Spark context you get following log output on my Windows

Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Matei Zaharia
a package name for it that omits sql. I would also be in favor of adding a separate Spark Schema module for Spark SQL to rely on, but I imagine that might be too large a change at this point? -Sandy On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: (Actually

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Matei Zaharia
I believe this is needed for driver recovery in Spark Streaming. If your Spark driver program crashes, Spark Streaming can recover the application by reading the set of DStreams and output operations from a checkpoint file (see

Re: renaming SchemaRDD - DataFrame

2015-01-26 Thread Matei Zaharia
(Actually when we designed Spark SQL we thought of giving it another name, like Spark Schema, but we decided to stick with SQL since that was the most obvious use case to many users.) Matei On Jan 26, 2015, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: While it might be possible

Re: renaming SchemaRDD - DataFrame

2015-01-26 Thread Matei Zaharia
While it might be possible to move this concept to Spark Core long-term, supporting structured data efficiently does require quite a bit of the infrastructure in Spark SQL, such as query planning and columnar storage. The intent of Spark SQL though is to be more than a SQL server -- it's meant

Re: Spark performance gains for small queries

2015-01-23 Thread Matei Zaharia
It's hard to tell without more details, but the start-up latency in Hive can sometimes be high, especially if you are running Hive on MapReduce. MR just takes 20-30 seconds per job to spin up even if the job is doing nothing. For real use of Spark SQL for short queries by the way, I'd recommend

Re: Semantics of LGTM

2015-01-17 Thread Matei Zaharia
+1 on this. On Jan 17, 2015, at 6:16 PM, Reza Zadeh r...@databricks.com wrote: LGTM On Sat, Jan 17, 2015 at 5:40 PM, Patrick Wendell pwend...@gmail.com wrote: Hey All, Just wanted to ping about a minor issue - but one that ends up having consequence given Spark's volume of reviews

Re: Spark UI and Spark Version on Google Compute Engine

2015-01-17 Thread Matei Zaharia
Unfortunately we don't have anything to do with Spark on GCE, so I'd suggest asking in the GCE support forum. You could also try to launch a Spark cluster by hand on nodes in there. Sigmoid Analytics published a package for this here: http://spark-packages.org/package/9 Matei On Jan 17,

Re: spark 1.2 compatibility

2015-01-16 Thread Matei Zaharia
The Apache Spark project should work with it, but I'm not sure you can get support from HDP (if you have that). Matei On Jan 16, 2015, at 5:36 PM, Judy Nash judyn...@exchange.microsoft.com wrote: Should clarify on this. I personally have used HDP 2.1 + Spark 1.2 and have not seen a

Re: SciSpark: NASA AIST14 proposal

2015-01-14 Thread Matei Zaharia
Yeah, very cool! You may also want to check out https://issues.apache.org/jira/browse/SPARK-5097 as something to build upon for these operations. Matei On Jan 14, 2015, at 6:18 PM, Reynold Xin r...@databricks.com wrote: Chris, This is really cool. Congratulations and thanks for sharing

[jira] [Updated] (SPARK-5088) Use spark-class for running executors directly on mesos

2015-01-13 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-5088: - Fix Version/s: (was: 1.2.1) Use spark-class for running executors directly on mesos

[jira] [Updated] (SPARK-5088) Use spark-class for running executors directly on mesos

2015-01-13 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-5088: - Target Version/s: 1.3.0 (was: 1.3.0, 1.2.1) Use spark-class for running executors directly

Re: Pattern Matching / Equals on Case Classes in Spark Not Working

2015-01-12 Thread Matei Zaharia
Is this in the Spark shell? Case classes don't work correctly in the Spark shell unfortunately (though they do work in the Scala shell) because we change the way lines of code compile to allow shipping functions across the network. The best way to get case classes in there is to compile them

[jira] [Resolved] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688

2015-01-09 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3619. -- Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Jongyoul Lee (was: Timothy

Fwd: ApacheCon North America 2015 Call For Papers

2015-01-05 Thread Matei Zaharia
FYI, ApacheCon North America call for papers is up. Matei Begin forwarded message: Date: January 5, 2015 at 9:40:41 AM PST From: Rich Bowen rbo...@rcbowen.com Reply-To: dev d...@community.apache.org To: dev d...@community.apache.org Subject: ApacheCon North America 2015 Call For Papers

Fwd: ApacheCon North America 2015 Call For Papers

2015-01-05 Thread Matei Zaharia
FYI, ApacheCon North America call for papers is up. Matei Begin forwarded message: Date: January 5, 2015 at 9:40:41 AM PST From: Rich Bowen rbo...@rcbowen.com Reply-To: dev d...@community.apache.org To: dev d...@community.apache.org Subject: ApacheCon North America 2015 Call For Papers

Re: JetS3T settings spark

2014-12-30 Thread Matei Zaharia
This file needs to be on your CLASSPATH actually, not just in a directory. The best way to pass it in is probably to package it into your application JAR. You can put it in src/main/resources in a Maven or SBT project, and check that it makes it into the JAR using jar tf yourfile.jar. Matei

[jira] [Commented] (SPARK-4660) JavaSerializer uses wrong classloader

2014-12-29 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260544#comment-14260544 ] Matei Zaharia commented on SPARK-4660: -- [~pkolaczk] mind sending a pull request

Re: How to become spark developer in jira?

2014-12-29 Thread Matei Zaharia
Please ask someone else to assign them for now, and just comment on them that you're working on them. Over time if you contribute a bunch we'll add you to that list. The problem is that in the past, people would assign issues to themselves and never actually work on them, making it confusing

Re: action progress in ipython notebook?

2014-12-29 Thread Matei Zaharia
Hey Eric, sounds like you are running into several issues, but thanks for reporting them. Just to comment on a few of these: I'm not seeing RDDs or SRDDs cached in the Spark UI. That page remains empty despite my calling cache(). This is expected until you compute the RDDs the first time

Re: When will spark 1.2 released?

2014-12-18 Thread Matei Zaharia
Yup, as he posted before, An Apache infrastructure issue prevented me from pushing this last night. The issue was resolved today and I should be able to push the final release artifacts tonight. On Dec 18, 2014, at 10:14 PM, Andrew Ash and...@andrewash.com wrote: Patrick is working on the

Re: wordcount job slow while input from NFS mount

2014-12-17 Thread Matei Zaharia
The problem is very likely NFS, not Spark. What kind of network is it mounted over? You can also test the performance of your NFS by copying a file from it to a local disk or to /dev/null and seeing how many bytes per second it can copy. Matei On Dec 17, 2014, at 9:38 AM, Larryliu

Re: wordcount job slow while input from NFS mount

2014-12-17 Thread Matei Zaharia
is running on the same server that Spark is running on. So basically I mount the NFS on the same bare metal machine. Larry On Wed, Dec 17, 2014 at 11:42 AM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: The problem is very likely NFS, not Spark. What kind

Re: Spark Web Site

2014-12-15 Thread Matei Zaharia
It's just Bootstrap checked into SVN and built using Jekyll. You can check out the raw source files from SVN from https://svn.apache.org/repos/asf/spark. IMO it's fine if you guys use the layout, but just make sure it doesn't look exactly the same because otherwise both sites will look like

Re: Spark SQL Roadmap?

2014-12-13 Thread Matei Zaharia
Spark SQL is already available, the reason for the alpha component label is that we are still tweaking some of the APIs so we have not yet guaranteed API stability for it. However, that is likely to happen soon (possibly 1.3). One of the major things added in Spark 1.2 was an external data

[jira] [Commented] (SPARK-3247) Improved support for external data sources

2014-12-11 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243253#comment-14243253 ] Matei Zaharia commented on SPARK-3247: -- For those looking to learn about

Re: what is the best way to implement mini batches?

2014-12-11 Thread Matei Zaharia
You can just do mapPartitions on the whole RDD, and then called sliding() on the iterator in each one to get a sliding window. One problem is that you will not be able to slide forward into the next partition at partition boundaries. If this matters to you, you need to do something more

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-10 Thread Matei Zaharia
+1 Tested on Mac OS X. Matei On Dec 10, 2014, at 1:08 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.0! The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):

[jira] [Commented] (SPARK-4690) AppendOnlyMap seems not using Quadratic probing as the JavaDoc

2014-12-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1429#comment-1429 ] Matei Zaharia commented on SPARK-4690: -- Yup, that's the definition

[jira] [Closed] (SPARK-4690) AppendOnlyMap seems not using Quadratic probing as the JavaDoc

2014-12-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia closed SPARK-4690. Resolution: Invalid AppendOnlyMap seems not using Quadratic probing as the JavaDoc

Re: dockerized spark executor on mesos?

2014-12-03 Thread Matei Zaharia
I'd suggest asking about this on the Mesos list (CCed). As far as I know, there was actually some ongoing work for this. Matei On Dec 3, 2014, at 9:46 AM, Dick Davies d...@hellooperator.net wrote: Just wondered if anyone had managed to start spark jobs on mesos wrapped in a docker

[jira] [Created] (SPARK-4683) Add a beeline.cmd to run on Windows

2014-12-01 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4683: Summary: Add a beeline.cmd to run on Windows Key: SPARK-4683 URL: https://issues.apache.org/jira/browse/SPARK-4683 Project: Spark Issue Type: New Feature

[jira] [Created] (SPARK-4684) Add a script to run JDBC server on Windows

2014-12-01 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4684: Summary: Add a script to run JDBC server on Windows Key: SPARK-4684 URL: https://issues.apache.org/jira/browse/SPARK-4684 Project: Spark Issue Type: New

[jira] [Updated] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections

2014-12-01 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4685: - Priority: Trivial (was: Major) Update JavaDoc settings to include spark.ml and all spark.mllib

[jira] [Created] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections

2014-12-01 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4685: Summary: Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections Key: SPARK-4685 URL: https://issues.apache.org/jira/browse/SPARK-4685

[jira] [Updated] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections

2014-12-01 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4685: - Target Version/s: 1.2.1 (was: 1.2.0) Update JavaDoc settings to include spark.ml and all

Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-01 Thread Matei Zaharia
+0.9 from me. Tested it on Mac and Windows (someone has to do it) and while things work, I noticed a few recent scripts don't have Windows equivalents, namely https://issues.apache.org/jira/browse/SPARK-4683 and https://issues.apache.org/jira/browse/SPARK-4684. The first one at least would be

Re: Spurious test failures, testing best practices

2014-11-30 Thread Matei Zaharia
Hi Ryan, As a tip (and maybe this isn't documented well), I normally use SBT for development to avoid the slow build process, and use its interactive console to run only specific tests. The nice advantage is that SBT can keep the Scala compiler loaded and JITed across builds, making it faster

Re: [RESULT] [VOTE] Designating maintainers for some Spark components

2014-11-30 Thread Matei Zaharia
the timeout for waiting for a maintainer to a week. Hopefully this will provide more options for reviewing in these components. The complete list is available at https://cwiki.apache.org/confluence/display/SPARK/Committers. Matei On Nov 8, 2014, at 7:28 PM, Matei Zaharia matei.zaha...@gmail.com

Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-11-28 Thread Matei Zaharia
Hey Patrick, unfortunately you got some of the text here wrong, saying 1.1.0 instead of 1.2.0. Not sure it will matter since there can well be another RC after testing, but we should be careful. Matei On Nov 28, 2014, at 9:16 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on

[jira] [Resolved] (SPARK-4613) Make JdbcRDD easier to use from Java

2014-11-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-4613. -- Resolution: Fixed Fix Version/s: 1.2.0 Make JdbcRDD easier to use from Java

[jira] [Updated] (SPARK-4613) Make JdbcRDD easier to use from Java

2014-11-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4613: - Issue Type: Improvement (was: Bug) Make JdbcRDD easier to use from Java

[jira] [Resolved] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-11-26 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3628. -- Resolution: Fixed Fix Version/s: 1.2.0 Target Version/s: 1.1.2 (was: 0.9.3

[jira] [Commented] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-11-26 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227077#comment-14227077 ] Matei Zaharia commented on SPARK-3628: -- FYI I merged this into 1.2.0, since the patch

[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

2014-11-26 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227108#comment-14227108 ] Matei Zaharia commented on SPARK-732: - As discussed on https://github.com/apache/spark

[jira] [Reopened] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-11-26 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reopened SPARK-3628: -- Don't apply accumulator updates multiple times for tasks in result stages

Re: configure to run multiple tasks on a core

2014-11-26 Thread Matei Zaharia
Instead of SPARK_WORKER_INSTANCES you can also set SPARK_WORKER_CORES, to have one worker that thinks it has more cores. Matei On Nov 26, 2014, at 5:01 PM, Yotto Koga yotto.k...@autodesk.com wrote: Thanks Sean. That worked out well. For anyone who happens onto this post and wants to do

[jira] [Created] (SPARK-4613) Make JdbcRDD easier to use from Java

2014-11-25 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4613: Summary: Make JdbcRDD easier to use from Java Key: SPARK-4613 URL: https://issues.apache.org/jira/browse/SPARK-4613 Project: Spark Issue Type: Bug

[jira] [Commented] (SPARK-4613) Make JdbcRDD easier to use from Java

2014-11-25 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225615#comment-14225615 ] Matei Zaharia commented on SPARK-4613: -- BTW the strawman for this would be a version

Re: Spark SQL - Any time line to move beyond Alpha version ?

2014-11-25 Thread Matei Zaharia
The main reason for the alpha tag is actually that APIs might still be evolving, but we'd like to freeze the API as soon as possible. Hopefully it will happen in one of 1.3 or 1.4. In Spark 1.2, we're adding an external data source API that we'd like to get experience with before freezing it.

Re: Configuring custom input format

2014-11-25 Thread Matei Zaharia
How are you creating the object in your Scala shell? Maybe you can write a function that directly returns the RDD, without assigning the object to a temporary variable. Matei On Nov 5, 2014, at 2:54 PM, Corey Nolet cjno...@gmail.com wrote: The closer I look @ the stack trace in the Scala

Re: Configuring custom input format

2014-11-25 Thread Matei Zaharia
. On Tue, Nov 25, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: How are you creating the object in your Scala shell? Maybe you can write a function that directly returns the RDD, without assigning the object to a temporary variable. Matei

Re: do not assemble the spark example jar

2014-11-25 Thread Matei Zaharia
You can do sbt/sbt assembly/assembly to assemble only the main package. Matei On Nov 25, 2014, at 7:50 PM, lihu lihu...@gmail.com wrote: Hi, The spark assembly is time costly. If I only need the spark-assembly-1.1.0-hadoop2.3.0.jar, do not need the

Re: do not assemble the spark example jar

2014-11-25 Thread Matei Zaharia
BTW as another tip, it helps to keep the SBT console open as you make source changes (by just running sbt/sbt with no args). It's a lot faster the second time it builds something. Matei On Nov 25, 2014, at 8:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: You can do sbt/sbt assembly

<    1   2   3   4   5   6   7   8   9   10   >