Re: How to run a single test suite?

2014-02-26 Thread Reynold Xin
You put your quotes in the wrong place. See https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools On Wed, Feb 26, 2014 at 10:04 PM, Bryn Keller xol...@xoltar.org wrote: Hi Folks, I've tried using sbt test-only '*PairRDDFunctionsSuite' to run only that test suite, which

Re: Development methodology

2014-03-02 Thread Reynold Xin
and if I have to work on a PR, I should rather make use of my github account... Thanks for the clarification. On Sat, Mar 1, 2014 at 12:27 PM, Reynold Xin r...@databricks.com wrote: I'm not sure what you mean by enterprise stash. But PR is a concept unique to Github. There is no PR

Re: Code documentation

2014-03-15 Thread Reynold Xin
Take a look at https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals On Sat, Mar 15, 2014 at 6:19 PM, David Thomas dt5434...@gmail.com wrote: Is there any documentation available that explains the code architecture that can help a new Spark framework developer?

Re: Spark AMI

2014-03-20 Thread Reynold Xin
It's mostly stock CentOS installation with some scripts. On Thu, Mar 20, 2014 at 2:53 AM, Usman Ghani us...@platfora.com wrote: Is there anything special about the spark AMIs or are they just stock CentOS installations?

Re: Largest input data set observed for Spark.

2014-03-20 Thread Reynold Xin
was that job (I guess in terms of number of transforms and actions) and how long did that take to process? -Suren On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin r...@databricks.com wrote: Actually we just ran a job with 70TB+ compressed data on 28 worker nodes - I didn't count the size

Re: Apache Spark and Graphx for Real Time Analytics

2014-04-08 Thread Reynold Xin
Nick and Koert summarized it pretty well. Just to clarify and give some concrete examples. If you want to start with a specific vertex, and follow some path, it is probably easier and faster to use some key values store or even MySQL or a graph database. If you want to count the average length

Re: minor optimizations to get my feet wet

2014-04-10 Thread Reynold Xin
Thanks for contributing! I think often unless the feature is gigantic, you can send a pull request directly for discussion. One rule of thumb in the Spark code base is that we typically prefer readability over conciseness, and thus we tend to avoid using too much Scala magic or operator

Re: bug using kryo as closure serializer

2014-05-04 Thread Reynold Xin
I added the config option to use the non-default serializer. However, at the time, Kryo fails serializing pretty much any closures so that option was never really used / recommended. Since then the Scala ecosystem has developed, and some other projects are starting to use Kryo to serialize more

Re: bug using kryo as closure serializer

2014-05-04 Thread Reynold Xin
/TaskResultGetter.scala#L39 Would storing my RDD as MEMORY_ONLY_SER prevent the closure serializer from trying to deal with my clojure.lang.PeristentVector class? Where do I go from here? On Sun, May 4, 2014 at 12:50 PM, Reynold Xin r...@databricks.com wrote: I added the config option to use

Re: bug using kryo as closure serializer

2014-05-04 Thread Reynold Xin
as MEMORY_ONLY_SER prevent the closure serializer from trying to deal with my clojure.lang.PeristentVector class? Where do I go from here? On Sun, May 4, 2014 at 12:50 PM, Reynold Xin r...@databricks.com wrote: I added the config option to use the non-default serializer. However

Re: Kryo not default?

2014-05-13 Thread Reynold Xin
The main reason is that it doesn't always work (e.g. sometimes application program has special serialization / externalization written already for Java which don't work in Kryo). On Mon, May 12, 2014 at 5:47 PM, Anand Avati av...@gluster.org wrote: Hi, Can someone share the reason why Kryo

Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Reynold Xin
Thanks for the experiments and analysis! I think Michael already submitted a patch that avoids scanning all columns for count(*) or count(1). On Mon, May 12, 2014 at 9:46 PM, Andrew Ash and...@andrewash.com wrote: Hi Spark devs, First of all, huge congrats on the parquet integration with

Re: Scala examples for Spark do not work as written in documentation

2014-05-16 Thread Reynold Xin
Thanks for pointing it out. We should update the website to fix the code. val count = spark.parallelize(1 to NUM_SAMPLES).map { i = val x = Math.random() val y = Math.random() if (x*x + y*y 1) 1 else 0 }.reduce(_ + _) println(Pi is roughly + 4.0 * count / NUM_SAMPLES) On Fri, May 16,

Re: BUG: graph.triplets does not return proper values

2014-05-19 Thread Reynold Xin
This was an optimization that reuses a triplet object in GraphX, and when you do a collect directly on triplets, the same object is returned. It has been fixed in Spark 1.0 here: https://issues.apache.org/jira/browse/SPARK-1188 To work around in older version of Spark, you can add a copy step to

Re: BUG: graph.triplets does not return proper values

2014-05-19 Thread Reynold Xin
reduce always return a single element - maybe you are misunderstanding what the reduce function in collections does. On Mon, May 19, 2014 at 3:32 PM, GlennStrycker glenn.stryc...@gmail.comwrote: I tried adding .copy() everywhere, but still only get one element returned, not even an RDD

Re: BUG: graph.triplets does not return proper values

2014-05-20 Thread Reynold Xin
You are probably looking for reduceByKey in that case. reduce just reduces everything in the collection into a single element. On Tue, May 20, 2014 at 12:16 PM, GlennStrycker glenn.stryc...@gmail.comwrote: Wait a minute... doesn't a reduce function return 1 element PER key pair? For example,

Re: Graphx: GraphLoader.edgeListFile with edge weight

2014-05-22 Thread Reynold Xin
You can submit a pull request on the github mirror: https://github.com/apache/spark Thanks. On Wed, May 21, 2014 at 10:59 PM, npanj nitinp...@gmail.com wrote: Hi, For my project I needed to load a graph with edge weight; for this I have updated GraphLoader.edgeListFile to consider third

Re: Kryo serialization for closures: a workaround

2014-05-24 Thread Reynold Xin
Thanks for sending this in. The ASF list doesn't support html so the formatting of the code is a little messed up. For those who want to see the code in clearly formatted text, go to http://apache-spark-developers-list.1001551.n3.nabble.com/Kryo-serialization-for-closures-a-workaround-tp6787.html

Re: Clearspring Analytics Version

2014-05-27 Thread Reynold Xin
Would you like to submit a pull request to update it? Also in the latest version HyperLogLog is serializable. That means we can get rid of the SerializableHyperLogLog class. (and move to use HyperLogLogPlus). On Tue, May 27, 2014 at 3:01 PM, Surendranauth Hiraman suren.hira...@velos.io

Re: Clearspring Analytics Version

2014-05-27 Thread Reynold Xin
On Tue, May 27, 2014 at 6:02 PM, Reynold Xin r...@databricks.com wrote: Would you like to submit a pull request to update it? Also in the latest version HyperLogLog is serializable. That means we can get rid of the SerializableHyperLogLog class. (and move to use HyperLogLogPlus

Re: Clearspring Analytics Version

2014-05-27 Thread Reynold Xin
properly and I'm having various dependency issues running sbt/sbt assembly. Any change you could go ahead and submit a pull request for this if it's easy for you? :-) -Suren On Tue, May 27, 2014 at 6:13 PM, Reynold Xin r...@databricks.com wrote: 2.7 sounds good. I was actually waiting for 2.7

Re: About JIRA SPARK-1825

2014-05-27 Thread Reynold Xin
It is actually pretty simple. You will first need to fork Spark on github, and then push your changes to it, and then follow: https://help.github.com/articles/using-pull-requests On Tue, May 27, 2014 at 6:10 PM, innowireless TaeYun Kim taeyun@innowireless.co.kr wrote: I'm afraid I don't

Re: GraphX triplets on 5-node graph

2014-05-29 Thread Reynold Xin
Take a look at this one: https://issues.apache.org/jira/browse/SPARK-1188 It was an optimization that added user inconvenience. We got rid of that now in Spark 1.0. On Wed, May 28, 2014 at 11:48 PM, Michael Malak michaelma...@yahoo.comwrote: Shouldn't I be seeing N2 and N4 in the output

Re: Please change instruction about Launching Applications Inside the Cluster

2014-05-30 Thread Reynold Xin
Can you take a look at the latest Spark 1.0 docs and see if they are fixed? https://github.com/apache/spark/tree/master/docs Thanks. On Thu, May 29, 2014 at 5:29 AM, Lizhengbing (bing, BIPA) zhengbing...@huawei.com wrote: The instruction address is in

Re: Implementing rdd.scanLeft()

2014-06-05 Thread Reynold Xin
I think the main concern is this would require scanning the data twice, and maybe the user should be aware of it ... On Thu, Jun 5, 2014 at 10:29 AM, Andrew Ash and...@andrewash.com wrote: I have a use case that would greatly benefit from RDDs having a .scanLeft() method. Are the project

openstack swift integration with Spark

2014-06-13 Thread Reynold Xin
If you are interested in openstack/swift integration with Spark, please drop me a line. We are looking into improving the integration. Thanks.

Re: Big-Endian (IBM Power7) Spark Serialization issue

2014-06-16 Thread Reynold Xin
Thanks for sending the update. Do you mind posting a link to the bug reported in the lzf project here as well? Cheers. On Sun, Jun 15, 2014 at 7:04 PM, gchen chenguanch...@gmail.com wrote: To anyone who is interested in this issue, the root cause if from a third party code

Re: Big-Endian (IBM Power7) Spark Serialization issue

2014-06-16 Thread Reynold Xin
I think you guys are / will be leading the effort on that :) On Mon, Jun 16, 2014 at 4:15 PM, gchen chenguanch...@gmail.com wrote: Hi Reynold, thanks for your interest on this issue. The work here is part of incorporating Spark into PowerLinux ecosystem. Here is the bug raised in ning by

Re: Big-Endian (IBM Power7) Spark Serialization issue

2014-06-16 Thread Reynold Xin
It is here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala On Mon, Jun 16, 2014 at 4:26 PM, gchen chenguanch...@gmail.com wrote: I didn't find ning's source code in Spark git repository (or maybe I missed it?), so next time when we

Re: Big-Endian (IBM Power7) Spark Serialization issue

2014-06-17 Thread Reynold Xin
It is actually pluggable. You can implement new compression codecs and just change the config variable to use those. On Tuesday, June 17, 2014, gchen chenguanch...@gmail.com wrote: Cool, so maybe when we swith to Snappy instead of LZF, we can workaround the bug until the LZF upstream fix it,

Re: Contribute to Spark - Need a mentor.

2014-06-18 Thread Reynold Xin
Hi Michael, Unfortunately the Apache mailing list filters out attachments. That said, you can usually just start by looking at the JIRA for Spark and find issues tagged with the starter tag and work on them. You can submit pull requests to the github repo or email the dev list for feedbacks on

Re: What about a general schema registration method for JavaSchemaRDD?

2014-06-21 Thread Reynold Xin
Thanks for the message. There is an open issue about the public type / schema system that is related to this topic: https://issues.apache.org/jira/browse/SPARK-2179 You probably want to comment on that ticket as well. On Sat, Jun 21, 2014 at 7:52 AM, guxiaobo1982 guxiaobo1...@qq.com wrote:

Re: [jira] [Created] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-06-23 Thread Reynold Xin
Mridul, Can you comment a little bit more on this issue? We are running into the same stack trace but not sure whether it is just different Spark versions on each cluster (doesn't seem likely) or a bug in Spark. Thanks. On Sat, May 17, 2014 at 4:41 AM, Mridul Muralidharan mri...@gmail.com

Re: IntelliJ IDEA cannot compile TreeNode.scala

2014-06-26 Thread Reynold Xin
IntelliJ parser/analyzer/compiler behaves differently from Scala compiler, and sometimes lead to inconsistent behavior. This is one of the case. In general while we use IntelliJ, we don't use it to build stuff. I personally always build in command line with sbt or Maven. On Thu, Jun 26, 2014

Re: NPE calling reduceByKey on JavaPairRDD

2014-06-26 Thread Reynold Xin
Responded on the jira... On Thu, Jun 26, 2014 at 9:17 PM, Bharath Ravi Kumar reachb...@gmail.com wrote: Hi, I've been encountering a NPE invoking reduceByKey on JavaPairRDD since upgrading to 1.0.0 . The issue is straightforward to reproduce with 1.0.0 and doesn't occur with 0.9.0. The

Re: Eliminate copy while sending data : any Akka experts here ?

2014-07-01 Thread Reynold Xin
I was actually talking to tgraves today at the summit about this. Based on my understanding, the sizes we track and send (which is unfortunately O(M*R) regardless of how we change the implementation -- whether we send via task or send via MapOutputTracker) is only used to compute maxBytesInFlight

Re: process for contributing to mllib

2014-07-02 Thread Reynold Xin
Yes it would be great to mention the JIRA ticket number on the pull request. Thanks! On Wed, Jul 2, 2014 at 1:01 AM, Eustache DIEMERT eusta...@diemert.fr wrote: Hi there, I just created an issue [1] for MLlib on Jira. I also want to contribute a fix, is it a good idea to submit a PR on

Re: Cloudera's Hive on Spark vs AmpLab's Shark

2014-07-08 Thread Reynold Xin
This blog post probably clarifies a lot of things: http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html On Tue, Jul 8, 2014 at 12:24 PM, anishs...@yahoo.co.in anishs...@yahoo.co.in wrote: Hi All I read somewhere that Cloudera announced

Re: CPU/Disk/network performance instrumentation

2014-07-09 Thread Reynold Xin
Maybe it's time to create an advanced mode in the ui. On Wed, Jul 9, 2014 at 12:23 PM, Kay Ousterhout k...@eecs.berkeley.edu wrote: Hi all, I've been doing a bunch of performance measurement of Spark and, as part of doing this, added metrics that record the average CPU utilization, disk

Re: How pySpark works?

2014-07-11 Thread Reynold Xin
Also take a look at this: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals On Fri, Jul 11, 2014 at 10:29 AM, Andrew Or and...@databricks.com wrote: Hi Egor, Here are a few answers to your questions: 1) Python needs to be installed on all machines, but not pyspark. The

Re: sparkSQL thread safe?

2014-07-13 Thread Reynold Xin
Ian, The LZFOutputStream's large byte buffer is sort of annoying. It is much smaller if you use the Snappy one. The downside of the Snappy one is slightly less compression (I've seen 10 - 20% larger sizes). If we can find a compression scheme implementation that doesn't do very large buffers,

better compression codecs for shuffle blocks?

2014-07-14 Thread Reynold Xin
Hi Spark devs, I was looking into the memory usage of shuffle and one annoying thing is the default compression codec (LZF) is that the implementation we use allocates buffers pretty generously. I did a simple experiment and found that creating 1000 LZFOutputStream allocated 198976424 bytes

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Reynold Xin
Copying Jon here since he worked on the lzf library at Ning. Jon - any comments on this topic? On Mon, Jul 14, 2014 at 3:54 PM, Matei Zaharia matei.zaha...@gmail.com wrote: You can actually turn off shuffle compression by setting spark.shuffle.compress to false. Try that out, there will

Re: small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-16 Thread Reynold Xin
of an algorithm, or multiple map functions, or stuff like that). But they won't have to broadcast something they only use once. Matei On Jul 16, 2014, at 10:07 PM, Reynold Xin r...@databricks.com wrote: Oops - the pull request should be https://github.com/apache/spark/pull/1452 On Wed, Jul 16

Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread Reynold Xin
+1 On Thursday, July 17, 2014, Matei Zaharia matei.zaha...@gmail.com wrote: +1 Tested on Mac, verified CHANGES.txt is good, verified several of the bug fixes. Matei On Jul 17, 2014, at 11:12 AM, Xiangrui Meng men...@gmail.com javascript:; wrote: I start the voting with a +1. Ran

Re: small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-19 Thread Reynold Xin
Thanks :) FYI the pull request has been merged and will be part of Spark 1.1.0. On Thu, Jul 17, 2014 at 11:09 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Thu, Jul 17, 2014 at 1:23 AM, Stephen Haberman stephen.haber...@gmail.com wrote: I'd be ecstatic if more major changes

Re: Contributing to Spark needs PySpark build/test instructions

2014-07-21 Thread Reynold Xin
I added an automated testing section: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-AutomatedTesting Can you take a look to see if it is what you had in mind? On Mon, Jul 21, 2014 at 3:54 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote:

Re: Contributing to Spark needs PySpark build/test instructions

2014-07-21 Thread Reynold Xin
”. Someone contributing to PySpark will want to be directed to run something in addition to (or instead of) sbt/sbt test, I believe. Nick ​ On Mon, Jul 21, 2014 at 11:43 PM, Reynold Xin r...@databricks.com wrote: I added an automated testing section: https://cwiki.apache.org/confluence

Re: Dynamic variables in Spark

2014-07-21 Thread Reynold Xin
Thanks for the thoughtful email, Neil and Christopher. If I understand this correctly, it seems like the dynamic variable is just a variant of the accumulator (a static one since it is a global object). Accumulators are already implemented using thread-local variables under the hood. Am I

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Reynold Xin
If the purpose is for dropping csv headers, perhaps we don't really need a common drop and only one that drops the first line in a file? I'd really try hard to avoid a common drop/dropWhile because they can be expensive to do. Note that I think we will be adding this functionality (ignoring

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Reynold Xin
...@cloudera.com wrote: It could make sense to add a skipHeader argument to SparkContext.textFile? On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin r...@databricks.com wrote: If the purpose is for dropping csv headers, perhaps we don't really need a common drop and only one that drops the first

Re: setting inputMetrics in HadoopRDD#compute()

2014-07-26 Thread Reynold Xin
, Reynold Xin r...@databricks.com wrote: There is one piece of information that'd be useful to know, which is the source of the input. Even in the presence of an IOException, the input metrics still specifies the task is reading from Hadoop. However, I'm slightly confused by this -- I think

Re: No such file or directory errors running tests

2014-07-27 Thread Reynold Xin
To run through all the tests you'd need to create the assembly jar first. I've seen this asked a few times. Maybe we should make it more obvious. http://spark.apache.org/docs/latest/building-with-maven.html Spark Tests in Maven Tests are run by default via the ScalaTest Maven plugin

Re: No such file or directory errors running tests

2014-07-27 Thread Reynold Xin
-Pyarn -Phadoop-2.3 -Phive test AFA documentation, yes adding another sentence to that same Building with Maven page would likely be helpful to future generations. 2014-07-27 19:10 GMT-07:00 Reynold Xin r...@databricks.com: To run through all the tests you'd need to create the assembly jar

Re: package/assemble with local spark

2014-07-28 Thread Reynold Xin
You can use publish-local in sbt. If you want to be more careful, you can give Spark a different version number and use that version number in your app. On Mon, Jul 28, 2014 at 4:33 AM, Larry Xiao xia...@sjtu.edu.cn wrote: Hi, How do you package an app with modified spark? In seems sbt

Re: Github mirroring is running behind

2014-07-28 Thread Reynold Xin
Hi devs, I don't know if this is going to help, but if you can watch vote on the ticket, it might help ASF INFRA prioritize and triage it faster: https://issues.apache.org/jira/browse/INFRA-8116 Please do. Thanks! On Mon, Jul 28, 2014 at 5:41 PM, Patrick Wendell pwend...@gmail.com wrote:

Re: pre-filtered hadoop RDD use case

2014-07-29 Thread Reynold Xin
Would something like this help? https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PartitionPruningRDD.scala On Thu, Jul 24, 2014 at 8:40 AM, Eugene Cheipesh echeip...@gmail.com wrote: Hello, I have an interesting use case for a pre-filtered RDD. I have

Re: pre-filtered hadoop RDD use case

2014-07-29 Thread Reynold Xin
Message- From: Reynold Xin [mailto:r...@databricks.com] Sent: Tuesday, July 29, 2014 12:55 AM To: dev@spark.apache.org Subject: Re: pre-filtered hadoop RDD use case Would something like this help? https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd

Re: JIRA content request

2014-07-29 Thread Reynold Xin
+1 on this. On Tue, Jul 29, 2014 at 4:34 PM, Mark Hamstra m...@clearstorydata.com wrote: Of late, I've been coming across quite a few pull requests and associated JIRA issues that contain nothing indicating their purpose beyond a pretty minimal description of what the pull request does. On

Re: Interested in contributing to GraphX in Python

2014-08-04 Thread Reynold Xin
Thanks for your interest. I think the main challenge is if we have to call Python functions per record, it can be pretty expensive to serialize/deserialize across boundaries of the Python process and JVM process. I don't know if there is a good way to solve this problem yet. On Fri, Aug 1,

Re: Tiny curiosity question on closing the jdbc connection

2014-08-05 Thread Reynold Xin
I'm pretty sure it is an oversight. Would you like to submit a pull request to fix that? On Tue, Aug 5, 2014 at 12:14 PM, Stephen Boesch java...@gmail.com wrote: Within its compute.close method, the JdbcRDD class has this interesting logic for closing jdbc connection: try {

Re: Tiny curiosity question on closing the jdbc connection

2014-08-05 Thread Reynold Xin
for another reason, not intending to be a bother ;) 2014-08-05 13:03 GMT-07:00 Reynold Xin r...@databricks.com: I'm pretty sure it is an oversight. Would you like to submit a pull request to fix that? On Tue, Aug 5, 2014 at 12:14 PM, Stephen Boesch java...@gmail.com wrote: Within

Re: Tiny curiosity question on closing the jdbc connection

2014-08-05 Thread Reynold Xin
it. As for the leaking in the case of malformed statements, isn't that addressed by context.addOnCompleteCallback{ () = closeIfNeeded() } or am I misunderstanding? On Tue, Aug 5, 2014 at 3:15 PM, Reynold Xin r...@databricks.com wrote: Thanks. Those are definitely great problems to fix! On Tue, Aug 5

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-06 Thread Reynold Xin
I don't think it was a conscious design decision to not include the application classes in the connection manager serializer. We should fix that. Where is it deserializing data in that thread? 4 might make sense in the long run, but it adds a lot of complexity to the code base (whole separate

Re: Unit tests in 5 minutes

2014-08-08 Thread Reynold Xin
ScalaTest actually has support for parallelization built-in. We can use that. The main challenge is to make sure all the test suites can work in parallel when running along side each other. On Fri, Aug 8, 2014 at 9:47 AM, Ted Yu yuzhih...@gmail.com wrote: How about using parallel execution

Re: Unit tests in 5 minutes

2014-08-08 Thread Reynold Xin
://github.com/apache/spark/blob/master/project/SparkBuild.scala#L350 On Fri, Aug 8, 2014 at 10:10 AM, Reynold Xin r...@databricks.com wrote: ScalaTest actually has support for parallelization built-in. We can use that. The main challenge is to make sure all the test suites can work in parallel when

Re: 1.1.0-SNAPSHOT possible regression

2014-08-08 Thread Reynold Xin
Looks like you didn't actually paste the exception message. Do you mind doing that? On Fri, Aug 8, 2014 at 10:14 AM, Reynold Xin r...@databricks.com wrote: Pasting a better formatted trace: at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1180

Re: 1.1.0-SNAPSHOT possible regression

2014-08-08 Thread Reynold Xin
Pasting a better formatted trace: at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1180) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at scala.collection.mutable.HashMap$$anonfun$writeObject$1.apply(HashMap.scala:137) at

Re: 1.1.0-SNAPSHOT possible regression

2014-08-08 Thread Reynold Xin
, Reynold Xin r...@databricks.com wrote: Looks like you didn't actually paste the exception message. Do you mind doing that? On Fri, Aug 8, 2014 at 10:14 AM, Reynold Xin r...@databricks.com wrote: Pasting a better formatted trace: at java.io.ObjectOutputStream.writeObject0

Re: 1.1.0-SNAPSHOT possible regression

2014-08-08 Thread Reynold Xin
I created a JIRA ticket to track this: https://issues.apache.org/jira/browse/SPARK-2928 Let me know if you need help with it. On Fri, Aug 8, 2014 at 10:40 AM, Reynold Xin r...@databricks.com wrote: Yes, I'm pretty sure it doesn't actually use the right serializer in TorrentBroadcast: https

Re: 1.1.0-SNAPSHOT possible regression

2014-08-08 Thread Reynold Xin
. I can compare Spark-1.0.1 code and see what's going on... Thanks, Ron On Friday, August 8, 2014 10:43 AM, Reynold Xin r...@databricks.com wrote: I created a JIRA ticket to track this: https://issues.apache.org/jira/browse/SPARK-2928 Let me know if you need help with it. On Fri

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Reynold Xin
They only compared their own implementations of couple algorithms on different platforms rather than comparing the different platforms themselves (in the case of Spark -- PySpark). I can write two variants of an algorithm on Spark and make them perform drastically differently. I have no doubt if

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Reynold Xin
, Reynold Xin r...@databricks.com wrote: They only compared their own implementations of couple algorithms on different platforms rather than comparing the different platforms themselves (in the case of Spark -- PySpark). I can write two variants of an algorithm on Spark and make them perform

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Reynold Xin
BTW you can find the original Presto (rebranded as Distributed R) paper here: http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Venkataraman.pdf On Wed, Aug 13, 2014 at 2:16 PM, Reynold Xin r...@databricks.com wrote: Actually I believe the same person started both projects

Re: Added support for :cp jar to the Spark Shell

2014-08-13 Thread Reynold Xin
I haven't read the code yet, but if it is what I think it is, this is SUPER, UBER, HUGELY useful. On a related note, I asked about this on the Scala dev list but never got a satisfactory answer https://groups.google.com/forum/#!msg/scala-internals/_cZ1pK7q6cU/xyBQA0DdcYwJ On Wed, Aug 13,

proposal for pluggable block transfer interface

2014-08-13 Thread Reynold Xin
Hi devs, I posted a design doc proposing an interface for pluggable block transfer (used in shuffle, broadcast, block replication, etc). This is expected to be done in 1.2 time frame. It should make our code base cleaner, and enable us to provide alternative implementations of block transfers

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-14 Thread Reynold Xin
the deserialisation to happen on that thread. See MemoryStore.scala:102. On 7 August 2014 11:53, Reynold Xin r...@databricks.com wrote: I don't think it was a conscious design decision to not include the application classes in the connection manager serializer. We should fix

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-14 Thread Reynold Xin
. The above approach wouldn't help with this problem. Additionally, the YARN scheduler currently uses this approach of adding the application jar to the Executor classpath, so it would make things a bit more uniform. Cheers, Graham On 14 August 2014 17:37, Reynold Xin r...@databricks.com

Re: Too late to contribute for 1.1.0?

2014-08-22 Thread Reynold Xin
I believe docs changes can go in anytime (because we can just publish new versions of docs). Critical bug fixes can still go in too. On Thu, Aug 21, 2014 at 11:43 PM, Evan Chan velvia.git...@gmail.com wrote: I'm hoping to get in some doc enhancements and small bug fixes for Spark SQL. Also

Re: Spark Contribution

2014-08-22 Thread Reynold Xin
Great idea. Added the link https://github.com/apache/spark/blob/master/README.md On Thu, Aug 21, 2014 at 4:06 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: We should add this link to the readme on GitHub btw. 2014년 8월 21일 목요일, Henry Saputrahenry.sapu...@gmail.com님이 작성한 메시지: The

Re: Adding support for a new object store

2014-08-27 Thread Reynold Xin
Hi Rajendran, I'm assuming you have some concept of schema and you are intending to integrate with SchemaRDD instead of normal RDDs. More responses inline below. On Fri, Aug 22, 2014 at 2:21 AM, Rajendran Appavu appra...@in.ibm.com wrote: I am new to Spark source code and looking to see if

Re: Adding support for a new object store

2014-08-27 Thread Reynold Xin
Linking to the JIRA tracking APIs to hook into the planner: https://issues.apache.org/jira/browse/SPARK-3248 On Wed, Aug 27, 2014 at 1:56 PM, Reynold Xin r...@databricks.com wrote: Hi Rajendran, I'm assuming you have some concept of schema and you are intending to integrate with SchemaRDD

Re: jenkins maintenance/downtime, aug 28th, 730am-9am PDT

2014-08-28 Thread Reynold Xin
Thanks for doing this, Shane. On Thursday, August 28, 2014, shane knapp skn...@berkeley.edu wrote: all clear: jenkins and all plugins have been updated! On Thu, Aug 28, 2014 at 7:51 AM, shane knapp skn...@berkeley.edu javascript:; wrote: jenkins is upgraded, but a few jobs sneaked in

Fwd: Partitioning strategy changed in Spark 1.0.x?

2014-08-30 Thread Reynold Xin
Sending the response back to the dev list so this is indexable and searchable by others. -- Forwarded message -- From: Milos Nikolic milos.nikoli...@gmail.com Date: Sat, Aug 30, 2014 at 5:50 PM Subject: Re: Partitioning strategy changed in Spark 1.0.x? To: Reynold Xin r

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Reynold Xin
Welcome, Shane! On Tuesday, September 2, 2014, shane knapp skn...@berkeley.edu wrote: so, i had a meeting w/the databricks guys on friday and they recommended i send an email out to the list to say 'hi' and give you guys a quick intro. :) hi! i'm shane knapp, the new AMPLab devops

Re: about spark assembly jar

2014-09-02 Thread Reynold Xin
Having a SSD help tremendously with assembly time. Without that, you can do the following in order for Spark to pick up the compiled classes before assembly at runtime. export SPARK_PREPEND_CLASSES=true On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza sandy.r...@cloudera.com wrote: This doesn't

Re: Ask something about spark

2014-09-02 Thread Reynold Xin
I think in general that is fine. It would be great if your slides come with proper attribution. On Tue, Sep 2, 2014 at 3:31 PM, Sanghoon Lee phoenixl...@gmail.com wrote: Hi, I am phoenixlee and a Spark programmer in Korea. And be a good chance this time, it tries to teach college students

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Reynold Xin
+1 On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian lian.cs@gmail.com wrote: +1 - Tested Thrift server and SQL CLI locally on OSX 10.9. - Checked datanucleus dependencies in distribution tarball built by make-distribution.sh without SPARK_HIVE defined. ​ On Tue, Sep 2, 2014 at

Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-03 Thread Reynold Xin
+1 Tested locally on Mac OS X with local-cluster mode. On Wed, Sep 3, 2014 at 12:24 AM, Patrick Wendell pwend...@gmail.com wrote: I'll kick it off with a +1 On Wed, Sep 3, 2014 at 12:24 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as

Re: Ask something about spark

2014-09-03 Thread Reynold Xin
be willing to put some creative commons license information on the site and its content? best, matt On 09/02/2014 06:32 PM, Reynold Xin wrote: I think in general that is fine. It would be great if your slides come with proper attribution. On Tue, Sep 2, 2014 at 3:31 PM, Sanghoon Lee phoenixl

Re: Scala's Jenkins setup looks neat

2014-09-06 Thread Reynold Xin
that would require github hooks permission and unfortunately asf infra wouldn't allow that. Maybe they will change their mind one day, but so far we asked about this and the answer has been no for security reasons. On Saturday, September 6, 2014, Nicholas Chammas nicholas.cham...@gmail.com

Re: Junit spark tests

2014-09-09 Thread Reynold Xin
Can you be a little bit more specific, maybe give a code snippet? On Tue, Sep 9, 2014 at 5:14 PM, Sudershan Malpani sudershan.malp...@gmail.com wrote: Hi all, I am calling an object which in turn is calling a method inside a map RDD in spark. While writing the tests how can I mock that

Re: Reporting serialized task size after task broadcast change?

2014-09-11 Thread Reynold Xin
I don't think so. We should probably add a line to log it. On Thursday, September 11, 2014, Sandy Ryza sandy.r...@cloudera.com wrote: After the change to broadcast all task data, is there any easy way to discover the serialized size of the data getting sent down for a task? thanks, -Sandy

Re: Reporting serialized task size after task broadcast change?

2014-09-11 Thread Reynold Xin
I didn't know about that On Thu, Sep 11, 2014 at 6:29 PM, Sandy Ryza sandy.r...@cloudera.com wrote: It used to be available on the UI, no? On Thu, Sep 11, 2014 at 6:26 PM, Reynold Xin r...@databricks.com wrote: I don't think so. We should probably add a line to log

Re: Reporting serialized task size after task broadcast change?

2014-09-11 Thread Reynold Xin
at 6:33 PM, Reynold Xin r...@databricks.com javascript:_e(%7B%7D,'cvml','r...@databricks.com'); wrote: I didn't know about that On Thu, Sep 11, 2014 at 6:29 PM, Sandy Ryza sandy.r...@cloudera.com javascript:_e(%7B%7D,'cvml','sandy.r...@cloudera.com'); wrote: It used to be available

Re: PSA: SI-8835 (Iterator 'drop' method has a complexity bug causing quadratic behavior)

2014-09-12 Thread Reynold Xin
Thanks for the email, Erik. The Scala collection library implementation is a complicated beast ... On Sat, Sep 6, 2014 at 8:27 AM, Erik Erlandson e...@redhat.com wrote: I tripped over this recently while preparing a solution for SPARK-3250 (efficient sampling): Iterator 'drop' method has a

Re: Adding abstraction in MLlib

2014-09-12 Thread Reynold Xin
Xiangrui can comment more, but I believe Joseph and him are actually working on standardize interface and pipeline feature for 1.2 release. On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov pahomov.e...@gmail.com wrote: Some architect suggestions on this matter -

Re: don't trigger tests when only .md files are changed

2014-09-12 Thread Reynold Xin
I like that idea, but the load on Jenkins isn't very high. The more complexity we add to the test script, the easier it is to screw it up (at some point we would need to add unit tests for the build scripts). Maybe we can just add the message part, so it becomes clear that a pull request does not

Re: Network Communication - Akka or more?

2014-09-17 Thread Reynold Xin
I'm not familiar with Infiniband, but I can chime in on the Spark part. There are two kinds of communications in Spark: control plane and data plane. Task scheduling / dispatching is control, whereas fetching a block (e.g. shuffle) is data. On Tue, Sep 16, 2014 at 4:22 PM, Trident

Re: network.ConnectionManager error

2014-09-17 Thread Reynold Xin
This is during shutdown right? Looks ok to me since connections are being closed. We could've handle this more gracefully, but the logs look harmless. On Wednesday, September 17, 2014, wyphao.2007 wyphao.2...@163.com wrote: Hi, When I run spark job on yarn,and the job finished success,but I

  1   2   3   4   5   6   7   8   9   10   >