Re: Eliminate copy while sending data : any Akka experts here ?

2014-07-01 Thread Reynold Xin
I was actually talking to tgraves today at the summit about this. Based on my understanding, the sizes we track and send (which is unfortunately O(M*R) regardless of how we change the implementation -- whether we send via task or send via MapOutputTracker) is only used to compute maxBytesInFlight

Re: process for contributing to mllib

2014-07-02 Thread Reynold Xin
Yes it would be great to mention the JIRA ticket number on the pull request. Thanks! On Wed, Jul 2, 2014 at 1:01 AM, Eustache DIEMERT wrote: > Hi there, > > I just created an issue [1] for MLlib on Jira. I also want to contribute a > fix, is it a good idea to submit a PR on github [2] ? > > Sh

Re: Eliminate copy while sending data : any Akka experts here ?

2014-07-02 Thread Reynold Xin
On Wed, Jul 2, 2014 at 3:44 AM, Mridul Muralidharan wrote: > > > > > The other thing we do need is the location of blocks. This is actually > just > > O(n) because we just need to know where the map was run. > > For well partitioned data, wont this not involve a lot of unwanted > requests to node

Re: Eliminate copy while sending data : any Akka experts here ?

2014-07-03 Thread Reynold Xin
Note that in my original proposal, I was suggesting we could track whether block size = 0 using a compressed bitmap. That way we can still avoid requests for zero-sized blocks. On Thu, Jul 3, 2014 at 3:12 PM, Reynold Xin wrote: > Yes, that number is likely == 0 in any real workl

Re: Eliminate copy while sending data : any Akka experts here ?

2014-07-03 Thread Reynold Xin
Yes, that number is likely == 0 in any real workload ... On Thu, Jul 3, 2014 at 8:01 AM, Mridul Muralidharan wrote: > On Thu, Jul 3, 2014 at 11:32 AM, Reynold Xin wrote: > > On Wed, Jul 2, 2014 at 3:44 AM, Mridul Muralidharan > > wrote: > > > >> > >>

Re: Invalid link for Spark 1.0.0 in Official Web Site

2014-07-07 Thread Reynold Xin
Thanks for reporting this. I just fixed it. On Fri, Jul 4, 2014 at 11:14 AM, Kousuke Saruta wrote: > Hi, > > I found there is a invalid link in > . > The link for release note of Spark 1.0.0 indicates > http://spark.apache.org/releases/spark-release-1.0

Re: Cloudera's Hive on Spark vs AmpLab's Shark

2014-07-08 Thread Reynold Xin
This blog post probably clarifies a lot of things: http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html On Tue, Jul 8, 2014 at 12:24 PM, anishs...@yahoo.co.in < anishs...@yahoo.co.in> wrote: > Hi All > > I read somewhere that Cloudera announce

Re: CPU/Disk/network performance instrumentation

2014-07-09 Thread Reynold Xin
Maybe it's time to create an advanced mode in the ui. On Wed, Jul 9, 2014 at 12:23 PM, Kay Ousterhout wrote: > Hi all, > > I've been doing a bunch of performance measurement of Spark and, as part of > doing this, added metrics that record the average CPU utilization, disk > throughput and utili

Re: MIMA Compatiblity Checks

2014-07-10 Thread Reynold Xin
You can take a look at https://github.com/apache/spark/blob/master/dev/run-tests dev/mima On Thu, Jul 10, 2014 at 12:21 AM, Liu, Raymond wrote: > so how to run the check locally? > > On master tree, sbt mimaReportBinaryIssues Seems to lead to a lot of > errors reported. Do we need to modify S

Re: How pySpark works?

2014-07-11 Thread Reynold Xin
Also take a look at this: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals On Fri, Jul 11, 2014 at 10:29 AM, Andrew Or wrote: > Hi Egor, > > Here are a few answers to your questions: > > 1) Python needs to be installed on all machines, but not pyspark. The way > the executors

Re: sparkSQL thread safe?

2014-07-12 Thread Reynold Xin
Ian, The LZFOutputStream's large byte buffer is sort of annoying. It is much smaller if you use the Snappy one. The downside of the Snappy one is slightly less compression (I've seen 10 - 20% larger sizes). If we can find a compression scheme implementation that doesn't do very large buffers, tha

better compression codecs for shuffle blocks?

2014-07-14 Thread Reynold Xin
Hi Spark devs, I was looking into the memory usage of shuffle and one annoying thing is the default compression codec (LZF) is that the implementation we use allocates buffers pretty generously. I did a simple experiment and found that creating 1000 LZFOutputStream allocated 198976424 bytes (~190M

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Reynold Xin
Copying Jon here since he worked on the lzf library at Ning. Jon - any comments on this topic? On Mon, Jul 14, 2014 at 3:54 PM, Matei Zaharia wrote: > You can actually turn off shuffle compression by setting > spark.shuffle.compress to false. Try that out, there will still be some > buffers fo

Re: better compression codecs for shuffle blocks?

2014-07-15 Thread Reynold Xin
> > needed? I admit I have not done much homework to determine if this is > > viable. > > > > -Jon > > > > > > On Mon, Jul 14, 2014 at 4:08 PM, Reynold Xin > wrote: > > > > > Copying Jon here since he worked on the lzf library at Ning. >

small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-16 Thread Reynold Xin
Hi Spark devs, Want to give you guys a heads up that I'm working on a small (but major) change with respect to how task dispatching works. Currently (as of Spark 1.0.1), Spark sends RDD object and closures using Akka along with the task itself to the executors. This is however inefficient because

Re: small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-16 Thread Reynold Xin
Oops - the pull request should be https://github.com/apache/spark/pull/1452 On Wed, Jul 16, 2014 at 10:06 PM, Reynold Xin wrote: > Hi Spark devs, > > Want to give you guys a heads up that I'm working on a small (but major) > change with respect to how task dispatching works.

Re: small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-16 Thread Reynold Xin
tiple map functions, or stuff like that). > But they won't have to broadcast something they only use once. > > Matei > > On Jul 16, 2014, at 10:07 PM, Reynold Xin wrote: > > > Oops - the pull request should be > https://github.com/apache/spark/pull/1452 > > &g

Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread Reynold Xin
+1 On Thursday, July 17, 2014, Matei Zaharia wrote: > +1 > > Tested on Mac, verified CHANGES.txt is good, verified several of the bug > fixes. > > Matei > > On Jul 17, 2014, at 11:12 AM, Xiangrui Meng > wrote: > > > I start the voting with a +1. > > > > Ran tests on the release candidates and s

Re: Building Spark against Scala 2.10.1 virtualized

2014-07-18 Thread Reynold Xin
Yes. On Fri, Jul 18, 2014 at 12:50 PM, Meisam Fathi wrote: > Sorry for resurrecting this thread but project/SparkBuild.scala is > completely rewritten recently (after this commit > https://github.com/apache/spark/tree/628932b). Should library > dependencies be defined in pox.xml files after thi

Re: small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-19 Thread Reynold Xin
Thanks :) FYI the pull request has been merged and will be part of Spark 1.1.0. On Thu, Jul 17, 2014 at 11:09 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > On Thu, Jul 17, 2014 at 1:23 AM, Stephen Haberman < > stephen.haber...@gmail.com> wrote: > >> I'd be ecstatic if more major

Re: Contributing to Spark needs PySpark build/test instructions

2014-07-21 Thread Reynold Xin
I added an automated testing section: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-AutomatedTesting Can you take a look to see if it is what you had in mind? On Mon, Jul 21, 2014 at 3:54 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: >

Re: Contributing to Spark needs PySpark build/test instructions

2014-07-21 Thread Reynold Xin
r “Contributing Code”. Someone contributing to > PySpark will want to be directed to run something in addition to (or > instead of) sbt/sbt test, I believe. > > Nick > ​ > > > On Mon, Jul 21, 2014 at 11:43 PM, Reynold Xin wrote: > > > I added an automated tes

Re: "Dynamic variables" in Spark

2014-07-21 Thread Reynold Xin
Thanks for the thoughtful email, Neil and Christopher. If I understand this correctly, it seems like the dynamic variable is just a variant of the accumulator (a static one since it is a global object). Accumulators are already implemented using thread-local variables under the hood. Am I misunder

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Reynold Xin
If the purpose is for dropping csv headers, perhaps we don't really need a common drop and only one that drops the first line in a file? I'd really try hard to avoid a common drop/dropWhile because they can be expensive to do. Note that I think we will be adding this functionality (ignoring header

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Reynold Xin
: > It could make sense to add a skipHeader argument to SparkContext.textFile? > > > On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin wrote: > > > If the purpose is for dropping csv headers, perhaps we don't really need > a > > common drop and only one that drops t

Re: Suggestion for SPARK-1825

2014-07-25 Thread Reynold Xin
Actually reflection is probably a better, lighter weight process for this. An extra project brings more overhead for something simple. On Fri, Jul 25, 2014 at 3:09 PM, Colin McCabe wrote: > So, I'm leaning more towards using reflection for this. Maven profiles > could work, but it's tough s

Re: setting inputMetrics in HadoopRDD#compute()

2014-07-26 Thread Reynold Xin
There is one piece of information that'd be useful to know, which is the source of the input. Even in the presence of an IOException, the input metrics still specifies the task is reading from Hadoop. However, I'm slightly confused by this -- I think usually we'd want to report the number of bytes

Re: setting inputMetrics in HadoopRDD#compute()

2014-07-26 Thread Reynold Xin
oth give an accurate count and allow > us to get metrics while the task is in progress. A hitch is that it relies > on https://issues.apache.org/jira/browse/HADOOP-10688, so we still might > want a fallback for versions of Hadoop that don't have this API. > > > On Sat, Jul

Re: No such file or directory errors running tests

2014-07-27 Thread Reynold Xin
To run through all the tests you'd need to create the assembly jar first. I've seen this asked a few times. Maybe we should make it more obvious. http://spark.apache.org/docs/latest/building-with-maven.html Spark Tests in Maven Tests are run by default via the ScalaTest Maven plugin

Re: No such file or directory errors running tests

2014-07-27 Thread Reynold Xin
skipTests -Phive clean package > mvn -Pyarn -Phadoop-2.3 -Phive test > > AFA documentation, yes adding another sentence to that same "Building with > Maven" page would likely be helpful to future generations. > > > 2014-07-27 19:10 GMT-07:00 Reynold Xin : >

Re: package/assemble with local spark

2014-07-28 Thread Reynold Xin
You can use publish-local in sbt. If you want to be more careful, you can give Spark a different version number and use that version number in your app. On Mon, Jul 28, 2014 at 4:33 AM, Larry Xiao wrote: > Hi, > > How do you package an app with modified spark? > > In seems sbt would resolve t

Re: Github mirroring is running behind

2014-07-28 Thread Reynold Xin
Hi devs, I don't know if this is going to help, but if you can watch & vote on the ticket, it might help ASF INFRA prioritize and triage it faster: https://issues.apache.org/jira/browse/INFRA-8116 Please do. Thanks! On Mon, Jul 28, 2014 at 5:41 PM, Patrick Wendell wrote: > https://issues.apa

Re: pre-filtered hadoop RDD use case

2014-07-29 Thread Reynold Xin
Would something like this help? https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PartitionPruningRDD.scala On Thu, Jul 24, 2014 at 8:40 AM, Eugene Cheipesh wrote: > Hello, > > I have an interesting use case for a pre-filtered RDD. I have two solutions > th

Re: pre-filtered hadoop RDD use case

2014-07-29 Thread Reynold Xin
lementation seems to be in place, and more optimization is desired > beyond just record-oriented execution pipelining. > > > > -Original Message- > From: Reynold Xin [mailto:r...@databricks.com] > Sent: Tuesday, July 29, 2014 12:55 AM > To: dev@spark.apache.org > Subject: Re: pre-f

Re: JIRA content request

2014-07-29 Thread Reynold Xin
+1 on this. On Tue, Jul 29, 2014 at 4:34 PM, Mark Hamstra wrote: > Of late, I've been coming across quite a few pull requests and associated > JIRA issues that contain nothing indicating their purpose beyond a pretty > minimal description of what the pull request does. On the pull request > it

Re: Interested in contributing to GraphX in Python

2014-08-04 Thread Reynold Xin
Thanks for your interest. I think the main challenge is if we have to call Python functions per record, it can be pretty expensive to serialize/deserialize across boundaries of the Python process and JVM process. I don't know if there is a good way to solve this problem yet. On Fri, Aug 1, 2

Re: Tiny curiosity question on closing the jdbc connection

2014-08-05 Thread Reynold Xin
I'm pretty sure it is an oversight. Would you like to submit a pull request to fix that? On Tue, Aug 5, 2014 at 12:14 PM, Stephen Boesch wrote: > Within its compute.close method, the JdbcRDD class has this interesting > logic for closing jdbc connection: > > > try { > if (null !=

Re: Tiny curiosity question on closing the jdbc connection

2014-08-05 Thread Reynold Xin
king at this code for another reason, not intending to be a > bother ;) > > > > > 2014-08-05 13:03 GMT-07:00 Reynold Xin : > > I'm pretty sure it is an oversight. Would you like to submit a pull >> request to fix that? >> >> >> >> On Tue, Au

Re: Tiny curiosity question on closing the jdbc connection

2014-08-05 Thread Reynold Xin
aking in the case of malformed statements, isn't that > addressed by > > context.addOnCompleteCallback{ () => closeIfNeeded() } > > or am I misunderstanding? > > > On Tue, Aug 5, 2014 at 3:15 PM, Reynold Xin wrote: > > > Thanks. Those are definitely great prob

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-06 Thread Reynold Xin
I don't think it was a conscious design decision to not include the application classes in the connection manager serializer. We should fix that. Where is it deserializing data in that thread? 4 might make sense in the long run, but it adds a lot of complexity to the code base (whole separate code

Re: Unit tests in < 5 minutes

2014-08-08 Thread Reynold Xin
ScalaTest actually has support for parallelization built-in. We can use that. The main challenge is to make sure all the test suites can work in parallel when running along side each other. On Fri, Aug 8, 2014 at 9:47 AM, Ted Yu wrote: > How about using parallel execution feature of maven-sure

Re: Unit tests in < 5 minutes

2014-08-08 Thread Reynold Xin
https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L350 On Fri, Aug 8, 2014 at 10:10 AM, Reynold Xin wrote: > ScalaTest actually has support for parallelization built-in. We can use > that. > > The main challenge is to make sure all the test suites can work in > paralle

Re: 1.1.0-SNAPSHOT possible regression

2014-08-08 Thread Reynold Xin
Looks like you didn't actually paste the exception message. Do you mind doing that? On Fri, Aug 8, 2014 at 10:14 AM, Reynold Xin wrote: > Pasting a better formatted trace: > > > > at java.io.ObjectOutputStream.writeObject0(ObjectOutputSt

Re: 1.1.0-SNAPSHOT possible regression

2014-08-08 Thread Reynold Xin
Pasting a better formatted trace: at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1180) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at scala.collection.mutable.HashMap$$anonfun$writeObject$1.apply(HashMap.scala:137) at scala.collection.mutable.HashMa

Re: 1.1.0-SNAPSHOT possible regression

2014-08-08 Thread Reynold Xin
- custom writeObject data (class "scala.collection.mutable.HashMap") > > > > On Friday, August 8, 2014 10:16 AM, Reynold Xin > wrote: > > > Looks like you didn't actually paste the exception message. Do you mind > doing that? >

Re: 1.1.0-SNAPSHOT possible regression

2014-08-08 Thread Reynold Xin
I created a JIRA ticket to track this: https://issues.apache.org/jira/browse/SPARK-2928 Let me know if you need help with it. On Fri, Aug 8, 2014 at 10:40 AM, Reynold Xin wrote: > Yes, I'm pretty sure it doesn't actually use the right serializer in > TorrentBroadcast: >

Re: 1.1.0-SNAPSHOT possible regression

2014-08-08 Thread Reynold Xin
I can compare Spark-1.0.1 code and see what's going on... > > Thanks, > Ron > > > On Friday, August 8, 2014 10:43 AM, Reynold Xin > wrote: > > > I created a JIRA ticket to track this: > https://issues.apache.org/jira/browse/SPARK-2928 > > Let me know if y

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Reynold Xin
They only compared their own implementations of couple algorithms on different platforms rather than comparing the different platforms themselves (in the case of Spark -- PySpark). I can write two variants of an algorithm on Spark and make them perform drastically differently. I have no doubt if y

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Reynold Xin
edR/tree/master/doc/platform> > > Nick > > > On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin wrote: > >> They only compared their own implementations of couple algorithms on >> different platforms rather than comparing the different platforms >> themselves (in th

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Reynold Xin
BTW you can find the original Presto (rebranded as Distributed R) paper here: http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Venkataraman.pdf On Wed, Aug 13, 2014 at 2:16 PM, Reynold Xin wrote: > Actually I believe the same person started both projects. > > The Dist

Re: Added support for :cp to the Spark Shell

2014-08-13 Thread Reynold Xin
I haven't read the code yet, but if it is what I think it is, this is SUPER, UBER, HUGELY useful. On a related note, I asked about this on the Scala dev list but never got a satisfactory answer https://groups.google.com/forum/#!msg/scala-internals/_cZ1pK7q6cU/xyBQA0DdcYwJ On Wed, Aug 13, 20

proposal for pluggable block transfer interface

2014-08-13 Thread Reynold Xin
Hi devs, I posted a design doc proposing an interface for pluggable block transfer (used in shuffle, broadcast, block replication, etc). This is expected to be done in 1.2 time frame. It should make our code base cleaner, and enable us to provide alternative implementations of block transfers (e.

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-14 Thread Reynold Xin
process before the first task is >>> > received and therefore before any user jars are downloaded. As this PR >>> > adds user jars to the Executor process at launch time, this won't be an >>> > issue. >>> > >>> > >>> > On 7

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-14 Thread Reynold Xin
is before any tasks are received by the > executor. The above approach wouldn't help with this problem. > Additionally, the YARN scheduler currently uses this approach of adding > the application jar to the Executor classpath, so it would make things a > bit more uniform. >

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-14 Thread Reynold Xin
f shipping custom *serializers* (not kryo registrators) in > user jars. > > On 14 August 2014 19:23, Reynold Xin wrote: > >> Graham, >> >> SparkEnv only creates a KryoSerializer, but as I understand that >> serializer doesn't actually initializes the registr

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-14 Thread Reynold Xin
Here: https://github.com/apache/spark/pull/1948 On Thu, Aug 14, 2014 at 5:45 PM, Debasish Das wrote: > Is there a fix that I can test ? I have the flows setup for both > standalone and YARN runs... > > Thanks. > Deb > > > > On Thu, Aug 14, 2014 at 10:59 AM, Reyn

Re: Too late to contribute for 1.1.0?

2014-08-21 Thread Reynold Xin
I believe docs changes can go in anytime (because we can just publish new versions of docs). Critical bug fixes can still go in too. On Thu, Aug 21, 2014 at 11:43 PM, Evan Chan wrote: > I'm hoping to get in some doc enhancements and small bug fixes for Spark > SQL. > > Also possibly a small ne

Re: Spark Contribution

2014-08-22 Thread Reynold Xin
Great idea. Added the link https://github.com/apache/spark/blob/master/README.md On Thu, Aug 21, 2014 at 4:06 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > We should add this link to the readme on GitHub btw. > > 2014년 8월 21일 목요일, Henry Saputra님이 작성한 메시지: > > > The Apache Spark wi

Re: Adding support for a new object store

2014-08-27 Thread Reynold Xin
Hi Rajendran, I'm assuming you have some concept of schema and you are intending to integrate with SchemaRDD instead of normal RDDs. More responses inline below. On Fri, Aug 22, 2014 at 2:21 AM, Rajendran Appavu wrote: > > I am new to Spark source code and looking to see if i can add push-do

Re: Adding support for a new object store

2014-08-27 Thread Reynold Xin
Linking to the JIRA tracking APIs to hook into the planner: https://issues.apache.org/jira/browse/SPARK-3248 On Wed, Aug 27, 2014 at 1:56 PM, Reynold Xin wrote: > Hi Rajendran, > > I'm assuming you have some concept of schema and you are intending to > integrate with Sch

Re: jenkins maintenance/downtime, aug 28th, 730am-9am PDT

2014-08-28 Thread Reynold Xin
Thanks for doing this, Shane. On Thursday, August 28, 2014, shane knapp wrote: > all clear: jenkins and all plugins have been updated! > > > On Thu, Aug 28, 2014 at 7:51 AM, shane knapp > wrote: > > > jenkins is upgraded, but a few jobs sneaked in before i could do the > > plugin updates. i'v

Fwd: Partitioning strategy changed in Spark 1.0.x?

2014-08-30 Thread Reynold Xin
Sending the response back to the dev list so this is indexable and searchable by others. -- Forwarded message -- From: Milos Nikolic Date: Sat, Aug 30, 2014 at 5:50 PM Subject: Re: Partitioning strategy changed in Spark 1.0.x? To: Reynold Xin Thank you, your insights were very

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Reynold Xin
Welcome, Shane! On Tuesday, September 2, 2014, shane knapp wrote: > so, i had a meeting w/the databricks guys on friday and they recommended i > send an email out to the list to say 'hi' and give you guys a quick intro. > :) > > hi! i'm shane knapp, the new AMPLab devops engineer, and will be

Re: about spark assembly jar

2014-09-02 Thread Reynold Xin
Having a SSD help tremendously with assembly time. Without that, you can do the following in order for Spark to pick up the compiled classes before assembly at runtime. export SPARK_PREPEND_CLASSES=true On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza wrote: > This doesn't help for every dependency

Re: Ask something about spark

2014-09-02 Thread Reynold Xin
I think in general that is fine. It would be great if your slides come with proper attribution. On Tue, Sep 2, 2014 at 3:31 PM, Sanghoon Lee wrote: > Hi, I am phoenixlee and a Spark programmer in Korea. > > And be a good chance this time, it tries to teach college students and > office workers

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Reynold Xin
+1 On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian wrote: > +1 > >- Tested Thrift server and SQL CLI locally on OSX 10.9. >- Checked datanucleus dependencies in distribution tarball built by >make-distribution.sh without SPARK_HIVE defined. > > ​ > > > On Tue, Sep 2, 2014 at 2:30 PM, Wil

Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-03 Thread Reynold Xin
+1 Tested locally on Mac OS X with local-cluster mode. On Wed, Sep 3, 2014 at 12:24 AM, Patrick Wendell wrote: > I'll kick it off with a +1 > > On Wed, Sep 3, 2014 at 12:24 AM, Patrick Wendell > wrote: > > Please vote on releasing the following candidate as Apache Spark version > 1.1.0! > >

Re: Ask something about spark

2014-09-03 Thread Reynold Xin
ng to put some creative commons license > information on the site and its content? > > best, > > > matt > > > On 09/02/2014 06:32 PM, Reynold Xin wrote: > >> I think in general that is fine. It would be great if your slides come >> with >> proper attri

Re: Scala's Jenkins setup looks neat

2014-09-06 Thread Reynold Xin
that would require github hooks permission and unfortunately asf infra wouldn't allow that. Maybe they will change their mind one day, but so far we asked about this and the answer has been no for security reasons. On Saturday, September 6, 2014, Nicholas Chammas wrote: > After reading Erik's e

Re: Junit spark tests

2014-09-09 Thread Reynold Xin
Can you be a little bit more specific, maybe give a code snippet? On Tue, Sep 9, 2014 at 5:14 PM, Sudershan Malpani < sudershan.malp...@gmail.com> wrote: > Hi all, > > I am calling an object which in turn is calling a method inside a map RDD > in spark. While writing the tests how can I mock tha

Re: Reporting serialized task size after task broadcast change?

2014-09-11 Thread Reynold Xin
I don't think so. We should probably add a line to log it. On Thursday, September 11, 2014, Sandy Ryza wrote: > After the change to broadcast all task data, is there any easy way to > discover the serialized size of the data getting sent down for a task? > > thanks, > -Sandy >

Re: Reporting serialized task size after task broadcast change?

2014-09-11 Thread Reynold Xin
I didn't know about that On Thu, Sep 11, 2014 at 6:29 PM, Sandy Ryza wrote: > It used to be available on the UI, no? > > On Thu, Sep 11, 2014 at 6:26 PM, Reynold Xin wrote: > > > I don't think so. We should probably add a line to log it. > > > > &g

Re: Reporting serialized task size after task broadcast change?

2014-09-11 Thread Reynold Xin
14 at 6:33 PM, Reynold Xin > wrote: > >> I didn't know about that >> >> On Thu, Sep 11, 2014 at 6:29 PM, Sandy Ryza > > wrote: >> >>> It used to be available on the UI, no? >>> >>> On Thu, Sep 11, 2014 at 6:26 PM, R

Re: PSA: SI-8835 (Iterator 'drop' method has a complexity bug causing quadratic behavior)

2014-09-12 Thread Reynold Xin
Thanks for the email, Erik. The Scala collection library implementation is a complicated beast ... On Sat, Sep 6, 2014 at 8:27 AM, Erik Erlandson wrote: > I tripped over this recently while preparing a solution for SPARK-3250 > (efficient sampling): > > Iterator 'drop' method has a complexity

Re: Adding abstraction in MLlib

2014-09-12 Thread Reynold Xin
Xiangrui can comment more, but I believe Joseph and him are actually working on standardize interface and pipeline feature for 1.2 release. On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov wrote: > Some architect suggestions on this matter - > https://github.com/apache/spark/pull/2371 > > 2014-09-1

Re: don't trigger tests when only .md files are changed

2014-09-12 Thread Reynold Xin
I like that idea, but the load on Jenkins isn't very high. The more complexity we add to the test script, the easier it is to screw it up (at some point we would need to add unit tests for the build scripts). Maybe we can just add the message part, so it becomes clear that a pull request does not

Re: Adding abstraction in MLlib

2014-09-15 Thread Reynold Xin
ce and discuss them on the JIRA >> >> before submitting PRs. >> >> >> >> For performance tests, there is a spark-perf package >> >> (https://github.com/databricks/spark-perf) and we added performance >> >> tests for MLlib in v1.1. But defini

Re: Network Communication - Akka or more?

2014-09-17 Thread Reynold Xin
I'm not familiar with Infiniband, but I can chime in on the Spark part. There are two kinds of communications in Spark: control plane and data plane. Task scheduling / dispatching is control, whereas fetching a block (e.g. shuffle) is data. On Tue, Sep 16, 2014 at 4:22 PM, Trident wrote: > Th

Re: Workflow Scheduler for Spark

2014-09-17 Thread Reynold Xin
Hi Egor, I think the design doc for the pipeline feature has been posted. For the workflow, I believe Oozie actually works fine with Spark if you want some external workflow system. Do you have any trouble using that? On Tue, Sep 16, 2014 at 11:45 PM, Egor Pahomov wrote: > There are two thing

Re: network.ConnectionManager error

2014-09-17 Thread Reynold Xin
This is during shutdown right? Looks ok to me since connections are being closed. We could've handle this more gracefully, but the logs look harmless. On Wednesday, September 17, 2014, wyphao.2007 wrote: > Hi, When I run spark job on yarn,and the job finished success,but I found > there are som

Re: Workflow Scheduler for Spark

2014-09-17 Thread Reynold Xin
>> > >> Reunold, can you help finding this doc? Do you mean just pipelining > spark > >> code or additional logic of persistence tasks, job server, task retry, > >> data > >> availability and extra? > >> > >> > >> 2014-09-17

Re: Eliminate copy while sending data : any Akka experts here ?

2014-09-20 Thread Reynold Xin
at maxBytesInFlight actually) and probably existing to track > non zero should be fine (we should not really track zero output for > reducer - just waste of space). > > > Regards, > Mridul > > On Fri, Jul 4, 2014 at 3:43 AM, Reynold Xin wrote: > > Note that in my origin

Re: BlockManager issues

2014-09-21 Thread Reynold Xin
It seems like you just need to raise the ulimit? On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi wrote: > Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the > workloads. Tried tracing the problem through change set analysis. Looks > like the offending commit is 4fde28c from

Re: Question about SparkSQL and Hive-on-Spark

2014-09-23 Thread Reynold Xin
On Tue, Sep 23, 2014 at 12:47 AM, Yi Tian wrote: > Hi all, > > I have some questions about the SparkSQL and Hive-on-Spark > > Will SparkSQL support all the hive feature in the future? or just making > hive as a datasource of Spark? > Most likely not *ALL* Hive features, but almost all common fea

Re: thank you for reviewing our patches

2014-09-26 Thread Reynold Xin
Keep the patches coming :) On Fri, Sep 26, 2014 at 1:50 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > I recently came across this mailing list post by Linus Torvalds > about the value of reviewing even > “trivial” patches. The following passag

Spark meetup on Oct 15 in NYC

2014-09-28 Thread Reynold Xin
Hi Spark users and developers, Some of the most active Spark developers (including Matei Zaharia, Michael Armbrust, Joseph Bradley, TD, Paco Nathan, and me) will be in NYC for Strata NYC. We are working with the Spark NYC meetup group and Bloomberg to host a meetup event. This might be the event w

Re: FYI: i've doubled the jenkins executors for every build node

2014-09-29 Thread Reynold Xin
Thanks. We might see more failures due to contention on resources. Fingers acrossed ... At some point it might make sense to run the tests in a VM or container. On Mon, Sep 29, 2014 at 2:20 PM, shane knapp wrote: > we were running at 8 executors per node, and BARELY even stressing the > machine

Re: Extending Scala style checks

2014-10-01 Thread Reynold Xin
There is scalariform but it can be disruptive. Last time I ran it on Spark it didn't compile due to some xml interpolation problem. On Wednesday, October 1, 2014, Nicholas Chammas wrote: > Does anyone know if Scala has something equivalent to autopep8 > ? I

Re: Unneeded branches/tags

2014-10-07 Thread Reynold Xin
Those branches are no longer active. However, I don't think we can delete branches from github due to the way ASF mirroring works. I might be wrong there. On Tue, Oct 7, 2014 at 6:25 PM, Nicholas Chammas wrote: > Just curious: Are there branches and/or tags on the repo that we don’t need > any

Re: Extending Scala style checks

2014-10-08 Thread Reynold Xin
Thanks. I added one. On Wed, Oct 8, 2014 at 8:49 AM, Nicholas Chammas wrote: > I've created SPARK-3849: Automate remaining Scala style rules > . > > Please create sub-tasks on this issue for rules that we have not automated > and let's work thro

Re: Scalastyle improvements / large code reformatting

2014-10-12 Thread Reynold Xin
I actually think we should just take the bite and follow through with the reformatting. Many rules are simply not possible to enforce only on deltas (e.g. import ordering). That said, maybe there are better windows to do this, e.g. during the QA period. On Sun, Oct 12, 2014 at 9:37 PM, Josh Rosen

Re: accumulators

2014-10-17 Thread Reynold Xin
is to have pagination of these and always sort them by the last update time. --  Reynold Xin On October 16, 2014 at 12:11:00 PM, Sean McNamara (sean.mcnam...@webtrends.com) wrote: Accumulators on the stage info page show the rolling life time value of accumulators as well as per task which is

Re: Get attempt number in a closure

2014-10-20 Thread Reynold Xin
I also ran into this earlier. It is a bug. Do you want to file a jira? I think part of the problem is that we don't actually have the attempt id on the executors. If we do, that's great. If not, we'd need to propagate that over. On Mon, Oct 20, 2014 at 7:17 AM, Yin Huai wrote: > Hello, > > Is t

Re: Get attempt number in a closure

2014-10-20 Thread Reynold Xin
rk, we just use a new taskId with the >>> same index. >>> >>> On Mon, Oct 20, 2014 at 12:38 PM, Yin Huai >>> wrote: >>> > Yeah, seems we need to pass the attempt id to executors through >>> > TaskDescription. I have created >>> &g

Re: Building and Running Spark on OS X

2014-10-20 Thread Reynold Xin
I usually use SBT on Mac and that one doesn't require any setup ... On Mon, Oct 20, 2014 at 4:43 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > If one were to put together a short but comprehensive guide to setting up > Spark to run locally on OS X, would it look like this? > > # In

Re: Kryo docs: do we include twitter/chill by default?

2014-02-24 Thread Reynold Xin
We do include Chill by default. It's a good idea to update the doc to include chill. On Mon, Feb 24, 2014 at 7:55 PM, Andrew Ash wrote: > Spark devs, > > I picked up somewhere that the Spark 0.9.0 release included Twitter's chill > library of default-registered Kryo serialization classes. Is t

Re: How to run a single test suite?

2014-02-26 Thread Reynold Xin
You put your quotes in the wrong place. See https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools On Wed, Feb 26, 2014 at 10:04 PM, Bryn Keller wrote: > Hi Folks, > > I've tried using "sbt test-only '*PairRDDFunctionsSuite'" to run only that > test suite, which is what I thi

Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-11 Thread Reynold Xin
+1 On Mon, Mar 11 2024 at 7:38 PM, Jungtaek Lim < kabhwan.opensou...@gmail.com > wrote: > > +1 (non-binding), thanks Gengliang! > > > On Mon, Mar 11, 2024 at 5:46 PM Gengliang Wang < ltn...@gmail.com > wrote: > > > >> Hi all, >> >> I'd like to start the vote for SPIP: Structured Logging F

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Reynold Xin
One of the problem in the past when something like this was brought up was that the ASF couldn't have officially blessed venues beyond the already approved ones. So that's something to look into. Now of course you are welcome to run unofficial things unblessed as long as they follow trademark r

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Reynold Xin
+1 On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale wrote: > +1 > > On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun > wrote: > >> FYI, there is a proposal to drop Python 3.8 because its EOL is October >> 2024. >> >> https://github.com/apache/spark/pull/46228 >> [SPARK-47993][PYTHON] Drop Python 3.8

<    9   10   11   12   13   14   15   >