Re: [VOTE] Spark 2.3.0 (RC2)

2018-02-01 Thread Andrew Ash
was resolved > yesterday and tests have been quite healthy throughout this week and the > last. I'll cut the new RC as soon as the remaining blocker (SPARK-23202 > <https://issues.apache.org/jira/browse/SPARK-23202>) is resolved. > > > On 30 January 2018 at 10:12, Andr

Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-30 Thread Andrew Ash
I'd like to nominate SPARK-23274 as a potential blocker for the 2.3.0 release as well, due to being a regression from 2.2.0. The ticket has a simple repro included, showing a query that works in prior releases but now fails with an exception in t

Re: Kubernetes: why use init containers?

2018-01-12 Thread Andrew Ash
+1 on the first release being marked experimental. Many major features coming into Spark in the past have gone through a stabilization process On Fri, Jan 12, 2018 at 1:18 PM, Marcelo Vanzin wrote: > BTW I most probably will not have time to get back to this at any time > soon, so if anyone is

Re: Kubernetes: why use init containers?

2018-01-10 Thread Andrew Ash
It seems we have two standard practices for resource distribution in place here: - the Spark way is that the application (Spark) distributes the resources *during* app execution, and does this by exposing files/jars on an http server on the driver (or pre-staged elsewhere), and executors downloadi

Re: Palantir replease under org.apache.spark?

2018-01-09 Thread Andrew Ash
That source repo is at https://github.com/palantir/spark/ with artifacts published to Palantir's bintray at https://palantir.bintray.com/releases/org/apache/spark/ If you're seeing any of them in Maven Central please flag, as that's a mistake! Andrew On Tue, Jan 9, 2018 at 10:10 AM, Sean Owen w

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Andrew Ash
+0 (non-binding) I think there are benefits to unifying all the Spark-internal datasources into a common public API for sure. It will serve as a forcing function to ensure that those internal datasources aren't advantaged vs datasources developed externally as plugins to Spark, and that all Spark

Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Andrew Ash
+1 (non-binding) We're moving large amounts of infrastructure from a combination of open source and homegrown cluster management systems to unify on Kubernetes and want to bring Spark workloads along with us. On Tue, Aug 15, 2017 at 2:29 PM, liyinan926 wrote: > +1 (non-binding) > > > > -- > Vie

Re: Use Apache ORC in Apache Spark 2.3

2017-08-10 Thread Andrew Ash
ORC > codes. > > > > And, Spark without `-Phive` can ORC like Parquet. > > > > This is one milestone for `Feature parity for ORC with Parquet > (SPARK-20901)`. > > > > Bests, > > Dongjoon > > > > *From: *Reynold Xin > *Date:

Re: Use Apache ORC in Apache Spark 2.3

2017-08-10 Thread Andrew Ash
I would support moving ORC from sql/hive -> sql/core because it brings me one step closer to eliminating Hive from my Spark distribution by removing -Phive at build time. On Thu, Aug 10, 2017 at 9:48 AM, Dong Joon Hyun wrote: > Thank you again for coming and reviewing this PR. > > > > So far, we

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-28 Thread Andrew Ash
-1 due to regression from 2.1.1 In 2.2.0-rc1 we bumped the Parquet version from 1.8.1 to 1.8.2 in commit 26a4cba3ff . Parquet 1.8.2 includes a backport from 1.9.0: PARQUET-389 in commit 2282c22c

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Andrew Ash
Spark 2.x has to be the time for Java 8. I'd rather increase JVM major version on a Spark major version than on a Spark minor version, and I'd rather Spark do that upgrade for the 2.x series than the 3.x series (~2yr from now based on the lifetime of Spark 1.x). If we wait until the next opportun

Re: [VOTE] Release Apache Spark 1.4.1

2015-06-25 Thread Andrew Ash
I would guess that many tickets targeted at 1.4.1 were set that way during the tail end of the 1.4.0 voting process as people realized they wouldn't make the .0 release in time. In that case, they were likely aiming for a 1.4.x release, not necessarily 1.4.1 specifically. Maybe creating a "1.4.x"

Re: DataFrame.withColumn very slow when used iteratively?

2015-06-02 Thread Andrew Ash
Would it be valuable to create a .withColumns([colName], [ColumnObject]) method that adds in bulk rather than iteratively? Alternatively effort might be better spent in making .withColumn() singular faster. On Tue, Jun 2, 2015 at 3:46 PM, Reynold Xin wrote: > We improved this in 1.4. Adding 100

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

2015-03-09 Thread Andrew Ash
Does the Apache project team have any ability to measure download counts of the various releases? That data could be useful when it comes time to sunset vendor-specific releases, like CDH4 for example. On Mon, Mar 9, 2015 at 5:34 AM, Mridul Muralidharan wrote: > In ideal situation, +1 on removi

Re: Block Transfer Service encryption support

2015-03-08 Thread Andrew Ash
I'm interested in seeing this data transfer occurring over encrypted communication channels as well. Many customers require that all network transfer occur encrypted to prevent the "soft underbelly" that's often found inside a corporate network. On Fri, Mar 6, 2015 at 4:20 PM, turp1twin wrote:

Re: Streaming partitions to driver for use in .toLocalIterator

2015-02-24 Thread Andrew Ash
se cases. > > > >An alternative would be to write your RDD to some other data store (eg, > >hdfs) which has better support for reading data in a streaming fashion, > >though you would probably be unhappy with the overhead. > > > > > > > >On Wed, Feb

Streaming partitions to driver for use in .toLocalIterator

2015-02-18 Thread Andrew Ash
Hi Spark devs, I'm creating a streaming export functionality for RDDs and am having some trouble with large partitions. The RDD.toLocalIterator() call pulls over a partition at a time to the driver, and then streams the RDD out from that partition before pulling in the next one. When you have la

Re: Pull Requests on github

2015-02-09 Thread Andrew Ash
Sam, I see your PR was merged -- many thanks for sending it in and getting it merged! In general for future reference, the most effective way to contribute is outlined on this wiki page: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark On Mon, Feb 9, 2015 at 1:04 AM, Akhil

Re: talk on interface design

2015-01-26 Thread Andrew Ash
In addition to the references you have at the end of the presentation, there's a great set of practical examples based on the learnings from Qt posted here: http://www21.in.tum.de/~blanchet/api-design.pdf Chapter 4's way of showing a principle and then an example from Qt is particularly instructio

Re: Join implementation in SparkSQL

2015-01-15 Thread Andrew Ash
What Reynold is describing is a performance optimization in implementation, but the semantics of the join (cartesian product plus relational algebra filter) should be the same and produce the same results. On Thu, Jan 15, 2015 at 1:36 PM, Reynold Xin wrote: > It's a bunch of strategies defined h

Maintainer for Mesos

2015-01-05 Thread Andrew Ash
Hi Spark devs, I'm interested in having a committer look at a PR [1] for Mesos, but there's not an entry for Mesos in the maintainers specialties on the wiki [2]. Which Spark committers have expertise in the Mesos features? Thanks! Andrew [1] https://github.com/apache/spark/pull/3074 [2] https

Re: More general submitJob API

2014-12-22 Thread Andrew Ash
Hi Alex, SparkContext.submitJob() is marked as experimental -- most client programs shouldn't be using it. What are you looking to do? For multiplexing jobs, one thing you can do is have multiple threads in your client JVM each submit jobs on your SparkContext job. This is described here in the

Re: Announcing Spark Packages

2014-12-22 Thread Andrew Ash
Hi Xiangrui, That link is currently returning a 503 Over Quota error message. Would you mind pinging back out when the page is back up? Thanks! Andrew On Mon, Dec 22, 2014 at 12:37 PM, Xiangrui Meng wrote: > Dear Spark users and developers, > > I’m happy to announce Spark Packages (http://spa

Re: Spark JIRA Report

2014-12-15 Thread Andrew Ash
this list. >> >> If you already spend a good amount of time cleaning up on JIRA, then this >> report won't be that relevant to you. But given the number and growth of >> open issues on our tracker, I suspect we could do with quite a few more >> people chipping in and

Re: Spark JIRA Report

2014-12-13 Thread Andrew Ash
The goal of increasing visibility on open issues is a good one. How is this different from just a link to Jira though? Some might say this adds noise to the mailing list and doesn't contain any information not already available in Jira. The idea seems good but the formatting leaves a little to b

Governance of the Jenkins whitelist

2014-12-13 Thread Andrew Ash
Jenkins is a really valuable tool for increasing quality of incoming patches to Spark, but I've noticed that there are often a lot of patches waiting for testing because they haven't been approved for testing. Certain users can instruct Jenkins to run on a PR, or add other users to a whitelist. Ho

Re: Tachyon in Spark

2014-12-11 Thread Andrew Ash
I'm interested in understanding this as well. One of the main ways Tachyon is supposed to realize performance gains without sacrificing durability is by storing the lineage of data rather than full copies of it (similar to Spark). But if Spark isn't sending lineage information into Tachyon, then

Re: Regarding RecordReader of spark

2014-11-16 Thread Andrew Ash
Filed as https://issues.apache.org/jira/browse/SPARK-4437 On Sun, Nov 16, 2014 at 4:49 PM, Reynold Xin wrote: > I don't think the code is immediately obvious. > > Davies - I think you added the code, and Josh reviewed it. Can you guys > explain and maybe submit a patch to add more documentation

Re: Is there a way for scala compiler to catch unserializable app code?

2014-11-16 Thread Andrew Ash
Hi Jay, I just came across SPARK-720 Statically guarantee serialization will succeed which sounds like exactly what you're referring to. Like Reynold I think it's not possible at this time but it would be good to get your feedback on that ticket.

Raise Java dependency from 6 to 7

2014-10-17 Thread Andrew Ash
Hi Spark devs, I've heard a few times that keeping support for Java 6 is a priority for Apache Spark. Given that Java 6 has been publicly EOL'd since Feb 2013 and the last public update was Apr 2013

Re: Spark on Mesos 0.20

2014-10-05 Thread Andrew Ash
Hi Gurvinder, Is there a SPARK ticket tracking the issue you describe? On Mon, Oct 6, 2014 at 2:44 AM, Gurvinder Singh wrote: > On 10/06/2014 08:19 AM, Fairiz Azizi wrote: > > The Spark online docs indicate that Spark is compatible with Mesos 0.18.1 > > > > I've gotten it to work just fine on 0

Re: Parquet schema migrations

2014-10-05 Thread Andrew Ash
Hi Cody, I wasn't aware there were different versions of the parquet format. What's the difference between "raw parquet" and the Hive-written parquet files? As for your migration question, the approaches I've often seen are convert-on-read and convert-all-at-once. Apache Cassandra for example d

Re: BlockManager issues

2014-09-22 Thread Andrew Ash
Another data point on the 1.1.0 FetchFailures: Running this SQL command works on 1.0.2 but fails on 1.1.0 due to the exceptions mentioned earlier in this thread: "SELECT stringCol, SUM(doubleCol) FROM parquetTable GROUP BY stringCol" The FetchFailure exception has the remote block manager that fa

Re: PARSING_ERROR from kryo

2014-09-15 Thread Andrew Ash
On Mon, Sep 15, 2014 at 2:10 PM, Ankur Dave wrote: > At 2014-09-15 08:59:48 -0700, Andrew Ash wrote: > > I'm seeing the same exception now on the Spark 1.1.0 release. Did you > ever > > get this figured out? > > > > [...] > > > > On Thu, Aug 21,

Re: PARSING_ERROR from kryo

2014-09-15 Thread Andrew Ash
Hi npanj, I'm seeing the same exception now on the Spark 1.1.0 release. Did you ever get this figured out? Andrew On Thu, Aug 21, 2014 at 2:14 PM, npanj wrote: > Hi All, > > I am getting PARSING_ERROR while running my job on the code checked out up > to commit# db56f2df1b8027171da1b8d2571d1f2

Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Andrew Ash
FWIW we use CDH4 extensively and would very much appreciate having a prebuilt version of Spark for it. We're doing a CDH 4.4 to 4.7 upgrade across all the clusters now and have plans for a 5.x transition after that. On Aug 28, 2014 11:57 PM, "Sean Owen" wrote: > On Fri, Aug 29, 2014 at 7:42 AM,

Re: take() reads every partition if the first one is empty

2014-08-25 Thread Andrew Ash
Filed as https://issues.apache.org/jira/browse/SPARK-3211 On Fri, Aug 22, 2014 at 1:06 PM, Andrew Ash wrote: > Yep, anyone can create a bug at > https://issues.apache.org/jira/browse/SPARK > > Then if you make a pull request on GitHub and have the bug number in the > header li

Re: take() reads every partition if the first one is empty

2014-08-22 Thread Andrew Ash
Yep, anyone can create a bug at https://issues.apache.org/jira/browse/SPARK Then if you make a pull request on GitHub and have the bug number in the header like "[SPARK-1234] Make take() less OOM-prone", then the PR gets linked to the Jira ticket. I think that's the best way to get feedback on a

Re: take() reads every partition if the first one is empty

2014-08-22 Thread Andrew Ash
Hi Paul, I agree that jumping straight from reading N rows from 1 partition to N rows from ALL partitions is pretty aggressive. The exponential growth strategy of doubling the partition count every time seems better -- 1, 2, 4, 8, 16, ... will be much more likely to prevent OOMs than the 1 -> ALL

Hang on Executor classloader lookup for the remote REPL URL classloader

2014-08-20 Thread Andrew Ash
Hi Spark devs, I'm seeing a stacktrace where the classloader that reads from the REPL is hung, and blocking all progress on that executor. Below is that hung thread's stacktrace, and also the stacktrace of another hung thread. I thought maybe there was an issue with the REPL's JVM on the other s

FileNotFoundException with _temporary in the name

2014-08-12 Thread Andrew Ash
Hi Spark devs, Several people on the mailing list have seen issues with FileNotFoundExceptions related to _temporary in the name. I've personally observed this several times, as have a few of my coworkers on various Spark clusters. Any ideas what might be going on? I've collected the various st

Re: Exception in Spark 1.0.1: com.esotericsoftware.kryo.KryoException: Buffer underflow

2014-08-01 Thread Andrew Ash
ng Guava 14... are you using Guava 16 in your > user app (i.e. you inverted the versions in your earlier e-mail)? > > - Patrick > > > On Fri, Aug 1, 2014 at 4:15 PM, Colin McCabe > wrote: > > > On Fri, Aug 1, 2014 at 2:45 PM, Andrew Ash wrote: > > > After seve

Re: Exception in Spark 1.0.1: com.esotericsoftware.kryo.KryoException: Buffer underflow

2014-08-01 Thread Andrew Ash
compatibility On Thu, Jul 31, 2014 at 10:47 AM, Andrew Ash wrote: > Hi everyone, > > I'm seeing the below exception coming out of Spark 1.0.1 when I call it > from my application. I can't share the source to that application, but the > quick gist is that it uses Spark&#x

Exception in Spark 1.0.1: com.esotericsoftware.kryo.KryoException: Buffer underflow

2014-07-31 Thread Andrew Ash
Hi everyone, I'm seeing the below exception coming out of Spark 1.0.1 when I call it from my application. I can't share the source to that application, but the quick gist is that it uses Spark's Java APIs to read from Avro files in HDFS, do processing, and write back to Avro files. It does this

Re:[VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-27 Thread Andrew Ash
Is that a regression since 1.0.0? On Jul 27, 2014 10:43 AM, "witgo" wrote: > -1 > The following bug should be fixed: > https://issues.apache.org/jira/browse/SPARK-2677‍ > > > > > > -- Original -- > From: "Tathagata Das";; > Date: Sat, Jul 26, 2014 07:08 AM > To:

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Andrew Ash
Personally I'd find the method useful -- I've often had a .csv file with a header row that I want to drop so filter it out, which touches all partitions anyway. I don't have any comments on the implementation quite yet though. On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson wrote: > A few week

Re: Hadoop's Configuration object isn't threadsafe

2014-07-16 Thread Andrew Ash
> > - Patrick > > On Wed, Jul 16, 2014 at 10:24 PM, Andrew Ash wrote: > > Hi Patrick, thanks for taking a look. I filed as > > https://issues.apache.org/jira/browse/SPARK-2546 > > > > Would you recommend I pursue the cloned Configuration object approach now &

Re: Hadoop's Configuration object isn't threadsafe

2014-07-16 Thread Andrew Ash
er" conflicts where we had multiple calls mutating the same object > at the same time. It won't deal with "reader writer" conflicts where > some of our initialization code touches state that is needed during > normal execution of other tasks. > > - Patrick

Re: Hadoop's Configuration object isn't threadsafe

2014-07-15 Thread Andrew Ash
threads). > > -Shengzhe > > > On Mon, Jul 14, 2014 at 10:22 PM, Andrew Ash wrote: > > > Hi Spark devs, > > > > We discovered a very interesting bug in Spark at work last week in Spark > > 0.9.1 — that the way Spark uses the Hadoop Configuration object is prone &g

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Andrew Ash
riority is addressing regressions between these two > releases. > > On Mon, Jul 14, 2014 at 9:05 PM, Andrew Ash wrote: > > I'm not sure either of those PRs will fix the concurrent adds to > > Configuration issue I observed. I've got a stack trace and writeup I'll > >

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Andrew Ash
I'm not sure either of those PRs will fix the concurrent adds to Configuration issue I observed. I've got a stack trace and writeup I'll share in an hour or two (traveling today). On Jul 14, 2014 9:50 PM, "scwf" wrote: > hi,Cody > i met this issue days before and i post a PR for this( > https:/

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Andrew Ash
I observed a deadlock here when using the AvroInputFormat as well. The short of the issue is that there's one configuration object per JVM, but multiple threads, one for each task. If each thread attempts to add a configuration option to the Configuration object at once you get issues because HashM

Re: [VOTE] Release Apache Spark 1.0.1 (RC1)

2014-06-29 Thread Andrew Ash
important > bugs. For this reason, it probably can't block a release (I'm not even sure > if it should go into a maintenance release where we fix critical bugs for > Spark core). > > We should definitely include them for 1.1.0 though (~Aug). > > > > > On

Re: [VOTE] Release Apache Spark 1.0.1 (RC1)

2014-06-29 Thread Andrew Ash
Thanks for helping shepherd the voting on 1.0.1 Patrick. I'd like to call attention to https://issues.apache.org/jira/browse/SPARK-2157 and https://github.com/apache/spark/pull/1107 -- "Ability to write tight firewall rules for Spark" I'm currently unable to run Spark on some projects because our

Re: Contributing to MLlib on GLM

2014-06-17 Thread Andrew Ash
Hi Xiaokai, Also take a look through Xiangrui's slides from HadoopSummit a few weeks back: http://www.slideshare.net/xrmeng/m-llib-hadoopsummit The roadmap starting at slide 51 will probably be interesting to you. Andrew On Tue, Jun 17, 2014 at 7:37 PM, Sandy Ryza wrote: > Hi Xiaokai, > > I

Re: Compile failure with SBT on master

2014-06-16 Thread Andrew Ash
gt; > On Mon, Jun 16, 2014 at 9:29 PM, Andrew Ash wrote: > > > I can't run sbt/sbt gen-idea on a clean checkout of Spark master. > > > > I get resolution errors on junit#junit;4.10!junit.zip(source) > > > > As shown below: > > > > aash@aas

Re: encounter jvm problem when integreation spark with mesos

2014-06-16 Thread Andrew Ash
Hi qingyang, This looks like an issue with the open source version of the Java runtime (called OpenJDK) that causes the JVM to fail. Can you try using the JVM released by Oracle and see if it has the same issue? Thanks! Andrew On Mon, Jun 16, 2014 at 9:24 PM, qingyang li wrote: > hi, I encou

Compile failure with SBT on master

2014-06-16 Thread Andrew Ash
I can't run sbt/sbt gen-idea on a clean checkout of Spark master. I get resolution errors on junit#junit;4.10!junit.zip(source) As shown below: aash@aash-mbp /tmp/git/spark$ sbt/sbt gen-idea Using /Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home as default JAVA_HOME. Note, this wi

Re: Contributing Spark Infrastructure Configuration Docs

2014-06-05 Thread Andrew Ash
I would appreciate seeing the specs you came up with as well but don't need to particularly quickly. I'll wait until seeing the PR to comment on the specifics, but have some questions about the thought process that went into configuring the hardware. Is the idea to see how you spec'd out memory/d

Re: Implementing rdd.scanLeft()

2014-06-05 Thread Andrew Ash
I that something that documentation on the method can solve? On Thu, Jun 5, 2014 at 10:47 AM, Reynold Xin wrote: > I think the main concern is this would require scanning the data twice, and > maybe the user should be aware of it ... > > > On Thu, Jun 5, 2014 at 10:29 AM, An

Implementing rdd.scanLeft()

2014-06-05 Thread Andrew Ash
I have a use case that would greatly benefit from RDDs having a .scanLeft() method. Are the project developers interested in adding this to the public API? Looking through past message traffic, this has come up a few times. The recommendation from the list before has been to implement a paralle

Re: Scala Language NPE

2014-06-02 Thread Andrew Ash
Ah nevermind, the fix is to get rid of "return" from my method. There's probably a bug somewhere related to the repl taking bad input more cleanly, but this isn't the end of the world once you figure out what the issue is. Thanks for the time, Andrew On Mon, Jun 2, 2014 at

Scala Language NPE

2014-06-02 Thread Andrew Ash
// observed in Spark 1.0 Scala devs, I was observing an unusual NPE in my code recently, and came up with the below minimal test case: class Super extends Serializable { lazy val superVal: String = null } class Sub extends Super { lazy val subVal: String = { try { "l

bin/spark-shell --jars option

2014-05-30 Thread Andrew Ash
Hi Spark users, In past Spark releases I always had to add jars to multiple places when using the spark-shell, and I'm looking to cut down on those. The --jars option looks like it does what I want, but it doesn't work. I did a quick experiment on latest branch-1.0 and found this: *# 0) jar not

Re: Timestamp support in v1.0

2014-05-29 Thread Andrew Ash
I can confirm that the commit is included in the 1.0.0 release candidates (it was committed before branch-1.0 split off from master), but I can't confirm that it works in PySpark. Generally the Python and Java interfaces lag a little behind the Scala interface to Spark, but we're working to keep t

Re: all values for a key must fit in memory

2014-05-25 Thread Andrew Ash
Hi Nilesh, That change from Matei to change (Key, Seq[Value]) into (Key, Iterable[Value]) was to enable the optimization in future releases without breaking the API. Currently though, all values on a single key are still held in memory on a single machine. The way I've gotten around this is by i

Re: Should SPARK_HOME be needed with Mesos?

2014-05-22 Thread Andrew Ash
os on a mesos master as a > way to have the mesos executor in place. > > -kr, Gerard. > > [1] https://issues.apache.org/jira/browse/SPARK-1110 > > > On Thu, May 22, 2014 at 6:19 AM, Andrew Ash wrote: > >> Hi Gerard, >> >> I agree that your second opt

Re: Should SPARK_HOME be needed with Mesos?

2014-05-21 Thread Andrew Ash
Hi Gerard, I agree that your second option seems preferred. You shouldn't have to specify a SPARK_HOME if the executor is going to use the spark.executor.uri instead. Can you send in a pull request that includes your proposed changes? Andrew On Wed, May 21, 2014 at 10:19 AM, Gerard Maas wrot

Re: Sorting partitions in Java

2014-05-20 Thread Andrew Ash
Voted :) https://issues.apache.org/jira/browse/SPARK-983 On Tue, May 20, 2014 at 10:21 AM, Sandy Ryza wrote: > There is: SPARK-545 > > > On Tue, May 20, 2014 at 10:16 AM, Andrew Ash wrote: > > > Sandy, is there a Jira ticket for that? > > > > > > On Tu

Re: Sorting partitions in Java

2014-05-20 Thread Andrew Ash
Sandy, is there a Jira ticket for that? On Tue, May 20, 2014 at 10:12 AM, Sandy Ryza wrote: > sortByKey currently requires partitions to fit in memory, but there are > plans to add external sort > > > On Tue, May 20, 2014 at 10:10 AM, Madhu wrote: > > > Thanks Sean, I had seen that post you men

Re: TorrentBroadcast aka Cornet?

2014-05-19 Thread Andrew Ash
g with it yet (kind of an > oversight in this release unfortunately). > > Matei > > On May 19, 2014, at 12:07 AM, Andrew Ash wrote: > > > Hi Spark devs, > > > > Is the algorithm for > > TorrentBroadcast< > https://github.com/apache/spark/b

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-19 Thread Andrew Ash
Sounds like the problem is that classloaders always look in their parents before themselves, and Spark users want executors to pick up classes from their custom code before the ones in Spark plus its dependencies. Would a custom classloader that delegates to the parent after first checking itself

TorrentBroadcast aka Cornet?

2014-05-19 Thread Andrew Ash
Hi Spark devs, Is the algorithm for TorrentBroadcastthe same as Cornet from the below paper? http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf If so it would be nic

Re: Matrix Multiplication of two RDD[Array[Double]]'s

2014-05-18 Thread Andrew Ash
Hi Liquan, There is some working being done on implementing linear algebra algorithms on Spark for use in higher-level machine learning algorithms. That work is happening in the MLlib project, which has a org.apache.spark.mllib.linalgpackage you may find useful. See https://github.com/apache/spa

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-18 Thread Andrew Ash
The nice thing about putting discussion on the Jira is that everything about the bug is in one place. So people looking to understand the discussion a few years from now only have to look on the jira ticket rather than also search the mailing list archives and hope commenters all put the string "S

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Andrew Ash
+1 on the next release feeling more like a 0.10 than a 1.0 On May 17, 2014 4:38 AM, "Mridul Muralidharan" wrote: > I had echoed similar sentiments a while back when there was a discussion > around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api > changes, add missing functionalit

Updating docs for running on Mesos

2014-05-15 Thread Andrew Ash
The docs for how to run Spark on Mesos have changed very little since 0.6.0, but setting it up is much easier now than then. Does it make sense to revamp with the below changes? You no longer need to build mesos yourself as pre-built versions are available from Mesosphere: http://mesosphere.io/d

Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Andrew Ash
Thanks for filing -- I'm keeping my eye out for updates on that ticket. Cheers! Andrew On Tue, May 13, 2014 at 2:40 PM, Michael Armbrust wrote: > > > > It looks like currently the .count() on parquet is handled incredibly > > inefficiently and all the columns are materialized. But if I select

Re: Class-based key in groupByKey?

2014-05-13 Thread Andrew Ash
In Scala, if you override .equals() you also need to override .hashCode(), just like in Java: http://www.scala-lang.org/api/2.10.3/index.html#scala.AnyRef I suspect if your .hashCode() delegates to just the hashcode of s then you'd be good. On Tue, May 13, 2014 at 3:30 PM, Michael Malak wrote:

Re: Updating docs for running on Mesos

2014-05-13 Thread Andrew Ash
. No > kidding - > > it was coming, as the laptop was crashing from time to time, but the > mesos > > build was that one drop too much) > > > > kr, Gerard. > > > > > > > > On Tue, May 13, 2014 at 6:57 AM, Andrew Ash > wrote: > > >

Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Andrew Ash
eriments and analysis! > > I think Michael already submitted a patch that avoids scanning all columns > for count(*) or count(1). > > > On Mon, May 12, 2014 at 9:46 PM, Andrew Ash wrote: > > > Hi Spark devs, > > > > First of all, huge congrats on the parqu

Re: Updating docs for running on Mesos

2014-05-12 Thread Andrew Ash
I have a draft of my proposed changes here: https://github.com/apache/spark/pull/756 https://issues.apache.org/jira/browse/SPARK-1818 Thanks! Andrew On Mon, May 12, 2014 at 9:57 PM, Andrew Ash wrote: > As far as I know, the upstream doesn't release binaries, only source code

Re: Updating docs for running on Mesos

2014-05-12 Thread Andrew Ash
k Wendell wrote: > Andrew, > > Updating these docs would be great! I think this would be a welcome change. > > In terms of packaging, it would be good to mention the binaries > produced by the upstream project as well, in addition to Mesosphere. > > - Patrick > > On T

Re: Requirements of objects stored in RDDs

2014-05-12 Thread Andrew Ash
An RDD can hold objects of any type. If you generally think of it as a distributed Collection, then you won't ever be that far off. As far as serialization, the contents of an RDD must be serializable. There are two serialization libraries you can use with Spark: normal Java serialization or Kry

Re: Updating docs for running on Mesos

2014-05-12 Thread Andrew Ash
this and volunteering to do it. > > On May 11, 2014 3:32 AM, "Andrew Ash" wrote: > > > > The docs for how to run Spark on Mesos have changed very little since > > 0.6.0, but setting it up is much easier now than then. Does it make > sense > > to revam

Preliminary Parquet numbers and including .count() in Catalyst

2014-05-12 Thread Andrew Ash
Hi Spark devs, First of all, huge congrats on the parquet integration with SparkSQL! This is an incredible direction forward and something I can see being very broadly useful. I was doing some preliminary tests to see how it works with one of my workflows, and wanted to share some numbers that p

Re: Any ideas on SPARK-1021?

2014-05-12 Thread Andrew Ash
This is the issue where .sortByKey() launches a cluster job when it shouldn't because it's a transformation not an action. https://issues.apache.org/jira/browse/SPARK-1021 I'd appreciate a fix too but don't currently have any thoughts on how to proceed forward. Andrew On Thu, May 8, 2014 at 2:

Re: Kryo not default?

2014-05-12 Thread Andrew Ash
As an example of where it sometimes doesn't work, in older versions of Kryo / Chill the Joda LocalDate class didn't serialize properly -- https://groups.google.com/forum/#!topic/cascalog-user/35cdnNIamKU On Mon, May 12, 2014 at 4:39 PM, Reynold Xin wrote: > The main reason is that it doesn't al

Re: reading custom input format in Spark

2014-04-08 Thread Andrew Ash
ing the PatternInputFormat from the blog post you > referenced. > I know how to set the pattern in configuration while writing a MR job, how > do i do that from a spark shell? > > -anurag > > > > On Tue, Apr 8, 2014 at 1:41 PM, Andrew Ash wrote: > > > Are you usin

Re: reading custom input format in Spark

2014-04-08 Thread Andrew Ash
Are you using the PatternInputFormat from this blog post? https://hadoopi.wordpress.com/2013/05/31/custom-recordreader-processing-string-pattern-delimited-records/ If so you need to set the pattern in the configuration before attempting to read data with that InputFormat: String regex = "^[A-Za-

Re: Largest input data set observed for Spark.

2014-03-20 Thread Andrew Ash
Understood of course. Did the data fit comfortably in memory or did you experience memory pressure? I've had to do a fair amount of tuning when under memory pressure in the past (0.7.x) and was hoping that the handling of this scenario is improved in later Spark versions. On Thu, Mar 20, 2014 a

Re: SPARK-942 patch review

2014-02-25 Thread Andrew Ash
ween those. For now please just post on the dev list if > your PR is being ignored. We'll implement some kind of cleanup (at least > manually) to close the old ones. > > > > Matei > > > > On Feb 24, 2014, at 1:30 PM, Andrew Ash wrote: > > > >> Yep t

Re: Kryo docs: do we include twitter/chill by default?

2014-02-24 Thread Andrew Ash
ion). On Mon, Feb 24, 2014 at 8:30 PM, Reynold Xin wrote: > We do include Chill by default. It's a good idea to update the doc to > include chill. > > > On Mon, Feb 24, 2014 at 7:55 PM, Andrew Ash wrote: > > > Spark devs, > > > > I picked up somewh

Kryo docs: do we include twitter/chill by default?

2014-02-24 Thread Andrew Ash
Spark devs, I picked up somewhere that the Spark 0.9.0 release included Twitter's chill library of default-registered Kryo serialization classes. Is that the case? If so I'd like to mention in the data serialization docs that many things are registered by default and include a link to the releva